Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,50 @@ Our pull request template has an extensive checklist that must be completed prio

Note that the maintainers reserve the right to close any submission without review for any reason.

## Expectations for WDL contributions

We have some opinionated rules and guidelines we use while writing WDL for this repository. These include:

- See `template/common-parameter-meta.txt` for common description strings.
- If applicable, use the same parameter name and help text as the underlying tool called by the task.
Comment on lines +38 to +39
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all related to this PR, but I wonder if there is a way to tell Copilot about this file. It'd be great if its autocomplete suggestions would consider that file as a set of options when writing parameter_meta.

- All requirement values are overridable at runtime. However, tasks should have easily configurable memory and disk space allocations.
- See the various tasks in the template directory for possible ways to allocate resources.
- Contributors can mix and match the available templates, copy and pasting subsections as appropriate.
- A task may contain both statically and dynamically allocated resources.
- Multi-core tasks should *always* follow the conventions laid out in the `use_all_cores_task` example (see `template/task-examples.wdl`).
- This is catering to cloud users, who may be allocated a machine with more cores than are specified by the `ncpu` parameter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to revisit this paradigm in the future. With Planetary + AKS, the container only gets allocated the requested number of cores within the VM.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q you may not have the answer to: how does nproc work in those environments? I think that nproc on the cluster is "smart" and only returns what was allocated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does. With Planetary + AKS, the container is allocated n CPUs, and it can only see that many. I'm not actually confident that the idea behind nproc is sound. As long as the task is running in a container, the Docker engine will have provided some number of cores to the container. Presumably, whatever engine is queueing the Docker container will provide the cpu argument to Docker as the number of cores. The use_all_cores idea is probably better implemented at the engine-level. If the engine allocates 1 VM per task, then it should use_all_cores. If it is capable of sharing VMs between tasks, then it shouldn't use_all_cores.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the point of use_all_cores is so that you can tell called tools what level of parallelism should be used. Most bioinformatics tools will only spin up as many threads as you ask for (i.e. without specifying some kind of --thread [>1 value] everything will run in one thread regardless of how many cores the task/engine/backend/VM allocates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I meant more that the engine, itself, should set use_all_cores if it isn't capable of scheduling multiple tasks to a node.

With our experience with LSF though, I'm curious how other HPC system function (e.g. Slurm). LSF does something so that the underlying task only sees the allocated cores. If the scheduling systems reliably did that, then we could just always use_all_cores.

- Note that future versions of WDL will likely cause a change to this convention.
- We plan to deprecate the `ncpu` param in favor of accessing the runtime section directly (`n_cores=~{task.runtime.cpu}`).
- Output file names should *always* be determined with either the `outfile_name` parameter or the `prefix` parameter.
- `outfile_name` should be preferred if no downstream tasks/tools rely on the file name/extension.
- Tasks with multiple outputs should always use the `prefix` convention.
- After the input sorting rules in `sprocket lint` have been applied, follow the below rules for further sorting.
- "sample" files come before "reference" files.
- If present, `use_all_cores` should be the last `Boolean` in its block.
- The `ncpu` parameter comes before inputs that allocate memory, which come before inputs that allocate disk space.
- This block of 2-3 inputs should come after all other inputs.
- If a task uses multiple cores or is multithreaded, then at least 2 cpu should be specified.
- Use the `as` keyword sparingly; only in the case of increased readability or to avoid name collisions.
- Prefer using `as` in the import block rather than at the task/workflow call level.
- When using `as` to rename an invalid URI, attempt to make as few changes to the filename as possible (i.e. try not to abbreviate).
- To disambiguate a task or workflow file from it's contents, you can respectively add the `_tasks` or `_wf` suffix in the import section.
- Whenever possible, prefer a Docker image maintained by an external source (such as BioContainers) rather than creating your own image.
- When adding a `Dockerfile` to this repository, follow the below conventions:
- Create a directory under the `docker/` directory and choose an appropriate name (likely shared with the underlying tool). The `Dockerfile` should be nested under this new directory. Then create a `package.json` alongside the `Dockerfile`. The `package.json` file is required to contain two JSON fields (`name` and `version`). It can optionally contain a `revision` field.
- Docker images should be versioned according to the following convention
- The `version` should be shared with whatever underlying tool is being used
- If no specific tool is named (e.g. the `util` image), default to SemVer. Ignore the next 3 bullet points.
- The revision should start with zero (`0`).
- If the Docker image gets updated, *without* updating the base tool's version, increment the number by one.
- If the Docker image gets updated, *including* updating the base tool's version, revert back to zero.
- Any tasks which are deprecated should have a `deprecated: true` key in their `meta` section.
- Never include a `deprecated: false` key in any production tasks. All tasks are assumed to not be deprecated unless otherwise noted.
- In addition, there should be a `warning` key which starts with the text `**[DEPRECATED]**`.
- No other text or explanation is required after the above text, but it can be added for further context.
- These two conventions allow for a task's deprecated status to be communicated in multiple ways, ensuring no user misses the notice.
- Deprecated tasks should be placed at the end of their file.
- While WDL allows embedded scripts in the `command` block sections, this repository requires scripts (e.g. R, Python) to be separate and placed in the `scripts` folder. The relevant Docker image build for your task should then include the script during the build so the task can access it. This separation of concerns improves the developer experience by improving syntax highlighting in the WDL document and enabling linting and formatting checks for the scripting languages.

## FAQs

### Can I use Artificial Intelligence (AI)?
Expand Down
93 changes: 27 additions & 66 deletions best-practices.md
Original file line number Diff line number Diff line change
@@ -1,69 +1,30 @@
# WDL Best Practices

All rules below should be followed by contributors to this repo. Contributors should also follow the rules enforced by [Sprocket](https://sprocket.bio/). Pull Requests which do not conform to these specifications will be asked to change.

## Rules

- All WDL should be written in v1.1+
- All tasks with multiple commands (including any pipes (`|`)) should have `set -euo pipefail` before any other commands.
- All tasks should run in a persistently versioned container
- This ensures reproducibility across time and environments
- See `template/common-parameter-meta.txt` for common description strings.
- If applicable, use the same parameter name, help string, and parameter ordering as the underlying tool called by the task
- Check all assumptions made about workflow inputs before beginning long running executions
- Common examples of assumptions that should be checked: valid `String` choice, mutually exclusive parameters, missing optional file for selected parameters, filename extensions
- This can commonly be handled by a `parse_input` task (defined in the same file as the workflow in question)
- When possible, avoid passing in entire files to the `parse_input` task. Coerce files to `Boolean`s or `String`s to avoid unnecessary disk space usage
- Tasks with string parameters for which a limited number of choices are valid, must be documented following the template in `string_choices_task` (see `template/task-examples.wdl`)
- they should also fail quickly with an informative error message if an invalid input is provided
- In most cases, just passing the parameter to the underlying tool should produce a satisfactory error, but this must be checked for each tool
- While redundant, it is still best practice to validate these strings in the `parse_input` task of any workflow which calls the task
- This ensures the workflow will fail as fast as possible to save users time and resources
- All requirement values are overridable at runtime. However, tasks should have easily configurable memory and disk space allocations
- see the various tasks in the template directory for possible ways to allocate resources
- Contributors can mix and match the available templates, copy and pasting subsections as appropriate
- It is allowed to have one resource allocated dynamically, and another allocated statically in the same task.
- multi-core tasks should *always* follow the conventions laid out in the `use_all_cores_task` example (see `template/task-examples.wdl`)
- this is catering to cloud users, who may be allocated a machine with more cores than are specified by the `ncpu` parameter
- Note that future versions of WDL will likely cause a change to this convention.
- We plan to deprecate the `ncpu` param in favor of accessing the runtime section directly (`n_cores=~{task.runtime.cpu}`)
- Tasks which assume a file and any accessory files (e.g. a BAM and a BAI) have specific extensions and/or are in the same directory should *always* follow the conventions laid out in the `localize_files_task` example (see `template/task-examples.wdl`)
- This is to accommodate as many backends as possible
- output file names should *always* be determined with either the `outfile_name` parameter or the `prefix` parameter.
- `outfile_name` should be preferred if no downstream tasks/tools rely on the file name/extension
- tasks with multiple outputs should always use the `prefix` convention
- After the input sorting rules in `sprocket lint` have been applied, follow the below rules for further sorting.
- "sample" files come before "reference" files
- If present, `use_all_cores` should be the last `Boolean` in its block
- the `ncpu` parameter comes before inputs that allocate memory, which come before inputs that allocate disk space
- This block of 2-3 inputs should come after all other inputs.
- Most tasks should have a default `maxRetries` of 1
- Certain tasks are prone to intermittent failure (often if an internet connection is involved) and can have a higher default `maxRetries`.
- If a task uses multiple cores or is multithreaded, then at least 2 cpu should be specified.
- Use the `as` keyword sparingly; only in the case of increased readability or to avoid name collisions
- Prefer using `as` in the import block rather than at the task/workflow call level
- When using `as` to rename an invalid URI, attempt to make as few changes to the filename as possible (i.e. try not to abbreviate)
- To disambiguate a task or workflow file from it's contents, you can respectively add the `_tasks` or `_wf` suffix in the import section
- the non-empty qualifier (`+`) of arrays and maps should be avoided
- Whenever possible, prefer a Docker image maintained by an external source (such as BioContainers) rather than creating your own image
- When adding a Dockerfile to this repository, follow the below conventions
- Create a directory under the `docker/` directory and choose an appropriate name (likely shared with the underlying tool). The `Dockerfile` should be nested under this new directory. Then create a `package.json` alongside the `Dockerfile`. The `package.json` file is required to contain two JSON fields (`name` and `version`). It can optionally contain a `revision` field.
- Docker images should be versioned according to the following convention
- The `version` should be shared with whatever underlying tool is being used
- If no specific tool is named (e.g. the `util` image), default to SemVer. Ignore the next 3 bullet points.
- The revision should start with zero (`0`)
- If the Docker image gets updated, *without* updating the base tool's version, increment the number by one
- If the Docker image gets updated, *including* updating the base tool's version, revert back to zero
- general purpose tasks can use the `util` image maintained in this repo
- The `description` key in WDL meta sections should be in active voice, beginning the first sentence with a verb
- Each task/workflow is _doing_ something. The first sentence should be a succinct description of what that "something" is.
- The `description` key should be succinct. Generally, one sentence shorter than 140 characters is appropriate.
- If documenting a workflow, task, input, or output and you need to be more verbose than is appropriate in a `description` field, you may include _in addition_ a `help` key with extended prose or an `external_help` key with a URL
- the presence of `help` or `external_help` is _not_ a substitute for a `description`
- Any tasks which are deprecated should have a `deprecated: true` key in their `meta` section
- It is allowed (but redundant and discouraged) to include a `deprecated: false` key in any production tasks. All tasks are assumed to not be deprecated unless otherwise noted.
- In addition, there should be a `warning` key which starts with the text `**[DEPRECATED]**`
- No other text or explanation is required after the above text, but it can be added for further context
- These two rules allow for a task's deprecated status to be communicated in multiple ways, ensuring no user misses the notice
- Deprecated tasks should be placed at the end of their file
- While WDL allows embedded scripts in the `command` block sections, this repository requires scripts (e.g. R, Python) to be separate and placed in the `scripts` folder. The relevant Docker image build for your task should then include the script during the build so the task can access it. This separation of concerns improves the developer experience by improving syntax highlighting in the WDL document and enabling linting and formatting checks for the scripting languages.
- Tasks without multiple commands or pipes can omit this.
- These options will cause common classes of bugs in Bash scripts to fail immediately and loudly, instead of causing silent or subtle bugs in your task behavior.
- All tasks should run in a persistently versioned container.
- e.g. do not use `latest` tags for Docker images.
- This helps ensure reproducibility across time and environments.
- Check all assumptions made about workflow inputs before beginning long running executions.
- Common examples of assumptions that should be checked:
- valid `String` choice (for WDL 1.3 and later, an `enum` should be used in place of `String`s with a fixed set of valid options)
- mutually exclusive parameters
- missing optional file for selected parameters
- filename extensions
- Use `after` clauses in workflows to ensure that all these assumptions are valid before beginning tasks with heavy computation.
- If the _contents_ of a `File` are not read or do not need to be localized for a task, try to coerce the `File` variable to a `Boolean` (with `defined()`) or a `String` (with `basename()`) to avoid unnecessary disk space usage and networking.
- All requirement values are overridable at runtime. However, tasks should have easily configurable memory and disk space allocations.
- Often, tasks have a dynamic calculation for resource requirements based on input sizes. Users of a WDL should have an easy way to fine tune this calculation.
- This may mean incorporating an `Int` or `Float` in the inputs of the task that is applied to the dynamic calculation.
- For WDL 1.3 and later, WDL authors can change resource requirements between retry attempts. This enables mitigation of errors relating to resources limits, but users may inadvertantly disable these mitigations by introducing runtime overrides. WDL authors should expose resource fine tuning via the input section and incorporate those user values in any dynamic calculations to prevent runtime locking.
- Tasks which assume a file and any accessory files (e.g. a BAM and a BAI) have specific extensions and/or are in the same directory should *always* create symlinks from the mounted inputs to the work directory of the task
- This is because individual `File` types are not guarenteed to be in the same mounted directory.
- The `command` may include something like: `ln -s "~{<input name>}" "./<expected name>"`
- Tasks should `rm` any temporary or intermediate files created in the work directory (including symlinks).
- This helps reduce disk bloat from keeping unnecessary files around.
- This is especially important for any large or uncompressed files, such as reference FASTAs or databases.
- Most tasks should have a default `maxRetries` of 1.
- This is because many WDL backends are prone to intermittent failures that can be recovered from with a retry.
- Certain tasks are especially prone to intermittent failure (often if any networking is involved) and can have a higher default `maxRetries`.
- Certain tasks with potentially high compute costs in cloud environments may default to `0`. This should be used in combination with call caching to aid rerunning while minimizing costs.
Loading