Building Common Workflow Language Tools (Using the Planemo Appliance)

This tutorial is a gentle introduction to writing Common Workflow Language tools using the Planemo virtual appliance (available for Docker and Vagrant). Check out ` these instructions <https://planemo.readthedocs.org/en/latest/appliance.html>`__ for obtaining the virtual appliance if you have not done so already.

The Basics

This guide is going to demonstrate building up tools for commands from Heng Li’s Seqtk package - a package for processing sequence data in FASTA and FASTQ files.

To get started let’s install Seqtk. Here we are going to use conda to install Seqtk - but however you obtain it should be fine.

$ conda install --force --yes -c conda-forge -c bioconda seqtk=1.2
    ... seqtk installation ...
$ seqtk seq
        Usage:   seqtk seq [options] <in.fq>|<in.fa>
        Options: -q INT    mask bases with quality lower than INT [0]
                 -X INT    mask bases with quality higher than INT [255]
                 -n CHAR   masked bases converted to CHAR; 0 for lowercase [0]
                 -l INT    number of residues per line; 0 for 2^32-1 [0]
                 -Q INT    quality shift: ASCII-INT gives base quality [33]
                 -s INT    random seed (effective with -f) [11]
                 -f FLOAT  sample FLOAT fraction of sequences [1]
                 -M FILE   mask regions in BED or name list FILE [null]
                 -L INT    drop sequences with length shorter than INT [0]
                 -c        mask complement region (effective with -M)
                 -r        reverse complement
                 -A        force FASTA output (discard quality)
                 -C        drop comments at the header lines
                 -N        drop sequences containing ambiguous bases
                 -1        output the 2n-1 reads only
                 -2        output the 2n reads only
                 -V        shift quality by '(-Q) - 33'

Next we will download an example FASTQ file and test out the a simple Seqtk command - seq which converts FASTQ files into FASTA.

$ wget https://raw.githubusercontent.com/galaxyproject/galaxy-test-data/master/2.fastq
$ seqtk seq -A 2.fastq > 2.fasta
$ cat 2.fasta
>EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
>EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
>EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG

Common Workflow Language tool files are just simple YAML files, so at this point one could just open a text editor and start implementing the tool. Planemo has a command tool_init to quickly generate a skeleton to work from, so let’s start by doing that.

$ planemo tool_init --cwl --id 'seqtk_seq' --name 'Convert to FASTA (seqtk)'

The tool_init command can take various complex arguments - but three two most basic ones are shown above --cwl, --id and --name. The --cwl flag tells Planemo to generate a Common Workflow Language tool. --id is a short identifier for this tool and it should be unique across all tools. --name is a short, human-readable name for the the tool - it corresponds to the label attribute in the CWL tool document.

The above command will generate the file seqtk_seq.cwl - which should look like this.

#!/usr/bin/env cwl-runner
cwlVersion: 'v1.0'
class: CommandLineTool
id: "seqtk_seq"
label: "Convert to FASTA (seqtk)"
inputs: [] # TODO
outputs: [] # TODO
baseCommand: []
arguments: []
doc: |
   TODO: Fill in description.

This tool file has the common fields required for a CWL tool with TODO notes, but you will still need to open up the editor and fill out the command, describe input parameters, tool outputs, writeup usage documentation (doc), etc..

The tool_init command can do a little bit better than this as well. We can use the test command we tried above seqtk seq -A 2.fastq > 2.fasta as an example to generate a command block by specifing the inputs and the outputs as follows.

$ planemo tool_init --force \
                    --cwl \
                    --id 'seqtk_seq' \
                    --name 'Convert to FASTA (seqtk)' \
                    --example_command 'seqtk seq -A 2.fastq > 2.fasta' \
                    --example_input 2.fastq \
                    --example_output 2.fasta

This will generate the following CWL tool definition - which now has correct definitions for the input, output, and command specified. These represent a best guess by planemo, and in most cases will need to be tweaked manually after the tool is generated.

#!/usr/bin/env cwl-runner
cwlVersion: 'v1.0'
class: CommandLineTool
id: "seqtk_seq"
label: "Convert to FASTA (seqtk)"
inputs:
  input1:
    type: File
    doc: |
      TODO
    inputBinding:
      position: 1
      prefix: "-a"
outputs:
  output1:
    type: File
    outputBinding:
      glob: out
baseCommand:
  - "seqtk"
  - "seq"
arguments: []
stdout: out
doc: |
   TODO: Fill in description.

As shown at the beginning of this section, the command seqtk seq generates a help message for the seq command. tool_init can take that help message and stick it right in the generated tool file using the help_from_command option.

Generally command help messages aren’t exactly appropriate for tools since they mention argument names and simillar details that are abstracted away by the tool - but they can be an excellent place to start.

The following Planemo’s tool_init call has been enhanced to use --help_from_command.

$ planemo tool_init --force \
                    --cwl \
                    --id 'seqtk_seq' \
                    --name 'Convert to FASTA (seqtk)' \
                    --example_command 'seqtk seq -A 2.fastq > 2.fasta' \
                    --example_input 2.fastq \
                    --example_output 2.fasta \
                    --requirement seqtk@1.2 \
                    --container 'quay.io/biocontainers/seqtk:1.2--0' \
                    --test_case \
                    --help_from_command 'seqtk seq'

This command generates the following CWL YAML file.

#!/usr/bin/env cwl-runner
cwlVersion: 'v1.0'
class: CommandLineTool
id: "seqtk_seq"
label: "Convert to FASTA (seqtk)"
hints:
  DockerRequirement:
    dockerPull: quay.io/biocontainers/seqtk:1.2--1
  SoftwareRequirement:
    packages:
    - package: seqtk
      version:
      - "1.2"
inputs:
  input1:
    type: File
    doc: |
      TODO
    inputBinding:
      position: 1
      prefix: "-a"
outputs:
  output1:
    type: File
    outputBinding:
      glob: out
baseCommand:
  - "seqtk"
  - "seq"
arguments: []
stdout: out
doc: |
  
  Usage:   seqtk seq [options] <in.fq>|<in.fa>

  Options: -q INT    mask bases with quality lower than INT [0]
           -X INT    mask bases with quality higher than INT [255]
           -n CHAR   masked bases converted to CHAR; 0 for lowercase [0]
           -l INT    number of residues per line; 0 for 2^32-1 [0]
           -Q INT    quality shift: ASCII-INT gives base quality [33]
           -s INT    random seed (effective with -f) [11]
           -f FLOAT  sample FLOAT fraction of sequences [1]
           -M FILE   mask regions in BED or name list FILE [null]
           -L INT    drop sequences with length shorter than INT [0]
           -c        mask complement region (effective with -M)
           -r        reverse complement
           -A        force FASTA output (discard quality)
           -C        drop comments at the header lines
           -N        drop sequences containing ambiguous bases
           -1        output the 2n-1 reads only
           -2        output the 2n reads only
           -V        shift quality by '(-Q) - 33'
           -U        convert all bases to uppercases
           -S        strip of white spaces in sequences

In addition to generating a CWL tool adding the --test_case flag generates from more files that are useful including seqtk_seq_job.yml as shown below:

input1:
  class: File
  path: test-data/2.fastq

This is a CWL job input document and should allow you to run the example command using any CWL implementation. For instance if you have cwltool (cwltool) or Toil (cwltoil) on your PATH the following examples should work.

$ cwltool seqtk_seq.cwl seqtk_seq_job.yml
/Users/john/workspace/planemo/.venv/bin/cwltool 1.0.20180508202931
Resolved 'seqtk_seq.cwl' to 'file:///Users/john/tool_init_exercise/seqtk_seq.cwl'
[job seqtk_seq.cwl] /private/tmp/docker_tmpXgtSLt$ docker \
    run \
    -i \
    --volume=/private/tmp/docker_tmpXgtSLt:/private/var/spool/cwl:rw \
    --volume=/private/var/folders/78/zxz5mz4d0jn53xf0l06j7ppc0000gp/T/tmpGG1thW:/tmp:rw \
    --volume=/Users/john/tool_init_exercise/test-data/2.fastq:/private/var/lib/cwl/stg7db12d3a-2375-42ed-ba60-8a0ef69ffe80/2.fastq:ro \
    --workdir=/private/var/spool/cwl \
    --read-only=true \
    --log-driver=none \
    --user=502:20 \
    --rm \
    --env=TMPDIR=/tmp \
    --env=HOME=/private/var/spool/cwl \
    quay.io/biocontainers/seqtk:1.2--1 \
    seqtk \
    seq \
    -A \
    /private/var/lib/cwl/stg7db12d3a-2375-42ed-ba60-8a0ef69ffe80/2.fastq > /private/tmp/docker_tmpXgtSLt/out
[job seqtk_seq.cwl] completed success
{
    "output1": {
        "checksum": "sha1$322e001e5a99f19abdce9f02ad0f02a17b5066c2",
        "basename": "out",
        "location": "file:///Users/john/tool_init_exercise/out",
        "path": "/Users/john/tool_init_exercise/out",
        "class": "File",
        "size": 150
    }
}

$ cwltoil seqtk_seq.cwl seqtk_seq_job.yml
jlaptop17.local 2018-05-21 15:25:30,630 MainThread INFO toil.lib.bioio: Root logger is at level 'INFO', 'toil' logger at level 'INFO'.
jlaptop17.local 2018-05-21 15:25:30,648 MainThread INFO toil.jobStores.abstractJobStore: The workflow ID is: '55a08d91-1852-4069-97a9-741abd2ea04e'
Resolved 'seqtk_seq.cwl' to 'file:///Users/john/tool_init_exercise/seqtk_seq.cwl'
jlaptop17.local 2018-05-21 15:25:30,650 MainThread INFO cwltool: Resolved 'seqtk_seq.cwl' to 'file:///Users/john/tool_init_exercise/seqtk_seq.cwl'
jlaptop17.local 2018-05-21 15:25:31,793 MainThread INFO toil.common: Using the single machine batch system
jlaptop17.local 2018-05-21 15:25:31,793 MainThread WARNING toil.batchSystems.singleMachine: Limiting maxCores to CPU count of system (8).
jlaptop17.local 2018-05-21 15:25:31,793 MainThread WARNING toil.batchSystems.singleMachine: Limiting maxMemory to physically available memory (17179869184).
jlaptop17.local 2018-05-21 15:25:31,800 MainThread INFO toil.common: Created the workflow directory at /var/folders/78/zxz5mz4d0jn53xf0l06j7ppc0000gp/T/toil-55a08d91-1852-4069-97a9-741abd2ea04e-132281828025877
jlaptop17.local 2018-05-21 15:25:31,800 MainThread WARNING toil.batchSystems.singleMachine: Limiting maxDisk to physically available disk (206962089984).
jlaptop17.local 2018-05-21 15:25:31,808 MainThread INFO toil.common: User script ModuleDescriptor(dirPath='/Users/john/workspace/planemo/.venv/lib/python2.7/site-packages', name='toil.cwl.cwltoil', fromVirtualEnv=True) belongs to Toil. No need to auto-deploy it.
jlaptop17.local 2018-05-21 15:25:31,809 MainThread INFO toil.common: No user script to auto-deploy.
jlaptop17.local 2018-05-21 15:25:31,809 MainThread INFO toil.common: Written the environment for the jobs to the environment file
jlaptop17.local 2018-05-21 15:25:31,809 MainThread INFO toil.common: Caching all jobs in job store
jlaptop17.local 2018-05-21 15:25:31,809 MainThread INFO toil.common: 0 jobs downloaded.
jlaptop17.local 2018-05-21 15:25:31,825 MainThread INFO toil: Running Toil version 3.15.0-0e3a87e738f5e0e7cff64bfdad337d592bd92704.
jlaptop17.local 2018-05-21 15:25:31,825 MainThread INFO toil.realtimeLogger: Real-time logging disabled
jlaptop17.local 2018-05-21 15:25:31,832 MainThread INFO toil.toilState: (Re)building internal scheduler state
2018-05-21 15:25:31,832 - toil.toilState - INFO - (Re)building internal scheduler state
jlaptop17.local 2018-05-21 15:25:31,832 MainThread INFO toil.leader: Found 1 jobs to start and 0 jobs with successors to run
2018-05-21 15:25:31,832 - toil.leader - INFO - Found 1 jobs to start and 0 jobs with successors to run
jlaptop17.local 2018-05-21 15:25:31,832 MainThread INFO toil.leader: Checked batch system has no running jobs and no updated jobs
2018-05-21 15:25:31,832 - toil.leader - INFO - Checked batch system has no running jobs and no updated jobs
jlaptop17.local 2018-05-21 15:25:31,833 MainThread INFO toil.leader: Starting the main loop
2018-05-21 15:25:31,833 - toil.leader - INFO - Starting the main loop
jlaptop17.local 2018-05-21 15:25:31,834 MainThread INFO toil.leader: Issued job 'file:///Users/john/tool_init_exercise/seqtk_seq.cwl' seqtk seq u/c/jobzYKQ3V with job batch system ID: 0 and cores: 1, disk: 3.0 G, and memory: 2.0 G
2018-05-21 15:25:31,834 - toil.leader - INFO - Issued job 'file:///Users/john/tool_init_exercise/seqtk_seq.cwl' seqtk seq u/c/jobzYKQ3V with job batch system ID: 0 and cores: 1, disk: 3.0 G, and memory: 2.0 G
jlaptop17.local 2018-05-21 15:25:33,953 MainThread INFO toil.leader: Job ended successfully: 'file:///Users/john/tool_init_exercise/seqtk_seq.cwl' seqtk seq u/c/jobzYKQ3V
2018-05-21 15:25:33,953 - toil.leader - INFO - Job ended successfully: 'file:///Users/john/tool_init_exercise/seqtk_seq.cwl' seqtk seq u/c/jobzYKQ3V
jlaptop17.local 2018-05-21 15:25:33,955 MainThread INFO toil.leader: Finished the main loop: no jobs left to run
2018-05-21 15:25:33,955 - toil.leader - INFO - Finished the main loop: no jobs left to run
jlaptop17.local 2018-05-21 15:25:33,955 MainThread INFO toil.serviceManager: Waiting for service manager thread to finish ...
2018-05-21 15:25:33,955 - toil.serviceManager - INFO - Waiting for service manager thread to finish ...
jlaptop17.local 2018-05-21 15:25:34,841 MainThread INFO toil.serviceManager: ... finished shutting down the service manager. Took 0.885795116425 seconds
2018-05-21 15:25:34,841 - toil.serviceManager - INFO - ... finished shutting down the service manager. Took 0.885795116425 seconds
jlaptop17.local 2018-05-21 15:25:34,842 MainThread INFO toil.statsAndLogging: Waiting for stats and logging collator thread to finish ...
2018-05-21 15:25:34,842 - toil.statsAndLogging - INFO - Waiting for stats and logging collator thread to finish ...
jlaptop17.local 2018-05-21 15:25:34,854 MainThread INFO toil.statsAndLogging: ... finished collating stats and logs. Took 0.0120511054993 seconds
2018-05-21 15:25:34,854 - toil.statsAndLogging - INFO - ... finished collating stats and logs. Took 0.0120511054993 seconds
jlaptop17.local 2018-05-21 15:25:34,855 MainThread INFO toil.leader: Finished toil run successfully
2018-05-21 15:25:34,855 - toil.leader - INFO - Finished toil run successfully
{
    "output1": {
        "checksum": "sha1$322e001e5a99f19abdce9f02ad0f02a17b5066c2",
        "basename": "out",
        "nameext": "",
        "nameroot": "out",
        "http://commonwl.org/cwltool#generation": 0,
        "location": "file:///Users/john/tool_init_exercise/out",
        "class": "File",
        "size": 150
    }
jlaptop17.local 2018-05-21 15:25:34,866 MainThread INFO toil.common: Successfully deleted the job store: <toil.jobStores.fileJobStore.FileJobStore object at 0x1057205d0>
}2018-05-21 15:25:34,866 - toil.common - INFO - Successfully deleted the job store: <toil.jobStores.fileJobStore.FileJobStore object at 0x1057205d0>

At this point we have a fairly a functional CWL tool with test and usage documentation. This was a pretty simple example - usually you will need to put more work into the tool to get to this point - tool_init is really just designed to get you started.

Now lets lint and test the tool we have developed. The Planemo’s lint (or just l) command will review tool for validity, obvious mistakes, and Planemo “best practices”.

$ planemo l seqtk_seq.cwl
Linting tool /Users/john/workspace/planemo/docs/writing/seqtk_seq.cwl
Applying linter general... CHECK
.. CHECK: Tool defines a version [0.0.1].
.. CHECK: Tool defines a name [Convert to FASTA (seqtk)].
.. CHECK: Tool defines an id [seqtk_seq_v3].
.. CHECK: Tool specifies profile version [16.04].
Applying linter cwl_validation... CHECK
.. INFO: CWL appears to be valid.
Applying linter docker_image... CHECK
.. INFO: Tool will run in Docker image [quay.io/biocontainers/seqtk:1.2--1].
Applying linter new_draft... CHECK
.. INFO: Modern CWL version [v1.0]

In addition to the actual tool and job files, --test_case caused a test file to be generated using the example command and provided test data. The file contents are as follows:

- doc: test generated from example command
  job: seqtk_seq_job.yml
  outputs:
    output1:
      path: test-data/2.fasta

Unlike the job file, this file is a Planemo-specific artifact. This file may contain 1 or more tests - each test is an element of the top-level list. tool_init will use the example command to build just one test.

Each test consists of a few parts:

doc - this attribute provides a short description for the test.
job - this can be the path to a CWL job description or a job description embedded right in the test (tool_init builds the latter).
outputs - this section describes the expected output for a test. Each output ID of the tool or workflow under test can appear as a key. The example above just describes expected specific output file contents exactly but many more expectations can be described.

For more information on the test file format check out the Test Format docs.

The tests described in this file can be run using the planemo test command on the original file.

$ planemo test --no-container seqtk_seq.cwl
Enable beta testing mode for testing.
cwltool INFO: /Users/john/workspace/planemo/.venv/bin/planemo 1.0.20180508202931
cwltool INFO: Resolved '/Users/john/tool_init_exercise/seqtk_seq.cwl' to 'file:///Users/john/tool_init_exercise/seqtk_seq.cwl'
cwltool INFO: [job seqtk_seq.cwl] /private/tmp/docker_tmpvLE9SS$ seqtk \
    seq \
    -A \
    /private/var/folders/78/zxz5mz4d0jn53xf0l06j7ppc0000gp/T/tmpGM22d_/stg0c0cad75-7ca0-4f3a-9d77-63e9c49f5353/2.fastq > /private/tmp/docker_tmpvLE9SS/out
cwltool INFO: [job seqtk_seq.cwl] completed success
cwltool INFO: Final process status is success
All 1 test(s) executed passed.
seqtk_seq_0: passed

This is a bit a different than running the job. For one thing, we don’t need to specify an input job - instead Planemo will automatically find the test file and run all the jobs described inside that file. Additionally, Planemo will check the outputs to ensure the match the test expectations.

In addition to the in console display of test results as red (failing) or green (passing), Planemo also creates an HTML report for the test results by default. Many more test report options are available such --test_output_xunit which is useful in certain continuous integration environments. See planemo test --help for more options, as well as the test_reports command.

The machine-readable JSON report written by --test_output_json uses the Planemo test report shape: a top-level version, tests, optional summary, and optional exit_code. test_reports validates this JSON before rendering derived reports. merge_test_reports also validates each input report and writes the same full report shape, including recalculated summary and exit_code fields. Older reports without summary are still accepted and summarized during rendering.

The above test example used cwltool to run our test and disabled containerization. By dropping the --no-container argument we can run the tool in a Docker container. By passing an engine argument as --engine toil we can run our test in Toil, an alternative CWL implementation.

$ planemo test seqtk_seq.cwl
Enable beta testing mode for testing.
cwltool INFO: /Users/john/workspace/planemo/.venv/bin/planemo 1.0.20180508202931
cwltool INFO: Resolved '/Users/john/tool_init_exercise/seqtk_seq.cwl' to 'file:///Users/john/tool_init_exercise/seqtk_seq.cwl'
cwltool INFO: [job seqtk_seq.cwl] /private/tmp/docker_tmpUeIpXJ$ docker \
    run \
    -i \
    --volume=/private/tmp/docker_tmpUeIpXJ:/private/var/spool/cwl:rw \
    --volume=/private/var/folders/78/zxz5mz4d0jn53xf0l06j7ppc0000gp/T/tmpteo_2Z:/tmp:rw \
    --volume=/Users/john/tool_init_exercise/test-data/2.fastq:/private/var/lib/cwl/stg939ee60b-a194-4177-8410-c40a1acb38ea/2.fastq:ro \
    --workdir=/private/var/spool/cwl \
    --read-only=true \
    --log-driver=none \
    --user=502:20 \
    --rm \
    --env=TMPDIR=/tmp \
    --env=HOME=/private/var/spool/cwl \
    quay.io/biocontainers/seqtk:1.2--1 \
    seqtk \
    seq \
    -A \
    /private/var/lib/cwl/stg939ee60b-a194-4177-8410-c40a1acb38ea/2.fastq > /private/tmp/docker_tmpUeIpXJ/out
cwltool INFO: [job seqtk_seq.cwl] completed success
cwltool INFO: Final process status is success
All 1 test(s) executed passed.
seqtk_seq_0: passed
$ planemo test --no-container --engine toil seqtk_seq.cwl
Enable beta testing mode for testing.
All 1 test(s) executed passed.
seqtk_seq_0: passed

For more information on the Common Workflow Language check out the Draft 3 User Guide and Specification.