Building Common Workflow Language Tools Using Planemo

This tutorial is a gentle introduction to writing Common Workflow Language tools using Planemo. Please read the installation instructions for Planemo if you have not already installed it.

The Basics

This guide is going to demonstrate building up tools for commands from Heng Li’s Seqtk package - a package for processing sequence data in FASTA and FASTQ files.

To get started let’s install Seqtk. Here we are going to use conda to install Seqtk - but however you obtain it should be fine.

$ conda install --force -c bioconda seqtk=1.2
    ... seqtk installation ...
$ seqtk seq
        Usage:   seqtk seq [options] <in.fq>|<in.fa>
        Options: -q INT    mask bases with quality lower than INT [0]
                 -X INT    mask bases with quality higher than INT [255]
                 -n CHAR   masked bases converted to CHAR; 0 for lowercase [0]
                 -l INT    number of residues per line; 0 for 2^32-1 [0]
                 -Q INT    quality shift: ASCII-INT gives base quality [33]
                 -s INT    random seed (effective with -f) [11]
                 -f FLOAT  sample FLOAT fraction of sequences [1]
                 -M FILE   mask regions in BED or name list FILE [null]
                 -L INT    drop sequences with length shorter than INT [0]
                 -c        mask complement region (effective with -M)
                 -r        reverse complement
                 -A        force FASTA output (discard quality)
                 -C        drop comments at the header lines
                 -N        drop sequences containing ambiguous bases
                 -1        output the 2n-1 reads only
                 -2        output the 2n reads only
                 -V        shift quality by '(-Q) - 33'

Next we will download an example FASTQ file and test out the a simple Seqtk command - seq which converts FASTQ files into FASTA.

$ wget https://raw.githubusercontent.com/galaxyproject/galaxy-test-data/master/2.fastq
$ seqtk seq -A 2.fastq > 2.fasta
$ cat 2.fasta
>EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
>EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
>EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG

Common Workflow Language tool files are just simple YAML files, so at this point one could just open a text editor and start implementing the tool. Planemo has a command tool_init to quickly generate a skeleton to work from, so let’s start by doing that.

$ planemo tool_init --cwl --id 'seqtk_seq' --name 'Convert to FASTA (seqtk)'

The tool_init command can take various complex arguments - but three two most basic ones are shown above --cwl, --id and --name. The --cwl flag tells Planemo to generate a Common Workflow Language tool. --id is a short identifier for this tool and it should be unique across all tools. --name is a short, human-readable name for the the tool - it corresponds to the label attribute in the CWL tool document.

The above command will generate the file seqtk_seq.cwl - which should look like this.

#!/usr/bin/env cwl-runner
cwlVersion: 'cwl:draft-3'
class: CommandLineTool
id: "seqtk_seq"
label: "Convert to FASTA (seqtk)"
inputs: [] # TODO
outputs: [] # TODO
baseCommand: []
arguments: []
description: |
   TODO: Fill in description.

This tool file has the common fields required for a CWL tool with TODO notes, but you will still need to open up the editor and fill out the command, describe input parameters, tool outputs, writeup a help section, etc....

The tool_init command can do a little bit better than this as well. We can use the test command we tried above seqtk seq -A 2.fastq > 2.fasta as an example to generate a command block by specifing the inputs and the outputs as follows.

$ planemo tool_init --force \
                    --cwl \
                    --id 'seqtk_seq' \
                    --name 'Convert to FASTA (seqtk)' \
                    --example_command 'seqtk seq -A 2.fastq > 2.fasta' \
                    --example_input 2.fastq \
                    --example_output 2.fasta

This will generate the following CWL tool definition - which now has correct definitions for the input, output, and command specified. These represent a best guess by planemo, and in most cases will need to be tweaked manually after the tool is generated.

#!/usr/bin/env cwl-runner
cwlVersion: 'cwl:draft-3'
class: CommandLineTool
id: "seqtk_seq"
label: "Convert to FASTA (seqtk)"
inputs:
  - id: input1
    type: File
    description: |
      TODO
    inputBinding:
      position: 1
      prefix: "-a"
outputs:
  - id: output1
    type: File
    outputBinding:
      glob: out
baseCommand:
  - "seqtk"
  - "seq"
arguments: []
stdout: out
description: |
   TODO: Fill in description.

As shown at the beginning of this section, the command seqtk seq generates a help message for the seq command. tool_init can take that help message and stick it right in the generated tool file using the help_from_command option.

Generally command help messages aren’t exactly appropriate for tools since they mention argument names and simillar details that are abstracted away by the tool - but they can be an excellent place to start.

The following Planemo’s tool_init call has been enhanced to use --help_from_command.

$ planemo tool_init --force \
                    --cwl \
                    --id 'seqtk_seq' \
                    --name 'Convert to FASTA (seqtk)' \
                    --example_command 'seqtk seq -A 2.fastq > 2.fasta' \
                    --example_input 2.fastq \
                    --example_output 2.fasta \
                    --container 'dukegcb/seqtk' \
                    --test_case \
                    --help_from_command 'seqtk seq'

This command generates the following CWL YAML file.

#!/usr/bin/env cwl-runner
cwlVersion: 'cwl:draft-3'
class: CommandLineTool
id: "seqtk_seq"
label: "Convert to FASTA (seqtk)"
hints:
  - class: DockerRequirement
    dockerPull: dukegcb/seqtk
inputs:
  - id: input1
    type: File
    description: |
      TODO
    inputBinding:
      position: 1
      prefix: "-a"
outputs:
  - id: output1
    type: File
    outputBinding:
      glob: out
baseCommand:
  - "seqtk"
  - "seq"
arguments: []
stdout: out
description: |
  
  Usage:   seqtk seq [options] <in.fq>|<in.fa>
  
  Options: -q INT    mask bases with quality lower than INT [0]
           -X INT    mask bases with quality higher than INT [255]
           -n CHAR   masked bases converted to CHAR; 0 for lowercase [0]
           -l INT    number of residues per line; 0 for 2^32-1 [0]
           -Q INT    quality shift: ASCII-INT gives base quality [33]
           -s INT    random seed (effective with -f) [11]
           -f FLOAT  sample FLOAT fraction of sequences [1]
           -M FILE   mask regions in BED or name list FILE [null]
           -L INT    drop sequences with length shorter than INT [0]
           -c        mask complement region (effective with -M)
           -r        reverse complement
           -A        force FASTA output (discard quality)
           -C        drop comments at the header lines
           -N        drop sequences containing ambiguous bases
           -1        output the 2n-1 reads only
           -2        output the 2n reads only
           -V        shift quality by '(-Q) - 33'
           -U        convert all bases to uppercases
  

At this point we have a fairly a functional Galaxy tool with test and help. This was a pretty simple example - usually you will need to put more work into the tool to get to this point - tool_init is really just designed to get you started.

Now lets lint and test the tool we have developed. The Planemo’s lint (or just l) command will review tool for XML validity, obvious mistakes, and compliance with IUC best practices.

$ planemo l seqtk_seq.cwl
Linting tool /home/john/workspace/planemo/docs/writing/seqtk_seq.cwl
Applying linter general... CHECK
.. CHECK: Tool defines a version [0.0.1].
.. CHECK: Tool defines a name [Convert to FASTA (seqtk)].
.. CHECK: Tool defines an id [seqtk_seq_v3].
Applying linter cwl_validation... CHECK
.. INFO: CWL appears to be valid.
Applying linter docker_image... CHECK
.. INFO: Tool will run in Docker image [dukegcb/seqtk].
Applying linter new_draft... CHECK
.. INFO: Modern CWL version [cwl:draft-3]

In addition to the actual tool file, a test file will be generated using the example command and provided test data. The file contents are as follows:

#!planemo test
- doc: test generated from example command
  job:
    input1:
      class: File
      value: test-data/2.fastq
  outputs:
    output1:
      path: test-data/2.fasta

This file is a planemo-specific artifact. This file may contain 1 or more tests - each test is an element of the top-level list. tool_init will use the example command to build just one test.

Each test consists of a few parts:

  • doc - this attribute provides a short description for the test.
  • job - this can be the path to a CWL job description or a job description embedded right in the test (tool_init builds the latter).
  • outputs - this section describes the expected output for a test. Each output ID of the tool or workflow under test can appear as a key. The example above just describes expected specific output file contents exactly but many more expectations can be described.

The tests described in this file can be run using the planemo t command on the original file. By default, planemo will run tool tests with Galaxy but we can also specify the use of cwltool (the reference implementation of CWL) which will be quicker and more robust until while Galaxy support for the CWL is still in development.

$ planemo test –no-container –engine cwltool seqtk_seq.cwl Enable beta testing mode to test artifact that isn’t a Galaxy tool. All 1 test(s) executed passed. seqtk_seq_0: passed

We can also open up the Galaxy web inteface with this tool loaded using the serve (or just s) command.

$ planemo s --cwl seqtk_seq.cwl
...
serving on http://127.0.0.1:9090

Open up http://127.0.0.1:9090 in a web browser to view your new tool.

For more information on the Common Workflow Language check out the Draft 3 User Guide and Specification.