Advanced Tool Development Topics

This tutorial covers some more advanced tool development topics. It assumes some basic knowledge about developing CWL tools and that you have an environment with Planemo available - check out the CWL User Guide CWL and the Planemo CWL intro tutorial if you have never developed a CWL tool.

Dependencies and Conda

Specifying and Using Software Requirements

Note

Planemo requires a Conda installation to target with its various Conda related commands. A properly configured Conda installation can be initialized with the conda_init command. This should only need to be executed once per development machine.

$ planemo conda_init
galaxy.tools.deps.conda_util INFO: Installing conda, this may take several minutes.
wget -q --recursive -O /var/folders/78/zxz5mz4d0jn53xf0l06j7ppc0000gp/T/conda_installLW5zn1.sh https://repo.continuum.io/miniconda/Miniconda3-4.3.31-MacOSX-x86_64.sh
bash /var/folders/78/zxz5mz4d0jn53xf0l06j7ppc0000gp/T/conda_installLW5zn1.sh -b -p /Users/john/miniconda3
PREFIX=/Users/john/miniconda3
installing: python-3.6.3-h47c878a_7 ...
Python 3.6.3 :: Anaconda, Inc.
installing: ca-certificates-2017.08.26-ha1e5d58_0 ...
installing: conda-env-2.6.0-h36134e3_0 ...
installing: libcxxabi-4.0.1-hebd6815_0 ...
installing: tk-8.6.7-h35a86e2_3 ...
installing: xz-5.2.3-h0278029_2 ...
installing: yaml-0.1.7-hc338f04_2 ...
installing: zlib-1.2.11-hf3cbc9b_2 ...
installing: libcxx-4.0.1-h579ed51_0 ...
installing: openssl-1.0.2n-hdbc3d79_0 ...
installing: libffi-3.2.1-h475c297_4 ...
installing: ncurses-6.0-hd04f020_2 ...
installing: libedit-3.1-hb4e282d_0 ...
installing: readline-7.0-hc1231fa_4 ...
installing: sqlite-3.20.1-h7e4c145_2 ...
installing: asn1crypto-0.23.0-py36h782d450_0 ...
installing: certifi-2017.11.5-py36ha569be9_0 ...
installing: chardet-3.0.4-py36h96c241c_1 ...
installing: idna-2.6-py36h8628d0a_1 ...
installing: pycosat-0.6.3-py36hee92d8f_0 ...
installing: pycparser-2.18-py36h724b2fc_1 ...
installing: pysocks-1.6.7-py36hfa33cec_1 ...
installing: python.app-2-py36h54569d5_7 ...
installing: ruamel_yaml-0.11.14-py36h9d7ade0_2 ...
installing: six-1.11.0-py36h0e22d5e_1 ...
installing: cffi-1.11.2-py36hd3e6348_0 ...
installing: setuptools-36.5.0-py36h2134326_0 ...
installing: cryptography-2.1.4-py36h842514c_0 ...
installing: wheel-0.30.0-py36h5eb2c71_1 ...
installing: pip-9.0.1-py36h1555ced_4 ...
installing: pyopenssl-17.5.0-py36h51e4350_0 ...
installing: urllib3-1.22-py36h68b9469_0 ...
installing: requests-2.18.4-py36h4516966_1 ...
installing: conda-4.3.31-py36_0 ...
installation finished.
/Users/john/miniconda3/bin/conda install -y --override-channels --channel iuc --channel conda-forge --channel bioconda --channel defaults conda=4.3.33 conda-build=2.1.18
Fetching package metadata ...................
Solving package specifications: .

Package plan for installation in environment /Users/john/miniconda3:

The following NEW packages will be INSTALLED:

    beautifulsoup4: 4.6.0-py36_0  conda-forge
    conda-build:    2.1.18-py36_0 conda-forge
    conda-verify:   2.0.0-py36_0  conda-forge
    filelock:       3.0.4-py36_0  conda-forge
    jinja2:         2.10-py36_0   conda-forge
    markupsafe:     1.0-py36_0    conda-forge
    pkginfo:        1.4.2-py36_0  conda-forge
    pycrypto:       2.6.1-py36_1  conda-forge
    pyyaml:         3.12-py36_1   conda-forge

The following packages will be UPDATED:

    conda:          4.3.31-py36_0             --> 4.3.33-py36_0 conda-forge

beautifulsoup4 100% |###################################################################| Time: 0:00:00 782.08 kB/s
filelock-3.0.4 100% |###################################################################| Time: 0:00:00   7.95 MB/s
markupsafe-1.0 100% |###################################################################| Time: 0:00:00   5.82 MB/s
pkginfo-1.4.2- 100% |###################################################################| Time: 0:00:00   1.18 MB/s
pycrypto-2.6.1 100% |###################################################################| Time: 0:00:00   1.69 MB/s
pyyaml-3.12-py 100% |###################################################################| Time: 0:00:00   3.31 MB/s
conda-verify-2 100% |###################################################################| Time: 0:00:00   6.91 MB/s
jinja2-2.10-py 100% |###################################################################| Time: 0:00:00   2.81 MB/s
conda-4.3.33-p 100% |###################################################################| Time: 0:00:00 621.27 kB/s
conda-build-2. 100% |###################################################################| Time: 0:00:00   2.16 MB/s
Conda installation succeeded - Conda is available at '/Users/john/miniconda3/bin/conda'

Note

Why not just use containers?

Containers are great, use containers (be it Docker, Singularity, etc.) whenever possible to increase reproducibility and portability of your tools and workflow. Building ad hoc containers to support CWL tools (e.g. custom Dockerfile definitions) has serious limitations, in the next tutorial on containers we will argue that using Biocontainers built or discovered from your tool’s Software Requirements is a superior approach.

Besides leading to better containers, there are other reasons to describe Software Requirements also - it will allow your tool to be used in environments without container runtimes available and provides valuable and actionable metadata about the computation described by the tool.

Read more about this whole dependency stack in our preprint Practical computational reproducibility in the life sciences

The Common Workflow Language specification loosely describes Software Requirements - a way to map CWL hints to packages, environment modules, or any other mechanism to describe dependencies for running a tool outside of a container. The large and active Galaxy tool development community has built an open source library and set of best practices for describing dependencies for Galaxy that should work just as well for CWL. The library has been integrated with cwltool and Toil to enable CWL tool authors and users to leverage the power and flexibility of the Galaxy dependency management and best practices.

While Software Requirements can be configured to resolve dependencies various ways, Planemo is configured with opinionated defaults geared at making building CWL tools that target Conda as easy as possible and build tools with requirements compatible with cwltool and Toil when running outside containers.

During the tool development introductory tutorial, we called planemo tool_init with the argument --requirement seqtk@1.2 and the resulting tool contained such a SoftwareRequirement in the form the following the YAML fragment:

SoftwareRequirement:
  packages:
  - package: seqtk
    version:
    - "1.2"

Planemo (and cwltool and Toil) can interpret these SoftwareRequirement annotations in various ways including as Conda packages. When interpreting these as Conda packages these runtimes can setup isolated, reproducible Conda environments for tool execution with the correct packages installed (e.g. seqtk in the above example).

Note

Why Conda?

Many different package managers could potentially be targeted here, but we focus on Conda for a few key reasons.

  • No compilation at install time - binaries with their dependencies and libraries

  • Support for all operating systems

  • Easy to manage multiple versions of the same recipe

  • HPC-ready: no root privileges needed

  • Easy-to-write YAML recipes

  • Viberant communities

Note

Conda Terminology

Diagram describing the relationship between Conda, Miniconda, and Anaconda.

Conda recipes build packages that are published to channels.

Planemo is setup to target a few channels by default, these include iuc, bioconda, conda_forge, defaults - the whole dependency management scheme outlined here works a lot better if packages can be found in one of these “best practice” channels.

We can check if the requirements on a tool are available in best practice Conda channels using an extended form of the planemo lint command (planemo lint was introduced in the introductory tutorial). Passing --conda_requirements flag will ensure all listed requirements are found.

$ planemo lint --conda_requirements seqtk_seq.cwl
Linting tool /Users/john/workspace/planemo/docs/writing/seqtk_seq.cwl
  ...
Applying linter requirements_in_conda... CHECK
.. INFO: Requirement [seqtk@1.2] matches target in best practice Conda channel [https://conda.anaconda.org/bioconda/osx-64].

Note

You can download a more complete version of the CWL seqtk seq from the Planemo tutorial using the command:

$ planemo project_init --template=seqtk_complete_cwl seqtk_example
$ cd seqtk_example

We can verify these tool requirements install with the conda_install command. With its default parameters conda_install processes tools and creates isolated environments for their declared Software Requirements (mirroring what can be done in production with cwltool and Toil).

$ planemo conda_install seqtk_seq.cwl
Install conda target CondaTarget[seqtk,version=1.2]
/home/john/miniconda3/bin/conda create -y --name __seqtk@1.2 seqtk=1.2
Fetching package metadata ...............
Solving package specifications: ..........

Package plan for installation in environment /home/john/miniconda2/envs/__seqtk@1.2:

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    seqtk-1.2                  |                0          29 KB  bioconda

The following NEW packages will be INSTALLED:

    seqtk: 1.2-0   bioconda
    zlib:  1.2.8-3

Fetching packages ...
seqtk-1.2-0.ta 100% |#############################################################| Time: 0:00:00 444.71 kB/s
Extracting packages ...
[      COMPLETE      ]|################################################################################| 100%
Linking packages ...
[      COMPLETE      ]|################################################################################| 100%
#
# To activate this environment, use:
# > source activate __seqtk@1.2
#
# To deactivate this environment, use:
# > source deactivate __seqtk@1.2
#
$ which seqtk
seqtk not found
$

The above install worked properly, but seqtk is not on your PATH because this merely created an environment within the Conda directory for the seqtk installation. Planemo will configure cwltool during testing to reuse this environment. If you wish to interactively explore the resulting enviornment to explore the installed tool or produce test data the output of the conda_env command can be sourced.

$ . <(planemo conda_env seqtk_seq.cwl)
Deactivate environment with conda_env_deactivate
(seqtk_seq) $ which seqtk
/home/planemo/miniconda3/envs/jobdepsiJClEUfecc6d406196737781ff4456ec60975c137e04884e4f4b05dc68192f7cec4656/bin/seqtk
(seqtk_seq) $ seqtk seq

Usage:   seqtk seq [options] <in.fq>|<in.fa>

Options: -q INT    mask bases with quality lower than INT [0]
         -X INT    mask bases with quality higher than INT [255]
         -n CHAR   masked bases converted to CHAR; 0 for lowercase [0]
         -l INT    number of residues per line; 0 for 2^32-1 [0]
         -Q INT    quality shift: ASCII-INT gives base quality [33]
         -s INT    random seed (effective with -f) [11]
         -f FLOAT  sample FLOAT fraction of sequences [1]
         -M FILE   mask regions in BED or name list FILE [null]
         -L INT    drop sequences with length shorter than INT [0]
         -c        mask complement region (effective with -M)
         -r        reverse complement
         -A        force FASTA output (discard quality)
         -C        drop comments at the header lines
         -N        drop sequences containing ambiguous bases
         -1        output the 2n-1 reads only
         -2        output the 2n reads only
         -V        shift quality by '(-Q) - 33'
         -U        convert all bases to uppercases
         -S        strip of white spaces in sequences
(seqtk_seq) $ conda_env_deactivate
$

As shown above the conda_env_deactivate will be created in this environment and can be used to restore your initial shell configuration.

Here is a portion of the output from the testing command planemo test seqtk_seq.cwl demonstrating using this tool.

$ planemo test --no-container seqtk_seq.cwl
Enable beta testing mode for testing.
cwltool INFO: /Users/john/workspace/planemo/.venv/bin/planemo 1.0.20170828135420
cwltool INFO: Resolved '/Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/seqtk_seq.cwl' to 'file:///Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/seqtk_seq.cwl'
cwltool INFO: [job seqtk_seq.cwl] /private/var/folders/78/zxz5mz4d0jn53xf0l06j7ppc0000gp/T/tmpaDQ1nK$ seqtk \
    seq \
    -a \
    /private/var/folders/78/zxz5mz4d0jn53xf0l06j7ppc0000gp/T/tmpJtPKCr/stg24cf7e67-5ca6-44a4-a46b-26cbe104e1d4/2.fastq > /private/var/folders/78/zxz5mz4d0jn53xf0l06j7ppc0000gp/T/tmpaDQ1nK/out
cwltool INFO: [job seqtk_seq.cwl] completed success
cwltool INFO: Final process status is success
galaxy.tools.parser.factory INFO: Loading CWL tool - this is experimental - tool likely will not function in future at least in same way.
All 1 test(s) executed passed.
seqtk_seq_0: passed

Since seqtk isn’t on the path and we did not use a container, we can see the SoftwareRequirement resolution was successful and it found the environment we previously installed with conda_install.

This can be used outside of Planemo testing as well, the following invocation shows running a job with cwltool using an environment like the one created above:

$ cwltool --no-container --beta-conda-dependencies seqtk_seq.cwl seqtk_seq_job.yml
/Users/john/workspace/planemo/.venv/bin/cwltool 1.0.20180508202931
Resolved 'seqtk_seq.cwl' to 'file:///Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/seqtk_seq.cwl'
No handlers could be found for logger "rdflib.term"
[job seqtk_seq.cwl] /private/tmp/docker_tmpDQYeqC$ seqtk \
    seq \
    -a \
    /private/var/folders/78/zxz5mz4d0jn53xf0l06j7ppc0000gp/T/tmpQwBqPo/stg8cf2282a-d807-4f90-b94d-feeda004cacd/2.fastq > /private/tmp/docker_tmpDQYeqC/out
PREFIX=/Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/cwltool_deps/_conda
installing: python-3.6.3-h47c878a_7 ...
Python 3.6.3 :: Anaconda, Inc.
installing: ca-certificates-2017.08.26-ha1e5d58_0 ...
installing: conda-env-2.6.0-h36134e3_0 ...
installing: libcxxabi-4.0.1-hebd6815_0 ...
installing: tk-8.6.7-h35a86e2_3 ...
installing: xz-5.2.3-h0278029_2 ...
installing: yaml-0.1.7-hc338f04_2 ...
installing: zlib-1.2.11-hf3cbc9b_2 ...
installing: libcxx-4.0.1-h579ed51_0 ...
installing: openssl-1.0.2n-hdbc3d79_0 ...
installing: libffi-3.2.1-h475c297_4 ...
installing: ncurses-6.0-hd04f020_2 ...
installing: libedit-3.1-hb4e282d_0 ...
installing: readline-7.0-hc1231fa_4 ...
installing: sqlite-3.20.1-h7e4c145_2 ...
installing: asn1crypto-0.23.0-py36h782d450_0 ...
installing: certifi-2017.11.5-py36ha569be9_0 ...
installing: chardet-3.0.4-py36h96c241c_1 ...
installing: idna-2.6-py36h8628d0a_1 ...
installing: pycosat-0.6.3-py36hee92d8f_0 ...
installing: pycparser-2.18-py36h724b2fc_1 ...
installing: pysocks-1.6.7-py36hfa33cec_1 ...
installing: python.app-2-py36h54569d5_7 ...
installing: ruamel_yaml-0.11.14-py36h9d7ade0_2 ...
installing: six-1.11.0-py36h0e22d5e_1 ...
installing: cffi-1.11.2-py36hd3e6348_0 ...
installing: setuptools-36.5.0-py36h2134326_0 ...
installing: cryptography-2.1.4-py36h842514c_0 ...
installing: wheel-0.30.0-py36h5eb2c71_1 ...
installing: pip-9.0.1-py36h1555ced_4 ...
installing: pyopenssl-17.5.0-py36h51e4350_0 ...
installing: urllib3-1.22-py36h68b9469_0 ...
installing: requests-2.18.4-py36h4516966_1 ...
installing: conda-4.3.31-py36_0 ...
installation finished.
Fetching package metadata .................
Solving package specifications: .

Package plan for installation in environment /Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/cwltool_deps/_conda:

The following packages will be UPDATED:

    conda: 4.3.31-py36_0 --> 4.3.33-py36_0 conda-forge

conda-4.3.33-p 100% |#################################################################| Time: 0:00:00   1.13 MB/s


Package plan for installation in environment /Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/cwltool_deps/_conda/envs/__seqtk@1.2:

The following NEW packages will be INSTALLED:

    seqtk: 1.2-1    bioconda
    zlib:  1.2.11-0 conda-forge


[job seqtk_seq.cwl] completed success
{
    "output1": {
        "checksum": "sha1$322e001e5a99f19abdce9f02ad0f02a17b5066c2",
        "basename": "out",
        "location": "file:///Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/out",
        "path": "/Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/out",
        "class": "File",
        "size": 150
    }
}
Final process status is success

This demonstrates that cwltool will install the packages needed on the first run, if we rerun cwltool it will reuse that previous environment.

$ cwltool --no-container --beta-conda-dependencies seqtk_seq.cwl seqtk_seq_job.yml
/Users/john/workspace/planemo/.venv/bin/cwltool 1.0.20180508202931
Resolved 'seqtk_seq.cwl' to 'file:///Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/seqtk_seq.cwl'
No handlers could be found for logger "rdflib.term"
[job seqtk_seq.cwl] /private/tmp/docker_tmp4vvE_i$ seqtk \
    seq \
    -a \
    /private/var/folders/78/zxz5mz4d0jn53xf0l06j7ppc0000gp/T/tmpcvQ3Ph/stg2ef3a21c-9fb0-4099-88c2-36e24719901d/2.fastq > /private/tmp/docker_tmp4vvE_i/out
[job seqtk_seq.cwl] completed success
{
    "output1": {
        "checksum": "sha1$322e001e5a99f19abdce9f02ad0f02a17b5066c2",
        "basename": "out",
        "location": "file:///Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/out",
        "path": "/Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/out",
        "class": "File",
        "size": 150
    }
}
Final process status is success

And the same thing is possible with Toil.

$ cwltoil --no-container --beta-conda-dependencies seqtk_seq.cwl seqtk_seq_job.yml
jlaptop17.local 2018-05-23 15:27:25,754 MainThread INFO toil.lib.bioio: Root logger is at level 'INFO', 'toil' logger at level 'INFO'.
jlaptop17.local 2018-05-23 15:27:25,785 MainThread INFO toil.jobStores.abstractJobStore: The workflow ID is: '92328fb2-33b7-44cd-879f-41d8cbf94555'
Resolved 'seqtk_seq.cwl' to 'file:///Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/seqtk_seq.cwl'
jlaptop17.local 2018-05-23 15:27:25,787 MainThread INFO cwltool: Resolved 'seqtk_seq.cwl' to 'file:///Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/seqtk_seq.cwl'
jlaptop17.local 2018-05-23 15:27:27,002 MainThread WARNING rdflib.term: http://schema.org/docs/!DOCTYPE html does not look like a valid URI, trying to serialize this will break.
jlaptop17.local 2018-05-23 15:27:27,396 MainThread INFO rdflib.plugins.parsers.pyRdfa: Current options:
        preserve space                         : True
        output processor graph                 : True
        output default graph                   : True
        host language                          : RDFa Core
        accept embedded RDF                    : False
        check rdfa lite                        : False
        cache vocabulary graphs                : False

jlaptop17.local 2018-05-23 15:27:29,797 MainThread INFO toil.common: Using the single machine batch system
jlaptop17.local 2018-05-23 15:27:29,798 MainThread WARNING toil.batchSystems.singleMachine: Limiting maxCores to CPU count of system (8).
jlaptop17.local 2018-05-23 15:27:29,798 MainThread WARNING toil.batchSystems.singleMachine: Limiting maxMemory to physically available memory (17179869184).
jlaptop17.local 2018-05-23 15:27:29,808 MainThread INFO toil.common: Created the workflow directory at /var/folders/78/zxz5mz4d0jn53xf0l06j7ppc0000gp/T/toil-92328fb2-33b7-44cd-879f-41d8cbf94555-132281828025877
jlaptop17.local 2018-05-23 15:27:29,808 MainThread WARNING toil.batchSystems.singleMachine: Limiting maxDisk to physically available disk (202669449216).
jlaptop17.local 2018-05-23 15:27:29,815 MainThread INFO toil.common: User script ModuleDescriptor(dirPath='/Users/john/workspace/planemo/.venv/lib/python2.7/site-packages', name='toil.cwl.cwltoil', fromVirtualEnv=True) belongs to Toil. No need to auto-deploy it.
jlaptop17.local 2018-05-23 15:27:29,816 MainThread INFO toil.common: No user script to auto-deploy.
jlaptop17.local 2018-05-23 15:27:29,816 MainThread INFO toil.common: Written the environment for the jobs to the environment file
jlaptop17.local 2018-05-23 15:27:29,816 MainThread INFO toil.common: Caching all jobs in job store
jlaptop17.local 2018-05-23 15:27:29,816 MainThread INFO toil.common: 0 jobs downloaded.
jlaptop17.local 2018-05-23 15:27:29,911 MainThread INFO toil: Running Toil version 3.15.0-0e3a87e738f5e0e7cff64bfdad337d592bd92704.
jlaptop17.local 2018-05-23 15:27:29,911 MainThread INFO toil.realtimeLogger: Real-time logging disabled
jlaptop17.local 2018-05-23 15:27:29,937 MainThread INFO toil.toilState: (Re)building internal scheduler state
2018-05-23 15:27:29,937 - toil.toilState - INFO - (Re)building internal scheduler state
jlaptop17.local 2018-05-23 15:27:29,938 MainThread INFO toil.leader: Found 1 jobs to start and 0 jobs with successors to run
2018-05-23 15:27:29,938 - toil.leader - INFO - Found 1 jobs to start and 0 jobs with successors to run
jlaptop17.local 2018-05-23 15:27:29,938 MainThread INFO toil.leader: Checked batch system has no running jobs and no updated jobs
2018-05-23 15:27:29,938 - toil.leader - INFO - Checked batch system has no running jobs and no updated jobs
jlaptop17.local 2018-05-23 15:27:29,938 MainThread INFO toil.leader: Starting the main loop
2018-05-23 15:27:29,938 - toil.leader - INFO - Starting the main loop
jlaptop17.local 2018-05-23 15:27:29,939 MainThread INFO toil.leader: Issued job 'file:///Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/seqtk_seq.cwl' seqtk seq e/V/jobsxUpTU with job batch system ID: 0 and cores: 1, disk: 3.0 G, and memory: 2.0 G
2018-05-23 15:27:29,939 - toil.leader - INFO - Issued job 'file:///Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/seqtk_seq.cwl' seqtk seq e/V/jobsxUpTU with job batch system ID: 0 and cores: 1, disk: 3.0 G, and memory: 2.0 G
jlaptop17.local 2018-05-23 15:27:31,409 MainThread INFO toil.leader: Job ended successfully: 'file:///Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/seqtk_seq.cwl' seqtk seq e/V/jobsxUpTU
2018-05-23 15:27:31,409 - toil.leader - INFO - Job ended successfully: 'file:///Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/seqtk_seq.cwl' seqtk seq e/V/jobsxUpTU
jlaptop17.local 2018-05-23 15:27:31,411 MainThread INFO toil.leader: Finished the main loop: no jobs left to run
2018-05-23 15:27:31,411 - toil.leader - INFO - Finished the main loop: no jobs left to run
jlaptop17.local 2018-05-23 15:27:31,411 MainThread INFO toil.serviceManager: Waiting for service manager thread to finish ...
2018-05-23 15:27:31,411 - toil.serviceManager - INFO - Waiting for service manager thread to finish ...
jlaptop17.local 2018-05-23 15:27:31,946 MainThread INFO toil.serviceManager: ... finished shutting down the service manager. Took 0.535056114197 seconds
2018-05-23 15:27:31,946 - toil.serviceManager - INFO - ... finished shutting down the service manager. Took 0.535056114197 seconds
jlaptop17.local 2018-05-23 15:27:31,947 MainThread INFO toil.statsAndLogging: Waiting for stats and logging collator thread to finish ...
2018-05-23 15:27:31,947 - toil.statsAndLogging - INFO - Waiting for stats and logging collator thread to finish ...
jlaptop17.local 2018-05-23 15:27:31,960 MainThread INFO toil.statsAndLogging: ... finished collating stats and logs. Took 0.0131621360779 seconds
2018-05-23 15:27:31,960 - toil.statsAndLogging - INFO - ... finished collating stats and logs. Took 0.0131621360779 seconds
jlaptop17.local 2018-05-23 15:27:31,961 MainThread INFO toil.leader: Finished toil run successfully
2018-05-23 15:27:31,961 - toil.leader - INFO - Finished toil run successfully
{
    "output1": {
        "checksum": "sha1$322e001e5a99f19abdce9f02ad0f02a17b5066c2",
        "basename": "out",
        "nameext": "",
        "nameroot": "out",
        "http://commonwl.org/cwltool#generation": 0,
        "location": "file:///Users/john/workspace/planemo/project_templates/seqtk_complete_cwl/out",
        "class": "File",
        "size": 150
    }
jlaptop17.local 2018-05-23 15:27:31,972 MainThread INFO toil.common: Successfully deleted the job store: <toil.jobStores.fileJobStore.FileJobStore object at 0x10554d490>
}2018-05-23 15:27:31,972 - toil.common - INFO - Successfully deleted the job store: <toil.jobStores.fileJobStore.FileJobStore object at 0x10554d490>

Finding Existing Conda Packages

How did we know what software name and software version to use? We found the existing packages available for Conda and referenced them. To do this yourself, you can simply use the planemo command conda_search. If we do a search for seqt it will show all the software and all the versions available matching that search term - including seqtk.

$ planemo conda_search seqt
/Users/john/miniconda3/bin/conda search --override-channels --channel iuc --channel conda-forge --channel bioconda --channel defaults '*seqt*'
Loading channels: done
# Name                  Version           Build  Channel
bioconductor-htseqtools          1.26.0        r3.4.1_0  bioconda
bioconductor-seqtools          1.10.0        r3.3.2_0  bioconda
bioconductor-seqtools          1.10.0        r3.4.1_0  bioconda
bioconductor-seqtools          1.12.0        r3.4.1_0  bioconda
seqtk                       r75               0  bioconda
seqtk                       r82               0  bioconda
seqtk                       r82               1  bioconda
seqtk                       r93               0  bioconda
seqtk                       1.2               0  bioconda
seqtk                       1.2               1  bioconda

Note

The Planemo command conda_search is a light wrapper around the underlying conda search command but configured to use the same channels and other options as Planemo and Galaxy. The following Conda command would also work to search:

$ $HOME/miniconda3/bin/conda -c iuc -c conda-forge -c bioconda '*seqt*'

For Conda versions 4.3.X or less, the search invocation would be something a bit different:

$ $HOME/miniconda3/bin/conda -c iuc -c conda-forge -c bioconda seqt

Alternatively the Anaconda website can be used to search for packages. Typing seqtk into the search form on that page and clicking the top result will bring on to this page with information about the Bioconda package.

When using the website to search though, you need to aware of what channel you are using. By default, Planemo and Galaxy will search a few different Conda channels. While it is possible to configure a local Planemo or Galaxy to target different channels - the current best practice is to add tools to the existing channels.

The existing channels include:

  • Bioconda (github | conda) - best practice channel for various bioinformatics packages.

  • Conda-Forge (github | conda) - best practice channel for general purpose and widely useful computing packages and libraries.

  • iuc (github | conda) - best practice channel for other more Galaxy specific packages.

Exercise - Leveraging Bioconda

Use the project_init command to download this exercise.

$ planemo project_init --template conda_exercises_cwl conda_exercises
$ cd conda_exercises/exercise_1
$ ls
pear.cwl              test-data

This project template contains a few exercises. The first uses a CWL tool for PEAR - Paired-End reAd mergeR. This tool however has no SoftwareRequirement or container annotations and so will not work properly without modification.

  1. Run planemo test pear.cwl to verify the tool does not function without dependencies defined.

  2. Use --conda_requirements flag with planemo lint to verify it does indeed lack requirements.

  3. Use planemo conda_search or the Anaconda website to search for the correct package and version in a best practice channel.

  4. Update pear.cwl with the correct SoftwareRequirement hints.

  5. Re-run the lint command from above to verify the tool now has the correct dependency definition.

  6. Re-run the test command from above to verify the tool test now works properly.

Building New Conda Packages

Frequently packages your tool will require are not found in Bioconda or conda-forge yet. In these cases, it is likely best to contribute your package to one of these projects. Unless the tool is exceedingly general Bioconda is usually the correct starting point.

Note

Many things that are not strictly or even remotely “bio” have been accepted into Bioconda - including tools for image analysis, natural language processing, and cheminformatics.

To get quickly learn to write Conda recipes for typical Galaxy tools, please read the following pieces of external documentation.

These guidelines in particular can be skimmed depending on your recipe type, for instance that document provides specific advice for:

To go a little deeper, you may want to read:

And finally to debug problems the Bioconda troubleshooting documentation may prove useful.

Exercise - Build a Recipe

If you have just completed the exercise above - this exercise can be found in parent folder. Get there with cd ../exercise_2. If not, the exercise can be downloaded with

$ planemo project_init --template conda_exercises_cwl conda_exercises
$ cd conda_exercises/exercise_2
$ ls
fleeqtk_seq.cwl      fleeqtk_seq_tests.yml         test-data

This is the skeleton of a tool wrapping the parody bioinformatics software package fleeqtk. fleeqtk is a fork of the project seqtk that many Planemo tutorials are built around and the example tool should hopefully be fairly familiar. fleeqtk version 1.3 can be downloaded from here and built using make. The result of make includes a single executable fleeqtk.

  1. Clone and branch Bioconda.

  2. Build a recipe for fleeqtk version 1.3. You may wish to start from scratch (conda skeleton is not available for C programs like fleeqtk), or copy the recipe of seqtk and modify it for fleeqtk.

  3. Use conda build or Bioconda tooling to build the recipe.

  4. Run planemo test --conda_use_local fleeqtk_seq.cwl to verify the resulting package works as expected.

Congratulations on writing a Conda recipe and building a package! Upon succesfully building and testing such a Bioconda package, you would normally push your branch to Github and open a pull request. This step is skipped here as to not pollute Bioconda with unneeded software packages.

Dependencies and Containers

Note

This section is a continuation of Dependencies and Conda, please review that section for background information on resolving Software Requirements with Conda.

Common Workflow Language tools can be annotated with arbitrary Docker requirements, see the CWL User Guide for a discussion about how to do this in general.

This document will discuss some techniques to find containers automatically from the SoftwareRequirement annotations when using Planemo, cwltool, or Toil. You will ultimately want to explicitly annotate your tools with the containers we describe here so that other CWL implementations will be able to find containers for your tool, but there are real advantages to using these containers instead of ad-hoc things you may build with a Dockerfile.

  • They provide superior reproducibility because the same binary Conda packages will automatically be used for both bare metal dependencies and inside containers.

  • They are constructed automatically from existing Conda packages so you as a tool developer won’t need to write Dockerfile s or register projects on Docker Hub.

  • They are produced using mulled which produce very small containers that make deployment easier regardless of the CWL implementation you are using.

  • Annotating Software Requirements reduces the opaqueness of the Docker process. With this method it is entirely traceable how the container was constructed from what sources were fetched, which exact build of every dependency was used, to how packages in the container were built. Beyond that metadata about the packages can be fetched from Bioconda (e.g. this).

Read more about this reproducibility stack in our preprint Practical computational reproducibility in the life sciences.

BioContainers

Note

This section is a continuation of Dependencies and Conda, please review that section for background information on resolving Software Requirements with Conda.

Finding BioContainers

If a tool contains Software Requirements in best practice Conda channels, a BioContainers-style container can be found or built for it.

As reminder, planemo lint --conda_requirements <tool.cwl> can be used to check if a tool contains only best-practice requirement tags. The lint command can also be fed the --biocontainers flag to check if a BioContainers container has been registered that is compatible with that tool.

This last linter indicates that indeed a container has been registered that is compatible with this tool – quay.io/biocontainers/seqtk:1.2--1. We didn’t do any extra work to build this container for this tool, all Bioconda recipes are packaged into containers and registered on quay.io as part of the BioContainers project.

This tool can be tested using planemo test in its BioContainer Docker container using the flag --biocontainers as shown below.

The Conda exercises project template has an example tool (exercise3) that we can use to demonstrate --biocontainers. If you are continuing from the Conda tutorial, simply move to ../exercise3 otherwise using planemo project_init to grab the exercise as show below.

$ planemo project_init --template conda_exercises_cwl conda_exercises
$ cd conda_exercises/exercise3
$ planemo lint --biocontainers seqtk_seq.cwl
Linting tool /home/planemo/conda_exercises_cwl/exercise_3/seqtk_seq.cwl
Applying linter general... CHECK
.. CHECK: Tool defines a version [0.0.1].
.. CHECK: Tool defines a name [Convert to FASTA (seqtk)].
.. CHECK: Tool defines an id [seqtk_seq].
.. CHECK: Tool specifies profile version [16.04].
Applying linter cwl_validation... CHECK
.. INFO: CWL appears to be valid.
Applying linter docker_image... WARNING
.. WARNING: Tool does not specify a DockerPull source.
Applying linter new_draft... CHECK
.. INFO: Modern CWL version [v1.0]
Applying linter biocontainer_registered... CHECK
.. INFO: BioContainer best-practice container found [quay.io/biocontainers/seqtk:1.2--1].
Failed linting
$ planemo test --biocontainers seqtk_seq.cwl
Enable beta testing mode for testing.
cwltool INFO: /Users/john/workspace/planemo/.venv/bin/planemo 1.0.20180508202931
cwltool INFO: Resolved '/Users/john/workspace/planemo/project_templates/conda_exercises_cwl/exercise_3/seqtk_seq.cwl' to 'file:///Users/john/workspace/planemo/project_templates/conda_exercises_cwl/exercise_3/seqtk_seq.cwl'
galaxy.tools.deps.containers INFO: Checking with container resolver [ExplicitContainerResolver[]] found description [None]
galaxy.tools.deps.containers INFO: Checking with container resolver [CachedMulledDockerContainerResolver[namespace=biocontainers]] found description [None]
galaxy.tools.deps.containers INFO: Checking with container resolver [MulledDockerContainerResolver[namespace=biocontainers]] found description [ContainerDescription[identifier=quay.io/biocontainers/seqtk:1.2--1,type=docker]]
cwltool INFO: [job seqtk_seq.cwl] /private/tmp/docker_tmpMEipaU$ docker \
    run \
    -i \
    --volume=/private/tmp/docker_tmpMEipaU:/private/tmp/docker_tmpMEipaU:rw \
    --volume=/private/var/folders/78/zxz5mz4d0jn53xf0l06j7ppc0000gp/T/tmpxkm9dp:/tmp:rw \
    --volume=/Users/john/workspace/planemo/project_templates/conda_exercises_cwl/exercise_3/test-data/2.fastq:/private/var/folders/78/zxz5mz4d0jn53xf0l06j7ppc0000gp/T/tmpjAVM_1/stgddf6fc2a-dd13-4322-9b88-68571a1697dd/2.fastq:ro \
    --workdir=/private/tmp/docker_tmpMEipaU \
    --read-only=true \
    --log-driver=none \
    --user=502:20 \
    --rm \
    --env=TMPDIR=/tmp \
    --env=HOME=/private/tmp/docker_tmpMEipaU \
    quay.io/biocontainers/seqtk:1.2--1 \
    seqtk \
    seq \
    -a \
    /private/var/folders/78/zxz5mz4d0jn53xf0l06j7ppc0000gp/T/tmpjAVM_1/stgddf6fc2a-dd13-4322-9b88-68571a1697dd/2.fastq > /private/tmp/docker_tmpMEipaU/out
cwltool INFO: [job seqtk_seq.cwl] completed success
cwltool INFO: Final process status is success
All 1 test(s) executed passed.
seqtk_seq_0: passed

Exercise - Leveraging Bioconda

  1. Try the above command without the --biocontainers argument. Verify the tool does not run in a container by default.

  2. Add a DockerRequirement based on the the lint output above to annotate this tool with a Biocontainers Docker container and rerun test to verify the tool works now.

Building BioContainers

In this seqtk example above the relevant BioContainer already existed on quay.io, this won’t always be the case. For tools that contain multiple Software Requirements tags an existing container likely won’t exist. The mulled toolkit (distributed with planemo or available standalone) can be used to build containers for such tools. For such tools, if cwltool or Toil is configured to use BioContainers it will attempt to build these containers on the fly by default (though this behavior can be disabled).

You can try it directly using the mull command in Planemo. The conda_testing Planemo project template has a toy example tool with two requirements for demonstrating this - bwa_and_samtools.cwl.

$ planemo project_init --template=conda_testing_cwl conda_testing
$ cd conda_testing/
$ planemo mull bwa_and_samtools.cwl
/Users/john/.planemo/involucro -v=3 -f /Users/john/workspace/planemo/.venv/lib/python2.7/site-packages/galaxy_lib-17.9.0-py2.7.egg/galaxy/tools/deps/mulled/invfile.lua -set CHANNELS='iuc,bioconda,r,defaults,conda-forge' -set TEST='true' -set TARGETS='samtools=1.3.1,bwa=0.7.15' -set REPO='quay.io/biocontainers/mulled-v2-fe8faa35dbf6dc65a0f7f5d4ea12e31a79f73e40:03dc1d2818d9de56938078b8b78b82d967c1f820' -set BINDS='build/dist:/usr/local/' -set PREINSTALL='conda install --quiet --yes conda=4.3' build
/Users/john/.planemo/involucro -v=3 -f /Users/john/workspace/planemo/.venv/lib/python2.7/site-packages/galaxy_lib-17.9.0-py2.7.egg/galaxy/tools/deps/mulled/invfile.lua -set CHANNELS='iuc,bioconda,r,defaults,conda-forge' -set TEST='true' -set TARGETS='samtools=1.3.1,bwa=0.7.15' -set REPO='quay.io/biocontainers/mulled-v2-fe8faa35dbf6dc65a0f7f5d4ea12e31a79f73e40:03dc1d2818d9de56938078b8b78b82d967c1f820' -set BINDS='build/dist:/usr/local/' -set PREINSTALL='conda install --quiet --yes conda=4.3' build
[Jun 19 11:28:35] DEBU Run file [/Users/john/workspace/planemo/.venv/lib/python2.7/site-packages/galaxy_lib-17.9.0-py2.7.egg/galaxy/tools/deps/mulled/invfile.lua]
[Jun 19 11:28:35] STEP Run image [continuumio/miniconda:latest] with command [[rm -rf /data/dist]]
[Jun 19 11:28:35] DEBU Creating container [step-730a02d79e]
[Jun 19 11:28:35] DEBU Created container [5e4b5f83c455 step-730a02d79e], starting it
[Jun 19 11:28:35] DEBU Container [5e4b5f83c455 step-730a02d79e] started, waiting for completion
[Jun 19 11:28:36] DEBU Container [5e4b5f83c455 step-730a02d79e] completed with exit code [0] as expected
[Jun 19 11:28:36] DEBU Container [5e4b5f83c455 step-730a02d79e] removed
[Jun 19 11:28:36] STEP Run image [continuumio/miniconda:latest] with command [[/bin/sh -c conda install --quiet --yes conda=4.3 && conda install  -c iuc -c bioconda -c r -c defaults -c conda-forge  samtools=1.3.1 bwa=0.7.15 -p /usr/local --copy --yes --quiet]]
[Jun 19 11:28:36] DEBU Creating container [step-e95bf001c8]
[Jun 19 11:28:36] DEBU Created container [72b9ca0e56f8 step-e95bf001c8], starting it
[Jun 19 11:28:37] DEBU Container [72b9ca0e56f8 step-e95bf001c8] started, waiting for completion
[Jun 19 11:28:46] SOUT Fetching package metadata .........
[Jun 19 11:28:47] SOUT Solving package specifications: .
[Jun 19 11:28:50] SOUT
[Jun 19 11:28:50] SOUT Package plan for installation in environment /opt/conda:
[Jun 19 11:28:50] SOUT
[Jun 19 11:28:50] SOUT The following packages will be UPDATED:
[Jun 19 11:28:50] SOUT
[Jun 19 11:28:50] SOUT conda: 4.3.11-py27_0 --> 4.3.22-py27_0
[Jun 19 11:28:50] SOUT
[Jun 19 11:29:04] SOUT Fetching package metadata .................
[Jun 19 11:29:06] SOUT Solving package specifications: .
[Jun 19 11:29:56] SOUT
[Jun 19 11:29:56] SOUT Package plan for installation in environment /usr/local:
[Jun 19 11:29:56] SOUT
[Jun 19 11:29:56] SOUT The following NEW packages will be INSTALLED:
[Jun 19 11:29:56] SOUT
[Jun 19 11:29:56] SOUT bwa:        0.7.15-1      bioconda
[Jun 19 11:29:56] SOUT curl:       7.52.1-0
[Jun 19 11:29:56] SOUT libgcc:     5.2.0-0
[Jun 19 11:29:56] SOUT openssl:    1.0.2l-0
[Jun 19 11:29:56] SOUT pip:        9.0.1-py27_1
[Jun 19 11:29:56] SOUT python:     2.7.13-0
[Jun 19 11:29:56] SOUT readline:   6.2-2
[Jun 19 11:29:56] SOUT samtools:   1.3.1-5       bioconda
[Jun 19 11:29:56] SOUT setuptools: 27.2.0-py27_0
[Jun 19 11:29:56] SOUT sqlite:     3.13.0-0
[Jun 19 11:29:56] SOUT tk:         8.5.18-0
[Jun 19 11:29:56] SOUT wheel:      0.29.0-py27_0
[Jun 19 11:29:56] SOUT zlib:       1.2.8-3
[Jun 19 11:29:56] SOUT
[Jun 19 11:29:57] DEBU Container [72b9ca0e56f8 step-e95bf001c8] completed with exit code [0] as expected
[Jun 19 11:29:57] DEBU Container [72b9ca0e56f8 step-e95bf001c8] removed
[Jun 19 11:29:57] STEP Wrap [build/dist] as [quay.io/biocontainers/mulled-v2-fe8faa35dbf6dc65a0f7f5d4ea12e31a79f73e40:03dc1d2818d9de56938078b8b78b82d967c1f820-0]
[Jun 19 11:29:57] DEBU Creating container [step-6f1c176372]
[Jun 19 11:29:58] DEBU Packing succeeded

As the output indicates, this command built the container named quay.io/biocontainers/mulled-v2-fe8faa35dbf6dc65a0f7f5d4ea12e31a79f73e40:03dc1d2818d9de56938078b8b78b82d967c1f820-0. This is the same namespace / URL that would be used if or when published by the BioContainers project.

Note

The first part of this mulled-v2 hash is a hash of the package names that went into it, the second the packages used and build number. Check out the Multi-package Containers web application to explore best practice channels and build such hashes.

We can see this new container when running the Docker command images and explore the new container interactively with docker run.

$ docker images
REPOSITORY                                                                 TAG                                          IMAGE ID            CREATED              SIZE
quay.io/biocontainers/mulled-v2-fe8faa35dbf6dc65a0f7f5d4ea12e31a79f73e40   03dc1d2818d9de56938078b8b78b82d967c1f820-0   a740fe1e6a9e        16 hours ago         104 MB
quay.io/biocontainers/seqtk                                                1.2--0                                       10bc359ebd30        2 days ago           7.34 MB
continuumio/miniconda                                                      latest                                       6965a4889098        3 weeks ago          437 MB
bgruening/busybox-bash                                                     0.1                                          3d974f51245c        9 months ago         6.73 MB
$ docker run -i -t quay.io/biocontainers/mulled-v2-fe8faa35dbf6dc65a0f7f5d4ea12e31a79f73e40:03dc1d2818d9de56938078b8b78b82d967c1f820-0 /bin/bash
bash-4.2# which samtools
/usr/local/bin/samtools
bash-4.2# which bwa
/usr/local/bin/bwa

As before, we can test running the tool inside its container in cwltool using the --biocontainers flag.

$ planemo test --biocontainers bwa_and_samtools.cwl
Enable beta testing mode for testing.
cwltool INFO: /Users/john/workspace/planemo/.venv/bin/planemo 1.0.20180508202931
cwltool INFO: Resolved '/Users/john/workspace/planemo/project_templates/conda_testing_cwl/bwa_and_samtools.cwl' to 'file:///Users/john/workspace/planemo/project_templates/conda_testing_cwl/bwa_and_samtools.cwl'
galaxy.tools.deps.containers INFO: Checking with container resolver [ExplicitContainerResolver[]] found description [None]
galaxy.tools.deps.containers INFO: Checking with container resolver [CachedMulledDockerContainerResolver[namespace=biocontainers]] found description [ContainerDescription[identifier=quay.io/biocontainers/mulled-v2-fe8faa35dbf6dc65a0f7f5d4ea12e31a79f73e40:03dc1d2818d9de56938078b8b78b82d967c1f820-0,type=docker]]
cwltool INFO: [job bwa_and_samtools.cwl] /private/tmp/docker_tmpYJnmO4$ docker \
    run \
    -i \
    --volume=/private/tmp/docker_tmpYJnmO4:/private/tmp/docker_tmpYJnmO4:rw \
    --volume=/private/var/folders/78/zxz5mz4d0jn53xf0l06j7ppc0000gp/T/tmpVI06me:/tmp:rw \
    --workdir=/private/tmp/docker_tmpYJnmO4 \
    --read-only=true \
    --user=502:20 \
    --rm \
    --env=TMPDIR=/tmp \
    --env=HOME=/private/tmp/docker_tmpYJnmO4 \
    quay.io/biocontainers/mulled-v2-fe8faa35dbf6dc65a0f7f5d4ea12e31a79f73e40:03dc1d2818d9de56938078b8b78b82d967c1f820-0 \
    sh \
    -c \
    'bwa > bwa_help.txt 2>&1; samtools > samtools_help.txt 2>&1'
cwltool INFO: [job bwa_and_samtools.cwl] completed success
cwltool INFO: Final process status is success
All 1 test(s) executed passed.
bwa_and_samtools_0: passed

In particular take note of the line:

2017-03-01 10:20:59,142 INFO  [galaxy.tools.deps.containers] Checking with container resolver [CachedMulledDockerContainerResolver[namespace=biocontainers]] found description [ContainerDescription[identifier=quay.io/biocontainers/mulled-v2-fe8faa35dbf6dc65a0f7f5d4ea12e31a79f73e40:03dc1d2818d9de56938078b8b78b82d967c1f820-0,type=docker]]

Here we can see the container ID (quay.io/biocontainers/mulled-v2-fe8faa35dbf6dc65a0f7f5d4ea12e31a79f73e40:03dc1d2818d9de56938078b8b78b82d967c1f820-0) from earlier has been cached on our Docker host is picked up by cwltool. This is used to run the simple tool tests and indeed they pass.

In our initial seqtk example, the container resolver that matched was of type MulledDockerContainerResolver indicating that the Docker image would be downloaded from the BioContainers repository and this time the resolve that matched was of type CachedMulledDockerContainerResolver meaning that cwltool would just use the locally cached version from the Docker host (i.e. the one we built with planemo mull above).

Note

Planemo doesn’t yet expose options that make it possible to build mulled containers for local packages that have yet to be published to anaconda.org but the mulled toolkit allows this. See mulled documentation for more information. However, once a container for a local package is built with mulled-build-tool the --biocontainers command should work to test it.

Publishing BioContainers

Building unpublished BioContainers on the fly is great for testing but for production use and to increase reproducibility such containers should ideally be published as well.

BioContainers maintains a registry of package combinations to be published using these long mulled hashes. This registry is represented as a Github repository named multi-package-containers. The Planemo command container_register will inspect a tool and open a Github pull request to add the tool’s combination of packages to the registry. Once merged, this pull request will result in the corresponding BioContainers image to be published (with the correct mulled has as its name) - these can be subsequently be picked up by Galaxy.

Various Github related settings need to be configured in order for Planemo to be able to open pull requests on your behalf as part of the container_register command. To simplify all of this - the Planemo community maintains a list of Github repositories containing Galaxy and/or CWL tools that are scanned daily by Travis. For each such repository, the Travis job will run container_register across the repository on all tools resulting in new registry pull requests for all new combinations of tools. This list is maintained in a script named monitor.sh in the planemo-monitor repository. The easiest way to ensure new containers are built for your tools is simply to open open a pull request to add your tool repositories to this list.