Converting an Illumina workflow to a Singularity container

As part of my job, I support various labs (and other users) on campus.  My work includes hardware maintenance, system administration, and software development.  One of the labs on campus (the Quake Lab) asked me to automate part of their DNA post-sequencing demultiplexing and delivery process.  I did that, but I did it in a fairly non-portable way: The code works with multiple sequencers, and it can be moved to other Linux systems, but it has a number of annoying dependencies.

As time goes on, I know that I will need to move this software to other hardware.  With multiple external dependencies—a third-party package, a few things from EPEL, and some other code—this is not an easy process.  However, there is a solution!  Containers!!

However however, this code needs to be run by (and as) regular people, and it needs to interact with the local filesystem.  Docker might not be the best option here.  Instead, I am going to use Singularity for my containers.

This document describes how to take an existing workflow and build it as a Singularity container, so that it may be easily moved between systems.

Organization

Here are all the sections of this post.  You should feel free to read at your own pace, going back to previous sections whenever you want.

Here are some skills and tools necessary to get the most from this post:

  • You should have basic to intermediate Linux shell knowledge: How to run programs, list files, etc.
  • You will need access to a system where you can run things as root, either by logging in directly as root or by using the sudo command.
  • Having knowledge on how to build software (configure; make; make install) is helpful.
  • It would be great if you had a place where you could run VirtualBox yourself, so that you could try to containerize some of your own stuff as you read this post.

With that, let’s begin!

The Workflow


Before we get in to the conversion, let’s go over the existing workflow!

This code occupies the second stage of a pipeline involving DNA sequencing.  Here’s a very brief description of the start of the pipeline:

  • Stage 1: Given samples, prepare the library (that is, the DNA to be sequenced) along with a samplesheet.  The library, along with a flow cell, other supplies, and (sometimes) a samplesheet are put into the sequencer.
    This stage is performed by hand, partially in a wet lab (working with the samples & library) and partially at a computer (preparing the samplesheet and related sequence paperwork).
    The output of this stage is a run folder, containing (among other things) BCL (or CBCL) files containing the direct output of the sequencer run.
  • Stage 2: Given a runfolder and a samplesheet, read in the BCL (or CBCL) files and output multiple FASTQ files.
    This stage is automated, and is what we will be containerizing!
    The output of this stage is a set of project directories, created based on the samplesheet, and delivered to the person who requested them.  Another product of this stage is an HTML file describing the results of the conversion, so we can make sure the conversion went well before delivering the project directories to the researcher who wants the data.
  • Stage 3+: Given a folder of FASTQ files, do work on them!
    This is the most variable stage, because it’s where the researcher does her research!

The sequencers we are working with are Illumina MiSeq and NextSeq sequencers, along with the NovaSeq series.  The conversion software in use is Illumina’s bcl2fastq software.  Our workflow software is mainly concerned with watching the run folder (looking for the signs that the run is complete), grabbing the samplesheet (if it’s not already in the run folder), running bcl2fastq, emailing the results HTML file, and delivering the results to the researcher.

The workflow is written in Perl.  bcl2fastq is available from Illumina as source (C++) and as a pre-built RPM package.

About Containers


From an outside perspective, a container is a file.  The file (in Singularity terms, a container image) has almost everything that you would have on a hard drive.  Container files include most of a base operating system, with core libraries (like libc), the shared library loader, and a shell.  On top of the base operating system, you add the software that your code relies on.  This includes things like Perl, Python, OpenSSL, etc..  Finally, you add your code to the container.

One thing that containers do not have is a kernel.  Containers use the kernel that the host system is running, so a container is not something you can boot by itself.  If you had that, you would have a virtual machine!

Up to now, I’ve been describing container attributes that are pretty common across container technologies.  What comes next, and what is specific to Singularity, is the level of separation.  Unlike many container technologies, Singularity does not try to completely separate the container from the outside world.  Singularity containers are designed to be read-only, so they rely on the operating system for all custom storage.  Also, Singularity does not try to completely separate the container environment from OS: Most directories (for example, /home) are visible inside the container by default.  When you run something inside a Singularity container, it runs as you, with all of your access.

The Environment


Now that I’ve explained what the workflow does, and why I’m using Singularity, but before I start trying to convert anything, I should quickly cover the environment I used to create the containers.

I am doing all of this work in a VirtualBox virtual machine, running Ubuntu 16.10.  The virtual machine has one CPU and 1 GB of RAM.  For this article, I set up the VM from scratch, using the normal Ubuntu Desktop installer.

For my needs, this was perfectly fine.  I was able to build the container and test it.  You may have different needs.

It is important that you have a system where you can root access, because part of the container creation process involves running a command as root (or using sudo).  If you don’t have a system where you can get root access, and you don’t have the resources to create a virtual machine, you can get a free, short-term (2 hours) cloud instance over at dply.co.

Now, we can actually get to talking about Singularity!  Or, if you already have the latest Singularity (version 2.3 or later), feel free to skip to the next section.

Getting Singularity


At the time I wrote this article, Singularity 2.3 was close to release.  For that reason, I’m going to be building Singularity from source, using the development branch on GitHub (that’s where the 2.3 work was happening).  If you choose to build from source, you should build instead from one of the tags (such as “2.2.1” for release 2.2.1).

Here’s what I’m doing to build Singularity on Ubuntu:

  1. I install Git, a basic build environment (build-essential), and the Autotools (autoconf, automake, and libtool).  The Autotools are needed because configure, the script normally included in a release, doesn’t exist in the Git repository: The configure script is generated at release-time, so for now you need to build it.  This step is done as root.
  2. I clone Singularity’s Git repository and check out the development branch.
  3. I run autogen.sh.  This script is included in the Git repo, and is responsible for running autoconf and automake in the appropriate order (or, if a newer tool called autoreconf is available, that is called).  At the end, you have your configure script, as well as a number of other files that you don’t really care about.
  4. I run configure.  If I wanted, I could include the --prefix option to tell configure where the software should be installed (like, in my home directory), but I’m not going to do that.
  5. I run make to build everything, and then make install to install it.  make install is run as root.
    NOTE: Running make install as root is important!  Some of Singularity’s actions (like singularity exec) can be run as a regular user; but in order for that to work, Singularity needs to be “suid” (or “set uid”).  This means Singularity is able to perform some actions as root, even though it isn’t root.  This gets set up at install time, which is why make install must be run as root.  This applies even if using –prefix!

When everything is done, I should be able to run singularity --version and get a result.  No, it’s not a proper acceptance test, but it’ll serve us well as a simple “Does this crash right away?” test!

Now that I’ve explained what I’m going to do, let’s watch me do it!

(You can ignore the “2.2.99” version number.  That’s just temporary, to indicate that I’m building something which hasn’t been released yet.)

Singularity has now been built and installed on my virtual machine.  Let’s start working with it!

Making a Container


In a way, this entire post is about making a container.  But to start, I am just going to create a simple container, containing nothing but a minimal operating system.

To create my Singularity container, I’m going to use a bootstrap definition file.  This is the ideal way to make a Singularity container, because it helps you meet one of Singularity’s goals: reproducibility!  Ideally, someone should be able to build an identical container, using the bootstrap definition, and any files referenced by the bootstrap definition.  The container should be able to be bootstrapped without the need for manual intervention.

Bootstrap definition files contain several parts, although only one part is technically required:

  1. How to get & install the base OS.  (This is the only required part.)
  2. What setup actions to perform outside of the container.
  3. What setup actions to perform inside the container.
  4. What files to copy into the container from the outside.
  5. What run script to install in the container.  (More on this later.)
  6. What tests to run inside the container to validate that the bootstrap was done successfully.

Let’s start with my basic bootstrap file:

Here’s what I’m doing:

  • BootStrap: yum” tells Singularity that it should use yum to install an RPM-based OS.
  • OSVersion” just sets a variable, though it’s one that the yum bootstrap method requires.
  • MirrorURL” is the URL to the yum repository that will be used for bootstrapping.  In my case, I’m using Stanford’s yum repository, because for me it’s the fastest.  Note how I can include the OSVersion field in the URL by using %{OSVERSION}, but that’s optional.  Singularity also provides $basearch, which is set appropriately for the architecture on which the container is built, but I’m hard-coding things to x86_64, because I don’t know if all the things I’m putting into the container would work on other architectures.
  • GPG” tells the bootstrap mechanism where to find the yum repository’s GPG key.  This is important, because packages are virtually always pulled from a mirror of some sort (like, the one I’m using), and the GPG key lets us be sure that the packages came from the source.  Note that I am pulling the GPG key from a remote server, but I’m pulling the key directly from CentOS, and I’m using HTTPS (not HTTP).
  • Include” is optional, and gives the bootstrap mechanism a (space-separated) list of additional packages to install.  Since my workflow is written in Perl, we’ll start by installing Perl.

The yum bootstrap mechanism is designed to bootstrap from a single repository, which is enough to install a minimal OS.  If you need to install from multiple repositories, you will need to bring in the supplemental repositories yourself.  We’ll get to this later.

Now that I have a bootstrap file, let’s build an image!  I’m going to create a container image, and bootstrap it, using the basic bootstrap definition above.

In retrospect, of course I should have expected this error.  Singularity doesn’t bring its own package manager, so if I want to build a container using yum, then I need to have yum installed!

Let’s try again:

Welp, this is different. Let’s make a small digression to explain what’s going on.

RPM-based systems have something known as the “rpmdb”. The rpmdb is a directory, containing database files which record information about the packages installed. The rpmdb normally lives at /var/lib/rpm, except on Debian-based systems, where Debian’s rpm defaults to looking in ~/.rpmdb (that is, .rpmdb within your home directory). If this were left alone, the container would be bootstrapped with the rpmdb living at /root/.rpmdb inside the container. That’s wrong, so the yum bootstrap mechanism has caught it, and given us instructions on how to fix it!

Let’s follow the instructions and try the bootstrap again.

Much better!  Singularity took our simple bootstrap file, leveraged the system’s yum, and has created a minimal CentOS 7 install (plus perl).

Before I go on, I should note that from a reproducibility standpoint, I’m OK with what I’m doing here: Although I’m installing the latest CentOS 7 available, I know that CentOS updates within a release do not normally break things.  However, it does mean that my container will be different from bootstrap to bootstrap, because bug fixes do happen, and those may (inadvertently) change my code’s behavior.  However however, I don’t want to ship buggy containers, so I at least need to get security updates!

Anyway, now that we have a minimal bootstrap file, let’s extend it to install some of our software!

Extending the Bootstrap


Now that we have a simple bootstrap file, let’s add code to install my workflow software.

Here’s what my bootstrap file looks like now:

My bootstrap file now has two new sections!  The %setup section contains shell script (bash script) that is run by Singularity (as root) outside of the container, and which is run as soon as the OS is installed.  The %post section contains shell script that is run by Singularity inside the container, one the %setup section is complete.

Here’s what my new bootstrap code does:

  1. Outside of the container, download a .zip file of the workflow software into the container’s /tmp directory.  The download is coming directly from our local Gitlab server, and I’m downloading a specific tag.
  2. Inside the container, create a local directory called /workflow.  This is where my code will live.
  3. Still inside the container, unzip the archive, and move the files into the /workflow directory.  The archive Gitlab creates places the repository’s code into a sub-directory, so I have to use one level of wildcards.
  4. After the move, delete the now-empty sub-directory, along with the .zip file.

Note how I’m using rmdir, and how I’m using rm without -f.  Bootstrapping a container happens as root, so its best to be as safe as possible, especially with destructive commands like rm!

Also, reproducibility purists may not like how I’m using a Git tag, because Git tags can change.  If I wanted to be sure that I was getting the same code every time, I would download an archive of a specific commit ID (that is, a specific Git commit hash).

Anyway, let’s see what happens what I bootstrap with this new file!

Uh oh, something didn’t work.

This is an important thing to remember: My %post script runs unzip, and the %post script runs inside the container, but I didn’t install unzip inside the container.  The bootstrap installs a minimal OS, so almost everything you use (even things like vi or even yum) are not installed by default.  So I need to add unzip to the list of packages to install.

Here’s my updated boostrap file:

And here’s what happens when I try to bootstrap it:

Finally!  I now have my workflow code installed in the container.  Next, let’s install the Illumina software.

Installing Single Packages


Illumina’s bcl2fasq software is what the workflow script runs.  Illumina ships bcl2fastq as source, and also as an RPM.  Since we’re using CentOS, I will download the RPM and install it.

Before I show you the updated bootstrap in action, I’m going to note that you should find a better way to do things than I am here.  Downloading a file from a third-party web site is guaranteed to fail eventually.  If you pull assets from a site that is not under your control, then you need to store those assets in a place that is under your control. Also, note how I’m downloading using HTTPS. That’s an absolute requirement.

Anyway, here’s the updated bootstrap file:

I’m doing a number of new things here:

  • I use mktemp to create a temporary directory outside the container (keeping all of my temp. stuff in one place).
  • I download the bcl2fastq RPM into the temporary directory.
  • I use rpm, running outside the container, to install the package into the container.
  • I clean things up (again, without using rm -f).

Let’s check out the bootstrap in action, and then we’re going to try a slightly different way.

As you can see, I am using Ubuntu’s rpm to install bcl2fastq from outside the container.  But, with the zip file, I do the unpacking inside the container.  What if I also did the package install inside the container?

Here’s what the bootstrap looks like, with the package install happening inside the container:

Things certainly look cleaner, don’t they!  I’ve moved the package-install code to the %post section, and I’ve added rpm to the list of packages to install. Let’s see what happens when I bootstrap the above container:

Although the bootstrap worked, what we are doing here is pretty dangerous.

I mentioned above that the rpmdb is a directory of database files, most likely BerkeleyDB files. BerkeleyDB files are closely tied to the version of BerkeleyDB that RPM was built with.  In my case, since I’m using an Ubuntu 16.04 system to build a CentOS 7 container, both rpms (the one outside the container and the one inside the container) use compatible BerkeleyDB versions, but you can’t always guarantee that.

The safest thing to do, if you have to run rpm or yum yourself, is to run them in the %setup section, so that you are using the host’s version of rpm and yum, the same version that bootstrapped the OS. If you must do package installation inside the container, then you may have to run rm /var/lib/rpm/__db* (which deletes some temporary BerkeleyDB files), followed by rpm --rebuilddb (which tries to rebuild the BerkeleyDB files), but it might not work.

In the remaining sections, I’m going to switch back to doing all package installation outside the container.

Installing Packages from a Repository


Although bcl2fastq is now installed, my workflow code is still not runnable, because dependencies are missing.

The easiest way to check this is to look at your code, and see what modules are required.  In my case, I need Perl’s Email::MIME and Email::Send.  Both of these modules are available in packages from EPEL, but my container doesn’t know anything about EPEL, and the bootstrap mechanism only supports bootstrapping from one repository.

Luckily, EPEL makes it easy for us: EPEL provides a package (called epel-release), which installs all of the configuration that yum requires to find and install packages from EPEL.  EPEL also provides a direct link to their GPG key.

Here’s what my bootstrap looks like now:

Other than moving everything into the %setup section, I am now downloading and installing EPEL’s GPG key, along with the epel-release package.  Then I install my Perl packages.

Let’s watch the updated bootstrap in action:

Near the end of the bootstrap you may notice warnings about running RPM (telling me to use Alien instead), and about how the rpmdb was modified outside of yum.  Both of those warnings are expected: Alien is an RPM converter for Debian, which I’m not using because I’m not installing these packages into a Debian system; and rpmdb was modified because I used the rpm command to install the epel-release package.  I could have used yum, but rpm works fine too.

Despite the (totally expected) warnings, the bootstrap completed and my workflow script was able to run!  Although, not without issues.  We now need to make a way to easily run our code, and it looks like we have a permission issue to address as well.

Making a Runscript


My workflow has two programs that can be executed: bbox-workflow-miseq and bbox-workflow-nextseq.  One is used with MiSeq-generated runfolders; the other is used with NextSeq-generated runfolders.  Each program takes several parameters.  Without a runscript, to run my workflow I would have to do something like this:

singularity exec image.img /workflow/bbox-workflow-nextseq workflow_parameters

That’s really annoying!  Not only do I want to avoid having to type so much, I don’t want people running arbitrary commands inside my container:  I only want people to run the two workflow commands.  Plus, I want to make things as easy as possible for my users!

There are two parts to this solution.  The first part is a shortcut.

When you create a container image, you may be surprised to see that the container image is executable!  Let’s look at it:

If you execute the container image, a set of scripts converts what you type…

./image.img workflow_parameters

… into …

singularity run ./image.img workflow_parameters

That’s great!  We no longer have extra stuff to type, but on the other hand, we’re looking at a new command: singularity run.

singularity run is used to execute a runscript.  The runscript is an executable shell script located at path /singularity inside the container.  singularity run image.img is essentially shorthand for singularity exec image.img /singularity.

So, in order to use the ./image.img method of execution, we need a runscript.  Here’s mine:

The runscript is really simple:

  • If no arguments are provided ($# is “number of arguments”), then print a helpful message with examples, and exit.
  • Take the first argument ($1), and convert it to lowercase.  Use shift to remove that first argument from the list.
  • If nextseq or miseq (case-insensitive) is the first argument, then pass the remaining arguments to the appropriate workflow script.
  • If something else is the first argument, then exit with an error.

The runscript can be automatically placed into the container by including it in the bootstrap definition file, like so:

And here’s the whole create-bootstrap-run process in action:

Woooo!  My workflow is now containerized, and easy to run.  However, I see I’m getting an error, as my workflow tries to create a file.

Adding a File to the Container


In the asciicasts above, every time I run one of my workflow programs, it reports “Permission denied” when it tries creating a file.  The file it is trying to create is a list of email addresses.

My workflow code sends emails when long-running actions complete.  The list of email addresses lives inside a separate configuration file.  If the file does not exist, then the workflow code creates it automatically.  The workflow code expects the file to live in the same directory as the code; that means, inside the container.  The files inside the container are owned by root, so we get a permission denied error.  But, even if we were running the workflow as root, we would still get an error because the container is mounted read-only.

To deal with this problem, there are two solutions: I can manually enter the container and create the file.  Or, I can create the file outside of the container, and copy it during the bootstrap.

Let’s explore both methods!

Manually Adding Files

To manually create a file inside the container, I need to treat the container image as if it were an ordinary disk image.  Here’s what I need to do:

  1. Create a local directory to act as the mount point.
  2. Mount the container to that directory.
  3. As root, create/edit the file.
  4. Unmount the container.

Luckily we already have root access, so let’s do it!

The asciicast this time starts with the container created in the previous section.  We also have a file, email-recipients.txt, ready to copy into the container.

If I wanted to, instead of copying an existing file, I could have started a text editor (as root) and created the file directly inside the container.  This is also the method I would use if I needed to modify an existing configuration file.  But, in my opinion it’s not the best way, because this is the more-complicated of the two methods.

Adding Files Via the Bootstrap

Using the bootstrap process is the easiest way of getting a file into the container, but it requires that you have the file ready at bootstrap time.  You also have to keep the file with your bootstrap file, or be able to generate it when necessary.

Here’s my bootstrap file, with a new section, %files, listing the files to copy and where to put them in the container:

Here’s what the bootstrap and running process looks like now:

Everything works!  My workflow scripts no longer complain about being unable to create a configuration file.  The basic containerization process is now complete.

Next Steps


At this point, the workflow has been migrated to the container!  It’s now time to ship the container and test with real data, before installing Singularity on the workflow server, and instructing people on how to use the new code.

That being said, although my container is done, it’s not done.  There are a few things about this container that are not ideal:

  • As mentioned in the previous section, my code tries to create a file inside the container.  That’s not good: All files should be created outside the container, not inside.
  • On the other hand, maybe I should have the configuration files installed as part of the bootstrap process?  It wouldn’t affect portability too badly; people could easily fork my repository and simply overwrite those config files.
  • If I do decide to copy in configuration files during the bootstrap process, I should also use the optional %test phase of the bootstrap to validate those configuration files.
  • My code sends email.  It already uses SMTP directly, which is good because trying to use a local sendmail (or mail) command would not work (those commands write to local files, which would be inside the read-only container).  Although I use SMTP, I don’t have any support for modern STARTTLS, which more and more SMTP servers require.
  • Some of the emails I send describe commands to run.  For example, I describe the exact command to run if the analysis needs to be done over.  Those commands are going be different now.
  • I have my SMTP server hard-coded in my script.  That makes the code non-portable.

Each of these problems can be fixed, but they will change the way the program works, which means the users of my code will need to be educated as to how the new code should be used.

I also have to remember to rebuild the container on a regular basis.  Every container “contains” (lol) a minimal OS and libraries, which need to be updated.  Yes, the attack surface is smaller than an entire server’s or VM’s attack surface, but it should still be kept updated.  Every one to three months is a good schedule.

Conclusion

That’s it!  I have taken a simple workflow program, along with its various dependencies, and I have converted it into a Singularity container.  My workflow should now work on other Linux systems, where the only requirement is Singularity.  The other systems don’t need to have Perl, or the specific modules, or the bcl2fastq program.  That is the goal, the mobility of compute.

I hope you enjoyed this blog post!  I also hope it was informative.  I know it was big, but I hope it was worth it.  Feel free to leave a comment (but don’t expect an immediate reply), or reach out to me on Twitter.

Also, thanks to Vanessa, who I annoyed with lots of Singularity questions; and both Vanessa & Greg for handling the issues I submitted during the development of this article.

Good luck using Singularity.  Have fun!

Bonus: Shrinking the Image


WARNING: This involves messing with Singularity images without using Singularity tools.  It could break at any time.  Make a backup before you mess with stuff!

As a final thing before I go, I wanted to talk about free space.

When you create a Singularity container from scratch, you have to set an image size.  If your image is too small, that’s OK, because singularity expand lets you grow an existing image.

In the end, you are always going to have an image that is larger than it needs to be, because there is no singularity shrink command.  That’s not unusual: It’s rare to want to (or to be able to) shrink something, but it is possible here.

This section describes one way to shrink a Singularity image.

In all of my previous examples, the container image has appeared as an executable file.  Let’s look at it:

The Singularity image is executable, and is set up like a script.  Except the “script” is data, and it is being “executed” by the run-singularity program. run-singularity does one of three things:

  • If there is a runscript, it runs that.
  • If there is no runscript, it exits with an error.
  • If the image hasn’t been bootstrapped, or /bin/sh is missing, it exits with a different error.

This is an interesting use of Linux’s script mechanism: When a program is run, Linux looks at the first two bytes to see what kind of code is being run.  If Linux sees #! as the first two characters in the file, the rest of the line is broken up into the program to run, any additional arguments, and then the script path is added to the end.  So, if you type ./image.img ARG into a shell, the OS translates that to (effectively) run-singularity image.img ARG, which then executes singularity run image.img ARG !

If we strip off the first line of the Singularity image, we get an ext3 filesystem.  We can shrink that using the resize2fs command.  So, here’s what we are going to do:

  1. Create an empty Singularity image.
  2. Split off the header and the ext3 filesystem into separate files.
  3. Run a full filesystem check, (which resize2fs requires before attempting to shrink).
  4. Shrink the filesystem file.
  5. Recombine the header and the (now-smaller) filesystem into one file.

Let’s do it!

In the end, resize2fs was able to shrink an empty 768 MiB filesystem down to 19 MiB.  This will also work with Singularity container images that have been bootstrapped.

Of course, shrinking is the last thing you should do to your container image, and you should not shrink container images if you plan on modifying them.  That includes mounting and executing in read/write mode.

I have submitted a request (singularity GitHub issue #623) to create a singularity shrink command.  In the meantime, if you want to save as much space as possible, feel free to give this a try!

Leave a Reply

Your email address will not be published. Required fields are marked *