Smaller containers - part 1

This is the first in a series of blog posts about building better Docker images.

Docker Inc is widely acknowledged for transitioning containers from geekdom to the real world
inhabited by us developers, and did this by providing easy to use tools for building, sharing
and running containers. Key to this is docker build command and the
Dockerfile.

But whilst this makes building a container image fairly easy, it doesn’t necessarily make it easy
to build a good container image. What do we mean by ‘good’? Well, this has a large number of
factors, but this series of posts will focus on one aspect, the need to create small images.
There are two key reasons for this.

  • Firstly, images are frequently and repeatedly pulled across the internet, so the smaller the image
    the faster this will happen. Clearly a good thing.

  • Secondly, small containers contain less ‘stuff’ and the less ‘stuff’ you have in your container the
    smaller the attack surface for hackers to get into your containers and cause damage. Hence, well
    designed lightweight containers are not only good because they load faster but they are also more
    secure. This series of posts describes approaches for achieving this.

As an example we’ll use a Docker image the contains the RDKit cheminformatics
toolkit. Our Squonk Computational Notebook uses RDKit extensively and this
container image and related ones are used frequently.

Let’s look at how we first went about this. The RDKit docs
provide good information about how to build RDKit from the source code in
GitHub. We wanted to be able to build versions at any time, including
from the different branches and tags, so building from source seemed to be a sensible approach.

So we created a Dockerfile to handle this. It took a little bit of trial and error to define all the
packages that were needed, but the end result is a well defined and repeatable process that builds a
container image for RDKit.

The Dockerfile looks like this:

FROM debian:stretch
MAINTAINER Tim Dudgeon <tdudgeon@informaticsmatters.com>

RUN apt-get update && apt-get install -y \
 build-essential\
 python-numpy\
 cmake\
 python-dev\
 python-pip\
 sqlite3\
 libsqlite3-dev\
 libboost-dev\
 libboost-system-dev\
 libboost-thread-dev\
 libboost-serialization-dev\
 libboost-python-dev\
 libboost-regex-dev\
 swig\
 git\
 wget\
 zip &&\
 apt-get upgrade -y &&\
 apt-get clean -y

ENV RDKIT_BRANCH=master
RUN git clone -b $RDKIT_BRANCH\
  --single-branch https://github.com/rdkit/rdkit.git

ENV RDBASE=/rdkit
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$RDBASE/lib:/usr/lib/x86_64-linux-gnu
ENV PYTHONPATH=$PYTHONPATH:$RDBASE

RUN mkdir $RDBASE/build
WORKDIR $RDBASE/build

RUN cmake -DRDK_BUILD_INCHI_SUPPORT=ON .. &&\
 make &&\
 make install &&\
 make clean

WORKDIR $RDBASE

The approach should be reasonably clear:

  1. Install all the dependencies using the apt package manager (we use a Debian base image).
  2. Clone the RDKit Git repository with the source code.
  3. Build RDKit according to the instructions in the RDKit docs and then install it.

It takes about an hour to build, but eventually you get an image that can be used to run RDKit:

$ docker build -f Dockerfile-all-in-one .
...
... lots and lots of output ...
...
Successfully built bae5c2ce64a8
$ docker run -it --rm bae5c2ce64a8 python
Python 2.7.13 (default, Nov 24 2017, 17:33:09)
[GCC 6.3.0 20170516] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import rdkit
>>>

Whilst this is a nice way to illustrate and reliably reproduce the process of building RDKit, it
does have a number of significant issues.

  1. It takes a long time to build.
  2. This long build time prevents use of automated builds on DockerHub as the build times out.
  3. The resulting image is a hefty 1.25 GB in size.
  4. The resulting image contains a large number of packages and files that are present but are not needed to run RDKit.

It’s the last of these that we want to focus on. This image is an extreme case of a nasty anti-pattern
that affects nearly all container images you will find on DockerHub or other repositories - that is the
resulting image contains various artefacts that are needed to build the image, but
not needed to run the container once it is built.

Specifically it contains git, wget and the entire build infrastructure including make, cmake, gcc
and g++, as well as the apt package manager. And it also contains the checked out RDKit GitHub
repository. Lots of extra fluff, none of which is needed to actually run RDKit which is the sole
purpose of this container image.

So whilst this Dockerfile is useful for illustrating how to build different versions of RDKit, and
could even be useful for a RDKit developer who needs to rebuild things and do some hacking, its a
poor example of how to build a container for just running RDKit as it’s huge and has a pretty
large attack surface with all those unnecessary extras.

We can do much better, and later posts will show various approaches for doing this. Take a look at the
[next post]({% post_url 2018-04-15-smaller-containers-part-2 %}).

If these Docker images are of use to you can find the source code in
GitHub
and the images in DockerHub.