28 October 2018

Why use conda over pip?

If you're in science, packaging is a multi language problem. You can kind of get by using system based installations of non-python libraries, but then you're usually stuck if you are working on multiple projects that depend on multiple versions of those tools. Conda is really the only way to go. In addition the Anaconda has packages that are compiled against the Intel MKL(Math Kernel Library). This can lead to much faster performance

sad

pip builds are slower

Stop using the root environment

When people get started with conda/anaconda, they tend to use the root environment (the one that is created by the installer) for everything. Then they start installing lots of stuff into it, and eventually it gets broken, and they're sad.

sad

how I feel when my root env is broken

It's much easier to create environments for every project. If you break those, you can always delete them and start over easily. The root environment is where conda is installed. Really, you should only touch the root environment to upgrade conda. Rely on other environments for everything else, and your life will be much happier

Start using Conda constructor

If you manage anaconda environments for a bunch of users, Constructor is a great tool!. The installer that you use on OSX/Linux/Windows to install Anaconda and miniconda are built using constructor. Constructor allows you to build your own installers with different specifications, so you no longer have to rely on the anconda installers, and you can build installers tailored to your team's desired data science environment.

Installers created by constructor are a decent way to deploy production applications. The files do tend to be large (since all the binaries are shoved into executable) but the upside is that there are no network dependencies needed when installing

Conda constructor doens't work on noarch packages though - which can be inconvenient. (noarch packages are packages that are platform independent - meaning you built one package for linux/osx and windows). There are many libraries (like django) which are actually platform independent but that have platform specific packages already built.

Don't rely too heavily on modifying environments

It's generally better to build environments and discard them rather than continuing to modify them over time. The reason is because eventually an environment can get into a weird state. I know this sounds bad - the primary reason this happens is because over time as you upgrade conda, it can become incompatible with older environments. In addition, as you continue to modify environments over time, they generally grow, which makes them harder for conda to deal with (slower)

Be careful with NFS.

Sysadmins and devops often install centralized conda environments for multiple users. These environments are backed by a package cache located in this world readable but not writable location. When users create their own conda environments, they end up downloading their own packages in their home directories (typically, since they don't have permissions to write in the system managed pacakge cache. Conda can symlink packages from multiple package caches and it's fine. The problem occurs when home directories are NFS mounted and users try to their environments on multiple machines AND those machines do not have the same package cache.

On machine A, I have a centralized anaconda installation at /opt/anaconda. It contains zeromq=4.2.5=hf484d3e_1. So when I create my own environment in ~/.conda, that package is symlinked into my environment. However if I try to use that environment on machine B, where the package cache does not contain zeromq=4.2.5=hf484d3e_1, my environment will be broken.

You have a few options

  1. Always make sure the centralized environment is identical on every machine
  2. disallow symlinks in conda

I would argue for doing both of these. Having identical centralized environments is just nice, because users will always get the same thing no matter where they go. in addition, in shared environments, disabling symlinks is useful because if you ever upgrade the central environment, you could end up breaking user environments (by removing packages)

Don't worry too much about conda activate, especially in production

Conda activate sets your PATH and a few environment variables that tell conda which environment it should install things into. It's almost certainly a distraction when you go to deploy applications. When you deploy applications, it's much easier just to reference the absolute path to the binary in your statup script, rather than calling activate. EDIT: this is generally true, but I've heard some packages (pyspark/rpy2) rely on this pathing

Never resolve dependencies when you deploy to production

So you write an application that depends on scikit-learn, you build a conda recipe for the application to deploy it to production. Because it's not your first rodeo, you do the smart thing and you set

 - scikit-learn=0.20

in your recipe. That's great!. Unfortunately, scikit-learn depends on

- scipy

and doesn't specifiy which version it depends on. So if you tested with scipy 1.1.0 and when you go to deploy scipy 1.2.0 is released, the conda solver will pickup the latest version and you're 1.2.0 in production, without ever testing how it works with the rest of your application.

Instead make sure that all nested dependencies are completely resolved before you go to production. There are 2 ways to do this. One is to use conda env export to export a very precise (down to the build number) specification of every single pacakge you want to deploy. This has to be on the same platform that you are going to deploy from. The other way is to use constructor. Constructor is especially nice because when you go to install something built with constructor, you have 0 dependencies on conda repositories. The only downside is that because everything is shoved into the constructor installer, the file size you're going to deploy with is usually at least a few hundred megabytes, possibly a few gigs.

What if the package I need isn't in Anaconda

First - check conda-forge. Conda Forge is a community project to build all the things. There is a really good chance that the package you need is there, and can be installed by specifying the conda-forge channel in your conda command conda -c conda-forge install my_package.

If that fails - you can use conda skeleton to generate a conda recipe from the pypi package. You'll also need to build and upload the package to your own repository, so this can be cumbersome.

If building your own packages is too cumbersome, you can fallback on pip. I do this for exploratory work al the time. However I'd recommend getting proper conda packages built before you release to production. The reason is that the conda mechanisms for ensuring that your dependencies are all fully resolved before you go to deploy (using anaconda constructor, or conda env export) won't completely work on pip packages


Thanks For Reading!