28 October 2018
If you're in science, packaging is a multi language problem. You can kind of get by using system based installations of non-python libraries, but then you're usually stuck if you are working on multiple projects that depend on multiple versions of those tools. Conda is really the only way to go. In addition the Anaconda has packages that are compiled against the Intel MKL(Math Kernel Library). This can lead to much faster performance
pip builds are slower
When people get started with conda/anaconda, they tend to use the root environment (the one that is created by the installer) for everything. Then they start installing lots of stuff into it, and eventually it gets broken, and they're sad.
how I feel when my root env is broken
If you manage anaconda environments for a bunch of users, Constructor is a great tool!. The installer that you use on OSX/Linux/Windows to install Anaconda and miniconda are built using constructor. Constructor allows you to build your own installers with different specifications, so you no longer have to rely on the anconda installers, and you can build installers tailored to your team's desired data science environment.
Installers created by constructor are a decent way to deploy production applications. The files do tend to be large (since all the binaries are shoved into executable) but the upside is that there are no network dependencies needed when installing
Conda constructor doens't work on noarch packages though - which can be inconvenient. (noarch packages are packages that are platform independent - meaning you built one package for linux/osx and windows). There are many libraries (like django) which are actually platform independent but that have platform specific packages already built.
It's generally better to build environments and discard them rather than continuing to modify them over time. The reason is because eventually an environment can get into a weird state. I know this sounds bad - the primary reason this happens is because over time as you upgrade conda, it can become incompatible with older environments. In addition, as you continue to modify environments over time, they generally grow, which makes them harder for conda to deal with (slower)
Sysadmins and devops often install centralized conda environments for multiple users. These environments are backed by a package cache located in this world readable but not writable location. When users create their own conda environments, they end up downloading their own packages in their home directories (typically, since they don't have permissions to write in the system managed pacakge cache. Conda can symlink packages from multiple package caches and it's fine. The problem occurs when home directories are NFS mounted and users try to their environments on multiple machines AND those machines do not have the same package cache.
On machine A, I have a centralized anaconda installation at /opt/anaconda. It contains
zeromq=4.2.5=hf484d3e_1. So when I create my own environment in
package is symlinked into my environment. However if I try to use that environment on machine B, where the package cache does not contain
zeromq=4.2.5=hf484d3e_1, my environment will be broken.
You have a few options
I would argue for doing both of these. Having identical centralized environments is just nice, because users will always get the same thing no matter where they go. in addition, in shared environments, disabling symlinks is useful because if you ever upgrade the central environment, you could end up breaking user environments (by removing packages)
Conda activate sets your
PATH and a few environment variables that tell
conda which environment it should install things into. It's almost certainly a distraction when you go to deploy applications. When you deploy applications, it's much easier just to reference the absolute path to the binary in your statup script, rather than calling
activate. EDIT: this is generally true, but I've heard some packages (pyspark/rpy2) rely on this pathing
So you write an application that depends on
scikit-learn, you build a conda recipe for the application to deploy it to production. Because it's not your first rodeo, you do the smart thing and you set
in your recipe. That's great!. Unfortunately, scikit-learn depends on
and doesn't specifiy which version it depends on. So if you tested with
scipy 1.1.0 and when you go to deploy
scipy 1.2.0 is released, the conda solver will pickup the latest version and you're
1.2.0 in production, without ever testing how it works with the rest of your application.
Instead make sure that all nested dependencies are completely resolved before you go to production. There are 2 ways to do this. One is to use
conda env export to export a very precise (down to the build number) specification of every single pacakge you want to deploy. This has to be on the same platform that you are going to deploy from. The other way is to use
constructor. Constructor is especially nice because when you go to install something built with constructor, you have 0 dependencies on conda repositories. The only downside is that because everything is shoved into the constructor installer, the file size you're going to deploy with is usually at least a few hundred megabytes, possibly a few gigs.
First - check conda-forge. Conda Forge is a community project to build all the things. There is a really good chance that the package you need is there, and can be installed by specifying the conda-forge channel in your conda command
conda -c conda-forge install my_package.
If that fails - you can use
conda skeleton to generate a conda recipe from the pypi package. You'll also need to build and upload the package to your own repository, so this can be cumbersome.
If building your own packages is too cumbersome, you can fallback on pip. I do this for exploratory work al the time. However I'd recommend getting proper conda packages built before you release to production. The reason is that the conda mechanisms for ensuring that your dependencies are all fully resolved before you go to deploy (using anaconda constructor, or conda env export) won't completely work on pip packages