How to Correctly Install CuDF for Python

Published on: Category: Analytics

Dependencies error

Introduction of CuDF

CuDF is a powerful tool in the era of big data. It utilizes the GPU computing framework Cuda to speed up data ETL and offers a Pandas-like interface. The tool is developed by the RapidAI team.

You can check out their Git repo here.

I love the tool. It gives me a way to make full use of my expensive graphic card, which most of the time is only used for gaming. Most importantly, for a company like Owler, which has to handle 14 million+ company profiles, even a basic data transformation task might take days. This tool can help speed up the process by about 80+ times. GPU computing has been a norm for ML/AL. CuDF makes it also good for the upstream of the flow, the ETL. And ETL is in high demand for almost every company with digital capacity in the world.

The Challenges

It’s nice to have this tool for our day-to-day data work. However, the convenience comes at a cost. The installation of CuDF is quite confusing and hard to follow. It also has some limitations in OS and Python versions. Currently, it only works with Linux and Python 3.7+. And it only provides a conda-forge way to install; otherwise, you need to build from the source.

The errors range from solving environment fail, dependencies conflict, inability to find GPU, and such. I have been installing CuDF into a couple of servers, including personal desktops and AWS. Each time, I have to spend hours dealing with multiple kinds of errors and try them again and again. When it finally works, I don’t know which one is the critical step because there were so many variables. Most ugly, when you have a dependency conflict error, you have to wait for a very long time after 4 solving environment attempts until it displays the conflicting package for you.

But the good news is, from the most recent installation, I can finally understand the cause for the complication and summarize an easy-to-follow guide for anyone who wants to enjoy this tool.

In short, the key is, use miniconda or create a new environment in anaconda to install.

Let me walk through the steps.

Installing the Nvidia Cuda framework (Ubuntu)

Installing Cuda is simple when you have the right machine. You can follow the guide here from Nvidia’s official webpage. If you encounter an installation error, please check if you are selecting the right architecture and meet the hardware/driver requirements.

However, if you have an older version of Cuda installed and wish to upgrade that, the Nvidia guide won’t help you. The correct way is to uninstall the older version of Cuda first before doing anything from the guide. The reason is that, at least in Ubuntu, the installation step will change your apt-get source library; once you do that, you will no longer be able to uninstall the older version, and it may cause conflict.

To uninstall Cuda, you can try the following steps (for Ubuntu).

  • Remove nvidia-cuda-toolkit
sudo apt-get remove nvidia-cuda-toolkit

You may want to use this command instead to remove all the dependencies as well.

sudo apt-get remove --auto-remove nvidia-cuda-toolkit
  • Remove Cuda
sudo apt-get remove cuda

If you forgot to remove the older version of Cuda before installing the new version, you would need to remove all dependencies for Cuda and start over the new version installation.

sudo apt-get remove --auto-remove cuda
  • Install Cuda by following the Nvidia official guide. Link above.

Install CuDF

I highly recommend installing CuDF in miniconda. This will avoid most of the package dependency conflicts. If you have dependency conflicts, you will probably get the below error. Or you will be waiting forever during the last solving environment step.

Miniconda is a much cleaner version of Anaconda. It has very few packages installed out of the box, so it will avoid the CuDF installation running into conflict.

If you already have Anaconda installed and wish to keep it, you can try creating a new conda environment with no default packages for CuDF.

conda create --no-default-packages -n myenv python=3.7