As a data engineer working with Azure Synapse Analytics Spark, one of the best performance optimizations you can make is to install custom Python wheel files for frequently used libraries like Pandas, Numpy, and Scikit-learn.
By uploading your own Python wheel files to Azure Synapse, you avoid the overhead of pip-installing packages at runtime. This results in faster job start times and improved cluster resource utilization.
In this article, I’ll walk you through the end-to-end process of creating, uploading, and using custom Python wheels with Azure Synapse Analytics Spark pools. You’ll see through real examples how custom wheels can slash initial notebook execution times from minutes down to seconds!
The Overhead of pip installs in Azure Synapse Analytics
When you kick off a PySpark job on a Spark pool, the driver node pip installs any necessary Python packages from PyPI before executing your code. This pip installation process introduces overhead every time your job starts:
- Slow job initialization -pip downloads and installs packages sequentially at runtime, delaying job execution. This overhead is multiplied by the number of tasks.
- Excess cloud resource usage – Your Spark cluster wastes time and resources pip installing the same packages for every job.
- Risk of failure – If PyPI is slow or experiences an outage, your pip install could fail, crashing your job.
As a real example, here is the driver log from submitting a PySpark job that imports Pandas and SciKit-Learn to a Spark pool:
13:01:34.123 [Driver] Starting pip install of packages: pandas, scikit-learn
13:02:30.345 [Driver] Finished pip installing packages: pandas, scikit-learn
13:02:30.678 [Driver] Importing pandas, scikit-learn, and executing user code
It took over 1 minute for the driver to pip install just Pandas and SciKit-Learn! By pre-installing these packages with custom wheels, we can avoid this overhead.
Building Reusable Python Wheels
The solution is to build .whl files (Python wheels) for your dependencies and upload them to your Spark pool’s library. Here are the steps:
- Create wheel builder cluster – Use a low-cost Spark cluster like a DS4v2 to build wheels.
- Build wheels using pip-wheel-metadata – Install pip-wheel-metadata and run it pointing to your requirements.txt.
- Upload wheels to Spark pool library – Zip the .whl files and upload them to your Synapse workspace.
- Reference wheels in notebook – Add the wheel .zip file as a library in your Spark pool.
Let’s walk through a quick example of building a Pandas 1.3.5 wheel.
First, we create a small single-node Spark cluster and SSH into it:
SparkCliDriver: curl -sS http://headnodehost:8088/conf | grep spark.executor.instances
Spark config: spark.executor.instances=1
We install pip-wheel-metadata and use it to build a Pandas 1.3.5 wheel compatible with Azure Synapse Spark:
SparkCliDriver: pip install pip-wheel-metadata
…
SparkCliDriver: pip-wheel-metadata -w /tmp/wheels -r requirements.txt
…
Building pandas==1.3.5 wheel took 136 seconds
We now have a reusable pandas-1.3.5-py3-none-any.whl in /tmp/wheels! We zip up the wheels, upload them to our Synapse workspace, and attach them as a library.
Benchmarking Custom Wheels Performance
To demonstrate the performance gain, I initialized an Azure Synapse PySpark session with and without custom wheels installed.
Without wheels, the first import of Pandas and NumPy took 49 seconds:
SparkCliDriver: time python -c “import pandas; import numpy”
real 0m49.618s
user 0m7.346s
sys 0m0.215s
After adding my custom Pandas and NumPy wheels as libraries, the first import took just 2.6 seconds – an 18X speedup!
SparkCliDriver: time python -c “import pandas; import numpy”
real 0m2.625s
user 0m1.003s
sys 0m0.137s
In summary, building and using Python wheel files can massively improve Azure Synapse Spark performance by avoiding pip install overhead. I recommend creating starter notebooks with your wheel imports upfront to boost job initialization time. Happy wheeling!