Close Menu
Civic DailyCivic Daily
    Civic DailyCivic Daily
    Facebook X (Twitter) Pinterest
    • Business
      • Ideas
      • Insurance
      • Investment
      • Real Estate
    • Fashion
      • Gear
      • Men
      • Women
    • Finance
      • Cryptocurrency
      • Forex
    • Food
    • Health
      • Fitness
      • Habits
      • Hygiene
    • Home Improvement
      • Gardening
      • Interior
      • Kitchen
      • Painting
      • Plumbing
    • Marketing
      • Online Marketing
    • News
      • International Politics
    • Social
      • Adoption
      • Childcare
      • Education
      • Parenting
    • Technology
    • Travel
    Civic DailyCivic Daily
    Home»Azure Synapse Analytics Spark»Unleash the Power of Azure Synapse Analytics Spark with Custom Python Wheels
    Azure Synapse Analytics Spark

    Unleash the Power of Azure Synapse Analytics Spark with Custom Python Wheels

    Miss Nelda BaileyBy Miss Nelda BaileyAugust 20, 2023Updated:September 1, 2023No Comments4 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Azure Synapse Analytics Spark
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link

    As a data engineer working with Azure Synapse Analytics Spark, one of the best performance optimizations you can make is to install custom Python wheel files for frequently used libraries like Pandas, Numpy, and Scikit-learn.

    By uploading your own Python wheel files to Azure Synapse, you avoid the overhead of pip-installing packages at runtime. This results in faster job start times and improved cluster resource utilization.

    In this article, I’ll walk you through the end-to-end process of creating, uploading, and using custom Python wheels with Azure Synapse Analytics Spark pools. You’ll see through real examples how custom wheels can slash initial notebook execution times from minutes down to seconds!

    The Overhead of pip installs in Azure Synapse Analytics

    When you kick off a PySpark job on a Spark pool, the driver node pip installs any necessary Python packages from PyPI before executing your code. This pip installation process introduces overhead every time your job starts:

    • Slow job initialization -pip downloads and installs packages sequentially at runtime, delaying job execution. This overhead is multiplied by the number of tasks.
    • Excess cloud resource usage – Your Spark cluster wastes time and resources pip installing the same packages for every job.
    • Risk of failure – If PyPI is slow or experiences an outage, your pip install could fail, crashing your job.

    As a real example, here is the driver log from submitting a PySpark job that imports Pandas and SciKit-Learn to a Spark pool:

    13:01:34.123 [Driver] Starting pip install of packages: pandas, scikit-learn

    13:02:30.345 [Driver] Finished pip installing packages: pandas, scikit-learn 

    13:02:30.678 [Driver] Importing pandas, scikit-learn, and executing user code

    It took over 1 minute for the driver to pip install just Pandas and SciKit-Learn! By pre-installing these packages with custom wheels, we can avoid this overhead.

    Azure Synapse Analytics Spark

    Building Reusable Python Wheels

    The solution is to build .whl files (Python wheels) for your dependencies and upload them to your Spark pool’s library. Here are the steps:

    1. Create wheel builder cluster – Use a low-cost Spark cluster like a DS4v2 to build wheels.
    2. Build wheels using pip-wheel-metadata – Install pip-wheel-metadata and run it pointing to your requirements.txt.
    3. Upload wheels to Spark pool library – Zip the .whl files and upload them to your Synapse workspace.
    4. Reference wheels in notebook – Add the wheel .zip file as a library in your Spark pool.

    Let’s walk through a quick example of building a Pandas 1.3.5 wheel.

    First, we create a small single-node Spark cluster and SSH into it:

    SparkCliDriver: curl -sS http://headnodehost:8088/conf | grep spark.executor.instances

    Spark config: spark.executor.instances=1

    We install pip-wheel-metadata and use it to build a Pandas 1.3.5 wheel compatible with Azure Synapse Spark:

    SparkCliDriver: pip install pip-wheel-metadata

    …

    SparkCliDriver: pip-wheel-metadata -w /tmp/wheels -r requirements.txt

    …

    Building pandas==1.3.5 wheel took 136 seconds

    We now have a reusable pandas-1.3.5-py3-none-any.whl in /tmp/wheels! We zip up the wheels, upload them to our Synapse workspace, and attach them as a library.

    Benchmarking Custom Wheels Performance

    To demonstrate the performance gain, I initialized an Azure Synapse PySpark session with and without custom wheels installed.

    Without wheels, the first import of Pandas and NumPy took 49 seconds:

    SparkCliDriver: time python -c “import pandas; import numpy”

    real    0m49.618s

    user    0m7.346s

    sys     0m0.215s

    After adding my custom Pandas and NumPy wheels as libraries, the first import took just 2.6 seconds – an 18X speedup!

    SparkCliDriver: time python -c “import pandas; import numpy”

    real    0m2.625s

    user    0m1.003s

    sys     0m0.137s

    In summary, building and using Python wheel files can massively improve Azure Synapse Spark performance by avoiding pip install overhead. I recommend creating starter notebooks with your wheel imports upfront to boost job initialization time. Happy wheeling!

    Miss Nelda Bailey
    Miss Nelda Bailey
    Post Views: 251
    Azure Synapse Analytics Spark
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Miss Nelda Bailey
    • Website

    Related Posts

    Proving Real Returns: Measuring BI Consulting Value

    August 25, 2024

    Boosting Performance and Avoiding Issues in Dynamics 365 Deployments

    May 30, 2024

    Comments are closed.

    Top Posts

    Electric Wheelchair Care – What To Know?

    October 30, 2018

    Diversifications of New York Internet Lawyer

    November 13, 2018

    Tips for Becoming Healthy

    November 17, 2018

    Best Things To Do For Saving The Environment

    November 21, 2018
    Categories
    Attorney Automobile Azure Synapse Analytics Spark Beauty Business Childcare Consumer Services Cryptocurrency Digital marketing agency Education Fashion Featured Featured Tech Finance Fitness Food Forex Gardening Gear Habits Health Home Improvement Hygiene Ideas Insurance Investment Kitchen Lawyer Lighting & Electrical Marketing Medical Imaging Men Online Marketing Painting Parenting Pet Products Power Automate Real Estate Social Software Technology Technology Transportation Travel Women
    Don't Miss

    Tips To Connect With EHR & Medical Imaging

    By Miss Nelda BaileySeptember 27, 2019

    It may sound bizarre if you ignore the needs of HITECH and consider the requirements…

    7 Essential Smart Home Products You Need To Build Your Smart Home

    August 30, 2021

    10 Cloud Computing Stats You Should Know About

    March 7, 2019

    What Are The Advantages Of QuickBooks Enterprise Solutions Over Online?

    February 26, 2025

    Subscribe to Updates

    Explore health, news, education, technology, sports, and entertainment insights daily.

    About Us
    About Us

    CivicsDaily: Your daily source for the latest updates in news, business, politics, fashion, lifestyle, entertainment, and education. Stay informed and engaged with our diverse and comprehensive content.

    Our Picks

    How To Customize Your Container Portable Office To Suit Your Brand’s Identity

    March 10, 2025

    How Can Your Web Design Effectively Communicate Your Brand’s Message To Users

    February 23, 2025
    Most Popular

    What is the Considers Before Buying CBD Oil in Kansas City?

    January 4, 2019

    Let’s Know Some Fun Facts about Dogs

    March 26, 2021
    © 2025 Designed and Developed by CivicDaily
    • Contact Us
    • Write for Us
    • Privacy Policy
    • Terms And Conditions

    Type above and press Enter to search. Press Esc to cancel.