still don't have a title

Investigating the popularity of Python build backends over time (II)

Last year, I analyzed the popularity of build backends used in pyproject.toml files over time. This post is the update for 2024.

Analysis

Like last year, I’m using Tom Forbes’ fantastic dataset containing information about every file within every release uploaded to PyPI. To get the current dataset, I followed the same process as in last year’s analysis, so I won’t repeat all the details here. Instead, I’ll highlight the main steps:

Downloading all the parquet files took roughly a week due to GitHub’s rate limiting. Tom suggested leveraging the Git v2 protocol to fetch the data directly. This approach could bypass rate limiting and complete the download of all pyproject.toml files in just 20 minutes(!). However, I couldn’t find sufficient documentation that would help me to implement this method, so this will have to wait until next year’s analysis.

Once all the data is downloaded, I perform some preprocessing:

Results

I modified the plots a bit from last year to make them easier to read. Most notably, I binned the data into quarters to make the plots less noisy, and secondly, I stopped stacking the relative distribution plots to make the percentages directly readable.

The first plot shows the absolute number of uploads (in thousands) by quarter and build backend.

Absolute distribution of build backends by quarter

The second plot shows the relative distribution of build backends by quarter.

Relative distribution of build backends by quarter

In 2024, we observe that:

The script for downloading and analyzing the data is available in my GitHub repository. If someone has insights or examples on implementing the Git v2 protocol to download the pyproject.toml file given the repository URL and its hash, I’d love to hear from you!