Skip to content

pyladiesams/scalable-data-harvesting-for-ai-may2026

Repository files navigation

Scalable Data Harvesting for AI

Workshop Description

Large scale training sets are the foundation for AI models and their development. Many AI companies or companies that produce large scale ML/AI models rely at least in part on webscraping frameworks. These frameworks are often required to scrape terabytes worth of data from various sources. In this workshop we will get into one of these, and arguably one of the most popular frameworks: Scrapy.

In the workshop folder, you will find a collection of exercises, that will teach you how to create a new scrapy project from scratch and collect documents at high trhoughput rates.

Setting Up the Environment

Before working on the exercises, we recommend to set up a virtual environment. PyLadies typically uses uv for that.

  1. Installing UV (if not done already)
pip install uv
  1. Creating a virtual environment
uv venv
  1. Install dependencies
cd workshop/ # Enter the workspace
uv sync
  1. Install additional packages as needed
uv add PACKAGE

Video record

Re-watch this YouTube stream.

Credits

This workshop was set up by @pyladiesams, @mmbc2008 and @gCaglia.

About

An introduction to scalable, high-throughput webscraping methods for provision of large training data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors