Presentation: Scalable Data Harvesting for AI
Large scale training sets are the foundation for AI models and their development. Many AI companies or companies that produce large scale ML/AI models rely at least in part on webscraping frameworks. These frameworks are often required to scrape terabytes worth of data from various sources. In this workshop we will get into one of these, and arguably one of the most popular frameworks: Scrapy.
In the workshop folder, you will find a collection of exercises, that will teach you how to create a new scrapy project from scratch and collect documents at high trhoughput rates.
Before working on the exercises, we recommend to set up a virtual environment.
PyLadies typically uses uv for that.
- Installing UV (if not done already)
pip install uv- Creating a virtual environment
uv venv- Install dependencies
cd workshop/ # Enter the workspace
uv sync- Install additional packages as needed
uv add PACKAGERe-watch this YouTube stream.
This workshop was set up by @pyladiesams, @mmbc2008 and @gCaglia.