Scalable Data Harvesting for AI

Presentation: Scalable Data Harvesting for AI

Workshop Description

Large scale training sets are the foundation for AI models and their development. Many AI companies or companies that produce large scale ML/AI models rely at least in part on webscraping frameworks. These frameworks are often required to scrape terabytes worth of data from various sources. In this workshop we will get into one of these, and arguably one of the most popular frameworks: Scrapy.

In the workshop folder, you will find a collection of exercises, that will teach you how to create a new scrapy project from scratch and collect documents at high trhoughput rates.

Setting Up the Environment

Before working on the exercises, we recommend to set up a virtual environment. PyLadies typically uses uv for that.

Installing UV (if not done already)

pip install uv

Creating a virtual environment

uv venv

Install dependencies

cd workshop/ # Enter the workspace
uv sync

Install additional packages as needed

uv add PACKAGE

Video record

Re-watch this YouTube stream.

Credits

This workshop was set up by @pyladiesams, @mmbc2008 and @gCaglia.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
docs		docs
solutions		solutions
workshop		workshop
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
presentation_template.pdf		presentation_template.pdf
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Scalable Data Harvesting for AI

Presentation: Scalable Data Harvesting for AI

Workshop Description

Setting Up the Environment

Video record

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Scalable Data Harvesting for AI

Presentation: Scalable Data Harvesting for AI

Workshop Description

Setting Up the Environment

Video record

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages