Loading…
PyTorch Day France
7 May 2025 | Paris, France
View More Details & Registration

The Sched app allows you to build your schedule but is separate from your event registration. Please visit the GOSIM AI Paris Website registration page for more details.

This schedule is automatically displayed in Central European Summer Time. To see the schedule in your preferred timezone, select from the drop-down menu located at the bottom of the menu to the right.
Wednesday May 7, 2025 14:20 - 14:40 CEST
This presentation looks at effective strategies for using Common Crawl's web archive in large-scale research applications, specifically for AI and other ML applications. We will discuss practical approaches to processing and filtering Common Crawl’s datasets, with focus on how to overcome computational challenges and optimise data pipelines. We will also discuss some of the challenges that users might encounter related to the multilingual and heterogeneous nature of Common Crawl’s data. The talk will cover best practices for data filtering, pre-processing, and storage, to ensure the quality and relevance of extracted information for research tasks. Additionally, we will briefly discuss the ranking mechanism used to determine whether a URL is crawled, and demonstrate how to use the Web Graph as a framework for further research.
Speakers
avatar for Pedro Ortis

Pedro Ortis

Senior Research Scientist, Common Crawl
Pedro is a senior research scientist at the Common Crawl Foundation. He holds a PhD in computer science and Natural Language Processing from Sorbonne Université. Pedro’s research has mainly focused on how data quality impacts ML models’ performance and how to improve these models... Read More →
Wednesday May 7, 2025 14:20 - 14:40 CEST
STATION F 5 Parv. Alan Turing, 75013 Paris, France

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link