The Sched app allows you to build your schedule but is separate from your event registration. Please visit the GOSIM AI Paris Website registration page for more details.
This schedule is automatically displayed in Central European Summer Time. To see the schedule in your preferred timezone, select from the drop-down menu located at the bottom of the menu to the right.
Sign up or log in to bookmark your favorites and sync them to your phone or calendar.
GM of AI, Executive Director, PyTorch, Linux Foundation
Matt White is the Executive Director of the PyTorch Foundation and GM of AI at the Linux Foundation. He is also the Director of the Generative AI Commons. Matt has nearly 30 years of experience in applied research and standards in AI and data in telecom, media and gaming industries... Read More →
Wednesday May 7, 2025 09:30 - 10:00 CEST Station 5
GM of AI, Executive Director, PyTorch, Linux Foundation
Matt White is the Executive Director of the PyTorch Foundation and GM of AI at the Linux Foundation. He is also the Director of the Generative AI Commons. Matt has nearly 30 years of experience in applied research and standards in AI and data in telecom, media and gaming industries... Read More →
Wednesday May 7, 2025 10:30 - 10:50 CEST STATION F5 Parv. Alan Turing, 75013 Paris, France
As AI continues to push the boundaries of perception and decision-making, robotics emerges as one of its most exciting and demanding playground. In this talk, we’ll explore how the intersection of machine learning and robotics opens up powerful avenues for interaction, manipulation, and embodied intelligence. We will emphasize the critical role of real-world experimentation and data collection in bridging the gap between simulation and deployment. Interestingly, tasks traditionally viewed as complex, like locomotion, have seen significant progress, while seemingly simple behaviors—such as dexterous manipulation—remain open challenges. By grounding AI systems in physical environments, we gain deeper insight into their capabilities and limitations, and identify new directions for research at the intersection of learning, control, and embodiment.
TorchCodec is a new PyTorch library for decoding video and audio data into tensors, on CPU and CUDA GPU. It aims to be fast, easy to install, easy to use, and well integrated into the PyTorch ecosystem. In this talk, we’ll present the various decoding capabilities of TorchCodec, how to sample video frames, and we’ll describe more advanced use-cases like streaming videos from the cloud.
Nicolas is a software engineer in the PyTorch team at Meta, where he mainly contributes to the torchvision library. Prior to that, Nicolas was a research scientist at Columbia University, where he became part of the scikit-learn core development team. Nicolas holds a PhD in machine... Read More →
Wednesday May 7, 2025 11:10 - 11:30 CEST STATION F5 Parv. Alan Turing, 75013 Paris, France
vLLM has become the community-standard engine for low-latency LLM inference, achieving a 10× increase in usage in 2024 and surpassing 100,000 daily installs by January 2025. Supported by hundreds of contributors and productized through Red Hat AI, vLLM provides a vendor-neutral solution for serving cutting-edge models at scale. This talk outlines a practical blueprint for scaling LLM inference using vLLM, integrating both system-level techniques and model-level optimizations.
We begin by addressing the challenges of deploying LLMs with chain-of-thought reasoning in production. Leveraging vLLM’s engine architecture, multi-accelerator deployments using tensor parallelism, paged attention scheduling, and prefill–decode disaggregation demonstrate how a single node can efficiently drive multiple AI accelerators, enhancing throughput without compromising latency.
The second optimization layer focuses on quantization. Based on over 500,000 evaluations across language and vision-language models, we examine the accuracy–speed trade-offs of weight and activation quantization. We introduce new pathways that significantly reduce memory usage while maintaining model quality. Attendees will leave with data-driven insights and ready-to-use configurations for deploying state-of-the-art quantized models in scalable enterprise inference pipelines.
This presentation explores the development of Llama 4, a state-of-the-art foundation model designed to excel in various tasks. We will discuss its key features, including long=context and multimodal understanding. We will also examine Llama 4's potential uses in agentic settings, such as autonomous decision-making and human-AI collaboration, through real-world examples and case studies.
Christian Keller is a Product Manager at Meta AI leading product for PyTorch. He works on enabling AI at scale for the PyTorch community and billions of Meta AI users. Prior to this, Christian was an entrepreneur with a dual machine learning engineer and business background. He has... Read More →
Wednesday May 7, 2025 14:00 - 14:20 CEST STATION F5 Parv. Alan Turing, 75013 Paris, France
This presentation looks at effective strategies for using Common Crawl's web archive in large-scale research applications, specifically for AI and other ML applications. We will discuss practical approaches to processing and filtering Common Crawl’s datasets, with focus on how to overcome computational challenges and optimise data pipelines. We will also discuss some of the challenges that users might encounter related to the multilingual and heterogeneous nature of Common Crawl’s data. The talk will cover best practices for data filtering, pre-processing, and storage, to ensure the quality and relevance of extracted information for research tasks. Additionally, we will briefly discuss the ranking mechanism used to determine whether a URL is crawled, and demonstrate how to use the Web Graph as a framework for further research.
Pedro is a senior research scientist at the Common Crawl Foundation. He holds a PhD in computer science and Natural Language Processing from Sorbonne Université. Pedro’s research has mainly focused on how data quality impacts ML models’ performance and how to improve these models... Read More →
Wednesday May 7, 2025 14:20 - 14:40 CEST STATION F5 Parv. Alan Turing, 75013 Paris, France
Training large language models (LLMs) demands more than just raw compute—it requires infrastructure, strategy, and a deep understanding of parallelism. What begins as a single-GPU prototype must eventually scale across thousands of devices, each step introducing new complexity.
This talk dives into the practicalities of ultra-scale training. We'll explore how 5D parallelism—spanning data, tensor, pipeline, context, and expert dimensions—makes it possible to stretch a single training run across massive GPU clusters. Along the way, we’ll cover performance tuning, communication patterns, and architecture choices that impact throughput and hardware efficiency.
A key reference for this session is the Ultra-Scale Playbook, which distills best practices and hard-earned lessons from real-world LLM scaling efforts. We’ll walk through highlights of the playbook, tying them into case studies, benchmarks, and hands-on recommendations.
Scaling isn’t just about size—it’s about doing more with what you have. This webinar offers a comprehensive look at what it really takes to train state-of-the-art models at scale, designed for engineers, researchers, and practitioners ready to move beyond “it fits on one GPU” toward infrastructure that powers trillion-parameter models—efficiently, and at speed.
Post-training techniques have become essential as demand for Reasoning AI systems explodes. This talk provides a practical overview of how to enhance the reasoning capabilities of open-weight models—using Mistral as a working example. We’ll explore the full pipeline: sourcing high-quality reasoning datasets, selecting the right model checkpoints, and using tools that extend the functionality of PyTorch like NVIDIA NeMo and TensorRT-LLM. Whether you’re working on chatbots, agents, or task-specific models, you’ll leave with a clear understanding of the tools and workflows to take advantage of open models.
Deep Learning (DL) is driving unprecedented progress across Artificial Intelligence domains, including natural language processing, vision, speech, and multimodal. Sustaining this rapid pace of AI revolution, however, requires practical solutions to the extreme demands of scaling on the compute, memory, communication, and storage components of modern computing hardware. To address this challenge, we created a deep learning optimization library called DeepSpeed to make distributed model training efficient, effective, and easy on commodity hardware. This talk will focus on DeepSpeed optimizations for improving compute, communication, and I/O of extreme-scale model training.
Mamba layers are efficient alternatives to standard attention: their training complexity is linear in sequence length, while inference is sequence-length-independent and only requires a small cache. I will discuss a selection of IBM's ongoing work in advancing the state of mamba training in pytorch, including: context-parallel training for long-sequence data, mamba + mixture-of-expert support with expert parallelism, torch-native associative scan ops, and improved DTensor op support.
Modern GPUs like Hopper and Blackwell are fast, but only after careful optimization. Thunder compiles “education-style” PyTorch models into optimized, distributed PyTorch code. Through a composable plugin system, Thunder lets developers layer in kernel fusion, low-precision operations, memory optimizations, and flexible parallelism strategies, to achieve performance and scale while leaving the original PyTorch code unchanged. This talk will cover how Thunder bridges the gap between ease-of-use and peak performance, and enables teams to easily write custom code transformations to scale models efficiently, reduce GPU waste, and stay in control of their stack.
Parsing errors, unexpected outputs. If you've felt the frustration of trying to wrangle LLMs into producing consistently formatted results, you've likely built complex post-processing pipelines and elaborate prompting schemes. What if there was a way to guarantee structured outputs without these workarounds? Enter structured outputs.
In this talk, we'll explore how model outputs can be precisely constrained using formal specifications (e.g. JSON Schema), why this dramatically improves reliability, and how it reduces sensitivity to prompt engineering. We'll demonstrate advanced use cases using our open source library Outlines, which add structured outputs the `transformers`, `vllm`, etc inference libraries.
By the end of the session, you'll understand how to implement these techniques in your applications today, enabling your models to generate flawless JSON with minimal latency overhead compared to unconstrained generation.
Multilingual language models seem to be getting better, but how do we know? In general, language model evaluation is made more uncertain by automatic evaluations which correlate poorly with human ratings, low-quality datasets, and a lack of reproducibility. But for languages other than high-resource languages like English and Mandarin Chinese, these problems are even more consequential. We provide a set of best practices for using existing evaluations. Given the limited number of evaluations for many languages, we highlight languages and tasks that need more benchmarks and outline key considerations for developing new multilingual benchmarks.
The HuggingFace Transformers library is a flagship example of what makes PyTorch special: a dynamic, readable, and hackable framework that scales from quick experiments to production-ready architectures. It began as an implementation of BERT, continued to a ""one model, one file"" setup—ideal for iteration—and grew into a modular codebase now defining 315+ models. Transformers has become a reference implementation for the field: a source of truth for model architectures, behaviors, and pretraining conventions. Its evolution reflects PyTorch’s own: grounded in Pythonic values, but pragmatic enough to diverge when needed.
PyTorch’s ecosystem has replaced entire toolchains. Scaling models has become simpler: torch.compile brings compiler-level speedups with minimal code changes, and new abstractions like DTensor offer serious performance gains without the low-level complexity.
Both PyTorch and Transformers inherit Python’s spirit—clarity, flexibility, expressiveness—without being bound by it. PyTorch leans on ATen and C++ kernels under the hood; Transformers increasingly rely on optimized community kernels and hardware-aware implementations from the hub.
Modularity and readability didn’t just improve maintainability—they grew the community. Lowering the barrier to entry encourages experimentation, contributions, and faster innovation. This talk tracks that journey—from how PyTorch enabled Transformers, to how the virtuous cycle of design, performance, and pragmatism continues to shape the tools driving modern AI.