Stop Thinking, Just Do!

Sungsoo Kim's Blog

The Hierarchy of Needs for Training Dataset Development

tagsTags

20 October 2024


Article Source


The Hierarchy of Needs for Training Dataset Development

Abstract

Training and fine-tuning models depends critically on how you construct your dataset. Part art, part science, we’ll share with you practical lessons in dataset construction at Character AI and how to build a data platform to support rapid iterative refinement of training data. For LLMs, data scale is much larger and workloads are more diverse. This is especially true for multimodal datasets. To deal with these challenges, we’ll show you how LanceDB is used in production to solve many pain-points around the storage, management, and querying of large scale AI data.

Recorded live in San Francisco at the AI Engineer World’s Fair. See the full schedule of talks at https://www.ai.engineer/worldsfair/2024/schedule & join us at the AI Engineer World’s Fair in 2025! Get your tickets today at https://ai.engineer/2025

About Chang

Chang She is the CEO and cofounder of LanceDB, the developer-friendly, open-source database for multi-modal AI. A serial entrepreneur, Chang has been building DS/ML tooling for nearly two decades and is one of the original contributors to the pandas library. Prior to founding LanceDB, Chang was VP of Engineering at TubiTV, where he focused on personalized recommendations and ML experimentation.

About Noah

Noah is a Research Engineer with a passion for building data systems and ML platforms from the ground up.

He leads the Data Platform team at Character, focusing on accelerating foundation model research, alignment, and product development through internet-scale data mining, prompting tools, and retrieval systems. Making data go vroom while gpus go brrrr is what makes him (and the team) tic!


comments powered by Disqus