#10 Data Collection and Curation for LLM Training

Mar 16, 2025

In my last post, I discussed how LLMs work at a high level. Today, I will go into depth about the most crucial steps in training a Large Language Model: Data collection and curation. While many discussions on AI focus on model architecture or computing power, the true backbone of any powerful LLM is the quality and cleanliness of its dataset. In this post, we’ll dive deep into how data is gathered, filtered, and refined for training cutting-edge AI models and highlight the startups and companies leading the charge in this space.

1. The Role of Data in LLM Training

The performance of an LLM is only as good as the data it’s trained on. Poor-quality data can lead to inaccurate, biased, or unhelpful AI responses, while well-curated data ensures better generalization, factual accuracy, and ethical AI behavior. The famous “Garbage In, Garbage Out” could never be more true here.

Key aspects of data collection and curation include:

Finding high-quality sources
Sourcing diverse and representative datasets
Filtering noise and reducing bias
Balancing general knowledge with domain-specific information
Ensuring ethical and legal compliance

Now that we understand why data is crucial, let's explore where it comes from.

2. Where does the Data come from?

The Internet is a huge trove of information, however not all data is created equal and careful selection is required to ensure only high quality data makes it into the training.

Open and Public datasets: Most AI models use publicly available data sources. Some of the most well-known sources include:
1. Common Crawl, a website that maintains a free, open repository of over 250 billion pages spanning over 18 years, with 3-5 billion pages added every month
2. Pile by eleuther.ai that has around 800GB of diverse, open source language modeling data set
3. LAION provides open datasets for multimodal AI models and includes text, images and videos

Proprietary and Licensed data: Since public data may not be sufficient, AI companies may buy or use proprietary data to augment the data to improve the model quality. Examples include:
1. Google using YouTube data for improving their own Gemini models
2. Companies purchasing news archives from sources like Reuters and Bloomberg
3. Acquiring books from publishers like Elsevier, Springer etc. to further enrich their models

Unfortunately, simply collecting vast amounts of data isn’t enough. The real challenge lies in making it useful, which is where data cleaning and filtering come in.

3. Data cleaning and Filtering: Removing the Noise

Open-source datasets are vast but often contain redundant, missing, or improperly formatted data that can hinder training. To address this, AI companies use sophisticated techniques for cleaning and filtering data.

Deduplication: Duplicate text must be removed to 1) reduce unnecessary computational load and 2) Prevent overfitting, where a model becomes overly dependent on specific sources. For instance, without deduplication, a model trained on duplicate bloomberg pages might incorrectly assume repeated information is more “true” than less common facts. Key players: MosaicML specializes in dataset deduplication for optimized training, while Hugging Face provides open-source tools for data cleaning.
1. Side note: In some cases, sources that are higher-quality could be trained on multiple times to ensure the model is really familiar with them
Filtering Out Toxicity and Bias: It is crucial to prevent AI models from learning harmful biases, misinformation, or toxic content. This is an active area of AI research, as biased training data has led to real-world AI failures. For instance, biased hiring models trained on historical job data have reinforced gender discrimination, and chatbots trained on unfiltered internet text have developed inappropriate behaviors.
Aligning the data into a standardized format: With so many different formats and layouts on the internet, it is necessary for all the data to be condensed into a text stream for the model to digest. Snorkel AI is a key player in this field, automating data structuring and labelling.
Legal and Ethical Considerations: Perhaps the biggest challenge in data sourcing is ethical and legal compliance. The creators of online content deserve recognition and compensation, and AI companies have faced growing scrutiny over data scraping practices.
1. Lawsuits & Controversies: Major publishers, artists, and writers are actively suing AI companies over unauthorized data usage.
2. Future Implications: As AI evolves, regulations on data ownership and usage rights will play a key role in shaping the industry.

Data collection and curation are the unsung heroes of LLM success. Without high-quality, well-structured, and ethically sourced data, even the most advanced neural architectures would be ineffective.

Moving forward, the future of LLMs will depend on innovations in data sourcing, cleaning, and ethical curation—not just model architecture. The companies and researchers solving these challenges will define the next generation of AI.

Tech Unpacked

Discussion about this post