Large Language Models

Prepare a dataset for training and validation of a Large Language Model (LLM)

In this short tutorial, we will learn how to prepare a balanced dataset that can be used to train a large language model (LLM).

3 months ago • 4 min read

By Shaoni Mukherjee

Bring this project to life

Run on Paperspace

Generating a dataset for training a Language Model (LLM) involves several crucial steps to ensure its efficacy in capturing the nuances of language. From selecting diverse text sources to preprocessing to splitting the dataset, each stage requires attention to detail. Additionally, it's crucial to balance the dataset's size and complexity to optimize the model's learning process. By curating a well-structured dataset, one lays a strong foundation for training an LLM capable of understanding and generating natural language with proficiency and accuracy.

This brief guide will walk you through generating a classification dataset to train and validate a Language Model (LLM). While the dataset created here is small but it lays a solid foundation for exploration and further development.

Datasets for Fine-Tuning and Training LLMs

Several sources provide great datasets for fine-tuning and training your LLMs. A few of them are listed below:-

Kaggle: Kaggle hosts various datasets across various domains. You can find datasets for NLP tasks, which include text classification, sentiment analysis, and more. Visit: Kaggle Datasets
Hugging Face Datasets: Hugging Face provides large datasets specifically curated for natural language processing tasks. They also offer easy integration with their transformers library for model training. Visit: Hugging Face Datasets
Google Dataset Search: Google Dataset Search is a search engine specifically designed to help researchers locate online data that is freely available for use. You can find a variety of datasets for language modeling tasks here. Visit: Google Dataset Search
UCI Machine Learning Repository: While not exclusively focused on NLP, the UCI Machine Learning Repository contains various datasets that can be used for language modeling and related tasks. Visit: UCI Machine Learning Repository
GitHub: GitHub hosts numerous repositories that contain datasets for different purposes, including NLP. You can search for repositories related to your specific task or model architecture. Visit: GitHub
Common Crawl: Common Crawl is a nonprofit organization that crawls the web and freely provides its archives and datasets to the public. It can be a valuable resource for collecting text data for language modeling. Visit: Common Crawl
OpenAI Datasets: OpenAI periodically releases datasets for research purposes. These datasets often include large-scale text corpora that can be used for training LLMs. Visit: OpenAI Datasets

Code to Create and Prepare the Dataset

Bring this project to life

Run on Paperspace

The code and concept for this article are inspired by Sebastian Rashka's excellent course, which provides comprehensive insights into constructing a substantial language model from the ground up.

We will start with installing the necessary packages,

import pandas as pd #for data processing, manipulation
import urllib.request #for downloading files from URLs zip file
import zipfile #to deal with zip file
import os #for dealing with the OS
from pathlib import Path  #for working with file paths

The below lines of code will help to get the raw dataset and extract it,

# getting the zip file from the url
url = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip"
data_zip_path = "sms_spam_collection.zip"
data_extracted_path = "sms_spam_collection"
data_file_path = Path(data_extracted_path) / "SMSSpamCollection.tsv"

Next, we will use the 'with' statement, for both opening the URL and opening the local file.

# Downloading the file
with urllib.request.urlopen(url) as response:
    with open(data_zip_path, "wb") as out_file:
        out_file.write(response.read())

# Unzipping the file
with zipfile.ZipFile(data_zip_path, "r") as zip_ref:
    zip_ref.extractall(data_extracted_path)

The below code will ensure that the downloaded file is properly renamed with the ".tsv" file

# Add .tsv file extension
original_file_path = Path(data_extracted_path) / "SMSSpamCollection"
os.rename(original_file_path, data_file_path)
print(f"File downloaded and saved as {data_file_path}")

After successful execution of this code we will get the message as "File downloaded and saved as sms_spam_collection/SMSSpamCollection.tsv"

Use the pandas library to load the saved dataset and further explore the data.

raw_text_df = pd.read_csv(data_file_path, sep="\t", header=None, names=["Label", "Text"])
raw_text_df.head()
print(raw_text_df["Label"].value_counts())

Label
ham 4825
spam 747
Name: count, dtype: int64

Let's define a function with pandas to generate a balanced dataset. Initially, we count the number of 'spam' messages, then proceed to randomly sample the same number to align with the total count of spam instances.

def create_balanced_dataset(df):
  # Count the instances of "spam"
  num_spam_inst = raw_text_df[raw_text_df["Label"] == "spam"].shape[0]
  # Randomly sample "ham' instances to match the number of 'spam' instances
  ham_subset_df = raw_text_df[raw_text_df["Label"] == "ham"].sample(num_spam, random_state=123)
  # Combine ham "subset" with "spam"
  balanced_df = pd.concat([ham_subset_df, raw_text_df[raw_text_df["Label"] == "spam"]])
  return balanced_df

balanced_df = create_balanced_dataset(raw_text_df)

Let us do a value_count to check the counts of 'spam' and 'ham'

print(balanced_df["Label"].value_counts())

Label
ham 747
spam 747
Name: count, dtype: int64

As we can see that the data frame is now balanced.

#change the 'label' data to integer class
balanced_df['Label']= balanced_df['Label'].map({"ham":1, "spam":0})

Net, we will write a function which will randomly split the dataset to train, test and validation function.

def random_split(df, train_frac, valid_frac):
    df = df.sample(frac = 1, random_state = 123).reset_index(drop=True)
    train_end = int(len(df) * train_frac)
    valid_end = train_end + int(len(df) * valid_frac)
    
    train_df = df[:train_end]
    valid_df = df[train_end:valid_end]
    
    test_df = df[valid_end:]
    
    return train_df,valid_df,test_df

train_df, valid_df, test_df = random_split(balanced_df, 0.7, 0.1)

Next save the dataset locally.

train_df.to_csv("train_df.csv", index=None)
valid_df.to_csv("valid_df.csv", index=None)
test_df.to_csv("test_df.csv", index=None)

Conclusion

Building a large language model (LLM) is quite complex. However, with this ever-evolving A.I. field and new technologies coming up, things are getting less complicated. From laying the groundwork with robust algorithms to fine-tuning hyperparameters and managing vast datasets, every step is critical in creating a model capable of understanding and generating human-like text.

One crucial aspect of training LLMs is creating high-quality datasets. This involves sourcing diverse and representative text corpora, preprocessing them to ensure consistency and relevance, and, perhaps most importantly, curating balanced datasets to avoid biases and enhance model performance.

With this, we came to the end of the article, and we understood how easy it is to create a classification dataset from a delimited file. We highly recommend using this article as a base and create more complex data.

We hope you enjoyed reading the article!

Add speed and simplicity to your Machine Learning workflow today

Get started

References

Code Reference

Blog

Docs

Community

ML Showcase

Professional Services

Talk to an Expert

Expanding the Versatility of IDM-VTON with Grounded Segment Anything

Train & Finetune LLama3 using LLama-Factory

Solutions

Product

Resources

Company

Datasets for Fine-Tuning and Training LLMs

Code to Create and Prepare the Dataset

Conclusion

References

Spread the word

Expanding the Versatility of IDM-VTON with Grounded Segment Anything

Train & Finetune LLama3 using LLama-Factory

Keep reading

Retrieval-Augmented Generation on Paperspace + DigitalOcean: Help Your Models Give Their Best Answers

Understanding Model Quantization in Large Language Models

Running Gemma 2 on an A4000 GPU

Subscribe to our newsletter

Solutions

Product

Resources

Company