Provides a LLMDataset class for generating and adding data to .csv datasets using LLMs (OpenAI API)
Install the following packages:
pip install openai==1.3.5 pandas==2.1.3 python-dotenv==1.0.0
1. Create a .env file in the root directory of the project and add your OpenAI API key to it:
OPENAI_API_KEY=<your-openai-api-key>
2. Create an empty dataset file using the create_dataset.py script
You can skip this step if you already have a dataset file
3. Create an instance of the LLMDataset class and provide a dataset_path:
from llm_dataset_gen import LLMDataset
data_filepath = "./data/Dataset.csv"
dataset = LLMDataset(dataset_path=data_filepath)4. Call the add_data method by providing the context and num_samples parameters:
dataset_context="For Context, this dataset represents requirements engineering excerpts and their corresponding Language Construct (LC) and Language Quality (LQ) codings"
dataset.add_data(context=dataset_context, num_samples=20)- The
add_datamethod will automatically overwrite/save the dataset file after appending the new data - The
contextparameter is the prompt that will be used to generate the data - The
num_samplesparameter is the number of data samples to generate and add to the dataset
The LLMDataset class is designed to manage a dataset and interact with the OpenAI API to generate new data entries. By using the JSON Mode of the OpenAI API and the gpt-4-1106-preview or gpt-3.5-turbo-1106 model, it can generate new data entries (as JSON Objects) that match the structure of a given dataset, and easily append them to the dataset.
When calling the API, two messages are sent to the model: a dataset_description, and a context
- The
dataset_descriptionis automatically generated by theLLMDatasetclass and describes the column names in the dataset, the number of data entries to generate, and how to format the data entries. This ensures that the generated data is consistent with the structure of the dataset. - The
contextis the prompt that is used to describe the data entries. This is provided by the user as a parameter in theadd_datamethod. - If the dataset contains an
IDcolumn, theLLMDatasetwill ignore the LLM's generated ID and instead use the next available ID in the dataset.