How to create new examples using prompting
If you're starting to prototype a new chain or agent, you may not have much data to use to check how the component will behave. Simulating data and using LLMs to augment your dataset can help you kickstart your development process. Below are a couple of approaches you could adapt to your use case.
Synthetic data generation is no full substitute for real data. The quality of the data generated by these methods will depend on many factors, including the model, prompt, and existing data. This approach may not be appropriate for the task you are trying to evaluate. Always inspect your datasets to ensure they capture the information you want to model.
Prerequisites
These guides assume you have already connected to LangSmith and have access to a dataset you want to expand.
Prompting for New Inputs
Expanding the dataset to cover a broader semantic range requires more than just paraphrasing. We need to generate completely new and plausible inputs. Here, we will demonstrate a simple method to achieve this by examining a random set of 5 examples and striving to create 6 novel ones that not only align with the inferred system but are also distinct enough to have likely originated from different individuals.
Step 1: Define the generator for new inputs
Firstly, we will create a chain for generating new inputs. Similar to the paraphrasing chain, we'll use the ChatOpenAI
model and provide it with system prompts containing instructions. We also will reuse the NumberedListOutputParser
to parse the output into a list of strings.
input_gen_llm = ChatOpenAI(temperature=0.5)
input_gen_prompt_template = ChatPromptTemplate.from_messages(
[
SystemMessagePromptTemplate.from_template(
"You are a creative assistant tasked with coming up with new inputs for an application."
),
SystemMessagePromptTemplate.from_template("The following are some examples of inputs: \n{examples}"),
HumanMessagePromptTemplate.from_template(
"Can you generate {n_inputs} unique and plausible inputs that could be asked by different users?"
),
]
)
input_gen_chain = LLMChain(
llm=input_gen_llm,
prompt=input_gen_prompt_template,
output_parser=NumberedListOutputParser(),
)
Step 2: Generate new inputs concurrently
Next, write a function that will generate new inputs based on a set of example inputs. We'll once again use asyncio to speed up this process.
async def generate_new_inputs(
example_inputs: List[str], n_inputs: int
) -> List[str]:
"""
A function to generate new inputs based on a list of example inputs.
"""
example_inputs_str = '\n'.join(f"- {input}" for input in example_inputs)
return await input_gen_chain.arun(
examples=example_inputs_str, n_inputs=n_inputs
)
Step 3: Generate new inputs for the dataset
Now you can use the Client from LangSmith to access your dataset and generate new inputs for it. Note that unlike paraphrasing, new inputs do not come with corresponding outputs. Therefore, you might need to manually label them or use a separate model to generate the outputs.
In the main function, example_inputs contains a random sample of 5 inputs from your dataset. We then call generate_new_inputs to create 6 new inputs based on these examples.
import random
async def main_input_gen(
client: Client, n_inputs: int, dataset_name: str, sample_size: int = 5
) -> list:
"""
The main function to generate new inputs for a dataset.
"""
dataset = client.read_dataset(dataset_name=dataset_name)
examples = list(client.list_examples(dataset_id=dataset.id))
example_inputs = [next(iter(example.inputs.values())) for example in random.sample(examples, sample_size)]
new_inputs = await generate_new_inputs(example_inputs, n_inputs)
for input in new_inputs:
# As we don't have corresponding outputs for new inputs, we leave outputs as an empty dict.
client.create_example(
inputs={"input": input},
outputs={},
dataset_id=dataset.id,
)
return new_inputs
Now, run the main function to generate new inputs for your dataset.
n_inputs = 6
results = await main_input_gen(client, n_inputs, dataset_name)
Once this is completed, your dataset should be populated with new examples that differ to a greater extend from the original ones.
Considerations
Remember that the quality and relevance of the generated inputs will highly depend on the model and prompt used, and this approach may not be suitable for all use cases. Always verify your generated inputs to ensure they align with the system's context and are appropriate for your application.
By using a combination of paraphrasing, generating new inputs, and other augmentation methods, you can expand and diversify your dataset effectively in the early stages of development to verify whether a feature is technically feasible and robust enough to be deployed in production.