How to paraphrase examples to expand your dataset
If you're starting to prototype a new chain or agent, you may not have much data to use to check how the component will behave. Simulating data and using LLMs to augment your dataset can help you kickstart your development process. Below are a couple of approaches you could adapt to your use case.
Synthetic data generation is no full substitute for real data. The quality of the data generated by these methods will depend on many factors, including the model, prompt, and existing data. This approach may not be appropriate for the task you are trying to evaluate. Always inspect your datasets to ensure they capture the information you want to model.
Prerequisites
These guides assume you have already connected to LangSmith and have access to a dataset you want to expand.
Using Paraphrasing
Once your chain has produces satisfactory results on a few example inputs, it's often helpful to check that this behavior is consistent across inputs are similar to what you already have. One way to do this is by prompting an LLM to paraphrase the input examples, generating multiple variants of each input. Since this is a semantically invariant transformation, we can assume that the outputs should be the same as the original. Below is some code doing exactly this.
Step 1: Define the paraphrase generator
The first step is to create a chain for generating paraphrases. We'll use the ChatOpenAI
model and provide it with some system prompts containing instructions.
import re
from typing import List
from langchain.chains import LLMChain
from langchain_openai import ChatOpenAI
from langchain.output_parsers import ListOutputParser
from langchain_core.prompts import (
ChatPromptTemplate,
HumanMessagePromptTemplate,
SystemMessagePromptTemplate,
)
class NumberedListOutputParser(ListOutputParser):
def parse(self, output: str) -> List[str]:
return re.findall(r"\d+\.\s+(.*?)\n", output)
paraphrase_llm = ChatOpenAI(temperature=0.5)
prompt_template = ChatPromptTemplate.from_messages(
[
SystemMessagePromptTemplate.from_template(
"You are a helpful paraphrasing assistant tasked with rephrasing text."
),
SystemMessagePromptTemplate.from_template("Input: <INPUT>{query}</INPUT>"),
HumanMessagePromptTemplate.from_template(
"What are {n_paraphrases} different ways you could paraphrase the INPUT text?"
" Do not significantly change the meaning."
" Respond using numbered bullets. If you cannot think of any,"
" just say 'I don't know.'"
),
]
)
paraphrase_chain = LLMChain(
llm=paraphrase_llm,
prompt=prompt_template,
output_parser=NumberedListOutputParser(),
)
Step 2: Generate paraphrases concurrently
The next step is to write a function that will generate paraphrases for each example in your dataset. To speed up this process, we will use asyncio to generate paraphrases for different examples concurrently.
import asyncio
import re
from typing import Any, Coroutine, Tuple
from langsmith.schemas import Example
async def gather_with_concurrency(n: int, *coros: Coroutine) -> list:
"""
A helper function to run coroutines concurrently up to a specified limit.
"""
semaphore = asyncio.Semaphore(n)
async def sem_coro(coro: Coroutine) -> Any:
async with semaphore:
return await coro
return await asyncio.gather(*(sem_coro(c) for c in coros))
async def generate_paraphrases(
example: Example, n_paraphrases: int
) -> Tuple[Example, List[str]]:
"""
A function to generate paraphrases for a given string.
"""
# Assumes the input is a string. Additional preprocessing might be needed.
example_input = next(iter(example.inputs.values()))
result = await paraphrase_chain.arun(
query=example_input, n_paraphrases=n_paraphrases
)
# Use regex to parse the result, getting just the text after the numbered bullets
return example, result
Step 3: Paraphrase the dataset
Now you can use the Client
from LangSmith to access your dataset and generate paraphrases for it.
In the main function, examples contains the data from the dataset you want to expand. coros
contains the coroutines for generating paraphrases for each example. For each generated example, we will create a new examples with the same outputs as the original, assuming the meaning or expected response has not changed (we assume paraphrasing is a semantically invariant transformation).
from langsmith import Client
async def main(
client: Client, n_paraphrases: int, dataset_name: str, concurrency: int = 20
) -> list:
"""
The main function to generate paraphrases for a dataset.
"""
dataset = client.read_dataset(dataset_name=dataset_name)
examples = client.list_examples(dataset_id=dataset.id)
coros = (generate_paraphrases(example, n_paraphrases) for example in examples)
results = await gather_with_concurrency(concurrency, *coros)
flattened_results = []
for example, batch_r in results:
input_key = next(iter(example.inputs))
for r in batch_r:
client.create_example(
inputs={input_key: r},
outputs=example.outputs,
dataset_id=dataset.id,
)
flattened_results.append(r)
return [r for res in results for r in res]
Now, run the main function to generate paraphrases for your dataset.
client = Client()
n_paraphrases = 3
dataset_name = "Your Dataset Name"
results = await main(client, n_paraphrases, dataset_name)
Replace "Your Dataset Name" with the name of your dataset. Once this is completed, your dataset should be roughly 3x the original size.
Considerations
Remember that the quality of the paraphrases will depend on the model and prompt used, and this approach may not be appropriate for the task you are trying to evaluate. Always check your paraphrased data to ensure it maintains the original meaning and is appropriate for your use case. This is mainly useful earlier on in the development process, when you are trying to get a sense of how sensitive your chain or model is to small changes in the input.