How to manage datasets programmatically

You can create a dataset from existing runs or upload a CSV file (or pandas dataframe in python).

Once you have a dataset created, you can continue to add new runs to it as examples. We recommend that you organize datasets to target a single "task", usually served by a single chain or LLM. For more discussions on datasets and evaluations, check out the recommendations.

Create from list of values

The most flexible way to make a dataset using the client is by creating examples from a list of inputs and optional outputs. Below is an example.

Python SDK
TypeScript SDK

from langsmith import Client

example_inputs = [
  ("What is the largest mammal?", "The blue whale"),
  ("What do mammals and birds have in common?", "They are both warm-blooded"),
  ("What are reptiles known for?", "Having scales"),
  ("What's the main characteristic of amphibians?", "They live both in water and on land"),
]

client = Client()
dataset_name = "Elementary Animal Questions"

# Storing inputs in a dataset lets us
# run chains and LLMs over a shared set of examples.
dataset = client.create_dataset(
    dataset_name=dataset_name, description="Questions and answers about animal phylogenetics.",
)
for input_prompt, output_answer in example_inputs:
    client.create_example(
        inputs={"question": input_prompt},
        outputs={"answer": output_answer},
        dataset_id=dataset.id,
    )

import { Client } from "langsmith";

const client = new Client({
  // apiUrl: "https://api.langchain.com", // Defaults to the LANGCHAIN_ENDPOINT env var
  // apiKey: "my_api_key", // Defaults to the LANGCHAIN_API_KEY env var
  /* callerOptions: {
         maxConcurrency?: Infinity; // Maximum number of concurrent requests to make
         maxRetries?: 6; // Maximum number of retries to make
  }*/
});

const exampleInputs: [string, string][] = [
  ["What is the largest mammal?", "The blue whale"],
  ["What do mammals and birds have in common?", "They are both warm-blooded"],
  ["What are reptiles known for?", "Having scales"],
  ["What's the main characteristic of amphibians?", "They live both in water and on land"],
];

const datasetName = "Elementary Animal Questions";

// Storing inputs in a dataset lets us
// run chains and LLMs over a shared set of examples.
const dataset = await client.createDataset(datasetName, {
  description: "Questions and answers about animal phylogenetics",
});

for (const [inputPrompt, outputAnswer] of exampleInputs) {
  await client.createExample(
    { question: inputPrompt },
    { answer: outputAnswer },
    {
      datasetId: dataset.id,
    }
  );
}

Create from existing runs

To create datasets from existing runs, you can use the same approach. Below is an example:

Python SDK
TypeScript SDK

from langsmith import Client

os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = "<YOUR-LANGSMITH-API-KEY>"
client = Client()
dataset_name = "Example Dataset"

# Filter runs to add to the dataset
runs = client.list_runs(
    project_name="my_project",
    execution_order=1,
    error=False,
)

dataset = client.create_dataset(dataset_name, description="An example dataset")
for run in runs:
    client.create_example(
        inputs=run.inputs,
        outputs=run.outputs,
        dataset_id=dataset.id,
    )

import { Client, Run } from "langsmith";
const client = new Client({
  // apiUrl: "https://api.langchain.com", // Defaults to the LANGCHAIN_ENDPOINT env var
  // apiKey: "my_api_key", // Defaults to the LANGCHAIN_API_KEY env var
  /* callerOptions: {
         maxConcurrency?: Infinity; // Maximum number of concurrent requests to make
         maxRetries?: 6; // Maximum number of retries to make
  }*/
});

const datasetName = "Example Dataset";
// Filter runs to add to the dataset
const runs: Run[] = [];
for await (const run of client.listRuns({
  projectName: "my_project",
  executionOrder: 1,
  error: false,
})) {
  runs.push(run);
}

const dataset = await client.createDataset(datasetName, {
  description: "An example dataset",
  dataType: "kv",
});

for (const run of runs) {
  await client.createExample(run.inputs, run.outputs ?? {}, {
    datasetId: dataset.id,
  });
}

Create dataset from CSV

In this section, we will demonstrate how you can create a dataset by uploading a CSV file.

First, ensure your CSV file is properly formatted with columns that represent your input and output keys. These keys will be utilized to map your data properly during the upload. You can specify an optional name and description for your dataset. Otherwise, the file name will be used as the dataset name and no description will be provided.

Python SDK
TypeScript SDK

from langsmith import Client
import os

os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = "<YOUR-LANGSMITH-API-KEY>"

client = Client()

csv_file = 'path/to/your/csvfile.csv'
input_keys = ['column1', 'column2'] # replace with your input column names
output_keys = ['output1', 'output2'] # replace with your output column names

dataset = client.upload_csv(
    csv_file=csv_file,
    input_keys=input_keys,
    output_keys=output_keys,
    name="My CSV Dataset",
    description="Dataset created from a CSV file"
    data_type="kv"
)

import { Client } from "langsmith";

const client = new Client();

const csvFile = 'path/to/your/csvfile.csv';
const inputKeys = ['column1', 'column2']; // replace with your input column names
const outputKeys = ['output1', 'output2']; // replace with your output column names

const dataset = await client.uploadCsv({
    csvFile: csvFile,
    fileName: "My CSV Dataset",
    inputKeys: inputKeys,
    outputKeys: outputKeys,
    description: "Dataset created from a CSV file",
    dataType: "kv"
});

Create dataset from pandas dataframe

The python client offers an additional convenience method to upload a dataset from a pandas dataframe.

from langsmith import Client
import os
import pandas as pd

os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = "<YOUR-LANGSMITH-API-KEY>"
client = Client()

df = pd.read_parquet('path/to/your/myfile.parquet')
input_keys = ['column1', 'column2'] # replace with your input column names
output_keys = ['output1', 'output2'] # replace with your output column names

dataset = client.upload_dataframe(
    df=df,
    input_keys=input_keys,
    output_keys=output_keys,
    name="My Parquet Dataset",
    description="Dataset created from a parquet file",
    data_type="kv" # The default
)

How to manage datasets programmatically

Create from list of values​

Create from existing runs​

Create dataset from CSV​

Create dataset from pandas dataframe​

Help us out by providing feedback on this documentation page:

Create from list of values

Create from existing runs

Create dataset from CSV

Create dataset from pandas dataframe