aevaluate#
- async langsmith.evaluation._arunner.aevaluate(target: Callable[[dict], Awaitable[dict]] | AsyncIterable[dict], /, data: str | UUID | Iterable[Example] | Dataset | AsyncIterable[Example], evaluators: Sequence[RunEvaluator | Callable[[Run, Example | None], EvaluationResult | EvaluationResults] | Callable[[...], dict | EvaluationResults | EvaluationResult] | Callable[[Run, Example | None], Awaitable[EvaluationResult | EvaluationResults]]] | None = None, summary_evaluators: Sequence[Callable[[Sequence[Run], Sequence[Example]], EvaluationResult | EvaluationResults] | Callable[[List[Run], List[Example]], EvaluationResult | EvaluationResults]] | None = None, metadata: dict | None = None, experiment_prefix: str | None = None, description: str | None = None, max_concurrency: int | None = None, num_repetitions: int = 1, client: Client | None = None, blocking: bool = True, experiment: TracerSession | str | UUID | None = None, upload_results: bool = True) AsyncExperimentResults [source]#
Evaluate an async target system or function on a given dataset.
- Parameters:
target (Union[AsyncCallable[[dict], dict], AsyncIterable[dict]]) β The async target system or function to evaluate.
data (Union[DATA_T, AsyncIterable[schemas.Example]]) β The dataset to evaluate on. Can be a dataset name, a list of examples, an async generator of examples, or an async iterable of examples.
evaluators (Optional[Sequence[EVALUATOR_T]]) β A list of evaluators to run on each example. Defaults to None.
summary_evaluators (Optional[Sequence[SUMMARY_EVALUATOR_T]]) β A list of summary evaluators to run on the entire dataset. Defaults to None.
metadata (Optional[dict]) β Metadata to attach to the experiment. Defaults to None.
experiment_prefix (Optional[str]) β A prefix to provide for your experiment name. Defaults to None.
description (Optional[str]) β A description of the experiment.
max_concurrency (Optional[int]) β The maximum number of concurrent evaluations to run. Defaults to None.
num_repetitions (int) β The number of times to run the evaluation. Each item in the dataset will be run and evaluated this many times. Defaults to 1.
client (Optional[langsmith.Client]) β The LangSmith client to use. Defaults to None.
blocking (bool) β Whether to block until the evaluation is complete. Defaults to True.
experiment (Optional[schemas.TracerSession]) β An existing experiment to extend. If provided, experiment_prefix is ignored. For advanced usage only.
upload_results (bool) β
- Returns:
An async iterator over the experiment results.
- Return type:
AsyncIterator[ExperimentResultRow]
- Environment:
- LANGSMITH_TEST_CACHE: If set, API calls will be cached to disk to save time and
cost during testing. Recommended to commit the cache files to your repository for faster CI/CD runs. Requires the βlangsmith[vcr]β package to be installed.
Examples
>>> from typing import Sequence >>> from langsmith import Client, aevaluate >>> from langsmith.schemas import Example, Run >>> client = Client() >>> dataset = client.clone_public_dataset( ... "https://smith.langchain.com/public/419dcab2-1d66-4b94-8901-0357ead390df/d" ... ) >>> dataset_name = "Evaluate Examples"
Basic usage:
>>> def accuracy(run: Run, example: Example): ... # Row-level evaluator for accuracy. ... pred = run.outputs["output"] ... expected = example.outputs["answer"] ... return {"score": expected.lower() == pred.lower()}
>>> def precision(runs: Sequence[Run], examples: Sequence[Example]): ... # Experiment-level evaluator for precision. ... # TP / (TP + FP) ... predictions = [run.outputs["output"].lower() for run in runs] ... expected = [example.outputs["answer"].lower() for example in examples] ... # yes and no are the only possible answers ... tp = sum([p == e for p, e in zip(predictions, expected) if p == "yes"]) ... fp = sum([p == "yes" and e == "no" for p, e in zip(predictions, expected)]) ... return {"score": tp / (tp + fp)}
>>> import asyncio >>> async def apredict(inputs: dict) -> dict: ... # This can be any async function or just an API call to your app. ... await asyncio.sleep(0.1) ... return {"output": "Yes"} >>> results = asyncio.run( ... aevaluate( ... apredict, ... data=dataset_name, ... evaluators=[accuracy], ... summary_evaluators=[precision], ... experiment_prefix="My Experiment", ... description="Evaluate the accuracy of the model asynchronously.", ... metadata={ ... "my-prompt-version": "abcd-1234", ... }, ... ) ... ) View the evaluation results for experiment:...
Evaluating over only a subset of the examples using an async generator:
>>> async def example_generator(): ... examples = client.list_examples(dataset_name=dataset_name, limit=5) ... for example in examples: ... yield example >>> results = asyncio.run( ... aevaluate( ... apredict, ... data=example_generator(), ... evaluators=[accuracy], ... summary_evaluators=[precision], ... experiment_prefix="My Subset Experiment", ... description="Evaluate a subset of examples asynchronously.", ... ) ... ) View the evaluation results for experiment:...
Streaming each prediction to more easily + eagerly debug.
>>> results = asyncio.run( ... aevaluate( ... apredict, ... data=dataset_name, ... evaluators=[accuracy], ... summary_evaluators=[precision], ... experiment_prefix="My Streaming Experiment", ... description="Streaming predictions for debugging.", ... blocking=False, ... ) ... ) View the evaluation results for experiment:...
>>> async def aenumerate(iterable): ... async for elem in iterable: ... print(elem) >>> asyncio.run(aenumerate(results))
Running without concurrency:
>>> results = asyncio.run( ... aevaluate( ... apredict, ... data=dataset_name, ... evaluators=[accuracy], ... summary_evaluators=[precision], ... experiment_prefix="My Experiment Without Concurrency", ... description="This was run without concurrency.", ... max_concurrency=0, ... ) ... ) View the evaluation results for experiment:...
Using Async evaluators:
>>> async def helpfulness(run: Run, example: Example): ... # Row-level evaluator for helpfulness. ... await asyncio.sleep(5) # Replace with your LLM API call ... return {"score": run.outputs["output"] == "Yes"}
>>> results = asyncio.run( ... aevaluate( ... apredict, ... data=dataset_name, ... evaluators=[helpfulness], ... summary_evaluators=[precision], ... experiment_prefix="My Helpful Experiment", ... description="Applying async evaluators example.", ... ) ... ) View the evaluation results for experiment:...