evaluate_existing#

Evaluate existing experiment runs.

Parameters:

experiment (Union[str, uuid.UUID]) – The identifier of the experiment to evaluate.
data (DATA_T) – The data to use for evaluation.
evaluators (Optional[Sequence[EVALUATOR_T]]) – Optional sequence of evaluators to use for individual run evaluation.
summary_evaluators (Optional[Sequence[SUMMARY_EVALUATOR_T]]) – Optional sequence of evaluators to apply over the entire dataset.
metadata (Optional[dict]) – Optional metadata to include in the evaluation results.
max_concurrency (Optional[int]) – Optional maximum number of concurrent evaluations.
client (Optional[langsmith.Client]) – Optional Langsmith client to use for evaluation.
load_nested (bool) – Whether to load all child runs for the experiment. Default is to only load the top-level root runs.
blocking (bool) – Whether to block until evaluation is complete.

Returns:

The evaluation results.

Return type:

ExperimentResults

Environment:

LANGSMITH_TEST_CACHE: If set, API calls will be cached to disk to save time and
cost during testing. Recommended to commit the cache files to your repository for faster CI/CD runs. Requires the ‘langsmith[vcr]’ package to be installed.

Examples

>>> from langsmith.evaluation import evaluate, evaluate_existing
>>> dataset_name = "Evaluate Examples"
>>> def predict(inputs: dict) -> dict:
...     # This can be any function or just an API call to your app.
...     return {"output": "Yes"}
>>> # First run inference on the dataset
... results = evaluate(
...     predict,
...     data=dataset_name,
... )  
View the evaluation results for experiment:...
>>> # Then apply evaluators to the experiment
... def accuracy(run: Run, example: Example):
...     # Row-level evaluator for accuracy.
...     pred = run.outputs["output"]
...     expected = example.outputs["answer"]
...     return {"score": expected.lower() == pred.lower()}
>>> def precision(runs: Sequence[Run], examples: Sequence[Example]):
...     # Experiment-level evaluator for precision.
...     # TP / (TP + FP)
...     predictions = [run.outputs["output"].lower() for run in runs]
...     expected = [example.outputs["answer"].lower() for example in examples]
...     # yes and no are the only possible answers
...     tp = sum([p == e for p, e in zip(predictions, expected) if p == "yes"])
...     fp = sum([p == "yes" and e == "no" for p, e in zip(predictions, expected)])
...     return {"score": tp / (tp + fp)}
>>> experiment_name = (
...     results.experiment_name
... )  # Can use the returned experiment name
>>> experiment_name = "My Experiment:64e6e91"  # Or manually specify
>>> results = evaluate_existing(
...     experiment_name,
...     summary_evaluators=[precision],
... )  
View the evaluation results for experiment:...