Core Concepts
This page digs into the different concepts of evaluations, and what's available in Langfuse.
Ready to start?
- Create a dataset to measure your LLM application's performance consistently
- Run an experiment to get an overview of how your application is doing
- Set up LLM-as-a-Judge to evaluate your live traces
The Evaluation Loop
LLM applications often have a constant loop of testing and monitoring.
Offline evaluation lets you test your application against a fixed dataset before you deploy. You run your new prompt or model against test cases, review the scores, iterate until the results look good, then deploy your changes. In Langfuse, you can do that by running Experiments.
Online evaluation scores live traces to catch issues in real traffic. When you find edge cases your dataset didn't cover, you add them back to your dataset so future experiments will catch them.
![]()
Here's an example workflow for building a customer support chatbot
- You update your prompt to make responses less formal.
- Before deploying, you run an experiment: test the new prompt against your dataset of customer questions (offline evaluation).
- You review the scores and outputs. The tone improved, but responses are longer and some miss important links.
- You refine the prompt and run the experiment again.
- The results look good now. You deploy the new prompt to production.
- You monitor with online evaluation to catch any new edge cases.
- You notice that a customer asked a question in French, but the bot responded in English.
- You add this French query to your dataset so future experiments will catch this issue.
- You update your prompt to support French responses and run another experiment.
Over time, your dataset grows from a couple of examples to a diverse, representative set of real-world test cases.
Scores
Scores are Langfuse's universal data object for storing evaluation results. Any time you want to assign a quality judgment to an LLM output, whether by a human annotation, an LLM judge, a programmatic check, or end-user feedback, the result is stored as a score.
Every score has a name (like "correctness" or "helpfulness"), a value, and a data type. Scores also support an optional comment for additional context.
Scores can be attached to traces, observations, sessions, or dataset runs. Most commonly, scores are attached to traces to evaluate a single end-to-end interaction.
Once you have scores, they show up in score analytics, can be visualized in custom dashboards, and can be queried via the API.
When to Use Scores
Scores become useful when you want to go beyond observing what your application does and start measuring how well it does it. Common use cases:
- Collecting user feedback: Capture thumbs up/down or star ratings from your users and attach them to traces. See the user feedback guide.
- Monitoring production quality: Set up automated evaluators (like LLM-as-a-Judge) to continuously score live traces for things like hallucination, relevance, or tone.
- Running guardrails: Score whether outputs pass safety checks like PII detection, format validation, or content policy compliance.
- Comparing changes with experiments: When you change a prompt, model, or pipeline, run an experiment to score the new version against a dataset.
Score Types
Langfuse supports four score data types:
| Type | Value | Use when |
|---|---|---|
NUMERIC | Float (e.g. 0.9) | Continuous judgments like accuracy, relevance, or similarity scores |
CATEGORICAL | String from predefined categories (e.g. "correct", "partially correct") | Discrete classifications where the set of possible values is known upfront |
BOOLEAN | 0 or 1 | Pass/fail checks like hallucination detection or format validation |
TEXT | Free-form string (1–500 characters) | Open-ended annotations like reviewer notes or qualitative feedback. Often used for open coding before formalizing into quantifiable scores via axial coding. |
Text scores are designed for qualitative, open-ended scoring. Because free-form text cannot be meaningfully aggregated or compared, text scores are not supported in experiments, LLM-as-a-Judge, or score analytics.
How to Create Scores
There are four ways to add scores:
- LLM-as-a-Judge: Set up automated evaluators that score traces based on custom criteria (e.g. hallucination, tone, relevance). These can return numeric or categorical scores plus reasoning, and can run on live production traces or on experiment results.
- Scores via UI: Team members manually score traces, observations, or sessions directly in the Langfuse UI. Requires a score config to be set up first.
- Annotation Queues: Set up structured review workflows where reviewers work through batches of traces.
- Scores via API/SDK: Programmatically add scores from your application code. This is the way to go for user feedback (thumbs up/down, star ratings), guardrail results, or custom evaluation pipelines.
Should I Use Scores or Tags?
| Scores | Tags | |
|---|---|---|
| Purpose | Measure how good something is | Describe what something is |
| Data | Numeric, categorical, boolean, or text value | Simple string label |
| When added | Can be added at any time, including long after the trace was created | Set during tracing and cannot be changed afterwards |
| Used for | Quality measurement, analytics, experiments | Filtering, segmentation, organizing |
As a rule of thumb: if you already know the category at tracing time (e.g. which feature or API endpoint triggered the trace), use a tag. If you need to classify or evaluate traces later, use a score.
Score Comments
Every score supports an optional comment field. Use it to capture reasoning (e.g. why an LLM judge assigned a particular score), reviewer notes, or context that helps others understand the score value. Comments are shown alongside scores in the Langfuse UI.
Use a TEXT score instead of comments to capture standalone qualitative feedback — comments are best for additional reasoning on an existing score.
Evaluation Methods
Evaluation methods are the functions that score traces, observations, sessions, or dataset runs. You can use a variety of evaluation methods to add scores.
| Method | What | Use when |
|---|---|---|
| LLM-as-a-Judge | Use an LLM to evaluate outputs based on custom criteria | Subjective assessments at scale (tone, accuracy, helpfulness) |
| Scores via UI | Manually add scores to traces directly in the Langfuse UI | Quick quality spot checks, reviewing individual traces |
| Annotation Queues | Structured human review workflows with customizable queues | Building ground truth, systematic labeling, team collaboration |
| Scores via API/SDK | Programmatically add scores using the Langfuse API or SDK | Custom evaluation pipelines, deterministic checks, automated workflows |
When setting up new evaluation methods, you can use Score Analytics to analyze or sense-check the scores you produce.
Experiments
An experiment runs your application against a dataset and evaluates the outputs. This is how you test changes before deploying to production.
Definitions
Before diving into experiments, it's helpful to understand the building blocks in Langfuse: datasets, dataset items, tasks, scores, and experiments.
| Object | Definition |
|---|---|
| Dataset | A collection of test cases (dataset items). You can run experiments on a dataset. |
| Dataset item | One item in a dataset. Each dataset item contains an input (the scenario to test) and optionally an expected output. |
| Task | The application code that you want to test in an experiment. This will be performed on each dataset item, and you will score the output. |
| Evaluation Method | A function that scores experiment results. In the context of a Langfuse experiment, this can be a deterministic check, or LLM-as-a-Judge. |
| Score | The output of an evaluation. See Scores for the available data types and details. |
| Experiment Run | A single execution of your task against all items in a dataset, producing outputs (and scores). |
You can find the data model for these objects here.
How these work together
This is what happens conceptually:
When you run an experiment on a given dataset, each of the dataset items will be passed to the task function you defined. The task function is generally an LLM call that happens in your application, that you want to test. The task function produces an output for each dataset item. This process is called an experiment run. The resulting collection of outputs linked to the dataset items are the experiment results.
Often, you want to score these experiment results. You can use various evaluation methods that take in the dataset item and the output produced by the task function, and produce a score based on criteria you define. Based on these scores, you can then get a complete picture of how your application performs across all test cases.
![]()
You can compare experiment runs to see if a new prompt version improves scores, or identify specific inputs where your application struggles. Based on these experiment results, you can decide whether the change is ready to be deployed to production.
You can find more details on how these objects link together under the hood on the data model page.
Two ways to run experiments
You can run experiments programmatically using the Langfuse SDK. This gives you full control over the task, evaluation logic, and more. Learn more about running experiments via SDK.
Another way is to run experiments directly from the Langfuse interface by selecting a dataset and prompt version. This is useful for quick iterations on prompts without writing code. Learn more about running experiments via UI.
Langfuse Execution
Local/CI Execution
Langfuse Dataset
Local Dataset
Not supported
While it's optional, we recommend managing the underlying Datasets in Langfuse as it allows for [1] In-UI comparison tables of different experiments on the same data and [2] Iteratively improve dataset based on production/staging traces.
Online Evaluation
For online evaluation, you can configure evaluation methods to automatically score production traces. This helps you catch issues immediately.
Langfuse currently supports LLM-as-a-Judge and human annotation checks for online evaluation. Deterministic checks are on the roadmap.
Monitoring with dashboards
Langfuse offers dashboards to monitor your application performance in real-time. You can also monitor scores in dashboards. You can find more details on how to use dashboards here.
Overview
With Langfuse you can capture all your LLM evaluations in one place. You can combine a variety of different evaluation metrics like model-based evaluations (LLM-as-a-Judge), human annotations or fully custom evaluation workflows via API/SDKs. This allows you to measure quality, tonality, factual accuracy, completeness, and other dimensions of your LLM application.
LLM-as-a-Judge
Learn how LLM-as-a-Judge evaluation works — use large language models to automatically score, evaluate, and monitor your LLM application outputs at scale with rubric-guided assessments.