Evaluation with dataset
Evaluate the performance of your pipelines against a dataset.
Overview
Configuring RAG pipelines requires iteration across different parameters ranging from pre-processing loaders and chunkers, to the actual embedding model being used. To assist in testing different configurations, Neum AI provides several tools to test, evaluate and compare pipelines.
Datasets
Datasets provide the ability to create a list of test to run against a pipeline. Datasets are made up of DatasetEntry objects which each represent a test. Each DatasetEntry objects contains a query, an expected output and an id.
Datasets can be configured to run an evaluation. Evaluations supported include:
- Cosine Evaluation: Compares the vector embeddings between the retrieved chunk and the expected output.
- LLM Evaluation: Uses an LLM to check the quality and correctness of the retrieved information in answering the query at hand. (Requires you to set an OpenAI key as an enviornment variable:
OPENAI_API_KEY
)
To create a dataset:
Run a test
Once a dataset is created, we can run it against a pipeline. We also support the ability to run a dataset against a pipeline collection to compare the results from multiple pipelines at the same time.
The result will include a score or an evaluation matrix depending on the type of evaluation being used.
Output:
Was this page helpful?