This project demonstrates how to generate synthetic test data for Retrieval Augmented Generation (RAG) using Ragas.
To get started, clone the repository and install the required dependencies.
!pip install ragas langchain_community langchain_groq sentence_transformers xmltodict -q
- Import necessary libraries and set up environment variables.
- Initialize the Groq API for data generation and critique models.
- Set up the HuggingFace BGE embeddings for document processing.
Use the PubMedLoader
from langchain_community
to load documents related to a specific query (e.g., "cancer"). In this project, we load a maximum of 5 documents.
loader = PubMedLoader("cancer", load_max_docs=5)
documents = loader.load()
We use the TestsetGenerator from ragas to generate test sets based on the loaded documents. The test set generation includes:
- Simple questions: 50%
- Multi-context questions: 40%
- Reasoning-based questions: 10%
The following code sets up the test set generation:
generator = TestsetGenerator.from_langchain(
data_generation_model,
critic_model,
embeddings
)
distributions = {
simple: 0.5,
multi_context: 0.4,
reasoning: 0.1
}
testset = generator.generate_with_langchain_docs(documents, 5, distributions)
test_df = testset.to_pandas()
The output is a Pandas DataFrame containing the generated test sets, which can be further analyzed or used for model evaluation.
This project is licensed under the MIT License.