Synthetic Test Data Generation using LangChain, Ragas, and Groq API

This project demonstrates how to generate synthetic test data for Retrieval Augmented Generation (RAG) using Ragas.

Installation

To get started, clone the repository and install the required dependencies.

!pip install ragas langchain_community langchain_groq sentence_transformers xmltodict -q

Usage

Environment Setup

Import necessary libraries and set up environment variables.
Initialize the Groq API for data generation and critique models.
Set up the HuggingFace BGE embeddings for document processing.

Load Documents from PubMed

Use the PubMedLoader from langchain_community to load documents related to a specific query (e.g., "cancer"). In this project, we load a maximum of 5 documents.

loader = PubMedLoader("cancer", load_max_docs=5)
documents = loader.load()

Generate Test Sets

We use the TestsetGenerator from ragas to generate test sets based on the loaded documents. The test set generation includes:

Simple questions: 50%
Multi-context questions: 40%
Reasoning-based questions: 10%

The following code sets up the test set generation:

generator = TestsetGenerator.from_langchain(
    data_generation_model,
    critic_model,
    embeddings
)
distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}
testset = generator.generate_with_langchain_docs(documents, 5, distributions)
test_df = testset.to_pandas()

Output

The output is a Pandas DataFrame containing the generated test sets, which can be further analyzed or used for model evaluation.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE.txt		LICENSE.txt
README.md		README.md
synthetic_test_data_generation.ipynb		synthetic_test_data_generation.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synthetic Test Data Generation using LangChain, Ragas, and Groq API

Table of Contents

Installation

Usage

Environment Setup

Load Documents from PubMed

Generate Test Sets

Output

License

About

Releases

Packages

Languages

License

shaadclt/Ragas-Synthetic-Test-Data-Generation

Folders and files

Latest commit

History

Repository files navigation

Synthetic Test Data Generation using LangChain, Ragas, and Groq API

Table of Contents

Installation

Usage

Environment Setup

Load Documents from PubMed

Generate Test Sets

Output

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages