This repository contains an experiment to test Gemini 1.5 Flash's ability to answer questions about up to 1 million tokens of context. The experiment uses a Kaggle dataset containing App Store data to systematically evaluate the model's performance across varying context lengths to see if it degrades with increasing context length.
Unlike traditional Needle-in-a-Haystack (NIAH) tests, this experiment uses a real-world dataset where all information is potentially relevant. It tests the model's ability to synthesize and reason across a large, cohesive body of information, rather than simply retrieving irrelevant pieces of data. When the needles are irrelevant to the haystack, the challenge becomes more of an anomaly detection problem. This experiment is meant to give us a sense of how confidently we can trust answers from an LLM in long context use cases, like asking questions about your company data.
The main notebook, Context_Length_AppStoreV2.ipynb, includes:
- Data collection and preparation using the App Store Apple Data Set
- Implementation of evaluation and prediction functions
- Experiment setup to test Gemini 1.5 Flash across increasing context lengths
- Visualization of test results
To run this experiment, you'll need:
- A Langsmith account and API Key
- A Google AI Studio API key
- An OpenAI API key
- Clone this repository
- Create a copy of the
.env.sample
file and save it as.env
- Add your API keys to the
.env
file - Install the required libraries:
pip install -qU pandas tiktoken langchain langchain-openai langchain-google-genai matplotlib langsmith python-dotenv seaborn
Open the Context_Length_AppStoreV2.ipynb notebook and run the cells sequentially. The notebook will guide you through:
- Data preparation
- Setting up the evaluation dataset
- Implementing the evaluation and prediction functions
- Running the experiment across different context lengths
- Visualizing the results
Gemini 1.5 performed at 100% accuracy across all context lengths tested!