Skip to content

Scalable Data Warehouse for LLM Finetuning: API Design for High Throughput Data Ingestion and RAG Retrieval. This project collects, cleans, processes, and stores text/audio data for Swahili, Yoruba and Amharic. It includes web scraping, database management, API development, and automated workflows to enhance NLP capabilities for African languages.

Notifications You must be signed in to change notification settings

cheronodaisy/NLPWarehouse

 
 

Repository files navigation

Scalable Data Warehouse for LLM Finetuning

Project Overview

This project aims to build a scalable data warehouse for collecting, cleaning, processing, and storing text data in Swahili. It encompasses web scraping, database management, API development, and automated workflows to enhance NLP capabilities for African languages. The system facilitates fine-tuning of Llama 2 7B, with benchmarking performed on Llama 2 13B and Llama 2 7B. Data sources include Hugging Face, Zenodo, and web scraping.

Features

  • Data Collection: Aggregates data from Hugging Face, Zenodo, and web scraping.
  • Data Processing: Cleans and preprocesses text and audio data for NLP tasks.
  • API Development: Creates high-throughput APIs for data ingestion and retrieval.
  • Backend Development: Implements backend services using Flask.
  • Frontend Development: Develops user interfaces with React.
  • Workflow Orchestration: Automates workflows with Apache Airflow
  • Models:
    • Benchmarking was done using Llama 2 13B and Llama 2 7B.
  • Metrics:
    • Evaluate model performance on text classofication, summarization, and translation.

Technologies Used

  • Backend: Flask
  • Frontend: React
  • Database: PostgreSQL
  • Data Ingestion: Apache Airflow
  • API: RESTful API with Flask
  • LLMs: Llama 2 7B, Llama 2 13B
  • Data Sources:

Table of Contents

Introduction

This project addresses the need for a scalable system to enhance NLP capabilities for Swahili. It integrates data collection, processing, API development, and machine learning to support the fine-tuning of Llama 2 models. The project also involves creating a user-friendly interface and automating workflows for seamless data handling.

Getting Started

Prerequisites

  • Python 3.8 or higher
  • Node.js and npm
  • Docker (optional, for containerization)
  • Apache Airflow

Installation

  1. Clone the repository:
    git clone https://github.com/cheronodaisy/NLPWarehouse.git
  2. Navigate to the project directory:
    cd NLPWarehouse
  3. Install the required dependencies:
    pip install -r requirements.txt
    npm install

Contributing

Contributions from the community are welcome. To contribute, please follow these steps:

  1. Fork the repository.
  2. Create a new branch:
    git checkout -b feature-branch
  3. Make your changes and commit them:
    git commit -m 'Add some feature'
  4. Push to the branch:
    git push origin feature-branch
  5. Open a pull request.

Acknowledgments

  • Hugging Face and Zenodo for the datasets.

About

Scalable Data Warehouse for LLM Finetuning: API Design for High Throughput Data Ingestion and RAG Retrieval. This project collects, cleans, processes, and stores text/audio data for Swahili, Yoruba and Amharic. It includes web scraping, database management, API development, and automated workflows to enhance NLP capabilities for African languages.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 94.3%
  • Python 3.9%
  • JavaScript 1.2%
  • Other 0.6%