Scalable Data Warehouse for LLM Finetuning

Project Overview

This project aims to build a scalable data warehouse for collecting, cleaning, processing, and storing text data in Swahili. It encompasses web scraping, database management, API development, and automated workflows to enhance NLP capabilities for African languages. The system facilitates fine-tuning of Llama 2 7B, with benchmarking performed on Llama 2 13B and Llama 2 7B. Data sources include Hugging Face, Zenodo, and web scraping.

Features

Data Collection: Aggregates data from Hugging Face, Zenodo, and web scraping.
Data Processing: Cleans and preprocesses text and audio data for NLP tasks.
API Development: Creates high-throughput APIs for data ingestion and retrieval.
Backend Development: Implements backend services using Flask.
Frontend Development: Develops user interfaces with React.
Workflow Orchestration: Automates workflows with Apache Airflow
Models:
- Benchmarking was done using Llama 2 13B and Llama 2 7B.
Metrics:
- Evaluate model performance on text classofication, summarization, and translation.

Technologies Used

Backend: Flask
Frontend: React
Database: PostgreSQL
Data Ingestion: Apache Airflow
API: RESTful API with Flask
LLMs: Llama 2 7B, Llama 2 13B
Data Sources:
- Hugging Face: Swahili News Dataset
- Zenodo: Swahili Dataset

Introduction

This project addresses the need for a scalable system to enhance NLP capabilities for Swahili. It integrates data collection, processing, API development, and machine learning to support the fine-tuning of Llama 2 models. The project also involves creating a user-friendly interface and automating workflows for seamless data handling.

Getting Started

Prerequisites

Python 3.8 or higher
Node.js and npm
Docker (optional, for containerization)
Apache Airflow

Installation

Clone the repository:

git clone https://github.com/cheronodaisy/NLPWarehouse.git

Navigate to the project directory:
```
cd NLPWarehouse
```

Install the required dependencies:

pip install -r requirements.txt
npm install

Contributing

Contributions from the community are welcome. To contribute, please follow these steps:

Fork the repository.
Create a new branch:
```
git checkout -b feature-branch
```
Make your changes and commit them:
```
git commit -m 'Add some feature'
```
Push to the branch:
```
git push origin feature-branch
```
Open a pull request.

Acknowledgments

Hugging Face and Zenodo for the datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
Backend		Backend
__pycache__		__pycache__
dags		dags
data		data
database		database
frontend		frontend
models		models
notebooks		notebooks
utils		utils
.gitignore		.gitignore
README.md		README.md
docker-compose_airflow.yaml		docker-compose_airflow.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scalable Data Warehouse for LLM Finetuning

Project Overview

Features

Technologies Used

Table of Contents

Introduction

Getting Started

Prerequisites

Installation

Contributing

Acknowledgments

About

Releases

Packages

Languages

cheronodaisy/NLPWarehouse

Folders and files

Latest commit

History

Repository files navigation

Scalable Data Warehouse for LLM Finetuning

Project Overview

Features

Technologies Used

Table of Contents

Introduction

Getting Started

Prerequisites

Installation

Contributing

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages