This project aims to build a scalable data warehouse for collecting, cleaning, processing, and storing text data in Swahili. It encompasses web scraping, database management, API development, and automated workflows to enhance NLP capabilities for African languages. The system facilitates fine-tuning of Llama 2 7B, with benchmarking performed on Llama 2 13B and Llama 2 7B. Data sources include Hugging Face, Zenodo, and web scraping.
- Data Collection: Aggregates data from Hugging Face, Zenodo, and web scraping.
- Data Processing: Cleans and preprocesses text and audio data for NLP tasks.
- API Development: Creates high-throughput APIs for data ingestion and retrieval.
- Backend Development: Implements backend services using Flask.
- Frontend Development: Develops user interfaces with React.
- Workflow Orchestration: Automates workflows with Apache Airflow
- Models:
- Benchmarking was done using Llama 2 13B and Llama 2 7B.
- Metrics:
- Evaluate model performance on text classofication, summarization, and translation.
- Backend: Flask
- Frontend: React
- Database: PostgreSQL
- Data Ingestion: Apache Airflow
- API: RESTful API with Flask
- LLMs: Llama 2 7B, Llama 2 13B
- Data Sources:
- Hugging Face: Swahili News Dataset
- Zenodo: Swahili Dataset
This project addresses the need for a scalable system to enhance NLP capabilities for Swahili. It integrates data collection, processing, API development, and machine learning to support the fine-tuning of Llama 2 models. The project also involves creating a user-friendly interface and automating workflows for seamless data handling.
- Python 3.8 or higher
- Node.js and npm
- Docker (optional, for containerization)
- Apache Airflow
- Clone the repository:
git clone https://github.com/cheronodaisy/NLPWarehouse.git
- Navigate to the project directory:
cd NLPWarehouse
- Install the required dependencies:
pip install -r requirements.txt npm install
Contributions from the community are welcome. To contribute, please follow these steps:
- Fork the repository.
- Create a new branch:
git checkout -b feature-branch
- Make your changes and commit them:
git commit -m 'Add some feature'
- Push to the branch:
git push origin feature-branch
- Open a pull request.
- Hugging Face and Zenodo for the datasets.