spb-balasubrahmanyam-llm-project

Project Overview

This project aims to emulate the vocal style of the legendary Indian playback singer SP Balasubrahmanyam using a language model (LLM). The study involves collecting datasets on SP Balasubrahmanyam's career and vocal performances, training an LLM to generate songs based on his unique vocal characteristics, and analyzing the results. The findings demonstrate the potential of LLMs in capturing and replicating the vocal essence of iconic singers.

Installation

To set up the project, follow these steps:

Clone the repository:

git clone https://github.com/kasinadhsarma/spb-balasubrahmanyam-llm-project.git
cd spb-balasubrahmanyam-llm-project

Install the required dependencies:
```
pip install -r requirements.txt
```

Usage

To generate a song in Telugu using the pre-trained model, run the following script:

python3 generate_song_indic_bart.py

LLM Model

Purpose

The LLM (Language Model) used in this project is designed to generate song lyrics that emulate the vocal style of SP Balasubrahmanyam. The model is fine-tuned on a dataset of Telugu song lyrics to capture the nuances of the language and the singer's unique vocal characteristics.

Model Details

Model Name: ai4bharat/IndicBART
Tokenizer: ./albert-indic64k
Checkpoint: ./model_output/pytorch_model.bin
Language Token: <2te> for Telugu

Fine-Tuning

The model is fine-tuned using the llm_fine_tuning.py script with the telugu_lyrics_dataset.txt dataset. The fine-tuning process involves training the model on the dataset to improve its ability to generate coherent and contextually relevant song lyrics in Telugu.

Parameters

Max Length: 1000
Temperature: 1.5
Early Stopping: False
No Repeat Ngram Size: 2
Num Return Sequences: 3

Dependencies

torch
transformers
sentencepiece
indic-nlp-library
accelerate
safetensors

PYTHONPATH

Ensure the PYTHONPATH is set to include the user's local packages:

export PYTHONPATH=/home/ubuntu/.local/lib/python3.10/site-packages

Current Progress

As of now, the project is approximately 80% complete. The following tasks have been completed:

Analyzed the provided data and reviewed the datasets.
Attempted to clean the dataset with clean_dataset.py.
Updated the README.md file with the currently completed percentage.
Edited the clean_dataset.py script to ensure it correctly removes line numbers.
Verified the contents of the cleaned_spb_texts.txt file to ensure the data is correctly formatted.
Drafted the research paper in IEEE format.
Created and pushed the LLM fine-tuning script to the repository.
Started and completed the fine-tuning process for the LLM using the formatted_spb_texts.txt dataset.
Generated new song demos in Telugu using the generate_song_indic_bart.py script.

Future Work

The following tasks are still pending:

Upload all datasets to GitHub, pre-train the LLM, and deploy the LLM model.
Find an alternative pre-trained model suitable for fine-tuning on SP Balasubrahmanyam's songs.
Ensure generated songs accurately capture multiple languages.
Request user assistance to resolve persistent shell and browser timeout issues.
Continue refining the LLM to generate songs that accurately emulate SP Balasubrahmanyam's voice.
Upload the refined song demos to GitHub after reaching 100% completion.

References

SP Balasubrahmanyam's Wikipedia page
SP Balasubrahmanyam's official YouTube channel
Song lyrics websites (Genius, Gaana, Musixmatch, tamil2lyrics.com, Smule)
transformers library documentation: https://huggingface.co/transformers/
torch library documentation: https://pytorch.org/
beautifulsoup4 library documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
requests library documentation: https://docs.python-requests.org/en/latest/
accelerate library documentation: https://huggingface.co/docs/accelerate/index

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
IEEEtran/IEEEtran		IEEEtran/IEEEtran
IndicBERT		IndicBERT
Telugu_Experiments @ fae5585		Telugu_Experiments @ fae5585
full_outputs		full_outputs
indic-trans @ c851af9		indic-trans @ c851af9
model_output		model_output
spb_gpt2		spb_gpt2
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
2109.02903		2109.02903
2109.02903.txt		2109.02903.txt
IEEEtran.cls		IEEEtran.cls
IEEEtran_HOWTO.txt		IEEEtran_HOWTO.txt
README.md		README.md
albert-indic64k.zip		albert-indic64k.zip
cached_lm_GPT2Tokenizer_128_formatted_spb_texts.txt		cached_lm_GPT2Tokenizer_128_formatted_spb_texts.txt
cached_lm_GPT2Tokenizer_128_formatted_spb_texts.txt.lock		cached_lm_GPT2Tokenizer_128_formatted_spb_texts.txt.lock
cached_lm_GPT2Tokenizer_128_telugu_lyrics_dataset.txt		cached_lm_GPT2Tokenizer_128_telugu_lyrics_dataset.txt
cached_lm_GPT2Tokenizer_128_telugu_lyrics_dataset.txt.lock		cached_lm_GPT2Tokenizer_128_telugu_lyrics_dataset.txt.lock
clean_dataset.py		clean_dataset.py
cleaned_spb_texts.txt		cleaned_spb_texts.txt
convert_to_json.py		convert_to_json.py
final_cleaned_spb_texts.txt		final_cleaned_spb_texts.txt
fine_tune_model.py		fine_tune_model.py
fine_tune_suzume.py		fine_tune_suzume.py
formatted_spb_texts.txt		formatted_spb_texts.txt
generate_song_indic_bart.py		generate_song_indic_bart.py
generate_song_indic_bart_backup.py		generate_song_indic_bart_backup.py
generated_song_demo.txt		generated_song_demo.txt
ieee.latex		ieee.latex
ieee_format_changes.md		ieee_format_changes.md
ieee_template.latex		ieee_template.latex
llm_fine_tuning.py		llm_fine_tuning.py
llm_song_generation.py		llm_song_generation.py
remove_non_lyrical.py		remove_non_lyrical.py
requirements.txt		requirements.txt
spb_hindi_lyrics.txt		spb_hindi_lyrics.txt
spb_marathi_lyrics.txt		spb_marathi_lyrics.txt
spb_research_paper.md		spb_research_paper.md
spb_research_paper.pdf		spb_research_paper.pdf
spb_song_lyrics.txt		spb_song_lyrics.txt
spb_telugu_lyrics.txt		spb_telugu_lyrics.txt
spb_texts.txt		spb_texts.txt
spb_youtube_data.json		spb_youtube_data.json
synthesize_vocals.py		synthesize_vocals.py
telugu_lyrics_dataset.json		telugu_lyrics_dataset.json
telugu_lyrics_dataset.txt		telugu_lyrics_dataset.txt
test_speech.wav		test_speech.wav
test_speecht5.py		test_speecht5.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spb-balasubrahmanyam-llm-project

Project Overview

Installation

Usage

LLM Model

Purpose

Model Details

Fine-Tuning

Parameters

Dependencies

PYTHONPATH

Current Progress

Future Work

References

About

Releases

Packages

Contributors 3

Languages

Exploit0xfffff/spb-balasubrahmanyam-llm-project

Folders and files

Latest commit

History

Repository files navigation

spb-balasubrahmanyam-llm-project

Project Overview

Installation

Usage

LLM Model

Purpose

Model Details

Fine-Tuning

Parameters

Dependencies

PYTHONPATH

Current Progress

Future Work

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages