About • How To Use • Citations • Acknowledgments • License
- The data is publicly available through the HuggingFace Datasets 🤗 library!
- The model has been updated with the HuggingFace Transformers 🤗 library!
- Usage instructions included!
- The paper was accepted at ICDAR 2024!
This GitHub repository contains the implementation of the Sheet Music Transfomrmer (SMT), a novel model for Optical Music Recognition (OMR) beyond monophonic level transcription. Unlike traditional approaches that primarily leverage monophonic transcription techniques for complex score layouts, the SMT model overcomes these limitations by offering a robust image-to-sequence solution for transcribing polyphonic musical scores directly from images.
This implementation has been developed in Python 3.9, PyTorch 2.0 and CUDA 12.0.
It should work in earlier versions.
To setup a project, run the following configuration instructions:
Create a virtual environment using either virtualenv or conda and run the following:
git clone https://github.com/antoniorv6/SMT.git
pip install -r requirements.txt
mkdir Data
If you are using Docker to run experiments, create an image with the provided Dockerfile:
docker build -t <your_tag> .
docker run -itd --rm --gpus all --shm-size=8gb -v <repository_path>:/workspace/ <image_tag>
docker exec -it <docker_container_id> /bin/bash
Using the SMT for transcribing scores is very easy, thanks to the HuggingFace Transformers 🤗 library. Just implement the following code and you will have the SMT up and running for transcribing excerpts!
import torch
import cv2
from data_augmentation.data_augmentation import convert_img_to_tensor
from smt_model import SMTModelForCausalLM
image = cv2.imread("sample.jpg")
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SMTModelForCausalLM.from_pretrained("antoniorv6/<smt-weights>").to(device)
predictions, _ = model.predict(convert_img_to_tensor(image).unsqueeze(0).to(device),
convert_to_str=True)
print("".join(predictions).replace('<b>', '\n').replace('<s>', ' ').replace('<t>', '\t'))
The datasets created to run the experiments are publicly available for replication purposes.
Eveything is implemented through the HuggingFace Datasets 🤗 library, so loading any of these datasets can be done through just one line of code:
import datasets
dataset = datasets.load_dataset('antoniorv6/<dataset-name>')
The dataset has two columns: image
which contains the original image of the music excerpt for input, and the transcription
, which contains the corresponding bekern
notation ground-truth that represents the content of this input.
These experiments run under the Weights & Biases API and the JSON
config. To replicate an experiment, run the following code:
wandb login
python train.py --config <config-path>
The config files are located in the config/
folder, depending on the executed config file, a specific experiment will be run.
You can make your own config files to train the SMT on your own data! Just, please, if you are using this code, I highly recommend to use your datasets in the same format provided in the HuggingFace Datasets specification to work with this model. If not, I suggest to make your own data.py file from scratch
@InProceedings{RiosVila:ICDAR:2024,
author="R{\'i}os-Vila, Antonio
and Calvo-Zaragoza, Jorge
and Paquet, Thierry",
title="Sheet Music Transformer: End-To-End Optical Music Recognition Beyond Monophonic Transcription",
booktitle="Document Analysis and Recognition - ICDAR 2024",
year="2024",
publisher="Springer Nature Switzerland",
address="Cham",
pages="20--37",
isbn="978-3-031-70552-6"
}
This work is part of the I+D+i PID2020-118447RA-I00 (MultiScore) project, funded by MCIN/AEI/10.13039/501100011033. Computational resources were provided by the Valencian Government and FEDER funding through IDIFEDER/2020/003.
This work is under a MIT license.