The Swiss Institute of Bioinformatics (SIB/ELIXIR-CH), Database Center for Life Science (DBCLS-Japan) and RIKEN-Japan join efforts to develop an open-source artificial intelligence (AI)-driven system for intuitive querying of scientific datasets to accelerate scientific innovation. We call for contributions in these efforts that align with the BioHackathon's goal of fostering an open-source infrastructure for data integration and addresses the urgent need for effective data retrieval methods.
Our goal is to make it easier for life scientists to use databases by converting their questions into SPARQL queries using large language models (LLMs). We understand the difficulties researchers face with SPARQL's complexity and knowledge base schemas, so we suggest a user interface that combines LLMs and knowledge bases. This will allow for direct data interaction in natural language, simplifying the research process. Our approach will facilitate data discovery and retrieval with the necessary accuracy for scientific research, as it leverages LLMs to generate SPARQL queries grounded in validated scientific data.
Despite LLMs’ abilities in areas like code generation, they often struggle with the semantic accuracy of SPARQL queries. Our project is focused on addressing these limitations, ensuring that conversational AI can accurately interpret and translate research inquiries into precise queries. It aligns with the objectives of the ELIXIR 2024-26 Programme and lays the groundwork for future research collaborations, offering a practical solution for data-driven discovery in the life sciences.
If you want to contribute or are just curious about our work, see:
👩💻 Project code: https://github.com/jcrangel/SPARQL4ELIXIR
📝 Project backlog: https://github.com/users/jcrangel/projects/9
- Tarcisio Mendes de Farias, SIB Swiss Institute of Bioinformatics (ELIXIR-CH)
- Julio Rangel, RIKEN - JAPAN
- Vincent Emonet, SIB Swiss Institute of Bioinformatics (ELIXIR-CH)
- TBD