This repo is the source code for implemenation of this base paper: Multi-Lingual Question Generation with Language Agnostic Language Model
The project focuses on multilingual automatic multiple-choice question generation to develop a robust and efficient system capable of automatically generating high-quality multiple-choice questions in multiple languages. Automating the process of multiple-choice question generation has various benefits. Automation significantly reduces the time and effort required to generate a substantial number of questions, allowing educators and trainers to focus on other essential aspects of teaching and content development.By using predefined rules and algorithms, the system can produce questions that adhere to specific guidelines, styles, and difficulty levels. This consistency helps maintain fairness and reliability in assessments, ensuring that all learners are evaluated on an equal basis.automating the process of multiple-choice question generation brings efficiency, scalability, standardization, customization, and improved learning experiences to educational institutions, trainers, and assessment organizations. It streamlines the question creation process, supports diverse question types and languages, and contributes to fair and effective assessments. Hence , the project aims to create a language agnostic model for creation of multiple choice questions in multiple languages.
-
First open this . Go to File on tab above and select Save a Copy in Drive, rename your colab accordingly.
-
Run all cells, where we do the installation of libraries, train the LSTM model for each language (low level model), perform fine tuning on transformer which is common for all languages (high level model) and run distractor code where we have used WordNet for that.
-
To collaborate with front-end using AnVIL software. AnVIL provides a unified platform for creating and sharing data and performance analysis. Anvil Uplink used to link code to Anvil app from anywhere on the Internet Server uplinks make Python code behave like a server module.
-
Functions defined in uplink code can be called from the application using server.call
-
Connection established, application can be executed by providing input and getting MCQs in return.
-
Now click on the link to give input paragraph. (Remember paragraph should be among these languages - English, Hindi, Korean, French and Chinese). And click generate MCQ.
-
It will take around 2 minutes to generate MCQs as distractor takes time to formulate and give relatable wrong options.
-
IMPORTANT - You will get all resources regarding this project in the Link tab below.
First of all, you should download the wikipedia dumps from https://dumps.wikimedia.org/ , basicly, there are 10 languages used in this paper for pre-training.
Language | Short name | Size |
---|---|---|
Chinese | zh | 1.4G |
English | en | 14G |
Korean | ko | 679M |
French | fr | 4.4G |
Hindi | hi | 430M |
Burmese | bu | 208M |
German | de | 5.8G |
Vietnam | vi | 979M |
Japanese | Ja | 2.8G |
Chinese Minnan | mi | 124M |
Note that the number of pre-training languages could be larger than the fine-tuning languages.
-
The first step of implementation includes training the LSTM model for five separate languages. The LSTM model implements the basic level language understanding of the language. It is separate for each language.
-
Transformers have thousands of pretrained models where we train it on data specific to our task and it’s called fine tuning. It is used for high level understanding of the subject. The output of LSTM is given as input to one common transformer.
-
We have used WordNet to generate distractors for the target questions. It is a large content database that stores many relationships between words. WordNet tags semantic relationships between words.
-
For example: Synonyms - car and automobile. It also captures the different meanings of a word. Example: Mouse can refer to an animal or computer mouse. This was the implementation done so far in the backend.
-
Talking about the collaboration with the front-end, we use AnVIL software; where AnVIL provides a unified platform for non-computing users and user data scientists to create and share data and performance analysis. We can use Anvil Uplink to link the code to your Anvil app from anywhere on the Internet. Server uplinks make Python code behave like a server module. We can define functions in your uplink code and then use the anvil to call them from our application. server.call. After the connection is established, we can execute our application by giving input and getting MCQ respectively.
- If you need more insight about the workflow with code snippets and explanation. VISIT THIS LINK
The proposed system architecture for generating multilingual multiple-choice questions using LSTM and Transformer models we have some sequence of processes, which includes:
-
Input Data: The system takes a multilingual text document as input, containing the content from which questions need to be generated.
-
Language Preprocessing: The input text is preprocessed to handle language-specific challenges such as tokenization, stemming, and stop-word removal. This step ensures that the text is prepared for further processing in a language-agnostic manner.
-
Language Understanding (LSTM Model): The preprocessed text is fed into an LSTM (Long Short�Term Memory) model for language understanding. The LSTM model is trained on multilingual text data to capture the context and semantics of the sentences effectively. The output of the LSTM model represents the learned representations of the input sentences. There is different LSTM for each language.
-
Question Generation (Transformer Model): The LSTM model's output is then used as input to a Transformer model for question generation. The Transformer model is responsible for generating meaningful and grammatically correct questions based on the input sentences. It consists of an encoder layer to encode the input sentences and a decoder layer to generate questions based on the encoded representations.
-
Answer Generation: The Transformer model generates a set of candidate answers based on the encoded representations. These candidate answers can be generated by conditioning the decoder on the encoded representations or by using a separate module for answer generation. The candidate answers can include possible correct answers as well as distractors.
-
Ranking and Selection: The generated candidate answers are ranked based on their relevance to the input sentences. Various techniques, such as similarity measures, can be used to assess the relevance. The top-ranked answer is selected as the correct answer, and the remaining candidates serve as distractors.
-
Multiple-Choice Options: Distractors can be generated by substituting or altering words in the correct answer or by extracting alternative answers from the input text. The correct answer and distractors are then randomly shuffled to create multiple-choice options.
-
Iterative Process: Steps 3 to 7 are repeated for each sentence in the input text document to generate multiple-choice questions for the entire document.
-
Output: The system outputs the generated multiple-choice questions along with the correct answer and distractors, forming a complete set of questions for the multilingual text document. This proposed architecture combines the strengths of LSTM for language understanding and Transformer for question generation, enabling the system to generate multilingual multiple-choice questions accurately and effectively.
We propose a model that is divided into two modules: the low-level and the high-level module. The overall model is trained for five languages: English, Hindi, Korean, French and Chinese. The low-level module, which is developed individually for each language, implements an LSTM (Long Short-Term Memory) encoder for low-level understanding of each language. The high level module implements the transformer model for higher-level understanding of information and is common for all the languages.
To implement the project we used python as our programming language and google colab notebook as platform to run the respective code. Initially installed basic dependencies such as transformers, sentencepiece, lstm etc and then implemented the different modules of the project. Low-level module: At first we trained five different LSTM for our five listed languages:
-
English
-
Hindi
-
Korean
-
French
-
Chinese
Here is the implementation demonstration for each language:
- Improving the generator Distractors
- The project can be expanded to support a wider range of languages. By leveraging multilingual NLP techniques and resources, the system can generate MCQs in multiple languages, catering to a broader user base.
- Generating MCQs in a language different from that of the input text.
Video Demo: Link
Google Colab Link: Link
Webpage link: Link
Base Paper: Link
Paper published at IEEE (In Process): Link
Final Year Project Report: Link
This project was selected for the final round of 12th CSI InApp International Student Project awards 2023 with 8th position among 900+ teams.