Named entitity recognition (NER) and Detecting Hyponym\Hypernym relationship on the dataset of Patents
The main goals of this project are:
- Train NER model with dataset of Patents in the specific domain
- Fine-tune with prodidy
- Implement automatic detection of hyponyms\hypernyms with Hearst patterns
- Validate detection results with several methods, inluding Wikidata
- project.ipynb - main notebook
- G06K.txt.gz - archive with patent texts
- configs/base_config.cfg - Base config to train NER SpaCy pipeline
- hearst_patterns/patterns.json - Hearst patterns configuration for SpaCy
- extracted_patterns - Extracted patterns (EL) from G06K texts
- Install dependencies from requirements.txt
- Unpack data:
tar -xvf G06K.txt.gz
- Open project.ipynb and run first cell to chek that all imports works propperly
Here is a brief overview of the project.ipynb parts.
In this section patent text read and prcessed to extract potential Named entities using curated list of terms manyterms.lower.txt
Next, we are training the model on the created dataset.
Additionaly, if you have access to the Prodiy, you can apply Active Learning to tune the model.