Releases: tchewik/isanlp_rst
v3.0
Key Features and Improvements
- Multiple End-to-End Models for Russian and English:
- Russian (
rstreebank
): As always, this version includes a model trained on RuRSTreebank, providing robust parsing capabilities for Russian texts. - Bilingual Model (
gumrrg
): A new bilingual model trained on a mix of GUM and RRG, offering enhanced parsing performance across multiple genres. - English Model (
rstdt
): An English model trained on the RST-DT benchmark.
- Russian (
- Modified DMRST Architecture: We have implemented modifications to the DMRST architecture, improving both segmentation and tree construction.
v2.1
The algorithm is from v2.0, but with data improvement, fast and accurate morphosyntactic analysis, and minor bug fixes.
Data
RuRSTreebank updated (attached jul22 version):
- All files have been fixed and are now readable. No more non-constituency subtrees, dangling EDUs and xml-crushing symbols!
- Fixed all paragraph boundaries. "##### " always means the beginning of a new line. Illustration markers (IMG, IMG-TXT, [код]) are now combined into separate EDUs which can be easily filtered.
- Fixed punctuation and word parts mistakenly placed in the next EDU in segmentation annotations.
- Improved sentence integrity, especially in Blogs:
News1 News2 Blogs Before 93.55% 81.79% 74.17% Now 94.79% 82.37% 81.63% - The consistency of formal structures in the corpus has been improved. Titles, subtitles, lists, illustrations, and conclusions are now annotated similarly throughout the corpus. Same relation-focused structures are now annotated in the same way. For example, Attribution satellites are now strictly within continuous citation boundaries.
- The consistency of relations annotation has been improved. Significantly reduced the number of obvious relation assignment errors according to annotator instructions and statistics in the rest of the data.
Morphosyntactic analysis
Now use the recent ru_core_news_lg model from SpaCy, for it's fast and accurate.
Minor bug fixes
- Fixed a bug with defining some top-level DU boundaries when extracting features for feature-rich classifiers. So it now works fine with not pretokenized long texts.
- Also produces the DUs without tokenization.
Evaluation
Evaluation of end-to-end parsing on RuRSTreebank, macro averaged over test documents of different genres (attached jul version):
Level | S | N | R | F |
---|---|---|---|---|
Sentence | 71.2 | 54.9 | 45.9 | 45.4 |
Paragraph | 72.7 | 52.1 | 41.4 | 41.1 |
Document | 66.5 | 47.2 | 37.1 | 36.8 |
Usage
For usage example, look into README.
Speed up
You can increase the parsing speed about twice if you turn off the prediction of relations between paragraphs. To do this, replace the last value in this line with 1.0. In this case, each output tree will correspond to a single paragraph.
v2.0
Paragraph-level trees are constructed with top-down algorithm. Default structure and label classifiers are both ensembles of a feature-rich sklearn classifier and a neural allennlp classifier using contextual embeddings and granularity features.
Described and applied in Discourse-aware Text Classification for Argument Mining.
- Optimized feature-rich classifiers
- Speed up x5 (~70s per document in RuRSTreebank)
- EDUs sharing the "same-unit" relation are now joined into single EDU
- RuRSTreebank updated (attached feb22 version):
- now all the documents are in .rs3 only.
- all the documents contain "#####" paragraph boundary markers.
- some files are fixed and now readable.
End-to-end parsing evaluation on RuRSTreebank (attached feb22 version):
Level | S | N | R | F |
---|---|---|---|---|
Sentence | 68.5 | 50.6 | 38.1 | 37.7 |
Paragraph | 59.8 | 38.8 | 27.5 | 27.3 |
Document | 52.5 | 34.2 | 24.2 | 23.9 |
Requires running Docker containers: tchewik/isanlp_udpipe
(syntax), tchewik/isanlp_rst:2.0
(RST)
Usage in Python:
from isanlp import PipelineCommon
from isanlp.processor_razdel import ProcessorRazdel
from isanlp.processor_remote import ProcessorRemote
from isanlp.ru.processor_mystem import ProcessorMystem
from isanlp.ru.converter_mystem_to_ud import ConverterMystemToUd
import razdel
# put the address here ->
address_syntax = ('', 3134)
address_rst = ('', 3335)
# Highly recommended to pre-tokenize texts
def tokenize(text):
""" Tokenize text, but keep paragraph boundaries. """
while '\n\n' in text:
text = text.replace('\n\n', '\n')
result = []
for paragraph in text.split('\n'):
result.append(' '.join([tok.text for tok in razdel.tokenize(paragraph)]))
return '\n'.join(result).strip()
ppl = PipelineCommon([
(ProcessorRazdel(), ['text'],
{'tokens': 'tokens',
'sentences': 'sentences'}),
(ProcessorRemote(address_syntax[0], address_syntax[1], '0'),
['tokens', 'sentences'],
{'lemma': 'lemma',
'syntax_dep_tree': 'syntax_dep_tree',
'postag': 'ud_postag'}),
(ProcessorMystem(delay_init=False),
['tokens', 'sentences'],
{'postag': 'postag'}),
(ConverterMystemToUd(),
['postag'],
{'morph': 'morph',
'postag': 'postag'}),
(ProcessorRemote(address_rst[0], address_rst[1], 'default'),
['text', 'tokens', 'sentences', 'postag', 'morph', 'lemma', 'syntax_dep_tree'],
{'rst': 'rst'})
])
v1.0.1
Trees are constructed with greedy bottom-up algorithm. Default structure and label classifiers are both ensembles of a feature-rich sklearn classifier and a neural allennlp classifier using contextual embeddings and granularity features.
Trained and evaluated on the first version of RuRSTreebank corpus (see src/maintenance/corpus/
).
Described in https://link.springer.com/chapter/10.1007/978-3-030-72610-2_8