Releases · tchewik/isanlp_rst

26 Aug 12:00

tchewik

v3.0

06879ff

v3.0 Latest

Latest

Key Features and Improvements

Multiple End-to-End Models for Russian and English:
- Russian (rstreebank): As always, this version includes a model trained on RuRSTreebank, providing robust parsing capabilities for Russian texts.
- Bilingual Model (gumrrg): A new bilingual model trained on a mix of GUM and RRG, offering enhanced parsing performance across multiple genres.
- English Model (rstdt): An English model trained on the RST-DT benchmark.
Modified DMRST Architecture: We have implemented modifications to the DMRST architecture, improving both segmentation and tree construction.

Assets 2

25 Aug 09:39

tchewik

v2.1

2ef241e

v2.1

The algorithm is from v2.0, but with data improvement, fast and accurate morphosyntactic analysis, and minor bug fixes.

Data

RuRSTreebank updated (attached jul22 version):

All files have been fixed and are now readable. No more non-constituency subtrees, dangling EDUs and xml-crushing symbols!
Fixed all paragraph boundaries. "##### " always means the beginning of a new line. Illustration markers (IMG, IMG-TXT, [код]) are now combined into separate EDUs which can be easily filtered.
Fixed punctuation and word parts mistakenly placed in the next EDU in segmentation annotations.
Improved sentence integrity, especially in Blogs:

News1 News2 Blogs

Before 93.55% 81.79% 74.17%

Now 94.79% 82.37% 81.63%
The consistency of formal structures in the corpus has been improved. Titles, subtitles, lists, illustrations, and conclusions are now annotated similarly throughout the corpus. Same relation-focused structures are now annotated in the same way. For example, Attribution satellites are now strictly within continuous citation boundaries.
The consistency of relations annotation has been improved. Significantly reduced the number of obvious relation assignment errors according to annotator instructions and statistics in the rest of the data.

	News1	News2	Blogs
Before	93.55%	81.79%	74.17%
Now	94.79%	82.37%	81.63%

Morphosyntactic analysis

Now use the recent ru_core_news_lg model from SpaCy, for it's fast and accurate.

Minor bug fixes

Fixed a bug with defining some top-level DU boundaries when extracting features for feature-rich classifiers. So it now works fine with not pretokenized long texts.
Also produces the DUs without tokenization.

Evaluation

Evaluation of end-to-end parsing on RuRSTreebank, macro averaged over test documents of different genres (attached jul version):

Level	S	N	R	F
Sentence	71.2	54.9	45.9	45.4
Paragraph	72.7	52.1	41.4	41.1
Document	66.5	47.2	37.1	36.8

Usage

For usage example, look into README.

Speed up

You can increase the parsing speed about twice if you turn off the prediction of relations between paragraphs. To do this, replace the last value in this line with 1.0. In this case, each output tree will correspond to a single paragraph.

Assets 3

06 Jun 09:14

tchewik

v2.0

8f5cc1d

v2.0

Paragraph-level trees are constructed with top-down algorithm. Default structure and label classifiers are both ensembles of a feature-rich sklearn classifier and a neural allennlp classifier using contextual embeddings and granularity features.

Described and applied in Discourse-aware Text Classification for Argument Mining.

Optimized feature-rich classifiers
Speed up x5 (~70s per document in RuRSTreebank)
EDUs sharing the "same-unit" relation are now joined into single EDU
RuRSTreebank updated (attached feb22 version):
- now all the documents are in .rs3 only.
- all the documents contain "#####" paragraph boundary markers.
- some files are fixed and now readable.

End-to-end parsing evaluation on RuRSTreebank (attached feb22 version):

Level	S	N	R	F
Sentence	68.5	50.6	38.1	37.7
Paragraph	59.8	38.8	27.5	27.3
Document	52.5	34.2	24.2	23.9

Requires running Docker containers: tchewik/isanlp_udpipe (syntax), tchewik/isanlp_rst:2.0 (RST)

Usage in Python:

from isanlp import PipelineCommon
from isanlp.processor_razdel import ProcessorRazdel
from isanlp.processor_remote import ProcessorRemote
from isanlp.ru.processor_mystem import ProcessorMystem
from isanlp.ru.converter_mystem_to_ud import ConverterMystemToUd
import razdel

# put the address here ->
address_syntax = ('', 3134)
address_rst = ('', 3335)

# Highly recommended to pre-tokenize texts
def tokenize(text):
    """ Tokenize text, but keep paragraph boundaries. """

    while '\n\n' in text:
        text = text.replace('\n\n', '\n')
    result = []
    for paragraph in text.split('\n'):
        result.append(' '.join([tok.text for tok in razdel.tokenize(paragraph)]))
    return '\n'.join(result).strip()

ppl = PipelineCommon([
    (ProcessorRazdel(), ['text'],
     {'tokens': 'tokens',
      'sentences': 'sentences'}),
    (ProcessorRemote(address_syntax[0], address_syntax[1], '0'),
     ['tokens', 'sentences'],
     {'lemma': 'lemma',
      'syntax_dep_tree': 'syntax_dep_tree',
      'postag': 'ud_postag'}),
    (ProcessorMystem(delay_init=False),
     ['tokens', 'sentences'],
     {'postag': 'postag'}),
    (ConverterMystemToUd(),
     ['postag'],
     {'morph': 'morph',
      'postag': 'postag'}),
    (ProcessorRemote(address_rst[0], address_rst[1], 'default'),
     ['text', 'tokens', 'sentences', 'postag', 'morph', 'lemma', 'syntax_dep_tree'],
     {'rst': 'rst'})
])

Assets 3

03 Jun 11:18

tchewik

v1.0.1

8579580

v1.0.1

Trees are constructed with greedy bottom-up algorithm. Default structure and label classifiers are both ensembles of a feature-rich sklearn classifier and a neural allennlp classifier using contextual embeddings and granularity features.

Trained and evaluated on the first version of RuRSTreebank corpus (see src/maintenance/corpus/).

Described in https://link.springer.com/chapter/10.1007/978-3-030-72610-2_8

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Key Features and Improvements

Data

Morphosyntactic analysis

Minor bug fixes

Evaluation

Usage

Speed up

Releases: tchewik/isanlp_rst

v3.0

Key Features and Improvements

v2.1

Data

Morphosyntactic analysis

Minor bug fixes

Evaluation

Usage

Speed up

v2.0

v1.0.1