Skip to content

A tool to transform JmdictFurigana dictionaries for compatibility with SudachiPy. It restructures the data and converts readings to katakana for efficient furigana lookup during morpheme tokenization in SudachiPy.

Notifications You must be signed in to change notification settings

katanabana/JmdictFurigana-SudachiPy-Transformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Making JmdictFurigana compatible with SudachiPy

Create transformed dictionary files

  1. install wanakana: pip install wanakana
  2. run in terminal:
python main.py initial_directory transformed_directory
  • initial_directory - directory with initial Jmdictfurigana json files
  • transformed_directory - directory where json files with transformed dictionaries should be created

Initial vs Transformed dictionary formats

For fast items accessing JmdictFurigana dictionaries should be transformed

from:

[
    {text: surface,
     reading: reading,
     furigana: [{ruby: part_surface, rt: part_reading}, ...]
     },
    ...
]

to:

{
    surface: {
              to_katakana(reading_1): 
                    [[surface_part_1, to_katakana(reading_1_part_1)], [surface_part_2, to_katakana(reading_1_part_2)], ...],
              to_katakana(reading_2): 
                    [[surface_part_1, to_katakana(reading_2_part_1)], [surface_part_2, to_katakana(reading_2_part_2)], ...],
              ...
              },
    ...
}

For items that have the same (surface, to_katakana(surface_reading)) furigana with the greatest number of parts (maximum length) is chosen.

Readings are transformed to katakana for making them compatible with sudachipy.Morpheme since morpheme.reading_form() is always written in katakana.

Dictionaries usage

from json import load
from sudachipy import Dictionary

text = '可笑しい'
tokenizer = Dictionary().create()
morphemes = tokenizer.tokenize(text)
path = 'path/to/transformed/dictionary/file.json'

with open(path, encoding='utf-8-sig') as file:
    dictionary = load(file)
    for morpheme in morphemes:
        parts = dictionary[morpheme.surface()][morpheme.reading_form()]
        print(parts)
[['可', 'オ'], ['笑', 'カ'], ['しい', 'シイ']]

About

A tool to transform JmdictFurigana dictionaries for compatibility with SudachiPy. It restructures the data and converts readings to katakana for efficient furigana lookup during morpheme tokenization in SudachiPy.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages