lstm-crf-tagger

This is an implementation of LSTM-CRF tagger proposed in the paper Neural Architectures for Named Entity Recognition. The model is trained and tested on SIGHAN2005 bake-off for word segmentation.

Setup

Python3, Numpy, Tensorflow 2.0

Preparation

You can directly use the prepared corpus: ./corpus/pku_training.txt for training your own tagger. Here, pku_training.txt is created by running process_corpus.py on pku_training.utf8 provided by SIGHAN2005 bake-off.

If you have downloaded the SIGHAN2005 bake-off and want to train your model on other corpus(e.g. msr_training.utf8, cityu_training.utf8) included in the bake-off, you need to run process_corpus.py to get the training-read file. The corpus provided by SIGHAN2005 is in an original format:

中共中央  总书记  、  国家  主席  江  泽民  
（  一九九七年  十二月  三十一日  ）

You can run process_corpus.py like this:

python process_corpus.py /path/of/pku_training.utf8 SIGHAN2005 /path/of/pku_training_ready.txt

Now you've converted the original file to the training-ready file in this format(S: single, B: beginning, M: middle, E: end):

中      B
共      M
中      M
央      E
总      B
书      M
记      E
、      S
国      B
家      E
主      B
席      E
江      S
泽      B
民      E

（      S
一      B
九      M
九      M
七      M
年      E
十      B
二      M
月      E
三      B
十      M
一      M
日      E
）      S

...

For Chinese NER task, I've provided processed SIGHAN 2006 bake-off MSRA dataset.

Training

Create TFRecord dataset for Chinese word segmentation.

python build_tfrecords.py /path/of/pku_training_ready.txt /path/of/dataset /path/of/tag_dict

This script converts your training-ready file to TFRecord format, shards them and write them to the directory: /path/of/dataset. In the meanwhile, it generates the vocabulary file: vocab.txt based on the training corpus.

Train the model.

python train_model.py '/path/of/dataset/train-?????-of-?????' '/path/of/dataset/valid-?????-of-?????' ./corpus/people_vec.txt /path/of/dataset/vocab.txt /path/to/save/checkpoints

Here, you can use the Chinese character vectors: /corpus/people_vec.txt for creating the embedding layer. These vectors are obtained by applying word2vec on the People's Daily dataset(199801-199806, 2014):

./word2vec -train /path/of/peoplesdaily.txt -output people_vec.txt -size 100 -window 5 -sample 1e-5 -negative 5 -hs 0 -binary 0 -cbow 0 -iter 5

Experiments

I use almost the same hyperparameters (see configuration.py) with the training setup reported in Neural Architectures for Named Entity Recognition, and I train the model 100 epochs. In addition, I use L2-regularization with a lambda of 0.0001 to prevent the model from overfitting.

Notice the paper said "To prevent the learner from depending too heavily on one representation class, dropout is used." However, there's only one source representations in my implementation, so the dropout over the input embedding layer is not used by default(emb_dropout = 0). In fact, the macro F1 score could only reach around 0.79 if I set emb_dropout to be 0.5.

The pre-trained embedding could be fine-tuned or not during training. However, fine-tuning Chinese character embeddings via backpropagation outperform keeping them static during training a large margin, as shown below:

Method	Macro F1
static	81.34
non-static	93.25

Furthermore, using static embedding throughout training leads overfitting around epoch 12, while overfitting is not ovserved around epoch 100 for non-static embedding.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
corpus		corpus
README.md		README.md
build_tfrecords.py		build_tfrecords.py
configuration.py		configuration.py
data_utils.py		data_utils.py
losses.py		losses.py
metrics.py		metrics.py
model.py		model.py
process_corpus.py		process_corpus.py
special_words.py		special_words.py
test_model.py		test_model.py
train_model.py		train_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lstm-crf-tagger

Setup

Preparation

Training

Experiments

About

Releases

Packages

Languages

snailcoder/lstm-crf-tagger

Folders and files

Latest commit

History

Repository files navigation

lstm-crf-tagger

Setup

Preparation

Training

Experiments

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages