UGCNormal

This is a normalizer tool for user-generated content (Brazilian Portuguese). You can use it as a service, look at ugcnormal_interface. Also consider using this dockerized service of UGCNormal features ugcnormal-microservice.


                            UGC-Normalizer

INPUT
|
|    -----------------------------      -------------      -----------
---> | SentenceBoundaryDetection | ---> | tokenizer | ---> | speller | ----
     -----------------------------      -------------      -----------    |
                                                                          |
                                                                          |
     ----------------------------------------------------------------------
     |
     |    --------------      ------------------      ----------
     ---> | siglas_map | ---> | internetes_map | ---> | np_map | ---> OUTPUT
          --------------      ------------------      ----------



>>> HOW TO USE:

Before anything else, run ./configure.sh script to check and solve all
dependencies. After that you can run the normalizer script.

Main script is ugc_norm.sh. Use it to apply the normalization pipeline. Just run and pass as
parameters INPUT_dir and OUTPUT_dir. The INPUT_dir must contain all text files
to be processed.

You can test the normalizer using the data in directory "test":

./ugc_norm.sh ./test/input/ ./test/output/



>>> MORE INFO:

******************************* test
Input and output directories to test the normalizer. The output directory tree
has the output produced by each step of this pipeline (sent -> tok -> checked
-> siglas -> internetes -> nomes). The deeper directory ('nomes') has the
result of the full pipeline (probably you are interested only in this result).


******************************* internetes_map.pl
perl script to translate web language using dictionary


******************************* np_map.pl
perl script to normalize NPs using (./resources/np_data.txt). It just
capitalizes the first letter


******************************* siglas_map.pl
Script to put all letters to upper case, if it is in ./resources/lexico_siglas.txt


******************************* upper_handler.py
It checks if a text file is totally in uppercase, if it is, only words after
punctuation are capitalized, all the others are set to lowercase


******************************* upper_periods.py
It capitalizes words after periods


******************************* README.txt
This file !


******************************* resources
Directory with dictionaries for NPs and web language


******************************* SentenceBoundaryDetection
Sentence boundary detection tool, it appends <S> tags at the end of each sentence


******************************* speller
Speller tool directory


******************************* tokenizer
Tokenizer tool directory, you can change lex rules in webtok.lex and run
Makefile using make tool


******************************* utils
- ./utils/extract.sh
This script extract all opinions (text files) in a corpus (many
subdirectories)

References

Duran, M. S.; Avanço, L. V.; Nunes, M. G. V. (2015). A Normalizer for UGC in Brazilian Portuguese. In: ACL 2015, Workshop on Noisy User-generated Text - WNUT, 2015, Beijing, China, p. 38-47. http://aclanthology.info/papers/W15-4305/a-normalizer-for-ugc-in-brazilian-portuguese

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UGCNormal

References

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
SentenceBoundaryDetection		SentenceBoundaryDetection
annotation_data		annotation_data
input		input
output/tok		output/tok
resources		resources
speller		speller
tokenizer		tokenizer
utils		utils
LICENSE.md		LICENSE.md
README.md		README.md
configure.sh		configure.sh
internetes_map.pl		internetes_map.pl
np_map.pl		np_map.pl
siglas_map.pl		siglas_map.pl
ugc_norm.sh		ugc_norm.sh
upper_handler.py		upper_handler.py
upper_periods.py		upper_periods.py

License

avanco/UGCNormal

Folders and files

Latest commit

History

Repository files navigation

UGCNormal

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages