What is the convention to filter non-words? #1251

otakutyrant · 2023-05-29T06:09:03Z

otakutyrant
May 29, 2023

I lemmatized a book and counted them. But the tops are almost symbols:

I know how to filter them. Just use isalpha() and isascii. But I wonder what the convention is in NLP, like some internal API which I do not know?

Answered by AngledLuffa

May 29, 2023

We don't provide that facility, but of course you can filter punctuation as you need, or you can read about stopwords This seems to cover some of it: https://byteiota.com/stopwords/

View full answer

AngledLuffa · 2023-05-29T06:41:03Z

AngledLuffa
May 29, 2023
Maintainer

We don't provide that facility, but of course you can filter punctuation as you need, or you can read about stopwords This seems to cover some of it: https://byteiota.com/stopwords/

…

On Sun, May 28, 2023 at 11:09 PM otakutyrant ***@***.***> wrote: I lemmatized the whole book and counted them. But the top are symbols: , 44348 . 19035 I 9928 — 6975 '' 6835 my 3714 ; 3421 - 3172 ! 2036 ) 1183 's 1154 ( 1148 ? 1116 ' 941 | 575 : 543 [ 313 ] 312 * 281 I know how to filter them. Just use isalpha() and isascii. But I wonder what the convetion is in NLP, like internal API which I do not know? — Reply to this email directly, view it on GitHub <#1251>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWKMU4BHMS45NYXVNH3XIQ4QZANCNFSM6AAAAAAYSMDNKE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the convention to filter non-words? #1251

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

What is the convention to filter non-words? #1251

otakutyrant May 29, 2023

Replies: 1 comment

AngledLuffa May 29, 2023 Maintainer

otakutyrant
May 29, 2023

AngledLuffa
May 29, 2023
Maintainer