What is the convention to filter non-words? #1251
Answered
by
AngledLuffa
otakutyrant
asked this question in
Q&A
Replies: 1 comment
-
We don't provide that facility, but of course you can filter punctuation as
you need, or you can read about stopwords
This seems to cover some of it:
https://byteiota.com/stopwords/
…On Sun, May 28, 2023 at 11:09 PM otakutyrant ***@***.***> wrote:
I lemmatized the whole book and counted them. But the top are symbols:
, 44348
. 19035
I 9928
— 6975
'' 6835
my 3714
; 3421
- 3172
! 2036
) 1183
's 1154
( 1148
? 1116
' 941
| 575
: 543
[ 313
] 312
* 281
I know how to filter them. Just use isalpha() and isascii. But I wonder
what the convetion is in NLP, like internal API which I do not know?
—
Reply to this email directly, view it on GitHub
<#1251>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWKMU4BHMS45NYXVNH3XIQ4QZANCNFSM6AAAAAAYSMDNKE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
0 replies
Answer selected by
otakutyrant
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I lemmatized a book and counted them. But the tops are almost symbols:
I know how to filter them. Just use
isalpha()
andisascii
. But I wonder what the convention is in NLP, like some internal API which I do not know?Beta Was this translation helpful? Give feedback.
All reactions