ML-bot that detects toxicity in russian texts.
Works on TeleBot(telegram API)
Data: Train&Test Test
Bot represents 3 models classifying insults, threats and obscenities.
All words in sentences are presented in vectors with word2vec model(Model 204 on nlppl.eu, trained on RNC, Wikipedia, News and ARM). The resulting vector of the proposal is the average value of its vectors.
Train data shape: Nx300. I use 3 CatBoostClassifier models to train on insults, threats and obscenities datasets.
This architecture is nice for getting main topic of sentence(because mean word2vec vector guesses semantics well), but it is not perfect for predicting tone of sentence. For this task it's better to use different way to vectorize sentences and different models(not decisions trees, better NN(RNN or CNN)).
Maybe I'll come back to this task later with better method.