A machine learning model that predicts tags for a given question and body.
Dataset Link: https://www.kaggle.com/imoore/60k-stack-overflow-questions-with-quality-rate
Stack Overflow is an open community for anyone that codes. They help you get answers to your toughest coding questions, share knowledge with your coworkers in private, and find your next dream job.
Their mission is to help developers write the script of the future. This means helping you find and hire skilled developers for your business and providing them the tools they need to share knowledge and work effectively.
Given a Title
and the Body
of a question, we have to predict the relevant tags such that the question gets recommended to the right domain expert
so that the expert can answer the question correctly
.
- To predict as many tags as possible with very high
precision
andrecall
. Incorrect tags
could impact thecustomer experience
on Stack Overflow.- No strict latency constraints. The model should be able to generate the relevant tags in a
reasonable
amount oftime
.
train.csv
= 48 MBtest.csv
= 16 MB
The data consists of 6 columns.
- Id: Represents the ID of the question
- Title: Represents the title of the question
- Body: Represents the body of the question where the question is explained properly
- Tags: The tags relevant for the question asked
- CreationDate: The date at which the question was asked
- Type: Deals with the quality of the question
Our main important features in the dataset are Title
,Body
and Tags
.
This is the countplot of number of tags per question.
The key take away from the above plot is that most of the question has 2
or 3
tags in them.
This is the distribution of number of times the tag appeared in questions.
The key take away from the above plot is that a tag is appearing 5 time in max.
This is the wordcloud generated from the tags and it's count.
The more frequent tags appears to be bigger in the wordcloud and vice versa.