Project by Ranit Bhowmick and Sayanti Chatterjee
This project aims to predict the gender of individuals based on their app usage patterns. The project leverages data collected through a custom survey to train machine learning models. By analyzing various app usage statistics, such as time spent on different categories of apps, we aim to determine the user's gender with high accuracy.
The project encompasses data collection, preprocessing, and model training, utilizing techniques such as one-hot encoding, outlier detection, and machine learning algorithms like Decision Trees and Random Forest.
To gather data for this project, we conducted a survey titled "What's Your App Usage?". The survey was designed to capture detailed information about participants' app usage across various categories. The data collected is crucial for training our machine learning models, and we appreciate the participation of everyone who took the time to contribute.
The survey was structured to be clean and organized, with each question focusing on specific aspects of app usage. Participants could fill out the form anonymously, and the sections included:
- Basic demographic information
- App usage duration across various categories (e.g., Social, Gaming, Banking)
- Time of app usage during the day
You can view and participate in the survey here.
The dataset used for this project was constructed from the survey responses. The raw data includes columns such as Transportation Usage
, Social Usage
, Meet Usage
, and more, all of which represent the time spent on various app categories.
The main features in the dataset are as follows:
- App Usage Duration: Time spent on apps in different categories, represented in
HH:MM:SS
format. - App Usage Time: Time of day when apps in different categories were used, also in
HH:MM
format. - Demographic Information: Gender, Employment Status, Field of Work, and Date of Birth.
The dataset contained some missing values in the app usage duration columns. These missing values were filled with '00:00:00'
, indicating no usage.
We implemented a custom outlier detection and correction function to handle invalid time entries. For instance, any time value with hours exceeding 23 was corrected to '00:00:00'
.
To make the time data suitable for machine learning models, we converted the HH:MM:SS
and HH:MM
formats into total seconds. This conversion allows the models to process and analyze the time data effectively.
The Date of Birth
was converted into the number of days since birth, which provides a numerical representation of the participant's age.
Categorical variables such as Gender
, Employment Status
, and Field
were converted into numerical values using one-hot encoding. This step is crucial for feeding the data into machine learning algorithms.
We initially employed a Decision Tree Classifier to predict gender based on app usage patterns. The model was trained using the preprocessed dataset, and it achieved an accuracy of around 82% on the test data.
from sklearn import tree
model = tree.DecisionTreeClassifier()
model.fit(x_train, y_train)
accuracy = model.score(x_test, y_test)
print(f"Decision Tree Accuracy: {accuracy}")
To further improve the accuracy, we used a Random Forest Classifier with 15 estimators. The Random Forest model provided more robust results, achieving similar accuracy levels as the Decision Tree but with improved stability.
from sklearn.ensemble import RandomForestClassifier as rf
model2 = rf(n_estimators=15)
model2.fit(x_train, y_train)
accuracy_rf = model2.score(x_test, y_test)
print(f"Random Forest Accuracy: {accuracy_rf}")
- The Random Forest model achieved an accuracy of 82%, which is promising given the limited dataset size.
- The Decision Tree model also performed well, but Random Forest's ensemble approach provided more consistent results.
- The accuracy can be further improved with a larger and more diverse dataset, as well as by fine-tuning the hyperparameters of the models.
Make sure you have Python 3.x installed along with the required libraries:
pip install pandas numpy scikit-learn
git clone https://github.com/Kawai-Senpai/Info-Through-App-Usage.git
cd Info-Through-App-Usage
Ensure that you have the dataset (app_usage.csv
) in the same directory and run the Jupyter notebook or the Python script provided.
After running the models, you can view the accuracy scores and model performance metrics. The models can be further fine-tuned to improve predictions.
- Expand the Dataset: Collect more survey responses to enhance the training dataset's size and diversity.
- Feature Engineering: Explore additional features that might improve model accuracy, such as app usage frequency or session length.
- Model Optimization: Experiment with other machine learning models, such as Gradient Boosting or Support Vector Machines, and fine-tune hyperparameters.
- Deployment: Consider deploying the model as a web service or integrating it into an app to provide real-time gender predictions based on app usage.
We thank all participants who took the time to complete our survey. Your contributions have been invaluable to this project. Special thanks to Sayanti Chatterjee for her collaboration and support.
For more information, feel free to reach out: