Skip to content

Shibli-Nomani/project--Uber-Fare-Prediction-in-New-York-City-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 

Repository files navigation

🚖Project: Uber Fare Rate Prediction in New York City using Regression Models

uberfare

👉 Google Colab Code: https://colab.research.google.com/drive/1H3pNjBhPNxVNt37EQMbkHRW_-k2zOkN7?usp=sharing

👉 GitHub Code: https://github.com/Shibli-Nomani/project--Uber-Fare-Prediction-in-New-York-City-/blob/main/project_Uber_Fare_in_New_York_City_Dataset.ipynb

👉 Kaggle Code: https://www.kaggle.com/code/shiblinomani/uber-fare-rate-prediction-in-new-york-city

🐍 About Dataset:

The project is about on world's largest taxi company Uber inc. In this project, we're looking to predict the fare for their future transactional cases. Uber delivers service to lakhs of customers daily. Now it becomes really important to manage their data properly to come up with new business ideas to get best results. Eventually, it becomes really importaant to estimate the fare prices accurately. The dataset is containing 200000 trips of users in New York City, USA

📌 dataset link: https://www.kaggle.com/datasets/shiblinomani/uber-fare-newyorkcity

🤖 Machine Learning:

A branch of AI where systems learn from data to make decisions or predictions without explicit programming.

📚 Supervised Machine Learning:

Training a model on labeled data, where inputs are paired with corresponding outputs, to make predictions or classifications.

🔋 Regression:

Predicting continuous outcomes, like predicting house prices based on features such as size and location.

📊 Example: Predicting stock prices based on historical data.

🎯 Classification:

Assigning categories or labels to inputs based on their features.

🔍 Example: Classifying emails as spam or non-spam based on their content and features.

♎ Different Types Python Libraries for this project

📊 Pandas: Data manipulation and analysis library.

➕ NumPy: Mathematical computing library for arrays and matrices.

📈 Matplotlib and Seaborn: Visualization libraries for creating static plots.

🗺️ Geopandas and Shapely: Libraries for working with geospatial data and geometries.

📊 Plotly: Interactive visualization library.

📍 Geopy: Library for calculating distances based on latitude and longitude.

📅 DateTime Conversion:

datetime: Library for handling dates and times.

🔢 Data Preprocessing:

train_test_split, StandardScaler: Functions for splitting data and scaling features.

➡️ Dataset Operations:

train_test_split, SMOTE, StandardScaler: Tools for splitting, sampling, and normalizing data.

📈 Regression Models:

LinearRegression, Lasso, Ridge, KNeighborsRegressor, XGBRegressor, RandomForestRegressor: Various regression algorithms for modeling relationships between variables.

📈 LinearRegression: Fits a straight line to the data, suitable for linear relationships.

🔍 Lasso: Performs feature selection by penalizing coefficients to zero, helpful for reducing overfitting and selecting important features.

🏞️ Ridge: Reduces model complexity and multicollinearity by adding L2 regularization term, preventing overfitting.

🤝 KNeighborsRegressor: Predicts based on the average of the 'k' nearest neighbors, robust for non-linear relationships.

🌳 XGBRegressor: Implements gradient boosting, boosting ensemble technique, enhancing prediction accuracy.

🌲 RandomForestRegressor: Constructs multiple decision trees and averages predictions, robust against overfitting and noise.

❌ Error Metrics:

mean_absolute_error, r2_score, explained_variance_score: Metrics for evaluating model performance.

🔍 Hyperparameter Tuning:

GridSearchCV: Tool for finding the best parameters through exhaustive search.

🔢 Data Evaluation:

StratifiedKFold: Cross-validation technique for evaluating dataset performance.

💾 Model Saving:

joblib: Library for saving and loading models.

⚠️ Warnings:

warnings: Library for managing warnings, suppressing them in this case.

✨ Model Evaluation:

Explained Variance Score: 📊

Measures the proportion of variance in the target variable that is explained by the model. Good value: Closer to 1, indicating a better fit.

Mean Absolute Error (MAE): 🔍

Average of absolute differences between predicted and actual values, representing model accuracy. Good value: Lower, with 0 being perfect accuracy.

R-squared (R2): 📈

Represents the proportion of variance in the dependent variable that is explained by the independent variables. Good value: Closer to 1, indicating a better fit of the model to the data.

⭐ Summary

Our KNN (K Nearest Neighbors) for Uber fare prediction shows good performance as compared to others. Through meticulous parameter tuning with GridSearchCV, we optimized the model to deliver better accurate fare estimates. Moving forward, future enhancements could include exploring additional features, added different models, fine tuning, training with higher infrastructure, integrating real-time data, and enhancing user experience with a user-friendly interface. 🚖🔍🚀

Authors

About

Apply Different Regression Models to Predict the Fare of Uber in New York

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published