This project focuses on understanding the crucial features that influence house sales prices in King County, WA. By utilizing a dataset from Kaggle containing information about houses in the area, the goal is to create a web application that aids real estate developers, individual buyers, and financial institutions in making informed decisions. The app provides insights into the best areas in King County for developing new apartment buildings or making property purchases.
Housing is a significant economic indicator that can have a substantial impact on a country's GDP (Gross Domestic Product). The housing sector has a significant multiplier effect on the economy. When there's increased activity in housing, it generates demand for other sectors like manufacturing, retail, and professional services, further boosting GDP. The housing market is often seen as a barometer of economic health. A thriving housing market generally reflects a healthy economy, while a slowdown can indicate broader economic challenges.
Within this repository, you'll find an extensive project on machine learning and data visualization for King County housing data. The project's primary objective is to predict housing prices by analyzing various features such as square footage, number of bedrooms and bathrooms, house condition, and more. The process includes exploratory data analysis (EDA) and constructing statistical machine learning models using a range of supervised and unsupervised learning techniques.
Features in the Dataset:
- price: Price of each house.
- sqrt_price: Square root of the price.
- price2: Dichotomous price variable (0 for houses <$600,000, 1 for houses <=$600,000).
- sqft_living, sqft_lot, sqft_above, sqft_basement, sqft_living15, sqft_lot15: Square footage features. bedrooms, bathrooms, floors, waterfront, condition, grade: House features.
- yr_built, zipcode, lat, long: Location and year built features.
- EDA and Visualizations (1 - EDA and visualizations.R):
Map of houses in King County. Summary statistics and plots of variables. Correlations among variables. Relationship between sqrt_price (outcome) and predictors.
- Linear Regression (2 - linear regression.R):
Simple and multiple linear regression. Polynomial regression with cross-validation. Plotting linear polynomial regression.
- Subset Selection (3 - subset selection.R):
Best subset selection. Forward and backward stepwise selection. Model selection using validation set and cross-validation.
- Ridge Regression and Lasso (4 - ridge regression and lasso.R):
Ridge regression with and without cross-validation. Lasso regression with and without cross-validation.
- PCR and PLS (5 - PCR & PLS.R):
Principal components regression (PCR) with cross-validation. Partial least squares (PLS) with cross-validation.
- Logistic Regression (6 - logistic regression.R):
Logistic regression and logistic polynomial regression with cross-validation. Plotting logistic polynomial regression.
- LDA, QDA, and KNN (7 - LDA, QDA, and KNN.R):
Linear discriminant analysis (LDA). Quadratic discriminant analysis (QDA). K-nearest neighbors (KNN) for categorization.
- Decision Trees (8 - decision trees.R):
Classification and regression trees. Bagging, random forests, and boosting.
- Support Vector Machines (9 - support vector machines.R):
Support vector classifiers and machines for classification.
- Unsupervised Learning (10 - unsupervised learning.R):
Principal components analysis (PCA). K-means and hierarchical clustering.
- A concise report is available in HTML format: king_county_markdown.html.
- The same content is also provided in PDF format: king_county_markdown_report.pdf.
- Sample plots are included in the repository, showing distributions and relationships.
- Plots include maps, correlation plots, distributions, and relationships between variables.
- Credit to "An Introduction to Statistical Learning" and "R Graphics Cookbook" for guidance.
- Various supervised and unsupervised learning methods used for regression and classification.
- The inclusion of classification and unsupervised learning methods serves as a demonstration of their potential applications.
The King County Housing Price Prediction Web App project delves into the intricacies of housing prices in the area, offering valuable insights for developers, buyers, and financial institutions. Through thorough analysis, including exploratory data visualization and a range of machine learning models, the project aims to empower stakeholders with the information needed to make informed decisions regarding property development and investment in King County. The combination of supervised and unsupervised learning techniques allows for a comprehensive understanding of the dataset, providing a robust foundation for the predictive model within the web application.