Country house price prediction

Main goal

ML modeling for house price prediction in Belarus.

Tasks realized

Collecting the data from web source using parsing technology
Data processing: handling duplicates, nan-values, features' types, outliers. Numeric & Categorical features processiong
Generating new features
EDA
Modeling, feature selection, hyperparameters optimization
Model inference

Data description

Primary dataset is one parsed from the web site realt.by. It contains data about houses, cottages & countryhouses ads in Belarus in Apr 2024: more than 16000 rows with 52 columns initially.

Here is the visualization of objects distribution on the map

Metrics

The evaluation metrics for the project was R2 & MAE.

Summary

The project goal required full-stack approach implementation: starting from parsing/collecting data, data processing/modeling and model inference, including both back&front-end representation.

At the 1st iterration of the project the main goal was building a model predicting a price only for countryhouses. Approximately 3000+ ads was parsed from realt.by. Taking into account relatively small size of the data set it was decided and seemed reasonable to use R2-score to evaluate model quality. When R2-score close to 0.7 was obtained I moved to the 2nd iterration. The new goal was reformulated: Building a model predicting a price for countryhouses, houses & cottages. This goal required more data for analysis. At the 2nd iterration more than 13000 ads was additionaly parsed from the same web source yielding in total the data set with the shape (16187, 52).

Parsing (parsing.ipynb)

The data was parsed from realt.by web-site using Selenium library & Chrome Webdriver engine. At the first step, the site feed containing brief information about the objects was scanned. The main goal at this stage: parsing ad URLs containing a unique identifier (ID)(data/pars_res{int}.txt). At the second stage, a loop of saved URLs was carried out and detailed information about the objects was collected and saved in csv-format (data/country_houses.csv & data/country_houses_2.csv).

Data processing (main.ipynb)

All parsed data was mainly saved as type 'object'. So at this step it was supposed to transform numeric data to int or float. Main part of collected data was categorical: some binary, another - multioptions. Due to the similarity of some Multicategorical feature options some options were merged after statistical assessment of the statistical significance of the difference between similar options. Both numeric and categorical features required handling nan values, which was managed either by filling with median/mode value or str-value - depending on the feature type. Due to not normal distribution of the target feature (price) outliers were handled using +/- 1.5 IQR method (Interquartile Range).

Generating new features (main.ipynb)

Some additional features were generated and added to the data set. They mainly based on house coordinates data and include distances till district city, regional city and Belarus capital.

EDA (main.ipynb)

Data preprocessing helped to identify the main dependencies between features and the target variable - price.

Modeling (main.ipynb)

LinearRegression was used as base model to be compared with. GradientBoosting appeared one of the most effective from the point of view of the metrics. 40 the most useful features were selected using SelectKBest.

Model inference (django folder)

GradientBoosting model fit with selected features was serialized and infered as back-end of web-application. This step was realised using Django framework. Web application is actually deployed on https://housespriceprediction-production.up.railway.app/. Where it's available now for test purpose.

Libraries & tools used

see the requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
__pycache__		__pycache__
django		django
.gitignore		.gitignore
README.md		README.md
functions.py		functions.py
log.log		log.log
main.ipynb		main.ipynb
map.png		map.png
model.pkl		model.pkl
parsing.ipynb		parsing.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Country house price prediction

Main goal

Tasks realized

Data description

Metrics

Summary

Libraries & tools used

About

Releases

Packages

Languages

ssiarhei115/CountryHouse-price-prediction

Folders and files

Latest commit

History

Repository files navigation

Country house price prediction

Main goal

Tasks realized

Data description

Metrics

Summary

Libraries & tools used

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages