This is the codebase for a web crawler that I have built to extract the residential listings available for renting in Singapore.
This web crawler can currently extract the following metrics of each listings.
- Property Name
- Property Type
- Address
- Number of Beds
- Nunber of Bathroom
- Rental
- Rental per square feet
- Completion Year
- Total Unit of the Project
- Building Tenure (Freehold or Leasehold)
- Amenities
- Key Details (Rules set by landlord and etc)
- Nearest MRT station and distance
- Basic Understanding in HTML Structure
- Python
- Basic Understanding of Shell command
This scraper was built using the scrapy framework in Python 3.7. You are required to install scrapy in order to run the scraper.
Pip:
pip install Scrapy
Conda:
conda install -c conda-forge scrapy
Otherwise, you can download the requirement.txt file and use the following code to install the packages stated.
$ while read requirement; do conda install --yes $requirement; done < requirements.txt
$ while read requirement; do conda install --yes $requirement || pip install $requirement; done < requirements.txt
.
├── properties_data_2019-07-12T16-19-53.json
├── README.md
├── scrapy.cfg
└── sgpropbot
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── __pycache__
├── settings.py
└── spiders
├── __init__.py
├── properties.py
├── __pycache__
├── run.log
└── scrapy_shell_test_linkextractor.txt
In order to run the crawler, please navigate to the sgpropbot/spiders
folder and run the following command:
scrapy crawl properties
Or
scrapy runspider properties.py
The data output was set as JSON file format by default. However, this can be easily changed by amending the setting.py
. I have provided a sample of data output here.
Following up project, I will use the scraped data to:
- Provide a walkthrough on how to export data into database like MySQL or PostgreSQL
- Conduct an exploratory data analysis
- Make an interactive dashboard
- Building a rental prediction model
The most essential part of building a good scraper is to have a good understanding of the website layout so that you are able to extract the right item. The most effective way is through a selector (CSS or XPath) and for specific text extraction, you have to familiar with regex.