This part is extracted from project "Analysis on Apartment rental prices between Switzerland and USA Markets" realized as a part my master studies in HSLU.
In this article, I explain how you can:
- Get the rental data by web scraping and converting it to Pandas data frame
- Rental data analysis (For the rental data analysis see this notebook)
- Use the Geopandas library to convert maps which are in shapefile to Geopandas data frames and Geojson maps
- Make interactive Choropleth maps embedded with rental data, which looks like this:
If available, it is easier/recommended to use an API (similar to twitter, reddit APIs) . However, most of the websites don't have public APIs or if they have they provide very limited public access. In such cases, web scraping might be necessary.
This was the case for me when I wanted to study rental data in Geneva. So, I decided to develop my own web crawler to get the data for analysis (not any business purposes) from Homegate.ch. Below I explain this web crawler. You can find the corresponding Jupyter notebook here.
After going to webpage Homegate.ch we select rent, type the city name in search box, and select language (eng) from top right:
The information that i want to extract is the price, size, number of rooms and the address:
- First step is to import necessary packages and as well as beautifulsoup and requests modules for scraping, this is how you could do it:
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
- We define page number here for looping pagination
cur_page = 1
- After carefully inspecting website, realized apt rental advertisement goes in 2 categories: Premium one (paid subscription) and simple ones (without subscription) therefore create 2 empty list and then creating function for changing page numbers and While True loop has to run until it breaks:
premium = []
simple = []
def getLink(page):
return f"https://www.homegate.ch/rent/apartment/city-geneva/matching-list?ep={page}"
while True:
print("Page ->", cur_page)
link = getLink(cur_page)
res = requests.get(link)
- We parse HTML with Beautiful Soup (I strongly suggest you to take a look at python documentation of BeautifulSoup:
bs = BeautifulSoup(res.text, features='html.parser')
- We define 2 variables-- a for premium and b for simple annoucements and with find_all() we returns all div containers with mentioned class names that match our filters:
a = bs.find_all('div', {'class': 'ListItemTopPremium_item_K9dLF'})
b = bs.find_all('div', {'class': 'ListItem_item_1GcIZ'})
- If we get zero results then we break the while loop defined earlier (we have 21 pages of results) otherwise we run for loop for findings in a and b then append according to the empty list created earlier:
if len(a) == 0 and len(b) == 0:
break
for offer in a:
premium.append(offer)
for offer in b:
simple.append(offer)
printing results and incrementing page number by one
print(len(premium), len(simple))
cur_page += 1
- Defining function block with result dic keys price,size,rooms,address and for each of them we use try except to get within span tag with mentioned class name info needed and add them to created list:
def extractPremiumInfo(block):
result = {
'price': None,
'size': None,
'rooms': None,
'address': None
}
try:
price = block.find('span', {'class': 'ListItemPrice_price_1o0i3'}).find_all('span')[1].text
result['price'] = price
except:
pass
try:
m2 = block.find('span', {'class': 'ListItemLivingSpace_value_2zFir'}).text
result['size'] = m2
except:
pass
try:
rn = block.find('span', {'class': 'ListItemRoomNumber_value_Hpn8O'}).text
result['rooms'] = rn
except:
pass
address = block.find('div', {'class': 'ListItemTopPremium_data_3i7Ca'})
if address is None:
address = block.find('div', {'class': 'ListItem_data_18_z_'})
address = address.find_all('p')[1].text
result['address'] = address
return result
- Again with for loop we go over simple and premium list and append results to fnish list:
finish = []
for i in premium:
finish.append(extractPremiumInfo(i))
for i in simple:
finish.append(extractPremiumInfo(i))
print(f"Found {len(finish)}apartments")
- Saving the extracted data into pandas dataframe and write to a CSV file:
df = pd.DataFrame(finish)
df.to_csv('Geneva_listings_src.csv', index=False, encoding='utf-8')
Once you have the rental data in the form of a Pandas dataframe you can do the usual data analysis pipeline. That is, you start by preprocessing the data (handling the missing data, outliers, etc.). For the data analysis, you can include new interesting features such as rent per room, rent per area, zip code of the apartments, etc. These are all done in this notebook. Perhaps, the most tricky part of the data analysis pipeline for this example is spotting and handling the outliers (which are indeed mostly due to wrong inputs from the users). Here is the first 5 elements of the resulting dataframe:
Price | Size | Rooms | Address |
---|---|---|---|
4,150.– | 104m2 | 2.5rm | Rue de l'Athénée 38, 1206 Genf |
1,250.– | 26m2 | 1rm | Rue de la Dôle 15, 1203 Genève |
4,000.– | 90m2 | 2.5rm | Rue de l'Athénée 36, 1206 Genève |
3,100.– | 82m2 | 4rm | Rue Liotard, 1202 Geneva |
1,580.– | NaN | 2.5rm | Rue de Lyon, 1201 Genève |
Let's say you are interested in rental prices distribution as a function of zip-code. Then you could use the groupBy()
method of Pandas on the above dataframe as follows:
zipVsRentMean = df[['ZipCode', 'RentPerArea', 'RentPerRoom', 'AreaPerRoom', 'SurfaceArea']]\
.groupby(['ZipCode'], as_index = False).mean()
Here is zipVsRentMean
:
ZipCode | RentPerArea | RentPerRoom | AreaPerRoom | SurfaceArea |
---|---|---|---|---|
1200 | 40.787924 | 899.814815 | 22.503367 | 106.666667 |
1201 | 41.403102 | 882.109565 | 21.283923 | 82.142857 |
1202 | 37.059230 | 818.934074 | 22.243254 | 85.266667 |
1203 | 37.527645 | 716.131490 | 19.320108 | 64.234043 |
1204 | 44.574250 | 1117.337317 | 25.277993 | 88.071429 |
1205 | 35.181856 | 698.735049 | 20.106478 | 75.918919 |
1206 | 39.905645 | 1103.584285 | 27.531853 | 143.326923 |
1207 | 41.646907 | 904.179500 | 21.860883 | 100.052632 |
1208 | 36.857806 | 909.245248 | 24.852548 | 88.071429 |
1209 | 37.278602 | 999.223665 | 27.183622 | 129.666667 |
Next, we would like to show the results of the zip code table above on a map. To this end, we first should be able to read the maps in Python. Maps are usually available in the shapefile format *.shp. Let's first download this shapefile map, and then I discuss how you could read this in Python.
Download the Switzerland's zip- code shapefiles from Swiss opendata. I have downloaded the PLZO_SHP_LV95 from here). Extract the folder, and note the address where you saved the zip-code shapefile (called PLZO_PLZ.shp) . You can also get it here.
Okay, now you have the shapefile. How would you read/manipulate this in Python? Luckily, the Geopandas library of Python, which is a powerful library used for geospatial data processing and analysis, has a method to convert shapefiles to geopandas dataframe:
import geopandas as gpd
gdf = gpd.read_file('.../PLZO_SHP_LV95/PLZO_PLZ.shp')
The Coordinate Reference System (CRS) in which the data is displayed can be found by gdf.crs. I convert this to a more common CRS by the following command:
gdf = gdf.to_crs({'init': 'espg:4326'})
Here is the first four elements of the geopandas dataframe gdf:
2635 | UUID | OS_UUID | STATUS | INAEND | PLZ | ZUSZIFF | Geometry |
---|---|---|---|---|---|---|---|
3370 | {54A45D65-97A3-45A1-8DB2-FA3E6E540269} | {5DF8DDBE-8D41-42A3-8F30-F9E716E39C75} | real | nein | 1203 | 0 | POLYGON ((6.13514 46.20837, 6.13470 46.20798, ... |
3456 | {D924C540-1604-4E4A-9C30-A31E36299921} | {5DF8DDBE-8D41-42A3-8F30-F9E716E39C75} | real | nein | 1206 | 0 | POLYGON ((6.15383 46.17984, 6.15387 46.18019, ... |
3485 | {F97E72AA-A260-4075-B3AE-F87FEDE38726} | {5DF8DDBE-8D41-42A3-8F30-F9E716E39C75} | real | nein | 1205 | 0 | POLYGON ((6.13394 46.20368, 6.13408 46.20308, ... |
3531 | {B5EA9714-EF37-41F0-B481-F59A93221892} | {5DF8DDBE-8D41-42A3-8F30-F9E716E39C75} | real | nein | 1207 | 0 | POLYGON ((6.15741 46.20996, 6.15746 46.21001, ... |
The geometry
column defines the shape of each polygon. Since we are only looking at the data in the city of Geneva, I extract the data of Geneva from gdf
(note that gdf
includes the data of the whole Switzerland):
First I create list of zip codes i have for Geneva:
geneva = [1200, 1201, 1202,1203,1204,1205,1206,1207,1208,1209]
Then creating geopandas frame for Geneva with zipcodes in gdf are contained in geneva list:
gdf_gen = gdf[gdf['PLZ'].isin(geneva)]
Now you can plot the zip-code map of Lausanne with the following code:
gdf_gen.plot()
Which would result in the following figure:
While geopandas
can plot such minimal maps, I would like to have a Choropleth interactive map (where you can hover over the map see the rental results) that also looks a bit nicer than this one. To create such a map I decided to use the use the Altair library)
First off, let's merge the gdf_gen dataframe which only contains geographical data with zipVsRentMean Pandas dataframe which included the rental data for each zip-code in Geneva:
Merge DataFrames gdf_gen and zipVsRentMean with specified left and right suffixes
gdf_gen = gdf_gen.merge(zipVsRentMean, left_on='PLZ', right_on='ZipCode')
This will simply add the columns of zipVsRentMean to the right of gdf_laus. Okay, now we have a geopandas dataframe gdf_laus, which includes both rental data and geographical information of Lausanne. Next, we want to visualize this on an interactive Choropleth map for which I use the Altair library.
In order for the gdf_gen data to be readable by the Altair library, we need to do some preprocessing as follows:
- Altair currently can only handle geojson or topjson maps
- So, first we need to convert the geopandas data to appropriate data readable by altair
import altair as alt
import json
json_gen = json.loads(gdf_gen.to_json())
alt_gen = alt.Data(values = json_gen['features'])
alt_gen has the data form which is readable by Altair as follows,this code uses the Altair Python library to create a choropleth map that displays the average rent per room in different zip codes of a specific area.
The alt_gen variable likely contains a GeoJSON file or a Pandas DataFrame with geographical data, such as the longitude and latitude of the zip code's centroid, as well as the corresponding average rent per room.
The alt_rentPerRoom variable creates the choropleth map using the mark_geoshape() method and encoding the longitude and latitude on the x- and y-axis, respectively. The color encoding is set to the average rent per room for each zip code, which is represented using a color scale.
The text variable adds text labels to the map, which display the zip code of each area. This is achieved by using the mark_text() method, encoding the longitude and latitude, and setting the text value to the ZipCode property.
Finally, the chart variable combines the choropleth map and the text labels into a single chart using the + operator. The resulting chart should display a map with different color shades representing the average rent per room in different zip codes, with each zip code labeled with its corresponding code.
alt_rentPerRoom = alt.Chart(alt_gen).mark_geoshape(
stroke = 'white'
).encode(
latitude = 'properties.y:Q',
longitude = 'properties.x:Q',
color = 'properties.RentPerRoom:Q'
).properties(
width = 700,
height = 600
)
text = alt.Chart(alt_gen).mark_text(
color = 'black',
fontWeight = 'bold'
).encode(
longitude = 'properties.x:Q',
latitude = 'properties.y:Q',
text = 'properties.ZipCode:Q',
)
chart = alt_rentPerRoom + text
chart
Here is the result: