In Mexico, the price of gasoline is constantly changing since its prices depend a lot on the prices of barrel, transfer and competitors. That is why keeping track of prices can save money when looking for the best price to fill your tank.
The objective of this project is proces this fuel prices data obtained from CRE. (Comision Regulatoria de Energía en México) in the cloud to storage a daily basis in Storage and in Big Query Data Warehouse
Tools that were used to accomplish this task:
- Python
- Pandas
- Requests
- reverse_geocoder
- unidecode
- pyarrow
- google-cloud-bigquery
- google-cloud-storage
- gcsfs
- Google Cloud Storage
- BigQuery
- Cloud Functions
The schema of the data is as follows:
[
{
"mode": "REQUIRED",
"name": "Permiso",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "Marca",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "Nombre",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "Direccion",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "Producto",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "Precio",
"type": "FLOAT64"
},
{
"mode": "REQUIRED",
"name": "Estado",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "Municipio",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "Fecha",
"type": "DATE"
},
{
"mode": "REQUIRED",
"name": "y",
"type": "FLOAT64"
},
{
"mode": "REQUIRED",
"name": "x",
"type": "FLOAT64"
}
]
The procedure that was taken to process the information is as follows:
- Get xml element tree from CRE web page
- Struct data in json objects to then convert it to pandas dataframe
- Transforms data with pandas and other libraries to insert geo coordinates and insert new columns
- Store transformed data to storage
- Send the final dataframe to Big Query
There are scripts to run all localy to test the data transformations. This proccess runs at 7:30 am with a cloud scheduler using pubsub.
Here is the GCP architecture, this shows the resources that are used