Simple Dataflow sample to store TSV files in GCS to BigQuery
- File name pattern "test_file_{target_data_date (YYYYMMDD) }_{output_date (YYYYMMDDHHMMSS)}" (If there are multiple files with the same target date, the latest date's file is automatically targeted)
- The first line in the file is a header line. We specify the string included at the beginning of the header line as a way to skip it in the beam.
- Assume that the table is created in BigQuery in advance to match the schema. (When partitioning with tdate, the program sets the date of the file suffix to tdate, allowing for constant control of the amount of data when searching.
- Assumed to run on a GCP console
- The Dataflow API must be enabled for the project
Clone this repository (in this case, save it in a directory called "dataflow")
cd dataflow/
sudo pip3 install -U pip
sudo pip3 install --upgrade virtualenv
virtualenv -p python3.7 env
source env/bin/activate
cd dataflow/
source env/bin/activate
gcloud config set project [PROJECT_ID]
【 の実行コマンド】
python --project [PROJECT_ID] --storagebucket [STORAGE_BUCKET_NAME] --workbucket [WORK_BUCKET_NAME] --dataset [BIGQUERY_DATASET_NAME] --tdate [TARGET_DATE]
GCP project ID to be executed.
The name of the Cloud Storage bucket in which the files are stored.
Cloud Storage bucket name for the DataFlow work directory
BIGQUERY_DATASET_NAME: The name of the BigQuery dataset.
TARGET_DATE: Date to be processed (in YYYYMMDD format): 204200401.
cd dataflow/
source env/bin/activate
gcloud config set project [hopstar-dev/hopstar-prod]
【 の実行コマンド】
python --project [PROJECT_ID] --storagebucket [STORAGE_BUCKET_NAME] --workbucket [WORK_BUCKET_NAME] --dataset [BIGQUERY_DATASET_NAME] --tdate [TARGET_DATE]
GCP project ID to be executed.
The name of the Cloud Storage bucket in which the files are stored.
WORK_BUCKET_NAME: Cloud Storage bucket name for the DataFlow work directory
BIGQUERY_DATASET_NAME: The name of the BigQuery dataset.
TARGET_DATE: Date to be processed (in YYYYMMDD format): 204200401.