Skip to content

Commit

Permalink
Feat/6971 data silver templating (#437)
Browse files Browse the repository at this point in the history
* Update datastacks config table

* dq section

* Add data processing quickstart

* Add processing steps

* GX bullet

* Tidy up GX

* lint

* fix links
  • Loading branch information
adurkan-amido authored Sep 26, 2023
1 parent c998184 commit 0ea4e84
Show file tree
Hide file tree
Showing 9 changed files with 299 additions and 54 deletions.
13 changes: 6 additions & 7 deletions docs/workloads/azure/data/etl_pipelines/data_processing.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ keywords:
Data processing workloads in Ensono Stacks are jobs which:

1. Take data in the data lake as input (this can be various formats e.g. CSV, Parquet, JSON, Delta).
2. Perform some form of data transformation / cleansing / modelling over the data.
2. Perform some form of data transformation / cleansing / modelling / aggregation over the data.
3. Output the into the data lake into a structured [Delta Lake](https://delta.io/) format.

While [data ingest workloads](./ingest_data_azure.md) in Ensono Stacks utilise Azure Data Factory's inbuilt connectors and Copy activity, data processing workloads are based upon running Apache Spark / Python jobs on Databricks. These workloads may be used for various levels of data transformation and preparation within the data lake. Within the [medallion architecture](./etl_intro_data_azure.md#medallion-architecture) these will include:
Expand Down Expand Up @@ -49,18 +49,17 @@ reliable and accurate.

## Data processing pipeline overview

The diagram below gives an overview of a data processing data pipeline in Data Factory.
Within Stacks, processing activities are performed using Python PySpark jobs. These jobs are orchestrated by pipelines in Data Factory, and executed in Databricks. Using PySpark jobs - as opposed to notebooks - gives full control over the processing activities (for example ensuring thorough [test coverage](./testing_data_azure.md) and quality control).

![ADF_SilverPipelineDesign.png](../images/ADF_SilverPipelineDesign.png)
The diagram below gives an example of a data processing data pipeline in Data Factory.

The processing is executed as Python Databricks job, with repeatable data transformation processes packaged within
our [PySparkle](pysparkle.md) library.
![ADF_SilverPipelineDesign.png](../images/ADF_SilverPipelineDesign.png)

Transformation and processing logic specific to particular datasets is kept inside the `spark_jobs` directory for the workload.
The Python PySpark script executed as part of a data workload is kept inside the `spark_jobs` directory for the workload. This job will utilise the [Pysparkle library](./pysparkle.md), which provides a wealth of reusable utilities to assist with data transformations and loading data from/into to the data lake.

### Data Factory pipeline design

Data processing pipelines are kept within the `Process` folder in Data Factory:
Within Data Factory, the processing pipelines are kept within the `Process` folder:

![ADF_SilverPipelinesList.png](../images/ADF_SilverPipelinesList.png)

Expand Down
4 changes: 2 additions & 2 deletions docs/workloads/azure/data/etl_pipelines/data_quality_azure.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,8 +71,8 @@ Here is the description of the main elements:
5. `validation_config`: A list of validation configurations where each configuration contains the following fields:
1. `column_name`: Name of the validated column.
2. `expectations`: List of expectations where each expectation has the following fields:
3. `expectation_type`: Name of the Great Expectations expectation class to use.
4. `expectation_kwargs`: The keyword arguments to pass to the expectation class.
* `expectation_type`: Name of the Great Expectations [expectation class](https://greatexpectations.io/expectations/) to use.
* `expectation_kwargs`: The keyword arguments to pass to the expectation class.

### Example

Expand Down
60 changes: 38 additions & 22 deletions docs/workloads/azure/data/etl_pipelines/datastacks.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,63 +34,79 @@ poetry run datastacks --help
Datastacks can be used to generate all the resources required for a new data engineering workload - for example a data ingest pipeline. This will create all resources required for the workload, based upon templates within the [de_templates](https://github.com/ensono/stacks-azure-data/tree/main/de_templates) directory.

The [deployment architecture](../architecture/architecture_data_azure.md#data-engineering-workloads) section shows the workflow for using Datastacks to generate a new workload.
See [ETL Pipeline Deployment](../getting_started/etl_pipelines_deployment_azure.md) for a step-by-step guide on deploying a new workload using Datastacks.
The [getting started](../getting_started/getting_started.md) section includes step-by-step instructions on deploying a new ingest or processing workload using Datastacks.

### Commands

* **`generate`**: This command contains subcommands which generate components for the data platform given a config file.
* **`ingest`**: This subcommand utilises the template for ingest data pipelines, and uses a given config file to generate the required code for a new ingest pipeline ready for use. A flag can be included to specify whether to include data quality components in the pipeline.
* **`generate`**: Top-level command for generating resources for a new data workload.
* **`ingest`**: Subcommand to generate a new [data ingest workload](./ingest_data_azure.md), using the provided configuration file. A optional flag (`--data-quality` or `-dq`) can be included to specify whether to include data quality components in the workload.
* **`processing`**: Subcommand to generate a new [data processing workload](./data_processing.md), using the provided configuration file. A optional flag (`--data-quality` or `-dq`) can be included to specify whether to include data quality components in the workload.

### Examples

```bash
# Activate virtual environment
poetry shell

# Generate resources for an ingest pipeline
# Generate resources for an ingest workload
datastacks generate ingest --config="de_templates/test_config_ingest.yaml"

# Generate resources for an ingest pipeline, with added Data Quality steps
# Generate resources for an ingest workload, with added data quality steps
datastacks generate ingest --config="de_templates/test_config_ingest.yaml" --data-quality

# Generate resources for a processing workload
datastacks generate processing --config="de_templates/test_config_processing.yaml"

# Generate resources for a processing workload, with added data quality steps
datastacks generate processing --config="de_templates/test_config_processing.yaml" --data-quality
```

### Configuration

In order to generate a new data engineering workload the Datastacks CLI takes a path to a config file. This config file should be YAML format, and contain configuration values as specified in the table below. A sample config file is included in the [de_templates](https://github.com/ensono/stacks-azure-data/tree/main/de_templates) folder. The structure of config is [validated using Pydantic](https://github.com/ensono/stacks-azure-data/tree/main/datastacks/datastacks/config.py).
In order to generate a new data engineering workload the Datastacks CLI takes a path to a config file. This config file should be YAML format, and contain configuration values as specified in the table below. Sample config files are included in the [de_templates](https://github.com/ensono/stacks-azure-data/tree/main/de_templates) folder. The structure of config is [validated using Pydantic](https://github.com/ensono/stacks-azure-data/tree/main/datastacks/datastacks/config.py).

#### All workloads

| Config field | Description | Required? | Format | Default value | Example value |
| --------------------------------------------- | ----------------------------------------------------------------- | --------------- | ------------ | ------------------- | ------------------- |
| dataset_name | Dataset name, used to derive pipeline and linked service names, e.g. AzureSql_Example. | Yes | String | _n/a_ | AzureSql_Demo |
| pipeline_description | Description of the pipeline to be created. Will be used for the Data Factory pipeline description. | Yes | String | _n/a_ | "Ingest from demo Azure SQL database using ingest config file." |
| data_source_type | Data source type. | Yes | String<br /><br />Allowed values[^1]:<br />"azure_sql" | _n/a_ | azure_sql |
| key_vault_linked_service_name | Name of the Key Vault linked service in Data Factory. | No | String | ls_KeyVault | ls_KeyVault |
| data_source_password_key_vault_secret_name | Secret name of the data source password in Key Vault. | Yes | String | _n/a_ | sql-password |
| data_source_connection_string_variable_name | Variable name for the connection string. | Yes | String | _n/a_ | sql_connection |
| ado_variable_groups_nonprod | List of required variable groups in non-production environment. | Yes | List[String] | _n/a_ | - amido-stacks-de-pipeline-nonprod<br />- stacks-credentials-nonprod-kv |
| ado_variable_groups_prod | List of required variable groups in production environment. | Yes | List[String] | _n/a_ | - amido-stacks-de-pipeline-prod<br />- stacks-credentials-prod-kv |
| default_arm_deployment_mode | Deployment mode for terraform. | No | String | "Incremental" | Incremental |
| window_start_default | Default window start date in the Data Factory pipeline. | No | Date | "2010-01-01" | 2010-01-01 |
| window_end_default | Default window end date in the Data Factory pipeline. | No | Date | "2010-01-31" | 2010-01-31 |
| bronze_container | Name of container for Bronze data | Yes | String | raw | raw |
| silver_container | Name of container for Silver data | No | String | staging | staging |
| gold_container | Name of container for Gold data | No | String | curated | curated |

#### Ingest workloads

| Config field | Description | Required? | Format | Default value | Example value |
| --------------------------------------------- | ----------------------------------------------------------------- | --------------- | ------------ | ------------------- | ------------------- |
| dataset_name | Dataset name, used to derive pipeline and linked service names, e.g. AzureSql_Example. | Yes | String | _n/a_ | azure_sql_demo |
| data_source_password_key_vault_secret_name | Secret name of the data source password in Key Vault. | Yes | String | _n/a_ | sql-password |
| data_source_connection_string_variable_name | Variable name for the connection string. | Yes | String | _n/a_ | sql_connection |
| data_source_type | Data source type. | Yes | String<br /><br />Allowed values[^1]:<br />"azure_sql" | _n/a_ | azure_sql |
| bronze_container | Name of container for landing ingested data. | No | String | raw | raw |
| key_vault_linked_service_name | Name of the Key Vault linked service in Data Factory. | No | String | ls_KeyVault | ls_KeyVault |
| trigger_start | Start datetime for Data Factory pipeline trigger. | No | Datetime | _n/a_ | 2010-01-01T00:00:00Z |
| trigger_end | Datetime to set as end time for pipeline trigger. | No | Datetime | _n/a_ | 2011-12-31T23:59:59Z |
| trigger_frequency | Frequency for the Data Factory pipeline trigger. | No | String<br /><br />Allowed values:<br />"Minute"<br />"Hour"<br />"Day"<br />"Week"<br />"Month" | "Month" | Month |
| trigger_interval | Interval value for the Data Factory pipeline trigger. | No | Integer | 1 | 1 |
| trigger_delay | Delay between Data Factory pipeline triggers, formatted HH:mm:ss | No | String | "02:00:00" | 02:00:00 |
| window_start_default | Default window start date in the Data Factory pipeline. | No | Date | "2010-01-01" | 2010-01-01 |
| window_end_default | Default window end date in the Data Factory pipeline. | No | Date | "2010-01-31" | 2010-01-31 |

[^1]: Additional data source types will be supported in future.

#### Processing workloads

| Config field | Description | Required? | Format | Default value | Example value |
| --------------------------------------------- | ----------------------------------------------------------------- | --------------- | ------------ | ------------------- | ------------------- |
| pipeline_name | Name of the data pipeline / workload. | Yes | String | _n/a_ | processing_demo |

## Performing data quality checks

Datastacks provides a CLI to conduct data quality checks using the [PySparkle](./pysparkle.md) library as a backend.
Datastacks can be used to execute data quality checks using [PySparkle](./pysparkle.md), for example:

```bash
datastacks dq --help
datastacks dq --config-path "ingest/Ingest_AzureSql_Example/data_quality/ingest_dq.json" --container config
# Execute data quality checks using the provided config
datastacks dq --config-path "ingest/ingest_azure_sql_example/data_quality/ingest_dq.json" --container config
```

### Required configuration

For details regarding the required environment settings and the configuration file please read the [Data Quality](./data_quality_azure.md#usage) section.
For details regarding the required environment settings and the configuration file see [Data Quality](./data_quality_azure.md#usage).
Original file line number Diff line number Diff line change
Expand Up @@ -87,4 +87,4 @@ To run scripts within a Databricks cluster, you will need to:

## Next steps

Once you setup your local development environment, you can continue with the Getting Started tutorial by [deploying the Shared Resources](shared_resources_deployment_azure.md).
Once you setup your local development environment, you can continue with the Getting Started tutorial by [deploying the shared resources](shared_resources_deployment_azure.md).
13 changes: 7 additions & 6 deletions docs/workloads/azure/data/getting_started/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,10 @@ A more [detailed workflow diagram](../architecture/architecture_data_azure.md#de

## Steps

1. [Generate a Data Project](generate_project.md) - Generate a new data project
2. [Infrastructure Deployment](core_data_platform_deployment_azure.md) - Deploy the data platform infrastructure into your cloud environment
3. [Local Development Quickstart](dev_quickstart_data_azure.md) - Once your project has been generated, setup your local environment to start developing
4. [Shared Resources Deployment](shared_resources_deployment_azure.md) - Deploy common resources to be shared across data pipelines
5. (Optional) [Example Data Source](example_data_source.md) - To assist with the 'Getting Started' steps, you may wish to setup the Example Data Source
6. [Data Ingest Pipeline Deployment](etl_pipelines_deployment_azure.md) - Generate and deploy a data ingest pipeline using the Datastacks utility
1. [Generate a Data Project](./generate_project.md) - Generate a new data project.
2. [Infrastructure Deployment](./core_data_platform_deployment_azure.md) - Deploy the data platform infrastructure into your cloud environment.
3. [Local Development Quickstart](./dev_quickstart_data_azure.md) - Once your project has been generated, setup your local environment to start developing.
4. [Shared Resources Deployment](./shared_resources_deployment_azure.md) - Deploy common resources to be shared across data pipelines.
5. (Optional) [Example Data Source](./example_data_source.md) - To assist with the 'Getting Started' steps, you may wish to setup the Example Data Source.
6. [Data Ingest Pipeline Deployment](./ingest_pipeline_deployment_azure.md) - Generate and deploy a data ingest pipeline using the Datastacks utility.
7. [Data Processing Pipeline Deployment](./processing_pipeline_deployment_azure.md) - Generate and deploy a data processing pipeline using the Datastacks utility.
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
id: etl_pipelines_deployment_azure
id: ingest_pipeline_deployment_azure
title: Data Ingest Pipeline Deployment
sidebar_label: 6. Data Ingest Pipeline Deployment
hide_title: false
Expand Down Expand Up @@ -33,7 +33,7 @@ This process will deploy the following resources into the project:
* Trigger
* Data ingest config files (JSON)
* Azure DevOps CI/CD pipeline (YAML)
* (optional) Spark jobs for data quality tests (Python)
* (optional) Spark job and config file for data quality tests (Python)
* Template unit tests (Python)
* Template end-to-end tests (Python, Behave)

Expand All @@ -58,7 +58,7 @@ The password will need to be accessed dynamically by Data Factory on each connec
Before creating a new workload using Datastacks, open the project locally and create a new branch for the workload being created, e.g.:

```bash
git checkout -b feat/my-new-data-pipeline
git checkout -b feat/my-new-ingest-pipeline
```

## Step 2: Prepare the Datastacks config file
Expand Down Expand Up @@ -100,7 +100,7 @@ window_end_default: 2010-01-31
## Step 3: Generate project artifacts using Datastacks
Use the [Datastacks CLI](../etl_pipelines/datastacks.md#using-the-datastacks-cli) to generate the artifacts for the new workload, using the prepared config file (replacing `path_to_config_file/my_config.yaml` with the appropriate path). Note, a workload with Data Quality steps requires a data platform with a Databricks workspace:
Use the [Datastacks CLI](../etl_pipelines/datastacks.md#using-the-datastacks-cli) to generate the artifacts for the new workload, using the prepared config file (replacing `path_to_config_file/my_config.yaml` with the appropriate path).:

```bash
# Activate virtual environment
Expand All @@ -113,7 +113,7 @@ datastacks generate ingest --config="path_to_config_file/my_config.yaml"
datastacks generate ingest --config="path_to_config_file/my_config.yaml" --data-quality
```

This should add new project artifacts for the workload under `de_workloads/ingest/Ingest_AzureSql_MyNewExample`, based on the ingest workload templates. Review the resources that have been generated.
This will add new project artifacts for the workload under `de_workloads/ingest/Ingest_AzureSql_MyNewExample`, based on the ingest workload templates. Review the resources that have been generated.

## Step 4: Update ingest configuration

Expand Down Expand Up @@ -163,12 +163,12 @@ The [end-to-end tests](../etl_pipelines/testing_data_azure.md#end-to-end-tests)

The generated workload contains a YAML file containing a template Azure DevOps CI/CD pipeline for the workload, named `de-ingest-ado-pipeline.yaml`. This should be added as the definition for a new pipeline in Azure DevOps.

1. Sign-in to your Azure DevOps organization and go to your project
2. Go to Pipelines, and then select New pipeline
3. Name the new pipeline to match the name of your new workload, e.g. `de-ingest-azuresql-mynewexample`
4. For the pipeline definition, specify the YAML file in the project repository feature branch (e.g. `de-ingest-ado-pipeline.yaml`) and save
1. Sign-in to your Azure DevOps organization and go to your project.
2. Go to Pipelines, and then select New pipeline.
3. Name the new pipeline to match the name of your new workload, e.g. `de-ingest-azuresql-mynewexample`.
4. For the pipeline definition, specify the YAML file in the project repository feature branch (e.g. `de-ingest-ado-pipeline.yaml`) and save.
5. The new pipeline will require access to any Azure DevOps pipeline variable groups specified in the [datastacks config file](#step-2-prepare-the-datastacks-config-file). Under each variable group, go to 'Pipeline permissions' and add the new pipeline.
6. Run the new pipeline
6. Run the new pipeline.

Running this pipeline in Azure DevOps will deploy the artifacts into the non-production (nonprod) environment and run tests. If successful, the generated resources will now be available in the nonprod Ensono Stacks environment.

Expand All @@ -189,4 +189,8 @@ In the example pipeline templates:
* Deployment to the non-production (nonprod) environment is triggered on a feature branch when a pull request is open
* Deployment to the production (prod) environment is triggered on merging to the `main` branch, followed by manual approval of the release step.

It is recommended in any Ensono Stacks data platform that processes for deploying and releasing to further should be agreed and documented, ensuring sufficient review and quality assurance of any new workloads. The template CI/CD pipelines provided are based upon two platform environments (nonprod and prod) - but these may be amended depending upon the specific requirements of your project and organisation.
ℹ️ It is recommended in any data platform that processes for deploying and releasing across environments should be agreed and documented, ensuring sufficient review and quality assurance of any new workloads. The template CI/CD pipelines provided are based upon two platform environments (nonprod and prod) - but these may be amended depending upon the specific requirements of your project and organisation.

## Next steps

Now you have ingested some data into the bronze data lake layer, you can generate a [data processing pipeline](./processing_pipeline_deployment_azure.md) to transform and model the data.
Loading

0 comments on commit 0ea4e84

Please sign in to comment.