diff --git a/docs/workloads/azure/data/etl_pipelines/data_processing.md b/docs/workloads/azure/data/etl_pipelines/data_processing.md
index b25ec937c..2ec1a631b 100644
--- a/docs/workloads/azure/data/etl_pipelines/data_processing.md
+++ b/docs/workloads/azure/data/etl_pipelines/data_processing.md
@@ -15,7 +15,7 @@ keywords:
Data processing workloads in Ensono Stacks are jobs which:
1. Take data in the data lake as input (this can be various formats e.g. CSV, Parquet, JSON, Delta).
-2. Perform some form of data transformation / cleansing / modelling over the data.
+2. Perform some form of data transformation / cleansing / modelling / aggregation over the data.
3. Output the into the data lake into a structured [Delta Lake](https://delta.io/) format.
While [data ingest workloads](./ingest_data_azure.md) in Ensono Stacks utilise Azure Data Factory's inbuilt connectors and Copy activity, data processing workloads are based upon running Apache Spark / Python jobs on Databricks. These workloads may be used for various levels of data transformation and preparation within the data lake. Within the [medallion architecture](./etl_intro_data_azure.md#medallion-architecture) these will include:
@@ -49,18 +49,17 @@ reliable and accurate.
## Data processing pipeline overview
-The diagram below gives an overview of a data processing data pipeline in Data Factory.
+Within Stacks, processing activities are performed using Python PySpark jobs. These jobs are orchestrated by pipelines in Data Factory, and executed in Databricks. Using PySpark jobs - as opposed to notebooks - gives full control over the processing activities (for example ensuring thorough [test coverage](./testing_data_azure.md) and quality control).
-![ADF_SilverPipelineDesign.png](../images/ADF_SilverPipelineDesign.png)
+The diagram below gives an example of a data processing data pipeline in Data Factory.
-The processing is executed as Python Databricks job, with repeatable data transformation processes packaged within
-our [PySparkle](pysparkle.md) library.
+![ADF_SilverPipelineDesign.png](../images/ADF_SilverPipelineDesign.png)
-Transformation and processing logic specific to particular datasets is kept inside the `spark_jobs` directory for the workload.
+The Python PySpark script executed as part of a data workload is kept inside the `spark_jobs` directory for the workload. This job will utilise the [Pysparkle library](./pysparkle.md), which provides a wealth of reusable utilities to assist with data transformations and loading data from/into to the data lake.
### Data Factory pipeline design
-Data processing pipelines are kept within the `Process` folder in Data Factory:
+Within Data Factory, the processing pipelines are kept within the `Process` folder:
![ADF_SilverPipelinesList.png](../images/ADF_SilverPipelinesList.png)
diff --git a/docs/workloads/azure/data/etl_pipelines/data_quality_azure.md b/docs/workloads/azure/data/etl_pipelines/data_quality_azure.md
index efaee5a68..ab53123ae 100644
--- a/docs/workloads/azure/data/etl_pipelines/data_quality_azure.md
+++ b/docs/workloads/azure/data/etl_pipelines/data_quality_azure.md
@@ -71,8 +71,8 @@ Here is the description of the main elements:
5. `validation_config`: A list of validation configurations where each configuration contains the following fields:
1. `column_name`: Name of the validated column.
2. `expectations`: List of expectations where each expectation has the following fields:
- 3. `expectation_type`: Name of the Great Expectations expectation class to use.
- 4. `expectation_kwargs`: The keyword arguments to pass to the expectation class.
+ * `expectation_type`: Name of the Great Expectations [expectation class](https://greatexpectations.io/expectations/) to use.
+ * `expectation_kwargs`: The keyword arguments to pass to the expectation class.
### Example
diff --git a/docs/workloads/azure/data/etl_pipelines/datastacks.md b/docs/workloads/azure/data/etl_pipelines/datastacks.md
index 05a72ac5e..51f2449e8 100644
--- a/docs/workloads/azure/data/etl_pipelines/datastacks.md
+++ b/docs/workloads/azure/data/etl_pipelines/datastacks.md
@@ -34,12 +34,13 @@ poetry run datastacks --help
Datastacks can be used to generate all the resources required for a new data engineering workload - for example a data ingest pipeline. This will create all resources required for the workload, based upon templates within the [de_templates](https://github.com/ensono/stacks-azure-data/tree/main/de_templates) directory.
The [deployment architecture](../architecture/architecture_data_azure.md#data-engineering-workloads) section shows the workflow for using Datastacks to generate a new workload.
-See [ETL Pipeline Deployment](../getting_started/etl_pipelines_deployment_azure.md) for a step-by-step guide on deploying a new workload using Datastacks.
+The [getting started](../getting_started/getting_started.md) section includes step-by-step instructions on deploying a new ingest or processing workload using Datastacks.
### Commands
-* **`generate`**: This command contains subcommands which generate components for the data platform given a config file.
- * **`ingest`**: This subcommand utilises the template for ingest data pipelines, and uses a given config file to generate the required code for a new ingest pipeline ready for use. A flag can be included to specify whether to include data quality components in the pipeline.
+* **`generate`**: Top-level command for generating resources for a new data workload.
+ * **`ingest`**: Subcommand to generate a new [data ingest workload](./ingest_data_azure.md), using the provided configuration file. A optional flag (`--data-quality` or `-dq`) can be included to specify whether to include data quality components in the workload.
+ * **`processing`**: Subcommand to generate a new [data processing workload](./data_processing.md), using the provided configuration file. A optional flag (`--data-quality` or `-dq`) can be included to specify whether to include data quality components in the workload.
### Examples
@@ -47,50 +48,65 @@ See [ETL Pipeline Deployment](../getting_started/etl_pipelines_deployment_azure.
# Activate virtual environment
poetry shell
-# Generate resources for an ingest pipeline
+# Generate resources for an ingest workload
datastacks generate ingest --config="de_templates/test_config_ingest.yaml"
-# Generate resources for an ingest pipeline, with added Data Quality steps
+# Generate resources for an ingest workload, with added data quality steps
datastacks generate ingest --config="de_templates/test_config_ingest.yaml" --data-quality
+
+# Generate resources for a processing workload
+datastacks generate processing --config="de_templates/test_config_processing.yaml"
+
+# Generate resources for a processing workload, with added data quality steps
+datastacks generate processing --config="de_templates/test_config_processing.yaml" --data-quality
```
### Configuration
-In order to generate a new data engineering workload the Datastacks CLI takes a path to a config file. This config file should be YAML format, and contain configuration values as specified in the table below. A sample config file is included in the [de_templates](https://github.com/ensono/stacks-azure-data/tree/main/de_templates) folder. The structure of config is [validated using Pydantic](https://github.com/ensono/stacks-azure-data/tree/main/datastacks/datastacks/config.py).
+In order to generate a new data engineering workload the Datastacks CLI takes a path to a config file. This config file should be YAML format, and contain configuration values as specified in the table below. Sample config files are included in the [de_templates](https://github.com/ensono/stacks-azure-data/tree/main/de_templates) folder. The structure of config is [validated using Pydantic](https://github.com/ensono/stacks-azure-data/tree/main/datastacks/datastacks/config.py).
+
+#### All workloads
| Config field | Description | Required? | Format | Default value | Example value |
| --------------------------------------------- | ----------------------------------------------------------------- | --------------- | ------------ | ------------------- | ------------------- |
-| dataset_name | Dataset name, used to derive pipeline and linked service names, e.g. AzureSql_Example. | Yes | String | _n/a_ | AzureSql_Demo |
| pipeline_description | Description of the pipeline to be created. Will be used for the Data Factory pipeline description. | Yes | String | _n/a_ | "Ingest from demo Azure SQL database using ingest config file." |
-| data_source_type | Data source type. | Yes | String
Allowed values[^1]:
"azure_sql" | _n/a_ | azure_sql |
-| key_vault_linked_service_name | Name of the Key Vault linked service in Data Factory. | No | String | ls_KeyVault | ls_KeyVault |
-| data_source_password_key_vault_secret_name | Secret name of the data source password in Key Vault. | Yes | String | _n/a_ | sql-password |
-| data_source_connection_string_variable_name | Variable name for the connection string. | Yes | String | _n/a_ | sql_connection |
| ado_variable_groups_nonprod | List of required variable groups in non-production environment. | Yes | List[String] | _n/a_ | - amido-stacks-de-pipeline-nonprod
- stacks-credentials-nonprod-kv |
| ado_variable_groups_prod | List of required variable groups in production environment. | Yes | List[String] | _n/a_ | - amido-stacks-de-pipeline-prod
- stacks-credentials-prod-kv |
| default_arm_deployment_mode | Deployment mode for terraform. | No | String | "Incremental" | Incremental |
-| window_start_default | Default window start date in the Data Factory pipeline. | No | Date | "2010-01-01" | 2010-01-01 |
-| window_end_default | Default window end date in the Data Factory pipeline. | No | Date | "2010-01-31" | 2010-01-31 |
-| bronze_container | Name of container for Bronze data | Yes | String | raw | raw |
-| silver_container | Name of container for Silver data | No | String | staging | staging |
-| gold_container | Name of container for Gold data | No | String | curated | curated |
+
+#### Ingest workloads
+
+| Config field | Description | Required? | Format | Default value | Example value |
+| --------------------------------------------- | ----------------------------------------------------------------- | --------------- | ------------ | ------------------- | ------------------- |
+| dataset_name | Dataset name, used to derive pipeline and linked service names, e.g. AzureSql_Example. | Yes | String | _n/a_ | azure_sql_demo |
+| data_source_password_key_vault_secret_name | Secret name of the data source password in Key Vault. | Yes | String | _n/a_ | sql-password |
+| data_source_connection_string_variable_name | Variable name for the connection string. | Yes | String | _n/a_ | sql_connection |
+| data_source_type | Data source type. | Yes | String
Allowed values[^1]:
"azure_sql" | _n/a_ | azure_sql |
+| bronze_container | Name of container for landing ingested data. | No | String | raw | raw |
+| key_vault_linked_service_name | Name of the Key Vault linked service in Data Factory. | No | String | ls_KeyVault | ls_KeyVault |
| trigger_start | Start datetime for Data Factory pipeline trigger. | No | Datetime | _n/a_ | 2010-01-01T00:00:00Z |
| trigger_end | Datetime to set as end time for pipeline trigger. | No | Datetime | _n/a_ | 2011-12-31T23:59:59Z |
| trigger_frequency | Frequency for the Data Factory pipeline trigger. | No | String
Allowed values:
"Minute"
"Hour"
"Day"
"Week"
"Month" | "Month" | Month |
| trigger_interval | Interval value for the Data Factory pipeline trigger. | No | Integer | 1 | 1 |
| trigger_delay | Delay between Data Factory pipeline triggers, formatted HH:mm:ss | No | String | "02:00:00" | 02:00:00 |
+| window_start_default | Default window start date in the Data Factory pipeline. | No | Date | "2010-01-01" | 2010-01-01 |
+| window_end_default | Default window end date in the Data Factory pipeline. | No | Date | "2010-01-31" | 2010-01-31 |
[^1]: Additional data source types will be supported in future.
+#### Processing workloads
+
+| Config field | Description | Required? | Format | Default value | Example value |
+| --------------------------------------------- | ----------------------------------------------------------------- | --------------- | ------------ | ------------------- | ------------------- |
+| pipeline_name | Name of the data pipeline / workload. | Yes | String | _n/a_ | processing_demo |
+
## Performing data quality checks
-Datastacks provides a CLI to conduct data quality checks using the [PySparkle](./pysparkle.md) library as a backend.
+Datastacks can be used to execute data quality checks using [PySparkle](./pysparkle.md), for example:
```bash
-datastacks dq --help
-datastacks dq --config-path "ingest/Ingest_AzureSql_Example/data_quality/ingest_dq.json" --container config
+# Execute data quality checks using the provided config
+datastacks dq --config-path "ingest/ingest_azure_sql_example/data_quality/ingest_dq.json" --container config
```
-### Required configuration
-
-For details regarding the required environment settings and the configuration file please read the [Data Quality](./data_quality_azure.md#usage) section.
+For details regarding the required environment settings and the configuration file see [Data Quality](./data_quality_azure.md#usage).
diff --git a/docs/workloads/azure/data/getting_started/dev_quickstart_data_azure.md b/docs/workloads/azure/data/getting_started/dev_quickstart_data_azure.md
index 2c94de4c3..5e229eaca 100644
--- a/docs/workloads/azure/data/getting_started/dev_quickstart_data_azure.md
+++ b/docs/workloads/azure/data/getting_started/dev_quickstart_data_azure.md
@@ -87,4 +87,4 @@ To run scripts within a Databricks cluster, you will need to:
## Next steps
-Once you setup your local development environment, you can continue with the Getting Started tutorial by [deploying the Shared Resources](shared_resources_deployment_azure.md).
+Once you setup your local development environment, you can continue with the Getting Started tutorial by [deploying the shared resources](shared_resources_deployment_azure.md).
diff --git a/docs/workloads/azure/data/getting_started/getting_started.md b/docs/workloads/azure/data/getting_started/getting_started.md
index 2c2f8a992..469bb22cc 100644
--- a/docs/workloads/azure/data/getting_started/getting_started.md
+++ b/docs/workloads/azure/data/getting_started/getting_started.md
@@ -19,9 +19,10 @@ A more [detailed workflow diagram](../architecture/architecture_data_azure.md#de
## Steps
-1. [Generate a Data Project](generate_project.md) - Generate a new data project
-2. [Infrastructure Deployment](core_data_platform_deployment_azure.md) - Deploy the data platform infrastructure into your cloud environment
-3. [Local Development Quickstart](dev_quickstart_data_azure.md) - Once your project has been generated, setup your local environment to start developing
-4. [Shared Resources Deployment](shared_resources_deployment_azure.md) - Deploy common resources to be shared across data pipelines
-5. (Optional) [Example Data Source](example_data_source.md) - To assist with the 'Getting Started' steps, you may wish to setup the Example Data Source
-6. [Data Ingest Pipeline Deployment](etl_pipelines_deployment_azure.md) - Generate and deploy a data ingest pipeline using the Datastacks utility
+1. [Generate a Data Project](./generate_project.md) - Generate a new data project.
+2. [Infrastructure Deployment](./core_data_platform_deployment_azure.md) - Deploy the data platform infrastructure into your cloud environment.
+3. [Local Development Quickstart](./dev_quickstart_data_azure.md) - Once your project has been generated, setup your local environment to start developing.
+4. [Shared Resources Deployment](./shared_resources_deployment_azure.md) - Deploy common resources to be shared across data pipelines.
+5. (Optional) [Example Data Source](./example_data_source.md) - To assist with the 'Getting Started' steps, you may wish to setup the Example Data Source.
+6. [Data Ingest Pipeline Deployment](./ingest_pipeline_deployment_azure.md) - Generate and deploy a data ingest pipeline using the Datastacks utility.
+7. [Data Processing Pipeline Deployment](./processing_pipeline_deployment_azure.md) - Generate and deploy a data processing pipeline using the Datastacks utility.
diff --git a/docs/workloads/azure/data/getting_started/etl_pipelines_deployment_azure.md b/docs/workloads/azure/data/getting_started/ingest_pipeline_deployment_azure.md
similarity index 89%
rename from docs/workloads/azure/data/getting_started/etl_pipelines_deployment_azure.md
rename to docs/workloads/azure/data/getting_started/ingest_pipeline_deployment_azure.md
index 5c41d908a..d5558c290 100644
--- a/docs/workloads/azure/data/getting_started/etl_pipelines_deployment_azure.md
+++ b/docs/workloads/azure/data/getting_started/ingest_pipeline_deployment_azure.md
@@ -1,5 +1,5 @@
---
-id: etl_pipelines_deployment_azure
+id: ingest_pipeline_deployment_azure
title: Data Ingest Pipeline Deployment
sidebar_label: 6. Data Ingest Pipeline Deployment
hide_title: false
@@ -33,7 +33,7 @@ This process will deploy the following resources into the project:
* Trigger
* Data ingest config files (JSON)
* Azure DevOps CI/CD pipeline (YAML)
-* (optional) Spark jobs for data quality tests (Python)
+* (optional) Spark job and config file for data quality tests (Python)
* Template unit tests (Python)
* Template end-to-end tests (Python, Behave)
@@ -58,7 +58,7 @@ The password will need to be accessed dynamically by Data Factory on each connec
Before creating a new workload using Datastacks, open the project locally and create a new branch for the workload being created, e.g.:
```bash
-git checkout -b feat/my-new-data-pipeline
+git checkout -b feat/my-new-ingest-pipeline
```
## Step 2: Prepare the Datastacks config file
@@ -100,7 +100,7 @@ window_end_default: 2010-01-31
## Step 3: Generate project artifacts using Datastacks
-Use the [Datastacks CLI](../etl_pipelines/datastacks.md#using-the-datastacks-cli) to generate the artifacts for the new workload, using the prepared config file (replacing `path_to_config_file/my_config.yaml` with the appropriate path). Note, a workload with Data Quality steps requires a data platform with a Databricks workspace:
+Use the [Datastacks CLI](../etl_pipelines/datastacks.md#using-the-datastacks-cli) to generate the artifacts for the new workload, using the prepared config file (replacing `path_to_config_file/my_config.yaml` with the appropriate path).:
```bash
# Activate virtual environment
@@ -113,7 +113,7 @@ datastacks generate ingest --config="path_to_config_file/my_config.yaml"
datastacks generate ingest --config="path_to_config_file/my_config.yaml" --data-quality
```
-This should add new project artifacts for the workload under `de_workloads/ingest/Ingest_AzureSql_MyNewExample`, based on the ingest workload templates. Review the resources that have been generated.
+This will add new project artifacts for the workload under `de_workloads/ingest/Ingest_AzureSql_MyNewExample`, based on the ingest workload templates. Review the resources that have been generated.
## Step 4: Update ingest configuration
@@ -163,12 +163,12 @@ The [end-to-end tests](../etl_pipelines/testing_data_azure.md#end-to-end-tests)
The generated workload contains a YAML file containing a template Azure DevOps CI/CD pipeline for the workload, named `de-ingest-ado-pipeline.yaml`. This should be added as the definition for a new pipeline in Azure DevOps.
-1. Sign-in to your Azure DevOps organization and go to your project
-2. Go to Pipelines, and then select New pipeline
-3. Name the new pipeline to match the name of your new workload, e.g. `de-ingest-azuresql-mynewexample`
-4. For the pipeline definition, specify the YAML file in the project repository feature branch (e.g. `de-ingest-ado-pipeline.yaml`) and save
+1. Sign-in to your Azure DevOps organization and go to your project.
+2. Go to Pipelines, and then select New pipeline.
+3. Name the new pipeline to match the name of your new workload, e.g. `de-ingest-azuresql-mynewexample`.
+4. For the pipeline definition, specify the YAML file in the project repository feature branch (e.g. `de-ingest-ado-pipeline.yaml`) and save.
5. The new pipeline will require access to any Azure DevOps pipeline variable groups specified in the [datastacks config file](#step-2-prepare-the-datastacks-config-file). Under each variable group, go to 'Pipeline permissions' and add the new pipeline.
-6. Run the new pipeline
+6. Run the new pipeline.
Running this pipeline in Azure DevOps will deploy the artifacts into the non-production (nonprod) environment and run tests. If successful, the generated resources will now be available in the nonprod Ensono Stacks environment.
@@ -189,4 +189,8 @@ In the example pipeline templates:
* Deployment to the non-production (nonprod) environment is triggered on a feature branch when a pull request is open
* Deployment to the production (prod) environment is triggered on merging to the `main` branch, followed by manual approval of the release step.
-It is recommended in any Ensono Stacks data platform that processes for deploying and releasing to further should be agreed and documented, ensuring sufficient review and quality assurance of any new workloads. The template CI/CD pipelines provided are based upon two platform environments (nonprod and prod) - but these may be amended depending upon the specific requirements of your project and organisation.
+ℹ️ It is recommended in any data platform that processes for deploying and releasing across environments should be agreed and documented, ensuring sufficient review and quality assurance of any new workloads. The template CI/CD pipelines provided are based upon two platform environments (nonprod and prod) - but these may be amended depending upon the specific requirements of your project and organisation.
+
+## Next steps
+
+Now you have ingested some data into the bronze data lake layer, you can generate a [data processing pipeline](./processing_pipeline_deployment_azure.md) to transform and model the data.
diff --git a/docs/workloads/azure/data/getting_started/processing_pipeline_deployment_azure.md b/docs/workloads/azure/data/getting_started/processing_pipeline_deployment_azure.md
new file mode 100644
index 000000000..cf74ec426
--- /dev/null
+++ b/docs/workloads/azure/data/getting_started/processing_pipeline_deployment_azure.md
@@ -0,0 +1,226 @@
+---
+id: processing_pipeline_deployment_azure
+title: Data Processing Pipeline Deployment
+sidebar_label: 7. Data Processing Pipeline Deployment
+hide_title: false
+hide_table_of_contents: false
+description: Data processing pipelines development & deployment
+keywords:
+ - datastacks
+ - data
+ - python
+ - etl
+ - cli
+ - azure
+ - template
+---
+
+This section provides an overview of generating a new [data processing pipeline](../etl_pipelines/data_processing.md) workload and deploying it into a Ensono Stacks Data Platform, using the [Datastacks](../etl_pipelines/datastacks.md) utility.
+
+This guide assumes the following are in place:
+
+* A [deployed Ensono Stacks data platform](./core_data_platform_deployment_azure.md)
+* [Development environment set up](./dev_quickstart_data_azure.md)
+* [Deployed shared resources](./shared_resources_deployment_azure.md)
+* [Data ingested into the bronze layer of the data lake](./ingest_pipeline_deployment_azure.md)
+
+This process will deploy the following resources into the project:
+
+* Azure Data Factory Pipeline resource (defined in Terraform / ARM)
+* Boilerplated script for performing data processing activities using PySpark (Python).
+* Azure DevOps CI/CD pipeline (YAML)
+* (optional) Spark job and config file for data quality tests (Python)
+* Template unit tests (Python)
+
+## Background
+
+Once you have landed data in the bronze layer of your data lake (e.g. through a [data ingest pipeline](./ingest_pipeline_deployment_azure.md)), you will typically need to perform additional processing activities upon the data in order make it usable for your needs - for example loading it into the silver and gold layers of your data lake. For more information, see [data processing workloads](../etl_pipelines/data_processing.md).
+
+In the steps below, we will generate a data processing pipeline that uses data ingested in the [previous task](./ingest_pipeline_deployment_azure.md) as its source, and loads it into the silver layer. The same approach can be used to load data from silver to gold.
+
+## Step 1: Create feature branch
+
+Before creating a new workload using Datastacks, open the project locally and create a new branch for the workload being created, e.g.:
+
+```bash
+git checkout -b feat/my-new-processing-pipeline
+```
+
+## Step 2: Prepare the Datastacks config file
+
+Datastacks requires a YAML config file for generating a new ingest workload - see [Datastacks configuration](../etl_pipelines/datastacks.md#configuration) for further details.
+
+Create a new YAML file and populate the values relevant to your new processing pipeline. The example below will generate resources for a processing workload named **my_processing_pipeline**.
+
+```yaml
+#######################
+# Required parameters #
+#######################
+
+# Pipeline configurations
+pipeline_name: my_processing_pipeline
+
+# Description of pipeline shown in Azure Data Factory
+pipeline_description: "My example processing pipeline."
+
+# Azure DevOps configurations
+ado_variable_groups_nonprod:
+ - amido-stacks-de-pipeline-nonprod
+ - stacks-credentials-nonprod-kv
+
+ado_variable_groups_prod:
+ - amido-stacks-de-pipeline-prod
+ - stacks-credentials-prod-kv
+
+```
+
+## Step 3: Generate project artifacts using Datastacks
+
+Use the [Datastacks CLI](../etl_pipelines/datastacks.md#using-the-datastacks-cli) to generate the artifacts for the new workload, using the prepared config file (replacing `path_to_config_file/my_config.yaml` with the appropriate path).:
+
+```bash
+# Activate virtual environment
+poetry shell
+
+# Generate resources for an ingest pipeline (without Data Quality steps)
+datastacks generate processing --config="path_to_config_file/my_config.yaml"
+
+# Generate resources for an ingest pipeline (with added Data Quality steps)
+datastacks generate processing --config="path_to_config_file/my_config.yaml" --data-quality
+```
+
+This will add new project artifacts for the workload under `de_workloads/processing/my_processing_pipeline`, based on the ingest workload templates. Review the resources that have been generated.
+
+## Step 4: Update PySpark job
+
+Within the generated workload, the following Python file will be used as the entrypoint for the processing job: `spark_jobs/process.py`. The file is structured ready to start adding any logic specific to your particular workload using Python / Spark. It will reference [Pysparkle](../etl_pipelines/pysparkle.md) utilities to simplify interactions with the data platform and standard transformation activities.
+
+```python
+import logging
+from pysparkle.logger import setup_logger
+
+WORKLOAD_NAME = "processing_demo"
+
+logger_library = "pysparkle"
+logger = logging.getLogger(logger_library)
+
+
+def etl_main() -> None:
+ """Execute data processing and transformation jobs."""
+ logger.info(f"Running {WORKLOAD_NAME} processing...")
+
+ #######################
+ # Add processing here #
+ #######################
+
+ logger.info(f"Finished: {WORKLOAD_NAME} processing.")
+
+
+if __name__ == "__main__":
+ setup_logger(name=logger_library, log_level=logging.INFO)
+ etl_main()
+
+```
+
+For the getting started guide, we have provided a simple example - you may extend this based on whatever your workload requires. Copy the following additional imports and constants into the top of your `process.py` file:
+
+```python
+from pysparkle.etl import (
+ TableTransformation,
+ get_spark_session_for_adls,
+ read_latest_rundate_data,
+ transform_and_save_as_delta,
+)
+from pysparkle.pyspark_utils import rename_columns_to_snake_case
+
+BRONZE_CONTAINER = "raw"
+SILVER_CONTAINER = "staging"
+SOURCE_DATA_TYPE = "parquet"
+INPUT_PATH_PATTERN = "Ingest_AzureSql_Example/movies.{table_name}/v1/full/"
+OUTPUT_PATH_PATTERN = "movies/{table_name}"
+```
+
+Next, copy the following within the `etl_main` function in `process.py`, replacing the ` # Add processing here #` comment:
+
+```python
+ spark = get_spark_session_for_adls(WORKLOAD_NAME)
+
+ tables = [
+ TableTransformation("links", rename_columns_to_snake_case),
+ TableTransformation("ratings_small", rename_columns_to_snake_case)
+ ]
+
+ for table in tables:
+ df = read_latest_rundate_data(
+ spark,
+ BRONZE_CONTAINER,
+ INPUT_PATH_PATTERN.format(table_name=table.table_name),
+ datasource_type=SOURCE_DATA_TYPE,
+ )
+
+ output_path = OUTPUT_PATH_PATTERN.format(table_name=table.table_name)
+
+ transform_and_save_as_delta(spark, df, table.transformation_function, SILVER_CONTAINER, output_path)
+```
+
+The processing script is now prepared to perform the following steps:
+
+1. Initiate a Spark session and connectivity to the data lake.
+2. Define `TableTransformation` objects - these consist of an input table name, and a transformation function. Here we are specifying two tables - _links_ and _ratings_small_ - and assigning the `rename_columns_to_snake_case` function as their transformation function.
+3. For each of the tables:
+ 1. Read the latest data from the bronze layer into a Spark DataFrame.
+ 2. Define an output path for the data in the silver layer.
+ 3. Execute the transformation function against the DataFrame.
+ 4. Save the transformed DataFrame into the silver layer in Delta format.
+
+In order to run / debug the code during development, you may wish to use [Databricks for development](./dev_quickstart_data_azure.md#optional-pyspark-development-in-databricks).
+
+## Step 5: Update tests
+
+The workload is created with placeholders for adding unit and end-to-end tests - see [testing](../etl_pipelines/testing_data_azure.md) for further details.
+
+### Unit tests
+
+A placeholder for adding unit tests is located within the workload under `tests/unit/test_processing.py`. The unit tests are intended as a first step to ensure the code is performing as intended and ensure no breaking changes have been introduced. The unit tests will run as part of the deployment pipeline, and can be run locally by developers.
+
+Within the same directory a `conftest.py` is provided. This contains a PyTest fixture to enable a local Spark session to be used for running the unit tests in isolation - for examples of this you can refer to the [example silver workload](https://github.com/Ensono/stacks-azure-data/blob/main/de_workloads/processing/silver_movies_example/tests/unit/).
+
+Add any unit tests you require to `test_processing.py` (although they are not strictly required for the getting started guide). You may also add these tests to the project's `Makefile` under the `test` command, to easily run them alongside other unit tests in the project.
+
+### End-to-end tests
+
+A placeholder directory for end-to-end tests for the workload is provided under `tests/unit/test_processing.py`. These will run as part of the deployment pipeline.
+
+End-to-end tests not required to be added for the getting started demo, but would be recommended when developing any production workload.
+
+## Step 6: Deploy new workload in non-production environment
+
+As for ingest workloads, processing workloads contains a YAML file containing a template Azure DevOps CI/CD pipeline, named `de-process-ado-pipeline.yaml`. This should be added as the definition for a new pipeline in Azure DevOps.
+
+1. Sign-in to your Azure DevOps organization and go to your project.
+2. Go to Pipelines, and then select New.
+3. Name the new pipeline to match the name of your new workload, e.g. `de-process-my-processing-pipeline`.
+4. For the pipeline definition, specify the YAML file in the project repository feature branch (e.g. `de-process-ado-pipeline.yaml`) and save.
+5. The new pipeline will require access to any Azure DevOps pipeline variable groups specified in the [datastacks config file](#step-2-prepare-the-datastacks-config-file). Under each variable group, go to 'Pipeline permissions' and add the new pipeline.
+6. Run the new pipeline.
+
+Running this pipeline in Azure DevOps will deploy the artifacts into the non-production (nonprod) environment and run tests. If successful, the generated resources will now be available in the nonprod Ensono Stacks environment.
+
+## Step 7: Review deployed resources
+
+If successful, the workload's resources will now be deployed into the non-production resource group in Azure - these can be viewed through the [Azure Portal](https://portal.azure.com/#home) or CLI.
+
+The Azure Data Factory resources can be viewed through the [Data Factory UI](https://adf.azure.com/). You may also wish to run/debug the newly generated pipeline from here (see [Microsoft documentation](https://learn.microsoft.com/en-us/azure/data-factory/iterative-development-debugging)).
+
+ℹ️ Note: The structure of the data platform and Data Factory resources are defined in the project's code repository, and deployed through the Azure DevOps pipelines. Changes to Data Factory resources directly through the UI will lead to them be overwritten when pipelines are next run. If you wish to update Data Factory resources, update the appropriate files within the workload (under the `data_factory` path).
+
+Continue to make any further amendments required to the new workload, re-running the DevOps pipeline as required. If including data quality checks, update the (`data_quality_config.json`) file in the repository with details of checks required on the data.
+
+## Step 8: Deploy new workload in further environments
+
+In the example pipeline templates:
+
+* Deployment to the non-production (nonprod) environment is triggered on a feature branch when a pull request is open
+* Deployment to the production (prod) environment is triggered on merging to the `main` branch, followed by manual approval of the release step.
+
+ℹ️ It is recommended in any data platform that processes for deploying and releasing across environments should be agreed and documented, ensuring sufficient review and quality assurance of any new workloads. The template CI/CD pipelines provided are based upon two platform environments (nonprod and prod) - but these may be amended depending upon the specific requirements of your project and organisation.
diff --git a/docs/workloads/azure/data/getting_started/shared_resources_deployment_azure.md b/docs/workloads/azure/data/getting_started/shared_resources_deployment_azure.md
index 4fbdc8dea..7c0e9cbfe 100644
--- a/docs/workloads/azure/data/getting_started/shared_resources_deployment_azure.md
+++ b/docs/workloads/azure/data/getting_started/shared_resources_deployment_azure.md
@@ -56,14 +56,13 @@ This YAML file should be added as the definition for a new pipeline in Azure Dev
4. For the pipeline definition, specify the YAML file in the project repository feature branch (`de-shared-resources.yml`) and save
5. The new pipeline will require access to any Azure DevOps pipeline variable groups specified in the pipeline YAML. Under each variable group, go to 'Pipeline permissions' and add the pipeline.
-
## Step 3: Deploy shared resources in non-production environment
Run the pipeline configured in Step 2 to commence the build and deployment process.
Running this pipeline in Azure DevOps will initiate the deployment of artifacts into the non-production (nonprod) environment. It's important to monitor the progress of this deployment to ensure its success. You can track the progress and status of the deployment within the Pipelines section of Azure DevOps.
-If successful, the core DE shared resources will now be available in the nonprod Ensono Stacks environment. To view the deployed resources, navigate to the relevant resource group in the [Azure portal](https://portal.azure.com/). The deployed Data Factory resources can be viewed through the [Data Factory UI](https://adf.azure.com/).
+If successful, the core DE shared resources will now be available in the non-production environment. To view the deployed resources, navigate to the relevant resource group in the [Azure portal](https://portal.azure.com/). The deployed Data Factory resources can be viewed through the [Data Factory UI](https://adf.azure.com/).
ℹ️ Note: The structure of the data platform and Data Factory resources are defined in the project's code repository, and deployed through the Azure DevOps pipelines. Changes to Data Factory resources directly through the UI will lead to them be overwritten when pipelines are next run. If you wish to update shared Data Factory resources, update the appropriate files under the path `de_workloads/shared_resources/data_factory`.
@@ -75,7 +74,6 @@ The template CI/CD pipelines provided are based upon these two platform environm
* Deployment to the non-production (nonprod) environment is triggered on a feature branch when a pull request is open
* Deployment to the production (prod) environment is triggered on merging to the `main` branch, followed by manual approval of the release step.
-
## Next steps
-Once the shared resources are deployed you may now [generate a new Data Ingest Pipeline](etl_pipelines_deployment_azure.md) (optionally implementing the [Example Data Source](example_data_source.md) beforehand).
+Once the shared resources are deployed you may now [generate a new data ingest pipeline](./ingest_pipeline_deployment_azure.md) (optionally implementing the [example data source](example_data_source.md) beforehand).
diff --git a/sidebars.js b/sidebars.js
index 7a34d6068..4da8e33f9 100644
--- a/sidebars.js
+++ b/sidebars.js
@@ -234,7 +234,8 @@ module.exports = {
"workloads/azure/data/getting_started/dev_quickstart_data_azure",
"workloads/azure/data/getting_started/shared_resources_deployment_azure",
"workloads/azure/data/getting_started/example_data_source",
- "workloads/azure/data/getting_started/etl_pipelines_deployment_azure",
+ "workloads/azure/data/getting_started/ingest_pipeline_deployment_azure",
+ "workloads/azure/data/getting_started/processing_pipeline_deployment_azure",
],
},
"workloads/azure/data/security_data_azure",