Skip to content

aws-samples/aws-glue-data-catalog-replication-utility

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

AWS Glue Data Catalog Replication Utility

This Utility is used to replicate Glue Data Catalog from one AWS account to another AWS account. Using this, you can replicate Databases, Tables, and Partitions from one source AWS account to one or more target AWS accounts. It uses AWS Glue APIs / AWS SDK for Java and serverless technologies such as AWS Lambda, Amazon SQS, and Amazon SNS. The architecture of this utility is shown in the following diagram. Alt

Automated Deployment

Follow the instructions in this README.md to deploy this utility through CloudFormation in your AWS accounts. Otherwise follow the guide below for a manual deployment.

Build Instructions

  1. The source code has Maven nature, you can build it using standard Maven commands e.g. mvn -X clean install. or use the options available in your IDE
  2. The above step generates a Jar file e.g. aws-glue-data-catalog-replication-utility-1.0.0.jar

AWS Service Requirements

This utility requires the following AWS services

Source Account

  • 3 AWS Lambda functions
  • 3 Amazon DynamoDB tables
  • 2 Amazon SNS Topics
  • 1 Amazon SQS Queue
  • 1 Amazon S3 Bucket

Each Target Account

  • 3 AWS Lambda functions
  • 2 Amazon DynamoDB tables
  • 2 Amazon SQS Queues

Lambda Functions Overview

Class Purpose
GDCReplicationPlannerLambda Lambda function determines the list of databases to export. It is the driver program initiates the replication process.
ExportLambda Lambda function to export databases and tables.
ExportLargeTableLambda Lambda function to export large tables tables with more than 10 partitions.
ImportLambda Lambda function to import databases and tables.
ImportLargeTableLambda Lambda function to import large tables.
DLQProcessorLambda Lambda function used to process errors generated by ImportLambda.

Deployment Instructions - Source Account

  1. Create DynamoDB tables as defined in the following table

    Table Purpose Schema Capacity
    glue_database_export_task audit data for replication planner Partition key - db_id (String), Sort key - export_run_id (Number) On-Demand
    db_status audit data for databases exported Partition key - db_id (String), Sort key - export_run_id (Number) On-Demand
    table_status audit data for tables exported Partition key - table_id (String), Sort key - export_run_id (Number) On-Demand
  2. Create two SNS Topics

    1. Topic 1: Name = e.g. ReplicationPlannerSNSTopic
    2. Topic 2: Name = e.g. SchemaDistributionSNSTopic
  3. Create an S3 Bucket. It is used to save partitions for large tables (partitions > 10). This bucket must provide cross-account permissions to the IAM roles used by ImportLargeTable Lambda function in Target Account. Refer the following AWS resources for more details:

    1. https://aws.amazon.com/premiumsupport/knowledge-center/cross-account-access-s3/
    2. https://docs.aws.amazon.com/AmazonS3/latest/dev/example-walkthroughs-managing-access-example2.html
  4. Create one SQS Queue

    1. Queue Name = e.g. LargeTableSQSQueue
    2. Queue Type = Standard
    3. Default Visibility Timeout = e.g. 3 minutes 15 seconds. Note: It must be higher than execution timeout of ExportLargeTable Lambda Function
  5. Create Lambda Execution IAM Role and attach it to the Lambda functions deployed in Source Account. This role needs to have multiple permissions. Refer the following IAM policies to know about required permissions:

    1. You can use AWS managed policy called AWSLambdaExecute (Policy ARN # arn:aws:iam::aws:policy/AWSLambdaExecute)
    2. sample_sqs_policy_source_and_target_accounts
    3. sample_sns_policy_source_account
    4. sample_glue_policy_source_account
    5. sample_ddb_policy_source_and_target_accounts
  6. Deploy GDCReplicationPlannerLambda function

    1. Runtime = Java 8
    2. Function package = Use the Jar file generated. Refer section Build Instructions
    3. Lambda Handler = com.amazonaws.gdcreplication.lambda.GDCReplicationPlanner
    4. Timeout = e.g. 5 minutes
    5. Memory = e.g. 128 MB
    6. Environment variable = as defined in the following table
    Variable Name Variable Value
    source_glue_catalog_id Source AWS Account Id
    ddb_name_gdc_replication_planner Name of the DDB Table for glue_database_export_task of source account
    database_prefix_list List of database prefixes separated by a token. E.g. raw_data_,processed_data_. To export all databases, do not add this variable.
    separator The separator used in the database_prefix_list. E.g. ,. This can be skipped when database_prefix_list is not added.
    region e.g. us-east-1
    sns_topic_arn_gdc_replication_planner SNS Topic ARN for ReplicationPlannerSNSTopic
  7. Deploy ExportLambda function

    1. Runtime = Java 8
    2. Function package = Use the Jar file generated. Refer section Build Instructions
    3. Lambda Handler = com.amazonaws.gdcreplication.lambda.ExportDatabaseWithTables
    4. Timeout = e.g. 5 minutes
    5. Memory = e.g. 192 MB
    6. Environment variable = as defined in the following table
    Variable Name Variable Value
    source_glue_catalog_id Source AWS Account Id
    ddb_name_db_export_status Name of the DDB Table for db_status of source account
    ddb_name_table_export_status Name of the DDB Table for table_status of source account
    region e.g. us-east-1
    sns_topic_arn_export_dbs_tables SNS Topic ARN for SchemaDistributionSNSTopic
    sqs_queue_url_large_tables SQS Queue URL for LargeTableSQSQueue
  8. Add ReplicationPlannerSNSTopic as a trigger to ExportLambda function

  9. Deploy ExportLargeTableLambda function

    1. Runtime = Java 8
    2. Function package = Use the Jar file generated. Refer section Build Instructions
    3. Lambda Handler = com.amazonaws.gdcreplication.lambda.ExportLargeTable
    4. Timeout = e.g. 3 minutes
    5. Memory = e.g. 256 MB
    6. Environment variable = as defined in the following table
    Variable Name Variable Value
    s3_bucket_name Name of the S3 Bucket used to save partitions for large Tables
    ddb_name_table_export_status Name of the DDB Table for table_status of source account
    region e.g. us-east-1
    sns_topic_arn_export_dbs_tables SNS Topic ARN for SchemaDistributionSNSTopic
  10. Add LargeTableSQSQueue as a trigger to ExportLargeTableLambda function

    1. Batch size = 1
  11. Cross-Account permissions in Source Account. Grant permissions to Target Account to subscribe to the second SNS Topic:

    aws sns add-permission --label lambda-access --aws-account-id TargetAccount \
    --topic-arn arn:aws:sns:us-east-1:SourceAccount:SchemaDistributionSNSTopic \
    --action-name Subscribe ListSubscriptionsByTopic Receive
    

Deployment Instructions - Target Account

  1. Create DynamoDB tables as defined in the following table

    Table Purpose Schema Capacity
    db_status audit data for databases imported Partition key - db_id (String), Sort key - import_run_id (Number) On-Demand
    table_status audit data for tables imported Partition key - table_id (String), Sort key - import_run_id (Number) On-Demand
  2. Create SQS Queue

    1. Queue Name = LargeTableSQSQueue
    2. Queue Type = Standard
    3. Default Visibility Timeout = e.g. 3 minutes 15 seconds. Note: It must be higher than execution timeout of ImportLargeTable Lambda Function
  3. Create SQS Queue - dead letter queue processing

    1. Queue Name = DeadLetterQueue
    2. Queue Type = Standard
    3. Default Visibility Timeout = e.g. 3 minutes 15 seconds
  4. Create Lambda Execution IAM Role and attach it to the Lambda functions deployed in Target Account. This role needs to have multiple permissions. Refer the following IAM policies to know about required permissions:

    1. You can use AWS managed policy called AWSLambdaExecute (Policy ARN # arn:aws:iam::aws:policy/AWSLambdaExecute)
    2. sample_sqs_policy_source_and_target_accounts
    3. sample_glue_policy_target_account
    4. sample_ddb_policy_source_and_target_accounts
  5. Deploy ImportLambda function

    1. Runtime = Java 8
    2. Function package = Use the Jar file generated. Refer section Build Instructions
    3. Lambda Handler = com.amazonaws.gdcreplication.lambda.ImportDatabaseOrTable
    4. Timeout = e.g. 5 minutes
    5. Memory = e.g. 192 MB
    6. Environment variable = as defined in the following table
    Variable Name Variable Value
    target_glue_catalog_id Target AWS Account Id
    ddb_name_db_import_status Name of the DDB Table for db_status of target account
    ddb_name_table_import_status Name of the DDB Table for table_status of target account
    skip_archive true
    region e.g. us-east-1
    sqs_queue_url_large_tables SQS Queue URL for LargeTableSQSQueue
    dlq_url_sqs SQS Queue URL for DeadLetterQueue
  6. Give SchemaDistributionSNSTopic permissions to invoke Lambda function

    aws lambda add-permission --function-name ImportLambda \
    --source-arn arn:aws:sns:us-east-1:SourceAccount:SchemaDistributionSNSTopic \
    --statement-id sns-x-account --action "lambda:InvokeFunction" \
    --principal sns.amazonaws.com
    
  7. Subscribe ImportLambda function to SchemaDistributionSNSTopic

    aws sns subscribe --protocol lambda \
    --topic-arn arn:aws:sns:us-east-1:SourceAccount:SchemaDistributionSNSTopic \
    --notification-endpoint arn:aws:lambda:us-east-1:TargetAccount:function:ImportLambda
    

    Additional References:

  8. Deploy ImportLargeTableLambda function

    1. Runtime = Java 8
    2. Function package = Use the Jar file generated. Refer section Build Instructions
    3. Lambda Handler = com.amazonaws.gdcreplication.lambda.ImportLargeTable
    4. Timeout = e.g. 3 minutes
    5. Memory = e.g. 256 MB
    6. Environment variable = as defined in the following table
    Variable Name Variable Value
    target_glue_catalog_id Target AWS Account Id
    ddb_name_table_import_status Name of the DDB Table for table_status of target account
    skip_archive true
    region e.g. us-east-1
  9. Add LargeTableSQSQueue as a trigger to ImportLargeTableLambda function

    1. Batch size = 1
  10. Deploy DLQProcessorLambda function

    1. Runtime = Java 8
    2. Function package = Use the Jar file generated. Refer section Build Instructions
    3. Lambda Handler = com.amazonaws.gdcreplication.lambda.DLQImportDatabaseOrTable
    4. Timeout = e.g. 3 minutes
    5. Memory = e.g. 192 MB
    6. Environment variable = as defined in the following table
    Variable Name Variable Value
    target_glue_catalog_id Target AWS Account Id
    ddb_name_db_import_status Name of the DDB Table for db_status of target account
    ddb_name_table_import_status Name of the DDB Table for table_status of target account
    skip_archive true
    dlq_url_sqs SQS Queue URL for DeadLetterQueue
    region e.g. us-east-1
  11. Add Dead Letter SQS Queue as a trigger to DLQProcessorLambda Lambda function

    1. Batch size = 1

Advantages

This solution was designed around 3 main tenets, which are simplicity, scalability, and cost-effectiveness. The following are direct benefits:

  1. Target AWS accounts are independent allowing the solution to scale efficiently.
  2. The target accounts always see the latest table information.
  3. Light weight and dependable at scale.
  4. The implementation is fully customizable.

Limitations

Following are the primary limitations:

  1. This utility is NOT intended for real-time replication. Refer section Use Case 2 - Ongoing replication to know about how to run the replication process as a scheduled job.
  2. This utility is NOT intended for two-way replication between AWS Accounts.
  3. This utility does NOT attempt to resolve database and table name conflicts which may result in undesirable behavior.

Applicable Use Cases

Use Case 1: One-time replication

To do this, you can run GDCReplicationPlannerLambda function using a Test event in AWS Lambda console.

Use Case 2: Ongoing replication

To do this, you can create a CloudWatch Event Rule in Source Account and add GDCReplicationPlannerLambda as its target. Refer the following AWS documentation for more details:

  1. Schedule Expressions for Rules
  2. Tutorial: Schedule AWS Lambda Functions Using CloudWatch Events

Replication Mechanism in Target Account

For databases and tables, the actions taken by import Lambdas depend on the state of Glue Data Catalog in target account. Those actions are summarized in the following table.

Input Message Type State of Target Glue Data Catalog Action Taken in Target Glue Data Catalog
Database Database exist already Skip the message
Database Database does not exist Create Database
Table Table exist already Update Table
Table Table does not exist Create Table

For partitions, the actions are summarized in the following table:

Partitions in Export State in Target Glue Data Catalog Action Taken in Target Account
Partitions DO NOT exist Target Table has no partitions No action taken
Partitions DO NOT exist Target Table has partitions Delete current partitions
Partitions exist Target Table has no partitions Create new partitions
Partitions exist Target Table has partitions Delete current partitions, create new partitions

License Summary

This sample code is made available under the MIT-0 license. See the LICENSE file.