Skip to content

Tutorial for using open source data-diff as part of a PR workflow.

Notifications You must be signed in to change notification settings

datafold/os-diff-pr-tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 

Repository files navigation

Running data-diff to improve your dbt development workflow

Let's walk through the process of using Datafold's open source data-diff tool to make sure the code changes in your development (dev) branch doesn't lead to unexpected data changes in production (prod).

How does data-diff work?

data-diff compares every data value that is generated by your dev branch code and compares those values with the data in prod.

Why is data-diff important?

data-diff catches unexpected data changes that can't be caught by dbt tests.

✨Imagine✨ this scenario:

  • Your SQL code includes a complex case statement that is modified as part of the code update.
  • The logic is confusing for a reviewer, but they think it's good to go. They approve the code update, and it is merged.
  • They didn't realize the status column values for some rows have swapped PENDING AND UNDER REVIEW.
  • There was also an edit to a union statement which caused some rows to disappear.
  • The following issues now exist in production data:
    • Inaccurate values in the status column due to a logic error in the case statement.
    • Missing primary keys.

There's no dbt test that could catch this. But data-diff can.

data-diff alerts you to data changes that:

  • Isn't be immediately clear from reading the code ...
    • ... due to complex or confusing logic.
  • Won't be caught by dbt tests because the values don't violate any rules ...
    • ... such as not_null, unique, or even a custom test.
  • May break your downstream dashboards, data apps, and executive dashboards ...
    • ... which, as data practitioners, is our deepest fear.

Enough preamble. Let's diff.

Intro to Diff

Install data-diff.

We recommend installing data-diff in the same virtual environment that you use when running dbt CLI.

pip install data-diff --pre

Then, install a driver specific to the database you want to connect to. We'll assume you're using Snowflake, but data-diff supports additional databases as well.

pip install 'data-diff[snowflake]'

Run data-diff to compare dev and prod data.

To run data-diff from the command line and compare the data in your prod schema to your dev schema, run this command:

data-diff \
  "snowflake://<USER>:<PASSWORD>@<ACCOUNT>/<DATABASE>/<PROD_SCEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" <TABLE_NAME> \
  <DEV_SCHEMA>.<TABLE_NAME> \
  -k <your_primary_key_column>
  -c <altered_column_1> <altered_column_2> <altered_column_3>
  -m test_results_%t \
  --materialize-all-rows \
  --table-write-limit 10000

A few notes about this command:

  • Everything in carrots (<>) should be replaced by your values, including the carrots.
    • In other words, your command should include no carrots! 🚫🥕
  • Each line beginning with a hyphen represents an option. In the command above, we've included the main options you will likely need:
    • -k specifies the name of your primary key column.
    • -c specifies the other columns you want to inspect for differences
      • To simplify your output and to improve performance, we recommend including only the columns that have altered lines of code in your dev branch.
      • Alternatively, you can include all columns by writing -c %.
      • If you're only investigating whether entire rows are missing (e.g., due to a modification of a UNION statement or a WHERE clause in your SQL model), we suggest omitting -c.
    • -m specifies a prefix that will be used to write the results of your data-diff run to a Diff Results Table into your warehouse.
    • --materialize_all_rows instructs data-diff to write all rows to the Diff Results Table, not only conflicts. This is helpful if you want to query the results and calculate statistics such as the percentage of rows that have a conflicting value in a given column.
    • --table-write-limit limits how many rows are written to the Diff Results Table in the warehouse. The limit is 1000 rows by default. Increase this limit for large tables to ensure your results are complete.
  • There are additional options you can add to your command, which are detailed in the Options section below.

Materialize the Diff Results Table in your warehouse.

Once the Diff Results Table has been written to the warehouse, you can analyze the data using SQL queries. You can also creating visualizations that use your Diff Results Table as source data.

Wait, I have to write a bunch of SQL? How is this better than what I was doing before?

It's totally different! With one very simple query of the Diff Results Table, you can immediately learn something like, "How many values conflict in these columns?" You can also immediately surface the values that are conflicting.

It's far more powerful to compare every value with a data-diff instead of writing ad hoc SQL statements to count rows and generate summaries. And you don't have to write a single join.

Structure of the Diff Results Table

We'll assume you're investigating the primary key, org_id, and the status column, because the dev code contains edits to the status column. We've also included the created_at column in our data-diff command to support our analysis of the results.

Column Name Value
is_exclusive_a true if the value exists only in the prod schema.
is_exclusive_b true if the value exists only in the dev schema.
is_diff_org_id 1 if the value is different between dev and prod; else 0.
is_diff_created_at 1 if the value is different between dev and prod; else 0.
is_diff_status 1 if the value is different between dev and prod; else 0.
org_id_a Prod org_id.
org_id_b Dev org_id.
created_at_a Prod created_at.
created_at_b Dev created_at.
status_a Prod status.
status_b Dev status.

The Diff Results Table has both information about whether the values conflict AND the actual values from the columns you've selected from both the dev and prod schema. This structure gives you a high degree of flexibility to easily investigate row-level value differences and quickly identify the root cause of data conflicts.

Write SQL to analyze Diff Results Table and determine whether your dev branch can be merged into prod.

Once you have the Diff Results Table in your warehouse, you can write SQL to understand how the tables differ.

Writing SQL to get that high value information is easy! It's also extensible if you want to write complex analysis and joins. That's the value you get out of using data-diff instead of doing this manually.

Here are some examples of SQL queries that you can use to interpet the Diff Results Table.

Check for any conflicts between values in dev and prod. This is an easy way to get started analysing your Diff Results Table. If there are no conflicts, you can stop here! 🛑 ✅ 🥳

select 
  sum("is_diff_org_id") as org_id_conflicts,
  sum("is_diff_created_at") as created_at_conflicts,
  sum("is_diff_status") as status_conflicts
from <DEV_SCHEMA>.<TEST_RESULTS>;

Screen Shot 2022-11-11 at 12 03 55 PM

Oops. Since the primary key org_id has conflicting values, and it's not a composite key (comprised of multiple columns), we can assume that 123 of the conflicting values in the other columns can also be explained by these 123 org_id conflicts.

We can confirm that with this query:

select 
  sum("is_diff_org_id") as org_id_conflicts,
  sum("is_diff_created_at") as created_at_conflicts,
  sum("is_diff_status") as status_conflicts
from <DEV_SCHEMA>.<TEST_RESULTS>
where "is_diff_org_id" = 0
;

Screen Shot 2022-11-11 at 12 16 12 PM

As expected, 123 of the conflicts are explained by missing primary key values. We'll proceed to explore both the missing primary keys as well as the conflicting values in the status column.

Check whether missing primary keys are missing from dev or prod.

select 
    sum(case when "is_exclusive_a" then 1 else 0 end) as "PK in Prod and missing from Dev",
    sum(case when "is_exclusive_b" then 1 else 0 end) as "PK in Dev and missing from Prod"
from <DEV_SCHEMA>.<TEST_RESULTS>;

Screen Shot 2022-11-17 at 9 33 32 PM

It looks like all of the 123 primary keys are missing from the data generated by the dev code. Perhaps you're expecting this, or perhaps it's an error. This is of course something that can only be answered knowing the goal of the code update.

You can easily view which order_id values are present in prod and missing from dev.

select 
    "org_id_a"
from <DEV_SCHEMA>.<TEST_RESULTS>
where "org_id_b" is null;

Screen Shot 2022-11-16 at 10 17 22 AM

This is useful if you want to investigate individual rows. You can also include additional columns in your select statement to see what values exist in the rows that have missing primary keys in the dev data.

Flexibility to join the Diff Results Table with other tables in your warehouse.

Around now, you might notice that this table is structured so that you can easily join to the actual Prod table, or any other table, using the primary key as a join key.

with missing_ids as (
    select 
        "org_id_a"
    from <DEV_SCHEMA>.<TEST_RESULTS>
    where "org_id_b" is null;
)
select 
    *
from <PROD_SCHEMA>.<TABLE_NAME> prod_table
inner join missing_ids on prod_table.order_id = missing_ids.order_id

Explore distribution of conflicting or missing values.

Back to the Diff Results Table. We've established that there are primary keys in prod that are missing from dev. But it's important to understand what's missing. Is it random, or are they grouped in some way?

In this example, we'll use the Diff Results Table to explore whether the missing rows are evenly distributed across created_at values.

select 
  YEAR(to_timestamp("created_at_b")) as year, 
  MONTH(to_timestamp("created_at_b")) as month, 
  sum(case when "is_exclusive_b" then 1 else 0 end) as "PK in prod and missing from dev",
  count(*) as "Total PKs in Prod",
  ROUND(100*sum(case when "is_exclusive_b" then 1 else 0 end)/count(*),1) as "% PK in prod and missing from dev"
from <DEV_SCHEMA>.<TEST_RESULTS>
group by 1,2 
order by 1 desc, 2 desc;

Screen Shot 2022-11-17 at 10 09 19 PM

We see that a much higher percentage of primary keys are missing from the dev data among rows with a created_at date from the summer of 2021.

See the changes that lead to conflicting values. What was the value before, and what is it now?

Now, we want to dig into the status column and understand how the values changed.

Instead of relying only on a human reviewer to figure out the impact of the code change, we can use the out-of-the-box Diff Results Table that we've materialized in the warehouse to fully understand the value-level differences between dev and prod data.

select 
  "status_a" as "Prod status value",
  "status_b" as "Dev status value",
  COUNT(*)
from <DEV_SCHEMA>.<TEST_RESULTS>
group by 1,2
;

Screen Shot 2022-11-17 at 10 59 05 PM

In the Dev status value column, we can clearly see that all the values are either "UNDER REVIEW" or NULL. This is in contrast to the Prod status values, which are either "UNDER REVIEW" or "PENDING".

More specifically, we can see that:

  • 1223 rows were "UNDER REVIEW" in prod data and remained "UNDER REVIEW" in dev data.
  • 43 rows changed from "PENDING" in prod data to "UNDER REVIEW" in dev data.
  • 123 rows changed from having string values of "UNDER REVIEW" or "PENDING" in prod data to being NULL in dev data.
    • These are the primary keys that are missing from dev data.

By materializing the results of our data-diff in the warehouse, we gained insight into data changes on the level of individual values, as well as the ability to quickly summarize the Diff Results Table into meaningful conclusions about the impact the dev code will have on the data.

Once we've completed this analysis, we can make any changes needed to the code, run data-diff again, and finally, ask for a codereview from your teammate.

Here's the best part.

When you ask for a code review, you can say: "Hi, can you please review this PR? I've already run a data-diff, so you don't need to worry about any unexpected impacts on the data."

😌

Command Line Options

Category CLI switch                                        Description
--help Show help message and exit.
Schema -k or --key-columns Name of the primary key column. If none provided, default is 'id'.
Schema -t or --update-column Name of updated_at/last_updated column.
Schema -c or --columns Names of extra columns to compare. Can be used more than once in the same command. Accepts a name or a pattern, like in SQL. Example: -c col% -c another_col -c %foorb.r%
Filtering --min-age Considers only rows older than specified. Useful for specifying replication lag. Example: --min-age=5min ignores rows from the last 5 minutes. Valid units: d, days, h, hours, min, minutes, mon, months, s, seconds, w, weeks, y, years
Filtering --max-age Considers only rows younger than specified. See --min-age.
Filtering -w, --where An additional 'where' expression to restrict the search space.
Performance -l or --limit Maximum number of differences to find (limits maximum bandwidth and runtime).
Performance -j or --threads Number of worker threads to use per database. Default=1.
Output -d or --debug Print debug info.
Output -i or --interactive Confirm queries, implies --debug
Output -v or --verbose Print extra info.
Output --materialize-all-rows Materialize every row, even if they are the same, instead of just the differing rows.
Output -m, --materialize Materialize the diff results into a new table in the database. If a table exists by that name, it will be replaced. Use %t in the name to place a timestamp. Example: -m test_mat_%t
Output --table-write-limit Maximum number of rows to write when creating materialized or sample tables, per thread. Default=1000.
Settings --no-tracking data-diff sends home anonymous usage data. Use this to disable it.

About

Tutorial for using open source data-diff as part of a PR workflow.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published