Let's walk through the process of using Datafold's open source data-diff
tool to make sure the code changes in your development (dev) branch doesn't lead to unexpected data changes in production (prod).
data-diff
compares every data value that is generated by your dev branch code and compares those values with the data in prod.
data-diff
catches unexpected data changes that can't be caught by dbt tests.
✨Imagine✨ this scenario:
- Your SQL code includes a complex case statement that is modified as part of the code update.
- The logic is confusing for a reviewer, but they think it's good to go. They approve the code update, and it is merged.
- They didn't realize the
status
column values for some rows have swappedPENDING
ANDUNDER REVIEW
. - There was also an edit to a
union
statement which caused some rows to disappear. - The following issues now exist in production data:
- Inaccurate values in the
status
column due to a logic error in the case statement. - Missing primary keys.
- Inaccurate values in the
There's no dbt test that could catch this. But data-diff
can.
data-diff
alerts you to data changes that:
- Isn't be immediately clear from reading the code ...
- ... due to complex or confusing logic.
- Won't be caught by dbt tests because the values don't violate any rules ...
- ... such as
not_null
,unique
, or even a custom test.
- ... such as
- May break your downstream dashboards, data apps, and executive dashboards ...
- ... which, as data practitioners, is our deepest fear.
We recommend installing data-diff
in the same virtual environment that you use when running dbt CLI.
pip install data-diff --pre
Then, install a driver specific to the database you want to connect to. We'll assume you're using Snowflake, but data-diff
supports additional databases as well.
pip install 'data-diff[snowflake]'
To run data-diff
from the command line and compare the data in your prod schema to your dev schema, run this command:
data-diff \
"snowflake://<USER>:<PASSWORD>@<ACCOUNT>/<DATABASE>/<PROD_SCEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" <TABLE_NAME> \
<DEV_SCHEMA>.<TABLE_NAME> \
-k <your_primary_key_column>
-c <altered_column_1> <altered_column_2> <altered_column_3>
-m test_results_%t \
--materialize-all-rows \
--table-write-limit 10000
A few notes about this command:
- Everything in carrots (
<>
) should be replaced by your values, including the carrots.- In other words, your command should include no carrots! 🚫🥕
- Each line beginning with a hyphen represents an option. In the command above, we've included the main options you will likely need:
-k
specifies the name of your primary key column.-c
specifies the other columns you want to inspect for differences- To simplify your output and to improve performance, we recommend including only the columns that have altered lines of code in your dev branch.
- Alternatively, you can include all columns by writing
-c %
. - If you're only investigating whether entire rows are missing (e.g., due to a modification of a
UNION
statement or aWHERE
clause in your SQL model), we suggest omitting-c
.
-m
specifies a prefix that will be used to write the results of yourdata-diff
run to a Diff Results Table into your warehouse.--materialize_all_rows
instructsdata-diff
to write all rows to the Diff Results Table, not only conflicts. This is helpful if you want to query the results and calculate statistics such as the percentage of rows that have a conflicting value in a given column.--table-write-limit
limits how many rows are written to the Diff Results Table in the warehouse. The limit is 1000 rows by default. Increase this limit for large tables to ensure your results are complete.
- There are additional options you can add to your command, which are detailed in the Options section below.
Once the Diff Results Table has been written to the warehouse, you can analyze the data using SQL queries. You can also creating visualizations that use your Diff Results Table as source data.
Wait, I have to write a bunch of SQL? How is this better than what I was doing before?
It's totally different! With one very simple query of the Diff Results Table, you can immediately learn something like, "How many values conflict in these columns?" You can also immediately surface the values that are conflicting.
It's far more powerful to compare every value with a data-diff
instead of writing ad hoc SQL statements to count rows and generate summaries. And you don't have to write a single join.
Structure of the Diff Results Table
We'll assume you're investigating the primary key, org_id
, and the status
column, because the dev code contains edits to the status
column. We've also included the created_at
column in our data-diff
command to support our analysis of the results.
Column Name | Value |
---|---|
is_exclusive_a | true if the value exists only in the prod schema. |
is_exclusive_b | true if the value exists only in the dev schema. |
is_diff_org_id | 1 if the value is different between dev and prod; else 0 . |
is_diff_created_at | 1 if the value is different between dev and prod; else 0 . |
is_diff_status | 1 if the value is different between dev and prod; else 0 . |
org_id_a | Prod org_id . |
org_id_b | Dev org_id . |
created_at_a | Prod created_at . |
created_at_b | Dev created_at . |
status_a | Prod status . |
status_b | Dev status . |
The Diff Results Table has both information about whether the values conflict AND the actual values from the columns you've selected from both the dev and prod schema. This structure gives you a high degree of flexibility to easily investigate row-level value differences and quickly identify the root cause of data conflicts.
Write SQL to analyze Diff Results Table and determine whether your dev branch can be merged into prod.
Once you have the Diff Results Table in your warehouse, you can write SQL to understand how the tables differ.
Writing SQL to get that high value information is easy! It's also extensible if you want to write complex analysis and joins. That's the value you get out of using data-diff
instead of doing this manually.
Check for any conflicts between values in dev and prod. This is an easy way to get started analysing your Diff Results Table. If there are no conflicts, you can stop here! 🛑 ✅ 🥳
select
sum("is_diff_org_id") as org_id_conflicts,
sum("is_diff_created_at") as created_at_conflicts,
sum("is_diff_status") as status_conflicts
from <DEV_SCHEMA>.<TEST_RESULTS>;
Oops. Since the primary key org_id
has conflicting values, and it's not a composite key (comprised of multiple columns), we can assume that 123 of the conflicting values in the other columns can also be explained by these 123 org_id
conflicts.
We can confirm that with this query:
select
sum("is_diff_org_id") as org_id_conflicts,
sum("is_diff_created_at") as created_at_conflicts,
sum("is_diff_status") as status_conflicts
from <DEV_SCHEMA>.<TEST_RESULTS>
where "is_diff_org_id" = 0
;
As expected, 123 of the conflicts are explained by missing primary key values. We'll proceed to explore both the missing primary keys as well as the conflicting values in the status
column.
Check whether missing primary keys are missing from dev or prod.
select
sum(case when "is_exclusive_a" then 1 else 0 end) as "PK in Prod and missing from Dev",
sum(case when "is_exclusive_b" then 1 else 0 end) as "PK in Dev and missing from Prod"
from <DEV_SCHEMA>.<TEST_RESULTS>;
It looks like all of the 123 primary keys are missing from the data generated by the dev code. Perhaps you're expecting this, or perhaps it's an error. This is of course something that can only be answered knowing the goal of the code update.
You can easily view which order_id
values are present in prod and missing from dev.
select
"org_id_a"
from <DEV_SCHEMA>.<TEST_RESULTS>
where "org_id_b" is null;
This is useful if you want to investigate individual rows. You can also include additional columns in your select
statement to see what values exist in the rows that have missing primary keys in the dev data.
Around now, you might notice that this table is structured so that you can easily join to the actual Prod table, or any other table, using the primary key as a join key.
with missing_ids as (
select
"org_id_a"
from <DEV_SCHEMA>.<TEST_RESULTS>
where "org_id_b" is null;
)
select
*
from <PROD_SCHEMA>.<TABLE_NAME> prod_table
inner join missing_ids on prod_table.order_id = missing_ids.order_id
Back to the Diff Results Table. We've established that there are primary keys in prod that are missing from dev. But it's important to understand what's missing. Is it random, or are they grouped in some way?
In this example, we'll use the Diff Results Table to explore whether the missing rows are evenly distributed across created_at
values.
select
YEAR(to_timestamp("created_at_b")) as year,
MONTH(to_timestamp("created_at_b")) as month,
sum(case when "is_exclusive_b" then 1 else 0 end) as "PK in prod and missing from dev",
count(*) as "Total PKs in Prod",
ROUND(100*sum(case when "is_exclusive_b" then 1 else 0 end)/count(*),1) as "% PK in prod and missing from dev"
from <DEV_SCHEMA>.<TEST_RESULTS>
group by 1,2
order by 1 desc, 2 desc;
We see that a much higher percentage of primary keys are missing from the dev data among rows with a created_at
date from the summer of 2021.
Now, we want to dig into the status
column and understand how the values changed.
Instead of relying only on a human reviewer to figure out the impact of the code change, we can use the out-of-the-box Diff Results Table that we've materialized in the warehouse to fully understand the value-level differences between dev and prod data.
select
"status_a" as "Prod status value",
"status_b" as "Dev status value",
COUNT(*)
from <DEV_SCHEMA>.<TEST_RESULTS>
group by 1,2
;
In the Dev status value
column, we can clearly see that all the values are either "UNDER REVIEW" or NULL. This is in contrast to the Prod status values, which are either "UNDER REVIEW" or "PENDING".
More specifically, we can see that:
- 1223 rows were "UNDER REVIEW" in prod data and remained "UNDER REVIEW" in dev data.
- 43 rows changed from "PENDING" in prod data to "UNDER REVIEW" in dev data.
- 123 rows changed from having string values of "UNDER REVIEW" or "PENDING" in prod data to being
NULL
in dev data.- These are the primary keys that are missing from dev data.
By materializing the results of our data-diff
in the warehouse, we gained insight into data changes on the level of individual values, as well as the ability to quickly summarize the Diff Results Table into meaningful conclusions about the impact the dev code will have on the data.
Once we've completed this analysis, we can make any changes needed to the code, run data-diff
again, and finally, ask for a codereview from your teammate.
When you ask for a code review, you can say: "Hi, can you please review this PR? I've already run a data-diff
, so you don't need to worry about any unexpected impacts on the data."
😌
Category | CLI switch | Description |
---|---|---|
--help |
Show help message and exit. | |
Schema | -k or --key-columns |
Name of the primary key column. If none provided, default is 'id'. |
Schema | -t or --update-column |
Name of updated_at/last_updated column. |
Schema | -c or --columns |
Names of extra columns to compare. Can be used more than once in the same command. Accepts a name or a pattern, like in SQL. Example: -c col% -c another_col -c %foorb.r% |
Filtering | --min-age |
Considers only rows older than specified. Useful for specifying replication lag. Example: --min-age=5min ignores rows from the last 5 minutes. Valid units: d, days, h, hours, min, minutes, mon, months, s, seconds, w, weeks, y, years |
Filtering | --max-age |
Considers only rows younger than specified. See --min-age . |
Filtering | -w , --where |
An additional 'where' expression to restrict the search space. |
Performance | -l or --limit |
Maximum number of differences to find (limits maximum bandwidth and runtime). |
Performance | -j or --threads |
Number of worker threads to use per database. Default=1. |
Output | -d or --debug |
Print debug info. |
Output | -i or --interactive |
Confirm queries, implies --debug |
Output | -v or --verbose |
Print extra info. |
Output | --materialize-all-rows |
Materialize every row, even if they are the same, instead of just the differing rows. |
Output | -m , --materialize |
Materialize the diff results into a new table in the database. If a table exists by that name, it will be replaced. Use %t in the name to place a timestamp. Example: -m test_mat_%t |
Output | --table-write-limit |
Maximum number of rows to write when creating materialized or sample tables, per thread. Default=1000. |
Settings | --no-tracking |
data-diff sends home anonymous usage data. Use this to disable it. |