Running `data-diff` to improve your dbt development workflow

Let's walk through the process of using Datafold's open source data-diff tool to make sure the code changes in your development (dev) branch doesn't lead to unexpected data changes in production (prod).

How does `data-diff` work?

data-diff compares every data value that is generated by your dev branch code and compares those values with the data in prod.

Why is `data-diff` important?

data-diff catches unexpected data changes that can't be caught by dbt tests.

✨Imagine✨ this scenario:

Your SQL code includes a complex case statement that is modified as part of the code update.
The logic is confusing for a reviewer, but they think it's good to go. They approve the code update, and it is merged.
They didn't realize the status column values for some rows have swapped PENDING AND UNDER REVIEW.
There was also an edit to a union statement which caused some rows to disappear.
The following issues now exist in production data:
- Inaccurate values in the status column due to a logic error in the case statement.
- Missing primary keys.

There's no dbt test that could catch this. But data-diff can.

data-diff alerts you to data changes that:

Isn't be immediately clear from reading the code ...
- ... due to complex or confusing logic.
Won't be caught by dbt tests because the values don't violate any rules ...
- ... such as not_null, unique, or even a custom test.
May break your downstream dashboards, data apps, and executive dashboards ...
- ... which, as data practitioners, is our deepest fear.

Enough preamble. Let's diff.

Install `data-diff`.

We recommend installing data-diff in the same virtual environment that you use when running dbt CLI.

pip install data-diff --pre

Then, install a driver specific to the database you want to connect to. We'll assume you're using Snowflake, but data-diff supports additional databases as well.

pip install 'data-diff[snowflake]'

Run `data-diff` to compare dev and prod data.

To run data-diff from the command line and compare the data in your prod schema to your dev schema, run this command:

data-diff \
  "snowflake://<USER>:<PASSWORD>@<ACCOUNT>/<DATABASE>/<PROD_SCEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" <TABLE_NAME> \
  <DEV_SCHEMA>.<TABLE_NAME> \
  -k <your_primary_key_column>
  -c <altered_column_1> <altered_column_2> <altered_column_3>
  -m test_results_%t \
  --materialize-all-rows \
  --table-write-limit 10000

A few notes about this command:

Everything in carrots (<>) should be replaced by your values, including the carrots.
- In other words, your command should include no carrots! 🚫🥕
Each line beginning with a hyphen represents an option. In the command above, we've included the main options you will likely need:
- -k specifies the name of your primary key column.
- -c specifies the other columns you want to inspect for differences
  - To simplify your output and to improve performance, we recommend including only the columns that have altered lines of code in your dev branch.
  - Alternatively, you can include all columns by writing -c %.
  - If you're only investigating whether entire rows are missing (e.g., due to a modification of a UNION statement or a WHERE clause in your SQL model), we suggest omitting -c.
- -m specifies a prefix that will be used to write the results of your data-diff run to a Diff Results Table into your warehouse.
- --materialize_all_rows instructs data-diff to write all rows to the Diff Results Table, not only conflicts. This is helpful if you want to query the results and calculate statistics such as the percentage of rows that have a conflicting value in a given column.
- --table-write-limit limits how many rows are written to the Diff Results Table in the warehouse. The limit is 1000 rows by default. Increase this limit for large tables to ensure your results are complete.
There are additional options you can add to your command, which are detailed in the Options section below.

Materialize the Diff Results Table in your warehouse.

Once the Diff Results Table has been written to the warehouse, you can analyze the data using SQL queries. You can also creating visualizations that use your Diff Results Table as source data.

Wait, I have to write a bunch of SQL? How is this better than what I was doing before?

It's totally different! With one very simple query of the Diff Results Table, you can immediately learn something like, "How many values conflict in these columns?" You can also immediately surface the values that are conflicting.

It's far more powerful to compare every value with a data-diff instead of writing ad hoc SQL statements to count rows and generate summaries. And you don't have to write a single join.

Structure of the Diff Results Table

We'll assume you're investigating the primary key, org_id, and the status column, because the dev code contains edits to the status column. We've also included the created_at column in our data-diff command to support our analysis of the results.

Column Name	Value
is_exclusive_a	`true` if the value exists only in the prod schema.
is_exclusive_b	`true` if the value exists only in the dev schema.
is_diff_org_id	`1` if the value is different between dev and prod; else `0`.
is_diff_created_at	`1` if the value is different between dev and prod; else `0`.
is_diff_status	`1` if the value is different between dev and prod; else `0`.
org_id_a	Prod `org_id`.
org_id_b	Dev `org_id`.
created_at_a	Prod `created_at`.
created_at_b	Dev `created_at`.
status_a	Prod `status`.
status_b	Dev `status`.

The Diff Results Table has both information about whether the values conflict AND the actual values from the columns you've selected from both the dev and prod schema. This structure gives you a high degree of flexibility to easily investigate row-level value differences and quickly identify the root cause of data conflicts.

Write SQL to analyze Diff Results Table and determine whether your dev branch can be merged into prod.

Once you have the Diff Results Table in your warehouse, you can write SQL to understand how the tables differ.

Writing SQL to get that high value information is easy! It's also extensible if you want to write complex analysis and joins. That's the value you get out of using data-diff instead of doing this manually.

Here are some examples of SQL queries that you can use to interpet the Diff Results Table.

Check for any conflicts between values in dev and prod. This is an easy way to get started analysing your Diff Results Table. If there are no conflicts, you can stop here! 🛑 ✅ 🥳

select 
  sum("is_diff_org_id") as org_id_conflicts,
  sum("is_diff_created_at") as created_at_conflicts,
  sum("is_diff_status") as status_conflicts
from <DEV_SCHEMA>.<TEST_RESULTS>;

Oops. Since the primary key org_id has conflicting values, and it's not a composite key (comprised of multiple columns), we can assume that 123 of the conflicting values in the other columns can also be explained by these 123 org_id conflicts.

We can confirm that with this query:

select 
  sum("is_diff_org_id") as org_id_conflicts,
  sum("is_diff_created_at") as created_at_conflicts,
  sum("is_diff_status") as status_conflicts
from <DEV_SCHEMA>.<TEST_RESULTS>
where "is_diff_org_id" = 0
;

As expected, 123 of the conflicts are explained by missing primary key values. We'll proceed to explore both the missing primary keys as well as the conflicting values in the status column.

Check whether missing primary keys are missing from dev or prod.

select 
    sum(case when "is_exclusive_a" then 1 else 0 end) as "PK in Prod and missing from Dev",
    sum(case when "is_exclusive_b" then 1 else 0 end) as "PK in Dev and missing from Prod"
from <DEV_SCHEMA>.<TEST_RESULTS>;

It looks like all of the 123 primary keys are missing from the data generated by the dev code. Perhaps you're expecting this, or perhaps it's an error. This is of course something that can only be answered knowing the goal of the code update.

You can easily view which order_id values are present in prod and missing from dev.

select 
    "org_id_a"
from <DEV_SCHEMA>.<TEST_RESULTS>
where "org_id_b" is null;

This is useful if you want to investigate individual rows. You can also include additional columns in your select statement to see what values exist in the rows that have missing primary keys in the dev data.

Flexibility to join the Diff Results Table with other tables in your warehouse.

Around now, you might notice that this table is structured so that you can easily join to the actual Prod table, or any other table, using the primary key as a join key.

with missing_ids as (
    select 
        "org_id_a"
    from <DEV_SCHEMA>.<TEST_RESULTS>
    where "org_id_b" is null;
)
select 
    *
from <PROD_SCHEMA>.<TABLE_NAME> prod_table
inner join missing_ids on prod_table.order_id = missing_ids.order_id

Explore distribution of conflicting or missing values.

Back to the Diff Results Table. We've established that there are primary keys in prod that are missing from dev. But it's important to understand what's missing. Is it random, or are they grouped in some way?

In this example, we'll use the Diff Results Table to explore whether the missing rows are evenly distributed across created_at values.

select 
  YEAR(to_timestamp("created_at_b")) as year, 
  MONTH(to_timestamp("created_at_b")) as month, 
  sum(case when "is_exclusive_b" then 1 else 0 end) as "PK in prod and missing from dev",
  count(*) as "Total PKs in Prod",
  ROUND(100*sum(case when "is_exclusive_b" then 1 else 0 end)/count(*),1) as "% PK in prod and missing from dev"
from <DEV_SCHEMA>.<TEST_RESULTS>
group by 1,2 
order by 1 desc, 2 desc;

We see that a much higher percentage of primary keys are missing from the dev data among rows with a created_at date from the summer of 2021.

See the changes that lead to conflicting values. What was the value before, and what is it now?

Now, we want to dig into the status column and understand how the values changed.

Instead of relying only on a human reviewer to figure out the impact of the code change, we can use the out-of-the-box Diff Results Table that we've materialized in the warehouse to fully understand the value-level differences between dev and prod data.

select 
  "status_a" as "Prod status value",
  "status_b" as "Dev status value",
  COUNT(*)
from <DEV_SCHEMA>.<TEST_RESULTS>
group by 1,2
;

In the Dev status value column, we can clearly see that all the values are either "UNDER REVIEW" or NULL. This is in contrast to the Prod status values, which are either "UNDER REVIEW" or "PENDING".

More specifically, we can see that:

1223 rows were "UNDER REVIEW" in prod data and remained "UNDER REVIEW" in dev data.
43 rows changed from "PENDING" in prod data to "UNDER REVIEW" in dev data.
123 rows changed from having string values of "UNDER REVIEW" or "PENDING" in prod data to being NULL in dev data.
- These are the primary keys that are missing from dev data.

By materializing the results of our data-diff in the warehouse, we gained insight into data changes on the level of individual values, as well as the ability to quickly summarize the Diff Results Table into meaningful conclusions about the impact the dev code will have on the data.

Once we've completed this analysis, we can make any changes needed to the code, run data-diff again, and finally, ask for a codereview from your teammate.

Here's the best part.

When you ask for a code review, you can say: "Hi, can you please review this PR? I've already run a data-diff, so you don't need to worry about any unexpected impacts on the data."

😌

Command Line Options

Category	CLI switch	Description
	`--help`	Show help message and exit.
Schema	`-k` or `--key-columns`	Name of the primary key column. If none provided, default is 'id'.
Schema	`-t` or `--update-column`	Name of updated_at/last_updated column.
Schema	`-c` or `--columns`	Names of extra columns to compare. Can be used more than once in the same command. Accepts a name or a pattern, like in SQL. Example: `-c col% -c another_col -c %foorb.r%`
Filtering	`--min-age`	Considers only rows older than specified. Useful for specifying replication lag. Example: `--min-age=5min` ignores rows from the last 5 minutes. Valid units: `d, days, h, hours, min, minutes, mon, months, s, seconds, w, weeks, y, years`
Filtering	`--max-age`	Considers only rows younger than specified. See `--min-age`.
Filtering	`-w`, `--where`	An additional 'where' expression to restrict the search space.
Performance	`-l` or `--limit`	Maximum number of differences to find (limits maximum bandwidth and runtime).
Performance	`-j` or `--threads`	Number of worker threads to use per database. Default=1.
Output	`-d` or `--debug`	Print debug info.
Output	`-i` or `--interactive`	Confirm queries, implies `--debug`
Output	`-v` or `--verbose`	Print extra info.
Output	`--materialize-all-rows`	Materialize every row, even if they are the same, instead of just the differing rows.
Output	`-m`, `--materialize`	Materialize the diff results into a new table in the database. If a table exists by that name, it will be replaced. Use `%t` in the name to place a timestamp. Example: `-m test_mat_%t`
Output	`--table-write-limit`	Maximum number of rows to write when creating materialized or sample tables, per thread. Default=1000.
Settings	`--no-tracking`	data-diff sends home anonymous usage data. Use this to disable it.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Running `data-diff` to improve your dbt development workflow

How does `data-diff` work?

Why is `data-diff` important?

Enough preamble. Let's diff.

Install `data-diff`.

Run `data-diff` to compare dev and prod data.

Materialize the Diff Results Table in your warehouse.

Write SQL to analyze Diff Results Table and determine whether your dev branch can be merged into prod.

Here are some examples of SQL queries that you can use to interpet the Diff Results Table.

Flexibility to join the Diff Results Table with other tables in your warehouse.

Explore distribution of conflicting or missing values.

See the changes that lead to conflicting values. What was the value before, and what is it now?

Here's the best part.

Command Line Options

About

Releases

Packages

datafold/os-diff-pr-tutorial

Folders and files

Latest commit

History

Repository files navigation

Running data-diff to improve your dbt development workflow

How does data-diff work?

Why is data-diff important?

Enough preamble. Let's diff.

Install data-diff.

Run data-diff to compare dev and prod data.

Materialize the Diff Results Table in your warehouse.

Write SQL to analyze Diff Results Table and determine whether your dev branch can be merged into prod.

Here are some examples of SQL queries that you can use to interpet the Diff Results Table.

Flexibility to join the Diff Results Table with other tables in your warehouse.

Explore distribution of conflicting or missing values.

See the changes that lead to conflicting values. What was the value before, and what is it now?

Here's the best part.

Command Line Options

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Running `data-diff` to improve your dbt development workflow

How does `data-diff` work?

Why is `data-diff` important?

Install `data-diff`.

Run `data-diff` to compare dev and prod data.

Packages