Anti Join is useful
In this post, we are going to look at how we can use an anti join to figure out which rows of data don't match. This can serve two purposes - it can:
- Identify data quality errors that we need to fix.
- Provide insight into which records are NOT present in the selected dataset.
Identify data quality errors
Take the two tables below as an example:
If we choose Left Outer join, the result is:
We know that the United States is a valid country with an existing population. Same with United Kingdom. So why are we getting null values?
With this small dataset, it's easy to see that USA and Great Britain/Northern Ireland are the problems - they appear in the Right Table, but they don't exactly match the records in the Left Table.
With larger sets of data, it can be more difficult to find those null values, which equate to mismatches in our data.
Right Anti Join
If we choose RIGHT ANTI, the second table will be filtered, to only give results that do NOT appear in the first table. In this case, I want view rows that appear in the Population table but do not appear in the Sales table, so that I can identify why United States and United Kingdom have both returned null for population:
Now that we know the problem, we can fix the data.
Transform Data to Improve Data Quality
In this example, we only need to do two simple fixes in our Population table:
- Replace USA with United States
- Group all countries in United Kingdom
Both of these fixes can be done directly in Power Query.
Replace Values
United States can be fixed with a simple Transform > Replace Values.
Group by
- Northern Ireland
- Great Britain
And replace all with "United Kingdom".
Then, we will Group By Country.
NOTE: Group By is a great way to ensure the column(s) selected in your Group By have unique values. This is highlighted extensively in my other Merge Ahead post and will ensure I do not get any unwanted duplicate data in my merge!
My Right Anti join is now an EMPTY table.
And my Left Outer join now looks like this:
Provide Insight into which records are NOT in selected data
For this example, I am going to use Olympic Medal data which is readily available from Wikipedia. For stable data matching this post, you can use:
Objective:
Our goal is to see which countries have never won any Olympic medals.
Select Table 4 from the Population by Country website. Rename the query to 'Population'.
Select Unranked medal table (sortable) from the Population by Country website. Rename the query to 'Medals'.
Transform Medals table
Rename the first column to 'Country'
Extract text before delimiter (
Trim Country column
Transform Population table
Rename the first column to 'Country'
Extract text before delimiter (
Extract text before delimiter [
Trim Country column
Left Anti Join
Note a Left Anti Join does not require us to keep the null columns. We will get a column of Table values, but because we have chosen an Anti join, all Tables in this column will be empty, so we can remove this column.
Result:
Now we can see which countries have not won any Olympic medals:
Some of these results, like the United Kingdom, we will recognize and are related to Data Quality Errors. Many of these results though provide insight into opportunity for growth.
Conclusion
An anti join can help add insight to your reporting as well as provide you with the information you need to cleanse and transform your data effectively. Use a Left Anti join with the "Merge Queries as New" option and remove that unneeded column of empty Table values you get and simply focus on the filtered rows that result from the unmatched data.