Search

Search this blog:

Search This Blog

Merge Ahead: Don't skip a row


 Anti Join is useful

In this post, we are going to look at how we can use an anti join to figure out which rows of data don't match. This can serve two purposes - it can: 

  • Identify data quality errors that we need to fix.
  • Provide insight into which records are NOT present in the selected dataset.

Identify data quality errors

Take the two tables below as an example: 

If we choose Left Outer join, the result is: 

We know that the United States is a valid country with an existing population. Same with United Kingdom. So why are we getting null values? 

With this small dataset, it's easy to see that USA and Great Britain/Northern Ireland are the problems - they appear in the Right Table, but they don't exactly match the records in the Left Table.

With larger sets of data, it can be more difficult to find those null values, which equate to mismatches in our data. 

Right Anti Join 

If we choose RIGHT ANTI, the second table will be filtered, to only give results that do NOT appear in the first table. In this case, I want view rows that appear in the Population table but do not appear in the Sales table, so that I can identify why United States and United Kingdom have both returned null for population: 




When doing a RIGHT ANTI join, we end up with many columns of unhelpful null values, but when we expand the Merge column, we get the results we are looking for. Below you can see the countries that have no matching record in the Sales table, and we can easily identify which should match United States and United Kingdom: 

Now that we know the problem, we can fix the data. 

Transform Data to Improve Data Quality

In this example, we only need to do two simple fixes in our Population table: 

  • Replace USA with United States
  • Group all countries in United Kingdom

Both of these fixes can be done directly in Power Query. 

Replace Values

United States can be fixed with a simple Transform > Replace Values.

Group by

United Kindgom needs two steps. 
First, we will replace the values to find: 
  • Northern Ireland
  • Great Britain

And replace all with "United Kingdom".

Then, we will Group By Country. 

NOTE: Group By is a great way to ensure the column(s) selected in your Group By have unique values. This is highlighted extensively in my other Merge Ahead post and will ensure I do not get any unwanted duplicate data in my merge!

My Right Anti join is now an EMPTY table. 

And my Left Outer join now looks like this: 


Provide Insight into which records are NOT in selected data

For this example, I am going to use Olympic Medal data which is readily available from Wikipedia. For stable data matching this post, you can use: 

Objective: 

Our goal is to see which countries have never won any Olympic medals.

Select Table 4 from the Population by Country website. Rename the query to 'Population'.

Select Unranked medal table (sortable) from the Population by Country website. Rename the query to 'Medals'.

Transform Medals table

Rename the first column to 'Country'

Extract text before delimiter (

Trim Country column

Transform Population table

Rename the first column to 'Country'

Extract text before delimiter (

Extract text before delimiter [

Trim Country column

Left Anti Join

If we choose Left Anti Join, we will need Population as the Left table and Medals as the Right table in order to see which countries exist in the Population table but have no record of winning any Olympic medals in the Medals table.


Note a Left Anti Join does not require us to keep the null columns. We will get a column of Table values, but because we have chosen an Anti join, all Tables in this column will be empty, so we can remove this column. 

Result:

Now we can see which countries have not won any Olympic medals: 

Some of these results, like the United Kingdom, we will recognize and are related to Data Quality Errors. Many of these results though provide insight into opportunity for growth.

Conclusion

An anti join can help add insight to your reporting as well as provide you with the information you need to cleanse and transform your data effectively. Use a Left Anti join with the "Merge Queries as New" option and remove that unneeded column of empty Table values you get and simply focus on the filtered rows that result from the unmatched data. 



Custom Visual Review: Charticulator

This is not your ordinary custom visual - this is EVERY custom visual. Charticulator puts the power to design and develop custom visuals to ...