Combine over 400 tables using Expand Data in Power Query
๐ต On the second day of Christmas, my true love gave to me...
Table expanded columns
and
a PDF in Power BI. ๐ต
There are many ways to combine and append data in Power BI. As we saw in yesterday's post, all our tables have the same format, so we don't need to do any transformations to them before the append. The source step of getting data from PDF lists all the tables, including their underlying data, in one big query. This means we can just expand the columns in the Source query to access all the data.
To do this, we need to go back a few steps to the Source step so we can see all the data in our PDF.
Skip to the How To: section below to see step by step instructions.
As we noted before, Power BI recognises PDF tables and pages. We don't want to bring all the data in twice, so we'll just focus on the Tables for this example (they are typically tidier).
Pro tip: Filtering data is one of the first things you should do in Power Query with any dataset. It can speed up your refresh times and in some datasets supports a nifty thing called 'Query Folding' that basically enables Power BI to outsource a lot of the work (which ultimately speeds things up).
This means two of the first steps you should do (even before the Changed Type step that Power BI automatically inserts) are:
- Choose Columns
- Filter Rows
Okay, so let's put that into practice. Start by deleting all the steps except 'Source'.
Click Choose Columns from the Home tab in the ribbon. In this example, we only need to know what "Kind" of object and the "Data" within that object, so we'll select just these two columns:
Now we can filter for tables only:
Wonderful, with the efficiencies out of the way, now we are free to focus on what we came here for; extracting that beautiful data from over 400 tables! Ok, so here's the fun part. So many data sources have embedded tables; databases, SharePoint lists, PDFs, Excel files, merged queries, .... I'm sure there's more that I can't think of at the moment. Basically, if you can learn how to access these embedded tables, and demystify the confusing interface that Power Query uses to display them, you will open endless opportunities for data analysis and exploration.
This dataset is unique, in that it mashes the tables side by side and imports them as one query. The data we want is only two columns:
- Name
- Naughty or Nice
But when we try to expand the Data column to extract the columns from the tables, you'll see we have four columns:
- Column1
- Column2
- Column3
- Column4
Expanding all four columns gives us Name, Naughty or Nice, Name, Naughty or Nice. Hang on, what's all this gibberish at the top?
Go back and look at the PDF if you need to - that's another valuable step when working in Power Query. Always understand your data source, what it means, where it comes from and how it might change. Now I can see that gibberish relates to the table of contents.
I really only want two columns, and I need to remove all the rows with table of contents (or anything that doesn't relate to the Naughty or Nice list).
I know Power BI does not allow me to simply cut and paste that data into the format I want. So what to do?
Today we will create two different queries:
- SantaList1 for Column1 and Column2
- SantaList2 for Column3 and Column4
They will be exactly the same otherwise. Tomorrow we'll see how to combine them into one seamless table.
Okay, so we have two challenges:
- Select only Column1 and Column2.
- Filter out the table of contents gibberish that we don't need (note I used that key word 'filter' again).
Pro tip: You can edit Applied Steps that you've already completed using the Settings cog.
That means we don't have to start over, just click the settings cog for Expanded Data step, and deselect Column3 and Column4.
Finally, we can filter out that gibberish. How is this PDF likely to change? I know Column1 has all the names and there's no pattern to that data - each name is unique. What about Column2? What information do we expect to see in Column2 that is valuable to us for this report?
"Nice" and "Naughty"
Anything else is useless to us. That's easy, just do a filter on Column2 to display only values equal to "Naughty" or "Nice":
Now we have to do that all over again with Column3 and Column4 right? Wrong! Tomorrow we'll see how to duplicate, edit and append this query to capture all the data in one table.
How to:
- Open the SantaList.pbix file you created on Day 1.
- Click Transform Data to open the Power Query Editor.
- Delete all steps except the Source step.
- Rename the Query to SantaList1.
- Click Choose Columns in the Home tab in ribbon.
- Select Kind and Data. Click OK.
- Filter the Kind column to show only Table.
- Right click the Kind column and select Remove.
- Click the double arrow at the Data column heading to Expand:
- Tick Column1
- Tick Column2
- Untick Use original column name as prefix
- Click OK.
- Filter the Column2 column to show only "Naughty" and Nice".
- Save the SantaList.pbix report. If asked, choose Apply Later.
That's enough for today. Tune in again tomorrow for the next gift in the 12 Days of Christmas series where we'll see how to quickly edit, duplicate and append queries.