Search

Search this blog:

Search This Blog

First Day of Christmas: Get Power BI Data from PDF


Get Naughty or Nice List Data from PDF (or web)

๐ŸŽต On the first day of Christmas, my true love gave to me... PDF data in Power BI. ๐ŸŽต

What? You mean I don't have to convert to csv or copy/paste my data? That's right folks, Power BI connects directly to PDF files. How cool is that? You can follow along with me as I'm using data from the North Pole Government Department of Christmas Affairs




As you can see from the screenshot above, it's 461 glorious pages of, well, names! There are no column headings. When I highlight the table, it selects the left side first, then the right side. This is kind of handy for copy and paste - it keeps my names all stacked together. That is, until I get to the end of the page - then it selects the page header! What a mess. 

Copying and pasting the data above looks like this: 

Aerith Nice

Aerix Naughty

Aero Naughty

North Pole Government, Department of Christmas Affairs | Naughty and Nice List, 2020

13 | © Copyright North Pole Government 2020 christmasaffairs.com

Aeron Nice

Aeryn Naughty

Aeryn-Jessie Nice

Aesa Nice

Aeson Naughty

Aesyn Naughty

Aethan Naughty

Aetheria Nice

Aeva Nice

Aevlyn Naughty

Affie Nice

Afiba Nice

Afnan Nice

Afonso Naughty

Afrasenia Naughty

Africa Naughty

Afton Nice

Agastya Nice

Agata Nice

Agatha Nice

Aggie Naughty

Aghy Nice

Agianna Naughty

Agnes Nice

Agness Naughty

Agripina Naughty

Agusta Naughty

Agustin Nice

Agustina Nice

Agustus Nice

Ah Nice

Ahana Naughty

If I had to do this manually, I would render this data unobtainable. Lucky for us, PDF Connector has been available as General Availability in Power BI Desktop since April, 2019. 

When you first navigate to a PDF file in Power BI desktop, you'll notice Power BI picks up on each and every page, as well as each and every table. Good news about PDF files, they all started as something and typically have some nicely formatted tables that we can access. 




In this example we have over 400 tables. It appears to create a new table whenever there is a page break in the PDF. If you don't know how many rows or lines of text your data may have, this can make it challenging to write a query that will work for future data refresh cycles - how many tables do you need to select? What data will be in each one? 

Additionally, bringing these in as separate queries will be too time consuming. PDF data load is notoriously slow to begin with, and querying 400 tables individually will take ages. Santa will never make it on time!

Since I want all the tables in one query, I'm just going to select the first table for now and customize the results in Power Query. This trick is handy for many PDF files, as single datasets can often be split across tables when there is a page break in the PDF.

Pro tip: If you have only a few tables and want to import them all as separate queries, you can use the Shift key to select multiple files. Don't tick the box next to these tables. Just click on the text for the table name and Power BI will do the multi-select for you. If you tick the box, it has a tendency to skip the last table you select.

How to access local PDF file:

  1. Save the PDF file to your desktop where you can find it easily. 
  2. Open a new blank Power BI report.
  3. In the Home tab, click Get Data > More.
  4. Select PDF from the list.
  5. Browse to find the Naughty and Nice list on your desktop.
  6. Tick the box next to Table002 (the first table is not the data format we're looking for).
  7. Click Transform Data to open Power Query - we still have some work to do.
  8. Save the report as Santa List.
  9. From the Query Settings pane, select the Source step and note the formula bar starts with =Pdf.Tables(
  10. Examine the preview pane: 

You can see some very interesting information here - note I have clicked in the white space next to the yellow Table link for Table002. This gives me a preview of all the data that's in that table. 

Click in the white space next to Table003 - looks similar. Hmm, this is VERY interesting. Tomorrow I will demonstrate how we can extract all this information in one query. 

(Alternate) How to access PDF online:

If you don't want to download the PDF to your desktop, you can access it directly from the web URL instead. Power BI is smart enough to figure out it's a PDF, and will behave exactly the same as if you grabbed it from local desktop file.
  1. Open a new blank Power BI report.
  2. In the Home tab, click Get Data from Web.
  3. Paste the following link: https://christmasaffairs.com/docs/Naughty-and-Nice-List-2020.pdf
  4. Select Anonymous for the connection method.
  5. Select the first table of names (Table002)
  6. Click Transform Data to open Power Query.
  7. Save the report as Santa List.
  8. From the Query Settings pane, select the Source step. You'll notice the Formula bar starts with = Pdf.Tables( just as it did in the previous method.

That's all for today. 

Tune in tomorrow for the next gift in the 12 Days of Christmas series where we'll see a clever trick to access all the information in this entire 461 page document in one go. 

Custom Visual Review: Charticulator

This is not your ordinary custom visual - this is EVERY custom visual. Charticulator puts the power to design and develop custom visuals to ...