top of page
  • Writer's pictureJeff Taylor

Parquet, CSV, or SQL Server with PowerBI

Updated: Mar 23

Recently a question came up on the various file types that would be best to load into PowerBI, so I ran a few tests to see how it would play out. First, I wanted to find one decently large table to load and work with that. I decided on the Stack Overflow database and the Votes table, which has 231 million rows.









My computer is a MacBook Pro, Intel, i9 2.4GHz 8 core, 64GB Ram, and NVMe SSD. This transfers at about 3k Mbps. I'm running Windows 10 64bit with SQL Server 2022 64bit Developer Edition.


The first hurdle I encountered was exporting this data without using ADF to a Parquet file. I found several blog posts by David Mason (Import and Export Parquet files with SQL Server, RStudio and Package Install for Machine Learning Services, Machine Learning Services and File System Interaction) detailing setting up and exporting Parquet files directly from SQL Server. Needless to say, in SQL Server 2022, it didn't seem as easy as it did in 2019 for Machine Learning Services, as well as adjusting permissions.


In the current release of PowerBI there is also a bug in loading Parquet files, so I had to go back to the December release to load the files.


Tests

Here are the tests I did to load data into PowerBI:

  • Parquet

  • CSV

  • SQL Server - Import

  • SQL Server - Direct Query

Data Size

Here is the size of the data for each on disk:


Votes.CSV: 13.9GB - Largest

SQL Server Table Size: 3.16GB

Votes.Parquet (3 files): 2.37GB - Smallest


(I could not export the table to one Parquet file, the Apache Arrow component seemed to time out over 100 million records.)


PowerBI File Size

Here is the size of the PowerBI pbix files after loading each:


VotesParquet.pbix = 6.50GB - Largest

VotesCSV.pbix = 4.21GB

VotesSQLImport.pbix = 4.21GB

VotesSQLDirectQuery.pbix = 15.6kb - Smallest


I timed a load of each file type, the time it took to save the file after loading, and how long it took to create one visual.


Initial Load Time


CSV = 43 minutes, 46 seconds - Slowest

Parquet = 7 minutes, 45 seconds

SQL Server Import = 7 minutes

SQL Server Direct Query = A couple of seconds - Fastest


Initial Save Time


Parquet = 3 minutes, 50 seconds - Slowest

CSV = 3 minutes

SQL Server Import = 2 minutes, 30 seconds

SQL Server Direct Query = A couple of seconds - Fastest


Creating a visual using the date and count of votes was instant for all types.


Save Time - After Visual Creation


Parquet = 50 seconds - Slowest

CSV = 29 seconds

SQL Server Import = 28 seconds

SQL Server Direct Query = Instant - Fastest


In conclusion, CSV cannot even be considered an option to load with over 43 minutes to import. If you don't have a database server, parquet might be a decent option, but SQL Server Direct Query, with a couple of seconds to load the data, is the fastest.

67 views

Recent Posts

See All
bottom of page