Recently a question came up on the various file types that would be best to load into PowerBI, so I ran a few tests to see how it would play out. First, I wanted to find one decently large table to load and work with that. I decided on the Stack Overflow database and the Votes table, which has 231 million rows.
|
My computer is a MacBook Pro, Intel, i9 2.4GHz 8 core, 64GB Ram, and NVMe SSD. This transfers at about 3k Mbps. I'm running Windows 10 64bit with SQL Server 2022 64bit Developer Edition.
The first hurdle I encountered was exporting this data without using ADF to a Parquet file. I found several blog posts by David Mason (Import and Export Parquet files with SQL Server, RStudio and Package Install for Machine Learning Services, Machine Learning Services and File System Interaction) detailing setting up and exporting Parquet files directly from SQL Server. Needless to say, in SQL Server 2022, it didn't seem as easy as it did in 2019 for Machine Learning Services, as well as adjusting permissions.
In the current release of PowerBI there is also a bug in loading Parquet files, so I had to go back to the December release to load the files.
Tests
Here are the tests I did to load data into PowerBI:
Parquet
CSV
SQL Server - Import
SQL Server - Direct Query
Data Size
Here is the size of the data for each on disk:
Votes.CSV: 13.9GB - Largest
SQL Server Table Size: 3.16GB
Votes.Parquet (3 files): 2.37GB - Smallest
(I could not export the table to one Parquet file, the Apache Arrow component seemed to time out over 100 million records.)
PowerBI File Size
Here is the size of the PowerBI pbix files after loading each:
VotesParquet.pbix = 6.50GB - Largest
VotesCSV.pbix = 4.21GB
VotesSQLImport.pbix = 4.21GB
VotesSQLDirectQuery.pbix = 15.6kb - Smallest
I timed a load of each file type, the time it took to save the file after loading, and how long it took to create one visual.
Initial Load Time
CSV = 43 minutes, 46 seconds - Slowest
Parquet = 7 minutes, 45 seconds
SQL Server Import = 7 minutes
SQL Server Direct Query = A couple of seconds - Fastest
Initial Save Time
Parquet = 3 minutes, 50 seconds - Slowest
CSV = 3 minutes
SQL Server Import = 2 minutes, 30 seconds
SQL Server Direct Query = A couple of seconds - Fastest
Creating a visual using the date and count of votes was instant for all types.
Save Time - After Visual Creation
Parquet = 50 seconds - Slowest
CSV = 29 seconds
SQL Server Import = 28 seconds
SQL Server Direct Query = Instant - Fastest
In conclusion, CSV cannot even be considered an option to load with over 43 minutes to import. If you don't have a database server, parquet might be a decent option, but SQL Server Direct Query, with a couple of seconds to load the data, is the fastest.
Comments