To quickly summarize where we are, we started testing Microsoft Fabric with a 10GB tar file, which took over 20 hours to extract, and now we have created a ticket with Microsoft.
While waiting for a response, let's start the next step: extracting all of the (535,678) extracted .gz files from the first file.
I created a new pipeline, pointed it to the parent data of the newly extracted files, then specified another folder to place all of the JSON files in, and kicked off the job after adjusting the job to the 'max' options available.
Below are the screenshots of the configuration:
I set the timeout to be 24 hours immediately in case it took as long as the first file.
Again, selecting Maximum for throughput and 32 for the degree of parallelism.
Now, off to the races!
This time, it took 1 hour, 32 minutes, and 17 seconds to extract 535,678 files, 10.53GB to 97.54GB!
Notice (image below) how the runtime performance settings, instead of being on 'Maximum' went to 'Balanced', unlike the first run, which went down to 'Standard', and it used all 32 of the parallel copies, vs. 1 from the first file, so that was also an improvement.
Updated on the Microsoft ticket...it has been passed around to several teams, but there is no response regarding the issue or things to try.
While waiting, I thought I would try to run this on my laptop to see how long both steps would take.
My laptop is a MacBook Pro, with 16 logical processors at 2.4 GHz, 64GB of memory, and an NVMe hard drive with a transfer rate of over 3000mbps.
Starting with the first 10GB file, using 7Zip to extract, it took only 38 minutes to extract the file. WOW! Compared to Microsoft Fabric's 20+ hours, that is much faster.
Once completed, I extracted all of the second level gzip files using 7Zip again. This took 5 hours and 14 minutes to extract 535,678 files.
It would appear that those 32 parallel processes could extract about 5,822 files a minute vs. my laptop, which only handled 1,705 files a minute. So, for the second step, Microsoft Fabric was about 3.4X faster, but for the first step, my laptop was 32X faster than Microsoft Fabric.
Seems a bit off if you ask me.
My ticket with Microsoft was opened on 8/27/23, and it's now 11/24/23. The ticket has changed owners three times during its lifetime, and last week, the team working on the project requested to execute the pipelines again. This time, the first extraction took 3 hours longer.