Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Parquet file generation
#1
I hope this message finds you well. I'm encountering an intriguing issue with our data processing pipeline and would greatly appreciate your insights.
Our current process involves reading CSV files and converting them to Parquet format. When I load these CSV files into DataFrames, they appear to have nearly identical sizes. However, upon conversion to Parquet, I've noticed a significant discrepancy: one of the resulting Parquet files is approximately three times larger than the other.
For context, these files contain monthly snapshot data, and there isn't substantial variance between them. This size difference is puzzling, given the similarity of the source data.
Key points:
CSV files are of similar size when loaded into DataFrames (same number of columns, almost same number of rows, same datatypes)
After conversion to Parquet, one file is roughly 3x larger
Data represents monthly snapshots with minimal variance
I'm keen to understand the underlying cause of this size disparity and would welcome any suggestions or insights you might have.
Thank you in advance for your assistance.
Reply
#2
Why are you asking? Do you think you are losing info during compression? You can extract the information and check. Or are you just wondering why some files compress down smaller than others? Compressibility is highly dependent on content, and 3X difference doesn’t surprise me at all.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Photo image generation with text style Belialhun 0 1,202 Oct-08-2024, 01:53 PM
Last Post: Belialhun
  Read TXT file in Pandas and save to Parquet zinho 2 2,444 Sep-15-2024, 06:14 PM
Last Post: zinho
  Node Flow Generation in Python Linenloid 0 1,477 Feb-21-2023, 07:09 PM
Last Post: Linenloid
  Allure Report Generation rotemz 0 1,798 Jan-24-2023, 08:30 PM
Last Post: rotemz
  Write sql data or CSV Data into parquet file mg24 2 5,283 Sep-26-2022, 08:21 AM
Last Post: ibreeden
  Why doesnt chunk generation work? LotosProgramer 1 2,993 Apr-02-2022, 08:25 AM
Last Post: deanhystad
  Random data generation sum to 1 by rounding juniorcoder 9 7,275 Oct-20-2021, 03:36 PM
Last Post: deanhystad
Question PDF generation / edit SpongeB0B 2 3,393 Jul-28-2021, 05:59 AM
Last Post: SpongeB0B
  Calling Input for Random Generation ScaledCodingWarrior 1 2,997 Feb-02-2021, 07:27 PM
Last Post: bowlofred
  Parquet format conversion problem Bilhardas 1 2,753 Nov-19-2019, 11:06 AM
Last Post: baquerik

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020
This forum uses Lukasz Tkacz MyBB addons.
Forum use Krzysztof "Supryk" Supryczynski addons.