Apache Parquet is a columnar storage format targeted at analytical workloads, but it can be used to store any type of structured data, addressing a variety of use cases.
One of its most notable features is the ability to efficiently compress data using different compression techniques at both stages of the processing process. This reduces storage costs and improves read performance.
This article explains Parquet’s file compression in Java, provides usage examples, and analyzes its performance.
Unlike traditional row-based storage formats, Parquet uses a columnar approach, allowing the use of more specific and efficient compression techniques based on locality and value redundancy of the same type of data.
Parquet writes information in binary format and applies compression at two different levels, each using a different technique:
Although the compression algorithm is configured at the file level, the encoding of each column is automatically selected using an internal heuristic (at least in the parquet-java implementation).
The performance of different compression technologies depends heavily on your data, so there is no one-size-fits-all solution that guarantees the fastest processing time and lowest storage consumption. You need to perform your own tests.
Configuration is simple and only requires explicit setting when writing. When reading a file, Parquet discovers which compression algorithm is used and applies the corresponding decompression algorithm.
In Carpet and Parquet using Protocol Buffers and Avro, to configure the compression algorithm, just call the builder's withCompressionCodec method:
Carpet
CarpetWriter<T> writer = new CarpetWriter.Builder<>(outputFile, clazz) .withCompressionCodec(CompressionCodecName.ZSTD) .build();
Avro
ParquetWriter<Organization> writer = AvroParquetWriter.<Organization>builder(outputFile) .withSchema(new Organization().getSchema()) .withCompressionCodec(CompressionCodecName.ZSTD) .build();
Protocol Buffers
ParquetWriter<Organization> writer = ProtoParquetWriter.<Organization>builder(outputFile) .withMessage(Organization.class) .withCompressionCodec(CompressionCodecName.ZSTD) .build();
The value must be one of the values available in the CompressionCodecName enumeration: UNCOMPRESSED, SNAPPY, GZIP, LZO, BROTLI, LZ4, ZSTD, and LZ4_RAW (LZ4 is deprecated, LZ4_RAW should be used).
Some compression algorithms provide a way to fine-tune the compression level. This level is usually related to how much effort they need to put into finding repeating patterns; the higher the compression level, the more time and memory the compression process requires.
Although they come with default values, they can be modified using Parquet's generic configuration mechanism, albeit using different keys for each codec.
Additionally, the values to choose are not standard and depend on each codec, so you must refer to the documentation for each algorithm to understand what each level offers.
ZSTD
To reference level configuration, the ZSTD codec declares a constant: ZstandardCodec.PARQUET_COMPRESS_ZSTD_LEVEL
.
Possible values range from 1 to 22, default value is 3.
CarpetWriter<T> writer = new CarpetWriter.Builder<>(outputFile, clazz) .withCompressionCodec(CompressionCodecName.ZSTD) .build();
LZO
To reference level configuration, the LZO codec declares a constant: LzoCodec.LZO_COMPRESSION_LEVEL_KEY
.
Possible values range from 1 to 9, 99 and 999, with the default value being '999'.
ParquetWriter<Organization> writer = AvroParquetWriter.<Organization>builder(outputFile) .withSchema(new Organization().getSchema()) .withCompressionCodec(CompressionCodecName.ZSTD) .build();
GZIP
It does not declare any constants, you must use the string "zlib.compress.level" directly, possible values range from 0 to 9, the default value is "6".
ParquetWriter<Organization> writer = ProtoParquetWriter.<Organization>builder(outputFile) .withMessage(Organization.class) .withCompressionCodec(CompressionCodecName.ZSTD) .build();
To analyze the performance of different compression algorithms, I will use two public datasets containing different types of data:
I will evaluate some of the compression algorithms enabled in Parquet Java: UNCOMPRESSED, SNAPPY, GZIP, LZO, ZSTD, LZ4_RAW.
As expected, I will be using Carpet with the default configuration provided by parquet-java and the default compression level for each algorithm.
You can find the source code on GitHub, testing was done on a laptop with an AMD Ryzen 7 4800HS CPU and JDK 17.
To understand the performance of each compression, we will use the equivalent CSV file as a reference.
格式 | gov.it | 纽约出租车 |
---|---|---|
CSV | 1761 MB | 2983 MB |
未压缩 | 564 MB | 760 MB |
SNAPPY | 220 MB | 542 MB |
GZIP | **146 MB** | 448 MB |
ZSTD | 148 MB | **430 MB** |
LZ4_RAW | 209 MB | 547 MB |
LZO | 215 MB | 518 MB |
Of the two tests, compression using GZip and Zstandard was the most efficient.
Using only Parquet encoding technology, file size can be reduced to 25%-32% of the original CSV size. With additional compression applied, it will be reduced to 9% to 15% of the CSV size.
How much overhead does compressing information bring?
If we write the same information three times and calculate the average seconds, we get:
算法 | gov.it | 纽约出租车 |
---|---|---|
未压缩 | 25.0 | 57.9 |
SNAPPY | 25.2 | 56.4 |
GZIP | 39.3 | 91.1 |
ZSTD | 27.3 | 64.1 |
LZ4_RAW | **24.9** | 56.5 |
LZO | 26.0 | **56.1** |
SNAPPY, LZ4 and LZO achieve similar times to no compression, while ZSTD adds some overhead. GZIP had the worst performance, with write times slowing down by 50%.
Reading a file is faster than writing because less computation is required.
The time in seconds to read all columns in the file is:
算法 | gov.it | 纽约出租车 |
---|---|---|
未压缩 | 11.4 | 37.4 |
SNAPPY | **12.5** | **39.9** |
GZIP | 13.6 | 40.9 |
ZSTD | 13.1 | 41.5 |
LZ4_RAW | 12.8 | 41.6 |
LZO | 13.1 | 41.1 |
The read time is close to that of uncompressed information, and the decompression overhead is between 10% and 20%.
No algorithm is significantly better than the others in terms of read and write times, all are in a similar range. In most cases, compressing information can make up for the space savings (and transmission) time lost.
In these two use cases, the deciding factor in choosing one or the other algorithm is probably the compression ratio achieved, with ZSTD and Gzip being prominent (but writing times being inferior).
Each algorithm has its advantages, so the best option is to test it with your data and consider which factor is more important:
Like everything in life, it's a trade-off and you have to see what best compensates for it. In Carpet, by default it uses Snappy for compression if you don't configure anything.
The value must be one of the values available in the CompressionCodecName enumeration. Associated with each enumeration value is the name of the class that implements the algorithm:
CarpetWriter<T> writer = new CarpetWriter.Builder<>(outputFile, clazz) .withCompressionCodec(CompressionCodecName.ZSTD) .build();
Parquet will use reflection to instantiate the specified class, which must implement the CompressionCodec interface. If you look at its source code, you'll see that it's in the Hadoop project, not Parquet. This shows how well Parquet is coupled to Hadoop in its Java implementation.
To use one of these codecs, you must ensure that you have added the JAR containing its implementation as a dependency.
Not all implementations are present in the transitive dependencies you have when adding parquet-java, or you may be excluding Hadoop dependencies too aggressively.
In the org.apache.parquet:parquet-hadoop dependency, include implementations of SnappyCodec, ZstandardCodec, and Lz4RawCodec, which transitively imports the snappy-java, zstd-jni, and aircompressor dependencies along with the actual implementations of these three algorithms.
In hadoop-common:hadoop-common dependency, contains the implementation of GzipCodec.
Where are the implementations of BrotliCodec and LzoCodec? They are not in any Parquet or Hadoop dependencies, so if you use them without adding additional dependencies, your application will not be able to use files compressed in those formats.
The above is the detailed content of Compression algorithms in Parquet Java. For more information, please follow other related articles on the PHP Chinese website!