-
Notifications
You must be signed in to change notification settings - Fork 18.4k
compress/flate: improve compression speed #75624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…speed Fixes golang#75532 This improves the compression speed of the flate package. This is a cleaned version of github.com/klauspost/compress/flate Overall changes: * Compression level 2-6 are custom implementations. * Compression level 7-9 tweaked to match levels 2-6 with minor improvements. * Tokens are encoded and indexed when added. * Huffman encoding attempts to continue blocks instead of always starting a new one. * Loads/Stores in separate functions and can be made to use unsafe. In overall terms this attempts to better balance out the compression levels, which tended to have little spread in the top levels. The intention is to place "default" at the place where performance drops off considerably without a proportional improvement in compression ratio. In my package I have set "5" to be the default, but this keeps it at level 6. There are built-in benchmarks using the standard library's benchmark below. I do not think this is a particular good representation of different data types, so I have also done benchmarks on various data types. I have compiled the benchmarks on https://stdeflate.klauspost.com/ The main focus has been on level 1 (fastest), level 5+6 (default) and level 9 (smallest). It is quite rare that levels outside of this are used, but they should still fit their role reasonably. Level 9 will attempt more aggressive compression, but will also typically be slightly slower than before. I hope the graphs above shows that focusing on a few data types doesn't always give the full picture. My own observations: Level 1 and 2 are often "trading places" depending on data type. Since level 1 is usually the lowest compressing of the two - and mostly slightly faster, with lower memory usage - it is placed as the lowest. The switchover between level 6 and 7 is not always smooth, since the search method changes significantly. Random data is now ~100x faster on levels 2-6, and ~3 faster on levels 7-9. You can feed pre-compressed data with no significant speed penalty. "Unsafe" operations have been removed for now. They can trivially be added back. This is an approximately 10% speed penalty. benchmark old ns/op new ns/op delta BenchmarkEncode/Digits/Huffman/1e4-32 11431 8001 -30.01% BenchmarkEncode/Digits/Huffman/1e5-32 123175 74780 -39.29% BenchmarkEncode/Digits/Huffman/1e6-32 1260402 750022 -40.49% BenchmarkEncode/Digits/Speed/1e4-32 35100 23758 -32.31% BenchmarkEncode/Digits/Speed/1e5-32 675355 385954 -42.85% BenchmarkEncode/Digits/Speed/1e6-32 6878375 4873784 -29.14% BenchmarkEncode/Digits/Default/1e4-32 63411 40974 -35.38% BenchmarkEncode/Digits/Default/1e5-32 1815762 801563 -55.86% BenchmarkEncode/Digits/Default/1e6-32 18875894 8101836 -57.08% BenchmarkEncode/Digits/Compression/1e4-32 63859 85275 +33.54% BenchmarkEncode/Digits/Compression/1e5-32 1803745 2752174 +52.58% BenchmarkEncode/Digits/Compression/1e6-32 18931995 30727403 +62.30% BenchmarkEncode/Newton/Huffman/1e4-32 15770 11108 -29.56% BenchmarkEncode/Newton/Huffman/1e5-32 134567 85103 -36.76% BenchmarkEncode/Newton/Huffman/1e6-32 1663889 1030186 -38.09% BenchmarkEncode/Newton/Speed/1e4-32 32749 22934 -29.97% BenchmarkEncode/Newton/Speed/1e5-32 565609 336750 -40.46% BenchmarkEncode/Newton/Speed/1e6-32 5996011 3815437 -36.37% BenchmarkEncode/Newton/Default/1e4-32 70505 34148 -51.57% BenchmarkEncode/Newton/Default/1e5-32 2374066 570673 -75.96% BenchmarkEncode/Newton/Default/1e6-32 24562355 5975917 -75.67% BenchmarkEncode/Newton/Compression/1e4-32 71505 77670 +8.62% BenchmarkEncode/Newton/Compression/1e5-32 3345768 3730804 +11.51% BenchmarkEncode/Newton/Compression/1e6-32 35770364 39768939 +11.18% benchmark old MB/s new MB/s speedup BenchmarkEncode/Digits/Huffman/1e4-32 874.80 1249.91 1.43x BenchmarkEncode/Digits/Huffman/1e5-32 811.86 1337.25 1.65x BenchmarkEncode/Digits/Huffman/1e6-32 793.40 1333.29 1.68x BenchmarkEncode/Digits/Speed/1e4-32 284.90 420.91 1.48x BenchmarkEncode/Digits/Speed/1e5-32 148.07 259.10 1.75x BenchmarkEncode/Digits/Speed/1e6-32 145.38 205.18 1.41x BenchmarkEncode/Digits/Default/1e4-32 157.70 244.06 1.55x BenchmarkEncode/Digits/Default/1e5-32 55.07 124.76 2.27x BenchmarkEncode/Digits/Default/1e6-32 52.98 123.43 2.33x BenchmarkEncode/Digits/Compression/1e4-32 156.59 117.27 0.75x BenchmarkEncode/Digits/Compression/1e5-32 55.44 36.33 0.66x BenchmarkEncode/Digits/Compression/1e6-32 52.82 32.54 0.62x BenchmarkEncode/Newton/Huffman/1e4-32 634.13 900.25 1.42x BenchmarkEncode/Newton/Huffman/1e5-32 743.12 1175.04 1.58x BenchmarkEncode/Newton/Huffman/1e6-32 601.00 970.70 1.62x BenchmarkEncode/Newton/Speed/1e4-32 305.35 436.03 1.43x BenchmarkEncode/Newton/Speed/1e5-32 176.80 296.96 1.68x BenchmarkEncode/Newton/Speed/1e6-32 166.78 262.09 1.57x BenchmarkEncode/Newton/Default/1e4-32 141.83 292.84 2.06x BenchmarkEncode/Newton/Default/1e5-32 42.12 175.23 4.16x BenchmarkEncode/Newton/Default/1e6-32 40.71 167.34 4.11x BenchmarkEncode/Newton/Compression/1e4-32 139.85 128.75 0.92x BenchmarkEncode/Newton/Compression/1e5-32 29.89 26.80 0.90x BenchmarkEncode/Newton/Compression/1e6-32 27.96 25.15 0.90x Static Memory Usage: Before: Level -2: Memory Used: 704KB, 8 allocs Level -1: Memory Used: 776KB, 7 allocs Level 0: Memory Used: 704KB, 7 allocs Level 1: Memory Used: 1160KB, 13 allocs Level 2: Memory Used: 776KB, 8 allocs Level 3: Memory Used: 776KB, 8 allocs Level 4: Memory Used: 776KB, 8 allocs Level 5: Memory Used: 776KB, 8 allocs Level 6: Memory Used: 776KB, 8 allocs Level 7: Memory Used: 776KB, 8 allocs Level 8: Memory Used: 776KB, 9 allocs Level 9: Memory Used: 776KB, 8 allocs After: Level -2: Memory Used: 272KB, 12 allocs Level -1: Memory Used: 1016KB, 7 allocs Level 0: Memory Used: 304KB, 6 allocs Level 1: Memory Used: 760KB, 13 allocs Level 2: Memory Used: 1144KB, 8 allocs Level 3: Memory Used: 1144KB, 8 allocs Level 4: Memory Used: 888KB, 14 allocs Level 5: Memory Used: 1016KB, 8 allocs Level 6: Memory Used: 1016KB, 8 allocs Level 7: Memory Used: 952KB, 7 allocs Level 8: Memory Used: 952KB, 7 allocs Level 9: Memory Used: 1080KB, 9 allocs This package has been fuzz tested for about 24 hours. Currently, there is about 1h between new "interesting" finds. Change-Id: Icb4c9839dc8f1bb96fd6d548038679a7554a559b
This PR (HEAD: 0a5dc67) has been imported to Gerrit for code review. Please visit Gerrit at https://go-review.googlesource.com/c/go/+/707355. Important tips:
|
Message from Gopher Robot: Patch Set 1: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/707355. |
Message from Jorropo: Patch Set 1: Commit-Queue+1 Please don’t reply on this GitHub thread. Visit golang.org/cl/707355. |
Message from Go LUCI: Patch Set 1: Dry run: CV is trying the patch. Bot data: {"action":"start","triggered_at":"2025-09-27T12:55:36Z","revision":"3a04d8a1390d38190a3f1f95a50627b12593dae7"} Please don’t reply on this GitHub thread. Visit golang.org/cl/707355. |
Message from Go LUCI: Patch Set 1: LUCI-TryBot-Result-1 Copied votes on follow-up patch sets have been updated:
Please don’t reply on this GitHub thread. Visit golang.org/cl/707355. |
Message from Go LUCI: Patch Set 2: Dry run: CV is trying the patch. Bot data: {"action":"start","triggered_at":"2025-09-27T12:55:56Z","revision":"103389d07a1b57ac50b2db8839c38519e480db66"} Please don’t reply on this GitHub thread. Visit golang.org/cl/707355. |
Message from Jorropo: Patch Set 2: -Commit-Queue Please don’t reply on this GitHub thread. Visit golang.org/cl/707355. |
Message from Go LUCI: Patch Set 2: This CL has failed the run. Reason: Tryjob golang/try/gotip-linux-arm64 has failed with summary (view all results):
To reproduce, try Additional links for debugging: Please don’t reply on this GitHub thread. Visit golang.org/cl/707355. |
Message from Go LUCI: Patch Set 2: LUCI-TryBot-Result-1 Please don’t reply on this GitHub thread. Visit golang.org/cl/707355. |
Message from Jorropo: Patch Set 2: (2 comments) Please don’t reply on this GitHub thread. Visit golang.org/cl/707355. |
Change-Id: I0ac5571da9585daba9491b360c9a6b4e0cecbcee
This PR (HEAD: 374779b) has been imported to Gerrit for code review. Please visit Gerrit at https://go-review.googlesource.com/c/go/+/707355. Important tips:
|
Message from Jes Cok: Patch Set 3: Commit-Queue+1 Please don’t reply on this GitHub thread. Visit golang.org/cl/707355. |
Message from Go LUCI: Patch Set 3: Dry run: CV is trying the patch. Bot data: {"action":"start","triggered_at":"2025-09-27T14:34:26Z","revision":"a471f3c21f29fc60440fa1f4c3de8707a9c4e158"} Please don’t reply on this GitHub thread. Visit golang.org/cl/707355. |
Message from Jes Cok: Patch Set 3: -Commit-Queue Please don’t reply on this GitHub thread. Visit golang.org/cl/707355. |
Message from Go LUCI: Patch Set 3: This CL has failed the run. Reason: Tryjob golang/try/gotip-linux-amd64_avx512 has failed with summary (view all results):
To reproduce, try Additional links for debugging:
Please don’t reply on this GitHub thread. Visit golang.org/cl/707355. |
Message from Go LUCI: Patch Set 3: LUCI-TryBot-Result-1 Please don’t reply on this GitHub thread. Visit golang.org/cl/707355. |
…table bytes. Change-Id: Ia141c7ec888bf51ceb6351d2a1c3f1501c2c4e12
This PR (HEAD: 2c5d12a) has been imported to Gerrit for code review. Please visit Gerrit at https://go-review.googlesource.com/c/go/+/707355. Important tips:
|
Change-Id: I1cef87da8cf7a2f2b330115f8eeecb7bf825af76
This PR (HEAD: f09b893) has been imported to Gerrit for code review. Please visit Gerrit at https://go-review.googlesource.com/c/go/+/707355. Important tips:
|
Message from Jes Cok: Patch Set 4: Commit-Queue+1 Please don’t reply on this GitHub thread. Visit golang.org/cl/707355. |
Message from Go LUCI: Patch Set 4: Dry run: CV is trying the patch. Bot data: {"action":"start","triggered_at":"2025-09-27T15:01:46Z","revision":"95a298954e77a087d3b3330c776ec447cffc006e"} Please don’t reply on this GitHub thread. Visit golang.org/cl/707355. |
Message from Klaus Post: Patch Set 4: (3 comments) Please don’t reply on this GitHub thread. Visit golang.org/cl/707355. |
Message from Go LUCI: Patch Set 4: LUCI-TryBot-Result-1 Please don’t reply on this GitHub thread. Visit golang.org/cl/707355. |
Change-Id: Ie3630fc4b51f30108909a3d5930ffe17851f4a94
This PR (HEAD: f5d855e) has been imported to Gerrit for code review. Please visit Gerrit at https://go-review.googlesource.com/c/go/+/707355. Important tips:
|
Message from Sean Liao: Patch Set 6: (22 comments) Please don’t reply on this GitHub thread. Visit golang.org/cl/707355. |
Message from Carlos Amedee: Patch Set 6: Commit-Queue+1 Please don’t reply on this GitHub thread. Visit golang.org/cl/707355. |
Message from Go LUCI: Patch Set 6: Dry run: CV is trying the patch. Bot data: {"action":"start","triggered_at":"2025-09-30T22:00:12Z","revision":"a6a9439ac842a968fe91377ab2c52eeefb264e87"} Please don’t reply on this GitHub thread. Visit golang.org/cl/707355. |
Message from Carlos Amedee: Patch Set 6: -Commit-Queue Please don’t reply on this GitHub thread. Visit golang.org/cl/707355. |
Message from Go LUCI: Patch Set 6: This CL has failed the run. Reason: Failed Tryjobs:
To reproduce, try Additional links for debugging:
To reproduce, try Additional links for debugging: Please don’t reply on this GitHub thread. Visit golang.org/cl/707355. |
Message from Go LUCI: Patch Set 6: LUCI-TryBot-Result-1 Please don’t reply on this GitHub thread. Visit golang.org/cl/707355. |
Fixes #75532
This improves the compression speed of the flate package.
This is a cleaned version of github.com/klauspost/compress/flate
Overall changes:
In overall terms this attempts to better balance out the compression levels,
which tended to have little spread in the top levels.
The intention is to place "default" at the place where performance drops off
considerably without a proportional improvement in compression ratio.
In my package I have set "5" to be the default, but this keeps it at level 6.
"Unsafe" operations have been removed for now.
They can trivially be added back.
This is an approximately 10% speed penalty.
There are built-in benchmarks using the standard library's benchmark below.
I do not think this is a particular good representation of different
data types, so I have also done benchmarks on various data types.
I have compiled the benchmarks on https://stdeflate.klauspost.com/
The main focus has been on level 1 (fastest),
level 5+6 (default) and level 9 (smallest).
It is quite rare that levels outside of this are used, but they should still
fit their role reasonably.
Level 9 will attempt more aggressive compression,
but will also typically be slightly slower than before.
I hope the graphs above shows that focusing on a few data types
doesn't always give the full picture.
My own observations:
Level 1 and 2 are often "trading places" depending on data type.
Since level 1 is usually the lowest compressing of the two -
mostly slightly faster, with lower memory usage -
it is placed as the lowest.
The switchover between level 6 and 7 is not always smooth,
since the search method changes significantly.
Random data is now ~100x faster on levels 2-6, and ~3 faster on levels 7-9.
You can feed pre-compressed data with no significant speed penalty.
benchmark old ns/op new ns/op delta
BenchmarkEncode/Digits/Huffman/1e4-32 11431 8001 -30.01%
BenchmarkEncode/Digits/Huffman/1e5-32 123175 74780 -39.29%
BenchmarkEncode/Digits/Huffman/1e6-32 1260402 750022 -40.49%
BenchmarkEncode/Digits/Speed/1e4-32 35100 23758 -32.31%
BenchmarkEncode/Digits/Speed/1e5-32 675355 385954 -42.85%
BenchmarkEncode/Digits/Speed/1e6-32 6878375 4873784 -29.14%
BenchmarkEncode/Digits/Default/1e4-32 63411 40974 -35.38%
BenchmarkEncode/Digits/Default/1e5-32 1815762 801563 -55.86%
BenchmarkEncode/Digits/Default/1e6-32 18875894 8101836 -57.08%
BenchmarkEncode/Digits/Compression/1e4-32 63859 85275 +33.54%
BenchmarkEncode/Digits/Compression/1e5-32 1803745 2752174 +52.58%
BenchmarkEncode/Digits/Compression/1e6-32 18931995 30727403 +62.30%
BenchmarkEncode/Newton/Huffman/1e4-32 15770 11108 -29.56%
BenchmarkEncode/Newton/Huffman/1e5-32 134567 85103 -36.76%
BenchmarkEncode/Newton/Huffman/1e6-32 1663889 1030186 -38.09%
BenchmarkEncode/Newton/Speed/1e4-32 32749 22934 -29.97%
BenchmarkEncode/Newton/Speed/1e5-32 565609 336750 -40.46%
BenchmarkEncode/Newton/Speed/1e6-32 5996011 3815437 -36.37%
BenchmarkEncode/Newton/Default/1e4-32 70505 34148 -51.57%
BenchmarkEncode/Newton/Default/1e5-32 2374066 570673 -75.96%
BenchmarkEncode/Newton/Default/1e6-32 24562355 5975917 -75.67%
BenchmarkEncode/Newton/Compression/1e4-32 71505 77670 +8.62%
BenchmarkEncode/Newton/Compression/1e5-32 3345768 3730804 +11.51%
BenchmarkEncode/Newton/Compression/1e6-32 35770364 39768939 +11.18%
benchmark old MB/s new MB/s speedup
BenchmarkEncode/Digits/Huffman/1e4-32 874.80 1249.91 1.43x
BenchmarkEncode/Digits/Huffman/1e5-32 811.86 1337.25 1.65x
BenchmarkEncode/Digits/Huffman/1e6-32 793.40 1333.29 1.68x
BenchmarkEncode/Digits/Speed/1e4-32 284.90 420.91 1.48x
BenchmarkEncode/Digits/Speed/1e5-32 148.07 259.10 1.75x
BenchmarkEncode/Digits/Speed/1e6-32 145.38 205.18 1.41x
BenchmarkEncode/Digits/Default/1e4-32 157.70 244.06 1.55x
BenchmarkEncode/Digits/Default/1e5-32 55.07 124.76 2.27x
BenchmarkEncode/Digits/Default/1e6-32 52.98 123.43 2.33x
BenchmarkEncode/Digits/Compression/1e4-32 156.59 117.27 0.75x
BenchmarkEncode/Digits/Compression/1e5-32 55.44 36.33 0.66x
BenchmarkEncode/Digits/Compression/1e6-32 52.82 32.54 0.62x
BenchmarkEncode/Newton/Huffman/1e4-32 634.13 900.25 1.42x
BenchmarkEncode/Newton/Huffman/1e5-32 743.12 1175.04 1.58x
BenchmarkEncode/Newton/Huffman/1e6-32 601.00 970.70 1.62x
BenchmarkEncode/Newton/Speed/1e4-32 305.35 436.03 1.43x
BenchmarkEncode/Newton/Speed/1e5-32 176.80 296.96 1.68x
BenchmarkEncode/Newton/Speed/1e6-32 166.78 262.09 1.57x
BenchmarkEncode/Newton/Default/1e4-32 141.83 292.84 2.06x
BenchmarkEncode/Newton/Default/1e5-32 42.12 175.23 4.16x
BenchmarkEncode/Newton/Default/1e6-32 40.71 167.34 4.11x
BenchmarkEncode/Newton/Compression/1e4-32 139.85 128.75 0.92x
BenchmarkEncode/Newton/Compression/1e5-32 29.89 26.80 0.90x
BenchmarkEncode/Newton/Compression/1e6-32 27.96 25.15 0.90x
Static Memory Usage:
Before:
Level -2: Memory Used: 704KB, 8 allocs
Level -1: Memory Used: 776KB, 7 allocs
Level 0: Memory Used: 704KB, 7 allocs
Level 1: Memory Used: 1160KB, 13 allocs
Level 2: Memory Used: 776KB, 8 allocs
Level 3: Memory Used: 776KB, 8 allocs
Level 4: Memory Used: 776KB, 8 allocs
Level 5: Memory Used: 776KB, 8 allocs
Level 6: Memory Used: 776KB, 8 allocs
Level 7: Memory Used: 776KB, 8 allocs
Level 8: Memory Used: 776KB, 9 allocs
Level 9: Memory Used: 776KB, 8 allocs
After:
Level -2: Memory Used: 272KB, 12 allocs
Level -1: Memory Used: 1016KB, 7 allocs
Level 0: Memory Used: 304KB, 6 allocs
Level 1: Memory Used: 760KB, 13 allocs
Level 2: Memory Used: 1144KB, 8 allocs
Level 3: Memory Used: 1144KB, 8 allocs
Level 4: Memory Used: 888KB, 14 allocs
Level 5: Memory Used: 1016KB, 8 allocs
Level 6: Memory Used: 1016KB, 8 allocs
Level 7: Memory Used: 952KB, 7 allocs
Level 8: Memory Used: 952KB, 7 allocs
Level 9: Memory Used: 1080KB, 9 allocs
This package has been fuzz tested for about 24 hours.
Currently, there is about 1h between new "interesting" finds.
Change-Id: Icb4c9839dc8f1bb96fd6d548038679a7554a559b