Go parallel gzip (de)compression

Last update: Dec 29, 2022

Comments: 15

pgzip

Go parallel gzip compression/decompression. This is a fully gzip compatible drop in replacement for "compress/gzip".

This will split compression into blocks that are compressed in parallel. This can be useful for compressing big amounts of data. The output is a standard gzip file.

The gzip decompression is modified so it decompresses ahead of the current reader. This means that reads will be non-blocking if the decompressor can keep ahead of your code reading from it. CRC calculation also takes place in a separate goroutine.

You should only use this if you are (de)compressing big amounts of data, say more than 1MB at the time, otherwise you will not see any benefit, and it will likely be faster to use the internal gzip library or this package.

It is important to note that this library creates and reads standard gzip files. You do not have to match the compressor/decompressor to get the described speedups, and the gzip files are fully compatible with other gzip readers/writers.

A golang variant of this is bgzf, which has the same feature, as well as seeking in the resulting file. The only drawback is a slightly bigger overhead compared to this and pure gzip. See a comparison below.

Installation

go get github.com/klauspost/pgzip/...

You might need to get/update the dependencies:

go get -u github.com/klauspost/compress

Usage

Godoc Doumentation

To use as a replacement for gzip, exchange

import "compress/gzip" with import gzip "github.com/klauspost/pgzip".

Changes

Oct 6, 2016: Fixed an issue if the destination writer returned an error.
Oct 6, 2016: Better buffer reuse, should now generate less garbage.
Oct 6, 2016: Output does not change based on write sizes.
Dec 8, 2015: Decoder now supports the io.WriterTo interface, giving a speedup and less GC pressure.
Oct 9, 2015: Reduced allocations by ~35 by using sync.Pool. ~15% overall speedup.

Changes in github.com/klauspost/compress are also carried over, so see that for more changes.

Compression

The simplest way to use this is to simply do the same as you would when using compress/gzip.

To change the block size, use the added (*pgzip.Writer).SetConcurrency(blockSize, blocks int) function. With this you can control the approximate size of your blocks, as well as how many you want to be processing in parallel. Default values for this is SetConcurrency(1MB, runtime.GOMAXPROCS(0)), meaning blocks are split at 1 MB and up to the number of CPU threads blocks can be processing at once before the writer blocks.

Example:

var b bytes.Buffer
w := gzip.NewWriter(&b)
w.SetConcurrency(100000, 10)
w.Write([]byte("hello, world\n"))
w.Close()

To get any performance gains, you should at least be compressing more than 1 megabyte of data at the time.

You should at least have a block size of 100k and at least a number of blocks that match the number of cores your would like to utilize, but about twice the number of blocks would be the best.

Another side effect of this is, that it is likely to speed up your other code, since writes to the compressor only blocks if the compressor is already compressing the number of blocks you have specified. This also means you don't have worry about buffering input to the compressor.

Decompression

Decompression works similar to compression. That means that you simply call pgzip the same way as you would call compress/gzip.

The only difference is that if you want to specify your own readahead, you have to use pgzip.NewReaderN(r io.Reader, blockSize, blocks int) to get a reader with your custom blocksizes. The blockSize is the size of each block decoded, and blocks is the maximum number of blocks that is decoded ahead.

See Example on playground

Performance

Compression

See my blog post in Benchmarks of Golang Gzip.

Compression cost is usually about 0.2% with default settings with a block size of 250k.

Example with GOMAXPROC set to 32 (16 core CPU)

Content is Matt Mahoneys 10GB corpus. Compression level 6.

Compressor	MB/sec	speedup	size	size overhead (lower=better)
gzip (golang)	15.44MB/s (1 thread)	1.0x	4781329307	0%
gzip (klauspost)	135.04MB/s (1 thread)	8.74x	4894858258	+2.37%
pgzip (klauspost)	1573.23MB/s	101.9x	4902285651	+2.53%
bgzf (biogo)	361.40MB/s	23.4x	4869686090	+1.85%
pargzip (builder)	306.01MB/s	19.8x	4786890417	+0.12%

pgzip also contains a linear time compression mode, that will allow compression at ~250MB per core per second, independent of the content.

See the complete sheet for different content types and compression settings.

Decompression

The decompression speedup is there because it allows you to do other work while the decompression is taking place.

In the example above, the numbers are as follows on a 4 CPU machine:

Decompressor	Time	Speedup
gzip (golang)	1m28.85s	0%
pgzip (golang)	43.48s	104%

But wait, since gzip decompression is inherently singlethreaded (aside from CRC calculation) how can it be more than 100% faster? Because pgzip due to its design also acts as a buffer. When using unbuffered gzip, you are also waiting for io when you are decompressing. If the gzip decoder can keep up, it will always have data ready for your reader, and you will not be waiting for input to the gzip decompressor to complete.

This is pretty much an optimal situation for pgzip, but it reflects most common usecases for CPU intensive gzip usage.

I haven't included bgzf in this comparison, since it only can decompress files created by a compatible encoder, and therefore cannot be considered a generic gzip decompressor. But if you are able to compress your files with a bgzf compatible program, you can expect it to scale beyond 100%.

License

This contains large portions of code from the go repository - see GO_LICENSE for more information. The changes are released under MIT License. See LICENSE for more information.

Owner

Klaus Post

https://github.com/klauspost/pgzip

Comments

Missing lines with the uncompression example

I am trying to uncompress a large file 24GB with the provided example (https://play.golang.org/p/uHv1B5NbDh) and the number of lines computed doesn't match the number expected.

The compressed file has ~94M lines and the output shows ~79M ...

What am I missing here ?

Thanks
gzip: fix memory allocs (buffers not returned to pool)

Fix allocation leaks during gzip writes. Due to incorrect use of the dstPool (unmatched Get and Put), large amounts of memory were temporarily allocated during writes and not put back into the pool.

This also removes some special handling code in compressCurrent that would recursively call itself for too large input buffers. This condition can never occur, because Write ensures that blocks are capped and there is no other public interface that extends currentBuffer. The recursive call that slices the buffer would have made returning byte slices to the Pool dangerous, as we could have been returning the same underlying buffer multiple times.

This also adds a test to check allocations per Write to prevent regressions. There is further room for improvement, but this was by far the biggest leak.

Closes #8

~~Additionally, this adds a go.mod for Go modules support.~~

(Note that tests broke with recent commit a8ba21498dc99e88bfc7677aa9b3ef38ef0101cc).
Add method to determine does file is pgzip or not

I'm use many compressors (zip, gzip, pgzip, bgzf) and need to understand what file underline i have. For example if i download bzgf file i need to enter to some code path to able to seek inside file, in case of gzip/pgzip i need to switch to other things (like enable more cpus or not..). Does it possible to add such method?

Unexpected, and nondeterministic, panic reading `.tar.gz`

NOTE: Borrowing the 'bug report' template from mholt/archiver which I'm coming indirectly from.

What version of the package or command are you using?

v1.2.5 via mholt/archiver latest version 3.5.1.

What are you trying to do?

Read a .tar.gz file and check that specific files are contained within it.

What steps did you take?

You can find a copy of the archive file here: https://github.com/fastly/cli/blob/main/pkg/commands/compute/testdata/deploy/pkg/package.tar.gz

Here is the code that attempts to validate the file: https://github.com/fastly/cli/blob/main/pkg/commands/compute/validate.go#L51-L105

What did you expect to happen, and what actually happened instead?

I noticed that my test suite would fail nondeterministically with a panic raised from this project indirectly via the mholt/archiver dependency my project uses.

If I run my test suite with a non default -count value of 20 (default is 1), then I can reliably get the test suite to panic. What happens is the archive file is read and one of the tests will eventually try to read the file and fail. The tests are not using t.Parallel() so there is no need to synchronise access to the archive file.

Below is the test run error stack trace (you'll notice that there are two tests that run and they pass multiple times before nondeterministically failing with a panic)...

NOTE: From what I can see our code calls the mholt/archiver's tar.Read() method here which then triggers the panic down in klauspost/pgzip.(*Reader).Read here.

--- PASS: TestDeploy (24.79s)
    --- PASS: TestDeploy/service_domain_error (19.51s)
    --- PASS: TestDeploy/service_backend_error (5.20s)

=== RUN   TestDeploy
=== RUN   TestDeploy/service_domain_error
panic: test timed out after 30s

goroutine 153 [running]:
testing.(*M).startAlarm.func1()
        /usr/local/go/src/testing/testing.go:1788 +0xbb
created by time.goFunc
        /usr/local/go/src/time/sleep.go:180 +0x4a

goroutine 1 [chan receive]:
testing.(*T).Run(0xc00050dba0, {0x1ffd734, 0xa}, 0x203eae0)
        /usr/local/go/src/testing/testing.go:1307 +0x752
testing.runTests.func1(0x0)
        /usr/local/go/src/testing/testing.go:1598 +0x9a
testing.tRunner(0xc00050dba0, 0xc0000e3bf8)
        /usr/local/go/src/testing/testing.go:1259 +0x230
testing.runTests(0xc00029c200, {0x28cbe60, 0x12, 0x12}, {0x0, 0xc00028c840, 0x28dc3a0})
        /usr/local/go/src/testing/testing.go:1596 +0x7cb
testing.(*M).Run(0xc00029c200)
        /usr/local/go/src/testing/testing.go:1504 +0x9d2
main.main()
        _testmain.go:79 +0x22c

goroutine 150 [chan receive]:
testing.(*T).Run(0xc00050dd40, {0x200777d, 0x14}, 0xc00027cf60)
        /usr/local/go/src/testing/testing.go:1307 +0x752
github.com/fastly/cli/pkg/commands/compute_test.TestDeploy(0xc00050dd40)
        /Users/integralist/Code/fastly/cli-main-branch/pkg/commands/compute/deploy_test.go:1262 +0x9ec9
testing.tRunner(0xc00050dd40, 0x203eae0)
        /usr/local/go/src/testing/testing.go:1259 +0x230
created by testing.(*T).Run
        /usr/local/go/src/testing/testing.go:1306 +0x727

goroutine 151 [chan receive]:
github.com/klauspost/pgzip.(*Reader).Read(0xc000613180, {0xc000344000, 0x28c5540, 0x2000})
        /Users/integralist/Code/fastly/cli-main-branch/vendor/github.com/klauspost/pgzip/gunzip.go:451 +0x134
io.(*LimitedReader).Read(0xc00000e378, {0xc000344000, 0x2000, 0x2000})
        /usr/local/go/src/io/io.go:473 +0xc6
io.discard.ReadFrom({}, {0x2215c40, 0xc00000e378})
        /usr/local/go/src/io/io.go:598 +0x92
io.copyBuffer({0x2216d00, 0x290d960}, {0x2215c40, 0xc00000e378}, {0x0, 0x0, 0x0})
        /usr/local/go/src/io/io.go:409 +0x1c3
io.Copy(...)
        /usr/local/go/src/io/io.go:382
io.CopyN({0x2216d00, 0x290d960}, {0xc584120, 0xc000613180}, 0x22113a0)
        /usr/local/go/src/io/io.go:358 +0xcc
archive/tar.discard({0xc584120, 0xc000613180}, 0x22113a0)
        /usr/local/go/src/archive/tar/reader.go:852 +0x150
archive/tar.(*Reader).next(0xc0000ecd80)
        /usr/local/go/src/archive/tar/reader.go:68 +0xef
archive/tar.(*Reader).Next(0xc0000ecd80)
        /usr/local/go/src/archive/tar/reader.go:51 +0x53
github.com/mholt/archiver/v3.(*Tar).Read(0xc00060e940)
        /Users/integralist/Code/fastly/cli-main-branch/vendor/github.com/mholt/archiver/v3/tar.go:441 +0xa5
github.com/fastly/cli/pkg/commands/compute.validate({0xc00015cba0, 0x12})
        /Users/integralist/Code/fastly/cli-main-branch/pkg/commands/compute/validate.go:78 +0x54d
github.com/fastly/cli/pkg/commands/compute.validatePackage({{{0x0, 0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, 0x2, ...}, ...}, ...)
        /Users/integralist/Code/fastly/cli-main-branch/pkg/commands/compute/deploy.go:414 +0x650
github.com/fastly/cli/pkg/commands/compute.(*DeployCommand).Exec(0xc0000b24e0, {0x2216220, 0xc00008b420}, {0x2215040, 0xc00027d050})
        /Users/integralist/Code/fastly/cli-main-branch/pkg/commands/compute/deploy.go:106 +0x5d8
github.com/fastly/cli/pkg/app.Run({0xc0004fff80, {0xc00016e7c0, 0x4, 0x4}, {{{0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, ...}, ...}, ...})
        /Users/integralist/Code/fastly/cli-main-branch/pkg/app/run.go:171 +0x1f95
github.com/fastly/cli/pkg/commands/compute_test.TestDeploy.func1(0xc000183520)
        /Users/integralist/Code/fastly/cli-main-branch/pkg/commands/compute/deploy_test.go:1352 +0x1165
testing.tRunner(0xc000183520, 0xc00027cf60)
        /usr/local/go/src/testing/testing.go:1259 +0x230
created by testing.(*T).Run
        /usr/local/go/src/testing/testing.go:1306 +0x727

goroutine 152 [running]:
        goroutine running on other thread; stack unavailable
created by github.com/klauspost/pgzip.(*Reader).doReadAhead
        /Users/integralist/Code/fastly/cli-main-branch/vendor/github.com/klauspost/pgzip/gunzip.go:379 +0x528
FAIL    github.com/fastly/cli/pkg/commands/compute      31.614s
?       github.com/fastly/cli/pkg/commands/compute/setup        [no test files]
FAIL
make: *** [test] Error 1

How do you think this should be fixed?

I'm not sure because mholt/archiver is already using the latest version of pgzip (1.2.5). I guess ideally pgzip shouldn't panic unexpectedly. But I suspect there's something I (or mholt/archiver) is doing wrong.

SetConcurrency(?,?)

If only one goroutine is used for the setting, what is the performance compare with gzip ; SetConcurrency(?,?), what is the optimal setting for the second parameter
gunzip: Reset may loose buffers
If Reset is called on the gzip reader without reading the gzip stream untill EOF the read-ahead go-routine may loose buffers from the block pool.

Simple way to reproduce this is doing something like.

r, _ := pgzip.NewReader(in) r.Reset(in) io.Copy(ioutil.Discard, r)

Running something will succeed most of the time but sometimes it'll deadlock. What happens is that by the time killReadAhead() called the read ahead go-routine might have already taken a buffer from the block pool. This buffer is either send into the readAhead channel and is lost when this channel is reinitialized when doReadAhead is called again or it's lost when the go-routine exits in gunzip.go#L419. In both cases the buffer taken by the read-ahead go-routine is never returned to the block pool. If all buffers are lost the reader will completely deadlock.

The easiest way to fix this would probably be to reinitialize and fill the block pool in either Reset or doReadAhead
panic: close of closed channel
While using pgzip, I'm getting this error in some (not yet fully debugged) situations:

panic: close of closed channel goroutine 2364 [running]: github.com/klauspost/pgzip.(*Writer).checkError(0xc208060100, 0x0, 0x0) /home/ubuntu/.go_workspace/src/github.com/klauspost/pgzip/gzip.go:254 +0xee github.com/klauspost/pgzip.(*Writer).Write(0xc208060100, 0xc20827a000, 0x8000, 0x8000, 0x8000, 0x0, 0x0) /home/ubuntu/.go_workspace/src/github.com/klauspost/pgzip/gzip.go:273 +0x6b [...]

If it can help, the underlying writer for pgzip is a io.Pipe() writer, and the other end is copied into a socket, so it looks like the bug is related to the packetization of the data over the wire, since it's not fully reproducible.

From my reading of the code, it looks like the panic is actually a failure to propagate an underlying error code, so I will just put a print there to see what error code was triggered in the first place. Meanwhile, the traceback might be enough to point you to the bug causing the panic.

Possibility to reduce memory consumption

Hello. I'm from the Kopia project. We use a bunch of your compressors in our repo. Kopia has a benchmark command that given an input file, we run all the compressors on it and report metrics, such as compression ratio, throughput and memory consumption. pgzip seems to have huge stats comparing to, say s2.

The memory consumption stats is calculated by calling runtime.ReadMemStats() before and after the compression loop, then compare the delta. Note that this is not about memory leak, just allocation.

Baseline: compressing a 400MB highly compressible file just once. All compressors behave similarly

Repeating 1 times per compression method (total 466.7 MiB).

     Compression                Compressed   Throughput   Memory Usage
------------------------------------------------------------------------------------------------
  0. s2-default                 127.1 MiB    4 GiB/s      3126   375.4 MiB
  1. s2-better                  120.1 MiB    3.4 GiB/s    2999   351.7 MiB
  2. s2-parallel-8              127.1 MiB    2.8 GiB/s    2981   362.2 MiB
  3. s2-parallel-4              127.1 MiB    2.3 GiB/s    2951   344.1 MiB
  4. pgzip-best-speed           96.7 MiB     2.1 GiB/s    4127   324.1 MiB
  5. pgzip                      86.3 MiB     1.2 GiB/s    4132   298.7 MiB
  6. lz4                        131.8 MiB    458.9 MiB/s  17     321.7 MiB
  7. zstd-fastest               79.8 MiB     356.2 MiB/s  22503  246 MiB
  8. zstd                       76.8 MiB     323.7 MiB/s  22605  237.8 MiB
  9. deflate-best-speed         96.7 MiB     220.8 MiB/s  45     310.8 MiB
 10. gzip-best-speed            94.9 MiB     165 MiB/s    40     305.2 MiB
 11. deflate-default            86.3 MiB     143.1 MiB/s  34     311 MiB
 12. zstd-better-compression    74.2 MiB     104 MiB/s    22496  251.4 MiB
 13. pgzip-best-compression     83 MiB       55.9 MiB/s   4359   299.1 MiB
 14. gzip                       83.6 MiB     40.5 MiB/s   69     304.8 MiB
 15. zstd-best-compression      68.9 MiB     19.2 MiB/s   22669  303.4 MiB
 16. deflate-best-compression   83 MiB       5.6 MiB/s    134    311 MiB
 17. gzip-best-compression      83 MiB       5.1 MiB/s    137    304.8 MiB

Compressing the first 128KB of the same file but repeat 10 times, you can see the higher memory consumption of pgzip among compressors

Repeating 10 times per compression method (total 1.2 MiB).

     Compression                Compressed   Throughput   Memory Usage
------------------------------------------------------------------------------------------------
  0. s2-default                 43.6 KiB     625.3 MiB/s  71     2.1 MiB
  1. s2-parallel-4              43.6 KiB     625.3 MiB/s  67     2.1 MiB
  2. s2-parallel-8              43.6 KiB     624.5 MiB/s  67     2.1 MiB
  3. s2-better                  41.3 KiB     416.8 MiB/s  72     2.1 MiB
  4. deflate-best-speed         34.3 KiB     208.3 MiB/s  22     874.6 KiB
  5. zstd-fastest               28.6 KiB     178.6 MiB/s  160    9.4 MiB
  6. lz4                        44.7 KiB     178.5 MiB/s  38     88.6 MiB
  7. gzip-best-speed            33.7 KiB     138.9 MiB/s  28     1.2 MiB
  8. deflate-default            31.2 KiB     125 MiB/s    22     1.1 MiB
  9. zstd                       26.8 KiB     113.6 MiB/s  174    18.4 MiB
 10. pgzip-best-speed           34.3 KiB     113.6 MiB/s  252    27.3 MiB
 11. zstd-better-compression    26.3 KiB     96.2 MiB/s   156    37.2 MiB
 12. pgzip                      31.2 KiB     74.5 MiB/s   342    31.7 MiB
 13. gzip                       30.4 KiB     39.1 MiB/s   26     874.7 KiB
 14. deflate-best-compression   30.4 KiB     25.5 MiB/s   21     1 MiB
 15. gzip-best-compression      30.4 KiB     24 MiB/s     26     874.7 KiB
 16. pgzip-best-compression     30.4 KiB     23.2 MiB/s   285    30.2 MiB
 17. zstd-best-compression      25.1 KiB     16.9 MiB/s   155    99.2 MiB

Repeating 100 times. s2 has exactly same stats, while pgzip grows accordingly

Repeating 100 times per compression method (total 12.5 MiB).

     Compression                Compressed   Throughput   Memory Usage
------------------------------------------------------------------------------------------------
  0. s2-parallel-4              43.6 KiB     833.4 MiB/s  533    2.1 MiB
  1. s2-parallel-8              43.6 KiB     833.3 MiB/s  555    2.1 MiB
  2. s2-default                 43.6 KiB     833.3 MiB/s  579    2.1 MiB
  3. s2-better                  41.3 KiB     500 MiB/s    610    2.1 MiB
  4. zstd-fastest               28.6 KiB     240.4 MiB/s  925    9.5 MiB
  5. deflate-best-speed         34.3 KiB     198.4 MiB/s  22     874.6 KiB
  6. zstd                       26.8 KiB     165.4 MiB/s  907    18.5 MiB
  7. zstd-better-compression    26.3 KiB     162.3 MiB/s  881    37.3 MiB
  8. gzip-best-speed            33.7 KiB     150.6 MiB/s  28     1.2 MiB
  9. pgzip-best-speed           34.3 KiB     143.7 MiB/s  1649   220.2 MiB
 10. deflate-default            31.2 KiB     126.3 MiB/s  22     1.1 MiB
 11. lz4                        44.7 KiB     112.6 MiB/s  435    816.7 MiB
 12. pgzip                      31.2 KiB     94.6 MiB/s   2634   277.5 MiB
 13. gzip                       30.4 KiB     39.5 MiB/s   26     874.7 KiB
 14. deflate-best-compression   30.4 KiB     25.4 MiB/s   21     1 MiB
 15. gzip-best-compression      30.4 KiB     24.5 MiB/s   27     874.9 KiB
 16. pgzip-best-compression     30.4 KiB     23.1 MiB/s   2646   281.8 MiB
 17. zstd-best-compression      25.1 KiB     19.3 MiB/s   882    99.3 MiB

I did some experiments around SetConcurrency() and found that:

The consumption grows slowly as blocks increases, and exponentially as blockSize increases, possibly due to z.dstPool.New = func() interface{} { return make([]byte, 0, blockSize+(blockSize)>>4) } line.
Even by just creating a new writer and immediately close it, the allocation still happens, possibly due to the internal compressCurrent().

Is there a bug here? Why allocate memory when no data is compressed? And can Reset() reuse previously allocated memory instead of creating new (like s2)?

pgzip.Writer causes panics in bufio.Write
When pgzip's Writer is used as a bufio.Writer, calls to Write can result in panics like:

bufio: writer returned negative count from Write

which come from this bit of code in bufio:

var errNegativeWrite = errors.New("bufio: writer returned negative count from Write") // writeBuf writes the Reader's buffer to the writer. func (b *Reader) writeBuf(w io.Writer) (int64, error) { n, err := w.Write(b.buf[b.r:b.w]) if n < 0 { panic(errNegativeWrite) } b.r += n return int64(n), err }

This seems to be due to the section of Write that actually writes the compressed data to the underlying buffer, definitely at least during the first iteration, and possibly others. The issue is with this return:

if err := z.checkError(); err != nil { return len(p) - len(q) - length, err }

On the first iteration of this loop q := p, and length is a positive integer, so this will always return a negative number, causing bufio to panic rather than to propagate the error back to the caller.

I think a simple fix here is to return the max of 0 and that value.

Publish version tags that are compatible with Go Modules

Go 1.12 and beyond require tags to be published in the form vN.N.N (all three numbers are required). Right now there exists v1.1 and v½.2.0 which the compiler doesn't behave as one might expect.

For example (using Go1.11 + GO111MODULE=on),

v1.1 does not work:

[p1 foobar] $ cat go.mod
module foobar

require (
	github.com/klauspost/pgzip v1.1
)
[p1 foobar] $ go build
go: errors parsing go.mod:
/tmp/foobar/go.mod:4: invalid module version "v1.1": no matching versions for query "v1.1"

whereas v1.0.1 does work:

[p1 foobar] $ cat go.mod
module foobar

require (
	github.com/klauspost/pgzip v1.0.1
)
[p1 foobar] $ go build
go: finding github.com/klauspost/compress/flate latest
go: finding github.com/klauspost/crc32 latest
[p1 foobar] $

tag v½.2.0 seems to trigger the fallback timestamp+hash pseudo version behavior:

[p1 foobar] $ cat go.mod
module foobar

require (
	github.com/klauspost/pgzip v½.2.0
)
[p1 foobar] $ go build
go: finding github.com/klauspost/pgzip v½.2.0
go: finding github.com/klauspost/compress/flate latest
go: finding github.com/klauspost/crc32 latest
[p1 foobar] $ cat go.mod
module foobar

require (
	github.com/klauspost/compress v1.4.1 // indirect
	github.com/klauspost/cpuid v1.2.0 // indirect
	github.com/klauspost/crc32 v0.0.0-20170628072449-bab58d77464a // indirect
	github.com/klauspost/pgzip v1.0.2-0.20180717084224-c4ad2ed77aec
)

Attempting to use v1.1.0 (as one might expect to be synonymous with v1.1) also doesn't work:

[p1 foobar] $ cat go.mod
module foobar

require (
	github.com/klauspost/pgzip v1.1.0
)
[p1 foobar] $ go build
go: finding github.com/klauspost/pgzip v1.1.0
go: github.com/klauspost/[email protected]: unknown revision v1.1.0
go: error loading module requirements

Please document the version of Go that was used

On go 1.11 beta2, I'm seeing the same performance from stdlib and pgzip for decompression, so it would be useful to know when your benchmarks were done.
gunzip: improve EOF handling

This fixes #38 and #39, though I'm not entirely sure if you're happy with this approach.

To solve #39, we switch from using a channel for the block pool and instead use a sync.Pool. This does have the downside that the read-ahead goroutine can now end up allocating more blocks than the user requested. If this is not acceptable I can try to figure out a different solution for this problem. By using sync.Pool, there is no issue of blocking on a goroutine channel send when there are no other threads reading from it.

To solve #38, some extra io.EOF special casing was needed in both WriteTo and Read. I think that these changes are reasonable -- it seems as though z.err should never store io.EOF (and there were only a few cases where it would -- which I've now fixed), but let me know what you think.

Fixes #38 Fixes #39

Signed-off-by: Aleksa Sarai [email protected]

goroutine deadlock if Read or WriteTo is called after WriteTo end of stream

I found this when playing around with the reproducer for #38. It seems as though if you do an io.Copy of a stream (which uses z.WriteTo), followed by ReadAll (which uses z.Read) you end up with a goroutine deadlock. https://play.golang.org/p/x6u6JSoKd2t

package main

import (
	"bytes"
	"fmt"
	"io"
	"io/ioutil"

	"github.com/klauspost/pgzip"
)

// echo hello | gzip -c | xxd -i
var gzipData = []byte{
	0x1f, 0x8b, 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x03, 0xcb, 0x48,
	0xcd, 0xc9, 0xc9, 0xe7, 0x02, 0x00, 0x20, 0x30, 0x3a, 0x36, 0x06, 0x00,
	0x00, 0x00,
}

func main() {
	buf := bytes.NewBuffer(gzipData)

	rdr, err := pgzip.NewReader(buf)
	if err != nil {
		panic(err)
	}

	n, err := io.Copy(ioutil.Discard, rdr)
	fmt.Printf("io.Copy at start of stream: n=%v, err=%v\n", n, err)

	b, err := ioutil.ReadAll(rdr)
	if err != nil {
		panic(err)
	}
	fmt.Printf("read %q from stream\n", string(b))
}

io.Copy at start of stream: n=6, err=<nil>
fatal error: all goroutines are asleep - deadlock!

goroutine 1 [chan send]:
github.com/klauspost/pgzip.(*Reader).Read(0xc00006ea80, 0xc000120000, 0x200, 0x200, 0xc000120000, 0x0, 0x0)
	/tmp/gopath805285542/pkg/mod/github.com/klauspost/[email protected]/gunzip.go:473 +0xfe
bytes.(*Buffer).ReadFrom(0xc000043e80, 0x5055a0, 0xc00006ea80, 0xc000062a90, 0xc00011e000, 0x29)
	/usr/local/go-faketime/src/bytes/buffer.go:204 +0xb1
io/ioutil.readAll(0x5055a0, 0xc00006ea80, 0x200, 0x0, 0x0, 0x0, 0x0, 0x0)
	/usr/local/go-faketime/src/io/ioutil/ioutil.go:36 +0xe5
io/ioutil.ReadAll(...)
	/usr/local/go-faketime/src/io/ioutil/ioutil.go:45
main.main()
	/tmp/sandbox128643491/prog.go:30 +0x1fc

Program exited: status 2.

io.Copy(pgzip.Reader) returns io.EOF if stream already complete due to WriteTo implementation

It turns out that if you have a pgzip.Reader which has read to the end of the stream, if you call io.Copy on that stream you get io.EOF -- which should never happen and causes spurious errors on callers that check error values from io.Copy. I hit this when working on opencontainers/umoci#360.

This happens because (as an optimisation) io.Copy will use the WriteTo method of the reader (or ReadFrom method of the writer) if they support that method. And in this mode, io.Copy simply returns whatever error the reader or writer give it -- meaning it doesn't hide io.EOFs returned from those methods. In the normal Read+Write mode, io.Copy does hide the error.

It seems as though this is at some level a stdlib bug, because this requirement of io.WriterTo and io.ReaderTo implementations (don't return io.EOF because io.Copy can't handle it) is not spelled out anywhere in the documentation. So either io.Copy should handle this, or this requirment should be documented. So I will open a parallel issue on the Go tracker for this problem.

But for now, it seems that the WriteTo implementation should avoid returning io.EOF. If the reader reaches an io.EOF before it is expected, the error should instead be io.ErrUnexpectedEOF.
A parallel zlib implementation?

Hi there,

Any chance of implementing pgzip for plain zlib? As far as I can tell, the only thing that differs between the two formats are the headers and the CRC.

Cheers, Gabriel

Related tags

Compression pgzip

Go parallel gzip (de)compression

pgzip

Installation

Usage

Changes

Compression

Decompression

Performance

Compression

Decompression

License

Owner

Klaus Post

Comments

Missing lines with the uncompression example

gzip: fix memory allocs (buffers not returned to pool)

Add method to determine does file is pgzip or not

Unexpected, and nondeterministic, panic reading `.tar.gz`

What version of the package or command are you using?

What are you trying to do?

What steps did you take?

What did you expect to happen, and what actually happened instead?

How do you think this should be fixed?

SetConcurrency(?,?)

gunzip: Reset may loose buffers

panic: close of closed channel

Possibility to reduce memory consumption

pgzip.Writer causes panics in bufio.Write

Publish version tags that are compatible with Go Modules

Please document the version of Go that was used

gunzip: improve EOF handling

goroutine deadlock if Read or WriteTo is called after WriteTo end of stream

io.Copy(pgzip.Reader) returns io.EOF if stream already complete due to WriteTo implementation

A parallel zlib implementation?

Related tags

a little app to gzip+base64 encode and decode

Optimized compression packages

Go wrapper for LZO compression library

Port of LZ4 lossless compression algorithm to Go

LZ4 compression and decompression in pure Go

Unsigned Integer 32 Byte Packing Compression

Bzip2 Compression Tool written in Go

Slipstream is a method for lossless compression of power system data.

An easy-to-use CLI-based compression tool.

zlib compression tool for modern multi-core machines written in Go

Parallel implementation of Gzip for modern multi-core machines written in Go

parallel: a Go Parallel Processing Library

M3u8-parallel-downloader - M3u8 parallel downloader with golang

Gzip Middleware for Go

Split text files into gzip files with x lines

a little app to gzip+base64 encode and decode

Ripgrep but for gzip-compressed files over http

Optimized compression packages

Go wrapper for LZO compression library

Port of LZ4 lossless compression algorithm to Go