Port of LZ4 lossless compression algorithm to Go

Бранимир Караџић

Last update: Jun 14, 2022

Comments: 12

go-lz4

go-lz4 is port of LZ4 lossless compression algorithm to Go. The original C code is located at:

https://github.com/Cyan4973/lz4

Status

Usage

go get github.com/bkaradzic/go-lz4

import "github.com/bkaradzic/go-lz4"

The package name is lz4

Notes

go-lz4 saves a uint32 with the original uncompressed length at the beginning of the encoded buffer. They may get in the way of interoperability with other implementations.

Alternative

https://github.com/pierrec/lz4

Contributors

Damian Gryski (@dgryski)
Dustin Sallings (@dustin)

Contact

@bkaradzic
http://www.stuckingeometry.com

Project page
https://github.com/bkaradzic/go-lz4

License

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY COPYRIGHT HOLDER ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Owner

Бранимир Караџић

кодер (gamedev, open source, ex-demoscene) ★

https://github.com/bkaradzic/go-lz4

Comments

Incompatibility with official LZ4
Setup:

backend which uses go-lz4

client which uses official LZ4 C libraries (no way to Go unfortunately)

The problem I have is that sometimes data from backend couldn't be decoded on client. The keyword is "sometimes". I've got one sample (55K compressed, 114K uncompressed).

I'm not sure yet which side contains bug but let me know if you want to check it out.
optimized cp(), 2x performance increase for large files

AFAICT, cp() needs to append byte by byte because initially there may not be enough in the dst to cover the length requested. But once there's enough in dst, you can just append a whole slice.
Can't compress files larger than 2^17 bytes

This is the size of the internal buffer, but there appear to be no checks to make sure we don't overrun it when we load data in in writer.go encoder.cache(). I haven't checked the reader, but it seems likely the same problem will occur in flush() if we try to write out a larger section that what we've read it.
Incompatible NewWriter signature

The typical signature for NewWriter is func NewWriter(wr io.Writer) *Writer whereas the LZ4 code has NewWriter(r io.Reader) io.ReadCloser. This makes it unusable within normal Go programs

Optimize decoder

By preallocating our destination buffer, we can eliminate all the calls to append(). This plus some inlining two hot routines give a considerable speedup to Decode().

Below are benchmarks ported from snappy-go:

benchmark                  old ns/op    new ns/op    delta
BenchmarkLZ4Decode           4480128      3150442  -29.68%
BenchmarkWordsDecode1e3         6071         3506  -42.25%
BenchmarkWordsDecode1e4        69195        45798  -33.81%
BenchmarkWordsDecode1e5       744347       539174  -27.56%
BenchmarkWordsDecode1e6      6616125      4841891  -26.82%

benchmark                   old MB/s     new MB/s  speedup
BenchmarkWordsDecode1e3       164.71       285.18    1.73x
BenchmarkWordsDecode1e4       144.52       218.35    1.51x
BenchmarkWordsDecode1e5       134.35       185.47    1.38x
BenchmarkWordsDecode1e6       151.15       206.53    1.37x

Add some basic tests so Travis does more than just build.

These are some sanity tests to make sure we actually encode/decode things.

I'm using /usr/share/dict/words as a source of input. If you're prefer (because the line-endings suggest to me you're on windows) we can import a novel from Project Gutenberg and use that instead so the data is actually commited to the repo.

The snappy tests actually download their larger test data at benchmark time, if requested, so that's an option too. I wouldn't want to do that for the basic tests though.
Allow more seamless integration to other projects.

This allows people to more easily use the library from within their code without having to pull it down to a different location and manipulate their GOPATH.

hashTable allocating tons of memory

$ go tool pprof populate /tmp/profile764028996/mem.pprof
Entering interactive mode (type "help" for commands)
(pprof) list lz4.Encode
Total: 3.90GB
ROUTINE ======================== github.com/bkaradzic/go-lz4.Encode in /home/ubuntu/go/src/github.com/bkaradzic/go-lz4/writer.go
    3.70GB     3.70GB (flat, cum) 94.89% of Total
         .          .    107:   if len(src) >= MaxInputSize {
         .          .    108:           return nil, ErrTooLarge
         .          .    109:   }
         .          .    110:
         .          .    111:   if n := CompressBound(len(src)); len(dst) < n {
    8.52MB     8.52MB    112:           dst = make([]byte, n)
         .          .    113:   }
         .          .    114:
    3.69GB     3.69GB    115:   e := encoder{src: src, dst: dst, hashTable: make([]uint32, hashTableSize)}
         .          .    116:
         .          .    117:   binary.LittleEndian.PutUint32(dst, uint32(len(src)))
         .          .    118:   e.dpos = 4
         .          .    119:
         .          .    120:   var (

This line in the code is causing Badger to OOM when loading data really fast. Ideally, you want to reuse the same hashTable. Can be done via sync.Pool. Happy to send a PR if that'd help.

Use capacity to determine whether dst can be reused for encoding and decoding

Also, modify the benchmarks to reuse the allocated slice. This improves the ns/op for both encoding and decoding by 0.4s/op (2.4s -> 2.0s for Decode, and similar for encode).

Interoperability issue with other lz4 libraries

I'm using two github versions of lz4: yours for Golang https://github.com/jpountz/lz4-java for java.

I used your library in the code below:

package main

import (
	lz4 "github.com/bkaradzic/go-lz4"
	"fmt"
	"os"
	"encoding/base64"
)

func main() {
	var data []byte
	var err error

	to_compress:="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

	if data,err=lz4.Encode(nil,[]byte(to_compress)); err!=nil {
		fmt.Fprintf(os.Stderr,"Failed to compress: '%s'",err)
		return
	}

	fmt.Fprintf(os.Stderr,"Success! Length=%d\n",len(to_compress))

	fmt.Fprintf(os.Stdout,"%s\n",base64.StdEncoding.EncodeToString(data))
}

and obtain following output:

raffi@iot-micro-raffi ~/tmp > go run ./test_lz4.go 
Success! Length=100
ZAAAAB94AQBLUHh4eHh4
raffi@iot-micro-raffi ~/tmp >

Then I inject this output into a Java version to decode the result:

import java.util.Base64;
import net.jpountz.lz4.LZ4Factory;
import net.jpountz.lz4.LZ4FastDecompressor;

public class test_lz4_decomp {

    public static void main(String[] args) {
        String compressed_base64= "ZAAAAB94AQBLUHh4eHh4";
        int original_size = 100;
        byte [] compressed = Base64.getDecoder().decode(compressed_base64);

        LZ4Factory _factory= LZ4Factory.fastestInstance();
        LZ4FastDecompressor decompressor = _factory.fastDecompressor();
        byte []restored = new byte[original_size];
        decompressor.decompress(compressed, 0, restored, 0,original_size);

        String decompressed_str=new String(restored);

        System.out.println(decompressed_str); 
    }

}

but obtain the following output:

raffi@iot-micro-raffi ~/tmp > java -cp ".:lz4-1.3.0.jar" test_lz4_decompException in thread "main" net.jpountz.lz4.LZ4Exception: Error decoding offset 58 of input buffer
	at net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:39)
	at test_lz4_decomp.main(test_lz4_decomp.java:16)
raffi@iot-micro-raffi ~/tmp >

The expected result would be that the Java version should be able to decompress the output of the Golang version.

LZ4 Framing

I'd like to see LZ4 framing supported.

The frame-descriptor includes an optional content-length, which may help to eliminate go-lz4's unique means of storing such:

go-lz4 saves a uint32 with the original uncompressed length at the beginning of the encoded buffer
encode with io.Writer / decode with io.Reader

I'd love to use this code, but my project involves compressing files up to gigabytes in size.

For that use case I really need a streaming version of the algorithm (i.e. using the io.Writer/io.Reader interfaces). I see there's one as part of the reference C implementation, but I really have no idea where to start with porting it to Go.

Related tags

Compression go-lz4

Slipstream is a method for lossless compression of power system data.

Slipstream Slipstream is a method for lossless compression of power system data. Design principles The protocol is designed for streaming raw measurem

Apr 14, 2022

Optimized compression packages

compress This package provides various compression algorithms. zstandard compression and decompression in pure Go. S2 is a high performance replacemen

Jan 4, 2023

Go wrapper for LZO compression library

This is a cgo wrapper around the LZO real-time compression library. LZO is available at http://www.oberhumer.com/opensource/lzo/ lzo.go is the go pack

Mar 4, 2022

Go parallel gzip (de)compression

pgzip Go parallel gzip compression/decompression. This is a fully gzip compatible drop in replacement for "compress/gzip". This will split compression

Dec 29, 2022

Unsigned Integer 32 Byte Packing Compression

dbp32 Unsigned Integer 32 Byte Packing Compression. Inspired by lemire/FastPFor. Package bp32 is an implementation of the binary packing integer compr

Sep 6, 2021

Bzip2 Compression Tool written in Go

Dec 28, 2021

An easy-to-use CLI-based compression tool.

Easy Compression An easy-to-use CLI-based compression tool. Usage NAME: EasyCompression - A CLI-based tool for (de)compression USAGE: EasyCompr

Jan 1, 2022

zlib compression tool for modern multi-core machines written in Go

Jan 21, 2022

LZ4 compression and decompression in pure Go

lz4 : LZ4 compression in pure Go Overview This package provides a streaming interface to LZ4 data streams as well as low level compress and uncompress

Dec 27, 2022

Slipstream is a method for lossless compression of power system data.

Slipstream Slipstream is a method for lossless compression of power system data. Design principles The protocol is designed for streaming raw measurem

Apr 14, 2022

Eunomia is a distributed application framework that support Gossip protocol, QuorumNWR algorithm, PBFT algorithm, PoW algorithm, and ZAB protocol and so on.

Introduction Eunomia is a distributed application framework that facilitates developers to quickly develop distributed applications and supports distr

Sep 28, 2021

Package flac provides access to FLAC (Free Lossless Audio Codec) streams.

flac This package provides access to FLAC (Free Lossless Audio Codec) streams. Documentation Documentation provided by GoDoc. flac: provides access to

Jan 5, 2023

Kanzi is a modern, modular, expendable and efficient lossless data compressor implemented in Go.

kanzi Kanzi is a modern, modular, expendable and efficient lossless data compressor implemented in Go. modern: state-of-the-art algorithms are impleme

Dec 22, 2022

Go-enum-algorithm - Implement an enumeration algorithm in GO

go-enum-algorithm implement an enumeration algorithm in GO run the code go run m

Feb 15, 2022

A Go port of the Rapid Automatic Keyword Extraction algorithm (RAKE)

A Go implementation of the Rapid Automatic Keyword Extraction (RAKE) algorithm as described in: Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010).

Nov 23, 2022

A little websocket TCP proxy to let browsers talk to a fixed port on arbitrary hosts. Built for Gemini (gemini://, port 1965)

Kepler A little websocket TCP proxy built to let Amfora talk to Gemini servers when running in a browser. Usage $ git clone https://github.com/awfulco

May 27, 2022

Port of LZ4 lossless compression algorithm to Go

go-lz4

Status

Usage

Notes

Alternative

Contributors

Contact

License

Owner

Бранимир Караџић

Comments

Incompatibility with official LZ4

optimized cp(), 2x performance increase for large files

Can't compress files larger than 2^17 bytes

Incompatible NewWriter signature

Optimize decoder

Add some basic tests so Travis does more than just build.

Allow more seamless integration to other projects.

hashTable allocating tons of memory

Use capacity to determine whether dst can be reused for encoding and decoding

Interoperability issue with other lz4 libraries

LZ4 Framing

encode with io.Writer / decode with io.Reader

Related tags

Slipstream is a method for lossless compression of power system data.

Optimized compression packages

Go wrapper for LZO compression library

Go parallel gzip (de)compression

Unsigned Integer 32 Byte Packing Compression

Bzip2 Compression Tool written in Go

An easy-to-use CLI-based compression tool.

zlib compression tool for modern multi-core machines written in Go

LZ4 compression and decompression in pure Go

Slipstream is a method for lossless compression of power system data.

Eunomia is a distributed application framework that support Gossip protocol, QuorumNWR algorithm, PBFT algorithm, PoW algorithm, and ZAB protocol and so on.

Package flac provides access to FLAC (Free Lossless Audio Codec) streams.

Kanzi is a modern, modular, expendable and efficient lossless data compressor implemented in Go.

Go-enum-algorithm - Implement an enumeration algorithm in GO

A Go port of the Rapid Automatic Keyword Extraction algorithm (RAKE)

A little websocket TCP proxy to let browsers talk to a fixed port on arbitrary hosts. Built for Gemini (gemini://, port 1965)

Port-proxy - Temporary expose port for remote connections

Optimized compression packages

Go wrapper for LZO compression library

Go parallel gzip (de)compression