Implementation of external merge sort with 2-way merge.

External merge sort

An implementation of external merge sort algorithm (with 2-way merge) written in Go. This particular implementation sorts strings. The implementation is located here.

As it's an implementation of an algorithm in external memory, we are interested in disk usage. If the size of the main memory is M, and we read/write B bytes at a time, the number of disk ops to sort N bytes is O(N/B * log(N/M)) [png] (for two-way merge).

The tool is not for production usage. If you want to sort huge files with high performance, probably, you want to find an implementation written in C or C++ (or any other programming language without a garbage collector).

Usage

The next example shows how to generate a huge file, sort it and validate the result:

make all

# Generate 42000000 lines. 
# Allowed line lengths: 126-128. (~5GB file will be created)
# Each line will have the same prefix of length 110.
./bin/generator -min-length 126 -max-length 128 -equal-prefix-length 110 -alphabet hex -output input.txt -count 42000000

# Sort input.txt and save the result in output.txt.
# Main memory limit: 500MB. Block size: 1MB.
# (in fact, more memory may be allocated due to the presence of a garbage collector)
./bin/sort -memory 500000000 -blocksize 1000000 -input input.txt -output output.txt

# Validate the result.
./bin/validator -input input.txt -output output.txt

Use --help for more detail.

Tools

There are three useful tools in this repository.

Run make to build them:

make all

After that, three executable files will be created in the bin directory. Use --help to list supported arguments.

Generator

Generator produces files with random tokens (strings). You can specify count of tokens, their length and a set of allowed characters.

Usage of ./bin/generator:
  -alphabet string
        You can specify one of the supported sets of characters used for generation:
        binary - 01;
        lower - abc..xyz;
        upper - ABC...XYZ;
        numbers - 012345689;
        alnum - abc...xyz0123456789;
        hex - 0123456789ABCDEF;
        non-space - ASCII code in range (32, 128).
        
        If the specified value doesn't match with any of the predefined alphabet names, this value will be used as the set of characters. (default "lower")
  -count int
        How many tokens must be generated?. (default 1000)
  -delimiter string
        A character used to separate tokens. (default "\n")
  -equal-prefix-length int
        All tokens will begin with the same prefix, and you can specify the length of this prefix.
  -max-length int
        Maximum allowed token length. (default 128)
  -min-length int
        Minimum allowed token length. (default 128)
  -output string
        Where the generated data will be saved. (default "input.txt")

Sort

Sort is an implementation of external merge sort algorithm.

Choose memory limit and block size according to your setup and limitations.

Usage of ./bin/sort:
  -blocksize int
        Size of one block (in bytes). (default 1048576)
  -delimiter string
        A character used to separate tokens. (default "\n")
  -input string
        Input file path. (default "input.txt")
  -memory int
        The algorithm will use at most O(memory) main memory. (default 536870912)
  -order string
        Sort order. Supported values: ASC, DESC. (default "ASC")
  -output string
        Output file path. (default "output.txt")
  -tempdir string
        Where temporary files can be created. If you use /tmp, make sure there is enough space for two copies of the input file. (default ".")

Validator

Validator helps to check whether the implementation is correct. It takes input and output files and makes the following checks:

  • the number of tokens matches;
  • hashes of the set of tokens are the same;
  • order of tokens in the output file is correct.
Usage of ./bin/validator:
  -delimiter string
        A character used to separate tokens. (default "\n")
  -input string
        Input file path. (default "input.txt")
  -order string
        Sort order. Supported values: ASC, DESC. (default "ASC")
  -output string
        Output file path. (default "output.txt")
Similar Resources

Go implementation to calculate Levenshtein Distance.

levenshtein Go package to calculate the Levenshtein Distance The library is fully capable of working with non-ascii strings. But the strings are not n

Dec 14, 2022

A Merkle Tree implementation written in Go.

A Merkle Tree implementation written in Go.

Merkle Tree in Golang An implementation of a Merkle Tree written in Go. A Merkle Tree is a hash tree that provides an efficient way to verify the cont

Jan 5, 2023

A prefix tree implementation in go

Trie (Prefix tree) This library is compatible with Go 1.11+ Please refer to CHANGELOG.md if you encounter breaking changes. Motivation Introduction Us

Nov 3, 2022

Package ring provides a high performance and thread safe Go implementation of a bloom filter.

ring - high performance bloom filter Package ring provides a high performance and thread safe Go implementation of a bloom filter. Usage Please see th

Nov 20, 2022

A simple Set data structure implementation in Go (Golang) using LinkedHashMap.

Set Set is a simple Set data structure implementation in Go (Golang) using LinkedHashMap. This library allow you to get a set of int64 or string witho

Sep 26, 2022

A Go library for an efficient implementation of a skip list: https://godoc.org/github.com/MauriceGit/skiplist

A Go library for an efficient implementation of a skip list: https://godoc.org/github.com/MauriceGit/skiplist

Fast Skiplist Implementation This Go-library implements a very fast and efficient Skiplist that can be used as direct substitute for a balanced tree o

Dec 30, 2022

A slice-based implementation of a stack. In Go!

Stackgo Stackgo is a slice-based implementation of a simple stack in Go. It uses a pre-alloc pagination strategy which adds little memory overhead to

Nov 3, 2022

A Left-Leaning Red-Black (LLRB) implementation of balanced binary search trees for Google Go

GoLLRB GoLLRB is a Left-Leaning Red-Black (LLRB) implementation of 2-3 balanced binary search trees in Go Language. Overview As of this writing and to

Dec 23, 2022

Golang implementation of Radix trees

go-radix Provides the radix package that implements a radix tree. The package only provides a single Tree implementation, optimized for sparse nodes.

Dec 30, 2022
Stalin sort in multiple languages!

stalin-sort Stalin sort in multiple languages, contributions are welcome! Motivation This repo is motivated by this tweet, this tweet contains a refer

Jan 14, 2022
Code in this repository solves the game Water Sort Puzzle for phones
Code in this repository solves the game Water Sort Puzzle for phones

Water Sort Puzzle Solver This code can solve a game called "Water Sort Puzzle" available on APPStore and PlayMarket. Start Guide Firstly, you have to

Nov 5, 2022
Smartsort - A smart sorting algorithm for Go to sort filename containing digits that is not zero padded

smartsort A smart sorting algorithm for Go to sort filename containing digits th

Mar 26, 2022
A small flexible merge library in go
A small flexible merge library in go

conjungo A merge utility designed for flexibility and customizability. The library has a single simple point of entry that works out of the box for mo

Dec 27, 2022
Go implementation of Count-Min-Log

Count-Min-Log Count-Min-Log sketch: Approximately counting with approximate counters - Guillaume Pitel & Geoffroy Fouquier TL;DR: Count-Min-Log Sketch

Jan 4, 2023
A Go implementation of the Elias-Fano encoding

go-ef A Go implementation of the Elias-Fano encoding Example package main import ( "fmt" "github.com/amallia/go-ef" "os" ) func main() {

Nov 23, 2022
Set is a useful collection but there is no built-in implementation in Go lang.

goset Set is a useful collection but there is no built-in implementation in Go lang. Why? The only one pkg which provides set operations now is golang

Sep 26, 2022
A skip list implementation in Go

About This is a library implementing skip lists for the Go programming language (http://golang.org/). Skip lists are a data structure that can be used

Sep 21, 2022
Go implementation of C++ STL iterators and algorithms.

iter Go implementation of C++ STL iterators and algorithms. Less hand-written loops, more expressive code. README translations: 简体中文 Motivation Althou

Dec 19, 2022