Perform external sorting on io.Reader

external-sort

This repo describes an algorithm to sort data from a stream with a fixed amount of memory necessary that can be specified.

The overall idea is:

  • Get a io.Reader using sftp.OpenFile()

  • Scan the entire file and dump to a temporary file every M rows and M must fit in memory. The M rows are sorted using binary search

  • You now have P small files on Hard drive. Each file is sorted and contains maximum M rows We know choose K = M/100

  • Load the first K rows of every chunk in memory. This is a buffer

  • It means P slices are in memory (memory used is still M)

  • Perform a merge sort on the P slices. There are comments in the function MergeSort to describe how it is done.

If one of the slice becomes empty when merge sorting then load the next K rows from the chunk file associated to the slice and carry on (very important to order correctly the final file)

If I’m correct the maximum RAM used is M + size of output buffer

The maximum hard drives used is the size P*M (size of the file) as long as you don't store the final output on drive.

Currently we hold the final sorted file in memory and we should just return a channel that can be consumed by another application.

There are many parts we could parallelise to improve the speed a lot.

Why are you using vector?

Mmh because this the way I imagined a solutionto the problem in the first place. But I don't think the name is still so accurate.

On the other hand, it brings a "nice" interface that allow to create models and decide how to compare rows depending on the shape of the file you want to sort.

For example, all the tests were done using IntVector. It reads a line form the file and convert it to integer so we can compare numbers.

Test

You can look at an intersting file testdata/100elems.tsv. It contains 100 rows with one integer per row. And the test succesfully order it for any size of chunks or buffer.

make test

Show some stuff

make run

Print on stdout how we ordered 10 integers. The original file can be find data/10elems.tsv

Similar Resources

Query git repositories with SQL. Generate reports, perform status checks, analyze codebases. 🔍 📊

askgit askgit is a command-line tool for running SQL queries on git repositories. It's meant for ad-hoc querying of git repositories on disk through a

Jan 5, 2023

A simple debugging Go package to perform Dump and Die

dump A simple Go package to perform Dump and Die.

May 16, 2021

Simple Bank is a simple REST API that allows users to perform transferences with each other.

Simple Bank is a simple REST API that allows users to perform transferences with each other. 🔧 Technologies Golang Docker PostgreSQ

Feb 15, 2022

cross-platform, cli app to perform various operations on string

cross-platform, cli app to perform various operations on string

sttr is command line software that allows you to quickly run various transformation operations on the string.

Dec 30, 2022

Query git repositories with SQL. Generate reports, perform status checks, analyze codebases. 🔍 📊

Query git repositories with SQL. Generate reports, perform status checks, analyze codebases. 🔍 📊

askgit is a command-line tool for running SQL queries on git repositories. It's meant for ad-hoc querying of git repositories on disk through a common interface (SQL), as an alternative to patching together various shell commands.

Jan 3, 2023

Open Source runtime scanner for OpenShift cluster and perform security audit checks based on CIS RedHat OpenShift Benchmark specification

Open Source runtime scanner for OpenShift cluster and perform security audit checks based on CIS RedHat OpenShift Benchmark specification

OpenShift-Ordeal Scan your Openshift cluster !! OpenShift-Ordeal is an open source audit scanner who perform audit check on OpenShift Cluster and outp

Sep 6, 2022

Perform a simple Vinegere cipher substitution of ASCII alphabetical characters.

Perform a simple Vinegere cipher substitution of ASCII alphabetical characters. Ignores special ASCII characters, and non ASCII characters.

Nov 3, 2021

O365 is a tool designed to perform user enumeration* and password guessing attacks on organizations that use Office365

O365 is a tool designed to perform user enumeration* and password guessing attacks on organizations that use Office365 (now/soon Microsoft365). O365 uses a unique SOAP API endpoint on login.microsoftonline.com that most other tools do not use.

Dec 2, 2022

Server and client implementation of the grpc go libraries to perform unary, client streaming, server streaming and full duplex RPCs from gRPC go introduction

Description This is an implementation of a gRPC client and server that provides route guidance from gRPC Basics: Go tutorial. It demonstrates how to u

Nov 24, 2021

A library for Go client applications that need to perform OAuth authorization against a server

A library for Go client applications that need to perform OAuth authorization against a server

oauth-0.8.0.zip oauth A library for Go client applications that need to perform OAuth authorization against a server, typically GitHub.com. Traditiona

Oct 13, 2021

Use of Advent of Code challenges to perform pyhton and learn Go language

Scripts in Python and Go language made to perform Advent of Code 2021 challenges

Jan 3, 2022

Package trn introduces a Range type with useful methods to perform complex operations over time ranges

Time Ranges Package trn introduces a Range type with useful methods to perform c

Aug 18, 2022

Cloud-Z gathers information and perform benchmarks on cloud instances in multiple cloud providers.

Cloud-Z Cloud-Z gathers information and perform benchmarks on cloud instances in multiple cloud providers. Cloud type, instance id, and type CPU infor

Jun 8, 2022

Api-product - A basic REST-ish API that allows you to perform CRUD operations for Products

Description A basic REST-ish API that allows you to perform CRUD operations for

Jan 3, 2022

A Go client library enabling programs to perform CRUD operations on the goharbor API.

goharbor-client A Go client library enabling programs to perform CRUD operations on the goharbor API. This client library utilizes types generated by

Jan 10, 2022

Assert to perform programming assertions

Cli EN README Assert para realizar las aserciones de programación. GoDoc godoc for github Menú Principal Se configura un menú con ese ejemplo como dis

Jan 13, 2022

Savoir - A tool to perform tasks during internal security assessment

Savoir Savoir is a tool to perform tasks during internal security assessment. Th

Nov 9, 2022

An operator that helps you perform benchmarks

Camunda-Benchmark-Operator 🏋️‍♀️ An operator that helps you perform benchmarks. Your first benchmark This requires that you know how to run the opera

Mar 2, 2022

Tnbassist - A CLI tool for thenewboston blockchain to perform various mundane tasks like taking daily accounts backup

TNB Assist is a CLI (Command Line Interface) tool for thenewboston blockchain to perform various mundane tasks like taking daily accounts backup, computing statistics, etc easier.

Feb 14, 2022
csv reader/writer and csv generator.

IO csv reader sample version 0.0.1-SNAPSHOT Goals: main: read huge file, hex substring, write to new file. repo has 2 version. v1 can read a file and

Nov 4, 2021
osu! database file format flexible reader

Example usage package main import ( "fmt" "github.com/l3lackShark/reader" types "github.com/l3lackShark/reader/types" ) type agent struct { Osu

Nov 24, 2021
A fast string sorting algorithm (MSD radix sort)
A fast string sorting algorithm (MSD radix sort)

Your basic radix sort A fast string sorting algorithm This is an optimized sorting algorithm equivalent to sort.Strings in the Go standard library. Fo

Dec 18, 2022
golang sorting algorithm and data construction.

data-structures-questions 算法和程序结构是我们学习编程的基础,但是很多的时候,我们很多都是只是在应用,而没有深入的去研究这些,所以自己也在不断的思考和探索,然后分析,学习,总结自己学习的过程,希望可以和大家一起学习和交流下算法! 目录 网络协议 数据结构 算法 数据库 Go

Dec 17, 2022
A radix sorting library for Go (golang)

zermelo A radix sorting library for Go. Trade memory for speed! import "github.com/shawnsmithdev/zermelo" func foo(large []uint64) zermelo.Sort(l

Jul 30, 2022
The simplest sorting algorithm that sorts in quadratic time
The simplest sorting algorithm that sorts in quadratic time

Simplest sort Showcases the simplest sorting algorithm that works in quadratic time. From here. The pseudocode for this algo can be seen below (sorts

Oct 14, 2022
May 15, 2022
micro-draft-manager is a microservice that helps you to manage unstructured data in your application with sorting and full-text search

micro-draft-manager is a microservice that helps you to manage unstructured data in your application with sorting and full-text search. For example, y

Nov 24, 2021
Repaint images by sorting pixels by colour
Repaint images by sorting pixels by colour

Every Frame A Painting Sorts pixels in an image by colour and redraws the image

Jan 5, 2022
Smartsort - A smart sorting algorithm for Go to sort filename containing digits that is not zero padded

smartsort A smart sorting algorithm for Go to sort filename containing digits th

Mar 26, 2022