A blazingly-fast simple-to-use tool to find duplicate files on your computer, portable hard drives etc.

Go Find Duplicates

Build Status Go Report Card Go Reference License

Introduction

A blazingly-fast simple-to-use tool to find duplicate files (photos, videos, music, documents etc.) on your computer, portable hard drives etc.

How to install and use?

Two ways: (one direct, one through docker)

Direct

To install:

  1. Install Go version at least 1.16
    • On Ubuntu: snap install go
    • On Mac: brew install go
    • For any other OS: Go downloads page
  2. Run command:
    go get github.com/m-manu/go-find-duplicates
  3. Ensure $HOME/go/bin is part of $PATH

To use:

go-find-duplicates {dir-1} {dir-2} ... {dir-n}

For more options and help, run:

go-find-duplicates -help

Through Docker

docker run --rm -v /Volumes/PortableHD:/mnt/PortableHD manumk/go-find-duplicates:latest go-find-duplicates -output=print /mnt/PortableHD

In above command:

  • option --rm removes the container when it exits
  • option -v is mounts host directory /Volumes/PortableHD as /mnt/PortableHD inside the container
Comments
  • GNU command line option conventions

    GNU command line option conventions

    First of all, I love the tool, thank you for creating it. Very useful and helped me discover a whole slew of duplicates in my library.

    However, my only small gripe is that the tool does not use typicalPOSIX conventionwhen it comes to command line options (single - for short options, -- for long options). This it makes it stand out from most of other UNIX tools.

  • Sorting output results by file size

    Sorting output results by file size

    I love this tool! The thing I'm trying to figure out is how I could write a script to sort the resulting output file to put the results in order from largest to smallest files - so I know which ones are the biggest problems. Have you considered adding that to your code?

  • go get doesn't work

    go get doesn't work

    When following the directions:

    $ go get github.com/m-manu/go-find-duplicates
    unrecognized import path "embed": import path does not begin with hostname
    unrecognized import path "io/fs": import path does not begin with hostname
    
  • Removing duplicate files

    Removing duplicate files

    This tool is definitely better and faster than rdfind, FSlint, etc but I can't find a way to actually remove the duplicate files. I see the list but it's too hard to remove by hand.

    What would be the way to do this automatically?

    Thanks

  • Please document how files are compared for uniqueness

    Please document how files are compared for uniqueness

    The readme says nothing about how files are checked for uniqueness. I've looked through the source code and couldn't identify neither the usage of a cryptographic hash nor the comparison of actual contents byte-for-byte (although I might have missed any of that).

    The only thing I found was the usage of CRC32 and the following comment in file_hash.go:

    // GetDigest generates entity.FileDigest of the file provided, in an extremely fast manner
    // without compromising the quality of file's uniqueness.
    //
    // When this function was called on approximately 172k files (mix of photos, videos, audio files, PDFs etc.), the
    // uniqueness identified by this matched uniqueness identified by SHA-256 for *all* files
    

    To me it seems that for any file the uniqueness is determined by a CRC32 based on 8 KiB of the file contents (for larger files taken from the beginning, middle and end).


    If this is the case I personally find this very concerning... It might work for high-entropy data formats (like the audio, video and other file formats you've tested against which employ some form of compression), but imagine using it for text files, say source code of the same code in multiple folders, then I think it is trivial to find "duplicates" which aren't actually duplicates.

    I would dare to say that your statement on in the readme "blazingly-fast simple-to-use tool to find duplicate files" borderlines to false advertisement... Yes, it is blazingly-fast because you don't actually read the whole file, nor do you compute any sort of cryptographic hash, and as a consequence you don't actually test for uniqueness...


    However to be constructive, I do think that using CRC32 is a good start to find duplicate candidates (since two files with different CRC32 are certainly different), so after you cluster files with the same CRC32 (and I might add the same size) you could now compute a proper cryptographic hash (I recommend testing a few like SHA1, SHA512/256 and Blake3 to see which is faster on your architecture, I went with Blake3 in my own tests).

  • Added JSON reporting

    Added JSON reporting

    With this change duplicates can be reported in JSON.

    Usage,

    go-find-duplicates -output json /home/adrian/photos/
    

    Sample JSON file,

    [
      {
        "ext": ".jpg",
        "size": 210381,
        "hash": "s0de88c66",
        "paths": [
          "/home/adrian/photos/IMG_9765.jpg",
          "/home/adrian/photos/IMG_9753.jpg"
        ]
      },
      {
        "ext": ".mp4",
        "size": 1067501,
        "hash": "sda3d7cfb",
        "paths": [
          "/home/adrian/photos/VID_1234.mp4",
          "/home/adrian/photos/VID_48733.mp4"
        ]
      }
    ]
    
  • Writing dynamic fields

    Writing dynamic fields

    Hi there!

    I have an incoming FlowFile that looks like this: { "TagEpoch": "1630346400", "InfluxMeasurement": "my_measurement", "field": "my_field", "my_field": 178.39694 }

    It might be quite obvious from this, but I've got a measurement called my_measurement that has multiple fields on it. I'm trying to write each field dynamically, so in this case my_field could be replaced by any field name.

    Now, I want to set up my PutInfluxDatabaseRecord so that it reads from this flow file which field it should be writing, and I currently have the following setup.

    image

    For my incoming RecordReader, I have an Avro schema indicating it should be looking for "TagEpoch", "InfluxMeasurement", "field" and every possible field name that I'm expecting, but this does not work!

    I get the following error: image

    Can you help me to figure out how to dynamically write field names here?

This is a tool to extract TODOs, NOTEs etc or search user provided terms from given files and/or directories.

ado This is a tool to extract TODOs, NOTEs etc or user provided terms from given files and/or directories. DEPRECIATED: My project seek has cleaner co

Aug 11, 2022
Format /etc/fstab files.
Format /etc/fstab files.

Format /etc/fstab files. Features and limitations Can format /etc/fstab files. Will use 2 spaces between all fields, if they are of equal length. The

Dec 3, 2022
A dead simple tool to rename your files for smooth web access!

ffw - Friendly Files for the Web Easily rename files from a folder to be compatible with the web Run ffw and that's it! Installation on macOs brew tap

Jan 31, 2022
Finder is a tool to sort and organize your files.

Finder ?? Finder is a tool to sort and organize your files. Installation ?? Currently, we only support the installation via go directly as shown below

Jan 23, 2022
Split text files into gzip files with x lines

hakgzsplit split lines of text into multiple gzip files

Jun 21, 2022
Easily create Go files from stub files

go-stubs Easily create .go files from stub files in your projects. Usage go get github.com/nwby/go-stubs Create a stub file: package stubs type {{.Mo

Jan 27, 2022
app-services-go-linter plugin analyze source tree of Go files and validates the availability of i18n strings in *.toml files

app-services-go-linter app-services-go-linter plugin analyze source tree of Go files and validates the availability of i18n strings in *.toml files. A

Nov 29, 2021
Paste your GitHub Secrets to files

Paste-Secret Paste your GitHub Secrets in files Usage Inputs Required secrets : Secrets ise JSON object array. Holds filename, keys and values which w

Feb 25, 2022
Release your hatred upon the .DS_Store files

dsgore Ever wanted those annoying .DS_Store files to just be gone in an instant? Or maybe you wanted to be able to forget about them and never see the

Sep 23, 2022
🏵 Gee is tool of stdin to each files and stdout
🏵 Gee is tool of stdin to each files and stdout

Gee is tool of stdin to each files and stdout. It is similar to the tee command, but there are more functions for convenience. In addition, it was written as go. which provides output to stdout and files.

Nov 17, 2022
A tool for moving files into directories by file extensions
A tool for moving files into directories by file extensions

The tool for moving files into directories by file extensions Example before moving structure: moving into same extension dir result: moving into diff

Dec 6, 2021
Vaala archive is a tar archive tool & library optimized for lots of small files.

?? Vaar ?? Vaala archive is a tar archive tool & library optimized for lots of small files. Written in Golang, vaar performs operations in parallel &

Sep 12, 2022
AsmVM - is a simple interpretation for nasm kind of files.

AsmVM My lab work for Computing Systems Architecture(CSA) subject in my university. Also was made as a simple example of small Golang CLI tool AsmVM -

Nov 23, 2021
Simple but powerful manager for your dotfiles
Simple but powerful manager for your dotfiles

Dotman The dotfile manager you are searching for Version v0.3 [Next] Installer scripts Bug fixes v0.2 [Now] Automatic git support added v0.1 Initial v

Dec 16, 2022
Fast, dependency-free, small Go package to infer the binary file type based on the magic numbers signature

filetype Small and dependency free Go package to infer file and MIME type checking the magic numbers signature. For SVG file type checking, see go-is-

Jan 3, 2023
Fast extensible file name sanitizer that works in Windows/Linux

Sanity Sanity is a fast and easily extensible file name (and in fact any other string) sanitizer. Usage Built-in rule set Sanity provides a sensible d

Jun 8, 2022
GoCsv is a library written in pure Go to use csv data more comfortable

GoCsv GoCsv is a library written in pure Go to use csv data more comfortable Supported Go version golang >= 1.13 Installation go get github.com/shr004

Nov 1, 2022
a tool for handling file uploads simple

baraka a tool for handling file uploads for http servers makes it easier to make operations with files from the http request. Contents Install Simple

Nov 30, 2022