A tool to find all duplicates in large sets of text documents.

⊧ dupi

Dupi is an engine for identifying and exploring duplicative text in sets of documents.

Status

Dupi is in alpha/early beta development stage. Please feel free to give it a try (and file issues). We have run it on several document sets successfully, but it definitely needs more testing.

Input

Throw hundreds of thousands of textual documents at it. Or extract text from other documents and send that to dupi.

Output

Find and query for repeated chunks of text.

Tutorial

Tutorial

Design

Design Document

Library Reference

Go Reference

Owner
go-air
⊧ Analysis and verification tools for and in Go. ⊧
go-air
Similar Resources

CLI tool to update ~/.aws/config with all accounts and permission sets defined in AWS SSO

aws-sso-profiles Generate or update ~/.aws/config with a profile for each SSO account you have access to, by using an existing AWS SSO session. Bootst

Nov 3, 2022

Templating system for HTML and other text documents - go implementation

FAQ What is Kasia.go? Kasia.go is a Go implementation of the Kasia templating system. Kasia is primarily designed for HTML, but you can use it for any

Mar 15, 2022

Templating system for HTML and other text documents - go implementation

FAQ What is Kasia.go? Kasia.go is a Go implementation of the Kasia templating system. Kasia is primarily designed for HTML, but you can use it for any

Mar 15, 2022

Proto-find is a tool for researchers that lets you find client side prototype pollution vulnerability.

proto-find proto-find is a tool for researchers that lets you find client side prototype pollution vulnerability. How it works proto-find open URL in

Dec 6, 2022

red-tldr is a lightweight text search tool, which is used to help red team staff quickly find the commands and key points they want to execute, so it is more suitable for use by red team personnel with certain experience.

red-tldr is a lightweight text search tool, which is used to help red team staff quickly find the commands and key points they want to execute, so it is more suitable for use by red team personnel with certain experience.

Red Team TL;DR English | 中文简体 What is Red Team TL;DR ? red-tldr is a lightweight text search tool, which is used to help red team staff quickly find t

Jan 5, 2023

A file find utility modeled after the unix find written in Go

gofind A file find utility modeled after the unix find written in Go. Why This p

Dec 17, 2021

Cf-cli-find-app-plugin - CF CLI plugin to find applications containing a search string

Overview This cf cli plugin allows users to search for application names that co

Jan 3, 2022

Drop-in replacement for Go's stringer tool with support for bitflag sets.

stringer This program is a drop-in replacement for Go's commonly used stringer tool. In addition to generating String() string implementations for ind

Nov 26, 2021

A golang tool to list out all EKS clusters with active nodegroups in all regions in json format

eks-tool A quick and dirty tool to list out all EKS clusters with active nodegro

Dec 18, 2021

Find all words spellable from the letters in a given word

Spellable Given a dictionary of words, find all of the words that can be spelled

Jan 3, 2022

Quickly find all IPv6 and IPv4 hosts in a LAN.

invaentory Quickly find all IPv6 and IPv4 hosts in a LAN. Overview 🚧 This project is a work-in-progress! Instructions will be added as soon as it is

May 17, 2022

gomtch - find text even if it doesn't want to be found

gomtch - find text even if it doesn't want to be found Do your users have clever ways to hide some terms from you? Sometimes it is hard to find forbid

Sep 28, 2022

Commonwords - Simple cli to find words in text that are not in the 1000 most common English words

Thousand common words Find words in a text that are not in the 1000 most common

Feb 1, 2022

Golang wrapper for Exiftool : extract as much metadata as possible (EXIF, ...) from files (pictures, pdf, office documents, ...)

go-exiftool go-exiftool is a golang library that wraps ExifTool. ExifTool's purpose is to extract as much metadata as possible (EXIF, IPTC, XMP, GPS,

Dec 28, 2022

Package kml provides convenince methods for creating and writing KML documents.

go-kml Package kml provides convenience methods for creating and writing KML documents. Key Features Simple API for building arbitrarily complex KML d

Jul 29, 2022

Compare ANY markup documents.

Menu What is XML-Comp? Features Installing Running How this works? Comparing any kind of document Contributing To Do Using only the comparer package W

Oct 23, 2022

Extract data or evaluate value from HTML/XML documents using XPath

xquery NOTE: This package is deprecated. Recommends use htmlquery and xmlquery package, get latest version to fixed some issues. Overview Golang packa

Nov 9, 2022

Pure go library for creating and processing Office Word (.docx), Excel (.xlsx) and Powerpoint (.pptx) documents

Pure go library for creating and processing Office Word (.docx), Excel (.xlsx) and Powerpoint (.pptx) documents

unioffice is a library for creation of Office Open XML documents (.docx, .xlsx and .pptx). Its goal is to be the most compatible and highest performan

Jan 4, 2023
ASCII transliterations of Unicode text.

go-unidecode ASCII transliterations of Unicode text. Inspired by python-unidecode. Installation go get -u github.com/mozillazg/go-unidecode Install C

Dec 2, 2022
Go efficient text segmentation and NLP; support english, chinese, japanese and other. Go 语言高性能分词

gse Go efficient text segmentation; support english, chinese, japanese and other. 简体中文 Dictionary with double array trie (Double-Array Trie) to achiev

Jan 8, 2023
A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

prose is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.

Jan 4, 2023
A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29

segment A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29 Features Currently only segmentation at Word

Dec 19, 2022
i18n-pseudo - Pseudolocalization is an incredibly useful tool for localizing your apps.

i18n-pseudo Pseudolocalization is an incredibly useful tool for localizing your apps. This module makes it easy to apply pseudo to any given string. I

Mar 21, 2022
Command-line tool to organize large directories of media files recursively by date, detecting duplicates.

go-media-organizer Command-line tool written in Go to organise all media files in a directory recursively by date, detecting duplicates.

Jan 6, 2022
Distributed merge sort for large sets across nodes in a network

distMergeSort An implementation of mergesort distributed across nodes used to sort large sets. Introduction Merge sort partitions sets so that they ca

Jul 8, 2022
Sep 23, 2022
word2text - a tool is to convert word documents (DocX) to text on the CLI with zero dependencies for free
word2text - a tool is to convert word documents (DocX) to text on the CLI with zero dependencies for free

This tool is to convert word documents (DocX) to text on the CLI with zero dependencies for free. This tool has been tested on: - Linux 32bit and 64 bit - Windows 32 bit and 64 bit - OpenBSD 64 bit

Apr 19, 2021