A faster file programming language detector

go-enry GoDoc Test codecov

Programming language detector and toolbox to ignore binary or vendored files. enry, started as a port to Go of the original Linguist Ruby library, that has an improved 2x performance.

CLI

The CLI binary is hosted in a separate repository go-enry/enry.

Library

enry is also a Go library for guessing a programming language that exposes API through FFI to multiple programming environments.

Use cases

enry guesses a programming language using a sequence of matching strategies that are applied progressively to narrow down the possible options. Each strategy varies on the type of input data that it needs to make a decision: file name, extension, the first line of the file, the full content of the file, etc.

Depending on available input data, enry API can be roughly divided into the next categories or use cases.

By filename

Next functions require only a name of the file to make a guess:

  • GetLanguageByExtension uses only file extension (wich may be ambiguous)
  • GetLanguageByFilename useful for cases like .gitignore, .bashrc, etc
  • all filtering helpers

Please note that such guesses are expected not to be very accurate.

By text

To make a guess only based on the content of the file or a text snippet, use

  • GetLanguageByShebang reads only the first line of text to identify the shebang.

  • GetLanguageByModeline for cases when Vim/Emacs modeline e.g. /* vim: set ft=cpp: */ may be present at a head or a tail of the text.

  • GetLanguageByClassifier uses a Bayesian classifier trained on all the ./samples/ from Linguist.

    It usually is a last-resort strategy that is used to disambiguate the guess of the previous strategies, and thus it requires a list of "candidate" guesses. One can provide a list of all known languages - keys from the data.LanguagesLogProbabilities as possible candidates if more intelligent hypotheses are not available, at the price of possibly suboptimal accuracy.

By file

The most accurate guess would be one when both, the file name and the content are available:

  • GetLanguagesByContent only uses file extension and a set of regexp-based content heuristics.
  • GetLanguages uses the full set of matching strategies and is expected to be most accurate.

Filtering: vendoring, binaries, etc

enry expose a set of file-level helpers Is* to simplify filtering out the files that are less interesting for the purpose of source code analysis:

  • IsBinary
  • IsVendor
  • IsConfiguration
  • IsDocumentation
  • IsDotFile
  • IsImage
  • IsTest
  • IsGenerated

Language colors and groups

enry exposes function to get language color to use for example in presenting statistics in graphs:

  • GetColor
  • GetLanguageGroup can be used to group similar languages together e.g. for Less this function will return CSS

Languages

Go

In a Go module, import enry to the module by running:

go get github.com/go-enry/go-enry/v2

The rest of the examples will assume you have either done this or fetched the library into your GOPATH.

")) fmt.Println(lang, safe) // result: Matlab true lang, safe := enry.GetLanguageByContent("bar.m", []byte("")) fmt.Println(lang, safe) // result: Objective-C true // all strategies together lang := enry.GetLanguage("foo.cpp", []byte("")) // result: C++ true ">
// The examples here and below assume you have imported the library.
import "github.com/go-enry/go-enry/v2"

lang, safe := enry.GetLanguageByExtension("foo.go")
fmt.Println(lang, safe)
// result: Go true

lang, safe := enry.GetLanguageByContent("foo.m", []byte(""))
fmt.Println(lang, safe)
// result: Matlab true

lang, safe := enry.GetLanguageByContent("bar.m", []byte(""))
fmt.Println(lang, safe)
// result: Objective-C true

// all strategies together
lang := enry.GetLanguage("foo.cpp", []byte(""))
// result: C++ true

Note that the returned boolean value safe is true if there is only one possible language detected.

A plural version of the same API allows getting a list of all possible languages for a given file.

")) // result: []string{"C", "C++", "Objective-C} langs := enry.GetLanguagesByExtension("foo.asc", []byte(""), nil) // result: []string{"AGS Script", "AsciiDoc", "Public Key"} langs := enry.GetLanguagesByFilename("Gemfile", []byte(""), []string{}) // result: []string{"Ruby"} ">
langs := enry.GetLanguages("foo.h",  []byte(""))
// result: []string{"C", "C++", "Objective-C}

langs := enry.GetLanguagesByExtension("foo.asc", []byte(""), nil)
// result: []string{"AGS Script", "AsciiDoc", "Public Key"}

langs := enry.GetLanguagesByFilename("Gemfile", []byte(""), []string{})
// result: []string{"Ruby"}

Java bindings

Generated Java bindings using a C shared library and JNI are available under java.

A library is published on Maven as tech.sourced:enry-java for macOS and linux platforms. Windows support is planned under src-d/enry#150.

Python bindings

Generated Python bindings using a C shared library and cffi are WIP under src-d/enry#154.

A library is going to be published on pypi as enry for macOS and linux platforms. Windows support is planned under src-d/enry#150.

Rust bindings

Generated Rust bindings using a C static library are available at https://github.com/go-enry/rs-enry.

Divergences from Linguist

The enry library is based on the data from github/linguist version v7.14.0.

Parsing linguist/samples the following enry results are different from the Linguist:

In all the cases above that have an issue number - we plan to update enry to match Linguist behavior.

Benchmarks

Enry's language detection has been compared with Linguist's on linguist/samples.

We got these results:

histogram

The histogram shows the number of files (y-axis) per time interval bucket (x-axis). Most of the files were detected faster by enry.

There are several cases where enry is slower than Linguist due to Go regexp engine being slower than Ruby's on, wich is based on oniguruma library, written in C.

See instructions for running enry with oniguruma.

Why Enry?

In the movie My Fair Lady, Professor Henry Higgins is a linguist who at the very beginning of the movie enjoys guessing the origin of people based on their accent.

"Enry Iggins" is how Eliza Doolittle, pronounces the name of the Professor.

Development

To run the tests use:

go test ./...

Setting ENRY_TEST_REPO to the path to existing checkout of Linguist will avoid cloning it and sepeed tests up. Setting ENRY_DEBUG=1 will provide insight in the Bayesian classifier building done by make code-generate.

Sync with github/linguist upstream

enry re-uses parts of the original github/linguist to generate internal data structures. In order to update to the latest release of linguist do:

$ git clone https://github.com/github/linguist.git .linguist
$ cd .linguist; git checkout <release-tag>; cd ..

# put the new release's commit sha in the generator_test.go (to re-generate .gold test fixtures)
# https://github.com/go-enry/go-enry/blob/13d3d66d37a87f23a013246a1b0678c9ee3d524b/internal/code-generator/generator/generator_test.go#L18

$ make code-generate

To stay in sync, enry needs to be updated when a new release of the linguist includes changes to any of the following files:

There is no automation for detecting the changes in the linguist project, so this process above has to be done manually from time to time.

When submitting a pull request syncing up to a new release, please make sure it only contains the changes in the generated files (in data subdirectory).

Separating all the necessary "manual" code changes to a different PR that includes some background description and an update to the documentation on "divergences from linguist" is very much appreciated as it simplifies the maintenance (review/release notes/etc).

Misc

Running a benchmark & faster regexp engine

Benchmark

All benchmark scripts are in benchmarks directory.

Dependencies

As benchmarks depend on Ruby and Github-Linguist gem make sure you have:

  • Ruby (e.g using rbenv), bundler installed
  • Docker
  • native dependencies installed
  • Build the gem cd .linguist && bundle install && rake build_gem && cd -
  • Install it gem install --no-rdoc --no-ri --local .linguist/github-linguist-*.gem

Quick benchmark

To run quicker benchmarks

make benchmarks

to get average times for the primary detection function and strategies for the whole samples set. If you want to see measures per sample file use:

make benchmarks-samples

Full benchmark

If you want to reproduce the same benchmarks as reported above:

  • Make sure all dependencies are installed
  • Install gnuplot (in order to plot the histogram)
  • Run ENRY_TEST_REPO="$PWD/.linguist" benchmarks/run.sh (takes ~15h)

It will run the benchmarks for enry and Linguist, parse the output, create csv files and plot the histogram.

Faster regexp engine (optional)

Oniguruma is CRuby's regular expression engine. It is very fast and performs better than the one built into Go runtime. enry supports swapping between those two engines thanks to rubex project. The typical overall speedup from using Oniguruma is 1.5-2x. However, it requires CGo and the external shared library. On macOS with Homebrew, it is:

brew install oniguruma

On Ubuntu, it is

sudo apt install libonig-dev

To build enry with Oniguruma regexps use the oniguruma build tag

go get -v -t --tags oniguruma ./...

and then rebuild the project.

License

Apache License, Version 2.0. See LICENSE

Owner
go-enry
A faster file programming language detector, based on Linguist
go-enry
Comments
  • Code generator Win support

    Code generator Win support

    Fixes #4

    On Win make code-generate produces unreasonable Bayesian classifier weights from Linguist samples silently, failing only the final classification tests.

    TestPlan:

    • passing tests on Win CI
      go test ./internal/code-generator/... \
       -run Test_GeneratorTestSuite -testify.m TestGenerationFiles
      
  • Expose IsTest and GetLangaugeType methods  & Fix test cases for Java Bindings

    Expose IsTest and GetLangaugeType methods & Fix test cases for Java Bindings

    Purpsoe

    [Fixes]:

    1. Correct failing test cases for Java Bindings so that make test command under Java does not fail.
    2. Expose isTest method in java bindings.

    [Features] :

    1. Export new function GetLanguageType at enry package in go & expose the same at Java Bindings.
  • Is there a prebuilt shared library?

    Is there a prebuilt shared library?

    I couldn't find any way in this repo on how to get the shared library so it can be used from other languages.

    So is there a prebuilt one somewhere or some instructions on how to build your own?

  • Refactoring tests

    Refactoring tests

    Several cosmetic changes

    • API function declarations order follows tests order
    • Linguist lazy loading logic unified & re-used, as much as possible between tests&benchmark
    • Separate test suite extracted for running over Linguist samples/fixtures
  • Linguist update automation opens multiple PRs

    Linguist update automation opens multiple PRs

    The Linguist update automation runs once a day, so if the generated PR isn't merged by then, it will open another one!

    https://github.com/go-enry/go-enry/pull/68 #70

  • Expose `LanguageInfo` with all Linguist data

    Expose `LanguageInfo` with all Linguist data

    As discussed in https://github.com/go-enry/go-enry/issues/54, this provides an API for accessing a LanguageInfo struct which is populated with all the data from the Linguist YAML source file. Functions are provided to access the LanguageInfo by name or ID.

    The other top-level functions like GetLanguageExtensions, GetLanguageGroup, etc. could in principle be implemented using this structure, which would simplify the code generation. But that would be a big change so I didn't do any of that. Perhaps in the next major version something like that would make sense.

    cc @tclem

    Closes https://github.com/go-enry/go-enry/issues/54

  • Python: API to expose highest-level enry.GetLanguage

    Python: API to expose highest-level enry.GetLanguage

    This is a blueprint for all other methods, dealing with go slice conversion.

    It still lacks on build automation (and there is no release automation whatsoever), but this is already useful.

  • data: replace substring package with regex package

    data: replace substring package with regex package

    This PR remote the old substring package from @toqueteos (sorry dude) and use the internal regex package to use oniguruma regexp with all the regular expressions.

  • IsVendor() overmatching paths

    IsVendor() overmatching paths

    I discovered this through our Gitea server flagging files as vendored through its use of enry.IsVendor(). Paths like oslo_cache/_bmemcache_pool.py and playbooks/roles/create-venv/tasks/main.yaml are inappropriately marked vendored.

    My hunch is that the first path is matching https://github.com/go-enry/go-enry/blob/7168084e5e5de38b915b1874528ff73f20a86b69/data/vendor.go#L9 and the second is matching https://github.com/go-enry/go-enry/blob/7168084e5e5de38b915b1874528ff73f20a86b69/data/vendor.go#L110

    I've written a little reproducer that removes Gitea from the equation:

    package main
    
    import "fmt"
    import "regexp"
    import "github.com/go-enry/go-enry/v2"
    
    func main() {
    	input_str1 := "oslo_cache/_bmemcache_pool.py"
    
    	rawregex1, _ := regexp.MatchString(`(^|/)cache/`, input_str1)
    	fmt.Println("Raw regex:", rawregex1)
    
    	vendor1 := enry.IsVendor(input_str1)
    	fmt.Println("IsVendor:", vendor1)
    
    	input_str2 := "playbooks/roles/create-venv/tasks/main.yaml"
    
    	rawregex2, _ := regexp.MatchString(`(^|/)env/`, input_str2)
    	fmt.Println("Raw regex:", rawregex2)
    
    	vendor2 := enry.IsVendor(input_str2)
    	fmt.Println("IsVendor:", vendor2)
    }
    

    When you run this the results are:

    Raw regex: false
    IsVendor: true
    Raw regex: false
    IsVendor: true
    

    What this shows us is that the raw input regexes appear to behave as expected. Neither of our example input strings matches which is what we expect. But when we call IsVendor() the result becomes true. I suspect that the init function https://github.com/go-enry/go-enry/blob/7168084e5e5de38b915b1874528ff73f20a86b69/utils.go#L139-L246 is either adding rules that collide or introducing some bug to the expanded regex that causes this to happen.

  • Mark `go.sum` as generated?

    Mark `go.sum` as generated?

    Most diffs to go checksum files are pure noise, I'm wondering if anyone else agrees it should be marked as generated so tools like gitea can hide diffs on it? Linguist doesn't do it but I think diverging here is fine.

  • Use a deterministic branch name for Linguist updates

    Use a deterministic branch name for Linguist updates

    Rather than creating the branch for the update PR ahead of time using the date, this changes it to use the short hash of the Linguist commit that was found, and updates the code so that if the branch already exists, it will exit without creating a PR.

    This branch name should be the same between runs of the workflow (unless the Linguist release tag is changed, which warrants another update anyway) and should address the problem of creating one PR a day until the update is merged.

    You can see an example of a PR created by this code here: https://github.com/look/go-enry/pull/8 (note the branch name)

    closes https://github.com/go-enry/go-enry/issues/69

    cc @bzz @lafriks

  • Syntax-aware regexp generation for configurable engines

    Syntax-aware regexp generation for configurable engines

    This is alternative to #138 where on build-time we always generate unaltered regexp syntax for all the rules and make runtime checks, similar to #65.

    This, by itself, does not solve the problem of dealing with more non-RE2 syntax coming from linguist, only renders it more visible. The only solution that I can see now that would not require everyone using native library (oniguruma) or compromising on predictive accuracy (due to missing rules for unsupported syntax) is to try shipping another regexp engine in go that would support the necessary syntax.

    TODOs

    • [x] update content heuristics generation
    • [ ] update vendor generation
    • [ ] add Oniguruma-only tests, as in #65
    • [ ] add https://github.com/dlclark/regexp2 backend option
  • [tentative] Check vendor regex at build-time

    [tentative] Check vendor regex at build-time

    This change does several things:

    • ~refactors optimization of vendor regexp collation for IsVendor()~
    • ~moves it to build-time "code generation" phase (instead of runtime at package initialisation)~
    • introduces RE2 syntax check for vendor.go (case that fails #137), the same as we use for heuristics from content.go (and skip the rules with unsupported syntax)
    • adds new CI profiles to test code generation

    The attempt is made for the checks to be RE-lib-specific, thus make code-generate now also respects the same --tags and should be passed when using of oniguruma is desired.

  • JNAerator throws exceptions in Docker for Apple Silicon

    JNAerator throws exceptions in Docker for Apple Silicon

    I've found out that this project uses JNAerator (https://github.com/nativelibs4java/JNAerator) and JNAerator does not seem to be maintained anymore, being Sep 30, 2015 its last commit date.

    The JNAerator in this project uses JNA version 4.1.0 and this version does not provide for 'linux-aarch64'. (4.2.0 and above does.)

    I am using go-enry in my project and it gives out the following error in Docker for Apple Silicon. java.lang.UnsatisfiedLinkError: Native library (com/sun/jna/linux-aarch64/libjnidispatch.so) not found in resource path

    I hope there's a way to fix this!

  • support for incompatible heuristics using oniguruma

    support for incompatible heuristics using oniguruma

    This PR enables the support of all the non-support heuristics due to the go regexp engine.

    • Exposes the original regular expressions to the regex package.
    • The regex package now handles the conversions of the Ruby regular expression to the go-ish version.
    • The heuristics rules now support nil regular expressions since some of the heuristics can't compile using the standard library.

    I added some tests using the linguist fixtures and passes all the fixtures (using onigurama).

  • Python bindings are memory leaking

    Python bindings are memory leaking

    All python wrappers are memory leaking.

    This may happen in several places:

    1. In CFFI - layer (when structures are converted between Python and Go runtimes)
    2. In CGo shared library (when C structures are converted in Go instances)
    3. Somewhere in enry library (nearly impossible, because all wrappers are leaking)

    You can reproduce it using this test script:

    import os
    
    import enry
    import psutil
    
    process = psutil.Process(os.getpid())
    
    content = "import os\nprint('Hello, world!')".encode()
    path = "test.py"
    initial_usage = process.memory_info().rss
    
    for _ in range(10000):
        enry.get_language(path, content)
    
    print(round(process.memory_info().rss / initial_usage * 100, 2))
    

    Here is some info about memory usage of get_language function (in % from initial ram usage), you can see that it's leaking: 1000 iterations: ~100.5 10000 iterations: ~102.4 100000 iterations: ~111.1

Abstract File Storage

afs - abstract file storage Please refer to CHANGELOG.md if you encounter breaking changes. Motivation Introduction Usage Matchers Content modifiers S

Dec 30, 2022
a tool for handling file uploads simple

baraka a tool for handling file uploads for http servers makes it easier to make operations with files from the http request. Contents Install Simple

Nov 30, 2022
Bigfile -- a file transfer system that supports http, rpc and ftp protocol https://bigfile.site
Bigfile -- a file transfer system that supports http, rpc and ftp protocol   https://bigfile.site

Bigfile ———— a file transfer system that supports http, rpc and ftp protocol 简体中文 ∙ English Bigfile is a file transfer system, supports http, ftp and

Dec 31, 2022
Go file operations library chasing GNU APIs.
Go file operations library chasing GNU APIs.

flop flop aims to make copying files easier in Go, and is modeled after GNU cp. Most administrators and engineers interact with GNU utilities every da

Nov 10, 2022
Read csv file from go using tags

go-csv-tag Read csv file from Go using tags The project is in maintenance mode. It is kept compatible with changes in the Go ecosystem but no new feat

Nov 16, 2022
File system event notification library on steroids.

notify Filesystem event notification library on steroids. (under active development) Documentation godoc.org/github.com/rjeczalik/notify Installation

Dec 31, 2022
Pluggable, extensible virtual file system for Go

vfs Package vfs provides a pluggable, extensible, and opinionated set of file system functionality for Go across a number of file system types such as

Jan 3, 2023
An epoll(7)-based file-descriptor multiplexer.

poller Package poller is a file-descriptor multiplexer. Download: go get github.com/npat-efault/poller Package poller is a file-descriptor multiplexer

Sep 25, 2022
QueryCSV enables you to load CSV files and manipulate them using SQL queries then after you finish you can export the new values to a CSV file
QueryCSV enables you to load CSV files and manipulate them using SQL queries then after you finish you can export the new values to a CSV file

QueryCSV enable you to load CSV files and manipulate them using SQL queries then after you finish you can export the new values to CSV file

Dec 22, 2021
Goful is a CUI file manager written in Go.
Goful is a CUI file manager written in Go.

Goful Goful is a CUI file manager written in Go. Works on cross-platform such as gnome-terminal and cmd.exe. Displays multiple windows and workspaces.

Dec 28, 2022
Read a tar file contents using go1.16 io/fs abstraction
Read a tar file contents using go1.16 io/fs abstraction

go-tarfs Read a tar file contents using go1.16 io/fs abstraction Usage ⚠️ go-tarfs needs go>=1.16 Install: go get github.com/nlepage/go-tarfs Use: pac

Dec 1, 2022
Open Source Continuous File Synchronization
Open Source Continuous File Synchronization

Goals Syncthing is a continuous file synchronization program. It synchronizes files between two or more computers. We strive to fulfill the goals belo

Jan 9, 2023
Cross-platform file system notifications for Go.

File system notifications for Go fsnotify utilizes golang.org/x/sys rather than syscall from the standard library. Ensure you have the latest version

Jan 1, 2023
The best HTTP Static File Server, write with golang+vue
The best HTTP Static File Server, write with golang+vue

gohttpserver Goal: Make the best HTTP File Server. Features: Human-friendly UI, file uploading support, direct QR-code generation for Apple & Android

Dec 30, 2022
Dragonfly is an intelligent P2P based image and file distribution system.
Dragonfly is an intelligent P2P based image and file distribution system.

Dragonfly Note: The master branch may be in an unstable or even broken state during development. Please use releases instead of the master branch in o

Jan 9, 2023
Fast, dependency-free, small Go package to infer the binary file type based on the magic numbers signature

filetype Small and dependency free Go package to infer file and MIME type checking the magic numbers signature. For SVG file type checking, see go-is-

Jan 3, 2023
📂 Web File Browser
📂 Web File Browser

filebrowser provides a file managing interface within a specified directory and it can be used to upload, delete, preview, rename and edit your files.

Jan 9, 2023
Plik is a scalable & friendly temporary file upload system ( wetransfer like ) in golang.

Want to chat with us ? Telegram channel : https://t.me/plik_root_gg Plik Plik is a scalable & friendly temporary file upload system ( wetransfer like

Jan 2, 2023
File system for GitHub
File system for GitHub

HUBFS · File System for GitHub HUBFS is a read-only file system for GitHub and Git. Git repositories and their contents are represented as regular dir

Dec 28, 2022