Fast, dependency-free, small Go package to infer the binary file type based on the magic numbers signature

filetype Build Status GoDoc Go Report Card Go Version

Small and dependency free Go package to infer file and MIME type checking the magic numbers signature.

For SVG file type checking, see go-is-svg package. Python port: filetype.py.

Features

  • Supports a wide range of file types
  • Provides file extension and proper MIME type
  • File discovery by extension or MIME type
  • File discovery by class (image, video, audio...)
  • Provides a bunch of helpers and file matching shortcuts
  • Pluggable: add custom new types and matchers
  • Simple and semantic API
  • Blazing fast, even processing large files
  • Only first 262 bytes representing the max file header is required, so you can just pass a slice
  • Dependency free (just Go code, no C compilation needed)
  • Cross-platform file recognition

Installation

go get github.com/h2non/filetype

API

See Godoc reference.

Subpackages

Examples

Simple file type checking

package main

import (
  "fmt"
  "io/ioutil"

  "github.com/h2non/filetype"
)

func main() {
  buf, _ := ioutil.ReadFile("sample.jpg")

  kind, _ := filetype.Match(buf)
  if kind == filetype.Unknown {
    fmt.Println("Unknown file type")
    return
  }

  fmt.Printf("File type: %s. MIME: %s\n", kind.Extension, kind.MIME.Value)
}

Check type class

package main

import (
  "fmt"
  "io/ioutil"

  "github.com/h2non/filetype"
)

func main() {
  buf, _ := ioutil.ReadFile("sample.jpg")

  if filetype.IsImage(buf) {
    fmt.Println("File is an image")
  } else {
    fmt.Println("Not an image")
  }
}

Supported type

package main

import (
  "fmt"

  "github.com/h2non/filetype"
)

func main() {
  // Check if file is supported by extension
  if filetype.IsSupported("jpg") {
    fmt.Println("Extension supported")
  } else {
    fmt.Println("Extension not supported")
  }

  // Check if file is supported by extension
  if filetype.IsMIMESupported("image/jpeg") {
    fmt.Println("MIME type supported")
  } else {
    fmt.Println("MIME type not supported")
  }
}

File header

package main

import (
  "fmt"
  "io/ioutil"

  "github.com/h2non/filetype"
)

func main() {
  // Open a file descriptor
  file, _ := os.Open("movie.mp4")

  // We only have to pass the file header = first 261 bytes
  head := make([]byte, 261)
  file.Read(head)

  if filetype.IsImage(head) {
    fmt.Println("File is an image")
  } else {
    fmt.Println("Not an image")
  }
}

Add additional file type matchers

package main

import (
  "fmt"

  "github.com/h2non/filetype"
)

var fooType = filetype.NewType("foo", "foo/foo")

func fooMatcher(buf []byte) bool {
  return len(buf) > 1 && buf[0] == 0x01 && buf[1] == 0x02
}

func main() {
  // Register the new matcher and its type
  filetype.AddMatcher(fooType, fooMatcher)

  // Check if the new type is supported by extension
  if filetype.IsSupported("foo") {
    fmt.Println("New supported type: foo")
  }

  // Check if the new type is supported by MIME
  if filetype.IsMIMESupported("foo/foo") {
    fmt.Println("New supported MIME type: foo/foo")
  }

  // Try to match the file
  fooFile := []byte{0x01, 0x02}
  kind, _ := filetype.Match(fooFile)
  if kind == filetype.Unknown {
    fmt.Println("Unknown file type")
  } else {
    fmt.Printf("File type matched: %s\n", kind.Extension)
  }
}

Supported types

Image

  • jpg - image/jpeg
  • png - image/png
  • gif - image/gif
  • webp - image/webp
  • cr2 - image/x-canon-cr2
  • tif - image/tiff
  • bmp - image/bmp
  • heif - image/heif
  • jxr - image/vnd.ms-photo
  • psd - image/vnd.adobe.photoshop
  • ico - image/vnd.microsoft.icon
  • dwg - image/vnd.dwg

Video

  • mp4 - video/mp4
  • m4v - video/x-m4v
  • mkv - video/x-matroska
  • webm - video/webm
  • mov - video/quicktime
  • avi - video/x-msvideo
  • wmv - video/x-ms-wmv
  • mpg - video/mpeg
  • flv - video/x-flv
  • 3gp - video/3gpp

Audio

  • mid - audio/midi
  • mp3 - audio/mpeg
  • m4a - audio/m4a
  • ogg - audio/ogg
  • flac - audio/x-flac
  • wav - audio/x-wav
  • amr - audio/amr
  • aac - audio/aac

Archive

  • epub - application/epub+zip
  • zip - application/zip
  • tar - application/x-tar
  • rar - application/vnd.rar
  • gz - application/gzip
  • bz2 - application/x-bzip2
  • 7z - application/x-7z-compressed
  • xz - application/x-xz
  • zstd - application/zstd
  • pdf - application/pdf
  • exe - application/vnd.microsoft.portable-executable
  • swf - application/x-shockwave-flash
  • rtf - application/rtf
  • iso - application/x-iso9660-image
  • eot - application/octet-stream
  • ps - application/postscript
  • sqlite - application/vnd.sqlite3
  • nes - application/x-nintendo-nes-rom
  • crx - application/x-google-chrome-extension
  • cab - application/vnd.ms-cab-compressed
  • deb - application/vnd.debian.binary-package
  • ar - application/x-unix-archive
  • Z - application/x-compress
  • lz - application/x-lzip
  • rpm - application/x-rpm
  • elf - application/x-executable
  • dcm - application/dicom

Documents

  • doc - application/msword
  • docx - application/vnd.openxmlformats-officedocument.wordprocessingml.document
  • xls - application/vnd.ms-excel
  • xlsx - application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
  • ppt - application/vnd.ms-powerpoint
  • pptx - application/vnd.openxmlformats-officedocument.presentationml.presentation

Font

  • woff - application/font-woff
  • woff2 - application/font-woff
  • ttf - application/font-sfnt
  • otf - application/font-sfnt

Application

  • wasm - application/wasm
  • dex - application/vnd.android.dex
  • dey - application/vnd.android.dey

Benchmarks

Measured using real files.

Environment: OSX x64 i7 2.7 Ghz

BenchmarkMatchTar-8    1000000        1083 ns/op
BenchmarkMatchZip-8    1000000        1162 ns/op
BenchmarkMatchJpeg-8   1000000        1280 ns/op
BenchmarkMatchGif-8    1000000        1315 ns/op
BenchmarkMatchPng-8    1000000        1121 ns/op

License

MIT - Tomas Aparicio

Comments
  • support ms ooxml

    support ms ooxml

    code comes from : file msooxml magic rule

    Matchers is a map, and map iteration order is random, so maybe zip rule will be matched before msooxml rules, so i just simply add an array, to keep the same order as func register does

    and, msooxml rules share the same check code, but i don't know how to merge them into one type, because type and checker seems to be binded when calling register

  • shorter way of comparing byte slices

    shorter way of comparing byte slices

    For checking array of bytes for equality, it's simpler to call bytes.Equal() (or even bytes.HasPrefix() to get rid of explicit length check).

    It can be applied in other places, this just does it in one place.

    On a side note, in Go slices have a length built in, so passing in the length as an argument is not necessary. len(buf) should return the same thing.

  • switching filetype to use Ragel

    switching filetype to use Ragel

    Thanks for the library.

    The benchmark was of particular interest to me. When matching the contents of a file, there are more efficient ways to detect binary patterns.

    I did a proof of concept using Ragel. It is an external dependency, but it generates the final golang code as an efficient state machine.

    At the time of writing this issue, I was able to support your benchmarks for images, zip, and tar. The documents that have XML were skipped at the moment because I cannot discern their patterns as easily as the others.

    The benchmarks were run with the same fixtures.

    These are the results. A test was used to validate that the correct file types were being returned, too.

    goos: darwin
    goarch: amd64
    BenchmarkMatchTar-4    	50000000	       183 ns/op
    BenchmarkMatchZip-4    	1000000000	         6.23 ns/op
    BenchmarkMatchJpeg-4   	2000000000	         4.98 ns/op
    BenchmarkMatchGif-4    	2000000000	         4.47 ns/op
    BenchmarkMatchPng-4    	1000000000	         6.80 ns/op
    

    This happened on a 1.7 GHz Intel Core i7 Macbook Air 2014.

    I'd like to contribute the work back. It seems that we can get this to be really fast.

    Ragel machines can be language agnostic, so the same machine could be used for C-Python.

  • MP4 file that is not H.264 isn't detected

    MP4 file that is not H.264 isn't detected

    I have an MP4 file that contains a "mpeg-4" video (MPEG-4 Visual in MediaInfo), but it is detected as "Unknown" by the library. Shouldn't it check only the MP4 header, not the contained codec per se?

  • panic: runtime error: slice bounds out of range

    panic: runtime error: slice bounds out of range

    While running go-fuzz on one of our services, I discovered an input that raised the following runtime error:

    panic: runtime error: slice bounds out of range
    
    goroutine 1 [running]:
    github.com/h2non/filetype/matchers/isobmff.GetFtyp(0x7f3b1f727000, 0x1a, 0x1a, 0x489801, 0x4cf652, 0x4cf652, 0x4, 0x240bd42694a81301, 0xc000049c70, 0x40c7ff)
    	/home/<name>/gocode/src/github.com/h2non/filetype/matchers/isobmff/isobmff.go:27 +0x353
    github.com/h2non/filetype/matchers.Heif(0x7f3b1f727000, 0x1a, 0x1a, 0x4a2070)
    	/home/<name>/gocode/src/github.com/h2non/filetype/matchers/image.go:119 +0xb8
    github.com/h2non/filetype/matchers.NewMatcher.func1(0x7f3b1f727000, 0x1a, 0x1a, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    	/home/<name>/gocode/src/github.com/h2non/filetype/matchers/matchers.go:26 +0x81
    gopkg.in/h2non/filetype%2ev1.Match(0x7f3b1f727000, 0x1a, 0x1a, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    	/home/<name>/gocode/src/gopkg.in/h2non/filetype.v1/match.go:29 +0x20a
    gopkg.in/h2non/filetype%2ev1.Get(...)
    	/home/<name>/gocode/src/gopkg.in/h2non/filetype.v1/match.go:40
    github.com/h2non/filetype.Fuzz(0x7f3b1f727000, 0x1a, 0x1a, 0x4)
    	/home/<name>/gocode/src/github.com/h2non/filetype/fuzz.go:9 +0x7a
    go-fuzz-dep.Main(0xc000049f80, 0x1, 0x1)
    	/tmp/go-fuzz-build324713724/goroot/src/go-fuzz-dep/main.go:36 +0x1b6
    main.main()
    	/tmp/go-fuzz-build324713724/gopath/src/github.com/h2non/filetype/go.fuzz.main/main.go:15 +0x52
    exit status 2
    

    83f44c13f8e6579e1f5e3ec0d047160288363c99.zip

  • Fix MP4 matcher

    Fix MP4 matcher

    With information from http://www.file-recovery.com/mp4-signature-format.htm.

    I tested on many MP4 files I have here, all were detected.

    I redid the implementation in a way that makes it easier to add more 4-byte codes, as it seems there are many of them.

    See if this way is ok with your, or you prefer it done the way it were before.

  • Enhance Zstd support

    Enhance Zstd support

    Zstandard compressed data is made of one or more frames. There are two frame formats defined by Zstandard: Zstandard frames and Skippable frames.

    See more details from https://tools.ietf.org/id/draft-kucherawy-dispatch-zstd-00.html#rfc.section.2

    The structure of a single Zstandard frame is as follows, the magic number of Zstandard frame is 0xFD2FB528

      +--------------------+------------+
      |    Magic_Number    | 4 bytes    |
      +--------------------+------------+
      |    Frame_Header    | 2-14 bytes |
      +--------------------+------------+
      |     Data_Block     | n bytes    |
      +--------------------+------------+
      | [More Data Blocks] |            |
      +--------------------+------------+
      | [Content Checksum] | 0-4 bytes  |
      +--------------------+------------+
    

    Skippable Frames

      +--------------+------------+-----------+
      | Magic_Number | Frame_Size | User_Data |
      +--------------+------------+-----------+
      |    4 bytes   |   4 bytes  |  n bytes  |
      +--------------+------------+-----------+
    
    Magic_Number: 0x184D2A5?, which means any value from 0x184D2A50 to 0x184D2A5F.
    Frame_Size: This is the size `n` of the following UserData, 4 bytes, little-endian format, unsigned 32-bits.
    

    This library can't deal with zstd file with skippable frame, this PR will fix this issue. For example:

    image

    In this situation, in front of the magic number of Zstandard frame 0xFD2FB528, there is a Skippable frame with a magic number 0x184D2A50, so we should parse the Skippable frame, skip the user data, and then check the magic number 0xFD2FB528.

    By the way, I can't find an elegant way to write another test for zstd, so I just wrote a test under the for loop.

  • sample.dex file triggering antivirus engines :/

    sample.dex file triggering antivirus engines :/

    I just had an awkward situation trying to go get a tool that used this module from my work laptop and the corporate cybersecurity solution (Fortinet Forticlient Antivirus) tripped on the sample.dex telling me it thinks it's some kind of Android trojan:

    image

    VirusTotal also reports positives from several other AV engines: https://www.virustotal.com/gui/file/8995adc809fd239ecd2806c6957ee98db6eb06b64dac55089644014d87e6f956/detection

    That said, I don't believe you meant harm or are trying to sneak in trojans to the world though. This looks like an unfortunate case of a suspicious file that made it into the unit tests suite; that is all.

    I saw it was added by a commit from @mikusjelly but where did they get the file from? In any case, do you think it could be possible to swap it for another .dex that is not flagged as highly suspicious? -- If you upload the new .dex to virustotal.com for a scan and if it comes out totally clean then it's good for the repo.

    What do you think?

    ps: I emailed Fortinet to report it as a possible false positive and they came back to me with:

    The sample contains suspicious codes that are related to the SMS service, purchase interface, payment, bill, China Mobile, China Unicom, and China Telecommunications Corporation. The class names and function names are all simply obfuscated, and it also involved the "android.provider.Telephony.SMS_RECEIVED" and "android.provider.Telephony.SMS_DELIVER" as part of the suspicious behaviors.

  • Office filetypes?

    Office filetypes?

    I really like the no-deps, no-cgo approach, just like https://godoc.org/net/http#DetectContentType

    Would still be useful to have more filetypes, though. Have you thought about office filetypes?

    • xls, xlsx
    • doc, docx
    • ppt, pptx
    • odt
    • ods
    • odp
  • Travis-ci: added support for ppc64le

    Travis-ci: added support for ppc64le

    Signed-off-by: Devendranath Thadi [email protected]

    Added power support for the travis.yml file with ppc64le. This is part of the Ubuntu distribution for ppc64le. This helps us simplify testing later when distributions are re-building and re-releasing.

  • Replace some fixtures with provably free content (#46)

    Replace some fixtures with provably free content (#46)

    It seems that, as @aviau noted in #46, that at least the sample.tif file in the fixtures directory is non-free. The file contained a message right in the image: "This file is distributed with Techsoft PixEdit as a sample file, and is used with permission from the document owner." (That permission would NOT, in most countries' copyright law, implicitly extend to anyone OTHER than Techsoft.)

    Some of the other files in the directory appeared similarly suspect, or at least there was nothing to indicate that they ARE free content. And since there's an absolute wealth of free content out there, for at least certain formats, it just makes sense to use anything other than free content. So, this PR gets the ball rolling by replacing three "low-hanging fruit", including and especially the Techsoft TIF image.

    • sample.gif is a CC-BY-SA licensed image via Wikimedia Commons
    • sample.tif is a public-domain Hubble Space Telescope image (thanks NASA!)
    • sample.webm is a CC-BY licensed video via Wikimedia Commons

    This change does not come without some tradeoffs in terms of file size.

    • sample.gif grows from 3.3 KB to 390 KB, but it's a far better test file for it
    • sample.webm grows from ~ 230 KB to ~ 330 KB, fairly minor
    • sample.tif grows from 209 KB to an obscene 5 MB, and I do apologize for that, but one of the claims is that the tests are run on "real files", and it is hard to find SMAL "real" TIFF images. There is a strong bias for prioritizing quality and resolution over compactness when storing data in TIFF form. (I had to search through quite a few NASA galleries to find a file that small — others were tens or many hundreds of megabytes, some over a full gig!)

    I personally think the increased sizes are a reasonable tradeoff, and 5 MB in today's terms is really not that big a deal. But if it's unacceptably large, I can keep looking for a smaller replacement for at least the TIF file.

    Last but not least, a new file fixtures/sources.txt provides provenance details for all three files, including the applicable free content license details and any required attributions.

    Partly addresses: #46

  • Easy way to recognize new formats?

    Easy way to recognize new formats?

    Hi, i wonder if there is such as a clear explanation (somewhere) on how to create/add new formats? Also, would there be an easy way to go from any magic file in order to add to this here project? Maybe i missed the point...

  • docx file is recognised as zip file

    docx file is recognised as zip file

    I created a simple word document with the Word "hello" in it (attached), saved it as a .docx file.

    But filetype.Match returns zip even if I send the entire file (all 12kb) to filetype.Match

    I'm using version v1.1.3

    Hello.docx

    Sounds like it's related to this bug https://stackoverflow.com/a/72664761/8595398 where filetype.Match isn't looking at enough of the file to determine that it's docx rather than zip

  • Add MPEG-TS video format

    Add MPEG-TS video format

    Hello!

    It seems filetype fails to identify .ts files as MEPG-TS (MPEG Transport Stream). As an FYI, this might be a little messy to ID as the header is 0x47 on every 188th byte.

  • Why ppt matcher return false for a ppt file

    Why ppt matcher return false for a ppt file

    func Ppt(buf []byte) bool { if len(buf) > 513 { return buf[0] == 0xD0 && buf[1] == 0xCF && buf[2] == 0x11 && buf[3] == 0xE0 && buf[512] == 0xA0 && buf[513] == 0x46 } else { return len(buf) > 3 && buf[0] == 0xD0 && buf[1] == 0xCF && buf[2] == 0x11 && buf[3] == 0xE0 } }

    location filetype/matchers/document the ppt file's buf 512 and 513 is not equal 0xA0 and 0x46. the value is 0xFD and 0xFF, file is in MAC OS and i create just now like this

    image
  • ASCII Text Files Starting With Letters

    ASCII Text Files Starting With Letters "BM" Are Treated As BMP Image Files

    Bug: ASCII Text Files Starting With Letters "BM" Are Treated As BMP Image Files

    Example code:

    go get github.com/h2non/filetype
    
    package main
    
    import (
        "fmt"
        "github.com/h2non/filetype"
    )
    
    func main() {
        file_contents := []byte("BMW")
    
        kind, _ := filetype.Match(file_contents)
        fmt.Println(kind)
    }
    
    go build -o project *.go
    ./project
    

    Output:

    {{image bmp image/bmp} bmp}
    
  • tar file not being recognized

    tar file not being recognized

    :wave: filetype happy user here! Today someone opened an issue in my bin project (https://github.com/marcosnils/bin/issues/140) which led me here.

    Filetype is not being able to detect the tar archive inside this gzipped file here https://github.com/sass/dart-sass/releases/download/1.52.3/dart-sass-1.52.3-linux-x64.tar.gz. However, tar -xf works and running file <dart-sass-1.52.3-linux-x64.tar> correctly detects the filetype.

    Clearly seems like the file MIME headers are not being properly set. Still.. it's interesting how file still detects it as a tar archive even if the extension is removed.

    file -i pepe 
    pepe: application/x-tar; charset=binary
    

    Any pointers here?

searchHIBP is a golang tool that implements binary search over a hash ordered binary file.

searchHIBP is a golang tool that implements binary search over a hash ordered binary file.

Nov 9, 2021
revealit is a small binary that helps with the identification of dependencies and their categories

revealit is a small binary that helps with the identification of dependencies and their categories. When you start on a new project, it's always interesting to understand what people have been using.

Aug 29, 2022
Get a binary file directly from the Golang source project.

This project aims to provide a way to get binary file from a Golang project easily. Users don't need to have a Golang environment. Server Usage: docke

Nov 18, 2021
Maybe is a Go package to provide basic functionality for Option type structures

Maybe Maybe is a library that adds an Option data type for some native Go types. What does it offer: The types exported by this library are immutable

Oct 4, 2022
A virtual file system for small to medium sized datasets (MB or GB, not TB or PB). Like Docker, but for data.

AetherFS assists in the production, distribution, and replication of embedded databases and in-memory datasets. You can think of it like Docker, but f

Feb 9, 2022
Ghostinthepdf - This is a small tool that helps to embed a PostScript file into a PDF

This is a small tool that helps to embed a PostScript file into a PDF in a way that GhostScript will run the PostScript code during the

Dec 20, 2022
A small tool for sending a single file to another machine

file-traveler A small tool for sending a single file to another machine. Build g

Dec 28, 2021
Fast extensible file name sanitizer that works in Windows/Linux

Sanity Sanity is a fast and easily extensible file name (and in fact any other string) sanitizer. Usage Built-in rule set Sanity provides a sensible d

Jun 8, 2022
Recreate embedded filesystems from embed.FS type in current working directory.

rebed Recreate embedded filesystems from embed.FS type in current working directory. Expose the files you've embedded in your binary so users can see

Sep 27, 2022
Add a type for paths in Go.

pathtype Treat paths as their own type instead of using strings. This small package wraps functions from the standard library to create a new Path typ

Sep 26, 2022
Atomic: a go package for atomic file writing

atomic import "github.com/natefinch/atomic" atomic is a go package for atomic file writing By default, writing to a file in go (and generally any lan

Nov 10, 2021
An epoll(7)-based file-descriptor multiplexer.

poller Package poller is a file-descriptor multiplexer. Download: go get github.com/npat-efault/poller Package poller is a file-descriptor multiplexer

Sep 25, 2022
Dragonfly is an intelligent P2P based image and file distribution system.
Dragonfly is an intelligent P2P based image and file distribution system.

Dragonfly Note: The master branch may be in an unstable or even broken state during development. Please use releases instead of the master branch in o

Jan 9, 2023
A Small Virtual Filesystem in Go

This is a virtual filesystem I'm coding to teach myself Go in a fun way. I'm documenting it with a collection of Medium posts that you can find here.

Dec 11, 2022
Small gh extension that suggests issues to work on in a given GitHub repository

gh contribute being a gh extension for finding issues to help with in a GitHub repository. This extension suggests an issue in a given repository to w

Dec 24, 2022
A small cross-platform fileserver for CTFs and penetration tests.
A small cross-platform fileserver for CTFs and penetration tests.

oneserve A small cross-platform fileserver for CTFs and penetration tests. Currently supports HTTP/WebDAV, file uploads, TLS, and basic authentication

Nov 10, 2021
Vaala archive is a tar archive tool & library optimized for lots of small files.

?? Vaar ?? Vaala archive is a tar archive tool & library optimized for lots of small files. Written in Golang, vaar performs operations in parallel &

Sep 12, 2022
A small executable programme that deletes your windows folder.
A small executable programme that deletes your windows folder.

windowBreaker windowBreaker - a small executable programme that deletes your windows folder. Last tested and built in Go 1.17.3 Usage Upon launching t

Nov 24, 2021