Converts PDF, DOC, DOCX, XML, HTML, RTF, etc to plain text

docconv

Go reference Build status Report card Sourcegraph

A Go wrapper library to convert PDF, DOC, DOCX, XML, HTML, RTF, ODT, Pages documents and images (see optional dependencies below) to plain text.

Note for returning users: the Go import path for this package changed to code.sajari.com/docconv.

Installation

If you haven't setup Go before, you first need to install Go.

To fetch and build the code:

$ go get code.sajari.com/docconv/...

This will also build the command line tool docd into $GOPATH/bin. Make sure that $GOPATH/bin is in your PATH environment variable.

Dependencies

tidy, wv, popplerutils, unrtf, https://github.com/JalfResi/justext

Example install of dependencies (not all systems):

$ sudo apt-get install poppler-utils wv unrtf tidy
$ go get github.com/JalfResi/justext

Optional dependencies

To add image support to the docconv library you first need to install and build gosseract.

Now you can add -tags ocr to any go command when building/fetching/testing docconv to include support for processing images:

$ go get -tags ocr code.sajari.com/docconv/...

This may complain on macOS, which you can fix by installing tesseract via brew:

$ brew install tesseract

docd tool

The docd tool runs as either:

  1. a service on port 8888 (by default)

    Documents can be sent as a multipart POST request and the plain text (body) and meta information are then returned as a JSON object.

  2. a service exposed from within a Docker container

    This also runs as a service, but from within a Docker container. Official images are published at https://hub.docker.com/r/sajari/docd.

    Optionally you can build it yourself:

    cd docd
    docker build -t docd .
    
  3. via the command line.

    Documents can be sent as an argument, e.g.

    $ docd -input document.pdf
    

Optional flags

  • addr - the bind address for the HTTP server, default is ":8888"
  • log-level
    • 0: errors & critical info
    • 1: inclues 0 and logs each request as well
    • 2: include 1 and logs the response payloads
  • readability-length-low - sets the readability length low if the ?readability=1 parameter is set
  • readability-length-high - sets the readability length high if the ?readability=1 parameter is set
  • readability-stopwords-low - sets the readability stopwords low if the ?readability=1 parameter is set
  • readability-stopwords-high - sets the readability stopwords high if the ?readability=1 parameter is set
  • readability-max-link-density - sets the readability max link density if the ?readability=1 parameter is set
  • readability-max-heading-distance - sets the readability max heading distance if the ?readability=1 parameter is set
  • readability-use-classes - comma separated list of readability classes to use if the ?readability=1 parameter is set

How to start the service

$ # This will only log errors and critical info
$ docd -log-level 0

$ # This will run on port 8000 and log each request
$ docd -addr :8000 -log-level 1

Example usage (code)

Some basic code is shown below, but normally you would accept the file by HTTP or open it from the file system.

This should be enough to get you started though.

Use case 1: run locally

Note: this assumes you have the dependencies installed.

package main

import (
	"fmt"
	"log"

	"code.sajari.com/docconv"
)

func main() {
	res, err := docconv.ConvertPath("your-file.pdf")
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(res)
}

Use case 2: request over the network

package main

import (
	"fmt"
	"log"

	"code.sajari.com/docconv/client"
)

func main() {
	// Create a new client, using the default endpoint (localhost:8888)
	c := client.New()

	res, err := client.ConvertPath(c, "your-file.pdf")
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(res)
}

Alternatively, via a curl:

curl -s -F input=your-file.pdf http://localhost:8888/convert
Owner
Search.io
Enabling every organization to create smart search experiences
Search.io
Comments
  • Compatibility with Windows

    Compatibility with Windows

    Docconv seems to give trouble when running on Windows computer. In doc.go for instance there is a hardcoded path to a tempdir which includes a forward slash (line 17 and 60 e.g.). This is definitely a problem for a Windows OS...

  • issues with deploying to gcloud

    issues with deploying to gcloud

    I tried to deploy it to appengine, and i'm failing.

    
    ---------------------------------------------------------------------------------------------------------------- REMOTE BUILD OUTPUT -----------------------------------------------------------------------------------------------------------------
    starting build "b30eafdb-e2e2-4a5d-b334-6be6535a8773"
    
    FETCHSOURCE
    Fetching storage object: gs://staging.xxx.appspot.com/us.gcr.io/xxx/appengine/docd.1:latest#1571314487109426
    Copying gs://staging.xxx.appspot.com/us.gcr.io/xxx/appengine/docd.1:latest#1571314487109426...
    / [1 files][  6.4 MiB/  6.4 MiB]
    Operation completed over 1 objects/6.4 MiB.
    BUILD
    Already have image (with digest): gcr.io/cloud-builders/docker
    Sending build context to Docker daemon  13.49MB
    Step 1/9 : FROM alpine
    latest: Pulling from library/alpine
    Digest: sha256:acd3ca9941a85e8ed16515bfc5328e4e2f8c128caa72959a58a127b7801ee01f
    Status: Downloaded newer image for alpine:latest
     ---> 961769676411
    Step 2/9 : MAINTAINER Hamish Ogilvy
     ---> Running in 1bf83502be90
    Removing intermediate container 1bf83502be90
     ---> 1d609d316173
    Step 3/9 : ENV CC=/usr/bin/gcc
     ---> Running in ebc84ab28e30
    Removing intermediate container ebc84ab28e30
     ---> 857461fa7c94
    Step 4/9 : ENV CXX=/usr/bin/g++
     ---> Running in 7022ffdf3aa6
    Removing intermediate container 7022ffdf3aa6
     ---> e6a37cfd4e07
    Step 5/9 : COPY dependencies/* /
    COPY failed: no source files were specified
    ERROR
    ERROR: build step 0 "gcr.io/cloud-builders/docker" failed: exit status 1
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    
    ERROR: (gcloud.app.deploy) Cloud build failed. Check logs at https://console.cloud.google.com/gcr/builds/xxx?project=xxx Failure status: UNKNOWN: Error Response: [2] Build failed; check build logs for details
    rm: appengine/dependencies: No such file or directory
    
  • Fix temporary file creation

    Fix temporary file creation

    The following line throws an error message for me: f, err := ioutil.TempFile(os.TempDir(), "/docconv")

    error message: pattern contains path separator

    full error message thrown by docconv.Convert(): error converting data: error creating local file: error creating temporary file: pattern contains path separator

    This can be easily reproduced with go playground, just add and remove the slash to see the difference: https://play.golang.org/p/VaCO9evqKzM

    I suppose it could be fixed by just removing the slash, unless I'm missing the reason it needed to be there in the first place.

  • docconv appears to be out of sync with latest github.com/otiai10/gosseract

    docconv appears to be out of sync with latest github.com/otiai10/gosseract

    With current docconv at github/sajari/docconv that imports github.com/otiai10/gosseract/v1/gosseract the build on Mac OS terminal results in:

    go get -tags ocr code.sajari.com/docconv/... package github.com/otiai10/gosseract/v1/gosseract: cannot find package "github.com/otiai10/gosseract/v1/gosseract" in any of: /usr/local/go/src/github.com/otiai10/gosseract/v1/gosseract (from $GOROOT) /Users//go/src/github.com/otiai10/gosseract/v1/gosseract (from $GOPATH)

    Changing import to reference otiai10's current release (i.e. "github.com/otiai10/gosseract/v1/gosseract") in image_orc.go results in undefined references as follows:

    go get -tags ocr code.sajari.com/docconv/...

    code.sajari.com/docconv

    /Users//go/src/code.sajari.com/docconv/image_ocr.go:35:11: undefined: gosseract.Must /Users//go/src/code.sajari.com/docconv/image_ocr.go:35:26: undefined: gosseract.Params

  • Use as internal go library

    Use as internal go library

    Is there a way to use this as an interal/embedded library in a golang program? I see that I can most likely achieve this with ODT and several other types since it returns a string.

    However, with PDF it returns a BodyResult that is not exported and looks like it is interpreted directly by the command line? Am I missing something?

  • error converting data: exec:

    error converting data: exec: "pdftotext": executable file not found in $PATH

    I'm trying to launch simple code from tutorial in this repo, only with my own PDF file and has this error error converting data: exec: "pdftotext": executable file not found in $PATH.

    Platform MacOS. My PDF file is in go/src/project and in go/bin

    My Go project file path: /User/admin/go/src/project

    .bash_profile:

    export GOPATH=$HOME/go
    export GOBIN=$GOPATH/bin
    export PATH=$PATH:/usr/local/go/bin
    

    Code:

    package main
    
    import (
    	"fmt"
    	"log"
    
    	"code.sajari.com/docconv"
    )
    
    func main() {
    	res, err := docconv.ConvertPath("gsl-mit-edu-0to1.pdf")
    	if err != nil {
    		log.Fatal(err)
    	}
    	fmt.Println(res)
    }
    

    What could be the problem ?

  • Convert Images in PDFs

    Convert Images in PDFs

    This is based on the work in https://github.com/sajari/docconv/pull/19. Many thanks to @marioidival for getting that started.

    The objective is to enable this tool to perform character recognition on images within PDFs in addition to its current pdftotext capabilities.

    When the project is built with the ocr tag, ConvertPDF will detect images within the document and invoke ConvertImage on each of them.

    Note that our gosseract dependency just released a v2 with a breaking change. In order to preserve the current integration, I've updated the import statement to use gosseract/v1/gosseract as recommended in their current README.

  • go-charset/charset not more hosting in google code

    go-charset/charset not more hosting in google code

    new hosting: https://github.com/rogpeppe/go-charset

    $ go get github.com/sajari/docconv
    warning: code.google.com is shutting down; import path code.google.com/p/go-charset/charset will stop working
    
  • Possible license inconsistencies

    Possible license inconsistencies

    Hello,

    We were considering using your library as part of our application and discovered one potential license inconsistency:

    Your library is licensed under MIT and has poppler-utils in the dependencies. However, poppler is licensed under GPL 2.

    License information: https://gitlab.freedesktop.org/poppler/poppler/-/blob/master/README.md https://pkgs.alpinelinux.org/package/edge/main/x86/poppler-utils

    In our understanding, it could make your library obligatory to be licensed under GPL 2. I'm not the license expert and I might be mistaken here. But I hope you find this observation helpful, and you might have already considered it, and there are reasons why it's still fine to use MIT. It'd be great if you can clarify it, and explain us the legal way to use your library and poppler in our app under MIT.

    Thank you in advance!

  • docd: refactor Dockerfile and publish to DockerHub

    docd: refactor Dockerfile and publish to DockerHub

    • Use multi-stage build to compile docd
    • Bring debian variant up to date
    • Deprecate the alpine variant
    • Add GitHub action to publish an official image to DockerHub
    • Use published image as the AppEngine custom runtime

    TODO:

    • [x] add DockerHub credentials for publishing
    • [ ] release v1.1.1
  • enable html output when readability is set to true

    enable html output when readability is set to true

    If html.go > HTMLReadabilityOptionsValues.ReadabilityUseClasses is left as an empty string as initialized, nothing will be included in the output. "good" is probably the very minimum that should be included in the output.

  • use as the default

    use as the default "good" and "neargood" for html when ReadabilityUseClasses is empty

    That are the defaults for docd, but that doesn't apply to library usage generating the problem seen in the issue #78.

    This PR can have a drawback if you intentionally pass an empty list for readabilityUseClasses, but that makes no sense, because the resulting extraction would be an empty body.

    Fixes #78

  • Fix mime type of .tif files

    Fix mime type of .tif files

    .tif files should map to the image/tiff mime type.

    List of official mime types: https://www.iana.org/assignments/media-types/media-types.xhtml

    From MDN web docs: https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Common_types

  • system deadlock

    system deadlock

    Line 102 103 in this doc. go causes a system deadlock, mainly because the coroutine implemented above failed to add valid data to the channel

    	body := <-bc
    	meta := <-mc
    

    err:

    ConvertDoc: could not read doc: mscfb: bad signature; 43016997712
    wvText: exit status 255
    
Decode / encode XML to/from map[string]interface{} (or JSON); extract values with dot-notation paths and wildcards. Replaces x2j and j2x packages.

mxj - to/from maps, XML and JSON Decode/encode XML to/from map[string]interface{} (or JSON) values, and extract/modify values from maps by key or key-

Dec 22, 2022
Dasel - Select, put and delete data from JSON, TOML, YAML, XML and CSV files with a single tool.
Dasel - Select, put and delete data from JSON, TOML, YAML, XML and CSV files with a single tool.

Select, put and delete data from JSON, TOML, YAML, XML and CSV files with a single tool. Supports conversion between formats and can be used as a Go package.

Jan 1, 2023
An encoder for Go structs to HTML
An encoder for Go structs to HTML

GOHTML An encoder for a Go struct to HTML Using the "reflect" package and recursion this package is able to convert a complex go struct into HTML Feat

Oct 1, 2022
Golang string comparison and edit distance algorithms library, featuring : Levenshtein, LCS, Hamming, Damerau levenshtein (OSA and Adjacent transpositions algorithms), Jaro-Winkler, Cosine, etc...

Go-edlib : Edit distance and string comparison library Golang string comparison and edit distance algorithms library featuring : Levenshtein, LCS, Ham

Dec 20, 2022
Convert arbitrary formats to Go Struct (including json, toml, yaml, etc.)

go2struct Convert arbitrary formats to Go Struct (including json, toml, yaml, etc.) Installation Run the following command under your project: go get

Nov 15, 2022
An interesting go struct tag expression syntax for field validation, etc.

An interesting go struct tag expression syntax for field validation, etc.

Jan 8, 2023
Generic types that are missing from Go, including sets, trees, sorted lists, etc.

go-typ Generic types that are missing from Go, including sets, trees, sorted lists, etc. All code is implemented with 0 dependencies and in pure Go co

Dec 4, 2022
Simple .docx converter implemented by Go. Convert .docx to plain text.

docc Simple ".docx" converter implemented by Go. Convert ".docx" to plain text. License MIT Features Less dependency. No need for Microsoft Office. On

Mar 30, 2022
Extract text from plaintext, .docx, .odt and .rtf files. Pure go.

cat This is a simple libary to extract text from plaintext, .docx, .odt, .pdf and .rtf files. Install go get -u github.com/lu4p/cat Basic Usage packag

Nov 18, 2022
Simple system for writing HTML/XML as Go code. Better-performing replacement for html/template and text/template

Simple system for writing HTML as Go code. Use normal Go conditionals, loops and functions. Benefit from typing and code analysis. Better performance than templating. Tiny and dependency-free.

Dec 5, 2022
Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.
Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Tagify Gets STDIN, file or HTTP address as an input and returns a list of most popular words ordered by popularity as an output. More info about what

Dec 19, 2022
A minimalist Go PDF writer in 1982 lines. Draws text, images and shapes. Helps understand the PDF format. Used in production for reports.
A minimalist Go PDF writer in 1982 lines. Draws text, images and shapes. Helps understand the PDF format. Used in production for reports.

one-file-pdf - A minimalist PDF generator in <2K lines and 1 file The main idea behind this project was: "How small can I make a PDF generator for it

Dec 11, 2022
Convert scanned image PDF file to text annotated PDF file
Convert scanned image PDF file to text annotated PDF file

Jisui (自炊) This tool is PoC (Proof of Concept). Jisui is a helper tool to create e-book. Ordinary the scanned book have not text information, so you c

Dec 11, 2022
This command line converts .html file into .html with images embed.

embed-html This command line converts .html file into .html with images embed. Install > go get github.com/gonejack/embed-html Usage > embed-html *.ht

Oct 6, 2022
Cairo in Go: vector to SVG, PDF, EPS, raster, HTML Canvas, etc.
Cairo in Go: vector to SVG, PDF, EPS, raster, HTML Canvas, etc.

Canvas is a common vector drawing target that can output SVG, PDF, EPS, raster images (PNG, JPG, GIF, ...), HTML Canvas through WASM, and OpenGL. It h

Dec 25, 2022
converts text-formats from one to another, it is very useful if you want to re-format a json file to yaml, toml to yaml, csv to yaml, ... etc

re-txt reformates a text file from a structure to another, i.e: convert from json to yaml, toml to json, ... etc Supported Source Formats json yaml hc

Sep 23, 2022
Go package that handles HTML, JSON, XML and etc. responses

gores http response utility library for Go this package is very small and lightweight, useful for RESTful APIs. installation go get github.com/alioygu

Oct 31, 2022
word2text - a tool is to convert word documents (DocX) to text on the CLI with zero dependencies for free
word2text - a tool is to convert word documents (DocX) to text on the CLI with zero dependencies for free

This tool is to convert word documents (DocX) to text on the CLI with zero dependencies for free. This tool has been tested on: - Linux 32bit and 64 bit - Windows 32 bit and 64 bit - OpenBSD 64 bit

Apr 19, 2021
mold your templated to HTML/ TEXT/ PDF easily.
mold your templated to HTML/ TEXT/ PDF easily.

mold mold your templated to HTML/ TEXT/ PDF easily. install go get github.com/mayur-tolexo/mold Example 1 //Todo model type Todo struct { Title stri

Jun 7, 2019
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.

omniparser Omniparser is a native Golang ETL parser that ingests input data of various formats (CSV, txt, fixed length/width, XML, EDI/X12/EDIFACT, JS

Jan 4, 2023