Extract urls from text

xurls

Go Reference

Extract urls from text using regular expressions. Requires Go 1.13 or later.

import "mvdan.cc/xurls/v2"

func main() {
	rxRelaxed := xurls.Relaxed()
	rxRelaxed.FindString("Do gophers live in golang.org?")  // "golang.org"
	rxRelaxed.FindString("This string does not have a URL") // ""

	rxStrict := xurls.Strict()
	rxStrict.FindAllString("must have scheme: http://foo.com/.", -1) // []string{"http://foo.com/"}
	rxStrict.FindAllString("no scheme, no match: foo.com", -1)       // []string{}
}

Since API is centered around regexp.Regexp, many other methods are available, such as finding the byte indexes for all matches.

Note that calling the exposed functions means compiling a regular expression, so repeated calls should be avoided.

cmd/xurls

To install the tool globally:

cd $(mktemp -d); go mod init tmp; GO111MODULE=on go get mvdan.cc/xurls/v2/cmd/xurls
$ echo "Do gophers live in http://golang.org?" | xurls
http://golang.org
Owner
Daniel Martí
I work on stuff in Go.
Daniel Martí
Comments
  • Support Arbitrary Protocols

    Support Arbitrary Protocols

    It looks like xurls only checks for https? which grabs http and https instead of allowing arbitrary protocols (e.g., file://, ftp://, steam://, etc.).

    Any interest in adding support for arbitrary protocols?

  • Arch Linux PKGBUILDs separation

    Arch Linux PKGBUILDs separation

    Hi, just wanted to suggest an edit for the Arch Linux PKGBUILDs. Wouldn't it be better and cleaner if you separated the xurls package in xurls (uses the already compiled go releases from the Releases tab) and xurls-git for the latest upstream, please? That way other users won't have to install go just to use your pretty cool piece of software. :)

    Thank you!

  • Invalid prefixes for URLs are matched

    Invalid prefixes for URLs are matched

    This is probably due to adding support for arbitrary protocols.

    $ echo "systems.https://google.com" | xurls
    systems.https://google.com
    

    I am unsure of it, but is it actually ever valid for punctuation to exist in the protocol portion of a URL schema?

    I know that xurls wasn't really focusing on being a URL validator. But honestly, we aren't too far from accomplishing that and it would be helpful to know that matches are valid (in terms of the specification).

  • Add file support

    Add file support

    Useful little util.

    It would be nice if it, in addition to the stdin support, could work with file(s) as arguments.

    Currently this works:

    cat myfile.txt | xurls

    This just hangs there, waiting:

    xurls myfile.txt

    An example of a Go tool that works as expected for a *nix CLI tool would be ccat.

  • Email support

    Email support

    "Hi, this is my email [email protected]"

    This extracts example.com which isn't useful by its own. I would expect to have the complete email address or it's being skipped.

    What can be done for email addresses?

  • Improvement suggestion with multiple domains in one single URL.

    Improvement suggestion with multiple domains in one single URL.

    Hi, Thank's for providing us xurls.

    I came across the following case:

    $ echo "http://www.fakedomain.com/account/legitdomain.com" | bin/xurls -r
    http://www.fakedomain.com/account/legitdomain.com
    

    I wonder if there is a easy (still fast) way for xurls to identify there are 2 "URLs" inside ? So this could possibly report something like:

    $ echo "http://www.fakedomain.com/account/legitdomain.com/folder" | bin/xurls -r
    http://www.fakedomain.com/account/legitdomain.com/folder
    legitdomain.com/folder
    $
    

    Possibly by adding an additional option to support it on demand only.

    If there is a space in the string, both are found fine (expected and fine)

    echo "http://www.fakedomain.com/        account/legitdomain.com/folder" | bin/xurls -r
    http://www.fakedomain.com/
    legitdomain.com/folder
    

    This is only suggestion. If this impact performances badly, this is probably better to not implement.

  • Error with Input containing long lines

    Error with Input containing long lines

    Hi, thank's for providing xurls.

    I came across the following error when input file contains quite long lines

    $ printf 'tototutu%.0s' {1..9000} > /tmp/a
    $ xurls -r  /tmp/a
    bufio.Scanner: token too long
    $
    $ printf 'tototutu%.0s' {1..5000} > /tmp/b
    $ xurls -r /tmp/b
    $
    

    Just wanted to report such strange case with long line could happen... As I'm not a good golang coder, It's better I'm not submitting PR.

  • Matching returns wrong url

    Matching returns wrong url

    I'm playing with the library and tried it with a simple example, I'm surprised about the result: https://play.golang.org/p/4BF3UXE4x87

    Is it expected? Shouldn't "|" be treated as a wrong character?

  • Adding standard schemes with semiStrict and semiRelaxed option

    Adding standard schemes with semiStrict and semiRelaxed option

    Currently the scheme regex we're using is very generic (rightfully so according to https://tools.ietf.org/html/rfc3986#section-3.1) ; causing us to miss certain urls.

    The idea is to be able to catch urls that are hidden in obfuscated text, example:

    "aatesthttp://www.google.com should get me google's page"

    Having a set of standard schemes in our regex allows this.

  • Add brackets to the allowed path chars

    Add brackets to the allowed path chars

    brackets are widely used in paths, so add them

    http://bgp.he.net/search?search[search]=vortex.data.microsoft.com&commit=Search was not matched for example

  • Dangling dots, mid-string, are seen as domains

    Dangling dots, mid-string, are seen as domains

    Here I have two small edge-cases:

    • <[email protected]> yields []string{"some.gu", "domain.com"}
    • [cid:programmer-thumb-shield-32x32.v2_fe0f1423-2d7d-484b-b624-6b7545ab4311.png] yields []sting{"fe0f1423-2d7d-484b-b624-6b7545ab4311.pn"}

    I'm just wondering about the dropped character before the symbol. This is email, so I can cross-reference against the filenames of inline attachments and also double-check against a known list of TLDs, but dropping that last character makes this difficult.

    Any ideas on why that last char is being dropped?

  • make a deterministic variant of

    make a deterministic variant of "go generate" and have CI check it's up to date

    To prevent issues like https://github.com/mvdan/xurls/pull/67 in the future.

    Two changes should be done:

    1. Use clearer filenames for generated files, so they stand out in file change summaries. For example, schemes_gen.go rather than schemes.go.

    2. Split go generate into two phases; one to download the latest TLD and scheme lists from the internet and write them to files in the git repo (but outside the module zip), and another to take those files and generate the code. The default go generate would do both, but we would add a go generate -tags=noupdate to only do the second. CI would enforce the latter has an empty git diff.

  • Issue with Email Addresses

    Issue with Email Addresses

    I am using the xurls code to pull out possible urls from a message body string. The urls can be in either strict or relaxed format so I need to use the relaxed method of xurls to find the possible urls in the string. The issue is that email addresses can also be in the string and the relaxed method of xurls is pulling those out too.

    For example my string might be: "Hello from http://www.google.com, please check the www.test.com webpage for further information. If you have any questions please email [email protected] or [email protected]"

    What I would like xurls to do is just pull the http://www.google.com or www.test.com.

    Instead is pulls the 2 urls, and John.Sm, test.com, test.com. Is there anything that can be done so that only urls are pulled?

Decode / encode XML to/from map[string]interface{} (or JSON); extract values with dot-notation paths and wildcards. Replaces x2j and j2x packages.

mxj - to/from maps, XML and JSON Decode/encode XML to/from map[string]interface{} (or JSON) values, and extract/modify values from maps by key or key-

Dec 29, 2022
A general purpose application and library for aligning text.

align A general purpose application that aligns text The focus of this application is to provide a fast, efficient, and useful tool for aligning text.

Sep 27, 2022
Parse placeholder and wildcard text commands

allot allot is a small Golang library to match and parse commands with pre-defined strings. For example use allot to define a list of commands your CL

Nov 24, 2022
Guess the natural language of a text in Go

guesslanguage This is a Go version of python guess-language. guesslanguage provides a simple way to detect the natural language of unicode string and

Dec 26, 2022
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.

omniparser Omniparser is a native Golang ETL parser that ingests input data of various formats (CSV, txt, fixed length/width, XML, EDI/X12/EDIFACT, JS

Jan 4, 2023
Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.
Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Tagify Gets STDIN, file or HTTP address as an input and returns a list of most popular words ordered by popularity as an output. More info about what

Dec 19, 2022
Easy AWK-style text processing in Go

awk Description awk is a package for the Go programming language that provides an AWK-style text processing capability. The package facilitates splitt

Jul 25, 2022
Change the color of console text.

go-colortext package This is a package to change the color of the text and background in the console, working both under Windows and other systems. Un

Oct 26, 2022
Templating system for HTML and other text documents - go implementation

FAQ What is Kasia.go? Kasia.go is a Go implementation of the Kasia templating system. Kasia is primarily designed for HTML, but you can use it for any

Mar 15, 2022
Package sanitize provides functions for sanitizing text in golang strings.

sanitize Package sanitize provides functions to sanitize html and paths with go (golang). FUNCTIONS sanitize.Accents(s string) string Accents replaces

Dec 5, 2022
Small and fast FTS (full text search)

Microfts A small full text indexing and search tool focusing on speed and space. Initial tests seem to indicate that the database takes about twice as

Jul 30, 2022
text to speech bot for discord
text to speech bot for discord

text to speech bot for discord

Oct 1, 2022
A diff3 text merge implementation in Go

Diff3 A diff3 text merge implementation in Go based on the awesome paper below. "A Formal Investigation of Diff3" by Sanjeev Khanna, Keshav Kunal, and

Nov 5, 2022
gomtch - find text even if it doesn't want to be found

gomtch - find text even if it doesn't want to be found Do your users have clever ways to hide some terms from you? Sometimes it is hard to find forbid

Sep 28, 2022
Unified text diffing in Go (copy of the internal diffing packages the officlal Go language server uses)

gotextdiff - unified text diffing in Go This is a copy of the Go text diffing packages that the official Go language server gopls uses internally to g

Dec 26, 2022
Convert scanned image PDF file to text annotated PDF file
Convert scanned image PDF file to text annotated PDF file

Jisui (自炊) This tool is PoC (Proof of Concept). Jisui is a helper tool to create e-book. Ordinary the scanned book have not text information, so you c

Dec 11, 2022
A modern text indexing library for go
A modern text indexing library for go

bleve modern text indexing in go - blevesearch.com Features Index any go data structure (including JSON) Intelligent defaults backed up by powerful co

Jan 4, 2023
Paranoid text spacing in Go (Golang)

pangu.go Paranoid text spacing for good readability, to automatically insert whitespace between CJK (Chinese, Japanese, Korean) and half-width charact

Oct 15, 2022
Diff, match and patch text in Go

go-diff go-diff offers algorithms to perform operations required for synchronizing plain text: Compare two texts and return their differences. Perform

Dec 25, 2022