tiny Go library to normalize URLs

Last update: Dec 7, 2022

Comments: 13

Purell

Purell is a tiny Go library to normalize URLs. It returns a pure URL. Pure-ell. Sanitizer and all. Yeah, I know...

Based on the wikipedia paper and the RFC 3986 document.

Install

go get github.com/PuerkitoBio/purell

Changelog

v1.1.1 : Fix failing test due to Go1.12 changes (thanks to @ianlancetaylor).
2016-11-14 (v1.1.0) : IDN: Conform to RFC 5895: Fold character width (thanks to @beeker1121).
2016-07-27 (v1.0.0) : Normalize IDN to ASCII (thanks to @zenovich).
2015-02-08 : Add fix for relative paths issue (PR #5) and add fix for unnecessary encoding of reserved characters (see issue #7).
v0.2.0 : Add benchmarks, Attempt IDN support.
v0.1.0 : Initial release.

Examples

From example_test.go (note that in your code, you would import "github.com/PuerkitoBio/purell", and would prefix references to its methods and constants with "purell."):

package purell

import (
  "fmt"
  "net/url"
)

func ExampleNormalizeURLString() {
  if normalized, err := NormalizeURLString("hTTp://someWEBsite.com:80/Amazing%3f/url/",
    FlagLowercaseScheme|FlagLowercaseHost|FlagUppercaseEscapes); err != nil {
    panic(err)
  } else {
    fmt.Print(normalized)
  }
  // Output: http://somewebsite.com:80/Amazing%3F/url/
}

func ExampleMustNormalizeURLString() {
  normalized := MustNormalizeURLString("hTTpS://someWEBsite.com:443/Amazing%fa/url/",
    FlagsUnsafeGreedy)
  fmt.Print(normalized)

  // Output: http://somewebsite.com/Amazing%FA/url
}

func ExampleNormalizeURL() {
  if u, err := url.Parse("Http://SomeUrl.com:8080/a/b/.././c///g?c=3&a=1&b=9&c=0#target"); err != nil {
    panic(err)
  } else {
    normalized := NormalizeURL(u, FlagsUsuallySafeGreedy|FlagRemoveDuplicateSlashes|FlagRemoveFragment)
    fmt.Print(normalized)
  }

  // Output: http://someurl.com:8080/a/c/g?c=3&a=1&b=9&c=0
}

API

As seen in the examples above, purell offers three methods, NormalizeURLString(string, NormalizationFlags) (string, error), MustNormalizeURLString(string, NormalizationFlags) (string) and NormalizeURL(*url.URL, NormalizationFlags) (string). They all normalize the provided URL based on the specified flags. Here are the available flags:

const (
	// Safe normalizations
	FlagLowercaseScheme           NormalizationFlags = 1 << iota // HTTP://host -> http://host, applied by default in Go1.1
	FlagLowercaseHost                                            // http://HOST -> http://host
	FlagUppercaseEscapes                                         // http://host/t%ef -> http://host/t%EF
	FlagDecodeUnnecessaryEscapes                                 // http://host/t%41 -> http://host/tA
	FlagEncodeNecessaryEscapes                                   // http://host/!"#$ -> http://host/%21%22#$
	FlagRemoveDefaultPort                                        // http://host:80 -> http://host
	FlagRemoveEmptyQuerySeparator                                // http://host/path? -> http://host/path

	// Usually safe normalizations
	FlagRemoveTrailingSlash // http://host/path/ -> http://host/path
	FlagAddTrailingSlash    // http://host/path -> http://host/path/ (should choose only one of these add/remove trailing slash flags)
	FlagRemoveDotSegments   // http://host/path/./a/b/../c -> http://host/path/a/c

	// Unsafe normalizations
	FlagRemoveDirectoryIndex   // http://host/path/index.html -> http://host/path/
	FlagRemoveFragment         // http://host/path#fragment -> http://host/path
	FlagForceHTTP              // https://host -> http://host
	FlagRemoveDuplicateSlashes // http://host/path//a///b -> http://host/path/a/b
	FlagRemoveWWW              // http://www.host/ -> http://host/
	FlagAddWWW                 // http://host/ -> http://www.host/ (should choose only one of these add/remove WWW flags)
	FlagSortQuery              // http://host/path?c=3&b=2&a=1&b=1 -> http://host/path?a=1&b=1&b=2&c=3

	// Normalizations not in the wikipedia article, required to cover tests cases
	// submitted by jehiah
	FlagDecodeDWORDHost           // http://1113982867 -> http://66.102.7.147
	FlagDecodeOctalHost           // http://0102.0146.07.0223 -> http://66.102.7.147
	FlagDecodeHexHost             // http://0x42660793 -> http://66.102.7.147
	FlagRemoveUnnecessaryHostDots // http://.host../path -> http://host/path
	FlagRemoveEmptyPortSeparator  // http://host:/path -> http://host/path

	// Convenience set of safe normalizations
	FlagsSafe NormalizationFlags = FlagLowercaseHost | FlagLowercaseScheme | FlagUppercaseEscapes | FlagDecodeUnnecessaryEscapes | FlagEncodeNecessaryEscapes | FlagRemoveDefaultPort | FlagRemoveEmptyQuerySeparator

	// For convenience sets, "greedy" uses the "remove trailing slash" and "remove www. prefix" flags,
	// while "non-greedy" uses the "add (or keep) the trailing slash" and "add www. prefix".

	// Convenience set of usually safe normalizations (includes FlagsSafe)
	FlagsUsuallySafeGreedy    NormalizationFlags = FlagsSafe | FlagRemoveTrailingSlash | FlagRemoveDotSegments
	FlagsUsuallySafeNonGreedy NormalizationFlags = FlagsSafe | FlagAddTrailingSlash | FlagRemoveDotSegments

	// Convenience set of unsafe normalizations (includes FlagsUsuallySafe)
	FlagsUnsafeGreedy    NormalizationFlags = FlagsUsuallySafeGreedy | FlagRemoveDirectoryIndex | FlagRemoveFragment | FlagForceHTTP | FlagRemoveDuplicateSlashes | FlagRemoveWWW | FlagSortQuery
	FlagsUnsafeNonGreedy NormalizationFlags = FlagsUsuallySafeNonGreedy | FlagRemoveDirectoryIndex | FlagRemoveFragment | FlagForceHTTP | FlagRemoveDuplicateSlashes | FlagAddWWW | FlagSortQuery

	// Convenience set of all available flags
	FlagsAllGreedy    = FlagsUnsafeGreedy | FlagDecodeDWORDHost | FlagDecodeOctalHost | FlagDecodeHexHost | FlagRemoveUnnecessaryHostDots | FlagRemoveEmptyPortSeparator
	FlagsAllNonGreedy = FlagsUnsafeNonGreedy | FlagDecodeDWORDHost | FlagDecodeOctalHost | FlagDecodeHexHost | FlagRemoveUnnecessaryHostDots | FlagRemoveEmptyPortSeparator
)

For convenience, the set of flags FlagsSafe, FlagsUsuallySafe[Greedy|NonGreedy], FlagsUnsafe[Greedy|NonGreedy] and FlagsAll[Greedy|NonGreedy] are provided for the similarly grouped normalizations on wikipedia's URL normalization page. You can add (using the bitwise OR | operator) or remove (using the bitwise AND NOT &^ operator) individual flags from the sets if required, to build your own custom set.

The full godoc reference is available on gopkgdoc.

Some things to note:

FlagDecodeUnnecessaryEscapes, FlagEncodeNecessaryEscapes, FlagUppercaseEscapes and FlagRemoveEmptyQuerySeparator are always implicitly set, because internally, the URL string is parsed as an URL object, which automatically decodes unnecessary escapes, uppercases and encodes necessary ones, and removes empty query separators (an unnecessary ? at the end of the url). So this operation cannot not be done. For this reason, FlagRemoveEmptyQuerySeparator (as well as the other three) has been included in the FlagsSafe convenience set, instead of FlagsUnsafe, where Wikipedia puts it.
The FlagDecodeUnnecessaryEscapes decodes the following escapes (from -> to): - %24 -> $ - %26 -> & - %2B-%3B -> +,-./0123456789:; - %3D -> = - %40-%5A -> @ABCDEFGHIJKLMNOPQRSTUVWXYZ - %5F -> _ - %61-%7A -> abcdefghijklmnopqrstuvwxyz - %7E -> ~
When the NormalizeURL function is used (passing an URL object), this source URL object is modified (that is, after the call, the URL object will be modified to reflect the normalization).
The replace IP with domain name normalization (http://208.77.188.166/ → http://www.example.com/) is obviously not possible for a library without making some network requests. This is not implemented in purell.
The remove unused query string parameters and remove default query parameters are also not implemented, since this is a very case-specific normalization, and it is quite trivial to do with an URL object.

Safe vs Usually Safe vs Unsafe

Purell allows you to control the level of risk you take while normalizing an URL. You can aggressively normalize, play it totally safe, or anything in between.

Consider the following URL:

HTTPS://www.RooT.com/toto/t%45%1f///a/./b/../c/?z=3&w=2&a=4&w=1#invalid

Normalizing with the FlagsSafe gives:

https://www.root.com/toto/tE%1F///a/./b/../c/?z=3&w=2&a=4&w=1#invalid

With the FlagsUsuallySafeGreedy:

https://www.root.com/toto/tE%1F///a/c?z=3&w=2&a=4&w=1#invalid

And with FlagsUnsafeGreedy:

http://root.com/toto/tE%1F/a/c?a=4&w=1&w=2&z=3

TODOs

Add a class/default instance to allow specifying custom directory index names? At the moment, removing directory index removes (^|/)((?:default|index)\.\w{1,4})$.

Thanks / Contributions

@rogpeppe @jehiah @opennota @pchristopher1275 @zenovich @beeker1121

License

The BSD 3-Clause license.

Owner

https://github.com/PuerkitoBio/purell

Comments

This patch allows FlagRemoveDotSegments to work with relative links --

i.e. URL's that have the Host field set to the empty string. Just to be clear, relative links will be stored in URL's, typically, if the user expects to later call URL.ResolveReference to create a full URL. That seems to be a relatively common use-case.
Add FlagRemoveQuery (strips entire query part)

I won't be at all offended if you think this flag has no place in purell, but it's one I would use almost exclusively :- )

context: I'm dealing with lots of news article URLs, and the unique part of the URL is almost always a slug, and the query used only for tracking (eg ?from=rssfeed ). Stripping the query is always my first step when normalizing URLs for comparison.

eg (apologies for dailymail link): http://www.dailymail.co.uk/news/article-3656841/It-s-Democrats-end-25-hour-sit-Congress-gun-control-walk-cheering-crowd-without-vote-demanded.html?ITO=1490&ns_mchannel=rss&ns_campaign=1490

Just one of the many frustrations when trying to decide if two URLs are referring to the same article - there are loads more. Newspaper sites can commit some real atrocities when dealing with URLs!

Add flag to force the default scheme

Hi, I'd like a flag to force the default scheme (eg. http) when no scheme was given.

str, _ := purell.NormalizeURLString("example.com/foo.html", purell.FlagsForceDefaultHttpScheme)

if str == "http://example.com/foo.html" {
  // cool, I have "http" default scheme set now
}

Would it make sense? If yes, I'd be glad to submit a patch.

Opaque URLs should be normalized too

purell doesn't normalize "opaque" URLs (see documentation for url.URL).

package main

import (
    "fmt"
    "net/url"

    "github.com/PuerkitoBio/purell"
)

func main() {
    u := &url.URL{Scheme: "http", Opaque: "//eXAMPLe.com/%3f"}
    fmt.Println(purell.NormalizeURL(u, purell.FlagLowercaseHost|purell.FlagUppercaseEscapes))
}

Output: http://eXAMPLe.com/%3f

Pull code from urlesc repo

Pulled code from the urlesc repo because the repo has been archived. I can remove QueryEscape but just replacing urlesc.Escape with url.String changes the current behavior, so using urlesc for now.

Signed-off-by: Yuki Okushi [email protected]
Make normalizing functions public for external consumption

I'd like to be able to use some of the normalizing functions in external project without the need of purell flags and NormalizeURL*() functions.

Would you be OK with this change?

Reserved characters should not be percent-encoded

purell should not normalize reserved characters, as per RFC3986:

  reserved    = gen-delims / sub-delims

  gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

  sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
              / "*" / "+" / "," / ";" / "="

package main

import (
    "fmt"

    "github.com/PuerkitoBio/purell"
)

func main() {
    fmt.Println(purell.MustNormalizeURLString("my_(url)", purell.FlagsSafe))
}

The above code outputs my_%28url%29, whereas it should be my_(url). This is due to a bug in Go stdlib (issue 5684).

more tests, and some bugs
I've been interested in URL normalization for quite some time (as i should be since i work at bitly). As a result I've written code to do this in python, tcl, and written various incomplete bits of code that dreams of doing this right. So, i know where you are coming from.

As i've been using golang more recently, i was happy to see your project, and hope that this will be one less thing that i'll need to do. To that end, i'm contributing test cases that i've found useful in helping improve my python library.

It's not all that many test cases, but it highlights a few things I would love to see

IDNA normalization

percent unescaping (where possible) to unicode values

remove trailing dot's from domains

beyond that, there are a few cases I think highlight bugs.

Additionally, while i appreciate the configurability of some of the flags, It would be nice to se one function that just does the "right" thing up until some of the more editorial items (like dropping www, trailing slash, etc). It might just be me, but as a user that's what I'd like to see.

It might be odd to open a pull request that just adds failing tests, but it's all i have time for tonight. when i get around to needing this in a project, i'll undoubtedly open up additional pull requests with some fixes.

(also, i'm lazy, so i added a .travis.yml file and a badge to the readme; you should be able to go to travis-ci.org and turn on automated tests)
A problem with the "golang.org/x" packages in purell.go

In file 'purell.go', some packages imported as: "golang.org/x/net/idna" "golang.org/x/text/unicode/norm" "golang.org/x/text/width"

however, I find their real downloading urls are: https://github.com/golang/net https://github.com/golang/text

so after I download them using 'go get' command, by default, paths '/github.com/golang/net' and '/github.com/golang/text/ are generated in my $GOPATH directory . I have to copy them into path 'golang.org/x', which is inconvenient,I think.

why not import them directly as: "github.com/golang/net/idna" "github.com/golang/text/unicode/norm" "github.com/golang/text/width" ?
go1.1 test error

purell_test.go:680: running LowerScheme... purell_test.go:680: running LowerScheme2... purell_test.go:680: running LowerHost... purell_test.go:696: LowerHost - FAIL expected 'HTTP://www.src.ca/', got 'http://www.src.ca/' purell_test.go:680: running UpperEscapes... purell_test.go:680: running UnnecessaryEscapes... purell_test.go:680: running RemoveDefaultPort... purell_test.go:696: RemoveDefaultPort - FAIL expected 'HTTP://www.SRC.ca/', got 'http://www.SRC.ca/' purell_test.go:680: running RemoveDefaultPort2... purell_test.go:696: RemoveDefaultPort2 - FAIL expected 'HTTP://www.SRC.ca', got 'http://www.SRC.ca' purell_test.go:680: running RemoveDefaultPort3... purell_test.go:696: RemoveDefaultPort3 - FAIL expected 'HTTP://www.SRC.ca:8080', got 'http://www.SRC.ca:8080' ............. .............
Looking for a new maintainer

Hello,

After more than 8 years (according to the git logs), I'd like to hand over maintenance of this library. I don't use it personally, and haven't dedicated much care and attention to it in a long time. It would be best for someone with an interest in it (as in, someone that relies on this library as part of their project(s)) to take over.

I believe the Hugo project uses this? If so, that could be a good fit, but anyone interested, please reach out!

Thanks, Martin
FlagRemoveUnnecessaryHostDots doesn't remove extra dots between parts

It seems like FlagRemoveUnnecessaryHostDots only removes extra leading and trailing dots in the hostname part of the URL. For instance a URL like "http://host..com" doesn't get the extra dots removed.
NormalizeURL does not perform IDNA normalization

IDNA normalization is only performed in NormalizeURLString, but that function returns a string. If you need a url.URL, you must then parse the result of NormalizeURLString, which means you are parsing the URL yet again, which is wasteful.

As NormalizeURLString calls NormalizeURL, the IDNA normalization in the former should be moved to the later, resulting in the URL passed to NormalizeURL having its host field IDNA normalized.

Related tags

A Go "clone" of the great and famous Requests library

GRequests A Go "clone" of the great and famous Requests library License GRequests is licensed under the Apache License, Version 2.0. See LICENSE for t

Dec 23, 2022

Simple HTTP and REST client library for Go

Resty Simple HTTP and REST client library for Go (inspired by Ruby rest-client) Features section describes in detail about Resty capabilities Resty Co

Jan 1, 2023

A Go HTTP client library for creating and sending API requests

Sling Sling is a Go HTTP client library for creating and sending API requests. Slings store HTTP Request properties to simplify sending requests and d

Jan 7, 2023

An HTTP proxy library for Go

Introduction Package goproxy provides a customizable HTTP proxy library for Go (golang), It supports regular HTTP proxy, HTTPS through CONNECT, and "h

Jan 4, 2023

httpreq is an http request library written with Golang to make requests and handle responses easily.

httpreq is an http request library written with Golang to make requests and handle responses easily. Install go get github.com/binalyze/http

Feb 10, 2022

Go library that makes it easy to add automatic retries to your projects, including support for context.Context.

go-retry Go library that makes it easy to add automatic retries to your projects, including support for context.Context. Example with context.Context

Aug 15, 2022

Cake is a lightweight HTTP client library for GO, inspired by Java Open-Feign.

Cake is a lightweight HTTP client library for GO, inspired by Java Open-Feign. Installation # With Go Modules, recommanded with go version > 1.16

Oct 6, 2022

Parcel - HTTP rendering and binding library for Go

parcel HTTP rendering/binding library for Go Getting Started Add to your project

Jan 1, 2022

Httpx - a fast and multi-purpose HTTP toolkit allow to run multiple probers using retryablehttp library

httpx is a fast and multi-purpose HTTP toolkit allow to run multiple probers using retryablehttp library, it is designed to maintain the result reliability with increased threads.

Feb 3, 2022

Beta tool to normalize Orbit member data

Orbit Normalize Member Data Thanks for checking out my handy tool to work with Orbit's api. Everything is written in go and will continue to be update

Sep 16, 2021

Link converter service converts URLs to deeplinks or deeplinks to URLs.

Link converter Link converter service converts URLs to deeplinks or deeplinks to URLs. The service responds to the incoming request and first checks w

Dec 23, 2021

A tiny Go library + client for downloading Youtube videos. The library is capable of fetching Youtube video metadata, in addition to downloading videos.

A tiny Go library + client (command line Youtube video downloader) for downloading Youtube videos. The library is capable of fetching Youtube video metadata, in addition to downloading videos. If ffmpeg is available, client can extract MP3 audio from downloaded video files.

Oct 14, 2022

Extract urls from text

xurls Extract urls from text using regular expressions. Requires Go 1.13 or later. import "mvdan.cc/xurls/v2" func main() { rxRelaxed := xurls.Relax

Jan 7, 2023

sigurls is a reconnaissance tool, it fetches URLs from AlienVault's OTX, Common Crawl, URLScan, Github and the Wayback Machine.

sigurls is a reconnaissance tool, it fetches URLs from AlienVault's OTX, Common Crawl, URLScan, Github and the Wayback Machine. DiSCLAIMER: fe

May 22, 2021

Create ePub files from URLs

url2epub Create ePub files from URLs Overview The root directory provides a Go library that creates ePub files out of URLs, with limitations.

Nov 5, 2022

A tool to find redirection chains in multiple URLs

UnChain A tool to find redirection chains in multiple URLs Introduction UnChain automates process of finding and following `30X` redirects by extracti

Dec 12, 2022

Serve vanity URLs to Go tools.

goovus serves vanity URLs to Go tools. What's In A Name? go Made for Go. o Open as in open source. vus vanity url server. go + o + vus gives goovus. Q

Sep 28, 2021

A simple go program which checks if your websites are running and runs forever (stop it with ctrl+c). It takes two optional arguments, comma separated string with urls and an interval.

uptime A simple go program which checks if your websites are running and runs forever (stop it with ctrl+c). It takes two optional arguments: -interva

Dec 15, 2022

A tool to filter URLs by parameter count or size

GoFilter A tool to filter URLs by parameter count or size. This tool requires unique sorted URL list. For example: cat hosts.txt | sort -u > sorted &&

Dec 18, 2022

Go pkg and cli tool to sign Google Maps API URLs

gmapsign gmapsign is a Go pkg and cli tool to sign Google Maps API request URLs. This is required when using: A client ID with the web service APIs, M

Jul 4, 2022