A little like that j-thing, only in Go.

goquery - a little like that j-thing, only in Go

builds.sr.ht status build status GoDoc Sourcegraph Badge

goquery brings a syntax and a set of features similar to jQuery to the Go language. It is based on Go's net/html package and the CSS Selector library cascadia. Since the net/html parser returns nodes, and not a full-featured DOM tree, jQuery's stateful manipulation functions (like height(), css(), detach()) have been left off.

Also, because the net/html parser requires UTF-8 encoding, so does goquery: it is the caller's responsibility to ensure that the source document provides UTF-8 encoded HTML. See the wiki for various options to do this.

Syntax-wise, it is as close as possible to jQuery, with the same function names when possible, and that warm and fuzzy chainable interface. jQuery being the ultra-popular library that it is, I felt that writing a similar HTML-manipulating library was better to follow its API than to start anew (in the same spirit as Go's fmt package), even though some of its methods are less than intuitive (looking at you, index()...).

Table of Contents

Installation

Please note that because of the net/html dependency, goquery requires Go1.1+.

$ go get github.com/PuerkitoBio/goquery

(optional) To run unit tests:

$ cd $GOPATH/src/github.com/PuerkitoBio/goquery
$ go test

(optional) To run benchmarks (warning: it runs for a few minutes):

$ cd $GOPATH/src/github.com/PuerkitoBio/goquery
$ go test -bench=".*"

Changelog

Note that goquery's API is now stable, and will not break.

  • 2021-01-11 (v1.6.1) : Fix panic when calling {Prepend,Append,Set}Html on a Selection that contains non-Element nodes.
  • 2020-10-08 (v1.6.0) : Parse html in context of the container node for all functions that deal with html strings (AfterHtml, AppendHtml, etc.). Thanks to @thiemok and @davidjwilkins for their work on this.
  • 2020-02-04 (v1.5.1) : Update module dependencies.
  • 2018-11-15 (v1.5.0) : Go module support (thanks @Zaba505).
  • 2018-06-07 (v1.4.1) : Add NewDocumentFromReader examples.
  • 2018-03-24 (v1.4.0) : Deprecate NewDocument(url) and NewDocumentFromResponse(response).
  • 2018-01-28 (v1.3.0) : Add ToEnd constant to Slice until the end of the selection (thanks to @davidjwilkins for raising the issue).
  • 2018-01-11 (v1.2.0) : Add AddBack* and deprecate AndSelf (thanks to @davidjwilkins).
  • 2017-02-12 (v1.1.0) : Add SetHtml and SetText (thanks to @glebtv).
  • 2016-12-29 (v1.0.2) : Optimize allocations for Selection.Text (thanks to @radovskyb).
  • 2016-08-28 (v1.0.1) : Optimize performance for large documents.
  • 2016-07-27 (v1.0.0) : Tag version 1.0.0.
  • 2016-06-15 : Invalid selector strings internally compile to a Matcher implementation that never matches any node (instead of a panic). So for example, doc.Find("~") returns an empty *Selection object.
  • 2016-02-02 : Add NodeName utility function similar to the DOM's nodeName property. It returns the tag name of the first element in a selection, and other relevant values of non-element nodes (see godoc for details). Add OuterHtml utility function similar to the DOM's outerHTML property (named OuterHtml in small caps for consistency with the existing Html method on the Selection).
  • 2015-04-20 : Add AttrOr helper method to return the attribute's value or a default value if absent. Thanks to piotrkowalczuk.
  • 2015-02-04 : Add more manipulation functions - Prepend* - thanks again to Andrew Stone.
  • 2014-11-28 : Add more manipulation functions - ReplaceWith*, Wrap* and Unwrap - thanks again to Andrew Stone.
  • 2014-11-07 : Add manipulation functions (thanks to Andrew Stone) and *Matcher functions, that receive compiled cascadia selectors instead of selector strings, thus avoiding potential panics thrown by goquery via cascadia.MustCompile calls. This results in better performance (selectors can be compiled once and reused) and more idiomatic error handling (you can handle cascadia's compilation errors, instead of recovering from panics, which had been bugging me for a long time). Note that the actual type expected is a Matcher interface, that cascadia.Selector implements. Other matcher implementations could be used.
  • 2014-11-06 : Change import paths of net/html to golang.org/x/net/html (see https://groups.google.com/forum/#!topic/golang-nuts/eD8dh3T9yyA). Make sure to update your code to use the new import path too when you call goquery with html.Nodes.
  • v0.3.2 : Add NewDocumentFromReader() (thanks jweir) which allows creating a goquery document from an io.Reader.
  • v0.3.1 : Add NewDocumentFromResponse() (thanks assassingj) which allows creating a goquery document from an http response.
  • v0.3.0 : Add EachWithBreak() which allows to break out of an Each() loop by returning false. This function was added instead of changing the existing Each() to avoid breaking compatibility.
  • v0.2.1 : Make go-getable, now that go.net/html is Go1.0-compatible (thanks to @matrixik for pointing this out).
  • v0.2.0 : Add support for negative indices in Slice(). BREAKING CHANGE Document.Root is removed, Document is now a Selection itself (a selection of one, the root element, just like Document.Root was before). Add jQuery's Closest() method.
  • v0.1.1 : Add benchmarks to use as baseline for refactorings, refactor Next...() and Prev...() methods to use the new html package's linked list features (Next/PrevSibling, FirstChild). Good performance boost (40+% in some cases).
  • v0.1.0 : Initial release.

API

goquery exposes two structs, Document and Selection, and the Matcher interface. Unlike jQuery, which is loaded as part of a DOM document, and thus acts on its containing document, goquery doesn't know which HTML document to act upon. So it needs to be told, and that's what the Document type is for. It holds the root document node as the initial Selection value to manipulate.

jQuery often has many variants for the same function (no argument, a selector string argument, a jQuery object argument, a DOM element argument, ...). Instead of exposing the same features in goquery as a single method with variadic empty interface arguments, statically-typed signatures are used following this naming convention:

  • When the jQuery equivalent can be called with no argument, it has the same name as jQuery for the no argument signature (e.g.: Prev()), and the version with a selector string argument is called XxxFiltered() (e.g.: PrevFiltered())
  • When the jQuery equivalent requires one argument, the same name as jQuery is used for the selector string version (e.g.: Is())
  • The signatures accepting a jQuery object as argument are defined in goquery as XxxSelection() and take a *Selection object as argument (e.g.: FilterSelection())
  • The signatures accepting a DOM element as argument in jQuery are defined in goquery as XxxNodes() and take a variadic argument of type *html.Node (e.g.: FilterNodes())
  • The signatures accepting a function as argument in jQuery are defined in goquery as XxxFunction() and take a function as argument (e.g.: FilterFunction())
  • The goquery methods that can be called with a selector string have a corresponding version that take a Matcher interface and are defined as XxxMatcher() (e.g.: IsMatcher())

Utility functions that are not in jQuery but are useful in Go are implemented as functions (that take a *Selection as parameter), to avoid a potential naming clash on the *Selection's methods (reserved for jQuery-equivalent behaviour).

The complete godoc reference documentation can be found here.

Please note that Cascadia's selectors do not necessarily match all supported selectors of jQuery (Sizzle). See the cascadia project for details. Invalid selector strings compile to a Matcher that fails to match any node. Behaviour of the various functions that take a selector string as argument follows from that fact, e.g. (where ~ is an invalid selector string):

  • Find("~") returns an empty selection because the selector string doesn't match anything.
  • Add("~") returns a new selection that holds the same nodes as the original selection, because it didn't add any node (selector string didn't match anything).
  • ParentsFiltered("~") returns an empty selection because the selector string doesn't match anything.
  • ParentsUntil("~") returns all parents of the selection because the selector string didn't match any element to stop before the top element.

Examples

See some tips and tricks in the wiki.

Adapted from example_test.go:

package main

import (
  "fmt"
  "log"
  "net/http"

  "github.com/PuerkitoBio/goquery"
)

func ExampleScrape() {
  // Request the HTML page.
  res, err := http.Get("http://metalsucks.net")
  if err != nil {
    log.Fatal(err)
  }
  defer res.Body.Close()
  if res.StatusCode != 200 {
    log.Fatalf("status code error: %d %s", res.StatusCode, res.Status)
  }

  // Load the HTML document
  doc, err := goquery.NewDocumentFromReader(res.Body)
  if err != nil {
    log.Fatal(err)
  }

  // Find the review items
  doc.Find(".sidebar-reviews article .content-block").Each(func(i int, s *goquery.Selection) {
    // For each item found, get the band and title
    band := s.Find("a").Text()
    title := s.Find("i").Text()
    fmt.Printf("Review %d: %s - %s\n", i, band, title)
  })
}

func main() {
  ExampleScrape()
}

Related Projects

  • Goq, an HTML deserialization and scraping library based on goquery and struct tags.
  • andybalholm/cascadia, the CSS selector library used by goquery.
  • suntong/cascadia, a command-line interface to the cascadia CSS selector library, useful to test selectors.
  • gocolly/colly, a lightning fast and elegant Scraping Framework
  • gnulnx/goperf, a website performance test tool that also fetches static assets.
  • MontFerret/ferret, declarative web scraping.
  • tacusci/berrycms, a modern simple to use CMS with easy to write plugins
  • Dataflow kit, Web Scraping framework for Gophers.
  • Geziyor, a fast web crawling & scraping framework for Go. Supports JS rendering.
  • Pagser, a simple, easy, extensible, configurable HTML parser to struct based on goquery and struct tags.
  • stitcherd, A server for doing server side includes using css selectors and DOM updates.

Support

There are a number of ways you can support the project:

  • Use it, star it, build something with it, spread the word!
    • If you do build something open-source or otherwise publicly-visible, let me know so I can add it to the Related Projects section!
  • Raise issues to improve the project (note: doc typos and clarifications are issues too!)
    • Please search existing issues before opening a new one - it may have already been adressed.
  • Pull requests: please discuss new code in an issue first, unless the fix is really trivial.
    • Make sure new code is tested.
    • Be mindful of existing code - PRs that break existing code have a high probability of being declined, unless it fixes a serious issue.

If you desperately want to send money my way, I have a BuyMeACoffee.com page:

Buy Me A Coffee

License

The BSD 3-Clause license, the same as the Go language. Cascadia's license is here.

Comments
  • Creating Document from Large Response Takes Very Long

    Creating Document from Large Response Takes Very Long

    The following code executes a form post which returns a very large HTML table (~40k rows). Do you have any suggestions for making this more performant? Building the document can take up to 10 minutes.

    timer := time.Now()
    
    resp, err := client.PostForm(runReportUrl, values)
    
    if err != nil {
        log.Println("Error posting form")
        panic(err)
    }
    
    log.Printf("Post Time: %s", time.Since(timer))
    
    contents, err := ioutil.ReadAll(resp.Body)
    
    if err != nil {
        panic(err)
    }
    defer resp.Body.Close()
    
    root, err := html.Parse(strings.NewReader(string(contents)))
    if err != nil {
        panic(err)
    }
    goquery.NewDocumentFromNode(root), complete
    log.Printf("Building goquery document took: %s", time.Since(timer))
    
  • Get heading parent of a paragraph

    Get heading parent of a paragraph

    Hello! How can i get the heading above paragraph?

    doc.Find("div#main-content.main-content").Each(func(i int, s *goquery.Selection) {
    		s.Find("p").Each(func(i int, s *goquery.Selection) {
    			if s.Text() == "" {
    				s.Remove()
    			}
    
    			
    		})
    	})
    

    I want to get the heading right above the p element

  • proxy

    proxy

    Hi, how can i use a SOCKS5 proxy with your library? e.g:

     // Create a socks5 dialer
      dialer, err := proxy.SOCKS5("tcp", "127.0.0.1:9050", nil, proxy.Direct)
      if err != nil {
        log.Fatal(err)
      }
    
      // Setup HTTP transport
      tr := &http.Transport{
        Dial: dialer.Dial,
      }
      client := &http.Client{Transport: tr}
    
      res, err := client.Get("http://google.com")
    

    how can you send a post request to login to a website?

    Thanks, Arnold

  • Manipulation Functions

    Manipulation Functions

    Go's net/html package added support for basic document manipulation. This change set adds support for jQuery's basic manipulation functions, excluding those that make no sense without a full renderer (css(), detach(), height(), etc).

    You'll notice that this includes variants of each function to accept a cascadia selector. I'm working on a way to cache compiled selectors, but I haven't yet finished it, and I wanted this pull request to stand by itself.

  • proposal: Struct Tags

    proposal: Struct Tags

    So I've been thinking for quite a long time that it would be awesome to be able to create structs with declarative tags for how an html document should be decoded into the struct. Much like one would do with the xml package, but using the syntax popularized by "that j-thing." 😄

    So, for example:

    type Page struct {
    	Resources []Resource `goquery:"#resources .resource"`
    }
    
    type Resource struct {
    	Name string `goquery:".name"`
    }
    
    func main() {
      var p Page
      var r io.Reader
      // Get an io.Reader somehow
      goquery.NewDecoder(r).Decode(&p)
    }
    

    Any thoughts? Would this be something I could create a PR for?

  • How to get text of a tag without text of children?

    How to get text of a tag without text of children?

    Let's say I have this code:

    	<ul class="slimdemo-menu">
    		<li>
    			<a href="/upload" class="mdl-navigation__link"><i class="material-icons" role="presentation">file_upload</i>Upload</a>
    		</li>
    		<li>
    			<a href="/data" class="mdl-navigation__link"><i class="material-icons" role="presentation">check_circle</i>Data</a>
    		</li>
    		<li>
    			<a href="/info" class="mdl-navigation__link"><i class="material-icons" role="presentation">info</i>Info</a>
    		</li>
    	</ul>
    

    and I want to extract each text of an A-tag. but not with the text of the I-tag, how to do this?

    I know there is s.Text() but in the first A-tag, it gives me: file_uploadUpload and I only want Upload

  • Add implementation for unmarshaling html directly into annotated go structs

    Add implementation for unmarshaling html directly into annotated go structs

    The general principals I've tried to follow for the intuition and implementation are found in the documentation of the Unmarshal function. I figure we can move it around if necessary, or just link to it from the main package documentation.

    In general, the tag structure is built around:

    • A css selector
    • A comma-separated list of "value selectors" that determine what is used to create primitive values. These should be intuitively in-order with respect to the primitive types in a type spec.
      • e.g. type S struct { MyMap map[int]map[string]string `goquery:"#foo,[bar],[baz],html"` } should behave intuitively, specifying values from left to right. There should be an int-valued bar=... attribute, a string-valued baz=... attribute, and the terminal value would be the html contained underneath an element that matches [baz] as a selector.

    Let me know what feedback you have; this has grown a bit since my original proposal #151, but I really tried to keep the same clarity and update my naming conventions to use clearer same names everywhere.

    This was fun to write. Looking forward to actually using it instead of just writing tests. 😄

  • NextAll and NextUntilSelection

    NextAll and NextUntilSelection

    I have content like

    section(could be anything)
    h2
    p #p1(could be anything - strong,section,article,div)
    p #p2(could be anything - strong,section,article,div)
    h2
    p #p3(could be anything - strong,section,article,div)
    h2
    p #p4(could be anything - strong,section,article,div)
    

    I am expecting to break in a way that each h2 and next selection(all) would be part of that.

    I iterated this way.

    	node.Find("h2").
    		Each(func(i int, s *goquery.Selection) {
    			//t := s.NextUntilSelection(s)
    			//t := s.NextFilteredUntilSelection(":not(h2)", s)
    			t := s.NextAll()
    			t.WrapAllHtml("<section></section>")
    		})
    

    I tried all these way

    expectation, logically I expected to pick the next all set of unknown elements till my selection. NextAll pick single element eg #p1 not #p2. s.NextUntilSelection(s) selects first next to each h2 like h2,#p1 h2,#p3 h2 #p4

     - <section>
    h2
    - <section (#p1,#p2)>
    h2
    - <section (#p3)>
    h2
    - <section (#p4)>
    
  • AppendHTML broken after v1.6.0

    AppendHTML broken after v1.6.0

    I'm just trying to update a complicated codebase that uses goquery 1.5.1 to a more recent version and I've run into what is either a regression or possibly an intended breaking change that I don't quite understand.

    The test code is:

    package main
    
    import (
    	"fmt"
    	"log"
    	"strings"
    
    	"github.com/PuerkitoBio/goquery"
    )
    
    func main() {
    	// Load an empty HTML document
    	doc, err := goquery.NewDocumentFromReader(strings.NewReader("<body></body>"))
    	if err != nil {
    		log.Fatal(err)
    	}
    
    	// append a new container element
    	doc.AppendHtml(`<div id="normalized"></div>`)
    
    	// lookup the container element
    	normalized := doc.Find(`#normalized`)
    
    	// expect <div id="normalized"></div>
    	fmt.Println(goquery.OuterHtml(normalized))
    }
    

    In v1.5.1, I get the container div output correctly:

    <div id="normalized"></div>
    

    In v1.6.1 I get an empty string output.

    Any ideas?

  • Why is there no method that can find first element which match a selector?

    Why is there no method that can find first element which match a selector?

    Sometimes we shall find an element by id. In general, there is only one element which match a specific id. Is find first is more effecient for no need to scan whole document? I use doc.Find(id).First to find the element now, is it correctly? And, it is necessary to add method which can select element by xpath?

  • Pretty print output of .Html()

    Pretty print output of .Html()

    Currently, there aren't any good pretty printers for html written in go. It would be nice if goquery could have the option of pretty printing the html.

  • Function to find the selector of a node

    Function to find the selector of a node

      //node is a sub html*node of doc
      // if ok return the select string such as `.sidebar-reviews article .content-block a`
       sel, ok :=  doc.FindSelector(node) 
    

    Could this function be possible?

  • <noscript> causes selector to fail

    Consider the following program:

    package main
    
    import (
    	"fmt"
    	"strings"
    
    	"github.com/PuerkitoBio/goquery"
    )
    
    const data = `<noscript><a href="http://example.org">click this link</a></noscript>`
    
    func main() {
    	d, err := goquery.NewDocumentFromReader(strings.NewReader(data))
    	if err != nil {
    		fmt.Println(err)
    		return
    	}
    	a, ok := d.Find("noscript a").Attr("href")
    	fmt.Printf("URL: '%s', %t\n", a, ok)
    }
    

    The expected output is:

    URL: 'http://example.org', true
    

    But instead the output is:

    URL: '', false
    

    Changing noscript to div in both the document and selector causes the expected output, so the problem seems to affect only <noscript> elements.

A little like that j-thing, only in Go.

goquery - a little like that j-thing, only in Go goquery brings a syntax and a set of features similar to jQuery to the Go language. It is based on Go

Jan 1, 2023
Barry is a silly little thing I wanted to try, feel free to screenshot.

Barry What is Barry? Barry is a silly little thing I wanted to try, feel free to screenshot. If you manage to get panics after being told not to do th

Nov 6, 2021
A youtube library for retrieving metadata, and obtaining direct links to video-only/audio-only/mixed versions of videos on YouTube in Go.

A youtube library for retrieving metadata, and obtaining direct links to video-only/audio-only/mixed versions of videos on YouTube in Go. Install go g

Dec 10, 2022
log4jScanner: provides you with the ability to scan internal (only) subnets for vulnerable log4j web servicelog4jScanner: provides you with the ability to scan internal (only) subnets for vulnerable log4j web service
log4jScanner: provides you with the ability to scan internal (only) subnets for vulnerable log4j web servicelog4jScanner: provides you with the ability to scan internal (only) subnets for vulnerable log4j web service

log4jScanner Goals This tool provides you with the ability to scan internal (only) subnets for vulnerable log4j web services. It will attempt to send

Jan 5, 2023
Concourse is a container-based continuous thing-doer written in Go and Elm.
Concourse is a container-based continuous thing-doer written in Go and Elm.

Concourse: the continuous thing-doer. Concourse is an automation system written in Go. It is most commonly used for CI/CD, and is built to scale to an

Dec 30, 2022
W3C WoT Thing Description Directory (TDD)

TinyIoT Thing Directory This is an implementation of the W3C WoT Thing Description Directory (TDD), a registry of Thing Descriptions. This project is

Jul 22, 2022
WIP Go Thing to download HCP Vault Logs

Example Go Script to pull HCP Vault Audit Logs WARNING: This makes use of unstable preview APIs which could change at any time! USE AT YOUR OWN PERIL

Feb 6, 2022
Generate signal for a thing for golang

go_kafka_signal generate signal for a thing go build producer.go ./producer -f ~

Dec 24, 2021
Dwmstatus - Simple modular dwm status thing made in go

dwm status simple modular dwm status command made in go that has drop in plugins

Oct 31, 2022
Mob-code-server - Mob programming - a software development approach where the whole team works on the same thing
Mob-code-server - Mob programming - a software development approach where the whole team works on the same thing

For those times when you need a ready to use server with a little more horse pow

Feb 2, 2022
Racoon - Secrets are my thing

racoon - secrets are my thing Commands See racoon help or racoon --help for all

Feb 3, 2022
Generate random, pronounceable, sometimes even memorable, "superhero like" codenames - just like Docker does with container names.

Codename an RFC1178 implementation to generate pronounceable, sometimes even memorable, "superheroe like" codenames, consisting of a random combinatio

Dec 11, 2022
A plugin for argo which behaves like I'd like

argocd-lovely-plugin An ArgoCD plugin to perform various manipulations in a sensible order to ultimately output YAML for Argo CD to put into your clus

Dec 27, 2022
I like reading news but I also like the terminal. I am leaning and practicing my go.
I like reading news but I also like the terminal. I am leaning and practicing my go.

I made an api and didn't know how to use it. Screenshots The initial screen when you first run the app. The screen after you specify an id. This app u

Jan 14, 2022
12 factor configuration as a typesafe struct in as little as two function calls

Config Manage your application config as a typesafe struct in as little as two function calls. type MyConfig struct { DatabaseUrl string `config:"DAT

Dec 13, 2022
A little library for turning TCP connections into go channels.

netutils By Tim Henderson (tim.tadh@gmail.com) This is a little library that was part of a larger project that I decided to pull out and make public.

Aug 13, 2020
A little bit of magic for keeping track of the things you have to do.

Be productive. To-do lists are supposed to help you get things done. And I suppose looking through all the stuff you still have to do each time you wa

Jun 1, 2022
📖 A little book on Ethereum Development with Go (golang)
📖 A little book on Ethereum Development with Go (golang)

Ethereum Development with Go A little book on Ethereum Development with Go (golang) Online https://goethereumbook.org E-book The e-book is avaiable in

Dec 29, 2022
A little fast cloc(Count Lines Of Code)

gocloc A little fast cloc(Count Lines Of Code), written in Go. Inspired by tokei. Installation $ go get -u github.com/hhatto/gocloc/cmd/gocloc Usage

Jan 6, 2023
A fast little LRU cache for Go

tinylru A fast little LRU cache. Getting Started Installing To start using tinylru, install Go and run go get: $ go get -u github.com/tidwall/tinylru

Dec 24, 2022