A declarative struct-tag-based HTML unmarshaling or scraping package for Go built on top of the goquery library

goq

Build Status GoDoc Coverage Status Go Report Card

Example

import (
	"log"
	"net/http"

	"astuart.co/goq"
)

// Structured representation for github file name table
type example struct {
	Title string `goquery:"h1"`
	Files []string `goquery:"table.files tbody tr.js-navigation-item td.content,text"`
}

func main() {
	res, err := http.Get("https://github.com/andrewstuart/goq")
	if err != nil {
		log.Fatal(err)
	}
	defer res.Body.Close()

	var ex example
	
	err = goq.NewDecoder(res.Body).Decode(&ex)
	if err != nil {
		log.Fatal(err)
	}

	log.Println(ex.Title, ex.Files)
}

Details

goq

-- import "astuart.co/goq"

Package goq was built to allow users to declaratively unmarshal HTML into go structs using struct tags composed of css selectors.

I've made a best effort to behave very similarly to JSON and XML decoding as well as exposing as much information as possible in the event of an error to help you debug your Unmarshaling issues.

When creating struct types to be unmarshaled into, the following general rules apply:

  • Any type that implements the Unmarshaler interface will be passed a slice of *html.Node so that manual unmarshaling may be done. This takes the highest precedence.

  • Any struct fields may be annotated with goquery metadata, which takes the form of an element selector followed by arbitrary comma-separated "value selectors."

  • A value selector may be one of html, text, or [someAttrName]. html and text will result in the methods of the same name being called on the *goquery.Selection to obtain the value. [someAttrName] will result in *goquery.Selection.Attr("someAttrName") being called for the value.

  • A primitive value type will default to the text value of the resulting nodes if no value selector is given.

  • At least one value selector is required for maps, to determine the map key. The key type must follow both the rules applicable to go map indexing, as well as these unmarshaling rules. The value of each key will be unmarshaled in the same way the element value is unmarshaled.

  • For maps, keys will be retreived from the same level of the DOM. The key selector may be arbitrarily nested, though. The first level of children with any number of matching elements will be used, though.

  • For maps, any values must be nested below the level of the key selector. Parents or siblings of the element matched by the key selector will not be considered.

  • Once used, a "value selector" will be shifted off of the comma-separated list. This allows you to nest arbitrary levels of value selectors. For example, the type []map[string][]string would require one selector for the map key, and take an optional second selector for the values of the string slice.

  • Any struct type encountered in nested types (e.g. map[string]SomeStruct) will override any remaining "value selectors" that had not been used. For example, given:

    struct S { F string goquery:",[bang]" }

    struct { T map[string]S goquery:"#someId,[foo],[bar],[baz]" }

[foo] will be used to determine the string map key,but [bar] and [baz] will be ignored, with the [bang] tag present S struct type taking precedence.

Usage

func NodeSelector

func NodeSelector(nodes []*html.Node) *goquery.Selection

NodeSelector is a quick utility function to get a goquery.Selection from a slice of *html.Node. Useful for performing unmarshaling, since the decision was made to use []*html.Node for maximum flexibility.

func Unmarshal

func Unmarshal(bs []byte, v interface{}) error

Unmarshal takes a byte slice and a destination pointer to any interface{}, and unmarshals the document into the destination based on the rules above. Any error returned here will likely be of type CannotUnmarshalError, though an initial goquery error will pass through directly.

func UnmarshalSelection

func UnmarshalSelection(s *goquery.Selection, iface interface{}) error

UnmarshalSelection will unmarshal a goquery.goquery.Selection into an interface appropriately annoated with goquery tags.

type CannotUnmarshalError

type CannotUnmarshalError struct {
	Err      error
	Val      string
	FldOrIdx interface{}
}

CannotUnmarshalError represents an error returned by the goquery Unmarshaler and helps consumers in programmatically diagnosing the cause of their error.

func (*CannotUnmarshalError) Error

func (e *CannotUnmarshalError) Error() string

type Decoder

type Decoder struct {
}

Decoder implements the same API you will see in encoding/xml and encoding/json except that we do not currently support proper streaming decoding as it is not supported by goquery upstream.

func NewDecoder

func NewDecoder(r io.Reader) *Decoder

NewDecoder returns a new decoder given an io.Reader

func (*Decoder) Decode

func (d *Decoder) Decode(dest interface{}) error

Decode will unmarshal the contents of the decoder when given an instance of an annotated type as its argument. It will return any errors encountered during either parsing the document or unmarshaling into the given object.

type Unmarshaler

type Unmarshaler interface {
	UnmarshalHTML([]*html.Node) error
}

Unmarshaler allows for custom implementations of unmarshaling logic

TODO

  • Callable goquery methods with args, via reflection
Comments
  • fixed value func cache race condition

    fixed value func cache race condition

    This PR simply adds a mutex to guard the map for the caching of the value functions. I think it has negligible impact on performance, so a more sophisticated approach should not be needed.

  • race condition & crash

    race condition & crash

    Yesterday, we had a crash that seems to come down to a data race / concurrent map access in goq, so I ran our application after compiling it with the -race flag. It seems that the library regularly creates race conditions:

    WARNING: DATA RACE
    Read at 0x00c0001c30e0 by goroutine 137:
      runtime.mapaccess1_faststr()
          /usr/local/go/src/runtime/map_faststr.go:12 +0x0
      github.com/andrewstuart/goq.goqueryTag.valFunc()
          /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:89 +0x85
      github.com/andrewstuart/goq.unmarshalByType()
          /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:210 +0x416
      github.com/andrewstuart/goq.unmarshalSlice()
          /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:332 +0x23f
      github.com/andrewstuart/goq.unmarshalByType()
          /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:204 +0x91a
      github.com/andrewstuart/goq.unmarshalStruct()
          /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:289 +0x243
      github.com/andrewstuart/goq.unmarshalByType()
          /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:202 +0x879
      github.com/andrewstuart/goq.UnmarshalSelection()
          /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:180 +0x4fc
    
    Previous write at 0x00c0001c30e0 by goroutine 44:
      runtime.mapassign_faststr()
          /usr/local/go/src/runtime/map_faststr.go:202 +0x0
      github.com/andrewstuart/goq.goqueryTag.valFunc()
          /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:115 +0x296
      github.com/andrewstuart/goq.unmarshalByType()
          /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:210 +0x416
      github.com/andrewstuart/goq.unmarshalSlice()
          /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:332 +0x23f
      github.com/andrewstuart/goq.unmarshalByType()
          /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:204 +0x91a
      github.com/andrewstuart/goq.unmarshalStruct()
          /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:289 +0x243
      github.com/andrewstuart/goq.unmarshalByType()
          /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:202 +0x879
      github.com/andrewstuart/goq.UnmarshalSelection()
          /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:180 +0x4fc
    

    I'm currently investigating.

  • Fix bug in error where path does not fully show for non-pointers

    Fix bug in error where path does not fully show for non-pointers

    E.g. 'main.page.Items[0xc42019e318]' (type int): a type conversion error occurred: strconv.ParseInt: parsing "": invalid syntax

    when the real error should show the extra type info 'main.page.Items[0xc42019f048].Score' (type unknown: invalid value): a custom Unmarshaler implementation threw an error: strconv.ParseInt: parsing "": invalid syntax

  • fixed value function cache race condition

    fixed value function cache race condition

    This PR simply adds a mutex to guard the map for the caching of the value functions. I think it has negligible impact on performance, so a more sophisticated approach should not be needed.

  • Out of range panic in embeded map structures

    Out of range panic in embeded map structures

    The panic stack trace is the following:

    panic: runtime error: index out of range
    
    goroutine 1 [running]:
    astuart.co/goq.goqueryTag.preprocess(0x7d907c, 0x13, 0xc42025be30, 0x7)
    	/n/gopath/src/astuart.co/goq/unmarshal.go:40 +0x207
    astuart.co/goq.unmarshalStruct(0xc42025be30, 0x8569c0, 0xc420184090, 0x199, 0x8569c0, 0x8569c0)
    	/n/gopath/src/astuart.co/goq/unmarshal.go:283 +0x1ac
    astuart.co/goq.unmarshalByType(0xc42025be30, 0x7cbd60, 0xc420184090, 0x16, 0x7c1091, 0x9, 0x0, 0x0)
    	/n/gopath/src/astuart.co/goq/unmarshal.go:202 +0x793
    astuart.co/goq.unmarshalMap.func1(0x0, 0xc42025be30, 0x1)
    	/n/gopath/src/astuart.co/goq/unmarshal.go:405 +0x435
    github.com/PuerkitoBio/goquery.(*Selection).EachWithBreak(0xc42025ba70, 0xc4204e3658, 0x7c1091)
    	/n/gopath/src/github.com/PuerkitoBio/goquery/iteration.go:21 +0x10b
    astuart.co/goq.unmarshalMap(0xc42025ba70, 0x7ffe20, 0xc42000e028, 0x195, 0x7c1091, 0x19, 0xc42000e028, 0x195)
    	/n/gopath/src/astuart.co/goq/unmarshal.go:390 +0x385
    astuart.co/goq.unmarshalByType(0xc42025ba70, 0x7ffe20, 0xc42000e028, 0x195, 0x7c1091, 0x19, 0x195, 0x7ffe20)
    	/n/gopath/src/astuart.co/goq/unmarshal.go:208 +0x6ab
    astuart.co/goq.unmarshalStruct(0xc42025ba40, 0x80fea0, 0xc42000e028, 0x199, 0x80fea0, 0x80fea0)
    	/n/gopath/src/astuart.co/goq/unmarshal.go:289 +0x245
    astuart.co/goq.unmarshalByType(0xc42025ba40, 0x80fea0, 0xc42000e028, 0x199, 0x0, 0x0, 0xc42000e028, 0x199)
    	/n/gopath/src/astuart.co/goq/unmarshal.go:202 +0x793
    astuart.co/goq.UnmarshalSelection(0xc42025ba40, 0x7cbda0, 0xc42000e028, 0x0, 0xc420400020)
    	/n/gopath/src/astuart.co/goq/unmarshal.go:180 +0x308
    astuart.co/goq.(*Decoder).Decode(0xc420400020, 0x7cbda0, 0xc42000e028, 0xc420784000, 0x2ad98)
    	/n/gopath/src/astuart.co/goq/decoder.go:37 +0xc4
    main.store(0xc42016e0c0)
    	/home/mester/twscrap/main.go:151 +0x1d3
    main.startCollecting(0xc42015a280)
    	/home/mester/twscrap/main.go:106 +0x458
    main.main()
    	/home/mester/twscrap/main.go:72 +0x1a4
    exit status 2
    

    The problem occurss with this structure type:

    type T struct {
        A string `goquery:",[second-id]"`
    }
    type A struct {
        B map[string]T `goquery:"div.id,[div-id]"`
    }
    
  • how does it compare with pure CSS selectors?

    how does it compare with pure CSS selectors?

    Is there any reason why the goquery tags don't use pure CSS selectors? It seems some special rules are required (i.e. element selector followed by arbitrary comma-separated "value selectors."). The documentation doesn't even mention if the "selectors" are actually CSS(3) selectors. though from the source looks like that's the case.

    I'm asking about this decision because just before to find your library I was thinking to develop something similar(i.e. a library that unmarshals pure css selectors(using cascadia) into /x/html.Node)

  • Module declares its path as: astuart.co/goq

    Module declares its path as: astuart.co/goq

    Command:

    GO111MODULE=on go get github.com/andrewstuart/[email protected]
    

    Outputs:

    go: finding github.com v1.0.0
    go: finding github.com/andrewstuart v1.0.0
    go: finding github.com/andrewstuart/goq v1.0.0
    go: downloading github.com/andrewstuart/goq v1.0.0
    go: extracting github.com/andrewstuart/goq v1.0.0
    go get: github.com/andrewstuart/[email protected]: parsing go.mod:
    	module declares its path as: astuart.co/goq
    	        but was required as: github.com/andrewstuart/goq
    

    I fixed it with:

    replace (
      github.com/andrewstuart/goq => astuart.co/goq v1.0.0
    )
    

    But maybe there is a way to fix it on side of repo?

  • Document selectors (issue?)

    Document selectors (issue?)

    I'm having an issue with selectors, and in general they're hard to deal with because they are not documented, neither here nor in GoQuery.

    I have this markup:

    image

    And I select :

    type Categorie struct {
    	Text string `goquery:"a,text"`
    	Link string `goquery:"a,[href]"`
    	Sub []Categorie `goquery:"ul"`
    }
    
    type Menu struct {
    	Categorie []Categorie `goquery:".menu-l1-li-hld li"`
    }
    

    I would expect text and href from links to return 1 of each, but I have a weird result where text append every sub text together, but href doesn't. Is it an issue with the lib?

    image

    Thanks

Path parsing for segment unmarshaling and slicing.

parth go get github.com/codemodus/parth/v2 Package parth provides path parsing for segment unmarshaling and slicing. In other words, parth provides s

Sep 27, 2022
bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

bluemonday bluemonday is a HTML sanitizer implemented in Go. It is fast and highly configurable. bluemonday takes untrusted user generated content as

Jan 4, 2023
Extract structured data from web sites. Web sites scraping.
Extract structured data from web sites. Web sites scraping.

Dataflow kit Dataflow kit ("DFK") is a Web Scraping framework for Gophers. It extracts data from web pages, following the specified CSS Selectors. You

Jan 7, 2023
Geziyor, a fast web crawling & scraping framework for Go. Supports JS rendering.

Geziyor Geziyor is a blazing fast web crawling and web scraping framework. It can be used to crawl websites and extract structured data from them. Gez

Dec 29, 2022
Match regex group into go struct using struct tags and automatic parsing

regroup Simple library to match regex expression named groups into go struct using struct tags and automatic parsing Installing go get github.com/oris

Nov 5, 2022
Re-tag an existing docker image

Tagger Note: Originally yanked out from vmware-tanzu/community-edition Tagger is a hack to prevent gc from breaking packages Packages are referenced f

Dec 18, 2021
yview is a lightweight, minimalist and idiomatic template library based on golang html/template for building Go web application.

wview wview is a lightweight, minimalist and idiomatic template library based on golang html/template for building Go web application. Contents Instal

Dec 5, 2021
htmlquery is golang XPath package for HTML query.

htmlquery Overview htmlquery is an XPath query package for HTML, lets you extract data or evaluate from HTML documents by an XPath expression. htmlque

Jan 4, 2023
Frongo is a Golang package to create HTML/CSS components using only the Go language.

Frongo Frongo is a Go tool to make HTML/CSS document out of Golang code. It was designed with readability and usability in mind, so HTML objects are c

Jul 29, 2021
Golang HTML to plaintext conversion library

html2text Converts HTML into text of the markdown-flavored variety Introduction Ensure your emails are readable by all! Turns HTML into raw text, usef

Dec 28, 2022
Golang library for converting Markdown to HTML. Good documentation is included.

md2html is a golang library for converting Markdown to HTML. Install go get github.com/wallblog/md2html Example package main import( "github.com/wa

Jan 11, 2022
⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules.
⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules.

html-to-markdown Convert HTML into Markdown with Go. It is using an HTML Parser to avoid the use of regexp as much as possible. That should prevent so

Jan 6, 2023
Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.
Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Tagify Gets STDIN, file or HTTP address as an input and returns a list of most popular words ordered by popularity as an output. More info about what

Dec 19, 2022
Templating system for HTML and other text documents - go implementation

FAQ What is Kasia.go? Kasia.go is a Go implementation of the Kasia templating system. Kasia is primarily designed for HTML, but you can use it for any

Mar 15, 2022
Take screenshots of websites and create PDF from HTML pages using chromium and docker

gochro is a small docker image with chromium installed and a golang based webserver to interact wit it. It can be used to take screenshots of w

Nov 23, 2022
export stripTags from html/template as strip.StripTags

HTML StripTags for Go This is a Go package containing an extracted version of the unexported stripTags function in html/template/html.go. ⚠️ This pack

Dec 4, 2022
network .md into .html with plaintext files
network .md into .html with plaintext files

plain network markdown files into html with plaintext files plain is a static-site generator operating on plaintext files containing a small set of co

Dec 10, 2022
golang program that simpily converts html into markdown

Simpily converts html to markdown Just a simple project I wrote in golang to convert html to markdown, surprisingly works decent for a lot of websites

Oct 23, 2021
Simple Markdown to Html converter in Go.

Markdown To Html Converter Simple Example package main import ( "github.com/gopherzz/MTDGo/pkg/lexer" "github.com/gopherzz/MTDGo/pkg/parser" "fm

Jan 29, 2022