htmlquery is golang XPath package for HTML query.

Last update: Jan 4, 2023

Comments: 5

htmlquery

Overview

htmlquery is an XPath query package for HTML, lets you extract data or evaluate from HTML documents by an XPath expression.

htmlquery built-in the query object caching feature based on LRU, this feature will caching the recently used XPATH query string. Enable query caching can avoid re-compile XPath expression each query.

Installation

go get github.com/antchfx/htmlquery

Getting Started

Query, returns matched elements or error.

nodes, err := htmlquery.QueryAll(doc, "//a")
if err != nil {
	panic(`not a valid XPath expression.`)
}

Load HTML document from URL.

doc, err := htmlquery.LoadURL("http://example.com/")

Load HTML from document.

filePath := "/home/user/sample.html"
doc, err := htmlquery.LoadDoc(filePath)

Load HTML document from string.

s := `<html>....</html>`
doc, err := htmlquery.Parse(strings.NewReader(s))

Find all A elements.

list := htmlquery.Find(doc, "//a")

Find all A elements that have `href` attribute.

list := htmlquery.Find(doc, "//a[@href]")

Find all A elements with `href` attribute and only return `href` value.

list := htmlquery.Find(doc, "//a/@href")	
for _ , n := range list{
	fmt.Println(htmlquery.SelectAttr(n, "href")) // output @href value
}

Find the third A element.

a := htmlquery.FindOne(doc, "//a[3]")

Find children element (img) under A `href` and print the source

a := htmlquery.FindOne(doc, "//a")
img := htmlquery.FindOne(a, "//img")
fmt.Prinln(htmlquery.SelectAttr(img, "src")) // output @src value

Evaluate the number of all IMG element.

expr, _ := xpath.Compile("count(//img)")
v := expr.Evaluate(htmlquery.CreateXPathNavigator(doc)).(float64)
fmt.Printf("total count is %f", v)

FAQ

`Find()` vs `QueryAll()`, which is better?

Find and QueryAll both do the same things, searches all of matched html nodes. The Find will panics if you give an error XPath query, but QueryAll will return an error for you.

Can I save my query expression object for the next query?

Yes, you can. We offer the QuerySelector and QuerySelectorAll methods, It will accept your query expression object.

Cache a query expression object(or reused) will avoid re-compile XPath query expression, improve your query performance.

XPath query object cache performance

goos: windows
goarch: amd64
pkg: github.com/antchfx/htmlquery
BenchmarkSelectorCache-4                20000000                55.2 ns/op
BenchmarkDisableSelectorCache-4           500000              3162 ns/op

How to disable caching?

htmlquery.DisableSelectorCache = true

Changelogs

2019-11-19

Add built-in query object cache feature, avoid re-compilation for the same query string. #16
Added LoadDoc 18

2019-10-05

Add new methods that compatible with invalid XPath expression error: QueryAll and Query.
Add QuerySelector and QuerySelectorAll methods, supported reused your query object.

2019-02-04

#7 Removed deprecated FindEach() and FindEachWithBreak() methods.

2018-12-28

Avoid adding duplicate elements to list for Find() method. #6

Tutorial

func main() {
	doc, err := htmlquery.LoadURL("https://www.bing.com/search?q=golang")
	if err != nil {
		panic(err)
	}
	// Find all news item.
	list, err := htmlquery.QueryAll(doc, "//ol/li")
	if err != nil {
		panic(err)
	}
	for i, n := range list {
		a := htmlquery.FindOne(n, "//a")
		fmt.Printf("%d %s(%s)\n", i, htmlquery.InnerText(a), htmlquery.SelectAttr(a, "href"))
	}
}

List of supported XPath query packages

Name	Description
htmlquery	XPath query package for the HTML document
xmlquery	XPath query package for the XML document
jsonquery	XPath query package for the JSON document

Questions

Please let me know if you have any questions.

Owner

The open source web crawler framework project

https://github.com/antchfx/htmlquery https://github.com/antchfx/xpath

Comments

`replace()` on a query doesn't seem to work

package main

import (
	"fmt"
	"strings"

	"github.com/antchfx/htmlquery"
)

func main() {
	s := `<html><a href="https://github.com/cashapp/hermit-build/releases/download/go-tools/stringer-v0.1.12-darwin-amd64.bz2">foo</a></html>`
	doc, err := htmlquery.Parse(strings.NewReader(s))
	if err != nil {
		panic(err)
	}
	nodes, err := htmlquery.QueryAll(doc, `replace((//a[contains(@href, '/stringer-')])/@href, '^.*/stringer-v([^-]*)-.*$', '$1')`)
	if err != nil {
		panic(err)
	}
	for _, node := range nodes {
		fmt.Println(htmlquery.OutputHTML(node, false))
	}
}

On playground: https://go.dev/play/p/jxU6UgH0DnK The same content+query works fine on https://www.freeformatter.com/xpath-tester.html

The above example without replace() works fine: https://go.dev/play/p/N22KULbkgRu

i am grabbing form from a html page but this is also printing some garbage string "0xc000145ea0"

&{0xc000144f50 0xc0001451f0 0xc0001452d0 3 input input [{ type email} { name email} { class form-control unicase-form-control text-input} { id exampleInputEmail1}]} &{0xc0001453b0 0xc000145650 0xc000145730 3 input input [{ type password} { name password} { class form-control unicase-form-control text-input} { id exampleInputPassword1}]} &{0xc000145ea0 0xc0001462a0 0xc000147ab0 0xc0001461c0 0xc000147b20 3 form form [{ class register-form outer-top-xs} { role form} { method post} { name register} { onsubmit return valid();}]} &{0xc000146310 0xc0001465b0 0xc000146690 3 input input [{ type text} { class form-control unicase-form-control text-input} { id fullname} { name fullname} { required required}]} &{0xc000146770 0xc000146a10 0xc000146af0 3 input input [{ type email} { class form-control unicase-form-control text-input} { id email} { onblur userAvailability()} { name emailid} { required }]} &{0xc000146cb0 0xc000146f50 0xc000147030 3 input input [{ type text} { class form-control unicase-form-control text-input} { id contactno} { name contactno} { maxlength 10} { required }]} &{0xc000147110 0xc0001473b0 0xc000147490 3 input input [{ type password} { class form-control unicase-form-control text-input} { id password} { name password} { required }]} &{0xc000147570 0xc000147810 0xc0001478f0 3 input input [{ type password} { class form-control unicase-form-control text-input} { id confirmpassword} { name confirmpassword} { required }]}
Xpath position function not working properly
Hi there,

First of all, thank you for the packages, they're very useful 🚀

I've been having issues with the position function and I'm not sure if it's an issue with the htmlquery package or the xpath package, here's an example:

const htmlSample = `<!DOCTYPE html><html lang="en-US"> <head> <title>Hello,World!</title> </head> <body> <div class="test"> <a href="/test1">Test 1</a> </div> <div class="test"> <a href="/test2">Test 1</a> </div> <div class="test"> <a href="/test3">Test 1</a> </div> </body> </html> ` func TestXPath(t *testing.T) { list := Find(testDoc, "//div[@class=\"test\" and position()=1]//a/@href") for _, n := range list { fmt.Println(InnerText(n)) } }

I would expect this to filter all the nodes that have the class test and have a position == 1, so only the first <a /> element. But instead, I get all the nodes. If I try position()=2 I get nothing back.

If I instead use this xpath, it gives me the correct element:

//div[@class=\"test\"][2]//a/@href

If I try this on the browser it works, so I'm not sure if it is expected that it works this way here 🤔.

What could be the problem? Thank you again!
Is it supposed to return the body node?

For some reason htmlquery.Find(parse, "/html/body//*") returns the body node too. I've tested that using https://codebeautify.org/Xpath-Tester as well as $x("/html/body//*") in the browser console and it doesn't seem to include body nodes. What am I missing?
substring-after() is not being executed

I tried a few expressions with substring-after() and it seems the functions is not being executed at all. Tried to debug func.go and substringIndFunc is being called, returns a callable which is never called though.

Example expression: substring-after(//span[@class="pageNumbersInfo"]//text(), "of ") Node: Pages 1 of 25

An (almost) compliant XPath 1.0 library.

xsel xsel is a library that (almost) implements the XPath 1.0 specification. The non-compliant bits are: xsel does not implement the id function. The

Dec 21, 2022

bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

bluemonday bluemonday is a HTML sanitizer implemented in Go. It is fast and highly configurable. bluemonday takes untrusted user generated content as

Jan 4, 2023

Frongo is a Golang package to create HTML/CSS components using only the Go language.

Frongo Frongo is a Go tool to make HTML/CSS document out of Golang code. It was designed with readability and usability in mind, so HTML objects are c

Jul 29, 2021

A declarative struct-tag-based HTML unmarshaling or scraping package for Go built on top of the goquery library

goq Example import ( "log" "net/http" "astuart.co/goq" ) // Structured representation for github file name table type example struct { Title str

Dec 12, 2022

Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler

Pagser Pagser inspired by page parser。 Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and str

Dec 13, 2022

Golang HTML to plaintext conversion library

html2text Converts HTML into text of the markdown-flavored variety Introduction Ensure your emails are readable by all! Turns HTML into raw text, usef

Dec 28, 2022

golang program that simpily converts html into markdown

Simpily converts html to markdown Just a simple project I wrote in golang to convert html to markdown, surprisingly works decent for a lot of websites

Oct 23, 2021

yview is a lightweight, minimalist and idiomatic template library based on golang html/template for building Go web application.

wview wview is a lightweight, minimalist and idiomatic template library based on golang html/template for building Go web application. Contents Instal

Dec 5, 2021

Golang library for converting Markdown to HTML. Good documentation is included.

md2html is a golang library for converting Markdown to HTML. Install go get github.com/wallblog/md2html Example package main import( "github.com/wa

Jan 11, 2022

⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules.

html-to-markdown Convert HTML into Markdown with Go. It is using an HTML Parser to avoid the use of regexp as much as possible. That should prevent so

Jan 6, 2023

Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Tagify Gets STDIN, file or HTTP address as an input and returns a list of most popular words ordered by popularity as an output. More info about what

Dec 19, 2022

htmlquery is golang XPath package for HTML query.

htmlquery

Overview

Installation

Getting Started

Query, returns matched elements or error.

Load HTML document from URL.

Load HTML from document.

Load HTML document from string.

Find all A elements.

Find all A elements that have href attribute.

Find all A elements with href attribute and only return href value.

Find the third A element.

Find children element (img) under A href and print the source

Evaluate the number of all IMG element.

FAQ

Find() vs QueryAll(), which is better?

Can I save my query expression object for the next query?

XPath query object cache performance

How to disable caching?

Changelogs

Tutorial

List of supported XPath query packages

Questions

Owner

Comments

`replace()` on a query doesn't seem to work

i am grabbing form from a html page but this is also printing some garbage string "0xc000145ea0"

Xpath position function not working properly

Is it supposed to return the body node?

substring-after() is not being executed

Related tags

An (almost) compliant XPath 1.0 library.

bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

Frongo is a Golang package to create HTML/CSS components using only the Go language.

A declarative struct-tag-based HTML unmarshaling or scraping package for Go built on top of the goquery library

Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler

Golang HTML to plaintext conversion library

golang program that simpily converts html into markdown

yview is a lightweight, minimalist and idiomatic template library based on golang html/template for building Go web application.

Golang library for converting Markdown to HTML. Good documentation is included.

⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules.

Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Templating system for HTML and other text documents - go implementation

Take screenshots of websites and create PDF from HTML pages using chromium and docker

export stripTags from html/template as strip.StripTags

network .md into .html with plaintext files

Simple Markdown to Html converter in Go.

This command line converts thuderbird's exported RSS .eml file to .html file

Develop Sites Faster with HTML-Includer!

HTML, CSS and SVG static renderer in pure Go

Find all A elements that have `href` attribute.

Find all A elements with `href` attribute and only return `href` value.

Find children element (img) under A `href` and print the source

`Find()` vs `QueryAll()`, which is better?