htmlquery is golang XPath package for HTML query.

htmlquery

Build Status Coverage Status GoDoc Go Report Card

Overview

htmlquery is an XPath query package for HTML, lets you extract data or evaluate from HTML documents by an XPath expression.

htmlquery built-in the query object caching feature based on LRU, this feature will caching the recently used XPATH query string. Enable query caching can avoid re-compile XPath expression each query.

Installation

go get github.com/antchfx/htmlquery

Getting Started

Query, returns matched elements or error.

nodes, err := htmlquery.QueryAll(doc, "//a")
if err != nil {
	panic(`not a valid XPath expression.`)
}

Load HTML document from URL.

doc, err := htmlquery.LoadURL("http://example.com/")

Load HTML from document.

filePath := "/home/user/sample.html"
doc, err := htmlquery.LoadDoc(filePath)

Load HTML document from string.

s := `<html>....</html>`
doc, err := htmlquery.Parse(strings.NewReader(s))

Find all A elements.

list := htmlquery.Find(doc, "//a")

Find all A elements that have href attribute.

list := htmlquery.Find(doc, "//a[@href]")	

Find all A elements with href attribute and only return href value.

list := htmlquery.Find(doc, "//a/@href")	
for _ , n := range list{
	fmt.Println(htmlquery.SelectAttr(n, "href")) // output @href value
}

Find the third A element.

a := htmlquery.FindOne(doc, "//a[3]")

Find children element (img) under A href and print the source

a := htmlquery.FindOne(doc, "//a")
img := htmlquery.FindOne(a, "//img")
fmt.Prinln(htmlquery.SelectAttr(img, "src")) // output @src value

Evaluate the number of all IMG element.

expr, _ := xpath.Compile("count(//img)")
v := expr.Evaluate(htmlquery.CreateXPathNavigator(doc)).(float64)
fmt.Printf("total count is %f", v)

FAQ

Find() vs QueryAll(), which is better?

Find and QueryAll both do the same things, searches all of matched html nodes. The Find will panics if you give an error XPath query, but QueryAll will return an error for you.

Can I save my query expression object for the next query?

Yes, you can. We offer the QuerySelector and QuerySelectorAll methods, It will accept your query expression object.

Cache a query expression object(or reused) will avoid re-compile XPath query expression, improve your query performance.

XPath query object cache performance

goos: windows
goarch: amd64
pkg: github.com/antchfx/htmlquery
BenchmarkSelectorCache-4                20000000                55.2 ns/op
BenchmarkDisableSelectorCache-4           500000              3162 ns/op

How to disable caching?

htmlquery.DisableSelectorCache = true

Changelogs

2019-11-19

  • Add built-in query object cache feature, avoid re-compilation for the same query string. #16
  • Added LoadDoc 18

2019-10-05

  • Add new methods that compatible with invalid XPath expression error: QueryAll and Query.
  • Add QuerySelector and QuerySelectorAll methods, supported reused your query object.

2019-02-04

  • #7 Removed deprecated FindEach() and FindEachWithBreak() methods.

2018-12-28

  • Avoid adding duplicate elements to list for Find() method. #6

Tutorial

func main() {
	doc, err := htmlquery.LoadURL("https://www.bing.com/search?q=golang")
	if err != nil {
		panic(err)
	}
	// Find all news item.
	list, err := htmlquery.QueryAll(doc, "//ol/li")
	if err != nil {
		panic(err)
	}
	for i, n := range list {
		a := htmlquery.FindOne(n, "//a")
		fmt.Printf("%d %s(%s)\n", i, htmlquery.InnerText(a), htmlquery.SelectAttr(a, "href"))
	}
}

List of supported XPath query packages

Name Description
htmlquery XPath query package for the HTML document
xmlquery XPath query package for the XML document
jsonquery XPath query package for the JSON document

Questions

Please let me know if you have any questions.

Owner
The open source web crawler framework project
null
Comments
  • `replace()` on a query doesn't seem to work

    `replace()` on a query doesn't seem to work

    package main
    
    import (
    	"fmt"
    	"strings"
    
    	"github.com/antchfx/htmlquery"
    )
    
    func main() {
    	s := `<html><a href="https://github.com/cashapp/hermit-build/releases/download/go-tools/stringer-v0.1.12-darwin-amd64.bz2">foo</a></html>`
    	doc, err := htmlquery.Parse(strings.NewReader(s))
    	if err != nil {
    		panic(err)
    	}
    	nodes, err := htmlquery.QueryAll(doc, `replace((//a[contains(@href, '/stringer-')])/@href, '^.*/stringer-v([^-]*)-.*$', '$1')`)
    	if err != nil {
    		panic(err)
    	}
    	for _, node := range nodes {
    		fmt.Println(htmlquery.OutputHTML(node, false))
    	}
    }
    

    On playground: https://go.dev/play/p/jxU6UgH0DnK The same content+query works fine on https://www.freeformatter.com/xpath-tester.html

    The above example without replace() works fine: https://go.dev/play/p/N22KULbkgRu

  • i am grabbing form from a html page but this is also printing some garbage string

    i am grabbing form from a html page but this is also printing some garbage string "0xc000145ea0"

    &{0xc000144f50 0xc0001451f0 0xc0001452d0 3 input input [{ type email} { name email} { class form-control unicase-form-control text-input} { id exampleInputEmail1}]} &{0xc0001453b0 0xc000145650 0xc000145730 3 input input [{ type password} { name password} { class form-control unicase-form-control text-input} { id exampleInputPassword1}]} &{0xc000145ea0 0xc0001462a0 0xc000147ab0 0xc0001461c0 0xc000147b20 3 form form [{ class register-form outer-top-xs} { role form} { method post} { name register} { onsubmit return valid();}]} &{0xc000146310 0xc0001465b0 0xc000146690 3 input input [{ type text} { class form-control unicase-form-control text-input} { id fullname} { name fullname} { required required}]} &{0xc000146770 0xc000146a10 0xc000146af0 3 input input [{ type email} { class form-control unicase-form-control text-input} { id email} { onblur userAvailability()} { name emailid} { required }]} &{0xc000146cb0 0xc000146f50 0xc000147030 3 input input [{ type text} { class form-control unicase-form-control text-input} { id contactno} { name contactno} { maxlength 10} { required }]} &{0xc000147110 0xc0001473b0 0xc000147490 3 input input [{ type password} { class form-control unicase-form-control text-input} { id password} { name password} { required }]} &{0xc000147570 0xc000147810 0xc0001478f0 3 input input [{ type password} { class form-control unicase-form-control text-input} { id confirmpassword} { name confirmpassword} { required }]}

  • Xpath position function not working properly

    Xpath position function not working properly

    Hi there,

    First of all, thank you for the packages, they're very useful 🚀

    I've been having issues with the position function and I'm not sure if it's an issue with the htmlquery package or the xpath package, here's an example:

    const htmlSample = `<!DOCTYPE html><html lang="en-US">
    <head>
    <title>Hello,World!</title>
    </head>
    <body>
    <div class="test">
    	<a href="/test1">Test 1</a>
    </div>
    <div class="test">
    	<a href="/test2">Test 1</a>
    </div>
    <div class="test">
    	<a href="/test3">Test 1</a>
    </div>
    </body>
    </html>
    `
    
    func TestXPath(t *testing.T) {
    	list := Find(testDoc, "//div[@class=\"test\" and position()=1]//a/@href")
    	for _, n := range list {
    		fmt.Println(InnerText(n))
    	}
    }
    

    I would expect this to filter all the nodes that have the class test and have a position == 1, so only the first <a /> element. But instead, I get all the nodes. If I try position()=2 I get nothing back.

    If I instead use this xpath, it gives me the correct element:

    //div[@class=\"test\"][2]//a/@href
    

    If I try this on the browser it works, so I'm not sure if it is expected that it works this way here 🤔.

    What could be the problem? Thank you again!

  • Is it supposed to return the body node?

    Is it supposed to return the body node?

    For some reason htmlquery.Find(parse, "/html/body//*") returns the body node too. I've tested that using https://codebeautify.org/Xpath-Tester as well as $x("/html/body//*") in the browser console and it doesn't seem to include body nodes. What am I missing?

  • substring-after() is not being executed

    substring-after() is not being executed

    I tried a few expressions with substring-after() and it seems the functions is not being executed at all. Tried to debug func.go and substringIndFunc is being called, returns a callable which is never called though.

    Example expression: substring-after(//span[@class="pageNumbersInfo"]//text(), "of ") Node: Pages 1 of 25

An (almost) compliant XPath 1.0 library.

xsel xsel is a library that (almost) implements the XPath 1.0 specification. The non-compliant bits are: xsel does not implement the id function. The

Dec 21, 2022
bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

bluemonday bluemonday is a HTML sanitizer implemented in Go. It is fast and highly configurable. bluemonday takes untrusted user generated content as

Jan 4, 2023
Frongo is a Golang package to create HTML/CSS components using only the Go language.

Frongo Frongo is a Go tool to make HTML/CSS document out of Golang code. It was designed with readability and usability in mind, so HTML objects are c

Jul 29, 2021
A declarative struct-tag-based HTML unmarshaling or scraping package for Go built on top of the goquery library

goq Example import ( "log" "net/http" "astuart.co/goq" ) // Structured representation for github file name table type example struct { Title str

Dec 12, 2022
Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler
Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler

Pagser Pagser inspired by page parser。 Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and str

Dec 13, 2022
Golang HTML to plaintext conversion library

html2text Converts HTML into text of the markdown-flavored variety Introduction Ensure your emails are readable by all! Turns HTML into raw text, usef

Dec 28, 2022
golang program that simpily converts html into markdown

Simpily converts html to markdown Just a simple project I wrote in golang to convert html to markdown, surprisingly works decent for a lot of websites

Oct 23, 2021
yview is a lightweight, minimalist and idiomatic template library based on golang html/template for building Go web application.

wview wview is a lightweight, minimalist and idiomatic template library based on golang html/template for building Go web application. Contents Instal

Dec 5, 2021
Golang library for converting Markdown to HTML. Good documentation is included.

md2html is a golang library for converting Markdown to HTML. Install go get github.com/wallblog/md2html Example package main import( "github.com/wa

Jan 11, 2022
⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules.
⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules.

html-to-markdown Convert HTML into Markdown with Go. It is using an HTML Parser to avoid the use of regexp as much as possible. That should prevent so

Jan 6, 2023
Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.
Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Tagify Gets STDIN, file or HTTP address as an input and returns a list of most popular words ordered by popularity as an output. More info about what

Dec 19, 2022
Templating system for HTML and other text documents - go implementation

FAQ What is Kasia.go? Kasia.go is a Go implementation of the Kasia templating system. Kasia is primarily designed for HTML, but you can use it for any

Mar 15, 2022
Take screenshots of websites and create PDF from HTML pages using chromium and docker

gochro is a small docker image with chromium installed and a golang based webserver to interact wit it. It can be used to take screenshots of w

Nov 23, 2022
export stripTags from html/template as strip.StripTags

HTML StripTags for Go This is a Go package containing an extracted version of the unexported stripTags function in html/template/html.go. ⚠️ This pack

Dec 4, 2022
network .md into .html with plaintext files
network .md into .html with plaintext files

plain network markdown files into html with plaintext files plain is a static-site generator operating on plaintext files containing a small set of co

Dec 10, 2022
Simple Markdown to Html converter in Go.

Markdown To Html Converter Simple Example package main import ( "github.com/gopherzz/MTDGo/pkg/lexer" "github.com/gopherzz/MTDGo/pkg/parser" "fm

Jan 29, 2022
This command line converts thuderbird's exported RSS .eml file to .html file

thunderbird-rss-html This command line tool converts .html to .epub with images fetching. Install > go get github.com/gonejack/thunderbird-rss-html Us

Dec 15, 2021
Develop Sites Faster with HTML-Includer!

HTML Includer Develop Sites Faster with HTML Includer! How to Install Install HTML Includer on your machine: go install github.com/GameWorkstore/html-

Jan 1, 2022
HTML, CSS and SVG static renderer in pure Go

Web render This module implements a static renderer for the HTML, CSS and SVG formats. It consists for the main part of a Golang port of the awesome W

Apr 19, 2022