Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler

Last update: Dec 13, 2022

Comments: 4

Pagser

Pagser inspired by page parser。

Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler.

Install
Features
Docs
Usage
Configuration
Struct Tag Grammar
Functions
Examples
Dependencies

Install

go get -u github.com/foolin/pagser

Or get the specified version:

go get github.com/foolin/pagser@{version}

The {version} release list: https://github.com/foolin/pagser/releases

Features

Simple - Use golang struct tag syntax.
Easy - Easy use for your spider/crawler/colly application.
Extensible - Support for extension functions.
Struct tag grammar - Grammar is simple, like `pagser:"a->attr(href)"`.
Nested Structure - Support Nested Structure for node.
Configurable - Support configuration.
Implicit type conversion - Automatic implicit type conversion, Output result string convert to int, int64, float64...
GoQuery/Colly - Support all goquery project, such as go-colly.

Docs

See Pagser

Usage

package main

import (
	"encoding/json"
	"github.com/foolin/pagser"
	"log"
)

const rawPageHtml = `
<!doctype html>
<html>
<head>
    <meta charset="utf-8">
    <title>Pagser Title</title>
	<meta name="keywords" content="golang,pagser,goquery,html,page,parser,colly">
</head>

<body>
	<h1>H1 Pagser Example</h1>
	<div class="navlink">
		<div class="container">
			<ul class="clearfix">
				<li id=''><a href="/">Index</a></li>
				<li id='2'><a href="/list/web" title="web site">Web page</a></li>
				<li id='3'><a href="/list/pc" title="pc page">Pc Page</a></li>
				<li id='4'><a href="/list/mobile" title="mobile page">Mobile Page</a></li>
			</ul>
		</div>
	</div>
</body>
</html>
`

type PageData struct {
	Title    string   `pagser:"title"`
	Keywords []string `pagser:"meta[name='keywords']->attrSplit(content)"`
	H1       string   `pagser:"h1"`
	Navs     []struct {
		ID   int    `pagser:"->attrEmpty(id, -1)"`
		Name string `pagser:"a->text()"`
		Url  string `pagser:"a->attr(href)"`
	} `pagser:".navlink li"`
}

func main() {
	//New default config
	p := pagser.New()

	//data parser model
	var data PageData
	//parse html data
	err := p.Parse(&data, rawPageHtml)
	//check error
	if err != nil {
		log.Fatal(err)
	}

	//print data
	log.Printf("Page data json: \n-------------\n%v\n-------------\n", toJson(data))
}

func toJson(v interface{}) string {
	data, _ := json.MarshalIndent(v, "", "\t")
	return string(data)
}

Run output:


Page data json: 
-------------
{
	"Title": "Pagser Title",
	"Keywords": [
		"golang",
		"pagser",
		"goquery",
		"html",
		"page",
		"parser",
		"colly"
	],
	"H1": "H1 Pagser Example",
	"Navs": [
		{
			"ID": -1,
			"Name": "Index",
			"Url": "/"
		},
		{
			"ID": 2,
			"Name": "Web page",
			"Url": "/list/web"
		},
		{
			"ID": 3,
			"Name": "Pc Page",
			"Url": "/list/pc"
		},
		{
			"ID": 4,
			"Name": "Mobile Page",
			"Url": "/list/mobile"
		}
	]
}
-------------

Configuration

type Config struct {
	TagName    string //struct tag name, default is `pagser`
	FuncSymbol   string //Function symbol, default is `->`
	Debug        bool   //Debug mode, debug will print some log, default is `false`
}

Struct Tag Grammar

[goquery selector]->[function]

Example:

type ExamData struct {
	Herf string `pagser:".navLink li a->attr(href)"`
}

1.Struct tag name: pagser
2.goquery selector: .navLink li a
3.Function symbol: ->
4.Function name: attr
5.Function arguments: href

Functions

Builtin functions

text() get element text, return string, this is default function, if not define function in struct tag.

eachText() get each element text, return []string.

html() get element inner html, return string.

eachHtml() get each element inner html, return []string.

outerHtml() get element outer html, return string.

eachOutHtml() get each element outer html, return []string.

attr(name) get element attribute value, return string.

eachAttr() get each element attribute value, return []string.

attrSplit(name, sep) get attribute value and split by separator to array string.

attr('value') get element attribute value by name is value, return string, eg: will return "xxx".

textSplit(sep) get element text and split by separator to array string, return []string.

eachTextJoin(sep) get each element text and join to string, return string.

eq(index) reduces the set of matched elements to the one at the specified index, return Selection for nested struct.

...

More builtin functions see docs: https://pkg.go.dev/github.com/foolin/pagser?tab=doc#BuiltinFunctions

Extension functions

Markdown() //convert html to markdown format.

UgcHtml() //sanitize html

Extensions function need register, like:

import "github.com/foolin/pagser/extensions/markdown"

p := pagser.New()

//Register Markdown
markdown.Register(p)

Custom function

Function interface

type CallFunc func(node *goquery.Selection, args ...string) (out interface{}, err error)

Define global function

//global function need call pagser.RegisterFunc("MyGlob", MyGlobalFunc) before use it.
// this global method must call pagser.RegisterFunc("MyGlob", MyGlobalFunc).
func MyGlobalFunc(node *goquery.Selection, args ...string) (out interface{}, err error) {
	return "Global-" + node.Text(), nil
}

type PageData struct{
  MyGlobalValue string    `pagser:"->MyGlob()"`
}

func main(){

    p := pagser.New()

    //Register global function `MyGlob`
    p.RegisterFunc("MyGlob", MyGlobalFunc)

    //Todo

    //data parser model
    var data PageData
    //parse html data
    err := p.Parse(&data, rawPageHtml)

    //...
}

Define struct function

type PageData struct{
  MyFuncValue int    `pagser:"->MyFunc()"`
}

// this method will auto call, not need register.
func (d PageData) MyFunc(node *goquery.Selection, args ...string) (out interface{}, err error) {
	return "Struct-" + node.Text(), nil
}


func main(){

    p := pagser.New()

    //Todo

    //data parser model
    var data PageData
    //parse html data
    err := p.Parse(&data, rawPageHtml)

    //...
}

Call Syntax

Note: all function arguments are string, single quotes are optional.

Function call with no arguments

->fn()

Function calls with one argument, and single quotes are optional

->fn(one)

->fn('one')

Function calls with many arguments

->fn(one, two, three, ...)

->fn('one', 'two', 'three', ...)

Function calls with single quotes and escape character

->fn('it\'s ok', 'two,xxx', 'three', ...)

Priority Order

Lookup function priority order:

struct method -> parent method -> ... -> global

More Examples

See advance example: https://github.com/foolin/pagser/tree/master/_examples/advance

Implicit type conversion

Automatic implicit type conversion, Output result string convert to int, int64, float64...

Support type:

bool
float32
float64
int
int32
int64
string
[]bool
[]float32
[]float64
[]int
[]int32
[]int64
[]string

Examples

Crawl page example

package main

import (
	"encoding/json"
	"github.com/foolin/pagser"
	"log"
	"net/http"
)

type PageData struct {
	Title    string `pagser:"title"`
	RepoList []struct {
		Names       []string `pagser:"h1->textSplit('/', true)"`
		Description string   `pagser:"h1 + p"`
		Stars       string   `pagser:"a.muted-link->eqAndText(0)"`
		Repo        string   `pagser:"h1 a->attrConcat('href', 'https://github.com', $value, '?from=pagser')"`
	} `pagser:"article.Box-row"`
}

func main() {
	resp, err := http.Get("https://github.com/trending")
	if err != nil {
		log.Fatal(err)
	}
	defer resp.Body.Close()

	//New default config
	p := pagser.New()

	//data parser model
	var data PageData
	//parse html data
	err = p.ParseReader(&data, resp.Body)
	//check error
	if err != nil {
		log.Fatal(err)
	}

	//print data
	log.Printf("Page data json: \n-------------\n%v\n-------------\n", toJson(data))
}

func toJson(v interface{}) string {
	data, _ := json.MarshalIndent(v, "", "\t")
	return string(data)
}

Run output:


2020/04/25 12:26:04 Page data json: 
-------------
{
	"Title": "Trending  repositories on GitHub today · GitHub",
	"RepoList": [
		{
			"Names": [
				"pcottle",
				"learnGitBranching"
			],
			"Description": "An interactive git visualization to challenge and educate!",
			"Stars": "16,010",
			"Repo": "https://github.com/pcottle/learnGitBranching?from=pagser"
		},
		{
			"Names": [
				"jackfrued",
				"Python-100-Days"
			],
			"Description": "Python - 100天从新手到大师",
			"Stars": "83,484",
			"Repo": "https://github.com/jackfrued/Python-100-Days?from=pagser"
		},
		{
			"Names": [
				"brave",
				"brave-browser"
			],
			"Description": "Next generation Brave browser for macOS, Windows, Linux, Android.",
			"Stars": "5,963",
			"Repo": "https://github.com/brave/brave-browser?from=pagser"
		},
		{
			"Names": [
				"MicrosoftDocs",
				"azure-docs"
			],
			"Description": "Open source documentation of Microsoft Azure",
			"Stars": "3,798",
			"Repo": "https://github.com/MicrosoftDocs/azure-docs?from=pagser"
		},
		{
			"Names": [
				"ahmetb",
				"kubectx"
			],
			"Description": "Faster way to switch between clusters and namespaces in kubectl",
			"Stars": "6,979",
			"Repo": "https://github.com/ahmetb/kubectx?from=pagser"
		},

        //...        

		{
			"Names": [
				"serverless",
				"serverless"
			],
			"Description": "Serverless Framework – Build web, mobile and IoT applications with serverless architectures using AWS Lambda, Azure Functions, Google CloudFunctions \u0026 more! –",
			"Stars": "35,502",
			"Repo": "https://github.com/serverless/serverless?from=pagser"
		},
		{
			"Names": [
				"vuejs",
				"vite"
			],
			"Description": "Experimental no-bundle dev server for Vue SFCs",
			"Stars": "1,573",
			"Repo": "https://github.com/vuejs/vite?from=pagser"
		}
	]
}
-------------

Colly Example

Work with colly:

p := pagser.New()


// On every a element which has href attribute call callback
collector.OnHTML("body", func(e *colly.HTMLElement) {
	//data parser model
	var data PageData
	//parse html data
	err := p.ParseSelection(&data, e.Dom)

})

Dependencies

github.com/PuerkitoBio/goquery
github.com/spf13/cast

Extensions:

github.com/mattn/godown
github.com/microcosm-cc/bluemonday

Owner

foolin

https://github.com/foolin/pagser

Similar Resources

A Go library to parse and format vCard

go-vcard A Go library to parse and format vCard. Usage f, err := os.Open("cards.vcf") if err != nil { log.Fatal(err) } defer f.Close() dec := vcard.

Dec 26, 2022

Parse RSS, Atom and JSON feeds in Go

gofeed The gofeed library is a robust feed parser that supports parsing both RSS, Atom and JSON feeds. The library provides a universal gofeed.Parser

Jan 8, 2023

Go library to parse and render Remarkable lines files

go-remarkable2pdf Go library to parse and render Remarkable lines files as PDF.

Nov 7, 2022

parse and generate XML easily in go

etree The etree package is a lightweight, pure go package that expresses XML in the form of an element tree. Its design was inspired by the Python Ele

Dec 19, 2022

Parse line as shell words

go-shellwords Parse line as shell words. Usage args, err := shellwords.Parse("./foo --bar=baz") // args should be ["./foo", "--bar=baz"] envs, args, e

Dec 23, 2022

htmlquery is golang XPath package for HTML query.

htmlquery Overview htmlquery is an XPath query package for HTML, lets you extract data or evaluate from HTML documents by an XPath expression. htmlque

Jan 4, 2023

Golang HTML to plaintext conversion library

html2text Converts HTML into text of the markdown-flavored variety Introduction Ensure your emails are readable by all! Turns HTML into raw text, usef

Dec 28, 2022

Frongo is a Golang package to create HTML/CSS components using only the Go language.

Frongo Frongo is a Go tool to make HTML/CSS document out of Golang code. It was designed with readability and usability in mind, so HTML objects are c

Jul 29, 2021

golang program that simpily converts html into markdown

Simpily converts html to markdown Just a simple project I wrote in golang to convert html to markdown, surprisingly works decent for a lot of websites

Oct 23, 2021

Comments

Bump github.com/microcosm-cc/bluemonday from 1.0.2 to 1.0.16
Bumps github.com/microcosm-cc/bluemonday from 1.0.2 to 1.0.16.

Release notes

Sourced from github.com/microcosm-cc/bluemonday's releases.

Prevent a HTML sanitization vulnerability

CVE-2021-42576

A vulnerability was discovered by https://github.com/TomAnthony https://www.tomanthony.co.uk/ which allowed the contents of a style tag to be leaked unsanitized by bluemonday into the HTML output. Further it was demonstrated that if the form elements select and option were allowed by the policy that this could result in a successful XSS.

You would only be vulnerable to if if you allowed style, select and option in your HTML sanitization policy:

p := bluemonday.NewPolicy() p.AllowElements("style","select") html := p.Sanitize(`<select><option><style><script>alert(1)</script>`) fmt.Println(html)

bluemonday very strongly recommends not allowing the style element in a policy. It is fundamentally unsafe as we do not have a CSS sanitizer and the content is passed through unmodified.

bluemonday has been updated to explicitly suppress style and script elements by default even if you do allow them by policy as these are considered unsafe. If you have a use-case for using bluemonday whilst trusting the input then you can assert this via p.AllowUnsafe(true) which will let style and script through if the policy also allows them.

Note: the policies shipped with bluemonday are not vulnerable to this.

Fix XSS vulnerability in HTML attribute parsing

A well crafted HTML attribute had the potential to evade sanitization due to incorrect escaping of the attribute whilst serializing it.

This version resolves that issue. In doing so it will also correctly use & to separate query string values in URLs within HTML attributes (href, src, ...).

Add SanitizeReaderToWriter(r io.Reader, w io.Writer)

No release notes provided.

Policies that accept regexps for matching are now additive

Thanks to @KN4CK3R for the contribution of a PR that results in multiple Matching() policies on the same attr and element no longer clobber the previous regexps.

Improve data-uri base64 handling, and improve docs structure

No release notes provided.

Improve support for links on all elements

Originally I had only concentrated the link validation on the elements that were safe to link. However people do want to allow some unsafe elements and yet still have the benefits of link validation and sanitization, i.e. allow iframe but still have the src safely validated... these changes allow that.

Additionally I have added tests showing how AllowSchemesWithCustomPolicy can be used to globally allow only links to certain domains, and a test that shows how to apply the AllowAttributes().Matching().OnElements to only allow a given domain on specific elements (i.e. only allow an iframe if is is a YouTube embed).

AllowComments

Adds a new func to allow HTML comments to be allowed. But does not allow CDATA comments which will be treated as plain HTML comments.

Also updates the readme, and the versions of the dependencies that have also updated.

Update x/net to latest version

As per https://nvd.nist.gov/vuln/detail/CVE-2020-28852

Restore support for go < 1.10

No release notes provided.

... (truncated)

Commits

c788a2a Prevent a HTML sanitization vulnerability

13d1799 go fmt with go 1.17

cada0f0 Merge pull request #128 from 6543-forks/ci-enforce-code-format

7e9370a CI.restart()

be04ac9 enforce "lf" line ending

9926455 Add "Check Code Formation" step to CI

1b5510c add "fmt-check" make target

1a86fcd go mod tidy && go fmt

f0057e2 Fix escaping of HTML attributes

c0a6f20 Spelling mistakes and whitespace are OK

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.
Why there are "each" versions of functions?
I think it could be simplified so that one could do:

type PageData struct { Link string `pagser:"a->attr(href)"` Links []string `pagser:"a->attr(href)"` }

The package could inspect the type associated with the tag and if it is slice, automatically assume "each" behavior. In general I think all functions should be returning slices, which then optionally are converted to an individual value (by default the first one, but you could set using ->Eq some other index).

How to select an html element based on his text content?

What should be the selector for this kind of html code:

<h2>Firstname</h2>
<p>John</p>
<!--- some random html code, with random h2 tags -->
<h2>Lastname</h2>
<p>Doe</p>

I would like to fill this struct:

type Person struct {
    Firstname string `pagser:"h2[text=Firstname]+p"`
    Lastname  string `pagser:"h2[text=Lastname]+p"`
}

`->eq()` should return a Selection if it's followed by something else?
type Item struct { Title string `pagser:"td->eq(0)"` Image string `pagser:"td a img->attr(src)"` Quote string `pagser:"td->eq(3)"` Description string `pagser:"td->eq(4)"` }

If I put td->eq(2) in the tag for Image, I get the text() for the full tr. But without the ->eq(2), it's possible I may end up with a different image. Ideally don't want to make a sub-struct just for this one field if possible.

(Happy to have a go at implementing this if it's actually possible.)

Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Tagify Gets STDIN, file or HTTP address as an input and returns a list of most popular words ordered by popularity as an output. More info about what

Dec 19, 2022

[Crawler/Scraper for Golang]🕷A lightweight distributed friendly Golang crawler framework.一个轻量的分布式友好的 Golang 爬虫框架。

Goribot 一个分布式友好的轻量的 Golang 爬虫框架。完整文档 | Document !! Warning !! Goribot 已经被迁移到 Gospider|github.com/zhshch2002/gospider。修复了一些调度问题并分离了网络请求部分到另一个仓库。此仓库会继续

Oct 29, 2022

Match regex group into go struct using struct tags and automatic parsing

regroup Simple library to match regex expression named groups into go struct using struct tags and automatic parsing Installing go get github.com/oris

Nov 5, 2022

bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

bluemonday bluemonday is a HTML sanitizer implemented in Go. It is fast and highly configurable. bluemonday takes untrusted user generated content as

Jan 4, 2023

Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Dec 30, 2022

Watches container registries for new and changed tags and creates an RSS feed for detected changes.

Tagwatch Watches container registries for new and changed tags and creates an RSS feed for detected changes. Configuration Tagwatch is configured thro

Jan 7, 2022

yview is a lightweight, minimalist and idiomatic template library based on golang html/template for building Go web application.

wview wview is a lightweight, minimalist and idiomatic template library based on golang html/template for building Go web application. Contents Instal

Dec 5, 2021

Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler

Pagser

Contents

Install

Features

Docs

Usage

Configuration

Struct Tag Grammar

Functions

Builtin functions

Extension functions

Custom function

Function interface

Define global function

Define struct function

Call Syntax

Priority Order

More Examples

Implicit type conversion

Examples

Crawl page example

Colly Example

Dependencies

Owner

foolin

Similar Resources

Comments

Bump github.com/microcosm-cc/bluemonday from 1.0.2 to 1.0.16

Prevent a HTML sanitization vulnerability

Fix XSS vulnerability in HTML attribute parsing

Add SanitizeReaderToWriter(r io.Reader, w io.Writer)

Policies that accept regexps for matching are now additive

Improve data-uri base64 handling, and improve docs structure

Improve support for links on all elements

AllowComments

Update x/net to latest version

Restore support for go < 1.10

Why there are "each" versions of functions?

How to select an html element based on his text content?

`->eq()` should return a Selection if it's followed by something else?

Related tags