基于文本密度的html2article实现[golang]

基于文本密度的html2article实现[golang]

Install

go get -u -v github.com/sundy-li/html2article

Performance

  • Accuracy: >= 98%

  • Qps: 2w/s , 0.06ms/op go test -bench=. BenchmarkExtract-4 20000 66341 ns/op

  • 说明(对比其他开源实现,可能是目前最快的html2article实现,我们测试的数据集约3kw来自于微信公众号,各大类中文科技媒体历史文章,目前能达到98%以上准确率)

  • 除了必要dom解析以及时间解析, 为了高效率实现, 避免了过多的正则匹配

Examples

参考examples from_url.go

package main

import (
	"github.com/sundy-li/html2article"
)

func main() {
	urlStr := "https://www.leiphone.com/news/201602/DsiQtR6c1jCu7iwA.html"
	ext, err := html2article.NewFromUrl(urlStr)
	if err != nil {
		panic(err)
	}
	article, err := ext.ToArticle()
	if err != nil {
		panic(err)
	}
	println("article title is =>", article.Title)
	println("article publishtime is =>", article.Publishtime) //using UTC timezone
	println("article content is =>", article.Content)

	//parse the article to be readability
	article.Readable(urlStr)
	println("read=>", article.ReadContent)
}

Options

	ext.SetOption(&html2article.Option{
		AccurateTitle: true,  //Get the accurate title instead of from title tag
		RemoveNoise: false,  //Remove the noise node such as some footer
	})

Algorithm

Similar Resources

iTunes and RSS 2.0 Podcast Generator in Golang

podcast Package podcast generates a fully compliant iTunes and RSS 2.0 podcast feed for GoLang using a simple API. Full documentation with detailed ex

Dec 23, 2022

TOML parser for Golang with reflection.

THIS PROJECT IS UNMAINTAINED The last commit to this repo before writing this message occurred over two years ago. While it was never my intention to

Dec 30, 2022

Your CSV pocket-knife (golang)

csvutil - Your CSV pocket-knife (golang) #WARNING I would advise against using this package. It was a language learning exercise from a time before "e

Oct 24, 2022

Golang (Go) bindings for GNU's gettext (http://www.gnu.org/software/gettext/)

gosexy/gettext Go bindings for GNU gettext, an internationalization and localization library for writing multilingual systems. Requeriments GNU gettex

Nov 16, 2022

agrep-like fuzzy matching, but made faster using Golang and precomputation.

goagrep There are situations where you want to take the user's input and match a primary key in a database. But, immediately a problem is introduced:

Oct 8, 2022

Ngram index for golang

go-ngram N-gram index for Go. Key features Unicode support. Append only. Data can't be deleted from index. GC friendly (all strings are pooled and com

Dec 29, 2022

String-matching in Golang using the Knuth–Morris–Pratt algorithm (KMP)

gokmp String-matching in Golang using the Knuth–Morris–Pratt algorithm (KMP). Disclaimer This library was written as part of my Master's Thesis and sh

Dec 8, 2022

Golang HTML to plaintext conversion library

html2text Converts HTML into text of the markdown-flavored variety Introduction Ensure your emails are readable by all! Turns HTML into raw text, usef

Dec 28, 2022

Handlebars for golang

raymond Handlebars for golang with the same features as handlebars.js 3.0. The full API documentation is available here: http://godoc.org/github.com/a

Dec 21, 2022
Comments
  • Missing error check

    Missing error check

    Thanks for using CodeLingo!

    We've got a fix for an issue:

    An error was returned but its value was not checked.

    We've got a new dashboard where you can review and apply the fix: dash.codelingo.io.

    Hope it's helpful and happy coding!

    CodeLingo Team

  • Add CodeLingo Tenets

    Add CodeLingo Tenets

    It looks like this is a Go project. This pull request will add a codelingo.yaml file with some common Go Tenets that CodeLingo will use to review Pull Requests to this repository. You can find more Tenets to import at codelingo.io/tenets

golang 在线预览word,excel,pdf,MarkDown(Online Preview Word,Excel,PPT,PDF,Image by Golang)
golang 在线预览word,excel,pdf,MarkDown(Online Preview Word,Excel,PPT,PDF,Image by Golang)

Go View File 在线体验地址 http://39.97.98.75:8082/view/upload (不会经常更新,保留最基本的预览功能。服务器配置较低,如果出现链接超时请等待几秒刷新重试,或者换Chrome) 目前已经完成 docker部署 (不用为运行环境烦恼) Wor

Dec 26, 2022
bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

bluemonday bluemonday is a HTML sanitizer implemented in Go. It is fast and highly configurable. bluemonday takes untrusted user generated content as

Jan 4, 2023
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Dec 30, 2022
A golang package to work with Decentralized Identifiers (DIDs)

did did is a Go package that provides tools to work with Decentralized Identifiers (DIDs). Install go get github.com/ockam-network/did Example packag

Nov 25, 2022
wcwidth for golang

go-runewidth Provides functions to get fixed width of the character or string. Usage runewidth.StringWidth("つのだ☆HIRO") == 12 Author Yasuhiro Matsumoto

Dec 11, 2022
Parses the Graphviz DOT language in golang

Parses the Graphviz DOT language and creates an interface, in golang, with which to easily create new and manipulate existing graphs which can be writ

Dec 25, 2022
Go (Golang) GNU gettext utilities package

Gotext GNU gettext utilities for Go. Features Implements GNU gettext support in native Go. Complete support for PO files including: Support for multil

Dec 18, 2022
htmlquery is golang XPath package for HTML query.

htmlquery Overview htmlquery is an XPath query package for HTML, lets you extract data or evaluate from HTML documents by an XPath expression. htmlque

Jan 4, 2023
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.

omniparser Omniparser is a native Golang ETL parser that ingests input data of various formats (CSV, txt, fixed length/width, XML, EDI/X12/EDIFACT, JS

Jan 4, 2023
Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler
Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler

Pagser Pagser inspired by page parser。 Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and str

Dec 13, 2022