[Crawler/Scraper for Golang]🕷A lightweight distributed friendly Golang crawler framework.一个轻量的分布式友好的 Golang 爬虫框架。

Goribot

一个分布式友好的轻量的 Golang 爬虫框架。

完整文档 | Document

!! Warning !!

Goribot 已经被迁移到 Gospider|github.com/zhshch2002/gospider。修复了一些调度问题并分离了网络请求部分到另一个仓库。此仓库会继续保留,建议新朋友使用新的 Gospider。

Goribot has been moved to Gospider|github.com/zhshch2002/gospider. Fixed some scheduling issues and separated the network request part to another repo. This repo will continue to be kept, suggest new friends to use the new Gospider.

GitHub go.mod Go version GitHub tag (latest by date) codecov go-report license code-size

🚀 Feature

版本警告

Goribot 仅支持 Go1.13 及以上版本。

👜 获取 Goribot

go get -u github.com/zhshch2002/goribot

Goribot 包含一个历史开发版本,如果您需要使用过那个版本,请拉取 Tag 为 v0.0.1 版本。

建立你的第一个项目

package main

import (
	"fmt"
	"github.com/zhshch2002/goribot"
)

func main() {
	s := goribot.NewSpider()

	s.AddTask(
		goribot.GetReq("https://httpbin.org/get"),
		func(ctx *goribot.Context) {
			fmt.Println(ctx.Resp.Text)
			fmt.Println(ctx.Resp.Json("headers.User-Agent"))
		},
	)

	s.Run()
}

🎉 完成

至此你已经可以使用 Goribot 了。更多内容请从 开始使用 了解。

🙏 感谢

万分感谢以上项目的帮助 🙏

Comments
  • 请教一个关于爬取结果存储的问题

    请教一个关于爬取结果存储的问题

    代码结构如下:

    func main() {
    	...
    	s.AddTask(goribot.GetReq(url), f1)
    	s.Run()
    }
    
    var f1 = func(ctx *goribot.Context) {
    	...
            url:= ctx.Resp.Json("url").Int()
    	file_name := ctx.Resp.Json("file_name").Int()
    	file_size := ctx.Resp.Json("file_size").Int()
    	ctx.AddTask(goribot.GetReq(url),f2)
    }
    
    var f2 = func(ctx *goribot.Context) {
    	file_download_url := ctx.Resp.Json("file_download_url")
    }
    

    f1中获取的file_namefile_size、和f2中获取的file_download_url ,如何能让这三个字段同时整体输出

  • PostJsonReq(urladdr string, requestData interface{}) 中requestData 参数格式的疑惑

    PostJsonReq(urladdr string, requestData interface{}) 中requestData 参数格式的疑惑

    PostJsonReq字面理解应该是生成一个携带参数为json格式的post请求,但是requestData 如果传入InfoRaw会请求失败,但是如果是InfoMap会请求成功。

    func PostJsonReq(urladdr string, requestData interface{}) *Request {
    	body, err := json.Marshal(requestData)
    	req := PostReq(urladdr, bytes.NewReader(body))
    	if req.Err == nil {
    		req.Err = err
    	}
    	req.SetHeader("Content-Type", "application/json")
    	return req
    }
    

    主要是json.Marshal()解析的问题:

    InfoRaw = `{"account":"XXX","password":"YYY"}`
    InfoMap = map[string]string{
    		"account":  "XXX",
    		"password": "YYY",
    }
    
    func xx2JSON(requestData interface{}) {
    	body, err := json.Marshal(requestData)
    	if err != nil {
    		log.Fatal(err)
    	}
    	fmt.Println(string(body))
    }
    
    	xx2JSON(InfoRaw)    //  "{\"account\":\"XXX\",\"password\":\"YYY\"}"
    	xx2JSON(InfoMap)   //  {"account":"XXX","password":"YYY"}
    
    

    所以对于PostJsonReq函数,是不是内部应该使用gjson.Parse(InfoRaw).String()

  • 一个任务还没执行完成爬虫就退出?

    一个任务还没执行完成爬虫就退出?

    ` package main

    import ( "fmt" "github.com/zhshch2002/goribot" "crawlab/service/babyTree/parser" )

    func main() { s := goribot.NewSpider( //goribot.Limiter( // true, // &goribot.LimitRule{ // Glob: "*.babytree.com", // Allow: goribot.Allow, // Rate: 2, // //Delay: 5 * time.Second, // RandomDelay: 5 * time.Second, // Parallelism: 3, // MaxReq: 3, // MaxDepth: 5, // }, //), goribot.SetDepthFirst(true), goribot.RefererFiller(), goribot.RandomUserAgent(), goribot.SpiderLogPrint(), )

    s.AddTask(
    	goribot.Get("http://www.babytree.com/difang/allCities.php"),
    	parser.CityList,
    )
    
    s.OnItem(func(i interface{}) interface{} {
    	fmt.Printf("Item : %+v\r\n",i)
    	return i
    })
    
    s.Run()
    

    } package parser

    import ( "fmt" "github.com/PuerkitoBio/goquery" "github.com/zhshch2002/goribot" "crawlab/models" "net/url" "strings" )

    func CityList(ctx *goribot.Context){ if ctx.Resp.Dom != nil { cityList := make([]map[string]string, 2) cityList[0] = map[string]string{"city": "北京", "py": "beijing"} cityList[1] = map[string]string{"city": "上海", "py": "shanghai"}

    	ctx.Resp.Dom.Find("a").Each(func(i int, sel *goquery.Selection) {
    		href, exists := sel.Attr("href")
    		index := strings.LastIndex(href, "?location=")
    		if index == -1 && exists {
    			u, _ := url.Parse(href)
    			i2 := strings.Index(u.Host, ".")
    			u2 := u.Host[:i2]
    			i3 := strings.Index(u2, "-city")
    			if i3 > -1 {
    				py := u2[:i3]
    				cityList = append(cityList, map[string]string{"city": sel.Text(), "py": py})
    			}
    		}
    	})
    	fmt.Printf("city Lists : %+v\r\n",cityList)
    	for _, c := range cityList {
    		py := c["py"]
    		uri := "http://www.babytree.com/community/" + py + "/index_1.html#topicpos"
    		ctx.AddItem(models.CityList{py,uri})
    	}
    }
    

    } ` 这是执行打印的结果: image

    image

    为什么map内的Item无法全部打印出来?run方法就退出了?

  • Add license scan report and status

    Add license scan report and status

    Your FOSSA integration was successful! Attached in this PR is a badge and license report to track scan status in your README.

    Below are docs for integrating FOSSA license checks into your CI:

  • build(deps): bump websocket-extensions from 0.1.3 to 0.1.4 in /_docs

    build(deps): bump websocket-extensions from 0.1.3 to 0.1.4 in /_docs

    Bumps websocket-extensions from 0.1.3 to 0.1.4.

    Changelog

    Sourced from websocket-extensions's changelog.

    0.1.4 / 2020-06-02

    • Remove a ReDoS vulnerability in the header parser (CVE-2020-7662, reported by Robert McLaughlin)
    • Change license from MIT to Apache 2.0
    Commits
    • 8efd0cd Bump version to 0.1.4
    • 3dad4ad Remove ReDoS vulnerability in the Sec-WebSocket-Extensions header parser
    • 4a76c75 Add Node versions 13 and 14 on Travis
    • 44a677a Formatting change: {...} should have spaces inside the braces
    • f6c50ab Let npm reformat package.json
    • 2d211f3 Change markdown formatting of docs.
    • 0b62083 Update Travis target versions.
    • 729a465 Switch license to Apache 2.0.
    • See full diff in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

  • 爬虫问题请教

    爬虫问题请教

    再请教一个问题 爬取如下页面: https://github.com/search?q=go&type=Topics 然后 指定开始爬取条目startTopic (比如:golang-library) 以及结束爬取条目endTopic(比如:google-cloud-storage 在第3页)

    爬虫从startTopic开始爬取,到endTopic后就不在爬取。

    我实现代码大致如下:

    var (
    	startTopic = "XXXXX"
    	endTopic   = "YYYYY"
    )
    
    func main1() {
    	s := goribot.NewSpider(
    		goribot.Limiter(true, &goribot.LimitRule{
    			//Glob: "*.github.com",
    			Rate: 2,
    		}),
    		goribot.RefererFiller(),
    		goribot.RandomUserAgent(),
    	)
    
    	s.AddTask(goribot.Get("https://github.com/search?q=go&type=Topics"), func(ctx *goribot.Context) {
    		totalPage := ctx.Resp.Dom.Find("XXXXXXX").Text()
    
    		f := func(p int) string {
    			return fmt.Sprintf("https://github.com/search?p=%v&q=go&type=Topics", p)
    		}
    
    		for i := 1; i <= totalPage; i++ {
    			ctx.AddTask(goribot.Get(f(i)), func(ctx *goribot.Context) {
    				ctx.Resp.Dom.Find("tbody[id^=normalthread]").Each(func(i int, s *goquery.Selection) {
    
    					topic := s.Find("XXXX").Text()
                                          
                                           ..............................
    
    				})
    
    			})
    		}
    
    	})
    
    	s.Run()
    }
    

    但是在在实现判断逻辑的时候,始终有问题,主要是判断到了endTopic,任然往后爬取

  • 建议:SetParam 时应同步现有 Param

    建议:SetParam 时应同步现有 Param

    当前一下代码,会将原有的 foo=bar 去掉了

    req:=goribot.
    	GetReq("https://example.com?foo=bra&ping=pong").
    	SetParam(map[string]string{
    		"ping": "pong"
    	})
    

    建议增加一个方法 func (s *Request) AddParam(key, value string) *Request

  • build(deps): bump lodash from 4.17.15 to 4.17.19 in /_docs

    build(deps): bump lodash from 4.17.15 to 4.17.19 in /_docs

    Bumps lodash from 4.17.15 to 4.17.19.

    Release notes

    Sourced from lodash's releases.

    4.17.16

    Commits
    Maintainer changes

    This version was pushed to npm by mathias, a new releaser for lodash since your current version.


    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

Go Humans! (formatters for units to human friendly sizes)

Humane Units Just a few functions for helping humanize times and sizes. go get it as github.com/dustin/go-humanize, import it as "github.com/dustin/go

Jan 2, 2023
URL-friendly slugify with multiple languages support.

slug Package slug generate slug from unicode string, URL-friendly slugify with multiple languages support. Documentation online Example package main

Jan 4, 2023
Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler
Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler

Pagser Pagser inspired by page parser。 Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and str

Dec 13, 2022
yview is a lightweight, minimalist and idiomatic template library based on golang html/template for building Go web application.

wview wview is a lightweight, minimalist and idiomatic template library based on golang html/template for building Go web application. Contents Instal

Dec 5, 2021
a simple and lightweight terminal text editor written in Go

Simple Text editor written in Golang build go build main.go

Oct 4, 2021
Geziyor, a fast web crawling & scraping framework for Go. Supports JS rendering.

Geziyor Geziyor is a blazing fast web crawling and web scraping framework. It can be used to crawl websites and extract structured data from them. Gez

Dec 29, 2022
golang 在线预览word,excel,pdf,MarkDown(Online Preview Word,Excel,PPT,PDF,Image by Golang)
golang 在线预览word,excel,pdf,MarkDown(Online Preview Word,Excel,PPT,PDF,Image by Golang)

Go View File 在线体验地址 http://39.97.98.75:8082/view/upload (不会经常更新,保留最基本的预览功能。服务器配置较低,如果出现链接超时请等待几秒刷新重试,或者换Chrome) 目前已经完成 docker部署 (不用为运行环境烦恼) Wor

Dec 26, 2022
bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

bluemonday bluemonday is a HTML sanitizer implemented in Go. It is fast and highly configurable. bluemonday takes untrusted user generated content as

Jan 4, 2023
A golang package to work with Decentralized Identifiers (DIDs)

did did is a Go package that provides tools to work with Decentralized Identifiers (DIDs). Install go get github.com/ockam-network/did Example packag

Nov 25, 2022
wcwidth for golang

go-runewidth Provides functions to get fixed width of the character or string. Usage runewidth.StringWidth("つのだ☆HIRO") == 12 Author Yasuhiro Matsumoto

Dec 11, 2022
Parses the Graphviz DOT language in golang

Parses the Graphviz DOT language and creates an interface, in golang, with which to easily create new and manipulate existing graphs which can be writ

Dec 25, 2022
Go (Golang) GNU gettext utilities package

Gotext GNU gettext utilities for Go. Features Implements GNU gettext support in native Go. Complete support for PO files including: Support for multil

Dec 18, 2022
htmlquery is golang XPath package for HTML query.

htmlquery Overview htmlquery is an XPath query package for HTML, lets you extract data or evaluate from HTML documents by an XPath expression. htmlque

Jan 4, 2023
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.

omniparser Omniparser is a native Golang ETL parser that ingests input data of various formats (CSV, txt, fixed length/width, XML, EDI/X12/EDIFACT, JS

Jan 4, 2023
iTunes and RSS 2.0 Podcast Generator in Golang

podcast Package podcast generates a fully compliant iTunes and RSS 2.0 podcast feed for GoLang using a simple API. Full documentation with detailed ex

Dec 23, 2022
TOML parser for Golang with reflection.

THIS PROJECT IS UNMAINTAINED The last commit to this repo before writing this message occurred over two years ago. While it was never my intention to

Dec 30, 2022
Your CSV pocket-knife (golang)

csvutil - Your CSV pocket-knife (golang) #WARNING I would advise against using this package. It was a language learning exercise from a time before "e

Oct 24, 2022
Golang (Go) bindings for GNU's gettext (http://www.gnu.org/software/gettext/)

gosexy/gettext Go bindings for GNU gettext, an internationalization and localization library for writing multilingual systems. Requeriments GNU gettex

Nov 16, 2022
agrep-like fuzzy matching, but made faster using Golang and precomputation.

goagrep There are situations where you want to take the user's input and match a primary key in a database. But, immediately a problem is introduced:

Oct 8, 2022