Geziyor, a fast web crawling & scraping framework for Go. Supports JS rendering.

Last update: Dec 29, 2022

Comments: 14

Geziyor

Geziyor is a blazing fast web crawling and web scraping framework. It can be used to crawl websites and extract structured data from them. Geziyor is useful for a wide range of purposes such as data mining, monitoring and automated testing.

Features

JS Rendering
5.000+ Requests/Sec
Caching (Memory/Disk/LevelDB)
Automatic Data Exporting (JSON, CSV, or custom)
Metrics (Prometheus, Expvar, or custom)
Limit Concurrency (Global/Per Domain)
Request Delays (Constant/Randomized)
Cookies, Middlewares, robots.txt
Automatic response decoding to UTF-8

See scraper Options for all custom settings.

Status

We highly recommend you to use Geziyor with go modules.

Usage

This example extracts all quotes from quotes.toscrape.com and exports to JSON file.

func main() {
    geziyor.NewGeziyor(&geziyor.Options{
        StartURLs: []string{"http://quotes.toscrape.com/"},
        ParseFunc: quotesParse,
        Exporters: []export.Exporter{&export.JSON{}},
    }).Start()
}

func quotesParse(g *geziyor.Geziyor, r *client.Response) {
    r.HTMLDoc.Find("div.quote").Each(func(i int, s *goquery.Selection) {
        g.Exports <- map[string]interface{}{
            "text":   s.Find("span.text").Text(),
            "author": s.Find("small.author").Text(),
        }
    })
    if href, ok := r.HTMLDoc.Find("li.next > a").Attr("href"); ok {
        g.Get(r.JoinURL(href), quotesParse)
    }
}

See tests for more usage examples.

Documentation

Installation

go get -u github.com/geziyor/geziyor

If you want to make JS rendered requests, make sure you have Chrome installed.

NOTE: macOS limits the maximum number of open file descriptors. If you want to make concurrent requests over 256, you need to increase limits. Read this for more.

Making Normal Requests

Initial requests start with StartURLs []string field in Options. Geziyor makes concurrent requests to those URLs. After reading response, ParseFunc func(g *Geziyor, r *Response) called.

geziyor.NewGeziyor(&geziyor.Options{
    StartURLs: []string{"http://api.ipify.org"},
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        fmt.Println(string(r.Body))
    },
}).Start()

If you want to manually create first requests, set StartRequestsFunc. StartURLs won't be used if you create requests manually.
You can make requests using Geziyor methods:

geziyor.NewGeziyor(&geziyor.Options{
    StartRequestsFunc: func(g *geziyor.Geziyor) {
    	g.Get("https://httpbin.org/anything", g.Opt.ParseFunc)
        g.Head("https://httpbin.org/anything", g.Opt.ParseFunc)
    },
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        fmt.Println(string(r.Body))
    },
}).Start()

Making JS Rendered Requests

JS Rendered requests can be made using GetRendered method. By default, geziyor uses local Chrome application CLI to start Chrome browser. Set BrowserEndpoint option to use different chrome instance. Such as, "ws://localhost:3000"

geziyor.NewGeziyor(&geziyor.Options{
    StartRequestsFunc: func(g *geziyor.Geziyor) {
        g.GetRendered("https://httpbin.org/anything", g.Opt.ParseFunc)
    },
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        fmt.Println(string(r.Body))
    },
    //BrowserEndpoint: "ws://localhost:3000",
}).Start()

Extracting Data

We can extract HTML elements using response.HTMLDoc. HTMLDoc is Goquery's Document.

HTMLDoc can be accessible on Response if response is HTML and can be parsed using Go's built-in HTML parser If response isn't HTML, response.HTMLDoc would be nil.

geziyor.NewGeziyor(&geziyor.Options{
    StartURLs: []string{"http://quotes.toscrape.com/"},
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        r.HTMLDoc.Find("div.quote").Each(func(_ int, s *goquery.Selection) {
            log.Println(s.Find("span.text").Text(), s.Find("small.author").Text())
        })
    },
}).Start()

Exporting Data

You can export data automatically using exporters. Just send data to Geziyor.Exports chan. Available exporters

geziyor.NewGeziyor(&geziyor.Options{
    StartURLs: []string{"http://quotes.toscrape.com/"},
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        r.HTMLDoc.Find("div.quote").Each(func(_ int, s *goquery.Selection) {
            g.Exports <- map[string]interface{}{
                "text":   s.Find("span.text").Text(),
                "author": s.Find("small.author").Text(),
            }
        })
    },
    Exporters: []export.Exporter{&export.JSON{}},
}).Start()

Benchmark

8748 request per seconds on Macbook Pro 15" 2016

See tests for this benchmark function:

>> go test -run none -bench Requests -benchtime 10s
goos: darwin
goarch: amd64
pkg: github.com/geziyor/geziyor
BenchmarkRequests-8   	  200000	    108710 ns/op
PASS
ok  	github.com/geziyor/geziyor	22.861s

Owner

https://github.com/geziyor/geziyor

Comments

google-chrome: executable file not found in $PATH

Issue:

I get an error when I start my service on the server. Local on my machine everything works so far.

request getting rendered: exec: "google-chrome": executable file not found in $PATH

Code

main.go

// ...
	crawler := geziyor.NewGeziyor(&geziyor.Options{
		StartRequestsFunc: func(g *geziyor.Geziyor) {
			g.GetRendered("https://www.google.com/", g.Opt.ParseFunc)
		},
		Exporters: []export.Exporter{&export.JSON{}},
	})
	
	crawler.Start()
// ...

Dockerfile

# -- Stage 1 -- #
FROM golang:1.16-alpine as builder
WORKDIR /app

COPY . .
RUN go build -mod=readonly -o bin/service

# -- Stage 2 -- #
FROM alpine

# Install any required dependencies.
RUN apk --no-cache add ca-certificates

WORKDIR /root/

COPY --from=builder /app/bin/service /usr/local/bin/

CMD ["service"]

Question

I assume I need additional dependencies on my server for geziyor to run smoothly? For example Headless Chrome?

Cookie cutters and Declarative scrapping

Many web sites can be scrapped using standard CSS selection without defining fancy Go code to do that. For this, I still like goscrape's "structured scraper" approach. Ref:

https://github.com/andrew-d/goscrape#goscrape

And here is how its scrapping is defined declaratively:

https://github.com/andrew-d/goscrape/blob/d89ba4ccc7f78429613f2a71bc7703c8faf9e8c9/_examples/scrape_hn.go#L15-L26

	config := &scrape.ScrapeConfig{
		DividePage: scrape.DividePageBySelector("tr:nth-child(3) tr:nth-child(3n-2):not([style='height:10px'])"),

		Pieces: []scrape.Piece{
			{Name: "title", Selector: "td.title > a", Extractor: extract.Text{}},
			{Name: "link", Selector: "td.title > a", Extractor: extract.Attr{Attr: "href"}},
			{Name: "rank", Selector: "td.title[align='right']",
				Extractor: extract.Regex{Regex: regexp.MustCompile(`(\d+)`)}},
		},

		Paginator: paginate.BySelector("a[rel='nofollow']:last-child", "href"),
	}

Hope geziyor can do declarative scrapping using predefined cookie cutters like above as well.

context deadline exceeded

I'm trying to scrape 3242 webpages but I'm getting response: Get "https://www.typeform.com/templates/t/course-evaluation-form-template/": context deadline exceeded (Client.Timeout exceeded while awaiting headers) for a lot of urls

Any advice?
Queue performance enhancements + delay middleware fix
Enhances the queue logic to improve memory management and handle deadlock situations

Fixes delay middleware to always factor in delay if randomised delay is added (combined, not instead of)

Moves request middleware to prior to the core g.do func - this allows middleware to cancel requests within triggering the semaphore locks (and corresponding rate limits!), also avoids queuing items that will only be cancelled, saving memory.

Avoids deadlocks when the queue exceeds the max queue size, discards any new records and prints a message to the log
Are there any plan to add supports for a POST request?
Hi there, I was using the project for a personal crawler, after navigating the source code I've realized that the only way to send a POST request might be implementing a StartRequestsFunc (let me know if I'm wrong lol) which manipulates the http client directly, e.g.,

func postToUrl(url string, body io.Reader) { geziyor.NewGeziyor(&geziyor.Options{ StartRequestsFunc: func(g *geziyor.Geziyor) { req, _ := client.NewRequest("POST", url, body) g.Do(req, nil) } }).Start() }

I haven't tried this approach yet but I'd like to know if that's the proper way to send requests other than a GET? Or is there any plan to add other implementations or an official example about a POST request?
out of control RAM usage

I've got a script that clicks every link and then clicks every link, and it quickly gets out of hand in terms of memory usage (40+GB) before crashing. Any suggestions as to where it's getting out of control? Storing millions of requests shouldn't take that much RAM in my mind.
Proxy Management Not supported

In order to integrate proxies, geziyor does not provide any interface. It does provide request middlewares but the object that can be manipulated in the middleware does not have proxy related configuration. Would be great if that can be supported as well.
How to get response error other than HTTP errors
Hi,

How can I get response error other than HTTP errors (StatusCode), like time out, address not found, Website isn't reachable.... ? For example

geziyor.NewGeziyor(&geziyor.Options{ StartURLs: []string{"http://www.1b4f.com/"}, ParseFunc: func(g *geziyor.Geziyor, r *client.Response) { fmt.Println(string(r.Body)) }, }).Start()

Log output :

2019/12/10 15:00:21 Scraping Started 2019/12/10 15:00:21 Retrying: http://www.1b4f.com/ 2019/12/10 15:00:21 Retrying: http://www.1b4f.com/ 2019/12/10 15:00:21 Response error: Get http://www.1b4f.com/: dial tcp: lookup www.1b4f.com: no such host 2019/12/10 15:00:21 Scraping Finished

I want to strore in DataBase site Url & Error ("http://www.1b4f.com/", "dial tcp: lookup www.1b4f.com: no such host")

Recursive Exports / Native return channels

I found it quite common to have recursive / nested scarping.

├── a
│   ├── itemA
│   └── foldA
│       └── itemB
└── b
    ├── itemC
    └── foldA
        └── itemA

Total result being something like:

{
  "a": [
    {
      "title": "itemA",
      "author": "Foo Bar",
      "contents": "asdjnasknd"
    },
    {
      "title": "foldA",
      "children": [
        {
          "title": "itemB",
          "author": "Foo Baz",
          "contents": "afgdgasknd"
        }
      ]
    }
  ],
  "b": [
    {
      "title": "itemC",
      "author": "Foo Bar",
      "contents": "odjfoij"
    },
    {
      "title": "foldA",
      "children": [
        {
          "title": "itemA",
          "author": "Foo Baz",
          "contents": "alsd"
        }
      ]
    }
  ]
}

Problem is, as soon as you pass something to g.Do(), you have no way of hearing back from the function.

runtime error: invalid memory address or nil pointer dereference

I just ran the basic example and got this error

code:

package main

import (
	"fmt"

	"github.com/geziyor/geziyor"
	"github.com/geziyor/geziyor/client"
)

func main() {
	geziyor.NewGeziyor(&geziyor.Options{
		StartRequestsFunc: func(g *geziyor.Geziyor) {
			g.GetRendered("https://httpbin.org/anything", g.Opt.ParseFunc)
		},
		ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
			fmt.Println(r.HTMLDoc.Find("title").Text())
		},
		//BrowserEndpoint: "ws://localhost:3000",
	}).Start()
}

error:

Scraping Started
Crawled: (200) <GET https://httpbin.org/anything>
runtime error: invalid memory address or nil pointer dereference goroutine 40 [running]:
runtime/debug.Stack()
        C:/Program Files/Go/src/runtime/debug/stack.go:24 +0x65
github.com/geziyor/geziyor.(*Geziyor).recoverMe(0xc00016cdc0)
        C:/Users/Marshall/go/pkg/mod/github.com/geziyor/[email protected]/geziyor.go:307 +0x45
panic({0x111dc60, 0x17f7d60})
        C:/Program Files/Go/src/runtime/panic.go:838 +0x207
main.main.func2(0xc00014a1c8?, 0xc000409d10?)
        C:/Users/Marshall/Desktop/gezi/main.go:16 +0x18
github.com/geziyor/geziyor.(*Geziyor).do(0xc00016cdc0, 0xc0001524b0, 0x12350c8)
        C:/Users/Marshall/go/pkg/mod/github.com/geziyor/[email protected]/geziyor.go:262 +0x235
created by github.com/geziyor/geziyor.(*Geziyor).Do
        C:/Users/Marshall/go/pkg/mod/github.com/geziyor/[email protected]/geziyor.go:228 +0xd2

Scraping Finished

Any advice?

Add a generic in-memory counter and expose Metrics

Added a Generic metrics counter (in-memory) and also exposed the Metrics variable so that it can be used outside of external metrics counters.

(This is more of a suggestion, I'm otherwise counting manually but it doesn't make sense when it's built in!)

problem installing geziyor

go get -u github.com/geziyor/geziyor
go: go.mod file not found in current directory or any parent directory.
	'go get' is no longer supported outside a module.
	To build and install a command, use 'go install' with a version,
	like 'go install example.com/cmd@latest'
	For more information, see https://golang.org/doc/go-get-install-deprecation
	or run 'go help get' or 'go help install'.
ubuntu2204@ubuntu2204:~/goscrape$ go get go.mod
go: go.mod file not found in current directory or any parent directory.
	'go get' is no longer supported outside a module.
	To build and install a command, use 'go install' with a version,
	like 'go install example.com/cmd@latest'
	For more information, see https://golang.org/doc/go-get-install-deprecation
	or run 'go help get' or 'go help install'.
ubuntu2204@ubuntu2204:~/goscrape$ go install github.com/geziyor/geziyor
go: 'go install' requires a version when current directory is not in a module
	Try 'go install github.com/geziyor/geziyor@latest' to install the latest version
ubuntu2204@ubuntu2204:~/goscrape$ go install github.com/geziyor/geziyor@latest
go: downloading github.com/geziyor/geziyor v0.0.0-20220429000531-738852f9321d
go: downloading golang.org/x/time v0.0.0-20220411224347-583f2d630306
go: downloading github.com/chromedp/chromedp v0.8.0
go: downloading github.com/PuerkitoBio/goquery v1.8.0
go: downloading github.com/chromedp/cdproto v0.0.0-20220428002153-285dfb42699c
go: downloading golang.org/x/net v0.0.0-20220425223048-2871e0cb64e4
go: downloading golang.org/x/text v0.3.7
go: downloading github.com/go-kit/kit v0.12.0
go: downloading github.com/prometheus/client_golang v1.12.1
go: downloading github.com/temoto/robotstxt v1.1.2
go: downloading github.com/andybalholm/cascadia v1.3.1
go: downloading github.com/beorn7/perks v1.0.1
go: downloading github.com/cespare/xxhash/v2 v2.1.2
go: downloading github.com/golang/protobuf v1.5.2
go: downloading github.com/prometheus/client_model v0.2.0
go: downloading github.com/cespare/xxhash v1.1.0
go: downloading github.com/prometheus/common v0.34.0
go: downloading github.com/prometheus/procfs v0.7.3
go: downloading google.golang.org/protobuf v1.28.0
go: downloading github.com/VividCortex/gohistogram v1.0.0
go: downloading github.com/matttproud/golang_protobuf_extensions v1.0.1
go: downloading golang.org/x/sys v0.0.0-20220422013727-9388b58f7150
package github.com/geziyor/geziyor is not a main package

Scrape URLs then get to there.
I'm looking for,

Parses URLs

Visit to each parsed URLs

Parse data from visited page.

For example,

Get books URLs from Goodreads

Visit those places

Get books' data from the visited pages.

This is possible with colly, I wonder if it's possible with geziyor.
Is scraping shadow DOM an option?

Hi, I'm trying to web scrapping YouTube charts, unsuccessfully because they use polymer / shadow DOM. With Geziyor, could I do that? I'm using colly, and they don't have support for that.

A declarative struct-tag-based HTML unmarshaling or scraping package for Go built on top of the goquery library

goq Example import ( "log" "net/http" "astuart.co/goq" ) // Structured representation for github file name table type example struct { Title str

Dec 12, 2022

Stylesheet-based markdown rendering for your CLI apps 💇🏻‍♀️

Glamour Write handsome command-line tools with Glamour. glamour lets you render markdown documents & templates on ANSI compatible terminals. You can c

Jan 1, 2023

A UTF-8 and internationalisation testing utility for text rendering.

ɱéťàł "English, but metal" Metal is a tool that converts English text into a legible, Zalgo-like character swap for the purposes of testing localisati

Jan 1, 2023

Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Tagify Gets STDIN, file or HTTP address as an input and returns a list of most popular words ordered by popularity as an output. More info about what

Dec 19, 2022

Auto-gen fuzzing wrappers from normal code. Automatically find buggy call sequences, including data races & deadlocks. Supports rich signature types.

fzgen fzgen auto-generates fuzzing wrappers for Go 1.18, optionally finds problematic API call sequences, can automatically wire outputs to inputs acr

Dec 23, 2022

bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

bluemonday bluemonday is a HTML sanitizer implemented in Go. It is fast and highly configurable. bluemonday takes untrusted user generated content as

Jan 4, 2023

A fast string sorting algorithm (MSD radix sort)

Your basic radix sort A fast string sorting algorithm This is an optimized sorting algorithm equivalent to sort.Strings in the Go standard library. Fo

Dec 18, 2022

Super Fast Regex in Go

Rubex : Super Fast Regexp for Go by Zhigang Chen ([email protected] or [email protected]) ONLY USE go1 BRANCH A simple regular expression libr

Sep 9, 2022

Small and fast FTS (full text search)

Microfts A small full text indexing and search tool focusing on speed and space. Initial tests seem to indicate that the database takes about twice as

Jul 30, 2022

Fast and secure steganography CLI for hiding text/files in images.

indie CLI This complete README is hidden in the target.png file below without the original readme.png this could have also been a lie as none could ev

Mar 20, 2022

A fast, easy-of-use and dependency free custom mapping from .csv data into Golang structs

csvparser This package provides a fast and easy-of-use custom mapping from .csv data into Golang structs. Index Pre-requisites Installation Examples C

Nov 14, 2022

Go minifiers for web formats

Minify Online demo if you need to minify files now. Command line tool that minifies concurrently and watches file changes. Releases of CLI for various

Jan 6, 2023

🌭 The hotdog web browser and browser engine 🌭

This is the hotdog web browser project. It's a web browser with its own layout and rendering engine, parsers, and UI toolkit! It's made from scratch e

Dec 30, 2022

yview is a lightweight, minimalist and idiomatic template library based on golang html/template for building Go web application.

wview wview is a lightweight, minimalist and idiomatic template library based on golang html/template for building Go web application. Contents Instal

Dec 5, 2021

Build and deploy resilient web applications.

Archived Due to the security concerns surrounding XML, this package is now archived. Go server overview : Template engine. Built in request tracer. we

Dec 15, 2020

Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Dec 30, 2022

[Crawler/Scraper for Golang]🕷A lightweight distributed friendly Golang crawler framework.一个轻量的分布式友好的 Golang 爬虫框架。

Goribot 一个分布式友好的轻量的 Golang 爬虫框架。完整文档 | Document !! Warning !! Goribot 已经被迁移到 Gospider|github.com/zhshch2002/gospider。修复了一些调度问题并分离了网络请求部分到另一个仓库。此仓库会继续

Oct 29, 2022

Antch, a fast, powerful and extensible web crawling & scraping framework for Go

Antch Antch, inspired by Scrapy. If you're familiar with scrapy, you can quickly get started. Antch is a fast, powerful and extensible web crawling &

Jan 6, 2023

🎨 Terminal color rendering tool library, support 8/16 colors, 256 colors, RGB color rendering output, support Print/Sprintf methods, compatible with Windows. GO CLI 控制台颜色渲染工具库，支持16色，256色，RGB色彩渲染输出，使用类似于 Print/Sprintf，兼容并支持 Windows 环境的色彩渲染

CLI Color A command-line color library with true color support, universal API methods and Windows support. 中文说明 Basic color preview: Now, 256 colors a

Dec 23, 2022

🎨 Terminal color rendering library, support 8/16 colors, 256 colors, RGB color rendering output, support Print/Sprintf methods, compatible with Windows.

?? Terminal color rendering library, support 8/16 colors, 256 colors, RGB color rendering output, support Print/Sprintf methods, compatible with Windows. GO CLI 控制台颜色渲染工具库，支持16色，256色，RGB色彩渲染输出，使用类似于 Print/Sprintf，兼容并支持 Windows 环境的色彩渲染

Dec 30, 2022

Geziyor, a fast web crawling & scraping framework for Go. Supports JS rendering.

Geziyor

Features

Status

Usage

Documentation

Installation

Making Normal Requests

Making JS Rendered Requests

Extracting Data

Exporting Data

Benchmark

Owner

Comments

google-chrome: executable file not found in $PATH

Issue:

Code

Question

Cookie cutters and Declarative scrapping

context deadline exceeded

Queue performance enhancements + delay middleware fix

Are there any plan to add supports for a POST request?

out of control RAM usage

Proxy Management Not supported

How to get response error other than HTTP errors

Recursive Exports / Native return channels

runtime error: invalid memory address or nil pointer dereference

Add a generic in-memory counter and expose Metrics

problem installing geziyor

Scrape URLs then get to there.

Is scraping shadow DOM an option?

Related tags

A declarative struct-tag-based HTML unmarshaling or scraping package for Go built on top of the goquery library

Stylesheet-based markdown rendering for your CLI apps 💇🏻‍♀️

A UTF-8 and internationalisation testing utility for text rendering.

Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Auto-gen fuzzing wrappers from normal code. Automatically find buggy call sequences, including data races & deadlocks. Supports rich signature types.

bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

A fast string sorting algorithm (MSD radix sort)

Super Fast Regex in Go

Small and fast FTS (full text search)

Fast and secure steganography CLI for hiding text/files in images.

A fast, easy-of-use and dependency free custom mapping from .csv data into Golang structs

Go minifiers for web formats

🌭 The hotdog web browser and browser engine 🌭

yview is a lightweight, minimalist and idiomatic template library based on golang html/template for building Go web application.

Build and deploy resilient web applications.

Elegant Scraper and Crawler Framework for Golang

[Crawler/Scraper for Golang]🕷A lightweight distributed friendly Golang crawler framework.一个轻量的分布式友好的 Golang 爬虫框架。

Antch, a fast, powerful and extensible web crawling & scraping framework for Go

🎨 Terminal color rendering library, support 8/16 colors, 256 colors, RGB color rendering output, support Print/Sprintf methods, compatible with Windows.