Geziyor, a fast web crawling & scraping framework for Go. Supports JS rendering.

Geziyor

Geziyor is a blazing fast web crawling and web scraping framework. It can be used to crawl websites and extract structured data from them. Geziyor is useful for a wide range of purposes such as data mining, monitoring and automated testing.

GoDoc report card Code Coverage

Features

  • JS Rendering
  • 5.000+ Requests/Sec
  • Caching (Memory/Disk/LevelDB)
  • Automatic Data Exporting (JSON, CSV, or custom)
  • Metrics (Prometheus, Expvar, or custom)
  • Limit Concurrency (Global/Per Domain)
  • Request Delays (Constant/Randomized)
  • Cookies, Middlewares, robots.txt
  • Automatic response decoding to UTF-8

See scraper Options for all custom settings.

Status

We highly recommend you to use Geziyor with go modules.

Usage

This example extracts all quotes from quotes.toscrape.com and exports to JSON file.

func main() {
    geziyor.NewGeziyor(&geziyor.Options{
        StartURLs: []string{"http://quotes.toscrape.com/"},
        ParseFunc: quotesParse,
        Exporters: []export.Exporter{&export.JSON{}},
    }).Start()
}

func quotesParse(g *geziyor.Geziyor, r *client.Response) {
    r.HTMLDoc.Find("div.quote").Each(func(i int, s *goquery.Selection) {
        g.Exports <- map[string]interface{}{
            "text":   s.Find("span.text").Text(),
            "author": s.Find("small.author").Text(),
        }
    })
    if href, ok := r.HTMLDoc.Find("li.next > a").Attr("href"); ok {
        g.Get(r.JoinURL(href), quotesParse)
    }
}

See tests for more usage examples.

Documentation

Installation

go get -u github.com/geziyor/geziyor

If you want to make JS rendered requests, make sure you have Chrome installed.

NOTE: macOS limits the maximum number of open file descriptors. If you want to make concurrent requests over 256, you need to increase limits. Read this for more.

Making Normal Requests

Initial requests start with StartURLs []string field in Options. Geziyor makes concurrent requests to those URLs. After reading response, ParseFunc func(g *Geziyor, r *Response) called.

geziyor.NewGeziyor(&geziyor.Options{
    StartURLs: []string{"http://api.ipify.org"},
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        fmt.Println(string(r.Body))
    },
}).Start()

If you want to manually create first requests, set StartRequestsFunc. StartURLs won't be used if you create requests manually.
You can make requests using Geziyor methods:

geziyor.NewGeziyor(&geziyor.Options{
    StartRequestsFunc: func(g *geziyor.Geziyor) {
    	g.Get("https://httpbin.org/anything", g.Opt.ParseFunc)
        g.Head("https://httpbin.org/anything", g.Opt.ParseFunc)
    },
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        fmt.Println(string(r.Body))
    },
}).Start()

Making JS Rendered Requests

JS Rendered requests can be made using GetRendered method. By default, geziyor uses local Chrome application CLI to start Chrome browser. Set BrowserEndpoint option to use different chrome instance. Such as, "ws://localhost:3000"

geziyor.NewGeziyor(&geziyor.Options{
    StartRequestsFunc: func(g *geziyor.Geziyor) {
        g.GetRendered("https://httpbin.org/anything", g.Opt.ParseFunc)
    },
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        fmt.Println(string(r.Body))
    },
    //BrowserEndpoint: "ws://localhost:3000",
}).Start()

Extracting Data

We can extract HTML elements using response.HTMLDoc. HTMLDoc is Goquery's Document.

HTMLDoc can be accessible on Response if response is HTML and can be parsed using Go's built-in HTML parser If response isn't HTML, response.HTMLDoc would be nil.

geziyor.NewGeziyor(&geziyor.Options{
    StartURLs: []string{"http://quotes.toscrape.com/"},
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        r.HTMLDoc.Find("div.quote").Each(func(_ int, s *goquery.Selection) {
            log.Println(s.Find("span.text").Text(), s.Find("small.author").Text())
        })
    },
}).Start()

Exporting Data

You can export data automatically using exporters. Just send data to Geziyor.Exports chan. Available exporters

geziyor.NewGeziyor(&geziyor.Options{
    StartURLs: []string{"http://quotes.toscrape.com/"},
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        r.HTMLDoc.Find("div.quote").Each(func(_ int, s *goquery.Selection) {
            g.Exports <- map[string]interface{}{
                "text":   s.Find("span.text").Text(),
                "author": s.Find("small.author").Text(),
            }
        })
    },
    Exporters: []export.Exporter{&export.JSON{}},
}).Start()

Benchmark

8748 request per seconds on Macbook Pro 15" 2016

See tests for this benchmark function:

>> go test -run none -bench Requests -benchtime 10s
goos: darwin
goarch: amd64
pkg: github.com/geziyor/geziyor
BenchmarkRequests-8   	  200000	    108710 ns/op
PASS
ok  	github.com/geziyor/geziyor	22.861s
Comments
  • google-chrome: executable file not found in $PATH

    google-chrome: executable file not found in $PATH

    Issue:

    I get an error when I start my service on the server. Local on my machine everything works so far.

    request getting rendered: exec: "google-chrome": executable file not found in $PATH

    Code

    main.go

    // ...
    	crawler := geziyor.NewGeziyor(&geziyor.Options{
    		StartRequestsFunc: func(g *geziyor.Geziyor) {
    			g.GetRendered("https://www.google.com/", g.Opt.ParseFunc)
    		},
    		Exporters: []export.Exporter{&export.JSON{}},
    	})
    	
    	crawler.Start()
    // ...
    

    Dockerfile

    # -- Stage 1 -- #
    FROM golang:1.16-alpine as builder
    WORKDIR /app
    
    COPY . .
    RUN go build -mod=readonly -o bin/service
    
    # -- Stage 2 -- #
    FROM alpine
    
    # Install any required dependencies.
    RUN apk --no-cache add ca-certificates
    
    WORKDIR /root/
    
    COPY --from=builder /app/bin/service /usr/local/bin/
    
    CMD ["service"]
    

    Question

    I assume I need additional dependencies on my server for geziyor to run smoothly? For example Headless Chrome?

  • Cookie cutters and Declarative scrapping

    Cookie cutters and Declarative scrapping

    Many web sites can be scrapped using standard CSS selection without defining fancy Go code to do that. For this, I still like goscrape's "structured scraper" approach. Ref:

    https://github.com/andrew-d/goscrape#goscrape

    And here is how its scrapping is defined declaratively:

    https://github.com/andrew-d/goscrape/blob/d89ba4ccc7f78429613f2a71bc7703c8faf9e8c9/_examples/scrape_hn.go#L15-L26

    	config := &scrape.ScrapeConfig{
    		DividePage: scrape.DividePageBySelector("tr:nth-child(3) tr:nth-child(3n-2):not([style='height:10px'])"),
    
    		Pieces: []scrape.Piece{
    			{Name: "title", Selector: "td.title > a", Extractor: extract.Text{}},
    			{Name: "link", Selector: "td.title > a", Extractor: extract.Attr{Attr: "href"}},
    			{Name: "rank", Selector: "td.title[align='right']",
    				Extractor: extract.Regex{Regex: regexp.MustCompile(`(\d+)`)}},
    		},
    
    		Paginator: paginate.BySelector("a[rel='nofollow']:last-child", "href"),
    	}
    

    Hope geziyor can do declarative scrapping using predefined cookie cutters like above as well.

  • context deadline exceeded

    context deadline exceeded

    I'm trying to scrape 3242 webpages but I'm getting response: Get "https://www.typeform.com/templates/t/course-evaluation-form-template/": context deadline exceeded (Client.Timeout exceeded while awaiting headers) for a lot of urls

    Any advice?

  • Queue performance enhancements + delay middleware fix

    Queue performance enhancements + delay middleware fix

    Enhances the queue logic to improve memory management and handle deadlock situations

    • Fixes delay middleware to always factor in delay if randomised delay is added (combined, not instead of)
    • Moves request middleware to prior to the core g.do func - this allows middleware to cancel requests within triggering the semaphore locks (and corresponding rate limits!), also avoids queuing items that will only be cancelled, saving memory.
    • Avoids deadlocks when the queue exceeds the max queue size, discards any new records and prints a message to the log
  • Are there any plan to add supports for a POST request?

    Are there any plan to add supports for a POST request?

    Hi there, I was using the project for a personal crawler, after navigating the source code I've realized that the only way to send a POST request might be implementing a StartRequestsFunc (let me know if I'm wrong lol) which manipulates the http client directly, e.g.,

    func postToUrl(url string, body io.Reader) {
    	geziyor.NewGeziyor(&geziyor.Options{
    		StartRequestsFunc: func(g *geziyor.Geziyor) {
    			req, _ := client.NewRequest("POST", url, body)
    			g.Do(req, nil)
    		}
    	}).Start()
    }
    

    I haven't tried this approach yet but I'd like to know if that's the proper way to send requests other than a GET? Or is there any plan to add other implementations or an official example about a POST request?

  • out of control RAM usage

    out of control RAM usage

    I've got a script that clicks every link and then clicks every link, and it quickly gets out of hand in terms of memory usage (40+GB) before crashing. Any suggestions as to where it's getting out of control? Storing millions of requests shouldn't take that much RAM in my mind.

  • Proxy Management Not supported

    Proxy Management Not supported

    In order to integrate proxies, geziyor does not provide any interface. It does provide request middlewares but the object that can be manipulated in the middleware does not have proxy related configuration. Would be great if that can be supported as well.

  • How to get response error other than HTTP errors

    How to get response error other than HTTP errors

    Hi,

    How can I get response error other than HTTP errors (StatusCode), like time out, address not found, Website isn't reachable.... ? For example

    	geziyor.NewGeziyor(&geziyor.Options{
    		StartURLs: []string{"http://www.1b4f.com/"},
    		ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
    			fmt.Println(string(r.Body))
    		},
    	}).Start()
    

    Log output :

    2019/12/10 15:00:21 Scraping Started 2019/12/10 15:00:21 Retrying: http://www.1b4f.com/ 2019/12/10 15:00:21 Retrying: http://www.1b4f.com/ 2019/12/10 15:00:21 Response error: Get http://www.1b4f.com/: dial tcp: lookup www.1b4f.com: no such host 2019/12/10 15:00:21 Scraping Finished

    I want to strore in DataBase site Url & Error ("http://www.1b4f.com/", "dial tcp: lookup www.1b4f.com: no such host")

  • Recursive Exports / Native return channels

    Recursive Exports / Native return channels

    I found it quite common to have recursive / nested scarping.

    ├── a
    │   ├── itemA
    │   └── foldA
    │       └── itemB
    └── b
        ├── itemC
        └── foldA
            └── itemA
    

    Total result being something like:

    {
      "a": [
        {
          "title": "itemA",
          "author": "Foo Bar",
          "contents": "asdjnasknd"
        },
        {
          "title": "foldA",
          "children": [
            {
              "title": "itemB",
              "author": "Foo Baz",
              "contents": "afgdgasknd"
            }
          ]
        }
      ],
      "b": [
        {
          "title": "itemC",
          "author": "Foo Bar",
          "contents": "odjfoij"
        },
        {
          "title": "foldA",
          "children": [
            {
              "title": "itemA",
              "author": "Foo Baz",
              "contents": "alsd"
            }
          ]
        }
      ]
    }
    

    Problem is, as soon as you pass something to g.Do(), you have no way of hearing back from the function.

  • runtime error: invalid memory address or nil pointer dereference

    runtime error: invalid memory address or nil pointer dereference

    I just ran the basic example and got this error

    code:

    package main
    
    import (
    	"fmt"
    
    	"github.com/geziyor/geziyor"
    	"github.com/geziyor/geziyor/client"
    )
    
    func main() {
    	geziyor.NewGeziyor(&geziyor.Options{
    		StartRequestsFunc: func(g *geziyor.Geziyor) {
    			g.GetRendered("https://httpbin.org/anything", g.Opt.ParseFunc)
    		},
    		ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
    			fmt.Println(r.HTMLDoc.Find("title").Text())
    		},
    		//BrowserEndpoint: "ws://localhost:3000",
    	}).Start()
    }
    

    error:

    Scraping Started
    Crawled: (200) <GET https://httpbin.org/anything>
    runtime error: invalid memory address or nil pointer dereference goroutine 40 [running]:
    runtime/debug.Stack()
            C:/Program Files/Go/src/runtime/debug/stack.go:24 +0x65
    github.com/geziyor/geziyor.(*Geziyor).recoverMe(0xc00016cdc0)
            C:/Users/Marshall/go/pkg/mod/github.com/geziyor/[email protected]/geziyor.go:307 +0x45
    panic({0x111dc60, 0x17f7d60})
            C:/Program Files/Go/src/runtime/panic.go:838 +0x207
    main.main.func2(0xc00014a1c8?, 0xc000409d10?)
            C:/Users/Marshall/Desktop/gezi/main.go:16 +0x18
    github.com/geziyor/geziyor.(*Geziyor).do(0xc00016cdc0, 0xc0001524b0, 0x12350c8)
            C:/Users/Marshall/go/pkg/mod/github.com/geziyor/[email protected]/geziyor.go:262 +0x235
    created by github.com/geziyor/geziyor.(*Geziyor).Do
            C:/Users/Marshall/go/pkg/mod/github.com/geziyor/[email protected]/geziyor.go:228 +0xd2
    
    Scraping Finished
    

    Any advice?

  • Add a generic in-memory counter and expose Metrics

    Add a generic in-memory counter and expose Metrics

    Added a Generic metrics counter (in-memory) and also exposed the Metrics variable so that it can be used outside of external metrics counters.

    (This is more of a suggestion, I'm otherwise counting manually but it doesn't make sense when it's built in!)

  • problem installing geziyor

    problem installing geziyor

    go get -u github.com/geziyor/geziyor
    go: go.mod file not found in current directory or any parent directory.
    	'go get' is no longer supported outside a module.
    	To build and install a command, use 'go install' with a version,
    	like 'go install example.com/cmd@latest'
    	For more information, see https://golang.org/doc/go-get-install-deprecation
    	or run 'go help get' or 'go help install'.
    ubuntu2204@ubuntu2204:~/goscrape$ go get go.mod
    go: go.mod file not found in current directory or any parent directory.
    	'go get' is no longer supported outside a module.
    	To build and install a command, use 'go install' with a version,
    	like 'go install example.com/cmd@latest'
    	For more information, see https://golang.org/doc/go-get-install-deprecation
    	or run 'go help get' or 'go help install'.
    ubuntu2204@ubuntu2204:~/goscrape$ go install github.com/geziyor/geziyor
    go: 'go install' requires a version when current directory is not in a module
    	Try 'go install github.com/geziyor/geziyor@latest' to install the latest version
    ubuntu2204@ubuntu2204:~/goscrape$ go install github.com/geziyor/geziyor@latest
    go: downloading github.com/geziyor/geziyor v0.0.0-20220429000531-738852f9321d
    go: downloading golang.org/x/time v0.0.0-20220411224347-583f2d630306
    go: downloading github.com/chromedp/chromedp v0.8.0
    go: downloading github.com/PuerkitoBio/goquery v1.8.0
    go: downloading github.com/chromedp/cdproto v0.0.0-20220428002153-285dfb42699c
    go: downloading golang.org/x/net v0.0.0-20220425223048-2871e0cb64e4
    go: downloading golang.org/x/text v0.3.7
    go: downloading github.com/go-kit/kit v0.12.0
    go: downloading github.com/prometheus/client_golang v1.12.1
    go: downloading github.com/temoto/robotstxt v1.1.2
    go: downloading github.com/andybalholm/cascadia v1.3.1
    go: downloading github.com/beorn7/perks v1.0.1
    go: downloading github.com/cespare/xxhash/v2 v2.1.2
    go: downloading github.com/golang/protobuf v1.5.2
    go: downloading github.com/prometheus/client_model v0.2.0
    go: downloading github.com/cespare/xxhash v1.1.0
    go: downloading github.com/prometheus/common v0.34.0
    go: downloading github.com/prometheus/procfs v0.7.3
    go: downloading google.golang.org/protobuf v1.28.0
    go: downloading github.com/VividCortex/gohistogram v1.0.0
    go: downloading github.com/matttproud/golang_protobuf_extensions v1.0.1
    go: downloading golang.org/x/sys v0.0.0-20220422013727-9388b58f7150
    package github.com/geziyor/geziyor is not a main package
    
  • Scrape URLs then get to there.

    Scrape URLs then get to there.

    I'm looking for,

    • Parses URLs
    • Visit to each parsed URLs
    • Parse data from visited page.

    For example,

    • Get books URLs from Goodreads
    • Visit those places
    • Get books' data from the visited pages.

    This is possible with colly, I wonder if it's possible with geziyor.

  • Is scraping shadow DOM an option?

    Is scraping shadow DOM an option?

    Hi, I'm trying to web scrapping YouTube charts, unsuccessfully because they use polymer / shadow DOM. With Geziyor, could I do that? I'm using colly, and they don't have support for that.

A declarative struct-tag-based HTML unmarshaling or scraping package for Go built on top of the goquery library

goq Example import ( "log" "net/http" "astuart.co/goq" ) // Structured representation for github file name table type example struct { Title str

Dec 12, 2022
Stylesheet-based markdown rendering for your CLI apps 💇🏻‍♀️
Stylesheet-based markdown rendering for your CLI apps 💇🏻‍♀️

Glamour Write handsome command-line tools with Glamour. glamour lets you render markdown documents & templates on ANSI compatible terminals. You can c

Jan 1, 2023
A UTF-8 and internationalisation testing utility for text rendering.

ɱéťàł "English, but metal" Metal is a tool that converts English text into a legible, Zalgo-like character swap for the purposes of testing localisati

Jan 1, 2023
Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.
Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Tagify Gets STDIN, file or HTTP address as an input and returns a list of most popular words ordered by popularity as an output. More info about what

Dec 19, 2022
Auto-gen fuzzing wrappers from normal code. Automatically find buggy call sequences, including data races & deadlocks. Supports rich signature types.

fzgen fzgen auto-generates fuzzing wrappers for Go 1.18, optionally finds problematic API call sequences, can automatically wire outputs to inputs acr

Dec 23, 2022
bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

bluemonday bluemonday is a HTML sanitizer implemented in Go. It is fast and highly configurable. bluemonday takes untrusted user generated content as

Jan 4, 2023
A fast string sorting algorithm (MSD radix sort)
A fast string sorting algorithm (MSD radix sort)

Your basic radix sort A fast string sorting algorithm This is an optimized sorting algorithm equivalent to sort.Strings in the Go standard library. Fo

Dec 18, 2022
Super Fast Regex in Go

Rubex : Super Fast Regexp for Go by Zhigang Chen ([email protected] or [email protected]) ONLY USE go1 BRANCH A simple regular expression libr

Sep 9, 2022
Small and fast FTS (full text search)

Microfts A small full text indexing and search tool focusing on speed and space. Initial tests seem to indicate that the database takes about twice as

Jul 30, 2022
Fast and secure steganography CLI for hiding text/files in images.
Fast and secure steganography CLI for hiding text/files in images.

indie CLI This complete README is hidden in the target.png file below without the original readme.png this could have also been a lie as none could ev

Mar 20, 2022
A fast, easy-of-use and dependency free custom mapping from .csv data into Golang structs

csvparser This package provides a fast and easy-of-use custom mapping from .csv data into Golang structs. Index Pre-requisites Installation Examples C

Nov 14, 2022
Go minifiers for web formats

Minify Online demo if you need to minify files now. Command line tool that minifies concurrently and watches file changes. Releases of CLI for various

Jan 6, 2023
🌭 The hotdog web browser and browser engine 🌭
🌭 The hotdog web browser and browser engine 🌭

This is the hotdog web browser project. It's a web browser with its own layout and rendering engine, parsers, and UI toolkit! It's made from scratch e

Dec 30, 2022
yview is a lightweight, minimalist and idiomatic template library based on golang html/template for building Go web application.

wview wview is a lightweight, minimalist and idiomatic template library based on golang html/template for building Go web application. Contents Instal

Dec 5, 2021
Build and deploy resilient web applications.

Archived Due to the security concerns surrounding XML, this package is now archived. Go server overview : Template engine. Built in request tracer. we

Dec 15, 2020
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Dec 30, 2022
[Crawler/Scraper for Golang]🕷A lightweight distributed friendly Golang crawler framework.一个轻量的分布式友好的 Golang 爬虫框架。

Goribot 一个分布式友好的轻量的 Golang 爬虫框架。 完整文档 | Document !! Warning !! Goribot 已经被迁移到 Gospider|github.com/zhshch2002/gospider。修复了一些调度问题并分离了网络请求部分到另一个仓库。此仓库会继续

Oct 29, 2022
Antch, a fast, powerful and extensible web crawling & scraping framework for Go

Antch Antch, inspired by Scrapy. If you're familiar with scrapy, you can quickly get started. Antch is a fast, powerful and extensible web crawling &

Jan 6, 2023
🎨 Terminal color rendering library, support 8/16 colors, 256 colors, RGB color rendering output, support Print/Sprintf methods, compatible with Windows.
🎨 Terminal color rendering library, support 8/16 colors, 256 colors, RGB color rendering output, support Print/Sprintf methods, compatible with Windows.

?? Terminal color rendering library, support 8/16 colors, 256 colors, RGB color rendering output, support Print/Sprintf methods, compatible with Windows. GO CLI 控制台颜色渲染工具库,支持16色,256色,RGB色彩渲染输出,使用类似于 Print/Sprintf,兼容并支持 Windows 环境的色彩渲染

Dec 30, 2022