Elegant Scraper and Crawler Framework for Golang

Colly

Lightning Fast and Elegant Scraping Framework for Gophers

Colly provides a clean interface to write any kind of crawler/scraper/spider.

With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

GoDoc Backers on Open Collective Sponsors on Open Collective build status report card view examples Code Coverage FOSSA Status Twitter URL

Features

  • Clean API
  • Fast (>1k request/sec on a single core)
  • Manages request delays and maximum concurrency per domain
  • Automatic cookie and session handling
  • Sync/async/parallel scraping
  • Caching
  • Automatic encoding of non-unicode responses
  • Robots.txt support
  • Distributed scraping
  • Configuration via environment variables
  • Extensions

Example

func main() {
	c := colly.NewCollector()

	// Find and visit all links
	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		e.Request.Visit(e.Attr("href"))
	})

	c.OnRequest(func(r *colly.Request) {
		fmt.Println("Visiting", r.URL)
	})

	c.Visit("http://go-colly.org/")
}

See examples folder for more detailed examples.

Installation

Add colly to your go.mod file:

module github.com/x/y

go 1.14

require (
        github.com/gocolly/colly/v2 latest
)

Bugs

Bugs or suggestions? Visit the issue tracker or join #colly on freenode

Other Projects Using Colly

Below is a list of public, open source projects that use Colly:

If you are using Colly in a project please send a pull request to add it to the list.

Contributors

This project exists thanks to all the people who contribute. [Contribute].

Backers

Thank you to all our backers! 🙏 [Become a backer]

Sponsors

Support this project by becoming a sponsor. Your logo will show up here with a link to your website. [Become a sponsor]

License

FOSSA Status

Owner
Colly
Elegant scraper and crawler framework for Golang
Colly
Similar Resources

Example Book Report API written in Golang with Fiber and GORM

book-report Example Book Report API written in Golang with Fiber and GORM API func setupRoutes(app *fiber.App) { app.Get("/api/v1/book", book.GetBook

Nov 5, 2021

A fast, easy-of-use and dependency free custom mapping from .csv data into Golang structs

csvparser This package provides a fast and easy-of-use custom mapping from .csv data into Golang structs. Index Pre-requisites Installation Examples C

Nov 14, 2022

Decode / encode XML to/from map[string]interface{} (or JSON); extract values with dot-notation paths and wildcards. Replaces x2j and j2x packages.

mxj - to/from maps, XML and JSON Decode/encode XML to/from map[string]interface{} (or JSON) values, and extract/modify values from maps by key or key-

Dec 29, 2022

[Go] Package of validators and sanitizers for strings, numerics, slices and structs

govalidator A package of validators and sanitizers for strings, structs and collections. Based on validator.js. Installation Make sure that Go is inst

Dec 28, 2022

Take screenshots of websites and create PDF from HTML pages using chromium and docker

gochro is a small docker image with chromium installed and a golang based webserver to interact wit it. It can be used to take screenshots of w

Nov 23, 2022

Parse data and test fixtures from markdown files, and patch them programmatically, too.

go-testmark Do you need test fixtures and example data for your project, in a language agnostic way? Do you want it to be easy to combine with documen

Oct 31, 2022

Watches container registries for new and changed tags and creates an RSS feed for detected changes.

Tagwatch Watches container registries for new and changed tags and creates an RSS feed for detected changes. Configuration Tagwatch is configured thro

Jan 7, 2022

bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

bluemonday bluemonday is a HTML sanitizer implemented in Go. It is fast and highly configurable. bluemonday takes untrusted user generated content as

Jan 4, 2023

A golang package to work with Decentralized Identifiers (DIDs)

did did is a Go package that provides tools to work with Decentralized Identifiers (DIDs). Install go get github.com/ockam-network/did Example packag

Nov 25, 2022
Comments
  • Using cookies

    Using cookies

    
    import (
    	"fmt"
    	"github.com/gocolly/colly"
    )
    
    func main() {
    
    	c := colly.NewCollector()
    
    	c.OnRequest(func(r *colly.Request) {
    		r.Headers.Set("User-Agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 OPR/96.0.4640.0")
    		r.Headers.Set("cookie", "enter_cookies")
    	})
    
    	c.OnHTML(".sr-only", func(element *colly.HTMLElement) {
    		fmt.Println(element.Text)
    	})
    
    	c.Visit("https://github.com/settings/profile")
    
    }
    

    What I am trying to do: Use my cookies and go to the github settings/profile page and extract the ".sr-only" text and print it. For some reason, nothing prints.

    https://github.com/gocolly/colly/issues/599 I used the sample from a similar issue right here

  • Option not to pass Request Context to the Next Request

    Option not to pass Request Context to the Next Request

    I'm using Request Context to store information about the parsed body on various c.OnHTML callbacks..

    So what happens is, if I use the e.Request.Visit() for following up on hrefs, then the request context is also being passed. I wanted to avoid this. So instead of using e.Request.Visit() I used c.Visit() directly. This made sure that I got new context for each request.

    However, I would like to use the MaxDepth option as well. But that only works if I use the e.Request.Visit().

    It would work for me to use the e.Request.Visit() but give new context for each request. This is currently not possible. Is that correct?

    If yes, this feature request would be great to have as a configuration option - to determine if the request context has to be passed along or not..

    For now, I have manually made the change for local purposes..

    index 6beef834..524bb77c 100644
    --- a/vendor/github.com/gocolly/colly/v2/request.go
    +++ b/vendor/github.com/gocolly/colly/v2/request.go
    @@ -117,7 +117,7 @@ func (r *Request) AbsoluteURL(u string) string {
     // request and preserves the Context of the previous request.
     // Visit also calls the previously provided callbacks
     func (r *Request) Visit(URL string) error {
    -	return r.collector.scrape(r.AbsoluteURL(URL), "GET", r.Depth+1, nil, r.Ctx, nil, true)
    +	return r.collector.scrape(r.AbsoluteURL(URL), "GET", r.Depth+1, nil, nil, nil, true)
     }
     
     // HasVisited checks if the provided URL has been visited```
    
  • TLS Error on Robots.txt is not handled in OnError

    TLS Error on Robots.txt is not handled in OnError

    I'm running a test project on localhost:8000 and when I access it over https, it fails (which is expected)

    Get "https://localhost:8000/": tls: first record does not look like a TLS handshake

    The above is correctly caught in OnError. However, when I set ignoreRobots to false, then it tries to fetch the robots.txt and the below failure

    Get "https://localhost:8000/robots.txt": tls: first record does not look like a TLS handshake

    Is not propogated to OnError - as it is really not originating from the request that I had started, but colly tries to first fetch the robots which fails.. Could this also be propogated either to OnError or can be caught with a known Error Code from Colly such as

    ErrRobotsTxtBlocked = errors.New("URL blocked by robots.txt")
    ErrRobotsTxtFetchFailed = errors.New("Unable to fetch robots.txt") // New Error Code
    
  • queue AddRequest will stuck if queue.Run() loop end

    queue AddRequest will stuck if queue.Run() loop end

    https://github.com/gocolly/colly/blob/947eeead97b39d46ce2c89b06164c78b39d25759/queue/queue.go#L113

    stuck on q.wake <- struct{}{}

    because q.wake already not used by queue.Run().

Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler
Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler

Pagser Pagser inspired by page parser。 Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and str

Dec 13, 2022
Geziyor, a fast web crawling & scraping framework for Go. Supports JS rendering.

Geziyor Geziyor is a blazing fast web crawling and web scraping framework. It can be used to crawl websites and extract structured data from them. Gez

Dec 29, 2022
golang 在线预览word,excel,pdf,MarkDown(Online Preview Word,Excel,PPT,PDF,Image by Golang)
golang 在线预览word,excel,pdf,MarkDown(Online Preview Word,Excel,PPT,PDF,Image by Golang)

Go View File 在线体验地址 http://39.97.98.75:8082/view/upload (不会经常更新,保留最基本的预览功能。服务器配置较低,如果出现链接超时请等待几秒刷新重试,或者换Chrome) 目前已经完成 docker部署 (不用为运行环境烦恼) Wor

Dec 26, 2022
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.

omniparser Omniparser is a native Golang ETL parser that ingests input data of various formats (CSV, txt, fixed length/width, XML, EDI/X12/EDIFACT, JS

Jan 4, 2023
iTunes and RSS 2.0 Podcast Generator in Golang

podcast Package podcast generates a fully compliant iTunes and RSS 2.0 podcast feed for GoLang using a simple API. Full documentation with detailed ex

Dec 23, 2022
agrep-like fuzzy matching, but made faster using Golang and precomputation.

goagrep There are situations where you want to take the user's input and match a primary key in a database. But, immediately a problem is introduced:

Oct 8, 2022
Golang metrics for calculating string similarity and other string utility functions

strutil strutil provides string metrics for calculating string similarity as well as other string utility functions. Full documentation can be found a

Jan 3, 2023
:book: A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

prose prose is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech

Jan 4, 2023
This package provides Go (golang) types and helper functions to do some basic but useful things with mxGraph diagrams in XML, which is most famously used by app.diagrams.net, the new name of draw.io.

Go Draw - Golang MX This package provides types and helper functions to do some basic but useful things with mxGraph diagrams in XML, which is most fa

Aug 30, 2022
yview is a lightweight, minimalist and idiomatic template library based on golang html/template for building Go web application.

wview wview is a lightweight, minimalist and idiomatic template library based on golang html/template for building Go web application. Contents Instal

Dec 5, 2021