Elegant Scraper and Crawler Framework for Golang

Colly

Lightning Fast and Elegant Scraping Framework for Gophers

Colly provides a clean interface to write any kind of crawler/scraper/spider.

With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

GoDoc Backers on Open Collective Sponsors on Open Collective build status report card view examples Code Coverage FOSSA Status Twitter URL

Features

  • Clean API
  • Fast (>1k request/sec on a single core)
  • Manages request delays and maximum concurrency per domain
  • Automatic cookie and session handling
  • Sync/async/parallel scraping
  • Caching
  • Automatic encoding of non-unicode responses
  • Robots.txt support
  • Distributed scraping
  • Configuration via environment variables
  • Extensions

Example

func main() {
	c := colly.NewCollector()

	// Find and visit all links
	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		e.Request.Visit(e.Attr("href"))
	})

	c.OnRequest(func(r *colly.Request) {
		fmt.Println("Visiting", r.URL)
	})

	c.Visit("http://go-colly.org/")
}

See examples folder for more detailed examples.

Installation

Add colly to your go.mod file:

module github.com/x/y

go 1.14

require (
        github.com/gocolly/colly/v2 latest
)

Bugs

Bugs or suggestions? Visit the issue tracker or join #colly on freenode

Other Projects Using Colly

Below is a list of public, open source projects that use Colly:

If you are using Colly in a project please send a pull request to add it to the list.

Contributors

This project exists thanks to all the people who contribute. [Contribute].

Backers

Thank you to all our backers! 🙏 [Become a backer]

Sponsors

Support this project by becoming a sponsor. Your logo will show up here with a link to your website. [Become a sponsor]

License

FOSSA Status

Owner
Colly
Elegant scraper and crawler framework for Golang
Colly
Similar Resources

Simple price scraper with HTTP server/exporter for use with Prometheus

priceserver v0.3 Simple price scraper with HTTP server/exporter for use with Prometheus Currently working with Bitrue.com exchange but easily adaptabl

Nov 16, 2021

Scraper to download school attendance data from the DfE's statistics website

Scraper to download school attendance data from the DfE's statistics website

💪 Simple to use. Scrape attendance data with a single command! 🏇 Super fast. A

Mar 31, 2022

A cli scraper of gocomics.com made in go

goComic goComic is a cli tool written in go that scrapes your favorite childhood favorite comic from gocomics.com. It will give you a single days comi

Dec 24, 2021

Best Room Price Scraper from Booking.com

Best Room Price Scraper from Booking.com This repo is a tutorial of Large Scale

Nov 11, 2022

A simple scraper to export data from buildkite to honeycomb using opentelemetry SDK

A simple scraper to export data from buildkite to honeycomb using opentelemetry SDK

A quick scraper program that let you export builds on BuildKite as OpenTelemetry data and then send them to honeycomb.io for slice-n-dice high cardinality analysis.

Jul 7, 2022

Fast golang web crawler for gathering URLs and JavaSript file locations.

Fast golang web crawler for gathering URLs and JavaSript file locations. This is basically a simple implementation of the awesome Gocolly library.

Sep 24, 2022

Pholcus is a distributed high-concurrency crawler software written in pure golang

Pholcus is a distributed high-concurrency crawler software written in pure golang

Pholcus Pholcus(幽灵蛛)是一款纯 Go 语言编写的支持分布式的高并发爬虫软件,仅用于编程学习与研究。 它支持单机、服务端、客户端三种运行模式,拥有Web、GUI、命令行三种操作界面;规则简单灵活、批量任务并发、输出方式丰富(mysql/mongodb/kafka/csv/excel等

Dec 30, 2022

New World Auction House Crawler In Golang

New-World-Auction-House-Crawler Goal of this library is to have a process which grabs New World auction house data in the background while playing the

Sep 7, 2022

A PCPartPicker crawler for Golang.

gopartpicker A scraper for pcpartpicker.com for Go. It is implemented using Colly. Features Extract data from part list URLs Search for parts Extract

Nov 9, 2021
Comments
  • Attempting to fetch nonexisting /sitemap.xml.gz results in

    Attempting to fetch nonexisting /sitemap.xml.gz results in "gzip: invalid header"

    package main
    
    import (
    	"log"
    	"net/http"
    
    	"github.com/gocolly/colly/v2"
    )
    
    func main() {
    	// we have to disable Accept-Encoding: gzip,
    	// because the remote server might send compressed
    	// 404 page, avoiding the bug
    	tr := &http.Transport{
    		DisableCompression: true,
    	}
    
    	c := colly.NewCollector()
    	c.WithTransport(tr)
    
    	c.OnError(func(resp *colly.Response, err error) {
    		log.Printf("err=%v", err)
    	})
    
    	c.Visit("https://example.org/sitemap.xml.gz")
    }
    

    Results in:

    2023/01/05 22:30:27 err=gzip: invalid header
    
  • Option not to pass Request Context to the Next Request

    Option not to pass Request Context to the Next Request

    I'm using Request Context to store information about the parsed body on various c.OnHTML callbacks..

    So what happens is, if I use the e.Request.Visit() for following up on hrefs, then the request context is also being passed. I wanted to avoid this. So instead of using e.Request.Visit() I used c.Visit() directly. This made sure that I got new context for each request.

    However, I would like to use the MaxDepth option as well. But that only works if I use the e.Request.Visit().

    It would work for me to use the e.Request.Visit() but give new context for each request. This is currently not possible. Is that correct?

    If yes, this feature request would be great to have as a configuration option - to determine if the request context has to be passed along or not..

    For now, I have manually made the change for local purposes..

    index 6beef834..524bb77c 100644
    --- a/vendor/github.com/gocolly/colly/v2/request.go
    +++ b/vendor/github.com/gocolly/colly/v2/request.go
    @@ -117,7 +117,7 @@ func (r *Request) AbsoluteURL(u string) string {
     // request and preserves the Context of the previous request.
     // Visit also calls the previously provided callbacks
     func (r *Request) Visit(URL string) error {
    -	return r.collector.scrape(r.AbsoluteURL(URL), "GET", r.Depth+1, nil, r.Ctx, nil, true)
    +	return r.collector.scrape(r.AbsoluteURL(URL), "GET", r.Depth+1, nil, nil, nil, true)
     }
     
     // HasVisited checks if the provided URL has been visited```
    
  • TLS Error on Robots.txt is not handled in OnError

    TLS Error on Robots.txt is not handled in OnError

    I'm running a test project on localhost:8000 and when I access it over https, it fails (which is expected)

    Get "https://localhost:8000/": tls: first record does not look like a TLS handshake

    The above is correctly caught in OnError. However, when I set ignoreRobots to false, then it tries to fetch the robots.txt and the below failure

    Get "https://localhost:8000/robots.txt": tls: first record does not look like a TLS handshake

    Is not propogated to OnError - as it is really not originating from the request that I had started, but colly tries to first fetch the robots which fails.. Could this also be propogated either to OnError or can be caught with a known Error Code from Colly such as

    ErrRobotsTxtBlocked = errors.New("URL blocked by robots.txt")
    ErrRobotsTxtFetchFailed = errors.New("Unable to fetch robots.txt") // New Error Code
    
  • queue AddRequest will stuck if queue.Run() loop end

    queue AddRequest will stuck if queue.Run() loop end

    https://github.com/gocolly/colly/blob/947eeead97b39d46ce2c89b06164c78b39d25759/queue/queue.go#L113

    stuck on q.wake <- struct{}{}

    because q.wake already not used by queue.Run().

Ratemyprof scraper - Ratemyprof scraper with golang

ratemyprof scraper visit https://ratemyprof-api.vercel.app/api/getProf to try ou

Jan 18, 2022
DataHen Till is a standalone tool that instantly makes your existing web scraper scalable, maintainable, and more unblockable, with minimal code changes on your scraper.
DataHen Till is a standalone tool that instantly makes your existing web scraper scalable, maintainable, and more unblockable, with minimal code changes on your scraper.

DataHen Till is a standalone tool that instantly makes your existing web scraper scalable, maintainable, and more unblockable, with minimal code changes on your scraper.

Dec 14, 2022
A crawler/scraper based on golang + colly, configurable via JSON

A crawler/scraper based on golang + colly, configurable via JSON

Aug 21, 2022
A crawler/scraper based on golang + colly, configurable via JSON

Super-Simple Scraper This a very thin layer on top of Colly which allows configuration from a JSON file. The output is JSONL which is ready to be impo

Aug 21, 2022
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)

About Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your new

Dec 4, 2022
High-performance crawler framework based on fasthttp

predator / 掠食者 基于 fasthttp 开发的高性能爬虫框架 使用 下面是一个示例,基本包含了当前已完成的所有功能,使用方法可以参考注释。

May 2, 2022
High-performance crawler framework based on fasthttp.

predator / 掠食者 基于 fasthttp 开发的高性能爬虫框架 使用 下面是一个示例,基本包含了当前已完成的所有功能,使用方法可以参考注释。 1 创建一个 Crawler import "github.com/go-predator/predator" func main() {

Dec 14, 2022
Golang based web site opengraph data scraper with caching
Golang based web site opengraph data scraper with caching

Snapper A Web microservice for capturing a website's OpenGraph data built in Golang Building Snapper building the binary git clone https://github.com/

Oct 5, 2022
Warhammer40K faction scraper written in Golang, powered by colly.

Wascra Description Wascra is a tool written in Golang, which lets you extract all relevant Datasheet info from a Warhammer40K (9th edition) faction fr

Feb 8, 2022
Web Scraper in Go, similar to BeautifulSoup

soup Web Scraper in Go, similar to BeautifulSoup soup is a small web scraper package for Go, with its interface highly similar to that of BeautifulSou

Jan 9, 2023