Collyzar - A distributed redis-based framework for colly.

Collyzar

A distributed redis-based framework for colly.

Collyzar provides a very simple configuration and tools to implement distributed crawling/scraping.

Features

  • Simple configuration and clean API
  • Distributed crawling/scraping
  • Built-in global bloom filter
  • Built-in spider cache
  • Support redis command
  • Multi-machine load balancing
  • Support to pause or stop all crawling machines
  • Pass additional information to the crawler and get it inside the crawler and store it in the database

Installation

Add collyzar to your go.mod file:

module github.com/x/y

go 1.14

require (
        github.com/Zartenc/collyzar/v2 latest
)

Example Usage

See examples folder for more detailed examples.

Crawler cluster machine

SpiderName must be unique.

After running, it will always monitor the redis crawler queue for crawling until it receives a pause or stop signal.

func main(){
    cs := &collyzar.CollyzarSettings{
    		SpiderName: "zarten",
    		Domain:     "www.amazon.com",
    		RedisIp:    "127.0.0.1",
    	}
	collyzar.Run(myResponse, cs, nil)
}

func myResponse(response *collyzar.ZarResponse){
	fmt.Println(response.StatusCode)
}

Control machine

Push url to redis queue

func main(){
	ts := collyzar.NewToolSpider("127.0.0.1", 6379, "", "zarten")

	url := "https://www.amazon.com"
	pushInfo := collyzar.PushInfo{Url:url}

	err := ts.PushToQueue(pushInfo)
	if err != nil{
		fmt.Println(err)
	}
}

Tools

Provide tools including stop crawlers and pause crawlers.

Stop all crawlers
func main() {
	ts := collyzar.NewToolSpider("127.0.0.1", 6379, "", "zarten")

	err := ts.StopSpiders()
	if err != nil{
		fmt.Println(err)
	}
}

Pause all crawlers

For all crawlers, the crawler process is idle after pausing the crawler.
Then you can use the WakeupSpiders method to wake up the crawlers.

func main() {
	ts := collyzar.NewToolSpider("127.0.0.1", 6379, "", "zarten")

	err := ts.PauseSpiders()
	if err != nil{
		fmt.Println(err)
	}
}

Bugs

Bugs or suggestions? Visit the issue tracker

Contributing

If you wish to contribute to this project, please branch and issue a pull request against master ("GitHub Flow").

Similar Resources

Interact with Chromium-based browsers' debug port to view open tabs, installed extensions, and cookies

Interact with Chromium-based browsers' debug port to view open tabs, installed extensions, and cookies

WhiteChocolateMacademiaNut Description Interacts with Chromium-based browsers' debug port to view open tabs, installed extensions, and cookies. Tested

Nov 2, 2022

Golang based web site opengraph data scraper with caching

Golang based web site opengraph data scraper with caching

Snapper A Web microservice for capturing a website's OpenGraph data built in Golang Building Snapper building the binary git clone https://github.com/

Oct 5, 2022

A crawler/scraper based on golang + colly, configurable via JSON

A crawler/scraper based on golang + colly, configurable via JSON

Aug 21, 2022

A crawler/scraper based on golang + colly, configurable via JSON

Super-Simple Scraper This a very thin layer on top of Colly which allows configuration from a JSON file. The output is JSONL which is ready to be impo

Aug 21, 2022

Warhammer40K faction scraper written in Golang, powered by colly.

Wascra Description Wascra is a tool written in Golang, which lets you extract all relevant Datasheet info from a Warhammer40K (9th edition) faction fr

Feb 8, 2022

Redis-shake is a tool for synchronizing data between two redis databases. Redis-shake是一个用于在两个redis之间同步数据的工具,满足用户非常灵活的同步、迁移需求。

Redis-shake is a tool for synchronizing data between two redis databases. Redis-shake是一个用于在两个redis之间同步数据的工具,满足用户非常灵活的同步、迁移需求。

RedisShake is mainly used to synchronize data from one redis to another. Thanks to the Douyu's WSD team for the support. 中文文档 English tutorial 中文使用文档

Dec 29, 2022

Go-distributed-websocket - Distributed Web Socket with Golang and Redis

Go-distributed-websocket - Distributed Web Socket with Golang and Redis

go-distributed-websocket Distributed Web Socket with Golang and Redis Dependenci

Oct 13, 2022

7 days golang programs from scratch (web framework Gee, distributed cache GeeCache, object relational mapping ORM framework GeeORM, rpc framework GeeRPC etc) 7天用Go动手写/从零实现系列

7 days golang programs from scratch README 中文版本 7天用Go从零实现系列 7天能写什么呢?类似 gin 的 web 框架?类似 groupcache 的分布式缓存?或者一个简单的 Python 解释器?希望这个仓库能给你答案

Jan 5, 2023

Distributed-Services - Distributed Systems with Golang to consequently build a fully-fletched distributed service

Distributed-Services This project is essentially a result of my attempt to under

Jun 1, 2022

A lightweight, distributed and reliable message queue based on Redis

nmq A lightweight, distributed and reliable message queue based on Redis Get Started Download go get github.com/inuggets/nmq Usage import "github.com

Nov 22, 2021

Distributed disk storage database based on Raft and Redis protocol.

Distributed disk storage database based on Raft and Redis protocol.

IceFireDB Distributed disk storage system based on Raft and RESP protocol. High performance Distributed consistency Reliable LSM disk storage Cold and

Dec 31, 2022

Distributed disk storage database based on Raft and Redis protocol.

Distributed disk storage database based on Raft and Redis protocol.

IceFireDB Distributed disk storage system based on Raft and RESP protocol. High performance Distributed consistency Reliable LSM disk storage Cold and

Dec 27, 2022

Disgo - A distributed lock based on redis developed with golang

Disgo - A distributed lock based on redis developed with golang

English | 中文 DisGo Introduce DisGo is a distributed lock based on Redis, develop

Dec 2, 2022

Golang client for redislabs' ReJSON module with support for multilple redis clients (redigo, go-redis)

Go-ReJSON - a golang client for ReJSON (a JSON data type for Redis) Go-ReJSON is a Go client for ReJSON Redis Module. ReJSON is a Redis module that im

Dec 25, 2022

Redis client Mock Provide mock test for redis query

Redis client Mock Provide mock test for redis query, Compatible with github.com/go-redis/redis/v8 Install Confirm that you are using redis.Client the

Dec 27, 2022

LFU Redis implements LFU Cache algorithm using Redis as data storage

LFU Redis cache library for Golang LFU Redis implements LFU Cache algorithm using Redis as data storage LFU Redis Package gives you control over Cache

Nov 10, 2022

GoBigdis is a persistent database that implements the Redis server protocol. Any Redis client can interface with it and start to use it right away.

GoBigdis GoBigdis is a persistent database that implements the Redis server protocol. Any Redis client can interface with it and start to use it right

Apr 27, 2022

Redis inventory is a tool to analyse Redis memory usage by key patterns and displaying it hierarchically

Redis inventory is a tool to analyse Redis memory usage by key patterns and displaying it hierarchically

Redis inventory is a tool to analyse Redis memory usage by key patterns and displaying it hierarchically. The name is inspired by "Disk Inventory X" tool doing similar analysis for disk usage.

Dec 11, 2022

The source code for workshop Scalable architecture using Redis as backend database using Golang + Redis

The source code for workshop Scalable architecture using Redis as backend database using Golang + Redis

Sep 23, 2022
A crawler/scraper based on golang + colly, configurable via JSON

Super-Simple Scraper This a very thin layer on top of Colly which allows configuration from a JSON file. The output is JSONL which is ready to be impo

Aug 21, 2022
Warhammer40K faction scraper written in Golang, powered by colly.

Wascra Description Wascra is a tool written in Golang, which lets you extract all relevant Datasheet info from a Warhammer40K (9th edition) faction fr

Feb 8, 2022
High-performance crawler framework based on fasthttp

predator / 掠食者 基于 fasthttp 开发的高性能爬虫框架 使用 下面是一个示例,基本包含了当前已完成的所有功能,使用方法可以参考注释。

May 2, 2022
High-performance crawler framework based on fasthttp.

predator / 掠食者 基于 fasthttp 开发的高性能爬虫框架 使用 下面是一个示例,基本包含了当前已完成的所有功能,使用方法可以参考注释。 1 创建一个 Crawler import "github.com/go-predator/predator" func main() {

Dec 14, 2022
Go-based search engine URL collector , support Google, Bing, can be based on Google syntax batch collection URL
Go-based search engine URL collector , support Google, Bing, can be based on Google syntax batch collection URL

Go-based search engine URL collector , support Google, Bing, can be based on Google syntax batch collection URL

Nov 9, 2022
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

Crawlab 中文 | English Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer Golang-

Jan 7, 2023
Pholcus is a distributed high-concurrency crawler software written in pure golang
Pholcus is a distributed high-concurrency crawler software written in pure golang

Pholcus Pholcus(幽灵蛛)是一款纯 Go 语言编写的支持分布式的高并发爬虫软件,仅用于编程学习与研究。 它支持单机、服务端、客户端三种运行模式,拥有Web、GUI、命令行三种操作界面;规则简单灵活、批量任务并发、输出方式丰富(mysql/mongodb/kafka/csv/excel等

Dec 30, 2022
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Jan 9, 2023
[爬虫框架 (golang)] An awesome Go concurrent Crawler(spider) framework. The crawler is flexible and modular. It can be expanded to an Individualized crawler easily or you can use the default crawl components only.

go_spider A crawler of vertical communities achieved by GOLANG. Latest stable Release: Version 1.2 (Sep 23, 2014). QQ群号:337344607 Features Concurrent

Jan 6, 2023
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)

About Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your new

Dec 4, 2022