[爬虫框架 (golang)] An awesome Go concurrent Crawler(spider) framework. The crawler is flexible and modular. It can be expanded to an Individualized crawler easily or you can use the default crawl components only.

go_spider

Build Status

A crawler of vertical communities achieved by GOLANG.

image

Latest stable Release: Version 1.2 (Sep 23, 2014).

  • go_spider讨论群 QQ群号:337344607

Features

  • Concurrent
  • Fit for vertical communities
  • Flexible, Modular
  • Native Go implementation
  • Can be expanded to an individualized crawler easily

Requirements

  • Go 1.2 or higher

Documentation

中文文档 && 常见问题.

Installation

go get github.com/hu17889/go_spider
go get github.com/PuerkitoBio/goquery
go get github.com/bitly/go-simplejson
go get golang.org/x/net/html/charset

This project is based on simplejson, goquery.

You can download packages from http://gopm.io/ in China.

Use example

Here is an example for crawling github content. You can have a try of the crawl process.

  • go install github.com/hu17889/go_spider/example/github_repo_page_processor
  • ./bin/github_repo_page_processor

More examples here: examples.

Make your spider

    // Spider input:
    //  PageProcesser ;
    //  Task name used in Pipeline for record;
    spider.NewSpider(NewMyPageProcesser(), "TaskName").
        AddUrl("https://github.com/hu17889?tab=repositories", "html"). // Start url, html is the responce type ("html" or "json")
        AddPipeline(pipeline.NewPipelineConsole()).                    // Print result on screen
        SetThreadnum(3).                                               // Crawl request by three Coroutines
        Run()
  • Use default modules

  • Downloader:HttpDownloader

  • Scheduler:QueueScheduler

  • Pipeline:PipelineConsole,PipelineFile

  • Use your modules

Just copy the default modules and modify it!

If you make a Downloader module, you can use it by Spider.SetDownloader(your_downloader).

If you make a Pipeline module, you can use it by Spider.AddPipeline(your_pipeline).

If you make a Scheduler module, you can use it by Spider.SetScheduler(your_scheduler).

Extensions

Extensions folder include modulers or other tools someone sharing. You can push your code without bugs.

Modulers

Spider

Summary: Crawler initialization, concurrent management, default moduler, moduler management, config setting.

Functions:

  • Clawler startup functions: Get, GetAll, Run
  • Add request: AddUrl, AddUrls, AddRequest, AddRequests
  • Set main moduler: AddPipeline(could have several pipeline modulers), SetScheduler, SetDownloader
  • Set config: SetExitWhenComplete, SetThreadnum(concurrent number), SetSleepTime(sleep time after one crawl)
  • Monitor: OpenFileLog, OpenFileLogDefault(open file log function, logged by mlog package), CloseFileLog, OpenStrace(open tracing info printed on screen by stderr), CloseStrace

Downloader

Summary: Spider gets a Request in Scheduler that has url to be crawled. Then Downloader downloads the result(html, json, jsonp, text) of the Request. The result is saved in Page for parsing in PageProcesser. Html parsing is based on goquery package. Json parsing is based on simplejson package. Jsonp will be conversed to json. Text form represents plain text content without parser.

Functions:

  • Download: download content of the crawl objective. Result contains data body, header, cookies and request info.

PageProcesser

Summary: The PageProcesser moduler only parse results. The moduler gets results(key-value pairs) and urls to be crawled next step. These key-value pairs will be saved in PageItems and urls will be pushed in Scheduler.

Functions:

  • Process: parse the objective crawled.

Page

Summary: save information of request.

Functions:

  • Get result: GetJson, GetHtmlParser, GetBodyStr(plain text)
  • Get information of objective: GetRequest, GetCookies, GetHeader
  • Get Status of crawl process: IsSucc(Download success or not), Errormsg(Get error info in Downloader)
  • Set config:SetSkip, GetSkip(if skip is true, do not output result in Pipeline), AddTargetRequest, AddTargetRequests(Save urls to be crawled next stage), AddTargetRequestWithParams, AddTargetRequestsWithParams, AddField(Save key-value pairs after parsing)

Scheduler

Summary: The Scheduler moduler is a Request queue. Urls parsed in PageProcesser will be pushed in the queue.

Functions:

  • Push
  • Poll
  • Count

Pipeline

Summary: The Pipeline moduler will output the result and save wherever you want. Default moduler is PipelineConsole(Output to stdout) and PipelineFile(Output to file)

Functions:

  • Process

Request

Summary: The Request moduler has config for http request like url, header and cookies.

Functions:

  • Process

License

go_spider is licensed under the Mozilla Public License Version 2.0

Mozilla summarizes the license scope as follows:

MPL: The copyleft applies to any files containing MPLed code.

That means:

  • You can use the unchanged source code both in private as also commercial
  • You needn't publish the source code of your library as long the files licensed under the MPL 2.0 are unchanged
  • You must publish the source code of any changed files licensed under the MPL 2.0 under a) the MPL 2.0 itself or b) a compatible license (e.g. GPL 3.0 or Apache License 2.0)

Please read the MPL 2.0 FAQ if you have further questions regarding the license.

You can read the full terms here: LICENSE.

Similar Resources

an online REST renting book platform which you can authenticate, order, reserve a book in your account.

an online REST renting book platform which you can authenticate, order, reserve a book in your account.

BOOK MAN an online REST renting book platform which you can authenticate, order, reserve a book in your account. it's a microservices project with hig

Jul 22, 2022

A simple command line tool using which you can skip phone number based SMS verification by using a temporary phone number that acts like a proxy

A simple command line tool using which you can skip phone number based SMS verification by using a temporary phone number that acts like a proxy

Fake-SMS A simple command line tool using which you can skip phone number based SMS verification by using a temporary phone number that acts like a pr

Dec 31, 2022

GoCondor is a golang web framework with an MVC like architecture, it's based on Gin framework

GoCondor is a golang web framework with an MVC like architecture, it's based on Gin framework

GoCondor is a golang web framework with an MVC like architecture, it's based on Gin framework, it features a simple organized directory structure for your next project with a pleasant development experience, made for developing modern APIs and microservices.

Dec 29, 2022

Go (Golang) API REST with Gin FrameworkGo (Golang) API REST with Gin Framework

go-rest-api-aml-service Go (Golang) API REST with Gin Framework 1. Project Description Build REST APIs to support AML service with the support of exte

Nov 21, 2021

GoTrue is a small open-source API written in Golang, that can act as a self-standing API service for handling user registration and authentication for Jamstack projects.

GoTrue is a small open-source API written in Golang, that can act as a self-standing API service for handling user registration and authentication for Jamstack projects.

GoTrue is a small open-source API written in Golang, that can act as a self-standing API service for handling user registration and authentication for Jamstack projects.

Dec 13, 2021

GoAdmin is a toolkit to help you build a data visualization admin panel for your golang app.

GoAdmin is a toolkit to help you build a data visualization admin panel for your golang app.

the missing golang data admin panel builder tool. Documentation | 中文文档 | 中文介绍 | DEMO | 中文DEMO | Twitter | Forum Inspired by laravel-admin Preface GoAd

Nov 25, 2021

A golang framework helps gopher to build a data visualization and admin panel in ten minutes

A golang framework helps gopher to build a data visualization and admin panel in ten minutes

the missing golang data admin panel builder tool. Documentation | 中文介绍 | DEMO | 中文DEMO | Twitter | Forum Inspired by laravel-admin Preface GoAdmin is

Dec 30, 2022

Simple control panel for Golang based on Gin framework and MongoDB

Simple control panel for Golang based on Gin framework and MongoDB

Summer panel Simple control panel for Golang based on Gin framework and MongoDB How To Install go install github.com/night-codes/summer/summerGen@late

Dec 16, 2022

Golang : Use gorm with mysql in gin

Golang : Use gorm with mysql in gin This repository guides to how ORM can be implemented in Golang. After cloning the code, follow below steps to let

Dec 9, 2021
RSS master is a RSS subscription function aggregation tool, You can use it easily!

???? 中文 rsm ✨ RSS master[rsm] is a RSS subscription function aggregation tool, You can use it easily! How to start? ?? Start rsm with rsm run -c,--cfg

Oct 18, 2022
Polite, slim and concurrent web crawler.

gocrawl gocrawl is a polite, slim and concurrent web crawler written in Go. For a simpler yet more flexible web crawler written in a more idiomatic Go

Dec 31, 2022
Nada is a JS runtime, just like Nodejs. The difference is that Nada allows JS developers to easily achieve millions of concurrent applications.

Nada is a JS runtime, just like Nodejs. The difference is that Nada allows JS developers to easily achieve millions of concurrent applications. It also adds some new enhancements to THE JS syntax (types, interfaces, generics) that fundamentally address JS's perennial complaints.

Jul 11, 2022
Proyecto para comprobación y migración de base de datos con versionado modular

rfcheckbd El objetivo de este proyecto es poder realizar migraciones y comprobaciones de bases de datos sin tener que depender de proyectos externos c

Dec 27, 2021
A barebones Go app, which can easily be deployed to Heroku

go-getting-started A barebones Go app, which can easily be deployed to Heroku. This application supports the Getting Started with Go on Heroku article

Nov 29, 2021
Forms is a fast, powerful, flexible, sortable web form rendering library written in golang.

forms Description forms makes form creation and handling easy. It allows the creation of form without having to write HTML code or bother to make the

Oct 2, 2022
Example golang using gin framework everything you need, i create this tutorial special for beginner.

Golang Gin Framework Fundamental Example golang using gin framework everything you need, i create this tutorial special for beginner. Feature Containe

Dec 16, 2022
Oct 1, 2021
66 is two player game played with playing cards and from now on you can play it from browser with your friends.

altmis-alti 66 is two player game played with playing cards and this project provides multiplayer game space from browser. How to run? Clone the proje

Feb 1, 2022
notion-md-gen allows you to use Notion as a CMS for pages built with any static site generators

notion-md-gen allows you to use Notion as a CMS for pages built with any static site generators

Dec 12, 2022