Pholcus is a distributed high-concurrency crawler software written in pure golang

Pholcus GitHub release report card github issues github closed issues GoDoc

Pholcus(幽灵蛛)是一款纯 Go 语言编写的支持分布式的高并发爬虫软件,仅用于编程学习与研究。

它支持单机、服务端、客户端三种运行模式,拥有Web、GUI、命令行三种操作界面;规则简单灵活、批量任务并发、输出方式丰富(mysql/mongodb/kafka/csv/excel等);另外它还支持横纵向两种抓取模式,支持模拟登录和任务暂停、取消等一系列高级功能。

image

免责声明

本软件仅用于学术研究,使用者需遵守其所在地的相关法律法规,请勿用于非法用途!! 如在中国大陆频频爆出爬虫开发者涉诉与违规的新闻
郑重声明:因违法违规使用造成的一切后果,使用者自行承担!!

爬虫原理

image

 

image

 

image

框架特点

  • 为具备一定Go或JS编程基础的用户提供只需关注规则定制、功能完备的重量级爬虫工具;
  • 支持单机、服务端、客户端三种运行模式;
  • GUI(Windows)、Web、Cmd 三种操作界面,可通过参数控制打开方式;
  • 支持状态控制,如暂停、恢复、停止等;
  • 可控制采集量;
  • 可控制并发协程数;
  • 支持多采集任务并发执行;
  • 支持代理IP列表,可控制更换频率;
  • 支持采集过程随机停歇,模拟人工行为;
  • 根据规则需求,提供自定义配置输入接口
  • 有mysql、mongodb、kafka、csv、excel、原文件下载共五种输出方式;
  • 支持分批输出,且每批数量可控;
  • 支持静态Go和动态JS两种采集规则,支持横纵向两种抓取模式,且有大量Demo;
  • 持久化成功记录,便于自动去重;
  • 序列化失败请求,支持反序列化自动重载处理;
  • 采用surfer高并发下载器,支持 GET/POST/HEAD 方法及 http/https 协议,同时支持固定UserAgent自动保存cookie与随机大量UserAgent禁用cookie两种模式,高度模拟浏览器行为,可实现模拟登录等功能;
  • 服务器/客户端模式采用Teleport高并发SocketAPI框架,全双工长连接通信,内部数据传输格式为JSON。

下载安装

go get -u -v github.com/henrylee2cn/pholcus

创建项目

package main

import (
    "github.com/henrylee2cn/pholcus/exec"
    // _ "pholcus_lib_pte" // 同样你也可以自由添加自己的规则库
)

func main() {
    // 设置运行时默认操作界面,并开始运行
    // 运行软件前,可设置 -a_ui 参数为"web"、"gui"或"cmd",指定本次运行的操作界面
    // 其中"gui"仅支持Windows系统
    exec.DefaultRun("web")
}

 

编译运行

正常编译方法

cd {{replace your gopath}}/src/github.com/henrylee2cn/pholcus
go install 或者 go build

Windows下隐藏cmd窗口的编译方法

cd {{replace your gopath}}/src/github.com/henrylee2cn/pholcus
go install -ldflags="-H=windowsgui -linkmode=internal" 或者 go build -ldflags="-H=windowsgui -linkmode=internal"

查看可选参数:

pholcus -h

image

 

Web版操作界面截图如下:

image

 

GUI版操作界面之模式选择界面截图如下

image

 

Cmd版运行参数设置示例如下

$ pholcus -_ui=cmd -a_mode=0 -c_spider=3,8 -a_outtype=csv -a_thread=20 -a_dockercap=5000 -a_pause=300
-a_proxyminute=0 -a_keyins="<pholcus><golang>" -a_limit=10 -a_success=true -a_failure=true

 

*注意:*Mac下如使用代理IP功能,请务必获取root用户权限,否则无法通过ping获取可以代理!

 

运行时目录文件

├─pholcus 软件
│
├─pholcus_pkg 运行时文件目录
│  ├─config.ini 配置文件
│  │
│  ├─proxy.lib 代理IP列表文件
│  │
│  ├─spiders 动态规则目录
│  │  └─xxx.pholcus.html 动态规则文件
│  │
│  ├─phantomjs 程序文件
│  │
│  ├─text_out 文本数据文件输出目录
│  │
│  ├─file_out 文件结果输出目录
│  │
│  ├─logs 日志目录
│  │
│  ├─history 历史记录目录
│  │
└─└─cache 临时缓存目录

 

动态规则示例

特点:动态加载规则,无需重新编译软件,书写简单,添加自由,适用于轻量级的采集项目。
xxx.pholcus.html

<Spider>
    <Name>HTML动态规则示例</Name>
    <Description>HTML动态规则示例 [Auto Page] [http://xxx.xxx.xxx]</Description>
    <Pausetime>300</Pausetime>
    <EnableLimit>false</EnableLimit>
    <EnableCookie>true</EnableCookie>
    <EnableKeyin>false</EnableKeyin>
    <NotDefaultField>false</NotDefaultField>
    <Namespace>
        <Script></Script>
    </Namespace>
    <SubNamespace>
        <Script></Script>
    </SubNamespace>
    <Root>
        <Script param="ctx">
        console.log("Root");
        ctx.JsAddQueue({
            Url: "http://xxx.xxx.xxx",
            Rule: "登录页"
        });
        </Script>
    </Root>
    <Rule name="登录页">
        <AidFunc>
            <Script param="ctx,aid">
            </Script>
        </AidFunc>
        <ParseFunc>
            <Script param="ctx">
            console.log(ctx.GetRuleName());
            ctx.JsAddQueue({
                Url: "http://xxx.xxx.xxx",
                Rule: "登录后",
                Method: "POST",
                PostData: "[email protected]&amp;password=44444444&amp;login_btn=login_btn&amp;submit=login_btn"
            });
            </Script>
        </ParseFunc>
    </Rule>
    <Rule name="登录后">
        <ParseFunc>
            <Script param="ctx">
            console.log(ctx.GetRuleName());
            ctx.Output({
                "全部": ctx.GetText()
            });
            ctx.JsAddQueue({
                Url: "http://accounts.xxx.xxx/member",
                Rule: "个人中心",
                Header: {
                    "Referer": [ctx.GetUrl()]
                }
            });
            </Script>
        </ParseFunc>
    </Rule>
    <Rule name="个人中心">
        <ParseFunc>
            <Script param="ctx">
            console.log("个人中心: " + ctx.GetRuleName());
            ctx.Output({
                "全部": ctx.GetText()
            });
            </Script>
        </ParseFunc>
    </Rule>
</Spider>

静态规则示例

特点:随软件一同编译,定制性更强,效率更高,适用于重量级的采集项目。
xxx.go

func init() {
    Spider{
        Name:        "静态规则示例",
        Description: "静态规则示例 [Auto Page] [http://xxx.xxx.xxx]",
        // Pausetime: 300,
        // Limit:   LIMIT,
        // Keyin:   KEYIN,
        EnableCookie:    true,
        NotDefaultField: false,
        Namespace:       nil,
        SubNamespace:    nil,
        RuleTree: &RuleTree{
            Root: func(ctx *Context) {
                ctx.AddQueue(&request.Request{Url: "http://xxx.xxx.xxx", Rule: "登录页"})
            },
            Trunk: map[string]*Rule{
                "登录页": {
                    ParseFunc: func(ctx *Context) {
                        ctx.AddQueue(&request.Request{
                            Url:      "http://xxx.xxx.xxx",
                            Rule:     "登录后",
                            Method:   "POST",
                            PostData: "[email protected]&password=123456&login_btn=login_btn&submit=login_btn",
                        })
                    },
                },
                "登录后": {
                    ParseFunc: func(ctx *Context) {
                        ctx.Output(map[string]interface{}{
                            "全部": ctx.GetText(),
                        })
                        ctx.AddQueue(&request.Request{
                            Url:    "http://accounts.xxx.xxx/member",
                            Rule:   "个人中心",
                            Header: http.Header{"Referer": []string{ctx.GetUrl()}},
                        })
                    },
                },
                "个人中心": {
                    ParseFunc: func(ctx *Context) {
                        ctx.Output(map[string]interface{}{
                            "全部": ctx.GetText(),
                        })
                    },
                },
            },
        },
    }.Register()
}

 

代理IP

  • 代理IP写在/pholcus_pkg/proxy.lib文件,格式如下,一行一个IP:
http://183.141.168.95:3128
https://60.13.146.92:8088
http://59.59.4.22:8090
https://180.119.78.78:8090
https://222.178.56.73:8118
http://115.228.57.254:3128
http://49.84.106.160:9000
  • 在操作界面选择“代理IP更换频率”或命令行设置-a_proxyminute参数,进行使用

  • *注意:*Mac下如使用代理IP功能,请务必获取root用户权限,否则无法通过ping获取可以代理!

 

FAQ

请求队列中,重复的URL是否会自动去重?

url默认情况下是去重的,但是可以通过设置Request.Reloadable=true忽略重复。

URL指向的页面内容若有更新,框架是否有判断的机制?

url页面内容的更新,框架无法直接支持判断,但是用户可以自己在规则中自定义支持。

请求成功是依据web头的状态码判断?

不是判断状态,而是判断服务器有无响应流返回。即,404页面同样属于成功。

请求失败后的重新请求机制?

每个url尝试下载指定次数之后,若依然失败,则将该请求追加到一个类似defer性质的特殊队列中。  
在当前任务正常结束后,将自动添加至下载队列,再次进行下载。如果依然有没下载成功的,则保存至失败历史记录。  
当下次执行该条爬虫规则时,可通过选择继承历史失败记录,把这些失败请求自动加入defer性质的特殊队列……(后面是重复步骤)
Owner
henrylee2cn
Cease to programing and cease to live.
henrylee2cn
Comments
  • 运行错误

    运行错误

    ./pholcus.go:44: undefined: config.MYSQL_OUTPUT in config.MYSQL_OUTPUT.HOST
    ./pholcus.go:44: cannot assign to config.MYSQL_OUTPUT.HOST
    ./pholcus.go:46: undefined: config.MYSQL_OUTPUT in config.MYSQL_OUTPUT.DB
    ./pholcus.go:46: cannot assign to config.MYSQL_OUTPUT.DB
    ./pholcus.go:48: undefined: config.MYSQL_OUTPUT in config.MYSQL_OUTPUT.USER
    ./pholcus.go:48: cannot assign to config.MYSQL_OUTPUT.USER
    ./pholcus.go:50: undefined: config.MYSQL_OUTPUT in config.MYSQL_OUTPUT.PASSWORD
    ./pholcus.go:50: cannot assign to config.MYSQL_OUTPUT.PASSWORD
    ./pholcus.go:52: undefined: config.MYSQL_OUTPUT in config.MYSQL_OUTPUT.MAX_CONNS
    ./pholcus.go:52: cannot assign to config.MYSQL_OUTPUT.MAX_CONNS
    ./pholcus.go:52: too many errors
    

    我把MGO_OUTPUT 相关都注掉了 写了密码 其他都没动. 数据库已建好pholcus

    go1.5.1

  • 执行 go get -u -v github.com/henrylee2cn/pholcus 失败

    执行 go get -u -v github.com/henrylee2cn/pholcus 失败

    github.com/henrylee2cn/pholcus/app/distribute

    ../../go/pkg/mod/github.com/henrylee2cn/[email protected]/app/distribute/master_api.go:6:2: imported and not used: "github.com/henrylee2cn/teleport" as tp ../../go/pkg/mod/github.com/henrylee2cn/[email protected]/app/distribute/master_api.go:10:31: undefined: teleport ../../go/pkg/mod/github.com/henrylee2cn/[email protected]/app/distribute/master_api.go:11:9: undefined: teleport ../../go/pkg/mod/github.com/henrylee2cn/[email protected]/app/distribute/master_api.go:25:48: undefined: teleport ../../go/pkg/mod/github.com/henrylee2cn/[email protected]/app/distribute/master_api.go:27:9: undefined: teleport ../../go/pkg/mod/github.com/henrylee2cn/[email protected]/app/distribute/master_api.go:33:42: undefined: teleport ../../go/pkg/mod/github.com/henrylee2cn/[email protected]/app/distribute/slave_api.go:7:2: imported and not used: "github.com/henrylee2cn/teleport" as tp ../../go/pkg/mod/github.com/henrylee2cn/[email protected]/app/distribute/slave_api.go:11:30: undefined: teleport ../../go/pkg/mod/github.com/henrylee2cn/[email protected]/app/distribute/slave_api.go:12:9: undefined: teleport ../../go/pkg/mod/github.com/henrylee2cn/[email protected]/app/distribute/slave_api.go:23:47: undefined: teleport ../../go/pkg/mod/github.com/henrylee2cn/[email protected]/app/distribute/slave_api.go:12:9: too many errors

  • 能否将regexp.Compile(")换成regexp.Compile("<[^>]+>")">

    能否将regexp.Compile("\\<[\\S\\s]+?\\>")换成regexp.Compile("<[^>]+>")

    regexp.Compile("\\<[\\S\\s]+?\\>"),\S\s是所有的空格和非空格,[]是字符集,包含空格和非空格的字符集不就是所有字符?'<' '>'这俩可以不用转义,那么"\<[\S\s]+?\>" => "<.+?>" => "<[^>]+>"

  • go get github.com/henrylee2cn/pholcus 的时候报错了

    go get github.com/henrylee2cn/pholcus 的时候报错了

    github.com/henrylee2cn/pholcus

    C:\Go\pkg\tool\windows_amd64\link.exe: running gcc failed: exit status 1 c:/mingw/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/4.8.2/../../../../x86_64-w64-mingw32/bin/ld.exe: i386 architecture of input file `C:\Users\henry\AppData\Local\Temp\go-link-908393803\000000.o' is incompatible with i386:x86-64 output collect2.exe: error: ld returned 1 exit status

    这个是报错的文本信息,请问这个是什么错误呢

  • arm处理器下编译出错

    arm处理器下编译出错

    执行 go run example_main.go

    app/downloader/surfer/agent/agent_linux.go:17: cannot use buf.Sysname (type [65]uint8) as type [65]int8 in argument to charsToString
    app/downloader/surfer/agent/agent_linux.go:27: cannot use buf.Release (type [65]uint8) as type [65]int8 in argument to charsToString
    

    可以帮忙看下为什么吗? 环境是在 树莓派的 arm处理器下

  • Reloadable 不可重复下载的判断条件不充分

    Reloadable 不可重复下载的判断条件不充分

    判断 Reloadable 是否允许重复下载时有以下判断

    func (self *Matrix) Push(req *request.Request) {
    	...
    	// 不可重复下载的req
    	if !req.IsReloadable() {
    		// 已存在成功记录时退出
    		if self.hasHistory(req.Unique()) {
    			return
    		}
    		// 添加到临时记录
    		self.insertTempHistory(req.Unique())
    	}
    	...
    }
    

    实际上依赖func (self *Request) Unique() string判断是否相同请求

    // 请求的唯一识别码
    func (self *Request) Unique() string {
    	if self.unique == "" {
    		block := md5.Sum([]byte(self.Spider + self.Rule + self.Url + self.Method))
    		self.unique = hex.EncodeToString(block[:])
    	}
    	return self.unique
    }
    

    如果一个 POST 请求填写了 PostData, 则不能正确的辨别是否是同一个请求

    POST /somewhere
    
    page=1&keyword=XXX
    

    期待结果:

    // 请求的唯一识别码
    func (self *Request) Unique() string {
    	if self.unique == "" {
    		block := md5.Sum([]byte(self.Spider + self.Rule + self.Url + self.Method + self.PostData))
    		self.unique = hex.EncodeToString(block[:])
    	}
    	return self.unique
    }
    

    该逻辑的调整会对已经存储的数据造成较大的影响。

  • Kafka Error

    Kafka Error

    [E] kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.
    2017/08/07 14:25:38 [E] circuit breaker is open
    2017/08/07 14:25:38 [E] circuit breaker is open
    

    hi there i got this error, when use kafka output

  • i18n support

    i18n support

    Thank you for this fantastic project, @henrylee2cn !

    It would be great if you would introduce means of i18n, so that this fantastic software could be translated without the risk of getting a codebase that diverges from your original (your project already has too many forks which fell behind your development without having contributed any code).

    go-i18n seems to be used most for i18n in Go.

    At a glance, there seem to be ~200 strings which could be considered for translation.


    https://github.com/henrylee2cn/pholcus/issues/80 https://github.com/henrylee2cn/pholcus/issues/77

  • go mod方式管理依赖有问题?

    go mod方式管理依赖有问题?

    1. go mod init 不报错
    2. go build时候报错
    # github.com/henrylee2cn/pholcus/app/distribute
    ../../go/pkg/mod/github.com/henrylee2cn/[email protected]/app/distribute/master_api.go:6:2: imported and not used: "github.com/henrylee2cn/teleport" as tp
    ../../go/pkg/mod/github.com/henrylee2cn/[email protected]/app/distribute/master_api.go:10:31: undefined: teleport
    ../../go/pkg/mod/github.com/henrylee2cn/[email protected]/app/distribute/master_api.go:11:9: undefined: teleport
    ../../go/pkg/mod/github.com/henrylee2cn/[email protected]/app/distribute/master_api.go:25:48: undefined: teleport
    ../../go/pkg/mod/github.com/henrylee2cn/[email protected]/app/distribute/master_api.go:27:9: undefined: teleport
    ../../go/pkg/mod/github.com/henrylee2cn/[email protected]/app/distribute/master_api.go:33:42: undefined: teleport
    ../../go/pkg/mod/github.com/henrylee2cn/[email protected]/app/distribute/slave_api.go:7:2: imported and not used: "github.com/henrylee2cn/teleport" as tp
    ../../go/pkg/mod/github.com/henrylee2cn/[email protected]/app/distribute/slave_api.go:11:30: undefined: teleport
    ../../go/pkg/mod/github.com/henrylee2cn/[email protected]/app/distribute/slave_api.go:12:9: undefined: teleport
    ../../go/pkg/mod/github.com/henrylee2cn/[email protected]/app/distribute/slave_api.go:23:47: undefined: teleport
    ../../go/pkg/mod/github.com/henrylee2cn/[email protected]/app/distribute/slave_api.go:12:9: too many errors
    
  • 动态规则解析错误,用xml包含js是否有问题?建议直接使用纯js文件吧

    动态规则解析错误,用xml包含js是否有问题?建议直接使用纯js文件吧

    func main() {
        type Spider struct {
            Script    string   `xml:"Script"`
        }
        result := Spider{Script: "none"}
        data := `
            <Spider>
                <Script>
                1 < 2
                </Script>
            </Spider>
        `
        err := xml.Unmarshal([]byte(data), &result)
        if err != nil {
            fmt.Printf("error:", err)
            return
        }
        fmt.Printf("Script: %v", result.Script)
    }
    

    Script元素内的js代码,如果有“<”符号,xml.Unmarshal解析过不了,“>“符号正常,其他未测试。

    个人觉得:用xml包含js不是太友好,建议直接使用纯js文件吧

  • How do U solve the problem about the dynamical JavaScript file ?

    How do U solve the problem about the dynamical JavaScript file ?

    hmmm... As I see ,you use phantomJS to solve this problem ? But ,you do not recommend us to do this , so ,there is any solution if I only use the default Golang Client?

  • fix(sec): upgrade github.com/tidwall/match to 1.0.3

    fix(sec): upgrade github.com/tidwall/match to 1.0.3

    What happened?

    There are 1 security vulnerabilities found in github.com/tidwall/match v1.0.1

    What did I do?

    Upgrade github.com/tidwall/match from v1.0.1 to 1.0.3 for vulnerability fix

    What did you expect to happen?

    Ideally, no insecure libs should be used.

    The specification of the pull request

    PR Specification from OSCS

  • 运行example的demo 出错, teleport 的包好像改名了, 以前的里面的方法也都没有了

    运行example的demo 出错, teleport 的包好像改名了, 以前的里面的方法也都没有了

    C:\Users\admin\go\pkg\mod\github.com\henrylee2cn\[email protected]\app\distribute\master_api.go:10:34: undefined: teleport
    C:\Users\admin\go\pkg\mod\github.com\henrylee2cn\[email protected]\app\distribute\master_api.go:11:9: undefined: teleport C:\Users\admin\go\pkg\mod\github.com\henrylee2cn\[email protected]\app\distribute\master_api.go:25:48: undefined: teleport C:\Users\admin\go\pkg\mod\github.com\henrylee2cn\[email protected]\app\distribute\master_api.go:27:9: undefined: teleport C:\Users\admin\go\pkg\mod\github.com\henrylee2cn\[email protected]\app\distribute\master_api.go:33:42: undefined: teleport C:\Users\admin\go\pkg\mod\github.com\henrylee2cn\[email protected]\app\distribute\slave_api.go:7:2: imported and not used: "github.com/henrylee2cn/teleport" as tp C:\Users\admin\go\pkg\mod\github.com\henrylee2cn\[email protected]\app\distribute\slave_api.go:11:30: undefined: teleport C:\Users\admin\go\pkg\mod\github.com\henrylee2cn\[email protected]\app\distribute\slave_api.go:12:9: undefined: teleport C:\Users\admin\go\pkg\mod\github.com\henrylee2cn\[email protected]\app\distribute\slave_api.go:23:47: undefined: teleport

  • Kafka:kafka: invalid configuration

    Kafka:kafka: invalid configuration

    /data/project/bin/pholcus -_ui=cmd -a_mode=0 -c_spider=2 -a_outtype=kafka -a_thread=10 -a_dockercap=10 -a_pause=300 -a_proxyminute=0 -a_success=true -a_failure=true 2021/11/23 15:53:17 * 读取代理IP: 7 条 2021/11/23 15:53:17 * 正在筛选在线的代理IP…… Pholcus幽灵蛛数据采集_v1.3.4 (by henrylee2cn)

    2021/11/23 15:53:17 [I] !!当前运行模式为:[ 单机 ] 模式!! 2021/11/23 15:53:17 [E] Kafka:kafka: invalid configuration (Producer.Return.Successes must be true to be used in a SyncProducer)

    2021/11/23 15:53:17 [I] * 不使用代理IP

    2021/11/23 15:53:17 [I] * 执行任务总数(任务数[*自定义配置数])为 1 个

    2021/11/23 15:53:17 [I] * 采集引擎池容量为 1

    2021/11/23 15:53:17 [I] * 并发协程最多 10 个

    2021/11/23 15:53:17 [I] * 默认随机停顿 150~600 毫秒

    2021/11/23 15:53:17 [P] * —— 开始抓取,请耐心等候 —— 2021/11/23 15:53:17 [I] *********************************************************************************************************************************** 2021/11/23 15:53:20 [I] * Success: http://www.inderscience.com/info/inarticletoc.php?jcode=ijguc&year=2016&vol=7&issue=1

  • windwos 编译出错

    windwos 编译出错

    E:\github_source\spider>go build

    github.com/Mirror-l/spider

    D:\applications\go\pkg\tool\windows_amd64\link.exe: running gcc failed: exit status 1 D:/applications/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.1.0/../../../../x86_64-w64-mingw32/bin/ld.exe: i386 architecture of input file `C:\Users\TANGZU~1\AppData\Local\Temp\go-link-952364683\000000.o' is incompatible with i386:x86-64 output collect2.exe: error: ld returned 1 exit status

High-performance crawler framework based on fasthttp

predator / 掠食者 基于 fasthttp 开发的高性能爬虫框架 使用 下面是一个示例,基本包含了当前已完成的所有功能,使用方法可以参考注释。

May 2, 2022
High-performance crawler framework based on fasthttp.

predator / 掠食者 基于 fasthttp 开发的高性能爬虫框架 使用 下面是一个示例,基本包含了当前已完成的所有功能,使用方法可以参考注释。 1 创建一个 Crawler import "github.com/go-predator/predator" func main() {

Dec 14, 2022
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

Crawlab 中文 | English Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer Golang-

Jan 7, 2023
Go-site-crawler - a simple application written in go that can fetch contentfrom a url endpoint

Go Site Crawler Go Site Crawler is a simple application written in go that can f

Feb 5, 2022
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Jan 9, 2023
A crawler/scraper based on golang + colly, configurable via JSON

A crawler/scraper based on golang + colly, configurable via JSON

Aug 21, 2022
A crawler/scraper based on golang + colly, configurable via JSON

Super-Simple Scraper This a very thin layer on top of Colly which allows configuration from a JSON file. The output is JSONL which is ready to be impo

Aug 21, 2022
New World Auction House Crawler In Golang

New-World-Auction-House-Crawler Goal of this library is to have a process which grabs New World auction house data in the background while playing the

Sep 7, 2022
A PCPartPicker crawler for Golang.

gopartpicker A scraper for pcpartpicker.com for Go. It is implemented using Colly. Features Extract data from part list URLs Search for parts Extract

Nov 9, 2021
Go spider: A crawler of vertical communities achieved by GOLANG

go_spider A crawler of vertical communities achieved by GOLANG. Latest stable Release: Version 1.2 (Sep 23, 2014). QQ群号:337344607 Features Concurrent

Dec 9, 2021
Fast golang web crawler for gathering URLs and JavaSript file locations.

Fast golang web crawler for gathering URLs and JavaSript file locations. This is basically a simple implementation of the awesome Gocolly library.

Sep 24, 2022
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)

About Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your new

Dec 4, 2022
ant (alpha) is a web crawler for Go.

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

Dec 30, 2022
Go IMDb Crawler
 Go IMDb Crawler

Go IMDb Crawler Hit the ⭐ button to show some ❤️ ?? INSPIRATION ?? Want to know which celebrities have a common birthday with yours? ?? Want to get th

Aug 1, 2022
Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.
Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.

Apollo ?? A Unix-style personal search engine and web crawler for your digital footprint Demo apollodemo.mp4 Contents Background Thesis Design Archite

Dec 27, 2022
Fast, highly configurable, cloud native dark web crawler.

Bathyscaphe dark web crawler Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler. How to start the crawler To start

Nov 22, 2022
Just a web crawler
Just a web crawler

gh-dependents gh command extension to see dependents of your repository. See The GitHub Blog: GitHub CLI 2.0 includes extensions! Install gh extension

Sep 27, 2022
crawlergo is a browser crawler that uses chrome headless mode for URL collection.
crawlergo is a browser crawler that uses chrome headless mode for URL collection.

A powerful browser crawler for web vulnerability scanners

Dec 29, 2022
Simple content crawler for joyreactor.cc
Simple content crawler for joyreactor.cc

Reactor Crawler Simple CLI content crawler for Joyreactor. He'll find all media content on the page you've provided and save it. If there will be any

May 5, 2022