Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

Crawlab

中文 | English

Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer

Golang-based distributed web crawler management platform, supporting various languages including Python, NodeJS, Go, Java, PHP and various web crawler frameworks including Scrapy, Puppeteer, Selenium.

Demo | Documentation

Installation

Three methods:

  1. Docker (Recommended)
  2. Direct Deploy (Check Internal Kernel)
  3. Kubernetes (Multi-Node Deployment)

Pre-requisite (Docker)

  • Docker 18.03+
  • Redis 5.x+
  • MongoDB 3.6+
  • Docker Compose 1.24+ (optional but recommended)

Pre-requisite (Direct Deploy)

  • Go 1.12+
  • Node 8.12+
  • Redis 5.x+
  • MongoDB 3.6+

Quick Start

Please open the command line prompt and execute the command below. Make sure you have installed docker-compose in advance.

git clone https://github.com/crawlab-team/crawlab
cd crawlab
docker-compose up -d

Next, you can look into the docker-compose.yml (with detailed config params) and the Documentation (Chinese) for further information.

Run

Docker

Please use docker-compose to one-click to start up. By doing so, you don't even have to configure MongoDB and Redis databases. Create a file named docker-compose.yml and input the code below.

version: '3.3'
services:
  master: 
    image: tikazyq/crawlab:latest
    container_name: master
    environment:
      CRAWLAB_SERVER_MASTER: "Y"
      CRAWLAB_MONGO_HOST: "mongo"
      CRAWLAB_REDIS_ADDRESS: "redis"
    ports:    
      - "8080:8080"
    depends_on:
      - mongo
      - redis
  mongo:
    image: mongo:latest
    restart: always
    ports:
      - "27017:27017"
  redis:
    image: redis:latest
    restart: always
    ports:
      - "6379:6379"

Then execute the command below, and Crawlab Master Node + MongoDB + Redis will start up. Open the browser and enter http://localhost:8080 to see the UI interface.

docker-compose up

For Docker Deployment details, please refer to relevant documentation.

Screenshot

Login

Home Page

Node List

Node Network

Spider List

Spider Overview

Spider Analytics

Spider File Edit

Task Log

Task Results

Cron Job

Language Installation

Dependency Installation

Notifications

Architecture

The architecture of Crawlab is consisted of the Master Node and multiple Worker Nodes, and Redis and MongoDB databases which are mainly for nodes communication and data storage.

The frontend app makes requests to the Master Node, which assigns tasks and deploys spiders through MongoDB and Redis. When a Worker Node receives a task, it begins to execute the crawling task, and stores the results to MongoDB. The architecture is much more concise compared with versions before v0.3.0. It has removed unnecessary Flower module which offers node monitoring services. They are now done by Redis.

Master Node

The Master Node is the core of the Crawlab architecture. It is the center control system of Crawlab.

The Master Node offers below services:

  1. Crawling Task Coordination;
  2. Worker Node Management and Communication;
  3. Spider Deployment;
  4. Frontend and API Services;
  5. Task Execution (one can regard the Master Node as a Worker Node)

The Master Node communicates with the frontend app, and send crawling tasks to Worker Nodes. In the mean time, the Master Node synchronizes (deploys) spiders to Worker Nodes, via Redis and MongoDB GridFS.

Worker Node

The main functionality of the Worker Nodes is to execute crawling tasks and store results and logs, and communicate with the Master Node through Redis PubSub. By increasing the number of Worker Nodes, Crawlab can scale horizontally, and different crawling tasks can be assigned to different nodes to execute.

MongoDB

MongoDB is the operational database of Crawlab. It stores data of nodes, spiders, tasks, schedules, etc. The MongoDB GridFS file system is the medium for the Master Node to store spider files and synchronize to the Worker Nodes.

Redis

Redis is a very popular Key-Value database. It offers node communication services in Crawlab. For example, nodes will execute HSET to set their info into a hash list named nodes in Redis, and the Master Node will identify online nodes according to the hash list.

Frontend

Frontend is a SPA based on Vue-Element-Admin. It has re-used many Element-UI components to support corresponding display.

Integration with Other Frameworks

Crawlab SDK provides some helper methods to make it easier for you to integrate your spiders into Crawlab, e.g. saving results.

⚠️ Note: make sure you have already installed crawlab-sdk using pip.

Scrapy

In settings.py in your Scrapy project, find the variable named ITEM_PIPELINES (a dict variable). Add content below.

ITEM_PIPELINES = {
    'crawlab.pipelines.CrawlabMongoPipeline': 888,
}

Then, start the Scrapy spider. After it's done, you should be able to see scraped results in Task Detail -> Result

General Python Spider

Please add below content to your spider files to save results.

# import result saving method
from crawlab import save_item

# this is a result record, must be dict type
result = {'name': 'crawlab'}

# call result saving method
save_item(result)

Then, start the spider. After it's done, you should be able to see scraped results in Task Detail -> Result

Other Frameworks / Languages

A crawling task is actually executed through a shell command. The Task ID will be passed to the crawling task process in the form of environment variable named CRAWLAB_TASK_ID. By doing so, the data can be related to a task. Also, another environment variable CRAWLAB_COLLECTION is passed by Crawlab as the name of the collection to store results data.

Comparison with Other Frameworks

There are existing spider management frameworks. So why use Crawlab?

The reason is that most of the existing platforms are depending on Scrapyd, which limits the choice only within python and scrapy. Surely scrapy is a great web crawl framework, but it cannot do everything.

Crawlab is easy to use, general enough to adapt spiders in any language and any framework. It has also a beautiful frontend interface for users to manage spiders much more easily.

Framework Technology Pros Cons Github Stats
Crawlab Golang + Vue Not limited to Scrapy, available for all programming languages and frameworks. Beautiful UI interface. Naturally support distributed spiders. Support spider management, task management, cron job, result export, analytics, notifications, configurable spiders, online code editor, etc. Not yet support spider versioning
ScrapydWeb Python Flask + Vue Beautiful UI interface, built-in Scrapy log parser, stats and graphs for task execution, support node management, cron job, mail notification, mobile. Full-feature spider management platform. Not support spiders other than Scrapy. Limited performance because of Python Flask backend.
Gerapy Python Django + Vue Gerapy is built by web crawler guru Germey Cui. Simple installation and deployment. Beautiful UI interface. Support node management, code edit, configurable crawl rules, etc. Again not support spiders other than Scrapy. A lot of bugs based on user feedback in v1.0. Look forward to improvement in v2.0
SpiderKeeper Python Flask Open-source Scrapyhub. Concise and simple UI interface. Support cron job. Perhaps too simplified, not support pagination, not support node management, not support spiders other than Scrapy.

Contributors

Community & Sponsorship

If you feel Crawlab could benefit your daily work or your company, please add the author's Wechat account noting "Crawlab" to enter the discussion group. Or you scan the Alipay QR code below to give us a reward to upgrade our teamwork software or buy a coffee.

Comments
  • Runing spiders ?!

    Runing spiders ?!

    I can't run spider neither outside docker master container nor inside it Inside i get this error after typing: crawlab upload spider Not authorized Error loging in

    Out side i only can upload zip file and it's not connected to my script and doesn't returen any data

    Any help?!

  • admin 通过vscode port登陆失败

    admin 通过vscode port登陆失败

    Bug 描述 我将机器部署在内网服务器,通过vscodeport功能映射到本机打开,例如,当输入用户名和密码均为admin 时,登陆功能不工作。 vscode port配置 image

    登陆错误 image

    排除其他因素 远程主机端口6800存在scrapyd的管理界面,vscode能将远程主机的6800映射到本机端口上。 image

    期望结果 登陆admin 能工作。

  • gRPC Client Cannot Connect to the master node!

    gRPC Client Cannot Connect to the master node!

    Bug

    When I tried building CRAWLAB by docker, worker node cannot connect the master node.

    YML File

    version: '3.3'
    services:
      master: 
        image: crawlabteam/crawlab:latest
        container_name: crawlab_example_master
        environment:
          CRAWLAB_NODE_MASTER: "Y"
          CRAWLAB_MONGO_HOST: "mongo"
          CRAWLAB_GRPC_SERVER_ADDRESS: "0.0.0.0:9666"
          CRAWLAB_SERVER_HOST: "0.0.0.0"
          CRAWLAB_GRPC_AUTHKEY: "youcanneverguess"
        volumes:
          - "./.crawlab/master:/root/.crawlab"
        ports:    
          - "8080:8080"
          - "9666:9666"
          - "8000:8000"
        depends_on:
          - mongo
    
      worker01: 
        image: crawlabteam/crawlab:latest
        container_name: crawlab_example_worker01
        environment:
          CRAWLAB_NODE_MASTER: "N"
          CRAWLAB_GRPC_ADDRESS: "MY_Public_IP_Address:9666"
          CRAWLAB_GRPC_AUTHKEY: "youcanneverguess"
          CRAWLAB_FS_FILER_URL: "http://master:8080/api/filer"
        volumes:
          - "./.crawlab/worker01:/root/.crawlab"
        depends_on:
          - master
    
      mongo:
        image: mongo:latest
        container_name: crawlab_example_mongo
        restart: always
    

    image

  • 求助 | 各种请求新建操作均无响应,不清楚什么地方配置出错了

    求助 | 各种请求新建操作均无响应,不清楚什么地方配置出错了

    进入了管理平台但是各种新建操作都无法完成,比如新建项目会返回下面之类的操作信息:face_with_spiral_eyes::face_with_spiral_eyes::face_with_spiral_eyes:

    crawlab_master   | [GIN] 2022/07/24 - 22:47:03 | 400 |    2.186658ms |       127.0.0.1 | PUT      "/projects"
    ......
    crawlab_master   | node error: not exists
    ......
    crawlab_master   | mongo: no documents in result
    ......
    

    docker-compose.yml配置如下:

    # 主节点
    version: '3.3'
    services:
      mongo:
        image: mongo
        container_name: mongo
        restart: always
        environment:
          MONGO_INITDB_ROOT_USERNAME: root  # mongo username
          MONGO_INITDB_ROOT_PASSWORD: 123456  # mongo password
        volumes:
          - "/opt/crawlab/mongo/data/db:/data/db"  # 持久化 mongo 数据
        ports:
          - "27017:27017"  # 开放 mongo 端口到宿主机
    
    
      mongo-express:
        image: mongo-express
        container_name: mongo-express
        restart: always
        depends_on: #设置依赖,这里用来代替links字段
          - mongo
        ports: #对外的映射端口,这里使用了27016,容器宿主机本机访问的地址:http://localhost:27016,外网的话改为ip:port.
          # - "27016:8081"
          - "8081:8081"
        environment:
          ME_CONFIG_MONGODB_SERVER: mongo #服务名是mongo容器的名字
          ME_CONFIG_MONGODB_PORT: 27017
          ME_CONFIG_BASICAUTH_USERNAME: admin #登陆页面时候的用户名
          ME_CONFIG_BASICAUTH_PASSWORD: 123456 #登陆页面的用户密码
          ME_CONFIG_MONGODB_ADMINUSERNAME: root #mongo验证的用户名
          ME_CONFIG_MONGODB_ADMINPASSWORD: 123456 #mongo验证的用户密码
    
    
      master:
        image: crawlabteam/crawlab
        container_name: crawlab_master
        restart: always
        environment:
          CRAWLAB_NODE_MASTER: Y  # Y: 主节点
          CRAWLAB_MONGO_HOST: mongo  # mongo host address. 在 Docker-Compose 网>络中,直接引用 service 名称
          CRAWLAB_MONGO_PORT: 27017  # mongo port
          CRAWLAB_MONGO_DB: crawlab  # mongo database
          CRAWLAB_MONGO_USERNAME: root  # mongo username
          CRAWLAB_MONGO_PASSWORD: '123456'  # mongo password
          CRAWLAB_MONGO_AUTHSOURCE: admin  # mongo auth source
        volumes:
          - "/opt/crawlab/master:/data"  # 持久化 crawlab 数据
        ports:
          - "8080:8080"  # 开放 api 端口
          - "9666:9666"  # 开放 grpc 端口
        depends_on:
          - mongo
         ~~~
    
  • 当存在错误日志的情况下,/tasks/{id}/error-log 仍返回为空

    当存在错误日志的情况下,/tasks/{id}/error-log 仍返回为空

    Bug 描述 没有错误信息提示,只有“空结果”错误提示

    部分日志信息

    2021-01-30 23:25:03 [scrapy.core.engine] INFO: Spider opened
    2021-01-30 23:25:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2021-01-30 23:25:03 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
    2021-01-30 23:25:05 [user_basic] ERROR [githubspider.user_basic] : Could not resolve to a node with the global id of ''
    2021-01-30 23:25:05 [scrapy.core.engine] INFO: Closing spider (finished)
    2021-01-30 23:25:05 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 1004,
    'downloader/request_count': 2,
    'downloader/request_method_count/GET': 1,
    'downloader/request_method_count/POST': 1,
    'downloader/response_bytes': 1878,
    'downloader/response_count': 2,
    'downloader/response_status_count/200': 2,
    'elapsed_time_seconds': 1.450998,
    'finish_reason': 'finished',
    'finish_time': datetime.datetime(2021, 1, 30, 15, 25, 5, 286471),
    'log_count/ERROR': 1,
    
    GET http://localhost:8080/api/tasks/7f081549-28c4-4a7e-80e6-cdb7cf399534/error-log
    {"status":"ok","message":"success","data":null,"error":""}
    

    截屏

    yAuuwt.jpg

  • Seaweedfs / Gocolly integration

    Seaweedfs / Gocolly integration

    Hi all,

    Hope you are all well !

    Just was wondering if it is possible to integrate these two awesome tools to crawlab, it would be awesome for storing millions of static objects and to scrape with golang. We already did that with a friend https://github.com/lucmichalski/peaks-tires but we lack of the horizontal scaling and a crawling management interface. That s why, and how, we found crawlab.

    • https://github.com/gocolly/colly Elegant Scraper and Crawler Framework for Golang

    • https://github.com/chrislusf/seaweedfs SeaweedFS is a simple and highly scalable distributed file system, to store and serve billions of files fast! SeaweedFS implements an object store with O(1) disk seek, transparent cloud integration, and an optional Filer supporting POSIX, S3 API, AES256 encryption, Rack-Aware Erasure Coding for warm storage, FUSE mount, Hadoop compatible, WebDAV.

    Thanks for your insights and feedbacks on the topic.

    Cheers, X

  • 最新master的代码,不能正常通过docker启动

    最新master的代码,不能正常通过docker启动

    Bug 描述 最新master的代码,不能正常通过docker启动

    复现步骤

    1. 拉取最新的代码
    2. docker-compose up
    3. 报错:2021/08/10 17:44:51 error grpc client connect error: grpc error: client failed to start. reattempt in 51.7 seconds
    4. 拉取517ae21e13a57e0d9c074b162793aee689f99c0d
    5. docker-compose up
    6. 正常运行

    期望结果 docker能正常工作。

    截屏 image

  • toscrapy_books 爬虫选择所有节点运行,任务运行了两遍

    toscrapy_books 爬虫选择所有节点运行,任务运行了两遍

    Describe the bug A clear and concise description of what the bug is.

    To Reproduce Steps to reproduce the behavior:

    toscrapy_books 爬虫选择所有节点运行,任务运行了两遍 image

    Expected behavior 只爬一遍

  • 使用docker-compose安装的,每过一个星期master就会挂掉?需要手动重启

    使用docker-compose安装的,每过一个星期master就会挂掉?需要手动重启

    这一个是日志文件输出: crawlab-master | 2019/11/10 06:00:00 error handle task error:open /var/logs/crawlab/5daef3fd05363c0015606068/20191110060000.log: no such file or directory crawlab-master | 2019/11/10 06:00:00 error [Worker 3] open /var/logs/crawlab/5daef3fd05363c0015606068/20191110060000.log: no such file or directory crawlab-master | fatal error: concurrent map writes crawlab-master | fatal error: concurrent map writes crawlab-master | 2019/11/11 12:03:39 error open /var/logs/crawlab/5daef3fd05363c0015606068/20191111021501.log: no such file or directory crawlab-master | 2019/11/11 12:03:39 error open /var/logs/crawlab/5daef3fd05363c0015606068/20191111021501.log: no such file or directory

  • scrapy目录结构问题

    scrapy目录结构问题

    上传爬虫之后,如果我的爬虫没有严格按照scrapy项目结构设置的话,在爬虫 - 爬虫详情 - scrapy设置里识别不到对应的文件,控制台也会提示错误

    比如:scrapy的settings和pipeline、middleware在同一文件夹下,而我的爬虫把pipeline和middleware独立成了文件夹

  • "TypeError: res is undefined" when i tried to sign in

    I did everything according to the instructions from GitHub. When I try to login, I get an error "TypeError: res is undefined". Please, help me to resolve that problem Screenshot from 2021-12-10 22-41-58 Screenshot from 2021-12-10 22-41-41 Screenshot from 2021-12-10 22-44-24

  • 0.6版本 任务取消按钮不能真正停止任务

    0.6版本 任务取消按钮不能真正停止任务

    Bug 描述 对运行中的任务进行取消,点击取消按钮,页面上展示取消。实际在后台程序仍然在运行。 复现步骤 该 Bug 复现步骤如下

    1. 运行一个scrapy 任务
    2. 记住启动的执行命令 中的xxx.py文件
    3. 点击任务中的取消按钮,取消任务
    4. 在linux终端 执行ps -ef |grep xxx.py 可定位到进程仍在运行

    期望结果 点击任务取消按钮,任务程序能停止工作。

    截屏 image

  • Integrate pull request preview environments

    Integrate pull request preview environments

    I would like to support Crawlab by implementing Uffizzi preview environments. Disclaimer: I work on Uffizzi.

    Uffizzi is a Open Source full stack previews engine and our platform is available completely free for Crawlab (and all open source projects). This will provide maintainers with preview environments of every PR in the cloud, which enables faster iterations and reduces time to merge. You can see the open source repos which are currently using Uffizzi over here

    Uffizzi is purpose-built for the task of previewing PRs and it integrates with your workflow to deploy preview environments in the background without any manual steps for maintainers or contributors.

    We can go ahead and create an Initial PoC for you right away if you think there is value in this proposal.

    TODO:

    • [ ] Intial PoC

    cc @waveywaves

  • Is there any way to import automatically scrapy spiders to crawlab?

    Is there any way to import automatically scrapy spiders to crawlab?

    Hi guys ı have watched the autors video he says you can host more then 100 spiders on crawlab but how can we import them automatically? It's very hard to import one by one.

  • 0.6.0-2 版本 ,定时任务修改执行时间无法生效

    0.6.0-2 版本 ,定时任务修改执行时间无法生效

    Bug 描述 定时任务修改执行时间后,定时任务修成成功,但任务仍然按照修改前的执行时间执行

    复现步骤 https://demo-pro.crawlab.cn/#/schedules/63931b03f07dbffe09080c5a/overview 该 Bug 复现步骤如下

    1. 新建定时任务时,设置任务执行时间为* * * * *
    2. 修改定时任务时间为10 10 1 * *,成功保存
    3. 任务仍然是每分钟执行一次

    期望结果 定时任务执行时间修改后立即生效

    截屏 image image

skweez spiders web pages and extracts words for wordlist generation.

skweez skweez (pronounced like "squeeze") spiders web pages and extracts words for wordlist generation. It is basically an attempt to make a more oper

Nov 27, 2022
Pholcus is a distributed high-concurrency crawler software written in pure golang
Pholcus is a distributed high-concurrency crawler software written in pure golang

Pholcus Pholcus(幽灵蛛)是一款纯 Go 语言编写的支持分布式的高并发爬虫软件,仅用于编程学习与研究。 它支持单机、服务端、客户端三种运行模式,拥有Web、GUI、命令行三种操作界面;规则简单灵活、批量任务并发、输出方式丰富(mysql/mongodb/kafka/csv/excel等

Dec 30, 2022
Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.
Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.

Apollo ?? A Unix-style personal search engine and web crawler for your digital footprint Demo apollodemo.mp4 Contents Background Thesis Design Archite

Dec 27, 2022
Fast golang web crawler for gathering URLs and JavaSript file locations.

Fast golang web crawler for gathering URLs and JavaSript file locations. This is basically a simple implementation of the awesome Gocolly library.

Sep 24, 2022
ant (alpha) is a web crawler for Go.

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

Dec 30, 2022
Fast, highly configurable, cloud native dark web crawler.

Bathyscaphe dark web crawler Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler. How to start the crawler To start

Nov 22, 2022
Just a web crawler
Just a web crawler

gh-dependents gh command extension to see dependents of your repository. See The GitHub Blog: GitHub CLI 2.0 includes extensions! Install gh extension

Sep 27, 2022
A recursive, mirroring web crawler that retrieves child links.

A recursive, mirroring web crawler that retrieves child links.

Jan 29, 2022
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Jan 9, 2023
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)

About Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your new

Dec 4, 2022
Go IMDb Crawler
 Go IMDb Crawler

Go IMDb Crawler Hit the ⭐ button to show some ❤️ ?? INSPIRATION ?? Want to know which celebrities have a common birthday with yours? ?? Want to get th

Aug 1, 2022
High-performance crawler framework based on fasthttp

predator / 掠食者 基于 fasthttp 开发的高性能爬虫框架 使用 下面是一个示例,基本包含了当前已完成的所有功能,使用方法可以参考注释。

May 2, 2022
A crawler/scraper based on golang + colly, configurable via JSON

A crawler/scraper based on golang + colly, configurable via JSON

Aug 21, 2022
crawlergo is a browser crawler that uses chrome headless mode for URL collection.
crawlergo is a browser crawler that uses chrome headless mode for URL collection.

A powerful browser crawler for web vulnerability scanners

Dec 29, 2022
A crawler/scraper based on golang + colly, configurable via JSON

Super-Simple Scraper This a very thin layer on top of Colly which allows configuration from a JSON file. The output is JSONL which is ready to be impo

Aug 21, 2022
New World Auction House Crawler In Golang

New-World-Auction-House-Crawler Goal of this library is to have a process which grabs New World auction house data in the background while playing the

Sep 7, 2022
Simple content crawler for joyreactor.cc
Simple content crawler for joyreactor.cc

Reactor Crawler Simple CLI content crawler for Joyreactor. He'll find all media content on the page you've provided and save it. If there will be any

May 5, 2022
A PCPartPicker crawler for Golang.

gopartpicker A scraper for pcpartpicker.com for Go. It is implemented using Colly. Features Extract data from part list URLs Search for parts Extract

Nov 9, 2021
Multiplexer: HTTP-Server & URL Crawler

Multiplexer: HTTP-Server & URL Crawler Приложение представляет собой http-сервер с одним хендлером. Хендлер на вход получает POST-запрос со списком ur

Nov 3, 2021