Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台，支持任何语言和框架

Last update: Jan 7, 2023

Comments: 15

Crawlab

中文 | English

Golang-based distributed web crawler management platform, supporting various languages including Python, NodeJS, Go, Java, PHP and various web crawler frameworks including Scrapy, Puppeteer, Selenium.

Demo | Documentation

Installation

Three methods:

Docker (Recommended)
Direct Deploy (Check Internal Kernel)
Kubernetes (Multi-Node Deployment)

Pre-requisite (Docker)

Docker 18.03+
Redis 5.x+
MongoDB 3.6+
Docker Compose 1.24+ (optional but recommended)

Pre-requisite (Direct Deploy)

Go 1.12+
Node 8.12+
Redis 5.x+
MongoDB 3.6+

Quick Start

Please open the command line prompt and execute the command below. Make sure you have installed docker-compose in advance.

git clone https://github.com/crawlab-team/crawlab
cd crawlab
docker-compose up -d

Next, you can look into the docker-compose.yml (with detailed config params) and the Documentation (Chinese) for further information.

Run

Docker

Please use docker-compose to one-click to start up. By doing so, you don't even have to configure MongoDB and Redis databases. Create a file named docker-compose.yml and input the code below.

version: '3.3'
services:
  master: 
    image: tikazyq/crawlab:latest
    container_name: master
    environment:
      CRAWLAB_SERVER_MASTER: "Y"
      CRAWLAB_MONGO_HOST: "mongo"
      CRAWLAB_REDIS_ADDRESS: "redis"
    ports:    
      - "8080:8080"
    depends_on:
      - mongo
      - redis
  mongo:
    image: mongo:latest
    restart: always
    ports:
      - "27017:27017"
  redis:
    image: redis:latest
    restart: always
    ports:
      - "6379:6379"

Then execute the command below, and Crawlab Master Node + MongoDB + Redis will start up. Open the browser and enter http://localhost:8080 to see the UI interface.

docker-compose up

For Docker Deployment details, please refer to relevant documentation.

Screenshot

Login

Home Page

Node List

Node Network

Spider List

Spider Overview

Spider Analytics

Spider File Edit

Task Log

Task Results

Cron Job

Language Installation

Dependency Installation

Notifications

Architecture

The architecture of Crawlab is consisted of the Master Node and multiple Worker Nodes, and Redis and MongoDB databases which are mainly for nodes communication and data storage.

The frontend app makes requests to the Master Node, which assigns tasks and deploys spiders through MongoDB and Redis. When a Worker Node receives a task, it begins to execute the crawling task, and stores the results to MongoDB. The architecture is much more concise compared with versions before v0.3.0. It has removed unnecessary Flower module which offers node monitoring services. They are now done by Redis.

Master Node

The Master Node is the core of the Crawlab architecture. It is the center control system of Crawlab.

The Master Node offers below services:

Crawling Task Coordination;
Worker Node Management and Communication;
Spider Deployment;
Frontend and API Services;
Task Execution (one can regard the Master Node as a Worker Node)

The Master Node communicates with the frontend app, and send crawling tasks to Worker Nodes. In the mean time, the Master Node synchronizes (deploys) spiders to Worker Nodes, via Redis and MongoDB GridFS.

Worker Node

The main functionality of the Worker Nodes is to execute crawling tasks and store results and logs, and communicate with the Master Node through Redis PubSub. By increasing the number of Worker Nodes, Crawlab can scale horizontally, and different crawling tasks can be assigned to different nodes to execute.

MongoDB

MongoDB is the operational database of Crawlab. It stores data of nodes, spiders, tasks, schedules, etc. The MongoDB GridFS file system is the medium for the Master Node to store spider files and synchronize to the Worker Nodes.

Redis

Redis is a very popular Key-Value database. It offers node communication services in Crawlab. For example, nodes will execute HSET to set their info into a hash list named nodes in Redis, and the Master Node will identify online nodes according to the hash list.

Frontend

Frontend is a SPA based on Vue-Element-Admin. It has re-used many Element-UI components to support corresponding display.

Integration with Other Frameworks

Crawlab SDK provides some helper methods to make it easier for you to integrate your spiders into Crawlab, e.g. saving results.

⚠️ Note: make sure you have already installed crawlab-sdk using pip.

Scrapy

In settings.py in your Scrapy project, find the variable named ITEM_PIPELINES (a dict variable). Add content below.

ITEM_PIPELINES = {
    'crawlab.pipelines.CrawlabMongoPipeline': 888,
}

Then, start the Scrapy spider. After it's done, you should be able to see scraped results in Task Detail -> Result

General Python Spider

Please add below content to your spider files to save results.

# import result saving method
from crawlab import save_item

# this is a result record, must be dict type
result = {'name': 'crawlab'}

# call result saving method
save_item(result)

Then, start the spider. After it's done, you should be able to see scraped results in Task Detail -> Result

Other Frameworks / Languages

A crawling task is actually executed through a shell command. The Task ID will be passed to the crawling task process in the form of environment variable named CRAWLAB_TASK_ID. By doing so, the data can be related to a task. Also, another environment variable CRAWLAB_COLLECTION is passed by Crawlab as the name of the collection to store results data.

Comparison with Other Frameworks

There are existing spider management frameworks. So why use Crawlab?

The reason is that most of the existing platforms are depending on Scrapyd, which limits the choice only within python and scrapy. Surely scrapy is a great web crawl framework, but it cannot do everything.

Crawlab is easy to use, general enough to adapt spiders in any language and any framework. It has also a beautiful frontend interface for users to manage spiders much more easily.

Framework	Technology	Pros	Cons
Crawlab	Golang + Vue	Not limited to Scrapy, available for all programming languages and frameworks. Beautiful UI interface. Naturally support distributed spiders. Support spider management, task management, cron job, result export, analytics, notifications, configurable spiders, online code editor, etc.	Not yet support spider versioning
ScrapydWeb	Python Flask + Vue	Beautiful UI interface, built-in Scrapy log parser, stats and graphs for task execution, support node management, cron job, mail notification, mobile. Full-feature spider management platform.	Not support spiders other than Scrapy. Limited performance because of Python Flask backend.
Gerapy	Python Django + Vue	Gerapy is built by web crawler guru Germey Cui. Simple installation and deployment. Beautiful UI interface. Support node management, code edit, configurable crawl rules, etc.	Again not support spiders other than Scrapy. A lot of bugs based on user feedback in v1.0. Look forward to improvement in v2.0
SpiderKeeper	Python Flask	Open-source Scrapyhub. Concise and simple UI interface. Support cron job.	Perhaps too simplified, not support pagination, not support node management, not support spiders other than Scrapy.

Contributors

Community & Sponsorship

If you feel Crawlab could benefit your daily work or your company, please add the author's Wechat account noting "Crawlab" to enter the discussion group. Or you scan the Alipay QR code below to give us a reward to upgrade our teamwork software or buy a coffee.

Owner

Crawlab Team

https://github.com/crawlab-team/crawlab https://demo-pro.crawlab.cn

Comments

Runing spiders ?!

I can't run spider neither outside docker master container nor inside it Inside i get this error after typing: crawlab upload spider Not authorized Error loging in

Out side i only can upload zip file and it's not connected to my script and doesn't returen any data

Any help?!
admin 通过vscode port登陆失败

Bug 描述 我将机器部署在内网服务器，通过vscode的port功能映射到本机打开，例如，当输入用户名和密码均为admin 时，登陆功能不工作。 vscode port配置

登陆错误

排除其他因素 远程主机端口6800存在scrapyd的管理界面，vscode能将远程主机的6800映射到本机端口上。

期望结果 登陆admin 能工作。

gRPC Client Cannot Connect to the master node!

Bug

When I tried building CRAWLAB by docker, worker node cannot connect the master node.

YML File

version: '3.3'
services:
  master: 
    image: crawlabteam/crawlab:latest
    container_name: crawlab_example_master
    environment:
      CRAWLAB_NODE_MASTER: "Y"
      CRAWLAB_MONGO_HOST: "mongo"
      CRAWLAB_GRPC_SERVER_ADDRESS: "0.0.0.0:9666"
      CRAWLAB_SERVER_HOST: "0.0.0.0"
      CRAWLAB_GRPC_AUTHKEY: "youcanneverguess"
    volumes:
      - "./.crawlab/master:/root/.crawlab"
    ports:    
      - "8080:8080"
      - "9666:9666"
      - "8000:8000"
    depends_on:
      - mongo

  worker01: 
    image: crawlabteam/crawlab:latest
    container_name: crawlab_example_worker01
    environment:
      CRAWLAB_NODE_MASTER: "N"
      CRAWLAB_GRPC_ADDRESS: "MY_Public_IP_Address:9666"
      CRAWLAB_GRPC_AUTHKEY: "youcanneverguess"
      CRAWLAB_FS_FILER_URL: "http://master:8080/api/filer"
    volumes:
      - "./.crawlab/worker01:/root/.crawlab"
    depends_on:
      - master

  mongo:
    image: mongo:latest
    container_name: crawlab_example_mongo
    restart: always

求助 | 各种请求新建操作均无响应，不清楚什么地方配置出错了

进入了管理平台但是各种新建操作都无法完成，比如新建项目会返回下面之类的操作信息:face_with_spiral_eyes::face_with_spiral_eyes::face_with_spiral_eyes:

crawlab_master   | [GIN] 2022/07/24 - 22:47:03 | 400 |    2.186658ms |       127.0.0.1 | PUT      "/projects"
......
crawlab_master   | node error: not exists
......
crawlab_master   | mongo: no documents in result
......

docker-compose.yml配置如下：

# 主节点
version: '3.3'
services:
  mongo:
    image: mongo
    container_name: mongo
    restart: always
    environment:
      MONGO_INITDB_ROOT_USERNAME: root  # mongo username
      MONGO_INITDB_ROOT_PASSWORD: 123456  # mongo password
    volumes:
      - "/opt/crawlab/mongo/data/db:/data/db"  # 持久化 mongo 数据
    ports:
      - "27017:27017"  # 开放 mongo 端口到宿主机


  mongo-express:
    image: mongo-express
    container_name: mongo-express
    restart: always
    depends_on: #设置依赖，这里用来代替links字段
      - mongo
    ports: #对外的映射端口，这里使用了27016，容器宿主机本机访问的地址：http://localhost:27016，外网的话改为ip:port.
      # - "27016:8081"
      - "8081:8081"
    environment:
      ME_CONFIG_MONGODB_SERVER: mongo #服务名是mongo容器的名字
      ME_CONFIG_MONGODB_PORT: 27017
      ME_CONFIG_BASICAUTH_USERNAME: admin #登陆页面时候的用户名
      ME_CONFIG_BASICAUTH_PASSWORD: 123456 #登陆页面的用户密码
      ME_CONFIG_MONGODB_ADMINUSERNAME: root #mongo验证的用户名
      ME_CONFIG_MONGODB_ADMINPASSWORD: 123456 #mongo验证的用户密码


  master:
    image: crawlabteam/crawlab
    container_name: crawlab_master
    restart: always
    environment:
      CRAWLAB_NODE_MASTER: Y  # Y: 主节点
      CRAWLAB_MONGO_HOST: mongo  # mongo host address. 在 Docker-Compose 网>络中，直接引用 service 名称
      CRAWLAB_MONGO_PORT: 27017  # mongo port
      CRAWLAB_MONGO_DB: crawlab  # mongo database
      CRAWLAB_MONGO_USERNAME: root  # mongo username
      CRAWLAB_MONGO_PASSWORD: '123456'  # mongo password
      CRAWLAB_MONGO_AUTHSOURCE: admin  # mongo auth source
    volumes:
      - "/opt/crawlab/master:/data"  # 持久化 crawlab 数据
    ports:
      - "8080:8080"  # 开放 api 端口
      - "9666:9666"  # 开放 grpc 端口
    depends_on:
      - mongo
     ~~~

当存在错误日志的情况下，/tasks/{id}/error-log 仍返回为空

Bug 描述 没有错误信息提示，只有“空结果”错误提示

部分日志信息

2021-01-30 23:25:03 [scrapy.core.engine] INFO: Spider opened
2021-01-30 23:25:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-01-30 23:25:03 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-01-30 23:25:05 [user_basic] ERROR [githubspider.user_basic] : Could not resolve to a node with the global id of ''
2021-01-30 23:25:05 [scrapy.core.engine] INFO: Closing spider (finished)
2021-01-30 23:25:05 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1004,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 1878,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 1.450998,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 1, 30, 15, 25, 5, 286471),
'log_count/ERROR': 1,

GET http://localhost:8080/api/tasks/7f081549-28c4-4a7e-80e6-cdb7cf399534/error-log
{"status":"ok","message":"success","data":null,"error":""}

截屏

Seaweedfs / Gocolly integration
Hi all,

Hope you are all well !

Just was wondering if it is possible to integrate these two awesome tools to crawlab, it would be awesome for storing millions of static objects and to scrape with golang. We already did that with a friend https://github.com/lucmichalski/peaks-tires but we lack of the horizontal scaling and a crawling management interface. That s why, and how, we found crawlab.

https://github.com/gocolly/colly Elegant Scraper and Crawler Framework for Golang

https://github.com/chrislusf/seaweedfs SeaweedFS is a simple and highly scalable distributed file system, to store and serve billions of files fast! SeaweedFS implements an object store with O(1) disk seek, transparent cloud integration, and an optional Filer supporting POSIX, S3 API, AES256 encryption, Rack-Aware Erasure Coding for warm storage, FUSE mount, Hadoop compatible, WebDAV.

Thanks for your insights and feedbacks on the topic.

Cheers, X
最新master的代码，不能正常通过docker启动
Bug 描述 最新master的代码，不能正常通过docker启动

复现步骤

拉取最新的代码

docker-compose up

报错：2021/08/10 17:44:51 error grpc client connect error: grpc error: client failed to start. reattempt in 51.7 seconds

拉取517ae21e13a57e0d9c074b162793aee689f99c0d

docker-compose up

正常运行

期望结果 docker能正常工作。

截屏
toscrapy_books 爬虫选择所有节点运行，任务运行了两遍

Describe the bug A clear and concise description of what the bug is.

To Reproduce Steps to reproduce the behavior:

toscrapy_books 爬虫选择所有节点运行，任务运行了两遍

Expected behavior 只爬一遍
使用docker-compose安装的，每过一个星期master就会挂掉？需要手动重启

这一个是日志文件输出： crawlab-master | 2019/11/10 06:00:00 error handle task error:open /var/logs/crawlab/5daef3fd05363c0015606068/20191110060000.log: no such file or directory crawlab-master | 2019/11/10 06:00:00 error [Worker 3] open /var/logs/crawlab/5daef3fd05363c0015606068/20191110060000.log: no such file or directory crawlab-master | fatal error: concurrent map writes crawlab-master | fatal error: concurrent map writes crawlab-master | 2019/11/11 12:03:39 error open /var/logs/crawlab/5daef3fd05363c0015606068/20191111021501.log: no such file or directory crawlab-master | 2019/11/11 12:03:39 error open /var/logs/crawlab/5daef3fd05363c0015606068/20191111021501.log: no such file or directory
scrapy目录结构问题

上传爬虫之后，如果我的爬虫没有严格按照scrapy项目结构设置的话，在爬虫 - 爬虫详情 - scrapy设置里识别不到对应的文件，控制台也会提示错误

比如：scrapy的settings和pipeline、middleware在同一文件夹下，而我的爬虫把pipeline和middleware独立成了文件夹
"TypeError: res is undefined" when i tried to sign in

I did everything according to the instructions from GitHub. When I try to login, I get an error "TypeError: res is undefined". Please, help me to resolve that problem
0.6版本任务取消按钮不能真正停止任务
Bug 描述 对运行中的任务进行取消，点击取消按钮，页面上展示取消。实际在后台程序仍然在运行。 复现步骤 该 Bug 复现步骤如下

运行一个scrapy 任务

记住启动的执行命令中的xxx.py文件

点击任务中的取消按钮，取消任务

在linux终端执行ps -ef |grep xxx.py 可定位到进程仍在运行

期望结果 点击任务取消按钮，任务程序能停止工作。

截屏
Integrate pull request preview environments
I would like to support Crawlab by implementing Uffizzi preview environments. Disclaimer: I work on Uffizzi.

Uffizzi is a Open Source full stack previews engine and our platform is available completely free for Crawlab (and all open source projects). This will provide maintainers with preview environments of every PR in the cloud, which enables faster iterations and reduces time to merge. You can see the open source repos which are currently using Uffizzi over here

Uffizzi is purpose-built for the task of previewing PRs and it integrates with your workflow to deploy preview environments in the background without any manual steps for maintainers or contributors.

We can go ahead and create an Initial PoC for you right away if you think there is value in this proposal.

TODO:

[ ] Intial PoC

cc @waveywaves
Is there any way to import automatically scrapy spiders to crawlab?

Hi guys ı have watched the autors video he says you can host more then 100 spiders on crawlab but how can we import them automatically? It's very hard to import one by one.
0.6.0-2 版本，定时任务修改执行时间无法生效
Bug 描述 定时任务修改执行时间后，定时任务修成成功，但任务仍然按照修改前的执行时间执行

复现步骤 https://demo-pro.crawlab.cn/#/schedules/63931b03f07dbffe09080c5a/overview 该 Bug 复现步骤如下

新建定时任务时，设置任务执行时间为* * * * *

修改定时任务时间为10 10 1 * *，成功保存

任务仍然是每分钟执行一次

期望结果 定时任务执行时间修改后立即生效

截屏