:paw_prints: Creeper - The Next Generation Crawler Framework (Go)

License Go Report Card Gitter Creeper

About

Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your news app, subscribe program, etc.

Warning: At present this project is still under early stage development, please do not use in the production environment.

Get Started

Installation

$ go get github.com/wspl/creeper

Hello World!

Create hacker_news.crs

page(@page=1) = "https://news.ycombinator.com/news?p={@page}"

news[]: page -> $("tr.athing")
    title: $(".title a.storylink").text
    site: $(".title span.sitestr").text
    link: $(".title a.storylink").href

Then, create main.go

package main

import "github.com/wspl/creeper"

func main() {
	c := creeper.Open("./hacker_news.crs")
	c.Array("news").Each(func(c *creeper.Creeper) {
		println("title: ", c.String("title"))
		println("site: ", c.String("site"))
		println("link: ", c.String("link"))
		println("===")
	})
}

Build and run. Console will print something like:

title:  Samsung chief Lee arrested as S.Korean corruption probe deepens
site:  reuters.com
link:  http://www.reuters.com/article/us-southkorea-politics-samsung-group-idUSKBN15V2RD
===
title:  ReactOS 0.4.4 Released
site:  reactos.org
link:  https://reactos.org/project-news/reactos-044-released
===
title:  FeFETs: How this new memory stacks up against existing non-volatile memory
site:  semiengineering.com
link:  http://semiengineering.com/what-are-fefets/

Script Spec

Town

Town is a lambda like expression for saving (in)mutable string. Most of the time, we used it to store url.

page(@page=1, ext) = "https://news.ycombinator.com/news?p={@page}&ext={ext}"

When you need town, use it as if you were calling a function:

news[]: page(ext="Hello World!") -> $("tr.athing")

You might have noticed that the @page parameter is not used. Yeah, it is a special parameter.

Expression in town definition line like name="something", represents parameter name has a default value "something".

Incidentally, @page is a parameter that will automatically increasing when current page has no more content.

Node

Nodes are tree structure that represent the data structure you are going to crawl.

news[]: page -> $("tr.athing")
	title: $(".title a.storylink").text
	site: $(".title span.sitestr").text
	link: $(".title a.storylink").href

Like yaml, nodes distinguishes the hierarchy by indentation.

Node Name

Node has name. title is a field name, represents a general string data. news[] is a array name, represents a parent structure with multiple sub-data.

Page

Page indicates where to fetching the field data. It can be a town expression or field reference.

Field reference is a advanced usage of Node, you can found the details in ./eh.crs.

If a node owned page and fun at the same time, page should on the left of ->, fun should on the right of ->. Which is page -> fun

Fun

Fun represents the data processing process.

There are all supported funs:

Name Parameters Description
$ (selector: string) Relative CSS selector (select from parent node)
$root (selector: string) Absolute CSS selector (select from body)
html inner HTML
text inner text
outerHTML outer HTML
attr (attr: string) attribute value
style style attribute value
href href attribute value
src src attribute value
class class attribute value
id id attribute value
calc (prec: int) calculate arithmetic expression
match (regexp: string) match first sub-string via regular expression
expand (regexp: string, target: string) expand matched strings to target string

Author

Plutonist

impl.moe · Github @wspl

Comments
  • note without css selector

    note without css selector

    gallery(@page=0) = "https://e-hentai.org/g/1034547/27cc8cb432/?p={@page}" tags[]: gallery -> $("div#taglist table tr td:eq(1) div a") name: .html

    The name will always return the first html selected by css("div#taglist table tr td:eq(1) div a").

  • Problem since new commits ?

    Problem since new commits ?

    Hi,

    I just copy paste your example code (hacker_news). Yesterday, it worked. Today with the new sources, it doesn't work anymore :(

    panic: runtime error: invalid memory address or nil pointer dereference
    [signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x45cdb6]
    
    goroutine 1 [running]:
    panic(0x65db60, 0xc42000c190)
            /opt/go/src/runtime/panic.go:500 +0x1a1
    github.com/wspl/creeper.(*Node).Value(0x0, 0x0, 0xc42001acd0, 0x0, 0xc4200340b8)
            /home/ubuntu/workspace/src/github.com/wspl/creeper/node.go:118 +0x26
    github.com/wspl/creeper.(*Creeper).Each(0xc42001acd0, 0x6cc428)
            /home/ubuntu/workspace/src/github.com/wspl/creeper/creeper.go:74 +0x8b
    main.main()
            /home/ubuntu/workspace/main.go:12 +0x73
    exit status 2
    

    (A simple println alone works :) )

  • go get

    go get

    go get github.com/wspl/creeper

    github.com/PuerkitoBio/goquery

    fatal error: unexpected signal during runtime execution [signal 0xb code=0x1 addr=0x1880e6d3a41e pc=0xf0eb]

    runtime stack: runtime.throw(0x4971c0, 0x2a) /usr/local/go/src/runtime/panic.go:547 +0x90 runtime.sigpanic() /usr/local/go/src/runtime/sigpanic_unix.go:12 +0x5a runtime.unlock(0x982540) /usr/local/go/src/runtime/lock_sema.go:107 +0x14b runtime.(*mheap).alloc_m(0x982540, 0x1, 0x10000000010, 0xeed928) /usr/local/go/src/runtime/mheap.go:492 +0x314 runtime.(*mheap).alloc.func1() /usr/local/go/src/runtime/mheap.go:502 +0x41 runtime.systemstack(0xc82047fe58) /usr/local/go/src/runtime/asm_amd64.s:307 +0xab runtime.(*mheap).alloc(0x982540, 0x1, 0x10000000010, 0xed8f) /usr/local/go/src/runtime/mheap.go:503 +0x63 runtime.(*mcentral).grow(0x983f10, 0x0) /usr/local/go/src/runtime/mcentral.go:209 +0x93 runtime.(*mcentral).cacheSpan(0x983f10, 0xeed928) /usr/local/go/src/runtime/mcentral.go:89 +0x47d runtime.(*mcache).refill(0xaf4000, 0x10, 0xeed928) /usr/local/go/src/runtime/mcache.go:119 +0xcc runtime.mallocgc.func2() /usr/local/go/src/runtime/malloc.go:642 +0x2b runtime.systemstack(0xc820025500) /usr/local/go/src/runtime/asm_amd64.s:291 +0x79 runtime.mstart() /usr/local/go/src/runtime/proc.go:1051

    goroutine 1 [running]: runtime.systemstack_switch() /usr/local/go/src/runtime/asm_amd64.s:245 fp=0xc821c79140 sp=0xc821c79138 runtime.mallocgc(0xf0, 0x438dc0, 0x0, 0x438dc0) /usr/local/go/src/runtime/malloc.go:643 +0x869 fp=0xc821c79218 sp=0xc821c79140

  • Feature: Next Page Node - functional node for directing next page

    Feature: Next Page Node - functional node for directing next page

    page = "http://example.com/info?page=1"
    demo[]: page -> $(".example")
        text: $(".title").html
        @next: $("a.next").href
    

    I am thinking about another method for page number directing, that is simulating the operation of the user click on the next page. We can add a @next node to indicates the next page link. Page director would switch to next page automatically when current page has no more content.

    New grammar features - Functional node: For assisting crawling. Node name start with @. It is less readable than private nodes.

  • how to parse Json structures

    how to parse Json structures

    We probably use both HTML parser and JSON parser for crawling complex pages, I found that pattern files support HTML parser only, how could I use this framework to parse JSON structures or extend functionalities by myself? Thanks.

  • @next node function implementation

    @next node function implementation

    There are some hindrance in implementing the functional part of @next. They have stumped me for a long time:

    • InitSelector's loop call
    • Wait until the page cycle ends and blocks the total cycle when there is no next page
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Jan 9, 2023
High-performance crawler framework based on fasthttp

predator / 掠食者 基于 fasthttp 开发的高性能爬虫框架 使用 下面是一个示例,基本包含了当前已完成的所有功能,使用方法可以参考注释。

May 2, 2022
High-performance crawler framework based on fasthttp.

predator / 掠食者 基于 fasthttp 开发的高性能爬虫框架 使用 下面是一个示例,基本包含了当前已完成的所有功能,使用方法可以参考注释。 1 创建一个 Crawler import "github.com/go-predator/predator" func main() {

Dec 14, 2022
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

Crawlab 中文 | English Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer Golang-

Jan 7, 2023
Pholcus is a distributed high-concurrency crawler software written in pure golang
Pholcus is a distributed high-concurrency crawler software written in pure golang

Pholcus Pholcus(幽灵蛛)是一款纯 Go 语言编写的支持分布式的高并发爬虫软件,仅用于编程学习与研究。 它支持单机、服务端、客户端三种运行模式,拥有Web、GUI、命令行三种操作界面;规则简单灵活、批量任务并发、输出方式丰富(mysql/mongodb/kafka/csv/excel等

Dec 30, 2022
ant (alpha) is a web crawler for Go.

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

Dec 30, 2022
Go IMDb Crawler
 Go IMDb Crawler

Go IMDb Crawler Hit the ⭐ button to show some ❤️ ?? INSPIRATION ?? Want to know which celebrities have a common birthday with yours? ?? Want to get th

Aug 1, 2022
Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.
Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.

Apollo ?? A Unix-style personal search engine and web crawler for your digital footprint Demo apollodemo.mp4 Contents Background Thesis Design Archite

Dec 27, 2022
Fast, highly configurable, cloud native dark web crawler.

Bathyscaphe dark web crawler Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler. How to start the crawler To start

Nov 22, 2022
A crawler/scraper based on golang + colly, configurable via JSON

A crawler/scraper based on golang + colly, configurable via JSON

Aug 21, 2022
Just a web crawler
Just a web crawler

gh-dependents gh command extension to see dependents of your repository. See The GitHub Blog: GitHub CLI 2.0 includes extensions! Install gh extension

Sep 27, 2022
crawlergo is a browser crawler that uses chrome headless mode for URL collection.
crawlergo is a browser crawler that uses chrome headless mode for URL collection.

A powerful browser crawler for web vulnerability scanners

Dec 29, 2022
A crawler/scraper based on golang + colly, configurable via JSON

Super-Simple Scraper This a very thin layer on top of Colly which allows configuration from a JSON file. The output is JSONL which is ready to be impo

Aug 21, 2022
New World Auction House Crawler In Golang

New-World-Auction-House-Crawler Goal of this library is to have a process which grabs New World auction house data in the background while playing the

Sep 7, 2022
Simple content crawler for joyreactor.cc
Simple content crawler for joyreactor.cc

Reactor Crawler Simple CLI content crawler for Joyreactor. He'll find all media content on the page you've provided and save it. If there will be any

May 5, 2022
A PCPartPicker crawler for Golang.

gopartpicker A scraper for pcpartpicker.com for Go. It is implemented using Colly. Features Extract data from part list URLs Search for parts Extract

Nov 9, 2021
Multiplexer: HTTP-Server & URL Crawler

Multiplexer: HTTP-Server & URL Crawler Приложение представляет собой http-сервер с одним хендлером. Хендлер на вход получает POST-запрос со списком ur

Nov 3, 2021
A simple crawler sending Telegram notification when Refurbished Macbook Air / Pro in stock.

A simple crawler sending Telegram notification when Refurbished Macbook Air / Pro in stock.

Jan 30, 2022
Go spider: A crawler of vertical communities achieved by GOLANG

go_spider A crawler of vertical communities achieved by GOLANG. Latest stable Release: Version 1.2 (Sep 23, 2014). QQ群号:337344607 Features Concurrent

Dec 9, 2021