:paw_prints: Creeper - The Next Generation Crawler Framework (Go)

Last update: Dec 4, 2022

Comments: 6

About

Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your news app, subscribe program, etc.

Warning: At present this project is still under early stage development, please do not use in the production environment.

Get Started

Installation

$ go get github.com/wspl/creeper

Hello World!

Create hacker_news.crs

page(@page=1) = "https://news.ycombinator.com/news?p={@page}"

news[]: page -> $("tr.athing")
    title: $(".title a.storylink").text
    site: $(".title span.sitestr").text
    link: $(".title a.storylink").href

Then, create main.go

package main

import "github.com/wspl/creeper"

func main() {
	c := creeper.Open("./hacker_news.crs")
	c.Array("news").Each(func(c *creeper.Creeper) {
		println("title: ", c.String("title"))
		println("site: ", c.String("site"))
		println("link: ", c.String("link"))
		println("===")
	})
}

Build and run. Console will print something like:

title:  Samsung chief Lee arrested as S.Korean corruption probe deepens
site:  reuters.com
link:  http://www.reuters.com/article/us-southkorea-politics-samsung-group-idUSKBN15V2RD
===
title:  ReactOS 0.4.4 Released
site:  reactos.org
link:  https://reactos.org/project-news/reactos-044-released
===
title:  FeFETs: How this new memory stacks up against existing non-volatile memory
site:  semiengineering.com
link:  http://semiengineering.com/what-are-fefets/

Script Spec

Town

Town is a lambda like expression for saving (in)mutable string. Most of the time, we used it to store url.

page(@page=1, ext) = "https://news.ycombinator.com/news?p={@page}&ext={ext}"

When you need town, use it as if you were calling a function:

news[]: page(ext="Hello World!") -> $("tr.athing")

You might have noticed that the @page parameter is not used. Yeah, it is a special parameter.

Expression in town definition line like name="something", represents parameter name has a default value "something".

Incidentally, @page is a parameter that will automatically increasing when current page has no more content.

Node

Nodes are tree structure that represent the data structure you are going to crawl.

news[]: page -> $("tr.athing")
	title: $(".title a.storylink").text
	site: $(".title span.sitestr").text
	link: $(".title a.storylink").href

Like yaml, nodes distinguishes the hierarchy by indentation.

Node Name

Node has name. title is a field name, represents a general string data. news[] is a array name, represents a parent structure with multiple sub-data.

Page

Page indicates where to fetching the field data. It can be a town expression or field reference.

Field reference is a advanced usage of Node, you can found the details in ./eh.crs.

If a node owned page and fun at the same time, page should on the left of ->, fun should on the right of ->. Which is page -> fun

Fun

Fun represents the data processing process.

There are all supported funs:

Name	Parameters	Description
$	(selector: string)	Relative CSS selector (select from parent node)
$root	(selector: string)	Absolute CSS selector (select from body)
html		inner HTML
text		inner text
outerHTML		outer HTML
attr	(attr: string)	attribute value
style		style attribute value
href		href attribute value
src		src attribute value
class		class attribute value
id		id attribute value
calc	(prec: int)	calculate arithmetic expression
match	(regexp: string)	match first sub-string via regular expression
expand	(regexp: string, target: string)	expand matched strings to target string

Author

Plutonist

impl.moe · Github @wspl

Owner

Plutonist

https://github.com/wspl/creeper

Comments

note without css selector

gallery(@page=0) = "https://e-hentai.org/g/1034547/27cc8cb432/?p={@page}" tags[]: gallery -> $("div#taglist table tr td:eq(1) div a") name: .html

The name will always return the first html selected by css("div#taglist table tr td:eq(1) div a").

Problem since new commits ?

Hi,

I just copy paste your example code (hacker_news). Yesterday, it worked. Today with the new sources, it doesn't work anymore :(

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x45cdb6]

goroutine 1 [running]:
panic(0x65db60, 0xc42000c190)
        /opt/go/src/runtime/panic.go:500 +0x1a1
github.com/wspl/creeper.(*Node).Value(0x0, 0x0, 0xc42001acd0, 0x0, 0xc4200340b8)
        /home/ubuntu/workspace/src/github.com/wspl/creeper/node.go:118 +0x26
github.com/wspl/creeper.(*Creeper).Each(0xc42001acd0, 0x6cc428)
        /home/ubuntu/workspace/src/github.com/wspl/creeper/creeper.go:74 +0x8b
main.main()
        /home/ubuntu/workspace/main.go:12 +0x73
exit status 2

(A simple println alone works :) )

go get

go get github.com/wspl/creeper

github.com/PuerkitoBio/goquery

fatal error: unexpected signal during runtime execution [signal 0xb code=0x1 addr=0x1880e6d3a41e pc=0xf0eb]

runtime stack: runtime.throw(0x4971c0, 0x2a) /usr/local/go/src/runtime/panic.go:547 +0x90 runtime.sigpanic() /usr/local/go/src/runtime/sigpanic_unix.go:12 +0x5a runtime.unlock(0x982540) /usr/local/go/src/runtime/lock_sema.go:107 +0x14b runtime.(*mheap).alloc_m(0x982540, 0x1, 0x10000000010, 0xeed928) /usr/local/go/src/runtime/mheap.go:492 +0x314 runtime.(*mheap).alloc.func1() /usr/local/go/src/runtime/mheap.go:502 +0x41 runtime.systemstack(0xc82047fe58) /usr/local/go/src/runtime/asm_amd64.s:307 +0xab runtime.(*mheap).alloc(0x982540, 0x1, 0x10000000010, 0xed8f) /usr/local/go/src/runtime/mheap.go:503 +0x63 runtime.(*mcentral).grow(0x983f10, 0x0) /usr/local/go/src/runtime/mcentral.go:209 +0x93 runtime.(*mcentral).cacheSpan(0x983f10, 0xeed928) /usr/local/go/src/runtime/mcentral.go:89 +0x47d runtime.(*mcache).refill(0xaf4000, 0x10, 0xeed928) /usr/local/go/src/runtime/mcache.go:119 +0xcc runtime.mallocgc.func2() /usr/local/go/src/runtime/malloc.go:642 +0x2b runtime.systemstack(0xc820025500) /usr/local/go/src/runtime/asm_amd64.s:291 +0x79 runtime.mstart() /usr/local/go/src/runtime/proc.go:1051

goroutine 1 [running]: runtime.systemstack_switch() /usr/local/go/src/runtime/asm_amd64.s:245 fp=0xc821c79140 sp=0xc821c79138 runtime.mallocgc(0xf0, 0x438dc0, 0x0, 0x438dc0) /usr/local/go/src/runtime/malloc.go:643 +0x869 fp=0xc821c79218 sp=0xc821c79140
Feature: Next Page Node - functional node for directing next page
page = "http://example.com/info?page=1" demo[]: page -> $(".example") text: $(".title").html @next: $("a.next").href

I am thinking about another method for page number directing, that is simulating the operation of the user click on the next page. We can add a @next node to indicates the next page link. Page director would switch to next page automatically when current page has no more content.

New grammar features - Functional node: For assisting crawling. Node name start with @. It is less readable than private nodes.
how to parse Json structures

We probably use both HTML parser and JSON parser for crawling complex pages, I found that pattern files support HTML parser only, how could I use this framework to parse JSON structures or extend functionalities by myself? Thanks.
@next node function implementation
There are some hindrance in implementing the functional part of @next. They have stumped me for a long time:

InitSelector's loop call

Wait until the page cycle ends and blocks the total cycle when there is no next page

Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Jan 9, 2023

High-performance crawler framework based on fasthttp

predator / 掠食者基于 fasthttp 开发的高性能爬虫框架使用下面是一个示例，基本包含了当前已完成的所有功能，使用方法可以参考注释。

May 2, 2022

High-performance crawler framework based on fasthttp.

predator / 掠食者基于 fasthttp 开发的高性能爬虫框架使用下面是一个示例，基本包含了当前已完成的所有功能，使用方法可以参考注释。 1 创建一个 Crawler import "github.com/go-predator/predator" func main() {

Dec 14, 2022

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台，支持任何语言和框架

Jan 7, 2023

Pholcus is a distributed high-concurrency crawler software written in pure golang

Pholcus Pholcus（幽灵蛛）是一款纯 Go 语言编写的支持分布式的高并发爬虫软件，仅用于编程学习与研究。它支持单机、服务端、客户端三种运行模式，拥有Web、GUI、命令行三种操作界面；规则简单灵活、批量任务并发、输出方式丰富（mysql/mongodb/kafka/csv/excel等

Dec 30, 2022

ant (alpha) is a web crawler for Go.

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

Dec 30, 2022

Go IMDb Crawler

Go IMDb Crawler Hit the ⭐ button to show some ❤️ ?? INSPIRATION ?? Want to know which celebrities have a common birthday with yours? ?? Want to get th

Aug 1, 2022

Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.

Apollo ?? A Unix-style personal search engine and web crawler for your digital footprint Demo apollodemo.mp4 Contents Background Thesis Design Archite

Dec 27, 2022

Fast, highly configurable, cloud native dark web crawler.

Bathyscaphe dark web crawler Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler. How to start the crawler To start

Nov 22, 2022

A crawler/scraper based on golang + colly, configurable via JSON

Aug 21, 2022

Just a web crawler

gh-dependents gh command extension to see dependents of your repository. See The GitHub Blog: GitHub CLI 2.0 includes extensions! Install gh extension

Sep 27, 2022

crawlergo is a browser crawler that uses chrome headless mode for URL collection.

A powerful browser crawler for web vulnerability scanners

Dec 29, 2022

A crawler/scraper based on golang + colly, configurable via JSON

Super-Simple Scraper This a very thin layer on top of Colly which allows configuration from a JSON file. The output is JSONL which is ready to be impo

Aug 21, 2022

New World Auction House Crawler In Golang

New-World-Auction-House-Crawler Goal of this library is to have a process which grabs New World auction house data in the background while playing the

Sep 7, 2022

Simple content crawler for joyreactor.cc

Reactor Crawler Simple CLI content crawler for Joyreactor. He'll find all media content on the page you've provided and save it. If there will be any

May 5, 2022

A PCPartPicker crawler for Golang.

gopartpicker A scraper for pcpartpicker.com for Go. It is implemented using Colly. Features Extract data from part list URLs Search for parts Extract

Nov 9, 2021

Multiplexer: HTTP-Server & URL Crawler

Multiplexer: HTTP-Server & URL Crawler Приложение представляет собой http-сервер с одним хендлером. Хендлер на вход получает POST-запрос со списком ur

Nov 3, 2021

A simple crawler sending Telegram notification when Refurbished Macbook Air / Pro in stock.

Jan 30, 2022

Go spider: A crawler of vertical communities achieved by GOLANG

go_spider A crawler of vertical communities achieved by GOLANG. Latest stable Release: Version 1.2 (Sep 23, 2014). QQ群号：337344607 Features Concurrent

Dec 9, 2021

:paw_prints: Creeper - The Next Generation Crawler Framework (Go)

About

Get Started

Installation

Hello World!

Script Spec

Town

Node

Node Name

Page

Fun

Author

Owner

Plutonist

Comments

note without css selector

Problem since new commits ?

go get

github.com/PuerkitoBio/goquery

Feature: Next Page Node - functional node for directing next page

how to parse Json structures

@next node function implementation

Related tags

Elegant Scraper and Crawler Framework for Golang

High-performance crawler framework based on fasthttp

High-performance crawler framework based on fasthttp.

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台，支持任何语言和框架

Pholcus is a distributed high-concurrency crawler software written in pure golang

ant (alpha) is a web crawler for Go.

Go IMDb Crawler

Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.

Fast, highly configurable, cloud native dark web crawler.

A crawler/scraper based on golang + colly, configurable via JSON

Just a web crawler

crawlergo is a browser crawler that uses chrome headless mode for URL collection.

A crawler/scraper based on golang + colly, configurable via JSON

New World Auction House Crawler In Golang

Simple content crawler for joyreactor.cc

A PCPartPicker crawler for Golang.

Multiplexer: HTTP-Server & URL Crawler

A simple crawler sending Telegram notification when Refurbished Macbook Air / Pro in stock.

Go spider: A crawler of vertical communities achieved by GOLANG