A crawler/scraper based on golang + colly, configurable via JSON

Super-Simple Scraper

This a very thin layer on top of Colly which allows configuration from a JSON file. The output is JSONL which is ready to be imported into Typesense.

Features

  • Scrape HTML & PDF documents based on the configured selectors
  • Selectors can use CSS selectors or template-based ones which have sprig functions available.

Configuration

See the example configuration. Many of these options are directly copied to the Colly equivalents:

Running

docker build -t ssscraper .
docker run -v `pwd`:/go/src/app -it --rm --name ssscraper-ahoy ssscraper

# you're now in the docker container

cd src/app
go build
./ssscraper

Developing

Using VSCode, clone and open the repo directory with the Containers extension installed.

Future ideas

  • Webhook support - POST the output to a URL on completion
  • Different output formats
  • Custom weighting for selectors
  • Extract the selector/template logic to a common function
  • Add Word doc support

Sponsors

Built by Go Tripod, making the web as easy as one, two, three. Go Tripod build bespoke software solutions, and if you need a custom version of SS Scraper please get in touch.

Similar Resources

Web Scraper in Go, similar to BeautifulSoup

soup Web Scraper in Go, similar to BeautifulSoup soup is a small web scraper package for Go, with its interface highly similar to that of BeautifulSou

Jan 9, 2023

Simple price scraper with HTTP server/exporter for use with Prometheus

priceserver v0.3 Simple price scraper with HTTP server/exporter for use with Prometheus Currently working with Bitrue.com exchange but easily adaptabl

Nov 16, 2021

Scraper to download school attendance data from the DfE's statistics website

Scraper to download school attendance data from the DfE's statistics website

💪 Simple to use. Scrape attendance data with a single command! 🏇 Super fast. A

Mar 31, 2022

A cli scraper of gocomics.com made in go

goComic goComic is a cli tool written in go that scrapes your favorite childhood favorite comic from gocomics.com. It will give you a single days comi

Dec 24, 2021

Best Room Price Scraper from Booking.com

Best Room Price Scraper from Booking.com This repo is a tutorial of Large Scale

Nov 11, 2022

A simple scraper to export data from buildkite to honeycomb using opentelemetry SDK

A simple scraper to export data from buildkite to honeycomb using opentelemetry SDK

A quick scraper program that let you export builds on BuildKite as OpenTelemetry data and then send them to honeycomb.io for slice-n-dice high cardinality analysis.

Jul 7, 2022

Pholcus is a distributed high-concurrency crawler software written in pure golang

Pholcus is a distributed high-concurrency crawler software written in pure golang

Pholcus Pholcus(幽灵蛛)是一款纯 Go 语言编写的支持分布式的高并发爬虫软件,仅用于编程学习与研究。 它支持单机、服务端、客户端三种运行模式,拥有Web、GUI、命令行三种操作界面;规则简单灵活、批量任务并发、输出方式丰富(mysql/mongodb/kafka/csv/excel等

Dec 30, 2022

New World Auction House Crawler In Golang

New-World-Auction-House-Crawler Goal of this library is to have a process which grabs New World auction house data in the background while playing the

Sep 7, 2022

A PCPartPicker crawler for Golang.

gopartpicker A scraper for pcpartpicker.com for Go. It is implemented using Colly. Features Extract data from part list URLs Search for parts Extract

Nov 9, 2021
[爬虫框架 (golang)] An awesome Go concurrent Crawler(spider) framework. The crawler is flexible and modular. It can be expanded to an Individualized crawler easily or you can use the default crawl components only.

go_spider A crawler of vertical communities achieved by GOLANG. Latest stable Release: Version 1.2 (Sep 23, 2014). QQ群号:337344607 Features Concurrent

Jan 6, 2023
Warhammer40K faction scraper written in Golang, powered by colly.

Wascra Description Wascra is a tool written in Golang, which lets you extract all relevant Datasheet info from a Warhammer40K (9th edition) faction fr

Feb 8, 2022
Ratemyprof scraper - Ratemyprof scraper with golang

ratemyprof scraper visit https://ratemyprof-api.vercel.app/api/getProf to try ou

Jan 18, 2022
DataHen Till is a standalone tool that instantly makes your existing web scraper scalable, maintainable, and more unblockable, with minimal code changes on your scraper.
DataHen Till is a standalone tool that instantly makes your existing web scraper scalable, maintainable, and more unblockable, with minimal code changes on your scraper.

DataHen Till is a standalone tool that instantly makes your existing web scraper scalable, maintainable, and more unblockable, with minimal code changes on your scraper.

Dec 14, 2022
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Jan 9, 2023
Collyzar - A distributed redis-based framework for colly.

Collyzar A distributed redis-based framework for colly. Collyzar provides a very simple configuration and tools to implement distributed crawling/scra

Nov 22, 2022
Fast, highly configurable, cloud native dark web crawler.

Bathyscaphe dark web crawler Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler. How to start the crawler To start

Nov 22, 2022
Golang based web site opengraph data scraper with caching
Golang based web site opengraph data scraper with caching

Snapper A Web microservice for capturing a website's OpenGraph data built in Golang Building Snapper building the binary git clone https://github.com/

Oct 5, 2022
High-performance crawler framework based on fasthttp

predator / 掠食者 基于 fasthttp 开发的高性能爬虫框架 使用 下面是一个示例,基本包含了当前已完成的所有功能,使用方法可以参考注释。

May 2, 2022
High-performance crawler framework based on fasthttp.

predator / 掠食者 基于 fasthttp 开发的高性能爬虫框架 使用 下面是一个示例,基本包含了当前已完成的所有功能,使用方法可以参考注释。 1 创建一个 Crawler import "github.com/go-predator/predator" func main() {

Dec 14, 2022