Super-Simple Scraper
This a very thin layer on top of Colly which allows configuration from a JSON file. The output is JSONL which is ready to be imported into Typesense.
Features
- Scrape HTML & PDF documents based on the configured selectors
- Selectors can use CSS selectors or template-based ones which have sprig functions available.
Configuration
See the example configuration. Many of these options are directly copied to the Colly equivalents:
- http://go-colly.org/docs/introduction/configuration/
- https://pkg.go.dev/github.com/gocolly/colly?utm_source=godoc#Collector
- https://pkg.go.dev/github.com/gocolly/colly?utm_source=godoc#LimitRule
Running
docker build -t ssscraper .
docker run -v `pwd`:/go/src/app -it --rm --name ssscraper-ahoy ssscraper
# you're now in the docker container
cd src/app
go build
./ssscraper
Developing
Using VSCode, clone and open the repo directory with the Containers extension installed.
Future ideas
- Webhook support - POST the output to a URL on completion
- Different output formats
- Custom weighting for selectors
- Extract the selector/template logic to a common function
- Add Word doc support
Sponsors
Built by Go Tripod, making the web as easy as one, two, three. Go Tripod build bespoke software solutions, and if you need a custom version of SS Scraper please get in touch.