Super-Simple Scraper
This a very thin layer on top of Colly which allows configuration from a JSON file. The output is JSONL which is ready to be imported into Typesense.
Features
- Scrape HTML & PDF documents based on the configured selectors
- Selectors can use CSS selectors or template-based ones which have sprig functions available.
Configuration
See the example configuration. Many of these options are directly copied to the Colly equivalents:
- http://go-colly.org/docs/introduction/configuration/
- https://pkg.go.dev/github.com/gocolly/colly?utm_source=godoc#Collector
- https://pkg.go.dev/github.com/gocolly/colly?utm_source=godoc#LimitRule
Running
We have an image on DockerHub, so after installing Docker
and jq
, something like this will work:
docker run -it -v `pwd`:/go/src/app -e "CONFIG=$(cat ./path/to/your/config.json | jq -r tostring)" gotripod/ssscraper:main
The manual method is:
docker build -t ssscraper .
docker run -v `pwd`:/go/src/app -it --rm --name ssscraper-ahoy ssscraper
# you're now in the docker container
cd src/app
go build
./ssscraper
Developing
Using VSCode, clone and open the repo directory with the Containers extension installed.
Future ideas
- Nested selectors; i.e. select each item from a list on each page
- Webhook support - POST the output to a URL on completion
- Different output formats
- Custom weighting for selectors
- Extract the selector/template logic to a common function
- Add Word doc support
Sponsors
Built by Go Tripod, making the web as easy as one, two, three. Go Tripod build bespoke software solutions, and if you need a custom version of SS Scraper please get in touch.