Go-site-crawler - a simple application written in go that can fetch contentfrom a url endpoint

Go Site Crawler

Go Site Crawler is a simple application written in go that can fetch content from a url endpoint, scan the content for href links and subsequently crawl the entire site. This is not done for data collection or scraping purposes and it does little more then fetch content and crawl hrefs.

It has a log output that will tell you the data and time the content has been fetched, the status code it received from the endpoint it tried to query, the number of hrefs it found and how many links it has parsed so far out of how many remain to be parsed.

The application uses a stack that tracks which endpoints have already been added so as to avoid adding duplicates. This should ensure that, even if the same href is present multiple times in the content, it will only fetch that endpoint once.

There are a number of reasons why you might want to do this:

  1. You want to traverse the entire site and ensure that there are no broken links (will give a 4xx or 5xx status code response)
  2. You want to traverse the entire site so as to build cache (such as for example prerender.io or you have a site that has some kind of caching mechanism)
  3. You want to generate a list of all the links that can be detected on your website.

Building the application

You will need go installed on your target os to build this application, the current release has been built using go version 1.17.5 but it will probably work for older versions of go too.

You can build the application by cloning this repo and in it's project folder running the following:

go build .

The process should generate an executable called go-site-crawler.

Running the application

You can run the application after building it by using the following command:

./go-site-crawler --baseUrl=https://example.com

The full list of arguments is as follows

--baseUrl=https://exmaple.com - this is the base endpoint that will be used when
crawling the site

--entrypoint=/ - this is the entrypint that is used in conjunction with the
baseUrl to fetch content. If omitted it will resolve to an empty string and
first endpoint to be fetched will be the baseUrl

--prefix=/en - This is a filter that will be used when scanning for href
content. Giving for example a prefix of `/` will only scan for relative urls,
while giving a prefix of `/en` will only scan for urls that start with `/en`

--userAgent=google - This is the user agent you want the crawler to use when
fetching content. Currently the only ones available are 'google', 'bing' and
'yahoo', and it will use the respective bot user agent. This defautls to google
bot.

TODO

Will be adding a Dockerfile so that the application can be run via docker so as to avoid having to install go if you really don't need it.

Similar Resources

ant (alpha) is a web crawler for Go.

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

Dec 30, 2022

Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files

Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files. Run arbitrary JavaScript on many web pages and see the returned values

Dec 29, 2022

Go IMDb Crawler

 Go IMDb Crawler

Go IMDb Crawler Hit the ⭐ button to show some ❤️ 😃 INSPIRATION 💪 Want to know which celebrities have a common birthday with yours? 👀 Want to get th

Aug 1, 2022

Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.

Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.

Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint Demo apollodemo.mp4 Contents Background Thesis Design Archite

Dec 27, 2022

High-performance crawler framework based on fasthttp

predator / 掠食者 基于 fasthttp 开发的高性能爬虫框架 使用 下面是一个示例,基本包含了当前已完成的所有功能,使用方法可以参考注释。

May 2, 2022

Fast, highly configurable, cloud native dark web crawler.

Bathyscaphe dark web crawler Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler. How to start the crawler To start

Nov 22, 2022

A crawler/scraper based on golang + colly, configurable via JSON

A crawler/scraper based on golang + colly, configurable via JSON

Aug 21, 2022

Just a web crawler

Just a web crawler

gh-dependents gh command extension to see dependents of your repository. See The GitHub Blog: GitHub CLI 2.0 includes extensions! Install gh extension

Sep 27, 2022

A crawler/scraper based on golang + colly, configurable via JSON

Super-Simple Scraper This a very thin layer on top of Colly which allows configuration from a JSON file. The output is JSONL which is ready to be impo

Aug 21, 2022
Go-based search engine URL collector , support Google, Bing, can be based on Google syntax batch collection URL
Go-based search engine URL collector , support Google, Bing, can be based on Google syntax batch collection URL

Go-based search engine URL collector , support Google, Bing, can be based on Google syntax batch collection URL

Nov 9, 2022
crawlergo is a browser crawler that uses chrome headless mode for URL collection.
crawlergo is a browser crawler that uses chrome headless mode for URL collection.

A powerful browser crawler for web vulnerability scanners

Dec 29, 2022
Multiplexer: HTTP-Server & URL Crawler

Multiplexer: HTTP-Server & URL Crawler Приложение представляет собой http-сервер с одним хендлером. Хендлер на вход получает POST-запрос со списком ur

Nov 3, 2021
Pholcus is a distributed high-concurrency crawler software written in pure golang
Pholcus is a distributed high-concurrency crawler software written in pure golang

Pholcus Pholcus(幽灵蛛)是一款纯 Go 语言编写的支持分布式的高并发爬虫软件,仅用于编程学习与研究。 它支持单机、服务端、客户端三种运行模式,拥有Web、GUI、命令行三种操作界面;规则简单灵活、批量任务并发、输出方式丰富(mysql/mongodb/kafka/csv/excel等

Dec 30, 2022
Simple content crawler for joyreactor.cc
Simple content crawler for joyreactor.cc

Reactor Crawler Simple CLI content crawler for Joyreactor. He'll find all media content on the page you've provided and save it. If there will be any

May 5, 2022
A simple crawler sending Telegram notification when Refurbished Macbook Air / Pro in stock.

A simple crawler sending Telegram notification when Refurbished Macbook Air / Pro in stock.

Jan 30, 2022
Golang based web site opengraph data scraper with caching
Golang based web site opengraph data scraper with caching

Snapper A Web microservice for capturing a website's OpenGraph data built in Golang Building Snapper building the binary git clone https://github.com/

Oct 5, 2022
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

Crawlab 中文 | English Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer Golang-

Jan 7, 2023
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Jan 9, 2023
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)

About Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your new

Dec 4, 2022