Gospider - Fast web spider written in Go

GoSpider

GoSpider - Fast web spider written in Go

Painless integrate Gospider into your recon workflow?

huntersuite

Enjoying this tool? Support it's development and take your game to the next level by using HunterSuite.io

Installation

go get -u github.com/jaeles-project/gospider

Features

  • Fast web crawling
  • Brute force and parse sitemap.xml
  • Parse robots.txt
  • Generate and verify link from JavaScript files
  • Link Finder
  • Find AWS-S3 from response source
  • Find subdomains from response source
  • Get URLs from Wayback Machine, Common Crawl, Virus Total, Alien Vault
  • Format output easy to Grep
  • Support Burp input
  • Crawl multiple sites in parallel
  • Random mobile/web User-Agent

Showcases

asciicast

Usage

Fast web spider written in Go - v1.1.5 by @thebl4ckturtle & @j3ssiejjj

Usage:
  gospider [flags]

Flags:
  -s, --site string               Site to crawl
  -S, --sites string              Site list to crawl
  -p, --proxy string              Proxy (Ex: http://127.0.0.1:8080)
  -o, --output string             Output folder
  -u, --user-agent string         User Agent to use
                                  	web: random web user-agent
                                  	mobi: random mobile user-agent
                                  	or you can set your special user-agent (default "web")
      --cookie string             Cookie to use (testA=a; testB=b)
  -H, --header stringArray        Header to use (Use multiple flag to set multiple header)
      --burp string               Load headers and cookie from burp raw http request
      --blacklist string          Blacklist URL Regex
      --whitelist string          Whitelist URL Regex
      --whitelist-domain string   Whitelist Domain
  -t, --threads int               Number of threads (Run sites in parallel) (default 1)
  -c, --concurrent int            The number of the maximum allowed concurrent requests of the matching domains (default 5)
  -d, --depth int                 MaxDepth limits the recursion depth of visited URLs. (Set it to 0 for infinite recursion) (default 1)
  -k, --delay int                 Delay is the duration to wait before creating a new request to the matching domains (second)
  -K, --random-delay int          RandomDelay is the extra randomized duration to wait added to Delay before creating a new request (second)
  -m, --timeout int               Request timeout (second) (default 10)
  -B, --base                      Disable all and only use HTML content
      --js                        Enable linkfinder in javascript file (default true)
      --subs                      Include subdomains
      --sitemap                   Try to crawl sitemap.xml
      --robots                    Try to crawl robots.txt (default true)
  -a, --other-source              Find URLs from 3rd party (Archive.org, CommonCrawl.org, VirusTotal.com, AlienVault.com)
  -w, --include-subs              Include subdomains crawled from 3rd party. Default is main domain
  -r, --include-other-source      Also include other-source's urls (still crawl and request)
      --debug                     Turn on debug mode
      --json                      Enable JSON output
  -v, --verbose                   Turn on verbose
  -l, --length                    Turn on length
  -L, --filter-length             Turn on length filter
  -R, --raw                       Turn on raw
  -q, --quiet                     Suppress all the output and only show URL
      --no-redirect               Disable redirect
      --version                   Check version
  -h, --help                      help for gospider

Example commands

Quite output

gospider -q -s "https://google.com/"

Run with single site

gospider -s "https://google.com/" -o output -c 10 -d 1

Run with site list

gospider -S sites.txt -o output -c 10 -d 1

Run with 20 sites at the same time with 10 bot each site

gospider -S sites.txt -o output -c 10 -d 1 -t 20

Also get URLs from 3rd party (Archive.org, CommonCrawl.org, VirusTotal.com, AlienVault.com)

gospider -s "https://google.com/" -o output -c 10 -d 1 --other-source

Also get URLs from 3rd party (Archive.org, CommonCrawl.org, VirusTotal.com, AlienVault.com) and include subdomains

gospider -s "https://google.com/" -o output -c 10 -d 1 --other-source --include-subs

Use custom header/cookies

gospider -s "https://google.com/" -o output -c 10 -d 1 --other-source -H "Accept: */*" -H "Test: test" --cookie "testA=a; testB=b"

gospider -s "https://google.com/" -o output -c 10 -d 1 --other-source --burp burp_req.txt

Blacklist url/file extension.

P/s: gospider blacklisted .(jpg|jpeg|gif|css|tif|tiff|png|ttf|woff|woff2|ico) as default

gospider -s "https://google.com/" -o output -c 10 -d 1 --blacklist ".(woff|pdf)"

Show and Blacklist file length.

gospider -s "https://google.com/" -o output -c 10 -d 1 --length --filter-length "6871,24432"   

License

Gospider is made with by @j3ssiejjj & @thebl4ckturtle and it is released under the MIT license.

Donation

paypal

Owner
Jaeles Project
The Swiss Army knife for automated Web Application Testing
Jaeles Project
Comments
  • Help to install the script on Mac

    Help to install the script on Mac

    Hi

    can you please explain how can I install this tool in Mac?

    Thanks

    Pentest@tools ~ % sudo go get -u github.com/jaeles-project/gospider
    Password:
    Pentest@tools~ % gospider -s "https://google.com/" -o output -c 10 -d 1
    zsh: command not found: gospider
    
  • Too many open files

    Too many open files

    Command: gospider -S "urls files" -o outuputfles -c 5 -t 100 -d 2 --other-source -v --robots --sitemap -u web

    Error: [0024] ERROR Failed to open file to write Output: open *********/target_folder : too many open files

  • go get installation errors

    go get installation errors

    go get -u github.com/jaeles-project/gospider
    # github.com/jaeles-project/gospider/core
    go/src/github.com/jaeles-project/gospider/core/crawler.go:27:17: unknown field 'MaxConnsPerHost' in struct literal of type http.Transport
    go/src/github.com/jaeles-project/gospider/core/crawler.go:183:15: undefined: strings.ReplaceAll
    go/src/github.com/jaeles-project/gospider/core/crawler.go:296:20: undefined: strings.ReplaceAll
    go/src/github.com/jaeles-project/gospider/core/linkfinder.go:14:12: undefined: strings.ReplaceAll
    go/src/github.com/jaeles-project/gospider/core/linkfinder.go:15:12: undefined: strings.ReplaceAll
    

    This appens when downloading directly with go get -u github.com/jaeles-project/gospider

    Any advice? Wrong Go version?

    OS: Ubuntu 18.04.3 LTS x86_64 Kernel: 4.15.0-76-generic Go Version: go version go1.10.4 linux/amd64

  • Add features and fixes

    Add features and fixes

    Added features :

    • Show length and add length filter
    • Show Raw source code
    • Crawl all subdomains

    Fixes :

    • Fix linkfinder (relative path, js in js, output ..)
    • Fix subdomains
    • Fix href output
    • Fix case sensitive for Duplicate
    • delay all requests
  • GoSpider doesn't seem to honor the delay parameter

    GoSpider doesn't seem to honor the delay parameter

    Hi @j3ssie and team,

    From what I can tell, setting the -k/--delay parameter doesn't delay anything. GoSpider still requests URL faster than expected.

    Can you replicate this issue?

  • Empty output specifying HTTP(S) port

    Empty output specifying HTTP(S) port

    Description

    Web I try to run gospider on an URL specifying also the HTTP port, sometimes I don't know why exactly it doesn't crawl the target.

    Go version

    go version go1.16.2 linux/amd64

    Gospider Version

    1.1.5 (In the last commit of https://github.com/jaeles-project/gospider/blob/2e610b3fd79e1ac0945b694385edd88028f821ce/core/version.go the version is wrong btw)

    Test case 1 - Not specifying http or https port

    ./gospider -q -s https://shippingmanager.bpost.be/ --debug
    
    [0000]  INFO Start crawling: https://shippingmanager.bpost.be/
    [0000]  INFO Found robots.txt: https://shippingmanager.bpost.be//robots.txt
    https://shippingmanager.bpost.be/ShmFrontEnd/
    [0000]  INFO Done.
    

    Test case 2 - Specifying the port:

    ./gospider -q -s https://shippingmanager.bpost.be:443/ --debug
    

    image

  • Include response length in output

    Include response length in output

    It would be useful if the response length was included in the output, or even better, have a way to filter the output by response length (!=, <, >). This would allow the user to filter out expected responses during enumeration.

  • Subdomains are not shown in output

    Subdomains are not shown in output

    Hey, since the last update the gospider tool is not showing subdomains in output. I have checked this with multiple flags but it's not working.

    image

    The line with the pointed arrow is now missing from gospider output.

    Can you please add it back?

    Best,

  • Missing License

    Missing License

    Hello, I didn't find license information. Can you add a LICENSE file or add the license information in the README.md?

    I would like to package it for Kali Linux: https://bugs.kali.org/view.php?id=6514

    Thanks.

  • Update README.md

    Update README.md

    Installing executables with "go get" in module mode is deprecated. "go install pkg@version" should be used instead. For more information, see https://golang.org/doc/go-get-install-deprecation

  • removing lower case conversion of paths and parameters

    removing lower case conversion of paths and parameters

    Gospider was converting case-sensitive paths and parameters to lowercase which results in lots of valid case-sensitive paths and parameters being 404 not found. For example a path found in HTML or JavaScript source: /SearchLive.php?Param=1 converts to: /searchlive.php?param=1

  • Output only URLs

    Output only URLs

    Hi,

    At first, congratulations for this project. I am have an issue, maybe by my mistake, but I'd want to send to stdout only URLs, without flags like [url] and [code-200]. Is it possible?

    gospider

  • RAM usage

    RAM usage

    Gospider uses a lot of RAM. It's ram usage keeps increasing with time . And it chokes my server if it keeps running for a few hours even with low threads (2-3).

    Is there any solution to this? or can you kindly solve this issue if possible?

    Thanks

  • Add Dockerfile

    Add Dockerfile

    Add Dockerfile

    FROM golang:1.17.8-alpine3.14 AS build-env
    RUN apk add --no-cache build-base
    RUN go install github.com/jaeles-project/gospider@latest
    
    FROM alpine:3.15.0
    RUN apk add --no-cache bind-tools ca-certificates
    COPY --from=build-env /go/bin/gospider /usr/local/bin/gospider
    ENTRYPOINT ["gospider"]
    

    Run Docker

    docker build -t gospider .
    docker run --rm -t gospider -q -s "https://google.com/"
    
  • issue with URLs containing dashes

    issue with URLs containing dashes

    URLs containing dashes in a list can not be parsed: http://ec2-XXX-XX-XX-XXX.compute-1.amazonaws.com

    gospider -S test.txt -v [0000] ERROR Failed to parse domain

Go spider: A crawler of vertical communities achieved by GOLANG

go_spider A crawler of vertical communities achieved by GOLANG. Latest stable Release: Version 1.2 (Sep 23, 2014). QQ群号:337344607 Features Concurrent

Dec 9, 2021
Fast, highly configurable, cloud native dark web crawler.

Bathyscaphe dark web crawler Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler. How to start the crawler To start

Nov 22, 2022
WebWalker - Fast Script To Walk Web for find urls...

WebWalker send http request to url to get all urls in url and send http request to urls and again .... WebWalker can find 10,000 urls in 10 seconds.

Nov 28, 2021
Fast golang web crawler for gathering URLs and JavaSript file locations.

Fast golang web crawler for gathering URLs and JavaSript file locations. This is basically a simple implementation of the awesome Gocolly library.

Sep 24, 2022
Fast website link checker in Go
Fast website link checker in Go

Muffet Muffet is a website link checker which scrapes and inspects all pages in a website recursively. Features Massive speed Colored outputs Differen

Dec 26, 2022
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

Crawlab 中文 | English Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer Golang-

Jan 7, 2023
ant (alpha) is a web crawler for Go.

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

Dec 30, 2022
Declarative web scraping
Declarative web scraping

Ferret Try it! Docs CLI Test runner Web worker What is it? ferret is a web scraping system. It aims to simplify data extraction from the web for UI te

Jan 4, 2023
Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files

Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files. Run arbitrary JavaScript on many web pages and see the returned values

Dec 29, 2022
Web Scraper in Go, similar to BeautifulSoup

soup Web Scraper in Go, similar to BeautifulSoup soup is a small web scraper package for Go, with its interface highly similar to that of BeautifulSou

Jan 9, 2023
Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.
Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.

Apollo ?? A Unix-style personal search engine and web crawler for your digital footprint Demo apollodemo.mp4 Contents Background Thesis Design Archite

Dec 27, 2022
DataHen Till is a standalone tool that instantly makes your existing web scraper scalable, maintainable, and more unblockable, with minimal code changes on your scraper.
DataHen Till is a standalone tool that instantly makes your existing web scraper scalable, maintainable, and more unblockable, with minimal code changes on your scraper.

DataHen Till is a standalone tool that instantly makes your existing web scraper scalable, maintainable, and more unblockable, with minimal code changes on your scraper.

Dec 14, 2022
Just a web crawler
Just a web crawler

gh-dependents gh command extension to see dependents of your repository. See The GitHub Blog: GitHub CLI 2.0 includes extensions! Install gh extension

Sep 27, 2022
Golang based web site opengraph data scraper with caching
Golang based web site opengraph data scraper with caching

Snapper A Web microservice for capturing a website's OpenGraph data built in Golang Building Snapper building the binary git clone https://github.com/

Oct 5, 2022
Crawls web pages and prints any link it can find.

crawley Crawls web pages and prints any link it can find. Scan depth (by default - 0) can be configured. features fast SAX-parser (powered by golang.o

Jan 4, 2023
skweez spiders web pages and extracts words for wordlist generation.

skweez skweez (pronounced like "squeeze") spiders web pages and extracts words for wordlist generation. It is basically an attempt to make a more oper

Nov 27, 2022
Examples for chromedp for web scrapping

About chromedp examples This folder contains a variety of code examples for working with chromedp. The godoc page contains a number of simple examples

Nov 30, 2021
Youtube tutorial about web scraping using golang and Gocolly

This is an example project I wrote for a youtube tutorial about webscraping using golang and gocolly It extracts data from a tracking differences webs

Mar 26, 2022
Implementing WEB Scraping with Go

WEB Scraping with Go In this project I implement a WEB scraper that create a CSV file with quotes and authors from the Pensador programing Web Page. R

Dec 10, 2021