Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files

page-fetch

page-fetch is a tool for researchers that lets you:

  • Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files
  • Run arbitrary JavaScript on many web pages and see the returned values

Installation

page-fetch is written with Go and can be installed with go get:

▶ go get github.com/detectify/page-fetch

Or you can clone the respository and build it manually:

▶ git clone https://github.com/detectify/page-fetch.git
▶ cd page-fetch
▶ go install

Dependencies

page-fetch uses chromedp, which requires that a Chrome or Chromium browser be installed. It uses the following list of executable names in attempting to execute a browser:

  • headless_shell
  • headless-shell
  • chromium
  • chromium-browser
  • google-chrome
  • google-chrome-stable
  • google-chrome-beta
  • google-chrome-unstable
  • /usr/bin/google-chrome

Basic Usage

page-fetch takes a list of URLs as its input on stdin. You can provide the input list using IO redirection:

▶ page-fetch < urls.txt

Or using the output of another command:

▶ grep admin urls.txt | page-fetch

By default, responses are stored in a directory called 'out', which is created if it does not exist:

▶ echo https://detectify.com | page-fetch
GET https://detectify.com/ 200 text/html; charset=utf-8
GET https://detectify.com/site/themes/detectify/css/detectify.css?v=1621498751 200 text/css
GET https://detectify.com/site/themes/detectify/img/detectify_logo_black.svg 200 image/svg+xml
GET https://fonts.googleapis.com/css?family=Merriweather:300i 200 text/css; charset=utf-8
...
▶ tree out
out
├── detectify.com
│   ├── index
│   ├── index.meta
│   └── site
│       └── themes
│           └── detectify
│               ├── css
│               │   ├── detectify.css
│               │   └── detectify.css.meta
...

The directory structure used in the output directory mirrors the directory structure used on the target websites. A ".meta" file is stored for each request that contains the originally requested URL, including the query string), the request and response headers etc.

Options

You can get the page-fetch help output by running page-fetch -h:

▶ page-fetch -h
Request URLs using headless Chrome, storing the results

Usage:
  page-fetch [options] < urls.txt

Options:
  -c, --concurrency    Concurrency Level (default 2)
  -e, --exclude     Do not save responses matching the provided string (can be specified multiple times)
  -i, --include     Only save requests matching the provided string (can be specified multiple times)
  -j, --javascript  JavaScript to run on each page
  -o, --output      Output directory name (default 'out')
  -w, --overwrite           Overwrite output files when they already exist
      --no-third-party      Do not save responses to requests on third-party domains
      --third-party         Only save responses to requests on third-party domains

Concurrency

You can change how many headless Chrome processes are used with the -c / --concurrency option. The default value is 2.

Excluding responses based on content-type

You can choose to not save responses that match particular content types with the -e / --exclude option. Any response with a content-type that partially matches the provided value will not be stored; so you can, for example, avoid storing image files by specifying:

▶ page-fetch --exclude image/

The option can be specified multiple times to exclude multiple different content-types.

Including responses based on content-type

Rather than excluding specific content-types, you can opt to only save certain content-types with the -i / --include option:

▶ page-fetch --include text/html

The option can be specified multiple times to include multiple different content-types.

Running JavaScript on each page

You can run arbitrary JavaScript on each page with the -j / --javascript option. The return value of the JavaScript is converted to a string and printed on a line prefixed with "JS":

▶ echo https://example.com | page-fetch --javascript document.domain
GET https://example.com/ 200 text/html; charset=utf-8
JS (https://example.com): example.com

This option can be used for a very wide variety of purposes. As an example, you could extract the href attribute from all links on a webpage:

n.href)' | grep ^JS JS (https://example.com): [https://www.iana.org/domains/example] ">
▶ echo https://example.com | page-fetch --javascript '[...document.querySelectorAll("a")].map(n => n.href)' | grep ^JS
JS (https://example.com): [https://www.iana.org/domains/example]

Setting the output directory name

By default, files are stored in a directory called out. This can be changed with the -o / --output option:

▶ echo https://example.com | page-fetch --output example
GET https://example.com/ 200 text/html; charset=utf-8
▶ find example/ -type f
example/example.com/index
example/example.com/index.meta

The directory is created if it does not already exist.

Overwriting files

By default, when a file already exists, a new file is created with a numeric suffix, e.g. if index already exists, index.1 will be created. This behaviour can be overridden with the -w / --overwrite option. When the option is used matching files will be overwritten instead.

Excluding third-party responses

You may sometimes wish to exclude responses from third-party domains. This can be done with the --no-third-party option. Any responses to requests for domains that do not match the input URL, or one of its subdomains, will not be saved.

Including only third-party responses

On rare occasions you may wish to only store responses to third party domains. This can be done with the --third-party option.

Owner
Detectify
Detectify analyzes the level of security of your website. Get secure on detectify.com or catch up with us on our via blog.detectify.com.
Detectify
Comments
  • error starting browser: exec:

    error starting browser: exec: "google-chrome": executable file not found in $PATH

    Hi, It's throwing me this error "error starting browser: exec: "google-chrome": executable file not found in $PATH" how to fix it?

    Update:

    Sorry the issue is now resolved I installed Google chrome in machine & everything got fixed.

    Just paste this

    wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
    sudo apt install ./google-chrome-stable_current_amd64.deb
    

    Thank you :)

  • Adding `ignore-certificate-errors` Chrome option

    Adding `ignore-certificate-errors` Chrome option

    page-fetch currently validates TLS certificates, which is generally an undesired feature for security tools. This PR adds the ignore-certificate-errors option to Chrome to disable certificate validation.

    I have not extensively tested this, but it seems to work in line with Chrome against badssl.com

  • run error: context deadline exceeded

    run error: context deadline exceeded

    Hi , when i run this command :

    • echo https://detectify.com | page-fetch

    i get this error :

    • run error: context deadline exceeded

    i was tried it with diffrent OS and the same error

  • added flag to skip storing responses

    added flag to skip storing responses

    In some occasions I just want to have URLs logged but not saved.

    I added a flag to skip saving requests. The flag for that is either -s or -skip-save-response.

  • added delay between requests

    added delay between requests

    If working with a lot of URLs, concurrency and WAFs, a delay between requests is a good idea.

    I added an option to have a delay between requests. The delay is specified with either the -d or --delay flag and is given in milliseconds.

Crawls web pages and prints any link it can find.

crawley Crawls web pages and prints any link it can find. Scan depth (by default - 0) can be configured. features fast SAX-parser (powered by golang.o

Jan 4, 2023
skweez spiders web pages and extracts words for wordlist generation.

skweez skweez (pronounced like "squeeze") spiders web pages and extracts words for wordlist generation. It is basically an attempt to make a more oper

Nov 27, 2022
Go-site-crawler - a simple application written in go that can fetch contentfrom a url endpoint

Go Site Crawler Go Site Crawler is a simple application written in go that can f

Feb 5, 2022
Youtube tutorial about web scraping using golang and Gocolly

This is an example project I wrote for a youtube tutorial about webscraping using golang and gocolly It extracts data from a tracking differences webs

Mar 26, 2022
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

Crawlab 中文 | English Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer Golang-

Jan 7, 2023
ant (alpha) is a web crawler for Go.

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

Dec 30, 2022
Declarative web scraping
Declarative web scraping

Ferret Try it! Docs CLI Test runner Web worker What is it? ferret is a web scraping system. It aims to simplify data extraction from the web for UI te

Jan 4, 2023
Web Scraper in Go, similar to BeautifulSoup

soup Web Scraper in Go, similar to BeautifulSoup soup is a small web scraper package for Go, with its interface highly similar to that of BeautifulSou

Jan 9, 2023
Gospider - Fast web spider written in Go
Gospider - Fast web spider written in Go

GoSpider GoSpider - Fast web spider written in Go Painless integrate Gospider into your recon workflow? Enjoying this tool? Support it's development a

Dec 31, 2022
Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.
Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.

Apollo ?? A Unix-style personal search engine and web crawler for your digital footprint Demo apollodemo.mp4 Contents Background Thesis Design Archite

Dec 27, 2022
DataHen Till is a standalone tool that instantly makes your existing web scraper scalable, maintainable, and more unblockable, with minimal code changes on your scraper.
DataHen Till is a standalone tool that instantly makes your existing web scraper scalable, maintainable, and more unblockable, with minimal code changes on your scraper.

DataHen Till is a standalone tool that instantly makes your existing web scraper scalable, maintainable, and more unblockable, with minimal code changes on your scraper.

Dec 14, 2022
Fast, highly configurable, cloud native dark web crawler.

Bathyscaphe dark web crawler Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler. How to start the crawler To start

Nov 22, 2022
Just a web crawler
Just a web crawler

gh-dependents gh command extension to see dependents of your repository. See The GitHub Blog: GitHub CLI 2.0 includes extensions! Install gh extension

Sep 27, 2022
Golang based web site opengraph data scraper with caching
Golang based web site opengraph data scraper with caching

Snapper A Web microservice for capturing a website's OpenGraph data built in Golang Building Snapper building the binary git clone https://github.com/

Oct 5, 2022
WebWalker - Fast Script To Walk Web for find urls...

WebWalker send http request to url to get all urls in url and send http request to urls and again .... WebWalker can find 10,000 urls in 10 seconds.

Nov 28, 2021
Examples for chromedp for web scrapping

About chromedp examples This folder contains a variety of code examples for working with chromedp. The godoc page contains a number of simple examples

Nov 30, 2021
Implementing WEB Scraping with Go

WEB Scraping with Go In this project I implement a WEB scraper that create a CSV file with quotes and authors from the Pensador programing Web Page. R

Dec 10, 2021
Dumbass-news - A web service to report dumbass news

Dumbass News - a web service to report dumbass news Copyright (C) 2022 Mike Tayl

Jan 18, 2022
A recursive, mirroring web crawler that retrieves child links.

A recursive, mirroring web crawler that retrieves child links.

Jan 29, 2022