Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files

Detectify

Last update: Dec 29, 2022

Comments: 5

page-fetch

page-fetch is a tool for researchers that lets you:

Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files
Run arbitrary JavaScript on many web pages and see the returned values

Installation

page-fetch is written with Go and can be installed with go get:

▶ go get github.com/detectify/page-fetch

Or you can clone the respository and build it manually:

▶ git clone https://github.com/detectify/page-fetch.git
▶ cd page-fetch
▶ go install

Dependencies

page-fetch uses chromedp, which requires that a Chrome or Chromium browser be installed. It uses the following list of executable names in attempting to execute a browser:

headless_shell
headless-shell
chromium
chromium-browser
google-chrome
google-chrome-stable
google-chrome-beta
google-chrome-unstable
/usr/bin/google-chrome

Basic Usage

page-fetch takes a list of URLs as its input on stdin. You can provide the input list using IO redirection:

▶ page-fetch < urls.txt

Or using the output of another command:

▶ grep admin urls.txt | page-fetch

By default, responses are stored in a directory called 'out', which is created if it does not exist:

▶ echo https://detectify.com | page-fetch
GET https://detectify.com/ 200 text/html; charset=utf-8
GET https://detectify.com/site/themes/detectify/css/detectify.css?v=1621498751 200 text/css
GET https://detectify.com/site/themes/detectify/img/detectify_logo_black.svg 200 image/svg+xml
GET https://fonts.googleapis.com/css?family=Merriweather:300i 200 text/css; charset=utf-8
...
▶ tree out
out
├── detectify.com
│   ├── index
│   ├── index.meta
│   └── site
│       └── themes
│           └── detectify
│               ├── css
│               │   ├── detectify.css
│               │   └── detectify.css.meta
...

The directory structure used in the output directory mirrors the directory structure used on the target websites. A ".meta" file is stored for each request that contains the originally requested URL, including the query string), the request and response headers etc.

Options

You can get the page-fetch help output by running page-fetch -h:

▶ page-fetch -h
Request URLs using headless Chrome, storing the results

Usage:
  page-fetch [options] < urls.txt

Options:
  -c, --concurrency    Concurrency Level (default 2)
  -e, --exclude     Do not save responses matching the provided string (can be specified multiple times)
  -i, --include     Only save requests matching the provided string (can be specified multiple times)
  -j, --javascript  JavaScript to run on each page
  -o, --output      Output directory name (default 'out')
  -w, --overwrite           Overwrite output files when they already exist
      --no-third-party      Do not save responses to requests on third-party domains
      --third-party         Only save responses to requests on third-party domains

Concurrency

You can change how many headless Chrome processes are used with the -c / --concurrency option. The default value is 2.

Excluding responses based on content-type

You can choose to not save responses that match particular content types with the -e / --exclude option. Any response with a content-type that partially matches the provided value will not be stored; so you can, for example, avoid storing image files by specifying:

▶ page-fetch --exclude image/

The option can be specified multiple times to exclude multiple different content-types.

Including responses based on content-type

Rather than excluding specific content-types, you can opt to only save certain content-types with the -i / --include option:

▶ page-fetch --include text/html

The option can be specified multiple times to include multiple different content-types.

Running JavaScript on each page

You can run arbitrary JavaScript on each page with the -j / --javascript option. The return value of the JavaScript is converted to a string and printed on a line prefixed with "JS":

▶ echo https://example.com | page-fetch --javascript document.domain
GET https://example.com/ 200 text/html; charset=utf-8
JS (https://example.com): example.com

This option can be used for a very wide variety of purposes. As an example, you could extract the href attribute from all links on a webpage:

n.href)' | grep ^JS JS (https://example.com): [https://www.iana.org/domains/example] ">

▶ echo https://example.com | page-fetch --javascript '[...document.querySelectorAll("a")].map(n => n.href)' | grep ^JS
JS (https://example.com): [https://www.iana.org/domains/example]

Setting the output directory name

By default, files are stored in a directory called out. This can be changed with the -o / --output option:

▶ echo https://example.com | page-fetch --output example
GET https://example.com/ 200 text/html; charset=utf-8
▶ find example/ -type f
example/example.com/index
example/example.com/index.meta

The directory is created if it does not already exist.

Overwriting files

By default, when a file already exists, a new file is created with a numeric suffix, e.g. if index already exists, index.1 will be created. This behaviour can be overridden with the -w / --overwrite option. When the option is used matching files will be overwritten instead.

Excluding third-party responses

You may sometimes wish to exclude responses from third-party domains. This can be done with the --no-third-party option. Any responses to requests for domains that do not match the input URL, or one of its subdomains, will not be saved.

Including only third-party responses

On rare occasions you may wish to only store responses to third party domains. This can be done with the --third-party option.

Owner

Detectify

Detectify analyzes the level of security of your website. Get secure on detectify.com or catch up with us on our via blog.detectify.com.

https://github.com/detectify/page-fetch

Comments

error starting browser: exec: "google-chrome": executable file not found in $PATH
Hi, It's throwing me this error "error starting browser: exec: "google-chrome": executable file not found in $PATH" how to fix it?

Update:

Sorry the issue is now resolved I installed Google chrome in machine & everything got fixed.

Just paste this

wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb sudo apt install ./google-chrome-stable_current_amd64.deb

Thank you :)
Adding `ignore-certificate-errors` Chrome option

page-fetch currently validates TLS certificates, which is generally an undesired feature for security tools. This PR adds the ignore-certificate-errors option to Chrome to disable certificate validation.

I have not extensively tested this, but it seems to work in line with Chrome against badssl.com
run error: context deadline exceeded
Hi , when i run this command :

echo https://detectify.com | page-fetch

i get this error :

run error: context deadline exceeded

i was tried it with diffrent OS and the same error
added flag to skip storing responses

In some occasions I just want to have URLs logged but not saved.

I added a flag to skip saving requests. The flag for that is either -s or -skip-save-response.
added delay between requests

If working with a lot of URLs, concurrency and WAFs, a delay between requests is a good idea.

I added an option to have a delay between requests. The delay is specified with either the -d or --delay flag and is given in milliseconds.

Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files

page-fetch

Installation

Dependencies

Basic Usage

Options

Concurrency

Excluding responses based on content-type

Including responses based on content-type

Running JavaScript on each page

Setting the output directory name

Overwriting files

Excluding third-party responses

Including only third-party responses

Owner

Detectify

Comments

error starting browser: exec: "google-chrome": executable file not found in $PATH

Update:

Adding `ignore-certificate-errors` Chrome option

run error: context deadline exceeded

added flag to skip storing responses

added delay between requests

Related tags

Crawls web pages and prints any link it can find.

skweez spiders web pages and extracts words for wordlist generation.

Go-site-crawler - a simple application written in go that can fetch contentfrom a url endpoint

Youtube tutorial about web scraping using golang and Gocolly

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台，支持任何语言和框架

ant (alpha) is a web crawler for Go.

Declarative web scraping

Web Scraper in Go, similar to BeautifulSoup

Gospider - Fast web spider written in Go

Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.

DataHen Till is a standalone tool that instantly makes your existing web scraper scalable, maintainable, and more unblockable, with minimal code changes on your scraper.

Fast, highly configurable, cloud native dark web crawler.

Just a web crawler

Golang based web site opengraph data scraper with caching

WebWalker - Fast Script To Walk Web for find urls...

Examples for chromedp for web scrapping

Implementing WEB Scraping with Go

Dumbass-news - A web service to report dumbass news

A recursive, mirroring web crawler that retrieves child links.