Extract structured data from web sites. Web sites scraping.

Dataflow kit

alt tag

Build Status GoDoc Go Report Card codecov

Dataflow kit ("DFK") is a Web Scraping framework for Gophers. It extracts data from web pages, following the specified CSS Selectors.

You can use it in many ways for data mining, data processing or archiving.

The Web Scraping Pipeline

Web-scraping pipeline consists of 3 general components:

  • Downloading an HTML web-page. (Fetch Service)
  • Parsing an HTML page and retrieving data we're interested in (Parse Service)
  • Encoding parsed data to CSV, MS Excel, JSON, JSON Lines or XML format.

Fetch service

fetch.d server is intended for html web pages content download. Depending on Fetcher type, web page content is downloaded using either Base Fetcher or Chrome fetcher.

Base fetcher uses standard golang http client to fetch pages as is. It works faster than Chrome fetcher. But Base fetcher cannot render dynamic javascript driven web pages.

Chrome fetcher is intended for rendering dynamic javascript based content. It sends requests to Chrome running in headless mode.

A fetched web page is passed to parse.d service.

Parse service

parse.d is the service that extracts data from downloaded web page following the rules listed in configuration JSON file. Extracted data is returned in CSV, MS Excel, JSON or XML format.

Note: Sometimes Parse service cannot extract data from some pages retrieved by default Base fetcher. Empty results may be returned while parsing Java Script generated pages. Parse service then attempts to force Chrome fetcher to render the same dynamic javascript driven content automatically. Have a look at https://scrape.dataflowkit.com/persons/page-0 which is a sample of JavaScript driven web page.

Dataflow kit benefits:

  • Scraping of JavaScript generated pages;

  • Data extraction from paginated websites;

  • Processing infinite scrolled pages.

  • Sсraping of websites behind login form;

  • Cookies and sessions handling;

  • Following links and detailed pages processing;

  • Managing delays between requests per domain;

  • Following robots.txt directives;

  • Saving intermediate data in Diskv or Mongodb. Storage interface is flexible enough to add more storage types easily;

  • Encode results to CSV, MS Excel, JSON(Lines), XML formats;

  • Dataflow kit is fast. It takes about 4-6 seconds to fetch and then parse 50 pages.

  • Dataflow kit is suitable to process quite large volumes of data. Our tests show the time needed to parse appr. 4 millions of pages is about 7 hours. 

Installation

go get -u github.com/slotix/dataflowkit

Usage

Docker

  1. Install Docker and Docker Compose

  2. Start services.

cd $GOPATH/src/github.com/slotix/dataflowkit && docker-compose up

This command fetches docker images automatically and starts services.

  1. Launch parsing in the second terminal window by sending POST request to parse daemon. Some json configuration files for testing are available in /examples folder.
curl -XPOST  127.0.0.1:8001/parse --data-binary "@$GOPATH/src/github.com/slotix/dataflowkit/examples/books.toscrape.com.json"

Here is the sample json configuration file:

{
	"name":"collection",
	"request":{
	   "url":"https://example.com"
	},
	"fields":[
	   {
		  "name":"Title",
		  "selector":".product-container a",
		  "extractor":{
			 "types":["text", "href"],
			 "filters":[
				"trim",
				"lowerCase"
			 ],
			 "params":{
				"includeIfEmpty":false
			 }
		  }
	   },
	   {
		  "name":"Image",
		  "selector":"#product-container img",
		  "extractor":{
			 "types":["alt","src","width","height"],
			 "filters":[
				"trim",
				"upperCase"
			 ]
		  }
	   },
	   {
		  "name":"Buyinfo",
		  "selector":".buy-info",
		  "extractor":{
			 "types":["text"],
			 "params":{
				"includeIfEmpty":false
			 }
		  }
	   }
	],
	"paginator":{
	   "selector":".next",
	   "attr":"href",
	   "maxPages":3
	},
	"format":"json",
	"fetcherType":"chrome",
	"paginateResults":false
}

Read more information about scraper configuration JSON files at our GoDoc reference

Extractors and filters are described at https://godoc.org/github.com/slotix/dataflowkit/extract

  1. To stop services just press Ctrl+C and run
cd $GOPATH/src/github.com/slotix/dataflowkit && docker-compose down --remove-orphans --volumes

IMAFGE ALT CLI Dataflow kit web scraping framework

Click on image to see CLI in action.

Manual way

  1. Start Chrome docker container
docker run --init -it --rm -d --name chrome --shm-size=1024m -p=127.0.0.1:9222:9222 --cap-add=SYS_ADMIN \
  yukinying/chrome-headless-browser

Headless Chrome is used for fetching web pages to feed a Dataflow kit parser.

  1. Build and run fetch.d service
cd $GOPATH/src/github.com/slotix/dataflowkit/cmd/fetch.d && go build && ./fetch.d
  1. In new terminal window build and run parse.d service
cd $GOPATH/src/github.com/slotix/dataflowkit/cmd/parse.d && go build && ./parse.d
  1. Launch parsing. See step 3. from the previous section.

Run tests

  • docker-compose -f test-docker-compose.yml up -d
  • ./test.sh
  • To stop services just run docker-compose -f test-docker-compose.yml down

Front-End

Try https://dataflowkit.com/dfk Front-end with Point-and-click interface to Dataflow kit services. It generates JSON config file and sends POST request to DFK Parser

IMAGE ALT Dataflow kit web scraping framework

Click on image to see Dataflow kit in action.

License

This is Free Software, released under the BSD 3-Clause License.

Contributing

You are welcome to contribute to our project.

alt tag

Comments
  • UI included?

    UI included?

    Will the UI available for toscrape data be published as a general service? Without that UI the whole service is... well, useful only to some extent...

  • Content behind login

    Content behind login

    I would like to scrape a website behind a login form (e.g. http://quotes.toscrape.com/login). Is Dataflowkit able to send forms and keep session information during scrapping? If yes, then how?

  • JSON Lines Newline Delimited JSON (.jsonl) format support.

    JSON Lines Newline Delimited JSON (.jsonl) format support.

    Is your feature request related to a problem? Please describe. Here are some use cases of using JSON lines:

    • Store multiple JSON records in a file. So any kind of (uniform) structured data can be stored, such as a list of users, products or log entries.
    • JSON Lines can be streamed easily.
    • quick insertions.
    • query the last or last (n) items quickly.

    Describe the solution you'd like

    Add new parameter here

    type JSONEncoder struct {
    	JSONLines bool
    }
    

    Implement encoding to JSON Lines in the function

    func (e JSONEncoder) encode(ctx context.Context, w *bufio.Writer, payloadMD5 string, keys *map[int][]int) error {}

  • ./build_docker_images.sh errors

    ./build_docker_images.sh errors

    Hi,

    i try to locally deploy dataflowkit and getting some errors. Could you point me somewhere, thanks.

    go get -u github.com/slotix/dataflowkit
    package github.com/slotix/dataflowkit: no Go files in /opt/go/gopath/src/github.com/slotix/dataflowkit1. 
    
    ./build_docker_images.sh
    rm -f parse.d
    CGO_ENABLED=0 \
    GOOS=linux GOARCH=amd64 \
    go build \
            -ldflags "-s -w -X main.Release=1.0.0 \
            -X main.Commit=096cadd -X main.BuildTime=2020-02-12_13:48:38" \
            -a -installsuffix cgo \
            -o parse.d
    docker build -t slotix/dfk-parse:1.0.0 .
    Sending build context to Docker daemon  13.34MB
    Step 1/5 : FROM alpine:latest
     ---> e7d92cdc71fe
    Step 2/5 : RUN apk update && apk add ca-certificates && rm -rf /var/cache/apk/*
     ---> Running in 82e93e366823
    fetch http://dl-cdn.alpinelinux.org/alpine/v3.11/main/x86_64/APKINDEX.tar.gz
    fetch http://dl-cdn.alpinelinux.org/alpine/v3.11/community/x86_64/APKINDEX.tar.gz
    v3.11.3-59-gf70c7aa335 [http://dl-cdn.alpinelinux.org/alpine/v3.11/main]
    v3.11.3-60-gb3a10d424a [http://dl-cdn.alpinelinux.org/alpine/v3.11/community]
    OK: 11262 distinct packages available
    (1/1) Installing ca-certificates (20191127-r1)
    Executing busybox-1.31.1-r9.trigger
    Executing ca-certificates-20191127-r1.trigger
    OK: 6 MiB in 15 packages
    Removing intermediate container 82e93e366823
     ---> 4d80a7286186
    Step 3/5 : COPY parse.d /
     ---> 286d8b08db13
    Step 4/5 : EXPOSE 8002
     ---> Running in a5e72b1c1871
    Removing intermediate container a5e72b1c1871
     ---> 736529c4a6f9
    Step 5/5 : ENTRYPOINT ./parse.d
     ---> Running in 1172c19b8c11
    Removing intermediate container 1172c19b8c11
     ---> 447c06723384
    Successfully built 447c06723384
    Successfully tagged slotix/dfk-parse:1.0.0
    docker tag slotix/dfk-parse:1.0.0 slotix/dfk-parse:latest
    docker push slotix/dfk-parse:1.0.0
    The push refers to repository [docker.io/slotix/dfk-parse]
    ad3827963233: Preparing 
    3b36fe6a41bd: Preparing 
    5216338b40a7: Preparing 
    denied: requested access to the resource is denied
    make: *** [Makefile:30: push] Error 1
    rm -f fetch.d
    CGO_ENABLED=0 \
    GOOS=linux GOARCH=amd64 \
    go build \
            -ldflags "-s -w -X main.Release=1.0.0 \
            -X main.Commit=096cadd -X main.BuildTime=2020-02-12_13:48:53" \
            -a -installsuffix cgo \
            -o fetch.d
    docker build -t slotix/dfk-fetch:1.0.0 .
    Sending build context to Docker daemon  14.37MB
    Step 1/5 : FROM alpine:latest
     ---> e7d92cdc71fe
    Step 2/5 : RUN apk update && apk add ca-certificates && rm -rf /var/cache/apk/*
     ---> Using cache
     ---> 4d80a7286186
    Step 3/5 : COPY fetch.d /
     ---> 4486675434a9
    Step 4/5 : EXPOSE 8000
     ---> Running in b38c5b5be50c
    Removing intermediate container b38c5b5be50c
     ---> a8f18189e717
    Step 5/5 : ENTRYPOINT ./fetch.d
     ---> Running in 800cb67a52a1
    Removing intermediate container 800cb67a52a1
     ---> 145c6d53aa06
    Successfully built 145c6d53aa06
    Successfully tagged slotix/dfk-fetch:1.0.0
    docker tag slotix/dfk-fetch:1.0.0 slotix/dfk-fetch:latest
    docker push slotix/dfk-fetch:1.0.0
    The push refers to repository [docker.io/slotix/dfk-fetch]
    6134c7560bc9: Preparing 
    3b36fe6a41bd: Preparing 
    5216338b40a7: Preparing 
    denied: requested access to the resource is denied
    make: *** [Makefile:29: push] Error 1
    rm -f testserver
    CGO_ENABLED=0 \
    GOOS=linux GOARCH=amd64 \
    go build \
            -ldflags "-s -w -X main.Release=1.0.0 \
            -X main.Commit=096cadd -X main.BuildTime=2020-02-12_13:49:07" \
            -a -installsuffix cgo \
            -o testserver
    docker build -t slotix/dfk-testserver:1.0.0 .
    Sending build context to Docker daemon  9.451MB
    Step 1/6 : FROM alpine:latest
     ---> e7d92cdc71fe
    Step 2/6 : RUN apk update && apk add ca-certificates && rm -rf /var/cache/apk/*
     ---> Using cache
     ---> 4d80a7286186
    Step 3/6 : COPY testserver /
     ---> 11023bdec3be
    Step 4/6 : COPY web /web
     ---> 792c372decae
    Step 5/6 : EXPOSE 12345
     ---> Running in 68552375095d
    Removing intermediate container 68552375095d
     ---> 92c4a2fb6531
    Step 6/6 : ENTRYPOINT ./testserver
     ---> Running in e86fad37fd33
    Removing intermediate container e86fad37fd33
     ---> 78aaa71181d3
    Successfully built 78aaa71181d3
    Successfully tagged slotix/dfk-testserver:1.0.0
    docker tag slotix/dfk-testserver:1.0.0 slotix/dfk-testserver:latest
    docker push slotix/dfk-testserver:1.0.0
    The push refers to repository [docker.io/slotix/dfk-testserver]
    35457e462bc2: Preparing 
    3db1ec4fc644: Preparing 
    3b36fe6a41bd: Preparing 
    5216338b40a7: Preparing 
    denied: requested access to the resource is denied
    make: *** [Makefile:29: push] Error 1
    
  • Kubernetes/Swarm/Multiple instances

    Kubernetes/Swarm/Multiple instances

    Great work with this. I am wondering if you have deployed multiple instances behind a load balancer? Have you found a good way to do this, i.e. Traefik, some other kubernetes or docker swarm integration?

  • Add stat information to Task

    Add stat information to Task

    Currently there is not enough information about Parse Task returned except output file path. It needs to add some extra information like Requests count divided by type (Initial, paginator, details), Response count, Error count, time elapsed, etc.

  • Example of a along for doctors

    Example of a along for doctors

    I read it was used for this. Is the script public. I want to get an idea of a production example and any issues that come up. Great toolkit and really useful in golang.

  • Multiple robots.txt files on the server. How to process them correctly?

    Multiple robots.txt files on the server. How to process them correctly?

    Let's consider the following case: Domain: http://example.com . Obviously robots.txt file is located at http://example.com/robots.txt . This robots.txt has no access restrictions.

    Let's assume that we have a link like http://adv.example.com/click?item=1 to be scraped. It redirects one to http://example.com/item1 . For security reasons the second http://adv.example.com/robots.txt file

    User-agent: *
    Disallow: /
    

    forbids everyone from accessing the page http://adv.example.com/click?item=1. But redirected page http://example.com/item1 is opened for crawling according to http://example.com/robots.txt .

    To respect robots.txt we have to parse it BEFORE downloading its corresponding page. But following the rules listed in http://adv.example.com/robots.txt restricts us from accessing final redirected page http://example.com/item1 . It stops fetching and returns the error "Forbidden by robots.txt"

    So... the only solution that comes to my mind is to download a page, generate robots.txt link from final redirected page response and check if its processing is allowed by robots.txt .

    Please have a look at robotstxt.mw.go func (mw robotstxtMiddleware) Fetch(req interface{}) (output interface{}, err error) {}

    Please share your ideas about the most elegant solution.

  • Implementation of GZip encode feature.

    Implementation of GZip encode feature.

    Allows user to get result files in GZip format.

    Solution: Payload needs to be supplemented with 'compressor' field, which will represent compress method: 'gz' for GZip and etc Result file's extension should be 'gz' if 'compressor' applied and native 'format' extension if not.

    type Payload struct {
    	Compressor string
    }
    
A declarative struct-tag-based HTML unmarshaling or scraping package for Go built on top of the goquery library

goq Example import ( "log" "net/http" "astuart.co/goq" ) // Structured representation for github file name table type example struct { Title str

Dec 12, 2022
Develop Sites Faster with HTML-Includer!

HTML Includer Develop Sites Faster with HTML Includer! How to Install Install HTML Includer on your machine: go install github.com/GameWorkstore/html-

Jan 1, 2022
A markdown parser written in Go. Easy to extend, standard(CommonMark) compliant, well structured.

goldmark A Markdown parser written in Go. Easy to extend, standards-compliant, well-structured. goldmark is compliant with CommonMark 0.29. Motivation

Dec 29, 2022
Decode / encode XML to/from map[string]interface{} (or JSON); extract values with dot-notation paths and wildcards. Replaces x2j and j2x packages.

mxj - to/from maps, XML and JSON Decode/encode XML to/from map[string]interface{} (or JSON) values, and extract/modify values from maps by key or key-

Dec 29, 2022
Extract urls from text

xurls Extract urls from text using regular expressions. Requires Go 1.13 or later. import "mvdan.cc/xurls/v2" func main() { rxRelaxed := xurls.Relax

Jan 7, 2023
Encoding and decoding for fixed-width formatted data

fixedwidth Package fixedwidth provides encoding and decoding for fixed-width formatted Data. go get github.com/ianlopshire/go-fixedwidth Usage Struct

Dec 16, 2022
Gotabulate - Easily pretty-print your tabular data with Go

Gotabulate - Easily pretty-print tabular data Summary Go-Tabulate - Generic Go Library for easy pretty-printing of tabular data. Installation go get g

Dec 27, 2022
Faker is a Go library that generates fake data for you.
Faker is a Go library that generates fake data for you.

Faker is a Go library that generates fake data for you. Whether you need to bootstrap your database, create good-looking XML documents, fill-in your p

Jan 7, 2023
ByNom is a Go package for parsing byte sequences, suitable for parsing text and binary data

ByNom is a Go package for parsing byte sequences. Its goal is to provide tools to build safe byte parsers without compromising the speed or memo

May 5, 2021
Parse data and test fixtures from markdown files, and patch them programmatically, too.

go-testmark Do you need test fixtures and example data for your project, in a language agnostic way? Do you want it to be easy to combine with documen

Oct 31, 2022
Easily to convert JSON data to Markdown Table

Easily to convert JSON data to Markdown Table

Oct 28, 2022
Auto-gen fuzzing wrappers from normal code. Automatically find buggy call sequences, including data races & deadlocks. Supports rich signature types.

fzgen fzgen auto-generates fuzzing wrappers for Go 1.18, optionally finds problematic API call sequences, can automatically wire outputs to inputs acr

Dec 23, 2022
Build "Dictionary of the Old Norwegian Language" into easier-to-use data formats

Old Norwegian Dictionary Builder Build "Dictionary of the Old Norwegian Language" into easier-to-use data formats. Available formats: JSON DSL XML Usa

Oct 11, 2022
A fast, easy-of-use and dependency free custom mapping from .csv data into Golang structs

csvparser This package provides a fast and easy-of-use custom mapping from .csv data into Golang structs. Index Pre-requisites Installation Examples C

Nov 14, 2022
Go minifiers for web formats

Minify Online demo if you need to minify files now. Command line tool that minifies concurrently and watches file changes. Releases of CLI for various

Jan 6, 2023
🌭 The hotdog web browser and browser engine 🌭
🌭 The hotdog web browser and browser engine 🌭

This is the hotdog web browser project. It's a web browser with its own layout and rendering engine, parsers, and UI toolkit! It's made from scratch e

Dec 30, 2022
yview is a lightweight, minimalist and idiomatic template library based on golang html/template for building Go web application.

wview wview is a lightweight, minimalist and idiomatic template library based on golang html/template for building Go web application. Contents Instal

Dec 5, 2021
Build and deploy resilient web applications.

Archived Due to the security concerns surrounding XML, this package is now archived. Go server overview : Template engine. Built in request tracer. we

Dec 15, 2020
Extract structured data from web sites. Web sites scraping.
Extract structured data from web sites. Web sites scraping.

Dataflow kit Dataflow kit ("DFK") is a Web Scraping framework for Gophers. It extracts data from web pages, following the specified CSS Selectors. You

Jan 7, 2023
🦙 acao(阿草), the tool man for data scraping of https://asoul.video/.

?? acao acao(阿草), the tool man for data scraping of https://asoul.video/. Deploy to Aliyun serverless function with Raika update_member Update A-SOUL

Jul 25, 2022