Web Scraper in Go, similar to BeautifulSoup

soup

Build Status GoDoc Go Report Card

Web Scraper in Go, similar to BeautifulSoup

soup is a small web scraper package for Go, with its interface highly similar to that of BeautifulSoup.

Exported variables and functions implemented till now :

var Headers map[string]string // Set headers as a map of key-value pairs, an alternative to calling Header() individually
var Cookies map[string]string // Set cookies as a map of key-value  pairs, an alternative to calling Cookie() individually
func Get(string) (string,error) {} // Takes the url as an argument, returns HTML string
func GetWithClient(string, *http.Client) {} // Takes the url and a custom HTTP client as arguments, returns HTML string
func Post(string, string, interface{}) (string, error) {} // Takes the url, bodyType, and payload as an argument, returns HTML string
func PostForm(string, url.Values) {} // Takes the url and body. bodyType is set to "application/x-www-form-urlencoded"
func Header(string, string) {} // Takes key,value pair to set as headers for the HTTP request made in Get()
func Cookie(string, string) {} // Takes key, value pair to set as cookies to be sent with the HTTP request in Get()
func HTMLParse(string) Root {} // Takes the HTML string as an argument, returns a pointer to the DOM constructed
func Find([]string) Root {} // Element tag,(attribute key-value pair) as argument, pointer to first occurence returned
func FindAll([]string) []Root {} // Same as Find(), but pointers to all occurrences returned
func FindStrict([]string) Root {} //  Element tag,(attribute key-value pair) as argument, pointer to first occurence returned with exact matching values
func FindAllStrict([]string) []Root {} // Same as FindStrict(), but pointers to all occurrences returned
func FindNextSibling() Root {} // Pointer to the next sibling of the Element in the DOM returned
func FindNextElementSibling() Root {} // Pointer to the next element sibling of the Element in the DOM returned
func FindPrevSibling() Root {} // Pointer to the previous sibling of the Element in the DOM returned
func FindPrevElementSibling() Root {} // Pointer to the previous element sibling of the Element in the DOM returned
func Children() []Root {} // Find all direct children of this DOM element
func Attrs() map[string]string {} // Map returned with all the attributes of the Element as lookup to their respective values
func Text() string {} // Full text inside a non-nested tag returned, first half returned in a nested one
func FullText() string {} // Full text inside a nested/non-nested tag returned
func SetDebug(bool) {} // Sets the debug mode to true or false; false by default
func HTML() {} // HTML returns the HTML code for the specific element

Root is a struct, containing three fields :

  • Pointer containing the pointer to the current html node
  • NodeValue containing the current html node's value, i.e. the tag name for an ElementNode, or the text in case of a TextNode
  • Error containing an error in a struct if one occurrs, else nil is returned. A detailed text explaination of the error can be accessed using the Error() function. A field Type in this struct of type ErrorType will denote the kind of error that took place, which will consist of either of the following
    • ErrUnableToParse
    • ErrElementNotFound
    • ErrNoNextSibling
    • ErrNoPreviousSibling
    • ErrNoNextElementSibling
    • ErrNoPreviousElementSibling
    • ErrCreatingGetRequest
    • ErrInGetRequest
    • ErrReadingResponse

Installation

Install the package using the command

go get github.com/anaskhan96/soup

Example

An example code is given below to scrape the "Comics I Enjoy" part (text and its links) from xkcd.

More Examples

package main

import (
	"fmt"
	"github.com/anaskhan96/soup"
	"os"
)

func main() {
	resp, err := soup.Get("https://xkcd.com")
	if err != nil {
		os.Exit(1)
	}
	doc := soup.HTMLParse(resp)
	links := doc.Find("div", "id", "comicLinks").FindAll("a")
	for _, link := range links {
		fmt.Println(link.Text(), "| Link :", link.Attrs()["href"])
	}
}

Contributions

This package was developed in my free time. However, contributions from everybody in the community are welcome, to make it a better web scraper. If you think there should be a particular feature or function included in the package, feel free to open up a new issue or pull request.

Owner
Anas Khan
Trying to make a meaningful contribution to open source since 2016
Anas Khan
Comments
  • Find by single class

    Find by single class

    Currently Find("a", "class", "message") would only work if it was <a class="message"></a> but would not work on <a class="message input-message"></a> even though they are both of class message.

    Could this be added?

  •   causes no text to be returned

      causes no text to be returned

    An odd issue I'm having while trying to use soup to parse Fmylife's site for FMLs is when I get an FML that has the (&)nbsp; tag

    <p class="block">
    <a href="/article/today-on-the-bus-i-saw-my-ex-girlfriend-get-on-despite-several-seats-being-open-she-specifically_190836.html">
    <span class="icon-piment"></span>&nbsp;
    [Insert FML text here] FML
    </a>
    </p>
    

    when I try to call the text, it returns blank text and nothing else.

    I usually call it using .Find("p", "class", "block").Find("a").Text() and if it doesn't have the whitespace tag, it returns fine.

  • Proposal: Add an

    Proposal: Add an "Empty" func to Root that would make it easier to tell when a query didn't return results

    Right now I suppose you would do this by checking if error was non-nil and then check the error to see if it contained "not found", which you would only know about if you read the source code of this project 😄

    I think what I am proposing is to add something that does that check for you in the library. Maybe something like:

    func (r Root) Empty() bool {
        if r.Error == nil {
             return false
        }
        return strings.Contains(string(r.Error), "not found")
    }
    

    Is this something other people would see as valuable? I would use it sorta like this:

    main := doc.Find("section", "class", "gramb")
    if main.Empty() {
      return errors.New("No results for this query")
    }
    defs := main.FindAll("span", "class", "ind")
    // Other processing here
    

    Right now I'm just checking if main.Error is non nil and returning no results. Would just be nice (I think) to have a cleaner interface around it.

    If you think this is worth doing I'd love to take a crack at it!

    Thanks for this library, it's immensely helpful to my side project 😄

  • [BUG]: Search classes with spaces fails every time (even in the weather example you provided)

    [BUG]: Search classes with spaces fails every time (even in the weather example you provided)

    Hi, I tried your weather example and it always trows an "invalid memory address". I tried to reproduce the same bug with another website and it can actually search only those classes without any spaces inside of them. I don't know why but your parser stopped understanding spaces. I added a fmt.Println() function in order to print the only class search with spaces (grid), that's the code:

    package main
    
    import (
    	"bufio"
    	"fmt"
    	"log"
    	"os"
    	"strings"
    
    	"github.com/anaskhan96/soup"
    )
    
    func main() {
    	fmt.Printf("Enter the name of the city : ")
    	city, _ := bufio.NewReader(os.Stdin).ReadString('\n')
    	city = city[:len(city)-1]
    	cityInURL := strings.Join(strings.Split(city, " "), "+")
    	url := "https://www.bing.com/search?q=weather+" + cityInURL
    	resp, err := soup.Get(url)
    	if err != nil {
    		log.Fatal(err)
    	}
    	doc := soup.HTMLParse(resp)
    	grid := doc.Find("div", "class", "b_antiTopBleed b_antiSideBleed b_antiBottomBleed")
    	fmt.Println("Print grid:", grid)
    	heading := grid.Find("div", "class", "wtr_titleCtrn").Find("div").Text()
    	conditions := grid.Find("div", "class", "wtr_condition")
    	primaryCondition := conditions.Find("div")
    	secondaryCondition := primaryCondition.FindNextElementSibling()
    	temp := primaryCondition.Find("div", "class", "wtr_condiTemp").Find("div").Text()
    	others := primaryCondition.Find("div", "class", "wtr_condiAttribs").FindAll("div")
    	caption := secondaryCondition.Find("div").Text()
    	fmt.Println("City Name : " + heading)
    	fmt.Println("Temperature : " + temp + "˚C")
    	for _, i := range others {
    		fmt.Println(i.Text())
    	}
    	fmt.Println(caption)
    }
    

    And that's the output:

    Enter the name of the city : New York
    Print grid: {<nil>  element `div` with attributes `class b_antiTopBleed b_antiSideBleed b_antiBottomBleed` not found}
    panic: runtime error: invalid memory address or nil pointer dereference
    [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x61d1f5]
    
    goroutine 1 [running]:
    github.com/anaskhan96/soup.findOnce(0x0, 0xc42005be68, 0x3, 0x3, 0xc420050000, 0x4aa247, 0xc420261e00)
    	/home/fef0/go/src/github.com/anaskhan96/soup/soup.go:304 +0x315
    github.com/anaskhan96/soup.Root.Find(0x0, 0x0, 0x0, 0x6e1e60, 0xc420242070, 0xc42005be68, 0x3, 0x3, 0x0, 0x0, ...)
    	/home/fef0/go/src/github.com/anaskhan96/soup/soup.go:120 +0x8d
    main.main()
    	/home/fef0/Code/Go/Test/Test.go:26 +0x4e3
    exit status 2
    

    If you notice in the second line it was impossible to found the grid, but in facts it happens only because there are spaces in the class name. I hope you can fix that as soon as possible, bye for now!

  • Add defined errors to the package

    Add defined errors to the package

    Hello! I have finally come back for #29 and solved it basically the way you suggested. I'm looking to try to get involved with more Go OSS and this is my first PR so far.

    My biggest concern is that it's theoretically a breaking change, as if someone was depending on having error details in Error previously, those are now gone.

    The reason this is so many changes is because as part of adding the ErrorDetails to Root I also changed the usages of Root to use labelled properties rather than specifying everything in each initialization.

    Let me know if I should add an example to either the Readme or to the examples folder of this in use?

  • Check if element exists without triggering warnings in console?

    Check if element exists without triggering warnings in console?

    I'm curious if there's a way to check if an element exists and have a Boolean returned if it does or does not exist rather than having the console just output something like

    2017/06/06 11:21:52 Error occurred in Find() : Element `div` with attributes `class title` not found
    
  • Crashed with SIGSEGV

    Crashed with SIGSEGV

    Trying to run the test weather.go in my machine and got this.

    Enter the name of the city : Brisbane panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x665715]

    goroutine 1 [running]: github.com/anaskhan96/soup.findOnce(0x0, 0xc0000bdea8, 0x3, 0x3, 0x0, 0x70207e, 0x13) /home/stevek/go/src/github.com/anaskhan96/soup/soup.go:345 +0x315 github.com/anaskhan96/soup.Root.Find(0x0, 0x0, 0x0, 0x75c820, 0xc000364040, 0xc0000bdea8, 0x3, 0x3, 0x0, 0x0, ...) /home/stevek/go/src/github.com/anaskhan96/soup/soup.go:121 +0x82 main.main() /home/stevek/tmp/go-lang/src/weather.go:24 +0x49d exit status 2

  • FindStrict and FindAllStrict

    FindStrict and FindAllStrict

    Implementation of FindStrict and FindAllStrict functions based on #15 discussion. New test scenarios added for this functions and for the old Find and FindAll functions.

    Also I've implemented a little different algorithm for searching a value occurrence in a tag's attribute values. In the previous implementation you were searching only for the first occurrence in attribute values:

    strings.Fields(n.Attr[i].Val)[0]  == args[2]
    

    I've changed this behavior to search in all attribute values:

    func attributeContainsValue(attr html.Attribute, attribute, value string) bool {
    	if attr.Key == attribute {
    		for _, attrVal := range strings.Fields(attr.Val) {
    			if attrVal == value {
    				return true
    			}
    		}
    	}
    	return false
    }
    

    I think that in this question the order of the value doesn't matter, so there are no difference between <div class="first second"> and <div class="second first"> elements.

  • Debug mode, check if element is found and correct comments

    Debug mode, check if element is found and correct comments

    According to #7, with this merge you will be able to:

    1. Check if the element is found with the Error field in the Root struct;
    2. Toggle debug mode with the SetDebug() function. Default is false, if set to true will show the various panic().

    Example to check if the node is found (no panic will appear in the terminal):

    source := soup.HTMLParse(resp)
    articles := source.Find("section", "class", "loop").FindAll("article")
    for _, article := range articles {
    	link := article.Find("h2").Find("a")
    	if link.Error == nil { // link is an instance of Root
    		fmt.Println(link.Text(), "| Link :", link.Attrs()["href"])
    	}
    }
    

    Example to check if the node is found with debug mode (panic will appear in terminal):

    soup.SetDebug(true)
    
    source := soup.HTMLParse(resp)
    articles := source.Find("section", "class", "loop").FindAll("article")
    for _, article := range articles {
    	link := article.Find("h2").Find("a")
    	if link.Error == nil { // link is an instance of Root
    		fmt.Println(link.Text(), "| Link :", link.Attrs()["href"])
    	}
    }
    

    Notes:

    • I added correct comments to each function, interface and struct. Node, Root, FindNextSibling and FindPrevSibling needs edit on their comments.
    • The example codes should be updated.
  • FindNextSibling bug

    FindNextSibling bug

    From the source code I see FindNextSibling calls r.Pointer.NextSibling.NextSibling which wrongly assumes NextSibling should have another NextSibling, and crash when it does not.

    e.g.

    const html = `<html>
    
      <head>
          <title>DOM Tutorial</title>
      </head>
    
      <body>
          <a>DOM Lesson one</a><p>Hello world!</p>
      </body>
    
    </html>`
    
    func main() {
    	doc := soup.HTMLParse(html)
    	link := doc.Find("a")
    	next := link.FindNextSibling()
    	fmt.Println(next.Text())
    }
    
    // $ panic: runtime error: invalid memory address or nil pointer dereference
    

    This also applies for FindPrevSibling.

    BTW, I suggest there should be FindNextSibling and FindNextSiblingElement as the spec describes. (This might be another issue, I guess what you want to implement is FindNextSiblingElement.)

  • go.mod: no matching versions for query

    go.mod: no matching versions for query "v1.2"

    It seems go mod/get is unable to understand shorter version strings

    I forked your repo and changed it to 1.2.1 and go get did work again. Currently I'm only able to get v1.1.1

    Could you release a new version with 3 digits?

  • invalid memory address or nil pointer dereference when chaining methods

    invalid memory address or nil pointer dereference when chaining methods

    package main
    
    import (
    	"fmt"
    	"log"
    	"net/http"
    	"time"
    
    	"github.com/anaskhan96/soup"
    )
    
    func main() {
    	go func() {
    		http.ListenAndServe(":12345", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
    			fmt.Fprint(w, "OK")
    		}))
    	}()
    
    	time.Sleep(time.Second)
    
    	resp, err := soup.Get("http://127.0.0.1:12345/")
    	if err != nil {
    		log.Println("Error:", err.Error())
    		return
    	}
    
    	doc := soup.HTMLParse(resp)
    	r := doc.Find("Semething").Find("SomethingElse")
    	fmt.Println(r.Error)
    }
    

    Hello, If I try to chain Find and FindAll method of non-existent tags like in the example above, I get a panic error

    $ go run .
    panic: runtime error: invalid memory address or nil pointer dereference
    [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x66ce1b]
    
    goroutine 1 [running]:
    github.com/anaskhan96/soup.findOnce(0x6b64c0?, {0xc00011fe50?, 0x1, 0x1}, 0x2?, 0x0)
            /home/alex/go/pkg/mod/github.com/anaskhan96/[email protected]/soup.go:502 +0xfb
    github.com/anaskhan96/soup.Root.Find({0x0, {0x0, 0x0}, {0x766ee0, 0xc000238030}}, {0xc00011fe50?, 0x1, 0x1})
            /home/alex/go/pkg/mod/github.com/anaskhan96/[email protected]/soup.go:268 +0xa5
    main.main()
            /home/alex/test/play3/main.go:24 +0x1ca
    exit status 2
    

    I believe that both func findOnce and func findAllofem should be checking if n *html.Node is nil before proceeding with the processing. Am I understanding this correctly?

    Thanks, Alex

  • Should Text() return all sibling text?

    Should Text() return all sibling text?

    For example:

    <div align="center">
    <a href="search_3.asp?action=up">up</a>
    &nbsp;
    <a href="search_3.asp?action=down">down</a>
    (2021-9-20~2021-9-26)
    </div>
    

    Current, div.Text() only returns &nbsp;, should it return &nbsp;(2021-9-20~2021-9-26) will be better?

  • soup.HTMLParse() returning nil

    soup.HTMLParse() returning nil

    This method was previously working but for some reason, it returns nil every single time now

    //example
    t, _ := soup.Get("https://google.com")
    fmt.Println(soup.HTMLParse(t)) //prints {address <nil>}
    
  • findOnce break after the first child node.

    findOnce break after the first child node.

    If the element is not found in the first child node, the value is returned, and the loop has no effect.

    I think this should be if q {

    for c := n.FirstChild; c != nil; c = c.NextSibling {
    	p, q := findOnce(c, args, true, strict)
    	if !q {
    		return p, q
    	}
    }
    

    https://github.com/anaskhan96/soup/blob/cb47551b378185a4504cf253c2abfbeea361cabf/soup.go#L504

Ratemyprof scraper - Ratemyprof scraper with golang

ratemyprof scraper visit https://ratemyprof-api.vercel.app/api/getProf to try ou

Jan 18, 2022
Golang based web site opengraph data scraper with caching
Golang based web site opengraph data scraper with caching

Snapper A Web microservice for capturing a website's OpenGraph data built in Golang Building Snapper building the binary git clone https://github.com/

Oct 5, 2022
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Jan 9, 2023
A crawler/scraper based on golang + colly, configurable via JSON

A crawler/scraper based on golang + colly, configurable via JSON

Aug 21, 2022
A crawler/scraper based on golang + colly, configurable via JSON

Super-Simple Scraper This a very thin layer on top of Colly which allows configuration from a JSON file. The output is JSONL which is ready to be impo

Aug 21, 2022
Simple price scraper with HTTP server/exporter for use with Prometheus

priceserver v0.3 Simple price scraper with HTTP server/exporter for use with Prometheus Currently working with Bitrue.com exchange but easily adaptabl

Nov 16, 2021
Scraper to download school attendance data from the DfE's statistics website
Scraper to download school attendance data from the DfE's statistics website

?? Simple to use. Scrape attendance data with a single command! ?? Super fast. A

Mar 31, 2022
A cli scraper of gocomics.com made in go

goComic goComic is a cli tool written in go that scrapes your favorite childhood favorite comic from gocomics.com. It will give you a single days comi

Dec 24, 2021
Best Room Price Scraper from Booking.com

Best Room Price Scraper from Booking.com This repo is a tutorial of Large Scale

Nov 11, 2022
A simple scraper to export data from buildkite to honeycomb using opentelemetry SDK
A simple scraper to export data from buildkite to honeycomb using opentelemetry SDK

A quick scraper program that let you export builds on BuildKite as OpenTelemetry data and then send them to honeycomb.io for slice-n-dice high cardinality analysis.

Jul 7, 2022
Warhammer40K faction scraper written in Golang, powered by colly.

Wascra Description Wascra is a tool written in Golang, which lets you extract all relevant Datasheet info from a Warhammer40K (9th edition) faction fr

Feb 8, 2022
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

Crawlab 中文 | English Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer Golang-

Jan 7, 2023
ant (alpha) is a web crawler for Go.

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

Dec 30, 2022
Declarative web scraping
Declarative web scraping

Ferret Try it! Docs CLI Test runner Web worker What is it? ferret is a web scraping system. It aims to simplify data extraction from the web for UI te

Jan 4, 2023
Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files

Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files. Run arbitrary JavaScript on many web pages and see the returned values

Dec 29, 2022
Gospider - Fast web spider written in Go
Gospider - Fast web spider written in Go

GoSpider GoSpider - Fast web spider written in Go Painless integrate Gospider into your recon workflow? Enjoying this tool? Support it's development a

Dec 31, 2022
Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.
Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.

Apollo ?? A Unix-style personal search engine and web crawler for your digital footprint Demo apollodemo.mp4 Contents Background Thesis Design Archite

Dec 27, 2022
Fast, highly configurable, cloud native dark web crawler.

Bathyscaphe dark web crawler Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler. How to start the crawler To start

Nov 22, 2022
Just a web crawler
Just a web crawler

gh-dependents gh command extension to see dependents of your repository. See The GitHub Blog: GitHub CLI 2.0 includes extensions! Install gh extension

Sep 27, 2022