Multiplexer: HTTP-Server & URL Crawler

Multiplexer: HTTP-Server & URL Crawler

Приложение представляет собой http-сервер с одним хендлером.
Хендлер на вход получает POST-запрос со списком url в json-формате.
Сервер запрашивает данные по всем этим url и возвращает результат клиенту в json-формате.
Если в процессе обработки хотя бы одного из url получена ошибка, обработка всего списка прекращается и клиенту возвращается текстовая ошибка.

Ограничения:
- для реализации задачи следует использовать Go 1.13 или выше
- использовать можно только компоненты стандартной библиотеки Go
- сервер не принимает запрос если количество url в в нем больше 20
- сервер не обслуживает больше чем 100 одновременных входящих http-запросов
- для каждого входящего запроса должно быть не больше 4 одновременных исходящих
- таймаут на запрос одного url - секунда
- обработка запроса может быть отменена клиентом в любой момент, это должно повлечь за собой остановку всех операций связанных с этим запросом
- сервис должен поддерживать 'graceful shutdown'

Run a Server

$ go run ./cmd/multiplexer/main.go

> 2021/10/28 17:21:24 app started on port: 80

Request Validation

POST-Method

$ curl http://localhost/crawler

> method not allowed: expected "POST": got "GET"

JSON Input

$ curl -X POST http://localhost/crawler

> unsupported "Content-Type" header: expected "application/json": got ""

Empty Body

bad request: empty request body ">
$ curl -X POST http://localhost/crawler -H "Content-Type: application/json" 

> bad request: empty request body

Input Contract Compliance

bad request: invalid character 's' looking for beginning of value ">
$ curl -X POST http://localhost/crawler -d "some random data" \
    -H "Content-Type: application/json"

> bad request: invalid character 's' looking for beginning of value
bad request: no URLs passed ">
$ curl -X POST http://localhost/crawler -d '{"some":"field"}' \
    -H "Content-Type: application/json"

> bad request: no URLs passed
max number of URLs exceeded: 22 of 20" ">
$ curl -X POST http://localhost/crawler \
    -H "Content-Type: application/json" \
    -d '{"urls":[
      "1", "2", "3", "4", "5", "6", "7", "8", "9", "10",
      "11", "12", "13", "14", "15", "16", "17", "18", "19", 
      "20", "21", "22"]}'

> max number of URLs exceeded: 22 of 20"
invalid url: "some random text" ">
$ curl -X POST http://localhost/crawler \
    -H "Content-Type: application/json" \
    -d '{"urls":["some random text"]}'

> invalid url: "some random text"

Error Handling

HTTP Status Code Check

failed to crawl "https://httpstat.us/500": unexpected response status code: 500 ">
$ curl -X POST http://localhost/crawler \
    -H "Content-Type: application/json" \
    -d '{"urls":["https://httpstat.us/500"]}'

> failed to crawl "https://httpstat.us/500": unexpected response status code: 500

Request Timeout

failed to crawl "https://httpstat.us/200?sleep=5000": failed to send a request: Get "https://httpstat.us/200?sleep=5000": context deadline exceeded (Client.Timeout exceeded while awaiting headers) ">
$ curl -X POST http://localhost/crawler \
    -H "Content-Type: application/json" \
    -d '{"urls":["https://httpstat.us/200?sleep=5000"]}'

> failed to crawl "https://httpstat.us/200?sleep=5000": 
  failed to send a request: Get "https://httpstat.us/200?sleep=5000": 
  context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Exit Fast & Context Cancel

2021/10/28 16:53:13 app started on port: 80
2021/10/28 16:53:19 crawler: received 3 tasks: validating URL format
2021/10/28 16:53:19 crawler: starting 3 workers
2021/10/28 16:53:19 crawler: sending request: http://google.com
2021/10/28 16:53:19 crawler: sending request: http://yandex.ru
2021/10/28 16:53:19 crawler: sending request: http://69.63.176.13
2021/10/28 16:53:19 crawler: worker stopped: no more tasks
2021/10/28 16:53:19 crawler: unmarshal response body to JSON: invalid character '<' looking for beginning of value
2021/10/28 16:53:19 crawler: worker stopped: no more tasks
2021/10/28 16:53:19 crawler: error occurred: stopping other goroutines
2021/10/28 16:53:19 crawler: send request: Get "http://69.63.176.13": context canceled
2021/10/28 16:53:19 crawler: worker stopped: context canceled
2021/10/28 16:53:19 crawler: send request: Get "http://www.google.com/": context canceled
2021/10/28 16:53:19 crawler: error occurred: skipping new results
2021/10/28 16:53:19 crawler: error occurred: skipping new results
2021/10/28 16:53:19 crawler: worker stopped: context canceled
2021/10/28 16:53:19 crawler: results channel closed
2021/10/28 16:53:19 crawler: exit with error: failed to crawl "http://yandex.ru": unmarshal response body to JSON: invalid character '<' looking for beginning of value
2021/10/28 16:53:19 handler: failed to crawl "http://yandex.ru": unmarshal response body to JSON: invalid character '<' looking for beginning of value

Graceful Shutdown

2021/10/28 16:53:13 app started on port: 80
...
^C2021/10/28 16:56:44 OS signal received: interrupt
2021/10/28 16:56:44 http: setting graceful timeout: 3.00s
2021/10/28 16:56:44 http: awaiting traffic to stop: 3.00s
2021/10/28 16:56:44 http: shutting down: disabling keep-alive
2021/10/28 16:56:44 closer: http: shutting down: context deadline exceeded

Process finished with exit code 0

Limited Number of Simultaneous Incoming Requests

The problem is solved with a simple buffered-channel window. Before new connection can be established, it has to acquire a lock (get queued to a channel). When connection is closed a lock is released.

Limited Number of Outgoing Requests

The problem can be solved in multiple ways, e.g. having a fixed number of workers that evenly pull tasks from the queue, or we can just run a goroutine per URL and try to acquire a spot in the buffered-channel window to limit the number of concurrently running tasks.

An unlimited number of goroutines (due to unknown size of the URLs array) can end up with waste of resources and crash afterwards. That's why I chose a worker-pool solution, it solves this exact problem just fine.

Happy Path

{ "results": [ { "url": "https://jsonplaceholder.typicode.com/todos/3", "response": { "code": 200, "body": { "userId": 1, "id": 3, "title": "fugiat veniam minus", "completed": false } } }, ... { "url": "https://jsonplaceholder.typicode.com/todos/20", "response": { "code": 200, "body": { "userId": 1, "id": 20, "title": "ullam nobis libero sapiente ad optio sint", "completed": true } } } ] } ">
$ curl -X POST http://localhost/crawler -H "Content-Type: application/json" -d '{"urls":[ 
    "https://jsonplaceholder.typicode.com/todos/1", 
    "https://jsonplaceholder.typicode.com/todos/2", 
    "https://jsonplaceholder.typicode.com/todos/3", 
    "https://jsonplaceholder.typicode.com/todos/4",
    "https://jsonplaceholder.typicode.com/todos/5", 
    "https://jsonplaceholder.typicode.com/todos/6", 
    "https://jsonplaceholder.typicode.com/todos/7", 
    "https://jsonplaceholder.typicode.com/todos/8", 
    "https://jsonplaceholder.typicode.com/todos/9", 
    "https://jsonplaceholder.typicode.com/todos/10", 
    "https://jsonplaceholder.typicode.com/todos/11", 
    "https://jsonplaceholder.typicode.com/todos/12", 
    "https://jsonplaceholder.typicode.com/todos/13", 
    "https://jsonplaceholder.typicode.com/todos/14", 
    "https://jsonplaceholder.typicode.com/todos/15", 
    "https://jsonplaceholder.typicode.com/todos/16", 
    "https://jsonplaceholder.typicode.com/todos/17", 
    "https://jsonplaceholder.typicode.com/todos/18", 
    "https://jsonplaceholder.typicode.com/todos/19", 
    "https://jsonplaceholder.typicode.com/todos/20" 
]}'

> {
  "results": [
    {
      "url": "https://jsonplaceholder.typicode.com/todos/3",
      "response": {
        "code": 200,
        "body": {
          "userId": 1,
          "id": 3,
          "title": "fugiat veniam minus",
          "completed": false
        }
      }
    },
    ...
    {
      "url": "https://jsonplaceholder.typicode.com/todos/20",
      "response": {
        "code": 200,
        "body": {
          "userId": 1,
          "id": 20,
          "title": "ullam nobis libero sapiente ad optio sint",
          "completed": true
        }
      }
    }
  ]
}
Owner
Alexey Khan
Lead Software Engineer at OZON.ru. Big fan of Golang, Ayn Rand and Beef Steaks :)
Alexey Khan
Similar Resources

High-performance crawler framework based on fasthttp

predator / 掠食者 基于 fasthttp 开发的高性能爬虫框架 使用 下面是一个示例,基本包含了当前已完成的所有功能,使用方法可以参考注释。

May 2, 2022

Fast, highly configurable, cloud native dark web crawler.

Bathyscaphe dark web crawler Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler. How to start the crawler To start

Nov 22, 2022

A crawler/scraper based on golang + colly, configurable via JSON

A crawler/scraper based on golang + colly, configurable via JSON

Aug 21, 2022

Just a web crawler

Just a web crawler

gh-dependents gh command extension to see dependents of your repository. See The GitHub Blog: GitHub CLI 2.0 includes extensions! Install gh extension

Sep 27, 2022

A crawler/scraper based on golang + colly, configurable via JSON

Super-Simple Scraper This a very thin layer on top of Colly which allows configuration from a JSON file. The output is JSONL which is ready to be impo

Aug 21, 2022

New World Auction House Crawler In Golang

New-World-Auction-House-Crawler Goal of this library is to have a process which grabs New World auction house data in the background while playing the

Sep 7, 2022

Simple content crawler for joyreactor.cc

Simple content crawler for joyreactor.cc

Reactor Crawler Simple CLI content crawler for Joyreactor. He'll find all media content on the page you've provided and save it. If there will be any

May 5, 2022

A PCPartPicker crawler for Golang.

gopartpicker A scraper for pcpartpicker.com for Go. It is implemented using Colly. Features Extract data from part list URLs Search for parts Extract

Nov 9, 2021

A simple crawler sending Telegram notification when Refurbished Macbook Air / Pro in stock.

A simple crawler sending Telegram notification when Refurbished Macbook Air / Pro in stock.

Jan 30, 2022
crawlergo is a browser crawler that uses chrome headless mode for URL collection.
crawlergo is a browser crawler that uses chrome headless mode for URL collection.

A powerful browser crawler for web vulnerability scanners

Dec 29, 2022
Go-site-crawler - a simple application written in go that can fetch contentfrom a url endpoint

Go Site Crawler Go Site Crawler is a simple application written in go that can f

Feb 5, 2022
Go-based search engine URL collector , support Google, Bing, can be based on Google syntax batch collection URL
Go-based search engine URL collector , support Google, Bing, can be based on Google syntax batch collection URL

Go-based search engine URL collector , support Google, Bing, can be based on Google syntax batch collection URL

Nov 9, 2022
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

Crawlab 中文 | English Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer Golang-

Jan 7, 2023
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Jan 9, 2023
Pholcus is a distributed high-concurrency crawler software written in pure golang
Pholcus is a distributed high-concurrency crawler software written in pure golang

Pholcus Pholcus(幽灵蛛)是一款纯 Go 语言编写的支持分布式的高并发爬虫软件,仅用于编程学习与研究。 它支持单机、服务端、客户端三种运行模式,拥有Web、GUI、命令行三种操作界面;规则简单灵活、批量任务并发、输出方式丰富(mysql/mongodb/kafka/csv/excel等

Dec 30, 2022
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)

About Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your new

Dec 4, 2022
ant (alpha) is a web crawler for Go.

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

Dec 30, 2022
Go IMDb Crawler
 Go IMDb Crawler

Go IMDb Crawler Hit the ⭐ button to show some ❤️ ?? INSPIRATION ?? Want to know which celebrities have a common birthday with yours? ?? Want to get th

Aug 1, 2022
Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.
Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.

Apollo ?? A Unix-style personal search engine and web crawler for your digital footprint Demo apollodemo.mp4 Contents Background Thesis Design Archite

Dec 27, 2022