用Go实现抓取Boss直聘职位数据。IP代理,模拟浏览器,高效快速。

crawler-boss

用Go实现抓取Boss直聘职位数据。有几个特点

1.代理防IP被封

2.模拟浏览器,反识别爬虫。

3.控制爬取频率。

4.多协程爬取。

不足之处

1.爬取失败,没有进行重试以及更换IP处理。

2.错误处理

3.代码结构方面进行优化。

image

交流 && 疑问

如果有任何错误或不懂的地方欢迎给我提问 https://github.com/githubw2015/crawler-boss

如果对你有所帮助,请给个Star,你的支持,是我最大的动力。

Similar Resources
Comments
  • port的设置有什么注意点吗?

    port的设置有什么注意点吗?

    我设置如下:

    const (
    	seleniumPath     = `/Users/dapeng/Downloads/others/selenium-server-4.0.0-beta-4.jar`
    	chromeDriverPath = `/Applications/Google\ Chrome.app/Contents/chromedriver`
    	port             = 4444
    )
    

    它报错:

    panic: server did not respond on port 4444
    
    goroutine 203 [running]:
    main.main.func1(0x14de540, 0xc000250fa0, 0x145b4b5, 0x4, 0x60571c4, 0x145d4fe, 0x6)
            /Users/dapeng/Documents/code/go/src/proj/crawler/main.go:89 +0x10db
    created by main.main
            /Users/dapeng/Documents/code/go/src/proj/crawler/main.go:87 +0x22c
    exit status 2
    

    我设置port为 8080

    const (
    	seleniumPath     = `/Users/dapeng/Downloads/others/selenium-server-4.0.0-beta-4.jar`
    	chromeDriverPath = `/Applications/Google\ Chrome.app/Contents/chromedriver`
    	port             = 8080
    )
    

    它报这个错:

    panic: unknown error - 33: Unable to create new service: ChromeDriverService
    Build info: version: '3.141.59', revision: 'e82be7d358', time: '2018-11-14T08:25:53'
    System info: host: 'B-K1H5JHD2-2305', ip: 'fe80:0:0:0:1c8b:e051:22cd:551a%en0', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '11.4', java.version: '16.0.1'Driver info: driver.version: unknown
    
    goroutine 36 [running]:
    main.main.func1(0x0, 0x0, 0x145cda2, 0x6, 0x60571c4, 0x145d4fe, 0x6)
            /Users/dapeng/Documents/code/go/src/proj/crawler/main.go:123 +0x10bf
    created by main.main
            /Users/dapeng/Documents/code/go/src/proj/crawler/main.go:87 +0x22c
    exit status 2
    

    java直接运行jar包

    ➜  ~ java -jar Downloads/others/selenium-server-4.0.0-beta-4.jar hub
    18:56:59.414 INFO [LoggingOptions.configureLogEncoding] - Using the system default encoding
    18:56:59.420 INFO [OpenTelemetryTracer.createTracer] - Using OpenTelemetry for tracing
    18:56:59.553 INFO [BoundZmqEventBus.<init>] - XPUB binding to [binding to tcp://*:4442, advertising as tcp://[fe80:0:0:0:1c8b:e051:22cd:551a%en0]:4442], XSUB binding to [binding to tcp://*:4443, advertising as tcp://[fe80:0:0:0:1c8b:e051:22cd:551a%en0]:4443]
    18:56:59.621 INFO [UnboundZmqEventBus.<init>] - Connecting to tcp://[fe80:0:0:0:1c8b:e051:22cd:551a%en0]:4442 and tcp://[fe80:0:0:0:1c8b:e051:22cd:551a%en0]:4443
    18:56:59.650 INFO [UnboundZmqEventBus.<init>] - Sockets created
    18:57:00.655 INFO [UnboundZmqEventBus.<init>] - Event bus ready
    18:57:01.353 INFO [Hub.execute] - Started Selenium Hub 4.0.0-beta-4 (revision 29f46d02dd): http://192.168.1.126:4444
    

    我该怎么搞?「PS:这个运行chromedriver还需要安装java...感觉比python的难用好多😂」会不会是jar包、java、或者chrome的版本原因?

    信息罗列

    • java: openjdk version "16.0.1" 2021-04-20
    • chrome: 91.0.4472.114(正式版本) (x86_64)
    • selenium-server-jar包: selenium-server-4.0.0-beta-4.jar