crawler-boss
用Go实现抓取Boss直聘职位数据。有几个特点
1.代理防IP被封
2.模拟浏览器,反识别爬虫。
3.控制爬取频率。
4.多协程爬取。
不足之处
1.爬取失败,没有进行重试以及更换IP处理。
2.错误处理
3.代码结构方面进行优化。
交流 && 疑问
如果有任何错误或不懂的地方欢迎给我提问 https://github.com/githubw2015/crawler-boss
如果对你有所帮助,请给个Star,你的支持,是我最大的动力。
1.代理防IP被封
2.模拟浏览器,反识别爬虫。
3.控制爬取频率。
4.多协程爬取。
1.爬取失败,没有进行重试以及更换IP处理。
2.错误处理
3.代码结构方面进行优化。
如果有任何错误或不懂的地方欢迎给我提问 https://github.com/githubw2015/crawler-boss
如果对你有所帮助,请给个Star,你的支持,是我最大的动力。
我设置如下:
const (
seleniumPath = `/Users/dapeng/Downloads/others/selenium-server-4.0.0-beta-4.jar`
chromeDriverPath = `/Applications/Google\ Chrome.app/Contents/chromedriver`
port = 4444
)
它报错:
panic: server did not respond on port 4444
goroutine 203 [running]:
main.main.func1(0x14de540, 0xc000250fa0, 0x145b4b5, 0x4, 0x60571c4, 0x145d4fe, 0x6)
/Users/dapeng/Documents/code/go/src/proj/crawler/main.go:89 +0x10db
created by main.main
/Users/dapeng/Documents/code/go/src/proj/crawler/main.go:87 +0x22c
exit status 2
我设置port为 8080
:
const (
seleniumPath = `/Users/dapeng/Downloads/others/selenium-server-4.0.0-beta-4.jar`
chromeDriverPath = `/Applications/Google\ Chrome.app/Contents/chromedriver`
port = 8080
)
它报这个错:
panic: unknown error - 33: Unable to create new service: ChromeDriverService
Build info: version: '3.141.59', revision: 'e82be7d358', time: '2018-11-14T08:25:53'
System info: host: 'B-K1H5JHD2-2305', ip: 'fe80:0:0:0:1c8b:e051:22cd:551a%en0', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '11.4', java.version: '16.0.1'Driver info: driver.version: unknown
goroutine 36 [running]:
main.main.func1(0x0, 0x0, 0x145cda2, 0x6, 0x60571c4, 0x145d4fe, 0x6)
/Users/dapeng/Documents/code/go/src/proj/crawler/main.go:123 +0x10bf
created by main.main
/Users/dapeng/Documents/code/go/src/proj/crawler/main.go:87 +0x22c
exit status 2
java直接运行jar包
➜ ~ java -jar Downloads/others/selenium-server-4.0.0-beta-4.jar hub
18:56:59.414 INFO [LoggingOptions.configureLogEncoding] - Using the system default encoding
18:56:59.420 INFO [OpenTelemetryTracer.createTracer] - Using OpenTelemetry for tracing
18:56:59.553 INFO [BoundZmqEventBus.<init>] - XPUB binding to [binding to tcp://*:4442, advertising as tcp://[fe80:0:0:0:1c8b:e051:22cd:551a%en0]:4442], XSUB binding to [binding to tcp://*:4443, advertising as tcp://[fe80:0:0:0:1c8b:e051:22cd:551a%en0]:4443]
18:56:59.621 INFO [UnboundZmqEventBus.<init>] - Connecting to tcp://[fe80:0:0:0:1c8b:e051:22cd:551a%en0]:4442 and tcp://[fe80:0:0:0:1c8b:e051:22cd:551a%en0]:4443
18:56:59.650 INFO [UnboundZmqEventBus.<init>] - Sockets created
18:57:00.655 INFO [UnboundZmqEventBus.<init>] - Event bus ready
18:57:01.353 INFO [Hub.execute] - Started Selenium Hub 4.0.0-beta-4 (revision 29f46d02dd): http://192.168.1.126:4444
我该怎么搞?「PS:这个运行chromedriver还需要安装java...感觉比python的难用好多😂」会不会是jar包、java、或者chrome的版本原因?
信息罗列