phpechjson技巧_Python开拓爬虫的常用技能架构

文章目录 [+]

1 根本知识

互联网上用来发布信息紧张有两种，一种是基于WEB浏览器的网页，还有一种是基于各种操作系统平台的客户端运用。

phpechjson技巧_Python开拓爬虫的常用技能架构

由于WEB发展迅速，干系通讯协议基本都向HTTP靠齐，以是要获取信息HTTP须要有一定理解。
而浏览器和WEB做事器作为主流利用HTTP协议通讯的客户端和做事端，也该当略有理解，便于识别哪些是合法访问，以及如何得到用户看到的数据。

（图片来自网络侵删）

HTTP 1.1

另一部分基于Android、iOS、HarmonyOS、WIndows、Linux等操作系统的运用，这类则须要理解操作系统的SDK或者TCP/IP协议。
对付一些私有协议，可以利用类似网络嗅探的办法去获取，可参考Wireshark、Winpcap之类的软件产品或者开拓库。
而对付大部分运用基本还是基于HTTP的协议。

Wireshark

Winpcap

如果要加快采集、剖析、存储数据的速率，须要并行打算。
以是线程、进程的观点要有一定的节制。
其余python供应了异步机制，能很好的解耦各个阶段的实现逻辑，以是异步机制和异步编程框架要有理解。
包括asyncio和twist框架，如：

import asyncioasync def read_file(file_path): with open(file_path, 'r') as f: return f.read()async def main(): file_content = await read_file('example.txt') print(file_content)asyncio.run(main())获取数据

python3对http的库有内置的urllib，也有第外部组件库urllib3、request。
可以较方便地通过url访问http做事。
期中request会默认管理http头和cookie，urllib3则不会，利用中要特殊把稳下（可能相同的url，有不同的返回值）。

import sys def test_urllib(url): import urllib.request targetUrl = 'https://www.baidu.com' if url is not None and url.startswith('http'): targetUrl = url print(targetUrl) response = urllib.request.urlopen(targetUrl) html = response.read() print(html)def test_urllib3(url): import urllib3 targetUrl = 'https://www.baidu.com' if url is not None and url.startswith('http'): targetUrl = url http = urllib3.PoolManager() response = http.request('GET', targetUrl) html = response.data print(html)print(len(sys.argv))for i, arg in enumerate(sys.argv): print(f"{i}: {arg}") url = arg if url is not None and url.startswith('http'): test_urllib(arg) test_urllib3(arg)

对付一些非http协议的，须要个案考虑。
这里不展开，但是可考虑一个并行框架twist，可帮助管理并发任务，提高开拓效率。
例如我们可以用twsit很轻松开拓一个client和sever程序。

# server.pyfrom twisted.internet import protocol, reactorfrom twisted.protocols import basicclass Echo(basic.LineReceiver): def connectionMade(self): self.sendLine(b'Welcome to the Twisted Echo Server!') def lineReceived(self, line): self.sendLine(line) # Echo back the received lineclass EchoFactory(protocol.Factory): def buildProtocol(self, addr): return Echo()if __name__ == '__main__': port = 8000 reactor.listenTCP(port, EchoFactory()) print(f'Server running on port {port}...') reactor.run()# client.pyfrom twisted.internet import reactor, protocolfrom twisted.protocols import basicclass EchoClient(basic.LineReceiver): def connectionMade(self): self.sendLine(b'Hello, Server!') def lineReceived(self, line): print(f'Received from server: {line.decode()}') self.transport.loseConnection() # Close the connection after receiving dataclass EchoClientFactory(protocol.ClientFactory): protocol = EchoClient def clientConnectionFailed(self, connector, reason): print(f'Connection failed: {reason}') reactor.stop() def clientConnectionLost(self, connector, reason): print(f'Connection lost: {reason}') reactor.stop()if __name__ == '__main__': server_address = 'localhost' server_port = 8000 factory = EchoClientFactory() reactor.connectTCP(server_address, server_port, factory) reactor.run()

剖析数据

这里不赘述，可按照思维导图的关键字，借助aigc工具逐个学习。
特殊关注下xml、json解析器，在爬虫的日常事情中，这些必不可少。

存储数据

存储数据到文件可关注二进制文件，例如图片、音乐、视频等，以及办公软件如excel、word，ppt等，还有常规的标准格式文件xml和json。

数据库方面可重点节制sqlalchemy。
当然也可以直接选择与mysql、redis、mongodb匹配的库。
都可组织措辞问问AIGC

爬虫进阶

爬虫涉及到的技能点较多，须要剖析通讯协议和仿照运行环境，乃至还要破解一些安全手段（如验证码）等。
这里可重点关注端侧的仿照工具，如selenium，appnium。
其余对付中继这类也很主要，学习利用fiddler之类有助于剖析通讯协议，和明确数据获取的目的。

利用框架

总体来说框架的选择较大略，由于scrapy发展的很好。
但是如果只是小试牛刀，可以考虑大略的框架，如crawley，他供应了界面，管理爬虫。

参考资料Wireshark https://www.wireshark.org/WinPcap https://www.winpcap.org/Http https://www.rfc-editor.org/rfc/rfc2616.pdf文小言、bito、豆包