一. 前言
国之大事, 在祀与戎.
python大事, 在爬虫与ai.
python
中HTTP
的请求第三方包颇多, 最近玩了一下的niquests
后发现自己已经用过了好几个类似的, 以下对主要的进行简单整理.
二. Packages
关于http2
协议的支持
When using the
httpx
client, HTTP/2 support is not enabled by default, because HTTP/1.1 is a mature, battle-hardened transport layer, and our HTTP/1.1 implementation may be considered the more robust option at this point in time. It is possible that a future version ofhttpx
may enable HTTP/2 support by default.httpx中关于http2协议的评论, http1依然还是基石, 未来可能切换到http2默认.
HTTP/1.1 vs HTTP/2: What's the Difference? | DigitalOcean
有些网站仅支持http2
, 使用requests
直接访问将报错.
import requests
r = requests.get('https://spa16.scrape.center/')
print(r.text)
httpx
并没有默认支持http2
, 需要手动安装相关依赖.
import httpx
client = httpx.Client(http2=True)
r = client.get('https://spa16.scrape.center/')
print(r.text)
print(r.status_code)
在使用上, 不需要刻意的去为了使用而使用, 以上的几个包在调用上都是类似的, 基本都是以requests
为蓝本开发的, 那个顺手就用那个. 除了一些特殊场景的需要, 如http2
支持, 异步等.
以下主要看看各个包在异步上的使用差异, 当然requests
例外.
2.1 requests
作为标杆的存在, 但是作者貌似没有更大的动作, 新特性的增加貌似趋于停滞.
具体使用这里不详述, 烂大街了, 这点对于新手很友好, 各种代码和使用的坑都随意可以找到.
关于http2
的支持, python requests http2, 但是这个包是很久之前发布的.
import requests
from hyper.contrib import HTTP20Adapter
session = requests.Session()
session.mount("https://", HTTP20Adapter())
response = session.get("https://www.example.com")
2.2 httpx
需要注意的是httpx
对于http2
的支持
pip install httpx
# 需要额外安装
pip install httpx[http2]
import httpx
import asyncio
async def spider(urls):
headers = {
'Connection': 'keep-alive',
'DNT': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit'
'/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/'
'webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Language': 'en-US,en;q=0.9',
"Accept-Encoding": "gzip, deflate, br",
"Referer": 'https://www.qq.com'
}
async with httpx.AsyncClient(headers=headers, timeout=30) as client:
tasks = [fetch_url(client, url) for url in urls]
return await asyncio.gather(*tasks)
async def fetch_url(client, url):
try:
r = await client.get(url) # 可以和aiohttp对比
r.raise_for_status() # 如果响应状态码不是200, 则抛出异常
return r.text
except httpx.HTTPError as exc:
print(f"Error fetching {url}: {exc}")
return None
async def main(urls):
results = await spider(urls)
print(results)
if __name__ == "__main__":
urls = [
"https://news.qq.com/rain/a/20241124A04Z7K00",
"https://news.qq.com/rain/a/20241124A056FR00",
"https://news.qq.com/rain/a/20241124A03Z5600",
'https://news.qq.com/rain/a/20241124A00VE400'
]
asyncio.run(main(urls))
2.3 aiohttp
import aiohttp
import asyncio
async def spider(urls):
headers = {
'Connection': 'keep-alive',
'DNT': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit'
'/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/'
'webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Language': 'en-US,en;q=0.9',
"Accept-Encoding": "gzip, deflate, br",
"Referer": 'https://www.qq.com'
}
async with aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=30), headers=headers) as client:
tasks = [fetch_url(client, url) for url in urls]
return await asyncio.gather(*tasks)
async def fetch_url(client, url):
try:
r = await client.get(url)
r.raise_for_status() # 如果响应状态码不是200, 则抛出异常
return await r.text() # 这里和httpx的差异
except aiohttp.http_exceptions as exc:
print(f"Error fetching {url}: {exc}")
return None
async def main(urls):
results = await spider(urls)
print(results)
if __name__ == "__main__":
links = [
"https://news.qq.com/rain/a/20241124A04Z7K00",
"https://news.qq.com/rain/a/20241124A056FR00",
"https://news.qq.com/rain/a/20241124A03Z5600",
'https://news.qq.com/rain/a/20241124A00VE400'
]
asyncio.run(main(links))
2.4 niquests
从使用文档的解释来看, niquests
应该是上述的几个http
请求库中"最强大"的一个, 原生支持http2
, 默认状态下启用, 但是可能是创建的相对短, 知名度稍低.
- DNS over HTTPS, DNS over QUIC, DNS over TLS, and DNS over UDP, 支持最新的http相关协议
- Automatic Content Decompression and Decoding
- OS truststore by default, no more certifi!
- OCSP Certificate Revocation Verification
- Advanced connection timings inspection
- In-memory certificates (CAs, and mTLS)
- Browser-style TLS/SSL Verification
- Sessions with Cookie Persistence
- Keep-Alive & Connection Pooling
- International Domains and URLs
- Automatic honoring of .netrc
- Basic & Digest Authentication
- Familiar dict–like Cookies
- Object-oriented headers
- Multi-part File Uploads
- Post-Quantum Security
- Chunked HTTP Requests
- Fully type-annotated!
- SOCKS Proxy Support
- Connection Timeouts
- Streaming Downloads
- HTTP/2 by default, 需要注意这项, 默认支持的协议
- HTTP/3 over QUIC
- Early Responses
- Happy Eyeballs
- Multiplexed!
- Thread-safe!, 线程安全
- WebSocket!
- Trailers!
- DNSSEC!
- Async!, 异步
import niquests
headers = {
'Connection': 'keep-alive',
'DNT': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit'
'/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image'
'/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Language': 'en-US,en;q=0.9',
"Accept - Encoding": "gzip, deflate, br",
}
r = niquests.get('https://spa16.scrape.center/', timeout=30, headers=headers)
print(r.text)
默认状态下即为http2
协议.
import niquests
import asyncio
async def spider(urls):
headers = {
'Connection': 'keep-alive',
'DNT': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit'
'/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/'
'webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Language': 'en-US,en;q=0.9',
"Accept-Encoding": "gzip, deflate, br",
"Referer": 'https://www.qq.com'
}
async with niquests.AsyncSession() as client:
tasks = [fetch_url(client, url, headers) for url in urls]
return await asyncio.gather(*tasks)
async def fetch_url(client, url, headers):
try:
r = await client.get(url, headers=headers)
r.raise_for_status() # 如果响应状态码不是200, 则抛出异常
return r.text # 和httpx类似
except niquests.exceptions as exc:
print(f"Error fetching {url}: {exc}")
return None
async def main(urls):
results = await spider(urls)
print(results)
if __name__ == "__main__":
links = [
"https://news.qq.com/rain/a/20241124A04Z7K00",
"https://news.qq.com/rain/a/20241124A056FR00",
"https://news.qq.com/rain/a/20241124A03Z5600",
'https://news.qq.com/rain/a/20241124A00VE400'
]
asyncio.run(main(links))
2.5 grequests
GRequests allows you to use Requests with Gevent to make asynchronous HTTP Requests easily.
grequests
是对requests
的二次封装, 类似于绘图的seaborn
.
import grequests
# 自定义错误处理程序, 接收两个参数分别为request和exception
def err_handler(_request, exception):
print('except', exception)
headers = {
'Connection': 'keep-alive',
'DNT': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit'
'/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/'
'webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Language': 'en-US,en;q=0.9',
"Accept-Encoding": "gzip, deflate, br",
"Referer": 'https://www.qq.com'
}
links = [
"https://news.qq.com/rain/a/20241124A04Z7K00",
"https://news.qq.com/rain/a/20241124A056FR00",
"https://news.qq.com/rain/a/20241124A03Z5600",
'https://news.qq.com/rain/a/20241124A00VE400'
]
req_list = [grequests.get(url, headers=headers) for url in links]
res_list = grequests.map(req_list, exception_handler=err_handler)
for res in res_list:
print(res.text)
对于上面的代码, 会发现grequests
和其他几个支持异步的包差异很大, 并不需要手动调用asyncio
.
其代码风格和JavaScript
的异步请求的方式非常类似.
{
// js在异步网络请求的优势是很明显, 这是它的本质
const links = [
"https://news.qq.com/rain/a/20241124A04Z7K00",
"https://news.qq.com/rain/a/20241124A056FR00",
"https://news.qq.com/rain/a/20241124A03Z5600",
'https://news.qq.com/rain/a/20241124A00VE400'
]
const spider = async (url) => await (await fetch(url)).text();
results = await Promise.all(links.map(spider));
results.forEach(result => console.log(result));
}
// 等价
{
const links = [
"https://news.qq.com/rain/a/20241124A04Z7K00",
"https://news.qq.com/rain/a/20241124A056FR00",
"https://news.qq.com/rain/a/20241124A03Z5600",
'https://news.qq.com/rain/a/20241124A00VE400'
];
// 得益于箭头函数, js的一行代码可以实现很复杂的功能
await Promise.all(links.map(async (url) => console.log(await (await fetch(url)).text())));
}
相对于JavaScript
(异步是天生的), python
的异步使用还是略微麻烦, 也并没有那么直观和容易理解. 实际上, 某些场景下使用JavaScript
来写爬虫可能会容易很多, 特别是数据规模并不是很大的场景(如每天几万到几十万的规模).
三. 小结
总体而言, 正如前面所言, 以上http
请求包, 在代码风格上是相近的, 功能也相似, 主要差异在于支持协议, 异步支持等上(当然还有其他更多细节如, 内存占用, 稳定性等).
可以根据自己的需要选择合适的, 没有必要为了选择而选择. 正如PG
数据库很强大, 但是用起来很麻烦, 以及各种潜在的坑都是未知的, 一旦出现问题解决并不易; MySQL
看起来"日薄西山", 但是高度成熟, 各种坑都有人踩过一遍, 一旦出现问题很容易找到应对方案.
如果往远的说, 爬虫的核心并不在于对于上述的各种包的使用, 爬虫的核心在于如何突破防御(简单的JavaScript
逆向, 困难的app
逆向), 拿到数据(本质上, 爬虫就是灰色产业, 是一个偏向社会工程的"黑客"产业), 至于python
为什么和爬虫高度关联, 其主要还是在于培训机构的强势推广, 动辄以爬虫作为展示python
如何简单易学, 如何简单诸如此类. 同时python
相对完善的生态, 为爬虫的提供了坚实的基础.
3.1 题外-AI代码
分别使用阿里的通义千晓, 百度的文心一言和chatGPT
分别生成niquests
的异步请求的代码.
但是令人吃惊的是, 这种没什么难度的要求, 前二者无法给出代码, chatGP
T则乱给代码.
也就是这些ai
的(训练)填词语料是来自于更早的时间段.
需要注意chatgpt给出的代码是有问题的, await get()
是不支持异步的.