一. 前言
国之大事, 在祀与戎.
python大事, 在爬虫与ai.

python中HTTP的请求第三方包颇多, 最近玩了一下的niquests后发现自己已经用过了好几个类似的, 以下对主要的进行简单整理.
二. Packages
关于http2协议的支持
When using the
httpxclient, HTTP/2 support is not enabled by default, because HTTP/1.1 is a mature, battle-hardened transport layer, and our HTTP/1.1 implementation may be considered the more robust option at this point in time. It is possible that a future version ofhttpxmay enable HTTP/2 support by default.httpx中关于http2协议的评论, http1依然还是基石, 未来可能切换到http2默认.
HTTP/1.1 vs HTTP/2: What's the Difference? | DigitalOcean
有些网站仅支持http2, 使用requests直接访问将报错.
import requests
r = requests.get('https://spa16.scrape.center/')
print(r.text)
httpx并没有默认支持http2, 需要手动安装相关依赖.
import httpx
client = httpx.Client(http2=True)
r = client.get('https://spa16.scrape.center/')
print(r.text)
print(r.status_code)
在使用上, 不需要刻意的去为了使用而使用, 以上的几个包在调用上都是类似的, 基本都是以requests为蓝本开发的, 那个顺手就用那个. 除了一些特殊场景的需要, 如http2支持, 异步等.
以下主要看看各个包在异步上的使用差异, 当然requests例外.
2.1 requests
作为标杆的存在, 但是作者貌似没有更大的动作, 新特性的增加貌似趋于停滞.
具体使用这里不详述, 烂大街了, 这点对于新手很友好, 各种代码和使用的坑都随意可以找到.
关于http2的支持, python requests http2, 但是这个包是很久之前发布的.
import requests
from hyper.contrib import HTTP20Adapter
session = requests.Session()
session.mount("https://", HTTP20Adapter())
response = session.get("https://www.example.com")
2.2 httpx
需要注意的是httpx对于http2的支持
pip install httpx
# 需要额外安装
pip install httpx[http2]
import httpx
import asyncio
async def spider(urls):
headers = {
'Connection': 'keep-alive',
'DNT': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit'
'/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/'
'webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Language': 'en-US,en;q=0.9',
"Accept-Encoding": "gzip, deflate, br",
"Referer": 'https://www.qq.com'
}
async with httpx.AsyncClient(headers=headers, timeout=30) as client:
tasks = [fetch_url(client, url) for url in urls]
return await asyncio.gather(*tasks)
async def fetch_url(client, url):
try:
r = await client.get(url) # 可以和aiohttp对比
r.raise_for_status() # 如果响应状态码不是200, 则抛出异常
return r.text
except httpx.HTTPError as exc:
print(f"Error fetching {url}: {exc}")
return None
async def main(urls):
results = await spider(urls)
print(results)
if __name__ == "__main__":
urls = [
"https://news.qq.com/rain/a/20241124A04Z7K00",
"https://news.qq.com/rain/a/20241124A056FR00",
"https://news.qq.com/rain/a/20241124A03Z5600",
'https://news.qq.com/rain/a/20241124A00VE400'
]
asyncio.run(main(urls))
2.3 aiohttp
import aiohttp
import asyncio
async def spider(urls):
headers = {
'Connection': 'keep-alive',
'DNT': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit'
'/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/'
'webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Language': 'en-US,en;q=0.9',
"Accept-Encoding": "gzip, deflate, br",
"Referer": 'https://www.qq.com'
}
async with aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=30), headers=headers) as client:
tasks = [fetch_url(client, url) for url in urls]
return await asyncio.gather(*tasks)
async def fetch_url(client, url):
try:
r = await client.get(url)
r.raise_for_status() # 如果响应状态码不是200, 则抛出异常
return await r.text() # 这里和httpx的差异
except aiohttp.http_exceptions as exc:
print(f"Error fetching {url}: {exc}")
return None
async def main(urls):
results = await spider(urls)
print(results)
if __name__ == "__main__":
links = [
"https://news.qq.com/rain/a/20241124A04Z7K00",
"https://news.qq.com/rain/a/20241124A056FR00",
"https://news.qq.com/rain/a/20241124A03Z5600",
'https://news.qq.com/rain/a/20241124A00VE400'
]
asyncio.run(main(links))
2.4 niquests
从使用文档的解释来看, niquests应该是上述的几个http请求库中"最强大"的一个, 原生支持http2, 默认状态下启用, 但是可能是创建的相对短, 知名度稍低.
- DNS over HTTPS, DNS over QUIC, DNS over TLS, and DNS over UDP, 支持最新的http相关协议
- Automatic Content Decompression and Decoding
- OS truststore by default, no more certifi!
- OCSP Certificate Revocation Verification
- Advanced connection timings inspection
- In-memory certificates (CAs, and mTLS)
- Browser-style TLS/SSL Verification
- Sessions with Cookie Persistence
- Keep-Alive & Connection Pooling
- International Domains and URLs
- Automatic honoring of .netrc
- Basic & Digest Authentication
- Familiar dict–like Cookies
- Object-oriented headers
- Multi-part File Uploads
- Post-Quantum Security
- Chunked HTTP Requests
- Fully type-annotated!
- SOCKS Proxy Support
- Connection Timeouts
- Streaming Downloads
- HTTP/2 by default, 需要注意这项, 默认支持的协议
- HTTP/3 over QUIC
- Early Responses
- Happy Eyeballs
- Multiplexed!
- Thread-safe!, 线程安全
- WebSocket!
- Trailers!
- DNSSEC!
- Async!, 异步
import niquests
headers = {
'Connection': 'keep-alive',
'DNT': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit'
'/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image'
'/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Language': 'en-US,en;q=0.9',
"Accept - Encoding": "gzip, deflate, br",
}
r = niquests.get('https://spa16.scrape.center/', timeout=30, headers=headers)
print(r.text)
默认状态下即为http2协议.
import niquests
import asyncio
async def spider(urls):
headers = {
'Connection': 'keep-alive',
'DNT': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit'
'/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/'
'webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Language': 'en-US,en;q=0.9',
"Accept-Encoding": "gzip, deflate, br",
"Referer": 'https://www.qq.com'
}
async with niquests.AsyncSession() as client:
tasks = [fetch_url(client, url, headers) for url in urls]
return await asyncio.gather(*tasks)
async def fetch_url(client, url, headers):
try:
r = await client.get(url, headers=headers)
r.raise_for_status() # 如果响应状态码不是200, 则抛出异常
return r.text # 和httpx类似
except niquests.exceptions as exc:
print(f"Error fetching {url}: {exc}")
return None
async def main(urls):
results = await spider(urls)
print(results)
if __name__ == "__main__":
links = [
"https://news.qq.com/rain/a/20241124A04Z7K00",
"https://news.qq.com/rain/a/20241124A056FR00",
"https://news.qq.com/rain/a/20241124A03Z5600",
'https://news.qq.com/rain/a/20241124A00VE400'
]
asyncio.run(main(links))
2.5 grequests
GRequests allows you to use Requests with Gevent to make asynchronous HTTP Requests easily.
grequests是对requests的二次封装, 类似于绘图的seaborn.
import grequests
# 自定义错误处理程序, 接收两个参数分别为request和exception
def err_handler(_request, exception):
print('except', exception)
headers = {
'Connection': 'keep-alive',
'DNT': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit'
'/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/'
'webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Language': 'en-US,en;q=0.9',
"Accept-Encoding": "gzip, deflate, br",
"Referer": 'https://www.qq.com'
}
links = [
"https://news.qq.com/rain/a/20241124A04Z7K00",
"https://news.qq.com/rain/a/20241124A056FR00",
"https://news.qq.com/rain/a/20241124A03Z5600",
'https://news.qq.com/rain/a/20241124A00VE400'
]
req_list = [grequests.get(url, headers=headers) for url in links]
res_list = grequests.map(req_list, exception_handler=err_handler)
for res in res_list:
print(res.text)
对于上面的代码, 会发现grequests和其他几个支持异步的包差异很大, 并不需要手动调用asyncio.
其代码风格和JavaScript的异步请求的方式非常类似.
{
// js在异步网络请求的优势是很明显, 这是它的本质
const links = [
"https://news.qq.com/rain/a/20241124A04Z7K00",
"https://news.qq.com/rain/a/20241124A056FR00",
"https://news.qq.com/rain/a/20241124A03Z5600",
'https://news.qq.com/rain/a/20241124A00VE400'
]
const spider = async (url) => await (await fetch(url)).text();
results = await Promise.all(links.map(spider));
results.forEach(result => console.log(result));
}
// 等价
{
const links = [
"https://news.qq.com/rain/a/20241124A04Z7K00",
"https://news.qq.com/rain/a/20241124A056FR00",
"https://news.qq.com/rain/a/20241124A03Z5600",
'https://news.qq.com/rain/a/20241124A00VE400'
];
// 得益于箭头函数, js的一行代码可以实现很复杂的功能
await Promise.all(links.map(async (url) => console.log(await (await fetch(url)).text())));
}
相对于JavaScript(异步是天生的), python的异步使用还是略微麻烦, 也并没有那么直观和容易理解. 实际上, 某些场景下使用JavaScript来写爬虫可能会容易很多, 特别是数据规模并不是很大的场景(如每天几万到几十万的规模).
三. 小结
总体而言, 正如前面所言, 以上http请求包, 在代码风格上是相近的, 功能也相似, 主要差异在于支持协议, 异步支持等上(当然还有其他更多细节如, 内存占用, 稳定性等).
可以根据自己的需要选择合适的, 没有必要为了选择而选择. 正如PG数据库很强大, 但是用起来很麻烦, 以及各种潜在的坑都是未知的, 一旦出现问题解决并不易; MySQL看起来"日薄西山", 但是高度成熟, 各种坑都有人踩过一遍, 一旦出现问题很容易找到应对方案.
如果往远的说, 爬虫的核心并不在于对于上述的各种包的使用, 爬虫的核心在于如何突破防御(简单的JavaScript逆向, 困难的app逆向), 拿到数据(本质上, 爬虫就是灰色产业, 是一个偏向社会工程的"黑客"产业), 至于python为什么和爬虫高度关联, 其主要还是在于培训机构的强势推广, 动辄以爬虫作为展示python如何简单易学, 如何简单诸如此类. 同时python相对完善的生态, 为爬虫的提供了坚实的基础.
3.1 题外-AI代码
分别使用阿里的通义千晓, 百度的文心一言和chatGPT分别生成niquests的异步请求的代码.
但是令人吃惊的是, 这种没什么难度的要求, 前二者无法给出代码, chatGPT则乱给代码.
也就是这些ai的(训练)填词语料是来自于更早的时间段.
需要注意chatgpt给出的代码是有问题的, await get()是不支持异步的.