一. 前言
国之大事, 在祀与戎.
python大事, 在爬虫与ai.

python中HTTP的请求第三方包颇多, 最近玩了一下的niquests后发现自己已经用过了好几个类似的, 以下对主要的进行简单整理.
二. Packages
关于http2协议的支持
When using the
httpxclient, HTTP/2 support is not enabled by default, because HTTP/1.1 is a mature, battle-hardened transport layer, and our HTTP/1.1 implementation may be considered the more robust option at this point in time. It is possible that a future version ofhttpxmay enable HTTP/2 support by default.httpx中关于http2协议的评论, http1依然还是基石, 未来可能切换到http2默认.
HTTP/1.1 vs HTTP/2: What's the Difference? | DigitalOcean
有些网站仅支持http2, 使用requests直接访问将报错.
import requests
r = requests.get('https://spa16.scrape.center/')
print(r.text)
httpx并没有默认支持http2, 需要手动安装相关依赖.
import httpx
client = httpx.Client(http2=True)
r = client.get('https://spa16.scrape.center/')
print(r.text)
print(r.status_code)
在使用上, 不需要刻意的去为了使用而使用, 以上的几个包在调用上都是类似的, 基本都是以requests为蓝本开发的, 那个顺手就用那个. 除了一些特殊场景的需要, 如http2支持, 异步等.
以下主要看看各个包在异步上的使用差异, 当然requests例外.
2.1 requests
作为标杆的存在, 但是作者貌似没有更大的动作, 新特性的增加貌似趋于停滞.
具体使用这里不详述, 烂大街了, 这点对于新手很友好, 各种代码和使用的坑都随意可以找到.
关于http2的支持, python requests http2, 但是这个包是很久之前发布的.
import requests
from hyper.contrib import HTTP20Adapter
session = requests.Session()
session.mount("https://", HTTP20Adapter())
response = session.get("https://www.example.com")
2.2 httpx
需要注意的是httpx对于http2的支持
pip install httpx
# 需要额外安装
pip install httpx[http2]
import httpx
import asyncio
async def spider(urls):
    headers = {
        'Connection': 'keep-alive',
        'DNT': '1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit'
                      '/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/'
                  'webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'Accept-Language': 'en-US,en;q=0.9',
        "Accept-Encoding": "gzip, deflate, br",
        "Referer": 'https://www.qq.com'
    }
    async with httpx.AsyncClient(headers=headers, timeout=30) as client:
        tasks = [fetch_url(client, url) for url in urls]
        return await asyncio.gather(*tasks)
async def fetch_url(client, url):
    try:
        r = await client.get(url) # 可以和aiohttp对比
        r.raise_for_status()  # 如果响应状态码不是200, 则抛出异常
        return r.text
    except httpx.HTTPError as exc:
        print(f"Error fetching {url}: {exc}")
        return None
async def main(urls):
    results = await spider(urls)
    print(results)
if __name__ == "__main__":
    urls = [
        "https://news.qq.com/rain/a/20241124A04Z7K00",
        "https://news.qq.com/rain/a/20241124A056FR00",
        "https://news.qq.com/rain/a/20241124A03Z5600",
        'https://news.qq.com/rain/a/20241124A00VE400'
    ]
    asyncio.run(main(urls))
2.3 aiohttp
import aiohttp
import asyncio
async def spider(urls):
    headers = {
        'Connection': 'keep-alive',
        'DNT': '1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit'
                      '/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/'
                  'webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'Accept-Language': 'en-US,en;q=0.9',
        "Accept-Encoding": "gzip, deflate, br",
        "Referer": 'https://www.qq.com'
    }
    async with aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=30), headers=headers) as client:
        tasks = [fetch_url(client, url) for url in urls]
        return await asyncio.gather(*tasks)
async def fetch_url(client, url):
    try:
        r = await client.get(url)
        r.raise_for_status()  # 如果响应状态码不是200, 则抛出异常
        return await r.text() # 这里和httpx的差异
    except aiohttp.http_exceptions as exc:
        print(f"Error fetching {url}: {exc}")
        return None
async def main(urls):
    results = await spider(urls)
    print(results)
if __name__ == "__main__":
    links = [
        "https://news.qq.com/rain/a/20241124A04Z7K00",
        "https://news.qq.com/rain/a/20241124A056FR00",
        "https://news.qq.com/rain/a/20241124A03Z5600",
        'https://news.qq.com/rain/a/20241124A00VE400'
    ]
    asyncio.run(main(links))
2.4 niquests
从使用文档的解释来看, niquests应该是上述的几个http请求库中"最强大"的一个, 原生支持http2, 默认状态下启用, 但是可能是创建的相对短, 知名度稍低.
- DNS over HTTPS, DNS over QUIC, DNS over TLS, and DNS over UDP, 支持最新的http相关协议
 - Automatic Content Decompression and Decoding
 - OS truststore by default, no more certifi!
 - OCSP Certificate Revocation Verification
 - Advanced connection timings inspection
 - In-memory certificates (CAs, and mTLS)
 - Browser-style TLS/SSL Verification
 - Sessions with Cookie Persistence
 - Keep-Alive & Connection Pooling
 - International Domains and URLs
 - Automatic honoring of .netrc
 - Basic & Digest Authentication
 - Familiar dict–like Cookies
 - Object-oriented headers
 - Multi-part File Uploads
 - Post-Quantum Security
 - Chunked HTTP Requests
 - Fully type-annotated!
 - SOCKS Proxy Support
 - Connection Timeouts
 - Streaming Downloads
 - HTTP/2 by default, 需要注意这项, 默认支持的协议
 - HTTP/3 over QUIC
 - Early Responses
 - Happy Eyeballs
 - Multiplexed!
 - Thread-safe!, 线程安全
 - WebSocket!
 - Trailers!
 - DNSSEC!
 - Async!, 异步
 
import niquests
headers = {
    'Connection': 'keep-alive',
    'DNT': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit'
                    '/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image'
                '/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Language': 'en-US,en;q=0.9',
    "Accept - Encoding": "gzip, deflate, br",
}
r = niquests.get('https://spa16.scrape.center/', timeout=30, headers=headers)
print(r.text)
默认状态下即为http2协议.
import niquests
import asyncio
async def spider(urls):
    headers = {
        'Connection': 'keep-alive',
        'DNT': '1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit'
                      '/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/'
                  'webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'Accept-Language': 'en-US,en;q=0.9',
        "Accept-Encoding": "gzip, deflate, br",
        "Referer": 'https://www.qq.com'
    }
    async with niquests.AsyncSession() as client:
        tasks = [fetch_url(client, url, headers) for url in urls]
        return await asyncio.gather(*tasks)
async def fetch_url(client, url, headers):
    try:
        r = await client.get(url, headers=headers)
        r.raise_for_status()  # 如果响应状态码不是200, 则抛出异常
        return r.text # 和httpx类似
    except niquests.exceptions as exc:
        print(f"Error fetching {url}: {exc}")
        return None
async def main(urls):
    results = await spider(urls)
    print(results)
if __name__ == "__main__":
    links = [
        "https://news.qq.com/rain/a/20241124A04Z7K00",
        "https://news.qq.com/rain/a/20241124A056FR00",
        "https://news.qq.com/rain/a/20241124A03Z5600",
        'https://news.qq.com/rain/a/20241124A00VE400'
    ]
    asyncio.run(main(links))
2.5 grequests
GRequests allows you to use Requests with Gevent to make asynchronous HTTP Requests easily.
grequests是对requests的二次封装, 类似于绘图的seaborn.
import grequests
# 自定义错误处理程序, 接收两个参数分别为request和exception
def err_handler(_request, exception):
    print('except', exception)
headers = {
    'Connection': 'keep-alive',
    'DNT': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit'
                  '/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/'
              'webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Language': 'en-US,en;q=0.9',
    "Accept-Encoding": "gzip, deflate, br",
    "Referer": 'https://www.qq.com'
}
links = [
    "https://news.qq.com/rain/a/20241124A04Z7K00",
    "https://news.qq.com/rain/a/20241124A056FR00",
    "https://news.qq.com/rain/a/20241124A03Z5600",
    'https://news.qq.com/rain/a/20241124A00VE400'
]
req_list = [grequests.get(url, headers=headers) for url in links]
res_list = grequests.map(req_list, exception_handler=err_handler)
for res in res_list:
    print(res.text)
对于上面的代码, 会发现grequests和其他几个支持异步的包差异很大, 并不需要手动调用asyncio.
其代码风格和JavaScript的异步请求的方式非常类似.
{
    // js在异步网络请求的优势是很明显, 这是它的本质
    const links = [
        "https://news.qq.com/rain/a/20241124A04Z7K00",
        "https://news.qq.com/rain/a/20241124A056FR00",
        "https://news.qq.com/rain/a/20241124A03Z5600",
        'https://news.qq.com/rain/a/20241124A00VE400'
    ]
    const spider = async (url) => await (await fetch(url)).text();
    results = await Promise.all(links.map(spider));
    results.forEach(result => console.log(result));
}
// 等价
{
    const links = [
        "https://news.qq.com/rain/a/20241124A04Z7K00",
        "https://news.qq.com/rain/a/20241124A056FR00",
        "https://news.qq.com/rain/a/20241124A03Z5600",
        'https://news.qq.com/rain/a/20241124A00VE400'
    ];
    // 得益于箭头函数, js的一行代码可以实现很复杂的功能
    await Promise.all(links.map(async (url) => console.log(await (await fetch(url)).text())));
}
相对于JavaScript(异步是天生的), python的异步使用还是略微麻烦, 也并没有那么直观和容易理解. 实际上, 某些场景下使用JavaScript来写爬虫可能会容易很多, 特别是数据规模并不是很大的场景(如每天几万到几十万的规模).
三. 小结
总体而言, 正如前面所言, 以上http请求包, 在代码风格上是相近的, 功能也相似, 主要差异在于支持协议, 异步支持等上(当然还有其他更多细节如, 内存占用, 稳定性等).
可以根据自己的需要选择合适的, 没有必要为了选择而选择. 正如PG数据库很强大, 但是用起来很麻烦, 以及各种潜在的坑都是未知的, 一旦出现问题解决并不易; MySQL看起来"日薄西山", 但是高度成熟, 各种坑都有人踩过一遍, 一旦出现问题很容易找到应对方案.
如果往远的说, 爬虫的核心并不在于对于上述的各种包的使用, 爬虫的核心在于如何突破防御(简单的JavaScript逆向, 困难的app逆向), 拿到数据(本质上, 爬虫就是灰色产业, 是一个偏向社会工程的"黑客"产业), 至于python为什么和爬虫高度关联, 其主要还是在于培训机构的强势推广, 动辄以爬虫作为展示python如何简单易学, 如何简单诸如此类. 同时python相对完善的生态, 为爬虫的提供了坚实的基础.
3.1 题外-AI代码
分别使用阿里的通义千晓, 百度的文心一言和chatGPT分别生成niquests的异步请求的代码.
但是令人吃惊的是, 这种没什么难度的要求, 前二者无法给出代码, chatGPT则乱给代码.
也就是这些ai的(训练)填词语料是来自于更早的时间段.
需要注意chatgpt给出的代码是有问题的, await get()是不支持异步的.