Python HTTP请求库系列

一. 前言

国之大事, 在祀与戎.

python大事, 在爬虫与ai.

python中HTTP的请求第三方包颇多, 最近玩了一下的niquests后发现自己已经用过了好几个类似的, 以下对主要的进行简单整理.

二. Packages

名称	Github	文档	start	异步支持	HTTP2
requests	GitHub - psf/requests: A simple, yet elegant, HTTP library.	Requests: HTTP for Humans™ - Requests 2.32.3 documentation	52.2k	不支持	不支持
aiohttp	GitHub - aio-libs/aiohttp: Asynchronous HTTP client/server framework for asyncio and Python	Welcome to AIOHTTP - aiohttp 3.11.7 documentation	15.2k	支持	不支持
httpx	GitHub - projectdiscovery/httpx: httpx is a fast and multi-purpose HTTP toolkit that allows running multiple probes using the retryablehttp library.	HTTPX	7.8k	支持	支持
niquests	GitHub - jawah/niquests: " Safest, Fastest, Easiest, and Most advanced" Python HTTP Client. Production Ready! Drop-in replacement for Requests. HTTP/1.1, HTTP/2, and HTTP/3 supported. With WebSocket!	Niquests 3.11.0 documentation	1.1k	支持	支持
grequests	GitHub - spyoungtech/grequests: Requests + Gevent = ❤️		4.5k	支持	不支持

关于http2协议的支持

When using the httpx client, HTTP/2 support is not enabled by default, because HTTP/1.1 is a mature, battle-hardened transport layer, and our HTTP/1.1 implementation may be considered the more robust option at this point in time. It is possible that a future version of httpx may enable HTTP/2 support by default.

httpx中关于http2协议的评论, http1依然还是基石, 未来可能切换到http2默认.

HTTP/1.1 vs HTTP/2: What's the Difference? | DigitalOcean

有些网站仅支持http2, 使用requests直接访问将报错.

Scrape | Book

import requests

r = requests.get('https://spa16.scrape.center/')
print(r.text)

httpx并没有默认支持http2, 需要手动安装相关依赖.

import httpx

client = httpx.Client(http2=True)

r = client.get('https://spa16.scrape.center/')
print(r.text)
print(r.status_code)

在使用上, 不需要刻意的去为了使用而使用, 以上的几个包在调用上都是类似的, 基本都是以requests为蓝本开发的, 那个顺手就用那个. 除了一些特殊场景的需要, 如http2支持, 异步等.

以下主要看看各个包在异步上的使用差异, 当然requests例外.

2.1 requests

作为标杆的存在, 但是作者貌似没有更大的动作, 新特性的增加貌似趋于停滞.

具体使用这里不详述, 烂大街了, 这点对于新手很友好, 各种代码和使用的坑都随意可以找到.

关于http2的支持, python requests http2, 但是这个包是很久之前发布的.

import requests
from hyper.contrib import HTTP20Adapter

session = requests.Session()
session.mount("https://", HTTP20Adapter())
response = session.get("https://www.example.com")

2.2 httpx

需要注意的是httpx对于http2的支持

pip install httpx
# 需要额外安装
pip install httpx[http2]

import httpx
import asyncio

async def spider(urls):
    headers = {
        'Connection': 'keep-alive',
        'DNT': '1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit'
                      '/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/'
                  'webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'Accept-Language': 'en-US,en;q=0.9',
        "Accept-Encoding": "gzip, deflate, br",
        "Referer": 'https://www.qq.com'
    }
    async with httpx.AsyncClient(headers=headers, timeout=30) as client:
        tasks = [fetch_url(client, url) for url in urls]
        return await asyncio.gather(*tasks)

async def fetch_url(client, url):
    try:
        r = await client.get(url) # 可以和aiohttp对比
        r.raise_for_status()  # 如果响应状态码不是200, 则抛出异常
        return r.text
    except httpx.HTTPError as exc:
        print(f"Error fetching {url}: {exc}")
        return None

async def main(urls):
    results = await spider(urls)
    print(results)

if __name__ == "__main__":
    urls = [
        "https://news.qq.com/rain/a/20241124A04Z7K00",
        "https://news.qq.com/rain/a/20241124A056FR00",
        "https://news.qq.com/rain/a/20241124A03Z5600",
        'https://news.qq.com/rain/a/20241124A00VE400'
    ]
    asyncio.run(main(urls))

2.3 aiohttp

import aiohttp
import asyncio

async def spider(urls):
    headers = {
        'Connection': 'keep-alive',
        'DNT': '1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit'
                      '/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/'
                  'webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'Accept-Language': 'en-US,en;q=0.9',
        "Accept-Encoding": "gzip, deflate, br",
        "Referer": 'https://www.qq.com'
    }
    async with aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=30), headers=headers) as client:
        tasks = [fetch_url(client, url) for url in urls]
        return await asyncio.gather(*tasks)

async def fetch_url(client, url):
    try:
        r = await client.get(url)
        r.raise_for_status()  # 如果响应状态码不是200, 则抛出异常
        return await r.text() # 这里和httpx的差异
    except aiohttp.http_exceptions as exc:
        print(f"Error fetching {url}: {exc}")
        return None

async def main(urls):
    results = await spider(urls)
    print(results)

if __name__ == "__main__":
    links = [
        "https://news.qq.com/rain/a/20241124A04Z7K00",
        "https://news.qq.com/rain/a/20241124A056FR00",
        "https://news.qq.com/rain/a/20241124A03Z5600",
        'https://news.qq.com/rain/a/20241124A00VE400'
    ]
    asyncio.run(main(links))

2.4 niquests

从使用文档的解释来看, niquests应该是上述的几个http请求库中"最强大"的一个, 原生支持http2, 默认状态下启用, 但是可能是创建的相对短, 知名度稍低.

DNS over HTTPS, DNS over QUIC, DNS over TLS, and DNS over UDP, 支持最新的http相关协议

Automatic Content Decompression and Decoding

OS truststore by default, no more certifi!

OCSP Certificate Revocation Verification

Advanced connection timings inspection

In-memory certificates (CAs, and mTLS)

Browser-style TLS/SSL Verification

Sessions with Cookie Persistence

Keep-Alive & Connection Pooling

International Domains and URLs

Automatic honoring of .netrc

Basic & Digest Authentication

Familiar dict–like Cookies

Object-oriented headers

Multi-part File Uploads

Post-Quantum Security

Chunked HTTP Requests

Fully type-annotated!

SOCKS Proxy Support

Connection Timeouts

Streaming Downloads

HTTP/2 by default, 需要注意这项, 默认支持的协议

HTTP/3 over QUIC

Early Responses

Happy Eyeballs

Multiplexed!

Thread-safe!, 线程安全

WebSocket!

Trailers!

DNSSEC!

Async!, 异步

import niquests

headers = {
    'Connection': 'keep-alive',
    'DNT': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit'
                    '/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image'
                '/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Language': 'en-US,en;q=0.9',
    "Accept - Encoding": "gzip, deflate, br",
}

r = niquests.get('https://spa16.scrape.center/', timeout=30, headers=headers)

print(r.text)

默认状态下即为http2协议.

import niquests
import asyncio

async def spider(urls):
    headers = {
        'Connection': 'keep-alive',
        'DNT': '1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit'
                      '/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/'
                  'webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'Accept-Language': 'en-US,en;q=0.9',
        "Accept-Encoding": "gzip, deflate, br",
        "Referer": 'https://www.qq.com'
    }
    async with niquests.AsyncSession() as client:
        tasks = [fetch_url(client, url, headers) for url in urls]
        return await asyncio.gather(*tasks)

async def fetch_url(client, url, headers):
    try:
        r = await client.get(url, headers=headers)
        r.raise_for_status()  # 如果响应状态码不是200, 则抛出异常
        return r.text # 和httpx类似
    except niquests.exceptions as exc:
        print(f"Error fetching {url}: {exc}")
        return None

async def main(urls):
    results = await spider(urls)
    print(results)

if __name__ == "__main__":
    links = [
        "https://news.qq.com/rain/a/20241124A04Z7K00",
        "https://news.qq.com/rain/a/20241124A056FR00",
        "https://news.qq.com/rain/a/20241124A03Z5600",
        'https://news.qq.com/rain/a/20241124A00VE400'
    ]
    asyncio.run(main(links))

2.5 grequests

GRequests allows you to use Requests with Gevent to make asynchronous HTTP Requests easily.

grequests是对requests的二次封装, 类似于绘图的seaborn.

import grequests

# 自定义错误处理程序, 接收两个参数分别为request和exception
def err_handler(_request, exception):
    print('except', exception)

headers = {
    'Connection': 'keep-alive',
    'DNT': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit'
                  '/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/'
              'webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Language': 'en-US,en;q=0.9',
    "Accept-Encoding": "gzip, deflate, br",
    "Referer": 'https://www.qq.com'
}

links = [
    "https://news.qq.com/rain/a/20241124A04Z7K00",
    "https://news.qq.com/rain/a/20241124A056FR00",
    "https://news.qq.com/rain/a/20241124A03Z5600",
    'https://news.qq.com/rain/a/20241124A00VE400'
]
req_list = [grequests.get(url, headers=headers) for url in links]

res_list = grequests.map(req_list, exception_handler=err_handler)

for res in res_list:
    print(res.text)

对于上面的代码, 会发现grequests和其他几个支持异步的包差异很大, 并不需要手动调用asyncio.

其代码风格和JavaScript的异步请求的方式非常类似.

{
    // js在异步网络请求的优势是很明显, 这是它的本质
    const links = [
        "https://news.qq.com/rain/a/20241124A04Z7K00",
        "https://news.qq.com/rain/a/20241124A056FR00",
        "https://news.qq.com/rain/a/20241124A03Z5600",
        'https://news.qq.com/rain/a/20241124A00VE400'
    ]
    const spider = async (url) => await (await fetch(url)).text();
    results = await Promise.all(links.map(spider));
    results.forEach(result => console.log(result));
}
// 等价
{
    const links = [
        "https://news.qq.com/rain/a/20241124A04Z7K00",
        "https://news.qq.com/rain/a/20241124A056FR00",
        "https://news.qq.com/rain/a/20241124A03Z5600",
        'https://news.qq.com/rain/a/20241124A00VE400'
    ];
    // 得益于箭头函数, js的一行代码可以实现很复杂的功能
    await Promise.all(links.map(async (url) => console.log(await (await fetch(url)).text())));
}

相对于JavaScript(异步是天生的), python的异步使用还是略微麻烦, 也并没有那么直观和容易理解. 实际上, 某些场景下使用JavaScript来写爬虫可能会容易很多, 特别是数据规模并不是很大的场景(如每天几万到几十万的规模).

三. 小结

总体而言, 正如前面所言, 以上http请求包, 在代码风格上是相近的, 功能也相似, 主要差异在于支持协议, 异步支持等上(当然还有其他更多细节如, 内存占用, 稳定性等).

可以根据自己的需要选择合适的, 没有必要为了选择而选择. 正如PG数据库很强大, 但是用起来很麻烦, 以及各种潜在的坑都是未知的, 一旦出现问题解决并不易; MySQL看起来"日薄西山", 但是高度成熟, 各种坑都有人踩过一遍, 一旦出现问题很容易找到应对方案.

如果往远的说, 爬虫的核心并不在于对于上述的各种包的使用, 爬虫的核心在于如何突破防御(简单的JavaScript逆向, 困难的app逆向), 拿到数据(本质上, 爬虫就是灰色产业, 是一个偏向社会工程的"黑客"产业), 至于python为什么和爬虫高度关联, 其主要还是在于培训机构的强势推广, 动辄以爬虫作为展示python如何简单易学, 如何简单诸如此类. 同时python相对完善的生态, 为爬虫的提供了坚实的基础.

3.1 题外-AI代码

分别使用阿里的通义千晓, 百度的文心一言和chatGPT分别生成niquests的异步请求的代码.

但是令人吃惊的是, 这种没什么难度的要求, 前二者无法给出代码, chatGPT则乱给代码.

也就是这些ai的(训练)填词语料是来自于更早的时间段.

需要注意chatgpt给出的代码是有问题的, await get()是不支持异步的.