Python HTTP请求库系列

一. 前言

国之大事, 在祀与戎.

python大事, 在爬虫与ai.

HTTPX

pythonHTTP的请求第三方包颇多, 最近玩了一下的niquests后发现自己已经用过了好几个类似的, 以下对主要的进行简单整理.

二. Packages

名称 Github 文档 start 异步支持 HTTP2
requests GitHub - psf/requests: A simple, yet elegant, HTTP library. Requests: HTTP for Humans™ - Requests 2.32.3 documentation 52.2k 不支持 不支持
aiohttp GitHub - aio-libs/aiohttp: Asynchronous HTTP client/server framework for asyncio and Python Welcome to AIOHTTP - aiohttp 3.11.7 documentation 15.2k 支持 不支持
httpx GitHub - projectdiscovery/httpx: httpx is a fast and multi-purpose HTTP toolkit that allows running multiple probes using the retryablehttp library. HTTPX 7.8k 支持 支持
niquests GitHub - jawah/niquests: " Safest, Fastest, Easiest, and Most advanced" Python HTTP Client. Production Ready! Drop-in replacement for Requests. HTTP/1.1, HTTP/2, and HTTP/3 supported. With WebSocket! Niquests 3.11.0 documentation 1.1k 支持 支持
grequests GitHub - spyoungtech/grequests: Requests + Gevent = ❤️ 4.5k 支持 不支持

关于http2协议的支持

When using the httpx client, HTTP/2 support is not enabled by default, because HTTP/1.1 is a mature, battle-hardened transport layer, and our HTTP/1.1 implementation may be considered the more robust option at this point in time. It is possible that a future version of httpx may enable HTTP/2 support by default.

httpx中关于http2协议的评论, http1依然还是基石, 未来可能切换到http2默认.

img

HTTP/1.1 vs HTTP/2: What's the Difference? | DigitalOcean

有些网站仅支持http2, 使用requests直接访问将报错.

Scrape | Book

import requests

r = requests.get('https://spa16.scrape.center/')
print(r.text)

img

httpx并没有默认支持http2, 需要手动安装相关依赖.

import httpx

client = httpx.Client(http2=True)

r = client.get('https://spa16.scrape.center/')
print(r.text)
print(r.status_code)

在使用上, 不需要刻意的去为了使用而使用, 以上的几个包在调用上都是类似的, 基本都是以requests为蓝本开发的, 那个顺手就用那个. 除了一些特殊场景的需要, 如http2支持, 异步等.

以下主要看看各个包在异步上的使用差异, 当然requests例外.

2.1 requests

作为标杆的存在, 但是作者貌似没有更大的动作, 新特性的增加貌似趋于停滞.

具体使用这里不详述, 烂大街了, 这点对于新手很友好, 各种代码和使用的坑都随意可以找到.

关于http2的支持, python requests http2, 但是这个包是很久之前发布的.

import requests
from hyper.contrib import HTTP20Adapter

session = requests.Session()
session.mount("https://", HTTP20Adapter())
response = session.get("https://www.example.com")

2.2 httpx

需要注意的是httpx对于http2的支持

pip install httpx
# 需要额外安装
pip install httpx[http2]
import httpx
import asyncio

async def spider(urls):
    headers = {
        'Connection': 'keep-alive',
        'DNT': '1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit'
                      '/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/'
                  'webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'Accept-Language': 'en-US,en;q=0.9',
        "Accept-Encoding": "gzip, deflate, br",
        "Referer": 'https://www.qq.com'
    }
    async with httpx.AsyncClient(headers=headers, timeout=30) as client:
        tasks = [fetch_url(client, url) for url in urls]
        return await asyncio.gather(*tasks)

async def fetch_url(client, url):
    try:
        r = await client.get(url) # 可以和aiohttp对比
        r.raise_for_status()  # 如果响应状态码不是200, 则抛出异常
        return r.text
    except httpx.HTTPError as exc:
        print(f"Error fetching {url}: {exc}")
        return None

async def main(urls):
    results = await spider(urls)
    print(results)

if __name__ == "__main__":
    urls = [
        "https://news.qq.com/rain/a/20241124A04Z7K00",
        "https://news.qq.com/rain/a/20241124A056FR00",
        "https://news.qq.com/rain/a/20241124A03Z5600",
        'https://news.qq.com/rain/a/20241124A00VE400'
    ]
    asyncio.run(main(urls))

2.3 aiohttp

import aiohttp
import asyncio

async def spider(urls):
    headers = {
        'Connection': 'keep-alive',
        'DNT': '1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit'
                      '/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/'
                  'webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'Accept-Language': 'en-US,en;q=0.9',
        "Accept-Encoding": "gzip, deflate, br",
        "Referer": 'https://www.qq.com'
    }
    async with aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=30), headers=headers) as client:
        tasks = [fetch_url(client, url) for url in urls]
        return await asyncio.gather(*tasks)

async def fetch_url(client, url):
    try:
        r = await client.get(url)
        r.raise_for_status()  # 如果响应状态码不是200, 则抛出异常
        return await r.text() # 这里和httpx的差异
    except aiohttp.http_exceptions as exc:
        print(f"Error fetching {url}: {exc}")
        return None

async def main(urls):
    results = await spider(urls)
    print(results)

if __name__ == "__main__":
    links = [
        "https://news.qq.com/rain/a/20241124A04Z7K00",
        "https://news.qq.com/rain/a/20241124A056FR00",
        "https://news.qq.com/rain/a/20241124A03Z5600",
        'https://news.qq.com/rain/a/20241124A00VE400'
    ]
    asyncio.run(main(links))

2.4 niquests

从使用文档的解释来看, niquests应该是上述的几个http请求库中"最强大"的一个, 原生支持http2, 默认状态下启用, 但是可能是创建的相对短, 知名度稍低.

  • DNS over HTTPS, DNS over QUIC, DNS over TLS, and DNS over UDP, 支持最新的http相关协议
  • Automatic Content Decompression and Decoding
  • OS truststore by default, no more certifi!
  • OCSP Certificate Revocation Verification
  • Advanced connection timings inspection
  • In-memory certificates (CAs, and mTLS)
  • Browser-style TLS/SSL Verification
  • Sessions with Cookie Persistence
  • Keep-Alive & Connection Pooling
  • International Domains and URLs
  • Automatic honoring of .netrc
  • Basic & Digest Authentication
  • Familiar dict–like Cookies
  • Object-oriented headers
  • Multi-part File Uploads
  • Post-Quantum Security
  • Chunked HTTP Requests
  • Fully type-annotated!
  • SOCKS Proxy Support
  • Connection Timeouts
  • Streaming Downloads
  • HTTP/2 by default, 需要注意这项, 默认支持的协议
  • HTTP/3 over QUIC
  • Early Responses
  • Happy Eyeballs
  • Multiplexed!
  • Thread-safe!, 线程安全
  • WebSocket!
  • Trailers!
  • DNSSEC!
  • Async!, 异步
import niquests

headers = {
    'Connection': 'keep-alive',
    'DNT': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit'
                    '/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image'
                '/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Language': 'en-US,en;q=0.9',
    "Accept - Encoding": "gzip, deflate, br",
}

r = niquests.get('https://spa16.scrape.center/', timeout=30, headers=headers)

print(r.text)

默认状态下即为http2协议.

import niquests
import asyncio

async def spider(urls):
    headers = {
        'Connection': 'keep-alive',
        'DNT': '1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit'
                      '/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/'
                  'webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'Accept-Language': 'en-US,en;q=0.9',
        "Accept-Encoding": "gzip, deflate, br",
        "Referer": 'https://www.qq.com'
    }
    async with niquests.AsyncSession() as client:
        tasks = [fetch_url(client, url, headers) for url in urls]
        return await asyncio.gather(*tasks)

async def fetch_url(client, url, headers):
    try:
        r = await client.get(url, headers=headers)
        r.raise_for_status()  # 如果响应状态码不是200, 则抛出异常
        return r.text # 和httpx类似
    except niquests.exceptions as exc:
        print(f"Error fetching {url}: {exc}")
        return None

async def main(urls):
    results = await spider(urls)
    print(results)

if __name__ == "__main__":
    links = [
        "https://news.qq.com/rain/a/20241124A04Z7K00",
        "https://news.qq.com/rain/a/20241124A056FR00",
        "https://news.qq.com/rain/a/20241124A03Z5600",
        'https://news.qq.com/rain/a/20241124A00VE400'
    ]
    asyncio.run(main(links))

2.5 grequests

GRequests allows you to use Requests with Gevent to make asynchronous HTTP Requests easily.

grequests是对requests的二次封装, 类似于绘图的seaborn.

import grequests

# 自定义错误处理程序, 接收两个参数分别为request和exception
def err_handler(_request, exception):
    print('except', exception)

headers = {
    'Connection': 'keep-alive',
    'DNT': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit'
                  '/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/'
              'webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Language': 'en-US,en;q=0.9',
    "Accept-Encoding": "gzip, deflate, br",
    "Referer": 'https://www.qq.com'
}

links = [
    "https://news.qq.com/rain/a/20241124A04Z7K00",
    "https://news.qq.com/rain/a/20241124A056FR00",
    "https://news.qq.com/rain/a/20241124A03Z5600",
    'https://news.qq.com/rain/a/20241124A00VE400'
]
req_list = [grequests.get(url, headers=headers) for url in links]

res_list = grequests.map(req_list, exception_handler=err_handler)

for res in res_list:
    print(res.text)

对于上面的代码, 会发现grequests和其他几个支持异步的包差异很大, 并不需要手动调用asyncio.

其代码风格和JavaScript的异步请求的方式非常类似.

{
    // js在异步网络请求的优势是很明显, 这是它的本质
    const links = [
        "https://news.qq.com/rain/a/20241124A04Z7K00",
        "https://news.qq.com/rain/a/20241124A056FR00",
        "https://news.qq.com/rain/a/20241124A03Z5600",
        'https://news.qq.com/rain/a/20241124A00VE400'
    ]
    const spider = async (url) => await (await fetch(url)).text();
    results = await Promise.all(links.map(spider));
    results.forEach(result => console.log(result));
}
// 等价
{
    const links = [
        "https://news.qq.com/rain/a/20241124A04Z7K00",
        "https://news.qq.com/rain/a/20241124A056FR00",
        "https://news.qq.com/rain/a/20241124A03Z5600",
        'https://news.qq.com/rain/a/20241124A00VE400'
    ];
    // 得益于箭头函数, js的一行代码可以实现很复杂的功能
    await Promise.all(links.map(async (url) => console.log(await (await fetch(url)).text())));
}

相对于JavaScript(异步是天生的), python的异步使用还是略微麻烦, 也并没有那么直观和容易理解. 实际上, 某些场景下使用JavaScript来写爬虫可能会容易很多, 特别是数据规模并不是很大的场景(如每天几万到几十万的规模).

三. 小结

总体而言, 正如前面所言, 以上http请求包, 在代码风格上是相近的, 功能也相似, 主要差异在于支持协议, 异步支持等上(当然还有其他更多细节如, 内存占用, 稳定性等).

可以根据自己的需要选择合适的, 没有必要为了选择而选择. 正如PG数据库很强大, 但是用起来很麻烦, 以及各种潜在的坑都是未知的, 一旦出现问题解决并不易; MySQL看起来"日薄西山", 但是高度成熟, 各种坑都有人踩过一遍, 一旦出现问题很容易找到应对方案.

如果往远的说, 爬虫的核心并不在于对于上述的各种包的使用, 爬虫的核心在于如何突破防御(简单的JavaScript逆向, 困难的app逆向), 拿到数据(本质上, 爬虫就是灰色产业, 是一个偏向社会工程的"黑客"产业), 至于python为什么和爬虫高度关联, 其主要还是在于培训机构的强势推广, 动辄以爬虫作为展示python如何简单易学, 如何简单诸如此类. 同时python相对完善的生态, 为爬虫的提供了坚实的基础.

3.1 题外-AI代码

分别使用阿里的通义千晓, 百度的文心一言和chatGPT分别生成niquests的异步请求的代码.

但是令人吃惊的是, 这种没什么难度的要求, 前二者无法给出代码, chatGPT则乱给代码.

img

也就是这些ai的(训练)填词语料是来自于更早的时间段.

img

img

img

需要注意chatgpt给出的代码是有问题的, await get()是不支持异步的.