Pandas备选方案

简单记录以下, 备忘.

一. 前言

Pandas很强大, 但是也是肉眼可见的性能"不足"在面对"大型"数据集的时候.

import polars as pl

import pandas as pd

import datatable as dt

import dask.dataframe as dd

不作任何的设置, 以默认状态下, 读取一个160592行, 16列的csv文件

11.5 MB (12,132,352 bytes)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 160591 entries, 0 to 160590
Data columns (total 16 columns):
 #   Column                               Non-Null Count   Dtype
---  ------                               --------------   -----
 0   location                             160591 non-null  object
 1   iso_code                             160591 non-null  object
 2   date                                 160591 non-null  object
 3   total_vaccinations                   73234 non-null   float64
 4   people_vaccinated                    70124 non-null   float64
 5   people_fully_vaccinated              67790 non-null   float64
 6   total_boosters                       41953 non-null   float64
 7   daily_vaccinations_raw               60519 non-null   float64
 8   daily_vaccinations                   159442 non-null  float64
 9   total_vaccinations_per_hundred       73234 non-null   float64
 10  people_vaccinated_per_hundred        70124 non-null   float64
 11  people_fully_vaccinated_per_hundred  67790 non-null   float64
 12  total_boosters_per_hundred           41953 non-null   float64
 13  daily_vaccinations_per_million       159442 non-null  float64
 14  daily_people_vaccinated              159430 non-null  float64
 15  daily_people_vaccinated_per_hundred  159430 non-null  float64
dtypes: float64(13), object(3)
memory usage: 19.6+ MB

二. 读取

%time pl_df = pl.read_csv('test.csv')

CPU times: total: 93.8 ms
Wall time: 45.5 ms

%time pd_df = pd.read_csv('test.csv')

CPU times: total: 203 ms
Wall time: 219 ms

%time dt_df = dt.fread('test.csv')

CPU times: total: 109 ms
Wall time: 37 ms

%time dd_df = dd.read_csv('test.csv')

CPU times: total: 15.6 ms
Wall time: 15 ms

使用Dask DataFrames 解决Pandas中并行计算的问题

2.1 polars

polors在使用上和pandas非常接近.

This book is an introduction to the Polars DataFrame library. Its goal is to introduce you to Polars by going through examples and comparing it to other solutions. Some design choices are introduced here. The guide will also introduce you to optimal usage of Polars.

Even though Polars is completely written in Rust (no runtime overhead!) and uses Arrow -- the native arrow2 Rust implementation -- as its foundation, the examples presented in this guide will be mostly using its higher-level language bindings. Higher-level bindings only serve as a thin wrapper for functionality implemented in the core library.

For Pandas users, our Python package will offer the easiest way to get started with Polars.

三. 安装

pip install pandas

pip install polars

# datatable
pip install datatable
# 在安装时, python 3.10上出现问题

# 完整安装dask
python -m pip install "dask[complete]"

四. 小结

这里的测试, 仅仅是记录读取csv文件.

选择可选备份时还需要考虑一系列的问题, 如数据后续处理, 格式转换, 和其他的库的协同等等.

五. 文档

六. 其他

pyspark
koalas
duckdb
cudf, gpu加持的.
blazingsql

Pandas备选方案

一. 前言

二. 读取

2.1 polars

三. 安装

四. 小结

五. 文档

六. 其他

感谢您的支持，我会继续努力的!