简单记录以下, 备忘.
一. 前言
Pandas
很强大, 但是也是肉眼可见的性能"不足
"在面对"大型
"数据集的时候.
import polars as pl
import pandas as pd
import datatable as dt
import dask.dataframe as dd
不作任何的设置, 以默认状态下, 读取一个160592
行, 16
列的csv
文件
11.5 MB (12,132,352 bytes)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 160591 entries, 0 to 160590
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 location 160591 non-null object
1 iso_code 160591 non-null object
2 date 160591 non-null object
3 total_vaccinations 73234 non-null float64
4 people_vaccinated 70124 non-null float64
5 people_fully_vaccinated 67790 non-null float64
6 total_boosters 41953 non-null float64
7 daily_vaccinations_raw 60519 non-null float64
8 daily_vaccinations 159442 non-null float64
9 total_vaccinations_per_hundred 73234 non-null float64
10 people_vaccinated_per_hundred 70124 non-null float64
11 people_fully_vaccinated_per_hundred 67790 non-null float64
12 total_boosters_per_hundred 41953 non-null float64
13 daily_vaccinations_per_million 159442 non-null float64
14 daily_people_vaccinated 159430 non-null float64
15 daily_people_vaccinated_per_hundred 159430 non-null float64
dtypes: float64(13), object(3)
memory usage: 19.6+ MB
二. 读取
%time pl_df = pl.read_csv('test.csv')
CPU times: total: 93.8 ms
Wall time: 45.5 ms
%time pd_df = pd.read_csv('test.csv')
CPU times: total: 203 ms
Wall time: 219 ms
%time dt_df = dt.fread('test.csv')
CPU times: total: 109 ms
Wall time: 37 ms
%time dd_df = dd.read_csv('test.csv')
CPU times: total: 15.6 ms
Wall time: 15 ms
2.1 polars
polors在使用上和pandas非常接近.
This book is an introduction to the
Polars
DataFrame library. Its goal is to introduce you toPolars
by going through examples and comparing it to other solutions. Some design choices are introduced here. The guide will also introduce you to optimal usage ofPolars
.Even though
Polars
is completely written inRust
(no runtime overhead!) and usesArrow
-- the native arrow2Rust
implementation -- as its foundation, the examples presented in this guide will be mostly using its higher-level language bindings. Higher-level bindings only serve as a thin wrapper for functionality implemented in the core library.For
Pandas
users, our Python package will offer the easiest way to get started withPolars
.
三. 安装
pip install pandas
pip install polars
# datatable
pip install datatable
# 在安装时, python 3.10上出现问题
# 完整安装dask
python -m pip install "dask[complete]"
四. 小结
这里的测试, 仅仅是记录读取csv
文件.
选择可选备份时还需要考虑一系列的问题, 如数据后续处理, 格式转换, 和其他的库的协同等等.
五. 文档
六. 其他
- pyspark
- koalas
- duckdb
- cudf, gpu加持的.
- blazingsql