简单记录以下, 备忘.
一. 前言
很强大, 但是也是肉眼可见的性能"不足
import polars as pl
import pandas as pd
import datatable as dt
import dask.dataframe as dd
不作任何的设置, 以默认状态下, 读取一个160592
行, 16
11.5 MB (12,132,352 bytes)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 160591 entries, 0 to 160590
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 location 160591 non-null object
1 iso_code 160591 non-null object
2 date 160591 non-null object
3 total_vaccinations 73234 non-null float64
4 people_vaccinated 70124 non-null float64
5 people_fully_vaccinated 67790 non-null float64
6 total_boosters 41953 non-null float64
7 daily_vaccinations_raw 60519 non-null float64
8 daily_vaccinations 159442 non-null float64
9 total_vaccinations_per_hundred 73234 non-null float64
10 people_vaccinated_per_hundred 70124 non-null float64
11 people_fully_vaccinated_per_hundred 67790 non-null float64
12 total_boosters_per_hundred 41953 non-null float64
13 daily_vaccinations_per_million 159442 non-null float64
14 daily_people_vaccinated 159430 non-null float64
15 daily_people_vaccinated_per_hundred 159430 non-null float64
dtypes: float64(13), object(3)
memory usage: 19.6+ MB
二. 读取
%time pl_df = pl.read_csv('test.csv')
CPU times: total: 93.8 ms
Wall time: 45.5 ms
%time pd_df = pd.read_csv('test.csv')
CPU times: total: 203 ms
Wall time: 219 ms
%time dt_df = dt.fread('test.csv')
CPU times: total: 109 ms
Wall time: 37 ms
%time dd_df = dd.read_csv('test.csv')
CPU times: total: 15.6 ms
Wall time: 15 ms
2.1 polars
三. 安装
pip install pandas
pip install polars
# datatable
pip install datatable
# 在安装时, python 3.10上出现问题
# 完整安装dask
python -m pip install "dask[complete]"
四. 小结
这里的测试, 仅仅是记录读取csv
选择可选备份时还需要考虑一系列的问题, 如数据后续处理, 格式转换, 和其他的库的协同等等.
五. 文档
六. 其他
- pyspark
- koalas
- duckdb
- cudf, gpu加持的.
- blazingsql