Pandas翻译系列-2.0版本前瞻

new features:

  • index完善对numpy数据类型的支持
  • 读取数据, 增加对pyarrow数据类型的支持
  • 优化读写(Copy-on-Write)性能

What’s new in 2.0.0 (March XX, 2023)

These are the changes in pandas 2.0.0. See Release notes for a full changelog including other versions of pandas.

一. Enhancements

1.1 Installing optional dependencies with pip extras

When installing pandas using pip, sets of optional dependencies can also be installed by specifying extras.

注意额外的安装参数:

pip install "pandas[performance, aws]>=2.0.0"

The available extras, found in the installation guide, are [all, performance, computation, timezone, fss, aws, gcp, excel, parquet, feather, hdf5, spss, postgresql, mysql, sql-other, html, xml, plot, output_formatting, clipboard, compression, test] (GH39164).

1.2 [Index](https://pandas.pydata.org/docs/dev/reference/api/pandas.Index.html### 1.2 pandas.Index) can now hold numpy numeric dtypes

索引更完善的numpy数据类型支持

Numpy 的类型 C 的类型 描述
np.int8 int8_t 字节( -128到127)
np.int16 int16_t 整数( -32768至32767)
np.int32 int32_t 整数( -2147483648至2147483647)
np.int64 int64_t 整数( -9223372036854775808至9223372036854775807)
np.uint8 uint8_t 无符号整数( 0到255)
np.uint16 uint16_t 无符号整数( 0到65535)
np.uint32 uint32_t 无符号整数( 0到4294967295)
np.uint64 uint64_t 无符号整数( 0到18446744073709551615)
np.intp intptr_t 用于索引的整数 通常与索引相同 ssize_t
np.uintp uintptr_t 整数大到足以容纳指针
np.float32 float
np.float64 / np.float_ double 请注意 这与内置python float的精度相匹配.
np.complex64 float complex 复数 由两个32位浮点数( 实数和虚数组件) 表示
np.complex128 / np.complex_ double complex 请注意 这与内置python 复合体的精度相匹配.

It is now possible to use any numpy numeric dtype in a Index (GH42717).

Previously it was only possible to use int64, uint64 & float64 dtypes:

In [1]: pd.Index([1, 2, 3], dtype=np.int8)
Out[1]: Int64Index([1, 2, 3], dtype="int64")
In [2]: pd.Index([1, 2, 3], dtype=np.uint16)
Out[2]: UInt64Index([1, 2, 3], dtype="uint64")
In [3]: pd.Index([1, 2, 3], dtype=np.float32)
Out[3]: Float64Index([1.0, 2.0, 3.0], dtype="float64")

Int64Index, UInt64Index & Float64Index were deprecated in pandas version 1.4 and have now been removed. Instead Index should be used directly, and can it now take all numpy numeric dtypes, i.e. int8/ int16/int32/int64/uint8/uint16/uint32/uint64/float32/float64 dtypes:

注意部分被移除掉的数据类型

In [1]: pd.Index([1, 2, 3], dtype=np.int8)
Out[1]: Index([1, 2, 3], dtype='int8')

In [2]: pd.Index([1, 2, 3], dtype=np.uint16)
Out[2]: Index([1, 2, 3], dtype='uint16')

In [3]: pd.Index([1, 2, 3], dtype=np.float32)
Out[3]: Index([1.0, 2.0, 3.0], dtype='float32')

The ability for Index to hold the numpy numeric dtypes has meant some changes in Pandas functionality. In particular, operations that previously were forced to create 64-bit indexes, can now create indexes with lower bit sizes, e.g. 32-bit indexes.

索引之前是默认(强制)创建的是64位的索引, 现在可以创建更小占用的索引, 如32位的.

Below is a possibly non-exhaustive list of changes:

  1. Instantiating using a numpy numeric array now follows the dtype of the numpy array. Previously, all indexes created from numpy numeric arrays were forced to 64-bit. Now, for example, Index(np.array([1, 2, 3])) will be int32 on 32-bit systems, where it previously would have been int64`` even on 32-bit systems. Instantiating [Index](https://pandas.pydata.org/docs/dev/reference/api/pandas.Index.html#pandas.Index) using a list of numbers will still return 64bit dtypes, e.g. Index([1, 2, 3])will have aint64` dtype, which is the same as previously.

  2. The various numeric datetime attributes of DatetimeIndex (day, month, year etc.) were previously in of dtype int64, while they were int32 for arrays.DatetimeArray. They are now int32 on DatetimeIndex also:

变更日期/时间索引int64int32

In [4]: idx = pd.date_range(start='1/1/2018', periods=3, freq='M')

In [5]: idx.array.year
Out[5]: array([2018, 2018, 2018], dtype=int32)

In [6]: idx.year
Out[6]: Index([2018, 2018, 2018], dtype='int32')
  1. Level dtypes on Indexes from Series.sparse.from_coo() are now of dtype int32, the same as they are on the rows/cols on a scipy sparse matrix. Previously they were of dtype int64.

Series.sparse.from_coo()原生支持int32

In [7]: from scipy import sparse

In [8]: A = sparse.coo_matrix(
   ...:     ([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])), shape=(3, 4)
   ...: )
   ...:

In [9]: ser = pd.Series.sparse.from_coo(A)

In [10]: ser.index.dtypes
Out[10]:
level_0    int32
level_1    int32
dtype: object
  1. Index cannot be instantiated using a float16 dtype. Previously instantiating an Index using dtype float16 resulted in a Float64Index with a float64 dtype. It row raises a NotImplementedError:

不能实例化float16类型, 之前实列化float16得到的是Float64Index, 现在是直接报错.

In [11]: pd.Index([1, 2, 3], dtype=np.float16)
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[11], line 1
----> 1 pd.Index([1, 2, 3], dtype=np.float16)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:552, in Index.__new__(cls, data, dtype, copy, name, tupleize_cols)
    548 arr = ensure_wrapped_if_datetimelike(arr)
    550 klass = cls._dtype_to_subclass(arr.dtype)
--> 552 arr = klass._ensure_array(arr, arr.dtype, copy=False)
    553 return klass._simple_new(arr, name)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:565, in Index._ensure_array(cls, data, dtype, copy)
    562     raise ValueError("Index data must be 1-dimensional")
    563 elif dtype == np.float16:
    564     # float16 not supported (no indexing engine)
--> 565     raise NotImplementedError("float16 indexes are not supported")
    567 if copy:
    568     # asarray_tuplesafe does not always copy underlying data,
    569     #  so need to make sure that this happens
    570     data = data.copy()

NotImplementedError: float16 indexes are not supported

1.3 Configuration option, mode.dtype_backend, to return pyarrow-backed dtypes

pyarrow

This library provides a Python API for functionality provided by the Arrow C++ libraries, along with tools for Arrow integration and interoperability with pandas, NumPy, and other software in the Python ecosystem.

设置上的改变, 增加对pyarrow数据类型的支持

The use_nullable_dtypes keyword argument has been expanded to the following functions to enable automatic conversion to nullable dtypes (GH36712)

一些method()支持通过新的参数增加对pyarrow的支持.

To simplify opting-in to nullable dtypes for these functions, a new option nullable_dtypes was added that allows setting the keyword argument globally to True if not specified directly. The option can be enabled through:

In [12]: pd.options.mode.nullable_dtypes = True

The option will only work for functions with the keyword use_nullable_dtypes.

Additionally a new global configuration, mode.dtype_backend can now be used in conjunction with the parameter use_nullable_dtypes=True in the following functions to select the nullable dtypes implementation.

全局设置, 以下method()增加对pyarrow的支持.

And the following methods will also utilize the mode.dtype_backend option.

以下方式也支持这个全局配置的参数

By default, mode.dtype_backend is set to "pandas" to return existing, numpy-backed nullable dtypes, but it can also be set to "pyarrow" to return pyarrow-backed, nullable ArrowDtype (GH48957, GH49997).

In [13]: import io

In [14]: data = io.StringIO("""a,b,c,d,e,f,g,h,i
   ....:     1,2.5,True,a,,,,,
   ....:     3,4.5,False,b,6,7.5,True,a,
   ....: """)
   ....:

In [15]: with pd.option_context("mode.dtype_backend", "pandas"):
   ....:     df = pd.read_csv(data, use_nullable_dtypes=True)
   ....:

In [16]: df.dtypes
Out[16]:
a             Int64
b           Float64
c           boolean
d    string[python]
e             Int64
f           Float64
g           boolean
h    string[python]
i             Int64
dtype: object

In [17]: data.seek(0)
Out[17]: 0

In [18]: with pd.option_context("mode.dtype_backend", "pyarrow"):
   ....:     df_pyarrow = pd.read_csv(data, use_nullable_dtypes=True, engine="pyarrow")
   ....:

In [19]: df_pyarrow.dtypes
Out[19]:
a     int64[pyarrow]
b    double[pyarrow]
c      bool[pyarrow]
d    string[pyarrow]
e     int64[pyarrow]
f    double[pyarrow]
g      bool[pyarrow]
h    string[pyarrow]
i      null[pyarrow]
dtype: object

1.4 Copy-on-Write improvements

复制-写入性能提升

  • A new lazy copy mechanism that defers the copy until the object in question is modified was added to the methods listed in Copy-on-Write optimizations. These methods return views when Copy-on-Write is enabled, which provides a significant performance improvement compared to the regular execution (GH49473).
  • 新的懒加载机制
  • Accessing a single column of a DataFrame as a Series (e.g. df["col"]) now always returns a new object every time it is constructed when Copy-on-Write is enabled (not returning multiple times an identical, cached Series object). This ensures that those Series objects correctly follow the Copy-on-Write rules (GH49450)
  • The Series constructor will now create a lazy copy (deferring the copy until a modification to the data happens) when constructing a Series from an existing Series with the default of copy=False (GH50471)
  • The DataFrame constructor will now create a lazy copy (deferring the copy until a modification to the data happens) when constructing from an existing DataFrame with the default of copy=False (GH51239)
  • The DataFrame constructor, when constructing a DataFrame from a dictionary of Series objects and specifying copy=False, will now use a lazy copy of those Series objects for the columns of the DataFrame (GH50777)
  • Trying to set values using chained assignment (for example, df["a"][1:3] = 0) will now always raise an exception when Copy-on-Write is enabled. In this mode, chained assignment can never work because we are always setting into a temporary object that is the result of an indexing operation (getitem), which under Copy-on-Write always behaves as a copy. Thus, assigning through a chain can never update the original Series or DataFrame. Therefore, an informative error is raised to the user instead of silently doing nothing (GH49467)
  • DataFrame.replace() will now respect the Copy-on-Write mechanism when inplace=True.
  • DataFrame.transpose() will now respect the Copy-on-Write mechanism.
  • Arithmetic operations that can be inplace, e.g. ser *= 2 will now respect the Copy-on-Write mechanism.

Copy-on-Write can be enabled through one of

pd.set_option("mode.copy_on_write", True)
pd.options.mode.copy_on_write = True

Alternatively, copy on write can be enabled locally through:

with pd.option_context("mode.copy_on_write", True):
    ...