new features:
index
完善对numpy
数据类型的支持- 读取数据, 增加对
pyarrow
数据类型的支持 - 优化读写(
Copy-on-Write
)性能
What’s new in 2.0.0 (March XX, 2023)
These are the changes in pandas 2.0.0. See Release notes for a full changelog including other versions of pandas.
一. Enhancements
1.1 Installing optional dependencies with pip extras
When installing pandas using pip, sets of optional dependencies can also be installed by specifying extras.
注意额外的安装参数:
pip install "pandas[performance, aws]>=2.0.0"
The available extras, found in the installation guide, are [all, performance, computation, timezone, fss, aws, gcp, excel, parquet, feather, hdf5, spss, postgresql, mysql, sql-other, html, xml, plot, output_formatting, clipboard, compression, test]
(GH39164).
1.2 [Index
](https://pandas.pydata.org/docs/dev/reference/api/pandas.Index.html### 1.2 pandas.Index) can now hold numpy numeric dtypes
索引更完善的numpy
数据类型支持
Numpy 的类型 | C 的类型 | 描述 |
---|---|---|
np.int8 | int8_t | 字节( -128到127) |
np.int16 | int16_t | 整数( -32768至32767) |
np.int32 | int32_t | 整数( -2147483648至2147483647) |
np.int64 | int64_t | 整数( -9223372036854775808至9223372036854775807) |
np.uint8 | uint8_t | 无符号整数( 0到255) |
np.uint16 | uint16_t | 无符号整数( 0到65535) |
np.uint32 | uint32_t | 无符号整数( 0到4294967295) |
np.uint64 | uint64_t | 无符号整数( 0到18446744073709551615) |
np.intp | intptr_t | 用于索引的整数 通常与索引相同 ssize_t |
np.uintp | uintptr_t | 整数大到足以容纳指针 |
np.float32 | float | |
np.float64 / np.float_ | double | 请注意 这与内置python float的精度相匹配. |
np.complex64 | float complex | 复数 由两个32位浮点数( 实数和虚数组件) 表示 |
np.complex128 / np.complex_ | double complex | 请注意 这与内置python 复合体的精度相匹配. |
It is now possible to use any numpy numeric dtype in a Index
(GH42717).
Previously it was only possible to use int64
, uint64
& float64
dtypes:
In [1]: pd.Index([1, 2, 3], dtype=np.int8)
Out[1]: Int64Index([1, 2, 3], dtype="int64")
In [2]: pd.Index([1, 2, 3], dtype=np.uint16)
Out[2]: UInt64Index([1, 2, 3], dtype="uint64")
In [3]: pd.Index([1, 2, 3], dtype=np.float32)
Out[3]: Float64Index([1.0, 2.0, 3.0], dtype="float64")
Int64Index
, UInt64Index
& Float64Index
were deprecated in pandas version 1.4 and have now been removed. Instead Index
should be used directly, and can it now take all numpy numeric dtypes, i.e. int8
/ int16
/int32
/int64
/uint8
/uint16
/uint32
/uint64
/float32
/float64
dtypes:
注意部分被移除掉的数据类型
In [1]: pd.Index([1, 2, 3], dtype=np.int8)
Out[1]: Index([1, 2, 3], dtype='int8')
In [2]: pd.Index([1, 2, 3], dtype=np.uint16)
Out[2]: Index([1, 2, 3], dtype='uint16')
In [3]: pd.Index([1, 2, 3], dtype=np.float32)
Out[3]: Index([1.0, 2.0, 3.0], dtype='float32')
The ability for Index
to hold the numpy numeric dtypes has meant some changes in Pandas functionality. In particular, operations that previously were forced to create 64-bit indexes, can now create indexes with lower bit sizes, e.g. 32-bit indexes.
索引之前是默认(强制)创建的是64位的索引, 现在可以创建更小占用的索引, 如32位的.
Below is a possibly non-exhaustive list of changes:
-
Instantiating using a numpy numeric array now follows the dtype of the numpy array. Previously, all indexes created from numpy numeric arrays were forced to 64-bit. Now, for example,
Index(np.array([1, 2, 3]))
will beint32
on 32-bit systems, where it previously would have beenint64`` even on 32-bit systems. Instantiating [
Index](https://pandas.pydata.org/docs/dev/reference/api/pandas.Index.html#pandas.Index) using a list of numbers will still return 64bit dtypes, e.g.
Index([1, 2, 3])will have a
int64` dtype, which is the same as previously. -
The various numeric datetime attributes of
DatetimeIndex
(day
,month
,year
etc.) were previously in of dtypeint64
, while they wereint32
forarrays.DatetimeArray
. They are nowint32
onDatetimeIndex
also:
变更日期/时间索引int64
为int32
In [4]: idx = pd.date_range(start='1/1/2018', periods=3, freq='M')
In [5]: idx.array.year
Out[5]: array([2018, 2018, 2018], dtype=int32)
In [6]: idx.year
Out[6]: Index([2018, 2018, 2018], dtype='int32')
- Level dtypes on Indexes from
Series.sparse.from_coo()
are now of dtypeint32
, the same as they are on therows
/cols
on a scipy sparse matrix. Previously they were of dtypeint64
.
Series.sparse.from_coo()
原生支持int32
In [7]: from scipy import sparse
In [8]: A = sparse.coo_matrix(
...: ([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])), shape=(3, 4)
...: )
...:
In [9]: ser = pd.Series.sparse.from_coo(A)
In [10]: ser.index.dtypes
Out[10]:
level_0 int32
level_1 int32
dtype: object
Index
cannot be instantiated using a float16 dtype. Previously instantiating anIndex
using dtypefloat16
resulted in aFloat64Index
with afloat64
dtype. It row raises aNotImplementedError
:
不能实例化float16
类型, 之前实列化float16
得到的是Float64Index
, 现在是直接报错.
In [11]: pd.Index([1, 2, 3], dtype=np.float16)
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
Cell In[11], line 1
----> 1 pd.Index([1, 2, 3], dtype=np.float16)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:552, in Index.__new__(cls, data, dtype, copy, name, tupleize_cols)
548 arr = ensure_wrapped_if_datetimelike(arr)
550 klass = cls._dtype_to_subclass(arr.dtype)
--> 552 arr = klass._ensure_array(arr, arr.dtype, copy=False)
553 return klass._simple_new(arr, name)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:565, in Index._ensure_array(cls, data, dtype, copy)
562 raise ValueError("Index data must be 1-dimensional")
563 elif dtype == np.float16:
564 # float16 not supported (no indexing engine)
--> 565 raise NotImplementedError("float16 indexes are not supported")
567 if copy:
568 # asarray_tuplesafe does not always copy underlying data,
569 # so need to make sure that this happens
570 data = data.copy()
NotImplementedError: float16 indexes are not supported
1.3 Configuration option, mode.dtype_backend
, to return pyarrow-backed dtypes
This library provides a Python API for functionality provided by the Arrow C++ libraries, along with tools for Arrow integration and interoperability with pandas, NumPy, and other software in the Python ecosystem.
设置上的改变, 增加对pyarrow
数据类型的支持
The use_nullable_dtypes
keyword argument has been expanded to the following functions to enable automatic conversion to nullable dtypes (GH36712)
一些method()
支持通过新的参数增加对pyarrow
的支持.
read_csv()
read_clipboard()
read_fwf()
read_excel()
read_html()
read_xml()
read_json()
read_sql()
read_sql_query()
read_sql_table()
read_orc()
read_feather()
read_spss()
to_numeric()
To simplify opting-in to nullable dtypes for these functions, a new option nullable_dtypes
was added that allows setting the keyword argument globally to True
if not specified directly. The option can be enabled through:
In [12]: pd.options.mode.nullable_dtypes = True
The option will only work for functions with the keyword use_nullable_dtypes
.
Additionally a new global configuration, mode.dtype_backend
can now be used in conjunction with the parameter use_nullable_dtypes=True
in the following functions to select the nullable dtypes implementation.
全局设置, 以下method()增加对pyarrow
的支持.
read_csv()
read_clipboard()
read_fwf()
read_excel()
read_html()
read_xml()
read_json()
read_sql()
read_sql_query()
read_sql_table()
read_parquet()
read_orc()
read_feather()
read_spss()
to_numeric()
And the following methods will also utilize the mode.dtype_backend
option.
以下方式也支持这个全局配置的参数
By default, mode.dtype_backend
is set to "pandas"
to return existing, numpy-backed nullable dtypes, but it can also be set to "pyarrow"
to return pyarrow-backed, nullable ArrowDtype
(GH48957, GH49997).
In [13]: import io
In [14]: data = io.StringIO("""a,b,c,d,e,f,g,h,i
....: 1,2.5,True,a,,,,,
....: 3,4.5,False,b,6,7.5,True,a,
....: """)
....:
In [15]: with pd.option_context("mode.dtype_backend", "pandas"):
....: df = pd.read_csv(data, use_nullable_dtypes=True)
....:
In [16]: df.dtypes
Out[16]:
a Int64
b Float64
c boolean
d string[python]
e Int64
f Float64
g boolean
h string[python]
i Int64
dtype: object
In [17]: data.seek(0)
Out[17]: 0
In [18]: with pd.option_context("mode.dtype_backend", "pyarrow"):
....: df_pyarrow = pd.read_csv(data, use_nullable_dtypes=True, engine="pyarrow")
....:
In [19]: df_pyarrow.dtypes
Out[19]:
a int64[pyarrow]
b double[pyarrow]
c bool[pyarrow]
d string[pyarrow]
e int64[pyarrow]
f double[pyarrow]
g bool[pyarrow]
h string[pyarrow]
i null[pyarrow]
dtype: object
1.4 Copy-on-Write improvements
复制-写入性能提升
- A new lazy copy mechanism that defers the copy until the object in question is modified was added to the methods listed in Copy-on-Write optimizations. These methods return views when Copy-on-Write is enabled, which provides a significant performance improvement compared to the regular execution (GH49473).
- 新的懒加载机制
- Accessing a single column of a DataFrame as a Series (e.g.
df["col"]
) now always returns a new object every time it is constructed when Copy-on-Write is enabled (not returning multiple times an identical, cached Series object). This ensures that those Series objects correctly follow the Copy-on-Write rules (GH49450) - The
Series
constructor will now create a lazy copy (deferring the copy until a modification to the data happens) when constructing a Series from an existing Series with the default ofcopy=False
(GH50471) - The
DataFrame
constructor will now create a lazy copy (deferring the copy until a modification to the data happens) when constructing from an existingDataFrame
with the default ofcopy=False
(GH51239) - The
DataFrame
constructor, when constructing a DataFrame from a dictionary of Series objects and specifyingcopy=False
, will now use a lazy copy of those Series objects for the columns of the DataFrame (GH50777) - Trying to set values using chained assignment (for example,
df["a"][1:3] = 0
) will now always raise an exception when Copy-on-Write is enabled. In this mode, chained assignment can never work because we are always setting into a temporary object that is the result of an indexing operation (getitem), which under Copy-on-Write always behaves as a copy. Thus, assigning through a chain can never update the original Series or DataFrame. Therefore, an informative error is raised to the user instead of silently doing nothing (GH49467) DataFrame.replace()
will now respect the Copy-on-Write mechanism wheninplace=True
.DataFrame.transpose()
will now respect the Copy-on-Write mechanism.- Arithmetic operations that can be inplace, e.g.
ser *= 2
will now respect the Copy-on-Write mechanism.
Copy-on-Write can be enabled through one of
pd.set_option("mode.copy_on_write", True)
pd.options.mode.copy_on_write = True
Alternatively, copy on write can be enabled locally through:
with pd.option_context("mode.copy_on_write", True):
...