閱讀(12.4k) 書簽贊(0) 我要糾錯

Pandas 基礎(chǔ)用法

2022-07-01 14:58 更新

本節(jié)介紹 Pandas 數(shù)據(jù)結(jié)構(gòu)的基礎(chǔ)用法。下列代碼創(chuàng)建上一節(jié)（Pandas 數(shù)據(jù)結(jié)構(gòu)）用過的示例數(shù)據(jù)對象：

In [1]: index = pd.date_range('1/1/2000', periods=8)

In [2]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [3]: df = pd.DataFrame(np.random.randn(8, 3), index=index,
   ...:                   columns=['A', 'B', 'C'])
   ...:

Head 與 Tail

head() 與 tail() 用于快速預(yù)覽 Series 與 DataFrame，默認(rèn)顯示 5 條數(shù)據(jù)，也可以指定顯示數(shù)據(jù)的數(shù)量。

In [4]: long_series = pd.Series(np.random.randn(1000))

In [5]: long_series.head()
Out[5]: 
0   -1.157892
1   -1.344312
2    0.844885
3    1.075770
4   -0.109050
dtype: float64

In [6]: long_series.tail(3)
Out[6]: 
997   -0.289388
998   -1.020544
999    0.589993
dtype: float64

屬性與底層數(shù)據(jù)

Pandas 可以通過多個屬性訪問元數(shù)據(jù)：

shape:輸出對象的軸維度，與 ndarray 一致
軸標(biāo)簽Series: Index (僅有此軸)DataFrame: Index (行) 與列

注意：為屬性賦值是安全的！

In [7]: df[:2]
Out[7]: 
                   A         B         C
2000-01-01 -0.173215  0.119209 -1.044236
2000-01-02 -0.861849 -2.104569 -0.494929

In [8]: df.columns = [x.lower() for x in df.columns]

In [9]: df
Out[9]: 
                   a         b         c
2000-01-01 -0.173215  0.119209 -1.044236
2000-01-02 -0.861849 -2.104569 -0.494929
2000-01-03  1.071804  0.721555 -0.706771
2000-01-04 -1.039575  0.271860 -0.424972
2000-01-05  0.567020  0.276232 -1.087401
2000-01-06 -0.673690  0.113648 -1.478427
2000-01-07  0.524988  0.404705  0.577046
2000-01-08 -1.715002 -1.039268 -0.370647

Pandas 對象（Index， Series， DataFrame）相當(dāng)于數(shù)組的容器，用于存儲數(shù)據(jù)、執(zhí)行計(jì)算。大部分類型的底層數(shù)組都是 numpy.ndarray。不過，Pandas 與第三方支持庫一般都會擴(kuò)展 NumPy 類型系統(tǒng)，添加自定義數(shù)組（見數(shù)據(jù)類型）。

.array 屬性用于提取 Index 或 Series 里的數(shù)據(jù)。

In [10]: s.array
Out[10]: 
<PandasArray>
[ 0.4691122999071863, -0.2828633443286633, -1.5090585031735124,
 -1.1356323710171934,  1.2121120250208506]
Length: 5, dtype: float64

In [11]: s.index.array
Out[11]: 
<PandasArray>
['a', 'b', 'c', 'd', 'e']
Length: 5, dtype: object

array 一般指 ExtensionArray。至于什么是 ExtensionArray 及 Pandas 為什么要用 ExtensionArray 不是本節(jié)要說明的內(nèi)容。更多信息請參閱數(shù)據(jù)類型。

提取 NumPy 數(shù)組，用 to_numpy() 或 numpy.asarray()。

In [12]: s.to_numpy()
Out[12]: array([ 0.4691, -0.2829, -1.5091, -1.1356,  1.2121])

In [13]: np.asarray(s)
Out[13]: array([ 0.4691, -0.2829, -1.5091, -1.1356,  1.2121])

Series 與 Index 的類型是 ExtensionArray 時， to_numpy() 會復(fù)制數(shù)據(jù)，并強(qiáng)制轉(zhuǎn)換值。詳情見數(shù)據(jù)類型。

to_numpy() 可以控制 numpy.ndarray 生成的數(shù)據(jù)類型。以帶時區(qū)的 datetime 為例，NumPy 未提供時區(qū)信息的 datetime 數(shù)據(jù)類型，Pandas 則提供了兩種表現(xiàn)形式：

一種是帶 Timestamp 的 numpy.ndarray，提供了正確的 tz 信息。
另一種是 datetime64[ns]，這也是一種 numpy.ndarray，值被轉(zhuǎn)換為 UTC，但去掉了時區(qū)信息。

時區(qū)信息可以用 dtype=object 保存。

In [14]: ser = pd.Series(pd.date_range('2000', periods=2, tz="CET"))

In [15]: ser.to_numpy(dtype=object)
Out[15]: 
array([Timestamp('2000-01-01 00:00:00+0100', tz='CET', freq='D'),
       Timestamp('2000-01-02 00:00:00+0100', tz='CET', freq='D')],
      dtype=object)

或用 dtype='datetime64[ns]' 去除。

In [16]: ser.to_numpy(dtype="datetime64[ns]")
Out[16]: 
array(['1999-12-31T23:00:00.000000000', '2000-01-01T23:00:00.000000000'],
      dtype='datetime64[ns]')

提取 DataFrame 里的原數(shù)據(jù)稍微有點(diǎn)復(fù)雜。DataFrame 里所有列的數(shù)據(jù)類型都一樣時，DataFrame.to_numpy() 返回底層數(shù)據(jù)：

In [17]: df.to_numpy()
Out[17]: 
array([[-0.1732,  0.1192, -1.0442],
       [-0.8618, -2.1046, -0.4949],
       [ 1.0718,  0.7216, -0.7068],
       [-1.0396,  0.2719, -0.425 ],
       [ 0.567 ,  0.2762, -1.0874],
       [-0.6737,  0.1136, -1.4784],
       [ 0.525 ,  0.4047,  0.577 ],
       [-1.715 , -1.0393, -0.3706]])

DataFrame 為同構(gòu)型數(shù)據(jù)時，Pandas 直接修改原始 ndarray，所做修改會直接反應(yīng)在數(shù)據(jù)結(jié)構(gòu)里。對于異質(zhì)型數(shù)據(jù)，即 DataFrame 列的數(shù)據(jù)類型不一樣時，就不是這種操作模式了。與軸標(biāo)簽不同，不能為值的屬性賦值。

注意
處理異質(zhì)型數(shù)據(jù)時，輸出結(jié)果 ndarray 的數(shù)據(jù)類型適用于涉及的各類數(shù)據(jù)。若 DataFrame 里包含字符串，輸出結(jié)果的數(shù)據(jù)類型就是 object。要是只有浮點(diǎn)數(shù)或整數(shù)，則輸出結(jié)果的數(shù)據(jù)類型是浮點(diǎn)數(shù)。

以前，Pandas 推薦用 Series.values 或 DataFrame.values 從 Series 或 DataFrame 里提取數(shù)據(jù)。舊有代碼庫或在線教程里仍在用這種操作，但 Pandas 已改進(jìn)了此功能，現(xiàn)在，推薦用 .array 或 to_numpy 提取數(shù)據(jù)，別再用 .values 了。.values 有以下幾個缺點(diǎn)：

Series 含擴(kuò)展類型時，Series.values 無法判斷到底是該返回 NumPy array，還是返回 ExtensionArray。而 Series.array 則只返回 ExtensionArray，且不會復(fù)制數(shù)據(jù)。Series.to_numpy() 則返回 NumPy 數(shù)組，其代價是需要復(fù)制、并強(qiáng)制轉(zhuǎn)換數(shù)據(jù)的值。
DataFrame 含多種數(shù)據(jù)類型時，DataFrame.values 會復(fù)制數(shù)據(jù)，并將數(shù)據(jù)的值強(qiáng)制轉(zhuǎn)換同一種數(shù)據(jù)類型，這是一種代價較高的操作。DataFrame.to_numpy() 則返回 NumPy 數(shù)組，這種方式更清晰，也不會把 DataFrame 里的數(shù)據(jù)都當(dāng)作一種類型。

加速操作

借助 numexpr 與 bottleneck 支持庫，Pandas 可以加速特定類型的二進(jìn)制數(shù)值與布爾操作。

處理大型數(shù)據(jù)集時，這兩個支持庫特別有用，加速效果也非常明顯。 numexpr 使用智能分塊、緩存與多核技術(shù)。bottleneck 是一組專屬 cython 例程，處理含 nans 值的數(shù)組時，特別快。

請看下面這個例子（DataFrame 包含 100 列 X 10 萬行數(shù)據(jù)）:

操作	0.11.0版 (ms)	舊版 (ms)	提升比率
`df1 > df2`	13.32	125.35	0.1063
`df1 * df2`	21.71	36.63	0.5928
`df1 + df2`	22.04	36.50	0.6039

強(qiáng)烈建議安裝這兩個支持庫，更多信息，請參閱推薦支持庫。

這兩個支持庫默認(rèn)為啟用狀態(tài)，可用以下選項(xiàng)設(shè)置：

0.20.0 版新增。

pd.set_option('compute.use_bottleneck', False)
pd.set_option('compute.use_numexpr', False)

二進(jìn)制操作

Pandas 數(shù)據(jù)結(jié)構(gòu)之間執(zhí)行二進(jìn)制操作，要注意下列兩個關(guān)鍵點(diǎn)：

多維（DataFrame）與低維（Series）對象之間的廣播機(jī)制；
計(jì)算中的缺失值處理。

這兩個問題可以同時處理，但下面先介紹怎么分開處理。

匹配/廣播機(jī)制

DataFrame 支持 add()、sub()、mul()、div() 及 radd()、rsub() 等方法執(zhí)行二進(jìn)制操作。廣播機(jī)制重點(diǎn)關(guān)注輸入的 Series。通過 axis 關(guān)鍵字，匹配 index 或 columns 即可調(diào)用這些函數(shù)。

In [18]: df = pd.DataFrame({
   ....:     'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
   ....:     'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
   ....:     'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
   ....: 

In [19]: df
Out[19]: 
        one       two     three
a  1.394981  1.772517       NaN
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 -0.613172

In [20]: row = df.iloc[1]

In [21]: column = df['two']

In [22]: df.sub(row, axis='columns')
Out[22]: 
        one       two     three
a  1.051928 -0.139606       NaN
b  0.000000  0.000000  0.000000
c  0.352192 -0.433754  1.277825
d       NaN -1.632779 -0.562782

In [23]: df.sub(row, axis=1)
Out[23]: 
        one       two     three
a  1.051928 -0.139606       NaN
b  0.000000  0.000000  0.000000
c  0.352192 -0.433754  1.277825
d       NaN -1.632779 -0.562782

In [24]: df.sub(column, axis='index')
Out[24]: 
        one  two     three
a -0.377535  0.0       NaN
b -1.569069  0.0 -1.962513
c -0.783123  0.0 -0.250933
d       NaN  0.0 -0.892516

In [25]: df.sub(column, axis=0)
Out[25]: 
        one  two     three
a -0.377535  0.0       NaN
b -1.569069  0.0 -1.962513
c -0.783123  0.0 -0.250933
d       NaN  0.0 -0.892516

還可以用 Series 對齊多層索引 DataFrame 的某一層級。

In [26]: dfmi = df.copy()

In [27]: dfmi.index = pd.MultiIndex.from_tuples([(1, 'a'), (1, 'b'),
   ....:                                         (1, 'c'), (2, 'a')],
   ....:                                        names=['first', 'second'])
   ....: 

In [28]: dfmi.sub(column, axis=0, level='second')
Out[28]: 
                   one       two     three
first second                              
1     a      -0.377535  0.000000       NaN
      b      -1.569069  0.000000 -1.962513
      c      -0.783123  0.000000 -0.250933
2     a            NaN -1.493173 -2.385688

Series 與 Index 還支持 divmod() 內(nèi)置函數(shù)，該函數(shù)同時執(zhí)行向下取整除與模運(yùn)算，返回兩個與左側(cè)類型相同的元組。示例如下：

In [29]: s = pd.Series(np.arange(10))

In [30]: s
Out[30]: 
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

In [31]: div, rem = divmod(s, 3)

In [32]: div
Out[32]: 
0    0
1    0
2    0
3    1
4    1
5    1
6    2
7    2
8    2
9    3
dtype: int64

In [33]: rem
Out[33]: 
0    0
1    1
2    2
3    0
4    1
5    2
6    0
7    1
8    2
9    0
dtype: int64

In [34]: idx = pd.Index(np.arange(10))

In [35]: idx
Out[35]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

In [36]: div, rem = divmod(idx, 3)

In [37]: div
Out[37]: Int64Index([0, 0, 0, 1, 1, 1, 2, 2, 2, 3], dtype='int64')

In [38]: rem
Out[38]: Int64Index([0, 1, 2, 0, 1, 2, 0, 1, 2, 0], dtype='int64')

divmod() 還支持元素級運(yùn)算：

In [39]: div, rem = divmod(s, [2, 2, 3, 3, 4, 4, 5, 5, 6, 6])

In [40]: div
Out[40]: 
0    0
1    0
2    0
3    1
4    1
5    1
6    1
7    1
8    1
9    1
dtype: int64

In [41]: rem
Out[41]: 
0    0
1    1
2    2
3    0
4    0
5    1
6    1
7    2
8    2
9    3
dtype: int64

#缺失值與填充缺失值操作

Series 與 DataFrame 的算數(shù)函數(shù)支持 fill_value 選項(xiàng)，即用指定值替換某個位置的缺失值。比如，兩個 DataFrame 相加，除非兩個 DataFrame 里同一個位置都有缺失值，其相加的和仍為 NaN，如果只有一個 DataFrame 里存在缺失值，則可以用 fill_value 指定一個值來替代 NaN，當(dāng)然，也可以用 fillna 把 NaN 替換為想要的值。

注意

下面第 43 條代碼里，Pandas 官檔沒有寫 df2 是哪里來的，這里補(bǔ)上，與 df 類似。 ```python df2 = pd.DataFrame({ ....: 'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']), ....: 'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']), ....: 'three': pd.Series(np.random.randn(3), index=['a', 'b', 'c', 'd'])}) ....:

In [42]: df
Out[42]: 
        one       two     three
a  1.394981  1.772517       NaN
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 -0.613172

In [43]: df2
Out[43]: 
        one       two     three
a  1.394981  1.772517  1.000000
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 -0.613172

In [44]: df + df2
Out[44]: 
        one       two     three
a  2.789963  3.545034       NaN
b  0.686107  3.824246 -0.100780
c  1.390491  2.956737  2.454870
d       NaN  0.558688 -1.226343

In [45]: df.add(df2, fill_value=0)
Out[45]: 
        one       two     three
a  2.789963  3.545034  1.000000
b  0.686107  3.824246 -0.100780
c  1.390491  2.956737  2.454870
d       NaN  0.558688 -1.226343

#比較操作

與上一小節(jié)的算數(shù)運(yùn)算類似，Series 與 DataFrame 還支持 eq、ne、lt、gt、le、ge 等二進(jìn)制比較操作的方法：

序號	縮寫	英文	中文
1	eq	equal to	等于
2	ne	not equal to	不等于
3	lt	less than	小于
4	gt	greater than	大于
5	le	less than or equal to	小于等于
6	ge	greater than or equal to	大于等于

In [46]: df.gt(df2)
Out[46]: 
     one    two  three
a  False  False  False
b  False  False  False
c  False  False  False
d  False  False  False

In [47]: df2.ne(df)
Out[47]: 
     one    two  three
a  False  False   True
b  False  False  False
c  False  False  False
d   True  False  False

這些操作生成一個與左側(cè)輸入對象類型相同的 Pandas 對象，即，dtype 為 bool。boolean 對象可用于索引操作，參閱布爾索引。

#布爾簡化

empty、any()、all()、bool() 可以把數(shù)據(jù)匯總簡化至單個布爾值。

In [48]: (df > 0).all()
Out[48]: 
one      False
two       True
three    False
dtype: bool

In [49]: (df > 0).any()
Out[49]: 
one      True
two      True
three    True
dtype: bool

還可以進(jìn)一步把上面的結(jié)果簡化為單個布爾值。

In [50]: (df > 0).any().any()
Out[50]: True

通過 empty 屬性，可以驗(yàn)證 Pandas 對象是否為空。

In [51]: df.empty
Out[51]: False

In [52]: pd.DataFrame(columns=list('ABC')).empty
Out[52]: True

用 bool() 方法驗(yàn)證單元素 pandas 對象的布爾值。

In [53]: pd.Series([True]).bool()
Out[53]: True

In [54]: pd.Series([False]).bool()
Out[54]: False

In [55]: pd.DataFrame([[True]]).bool()
Out[55]: True

In [56]: pd.DataFrame([[False]]).bool()
Out[56]: False

警告

以下代碼：

>>> if df:
...     pass

或

>>> df and df2

上述代碼試圖比對多個值，因此，這兩種操作都會觸發(fā)錯誤：

ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().

了解詳情，請參閱各種坑小節(jié)的內(nèi)容。

#比較對象是否等效

一般情況下，多種方式都能得出相同的結(jié)果。以 df + df 與 df * 2 為例。應(yīng)用上一小節(jié)學(xué)到的知識，測試這兩種計(jì)算方式的結(jié)果是否一致，一般人都會用 (df + df == df * 2).all()，不過，這個表達(dá)式的結(jié)果是 False：

In [57]: df + df == df * 2
Out[57]: 
     one   two  three
a   True  True  False
b   True  True   True
c   True  True   True
d  False  True   True

In [58]: (df + df == df * 2).all()
Out[58]: 
one      False
two       True
three    False
dtype: bool

注意：布爾型 DataFrame df + df == df * 2 中有 False 值！這是因?yàn)閮蓚€ NaN 值的比較結(jié)果為不等：

In [59]: np.nan == np.nan
Out[59]: False

為了驗(yàn)證數(shù)據(jù)是否等效，Series 與 DataFrame 等 N 維框架提供了 equals() 方法，用這個方法驗(yàn)證 NaN 值的結(jié)果為相等。

In [60]: (df + df).equals(df * 2)
Out[60]: True

注意：Series 與 DataFrame 索引的順序必須一致，驗(yàn)證結(jié)果才能為 True：

In [61]: df1 = pd.DataFrame({'col': ['foo', 0, np.nan]})

In [62]: df2 = pd.DataFrame({'col': [np.nan, 0, 'foo']}, index=[2, 1, 0])

In [63]: df1.equals(df2)
Out[63]: False

In [64]: df1.equals(df2.sort_index())
Out[64]: True

#比較 array 型對象

用標(biāo)量值與 Pandas 數(shù)據(jù)結(jié)構(gòu)對比數(shù)據(jù)元素非常簡單：

In [65]: pd.Series(['foo', 'bar', 'baz']) == 'foo'
Out[65]: 
0     True
1    False
2    False
dtype: bool

In [66]: pd.Index(['foo', 'bar', 'baz']) == 'foo'
Out[66]: array([ True, False, False])

Pandas 還能對比兩個等長 array 對象里的數(shù)據(jù)元素：

In [67]: pd.Series(['foo', 'bar', 'baz']) == pd.Index(['foo', 'bar', 'qux'])
Out[67]: 
0     True
1     True
2    False
dtype: bool

In [68]: pd.Series(['foo', 'bar', 'baz']) == np.array(['foo', 'bar', 'qux'])
Out[68]: 
0     True
1     True
2    False
dtype: bool

對比不等長的 Index 或 Series 對象會觸發(fā) ValueError：

In [55]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo', 'bar'])
ValueError: Series lengths must match to compare

In [56]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo'])
ValueError: Series lengths must match to compare

注意：這里的操作與 NumPy 的廣播機(jī)制不同：

In [69]: np.array([1, 2, 3]) == np.array([2])
Out[69]: array([False,  True, False])

NumPy 無法執(zhí)行廣播操作時，返回 False:

In [70]: np.array([1, 2, 3]) == np.array([1, 2])
Out[70]: False

#合并重疊數(shù)據(jù)集

有時，要合并兩個相似的數(shù)據(jù)集，兩個數(shù)據(jù)集里的其中一個的數(shù)據(jù)比另一個多。比如，展示特定經(jīng)濟(jì)指標(biāo)的兩個數(shù)據(jù)序列，其中一個是“高質(zhì)量”指標(biāo)，另一個是“低質(zhì)量”指標(biāo)。一般來說，低質(zhì)量序列可能包含更多的歷史數(shù)據(jù)，或覆蓋更廣的數(shù)據(jù)。因此，要合并這兩個 DataFrame 對象，其中一個 DataFrame 中的缺失值將按指定條件用另一個 DataFrame 里類似標(biāo)簽中的數(shù)據(jù)進(jìn)行填充。要實(shí)現(xiàn)這一操作，請用下列代碼中的 combine_first() 函數(shù)。

In [71]: df1 = pd.DataFrame({'A': [1., np.nan, 3., 5., np.nan],
   ....:                     'B': [np.nan, 2., 3., np.nan, 6.]})
   ....: 

In [72]: df2 = pd.DataFrame({'A': [5., 2., 4., np.nan, 3., 7.],
   ....:                     'B': [np.nan, np.nan, 3., 4., 6., 8.]})
   ....: 

In [73]: df1
Out[73]: 
     A    B
0  1.0  NaN
1  NaN  2.0
2  3.0  3.0
3  5.0  NaN
4  NaN  6.0

In [74]: df2
Out[74]: 
     A    B
0  5.0  NaN
1  2.0  NaN
2  4.0  3.0
3  NaN  4.0
4  3.0  6.0
5  7.0  8.0

In [75]: df1.combine_first(df2)
Out[75]: 
     A    B
0  1.0  NaN
1  2.0  2.0
2  3.0  3.0
3  5.0  4.0
4  3.0  6.0
5  7.0  8.0

#DataFrame 通用合并方法

上述 combine_first() 方法調(diào)用了更普適的 DataFrame.combine() 方法。該方法提取另一個 DataFrame 及合并器函數(shù)，并將之與輸入的 DataFrame 對齊，再傳遞與 Series 配對的合并器函數(shù)（比如，名稱相同的列）。

下面的代碼復(fù)現(xiàn)了上述的 combine_first() 函數(shù)：

In [76]: def combiner(x, y):
   ....:     return np.where(pd.isna(x), y, x)
   ....:

#描述性統(tǒng)計(jì)

Series 與 DataFrame 支持大量計(jì)算描述性統(tǒng)計(jì)的方法與操作。這些方法大部分都是 sum()、mean()、quantile() 等聚合函數(shù)，其輸出結(jié)果比原始數(shù)據(jù)集??；此外，還有輸出結(jié)果與原始數(shù)據(jù)集同樣大小的 cumsum() 、 cumprod() 等函數(shù)。這些方法都基本上都接受 axis 參數(shù)，如， ndarray.{sum,std,…}，但這里的 axis 可以用名稱或整數(shù)指定：

Series：無需 axis 參數(shù)
DataFrame：index，即 axis=0，默認(rèn)值columns, 即 axis=1

示例如下：

In [77]: df
Out[77]: 
        one       two     three
a  1.394981  1.772517       NaN
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 -0.613172

In [78]: df.mean(0)
Out[78]: 
one      0.811094
two      1.360588
three    0.187958
dtype: float64

In [79]: df.mean(1)
Out[79]: 
a    1.583749
b    0.734929
c    1.133683
d   -0.166914
dtype: float64

上述方法都支持 skipna 關(guān)鍵字，指定是否要排除缺失數(shù)據(jù)，默認(rèn)值為 True。

In [80]: df.sum(0, skipna=False)
Out[80]: 
one           NaN
two      5.442353
three         NaN
dtype: float64

In [81]: df.sum(axis=1, skipna=True)
Out[81]: 
a    3.167498
b    2.204786
c    3.401050
d   -0.333828
dtype: float64

結(jié)合廣播機(jī)制或算數(shù)操作，可以描述不同統(tǒng)計(jì)過程，比如標(biāo)準(zhǔn)化，即渲染數(shù)據(jù)零均值與標(biāo)準(zhǔn)差 1，這種操作非常簡單：

In [82]: ts_stand = (df - df.mean()) / df.std()

In [83]: ts_stand.std()
Out[83]: 
one      1.0
two      1.0
three    1.0
dtype: float64

In [84]: xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0)

In [85]: xs_stand.std(1)
Out[85]: 
a    1.0
b    1.0
c    1.0
d    1.0
dtype: float64

注： cumsum() 與 cumprod() 等方法保留 NaN 值的位置。這與 expanding() 和 rolling() 略顯不同，詳情請參閱本文。

In [86]: df.cumsum()
Out[86]: 
        one       two     three
a  1.394981  1.772517       NaN
b  1.738035  3.684640 -0.050390
c  2.433281  5.163008  1.177045
d       NaN  5.442353  0.563873

下表為常用函數(shù)匯總表。每個函數(shù)都支持 level 參數(shù)，僅在數(shù)據(jù)對象為結(jié)構(gòu)化 Index 時使用。

函數(shù)	描述
`count`	統(tǒng)計(jì)非空值數(shù)量
`sum`	匯總值
`mean`	平均值
`mad`	平均絕對偏差
`median`	算數(shù)中位數(shù)
`min`	最小值
`max`	最大值
`mode`	眾數(shù)
`abs`	絕對值
`prod`	乘積
`std`	貝塞爾校正的樣本標(biāo)準(zhǔn)偏差
`var`	無偏方差
`sem`	平均值的標(biāo)準(zhǔn)誤差
`skew`	樣本偏度 (第三階)
`kurt`	樣本峰度 (第四階)
`quantile`	樣本分位數(shù) (不同 % 的值)
`cumsum`	累加
`cumprod`	累乘
`cummax`	累積最大值
`cummin`	累積最小值

注意：NumPy 的 mean、std、sum 等方法默認(rèn)不統(tǒng)計(jì) Series 里的空值。

In [87]: np.mean(df['one'])
Out[87]: 0.8110935116651192

In [88]: np.mean(df['one'].to_numpy())
Out[88]: nan

Series.nunique() 返回 Series 里所有非空值的唯一值。

In [89]: series = pd.Series(np.random.randn(500))

In [90]: series[20:500] = np.nan

In [91]: series[10:20] = 5

In [92]: series.nunique()
Out[92]: 11

#數(shù)據(jù)總結(jié)：describe

describe() 函數(shù)計(jì)算 Series 與 DataFrame 數(shù)據(jù)列的各種數(shù)據(jù)統(tǒng)計(jì)量，注意，這里排除了空值。

In [93]: series = pd.Series(np.random.randn(1000))

In [94]: series[::2] = np.nan

In [95]: series.describe()
Out[95]: 
count    500.000000
mean      -0.021292
std        1.015906
min       -2.683763
25%       -0.699070
50%       -0.069718
75%        0.714483
max        3.160915
dtype: float64

In [96]: frame = pd.DataFrame(np.random.randn(1000, 5),
   ....:                      columns=['a', 'b', 'c', 'd', 'e'])
   ....: 

In [97]: frame.iloc[::2] = np.nan

In [98]: frame.describe()
Out[98]: 
                a           b           c           d           e
count  500.000000  500.000000  500.000000  500.000000  500.000000
mean     0.033387    0.030045   -0.043719   -0.051686    0.005979
std      1.017152    0.978743    1.025270    1.015988    1.006695
min     -3.000951   -2.637901   -3.303099   -3.159200   -3.188821
25%     -0.647623   -0.576449   -0.712369   -0.691338   -0.691115
50%      0.047578   -0.021499   -0.023888   -0.032652   -0.025363
75%      0.729907    0.775880    0.618896    0.670047    0.649748
max      2.740139    2.752332    3.004229    2.728702    3.240991

此外，還可以指定輸出結(jié)果包含的分位數(shù)：

In [99]: series.describe(percentiles=[.05, .25, .75, .95])
Out[99]: 
count    500.000000
mean      -0.021292
std        1.015906
min       -2.683763
5%        -1.645423
25%       -0.699070
50%       -0.069718
75%        0.714483
95%        1.711409
max        3.160915
dtype: float64

一般情況下，默認(rèn)值包含中位數(shù)。

對于非數(shù)值型 Series 對象， describe() 返回值的總數(shù)、唯一值數(shù)量、出現(xiàn)次數(shù)最多的值及出現(xiàn)的次數(shù)。

In [100]: s = pd.Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a'])

In [101]: s.describe()
Out[101]: 
count     9
unique    4
top       a
freq      5
dtype: object

注意：對于混合型的 DataFrame 對象， describe() 只返回?cái)?shù)值列的匯總統(tǒng)計(jì)量，如果沒有數(shù)值列，則只顯示類別型的列。

In [102]: frame = pd.DataFrame({'a': ['Yes', 'Yes', 'No', 'No'], 'b': range(4)})

In [103]: frame.describe()
Out[103]: 
              b
count  4.000000
mean   1.500000
std    1.290994
min    0.000000
25%    0.750000
50%    1.500000
75%    2.250000
max    3.000000

include/exclude 參數(shù)的值為列表，用該參數(shù)可以控制包含或排除的數(shù)據(jù)類型。這里還有一個特殊值，all：

In [104]: frame.describe(include=['object'])
Out[104]: 
          a
count     4
unique    2
top     Yes
freq      2

In [105]: frame.describe(include=['number'])
Out[105]: 
              b
count  4.000000
mean   1.500000
std    1.290994
min    0.000000
25%    0.750000
50%    1.500000
75%    2.250000
max    3.000000

In [106]: frame.describe(include='all')
Out[106]: 
          a         b
count     4  4.000000
unique    2       NaN
top     Yes       NaN
freq      2       NaN
mean    NaN  1.500000
std     NaN  1.290994
min     NaN  0.000000
25%     NaN  0.750000
50%     NaN  1.500000
75%     NaN  2.250000
max     NaN  3.000000

本功能依托于 select_dtypes，要了解該參數(shù)接受哪些輸入內(nèi)容請參閱本文。

#最大值與最小值對應(yīng)的索引

Series 與 DataFrame 的 idxmax() 與 idxmin() 函數(shù)計(jì)算最大值與最小值對應(yīng)的索引。

In [107]: s1 = pd.Series(np.random.randn(5))

In [108]: s1
Out[108]: 
0    1.118076
1   -0.352051
2   -1.242883
3   -1.277155
4   -0.641184
dtype: float64

In [109]: s1.idxmin(), s1.idxmax()
Out[109]: (3, 0)

In [110]: df1 = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C'])

In [111]: df1
Out[111]: 
          A         B         C
0 -0.327863 -0.946180 -0.137570
1 -0.186235 -0.257213 -0.486567
2 -0.507027 -0.871259 -0.111110
3  2.000339 -2.430505  0.089759
4 -0.321434 -0.033695  0.096271

In [112]: df1.idxmin(axis=0)
Out[112]: 
A    2
B    3
C    1
dtype: int64

In [113]: df1.idxmax(axis=1)
Out[113]: 
0    C
1    A
2    C
3    A
4    C
dtype: object

多行或多列中存在多個最大值或最小值時，idxmax() 與 idxmin() 只返回匹配到的第一個值的 Index：

In [114]: df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=['A'], index=list('edcba'))

In [115]: df3
Out[115]: 
     A
e  2.0
d  1.0
c  1.0
b  3.0
a  NaN

In [116]: df3['A'].idxmin()
Out[116]: 'd'

注意

idxmin 與 idxmax 對應(yīng) NumPy 里的 argmin 與 argmax。

#值計(jì)數(shù)（直方圖）與眾數(shù)

Series 的 value_counts() 方法及頂級函數(shù)計(jì)算一維數(shù)組中數(shù)據(jù)值的直方圖，還可以用作常規(guī)數(shù)組的函數(shù)：

In [117]: data = np.random.randint(0, 7, size=50)

In [118]: data
Out[118]: 
array([6, 6, 2, 3, 5, 3, 2, 5, 4, 5, 4, 3, 4, 5, 0, 2, 0, 4, 2, 0, 3, 2,
       2, 5, 6, 5, 3, 4, 6, 4, 3, 5, 6, 4, 3, 6, 2, 6, 6, 2, 3, 4, 2, 1,
       6, 2, 6, 1, 5, 4])

In [119]: s = pd.Series(data)

In [120]: s.value_counts()
Out[120]: 
6    10
2    10
4     9
5     8
3     8
0     3
1     2
dtype: int64

In [121]: pd.value_counts(data)
Out[121]: 
6    10
2    10
4     9
5     8
3     8
0     3
1     2
dtype: int64

與上述操作類似，還可以統(tǒng)計(jì) Series 或 DataFrame 的眾數(shù)，即出現(xiàn)頻率最高的值：

In [122]: s5 = pd.Series([1, 1, 3, 3, 3, 5, 5, 7, 7, 7])

In [123]: s5.mode()
Out[123]: 
0    3
1    7
dtype: int64

In [124]: df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50),
   .....:                     "B": np.random.randint(-10, 15, size=50)})
   .....: 

In [125]: df5.mode()
Out[125]: 
     A   B
0  1.0  -9
1  NaN  10
2  NaN  13

#離散化與分位數(shù)

cut() 函數(shù)（以值為依據(jù)實(shí)現(xiàn)分箱）及 qcut() 函數(shù)（以樣本分位數(shù)為依據(jù)實(shí)現(xiàn)分箱）用于連續(xù)值的離散化：

In [126]: arr = np.random.randn(20)

In [127]: factor = pd.cut(arr, 4)

In [128]: factor
Out[128]: 
[(-0.251, 0.464], (-0.968, -0.251], (0.464, 1.179], (-0.251, 0.464], (-0.968, -0.251], ..., (-0.251, 0.464], (-0.968, -0.251], (-0.968, -0.251], (-0.968, -0.251], (-0.968, -0.251]]
Length: 20
Categories (4, interval[float64]): [(-0.968, -0.251] < (-0.251, 0.464] < (0.464, 1.179] <
                                    (1.179, 1.893]]

In [129]: factor = pd.cut(arr, [-5, -1, 0, 1, 5])

In [130]: factor
Out[130]: 
[(0, 1], (-1, 0], (0, 1], (0, 1], (-1, 0], ..., (-1, 0], (-1, 0], (-1, 0], (-1, 0], (-1, 0]]
Length: 20
Categories (4, interval[int64]): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]

qcut() 計(jì)算樣本分位數(shù)。比如，下列代碼按等距分位數(shù)分割正態(tài)分布的數(shù)據(jù)：

In [131]: arr = np.random.randn(30)

In [132]: factor = pd.qcut(arr, [0, .25, .5, .75, 1])

In [133]: factor
Out[133]: 
[(0.569, 1.184], (-2.278, -0.301], (-2.278, -0.301], (0.569, 1.184], (0.569, 1.184], ..., (-0.301, 0.569], (1.184, 2.346], (1.184, 2.346], (-0.301, 0.569], (-2.278, -0.301]]
Length: 30
Categories (4, interval[float64]): [(-2.278, -0.301] < (-0.301, 0.569] < (0.569, 1.184] <
                                    (1.184, 2.346]]

In [134]: pd.value_counts(factor)
Out[134]: 
(1.184, 2.346]      8
(-2.278, -0.301]    8
(0.569, 1.184]      7
(-0.301, 0.569]     7
dtype: int64

定義分箱時，還可以傳遞無窮值：

In [135]: arr = np.random.randn(20)

In [136]: factor = pd.cut(arr, [-np.inf, 0, np.inf])

In [137]: factor
Out[137]: 
[(-inf, 0.0], (0.0, inf], (0.0, inf], (-inf, 0.0], (-inf, 0.0], ..., (-inf, 0.0], (-inf, 0.0], (-inf, 0.0], (0.0, inf], (0.0, inf]]
Length: 20
Categories (2, interval[float64]): [(-inf, 0.0] < (0.0, inf]]

#函數(shù)應(yīng)用

不管是為 Pandas 對象應(yīng)用自定義函數(shù)，還是應(yīng)用第三方函數(shù)，都離不開以下三種方法。用哪種方法取決于操作的對象是 DataFrame，還是 Series ；是行、列，還是元素。

表級函數(shù)應(yīng)用：pipe()
行列級函數(shù)應(yīng)用： apply()
聚合 API： agg() 與 transform()
元素級函數(shù)應(yīng)用：applymap()

#表級函數(shù)應(yīng)用

雖然可以把 DataFrame 與 Series 傳遞給函數(shù)，不過鏈?zhǔn)秸{(diào)用函數(shù)時，最好使用 pipe() 方法。對比以下兩種方式：

# f、g、h 是提取、返回 `DataFrames` 的函數(shù)
>>> f(g(h(df), arg1=1), arg2=2, arg3=3)

下列代碼與上述代碼等效：

>>> (df.pipe(h)
...    .pipe(g, arg1=1)
...    .pipe(f, arg2=2, arg3=3))

Pandas 鼓勵使用第二種方式，即鏈?zhǔn)椒椒?。在鏈?zhǔn)椒椒ㄖ姓{(diào)用自定義函數(shù)或第三方支持庫函數(shù)時，用 pipe 更容易，與用 Pandas 自身方法一樣。

上例中，f、g 與 h 這幾個函數(shù)都把 DataFrame 當(dāng)作首位參數(shù)。要是想把數(shù)據(jù)作為第二個參數(shù)，該怎么辦？本例中，pipe 為元組（callable,data_keyword）形式。.pipe 把 DataFrame 作為元組里指定的參數(shù)。

下例用 statsmodels 擬合回歸。該 API 先接收一個公式，DataFrame 是第二個參數(shù)，data。要傳遞函數(shù)，則要用pipe 接收關(guān)鍵詞對 (sm.ols,'data')。

In [138]: import statsmodels.formula.api as sm

In [139]: bb = pd.read_csv('data/baseball.csv', index_col='id')

In [140]: (bb.query('h > 0')
   .....:    .assign(ln_h=lambda df: np.log(df.h))
   .....:    .pipe((sm.ols, 'data'), 'hr ~ ln_h + year + g + C(lg)')
   .....:    .fit()
   .....:    .summary()
   .....:  )
   .....: 
Out[140]: 
<class 'statsmodels.iolib.summary.Summary'>
"""
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     hr   R-squared:                       0.685
Model:                            OLS   Adj. R-squared:                  0.665
Method:                 Least Squares   F-statistic:                     34.28
Date:                Thu, 22 Aug 2019   Prob (F-statistic):           3.48e-15
Time:                        15:48:59   Log-Likelihood:                -205.92
No. Observations:                  68   AIC:                             421.8
Df Residuals:                      63   BIC:                             432.9
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept   -8484.7720   4664.146     -1.819      0.074   -1.78e+04     835.780
C(lg)[T.NL]    -2.2736      1.325     -1.716      0.091      -4.922       0.375
ln_h           -1.3542      0.875     -1.547      0.127      -3.103       0.395
year            4.2277      2.324      1.819      0.074      -0.417       8.872
g               0.1841      0.029      6.258      0.000       0.125       0.243
==============================================================================
Omnibus:                       10.875   Durbin-Watson:                   1.999
Prob(Omnibus):                  0.004   Jarque-Bera (JB):               17.298
Skew:                           0.537   Prob(JB):                     0.000175
Kurtosis:                       5.225   Cond. No.                     1.49e+07
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.49e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
"""

unix 的 pipe 與后來出現(xiàn)的 dplyr 及 magrittr 啟發(fā)了pipe 方法，在此，引入了 R 語言里用于讀取 pipe 的操作符 (%>%)。pipe 的實(shí)現(xiàn)思路非常清晰，仿佛 Python 源生的一樣。強(qiáng)烈建議大家閱讀 pipe() 的源代碼。

#行列級函數(shù)應(yīng)用

apply() 方法沿著 DataFrame 的軸應(yīng)用函數(shù)，比如，描述性統(tǒng)計(jì)方法，該方法支持 axis 參數(shù)。

In [141]: df.apply(np.mean)
Out[141]: 
one      0.811094
two      1.360588
three    0.187958
dtype: float64

In [142]: df.apply(np.mean, axis=1)
Out[142]: 
a    1.583749
b    0.734929
c    1.133683
d   -0.166914
dtype: float64

In [143]: df.apply(lambda x: x.max() - x.min())
Out[143]: 
one      1.051928
two      1.632779
three    1.840607
dtype: float64

In [144]: df.apply(np.cumsum)
Out[144]: 
        one       two     three
a  1.394981  1.772517       NaN
b  1.738035  3.684640 -0.050390
c  2.433281  5.163008  1.177045
d       NaN  5.442353  0.563873

In [145]: df.apply(np.exp)
Out[145]: 
        one       two     three
a  4.034899  5.885648       NaN
b  1.409244  6.767440  0.950858
c  2.004201  4.385785  3.412466
d       NaN  1.322262  0.541630

apply() 方法還支持通過函數(shù)名字符串調(diào)用函數(shù)。

In [146]: df.apply('mean')
Out[146]: 
one      0.811094
two      1.360588
three    0.187958
dtype: float64

In [147]: df.apply('mean', axis=1)
Out[147]: 
a    1.583749
b    0.734929
c    1.133683
d   -0.166914
dtype: float64

默認(rèn)情況下，apply() 調(diào)用的函數(shù)返回的類型會影響 DataFrame.apply 輸出結(jié)果的類型。

函數(shù)返回的是 Series 時，最終輸出結(jié)果是 DataFrame。輸出的列與函數(shù)返回的 Series 索引相匹配。
函數(shù)返回其它任意類型時，輸出結(jié)果是 Series。

result_type 會覆蓋默認(rèn)行為，該參數(shù)有三個選項(xiàng)：reduce、broadcast、expand。這些選項(xiàng)決定了列表型返回值是否擴(kuò)展為 DataFrame。

用好 apply() 可以了解數(shù)據(jù)集的很多信息。比如可以提取每列的最大值對應(yīng)的日期：

In [148]: tsdf = pd.DataFrame(np.random.randn(1000, 3), columns=['A', 'B', 'C'],
   .....:                     index=pd.date_range('1/1/2000', periods=1000))
   .....: 

In [149]: tsdf.apply(lambda x: x.idxmax())
Out[149]: 
A   2000-08-06
B   2001-01-18
C   2001-07-18
dtype: datetime64[ns]

還可以向 apply() 方法傳遞額外的參數(shù)與關(guān)鍵字參數(shù)。比如下例中要應(yīng)用的這個函數(shù)：

def subtract_and_divide(x, sub, divide=1):
    return (x - sub) / divide

可以用下列方式應(yīng)用該函數(shù)：

df.apply(subtract_and_divide, args=(5,), divide=3)

為每行或每列執(zhí)行 Series 方法的功能也很實(shí)用：

In [150]: tsdf
Out[150]: 
                   A         B         C
2000-01-01 -0.158131 -0.232466  0.321604
2000-01-02 -1.810340 -3.105758  0.433834
2000-01-03 -1.209847 -1.156793 -0.136794
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08 -0.653602  0.178875  1.008298
2000-01-09  1.007996  0.462824  0.254472
2000-01-10  0.307473  0.600337  1.643950

In [151]: tsdf.apply(pd.Series.interpolate)
Out[151]: 
                   A         B         C
2000-01-01 -0.158131 -0.232466  0.321604
2000-01-02 -1.810340 -3.105758  0.433834
2000-01-03 -1.209847 -1.156793 -0.136794
2000-01-04 -1.098598 -0.889659  0.092225
2000-01-05 -0.987349 -0.622526  0.321243
2000-01-06 -0.876100 -0.355392  0.550262
2000-01-07 -0.764851 -0.088259  0.779280
2000-01-08 -0.653602  0.178875  1.008298
2000-01-09  1.007996  0.462824  0.254472
2000-01-10  0.307473  0.600337  1.643950

apply() 有一個參數(shù) raw，默認(rèn)值為 False，在應(yīng)用函數(shù)前，使用該參數(shù)可以將每行或列轉(zhuǎn)換為 Series。該參數(shù)為 True 時，傳遞的函數(shù)接收 ndarray 對象，若不需要索引功能，這種操作能顯著提高性能。

#聚合 API

0.20.0 版新增。

聚合 API 可以快速、簡潔地執(zhí)行多個聚合操作。Pandas 對象支持多個類似的 API，如 groupby API、window functions API、resample API。聚合函數(shù)為DataFrame.aggregate()，它的別名是 DataFrame.agg()。

此處用與上例類似的 DataFrame：

In [152]: tsdf = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'],
   .....:                     index=pd.date_range('1/1/2000', periods=10))
   .....: 

In [153]: tsdf.iloc[3:7] = np.nan

In [154]: tsdf
Out[154]: 
                   A         B         C
2000-01-01  1.257606  1.004194  0.167574
2000-01-02 -0.749892  0.288112 -0.757304
2000-01-03 -0.207550 -0.298599  0.116018
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.814347 -0.257623  0.869226
2000-01-09 -0.250663 -1.206601  0.896839
2000-01-10  2.169758 -1.333363  0.283157

應(yīng)用單個函數(shù)時，該操作與 apply() 等效，這里也可以用字符串表示聚合函數(shù)名。下面的聚合函數(shù)輸出的結(jié)果為 Series：

In [155]: tsdf.agg(np.sum)
Out[155]: 
A    3.033606
B   -1.803879
C    1.575510
dtype: float64

In [156]: tsdf.agg('sum')
Out[156]: 
A    3.033606
B   -1.803879
C    1.575510
dtype: float64

# 因?yàn)閼?yīng)用的是單個函數(shù)，該操作與`.sum()` 是等效的
In [157]: tsdf.sum()
Out[157]: 
A    3.033606
B   -1.803879
C    1.575510
dtype: float64

Series 單個聚合操作返回標(biāo)量值：

In [158]: tsdf.A.agg('sum')
Out[158]: 3.033606102414146

#多函數(shù)聚合

還可以用列表形式傳遞多個聚合函數(shù)。每個函數(shù)在輸出結(jié)果 DataFrame 里以行的形式顯示，行名是每個聚合函數(shù)的函數(shù)名。

In [159]: tsdf.agg(['sum'])
Out[159]: 
            A         B        C
sum  3.033606 -1.803879  1.57551

多個函數(shù)輸出多行：

In [160]: tsdf.agg(['sum', 'mean'])
Out[160]: 
             A         B         C
sum   3.033606 -1.803879  1.575510
mean  0.505601 -0.300647  0.262585

Series 聚合多函數(shù)返回結(jié)果還是 Series，索引為函數(shù)名：

In [161]: tsdf.A.agg(['sum', 'mean'])
Out[161]: 
sum     3.033606
mean    0.505601
Name: A, dtype: float64

傳遞 lambda 函數(shù)時，輸出名為 <lambda> 的行：

In [162]: tsdf.A.agg(['sum', lambda x: x.mean()])
Out[162]: 
sum         3.033606
<lambda>    0.505601
Name: A, dtype: float64

應(yīng)用自定義函數(shù)時，該函數(shù)名為輸出結(jié)果的行名：

In [163]: def mymean(x):
   .....:     return x.mean()
   .....: 

In [164]: tsdf.A.agg(['sum', mymean])
Out[164]: 
sum       3.033606
mymean    0.505601
Name: A, dtype: float64

#用字典實(shí)現(xiàn)聚合

指定為哪些列應(yīng)用哪些聚合函數(shù)時，需要把包含列名與標(biāo)量（或標(biāo)量列表）的字典傳遞給 DataFrame.agg。

注意：這里輸出結(jié)果的順序不是固定的，要想讓輸出順序與輸入順序一致，請使用 OrderedDict。

In [165]: tsdf.agg({'A': 'mean', 'B': 'sum'})
Out[165]: 
A    0.505601
B   -1.803879
dtype: float64

輸入的參數(shù)是列表時，輸出結(jié)果為 DataFrame，并以矩陣形式顯示所有聚合函數(shù)的計(jì)算結(jié)果，且輸出結(jié)果由所有唯一函數(shù)組成。未執(zhí)行聚合操作的列輸出結(jié)果為 NaN 值：

In [166]: tsdf.agg({'A': ['mean', 'min'], 'B': 'sum'})
Out[166]: 
             A         B
mean  0.505601       NaN
min  -0.749892       NaN
sum        NaN -1.803879

#多種數(shù)據(jù)類型（Dtype）

與 groupby 的 .agg 操作類似，DataFrame 含不能執(zhí)行聚合的數(shù)據(jù)類型時，.agg 只計(jì)算可聚合的列：

In [167]: mdf = pd.DataFrame({'A': [1, 2, 3],
   .....:                     'B': [1., 2., 3.],
   .....:                     'C': ['foo', 'bar', 'baz'],
   .....:                     'D': pd.date_range('20130101', periods=3)})
   .....: 

In [168]: mdf.dtypes
Out[168]: 
A             int64
B           float64
C            object
D    datetime64[ns]
dtype: object

In [169]: mdf.agg(['min', 'sum'])
Out[169]: 
     A    B          C          D
min  1  1.0        bar 2013-01-01
sum  6  6.0  foobarbaz        NaT

#自定義 Describe

.agg() 可以創(chuàng)建類似于內(nèi)置 describe 函數(shù) 的自定義 describe 函數(shù)。

In [170]: from functools import partial

In [171]: q_25 = partial(pd.Series.quantile, q=0.25)

In [172]: q_25.__name__ = '25%'

In [173]: q_75 = partial(pd.Series.quantile, q=0.75)

In [174]: q_75.__name__ = '75%'

In [175]: tsdf.agg(['count', 'mean', 'std', 'min', q_25, 'median', q_75, 'max'])
Out[175]: 
               A         B         C
count   6.000000  6.000000  6.000000
mean    0.505601 -0.300647  0.262585
std     1.103362  0.887508  0.606860
min    -0.749892 -1.333363 -0.757304
25%    -0.239885 -0.979600  0.128907
median  0.303398 -0.278111  0.225365
75%     1.146791  0.151678  0.722709
max     2.169758  1.004194  0.896839

#Transform API

0.20.0 版新增。

transform() 方法的返回結(jié)果與原始數(shù)據(jù)的索引相同，大小相同。與 .agg API 類似，該 API 支持同時處理多種操作，不用一個一個操作。

下面，先創(chuàng)建一個 DataFrame：

In [176]: tsdf = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'],
   .....:                     index=pd.date_range('1/1/2000', periods=10))
   .....: 

In [177]: tsdf.iloc[3:7] = np.nan

In [178]: tsdf
Out[178]: 
                   A         B         C
2000-01-01 -0.428759 -0.864890 -0.675341
2000-01-02 -0.168731  1.338144 -1.279321
2000-01-03 -1.621034  0.438107  0.903794
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.254374 -1.240447 -0.201052
2000-01-09 -0.157795  0.791197 -1.144209
2000-01-10 -0.030876  0.371900  0.061932

這里轉(zhuǎn)換的是整個 DataFrame。.transform() 支持 NumPy 函數(shù)、字符串函數(shù)及自定義函數(shù)。

In [179]: tsdf.transform(np.abs)
Out[179]: 
                   A         B         C
2000-01-01  0.428759  0.864890  0.675341
2000-01-02  0.168731  1.338144  1.279321
2000-01-03  1.621034  0.438107  0.903794
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.254374  1.240447  0.201052
2000-01-09  0.157795  0.791197  1.144209
2000-01-10  0.030876  0.371900  0.061932

In [180]: tsdf.transform('abs')
Out[180]: 
                   A         B         C
2000-01-01  0.428759  0.864890  0.675341
2000-01-02  0.168731  1.338144  1.279321
2000-01-03  1.621034  0.438107  0.903794
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.254374  1.240447  0.201052
2000-01-09  0.157795  0.791197  1.144209
2000-01-10  0.030876  0.371900  0.061932

In [181]: tsdf.transform(lambda x: x.abs())
Out[181]: 
                   A         B         C
2000-01-01  0.428759  0.864890  0.675341
2000-01-02  0.168731  1.338144  1.279321
2000-01-03  1.621034  0.438107  0.903794
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.254374  1.240447  0.201052
2000-01-09  0.157795  0.791197  1.144209
2000-01-10  0.030876  0.371900  0.061932

這里的 transform() 接受單個函數(shù)；與 ufunc 等效。

In [182]: np.abs(tsdf)
Out[182]: 
                   A         B         C
2000-01-01  0.428759  0.864890  0.675341
2000-01-02  0.168731  1.338144  1.279321
2000-01-03  1.621034  0.438107  0.903794
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.254374  1.240447  0.201052
2000-01-09  0.157795  0.791197  1.144209
2000-01-10  0.030876  0.371900  0.061932

.transform() 向 Series 傳遞單個函數(shù)時，返回的結(jié)果也是單個 Series。

In [183]: tsdf.A.transform(np.abs)
Out[183]: 
2000-01-01    0.428759
2000-01-02    0.168731
2000-01-03    1.621034
2000-01-04         NaN
2000-01-05         NaN
2000-01-06         NaN
2000-01-07         NaN
2000-01-08    0.254374
2000-01-09    0.157795
2000-01-10    0.030876
Freq: D, Name: A, dtype: float64

#多函數(shù) Transform

transform() 調(diào)用多個函數(shù)時，生成多層索引 DataFrame。第一層是原始數(shù)據(jù)集的列名；第二層是 transform() 調(diào)用的函數(shù)名。

In [184]: tsdf.transform([np.abs, lambda x: x + 1])
Out[184]: 
                   A                   B                   C          
            absolute  <lambda>  absolute  <lambda>  absolute  <lambda>
2000-01-01  0.428759  0.571241  0.864890  0.135110  0.675341  0.324659
2000-01-02  0.168731  0.831269  1.338144  2.338144  1.279321 -0.279321
2000-01-03  1.621034 -0.621034  0.438107  1.438107  0.903794  1.903794
2000-01-04       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-08  0.254374  1.254374  1.240447 -0.240447  0.201052  0.798948
2000-01-09  0.157795  0.842205  0.791197  1.791197  1.144209 -0.144209
2000-01-10  0.030876  0.969124  0.371900  1.371900  0.061932  1.061932

為 Series 應(yīng)用多個函數(shù)時，輸出結(jié)果是 DataFrame，列名是 transform() 調(diào)用的函數(shù)名。

In [185]: tsdf.A.transform([np.abs, lambda x: x + 1])
Out[185]: 
            absolute  <lambda>
2000-01-01  0.428759  0.571241
2000-01-02  0.168731  0.831269
2000-01-03  1.621034 -0.621034
2000-01-04       NaN       NaN
2000-01-05       NaN       NaN
2000-01-06       NaN       NaN
2000-01-07       NaN       NaN
2000-01-08  0.254374  1.254374
2000-01-09  0.157795  0.842205
2000-01-10  0.030876  0.969124

#用字典執(zhí)行 transform 操作

函數(shù)字典可以為每列執(zhí)行指定 transform() 操作。

In [186]: tsdf.transform({'A': np.abs, 'B': lambda x: x + 1})
Out[186]: 
                   A         B
2000-01-01  0.428759  0.135110
2000-01-02  0.168731  2.338144
2000-01-03  1.621034  1.438107
2000-01-04       NaN       NaN
2000-01-05       NaN       NaN
2000-01-06       NaN       NaN
2000-01-07       NaN       NaN
2000-01-08  0.254374 -0.240447
2000-01-09  0.157795  1.791197
2000-01-10  0.030876  1.371900

transform() 的參數(shù)是列表字典時，生成的是以 transform() 調(diào)用的函數(shù)為名的多層索引 DataFrame。

In [187]: tsdf.transform({'A': np.abs, 'B': [lambda x: x + 1, 'sqrt']})
Out[187]: 
                   A         B          
            absolute  <lambda>      sqrt
2000-01-01  0.428759  0.135110       NaN
2000-01-02  0.168731  2.338144  1.156782
2000-01-03  1.621034  1.438107  0.661897
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.254374 -0.240447       NaN
2000-01-09  0.157795  1.791197  0.889493
2000-01-10  0.030876  1.371900  0.609836

#元素級函數(shù)應(yīng)用

并非所有函數(shù)都能矢量化，即接受 NumPy 數(shù)組，返回另一個數(shù)組或值，DataFrame 的 applymap() 及 Series 的 map() ，支持任何接收單個值并返回單個值的 Python 函數(shù)。

示例如下：

In [188]: df4
Out[188]: 
        one       two     three
a  1.394981  1.772517       NaN
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 -0.613172

In [189]: def f(x):
   .....:     return len(str(x))
   .....: 

In [190]: df4['one'].map(f)
Out[190]: 
a    18
b    19
c    18
d     3
Name: one, dtype: int64

In [191]: df4.applymap(f)
Out[191]: 
   one  two  three
a   18   17      3
b   19   18     20
c   18   18     16
d    3   19     19

Series.map() 還有個功能，可以“連接”或“映射”第二個 Series 定義的值。這與 merging / joining 功能聯(lián)系非常緊密：

In [192]: s = pd.Series(['six', 'seven', 'six', 'seven', 'six'],
   .....:               index=['a', 'b', 'c', 'd', 'e'])
   .....: 

In [193]: t = pd.Series({'six': 6., 'seven': 7.})

In [194]: s
Out[194]: 
a      six
b    seven
c      six
d    seven
e      six
dtype: object

In [195]: s.map(t)
Out[195]: 
a    6.0
b    7.0
c    6.0
d    7.0
e    6.0
dtype: float64

#重置索引與更換標(biāo)簽

reindex() 是 Pandas 里實(shí)現(xiàn)數(shù)據(jù)對齊的基本方法，該方法執(zhí)行幾乎所有功能都要用到的標(biāo)簽對齊功能。 reindex 指的是沿著指定軸，讓數(shù)據(jù)與給定的一組標(biāo)簽進(jìn)行匹配。該功能完成以下幾項(xiàng)操作：

讓現(xiàn)有數(shù)據(jù)匹配一組新標(biāo)簽，并重新排序；
在無數(shù)據(jù)但有標(biāo)簽的位置插入缺失值（NA）標(biāo)記；
如果指定，則按邏輯填充無標(biāo)簽的數(shù)據(jù)，該操作多見于時間序列數(shù)據(jù)。

示例如下：

In [196]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [197]: s
Out[197]: 
a    1.695148
b    1.328614
c    1.234686
d   -0.385845
e   -1.326508
dtype: float64

In [198]: s.reindex(['e', 'b', 'f', 'd'])
Out[198]: 
e   -1.326508
b    1.328614
f         NaN
d   -0.385845
dtype: float64

本例中，原 Series 里沒有標(biāo)簽 f ，因此，輸出結(jié)果里 f 對應(yīng)的值為 NaN。

DataFrame 支持同時 reindex 索引與列：

In [199]: df
Out[199]: 
        one       two     three
a  1.394981  1.772517       NaN
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 -0.613172

In [200]: df.reindex(index=['c', 'f', 'b'], columns=['three', 'two', 'one'])
Out[200]: 
      three       two       one
c  1.227435  1.478369  0.695246
f       NaN       NaN       NaN
b -0.050390  1.912123  0.343054

reindex 還支持 axis 關(guān)鍵字：

In [201]: df.reindex(['c', 'f', 'b'], axis='index')
Out[201]: 
        one       two     three
c  0.695246  1.478369  1.227435
f       NaN       NaN       NaN
b  0.343054  1.912123 -0.050390

注意：不同對象可以共享 Index 包含的軸標(biāo)簽。比如，有一個 Series，還有一個 DataFrame，可以執(zhí)行下列操作：

In [202]: rs = s.reindex(df.index)

In [203]: rs
Out[203]: 
a    1.695148
b    1.328614
c    1.234686
d   -0.385845
dtype: float64

In [204]: rs.index is df.index
Out[204]: True

這里指的是，重置后，Series 的索引與 DataFrame 的索引是同一個 Python 對象。

0.21.0 版新增。

DataFrame.reindex() 還支持 “軸樣式”調(diào)用習(xí)語，可以指定單個 labels 參數(shù)，并指定應(yīng)用于哪個 axis。

In [205]: df.reindex(['c', 'f', 'b'], axis='index')
Out[205]: 
        one       two     three
c  0.695246  1.478369  1.227435
f       NaN       NaN       NaN
b  0.343054  1.912123 -0.050390

In [206]: df.reindex(['three', 'two', 'one'], axis='columns')
Out[206]: 
      three       two       one
a       NaN  1.772517  1.394981
b -0.050390  1.912123  0.343054
c  1.227435  1.478369  0.695246
d -0.613172  0.279344       NaN

注意

多層索引與高級索引介紹了怎樣用更簡潔的方式重置索引。

注意

編寫注重性能的代碼時，最好花些時間深入理解 reindex：預(yù)對齊數(shù)據(jù)后，操作會更快。兩個未對齊的 DataFrame 相加，后臺操作會執(zhí)行 reindex。探索性分析時很難注意到這點(diǎn)有什么不同，這是因?yàn)?nbsp;reindex 已經(jīng)進(jìn)行了高度優(yōu)化，但需要注重 CPU 周期時，顯式調(diào)用 reindex 還是有一些影響的。

#重置索引，并與其它對象對齊

提取一個對象，并用另一個具有相同標(biāo)簽的對象 reindex 該對象的軸。這種操作的語法雖然簡單，但未免有些啰嗦。這時，最好用 reindex_like() 方法，這是一種既有效，又簡單的方式：

In [207]: df2
Out[207]: 
        one       two
a  1.394981  1.772517
b  0.343054  1.912123
c  0.695246  1.478369

In [208]: df3
Out[208]: 
        one       two
a  0.583888  0.051514
b -0.468040  0.191120
c -0.115848 -0.242634

In [209]: df.reindex_like(df2)
Out[209]: 
        one       two
a  1.394981  1.772517
b  0.343054  1.912123
c  0.695246  1.478369

#用 align 對齊多個對象

align() 方法是對齊兩個對象最快的方式，該方法支持 join 參數(shù)（請參閱 joining 與 merging）：

join='outer'：使用兩個對象索引的合集，默認(rèn)值
join='left'：使用左側(cè)調(diào)用對象的索引
join='right'：使用右側(cè)傳遞對象的索引
join='inner'：使用兩個對象索引的交集

該方法返回重置索引后的兩個 Series 元組：

In [210]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [211]: s1 = s[:4]

In [212]: s2 = s[1:]

In [213]: s1.align(s2)
Out[213]: 
(a   -0.186646
 b   -1.692424
 c   -0.303893
 d   -1.425662
 e         NaN
 dtype: float64, a         NaN
 b   -1.692424
 c   -0.303893
 d   -1.425662
 e    1.114285
 dtype: float64)

In [214]: s1.align(s2, join='inner')
Out[214]: 
(b   -1.692424
 c   -0.303893
 d   -1.425662
 dtype: float64, b   -1.692424
 c   -0.303893
 d   -1.425662
 dtype: float64)

In [215]: s1.align(s2, join='left')
Out[215]: 
(a   -0.186646
 b   -1.692424
 c   -0.303893
 d   -1.425662
 dtype: float64, a         NaN
 b   -1.692424
 c   -0.303893
 d   -1.425662
 dtype: float64)

默認(rèn)條件下， join 方法既應(yīng)用于索引，也應(yīng)用于列：

In [216]: df.align(df2, join='inner')
Out[216]: 
(        one       two
 a  1.394981  1.772517
 b  0.343054  1.912123
 c  0.695246  1.478369,         one       two
 a  1.394981  1.772517
 b  0.343054  1.912123
 c  0.695246  1.478369)

align 方法還支持 axis 選項(xiàng)，用來指定要對齊的軸：

In [217]: df.align(df2, join='inner', axis=0)
Out[217]: 
(        one       two     three
 a  1.394981  1.772517       NaN
 b  0.343054  1.912123 -0.050390
 c  0.695246  1.478369  1.227435,         one       two
 a  1.394981  1.772517
 b  0.343054  1.912123
 c  0.695246  1.478369)

如果把 Series 傳遞給 DataFrame.align()，可以用 axis 參數(shù)選擇是在 DataFrame 的索引，還是列上對齊兩個對象：

In [218]: df.align(df2.iloc[0], axis=1)
Out[218]: 
(        one     three       two
 a  1.394981       NaN  1.772517
 b  0.343054 -0.050390  1.912123
 c  0.695246  1.227435  1.478369
 d       NaN -0.613172  0.279344, one      1.394981
 three         NaN
 two      1.772517
 Name: a, dtype: float64)

方法	動作
pad / ffill	先前填充
bfill / backfill	向后填充
nearest	從最近的索引值填充

下面用一個簡單的 Series 展示 fill 方法：

In [219]: rng = pd.date_range('1/3/2000', periods=8)

In [220]: ts = pd.Series(np.random.randn(8), index=rng)

In [221]: ts2 = ts[[0, 3, 6]]

In [222]: ts
Out[222]: 
2000-01-03    0.183051
2000-01-04    0.400528
2000-01-05   -0.015083
2000-01-06    2.395489
2000-01-07    1.414806
2000-01-08    0.118428
2000-01-09    0.733639
2000-01-10   -0.936077
Freq: D, dtype: float64

In [223]: ts2
Out[223]: 
2000-01-03    0.183051
2000-01-06    2.395489
2000-01-09    0.733639
dtype: float64

In [224]: ts2.reindex(ts.index)
Out[224]: 
2000-01-03    0.183051
2000-01-04         NaN
2000-01-05         NaN
2000-01-06    2.395489
2000-01-07         NaN
2000-01-08         NaN
2000-01-09    0.733639
2000-01-10         NaN
Freq: D, dtype: float64

In [225]: ts2.reindex(ts.index, method='ffill')
Out[225]: 
2000-01-03    0.183051
2000-01-04    0.183051
2000-01-05    0.183051
2000-01-06    2.395489
2000-01-07    2.395489
2000-01-08    2.395489
2000-01-09    0.733639
2000-01-10    0.733639
Freq: D, dtype: float64

In [226]: ts2.reindex(ts.index, method='bfill')
Out[226]: 
2000-01-03    0.183051
2000-01-04    2.395489
2000-01-05    2.395489
2000-01-06    2.395489
2000-01-07    0.733639
2000-01-08    0.733639
2000-01-09    0.733639
2000-01-10         NaN
Freq: D, dtype: float64

In [227]: ts2.reindex(ts.index, method='nearest')
Out[227]: 
2000-01-03    0.183051
2000-01-04    0.183051
2000-01-05    2.395489
2000-01-06    2.395489
2000-01-07    2.395489
2000-01-08    0.733639
2000-01-09    0.733639
2000-01-10    0.733639
Freq: D, dtype: float64

上述操作要求索引按遞增或遞減排序。

注意：除了 method='nearest'，用 fillna 或 interpolate 也能實(shí)現(xiàn)同樣的效果：

In [228]: ts2.reindex(ts.index).fillna(method='ffill')
Out[228]: 
2000-01-03    0.183051
2000-01-04    0.183051
2000-01-05    0.183051
2000-01-06    2.395489
2000-01-07    2.395489
2000-01-08    2.395489
2000-01-09    0.733639
2000-01-10    0.733639
Freq: D, dtype: float64

如果索引不是按遞增或遞減排序，reindex() 會觸發(fā) ValueError 錯誤。fillna() 與 interpolate() 則不檢查索引的排序。

#重置索引填充的限制

limit 與 tolerance 參數(shù)可以控制 reindex 的填充操作。limit 限定了連續(xù)匹配的最大數(shù)量：

In [229]: ts2.reindex(ts.index, method='ffill', limit=1)
Out[229]: 
2000-01-03    0.183051
2000-01-04    0.183051
2000-01-05         NaN
2000-01-06    2.395489
2000-01-07    2.395489
2000-01-08         NaN
2000-01-09    0.733639
2000-01-10    0.733639
Freq: D, dtype: float64

反之，tolerance 限定了索引與索引器值之間的最大距離：

In [230]: ts2.reindex(ts.index, method='ffill', tolerance='1 day')
Out[230]: 
2000-01-03    0.183051
2000-01-04    0.183051
2000-01-05         NaN
2000-01-06    2.395489
2000-01-07    2.395489
2000-01-08         NaN
2000-01-09    0.733639
2000-01-10    0.733639
Freq: D, dtype: float64

注意：索引為 DatetimeIndex、TimedeltaIndex 或 PeriodIndex 時，tolerance 會盡可能將這些索引強(qiáng)制轉(zhuǎn)換為 Timedelta，這里要求用戶用恰當(dāng)?shù)淖址O(shè)定 tolerance 參數(shù)。

#去掉軸上的標(biāo)簽

drop() 函數(shù)與 reindex 經(jīng)常配合使用，該函數(shù)用于刪除軸上的一組標(biāo)簽：

In [231]: df
Out[231]: 
        one       two     three
a  1.394981  1.772517       NaN
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 -0.613172

In [232]: df.drop(['a', 'd'], axis=0)
Out[232]: 
        one       two     three
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435

In [233]: df.drop(['one'], axis=1)
Out[233]: 
        two     three
a  1.772517       NaN
b  1.912123 -0.050390
c  1.478369  1.227435
d  0.279344 -0.613172

注意：下面的代碼可以運(yùn)行，但不夠清晰：

In [234]: df.reindex(df.index.difference(['a', 'd']))
Out[234]: 
        one       two     three
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435

#重命名或映射標(biāo)簽

rename() 方法支持按不同的軸基于映射（字典或 Series）調(diào)整標(biāo)簽。

In [235]: s
Out[235]: 
a   -0.186646
b   -1.692424
c   -0.303893
d   -1.425662
e    1.114285
dtype: float64

In [236]: s.rename(str.upper)
Out[236]: 
A   -0.186646
B   -1.692424
C   -0.303893
D   -1.425662
E    1.114285
dtype: float64

如果調(diào)用的是函數(shù)，該函數(shù)在處理標(biāo)簽時，必須返回一個值，而且生成的必須是一組唯一值。此外，rename() 還可以調(diào)用字典或 Series。

In [237]: df.rename(columns={'one': 'foo', 'two': 'bar'},
   .....:           index={'a': 'apple', 'b': 'banana', 'd': 'durian'})
   .....: 
Out[237]: 
             foo       bar     three
apple   1.394981  1.772517       NaN
banana  0.343054  1.912123 -0.050390
c       0.695246  1.478369  1.227435
durian       NaN  0.279344 -0.613172

Pandas 不會重命名標(biāo)簽未包含在映射里的列或索引。注意，映射里多出的標(biāo)簽不會觸發(fā)錯誤。

0.21.0 版新增。

DataFrame.rename() 還支持“軸式”習(xí)語，用這種方式可以指定單個 mapper，及執(zhí)行映射的 axis。

In [238]: df.rename({'one': 'foo', 'two': 'bar'}, axis='columns')
Out[238]: 
        foo       bar     three
a  1.394981  1.772517       NaN
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 -0.613172

In [239]: df.rename({'a': 'apple', 'b': 'banana', 'd': 'durian'}, axis='index')
Out[239]: 
             one       two     three
apple   1.394981  1.772517       NaN
banana  0.343054  1.912123 -0.050390
c       0.695246  1.478369  1.227435
durian       NaN  0.279344 -0.613172

rename() 方法還提供了 inplace 命名參數(shù)，默認(rèn)為 False，并會復(fù)制底層數(shù)據(jù)。inplace=True 時，會直接在原數(shù)據(jù)上重命名。

0.18.0 版新增。

rename() 還支持用標(biāo)量或列表更改 Series.name 屬性。

In [240]: s.rename("scalar-name")
Out[240]: 
a   -0.186646
b   -1.692424
c   -0.303893
d   -1.425662
e    1.114285
Name: scalar-name, dtype: float64

0.24.0 版新增。

rename_axis() 方法支持指定多層索引名稱，與標(biāo)簽相對應(yīng)。

In [241]: df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6],
   .....:                    'y': [10, 20, 30, 40, 50, 60]},
   .....:                   index=pd.MultiIndex.from_product([['a', 'b', 'c'], [1, 2]],
   .....:                   names=['let', 'num']))
   .....: 

In [242]: df
Out[242]: 
         x   y
let num       
a   1    1  10
    2    2  20
b   1    3  30
    2    4  40
c   1    5  50
    2    6  60

In [243]: df.rename_axis(index={'let': 'abc'})
Out[243]: 
         x   y
abc num       
a   1    1  10
    2    2  20
b   1    3  30
    2    4  40
c   1    5  50
    2    6  60

In [244]: df.rename_axis(index=str.upper)
Out[244]: 
         x   y
LET NUM       
a   1    1  10
    2    2  20
b   1    3  30
    2    4  40
c   1    5  50
    2    6  60

#迭代

Pandas 對象基于類型進(jìn)行迭代操作。Series 迭代時被視為數(shù)組，基礎(chǔ)迭代生成值。DataFrame 則遵循字典式習(xí)語，用對象的 key 實(shí)現(xiàn)迭代操作。

簡言之，基礎(chǔ)迭代（for i in object）生成：

Series ：值
DataFrame：列標(biāo)簽

例如，DataFrame 迭代時輸出列名：

In [245]: df = pd.DataFrame({'col1': np.random.randn(3),
   .....:                    'col2': np.random.randn(3)}, index=['a', 'b', 'c'])
   .....: 

In [246]: for col in df:
   .....:     print(col)
   .....: 
col1
col2

Pandas 對象還支持字典式的 items() 方法，通過鍵值對迭代。

用下列方法可以迭代 DataFrame 里的行：

iterrows()：把 DataFrame 里的行當(dāng)作（index， Series）對進(jìn)行迭代。該操作把行轉(zhuǎn)為 Series，同時改變數(shù)據(jù)類型，并對性能有影響。
itertuples() 把 DataFrame 的行當(dāng)作值的命名元組進(jìn)行迭代。該操作比 iterrows() 快的多，建議盡量用這種方法迭代 DataFrame 的值。

警告

Pandas 對象迭代的速度較慢。大部分情況下，沒必要對行執(zhí)行迭代操作，建議用以下幾種替代方式：

矢量化：很多操作可以用內(nèi)置方法或 NumPy 函數(shù)，布爾索引……
調(diào)用的函數(shù)不能在完整的 DataFrame / Series 上運(yùn)行時，最好用 apply()，不要對值進(jìn)行迭代操作。請參閱函數(shù)應(yīng)用文檔。
如果必須對值進(jìn)行迭代，請務(wù)必注意代碼的性能，建議在 cython 或 numba 環(huán)境下實(shí)現(xiàn)內(nèi)循環(huán)。參閱性能優(yōu)化一節(jié)，查看這種操作方法的示例。

警告

永遠(yuǎn)不要修改迭代的內(nèi)容，這種方式不能確保所有操作都能正常運(yùn)作?；跀?shù)據(jù)類型，迭代器返回的是復(fù)制（copy）的結(jié)果，不是視圖（view），這種寫入可能不會生效！

下例中的賦值就不會生效：

In [247]: df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})

In [248]: for index, row in df.iterrows():
.....:     row['a'] = 10
.....: 

In [249]: df
Out[249]: 
a  b
0  1  a
1  2  b
2  3  c

#項(xiàng)目（items）

與字典型接口類似，items() 通過鍵值對進(jìn)行迭代：

Series：（Index，標(biāo)量值）對
DataFrame：（列，Series）對

示例如下：

In [250]: for label, ser in df.items():
   .....:     print(label)
   .....:     print(ser)
   .....: 
a
0    1
1    2
2    3
Name: a, dtype: int64
b
0    a
1    b
2    c
Name: b, dtype: object

#iterrows

iterrows() 迭代 DataFrame 或 Series 里的每一行數(shù)據(jù)。這個操作返回一個迭代器，生成索引值及包含每行數(shù)據(jù)的 Series：

In [251]: for row_index, row in df.iterrows():
   .....:     print(row_index, row, sep='\n')
   .....: 
0
a    1
b    a
Name: 0, dtype: object
1
a    2
b    b
Name: 1, dtype: object
2
a    3
b    c
Name: 2, dtype: object

注意

iterrows() 返回的是 Series 里的每一行數(shù)據(jù)，該操作不保留每行數(shù)據(jù)的數(shù)據(jù)類型，因?yàn)閿?shù)據(jù)類型是通過 DataFrame 的列界定的。

示例如下：

In [252]: df_orig = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])

In [253]: df_orig.dtypes
Out[253]: 
int        int64
float    float64
dtype: object

In [254]: row = next(df_orig.iterrows())[1]

In [255]: row
Out[255]: 
int      1.0
float    1.5
Name: 0, dtype: float64

row 里的值以 Series 形式返回，并被轉(zhuǎn)換為浮點(diǎn)數(shù)，原始的整數(shù)值則在列 X：

In [256]: row['int'].dtype
Out[256]: dtype('float64')

In [257]: df_orig['int'].dtype
Out[257]: dtype('int64')

要想在行迭代時保存數(shù)據(jù)類型，最好用 itertuples()，這個函數(shù)返回值的命名元組，總的來說，該操作比 iterrows() 速度更快。

下例展示了怎樣轉(zhuǎn)置 DataFrame：

In [258]: df2 = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})

In [259]: print(df2)
   x  y
0  1  4
1  2  5
2  3  6

In [260]: print(df2.T)
   0  1  2
x  1  2  3
y  4  5  6

In [261]: df2_t = pd.DataFrame({idx: values for idx, values in df2.iterrows()})

In [262]: print(df2_t)
   0  1  2
x  1  2  3
y  4  5  6

#itertuples

itertuples() 方法返回為 DataFrame 里每行數(shù)據(jù)生成命名元組的迭代器。該元組的第一個元素是行的索引值，其余的值則是行的值。

示例如下：

In [263]: for row in df.itertuples():
   .....:     print(row)
   .....: 
Pandas(Index=0, a=1, b='a')
Pandas(Index=1, a=2, b='b')
Pandas(Index=2, a=3, b='c')

該方法不會把行轉(zhuǎn)換為 Series，只是返回命名元組里的值。itertuples() 保存值的數(shù)據(jù)類型，而且比 iterrows() 快。

注意

包含無效 Python 識別符的列名、重復(fù)的列名及以下劃線開頭的列名，會被重命名為位置名稱。如果列數(shù)較大，比如大于 255 列，則返回正則元組。

#.dt 訪問器

Series 提供一個可以簡單、快捷地返回 datetime 屬性值的訪問器。這個訪問器返回的也是 Series，索引與現(xiàn)有的 Series 一樣。

# datetime
In [264]: s = pd.Series(pd.date_range('20130101 09:10:12', periods=4))

In [265]: s
Out[265]: 
0   2013-01-01 09:10:12
1   2013-01-02 09:10:12
2   2013-01-03 09:10:12
3   2013-01-04 09:10:12
dtype: datetime64[ns]

In [266]: s.dt.hour
Out[266]: 
0    9
1    9
2    9
3    9
dtype: int64

In [267]: s.dt.second
Out[267]: 
0    12
1    12
2    12
3    12
dtype: int64

In [268]: s.dt.day
Out[268]: 
0    1
1    2
2    3
3    4
dtype: int64

用下列表達(dá)式進(jìn)行篩選非常方便：

In [269]: s[s.dt.day == 2]
Out[269]: 
1   2013-01-02 09:10:12
dtype: datetime64[ns]

時區(qū)轉(zhuǎn)換也很輕松：

In [270]: stz = s.dt.tz_localize('US/Eastern')

In [271]: stz
Out[271]: 
0   2013-01-01 09:10:12-05:00
1   2013-01-02 09:10:12-05:00
2   2013-01-03 09:10:12-05:00
3   2013-01-04 09:10:12-05:00
dtype: datetime64[ns, US/Eastern]

In [272]: stz.dt.tz
Out[272]: <DstTzInfo 'US/Eastern' LMT-1 day, 19:04:00 STD>

可以把這些操作連在一起：

In [273]: s.dt.tz_localize('UTC').dt.tz_convert('US/Eastern')
Out[273]: 
0   2013-01-01 04:10:12-05:00
1   2013-01-02 04:10:12-05:00
2   2013-01-03 04:10:12-05:00
3   2013-01-04 04:10:12-05:00
dtype: datetime64[ns, US/Eastern]

還可以用 Series.dt.strftime() 把 datetime 的值當(dāng)成字符串進(jìn)行格式化，支持與標(biāo)準(zhǔn) strftime() 同樣的格式。

# DatetimeIndex
In [274]: s = pd.Series(pd.date_range('20130101', periods=4))

In [275]: s
Out[275]: 
0   2013-01-01
1   2013-01-02
2   2013-01-03
3   2013-01-04
dtype: datetime64[ns]

In [276]: s.dt.strftime('%Y/%m/%d')
Out[276]: 
0    2013/01/01
1    2013/01/02
2    2013/01/03
3    2013/01/04
dtype: object

# PeriodIndex
In [277]: s = pd.Series(pd.period_range('20130101', periods=4))

In [278]: s
Out[278]: 
0    2013-01-01
1    2013-01-02
2    2013-01-03
3    2013-01-04
dtype: period[D]

In [279]: s.dt.strftime('%Y/%m/%d')
Out[279]: 
0    2013/01/01
1    2013/01/02
2    2013/01/03
3    2013/01/04
dtype: object

.dt 訪問器還支持 period 與 timedelta。

# period
In [280]: s = pd.Series(pd.period_range('20130101', periods=4, freq='D'))

In [281]: s
Out[281]: 
0    2013-01-01
1    2013-01-02
2    2013-01-03
3    2013-01-04
dtype: period[D]

In [282]: s.dt.year
Out[282]: 
0    2013
1    2013
2    2013
3    2013
dtype: int64

In [283]: s.dt.day
Out[283]: 
0    1
1    2
2    3
3    4
dtype: int64

# timedelta
In [284]: s = pd.Series(pd.timedelta_range('1 day 00:00:05', periods=4, freq='s'))

In [285]: s
Out[285]: 
0   1 days 00:00:05
1   1 days 00:00:06
2   1 days 00:00:07
3   1 days 00:00:08
dtype: timedelta64[ns]

In [286]: s.dt.days
Out[286]: 
0    1
1    1
2    1
3    1
dtype: int64

In [287]: s.dt.seconds
Out[287]: 
0    5
1    6
2    7
3    8
dtype: int64

In [288]: s.dt.components
Out[288]: 
   days  hours  minutes  seconds  milliseconds  microseconds  nanoseconds
0     1      0        0        5             0             0            0
1     1      0        0        6             0             0            0
2     1      0        0        7             0             0            0
3     1      0        0        8             0             0            0

注意

用這個訪問器處理不是 datetime 類型的值時，Series.dt 會觸發(fā) TypeError 錯誤。

#矢量化字符串方法

Series 支持字符串處理方法，可以非常方便地操作數(shù)組里的每個元素。這些方法會自動排除缺失值與空值，這也許是其最重要的特性。這些方法通過 Series 的 str 屬性訪問，一般情況下，這些操作的名稱與內(nèi)置的字符串方法一致。示例如下：

In [289]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])

In [290]: s.str.lower()
Out[290]: 
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

這里還提供了強(qiáng)大的模式匹配方法，但工業(yè)注意，模式匹配方法默認(rèn)使用正則表達(dá)式。

參閱矢量化字符串方法，了解完整內(nèi)容。

#排序

Pandas 支持三種排序方式，按索引標(biāo)簽排序，按列里的值排序，按兩種方式混合排序。

#按索引排序

Series.sort_index() 與 DataFrame.sort_index() 方法用于按索引層級對 Pandas 對象排序。

In [291]: df = pd.DataFrame({
   .....:     'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
   .....:     'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
   .....:     'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
   .....: 

In [292]: unsorted_df = df.reindex(index=['a', 'd', 'c', 'b'],
   .....:                          columns=['three', 'two', 'one'])
   .....: 

In [293]: unsorted_df
Out[293]: 
      three       two       one
a       NaN -1.152244  0.562973
d -0.252916 -0.109597       NaN
c  1.273388 -0.167123  0.640382
b -0.098217  0.009797 -1.299504

# DataFrame
In [294]: unsorted_df.sort_index()
Out[294]: 
      three       two       one
a       NaN -1.152244  0.562973
b -0.098217  0.009797 -1.299504
c  1.273388 -0.167123  0.640382
d -0.252916 -0.109597       NaN

In [295]: unsorted_df.sort_index(ascending=False)
Out[295]: 
      three       two       one
d -0.252916 -0.109597       NaN
c  1.273388 -0.167123  0.640382
b -0.098217  0.009797 -1.299504
a       NaN -1.152244  0.562973

In [296]: unsorted_df.sort_index(axis=1)
Out[296]: 
        one     three       two
a  0.562973       NaN -1.152244
d       NaN -0.252916 -0.109597
c  0.640382  1.273388 -0.167123
b -1.299504 -0.098217  0.009797

# Series
In [297]: unsorted_df['three'].sort_index()
Out[297]: 
a         NaN
b   -0.098217
c    1.273388
d   -0.252916
Name: three, dtype: float64

#按值排序

Series.sort_values() 方法用于按值對 Series 排序。DataFrame.sort_values() 方法用于按行列的值對 DataFrame 排序。DataFrame.sort_values() 的可選參數(shù) by 用于指定按哪列排序，該參數(shù)的值可以是一列或多列數(shù)據(jù)。

In [298]: df1 = pd.DataFrame({'one': [2, 1, 1, 1],
   .....:                     'two': [1, 3, 2, 4],
   .....:                     'three': [5, 4, 3, 2]})
   .....: 

In [299]: df1.sort_values(by='two')
Out[299]: 
   one  two  three
0    2    1      5
2    1    2      3
1    1    3      4
3    1    4      2

參數(shù) by 支持列名列表，示例如下：

In [300]: df1[['one', 'two', 'three']].sort_values(by=['one', 'two'])
Out[300]: 
   one  two  three
2    1    2      3
1    1    3      4
3    1    4      2
0    2    1      5

這些方法支持用 na_position 參數(shù)處理空值。

In [301]: s[2] = np.nan

In [302]: s.sort_values()
Out[302]: 
0       A
3    Aaba
1       B
4    Baca
6    CABA
8     cat
7     dog
2     NaN
5     NaN
dtype: object

In [303]: s.sort_values(na_position='first')
Out[303]: 
2     NaN
5     NaN
0       A
3    Aaba
1       B
4    Baca
6    CABA
8     cat
7     dog
dtype: object

#按索引與值排序

0.23.0 版新增。

通過參數(shù) by 傳遞給 DataFrame.sort_values() 的字符串可以引用列或索引層名。

# 創(chuàng)建 MultiIndex
In [304]: idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2),
   .....:                                 ('b', 2), ('b', 1), ('b', 1)])
   .....: 

In [305]: idx.names = ['first', 'second']

# 創(chuàng)建 DataFrame
In [306]: df_multi = pd.DataFrame({'A': np.arange(6, 0, -1)},
   .....:                         index=idx)
   .....: 

In [307]: df_multi
Out[307]: 
              A
first second   
a     1       6
      2       5
      2       4
b     2       3
      1       2
      1       1

按 second（索引）與 A（列）排序。

In [308]: df_multi.sort_values(by=['second', 'A'])
Out[308]: 
              A
first second   
b     1       1
      1       2
a     1       6
b     2       3
a     2       4
      2       5

注意

字符串、列名、索引層名重名時，會觸發(fā)警告提示，并以列名為準(zhǔn)。后期版本中，這種情況將會觸發(fā)模糊錯誤。

#搜索排序

Series 支持 searchsorted() 方法，這與numpy.ndarray.searchsorted() 的操作方式類似。

In [309]: ser = pd.Series([1, 2, 3])

In [310]: ser.searchsorted([0, 3])
Out[310]: array([0, 2])

In [311]: ser.searchsorted([0, 4])
Out[311]: array([0, 3])

In [312]: ser.searchsorted([1, 3], side='right')
Out[312]: array([1, 3])

In [313]: ser.searchsorted([1, 3], side='left')
Out[313]: array([0, 2])

In [314]: ser = pd.Series([3, 1, 2])

In [315]: ser.searchsorted([0, 3], sorter=np.argsort(ser))
Out[315]: array([0, 2])

#最大值與最小值

Series 支持 nsmallest() 與 nlargest() 方法，本方法返回 N 個最大或最小的值。對于數(shù)據(jù)量大的 Series 來說，該方法比先為整個 Series 排序，再調(diào)用 head(n) 這種方式的速度要快得多。

In [316]: s = pd.Series(np.random.permutation(10))

In [317]: s
Out[317]: 
0    2
1    0
2    3
3    7
4    1
5    5
6    9
7    6
8    8
9    4
dtype: int64

In [318]: s.sort_values()
Out[318]: 
1    0
4    1
0    2
2    3
9    4
5    5
7    6
3    7
8    8
6    9
dtype: int64

In [319]: s.nsmallest(3)
Out[319]: 
1    0
4    1
0    2
dtype: int64

In [320]: s.nlargest(3)
Out[320]: 
6    9
8    8
3    7
dtype: int64

DataFrame 也支持 nlargest 與 nsmallest 方法。

In [321]: df = pd.DataFrame({'a': [-2, -1, 1, 10, 8, 11, -1],
   .....:                    'b': list('abdceff'),
   .....:                    'c': [1.0, 2.0, 4.0, 3.2, np.nan, 3.0, 4.0]})
   .....: 

In [322]: df.nlargest(3, 'a')
Out[322]: 
    a  b    c
5  11  f  3.0
3  10  c  3.2
4   8  e  NaN

In [323]: df.nlargest(5, ['a', 'c'])
Out[323]: 
    a  b    c
5  11  f  3.0
3  10  c  3.2
4   8  e  NaN
2   1  d  4.0
6  -1  f  4.0

In [324]: df.nsmallest(3, 'a')
Out[324]: 
   a  b    c
0 -2  a  1.0
1 -1  b  2.0
6 -1  f  4.0

In [325]: df.nsmallest(5, ['a', 'c'])
Out[325]: 
   a  b    c
0 -2  a  1.0
1 -1  b  2.0
6 -1  f  4.0
2  1  d  4.0
4  8  e  NaN

#用多層索引的列排序

列為多層索引時，可以顯式排序，用 by 指定所有層級。

In [326]: df1.columns = pd.MultiIndex.from_tuples([('a', 'one'),
   .....:                                          ('a', 'two'),
   .....:                                          ('b', 'three')])
   .....: 

In [327]: df1.sort_values(by=('a', 'two'))
Out[327]: 
    a         b
  one two three
0   2   1     5
2   1   2     3
1   1   3     4
3   1   4     2

#復(fù)制

在 Pandas 對象上執(zhí)行 copy() 方法，將復(fù)制底層數(shù)據(jù)（但不包括軸索引，因?yàn)檩S索引不可變），并返回一個新的對象。注意，復(fù)制對象這種操作一般來說不是必須的。比如說，以下幾種方式可以***就地（inplace）*** 改變 DataFrame：

插入、刪除、修改列
為 index 或 columns 屬性賦值
對于同質(zhì)數(shù)據(jù)，用 values 屬性或高級索引即可直接修改值

注意，用 Pandas 方法修改數(shù)據(jù)不會帶來任何副作用，幾乎所有方法都返回新的對象，不會修改原始數(shù)據(jù)對象。如果原始數(shù)據(jù)有所改動，唯一的可能就是用戶顯式指定了要修改原始數(shù)據(jù)。

#數(shù)據(jù)類型

大多數(shù)情況下，Pandas 使用 NumPy 數(shù)組、Series 或 DataFrame 里某列的數(shù)據(jù)類型。NumPy 支持 float、int、bool、timedelta[ns]、datetime64[ns]，注意，NumPy 不支持帶時區(qū)信息的 datetime。

Pandas 與第三方支持庫擴(kuò)充了 NumPy 類型系統(tǒng)，本節(jié)只介紹 Pandas 的內(nèi)部擴(kuò)展。如需了解如何編寫與 Pandas 擴(kuò)展類型，請參閱擴(kuò)展類型，參閱擴(kuò)展數(shù)據(jù)類型了解第三方支持庫提供的擴(kuò)展類型。

下表列出了 Pandas 擴(kuò)展類型，參閱列出的文檔內(nèi)容，查看每種類型的詳細(xì)說明。

數(shù)據(jù)種類	數(shù)據(jù)類型	標(biāo)量	數(shù)組	文檔
tz-aware datetime	`DatetimeTZDtype`	`Timestamp`	`arrays.DatetimeArray`	Time zone handling
Categorical	`CategoricalDtype`	(無)	`Categorical`	Categorical data
period (time spans)	`PeriodDtype`	`Period`	`arrays.PeriodArray`	Time span representation
sparse	`SparseDtype`	(無)	`arrays.SparseArray`	Sparse data structures
intervals	`IntervalDtype`	`Interval`	`arrays.IntervalArray`	IntervalIndex
nullable integer	`Int64Dtype` , …	(無)	`arrays.IntegerArray`	Nullable integer data type

Pandas 用 object 存儲字符串。

雖然， object 數(shù)據(jù)類型能夠存儲任何對象，但應(yīng)盡量避免這種操作，要了解與其它支持庫與方法的性能與交互操作，參閱對象轉(zhuǎn)換。

DataFrame 的 dtypes 屬性用起來很方便，以 Series 形式返回每列的數(shù)據(jù)類型。

In [328]: dft = pd.DataFrame({'A': np.random.rand(3),
   .....:                     'B': 1,
   .....:                     'C': 'foo',
   .....:                     'D': pd.Timestamp('20010102'),
   .....:                     'E': pd.Series([1.0] * 3).astype('float32'),
   .....:                     'F': False,
   .....:                     'G': pd.Series([1] * 3, dtype='int8')})
   .....: 

In [329]: dft
Out[329]: 
          A  B    C          D    E      F  G
0  0.035962  1  foo 2001-01-02  1.0  False  1
1  0.701379  1  foo 2001-01-02  1.0  False  1
2  0.281885  1  foo 2001-01-02  1.0  False  1

In [330]: dft.dtypes
Out[330]: 
A           float64
B             int64
C            object
D    datetime64[ns]
E           float32
F              bool
G              int8
dtype: object

要查看 Series 的數(shù)據(jù)類型，用 dtype 屬性。

In [331]: dft['A'].dtype
Out[331]: dtype('float64')

Pandas 對象單列中含多種類型的數(shù)據(jù)時，該列的數(shù)據(jù)類型為可適配于各類數(shù)據(jù)的數(shù)據(jù)類型，通常為 object。

# 整數(shù)被強(qiáng)制轉(zhuǎn)換為浮點(diǎn)數(shù)
In [332]: pd.Series([1, 2, 3, 4, 5, 6.])
Out[332]: 
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
dtype: float64

# 字符串?dāng)?shù)據(jù)決定了該 Series 的數(shù)據(jù)類型為 ``object``
In [333]: pd.Series([1, 2, 3, 6., 'foo'])
Out[333]: 
0      1
1      2
2      3
3      6
4    foo
dtype: object

DataFrame.dtypes.value_counts() 用于統(tǒng)計(jì) DataFrame 里不同數(shù)據(jù)類型的列數(shù)。

In [334]: dft.dtypes.value_counts()
Out[334]: 
float32           1
object            1
bool              1
int8              1
float64           1
datetime64[ns]    1
int64             1
dtype: int64

多種數(shù)值型數(shù)據(jù)類型可以在 DataFrame 里共存。如果只傳遞一種數(shù)據(jù)類型，不論是通過 dtype 關(guān)鍵字直接傳遞，還是通過 ndarray 或 Series 傳遞，都會保存至 DataFrame 操作。此外，不同數(shù)值型數(shù)據(jù)類型不會合并。示例如下：

In [335]: df1 = pd.DataFrame(np.random.randn(8, 1), columns=['A'], dtype='float32')

In [336]: df1
Out[336]: 
          A
0  0.224364
1  1.890546
2  0.182879
3  0.787847
4 -0.188449
5  0.667715
6 -0.011736
7 -0.399073

In [337]: df1.dtypes
Out[337]: 
A    float32
dtype: object

In [338]: df2 = pd.DataFrame({'A': pd.Series(np.random.randn(8), dtype='float16'),
   .....:                     'B': pd.Series(np.random.randn(8)),
   .....:                     'C': pd.Series(np.array(np.random.randn(8),
   .....:                                             dtype='uint8'))})
   .....: 

In [339]: df2
Out[339]: 
          A         B    C
0  0.823242  0.256090    0
1  1.607422  1.426469    0
2 -0.333740 -0.416203  255
3 -0.063477  1.139976    0
4 -1.014648 -1.193477    0
5  0.678711  0.096706    0
6 -0.040863 -1.956850    1
7 -0.357422 -0.714337    0

In [340]: df2.dtypes
Out[340]: 
A    float16
B    float64
C      uint8
dtype: object

#默認(rèn)值

整數(shù)的默認(rèn)類型為 int64，浮點(diǎn)數(shù)的默認(rèn)類型為 float64，這里的默認(rèn)值與系統(tǒng)平臺無關(guān)，不管是 32 位系統(tǒng)，還是 64 位系統(tǒng)都是一樣的。下列代碼返回的結(jié)果都是 int64：

In [341]: pd.DataFrame([1, 2], columns=['a']).dtypes
Out[341]: 
a    int64
dtype: object

In [342]: pd.DataFrame({'a': [1, 2]}).dtypes
Out[342]: 
a    int64
dtype: object

In [343]: pd.DataFrame({'a': 1}, index=list(range(2))).dtypes
Out[343]: 
a    int64
dtype: object

注意，NumPy 創(chuàng)建數(shù)組時，會根據(jù)系統(tǒng)選擇類型。下列代碼在 32 位系統(tǒng)上將返回 int32。

In [344]: frame = pd.DataFrame(np.array([1, 2]))

#向上轉(zhuǎn)型

與其它類型合并時，用的是向上轉(zhuǎn)型，指的是從現(xiàn)有類型轉(zhuǎn)換為另一種類型，如int 變?yōu)?nbsp;float。

In [345]: df3 = df1.reindex_like(df2).fillna(value=0.0) + df2

In [346]: df3
Out[346]: 
          A         B      C
0  1.047606  0.256090    0.0
1  3.497968  1.426469    0.0
2 -0.150862 -0.416203  255.0
3  0.724370  1.139976    0.0
4 -1.203098 -1.193477    0.0
5  1.346426  0.096706    0.0
6 -0.052599 -1.956850    1.0
7 -0.756495 -0.714337    0.0

In [347]: df3.dtypes
Out[347]: 
A    float32
B    float64
C    float64
dtype: object

DataFrame.to_numpy() 返回多個數(shù)據(jù)類型里用得最多的數(shù)據(jù)類型，這里指的是，輸出結(jié)果的數(shù)據(jù)類型，適用于所有同構(gòu) NumPy 數(shù)組的數(shù)據(jù)類型。此處強(qiáng)制執(zhí)行向上轉(zhuǎn)型。

In [348]: df3.to_numpy().dtype
Out[348]: dtype('float64')

#astype

astype() 方法顯式地把一種數(shù)據(jù)類型轉(zhuǎn)換為另一種，默認(rèn)操作為復(fù)制數(shù)據(jù)，就算數(shù)據(jù)類型沒有改變也會復(fù)制數(shù)據(jù)，copy=False 改變默認(rèn)操作模式。此外，astype 無效時，會觸發(fā)異常。

向上轉(zhuǎn)型一般都遵循 NumPy 規(guī)則。操作中含有兩種不同類型的數(shù)據(jù)時，返回更為通用的那種數(shù)據(jù)類型。

In [349]: df3
Out[349]: 
          A         B      C
0  1.047606  0.256090    0.0
1  3.497968  1.426469    0.0
2 -0.150862 -0.416203  255.0
3  0.724370  1.139976    0.0
4 -1.203098 -1.193477    0.0
5  1.346426  0.096706    0.0
6 -0.052599 -1.956850    1.0
7 -0.756495 -0.714337    0.0

In [350]: df3.dtypes
Out[350]: 
A    float32
B    float64
C    float64
dtype: object

# 轉(zhuǎn)換數(shù)據(jù)類型
In [351]: df3.astype('float32').dtypes
Out[351]: 
A    float32
B    float32
C    float32
dtype: object

用 astype() 把一列或多列轉(zhuǎn)換為指定類型。

In [352]: dft = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})

In [353]: dft[['a', 'b']] = dft[['a', 'b']].astype(np.uint8)

In [354]: dft
Out[354]: 
   a  b  c
0  1  4  7
1  2  5  8
2  3  6  9

In [355]: dft.dtypes
Out[355]: 
a    uint8
b    uint8
c    int64
dtype: object

0.19.0 版新增。

astype() 通過字典指定哪些列轉(zhuǎn)換為哪些類型。

In [356]: dft1 = pd.DataFrame({'a': [1, 0, 1], 'b': [4, 5, 6], 'c': [7, 8, 9]})

In [357]: dft1 = dft1.astype({'a': np.bool, 'c': np.float64})

In [358]: dft1
Out[358]: 
       a  b    c
0   True  4  7.0
1  False  5  8.0
2   True  6  9.0

In [359]: dft1.dtypes
Out[359]: 
a       bool
b      int64
c    float64
dtype: object

注意

用 astype() 與 loc() 為部分列轉(zhuǎn)換指定類型時，會發(fā)生向上轉(zhuǎn)型。

loc() 嘗試分配當(dāng)前的數(shù)據(jù)類型，而 [] 則會從右方獲取數(shù)據(jù)類型并進(jìn)行覆蓋。因此，下列代碼會產(chǎn)出意料之外的結(jié)果：

In [360]: dft = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})

In [361]: dft.loc[:, ['a', 'b']].astype(np.uint8).dtypes
Out[361]: 
a    uint8
b    uint8
dtype: object

In [362]: dft.loc[:, ['a', 'b']] = dft.loc[:, ['a', 'b']].astype(np.uint8)

In [363]: dft.dtypes
Out[363]: 
a    int64
b    int64
c    int64
dtype: object

#對象轉(zhuǎn)換

Pandas 提供了多種函數(shù)可以把 object 從一種類型強(qiáng)制轉(zhuǎn)為另一種類型。這是因?yàn)?，?shù)據(jù)有時存儲的是正確類型，但在保存時卻存成了 object 類型，此時，用 DataFrame.infer_objects() 與 Series.infer_objects() 方法即可把數(shù)據(jù)軟轉(zhuǎn)換為正確的類型。

In [364]: import datetime

In [365]: df = pd.DataFrame([[1, 2],
   .....:                    ['a', 'b'],
   .....:                    [datetime.datetime(2016, 3, 2),
   .....:                     datetime.datetime(2016, 3, 2)]])
   .....: 

In [366]: df = df.T

In [367]: df
Out[367]: 
   0  1          2
0  1  a 2016-03-02
1  2  b 2016-03-02

In [368]: df.dtypes
Out[368]: 
0            object
1            object
2    datetime64[ns]
dtype: object

因?yàn)閿?shù)據(jù)被轉(zhuǎn)置，所以把原始列的數(shù)據(jù)類型改成了 object，但使用 infer_objects 后就變正確了。

In [369]: df.infer_objects().dtypes
Out[369]: 
0             int64
1            object
2    datetime64[ns]
dtype: object

下列函數(shù)可以應(yīng)用于一維數(shù)組與標(biāo)量，執(zhí)行硬轉(zhuǎn)換，把對象轉(zhuǎn)換為指定類型。

to_numeric()，轉(zhuǎn)換為數(shù)值型

In [370]: m = ['1.1', 2, 3]

In [371]: pd.to_numeric(m)
Out[371]: array([1.1, 2. , 3. ])

to_datetime()，轉(zhuǎn)換為 datetime 對象

In [372]: import datetime

In [373]: m = ['2016-07-09', datetime.datetime(2016, 3, 2)]

In [374]: pd.to_datetime(m)
Out[374]: DatetimeIndex(['2016-07-09', '2016-03-02'], dtype='datetime64[ns]', freq=None)

to_timedelta()，轉(zhuǎn)換為 timedelta 對象。

In [375]: m = ['5us', pd.Timedelta('1day')]

In [376]: pd.to_timedelta(m)
Out[376]: TimedeltaIndex(['0 days 00:00:00.000005', '1 days 00:00:00'], dtype='timedelta64[ns]', freq=None)

如需強(qiáng)制轉(zhuǎn)換，則要加入 error 參數(shù)，指定 Pandas 怎樣處理不能轉(zhuǎn)換為成預(yù)期類型或?qū)ο蟮臄?shù)據(jù)。errors 參數(shù)的默認(rèn)值為 False，指的是在轉(zhuǎn)換過程中，遇到任何問題都觸發(fā)錯誤。設(shè)置為 errors='coerce' 時，pandas 會忽略錯誤，強(qiáng)制把問題數(shù)據(jù)轉(zhuǎn)換為 pd.NaT（datetime 與 timedelta），或 np.nan（數(shù)值型）。讀取數(shù)據(jù)時，如果大部分要轉(zhuǎn)換的數(shù)據(jù)是數(shù)值型或 datetime，這種操作非常有用，但偶爾也會有非制式數(shù)據(jù)混合在一起，可能會導(dǎo)致展示數(shù)據(jù)缺失：

In [377]: import datetime

In [378]: m = ['apple', datetime.datetime(2016, 3, 2)]

In [379]: pd.to_datetime(m, errors='coerce')
Out[379]: DatetimeIndex(['NaT', '2016-03-02'], dtype='datetime64[ns]', freq=None)

In [380]: m = ['apple', 2, 3]

In [381]: pd.to_numeric(m, errors='coerce')
Out[381]: array([nan,  2.,  3.])

In [382]: m = ['apple', pd.Timedelta('1day')]

In [383]: pd.to_timedelta(m, errors='coerce')
Out[383]: TimedeltaIndex([NaT, '1 days'], dtype='timedelta64[ns]', freq=None)

error 參數(shù)還有第三個選項(xiàng)，error='ignore'。轉(zhuǎn)換數(shù)據(jù)時會忽略錯誤，直接輸出問題數(shù)據(jù)：

In [384]: import datetime

In [385]: m = ['apple', datetime.datetime(2016, 3, 2)]

In [386]: pd.to_datetime(m, errors='ignore')
Out[386]: Index(['apple', 2016-03-02 00:00:00], dtype='object')

In [387]: m = ['apple', 2, 3]

In [388]: pd.to_numeric(m, errors='ignore')
Out[388]: array(['apple', 2, 3], dtype=object)

In [389]: m = ['apple', pd.Timedelta('1day')]

In [390]: pd.to_timedelta(m, errors='ignore')
Out[390]: array(['apple', Timedelta('1 days 00:00:00')], dtype=object)

執(zhí)行轉(zhuǎn)換操作時，to_numeric() 還有一個參數(shù)，downcast，即向下轉(zhuǎn)型，可以把數(shù)值型轉(zhuǎn)換為減少內(nèi)存占用的數(shù)據(jù)類型：

In [391]: m = ['1', 2, 3]

In [392]: pd.to_numeric(m, downcast='integer')   # smallest signed int dtype
Out[392]: array([1, 2, 3], dtype=int8)

In [393]: pd.to_numeric(m, downcast='signed')    # same as 'integer'
Out[393]: array([1, 2, 3], dtype=int8)

In [394]: pd.to_numeric(m, downcast='unsigned')  # smallest unsigned int dtype
Out[394]: array([1, 2, 3], dtype=uint8)

In [395]: pd.to_numeric(m, downcast='float')     # smallest float dtype
Out[395]: array([1., 2., 3.], dtype=float32)

上述方法僅能應(yīng)用于一維數(shù)組、列表或標(biāo)量；不能直接用于 DataFrame 等多維對象。不過，用 apply()，可以快速為每列應(yīng)用函數(shù)：

In [396]: import datetime

In [397]: df = pd.DataFrame([
   .....:     ['2016-07-09', datetime.datetime(2016, 3, 2)]] * 2, dtype='O')
   .....: 

In [398]: df
Out[398]: 
            0                    1
0  2016-07-09  2016-03-02 00:00:00
1  2016-07-09  2016-03-02 00:00:00

In [399]: df.apply(pd.to_datetime)
Out[399]: 
           0          1
0 2016-07-09 2016-03-02
1 2016-07-09 2016-03-02

In [400]: df = pd.DataFrame([['1.1', 2, 3]] * 2, dtype='O')

In [401]: df
Out[401]: 
     0  1  2
0  1.1  2  3
1  1.1  2  3

In [402]: df.apply(pd.to_numeric)
Out[402]: 
     0  1  2
0  1.1  2  3
1  1.1  2  3

In [403]: df = pd.DataFrame([['5us', pd.Timedelta('1day')]] * 2, dtype='O')

In [404]: df
Out[404]: 
     0                1
0  5us  1 days 00:00:00
1  5us  1 days 00:00:00

In [405]: df.apply(pd.to_timedelta)
Out[405]: 
                0      1
0 00:00:00.000005 1 days
1 00:00:00.000005 1 days

#各種坑

對 integer 數(shù)據(jù)執(zhí)行選擇操作時，可以很輕而易舉地把數(shù)據(jù)轉(zhuǎn)換為 floating 。Pandas 會保存輸入數(shù)據(jù)的數(shù)據(jù)類型，以防未引入 nans 的情況。參閱對整數(shù) NA 空值的支持。

In [406]: dfi = df3.astype('int32')

In [407]: dfi['E'] = 1

In [408]: dfi
Out[408]: 
   A  B    C  E
0  1  0    0  1
1  3  1    0  1
2  0  0  255  1
3  0  1    0  1
4 -1 -1    0  1
5  1  0    0  1
6  0 -1    1  1
7  0  0    0  1

In [409]: dfi.dtypes
Out[409]: 
A    int32
B    int32
C    int32
E    int64
dtype: object

In [410]: casted = dfi[dfi > 0]

In [411]: casted
Out[411]: 
     A    B      C  E
0  1.0  NaN    NaN  1
1  3.0  1.0    NaN  1
2  NaN  NaN  255.0  1
3  NaN  1.0    NaN  1
4  NaN  NaN    NaN  1
5  1.0  NaN    NaN  1
6  NaN  NaN    1.0  1
7  NaN  NaN    NaN  1

In [412]: casted.dtypes
Out[412]: 
A    float64
B    float64
C    float64
E      int64
dtype: object

浮點(diǎn)數(shù)類型未改變。

In [413]: dfa = df3.copy()

In [414]: dfa['A'] = dfa['A'].astype('float32')

In [415]: dfa.dtypes
Out[415]: 
A    float32
B    float64
C    float64
dtype: object

In [416]: casted = dfa[df2 > 0]

In [417]: casted
Out[417]: 
          A         B      C
0  1.047606  0.256090    NaN
1  3.497968  1.426469    NaN
2       NaN       NaN  255.0
3       NaN  1.139976    NaN
4       NaN       NaN    NaN
5  1.346426  0.096706    NaN
6       NaN       NaN    1.0
7       NaN       NaN    NaN

In [418]: casted.dtypes
Out[418]: 
A    float32
B    float64
C    float64
dtype: object

#基于 dtype 選擇列

select_dtypes() 方法基于 dtype 選擇列。

首先，創(chuàng)建一個由多種數(shù)據(jù)類型組成的 DataFrame：

In [419]: df = pd.DataFrame({'string': list('abc'),
   .....:                    'int64': list(range(1, 4)),
   .....:                    'uint8': np.arange(3, 6).astype('u1'),
   .....:                    'float64': np.arange(4.0, 7.0),
   .....:                    'bool1': [True, False, True],
   .....:                    'bool2': [False, True, False],
   .....:                    'dates': pd.date_range('now', periods=3),
   .....:                    'category': pd.Series(list("ABC")).astype('category')})
   .....: 

In [420]: df['tdeltas'] = df.dates.diff()

In [421]: df['uint64'] = np.arange(3, 6).astype('u8')

In [422]: df['other_dates'] = pd.date_range('20130101', periods=3)

In [423]: df['tz_aware_dates'] = pd.date_range('20130101', periods=3, tz='US/Eastern')

In [424]: df
Out[424]: 
  string  int64  uint8  float64  bool1  bool2                      dates category tdeltas  uint64 other_dates            tz_aware_dates
0      a      1      3      4.0   True  False 2019-08-22 15:49:01.870038        A     NaT       3  2013-01-01 2013-01-01 00:00:00-05:00
1      b      2      4      5.0  False   True 2019-08-23 15:49:01.870038        B  1 days       4  2013-01-02 2013-01-02 00:00:00-05:00
2      c      3      5      6.0   True  False 2019-08-24 15:49:01.870038        C  1 days       5  2013-01-03 2013-01-03 00:00:00-05:00

該 DataFrame 的數(shù)據(jù)類型：

In [425]: df.dtypes
Out[425]: 
string                                object
int64                                  int64
uint8                                  uint8
float64                              float64
bool1                                   bool
bool2                                   bool
dates                         datetime64[ns]
category                            category
tdeltas                      timedelta64[ns]
uint64                                uint64
other_dates                   datetime64[ns]
tz_aware_dates    datetime64[ns, US/Eastern]
dtype: object

select_dtypes() 有兩個參數(shù)，include 與 exclude，用于實(shí)現(xiàn)“提取這些數(shù)據(jù)類型的列” （include）或 “提取不是這些數(shù)據(jù)類型的列”（exclude）。

選擇 bool 型的列，示例如下：

In [426]: df.select_dtypes(include=[bool])
Out[426]: 
   bool1  bool2
0   True  False
1  False   True
2   True  False

該方法還支持輸入 NumPy 數(shù)據(jù)類型的名稱：

In [427]: df.select_dtypes(include=['bool'])
Out[427]: 
   bool1  bool2
0   True  False
1  False   True
2   True  False

select_dtypes() 還支持通用數(shù)據(jù)類型。

比如，選擇所有數(shù)值型與布爾型的列，同時，排除無符號整數(shù)：

In [428]: df.select_dtypes(include=['number', 'bool'], exclude=['unsignedinteger'])
Out[428]: 
   int64  float64  bool1  bool2 tdeltas
0      1      4.0   True  False     NaT
1      2      5.0  False   True  1 days
2      3      6.0   True  False  1 days

選擇字符串型的列必須要用 object：

In [429]: df.select_dtypes(include=['object'])
Out[429]: 
  string
0      a
1      b
2      c

要查看 numpy.number 等通用 dtype 的所有子類型，可以定義一個函數(shù)，返回子類型樹：

In [430]: def subdtypes(dtype):
   .....:     subs = dtype.__subclasses__()
   .....:     if not subs:
   .....:         return dtype
   .....:     return [dtype, [subdtypes(dt) for dt in subs]]
   .....:

所有 NumPy 數(shù)據(jù)類型都是 numpy.generic 的子類：

In [431]: subdtypes(np.generic)
Out[431]: 
[numpy.generic,
 [[numpy.number,
   [[numpy.integer,
     [[numpy.signedinteger,
       [numpy.int8,
        numpy.int16,
        numpy.int32,
        numpy.int64,
        numpy.int64,
        numpy.timedelta64]],
      [numpy.unsignedinteger,
       [numpy.uint8,
        numpy.uint16,
        numpy.uint32,
        numpy.uint64,
        numpy.uint64]]]],
    [numpy.inexact,
     [[numpy.floating,
       [numpy.float16, numpy.float32, numpy.float64, numpy.float128]],
      [numpy.complexfloating,
       [numpy.complex64, numpy.complex128, numpy.complex256]]]]]],
  [numpy.flexible,
   [[numpy.character, [numpy.bytes_, numpy.str_]],
    [numpy.void, [numpy.record]]]],
  numpy.bool_,
  numpy.datetime64,
  numpy.object_]]

注意

Pandas 支持 category 與 datetime64[ns, tz] 類型，但這兩種類型未整合到 NumPy 架構(gòu)，因此，上面的函數(shù)沒有顯示。

以上內(nèi)容是否對您有幫助：

← Pandas 入門

Pandas 數(shù)據(jù)結(jié)構(gòu)簡介 →

寫筆記

我要補(bǔ)充

Pandas 基礎(chǔ)用法

Head 與 Tail

屬性與底層數(shù)據(jù)

加速操作

二進(jìn)制操作

匹配/廣播機(jī)制

#缺失值與填充缺失值操作

#比較操作

#布爾簡化

#比較對象是否等效

#比較 array 型對象

#合并重疊數(shù)據(jù)集

#DataFrame 通用合并方法

#描述性統(tǒng)計(jì)

#數(shù)據(jù)總結(jié)：describe

#最大值與最小值對應(yīng)的索引

#值計(jì)數(shù)（直方圖）與眾數(shù)

#離散化與分位數(shù)

#函數(shù)應(yīng)用

#表級函數(shù)應(yīng)用

#行列級函數(shù)應(yīng)用

#聚合 API

#多函數(shù)聚合

#用字典實(shí)現(xiàn)聚合

#多種數(shù)據(jù)類型（Dtype）

#自定義 Describe

#Transform API

#多函數(shù) Transform

#用字典執(zhí)行 transform 操作

#元素級函數(shù)應(yīng)用

#重置索引與更換標(biāo)簽

#重置索引，并與其它對象對齊

#用 align 對齊多個對象

#重置索引填充的限制

#去掉軸上的標(biāo)簽

#重命名或映射標(biāo)簽

#迭代

#項(xiàng)目（items）

#iterrows

#itertuples

#.dt 訪問器

#矢量化字符串方法

#排序

#按索引排序

#按值排序

#按索引與值排序

#搜索排序

#最大值與最小值

#用多層索引的列排序

#復(fù)制

#數(shù)據(jù)類型

#默認(rèn)值

#向上轉(zhuǎn)型

#astype

#對象轉(zhuǎn)換

#各種坑

#基于 dtype 選擇列

推薦文章

推薦教程

推薦課程

#重置索引，并與其它對象對齊