[Pandas] 02 - Tutorial on NumPy

時間 2019-11-18

標籤 pandas tutorial numpy 简体版

原文原文鏈接

常見考點

相關參考：NumPy 教程html

1、矩陣 (Matrix)

　　初始化api

mat = np.array([0, 0.5, 1.0, 1.5, 2.0])
mat = np.random.standard_normal((10, 10))
mat = np.zeros((2, 3, 4), dtype='i', order='C')
...

自定義混合類型初始化

統計量數組

[Pandas] 01 - A guy based on NumPyapp

Basic Vectorization 向量化dom

當存在nan元素時，失效；須要排除nan再統計。函數

矩陣取整
　　　　取左地板值
　　　　僅保留整數位
　　　　四捨五入oop

矩陣大小post

import sys
sys.getsizeof(a)

2、迭代

　　矩陣下標
　　　　　　index 表示範圍
　　　　　　下標表示範圍內的「間隔」
　　矩陣遍歷ui

for x in np.nditer(a, order='F'):       Fortran order，便是列序優先； for x in np.nditer(a.T, order='C'):     C order，便是行序優先；

3、形變

　　扁平化
　　　　　　徹底扁平　ravel
　　　　　　自定義扁平　reshape, resize
　　轉置
　　堆疊
　　　　　　總體對接　vstack, hstack
　　　　　　各取出一個配對　column_stack, row_stack
　　　　　　元素自增長一維度
　　拆分

4、矩陣拷貝

　　引用，非拷貝
　　映射關係 view
　　深拷貝

5、統計採樣

　　正態分佈
　　其餘分佈

常見重難點

1、丟失的類型

丟失的數據類型主要有 None 和 np.nan

（1）np.nan是一個float類型的數據；

（2）None是一個NoneType類型。

(a) 在ndarray中顯示時 np.nan會顯示nan，若是進行計算結果會顯示爲NAN

None顯示爲None 而且對象爲object類型，若是進行計算結果會報錯。

因此ndarray中沒法對有缺失值的數據進行計算。

(b) 在Serise中顯示的時候都會顯示爲NAN，都可以視做np.nan

進行計算時能夠經過np.sum()獲得結果，此時NAN默認爲0.0

s1 + 10 對於空值獲得的結果爲NAN,

若是使用加法能夠經過s1.add(參數，fill_value = 0)指定空值的默認值爲0

2、副本和視圖

ndarray.view() 理解爲：共享了「同一片物理地址」，但描述部分，例如維度等是單獨的。

ndarray.copy() 理解爲：深拷貝。

視圖通常發生在：

- 一、numpy 的切片操做返回原數據的視圖。
- 二、調用 ndarray 的 view() 函數產生一個視圖。

副本通常發生在：

- Python 序列的切片操做，調用deepCopy()函數。
- 調用 ndarray 的 copy() 函數產生一個副本。

3、採樣空間

隊友：random隨機數採樣空間。

import numpy as np
 
a = np.linspace(10, 20,  5, endpoint =  False)  
print(a)

# 默認底數是 10
a = np.logspace(1.0,  2.0, num =  10)  
print (a)

4、矩陣庫(Matrix)

NumPy 中包含了一個矩陣庫 numpy.matlib，該模塊中的函數返回的是一個矩陣，而不是 ndarray 對象。

Ref: numpy教程：矩陣matrix及其運算

NumPy函數庫中的matrix與MATLAB中matrices等價。

NumPy 提供了線性代數函數庫 linalg，該庫包含了線性代數所需的全部功能，能夠看看下面的說明：

函數	描述
`dot`	兩個數組的點積，即元素對應相乘。(就是「矩陣相乘」)
`vdot`	兩個向量的點積
`inner`	兩個數組的內積（向量積）
`matmul`	兩個數組的矩陣積
`determinant`	數組的行列式
`solve`	求解線性矩陣方程
`inv`	計算矩陣的乘法逆矩陣

點積和內積的區別：difference between numpy dot() and inner()

In [103]: a=np.array([[1,2],[3,4]])                                             

In [104]: b=np.array([[11,12],[13,14]])                                         

In [105]: np.dot(a,b) # 矩陣乘法： 1*11+2*13 = 37
Out[105]: 
array([[37, 40],
       [85, 92]])

In [106]: np.inner(a,b)  # 1*11+2*12 = 35, 1*13+2*14 = 41
Out[106]: 
array([[35, 41],
       [81, 95]])

進一步考察inner：Ordinary inner product of vectors for 1-D arrays (without complex conjugation), in higher dimensions a sum product over the last axes.

In [118]: a=np.array([[1,2],[3,4],[5,6]])                                       

In [119]: b=np.array([[7,8],[9,10],[11,12]])                                    

In [120]: a                                                                     
Out[120]: 
array([[1, 2],
       [3, 4],
       [5, 6]])

In [121]: b                                                                     
Out[121]: 
array([[ 7,  8],
       [ 9, 10],
       [11, 12]])

In [122]: np.inner(a,b)                                                         
Out[122]: 
array([[ 23,  29,  35],
       [ 53,  67,  81],
       [ 83, 105, 127]])

5、更好的索引

布爾索引

相似於filter。

import numpy as np 
 
x = np.array([[  0,  1,  2],[  3,  4,  5],[  6,  7,  8],[  9,  10,  11]])  
print ('咱們的數組是：')
print (x)
print ('\n')

# 如今咱們會打印出大於 5 的元素  
print  ('大於 5 的元素是：')
print (x[x > 5]) 

# 元素位置被"True/False"置換
print (x > 5)

Output:

咱們的數組是：
[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]

大於 5 的元素是：
[ 6  7  8  9 10 11]

array([[False, False, False],
[False, False, False],
[ True, True, True],
[ True, True, True]])

花式索引

(a) 傳入順序索引數組

import numpy as np 
 print (x[[4,2,1,7]])

(b) 傳入多個索引數組（二級操做，先fiter一次，再filter一次）

import numpy as np 
 
x=np.arange(32).reshape((8,4))
print (x[np.ix_([1,5,7,2],[0,3,1,2])])

有點至關於x[[1,5,7,2]][[0,3,1,2]]，但不具備順序性。

實踐意義　　

In [193]: array_3                                                               
Out[193]: array([nan,  0.,  1.,  2., nan])

In [194]: ix = ~np.isnan(array_3)                                               

In [195]: array_3[ix]                                                           
Out[195]: array([0., 1., 2.])

In [196]: array_3[ix].mean()                                                    
Out[196]: 1.0

##########################################
# 做爲對比，下面的這個由於nan而沒法統計
##########################################

In [197]: array_3.mean()                                                        
Out[197]: nan

6、廣播機制

計算廣播

通常不要用這種「隱式轉換」。

import numpy as np 
 
a = np.array([[ 0, 0, 0],
              [10,10,10],
              [20,20,20],
              [30,30,30]])
b = np.array([1,2,3])
print(a + b)

迭代廣播

import numpy as np 
 
a = np.arange(0,60,5) 
a = a.reshape(3,4)  
print  ('第一個數組爲：')
print (a)

print  ('\n')
print ('第二個數組爲：')
b = np.array([1,  2,  3,  4], dtype =  int)  
print (b)

print ('\n')
print ('修改後的數組爲：')
for x,y in np.nditer([a,b]):  
    print ("%d:%d"  %  (x,y), end=", " )

Output:

第一個數組爲：
[[ 0  5 10 15]
 [20 25 30 35]
 [40 45 50 55]]


第二個數組爲：
[1 2 3 4]


修改後的數組爲：
0:1, 5:2, 10:3, 15:4, 20:1, 25:2, 30:3, 35:4, 40:1, 45:2, 50:3, 55:4,

7、迭代矩陣

迭代器

默認迭代器選擇以更有效的方式對數組進行迭代；固然也能夠強制「風格順序」。

import numpy as np 
 
a = np.arange(0,60,5) 
a = a.reshape(3,4)  
print ('原始數組是：')
print (a)

print ('\n')
print ('以 C 風格順序排序：')
for x in np.nditer(a, order =  'C'):  
    print (x, end=", " )

print ('\n')
print ('以 F 風格順序排序：')
for x in np.nditer(a, order =  'F'):  
    print (x, end=", " )

結果：

原始數組是：
[[ 0  5 10 15]
 [20 25 30 35]
 [40 45 50 55]]


以 C 風格順序排序：
0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55,

以 F 風格順序排序：
0, 20, 40, 5, 25, 45, 10, 30, 50, 15, 35, 55,

可讀可寫屬性

nditer 對象有另外一個可選參數 op_flags。默認狀況下，nditer 將視待迭代遍歷的數組爲只讀對象（read-only）。

爲了在遍歷數組的同時，實現對數組元素值得修改，必須指定 read-write 或者 write-only 的模式。

import numpy as np
 
a = np.arange(0,60,5) 
a = a.reshape(3,4)  
print ('原始數組是：')
print (a)

print ('\n')
for x in np.nditer(a, op_flags=['readwrite']): 
    x[...]=2*x 
print ('修改後的數組是：')
print (a)

使用外部循環

若是想獲得「列的分組」結果，考慮以下方法。

或者考慮ndarrary.flatten(order='F')。

import numpy as np 

a = np.arange(0,60,5) 
a = a.reshape(3,4)  

print ('原始數組是：')
print (a)
print ('\n')
print ('修改後的數組是：')

for x in np.nditer(a, flags = ['external_loop'], order = 'C'):  
   print (x, end=", " )
print() for x in np.nditer(a, flags = ['external_loop'], order = 'F'):  
   print (x, end=", " )

Output:

原始數組是：
[[ 0  5 10 15]
 [20 25 30 35]
 [40 45 50 55]]


修改後的數組是：
[ 0  5 10 15 20 25 30 35 40 45 50 55],
[ 0 20 40], [ 5 25 45], [10 30 50], [15 35 55],

降到一維

numpy中的ravel()、flatten()、squeeze()都有將多維數組轉換爲一維數組的功能，區別：

ravel()	若是沒有必要，不會產生源數據的副本。
flatten()	返回源數據的副本。
squeeze()	只能對維數爲1的維度降維。

8、字符串操做

如下函數用於對 dtype 爲 numpy.string_ 或 numpy.unicode_ 的數組執行向量化字符串操做。它們基於 Python 內置庫中的標準字符串函數。

這些函數在字符數組類（numpy.char）中定義。

函數	描述
`add()`	對兩個數組的逐個字符串元素進行鏈接
multiply()	返回按元素多重鏈接後的字符串
`center()`	居中字符串
`capitalize()`	將字符串第一個字母轉換爲大寫
`title()`	將字符串的每一個單詞的第一個字母轉換爲大寫
`lower()`	數組元素轉換爲小寫
`upper()`	數組元素轉換爲大寫
`split()`	指定分隔符對字符串進行分割，並返回數組列表
`splitlines()`	返回元素中的行列表，以換行符分割
`strip()`	移除元素開頭或者結尾處的特定字符
`join()`	經過指定分隔符來鏈接數組中的元素
`replace()`	使用新字符串替換字符串中的全部子字符串
`decode()`	數組元素依次調用`str.decode`
`encode()`	數組元素依次調用`str.encode`
`byteswap()`	將 ndarray 中每一個元素中的字節進行大小端轉換

部分生僻難點

數組維度操做。

1、軸對調：numpy.swapaxes

import numpy as np 
x = np.array([[[0,1],[2,3]],[[4,5],[6,7]]])
y = np.swapaxes(x,0,2)
print(y)

Output: x軸(0) 和 z軸(2) 調換的結果

[[[0 4]
  [2 6]]

 [[1 5]
  [3 7]]]

2、插入軸

一個要點：新添加的軸，必然是shape = 1的。

import numpy as np
 
x = np.array(([1,2],[3,4]))
 
print ('數組 x：')
print (x)
print ('\n')
y = np.expand_dims(x, axis = 0)
 
print ('數組 y：')
print (y)
print ('\n')
 
print ('數組 x 和 y 的形狀：')
print (x.shape, y.shape)
print ('\n')
# 在位置 1 插入軸
y = np.expand_dims(x, axis = 1)
 
print ('在位置 1 插入軸以後的數組 y：')
print (y)
print ('\n')
 
print ('x.ndim 和 y.ndim：')
print (x.ndim,y.ndim)
print ('\n')
 
print ('x.shape 和 y.shape：')
print (x.shape, y.shape)

輸出結果爲：

數組 x：
[[1 2]
 [3 4]]


數組 y：
[[[1 2]
  [3 4]]]


數組 x 和 y 的形狀：
(2, 2) (1, 2, 2)


在位置 1 插入軸以後的數組 y：
[[[1 2]]

 [[3 4]]]


x.ndim 和 y.ndim：
2 3


x.shape 和 y.shape：
(2, 2) (2, 1, 2)

3、廣播維度 (iter's 笛卡爾積)

x的維度變高後，y的維度經過「broadcast」自動升維。

import numpy as np
 
x = np.array([[1], [2], [3]])
y = np.array([4, 5, 6])  
 
# 對 y 廣播 x
b = np.broadcast(x,y)  
# 它擁有 iterator 屬性，基於自身組件的迭代器元組
 
print ('對 y 廣播 x：')
r,c = b.iters # Python3.x 爲 next(context) ，Python2.x 爲 context.next()
print (next(r), next(c))
print (next(r), next(c))
print (next(r), next(c))
print (next(r), next(c))
print ('\n')

Output:

對 y 廣播 x：
1 4
1 5
1 6
2 4

4、正統"笛卡爾積"求解

import itertools

class cartesian(object):
    def __init__(self):
        self._data_list=[]

    def add_data(self,data=[]): #添加生成笛卡爾積的數據列表
        self._data_list.append(data)

    def build(self): #計算笛卡爾積
        for item in itertools.product(*self._data_list):
            print(item)

if __name__=="__main__":
    car = cartesian()
    car.add_data([1,2,3])
    car.add_data([4,5,6])
    car.build()

"相加"過程的broadcast

print ('廣播對象的形狀：')
print (b.shape)
print ('\n')

b = np.broadcast(x,y)
c = np.empty(b.shape)
 
print ('手動使用 broadcast 將 x 與 y 相加：')
print (c.shape)
print ('\n')

# 方案一：手動相加
c.flat = [u + v for (u,v) in b]　　　　# <----很是妙！c實際上是假降維 print ('調用 flat 函數：')
print (c)
print ('\n')

# 方案二：自動相加；得到了和 NumPy 內建的廣播支持相同的結果
print ('x 與 y 的和：')
print (x + y)

Output:

廣播對象的形狀：
(3, 3)


手動使用 broadcast 將 x 與 y 相加：
(3, 3)


調用 flat 函數：
[[5. 6. 7.]
 [6. 7. 8.]
 [7. 8. 9.]]


x 與 y 的和：
[[5 6 7]
 [6 7 8]
 [7 8 9]]

broadcast_to之「僞複製」

將數組廣播到新形狀，有點「複製」的意思；但倒是隻讀，難道是「僞複製」？是的，內存沒有增長。

可使用 sys.getsizeof(<variable>) 獲取變量大小。

import numpy as np
 
a = np.arange(4).reshape(1,4)
 
print ('原數組：')
print (a)
print ('\n')
 
print ('調用 broadcast_to 函數以後：')
print (np.broadcast_to(a,(4,4)))

Output:

原數組：
[[0 1 2 3]]


調用 broadcast_to 函數以後：
[[0 1 2 3]
 [0 1 2 3]
 [0 1 2 3]
 [0 1 2 3]]

排序算法

根據形式

返回排序結果

按照某個字段排序。

import numpy as np  
 
a = np.array([[3,7],[9,1]])  
print ('咱們的數組是：')
print (a)
print ('\n')

print ('調用 sort() 函數：')
print (np.sort(a))
print ('\n')

print ('按列排序：')
print (np.sort(a, axis =  0))
print ('\n')

# 在 sort 函數中排序字段 
dt = np.dtype([('name',  'S10'),('age',  int)]) 
a = np.array([("raju",21),("anil",25),("ravi",  17),  ("amar",27)], dtype = dt)  

print ('咱們的數組是：')
print (a)
print ('\n')

print ('按 name 排序：')
print (np.sort(a, order =  'name'))

返回排序的索引

返回索引，但可經過x[y]的形式巧妙地獲得結果。

import numpy as np 
 
x = np.array([3,  1,  2])  
print ('咱們的數組是：')
print (x)
print ('\n')

print ('對 x 調用 argsort() 函數：')
y = np.argsort(x)  
print (y)
print ('\n')

print ('以排序後的順序重構原數組：')
print (x[y])
print ('\n')

print ('使用循環重構原數組：')
for i in y:  
    print (x[i], end=" ")

根據多序列排序

每一列表明一個序列，排序時優先照顧靠後的列。

注意，這裏一列數據傳入的是tuple。

import numpy as np 
 
nm  =  ('raju','anil','ravi','amar') 
dv  =  ('f.y.',  's.y.',  's.y.',  'f.y.') 
ind = np.lexsort((dv, nm))  
print ('調用 lexsort() 函數：') 
print (ind) 
print ('\n') 
print ('使用這個索引來獲取排序後的數據：') 
print ([nm[i]  +  ", "  + dv[i]  for i in ind])

根據細節

實數、虛數排序

msort(a)	數組按第一個軸排序，返回排序後的數組副本。np.msort(a) 相等於 np.sort(a, axis=0)。
sort_complex(a)	對複數按照先實部後虛部的順序進行排序。

分區排序

(1) 排序數組索引爲 3 的數字，比該數字小的排在該數字前面，比該數字大的排在該數字的後面。

>>> a = np.array([3, 4, 2, 1])
>>> np.partition(a, 3)  # 將數組 a 中全部元素（包括重複元素）從小到大排列，3 表示的是排序數組索引爲 3 的數字，比該數字小的排在該數字前面，比該數字大的排在該數字的後面
array([2, 1, 3, 4])
>>>
>>> np.partition(a, (1, 3)) # 小於 1 的在前面，大於 3 的在後面，1和3之間的在中間
array([1, 2, 3, 4])

(2) 第 3 小（index=2）的值

>>> arr = np.array([46, 57, 23, 39, 1, 10, 0, 120])
>>> arr[np.argpartition(arr, 2)[2]]
10

(3) 第 2 大（index=-2）的值

>>> arr[np.argpartition(arr, -2)[-2]]
57

(4) 同時找到第 3 和第 4 小的值。

>>> arr[np.argpartition(arr, [2,3])[2]]
10
>>> arr[np.argpartition(arr, [2,3])[3]]
23

(5) numpy.argmax() 和 numpy.argmin()函數分別沿給定軸返回最大和最小元素的索引。

相似布爾索引

numpy.nonzero() 函數返回輸入數組中非零元素的索引。

numpy.where() 函數返回輸入數組中知足給定條件的元素的索引。

End.

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。