數據挖掘---Numpy的學習

時間 2019-11-06

標籤數據挖掘 numpy 學習简体版

原文原文鏈接

什麼是Numpy

NumPy系統是Python的一種開源的數值計算擴展。這種工具可用來存儲和處理大型矩陣(任意維度的數據處理)，比Python自身的嵌套列表（nested list structure)結構要高效的多（該結構也能夠用來表示矩陣（matrix））。html

數據類型ndarrayjava

NumPy provides an N-dimension array type, the ndarray, which describes a collection of ‘items’of the same type. python

NumPy提供了一個N維數組類型ndarray，它描述了相同類型的「items」的集合。編程

import numpy as np

score = np.array([
    [80, 89, 86, 67, 79],
    [78, 97, 89, 67, 81],
    [90, 94, 78, 67, 74],
    [91, 91, 90, 67, 69],
    [76, 87, 75, 67, 86],
    [70, 79, 84, 67, 84],
    [94, 92, 93, 67, 64],
    [86, 85, 83, 67, 80]])

print(score, type(score))  #<class 'numpy.ndarray'>

ndarray與Python原生list運算效率對比數組

import numpy as np
import random
import time
# 生成一個大數組
python_list = []

for i in range(100000000):
    python_list.append(random.random())

ndarray_list = np.array(python_list)
len(ndarray_list)

# 原生pythonlist求和
t1 = time.time()
a = sum(python_list)
t2 = time.time()
d1 = t2 - t1
print(d1)   # 0.7309620380401611

# ndarray求和
t3 = time.time()
b = np.sum(ndarray_list)
t4 = time.time()
d2 = t4 - t3
print(d2)  # 0.12980318069458008

Numpy優點:數據結構

         1）存儲風格
            ndarray - 相同類型 - 通用性不強 - 數據是連續性的存儲
            list - 不一樣類型 - 通用性很強 - 引用的方式且不連續的堆空間存儲
        2）並行化運算
            ndarray支持向量化運算
        3）底層語言
            C語言，解除了GILapp

一、內存塊風格dom

二、ndarry支持並行化運算ide

三、Numpy底層是C編程，內部解除了GIL(全局解釋器鎖--實際上只有一個線程)的限制函數

認識N維數組的屬性-ndarry的屬性(shape+dtype)

ndarry形狀

import numpy as np
# 利用元組表示維度(2,3)2個數字表明2維，具體表明2行3列
a = np.array([[1, 2, 3], [4, 5, 6]])

# (4,)1維用1個數字表示，表示元素個數，爲了表示爲一個元組，咱們會添加一個，
b = np.array([1, 2, 3, 4])

# (2,2,3),最外層2個二維數組，2維數組內又嵌套了2個一維數組，一個一維數組又有3個元素
c = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])

如何理解數組的形狀？

二維數組其實是在一維數組內嵌套多個一維數組

三維數組其實是在一維數組內嵌套多個二維數組

ndarry的類型

在建立ndarray的時候，若是沒有指定類型
默認整數 int64
默認浮點數 float64

建立數組的時候指定類型

import numpy as np

# 建立數組的時候指定類型(1)
t = np.array([1.1, 2.2, 3.3], dtype=np.float32)
# 建立數組的時候指定類型(2)
tt = np.array([1.1, 2.2, 3.3], dtype="float32")

基本操做

生成數組的方法

生成數組的方法(4種類型)

1）生成0和1
    np.zeros(shape)
    np.ones(shape)
2）從現有數組中生成
    np.array() np.copy() 深拷貝
    np.asarray() 淺拷貝
3）生成固定範圍的數組
    np.linspace(0, 10, 100)
        [0, 10] 等距離

    np.arange(a, b, c)
        range(a, b, c)
            [a, b) c是步長
4）生成隨機數組
    分佈情況 - 直方圖
    1）均勻分佈
        每組的可能性相等
    2）正態分佈
        σ 幅度、波動程度、集中程度、穩定性、離散程度

一、生成0和1的數組

import numpy as np

# 1 生成0和1的數組
t = np.zeros(shape=(3, 4), dtype="float32")
tt = np.ones(shape=[2, 3], dtype=np.int32)

2 從現有數組生成

import numpy as np
# 方法一：np.array()
score = np.array([[80, 89, 86, 67, 79],
[94, 92, 93, 67, 64],
[86, 85, 83, 67, 80]])

# 方法二：np.copy()
ttt = np.copy(score)

# 方法三：np.asarray()
tttt = np.asarray(ttt)

區別：

np.array() np.copy() 深拷貝
np.asarray() 淺拷貝

3 生成固定範圍的數組

            np.linspace(0, 10, 100)
                [0, 10] 左閉右閉的等距離輸出100個數字

            np.arange(a, b, c)
                 [a, b) 左閉右開的步長爲c的數組

4 生成隨機數組（分佈情況 - 直方圖）

            1）均勻分佈
                每組的可能性相等
            2）正態分佈
                σ 幅度、波動程度、集中程度、穩定性、離散程度

一、均勻分佈：出現的機率同樣

import numpy as np
import matplotlib.pyplot as plt


# 均勻分佈：
data1 = np.random.uniform(low=-1, high=1, size=1000000)

# 一、建立畫布
plt.figure(figsize=(8, 6), dpi=100)
# 二、繪製直方圖
plt.hist(data1, 1000)
# 三、顯示圖像
plt.show()

二、正太分佈

方差是在機率論和統計方差衡量隨機變量或一組數據時離散程度的度量。機率論中方差用來度量隨機變量和其數學指望（即均值）之間的偏離程度。統計中的方差（樣本方差）是每一個樣本值與全體樣本值的平均數之差的平方值的平均數。標準差越小，數據越集中。

demo:

import numpy as np
import matplotlib.pyplot as plt

# 正太分佈
data2 = np.random.normal(loc=1.75, scale=0.1, size=1000000)

# 一、建立畫布
plt.figure(figsize=(20, 8), dpi=80)

# 二、繪製直方圖
plt.hist(data2, 1000)

# 三、顯示圖像
plt.show()

數組的索引與切片

demo:

import numpy as np

def slice_index():
    '''
        一維修改：
    '''
    arr = np.array([12, 32, 31])
    arr[0]=2
    print(arr)

    '''
        二維修改：
    '''
    arr2 = np.array([[12, 2], [43, 3]])
    arr2[0, 0] = 22  # 修改[12, 2]爲[22, 2]
    print(arr2)

    '''
        三維修改：
    '''
    arr3 = np.array(
        [[[1, 2, 3],
          [4, 5, 6]],

         [[12, 3, 34],
          [5, 6, 7]]]
    )   # 3個[，表示3維數組，內又2個2維數組，1個二維數組有2個1維數組，1個一維數組又3個數字，古(2,2,3)

    arr3[1, 0, 2] = 22  # 修改[12, 3, 34]爲[12, 3, 22]
    print(arr3)
    print(arr3[1, 1, :2])  # 5,6 # 取出前2個


if __name__ == '__main__':
    # 切片與索引
    slice_index()

形狀改變

ndarray.reshape(shape) 返回新的ndarray，原始數據沒有改變，且僅僅是改變了形狀，未改變行列. ndarry.reshape(-1,2) 自動變形

ndarray.resize(shape) 沒有返回值，對原始的ndarray進行了修改，未改變行列
ndarray.T 轉置行變成列，列變成行

demo:

import numpy as np


def np_change():
    arr3 = np.array(
        [[1, 2, 3], [4, 5, 6]]
    )  # （2, 3）
    '''
    方式一：
     reshape: 返回一個新的ndarry, 且不改變原ndarry,且僅僅是改變了形狀，未改變行列
        [[1 2]
         [3 4]
         [5 6]]
    '''
    arr4 = arr3.reshape((3, 2))
    print(arr3.shape)  # (2, 3)
    print(arr4.shape)  # (3, 2)

    '''
     方式二：
     resize:  沒有返回值，對原始的ndarray進行了修改，未改變行列
        [[1 2 3 1 2 3]]  
    '''
    arr3.resize((1, 6))
    print(arr3)  # (1, 6)

    '''
     方式三：
     T:  進行行列的轉置，把行數據轉換爲列，列數據轉換爲行
        [[1 3 5]
        [2 4 6]]
    '''
    print(arr4.T)  


if __name__ == '__main__':
    # 改變形狀
    np_change()

類型的修改

ndarray.astype(type)
ndarray 序列化到本地 --》ndarray.tostring()：實現序列化

import numpy as np


def type_change():
    '''
        ndarry的類型修改一： astype('float32')
    '''
    arr3 = np.array(
        [[1, 2, 3], [4, 5, 6]]
    )  # （2, 3）
    arr4 = arr3.astype("float32")  # int轉換爲float
    print(arr3.dtype)    # int32
    print(arr4.dtype)    # float32

    '''
        ndarry的類型修改二： 利用tostrint()序列化
    '''

    arr5 =arr3.tostring()   # 序列化  \x01\x00\x00\x00
    print(arr5)


if __name__ == '__main__':
    # 類型形狀
    type_change()

數組去重

set

import numpy as np


def type_change():
    '''
        ndarry的去重
    '''
    temp = np.array([[1, 2, 3, 4], [3, 4, 5, 6]])
    # 方法一： unique()
    np.unique(temp)
    print('利用unique去重：', temp)   #  [3 4 5 6]]

    temp2 = np.array([[1, 2, 3, 4], [3, 4, 5, 6]])
    # 方法二： set的要求是數組必須是一維的，利用flatten（）進行降維
    set(temp2.flatten())
    print('利用set進行降維後：', temp2)   #  [3 4 5 6]]


if __name__ == '__main__':
    # ndarry的去重
    type_change()

小結：

ndarray的運算(邏輯運算+統計運算+數組運算)

一、邏輯運算

        布爾索引
        通用判斷函數
            np.all(布爾值)
                只要有一個False就返回False，只有全是True才返回True
            np.any()
                只要有一個True就返回True，只有全是False才返回False

np.where（三元運算符）
np.where(布爾值, True的位置的值, False的位置的值)

布爾索引

import numpy as np

def demo():
    '''
        邏輯運算
    '''
    temp = np.array([[1, 2, 3, 4], [3, 4, 5, 6]])
    # 判斷temp裏面的元素是否大於5(temp > 5)就標記爲True 不然爲False:
    print(temp > 5)

    # 找到數值大於等於5的數字
    print(temp[temp >= 5])  # [5 6]

    # 找到數值大於等於5的數字,並統一賦值爲100
    temp[temp >= 5] = 100
    print(temp)

if __name__ == '__main__':
    # 邏輯運算  -- 布爾索引
    demo()

通用判斷函數

    np.all(布爾值)
        只要有一個False就返回False，只有全是True才返回True
    np.any()
        只要有一個True就返回True，只有全是False才返回False

import numpy as np

def demo():
    '''
        通用判斷函數
    '''
    temp = np.array([[1, 2, 3, 4], [3, 4, 5, 6]])
    # np.all():  只要有一個False就返回False，只有全是True才返回True
    print(np.all(temp > 5))   # False
    print(np.all(temp < 15))  # True

    # np.any(): 只要有一個True就返回True，只有全是False才返回False
    print(np.any(temp > 5))  # True

if __name__ == '__main__':
    # 邏輯運算  -- 通用判斷函數
    demo()

三元運算符

np.where(布爾值, True的位置的值, False的位置的值)

import numpy as np

def demo():
    '''
        三元運算符
    '''
    temp = np.array([[1, 2, 3, 4], [3, 4, 5, 6]])
    # np.where():  np.where(布爾值, True的位置的值, False的位置的值)
    print(np.where(temp > 4, 100, -100))  # 若是元素大於4，則置爲100，不然置爲-100

    '''
        [[-100 -100 -100 -100]
        [-100 -100  100  100]]
    '''

if __name__ == '__main__':
    # 邏輯運算  -- 三元運算符
    demo()

配合了邏輯與或非的運算：

import numpy as np

def demo():
    '''
        三元運算符： 配合邏輯與或非運算
    '''
    temp = np.array([[1, 2, 3, 4], [3, 4, 5, 6]])
    # np.logical_and(), np.logical_or(), logical_not()進行與或非運算
    print(np.logical_and(temp > 2, temp < 4))  # 進行與運算

    print(np.logical_or(temp > 2, temp < 3))  # 進行或運算

    print(np.where(np.logical_or(temp > 2, temp < 3), 1, 0))    # 配合了or的where三木運算

    print(np.where(np.logical_and(temp > 2, temp < 4), 1, 0))  # 配合了and的where三木運算


    '''
        [[-100 -100 -100 -100]
        [-100 -100  100  100]]
    '''

if __name__ == '__main__':
    # 邏輯運算  -- 三元運算符
    demo()

二、統計運算

        統計指標函數
            min, max, mean, median, var, std
            np.函數名，例如，arr.max()
            ndarray.方法名, 例如，ndarray.max(arr, ) # 須要先指定好元組
        返回最大值、最小值所在位置
            np.argmax(temp, axis=)
            np.argmin(temp, axis=)

統計指標函數：需指定好指標

import numpy as np

def demo():
    '''
        統計運算
    '''
    temp = np.array([[1, 2, 3, 4], [3, 4, 5, 6], [5, 6, 7, 8]])
    print(temp.max(axis=0))  # [5 6 7 8]， 按照列比較
    print(temp.max(axis=1))  # [4 6 8]， 按照行比較

    print(np.argmax(temp, axis=1))   # [3 3 3]， 返回最大值所在的位置
    print(np.argmin(temp, axis=1))  # [0 0 0 ]， 返回最小值所在的位置

if __name__ == '__main__':
    # 統計運算
    demo()

三、數組間運算

1. 數組與數的運算
2. 數組與數組的運算
3. 廣播機制
4. 矩陣運算
    1 什麼是矩陣
        矩陣matrix 二維數組
        矩陣 & 二維數組
        兩種方法存儲矩陣
            1）ndarray 二維數組
                矩陣乘法：
                    np.matmul
                    np.dot
            2）matrix數據結構
    2 矩陣乘法運算
        形狀
            (m, n) * (n, l) = (m, l)
        運算規則
            A (2, 3) B(3, 2)
            A * B = (2, 2)

一、數組與數的運算

import numpy as np

def demo():
    '''
        數組與數的運算
    '''
    temp = np.array([[1, 2, 3, 4], [3, 4, 5, 6], [5, 6, 7, 8]])
    print(temp + 10)
    print(temp * 10)

if __name__ == '__main__':
    # 數組與數的運算
    demo()

二、數組與數組的運算(需知足廣播機制)

import numpy as np

def demo():
    '''
        數組與數組的運算
    '''
    arr1 = np.array([[1, 2, 3, 2, 1, 4], [5, 6, 1, 2, 3, 1]])  # 2行6列
    arr2 = np.array([[1, 2, 3, 4], [3, 4, 5, 6]])  # 2行4列
    arr3 = np.array([[1, 2, 3, 2, 1, 4], [5, 6, 1, 2, 3, 1]])  # 2行6列
    arr4 = [2]
    # print(arr1 + arr2) could not be broadcast together with shapes (2,6) (2,4)
    print(arr1 + arr3)
    print(arr1 + arr4)



if __name__ == '__main__':
    # 數組與數組的運算
    demo()

矩陣運算

1 什麼是矩陣
    矩陣matrix 二維數組
    矩陣 & 二維數組   --》矩陣確定是二維數組形式存儲計算機，可是不是全部的二維數組都是矩陣。
    兩種方法存儲矩陣
        1）ndarray 二維數組
            矩陣乘法：
                np.matmul
                np.dot
        2）matrix數據結構
2 矩陣乘法運算
    形狀
        (m, n) * (n, l) = (m, l)
    運算規則
        A (2, 3) B(3, 2)
        A * B = (2, 2)

一、什麼是矩陣

import numpy as np


def demo():
    '''
        矩陣存儲方法
    '''
    # 方案一：ndarray存儲矩陣
    data = np.array([[80, 86],
                     [82, 80],
                     [85, 78],
                     [90, 90],
                     [86, 82],
                     [82, 90],
                     [78, 80],
                     [92, 94]])
    print(type(data))  # <class 'numpy.ndarray'>

    # 方案二： matrix存儲矩陣
    data_mat = np.mat([[80, 86],
                       [82, 80],
                       [85, 78],
                       [90, 90],
                       [86, 82],
                       [82, 90],
                       [78, 80],
                       [92, 94]])
    print(type(data_mat))  # <class 'numpy.matrix'>

if __name__ == '__main__':
    # ndarray存儲矩陣
    demo()

二、矩陣乘法

       形狀
             (m, n) * (n, l) = (m, l)
       運算規則
             A (2, 3) B(3, 2)
             A * B = (2, 2)

import numpy as np


def demo():
    '''
        矩陣乘法API
    '''
    # 方案一：np.matmul()
    data = np.array([[80, 86],
                     [82, 80],
                     [78, 80],
                     [92, 94]])  # (4,2)
    weight = np.array([[0.5],
                       [0.5]])   # (2,1)

    print(np.matmul(data, weight))  # (4,1)

    # 方案二： np.dot()
    data_mat = np.mat([[80, 86],
                       [82, 80],
                       [78, 80],
                       [92, 94]])
    print(np.dot(data_mat, weight))  # (4,1)        # 擴展方案：    print(data @ weight)   # ndarry的直接矩陣計算  


if __name__ == '__main__':
    # 矩陣乘法API
    demo()

合併與分割

合併

分割

IO操做和數據處理

數據準備：test.csv

id,value1,value2,value3
1,123,1.4,23
2,110,,18
3,,2.1,19

demo:

import numpy as np


def demo():
    '''
        # 合併
    '''
    data = np.genfromtxt("F:\linear\\test.csv", delimiter=",")
    print(data)   # 把字符串和缺失值用nan記錄(not a number)
    '''
       [[  nan   nan   nan   nan]
        [  1.  123.    1.4  23. ]
        [  2.  110.    nan  18. ]
        [  3.    nan   2.1  19. ]]
    '''


if __name__ == '__main__':
    # 合併
    demo()

缺失值的處理

    1. 直接刪除含有缺失值的樣本
    2. 替換/插補
            按列求平均，用平均值進行填補

import numpy as np


def fill_nan_by_column_mean():
    '''
        處理缺失值 -- 均值填補
    '''
    t = np.genfromtxt("F:\linear\\test.csv", delimiter=",")
    for i in range(t.shape[1]): # 按照列求平均，先計算數據的shape，看列的數量
        # 計算nan的個數
        nan_num = np.count_nonzero(t[:, i][t[:, i] != t[:, i]])
        if nan_num > 0:
            now_col = t[:, i]
            # 求和
            now_col_not_nan = now_col[np.isnan(now_col) == False].sum()
            # 和/個數
            now_col_mean = now_col_not_nan / (t.shape[0] - nan_num)
            # 賦值給now_col
            now_col[np.isnan(now_col)] = now_col_mean
            # 賦值給t，即更新t的當前列
            t[:, i] = now_col
    print(t)
    return t


if __name__ == '__main__':
    # 處理缺失值 -- 均值填補
    fill_nan_by_column_mean()