Python學習筆記：數據可視化(一)

時間 2019-12-13

標籤 python 學習筆記數據可視化欄目 Python 简体版

原文原文鏈接

python相關

基礎概念

數據:離散的，客觀事實的數字表示
信息：處理後的數據，爲實際問題提供答案

　　- 爲數據提供一種關係或一個關聯後，數據就成了信息，這種關聯經過提供數據背景來完成html

知識: 是數據、信息和經過經驗得到的技能

　　- 知識包括作出適當決策的能力和執行時所需的技能前端

觀點：

　- 如何獲取觀點: 基於已有數據信息獲得最佳或現實的決策，咱們能夠經過數據分析python

數據分析　依賴數學算法來肯定產生觀點的數據之間的關係

信息是可量化的、可測度的、有形式的，可被訪問、生成、存儲、分發、搜索、壓縮和複製。
信息能夠經過數量或信息量進行量化。，信息可轉換爲知識，知識要比信息更量化。在某些領域，知識持續經歷一個不斷髮展週期。當數據發生變化時，這種演變過程隨之發生。

經過離散算法git

數據轉換：數據被轉換成信息，獲得進一步處理，而後用來解決問題

　- 數據的種類不一樣，包括表現數據, 實驗數據,基準數據github

可視化的整個過程須要不用技能和專業領域的人。算法

數據工人努力收集數據並完成分析
數學家和統計學家理解可視化設計原則，並用這些原則完成數據交流
設計師和藝術家和開發者具有可視化轉換的技能
業務分析員等找尋行爲模式，離羣點或突發趨勢
整個過程額步驟是：

獲取或收集數據：
解析和過濾數據：用編程方法進行解析、清洗和減小數據
分析提煉數據：刪除噪音和一些沒必要要維度，發展模式
呈現和交互用更容易獲得和理解的方式展現數據

數據預處理

數據清洗：用於數據的噪音清理和矛盾糾正
數據集成：將多個數據源的數據合併起來（倉庫）
數據壓縮：經過合併、彙集、消除冗餘特徵等方法減小數據量
數據轉換：將數據縮放到一個較小的區間，從而提升處理和可視化的精確性和效率編程
- 提取數據 -> 刪除不一致數據 ->重建缺失數據 ->數據標準化 -> 驗證數據

-segmentfault

數據處理

數據集資源

數據分析與可視化

這裏咱們會使用下面幾種繪圖工具api

matplotlib：是一個最基礎的Python可視化庫，做圖風格接近MATLAB，因此稱爲matplotlib。通常都是從matplotlib上手Python數據可視化，而後開始作縱向與橫向拓展
Seaborn:是一個基於matplotlib的高級可視化效果庫，針對的點主要是數據挖掘和機器學習中的變量特徵選取，seaborn能夠用短小的代碼去繪製描述更多維度數據的可視化效果圖
Plotly：繪圖工具，是創建在一個開放源碼庫plotly.js上，由一家擁有多種產品和開源工具的
Pyecharts : 是基於百度echarts的一個開源項目，也是我目前接觸到的最容易實現交互可視化的工具，相比bokeh和plotly，pyecharts的語法更簡單，實現效果更佳出衆(作過前端的對這個應該很瞭解)
Bokeh: 是一個用於作瀏覽器端交互可視化的庫，實現分析師與數據的交互
pandas：是基於NumPy 的一種工具，該工具是爲了解決數據分析任務而建立的。Pandas 歸入了大量庫和一些標準的數據模型，提供了高效地操做大型數據集所需的工具。
Mapbox: 處理地理數據引擎更強的可視化工具庫
geoplotlib
cufflinks： a library for easy interactive Pandas charting with Plotly

圖形公司創造並提供無償使用，咱們能夠在離線模式下建立無限圖表，在線模式下最多能夠建立25個圖表
固然，找更多工具戳這裏瀏覽器

數據的可視化是表達信息的過程，在可視化化過程咱們要思考：

要處理多少變量？咱們試圖畫出怎樣的圖像？
x軸和y軸指代什麼？（三維圖中有ｚ軸）
數據的大小是否被標準化?數據點的大小意味着什麼？
咱們的選色對嗎？
對於時間序列數據，咱們是否試圖識別趨勢或相關性

這裏有個學生數據：http://www.knapdata.com/pytho...

# -*- coding:utf-8 -*-
# usr/bin/python 3.5+
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

students = pd.read_csv("data/ucdavis.csv")

g = sns.FacetGrid(students, hue="gender", palette="Set1", size=6)
g.map(plt.scatter, "gpa", "computer", s=250, linewidth= 0.65, edgecolor="white")

g.add_legend()
plt.show()

seaborn: http://seaborn.pydata.org/api...

FacetGrid 類能夠刻畫三個維度：　行、列、色調

　- 用於對數據子集中的一個變量的分佈或者多個變量關係進行可視化

barchart

matplotlib.pyplot.bar

import numpy as np
import matplotlib.pyplot as plt

N = 7
winnersplot = (142.6, 125.3, 62.0, 81.0, 145.6, 319.4, 178.1 )
ind = np.arange(N)
width = 0.35
fig, ax = plt.subplots()
winners = ax.bar(ind, winnersplot, width, color='#ffad00')
print(winners)

nomineesplot = (109.4, 94.8, 60.7, 44.6, 116.9,262.5,102.0)
nominees = ax.bar(ind + width, nomineesplot, width, color='#9b3c38')

# add some text for labels ,title and axes ticks

ax.set_xticks(ind+width)
ax.set_xticklabels(('小明', '小紅', '小凡', '小錢', '小劉', '小趙', '小文'))
ax.legend((winners[0], nominees[0]),('奧斯卡金獎得住','奧斯卡得住提名'))

def autolabel(rects):
    # attach some text labels
   for rect in rects:
       height = rect.get_height()
       hcap = "$" + str(height) + "M"
       ax.text(rect.get_x() + rect.get_width()/2. ,height, hcap,ha = 'center',va='bottom',rotation='horizontal')

autolabel(winners)
autolabel(nominees)

plt.show()

piechart

matplotlib.pyplot.pie

import matplotlib.pyplot as plt
labels = 'Computer Science', 'Foreign Languges','Analytical Chemistry', 'Education', 'Humanities', 'Physics', 'Biology', 'Math and Statistics', 'Engineering'
sizes = [21, 4, 7, 7, 8, 9, 10, 15, 19]
colors = ['yellowgreen', 'gold', 'lightskyblue', 'lightcoral','red', 'purple', '#f280de', 'orange', 'green']
explode = (0,0,0,0,0,0,0,0,0.1)
plt.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',colors=colors)
plt.axis('equal')
plt.show()

box chart

scatter

散點圖

散點圖是同一組研究對象的兩個變量間關係的可視化

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
students = pd.read_csv("data/ucdavis.csv")
g = sns.FacetGrid(students, palette="Set10",size=7)
g.map(plt.scatter, "momheight", "height",s=140, linewidth=.7,edgecolor = "#ffad40",color="#ff8000")
g.set_axis_labels("Mothers Heilsght", "Students Height")
plt.show()

散點圖最適合研究不一樣變量之間的關係：

男性與女性人羣中不一樣年齡階段得皮膚病的可能性
IQ測試得分和GPA之間的相關性

另外咱們還要考慮：

添加一條趨勢線或最佳擬合線（若是關係是線性的）：添加趨勢線能夠展現數據之間的關聯性
使用信息標記類型：信息標記類型適用於經過形狀和顏色提升視覺效果來解讀數據的狀況

氣泡圖

氣泡圖展現了數據的三個維度，每一個數據點有三重維度(a, b , c), xy軸的座標表示兩個維度變量，氣泡的大小表示第三個維度的定量測度結果

Histograms直方圖

直方圖(Histogram)又稱質量分佈圖。是一種統計報告圖，由一系列高度不等的縱向條紋或線段表示數據分佈的狀況。通常用橫軸表示數據類型，縱軸表示分佈狀況。

import numpy as np
import pandas as pd
from scipy import stats, integrate
import matplotlib.pyplot as plt  #導入

import seaborn as sns
sns.set(color_codes=True)#導入seaborn包設定顏色

np.random.seed(sum(map(ord, "distributions")))
x = np.random.normal(size=100)
sns.distplot(x, kde=False, rug=True);#kde=False關閉核密度分佈,rug表示在x軸上每一個觀測上生成的小細條（邊際毛毯）
plt.show()

當繪製直方圖時，你最須要肯定的參數是矩形條的數目以及如何放置它們。利用bins能夠方便設置矩形條的數量。以下所示：

sns.distplot(x, bins=20, kde=False, rug=True);#設置了20個矩形條

核密度估計圖

核密度估計（Kernel Density Estimation, KDE）是一種用來估計機率密度函數的非參數方法。能夠經過觀測到的數據點取平均實現平滑逼近。

核密度估計是在機率論中用來估計未知的密度函數，屬於非參數檢驗方法之一。．因爲核密度估計方法不利用有關數據分佈的先驗知識，對數據分佈不附加任何假定，是一種從數據樣本自己出發研究數據分佈特徵的方法，於是，在統計學理論和應用領域均受到高度的重視。

核密度函數與直方圖密切相關，但有時可以經過核概念用平滑性或連續性賦予實際含義。
機率密度函數(Probablity Density Function,PDF)的核是PDF的形式。這種形式不考慮非變量函數因素。

這裏咱們用一個鳶尾花數據集和seaborn包展現KDE圖
使用seaborn 和matplotlib演示KDE圖

seaborn.distplot

This function combines the matplotlib hist function (with automatic calculation of a good default bin size) with the seaborn kdeplot() and rugplot() functions. It can also fit scipy.stats distributions and plot the estimated PDF over the data.

seaborn的displot()集合了matplotlib的hist()與核函數估計kdeplot的功能，增長了rugplot分佈觀測條顯示與利用scipy庫fit擬合參數分佈的新穎用途。具體用法以下：

seaborn入門（一）：distplot與kdeplot
seaborn.kdeplot

distplot()

distplot 函數默認同時繪製直方圖和KDE(核密度圖)

from numpy.random import randn
import matplotlib as mpl
import seaborn as sns
import matplotlib.pyplot as plt

#引入鳶尾花數據集
df_iris = sns.load_dataset("iris")
fig, axes = plt.subplots(1,2)
# print(df_iris['petal_length'])
# print(axes[0])


# distplot 函數默認同時繪製直方圖和KDE(核密度圖),開啓rug細條
sns.distplot(df_iris['petal_length'], ax= axes[0], rug = True)
# shade 陰影
sns.kdeplot(df_iris['petal_length'], ax = axes[1], shade = True)

plt.show()

若是不須要核密度圖，能夠將kde參數設置成False。

sns.distplot(df_iris['petal_length'], ax= axes[0], kde = False, rug = True)

若是不須要核密度圖，能夠將hist參數設置成False。

sns.distplot(df_iris['petal_length'], ax= axes[0], hist = False, rug = True)

# Fitting parametric distributions擬合參數分佈

# 能夠利用distplot() 把數據擬合成參數分佈的圖形而且觀察它們之間的差距,再運用fit來進行參數控制。


import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style="white", palette="muted", color_codes=True)
rs = np.random.RandomState(10)

# Set up the matplotlib figure
f, axes = plt.subplots(2, 2, figsize=(7, 7), sharex=True)
sns.despine(left=True)

#引入鳶尾花數據集
df_iris = sns.load_dataset("iris")

# Plot a simple histogram with binsize determined automatically

sns.distplot(df_iris['petal_length'], ax= axes[0, 0], kde = False, color="b")

# Plot a kernel density estimate and rug plot

sns.distplot(df_iris['petal_length'], ax= axes[0, 1], kde = False, color="r", rug=True)

# Plot a filled kernel density estimate

sns.distplot(df_iris['petal_length'], ax= axes[1, 0], hist = False, color="g", kde_kws={"shade": True})

# Plot a historgram and kernel density estimate
sns.distplot(df_iris['petal_length'], color="m", ax=axes[1, 1])

plt.setp(axes, yticks=[])
plt.tight_layout()
plt.show()

Fitting parametric distributions擬合參數分佈

使用Scipy和Numpy演示KDE圖

咱們用Scipy 和 Numpy代表機率密度函數
首先用Scipy中的norm()建立正態分佈樣本
而後用Numpy中的hstack()進行水平方向上的堆疊
再用Scipy中的gaussian_kde()

from scipy.stats.kde import gaussian_kde
from scipy.stats import norm
from numpy import linspace, hstack
import matplotlib.pyplot as plt
from matplotlib.pylab import plot,show, hist

sample1 = norm.rvs(loc=-0.1,scale=1,size=320)
sample2 = norm.rvs(loc=2.0,scale=0.6,size=130)
sample = hstack([sample1,sample2])
probDensityFun = gaussian_kde(sample)
plt.title("KDE Demonstration using Scipy and Numpy",fontsize=20)
x = linspace(-5,5,200)
plot(x,probDensityFun(x),'r')
hist(sample,normed=1,alpha=0.45,color='purple')
show()