吹爆！！！適合數據科學小白的Python工具

時間 2019-11-05

標籤適合數據科學白的 python 工具欄目 Python 简体版

原文原文鏈接

全文共4291字，預計學習時長8分鐘html

圖片源自 Unsplash， Florian Klauer

在數據科學項目的任何階段，Python都可提供相關工具。全部數據科學項目都包含如下3個階段。python

· 數據收集git

· 數據建模程序員

· 數據可視化github

Python可爲這三個階段提供很是巧妙的工具。算法

數據收集api

1) Beautiful Soupbash

https://pypi.org/project/beautifulsoup4/微信

Digital Ocean

數據收集包括從網頁上獲取數據，python可爲此提供一個名爲beautifulsoup的庫。數據結構

from bs4 import Beautiful
Soup soup = BeautifulSoup(html_doc, 'html.parser')複製代碼

該庫可解析、有序存儲網頁內容。例如，該庫將根據標題分別存儲，包括存儲全部<a>標籤，在頁面中呈現很是簡潔的URL列表。

舉個例子，請看《愛麗絲夢遊仙境》中一個故事的簡單網頁。

網頁截圖

顯然，從中存在一些可獲取的html元素。

1.標題—睡鼠的故事

2.頁面文本

3.超連接 — Elsie，Lacie和Tillie。

Soup可輕鬆提取這些信息。

soup.title
# <title>The Dormouse's story</title>

soup.title.string
# u'The Dormouse's story'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

for link in soup.find_all('a'):
    print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

print(soup.get_text())
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names
 were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...複製代碼

該工具可從HTML和XML文件中提取數據，表現出色，也所以成爲導航、搜索和修改解析樹的慣用方法。使用該工具一般可節省程序員的工做時間，從幾小時到幾天不等。

2) Wget

https://pypi.org/project/wget/

來源 : Fossmint

下載數據，尤爲是從網頁上下載數據，是數據科學家們的重要任務之一。Wget是一款免費的程序，以非交互式方式從網頁上下載文件。因爲具備非交互式特徵，即便用戶未登陸，程序也可在後臺運行。程序支持HTTP、HTTPS和FTP協議，可經過HTTP代理進行檢索。所以，下次若是從網頁上下載一個網站或全部圖片時，能夠考慮使用wget。

>>> import wget
 >>> url = 'www.futurecrew.com/skaven/song_files/mp3/razorback.mp3'
 >>> filename = wget.download(url)
 100% [................................................] 3841532 /
 3841532
>>> filename
 'razorback.mp3'複製代碼

3) Data APIs

除了須要用於獲取或下載數據的工具外，還須要實際數據。Data APIs在這一點上頗有幫助。Python中存在許多API，供您免費下載數據。例如，Alpha Vantage可提供全球股票、外匯和加密貨幣的實時數據和歷史數據。Data APIs擁有長達20年的數據。

例如，咱們可使用alpha vantage API，提取有關比特幣每日價值的數據並進行繪製：

from alpha_vantage.cryptocurrencies
 import CryptoCurrenciesimport matplotlib.pyplot as plt
cc = CryptoCurrencies(key='YOUR_API_KEY',output_format='pandas')
data, meta_data = cc.get_digital_currency_daily(symbol='BTC',
 market='USD')
data['1a. open (USD)'].plot()
plt.tight_layout()
plt.title('Alpha Vantage Example - daily value for bitcoin (BTC) in US Dollars')
plt.show()複製代碼

Plotted Image

API的其餘用途以下：

· 開啓通知API — NASA和國際空間站數據

· 匯率API — 歐洲中央銀行公佈的當前和歷史匯率

用於數據收集的幾個API

數據建模

如本文所述，數據清洗或平衡是數據建模前的重要步驟。

1)Imbalanced-learn

http://glemaitre.github.io/imbalanced-learn/index.html

Imabalanced-learn用於平衡數據集。較其餘類別而言，若是同一級別或類別的數據樣本差別比例較大，那麼該數據集就是不平衡的。這可能致使分類算法面臨巨大考驗，最終偏向具備更多數據的類別。

例如，來自該庫的名爲Tomek-Links的命令有助於平衡數據集：

from imblearn.under_sampling import TomekLinks
  tl = TomekLinks(return_indices=True, ratio='majority')
 X_tl, y_tl, id_tl = tl.fit_sample(X, y)複製代碼

平衡失衡的數據集

2) Scipy Ecosystem — NumPy

https://www.numpy.org/

圖片來自 Ty Shaikh

經過python的scipy堆棧，對實際數據進行處理或建模。Python的SciPy Stack是專爲Pytho中的科學計算而設計的軟件集合。Scipy ecosystem包含許多有用的庫，但Numpy能夠說是其中最強大的工具。

NumPy全稱爲Numerical Python，是構建科學計算堆棧最基礎的軟件包。它爲矩陣操做提供了不少有用的功能。若是使用過MATLAB，就會馬上發現NumPy不只與MATLAB同樣功能強大，並且在操做上也很是類似。

3) Pandas

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

Pandas可提供數據結構，處理並操縱數據。被稱爲dataframe的二維結構是最受歡迎的結構。

Pandas是處理數據的完美工具，旨在進行快速簡便的數據操做、聚合和可視化。

Example of a DataFrame — Shanelynn

數據可視化

1) Matplotlib

Matplotlib是來自Scipy ecosystem的另外一軟件包，它能夠輕鬆生成簡單而強大的可視化。該軟件是2D繪圖庫，可生成出版質量級別的圖形，具備多種硬拷貝格式。

如下是Matplotlib輸出的例子：

import numpy as np
import matplotlib.pyplot as plt

N = 5
menMeans = (20, 35, 30, 35, 27)
womenMeans = (25, 32, 34, 20, 25)
menStd = (2, 3, 4, 1, 2)
womenStd = (3, 5, 2, 3, 3)
ind = np.arange(N)    # the x locations for the groups
width = 0.35       # the width of the bars: can also be len(x) sequence

p1 = plt.bar(ind, menMeans, width, yerr=menStd)
p2 = plt.bar(ind, womenMeans, width,
             bottom=menMeans, yerr=womenStd)

plt.ylabel('Scores'
)plt.title('Scores by group and gender')
plt.xticks(ind, ('G1', 'G2', 'G3', 'G4', 'G5'))
plt.yticks(np.arange(0, 81, 10))
plt.legend((p1[0], p2[0]), ('Men', 'Women'))

plt.show()複製代碼

Bar Plot

其餘例子

Taken from Matplotlib Docs

2) Seaborn

https://seaborn.pydata.org/

Seaborn是基於matplotlib的Python數據可視化庫，主要用於繪製有吸引力且信息豐富的統計圖形，提供高級界面。該軟件主要關注可視化，如熱量地圖。

Seaborn docs

3) MoviePy

https://pypi.org/project/moviepy/

MoviePy是用於視頻編輯的Python庫，可剪切、採集、插入標題、合成、處理視頻以及建立自定義效果。軟件可讀寫全部常見格式的音頻和視頻，包括GIF。

https://zulko.github.io/moviepy/gallery.html

4）Bonus NLP Tool — FuzzyWuzzy

https://pypi.org/project/fuzzywuzzy/

在字符串匹配方面，該聲音工具很是有用。該工具可進行快速操做，如字符串比較比率、分詞比率等。

>>> fuzz.ratio("this is a test", "this is a test!")
    97
>>> fuzz.partial_ratio("this is a test", "this is a test!")
    100
>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    91
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    100複製代碼