寫給 Python 開發者的 10 條機器學習建議

有時候，做爲一個數據科學家，咱們經常忘記了初心。咱們首先是一個開發者，而後纔是研究人員，最後纔多是數學家。咱們的首要職責是快速找到無 bug 的解決方案。linux

咱們能作模型並不意味着咱們就是神。這並非編寫垃圾代碼的理由。json

自從我開始學習機器學習以來，我犯了不少錯誤。所以我想把我認機器學習工程中最經常使用的技能分享出來。在我看來，這也是目前這個行業最缺少的技能。windows

下面開始個人分享。api

學習編寫抽象類

一旦開始編寫抽象類，你就能體會到它給帶來的好處。抽象類強制子類使用相同的方法和方法名稱。許多人在同一個項目上工做，若是每一個人去定義不一樣的方法，這樣作沒有必要也很容易形成混亂。服務器

1 import os
 2 from abc import ABCMeta, abstractmethod
 3
 4
 5 class DataProcessor(metaclass=ABCMeta):
 6    """Base processor to be used for all preparation."""
 7    def __init__(self, input_directory, output_directory):
 8        self.input_directory = input_directory
 9        self.output_directory = output_directory
10
11    @abstractmethod
12    def read(self):
13        """Read raw data."""
14
15    @abstractmethod
16    def process(self):
17        """Processes raw data. This step should create the raw dataframe with all the required features. Shouldn't implement statistical or text cleaning."""
18
19    @abstractmethod
20    def save(self):
21        """Saves processed data."""
22
23
24 class Trainer(metaclass=ABCMeta):
25    """Base trainer to be used for all models."""
26
27    def __init__(self, directory):
28        self.directory = directory
29        self.model_directory = os.path.join(directory, 'models')
30
31    @abstractmethod
32    def preprocess(self):
33        """This takes the preprocessed data and returns clean data. This is more about statistical or text cleaning."""
34
35    @abstractmethod
36    def set_model(self):
37        """Define model here."""
38
39    @abstractmethod
40    def fit_model(self):
41        """This takes the vectorised data and returns a trained model."""
42
43    @abstractmethod
44    def generate_metrics(self):
45        """Generates metric with trained model and test data."""
46
47    @abstractmethod
48    def save_model(self, model_name):
49        """This method saves the model in our required format."""
50
51
52 class Predict(metaclass=ABCMeta):
53    """Base predictor to be used for all models."""
54
55    def __init__(self, directory):
56        self.directory = directory
57        self.model_directory = os.path.join(directory, 'models')
58
59    @abstractmethod
60    def load_model(self):
61        """Load model here."""
62
63    @abstractmethod
64    def preprocess(self):
65        """This takes the raw data and returns clean data for prediction."""
66
67    @abstractmethod
68    def predict(self):
69        """This is used for prediction."""
70
71
72 class BaseDB(metaclass=ABCMeta):
73    """ Base database class to be used for all DB connectors."""
74    @abstractmethod
75    def get_connection(self):
76        """This creates a new DB connection."""
77    @abstractmethod
78    def close_connection(self):
79        """This closes the DB connection."""

固定隨機數種子

實驗的可重複性是很是重要的，隨機數種子是咱們的敵人。要特別注重隨機數種子的設置，不然會致使不一樣的訓練 / 測試數據的分裂和神經網絡中不一樣權重的初始化。這些最終會致使結果的不一致。網絡

1 def set_seed(args):
2    random.seed(args.seed)
3    np.random.seed(args.seed)
4    torch.manual_seed(args.seed)
5    if args.n_gpu > 0:
6        torch.cuda.manual_seed_all(args.seed)

先加載少許數據

若是你的數據量太大，而且你正在處理好比清理數據或建模等後續編碼時，請使用 nrows來避免每次都加載大量數據。當你只想測試代碼而不是想實際運行整個程序時，可使用此方法。app

很是適合在你本地電腦配置不足以處理那麼大的數據量，但你喜歡用 Jupyter/VS code/Atom 開發的場景。dom

1 f_train = pd.read_csv(‘train.csv’, nrows=1000)

預測失敗 (成熟開發人員的標誌)

老是檢查數據中的 NA（缺失值），由於這些數據可能會形成一些問題。即便你當前的數據沒有，並不意味着它不會在將來的訓練循環中出現。因此不管如何都要留意這個問題。機器學習

1 print(len(df))
2 df.isna().sum()
3 df.dropna()
4 print(len(df))

顯示處理進度

在處理大數據時，若是能知道還須要多少時間能夠處理完，可以瞭解當前的進度很是重要。ide

寫給 Python 開發者的 10 條機器學習建議

1 from tqdm import tqdm
 2 import time
 3
 4 tqdm.pandas()
 5
 6 df['col'] = df['col'].progress_apply(lambda x: x**2)
 7
 8 text = ""
 9 for char in tqdm(["a", "b", "c", "d"]):
10    time.sleep(0.25)
11    text = text + char

方案2：fastprogress

1 from fastprogress.fastprogress import master_bar, progress_bar
2 from time import sleep
3 mb = master_bar(range(10))
4 for i in mb:
5    for j in progress_bar(range(100), parent=mb):
6        sleep(0.01)
7        mb.child.comment = f'second bar stat'
8    mb.first_bar.comment = f'first bar stat'
9    mb.write(f'Finished loop {i}.')

解決 Pandas 慢的問題

若是你用過 pandas，你就會知道有時候它的速度有多慢ーー尤爲在團隊合做時。與其絞盡腦汁去尋找加速解決方案，不如經過改變一行代碼來使用 modin。

1 import modin.pandas as pd

記錄函數的執行時間

並非全部的函數都生來平等。

即便所有代碼都運行正常，也並不能意味着你寫出了一手好代碼。一些軟錯誤實際上會使你的代碼變慢，所以有必要找到它們。使用此裝飾器記錄函數的時間。

1 import time
 2
 3 def timing(f):
 4    """Decorator for timing functions
 5    Usage:
 6    @timing
 7    def function(a):
 8        pass
 9    """
10
11
12    @wraps(f)
13    def wrapper(*args, **kwargs):
14        start = time.time()
15        result = f(*args, **kwargs)
16        end = time.time()
17        print('function:%r took: %2.2f sec' % (f.__name__,  end - start))
18        return result
19    return wrapp

不要在雲上燒錢

沒有人喜歡浪費雲資源的工程師。

咱們的一些實驗可能會持續數小時。跟蹤它並在完成後關閉雲實例是很困難的。我本身也犯過錯誤，也看到過有些人會有連續幾天不關機的狀況。

這種狀況常常會發生在咱們週五上班，留下一些東西運行，直到週一回來才意識到。

只要在執行結束時調用這個函數，你的屁股就不再會着火了！

使用 try 和 except 來包裹 main 函數，一旦發生異常，服務器就不會再運行。我就處理過相似的案例

讓咱們多一點責任感，低碳環保從我作起。

1 import os
 2
 3 def run_command(cmd):
 4    return os.system(cmd)
 5
 6 def shutdown(seconds=0, os='linux'):
 7    """Shutdown system after seconds given. Useful for shutting EC2 to save costs."""
 8    if os == 'linux':
 9        run_command('sudo shutdown -h -t sec %s' % seconds)
10    elif os == 'windows':
11        run_command('shutdown -s -t %s' % seconds)

建立和保存報告

在建模的某個特定點以後，全部的深入看法都來自於對偏差和度量的分析。確保爲本身和上司建立並保存格式正確的報告。

無論怎樣，管理層都喜歡報告，不是嗎？

1 import json
 2 import os
 3
 4 from sklearn.metrics import (accuracy_score, classification_report,
 5                             confusion_matrix, f1_score, fbeta_score)
 6
 7 def get_metrics(y, y_pred, beta=2, average_method='macro', y_encoder=None):
 8    if y_encoder:
 9        y = y_encoder.inverse_transform(y)
10        y_pred = y_encoder.inverse_transform(y_pred)
11    return {
12        'accuracy': round(accuracy_score(y, y_pred), 4),
13        'f1_score_macro': round(f1_score(y, y_pred, average=average_method), 4),
14        'fbeta_score_macro': round(fbeta_score(y, y_pred, beta, average=average_method), 4),
15        'report': classification_report(y, y_pred, output_dict=True),
16        'report_csv': classification_report(y, y_pred, output_dict=False).replace('\n','\r\n')
17    }
18
19
20 def save_metrics(metrics: dict, model_directory, file_name):
21    path = os.path.join(model_directory, file_name + '_report.txt')
22    classification_report_to_csv(metrics['report_csv'], path)
23    metrics.pop('report_csv')
24    path = os.path.join(model_directory, file_name + '_metrics.json')
25    json.dump(metrics, open(path, 'w'), indent=4)

寫出一手好 API

結果很差，一切都很差。

你能夠作很好的數據清理和建模，可是你仍然能夠在最後製造巨大的混亂。經過我與人打交道的經驗告訴我，許多人不清楚如何編寫好的 api、文檔和服務器設置。我將很快寫另外一篇關於這方面的文章，可是先讓我簡要分享一部分。

下面的方法適用於經典的機器學習和深度學習部署，在不過高的負載下(好比1000 / min)。

見識下這個組合: Fastapi + uvicorn + gunicorn

最快的用 fastapi 編寫 API，由於這是最快的，緣由參見這篇文章。
文檔在 fastapi 中編寫 API 爲咱們提供了 http: url/docs 上的免費文檔和測試端點，當咱們更改代碼時，fastapi 會自動生成和更新這些文檔。
worker使用 gunicorn 服務器部署 API，由於 gunicorn 具備啓動多於1個 worker，並且你應該保留至少 2 個worker。

運行這些命令來使用 4 個 worker 部署。能夠經過負載測試優化 worker 數量。

1 pip install fastapi uvicorn gunicorn
2 gunicorn -w 4 -k uvicorn.workers.UvicornH11Worker main:app

原文來自：http://suo.im/5MoQTN