從零實現來理解機器學習算法：書籍推薦及障礙的克服

時間 2019-11-11

標籤實現理解機器學習算法書籍推薦障礙克服简体版

原文原文鏈接

前部爲英文原文，原文連接：http://machinelearningmastery.com/understand-machine-learning-algorithms-by-implementing-them-from-scratch/程序員

後部爲中文翻譯，本文中文部分轉自：http://www.csdn.net/article/2015-09-08/2825646web

Understand Machine Learning Algorithms By Implementing Them From Scratch (and tactics to get around算法

Implementing machine learning algorithms from scratch seems like a great way for a programmer to understand machine learning.編程

And maybe it is.app

But there some downsides to this approach too.框架

In this post you will discover some great resources that you can use to implement machine learning algorithms from scratch.機器學習

You will also discover some of the limitations of this seemingly perfect approach.編程語言

Have you implemented a machine learning algorithm from scratch in an effort to learn about it Leave a comment, I’d love to hear about your experience.ide

Implement machine learning algorithms from scratch!
Photo by Tambako The Jaguar, some rights reserved.函數

Benefits of Implementing Machine Learning Algorithms From Scratch

I promote the idea of implementing machine learning algorithms from scratch.

I think you can learn a lot about how algorithms work. I also think that as a developer, it provides a bridge into learning the mathematical notations, descriptions and intuitions used in machine learning.

I’ve discussed the benefits of implementing algorithms from scratch before in the post 「Benefits of Implementing Machine Learning Algorithms From Scratch「.

In the post I listed the benefits as:

the understanding you gain
the starting point it provides
the ownership of the algorithm and code it forces

Also in that post I comment how you can short-cut the process by leveraging existing tutorials and books. There is a wealth of good resources for getting started, but there are also stumbling blocks to watch out for.

In the next section I point out three books that you can follow to implement machine learning algorithms from scratch.

I’ve helped a lot of programmers get started in machine learning over the last few years. From my experience, I list 5 of the most common stumbling blocks that I see tripping up programmers and the tactics that you can use to over come them.

Finally, you will discover 3 quick tips to getting the most from code tutorials and going from a copy-paste programmer (if you happen to be one) to truly diving down the rabbit hole of machine learning algorithms.

Great Books You Can Use To Implement Algorithms

I have implemented a lot of algorithms from scratch, directly from research papers. It can be very difficult.

It is a much gentler start to follow someone else’s tutorial.

There are many excellent resources that you can use to get started implementing machine learning algorithms from scratch.

Perhaps the most authoritative are books that guide you through tutorials.

There are many benefits to starting with a book. For example:

Someone else has figured out the algorithm and how to turn it into code.
You can use it as a known working starting point for tinkering and experimentation.

Some great books that guide you through implementing machine learning algorithms step-by-step are:

Data Science from Scratch: First Principles with Python by Joel Grus

This truly is from scratch, working through visualization, stats, probability, working with data and then 12 or so different machine learning algorithms.

This is one of my favorite beginner machine learning books from this year.

Machine Learning: An Algorithmic Perspective by Stephen Marsland

This is the long awaited second edition to this popular book. This covers a large number of diverse machine learning algorithms with implementations.

I like that it gives a mix of mathematical description, pseudo code as well as working source code.

Machine Learning in Action by Peter Harrington

This book works through the 10 most popular machine learning algorithms providing case study problems and worked code examples in Python.

I like that there is a good effort to tie the code to the descriptions using numbering and arrows.

Did I miss a good book that provides programming tutorials for implementing machine learning algorithms from scratch?

Let me know in the comments.

5 Stumbling Blocks When Implementing Algorithms From Scratch (and how to overcome them)

Implementing machine learning algorithms from scratch using tutorials is a lot of fun.

But there can be stumbling blocks, and if you’re not careful, they may trip you up and kill your motivation.

In this section I want to point out the 5 most common stumbling blocks that I see and how to roll with them and not let them hold you up. I want you to get unstuck and plow on (or move on to another tutorial).

Some good general advice for avoiding the stumbling blocks below is to carefully check the reviews of books (or the comments on blog posts) before diving into a tutorial. You want to be sure that the code works and that you’re not wasting your time.

Another general tactic is to dive-in no matter what and figure out the parts that are not working and re-implement them yourself. This is a great hack to force understanding, but it’s probably not for the beginner and you may require a good technical reference close at hand.

Anyway, let’s dive into the 5 common stumbling blocks with machine learning from scratch tutorials:

1) The Code Does Not Work

The worst and perhaps most common stumbling block is that the code in the example does not work.

In fact, if you spend some time in the book reviews on Amazon for some texts or in the comments of big blog posts, it’s clear that this problem is more prevalent than you think.

How does this happen? A few reasons come to mind that might give you clues to applying your own fixes and carrying on:

The code never worked. This means that the book was published without being carefully edited. Not much you can do here other than perhaps getting into the mind of the author and trying to figure out what they meant. Maybe even try contacting the author or the publisher.
The language has moved on. This can happen, especially if the post is old or the book has been in print for a long time. Two good examples are the version of Ruby moving from 1.x to 2.x and Python moving from 2.x to 3.x.
The third-party libraries have moved on. This is for those cases where the implementations were not totally from scratch and some utility libraries were used, such as for plotting. This is often not that bad. You can often just update the code to use the latest version of the library and modify the arguments to meet the API changes. It may even be possible to install an older version of the library (if there are few or no dependencies that you might break in your development environment).
The dataset has moved on. This can happen if the data file is a URL and is no longer available (perhaps you can find the file elsewhere). It is much worse if the example is coded against a third-party API data source like Facebook or Twitter. These APIs can change a lot and quickly. Your best bet is to understand the most recent version of the API and adapt the code example, if possible.

A good general tactic if the code does not work is to look for the associated errata if it is a book, GitHub repository, code downloads or similar. Sometimes the problems have been fixed and are available on the book or author’s website. Some simple Googling should turn it up.

Code machine learning algorithms completely from scratch. Photo by Tambako The Jaguar, some rights reserved

2) Poor Descriptions Of Code

I think the second worst stumbling block when implementing algorithms from scratch is when the descriptions provided with the code are bad.

These types of problems are particularly not good for a beginner, because you are trying your best to stay motivated and actually learn something from the exercise. All of that goes down in smoke if the code and text do not align.

I (perhaps kindly) call them 「bad descriptions」 because there may be many symptoms and causes. For example:

A mismatch between code and description. This may have been caused by the code and text being prepared at different times and not being correctly edited together. It may be something small like a variable name change or it may be whole function names or functions themselves.
Missing explanations. Sometimes you are given large slabs of code that you are expected to figure out. This is frustrating, especially in a book where it’s page after page of code that would be easier to understand on the screen. If this is the case, you might be better off finding the online download for the code and working with it directly.
Terse explanations. Sometimes you get explanations of the code, but they are too brief, like 「uses information gain」 or whatever. Frustrating! You still may have enough to research the term, but it would be much easier if the author had included an explanation in the context and relevant to the example.

A good general tactic is to look up description for the algorithm in other resources and try to map them onto the code you are working with. Essentially, try to build your own descriptions for the code.

This just might not be an option for a beginner and you may need to move on to another resource.

3) Code is not Idiomatic

We programmers can be pedantic about the 「correct」 use of our languages (e.g. Python code is not Pythonic). This is a good thing, it shows good attention to detail and best practices.

When sample code is not idiomatic to the language in which it is written it can be off putting. Sometimes it can be so distracting that the code can be unreadable.

There are many reasons that this may be the case, for example:

Port from another language. The sample code may be a port from another programming language. Such as FORTRAN in Java or C in Python. To a trained eye, this can be obvious.
Author is learning the language. Sometimes the author may use a book or tutorial project to learn a language. This can be manifest by inconsistency throughout the code examples. This can be frustrating and even distracting when examples are verbose making poor use of language features and API.
Author has not used the language professionally. This can be more subtle to spot and can be manifest by the use of esoteric language features and APIs. This can be confusing when you have to research or decode the strange code.

If idiomatic code is deeply important to you, these stumbling blocks could be an opportunity. You could port the code from the 「Java-Python」 hybrid (or whatever) to a pure Pythonic implementation.

In so doing, you would gain a deeper understanding for the algorithm and more ownership over the code.

4) Code is not Connected to the Math

A good code example or tutorial will provide a bridge from the mathematical description to the code.

This is important because it allows you to travel across and start to build an intuition for the notation and the concise mathematical descriptions.

There problem is, sometimes this bridge may be broken or missing completely.

Errors in the math. This is insidious for the beginner that is already straining to build connections from the math to the code. Incorrect math can mislead or worse consume vast amounts of time with no pay off. Knowing that it is possible, is a good start.
Terse mathematical description. Equations may be littered around the sample code, leaving it to you to figure out what it is and how it relates to the code. You have few options, you could just treat it as a math free example and refer to a different more complete reference text, or you could put in effort to relate the math to the code yourself. This is more likely by authors that are not familiar with the mathematical description of the algorithm and seemingly drop it in as an after thought.
Missing mathematics. Some references are math free, by design. In this case you may need to find your own reference text and build the bridge yourself. This is probably not for beginners, but it is a skill well worth investing the time into.

A beginner might want to stick with code and ignore the math, to build confidence and momentum. Later, it will pay to invest in a high-quality reference text and start relating the code to the math.

You want to get good at relating the algebra to standard code constructs and build an intuition for the process involved. It’s an applied skill. You need to put in the work and practice.

5) Incomplete Code Listing

We saw in 2) that you can have no descriptions and long listings of code. This problem can be inverted where you don’t have enough code. This is the case when the code listing is incomplete.

I am a big believer in complete code listings. I think the code listing should give you everything you need to give a 「complete」 and working implementation, even if it is the simplest possible case.

You can build on a simple case, you can’t run an incomplete example. You have to put in work and tie it all together.

Some reasons that this stumbling block may be the case, are:

Elaborate descriptions. Verbose writing can be a sign of incomplete thinking. Not always, but sometimes. If something is not well understood there may be an implicit attempt to cover it up with a wash of words. If there is no code at all, you could take it as a challenge to design the algorithm from the description and corroborate it from other descriptions and resources.
Code snipp

摘要：現階段有些開發者並無機器學習算法的基礎知識，可是怎麼才能讓開發者從零入門來學習好機器學習算法，這篇文便幫助開發者總結推薦了一些辦法。

【編者按】並不是全部的開發者都有機器學習算法的基礎知識，那麼開發者如何從零入門來學習好機器學習算法呢？本文總結推薦了一些從零開始學習機器學習算法的辦法，包括推薦了一些合適的書籍，如何克服所面臨的各類障礙，以及快速得到更多知識的竅門。

從零開始實現機器學習算法彷佛是開發者理解機器學習的一個出色方式。或許真的是這樣，但這種作法也有一些缺點。

在這篇文章中，你會發現一些很好的資源，能夠用來從零開始實現機器學習算法。你也會發現一些看似完美的方法的侷限性。你已經從零開始實現機器學習算法並努力學習留下的每一條評論了麼？我很樂意聽到關於你的經驗。

從零開始實現機器學習算法！圖片來自Tambako The Jaguar

從零開始實現機器學習算法的好處

我推廣了從零開始實現機器學習算法的觀念。

我認爲你能夠學到不少關於算法是如何工做的。我也認爲，做爲一名開發者，它提供了一個學習用於機器學習的數學符號、描述以及直覺的橋樑。

在「從零開始實現機器學習算法的好處」這篇文章裏，我已經討論了從零實現機器學習算法的好處。

在那篇文章，我列出的好處以下：

你獲取了知識；
它提供了一個起點；
擁有算法和代碼的所屬權。

在這篇文章中，我對如何利用現有的教程和書籍來縮短這個學習過程表達了一些我的見解。有一些用於初學的豐富資源，但也要堤防一些絆腳石。

下一節，我指出了三本書，你能夠照着書籍從零開始實現機器學習算法。

在過去的幾年裏，我已經在機器學習入門中幫助了許多程序員。根據個人經驗，我列出了五項曾困擾過程序員的最多見的障礙，以及你能夠用來克服它們的技巧。

最後，你會發現3個快速技巧，用以從代碼教程中得到更豐富的知識，並從一個複製粘貼的程序員（若是你碰巧是其中一個）到一個真正深刻機器學習算法的學者。

用於實現算法的優秀書籍

我從零實現過許多算法，這些算法直接來自研究論文。這個過程可能很是困難。

跟着別人的教程來作是一個很是溫和的開始。有不少優秀的資源，可讓你用來從零開始實現機器學習算法。也許最具權威性的是能指導你完成整個教程的書籍。

從啃書本開始學習有不少好處。例如：

其餘人已經研究出了該算法並把它轉換成了代碼；
你可使用它做爲一個用於修改和實驗的已知工做起點。

那麼，一步一步引導你完成機器學習算法實現的出色書籍有：

Data Science from Scratch: First Principles with Python by Joel Grus

這本書的確是從零開始，貫穿可視化操做、統計、機率、數據處理，而後是大約12個不一樣的機器學習算法。

這本書是我今年最喜歡的機器學習初學者書籍之一。

Machine Learning: An Algorithmic Perspective by Stephen Marsland

這本書是我期待已久的這本流行書籍的第二版。它涵蓋了大量的不一樣種類的機器學習算法實現。

我喜歡它既給出了數學描述和僞代碼，又包含了能執行的源代碼。

Machine Learning in Action by Peter Harrington

該書貫穿了10個最受歡迎的機器學習算法，提供了案例研究問題並用Python代碼實例來解決。

我喜歡它用符號和箭頭把代碼和描述緊密聯繫在一塊兒的形式。

我是否有漏掉一本從零開始實現機器學習算法的編程教程書籍呢？

若是有，請在評論中指出！

從零實現機器學習算法的5個障礙（以及如何克服它們）

根據教程從零開始實現機器學習算法是頗有趣的。但也有可能會成爲絆腳石，並且若是你不當心，他們可能會絆倒你並抹殺你的學習動機。

在這一節中，我想指出我所看到的五個常見的絆腳石，以及如何與它們共存，而不是讓它們阻礙你。個人目的是讓你徹底擺脫它而且破浪前行（或是轉移到另外一個教程）。

用來避免下面障礙的一些好的常規建議是在你深刻一個教程以前，仔細檢查書籍的評論（或博客帖子的評論）。你要確保代碼是可以工做的而且保證你不是在浪費時間。

另外一個常規策略是，不管深刻的是什麼，找出不工做的那部分，並本身去從新實現他們。這是一個強行理解的出色解決方法，但它可能不適合初學者，而且你可能須要一個很好的技術參考資料放在手邊。

不管如何，讓咱們從零開始機器學習教程，深刻研究這5個常見的障礙：

1）代碼不能正常工做

最糟糕而且最多見的障礙就是實例當中的代碼不能正常工做。

事實上，若是你花一些時間瀏覽亞馬遜網站的一些書籍評論或博文評論，很顯然，這個問題比你想象的更爲廣泛。

這是怎麼發生的呢？有幾個緣由可能會給你提供一些線索，能夠應用到你本身的修改中並繼續使用：

代碼從不工做。這意味着，這本書沒有通過精心編輯就出版了。在這種狀況下，你能作的並很少，除非是進入做者的大腦，並試圖推測出他們的想法。或許還能夠嘗試聯繫做者本人或是出版商。
語言已變更。這種狀況可能會發生，特別是若是該文章是發佈已久的或者該書已印刷了很長一段時間。兩個很好的例子是Ruby從1.x版本到2.x版本和Python從2.x版本到3.x版本。
第三方庫已變更。這一般發生在那些狀況下，即實現不徹底是從零開始而且使用了一些有用的庫，如用於繪圖的庫。這一般不會那麼糟糕。你能夠經過常常更新代碼來使用最新版本的庫以及修改參數來知足API的修改。甚至能夠安裝一箇舊版本的庫（若是版本不多或是幾乎不須要可能破壞開發環境的其它依賴庫）。
該數據集已變更。若是數據文件是一個下載連接，而且已經失效（也許你能夠在其它地方找到該文件），這種狀況下就有可能會發生。若是這個例子是針對第三方API數據來源，好比Facebook或Twitter，該狀況會更加糟糕。這些APIs能夠迅速地改變不少。若是可能的話，你最好的辦法是瞭解最新版本的API，並改寫代碼中的實例。

若是它是一本書、GitHub庫、代碼下載或者相似的，若是代碼不工做，一個好的常規策略是尋找相關的勘誤表。有時這些問題已經在書上或做者的網站上修正了。一些簡單的谷歌搜索就能找到它們。

2) 代碼不規範描述

當從零開始實現算法時，我認爲第二個糟糕的絆腳石是提供的代碼描述很糟糕。

對於初學者來講，這類問題特別很差，由於你正在努力維持積極性，而實際上你是從練習中學習一些東西。若是代碼和文本不一致，全部的這些都會在煙霧中漸漸消失。

我（或許比較溫和）把他們稱爲「糟糕的描述」，由於可能有不少的症狀和緣由。例如：

代碼和描述之間的不匹配。這多是因爲代碼和文本在不一樣時間準備而形成的，而且不能正確地編輯起來。它多是一些小的，如一個變量名稱的變化，或者它多是整個函數名或函數自己的變化。
缺失的解釋。有時，你會獲得你所指望得到的大量代碼。這是使人沮喪的，特別是書中連篇累牘的代碼，可能在屏幕上更容易理解。若是是這樣的話，最好的方法是找到在線下載的代碼並直接使用它來工做。
過於簡潔的解釋。有時你會對代碼進行解釋，但它們可能過於簡單，如「使用信息增益」或任何其它的。使人沮喪！你可能還要花更多的時間來研究這個術語，但若是做者在上下文中包含了一個該術語的解釋以及相關的實例，那麼這就會顯得更簡單。

一個好的常規方法是在其它的資源裏尋找算法的描述，並嘗試將它們映射到你所使用的代碼中。從本質上講，是嘗試創建你本身的代碼描述。

這對初學者來講可能不是一個好的選擇，你可能須要轉到另外一個資源上。

3）代碼不符合語言習慣

咱們程序員能夠對咱們語言的「正確」使用咬文嚼字（如Python代碼不是Pythonic）。這實際上是一件好事，它顯示了對細節和最佳實踐的充分關注。

當實例代碼不符合語言編寫習慣時，它可能會讓人排斥。有時它會使代碼零散以致於難以理解。

這種狀況有許多緣由，例如：

來自另外一種語言的接口。實例代碼多是另外一種編程語言的接口。如在Java中調用FORTRAN或在Python中調用C。在老手眼裏，這會很顯眼。
做者正在學習語言。有時，做者可能使用一本書或一個教程項目來學習語言。在整個代碼示例中，可能會不一致。當實例屢次使用難以理解的語言特徵和API時，這可能會讓人失望甚至分散注意力。
做者沒有使用專業語言。這多是更加微妙的一點，能夠經過使用深奧的語言功能和APIs來體現。當你必須研究或解讀奇怪的代碼時，這可能會讓你混淆。

若是你慣用的代碼對你很是重要，這些障礙可能會是一個機會。你能夠把接口代碼從「Java-Python」混合體（或別的什麼）化爲一個純Python的實現。

這麼作以後，你將獲得一個更深層次的算法理解以及更多的代碼所屬權。

4）代碼和數學無關

一個很好的代碼示例或教程將提供一個從數學描述到代碼的橋樑。

這很重要，由於它容許你跨越代碼和數學，並開始爲符號和簡明的數學描述造成一個直覺。

問題是，有時候這個橋樑可能會被完全破壞或是丟失。

數學上的錯誤。這對初學者來講是潛在的，由於創建從數學到代碼的關聯已經很緊張了。不正確的數學可能會誤導或者嚴重地消耗大量的時間，而且尚未回報。知道這個可能會發生，就是一個很好的開始。
簡明的數學描述。方程能夠在示例代碼中四處散落，讓你去弄清楚它到底是什麼，以及它是如何與代碼相關聯的。你的選擇很少，你能夠把它當作是一個與數學無關的例子，並參考一個不一樣的更加完整的參考文本，或者你能夠努力把數學與本身的代碼關聯起來。這更有可能的是做者自己就不熟悉算法的數學描述，並且彷佛是過後才添加到文章裏的。
缺失的數學。有些參考文獻在描述數學時是自由的。在這種狀況下，你可能須要找到本身的參考文本，並創建本身的橋樑。這可能不適合初學者，但這是一個技能，很值得去投入時間。

一個初學者可能會堅持代碼而忽略數學，創建信心和動力。以後，它將爲一個高質量的參考文本以及關聯代碼和數學付出代價。

你想要擅長於關聯代數和標準代碼，併爲有關過程創建一個直覺。這是一個應用技巧。須要你投入工做與實踐。

5）不完整的代碼列表

咱們在2）中看到，你能夠有不帶任何描述和長列表的代碼。然而，當你沒有大量代碼的時候，這個問題會逆轉。這也就是代碼列表不完整時的狀況。

事實上，我是一個完整代碼列表的忠實信徒。我認爲代碼列表應該給你所須要的，給你一個「完整」的代碼和工做實現，即便它是最簡單的狀況。

你能夠創建一個簡單的實例，但你不能運行一個不完整的例子。你必須把它放在工做中並把全部的都聯繫在一塊兒。

這個障礙可能成爲事實的一些緣由是：

冗長的描述。冗長的編寫多是一個不完整思惟的標誌。但有時候，也不一直都是這樣。若是理解的不是很好，可能會在潛意識裏試圖用一堆詞來掩飾。若是沒有任何代碼，你能夠把它看成是一個挑戰，根據描述來設計算法，並從其它描述和資源來證明它。
代碼片斷。概念可能會精心描述，而後使用一個小代碼片斷來證明。這有助於緊密配合代碼段的概念，但它須要你本身大量的工做，將其結合在一塊兒，造成一個工做系統。
無樣本輸出。代碼實例常常失誤的一個關鍵方面一般是樣本輸出。若是有輸出的話，當你運行它時，它能夠給你一個期待的明確想法。沒有樣本輸出的話，那就徹底是猜想。

在某些狀況下，把代碼聚在一塊兒，這對你可能會是一個有趣的挑戰。這一樣不適合初學者，可是一旦你有一些算法以後，這也許會是一個有趣的鍛鍊。

3個訣竅讓你從算法實現中得到更多知識

你能夠實現一個合理的算法。一旦你這樣作過，那麼你能夠作得更多，並在你知道它以前，你已經創建了你本身很是理解的小算法庫。

在這一節中，我想給你3個你可使用的快速技巧，可讓你從實現機器學習算法過程當中得到最多的經驗。

添加先進的特徵。以你正常運行的代碼爲例，並在它的基礎上建立。若是教程是好的，它將列出擴展的想法。若是沒有，你能夠研究一些本身的。在算法的後面列出一系列的候選擴展算法並一個又一個的去實現它們。這至少會迫使你去理解代碼的意思並作出修改。
適應另外一個問題。在不一樣的數據集上運行該算法。若是有任何問題，就解決它。進一步去適應不一樣的問題實現。若是代碼示例是二分類，那麼修改它讓其適用於多分類或迴歸問題。
可視化算法行爲。我發現實時繪製算法的性能和行爲是一個很是寶貴的學習工具，即便是在今天。你能夠在測試集和訓練集上開始按時期水平（全部的算法在必定程度上都是迭代的）繪製精確度。在那裏，你能夠選擇特定的可視化算法，如自組織映射模型的二維網格，迴歸時間序列的係數和k近鄰算法的Voronoi劃分。

我認爲這些技巧與教程和代碼實例相比，會讓你走的更遠。

特別是最後一點，會給你在算法行爲上更深層次的看法，不多有從業人員花時間去學習它。