機器學習的前期入門彙總

機器學習機器學習是近20多年興起的一門多領域交叉學科,涉及機率論、統計學、逼近論、凸分析、算法複雜度理論等多門學科。機器學習理論主要是設計和分析一些讓計算機能夠自動「學習」的算法。機器學習算法是一類從數據中自動分析得到規律,並利用規律對未知數據進行預測的算法。由於學習算法中涉及了大量的統計學理論,機器學習與統計推斷學聯繫尤其密切,也被稱爲統計學習理論。算法設計方面,機器學習理論關注能夠實現的,行之有效的學習算法。html


大體分三類: 起步體悟,實戰筆記,行家導讀java

  • 機器學習入門者學習指南 @果殼網 (2013) 做者 白馬 -- [起步體悟] 研究生型入門者的親身經歷python

  • 有沒有作機器學習的哥們?可否介紹一下是如何起步的 @ourcoders-- [起步體悟] 研究生型入門者的親身經歷,尤爲要看reyoung的建議linux

  • tornadomeet 機器學習筆記 (2013) -- [實戰筆記] 學霸的學習筆記,看看小夥伴是怎樣一步一步地掌握「機器學習」git

  • Machine Learning Roadmap: Your Self-Study Guide to Machine Learning (2014) Jason Brownlee -- [行家導讀] 雖然是英文版,但很是容易讀懂。對Beginner,Novice,Intermediate,Advanced讀者都有覆蓋。程序員

  • A Tour of Machine Learning Algorithms (2013) 這篇關於機器學習算法分類的文章也很是好github

  • Best Machine Learning Resources for Getting Started(2013) 這片有中文翻譯 機器學習的最佳入門學習資源web

  • 門主的幾個建議面試

    + 既要有數學基礎,也要編程實踐
    
      + 別怕英文版,你不懂的大可能是專業名詞,未來不論寫文章仍是讀文檔都是英文爲主
    複製代碼

機器學習入門資源不徹底彙總更多攻略

  • 機器學習該怎麼入門 @知乎 (2014)算法

  • What's the easiest way to learn machine learning @quora(2013)

  • What is the best way to study machine learning @quora(2012)

  • Is there any roadmap for learning Machine Learning (ML) and its related courses at CMU Is there any roadmap for learning Machine Learning (ML) and its related courses at CMU(2014)


機器學習入門資源不徹底彙總課程資源

Tom Mitchell 和 Andrew Ng 的課都很適合入門

機器學習入門資源不徹底彙總入門課程機器學習入門資源不徹底彙總2011Tom Mitchell(CMU)機器學習


英文原版視頻與課件PDF他的《機器學習》在不少課程上被選作教材,有中文版。

  • Decision Trees

  • Probability and Estimation

  • Naive Bayes

  • Logistic Regression

  • Linear Regression

  • Practical Issues: Feature selection,Overfitting ...

  • Graphical models: Bayes networks, EM,Mixture of Gaussians clustering ...

  • Computational Learning Theory: PAC Learning, Mistake bounds ...

  • Semi-Supervised Learning

  • Hidden Markov Models

  • Neural Networks

  • Learning Representations: PCA, Deep belief networks, ICA, CCA ...

  • Kernel Methods and SVM

  • Active Learning

  • Reinforcement Learning 以上爲課程標題節選


機器學習與數據挖掘的區別

  • 機器學習關注從訓練數據中學到已知屬性進行預測

  • 數據挖掘側重從數據中發現未知屬性


Dan Levin, What is the differencebetween statistics, machine learning, AI and data mining?

  • If there are up to 3 variables, it is statistics.

  • If the problem is NP-complete, it is machine learning.

  • If the problem is PSPACE-complete, it is AI.

  • If you don't know what is PSPACE-complete, it is data mining.


幾篇高屋建瓴的機器學習領域概論, 參見原文

  • The Discipline of Machine LearningTom Mitchell 當年爲在CMU創建機器學習系給校長寫的東西。

  • A Few Useful Things to Know about Machine Learning Pedro Domingos教授的大道理,也許入門時不少概念還不明白,上完公開課後必定要再讀一遍。

幾本好書

  • 李航博士的《統計學習方法》。
  1. 數學基礎

機器學習必要的數學基礎主要包括:多元微積分,線性代數

  1. Calculus: Single Variable | Calculus One (可選)

  2. Multivariable Calculus

  3. Linear Algebra

  4. 統計基礎

    1. Introduction to Statistics: Descriptive Statistics

    2. Probabilistic Systems Analysis and Applied Probability | 機率 ( 可選)

    3. Introduction to Statistics: Inference

  5. 編程基礎

    1. Programming for Everybody (Python)

    2. DataCamp: Learn R with R tutorials and coding challenges(R)

    3. Introduction to Computer Science:Build a Search Engine & a Social Network

  6. 機器學習

    1. Statistical Learning(R)

    2. Machine Learning

    3. 機器學習基石

    4. 機器學習技法


下面是近期的給外行人讀的泛數學科普書籍,由淺至深,做用除了感覺數學之美以外,更重要的是能夠做用天天學習的雞血,由於這些書都比較好讀……

1.《數學之美》做者:吳軍 2.《 Mathematician's Lament | 數學家的嘆息》做者:by Paul Lockhart 3.《 Think Stats: Probability and Statistics forProgrammers | 統計思惟:程序員數學之機率統計 》 做者:Allen B. Downey 4.《 A History of Mathematics | 數學史 》做者:Carl B. Boyer 5.《 Journeys Through Genius | 天才引導的歷程:數學中的偉大定理 》做者:William Dunham 6.《 The Mathematical Experience | 數學經驗 》做者 Philip J.Davis、Reuben Hersh 7.《 Proofs from the Book | 數學天書中的證實 》做者:Martin Aigner、Günter M. Ziegler 8.《 Proofs and Refutations | 證實與反駁-數學發現的邏輯 》做者:Imre Lakatos


  1. Python/C++/R/Java - you will probably want to learnall of these languages at some point if you want a job in machine-learning.Python's Numpy and Scipy libraries [2] are awesome because they have similarfunctionality to MATLAB, but can be easily integrated into a web service andalso used in Hadoop (see below). C++ will be needed to speed code up. R [3] isgreat for statistics and plots, and Hadoop [4] is written in Java, so you mayneed to implement mappers and reducers in Java (although you could use ascripting language via Hadoop streaming [5])

首先,你要熟悉這四種語言。Python由於開源的庫比較多,能夠看看Numpy和Scipy這兩個庫,這兩個均可以很好的融入網站開發以及Hadoop。C++可讓你的代碼跑的更快,R則是一個很好地統計工具。而你想很好地使用Hadoop你也必須懂得java,以及如何實現map reduce


  1. Probability and Statistics: A good portion oflearning algorithms are based on this theory. Naive Bayes [6], Gaussian MixtureModels [7], Hidden Markov Models [8], to name a few. You need to have a firmunderstanding of Probability and Stats to understand these models. Go nuts andstudy measure theory [9]. Use statistics as an model evaluation metric:confusion matrices, receiver-operator curves, p-values, etc.

我推薦統計學習方法 李航寫的,這算的上我mentor的mentor了。理解一些機率的理論,好比貝葉斯,SVM,CRF,HMM,決策樹,AdaBoost,邏輯斯蒂迴歸,而後再稍微看看怎麼作evaluation 好比P R F。也能夠再看看假設檢驗的一些東西。


  1. Applied Math + Algorithms: For discriminatemodels like SVMs [10], you need to have a firm understanding of algorithmtheory. Even though you will probably never need to implement an SVM fromscratch, it helps to understand how the algorithm works. You will need tounderstand subjects like convex optimization [11], gradient decent [12],quadratic programming [13], lagrange [14], partial differential equations [15],etc. Get used to looking at summations [16].

機器學習畢竟是須要極強極強數學基礎的。我但願開始能夠深刻的瞭解一些算法的本質,SVM是個很好的下手點。能夠今後入手,看看拉格朗日,凸優化都是些什麼


  1. Distributed Computing: Most machine learningjobs require working with large data sets these days (see Data Science) [17].You cannot process this data on a single machine, you will have to distributeit across an entire cluster. Projects like Apache Hadoop [4] and cloud serviceslike Amazon's EC2 [18] makes this very easy and cost-effective. Although Hadoopabstracts away a lot of the hard-core, distributed computing problems, youstill need to have a firm understanding of map-reduce [22], distribute-filesystems [19], etc. You will most likely want to check out Apache Mahout [20]and Apache Whirr [21].

熟悉分佈計算,機器學習當今必須是多臺機器跑大數據,要否則沒啥意義。請熟悉Hadoop,這對找工做有很大很大的意義。百度等公司都須要hadoop基礎。


  1. Expertise in Unix Tools: Unless you are veryfortunate, you are going to need to modify the format of your data sets so theycan be loaded into R,Hadoop,HBase [23],etc. You can use a scripting languagelike python (using re) to do this but the best approach is probably just masterall of the awesome unix tools that were designed for this: cat [24], grep [25],find [26], awk [27], sed [28], sort [29], cut [30], tr [31], and many more.Since all of the processing will most likely be on linux-based machine (Hadoopdoesnt run on Window I believe), you will have access to these tools. Youshould learn to love them and use them as much as possible. They certainly havemade my life a lot easier. A great example can be found here [1].

熟悉Unix的Tool以及命令。百度等公司都是依靠Linux工做的,可能如今依靠Windows的Service公司已經比較少了。因此怎麼也要熟悉Unix操做系統的這些指令吧。我記得有個百度的面試題就是問文件複製的事情。


  1. Become familiar with the Hadoop sub-projects:HBase, Zookeeper [32], Hive [33], Mahout, etc. These projects can help youstore/access your data, and they scale.

機器學習終究和大數據息息相關,因此Hadoop的子項目要關注,好比HBase Zookeeper Hive等等


  1. Learn about advanced signal processing techniques:feature extraction is one of the most important parts of machine-learning. Ifyour features suck, no matter which algorithm you choose, your going to seehorrible performance. Depending on the type of problem you are trying to solve,you may be able to utilize really cool advance signal processing algorithmslike: wavelets [42], shearlets [43], curvelets [44], contourlets [45], bandlets[46]. Learn about time-frequency analysis [47], and try to apply it to yourproblems. If you have not read about Fourier Analysis[48] and Convolution[49],you will need to learn about this stuff too. The ladder is signal processing101 stuff though.

這裏主要是在講特徵的提取問題。不管是分類(classification)仍是迴歸(regression)問題,都要解決特徵選擇和抽取(extraction)的問題。他給出了一些基礎的特徵抽取的工具如小波等,同時說須要掌握傅里葉分析和卷積等等。這部分我不大瞭解,大概就是說信號處理你要懂,好比傅里葉這些。。。


Finally, practice and read as much as you can. In yourfree time, read papers like Google Map-Reduce [34], Google File System [35],Google Big Table [36], The Unreasonable Effectiveness of Data [37],etc Thereare great free machine learning books online and you should read those also.[38][39][40]. Here is an awesome course I found and re-posted on github [41].Instead of using open source packages, code up your own, and compare theresults. If you can code an SVM from scratch, you will understand the conceptof support vectors, gamma, cost, hyperplanes, etc. It's easy to just load somedata up and start training, the hard part is making sense of it all.


總之機器學習若是想要入門分爲兩方面: 一方面是去看算法,須要極強的數理基礎(真的是極強的),從SVM入手,一點點理解。 另外一方面是學工具,好比分佈式的一些工具以及Unix。


閱讀原文

相關文章
相關標籤/搜索