貝葉斯法則
機器學習的任務:在給定訓練數據A時,肯定假設空間B中的最佳假設。
最佳假設:一種方法是把它定義爲在給定數據A以及B中不一樣假設的先驗機率的有關知識下的最可能假設
貝葉斯理論提供了一種計算假設機率的方法,基於假設的先驗機率、給定假設下觀察到不一樣數據的機率以及觀察到的數據自己
先驗機率和後驗機率
用P(A)表示在沒有訓練數據前假設A擁有的初始機率。P(A)被稱爲A的先驗機率。
先驗機率反映了關於A是一正確假設的機會的背景知識
若是沒有這一先驗知識,能夠簡單地將每一候選假設賦予相同的先驗機率
相似地,P(B)表示訓練數據B的先驗機率,P(A|B)表示假設B成立時A的機率
機器學習中,咱們關心的是P(B|A),即給定A時B的成立的機率,稱爲B的後驗機率
貝葉斯公式
貝葉斯公式提供了從先驗機率P(A)、P(B)和P(A|B)計算後驗機率P(B|A)的方法
貝葉斯定理即是基於下述貝葉斯公式:算法
P(B|A)隨着P(B)和P(A|B)的增加而增加,隨着P(A)的增加而減小,即若是A獨立於B時被觀察到的可能性越大,那麼A對B的支持度越小sql
樸素貝葉斯 apache
樸素貝葉斯算法是假設各個特徵之間相互獨立,使用貝葉斯公式進行分類的。請參考:https://blog.csdn.net/amds123/article/details/70173402 app
spark NavieBayes 官方示例代碼以下:dom
import org.apache.spark.ml.classification.NaiveBayes
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.sql.SparkSession
object NavieBayesDemo {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("NavieBayesDemo").master("local")
.config("spark.sql.warehouse.dir", "C:\\study\\sparktest")
.getOrCreate()
// Load the data stored in LIBSVM format as a DataFrame.
val dataset=spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing)
val Array(tranningData,testData)=dataset.randomSplit(Array(0.7,0.3),seed = 1234L)
// Train a NavieBayes model
val model = new NaiveBayes().fit(tranningData)
// Select example rows to display.
val predictions=model.transform(testData)
predictions.show()
// Select (prediction, true label) and compute test error
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println(s"Test set accuracy = $accuracy")
spark.stop()
}
}
運行結果以下: 機器學習
18/10/24 11:50:06 INFO SparkContext: Starting job: collectAsMap at MulticlassMetrics.scala:48
+-----+--------------------+--------------------+-----------+----------+
|label| features| rawPrediction|probability|prediction|
+-----+--------------------+--------------------+-----------+----------+
| 0.0|(692,[95,96,97,12...|[-173678.60946628...| [1.0,0.0]| 0.0|
| 0.0|(692,[98,99,100,1...|[-178107.24302988...| [1.0,0.0]| 0.0|
| 0.0|(692,[100,101,102...|[-100020.80519087...| [1.0,0.0]| 0.0|
| 0.0|(692,[124,125,126...|[-183521.85526462...| [1.0,0.0]| 0.0|
| 0.0|(692,[127,128,129...|[-183004.12461660...| [1.0,0.0]| 0.0|
| 0.0|(692,[128,129,130...|[-246722.96394714...| [1.0,0.0]| 0.0|
| 0.0|(692,[152,153,154...|[-208696.01108598...| [1.0,0.0]| 0.0|
| 0.0|(692,[153,154,155...|[-261509.59951302...| [1.0,0.0]| 0.0|
| 0.0|(692,[154,155,156...|[-217654.71748256...| [1.0,0.0]| 0.0|
| 0.0|(692,[181,182,183...|[-155287.07585335...| [1.0,0.0]| 0.0|
| 1.0|(692,[99,100,101,...|[-145981.83877498...| [0.0,1.0]| 1.0|
| 1.0|(692,[100,101,102...|[-147685.13694275...| [0.0,1.0]| 1.0|
| 1.0|(692,[123,124,125...|[-139521.98499849...| [0.0,1.0]| 1.0|
| 1.0|(692,[124,125,126...|[-129375.46702012...| [0.0,1.0]| 1.0|
| 1.0|(692,[126,127,128...|[-145809.08230799...| [0.0,1.0]| 1.0|
| 1.0|(692,[127,128,129...|[-132670.15737290...| [0.0,1.0]| 1.0|
| 1.0|(692,[128,129,130...|[-100206.72054749...| [0.0,1.0]| 1.0|
| 1.0|(692,[129,130,131...|[-129639.09694930...| [0.0,1.0]| 1.0|
| 1.0|(692,[129,130,131...|[-143628.65574273...| [0.0,1.0]| 1.0|
| 1.0|(692,[129,130,131...|[-129238.74023248...| [0.0,1.0]| 1.0|
+-----+--------------------+--------------------+-----------+----------+
only showing top 20 rows
18/10/24 11:50:06 INFO DAGScheduler: Job 6 finished: countByValue at MulticlassMetrics.scala:42, took 0.157446 s
Test set accuracy = 1.0