Spark Mllib 統計模塊代碼結構以下:html
計算每列最大值、最小值、平均值、方差值、L1範數、L2範數。 es6
//讀取數據,轉換成RDD[Vector]類型 val data_path = "/home/jb-huangmeiling/sample_stat.txt" val data = sc.textFile(data_path).map(_.split("\t")).map(f => f.map(f => f.toDouble)) val data1 = data.map(f => Vectors.dense(f)) //計算每列最大值、最小值、平均值、方差值、L1範數、L2範數 val stat1 = Statistics.colStats(data1) stat1.max stat1.min stat1.mean stat1.variance stat1.normL1 stat1.normL2
執行結果:apache
數據dom
scala> data1.collectspa
res19: Array[org.apache.spark.mllib.linalg.Vector] = Array([1.0,2.0,3.0,4.0,5.0], [6.0,7.0,1.0,5.0,9.0], [3.0,5.0,6.0,3.0,1.0], [3.0,1.0,1.0,5.0,6.0]).net
scala> stat1.maxscala
res20: org.apache.spark.mllib.linalg.Vector = [6.0,7.0,6.0,5.0,9.0]code
scala> stat1.minorm
res21: org.apache.spark.mllib.linalg.Vector = [1.0,1.0,1.0,3.0,1.0]htm
scala> stat1.mean
res22: org.apache.spark.mllib.linalg.Vector = [3.25,3.75,2.75,4.25,5.25]
scala> stat1.variance
res23: org.apache.spark.mllib.linalg.Vector = [4.25,7.583333333333333,5.583333333333333,0.9166666666666666,10.916666666666666]
scala> stat1.normL1
res24: org.apache.spark.mllib.linalg.Vector = [13.0,15.0,11.0,17.0,21.0]
scala> stat1.normL2
res25: org.apache.spark.mllib.linalg.Vector = [7.416198487095663,8.888194417315589,6.855654600401044,8.660254037844387,11.958260743101398]
Pearson相關係數表達的是兩個數值變量的線性相關性, 它通常適用於正態分佈。其取值範圍是[-1, 1], 當取值爲0表示不相關,取值爲(0~-1]表示負相關,取值爲(0, 1]表示正相關。
Spearman相關係數也用來表達兩個變量的相關性,可是它沒有Pearson相關係數對變量的分佈要求那麼嚴格,另外Spearman相關係數能夠更好地用於測度變量的排序關係。其計算公式爲:
//計算pearson係數、spearman相關係數 val corr1 = Statistics.corr(data1, "pearson") val corr2 = Statistics.corr(data1, "spearman") val x1 = sc.parallelize(Array(1.0, 2.0, 3.0, 4.0)) val y1 = sc.parallelize(Array(5.0, 6.0, 6.0, 6.0)) val corr3 = Statistics.corr(x1, y1, "pearson")
scala> corr1
res6: org.apache.spark.mllib.linalg.Matrix =
1.0 0.7779829610026362 -0.39346431156047523 ... (5 total)
0.7779829610026362 1.0 0.14087521363240252 ...
-0.39346431156047523 0.14087521363240252 1.0 ...
0.4644203640128242 -0.09482093118615205 -0.9945577827230707 ...
0.5750122832421579 0.19233705001984078 -0.9286374704669208 ...
scala> corr2
res7: org.apache.spark.mllib.linalg.Matrix =
1.0 0.632455532033675 -0.5000000000000001 ... (5 total)
0.632455532033675 1.0 0.10540925533894883 ...
-0.5000000000000001 0.10540925533894883 1.0 ...
0.5000000000000001 -0.10540925533894883 -1.0000000000000002 ...
0.6324555320336723 0.20000000000000429 -0.9486832980505085 ...
scala> corr3
res8: Double = 0.7745966692414775
MLlib當前支持用於判斷擬合度或者獨立性的Pearson卡方(chi-squared ( χ2) )檢驗。不一樣的輸入類型決定了是作擬合度檢驗仍是獨立性檢驗。擬合度檢驗要求輸入爲Vector, 獨立性檢驗要求輸入是Matrix。
//卡方檢驗 val v1 = Vectors.dense(43.0, 9.0) val v2 = Vectors.dense(44.0, 4.0) val c1 = Statistics.chiSqTest(v1, v2)
執行結果:
c1: org.apache.spark.mllib.stat.test.ChiSqTestResult =
Chi squared test summary:
method: pearson
degrees of freedom = 1
statistic = 5.482517482517483
pValue = 0.01920757707591003
Strong presumption against null hypothesis: observed follows the same distribution as expected..
結果返回:統計量:pearson、自由度:一、值:5.4八、機率:0.019。
---------------------
做者:sunbow0
來源:CSDN
原文:https://blog.csdn.net/sunbow0/article/details/45644273
版權聲明:本文爲博主原創文章,轉載請附上博文連接!