8.2 Generating Predictions¶

# Generate predictions
predictions = linearModel.transform(test_data)

# Extract the predictions and the "known" correct labels
predandlabels = predictions.select("predmedhv", "medhv")

predandlabels.show()

+------------------+-----+
|         predmedhv|medhv|
+------------------+-----+
|1.5977678077735522|0.269|
|1.3402962575651638|0.275|
|1.7478926681617617|0.283|
|1.5026315463850333|0.325|
|1.5840068859455108|0.344|
|1.4744173855604754|0.379|
|1.5274954532293994|0.388|
|1.3578228236744827|0.394|
|1.6929041021688493|  0.4|
| 2.010874171848204|  0.4|
|1.3656308740705367| 0.41|
|1.4496919091430263|0.421|
| 1.380970081002033|0.425|
|1.3394379493101451| 0.43|
| 1.722973408950696|0.435|
|1.5529131147882111|0.439|
| 1.323489602290725| 0.44|
|1.4030651812673915|0.444|
|1.5111871672959283|0.446|
|1.5996783060975408| 0.45|
+------------------+-----+
only showing top 20 rows

8.3 Inspect the Metrics¶

Looking at predicted values is one thing, but another and better thing is looking at some metrics to get a better idea of how good your model actually is.

Using the LinearRegressionModel.summary attribute:

Next, we can also use the summary attribute to pull up the rootMeanSquaredError and the r2.

# Get the RMSE
print("RMSE: {0}".format(linearModel.summary.rootMeanSquaredError))

RMSE: 0.8729980899366503

print("MAE: {0}".format(linearModel.summary.meanAbsoluteError))

MAE: 0.6714989215155925

# Get the R2
print("R2: {0}".format(linearModel.summary.r2))

R2: 0.42213332730120356

The RMSE measures how much error there is between two datasets comparing a predicted value and an observed or known value. The smaller an RMSE value, the closer predicted and observed values are.
The R2 ("R squared") or the coefficient of determination is a measure that shows how close the data are to the fitted regression line. This score will always be between 0 and a 100% (or 0 to 1 in this case), where 0% indicates that the model explains none of the variability of the response data around its mean, and 100% indicates the opposite: it explains all the variability. That means that, in general, the higher the R-squared, the better the model fits our data.

Using the RegressionEvaluator from pyspark.ml package:

evaluator = RegressionEvaluator(predictionCol="predmedhv", labelCol='medhv', metricName='rmse')
print("RMSE: {0}".format(evaluator.evaluate(predandlabels)))

RMSE: 0.9033627063798556

evaluator = RegressionEvaluator(predictionCol="predmedhv", labelCol='medhv', metricName='mae')
print("MAE: {0}".format(evaluator.evaluate(predandlabels)))

MAE: 0.6888437385796472

evaluator = RegressionEvaluator(predictionCol="predmedhv", labelCol='medhv', metricName='r2')
print("R2: {0}".format(evaluator.evaluate(predandlabels)))

R2: 0.40877519027090536

Using the RegressionMetrics from pyspark.mllib package:

# mllib is old so the methods are available in rdd
metrics = RegressionMetrics(predandlabels.rdd)

print("RMSE: {0}".format(metrics.rootMeanSquaredError))

RMSE: 0.9033627063798556

print("MAE: {0}".format(metrics.meanAbsoluteError))

MAE: 0.6888437385796472

print("R2: {0}".format(metrics.r2))

R2: 0.40877519027090536

There's definitely some improvements needed to our model! If we want to continue with this model, we can play around with the parameters that we passed to your model, the variables that we included in your original DataFrame.

spark.stop()

	Feature	Co-efficients
0	Intercept	0.989876
1	totbdrms	0.000000
2	pop	0.000000
3	houshlds	0.000000
4	medinc	0.526024
5	rmsperhh	0.000000
6	popperhh	0.000000
7	bdrmsperrm	0.000000

[ML] Pyspark ML tutorial for beginners

1、熱身例子

2、常規套路

3、房價預測

Predicting House Prices with Apache Spark¶

LINEAR REGRESSION¶

1. Understanding the Data Set and init.¶

2. Creating the Spark Session, Context¶

3. Load The Data From a File Into a Dataframe¶

4. Data Exploration¶

4.1 Distribution of the median age of the people living in the area:¶

4.2 Summary Statistics:¶

5. Data Preprocessing¶

5.1 Preprocessing The Target Values¶

6. Feature Engineering¶

6.1 Feature Extraction¶

6.2 Standardization¶

7. Building A Machine Learning Model With Spark ML¶

8. Evaluating the Model¶

8.1 Inspect the Model Co-efficients¶

8.2 Generating Predictions¶

8.3 Inspect the Metrics¶