delphi-xk
diff --git a/‎src/main/resources/metrics_learning.md‎
Lines changed: 36 additions & 24 deletions b/‎src/main/resources/metrics_learning.md‎
Lines changed: 36 additions & 24 deletions
diff --git a/‎src/main/scala/com/hyzs/spark/ml/MatrixOps.scala‎
Lines changed: 0 additions & 1 deletion b/‎src/main/scala/com/hyzs/spark/ml/MatrixOps.scala‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎src/main/scala/com/hyzs/spark/ml/ModelEvaluation.scala‎
Lines changed: 38 additions & 4 deletions b/‎src/main/scala/com/hyzs/spark/ml/ModelEvaluation.scala‎
Lines changed: 38 additions & 4 deletions
diff --git a/‎src/main/scala/com/hyzs/spark/ml/evaluation/BinaryConfusionMatrix.scala‎
Lines changed: 71 additions & 0 deletions b/‎src/main/scala/com/hyzs/spark/ml/evaluation/BinaryConfusionMatrix.scala‎
Lines changed: 71 additions & 0 deletions
diff --git a/‎src/main/scala/com/hyzs/spark/ml/evaluation/BinaryLabelCounter.scala‎
Lines changed: 36 additions & 0 deletions b/‎src/main/scala/com/hyzs/spark/ml/evaluation/BinaryLabelCounter.scala‎
Lines changed: 36 additions & 0 deletions
diff --git a/‎src/main/scala/com/hyzs/spark/utils/BaseUtil.scala‎
Lines changed: 20 additions & 0 deletions b/‎src/main/scala/com/hyzs/spark/utils/BaseUtil.scala‎
Lines changed: 20 additions & 0 deletions
diff --git a/‎src/main/scala/com/hyzs/spark/utils/SparkUtils.scala‎
Lines changed: 3 additions & 0 deletions b/‎src/main/scala/com/hyzs/spark/utils/SparkUtils.scala‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎src/test/scala/ScalaTest.scala‎
Lines changed: 20 additions & 0 deletions b/‎src/test/scala/ScalaTest.scala‎
Lines changed: 20 additions & 0 deletions
@@ -5,40 +5,35 @@
 * False Positive (FP) - 标签为负，预测为正
 * False Negative (FN) - 标签为正，预测为负
 
-
-|metrics|definition|description|
-|-------|----------|-----------|
-|Precision (Positive Predictive Value)|\\(PPV=\frac{TP}{TP + FP}\\)|准确率|
-|Recall (True Positive Rate)|\\(TPR=\frac{TP}{P}=\frac{TP}{TP + FN}\\)|召回率|
-|FPR|\\(FPR = \frac{FP}{FP+TN}\\)||
-|F-measure|\\(F(\beta) = \left(1 + \beta^2\right) \cdot \left(\frac{PPV \cdot TPR}{\beta^2 \cdot PPV + TPR}\right) = \frac{1}{\frac{1}{1+\beta^2}\frac{1}{\text{PPV}}+\frac{\beta^2}{1+\beta^2}\frac{1}{\text{TPR}}}\\) | \\(\beta\\)代表模型分类的偏好，当\\(\beta\\)小于1时，Precision更重要；<br/>当\\(\beta\\)大于1时，Recall更重要；<br/>当\\(\beta = 1\\)时，指标退化为F1。|
-|Receiver Operating Characteristic (ROC)|\\(FPR(T)=\int^\infty_{T} P_0(T)\,dT\\) <br/> \\(TPR(T)=\int^\infty_{T} P_1(T)\,dT\\)||
-|Area Under ROC Curve|\\(AUROC=\int^1_{0} \frac{TP}{P} d\left(\frac{FP}{N}\right)\\)||
-|Area Under Precision-Recall Curve|\\(AUPRC=\int^1_{0} \frac{TP}{TP+FP} d\left(\frac{TP}{P}\right)\\)||
-|MAE|\\(MAE = \frac{1}{n}\sum_{i=0}^n\mid y_i - \hat{y_i} \mid \\)|负向指标，值越小越好，取值范围0到无穷大；<br/>对比RMSE，易解释，易理解，易计算|
-|RMSE|\\(RMSE = \sqrt{\frac{1}{n}\sum_{i=0}^n {\({y_i} - \hat{y_i}\)}^2 }\\)|又作RMSD(root mean square deviation)，负向指标，值越小越好，取值范围0到无穷大；<br/>能更好的限制误差的量级，有效识别大误差|
-|DCG|\\( DCG = rel_1+\sum_{i=2}^p \frac{rel_i}{\log_2 i} \\)|当权重以单调递减方式排序后，DCG可取到最大值，为iDCG（ideal DCG）|
-|NDCG|\\( NDCG = \frac{DCG}{iDCG}\\)| |
-
-### MAE: mean absolute error
-
-### RMSE: root mean square error
+### Confusion Matrix
+![](https://note.youdao.com/yws/api/personal/file/7456F54FE899436D863546AAF7A20F77?method=download&shareKey=a823568a6551ae56eb90cedaf2c594a9)
 
 ### KS TEST
 - 基于累计分布函数，用于检验数据是否符合某个分布或两个分布是否相同；
 - 可以用来测量模型区分正例和负例的能力，即正例分布和负例分布的分离程度的度量；
+- 下表显示了正负样本在不同区间上的统计数和累计个数：
+![](https://note.youdao.com/yws/api/personal/file/1C8823E28461422B8ACB38FD8ADAEFC7?method=download&shareKey=6bf418d12724853f1c36c9fd099a534e "ks chart")
+- 下图显示了正负样本的累计分布随区间阈值而变化的趋势：
+![](https://note.youdao.com/yws/api/personal/file/324FAED6FCE84E9788B3C575056E7293?method=download&shareKey=97ea70dfce5e51be408470a935505275 "ks chart")
+- 可发现，在第7个区间上，两个累计分布的间隔达到最大为（94%-12%=82%），即ks-test值为0.82
 
-### DCG: discounted cumulative gain
-- 信息检索中，用来测量搜索引擎（排序系统，推荐系统）检索质量的评价指标；
-- 权重高的项排在前面的DCG值越大，越往后DCG值越小；
+### ROC: Receiver operating characteristic（接收机操作特性）
+- 以（FPR，TPR）为点作出的曲线；
 
+![](https://note.youdao.com/yws/api/personal/file/409EEFE4D636423B9022ED9F60488D18?method=download&shareKey=6b22beb44139d97e26106d17648efa1a "roc")
 
-### NDCG: Normalized DCG 
-
-### AUC: Area Under ROC Curve(Receiver operating characteristic)
+### AUC: Area Under ROC Curve
+- 当roc曲线下面积auc值为1时，意味着分类器能够完美的区分正例和负例，是一个完美分类器；
+- roc值为0.5时，意味着分类器无法区分正例和负例，是完全随机的分类器；
+- roc值在0.8以上时即为好的分类器。
 
 ### AUPRC: Area Under Precision-Recall Curve
 
+### DCG: discounted cumulative gain
+- 信息检索中，用来测量搜索引擎（排序系统，推荐系统）检索质量的评价指标；
+- 权重高的项排在前面的DCG值越大，越往后DCG值越小；
+
+### NDCG: Normalized DCG 
 
 > https://en.wikipedia.org/wiki/Mean_absolute_error
 > https://en.wikipedia.org/wiki/Root-mean-square_deviation
@@ -47,4 +42,21 @@
 > http://spark.apache.org/docs/2.2.1/mllib-evaluation-metrics.html
 > https://en.wikipedia.org/wiki/Receiver_operating_characteristic
 
+
+### 计算公式
+
+|metrics|definition|description|
+|-------|----------|-----------|
+|Precision (Positive Predictive Value)|\\(PPV=\frac{TP}{TP + FP}\\)|准确率|
+|Recall (True Positive Rate)|\\(TPR=\frac{TP}{P}=\frac{TP}{TP + FN}\\)|召回率|
+|FPR|\\(FPR = \frac{FP}{FP+TN}\\)||
+|F-measure|\\(F(\beta) = \left(1 + \beta^2\right) \cdot \left(\frac{PPV \cdot TPR}{\beta^2 \cdot PPV + TPR}\right) = \frac{1}{\frac{1}{1+\beta^2}\frac{1}{\text{PPV}}+\frac{\beta^2}{1+\beta^2}\frac{1}{\text{TPR}}}\\) | \\(\beta\\)代表模型分类的偏好，当\\(\beta\\)小于1时，Precision更重要；<br/>当\\(\beta\\)大于1时，Recall更重要；<br/>当\\(\beta = 1\\)时，指标退化为F1。|
+|Receiver Operating Characteristic (ROC)|\\(FPR(T)=\int^\infty_{T} P_0(T)\,dT\\) <br/> \\(TPR(T)=\int^\infty_{T} P_1(T)\,dT\\)||
+|Area Under ROC Curve|\\(AUROC=\int^1_{0} \frac{TP}{P} d\left(\frac{FP}{N}\right)\\)||
+|Area Under Precision-Recall Curve|\\(AUPRC=\int^1_{0} \frac{TP}{TP+FP} d\left(\frac{TP}{P}\right)\\)||
+|MAE(mean absolute error)|\\(MAE = \frac{1}{n}\sum_{i=0}^n\mid y_i - \hat{y_i} \mid \\)|负向指标，值越小越好，取值范围0到无穷大；<br/>对比RMSE，易解释，易理解，易计算|
+|RMSE(root mean square error)|\\(RMSE = \sqrt{\frac{1}{n}\sum_{i=0}^n {\({y_i} - \hat{y_i}\)}^2 }\\)|又作RMSD(root mean square deviation)，负向指标，值越小越好，取值范围0到无穷大；<br/>能更好的限制误差的量级，有效识别大误差|
+|DCG|\\( DCG = rel_1+\sum_{i=2}^p \frac{rel_i}{\log_2 i} \\)|当权重以单调递减方式排序后，DCG可取到最大值，为iDCG（ideal DCG）|
+|NDCG|\\( NDCG = \frac{DCG}{iDCG}\\)| |
+
 <script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=default"></script>
@@ -23,7 +23,6 @@ object MatrixOps extends App{
  IndexedRow(index, vector)
  }
 
-
  val rowNum = matrix.count()
 
  val first = matrix.first()
 
@@ -1,20 +1,54 @@
 package com.hyzs.spark.ml
 
+import com.hyzs.spark.ml.evaluation.{BinaryConfusionMatrix, BinaryConfusionMatrixImpl, BinaryLabelCounter}
 import org.apache.spark.rdd.RDD
 import com.hyzs.spark.utils.SparkUtils._
+import com.hyzs.spark.utils.BaseUtil._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.sql._
+import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
 
 /**
  * Created by xk on 2018/5/9.
  */
 object ModelEvaluation {
+ val threshold = 0.5
 
- // row( index, label, score)
- def loadDataFromTable(tableName:String): RDD[(Int,Double)] = {
- val scoreRdd = spark.table(tableName).rdd.map(row => (row.getInt(1), row.getDouble(2)))
+ // src row(index, score, label), result row(score, label)
+ def loadDataFromTable(tableName:String): RDD[Row] = {
+ val scoreRdd = spark.table(tableName).rdd.map(row => anySeqToRow(Seq(row(1), row(2))))
  scoreRdd
  }
 
- val scoresRdd:RDD[(Int, Double)] = loadDataFromTable("scores")
+ // src row(score, label), result row(score, label, pred_label)
+ def getLabeledRDD(threshold:Double, rdd:RDD[Row]): RDD[Row] ={
+ val labeledRdd = rdd.map( row => {
+ val score = toDoubleDynamic(row(0))
+ if(score >= threshold) Row(row.toSeq :+ 1.0)
+ else Row(row.toSeq :+ 0.0)
+ })
+ labeledRdd
+ }
+
+ def getConfusionMatrix(threshold:Double, labeledRdd:RDD[Row]): BinaryConfusionMatrix = {
+ val posNum = labeledRdd.filter(row => row.getDouble(1) == 1.0).count()
+ val negNum = labeledRdd.filter(row => row.getDouble(1) == 0.0).count()
+ val truePosNum = labeledRdd.filter(row => row.getDouble(1) == 1.0 && row.getDouble(2) == 1.0).count()
+ val falsePosNum = labeledRdd.filter(row => row.getDouble(1) == 0.0 && row.getDouble(2) == 1.0).count()
+ val posCount = new BinaryLabelCounter(truePosNum, falsePosNum)
+ val totalCount = new BinaryLabelCounter(posNum, negNum)
+ val confusion = BinaryConfusionMatrixImpl(posCount, totalCount)
+ confusion
+ }
+
+ val scoresRdd:RDD[Row] = loadDataFromTable("scores")
+ val predictRdd:RDD[Row] = getLabeledRDD(threshold, scoresRdd)
+
+ val metrics = new BinaryClassificationMetrics(
+ scoresRdd.map( row => (toDoubleDynamic(row(0)), toDoubleDynamic(row(1))) )
+ )
+ val precision = metrics.precisionByThreshold
 
 
 }
+
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package com.hyzs.spark.ml.evaluation
+
+
+/**
+ * Created by xk on 2018/5/10.
+ */
+private[ml] trait BinaryConfusionMatrix {
+
+ /** number of true positives */
+ def numTruePositives: Long
+
+ /** number of false positives */
+ def numFalsePositives: Long
+
+ /** number of false negatives */
+ def numFalseNegatives: Long
+
+ /** number of true negatives */
+ def numTrueNegatives: Long
+
+ /** number of positives */
+ def numPositives: Long = numTruePositives + numFalseNegatives
+
+ /** number of negatives */
+ def numNegatives: Long = numFalsePositives + numTrueNegatives
+}
+
+/**
+ * Implementation of [[org.apache.spark.mllib.evaluation.binary.BinaryConfusionMatrix]].
+ *
+ * @param count label counter for labels with scores greater than or equal to the current score
+ * @param totalCount label counter for all labels
+ */
+private[ml] case class BinaryConfusionMatrixImpl( count: BinaryLabelCounter,
+ totalCount: BinaryLabelCounter) extends BinaryConfusionMatrix {
+
+ /** number of true positives */
+ override def numTruePositives: Long = count.numPositives
+
+ /** number of false positives */
+ override def numFalsePositives: Long = count.numNegatives
+
+ /** number of false negatives */
+ override def numFalseNegatives: Long = totalCount.numPositives - count.numPositives
+
+ /** number of true negatives */
+ override def numTrueNegatives: Long = totalCount.numNegatives - count.numNegatives
+
+ /** number of positives */
+ override def numPositives: Long = totalCount.numPositives
+
+ /** number of negatives */
+ override def numNegatives: Long = totalCount.numNegatives
+}
@@ -0,0 +1,36 @@
+package com.hyzs.spark.ml.evaluation
+
+/**
+ * Created by xk on 2018/5/10.
+ */
+/**
+ * A counter for positives and negatives.
+ *
+ *
+ * @param numPositives number of positive labels
+ * @param numNegatives number of negative labels
+ */
+private[ml] class BinaryLabelCounter( var numPositives: Long = 0L,
+ var numNegatives: Long = 0L) extends Serializable {
+
+ /** Processes a label. */
+ def +=(label: Double): BinaryLabelCounter = {
+ // Though we assume 1.0 for positive and 0.0 for negative, the following check will handle
+ // -1.0 for negative as well.
+ if (label >= 0.5) numPositives += 1L else numNegatives += 1L
+ this
+ }
+
+ /** Merges another counter. */
+ def +=(other: BinaryLabelCounter): BinaryLabelCounter = {
+ numPositives += other.numPositives
+ numNegatives += other.numNegatives
+ this
+ }
+
+ override def clone: BinaryLabelCounter = {
+ new BinaryLabelCounter(numPositives, numNegatives)
+ }
+
+ override def toString: String = s"{numPos: $numPositives, numNeg: $numNegatives}"
+}
@@ -1,6 +1,9 @@
 package com.hyzs.spark.utils
 
 import java.text.SimpleDateFormat
+
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.sql.Row
 import scala.util.{Failure, Success, Try}
 /**
  * Created by Administrator on 2018/2/5.
@@ -18,7 +21,24 @@ object BaseUtil {
  }
  }
 
+ def toDoubleDynamic(x: Any): Double = x match {
+ case s: String => s.toDouble
+ case num: java.lang.Number => num.doubleValue()
+ case _ => throw new ClassCastException("cannot cast to double")
+ }
 
+ def anySeqToSparkVector[T](x: Any): Vector = x match {
+ case a: Array[T] => Vectors.dense(a.map(toDoubleDynamic))
+ case s: Seq[Any] => Vectors.dense(s.toArray.map(toDoubleDynamic))
+ case v: Vector => v
+ case _ => throw new ClassCastException("unsupported class")
+ }
 
+ def anySeqToRow[T](x:Any): Row = x match {
+ case a: Array[T] => Row(a.map(toDoubleDynamic))
+ case s: Seq[Any] => Row(s.map(toDoubleDynamic))
+ case r: Row => Row(r.toSeq.map(toDoubleDynamic))
+ case _ => throw new ClassCastException("unsupported class")
+ }
 
 }
@@ -6,6 +6,7 @@ import com.fasterxml.jackson.module.scala.DefaultScalaModule
 import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.fs.{FileStatus, FileSystem, FileUtil, Path}
 import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.{Dataset, Row, SparkSession}
 import org.apache.spark.{SparkConf, SparkContext}
@@ -154,4 +155,6 @@ object SparkUtils {
  SizeEstimator.estimate(rdd)
  }
 
+
+
 }
@@ -4,6 +4,8 @@ import scala.annotation.tailrec
 import java.io._
 
 import com.hyzs.spark.utils.BaseUtil
+import com.hyzs.spark.utils.BaseUtil._
+import org.apache.spark.sql.Row
 
 import scala.io.Source
 import scala.util.Random
@@ -87,4 +89,22 @@ class ScalaTest extends FunSuite{
  writer.close()
  }
 
+ test("evaluation test file"){
+ val writer = new PrintWriter(new File("d:/evaluation_test.txt"))
+ for( i <- 0 until 100000){
+ writer.write(i+",")
+ writer.write(Random.nextDouble+",")
+ writer.write(Random.nextInt(2)+"\n")
+ }
+ writer.close()
+ }
+
+ test("type cast in spark"){
+ val row = Row(1,3,4.0, "5")
+ for(v <- row.toSeq){
+ println(toDoubleDynamic(v))
+ }
+ println(anySeqToSparkVector(Array(1,2.3,3)))
+ println(anySeqToSparkVector(row.toSeq))
+ }
 }
Original file line number	Diff line number	Diff line change
`@@ -23,7 +23,6 @@ object MatrixOps extends App{`
`23`	`23`	`IndexedRow(index, vector)`
`24`	`24`	`}`
`25`	`25`
`26`		`-`
`27`	`26`	`val rowNum = matrix.count()`
`28`	`27`
`29`	`28`	`val first = matrix.first()`
Original file line number	Diff line number	Diff line change
`@@ -6,6 +6,7 @@ import com.fasterxml.jackson.module.scala.DefaultScalaModule`
`6`	`6`	`import org.apache.hadoop.conf.Configuration`
`7`	`7`	`import org.apache.hadoop.fs.{FileStatus, FileSystem, FileUtil, Path}`
`8`	`8`	`import org.apache.spark.broadcast.Broadcast`
	`9`	`+import org.apache.spark.mllib.linalg.{Vector, Vectors}`
`9`	`10`	`import org.apache.spark.rdd.RDD`
`10`	`11`	`import org.apache.spark.sql.{Dataset, Row, SparkSession}`
`11`	`12`	`import org.apache.spark.{SparkConf, SparkContext}`
`@@ -154,4 +155,6 @@ object SparkUtils {`
`154`	`155`	`SizeEstimator.estimate(rdd)`
`155`	`156`	`}`
`156`	`157`
	`158`	`+`
	`159`	`+`
`157`	`160`	`}`