BERT模型在多類別文本分類時的precision, recall, f1值的計算

時間 2019-12-13

標籤 bert 模型類別文本分類 precision recall f1 計算简体版

原文原文鏈接

　　BERT預訓練模型在諸多NLP任務中都取得最優的結果。在處理文本分類問題時，便可以直接用BERT模型做爲文本分類的模型，也能夠將BERT模型的最後層輸出的結果做爲word embedding導入到咱們定製的文本分類模型中（如text-CNN等）。總之如今只要你的計算資源能知足，通常問題均可以用BERT來處理，這次針對公司的一個實際項目——一個多類別（61類）的文本分類問題，其就取得了很好的結果。python

　　咱們這次的任務是一個數據分佈極度不平衡的多類別文本分類（有的類別下只有幾個或者十幾個樣本，有的類別下又有幾千個樣本），在不作不平衡數據處理且不採用BERT模型時，其取得的F1值只有50%，而在不作不平衡數據處理但採用BERT模型時，其F1值能達到65%，可是在用bert模型時得到F1值時卻存在一些問題。git

　　在tensorflow中只提供了二分類的precision，recall，f1值的計算接口，而bert源代碼中的run_classifier.py文件中訓練模型，驗證模型等都是用的estimator API，這些高層API極大的限制了修改代碼的靈活性。好在tensorflow源碼中有一個方法能夠計算混淆矩陣的方法，而且會返回一個operation。注意：這個和tf.confusion_matrix()不一樣，具體看源代碼中下面這段代碼：app

        elif mode == tf.estimator.ModeKeys.EVAL:

            def metric_fn(per_example_loss, label_ids, logits, num_labels):
                predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
                accuracy = tf.metrics.accuracy(
                    labels=label_ids, predictions=predictions)
　　　　　　　　　　
　　　　　　　　　　# 這裏的metrics時咱們定義的一個python文件，在下面會介紹

                conf_mat = metrics.get_metrics_ops(label_ids, predictions, num_labels)

                loss = tf.metrics.mean(values=per_example_loss)
                return {
                    "eval_accuracy": accuracy,
                    "eval_cm": conf_mat,
                    "eval_loss": loss,
                }

　　驗證時的性能指標計算都在這個方法裏面，並且在return的這個字典中每一個值必須是一個tuple。以accuracy爲例，tf.metrics.accuracy返回的是一個（accuracy, update_op）這樣一個tuple，而咱們上一段說的tf.confusion_matrix只返回一個混淆矩陣。所以在這裏咱們使用一個內部的方法，方法導入以下：函數

from tensorflow.python.ops.metrics_impl import _streaming_confusion_matrix

這個方法會返回一個（confusion_matrix, update_op）的tuple。咱們新建一個metrics.py文件，裏面的代碼以下：性能

import numpy as np
import tensorflow as tf
from tensorflow.python.ops.metrics_impl import _streaming_confusion_matrix
    

def get_metrics_ops(labels, predictions, num_labels):
　　# 獲得混淆矩陣和update_op，在這裏咱們須要將生成的混淆矩陣轉換成tensor
    cm, op = _streaming_confusion_matrix(labels, predictions, num_labels)
    tf.logging.info(type(cm))
    tf.logging.info(type(op))
    
    return (tf.convert_to_tensor(cm), op)


def get_metrics(conf_mat, num_labels):
　　# 獲得numpy類型的混淆矩陣，而後計算precision，recall，f1值。
    precisions = []
    recalls = []
    for i in range(num_labels):
        tp = conf_mat[i][i].sum()
        col_sum = conf_mat[:, i].sum()
        row_sum = conf_mat[i].sum()

        precision = tp / col_sum if col_sum > 0 else 0
        recall = tp / row_sum if row_sum > 0 else 0

        precisions.append(precision)
        recalls.append(recall)

    pre = sum(precisions) / len(precisions)
    rec = sum(recalls) / len(recalls)
    f1 = 2 * pre * rec / (pre + rec)

    return pre, rec, f1

最上面一段代碼中return的字典中的值能夠在run_classifier.py中main函數中的下面一段代碼中獲得：ui

    if FLAGS.do_eval:
        eval_examples = processor.get_dev_examples(FLAGS.data_dir)
        num_actual_eval_examples = len(eval_examples)
        if FLAGS.use_tpu:
            # TPU requires a fixed batch size for all batches, therefore the number
            # of examples must be a multiple of the batch size, or else examples
            # will get dropped. So we pad with fake examples which are ignored
            # later on. These do NOT count towards the metric (all tf.metrics
            # support a per-instance weight, and these get a weight of 0.0).
            while len(eval_examples) % FLAGS.eval_batch_size != 0:
                eval_examples.append(PaddingInputExample())

        eval_file = os.path.join(FLAGS.output_dir, "eval.tf_record")
        file_based_convert_examples_to_features(
            eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file)

        tf.logging.info("***** Running evaluation *****")
        tf.logging.info("  Num examples = %d (%d actual, %d padding)",
                        len(eval_examples), num_actual_eval_examples,
                        len(eval_examples) - num_actual_eval_examples)
        tf.logging.info("  Batch size = %d", FLAGS.eval_batch_size)

        # This tells the estimator to run through the entire set.
        eval_steps = None
        # However, if running eval on the TPU, you will need to specify the
        # number of steps.
        if FLAGS.use_tpu:
            assert len(eval_examples) % FLAGS.eval_batch_size == 0
            eval_steps = int(len(eval_examples) // FLAGS.eval_batch_size)

        eval_drop_remainder = True if FLAGS.use_tpu else False
        eval_input_fn = file_based_input_fn_builder(
            input_file=eval_file,
            seq_length=FLAGS.max_seq_length,
            is_training=False,
            drop_remainder=eval_drop_remainder)

　　　　　# result中就是return返回的字典
        result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)

        output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
        with tf.gfile.GFile(output_eval_file, "w") as writer:
            tf.logging.info("***** Eval results *****")
　　　　　　　
　　　　　　　# 咱們能夠拿到混淆矩陣（如今時numpy的形式），調用metrics.py文件中的方法來獲得precision，recall，f1值
            pre, rec, f1 = metrics.get_metrics(result["eval_cm"], len(label_list))
            tf.logging.info("eval_precision: {}".format(pre))
            tf.logging.info("eval_recall: {}".format(rec))
            tf.logging.info("eval_f1: {}".format(f1))
            tf.logging.info("eval_accuracy: {}".format(result["eval_accuracy"]))
            tf.logging.info("eval_loss: {}".format(result["eval_loss"]))

            np.save("conf_mat.npy", result["eval_cm"])

經過上面的代碼拿到混淆矩陣後，調用metrics.py文件中的get_metrics方法就能夠獲得precision，recall，f1值。lua

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。