數據挖掘：關聯規則的apriori算法在weka的源碼分析

時間 2019-12-20

標籤數據挖掘關聯規則 apriori 算法 weka 源碼分析简体版

原文原文鏈接

　　相對於機器學習，關聯規則的apriori算法更偏向於數據挖掘。算法

1）　測試文檔中調用weka的關聯規則apriori算法，以下機器學習

try {
            File file = new File("F:\\tools/lib/data/contact-lenses.arff");
            ArffLoader loader = new ArffLoader();
            loader.setFile(file);
            Instances m_instances = loader.getDataSet();
            
            Discretize discretize = new Discretize();
            discretize.setInputFormat(m_instances);
            m_instances = Filter.useFilter(m_instances, discretize);
            Apriori apriori = new Apriori();
            apriori.buildAssociations(m_instances);
            System.out.println(apriori.toString());
        } catch (Exception e) {
            e.printStackTrace();
        }

步驟ide

1 讀取數據集data，並提取樣本集instances學習

2 離散化屬性Discretize 測試

3 建立Apriori 關聯規則模型ui

4 輸出大頻率項集和關聯規則集spa

2）建立分類器的時候，調用設置默認參數方法code

  public void resetOptions() {

    m_removeMissingCols = false;
    m_verbose = false;
    m_delta = 0.05;
    m_minMetric = 0.90;
    m_numRules = 10;
    m_lowerBoundMinSupport = 0.1;
    m_upperBoundMinSupport = 1.0;
    m_significanceLevel = -1;
    m_outputItemSets = false;
    m_car = false;
    m_classIndex = -1;
  }

參數詳細解析，見後面的備註1orm

3）buildAssociations方法的解析，源碼以下blog

public void buildAssociations(Instances instances) throws Exception {

    double[] confidences, supports;
    int[] indices;
    FastVector[] sortedRuleSet;
    int necSupport = 0;

    instances = new Instances(instances);

    if (m_removeMissingCols) {
      instances = removeMissingColumns(instances);
    }
    if (m_car && m_metricType != CONFIDENCE)
      throw new Exception("For CAR-Mining metric type has to be confidence!");

    // only set class index if CAR is requested
    if (m_car) {
      if (m_classIndex == -1) {
        instances.setClassIndex(instances.numAttributes() - 1);
      } else if (m_classIndex <= instances.numAttributes() && m_classIndex > 0) {
        instances.setClassIndex(m_classIndex - 1);
      } else {
        throw new Exception("Invalid class index.");
      }
    }

    // can associator handle the data?
    getCapabilities().testWithFail(instances);

    m_cycles = 0;

    // make sure that the lower bound is equal to at least one instance
    double lowerBoundMinSupportToUse = (m_lowerBoundMinSupport
        * instances.numInstances() < 1.0) ? 1.0 / instances.numInstances()
        : m_lowerBoundMinSupport;

    if (m_car) {
      // m_instances does not contain the class attribute
      m_instances = LabeledItemSet.divide(instances, false);

      // m_onlyClass contains only the class attribute
      m_onlyClass = LabeledItemSet.divide(instances, true);
    } else
      m_instances = instances;

    if (m_car && m_numRules == Integer.MAX_VALUE) {
      // Set desired minimum support
      m_minSupport = lowerBoundMinSupportToUse;
    } else {
      // Decrease minimum support until desired number of rules found.
      m_minSupport = m_upperBoundMinSupport - m_delta;
      m_minSupport = (m_minSupport < lowerBoundMinSupportToUse) ? lowerBoundMinSupportToUse
          : m_minSupport;
    }

    do {

      // Reserve space for variables
      m_Ls = new FastVector();
      m_hashtables = new FastVector();
      m_allTheRules = new FastVector[6];
      m_allTheRules[0] = new FastVector();
      m_allTheRules[1] = new FastVector();
      m_allTheRules[2] = new FastVector();
      if (m_metricType != CONFIDENCE || m_significanceLevel != -1) {
        m_allTheRules[3] = new FastVector();
        m_allTheRules[4] = new FastVector();
        m_allTheRules[5] = new FastVector();
      }
      sortedRuleSet = new FastVector[6];
      sortedRuleSet[0] = new FastVector();
      sortedRuleSet[1] = new FastVector();
      sortedRuleSet[2] = new FastVector();
      if (m_metricType != CONFIDENCE || m_significanceLevel != -1) {
        sortedRuleSet[3] = new FastVector();
        sortedRuleSet[4] = new FastVector();
        sortedRuleSet[5] = new FastVector();
      }
      if (!m_car) {
        // Find large itemsets and rules
        findLargeItemSets();
        if (m_significanceLevel != -1 || m_metricType != CONFIDENCE)
          findRulesBruteForce();
        else
          findRulesQuickly();
      } else {
        findLargeCarItemSets();
        findCarRulesQuickly();
      }

      // prune rules for upper bound min support
      if (m_upperBoundMinSupport < 1.0) {
        pruneRulesForUpperBoundSupport();
      }

      int j = m_allTheRules[2].size() - 1;
      supports = new double[m_allTheRules[2].size()];
      for (int i = 0; i < (j + 1); i++)
        supports[j - i] = ((double) ((ItemSet) m_allTheRules[1]
            .elementAt(j - i)).support()) * (-1);
      indices = Utils.stableSort(supports);
      for (int i = 0; i < (j + 1); i++) {
        sortedRuleSet[0].addElement(m_allTheRules[0].elementAt(indices[j - i]));
        sortedRuleSet[1].addElement(m_allTheRules[1].elementAt(indices[j - i]));
        sortedRuleSet[2].addElement(m_allTheRules[2].elementAt(indices[j - i]));
        if (m_metricType != CONFIDENCE || m_significanceLevel != -1) {
          sortedRuleSet[3].addElement(m_allTheRules[3]
              .elementAt(indices[j - i]));
          sortedRuleSet[4].addElement(m_allTheRules[4]
              .elementAt(indices[j - i]));
          sortedRuleSet[5].addElement(m_allTheRules[5]
              .elementAt(indices[j - i]));
        }
      }

      // Sort rules according to their confidence
      m_allTheRules[0].removeAllElements();
      m_allTheRules[1].removeAllElements();
      m_allTheRules[2].removeAllElements();
      if (m_metricType != CONFIDENCE || m_significanceLevel != -1) {
        m_allTheRules[3].removeAllElements();
        m_allTheRules[4].removeAllElements();
        m_allTheRules[5].removeAllElements();
      }
      confidences = new double[sortedRuleSet[2].size()];
      int sortType = 2 + m_metricType;

      for (int i = 0; i < sortedRuleSet[2].size(); i++)
        confidences[i] = ((Double) sortedRuleSet[sortType].elementAt(i))
            .doubleValue();
      indices = Utils.stableSort(confidences);
      for (int i = sortedRuleSet[0].size() - 1; (i >= (sortedRuleSet[0].size() - m_numRules))
          && (i >= 0); i--) {
        m_allTheRules[0].addElement(sortedRuleSet[0].elementAt(indices[i]));
        m_allTheRules[1].addElement(sortedRuleSet[1].elementAt(indices[i]));
        m_allTheRules[2].addElement(sortedRuleSet[2].elementAt(indices[i]));
        if (m_metricType != CONFIDENCE || m_significanceLevel != -1) {
          m_allTheRules[3].addElement(sortedRuleSet[3].elementAt(indices[i]));
          m_allTheRules[4].addElement(sortedRuleSet[4].elementAt(indices[i]));
          m_allTheRules[5].addElement(sortedRuleSet[5].elementAt(indices[i]));
        }
      }

      if (m_verbose) {
        if (m_Ls.size() > 1) {
          System.out.println(toString());
        }
      }

      if (m_minSupport == lowerBoundMinSupportToUse
          || m_minSupport - m_delta > lowerBoundMinSupportToUse)
        m_minSupport -= m_delta;
      else
        m_minSupport = lowerBoundMinSupportToUse;

      necSupport = Math.round((float) ((m_minSupport * m_instances
          .numInstances()) + 0.5));

      m_cycles++;
    } while ((m_allTheRules[0].size() < m_numRules)
        && (Utils.grOrEq(m_minSupport, lowerBoundMinSupportToUse))
        /* (necSupport >= lowerBoundNumInstancesSupport) */
        /* (Utils.grOrEq(m_minSupport, m_lowerBoundMinSupport)) */&& (necSupport >= 1));
    m_minSupport += m_delta;
  }

主要步驟解析：

1 使用removeMissingColumns方法，刪除缺失屬性的列

2 若是參數m_car是真，則進行劃分；由於m_car是真的意思是挖掘與關聯規則的有關的規則，因此劃分紅兩部分，一部分有關，一部分無關，刪除無關的便可；

3 方法findLargeItemSets查找大頻率項集；具體源碼見下面

4 方法findRulesQuickly查找全部的關聯規則集；

5 方法pruneRulesForUpperBoundSupport刪除不知足最小置信度的規則集；

6）按照置信度把規則集排序；

4）查找大頻率項集findLargeItemSets源碼以下：

private void findLargeItemSets() throws Exception {

    FastVector kMinusOneSets, kSets;
    Hashtable hashtable;
    int necSupport, necMaxSupport, i = 0;

    // Find large itemsets

    // minimum support
    necSupport = (int) (m_minSupport * m_instances.numInstances() + 0.5);
    necMaxSupport = (int) (m_upperBoundMinSupport * m_instances.numInstances() + 0.5);

    kSets = AprioriItemSet.singletons(m_instances);
    AprioriItemSet.upDateCounters(kSets, m_instances);
    kSets = AprioriItemSet.deleteItemSets(kSets, necSupport,
        m_instances.numInstances());
    if (kSets.size() == 0)
      return;
    do {
      m_Ls.addElement(kSets);
      kMinusOneSets = kSets;
      kSets = AprioriItemSet.mergeAllItemSets(kMinusOneSets, i,
          m_instances.numInstances());
      hashtable = AprioriItemSet.getHashtable(kMinusOneSets,
          kMinusOneSets.size());
      m_hashtables.addElement(hashtable);
      kSets = AprioriItemSet.pruneItemSets(kSets, hashtable);
      AprioriItemSet.upDateCounters(kSets, m_instances);
      kSets = AprioriItemSet.deleteItemSets(kSets, necSupport,
          m_instances.numInstances());
      i++;
    } while (kSets.size() > 0);
  }

主要步驟：

1 類AprioriItemSet.singletons方法，將給定數據集的頭信息轉換成一個項集的集合，頭信息中的值的順序是按字典序。

2 方法upDateCounters查找一頻繁項目集；

3 AprioriItemSet.deleteItemSets方法，刪除不知足支持度區間的項目集；

4 使用方法mergeAllItemSets（源碼以下）由k-1項目集循環生出k頻繁項目集，而且使用方法deleteItemSets刪除不知足支持度區間的項目集；

5)由k-1項目集循環生出k頻繁項目集的方法mergeAllItemSets，源碼以下：

public static FastVector mergeAllItemSets(FastVector itemSets, int size,
      int totalTrans) {

    FastVector newVector = new FastVector();
    ItemSet result;
    int numFound, k;

    for (int i = 0; i < itemSets.size(); i++) {
      ItemSet first = (ItemSet) itemSets.elementAt(i);
      out: for (int j = i + 1; j < itemSets.size(); j++) {
        ItemSet second = (ItemSet) itemSets.elementAt(j);
        result = new AprioriItemSet(totalTrans);
        result.m_items = new int[first.m_items.length];

        // Find and copy common prefix of size 'size'
        numFound = 0;
        k = 0;
        while (numFound < size) {
          if (first.m_items[k] == second.m_items[k]) {
            if (first.m_items[k] != -1)
              numFound++;
            result.m_items[k] = first.m_items[k];
          } else
            break out;
          k++;
        }

        // Check difference
        while (k < first.m_items.length) {
          if ((first.m_items[k] != -1) && (second.m_items[k] != -1))
            break;
          else {
            if (first.m_items[k] != -1)
              result.m_items[k] = first.m_items[k];
            else
              result.m_items[k] = second.m_items[k];
          }
          k++;
        }
        if (k == first.m_items.length) {
          result.m_counter = 0;
          newVector.addElement(result);
        }
      }
    }
    return newVector;
  }

調用方法generateRules生出關聯規則

6)生出關聯規則的方法generateRules，源碼以下

public FastVector[] generateRules(double minConfidence,
      FastVector hashtables, int numItemsInSet) {

    FastVector premises = new FastVector(), consequences = new FastVector(), conf = new FastVector();
    FastVector[] rules = new FastVector[3], moreResults;
    AprioriItemSet premise, consequence;
    Hashtable hashtable = (Hashtable) hashtables.elementAt(numItemsInSet - 2);

    // Generate all rules with one item in the consequence.
    for (int i = 0; i < m_items.length; i++)
      if (m_items[i] != -1) {
        premise = new AprioriItemSet(m_totalTransactions);
        consequence = new AprioriItemSet(m_totalTransactions);
        premise.m_items = new int[m_items.length];
        consequence.m_items = new int[m_items.length];
        consequence.m_counter = m_counter;

        for (int j = 0; j < m_items.length; j++)
          consequence.m_items[j] = -1;
        System.arraycopy(m_items, 0, premise.m_items, 0, m_items.length);
        premise.m_items[i] = -1;

        consequence.m_items[i] = m_items[i];
        premise.m_counter = ((Integer) hashtable.get(premise)).intValue();
        premises.addElement(premise);
        consequences.addElement(consequence);
        conf.addElement(new Double(confidenceForRule(premise, consequence)));
      }
    rules[0] = premises;
    rules[1] = consequences;
    rules[2] = conf;
    pruneRules(rules, minConfidence);

    // Generate all the other rules
    moreResults = moreComplexRules(rules, numItemsInSet, 1, minConfidence,
        hashtables);
    if (moreResults != null)
      for (int i = 0; i < moreResults[0].size(); i++) {
        rules[0].addElement(moreResults[0].elementAt(i));
        rules[1].addElement(moreResults[1].elementAt(i));
        rules[2].addElement(moreResults[2].elementAt(i));
      }
    return rules;
  }

幾個我想說的

1）不想輸出爲0的項，能夠設置成缺失值？，由於算法會自動刪除缺失值的列，不參與關聯規則的生成；

2）按照置信度對關聯規則排序，是關聯規則分類器中使用的，只是提取關聯規則，不須要排序；

備註

1）weka的關聯規則中參數的詳解

1.        car 若是設爲真，則會挖掘類關聯規則而不是全局關聯規則。也就是隻保留與類標籤有關的關聯規則，設置索引爲-1
2.        classindex 類屬性索引。若是設置爲-1，最後的屬性被當作類屬性。
3.        delta 以此數值爲迭代遞減單位。不斷減少支持度直至達到最小支持度或產生了知足數量要求的規則。
4.        lowerBoundMinSupport 最小支持度下界。
5.        metricType 度量類型。設置對規則進行排序的度量依據。能夠是：置信度（類關聯規則只能用置信度挖掘），提高度(lift)，槓桿率(leverage)，確信度(conviction)。
在 Weka中設置了幾個相似置信度(confidence)的度量來衡量規則的關聯程度，它們分別是：
a)        Lift ： P(A,B)/(P(A)P(B)) Lift=1時表示A和B獨立。這個數越大(>1)，越代表A和B存在於一個購物籃中不是偶然現象,有較強的關聯度.
b)        Leverage :P(A,B)-P(A)P(B)Leverage=0時A和B獨立，Leverage越大A和B的關係越密切
c)        Conviction:P(A)P(!B)/P(A,!B) （!B表示B沒有發生） Conviction也是用來衡量A和B的獨立性。從它和lift的關係（對B取反，代入Lift公式後求倒數）能夠看出，這個值越大, A、B越關聯。
6.        minMtric 度量的最小值。
7.        numRules 要發現的規則數。
8.        outputItemSets 若是設置爲真，會在結果中輸出項集。
9.        removeAllMissingCols 移除所有爲缺省值的列。
 
10.    significanceLevel 重要程度。重要性測試（僅用於置信度）。
 
11.    upperBoundMinSupport 最小支持度上界。 從這個值開始迭代減少最小支持度。
 
12.    verbose 若是設置爲真，則算法會以冗餘模式運行。

2）控制檯輸出結果

Apriori
=======

Minimum support: 0.2 (5 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 16

Generated sets of large itemsets:

Size of set of large itemsets L(1): 11

Size of set of large itemsets L(2): 21

Size of set of large itemsets L(3): 6

Best rules found:

 1. tear-prod-rate=reduced 12 ==> contact-lenses=none 12    conf:(1)
 2. spectacle-prescrip=myope tear-prod-rate=reduced 6 ==> contact-lenses=none 6    conf:(1)
 3. spectacle-prescrip=hypermetrope tear-prod-rate=reduced 6 ==> contact-lenses=none 6    conf:(1)
 4. astigmatism=no tear-prod-rate=reduced 6 ==> contact-lenses=none 6    conf:(1)
 5. astigmatism=yes tear-prod-rate=reduced 6 ==> contact-lenses=none 6    conf:(1)
 6. contact-lenses=soft 5 ==> astigmatism=no 5    conf:(1)
 7. contact-lenses=soft 5 ==> tear-prod-rate=normal 5    conf:(1)
 8. tear-prod-rate=normal contact-lenses=soft 5 ==> astigmatism=no 5    conf:(1)
 9. astigmatism=no contact-lenses=soft 5 ==> tear-prod-rate=normal 5    conf:(1)
10. contact-lenses=soft 5 ==> astigmatism=no tear-prod-rate=normal 5    conf:(1)