相對於機器學習,關聯規則的apriori算法更偏向於數據挖掘。算法
1) 測試文檔中調用weka的關聯規則apriori算法,以下機器學習
try { File file = new File("F:\\tools/lib/data/contact-lenses.arff"); ArffLoader loader = new ArffLoader(); loader.setFile(file); Instances m_instances = loader.getDataSet(); Discretize discretize = new Discretize(); discretize.setInputFormat(m_instances); m_instances = Filter.useFilter(m_instances, discretize); Apriori apriori = new Apriori(); apriori.buildAssociations(m_instances); System.out.println(apriori.toString()); } catch (Exception e) { e.printStackTrace(); }
步驟ide
1 讀取數據集data,並提取樣本集instances學習
2 離散化屬性Discretize 測試
3 建立Apriori 關聯規則模型ui
4 輸出大頻率項集和關聯規則集spa
2) 建立分類器的時候,調用設置默認參數方法code
public void resetOptions() { m_removeMissingCols = false; m_verbose = false; m_delta = 0.05; m_minMetric = 0.90; m_numRules = 10; m_lowerBoundMinSupport = 0.1; m_upperBoundMinSupport = 1.0; m_significanceLevel = -1; m_outputItemSets = false; m_car = false; m_classIndex = -1; }
參數詳細解析,見後面的備註1orm
3)buildAssociations方法的解析,源碼以下blog
public void buildAssociations(Instances instances) throws Exception { double[] confidences, supports; int[] indices; FastVector[] sortedRuleSet; int necSupport = 0; instances = new Instances(instances); if (m_removeMissingCols) { instances = removeMissingColumns(instances); } if (m_car && m_metricType != CONFIDENCE) throw new Exception("For CAR-Mining metric type has to be confidence!"); // only set class index if CAR is requested if (m_car) { if (m_classIndex == -1) { instances.setClassIndex(instances.numAttributes() - 1); } else if (m_classIndex <= instances.numAttributes() && m_classIndex > 0) { instances.setClassIndex(m_classIndex - 1); } else { throw new Exception("Invalid class index."); } } // can associator handle the data? getCapabilities().testWithFail(instances); m_cycles = 0; // make sure that the lower bound is equal to at least one instance double lowerBoundMinSupportToUse = (m_lowerBoundMinSupport * instances.numInstances() < 1.0) ? 1.0 / instances.numInstances() : m_lowerBoundMinSupport; if (m_car) { // m_instances does not contain the class attribute m_instances = LabeledItemSet.divide(instances, false); // m_onlyClass contains only the class attribute m_onlyClass = LabeledItemSet.divide(instances, true); } else m_instances = instances; if (m_car && m_numRules == Integer.MAX_VALUE) { // Set desired minimum support m_minSupport = lowerBoundMinSupportToUse; } else { // Decrease minimum support until desired number of rules found. m_minSupport = m_upperBoundMinSupport - m_delta; m_minSupport = (m_minSupport < lowerBoundMinSupportToUse) ? lowerBoundMinSupportToUse : m_minSupport; } do { // Reserve space for variables m_Ls = new FastVector(); m_hashtables = new FastVector(); m_allTheRules = new FastVector[6]; m_allTheRules[0] = new FastVector(); m_allTheRules[1] = new FastVector(); m_allTheRules[2] = new FastVector(); if (m_metricType != CONFIDENCE || m_significanceLevel != -1) { m_allTheRules[3] = new FastVector(); m_allTheRules[4] = new FastVector(); m_allTheRules[5] = new FastVector(); } sortedRuleSet = new FastVector[6]; sortedRuleSet[0] = new FastVector(); sortedRuleSet[1] = new FastVector(); sortedRuleSet[2] = new FastVector(); if (m_metricType != CONFIDENCE || m_significanceLevel != -1) { sortedRuleSet[3] = new FastVector(); sortedRuleSet[4] = new FastVector(); sortedRuleSet[5] = new FastVector(); } if (!m_car) { // Find large itemsets and rules findLargeItemSets(); if (m_significanceLevel != -1 || m_metricType != CONFIDENCE) findRulesBruteForce(); else findRulesQuickly(); } else { findLargeCarItemSets(); findCarRulesQuickly(); } // prune rules for upper bound min support if (m_upperBoundMinSupport < 1.0) { pruneRulesForUpperBoundSupport(); } int j = m_allTheRules[2].size() - 1; supports = new double[m_allTheRules[2].size()]; for (int i = 0; i < (j + 1); i++) supports[j - i] = ((double) ((ItemSet) m_allTheRules[1] .elementAt(j - i)).support()) * (-1); indices = Utils.stableSort(supports); for (int i = 0; i < (j + 1); i++) { sortedRuleSet[0].addElement(m_allTheRules[0].elementAt(indices[j - i])); sortedRuleSet[1].addElement(m_allTheRules[1].elementAt(indices[j - i])); sortedRuleSet[2].addElement(m_allTheRules[2].elementAt(indices[j - i])); if (m_metricType != CONFIDENCE || m_significanceLevel != -1) { sortedRuleSet[3].addElement(m_allTheRules[3] .elementAt(indices[j - i])); sortedRuleSet[4].addElement(m_allTheRules[4] .elementAt(indices[j - i])); sortedRuleSet[5].addElement(m_allTheRules[5] .elementAt(indices[j - i])); } } // Sort rules according to their confidence m_allTheRules[0].removeAllElements(); m_allTheRules[1].removeAllElements(); m_allTheRules[2].removeAllElements(); if (m_metricType != CONFIDENCE || m_significanceLevel != -1) { m_allTheRules[3].removeAllElements(); m_allTheRules[4].removeAllElements(); m_allTheRules[5].removeAllElements(); } confidences = new double[sortedRuleSet[2].size()]; int sortType = 2 + m_metricType; for (int i = 0; i < sortedRuleSet[2].size(); i++) confidences[i] = ((Double) sortedRuleSet[sortType].elementAt(i)) .doubleValue(); indices = Utils.stableSort(confidences); for (int i = sortedRuleSet[0].size() - 1; (i >= (sortedRuleSet[0].size() - m_numRules)) && (i >= 0); i--) { m_allTheRules[0].addElement(sortedRuleSet[0].elementAt(indices[i])); m_allTheRules[1].addElement(sortedRuleSet[1].elementAt(indices[i])); m_allTheRules[2].addElement(sortedRuleSet[2].elementAt(indices[i])); if (m_metricType != CONFIDENCE || m_significanceLevel != -1) { m_allTheRules[3].addElement(sortedRuleSet[3].elementAt(indices[i])); m_allTheRules[4].addElement(sortedRuleSet[4].elementAt(indices[i])); m_allTheRules[5].addElement(sortedRuleSet[5].elementAt(indices[i])); } } if (m_verbose) { if (m_Ls.size() > 1) { System.out.println(toString()); } } if (m_minSupport == lowerBoundMinSupportToUse || m_minSupport - m_delta > lowerBoundMinSupportToUse) m_minSupport -= m_delta; else m_minSupport = lowerBoundMinSupportToUse; necSupport = Math.round((float) ((m_minSupport * m_instances .numInstances()) + 0.5)); m_cycles++; } while ((m_allTheRules[0].size() < m_numRules) && (Utils.grOrEq(m_minSupport, lowerBoundMinSupportToUse)) /* (necSupport >= lowerBoundNumInstancesSupport) */ /* (Utils.grOrEq(m_minSupport, m_lowerBoundMinSupport)) */&& (necSupport >= 1)); m_minSupport += m_delta; }
主要步驟解析:
1 使用removeMissingColumns方法,刪除缺失屬性的列
2 若是參數m_car是真,則進行劃分;由於m_car是真的意思是挖掘與關聯規則的有關的規則,因此劃分紅兩部分,一部分有關,一部分無關,刪除無關的便可;
3 方法findLargeItemSets查找大頻率項集;具體源碼見下面
4 方法findRulesQuickly查找全部的關聯規則集;
5 方法pruneRulesForUpperBoundSupport刪除不知足最小置信度的規則集;
6)按照置信度把規則集排序;
4)查找大頻率項集findLargeItemSets源碼以下:
private void findLargeItemSets() throws Exception { FastVector kMinusOneSets, kSets; Hashtable hashtable; int necSupport, necMaxSupport, i = 0; // Find large itemsets // minimum support necSupport = (int) (m_minSupport * m_instances.numInstances() + 0.5); necMaxSupport = (int) (m_upperBoundMinSupport * m_instances.numInstances() + 0.5); kSets = AprioriItemSet.singletons(m_instances); AprioriItemSet.upDateCounters(kSets, m_instances); kSets = AprioriItemSet.deleteItemSets(kSets, necSupport, m_instances.numInstances()); if (kSets.size() == 0) return; do { m_Ls.addElement(kSets); kMinusOneSets = kSets; kSets = AprioriItemSet.mergeAllItemSets(kMinusOneSets, i, m_instances.numInstances()); hashtable = AprioriItemSet.getHashtable(kMinusOneSets, kMinusOneSets.size()); m_hashtables.addElement(hashtable); kSets = AprioriItemSet.pruneItemSets(kSets, hashtable); AprioriItemSet.upDateCounters(kSets, m_instances); kSets = AprioriItemSet.deleteItemSets(kSets, necSupport, m_instances.numInstances()); i++; } while (kSets.size() > 0); }
主要步驟:
1 類AprioriItemSet.singletons方法,將給定數據集的頭信息轉換成一個項集的集合, 頭信息中的值的順序是按字典序。
2 方法upDateCounters查找一頻繁項目集;
3 AprioriItemSet.deleteItemSets方法,刪除不知足支持度區間的項目集;
4 使用方法mergeAllItemSets(源碼以下)由k-1項目集循環生出k頻繁項目集,而且使用方法deleteItemSets刪除不知足支持度區間的項目集;
5)由k-1項目集循環生出k頻繁項目集的方法mergeAllItemSets,源碼以下:
public static FastVector mergeAllItemSets(FastVector itemSets, int size, int totalTrans) { FastVector newVector = new FastVector(); ItemSet result; int numFound, k; for (int i = 0; i < itemSets.size(); i++) { ItemSet first = (ItemSet) itemSets.elementAt(i); out: for (int j = i + 1; j < itemSets.size(); j++) { ItemSet second = (ItemSet) itemSets.elementAt(j); result = new AprioriItemSet(totalTrans); result.m_items = new int[first.m_items.length]; // Find and copy common prefix of size 'size' numFound = 0; k = 0; while (numFound < size) { if (first.m_items[k] == second.m_items[k]) { if (first.m_items[k] != -1) numFound++; result.m_items[k] = first.m_items[k]; } else break out; k++; } // Check difference while (k < first.m_items.length) { if ((first.m_items[k] != -1) && (second.m_items[k] != -1)) break; else { if (first.m_items[k] != -1) result.m_items[k] = first.m_items[k]; else result.m_items[k] = second.m_items[k]; } k++; } if (k == first.m_items.length) { result.m_counter = 0; newVector.addElement(result); } } } return newVector; }
調用方法generateRules生出關聯規則
6)生出關聯規則的方法generateRules,源碼以下
public FastVector[] generateRules(double minConfidence, FastVector hashtables, int numItemsInSet) { FastVector premises = new FastVector(), consequences = new FastVector(), conf = new FastVector(); FastVector[] rules = new FastVector[3], moreResults; AprioriItemSet premise, consequence; Hashtable hashtable = (Hashtable) hashtables.elementAt(numItemsInSet - 2); // Generate all rules with one item in the consequence. for (int i = 0; i < m_items.length; i++) if (m_items[i] != -1) { premise = new AprioriItemSet(m_totalTransactions); consequence = new AprioriItemSet(m_totalTransactions); premise.m_items = new int[m_items.length]; consequence.m_items = new int[m_items.length]; consequence.m_counter = m_counter; for (int j = 0; j < m_items.length; j++) consequence.m_items[j] = -1; System.arraycopy(m_items, 0, premise.m_items, 0, m_items.length); premise.m_items[i] = -1; consequence.m_items[i] = m_items[i]; premise.m_counter = ((Integer) hashtable.get(premise)).intValue(); premises.addElement(premise); consequences.addElement(consequence); conf.addElement(new Double(confidenceForRule(premise, consequence))); } rules[0] = premises; rules[1] = consequences; rules[2] = conf; pruneRules(rules, minConfidence); // Generate all the other rules moreResults = moreComplexRules(rules, numItemsInSet, 1, minConfidence, hashtables); if (moreResults != null) for (int i = 0; i < moreResults[0].size(); i++) { rules[0].addElement(moreResults[0].elementAt(i)); rules[1].addElement(moreResults[1].elementAt(i)); rules[2].addElement(moreResults[2].elementAt(i)); } return rules; }
幾個我想說的
1)不想輸出爲0的項,能夠設置成缺失值?,由於算法會自動刪除缺失值的列,不參與關聯規則的生成;
2)按照置信度對關聯規則排序,是關聯規則分類器中使用的,只是提取關聯規則,不須要排序;
備註
1)weka的關聯規則中參數的詳解
1. car 若是設爲真,則會挖掘類關聯規則而不是全局關聯規則。也就是隻保留與類標籤有關的關聯規則,設置索引爲-1 2. classindex 類屬性索引。若是設置爲-1,最後的屬性被當作類屬性。 3. delta 以此數值爲迭代遞減單位。不斷減少支持度直至達到最小支持度或產生了知足數量要求的規則。 4. lowerBoundMinSupport 最小支持度下界。 5. metricType 度量類型。設置對規則進行排序的度量依據。能夠是:置信度(類關聯規則只能用置信度挖掘),提高度(lift),槓桿率(leverage),確信度(conviction)。 在 Weka中設置了幾個相似置信度(confidence)的度量來衡量規則的關聯程度,它們分別是: a) Lift : P(A,B)/(P(A)P(B)) Lift=1時表示A和B獨立。這個數越大(>1),越代表A和B存在於一個購物籃中不是偶然現象,有較強的關聯度. b) Leverage :P(A,B)-P(A)P(B)Leverage=0時A和B獨立,Leverage越大A和B的關係越密切 c) Conviction:P(A)P(!B)/P(A,!B) (!B表示B沒有發生) Conviction也是用來衡量A和B的獨立性。從它和lift的關係(對B取反,代入Lift公式後求倒數)能夠看出,這個值越大, A、B越關聯。 6. minMtric 度量的最小值。 7. numRules 要發現的規則數。 8. outputItemSets 若是設置爲真,會在結果中輸出項集。 9. removeAllMissingCols 移除所有爲缺省值的列。 10. significanceLevel 重要程度。重要性測試(僅用於置信度)。 11. upperBoundMinSupport 最小支持度上界。 從這個值開始迭代減少最小支持度。 12. verbose 若是設置爲真,則算法會以冗餘模式運行。
2)控制檯輸出結果
Apriori ======= Minimum support: 0.2 (5 instances) Minimum metric <confidence>: 0.9 Number of cycles performed: 16 Generated sets of large itemsets: Size of set of large itemsets L(1): 11 Size of set of large itemsets L(2): 21 Size of set of large itemsets L(3): 6 Best rules found: 1. tear-prod-rate=reduced 12 ==> contact-lenses=none 12 conf:(1) 2. spectacle-prescrip=myope tear-prod-rate=reduced 6 ==> contact-lenses=none 6 conf:(1) 3. spectacle-prescrip=hypermetrope tear-prod-rate=reduced 6 ==> contact-lenses=none 6 conf:(1) 4. astigmatism=no tear-prod-rate=reduced 6 ==> contact-lenses=none 6 conf:(1) 5. astigmatism=yes tear-prod-rate=reduced 6 ==> contact-lenses=none 6 conf:(1) 6. contact-lenses=soft 5 ==> astigmatism=no 5 conf:(1) 7. contact-lenses=soft 5 ==> tear-prod-rate=normal 5 conf:(1) 8. tear-prod-rate=normal contact-lenses=soft 5 ==> astigmatism=no 5 conf:(1) 9. astigmatism=no contact-lenses=soft 5 ==> tear-prod-rate=normal 5 conf:(1) 10. contact-lenses=soft 5 ==> astigmatism=no tear-prod-rate=normal 5 conf:(1)
轉置請註明出處:http://www.cnblogs.com/rongyux/