2012年6月15日 星期五

weka tutorial test 2

利用weka機器學習軟體的Explorer/Preprocess,Classify,Select_Attribute功能,
回答如下問題:

離散化(參考weka書17章4.5節)
1.針對糖尿病判定資料集diabetes,如下離散化方法
  何者最有助於J48分類器提升未見資料準確率?
   unsupervised Discretize (makeBinary=false)
   unsupervised Discretize (makeBinary=true)
   supervised Discretize (makeBinary=false)
   supervised Discretize (makeBinary=true)
  提示:使用10次交叉驗證測試準確率
       supervised Discretize要經由FilteredClassifier+J48作測試才公平(不會一魚兩吃)


排除重複屬性能力(參考weka書17章4.11節)
2.針對糖尿病判定資料集diabetes,若挑第1個屬性複製2次,成3個相依屬性,
  則如下三種挑選屬性方法,何者最能排除重複屬性,何者慎選屬性後準確率最高?
   A.InfoGainAttributeEval+Ranker(只留8個屬性)
 B.CfsSubsetEval+BestFirst
 C.WrapperSubsetEval+NaiveBayes+BestFirst
  請利用AttributeSelectedClassifier + NaiveBayes製作下表,回答問題:
      準確率 挑選屬性集合
 A.InfoGainAttributeEval+Ranker
 B.CfsSubsetEval+BestFirst
 C.WrapperSubsetEval+NaiveBayes+BestFirst

參數調整(參考weka書17章4.12節)
3.針對糖尿病判定資料集diabetes,找出最近鄰居法IBk最適用鄰居數k值為何?
 提示:利用weka.classifiers.meta.CVParameterSelection
  變化鄰居數k=1~10分成10步

文字資料集(參考weka書17章5.4節)
4.針對如下新聞資料集找出判定穀物新聞準確率最高的分類器(含不同參數).
 訓練集: ReutersGrain-train.arff
 測試集: ReutersGrain-test.arff
  列出評比過的分類器,含各種參數變化,依準確率高到低排序.
  提示:文章向量化工具StringToWordVector,採預設參數
       StringToWordVector要經由FilteredClassifier作測試才公平(不會一魚兩吃)
       使用提供測試集選項所得之準確率,列出至少3個數據.


參考文獻:
1.weka軟體下載,內含所有資料集,
   http://www.cs.waikato.ac.nz/~ml/weka/index_downloading.html

2.witten-11-mkp-data mining- practical machine learning tools and techniques with java implementations
  第17章tutorial exercises for the weka explorer

  Explorer::Classify
  離散化
 4.1.glass by unsupervised Discretize
   weka.filters.unsupervised.attribute.Discretize
     equal-width (預設值) v.s. equal-frequency
   觀察等區間寬度離散化結果各區間案例分佈圖
   等案例數離散化結果某些屬性區間案例分佈仍很偏斜,理由為何?
   哪些離散化屬性適合作預測

 4.2-4.3.glass by supervised Discretize
   weka.filters.supervised.attribute.Discretize
     類別分佈保持一致性?
   --
   某些屬性仍唯持一個區間,不作切割,理由為何?

 4.4.glass by supervised/unsupervised Discretize
   任挑一離散化過濾器,啟用建立兩值屬性makeBinary功能,觀察結果
   產生的兩值屬性意義為何?

 4.5.ionosphere by unsupervised Discretize + J48
   填入下表:
                交叉驗證準確率, 樹節點數
   (1)原始資料
   (2)非監督式離散化
      (makeBinary=false)
   (3)非監督式離散化
      (makeBinary=true)

 4.6.-4.7.ionosphere by FilteredClassifier
   過濾器:supervised Discretize + 分類器:J48
                交叉驗證準確率, 樹節點數
   (4)監督式離散化
      (makeBinary=false)
   (5)監督式離散化
      (makeBinary=true)
   --
   為何決策樹使用離散資料,預測表現比原始資料還好?

  Explorer::Attribute Selection
 filter:
   weka.attributeSelection.CfsSubsetEval
     weka.attributeSelection.BestFirst
   weka.attributeSelection.InfoGainAttributeEval
     weka.attributeSelection.Ranker
 wrapper:
   weka.attributeSelection.WrapperSubsetEval
   weka.attributeSelection.AttributeSelectedClasifier

 4.8.labor by InfoGainAttributeEval + Ranker
   利用information gain找出labor資料集4個最重要屬性
       挑選屬性集合
   InfoGainAttributeEval+Ranker

 4.9.labor by CfsSubsetEval+BestFirst / WrapperSubsetEval+J48+BestFirst
     挑選屬性集合
   CfsSubsetEval+BestFirst
   WrapperSubsetEval+J48+BestFirst
   -
   哪些屬性兩個方法都挑選到?
   兩法共挑屬性和InfoGainAttributeEval+Ranker所挑屬性有何關係.

 4.10.diabetes by NaiveBayes
   NaiveBayes啟用useSupervisedDiscretization=true
   第1屬性拷貝次數 0,1,2,3,4
   準確率

 4.11.diabetes by AttributeSelectedClassifier
   分類器: NaiveBayes
   挑選屬性法:
     InfoGainAttributeEval+Ranker(只留8個屬性)
     CfsSubsetEval+BestFirst
     WrapperSubsetEval+NaiveBayes+BestFirst
   觀察三種屬性挑選法在屬性重複多次下的表現
   可否成功剔除重複屬性,若不行,理由為何?
      準確率 挑選屬性
     InfoGainAttributeEval+Ranker
     CfsSubsetEval+BestFirst
     WrapperSubsetEval+NaiveBayes+BestFirst

  參數調整
   weka.classifiers.meta.CVParameterSelection

   4.12.diabetes by CVParameterSelection+IBk
     IBk變化鄰居數K=1,10,10步
       交叉驗證準確率
     IBk(k=1)
     IBk(k=2)
     IBk(k=3)
     IBk(k=4)
     IBk(k=5)
     IBk(k=6)
     IBk(k=7)
     IBk(k=8)
     IBk(k=9)
     IBk(k=10)
     --
     所挑k值為何?

   4.13.diabetes by CVParameterSelection+J48
     葉節點最小案例數M=1,10,10步
     修剪信心度C=0.1,0.5,5步
     準確率有變嗎,樹節點數為何,所挑M,C為何?

  文件分類
 weka.filters.unsupervised.attribute.StringToWordVector
 mini-train訓練文件:
   class text
   yes the price of crude oil has increased significantly
   yes demand for crude oil outstrips supply
   no some people do not like the flavor of olive oil
   no the food was very oily
   yes crude oil is in short supply
   no use a bit of cooking oil in the frying pan
 mini-test測試文件:
   class text
   ? oil platforms extract crude oil
   ? canola oil is supposed to be healthy
   ? iraq has significant oil reserves
   ? there are different types of cooking oil

 5.1.-5.3.mini-xx by FilteredClassifier
    過濾器:StringToWordVector + 分類器:J48
    mini-train + mini-test
   針對訓練文件mini-train
   StringToWordVector採預設選項,產生多少屬性?
   改minTermFreq選項為2,產生多少屬性?
   --
   利用minTermFreq=2產生資料,建立J48決策樹
   --
   利用前述決策樹,預測mini-test文件結果

  真實文件
 5.4.reutersxx_yy.arff by J48 and NaiveBayesMultinominal
  ReutersCorn-train.arff + ReutersCorn-test.arff
  ReutersGrain-train.arff + ReutersGrain-test.arff
  weka.classifiers.meta.FilteredClassifier
   填入下表:
   預測準確率  J48 NaiveBayesMultinominal
   corn-train+corn-test
   grain-train+grain-test
   -
   哪一個分類器表現較好?

沒有留言: