利用weka機器學習軟體的Explorer/Preprocess,Classify,Select_Attribute功能, 回答如下問題: 離散化(參考weka書17章4.5節) 1.針對糖尿病判定資料集diabetes,如下離散化方法 何者最有助於J48分類器提升未見資料準確率? unsupervised Discretize (makeBinary=false) unsupervised Discretize (makeBinary=true) supervised Discretize (makeBinary=false) supervised Discretize (makeBinary=true) 提示:使用10次交叉驗證測試準確率 supervised Discretize要經由FilteredClassifier+J48作測試才公平(不會一魚兩吃) 排除重複屬性能力(參考weka書17章4.11節) 2.針對糖尿病判定資料集diabetes,若挑第1個屬性複製2次,成3個相依屬性, 則如下三種挑選屬性方法,何者最能排除重複屬性,何者慎選屬性後準確率最高? A.InfoGainAttributeEval+Ranker(只留8個屬性) B.CfsSubsetEval+BestFirst C.WrapperSubsetEval+NaiveBayes+BestFirst 請利用AttributeSelectedClassifier + NaiveBayes製作下表,回答問題: 準確率 挑選屬性集合 A.InfoGainAttributeEval+Ranker B.CfsSubsetEval+BestFirst C.WrapperSubsetEval+NaiveBayes+BestFirst 參數調整(參考weka書17章4.12節) 3.針對糖尿病判定資料集diabetes,找出最近鄰居法IBk最適用鄰居數k值為何? 提示:利用weka.classifiers.meta.CVParameterSelection 變化鄰居數k=1~10分成10步 文字資料集(參考weka書17章5.4節) 4.針對如下新聞資料集找出判定穀物新聞準確率最高的分類器(含不同參數). 訓練集: ReutersGrain-train.arff 測試集: ReutersGrain-test.arff 列出評比過的分類器,含各種參數變化,依準確率高到低排序. 提示:文章向量化工具StringToWordVector,採預設參數 StringToWordVector要經由FilteredClassifier作測試才公平(不會一魚兩吃) 使用提供測試集選項所得之準確率,列出至少3個數據. 參考文獻: 1.weka軟體下載,內含所有資料集, http://www.cs.waikato.ac.nz/~ml/weka/index_downloading.html 2.witten-11-mkp-data mining- practical machine learning tools and techniques with java implementations 第17章tutorial exercises for the weka explorer Explorer::Classify 離散化 4.1.glass by unsupervised Discretize weka.filters.unsupervised.attribute.Discretize equal-width (預設值) v.s. equal-frequency 觀察等區間寬度離散化結果各區間案例分佈圖 等案例數離散化結果某些屬性區間案例分佈仍很偏斜,理由為何? 哪些離散化屬性適合作預測 4.2-4.3.glass by supervised Discretize weka.filters.supervised.attribute.Discretize 類別分佈保持一致性? -- 某些屬性仍唯持一個區間,不作切割,理由為何? 4.4.glass by supervised/unsupervised Discretize 任挑一離散化過濾器,啟用建立兩值屬性makeBinary功能,觀察結果 產生的兩值屬性意義為何? 4.5.ionosphere by unsupervised Discretize + J48 填入下表: 交叉驗證準確率, 樹節點數 (1)原始資料 (2)非監督式離散化 (makeBinary=false) (3)非監督式離散化 (makeBinary=true) 4.6.-4.7.ionosphere by FilteredClassifier 過濾器:supervised Discretize + 分類器:J48 交叉驗證準確率, 樹節點數 (4)監督式離散化 (makeBinary=false) (5)監督式離散化 (makeBinary=true) -- 為何決策樹使用離散資料,預測表現比原始資料還好? Explorer::Attribute Selection filter: weka.attributeSelection.CfsSubsetEval weka.attributeSelection.BestFirst weka.attributeSelection.InfoGainAttributeEval weka.attributeSelection.Ranker wrapper: weka.attributeSelection.WrapperSubsetEval weka.attributeSelection.AttributeSelectedClasifier 4.8.labor by InfoGainAttributeEval + Ranker 利用information gain找出labor資料集4個最重要屬性 挑選屬性集合 InfoGainAttributeEval+Ranker 4.9.labor by CfsSubsetEval+BestFirst / WrapperSubsetEval+J48+BestFirst 挑選屬性集合 CfsSubsetEval+BestFirst WrapperSubsetEval+J48+BestFirst - 哪些屬性兩個方法都挑選到? 兩法共挑屬性和InfoGainAttributeEval+Ranker所挑屬性有何關係. 4.10.diabetes by NaiveBayes NaiveBayes啟用useSupervisedDiscretization=true 第1屬性拷貝次數 0,1,2,3,4 準確率 4.11.diabetes by AttributeSelectedClassifier 分類器: NaiveBayes 挑選屬性法: InfoGainAttributeEval+Ranker(只留8個屬性) CfsSubsetEval+BestFirst WrapperSubsetEval+NaiveBayes+BestFirst 觀察三種屬性挑選法在屬性重複多次下的表現 可否成功剔除重複屬性,若不行,理由為何? 準確率 挑選屬性 InfoGainAttributeEval+Ranker CfsSubsetEval+BestFirst WrapperSubsetEval+NaiveBayes+BestFirst 參數調整 weka.classifiers.meta.CVParameterSelection 4.12.diabetes by CVParameterSelection+IBk IBk變化鄰居數K=1,10,10步 交叉驗證準確率 IBk(k=1) IBk(k=2) IBk(k=3) IBk(k=4) IBk(k=5) IBk(k=6) IBk(k=7) IBk(k=8) IBk(k=9) IBk(k=10) -- 所挑k值為何? 4.13.diabetes by CVParameterSelection+J48 葉節點最小案例數M=1,10,10步 修剪信心度C=0.1,0.5,5步 準確率有變嗎,樹節點數為何,所挑M,C為何? 文件分類 weka.filters.unsupervised.attribute.StringToWordVector mini-train訓練文件: class text yes the price of crude oil has increased significantly yes demand for crude oil outstrips supply no some people do not like the flavor of olive oil no the food was very oily yes crude oil is in short supply no use a bit of cooking oil in the frying pan mini-test測試文件: class text ? oil platforms extract crude oil ? canola oil is supposed to be healthy ? iraq has significant oil reserves ? there are different types of cooking oil 5.1.-5.3.mini-xx by FilteredClassifier 過濾器:StringToWordVector + 分類器:J48 mini-train + mini-test 針對訓練文件mini-train StringToWordVector採預設選項,產生多少屬性? 改minTermFreq選項為2,產生多少屬性? -- 利用minTermFreq=2產生資料,建立J48決策樹 -- 利用前述決策樹,預測mini-test文件結果 真實文件 5.4.reutersxx_yy.arff by J48 and NaiveBayesMultinominal ReutersCorn-train.arff + ReutersCorn-test.arff ReutersGrain-train.arff + ReutersGrain-test.arff weka.classifiers.meta.FilteredClassifier 填入下表: 預測準確率 J48 NaiveBayesMultinominal corn-train+corn-test grain-train+grain-test - 哪一個分類器表現較好?
2012年6月15日 星期五
weka tutorial test 2
訂閱:
張貼留言 (Atom)
沒有留言:
張貼留言