利用weka機器學習軟體的Explorer/Preprocess,Classify,Select_Attribute功能,
回答如下問題:
離散化(參考weka書17章4.5節)
1.針對糖尿病判定資料集diabetes,如下離散化方法
何者最有助於J48分類器提升未見資料準確率?
unsupervised Discretize (makeBinary=false)
unsupervised Discretize (makeBinary=true)
supervised Discretize (makeBinary=false)
supervised Discretize (makeBinary=true)
提示:使用10次交叉驗證測試準確率
supervised Discretize要經由FilteredClassifier+J48作測試才公平(不會一魚兩吃)
排除重複屬性能力(參考weka書17章4.11節)
2.針對糖尿病判定資料集diabetes,若挑第1個屬性複製2次,成3個相依屬性,
則如下三種挑選屬性方法,何者最能排除重複屬性,何者慎選屬性後準確率最高?
A.InfoGainAttributeEval+Ranker(只留8個屬性)
B.CfsSubsetEval+BestFirst
C.WrapperSubsetEval+NaiveBayes+BestFirst
請利用AttributeSelectedClassifier + NaiveBayes製作下表,回答問題:
準確率 挑選屬性集合
A.InfoGainAttributeEval+Ranker
B.CfsSubsetEval+BestFirst
C.WrapperSubsetEval+NaiveBayes+BestFirst
參數調整(參考weka書17章4.12節)
3.針對糖尿病判定資料集diabetes,找出最近鄰居法IBk最適用鄰居數k值為何?
提示:利用weka.classifiers.meta.CVParameterSelection
變化鄰居數k=1~10分成10步
文字資料集(參考weka書17章5.4節)
4.針對如下新聞資料集找出判定穀物新聞準確率最高的分類器(含不同參數).
訓練集: ReutersGrain-train.arff
測試集: ReutersGrain-test.arff
列出評比過的分類器,含各種參數變化,依準確率高到低排序.
提示:文章向量化工具StringToWordVector,採預設參數
StringToWordVector要經由FilteredClassifier作測試才公平(不會一魚兩吃)
使用提供測試集選項所得之準確率,列出至少3個數據.
參考文獻:
1.weka軟體下載,內含所有資料集,
http://www.cs.waikato.ac.nz/~ml/weka/index_downloading.html
2.witten-11-mkp-data mining- practical machine learning tools and techniques with java implementations
第17章tutorial exercises for the weka explorer
Explorer::Classify
離散化
4.1.glass by unsupervised Discretize
weka.filters.unsupervised.attribute.Discretize
equal-width (預設值) v.s. equal-frequency
觀察等區間寬度離散化結果各區間案例分佈圖
等案例數離散化結果某些屬性區間案例分佈仍很偏斜,理由為何?
哪些離散化屬性適合作預測
4.2-4.3.glass by supervised Discretize
weka.filters.supervised.attribute.Discretize
類別分佈保持一致性?
--
某些屬性仍唯持一個區間,不作切割,理由為何?
4.4.glass by supervised/unsupervised Discretize
任挑一離散化過濾器,啟用建立兩值屬性makeBinary功能,觀察結果
產生的兩值屬性意義為何?
4.5.ionosphere by unsupervised Discretize + J48
填入下表:
交叉驗證準確率, 樹節點數
(1)原始資料
(2)非監督式離散化
(makeBinary=false)
(3)非監督式離散化
(makeBinary=true)
4.6.-4.7.ionosphere by FilteredClassifier
過濾器:supervised Discretize + 分類器:J48
交叉驗證準確率, 樹節點數
(4)監督式離散化
(makeBinary=false)
(5)監督式離散化
(makeBinary=true)
--
為何決策樹使用離散資料,預測表現比原始資料還好?
Explorer::Attribute Selection
filter:
weka.attributeSelection.CfsSubsetEval
weka.attributeSelection.BestFirst
weka.attributeSelection.InfoGainAttributeEval
weka.attributeSelection.Ranker
wrapper:
weka.attributeSelection.WrapperSubsetEval
weka.attributeSelection.AttributeSelectedClasifier
4.8.labor by InfoGainAttributeEval + Ranker
利用information gain找出labor資料集4個最重要屬性
挑選屬性集合
InfoGainAttributeEval+Ranker
4.9.labor by CfsSubsetEval+BestFirst / WrapperSubsetEval+J48+BestFirst
挑選屬性集合
CfsSubsetEval+BestFirst
WrapperSubsetEval+J48+BestFirst
-
哪些屬性兩個方法都挑選到?
兩法共挑屬性和InfoGainAttributeEval+Ranker所挑屬性有何關係.
4.10.diabetes by NaiveBayes
NaiveBayes啟用useSupervisedDiscretization=true
第1屬性拷貝次數 0,1,2,3,4
準確率
4.11.diabetes by AttributeSelectedClassifier
分類器: NaiveBayes
挑選屬性法:
InfoGainAttributeEval+Ranker(只留8個屬性)
CfsSubsetEval+BestFirst
WrapperSubsetEval+NaiveBayes+BestFirst
觀察三種屬性挑選法在屬性重複多次下的表現
可否成功剔除重複屬性,若不行,理由為何?
準確率 挑選屬性
InfoGainAttributeEval+Ranker
CfsSubsetEval+BestFirst
WrapperSubsetEval+NaiveBayes+BestFirst
參數調整
weka.classifiers.meta.CVParameterSelection
4.12.diabetes by CVParameterSelection+IBk
IBk變化鄰居數K=1,10,10步
交叉驗證準確率
IBk(k=1)
IBk(k=2)
IBk(k=3)
IBk(k=4)
IBk(k=5)
IBk(k=6)
IBk(k=7)
IBk(k=8)
IBk(k=9)
IBk(k=10)
--
所挑k值為何?
4.13.diabetes by CVParameterSelection+J48
葉節點最小案例數M=1,10,10步
修剪信心度C=0.1,0.5,5步
準確率有變嗎,樹節點數為何,所挑M,C為何?
文件分類
weka.filters.unsupervised.attribute.StringToWordVector
mini-train訓練文件:
class text
yes the price of crude oil has increased significantly
yes demand for crude oil outstrips supply
no some people do not like the flavor of olive oil
no the food was very oily
yes crude oil is in short supply
no use a bit of cooking oil in the frying pan
mini-test測試文件:
class text
? oil platforms extract crude oil
? canola oil is supposed to be healthy
? iraq has significant oil reserves
? there are different types of cooking oil
5.1.-5.3.mini-xx by FilteredClassifier
過濾器:StringToWordVector + 分類器:J48
mini-train + mini-test
針對訓練文件mini-train
StringToWordVector採預設選項,產生多少屬性?
改minTermFreq選項為2,產生多少屬性?
--
利用minTermFreq=2產生資料,建立J48決策樹
--
利用前述決策樹,預測mini-test文件結果
真實文件
5.4.reutersxx_yy.arff by J48 and NaiveBayesMultinominal
ReutersCorn-train.arff + ReutersCorn-test.arff
ReutersGrain-train.arff + ReutersGrain-test.arff
weka.classifiers.meta.FilteredClassifier
填入下表:
預測準確率 J48 NaiveBayesMultinominal
corn-train+corn-test
grain-train+grain-test
-
哪一個分類器表現較好?
weka tutorial test 2
訂閱:
張貼留言 (Atom)
Linked Lists from C to Java
「 C Pointer Concepts in Java 」一文提到 Java 沒有指標型別 (pointer type) ,但有參照型別 (reference type) 的設計。在遇到須要處理鏈結清單 (linked list)、圖形 (graph) 等資料結構時,Java ...
沒有留言:
張貼留言