Seke Blog: weka

顯示具有 weka 標籤的文章。顯示所有文章

Use of Weka for sentiment analysis in traditional Chinese

網路上有關 Weka機器學習軟體的文字分類範例大部份都是針對英文，以下示範針對繁體中文的作法。

由於繁體中文的標記語料很少，本範例將取自 SnowNLP 釋放的簡體中文標記語料。 SnowNLP 正負情感資料集共計34,880筆(35k)日常聊天語料，包括負面情感(neg.txt) 18,576筆(19k)，正面情感(pos.txt) 16,304筆(16k)。前處理利用 OpenCC 軟體轉換為繁體，再利用 jieba 軟體空格斷詞，存成.csv檔，其中3欄位為原始文字 text，空格斷詞文字 token_text，情感判定 sentiment，0表示負面，1表示正面。

用Weka分類軟體讀入.csv檔，利用StringToWordVector過濾器將 token_text 欄位由String型別轉為眾多nominal欄位，以sentiment欄位為預測目標，進行1次性訓練及測試。測試選項: Percent 66% 當訓練集(23k)，34%當測集(12k)。

所有分類器都使用預設參數，準確率accuracy，訓練時間training time，測試時間testing time的結果如下，可發覺準確率表現最好為RandomForest分類器。

SnowNLP	classifiers	accuracy	training time (s)	testing time (s)
weka.classifiers.rules
	ZeroR	52.80	0.03	0.44
	OneR	62.10	12.23	0.38

weka.classifiers.trees
	J48	85.00	2,694.75	0.37
	RandomForest	93.64	459.36	8.94

weka.classifiers.bayes
	NavieBayesSimple	8.14	5.43	7.35
	NaiveBayesMultinominalText	52.80	0.14	0.28
	NaiveBayes	70.60	14.10	9.77
	NaiveBayesMultinominal	78.50	0.14	0.94

weka.classifiers.functions
	MLP	52.80	405,797.00	83.62
	SimpleLogistic	81.90	1,101.32	1.39
	Logistic	83.00	294.00	0.86
	SMO	83.20	5,231.83	1.94

weka.classifiers.lazy
	IB1	86.20	1.80	23,146.80
	IBk	86.70	0.04	99.95

weka.classifiers.functions.MultilayerPerceptron

weka.classifiers.functions.MultilayerPerceptron 為多層感知機學習器。
使用具備輸入層、隱藏層、及輸出層的類神經結構，以倒傳遞法學習各層之間的連結權重，於輸出層進行類別或數值預測。
若遇名目屬性，將先進行二值化處理再學習及預測。

參數說明:
 -L <learning rate> 依據下降梯度的多少百分比更新權重，稱為學習速率，介於[0,1]之間，預設值 0.3。
 
 -M <momentum> 參考上回權重更新量的多少百分比更新本回權重，稱為更新動量，介於[0,1]之間，預設值 0.2。
 
 -N <number of epochs> 訓練迭代次數，預設值 500。
 
 -V <percentage size of validation set> 連續變差終止訓練用的驗證集大小比例，介於[0,100]，預設值0。
 
 -S <seed>  亂數產生器種子，值應>=0，預設值為0。
 
 -E <threshold for number of consequetive errors> 網路終止前驗證集允許的連續錯誤門檻，值應>0，預設值20。
 
 -G 開啟圖形介面，預設不開啟圖形介面。
  
 -A 不要自動建立網路連結，只在開啟圖形介面(-G)才有作用。
 
 -B 不要自動使用【名目轉成二元】屬性過濾器，預設使用【名目轉成二元】屬性過濾器。
 
 -H <comma seperated numbers for nodes on each layer> 建立網路的隱藏層。
       0表示不要隱藏層，預設值是a。決定隱藏層逐層的節點個數值可以是 
       i表示輸入層屬性個數， o表示輸出層類別數， t表示i+o總數， a表示i及o平均值。
  
 -C 不要對數值類別輸出作正規化，預設數值類別輸出會作正規化
  
 -I 不要對屬性輸入值作正規化，預設屬性輸入值會作正規化，名目屬性值將介於[-1,1]之間
  
 -R 不允許重設網路
  
 -D 學習速率會衰減，預設不衰減

>java -cp weka.jar;. weka.classifiers.functions.MultilayerPerceptron -t data\weather.numeric.arff
      -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a

Options: -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a


Sigmoid Node 0
    Inputs    Weights
    Threshold    -3.248835441689124
    Node 2    5.706344521860183
    Node 3    2.443270263208691
    Node 4    2.6425576499015655
    Node 5    2.5103414057156117
Sigmoid Node 1
    Inputs    Weights
    Threshold    3.247940047055843
    Node 2    -5.7047440571074866
    Node 3    -2.3959635449403223
    Node 4    -2.61941341516743
    Node 5    -2.57892674553124
Sigmoid Node 2
    Inputs    Weights
    Threshold    -1.4298110453038173
    Attrib outlook=sunny    1.2796074137730873
    Attrib outlook=overcast    2.5993304643376662
    Attrib outlook=rainy    -2.482189408449902
    Attrib temperature    -0.991784436689735
    Attrib humidity    -4.132575972523981
    Attrib windy    -0.8030823939514043
Sigmoid Node 3
    Inputs    Weights
    Threshold    -0.7740672340804496
    Attrib outlook=sunny    -1.9100370742566128
    Attrib outlook=overcast    2.3822068707682824
    Attrib outlook=rainy    0.2349921312574373
    Attrib temperature    -0.8639638424331715
    Attrib humidity    -0.8117295111072012
    Attrib windy    3.0923597946788437
Sigmoid Node 4
    Inputs    Weights
    Threshold    -0.7812523749731839
    Attrib outlook=sunny    -2.0149350612947305
    Attrib outlook=overcast    2.4850160661055654
    Attrib outlook=rainy    0.2429746779978898
    Attrib temperature    -0.9010443938018432
    Attrib humidity    -0.8326891162034927
    Attrib windy    3.255120039808521
Sigmoid Node 5
    Inputs    Weights
    Threshold    -0.7574102682219431
    Attrib outlook=sunny    -1.9605922799976891
    Attrib outlook=overcast    2.481930135373603
    Attrib outlook=rainy    0.2838381715677166
    Attrib temperature    -0.8613350411165092
    Attrib humidity    -0.775628050353589
    Attrib windy    3.169910152935346
Class yes
    Input
    Node 0
Class no
    Input
    Node 1

註: 學到的神經網路如下圖，其中，黃色兩個節點由上而下分別為Node 0, Node 1。
   紅色4個節點由上而下分別為Node 2, Node 3, Node 4, Node 5。
   隱藏層會有4個節點的理由為參數-H a，而 a=(i + o)/2=(6 + 2)/2=4。

Time taken to build model: 0.1 seconds
Time taken to test model on training data: 0.01 seconds

=== Error on training data ===

Correctly Classified Instances          14              100      %
Incorrectly Classified Instances         0                0      %
Kappa statistic                          1
Mean absolute error                      0.036
Root mean squared error                  0.0454
Relative absolute error                  7.7533 %
Root relative squared error              9.4618 %
Total Number of Instances               14


=== Confusion Matrix ===

 a b   <-- classified as
 9 0 | a = yes
 0 5 | b = no



=== Stratified cross-validation ===

Correctly Classified Instances          11               78.5714 %
Incorrectly Classified Instances         3               21.4286 %
Kappa statistic                          0.5116
Mean absolute error                      0.265
Root mean squared error                  0.4627
Relative absolute error                 55.6497 %
Root relative squared error             93.7923 %
Total Number of Instances               14


=== Confusion Matrix ===

 a b   <-- classified as
 8 1 | a = yes
 2 3 | b = no

如下 weather.numeric.arff 案例集的14個案例利用2個文字屬性及2個數字屬性，預測文字屬性。

outlook temperature humidity windy play
sunny 85 85 FALSE no
sunny 80 90 TRUE no
rainy 65 70 TRUE no
sunny 72 95 FALSE no
rainy 71 91 TRUE no
overcast 83 86 FALSE yes
rainy 70 96 FALSE yes
rainy 68 80 FALSE yes
overcast 64 65 TRUE yes
sunny 69 70 FALSE yes
rainy 75 80 FALSE yes
sunny 75 70 TRUE yes
overcast 72 90 TRUE yes
overcast 81 75 FALSE yes


參考:
1.weka.classifiers.functions.MultilayerPerceptron
   code | doc

weka.classifiers.bayes.BayesNet

weka.classifiers.bayes.BayesNet 為貝氏網路學習器，
可克服屬性之間相關性，學得貝氏網路結構及其機率表，以進行類別預測。
若遇數值屬性，將先進行離散化後再學習。

參數說明:
 -B <BIF file> 供結構比對之用的貝氏網路描述檔，副檔名.bif。預設無。

 -D  不要使用ADTree資料結構，較省記憶體，但跑較慢。預設使用，較耗記憶體，但跑較快。

 -Q <weka.classifiers.bayes.net.search.searchAlgorithm> 結構學習演算法。
    -- 條件獨立法
     weka.classifiers.bayes.net.search.ci.CISearchAlgorithm
     weka.classifiers.bayes.net.search.ci.ICSSearchAlgorithm
    -- 採用固定結構
     weka.classifiers.bayes.net.search.fixed.FromFile  外部檔案結構
     weka.classifiers.bayes.net.search.fixed.NaiveBayes  簡單貝氏結構
    -- 全域法，-S LOO-CV 預設值表示選用留一法交叉驗證決定好壞
     weka.classifiers.bayes.net.search.global.GeneticSearch
     weka.classifiers.bayes.net.search.global.HillClimber
     weka.classifiers.bayes.net.search.global.K2
     weka.classifiers.bayes.net.search.global.SimulatedAnnealing
     weka.classifiers.bayes.net.search.global.TabuSearch
     weka.classifiers.bayes.net.search.global.TAN
    -- 區域法，-S BAYES 預設值表示選用Bayes評分指標決定好壞
     weka.classifiers.bayes.net.search.local.GeneticSearch
     weka.classifiers.bayes.net.search.local.HillClimber
     weka.classifiers.bayes.net.search.local.K2 (-P 1 表示親節點個數限制1個)
     weka.classifiers.bayes.net.search.local.SimulatedAnnealing
     weka.classifiers.bayes.net.search.local.TabuSearch
     weka.classifiers.bayes.net.search.local.TAN
     預設值weka.classifiers.bayes.net.search.local.K2。

 -E <weka.classifiers.bayes.net.estimate.estimateAlgorithm> 機率表學習演算法。
     weka.classifiers.bayes.net.estimate.BayesNetEstimator
     weka.classifiers.bayes.net.estimate.BMAEstimator
     weka.classifiers.bayes.net.estimate.MultinomialBMAEstimator
     weka.classifiers.bayes.net.estimate.SimpleEstimator (-A 0.5 表示初始機率值0.5)
     預設值weka.classifiers.bayes.net.estimate.SimpleEstimator。


>java -cp weka.jar;. weka.classifiers.bayes.BayesNet -t data\weather.nominal.arff
    -D 
    -Q weka.classifiers.bayes.net.search.local.K2 -- -P 1 -S BAYES 
    -E weka.classifiers.bayes.net.estimate.SimpleEstimator -- -A 0.5

Options: -D -Q weka.classifiers.bayes.net.search.local.K2 -- -P 1 -S BAYES -E weka.classifiers.bayes.net.estimate.SimpleEstimator -- -A 0.5


Bayes Network Classifier
not using ADTree
#attributes=5 #classindex=4
Network structure (nodes followed by parents)
outlook(3): play
temperature(3): play
humidity(2): play
windy(2): play
play(2):
LogScore Bayes: -69.07317135664013
LogScore BDeu: -83.46880542273107
LogScore MDL: -82.71568504897063
LogScore ENTROPY: -65.56181240647145
LogScore AIC: -78.56181240647145




Time taken to build model: 0.02 seconds
Time taken to test model on training data: 0 seconds

=== Error on training data ===

Correctly Classified Instances          13               92.8571 %
Incorrectly Classified Instances         1                7.1429 %
Kappa statistic                          0.8372
Mean absolute error                      0.2615
Root mean squared error                  0.3242
Relative absolute error                 56.3272 %
Root relative squared error             67.6228 %
Total Number of Instances               14


=== Confusion Matrix ===

 a b   <-- classified as
 9 0 | a = yes
 1 4 | b = no



=== Stratified cross-validation ===

Correctly Classified Instances           8               57.1429 %
Incorrectly Classified Instances         6               42.8571 %
Kappa statistic                         -0.0244
Mean absolute error                      0.415
Root mean squared error                  0.4909
Relative absolute error                 87.1501 %
Root relative squared error             99.5104 %
Total Number of Instances               14


=== Confusion Matrix ===

 a b   <-- classified as
 7 2 | a = yes
 4 1 | b = no

如下 weather.nominal.arff 案例集的14個案例有9個yes、5個no。

 
 
 
 
 
 

  
 
 
 
 
 

  
 
 

  outlook
  temperature
  humidity
  windy
  play
 

  sunny
  hot
  high
  FALSE
  no
 

  sunny
  hot
  high
  TRUE
  no
 

  rainy
  cool
  normal
  TRUE
  no
 

  sunny
  mild
  high
  FALSE
  no
 

  rainy
  mild
  high
  TRUE
  no
 

  overcast
  hot
  high
  FALSE
  yes
 

  rainy
  mild
  high
  FALSE
  yes
 

  rainy
  cool
  normal
  FALSE
  yes
 

  overcast
  cool
  normal
  TRUE
  yes
 

  sunny
  cool
  normal
  FALSE
  yes
 

  rainy
  mild
  normal
  FALSE
  yes
 

  sunny
  mild
  normal
  TRUE
  yes
 

  overcast
  mild
  high
  TRUE
  yes
 

  overcast
  hot
  normal
  FALSE
  yes
 





參考:
1.weka.classifiers.bayes.BayesNet
   code | doc

2.weka.classifiers.bayes.net.search
   code | doc

3.weka.classifiers.bayes.net.estimate
   code | doc

weka.classifiers.bayes.NaiveBayesSimple

weka.classifiers.bayes.NaiveBayesSimple 為簡單貝氏機率學習器的簡化版，
記錄各類別事前機率，及給定類別下各屬性值出現之條件機率，
再依案例，累乘得到給定屬性值下各類別出現之事後機率，取機率高者為預測類別，
可提供案例集不錯表現值供標竿比較之用。

NaiveBayesSimple 學習分類時，同樣為每個類別統計其類別事前機率(prior probability)、給定類別下各屬性值出現之條件機率。
遇數值屬性時，一律假設母體為常態分布，統計其平均值、標準差，供條件機率之推估。
預測分類時，依新案例，累乘得到給定屬性值下各類別出現之事後機率(posterior probability)，取機率高者為預測類別。

參數說明:

出處: R. Duda and P. Hart (1973). Pattern Classification and Scene Analysis. Wiley, New York.

>java -cp simpleEducationalLearningSchemes.jar;weka.jar;. 
   weka.classifiers.bayes.NaiveBayesSimple -t data\weather.numeric.arff


Naive Bayes (simple)

Class yes: P(C) = 0.625

Attribute outlook
sunny   overcast        rainy
0.25            0.41666667      0.33333333

Attribute temperature
Mean: 73        Standard Deviation: 6.164414

Attribute humidity
Mean: 79.11111111       Standard Deviation: 10.21572861

Attribute windy
TRUE    FALSE
0.36363636      0.63636364



Class no: P(C) = 0.375

Attribute outlook
sunny   overcast        rainy
0.5             0.125           0.375

Attribute temperature
Mean: 74.6      Standard Deviation: 7.8930349

Attribute humidity
Mean: 86.2      Standard Deviation: 9.7313925

Attribute windy
TRUE    FALSE
0.57142857      0.42857143


Time taken to build model: 0.77 seconds
Time taken to test model on training data: 0 seconds

=== Error on training data ===

Correctly Classified Instances          13               92.8571 %
Incorrectly Classified Instances         1                7.1429 %
Kappa statistic                          0.8372
Mean absolute error                      0.3003
Root mean squared error                  0.3431
Relative absolute error                 64.6705 %
Root relative squared error             71.5605 %
Total Number of Instances               14


=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 1.000    0.200    0.900      1.000    0.947      0.849    0.933     0.963     yes
                 0.800    0.000    1.000      0.800    0.889      0.849    0.933     0.925     no
Weighted Avg.    0.929    0.129    0.936      0.929    0.926      0.849    0.933     0.949


=== Confusion Matrix ===

 a b   <-- classified as
 9 0 | a = yes
 1 4 | b = no



=== Stratified cross-validation ===

Correctly Classified Instances           8               57.1429 %
Incorrectly Classified Instances         6               42.8571 %
Kappa statistic                         -0.0244
Mean absolute error                      0.4699
Root mean squared error                  0.5376
Relative absolute error                 98.6856 %
Root relative squared error            108.9683 %
Total Number of Instances               14


=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.778    0.800    0.636      0.778    0.700      -0.026   0.444     0.636     yes
                 0.200    0.222    0.333      0.200    0.250      -0.026   0.444     0.398     no
Weighted Avg.    0.571    0.594    0.528      0.571    0.539      -0.026   0.444     0.551


=== Confusion Matrix ===

 a b   <-- classified as
 7 2 | a = yes
 4 1 | b = no


如下 weather.numeric.arff 案例集的14個案例利用2個文字屬性及2個數字屬性，預測文字屬性。

outlook temperature humidity windy play
sunny 85 85 FALSE no
sunny 80 90 TRUE no
rainy 65 70 TRUE no
sunny 72 95 FALSE no
rainy 71 91 TRUE no
overcast 83 86 FALSE yes
rainy 70 96 FALSE yes
rainy 68 80 FALSE yes
overcast 64 65 TRUE yes
sunny 69 70 FALSE yes
rainy 75 80 FALSE yes
sunny 75 70 TRUE yes
overcast 72 90 TRUE yes
overcast 81 75 FALSE yes

參考:
1.weka.classifiers.bayes.NaiveBayesSimple
   code | doc
2.從Weka 3.7.2版之後，NaiveBayesSimple 類別
  從 weka.jar 主套件改歸到 simpleEducationalLearningSchemes.jar 選擇性套件內，
  可利用Tools/Package Manager/Search: simpleEducationalSchemes/Install進行安裝。
  在Windows下，下載套件存放位置在 C:\Users\用戶名\wekafiles\packages\ 資料夾內。
  simpleEducationalSchemes 包含IB1,Prism,Id3,NaiveBayesSimple四個簡單分類器

weka.classifiers.rules.DecisionTable

 
weka.classifiers.rules.DecisionTable 為決策表學習器，適用於類別/數值預測。
預設利用最佳優先(登山)搜尋法，以交叉驗證之準確率或均方差為指標，找出最佳的屬性子集合。
然後，將訓練案例集縮減為只用留下的屬性子集合描述。
視每個案例為一條規則，其前件由留下屬性值組成，其後件則為多數決類別或平均值。
預測時若新案例有符合某規則之前件，則依其後件進行預測。
若遇決策表未涵蓋新案例，則使用k最近鄰居法，或背景多數決法作預測。

參數說明:
-X  crossVal: [1] 交叉驗證切割組數，1表只保留一測試案例，餘供訓練
-I  useIBk: [false] 遇未涵蓋新案例，使用k最近鄰居法，否則使用多數決法
-R  displayRules: [false] 列印決策表
-E  evaluationMeasure: [acc 或 rmse] 最佳指標遇類別採用準確率,遇數值採用均方差
                       其他指標還有  mae ， auc
-S  search: [weka.attributeSelection.BestFirst] 子集合搜尋策略

  -- 以下為搜尋策略的參數 --

-D  direction: [1] 0表向後屬性變少，1表向前屬性變多，3表雙向
-S  lookupCacheSize: [1] 保留候選子集合的個數為案例集屬性個數的多少倍
-N  searchTermination: [5] 放棄搜尋前，能忍受指標無進步之試走步數
-P  startSet: [] 找尋初始點的屬性子集合，預設為空集合

參考: 
    kohavi-ecml-95-the power of decision tables

> java weka.classifiers.rules.DecisionTable -R -t data\weather.nominal.arff


Options: -R 

Decision Table:

Number of training instances: 14
Number of Rules : 1
Non matches covered by Majority class.
 Best first.
 Start set: no attributes
 Search direction: forward
 Stale search after 5 node expansions
 Total number of subsets evaluated: 12
 Merit of best subset found:   64.286
Evaluation (for feature selection): CV (leave one out) 
Feature set: 5

Rules:
================
play  
================
yes
================



Time taken to build model: 0 seconds
Time taken to test model on training data: 0 seconds

=== Error on training data ===

Correctly Classified Instances           9               64.2857 %
Incorrectly Classified Instances         5               35.7143 %
Kappa statistic                          0     
Mean absolute error                      0.4524
Root mean squared error                  0.4797
Relative absolute error                 97.4359 %
Root relative squared error            100.0539 %
Total Number of Instances               14     


=== Confusion Matrix ===

 a b   <-- classified as
 9 0 | a = yes
 5 0 | b = no



=== Stratified cross-validation ===

Correctly Classified Instances           6               42.8571 %
Incorrectly Classified Instances         8               57.1429 %
Kappa statistic                         -0.3659
Mean absolute error                      0.5318
Root mean squared error                  0.5583
Relative absolute error                111.6786 %
Root relative squared error            113.1584 %
Total Number of Instances               14     


=== Confusion Matrix ===

 a b   <-- classified as
 6 3 | a = yes
 5 0 | b = no

如下 weather.nominal.arff 案例集的14個案例有9個yes、5個no。

outlook	temperature	humidity	windy	play
sunny	hot	high	FALSE	no
sunny	hot	high	TRUE	no
rainy	cool	normal	TRUE	no
sunny	mild	high	FALSE	no
rainy	mild	high	TRUE	no
overcast	hot	high	FALSE	yes
rainy	mild	high	FALSE	yes
rainy	cool	normal	FALSE	yes
overcast	cool	normal	TRUE	yes
sunny	cool	normal	FALSE	yes
rainy	mild	normal	FALSE	yes
sunny	mild	normal	TRUE	yes
overcast	mild	high	TRUE	yes
overcast	hot	normal	FALSE	yes

參考:
1.weka.classifiers.rules.DecisionTable
   code | doc

weka.classifiers.functions.Winnow

weka.classifiers.functions.Winnow 屬錯誤驅動型學習器，
只處理文字屬性，將之轉成二元屬性，用來預測二元類別值。可線上累進學習。
適用在案例集屬性眾多，卻多數和預測不相關情況，可以快速鎖定相關屬性作預測。

給定案例屬性(a0, a1, ..., ak)，門檻 theta，權重升級係數alpha，權重降級係數beta，
權重向量(w0, w1, ..., wk)或(w0+ - w0-, w1+ - w1-, ..., wk+ - wk-)
其中，所有符號皆為正數，擴充屬性 a0 恆為 1。則預測式有二: 

  不平衡版: 權重向量各維度只能正數
     w0 * a0 + w1 * a1 + ... + wk * ak > theta 表類別1; 否則類別2

  平衡版:  權重向量各維度允許負數
    (w0+ - w0-) * a0 + (w1+ - w1-) * a1 + ... + (wk+ - wk-) * ak > theta 表類別1; 否則類別2

學習過程若遇預測錯誤，則權重向量調整法如下:
  類別2誤為類別1:   w *= beta  或 w+ *= beta  and w- *= alpha 讓權重變小
  類別1誤為類別2:   w *= alpha 或 w+ *= alpha and w- *= beta  讓權重變大

參數說明:
 -L  使用平衡版。預設值false
 -I  套用訓練集學習權重的輪數。預設值1
 -A  權重升級係數alpha，需>1。預設值2.0
 -B  權重降級係數beta，需<1。預設值0.5
 -H  預測門檻theta。預設值-1，表示屬性個數
 -W  權重初始值，需>0。預設值2.0
 -S  亂數種子，影響訓練集的案例訓練順序。預設值1


> java  weka.classifiers.functions.Winnow  -t data\weather.nominal.arff


Winnow

Attribute weights

w0 8.0
w1 1.0
w2 2.0
w3 4.0
w4 2.0
w5 2.0
w6 1.0
w7 1.0

Cumulated mistake count: 7


Time taken to build model: 0 seconds
Time taken to test model on training data: 0 seconds

=== Error on training data ===

Correctly Classified Instances          10               71.4286 %
Incorrectly Classified Instances         4               28.5714 %
Kappa statistic                          0.3778
Mean absolute error                      0.2857
Root mean squared error                  0.5345
Relative absolute error                 61.5385 %
Root relative squared error            111.4773 %
Total Number of Instances               14     


=== Confusion Matrix ===

 a b   <-- classified as
 7 2 | a = yes
 2 3 | b = no



=== Stratified cross-validation ===

Correctly Classified Instances           7               50      %
Incorrectly Classified Instances         7               50      %
Kappa statistic                         -0.2564
Mean absolute error                      0.5   
Root mean squared error                  0.7071
Relative absolute error                105      %
Root relative squared error            143.3236 %
Total Number of Instances               14     


=== Confusion Matrix ===

 a b   <-- classified as
 7 2 | a = yes
 5 0 | b = no

如下 weather.nominal.arff 案例集的14個案例利用4個文字屬性，預測文字屬性。

 
 
 
 
 

  
 
 
 
 
 

  outlook
  temperature
  humidity
  windy
  play
 

  sunny
  hot
  high
  FALSE
  no
 

  sunny
  hot
  high
  TRUE
  no
 

  overcast
  hot
  high
  FALSE
  yes
 

  rainy
  mild
  high
  FALSE
  yes
 

  rainy
  cool
  normal
  FALSE
  yes
 

  rainy
  cool
  normal
  TRUE
  no
 

  overcast
  cool
  normal
  TRUE
  yes
 

  sunny
  mild
  high
  FALSE
  no
 

  sunny
  cool
  normal
  FALSE
  yes
 

  rainy
  mild
  normal
  FALSE
  yes
 

  sunny
  mild
  normal
  TRUE
  yes
 

  overcast
  mild
  high
  TRUE
  yes
 

  overcast
  hot
  normal
  FALSE
  yes
 

  rainy
  mild
  high
  TRUE
  no




參考:
1.weka.classifiers.functions.Winnow
   code | doc

weka.classifiers.functions.VotedPerceptron

weka.classifiers.functions.VotedPerceptron 為投票型感知器，屬錯誤驅動型學習器。
先全域性取代缺值，再轉換文字屬性為二元屬性，適用於預測二元類別值，可線上累進學習。

給定案例屬性 a=(a0, a1, ..., ak)，權重向量 w=(w0, w1, ..., wk)
其中，a 屬性值為二元值 0 或 1，擴充屬性 a0 恆為 1。
預測式為
  w0 * a0 + w1 * a1 + ... + wk * ak > 0 表類別1; 否則類別2

學習過程若遇預測錯誤，則權重向量調整法如下:
  類別2誤為類別1:   w -= a  讓權重變小
  類別1誤為類別2:   w += a  讓權重變大

參數說明:
 -I  套用訓練集學習權重的輪數。預設值1
 -E  多項式核函數(polynomial kernel)之次方。預設值1
 -S  亂數種子，影響訓練集的案例訓練順序。預設值1
 -M  最大允許權重修正次數。預設值10000

> java  weka.classifiers.functions.VotedPerceptron  -t data\weather.numeric.arff


VotedPerceptron: Number of perceptrons=5


Time taken to build model: 0 seconds
Time taken to test model on training data: 0 seconds

=== Error on training data ===

Correctly Classified Instances           9               64.2857 %
Incorrectly Classified Instances         5               35.7143 %
Kappa statistic                          0     
Mean absolute error                      0.3623
Root mean squared error                  0.587 
Relative absolute error                 78.0299 %
Root relative squared error            122.4306 %
Total Number of Instances               14     


=== Confusion Matrix ===

 a b   <-- classified as
 9 0 | a = yes
 5 0 | b = no



=== Stratified cross-validation ===

Correctly Classified Instances           9               64.2857 %
Incorrectly Classified Instances         5               35.7143 %
Kappa statistic                          0     
Mean absolute error                      0.3736
Root mean squared error                  0.589 
Relative absolute error                 78.4565 %
Root relative squared error            119.3809 %
Total Number of Instances               14     


=== Confusion Matrix ===

 a b   <-- classified as
 9 0 | a = yes
 5 0 | b = no


如下 weather.numeric.arff 案例集的14個案例利用2個文字屬性及2個數字屬性，預測文字屬性。

outlook temperature humidity windy play
sunny 85 85 FALSE no
sunny 80 90 TRUE no
rainy 65 70 TRUE no
sunny 72 95 FALSE no
rainy 71 91 TRUE no
overcast 83 86 FALSE yes
rainy 70 96 FALSE yes
rainy 68 80 FALSE yes
overcast 64 65 TRUE yes
sunny 69 70 FALSE yes
rainy 75 80 FALSE yes
sunny 75 70 TRUE yes
overcast 72 90 TRUE yes
overcast 81 75 FALSE yes


參考:
1.weka.classifiers.functions.VotedPerceptron
   code | doc

weka.classifiers.functions.Logistic

weka.classifiers.functions.Logistic 為羅吉斯迴歸學習器，
建立多類別羅吉斯迴歸模型，含嶺迴歸估計量(ridge estimator)參數，可用來預測類別值。
缺值由ReplaceMissingValuesFilter過濾器補值，文字屬性由NominalToBinaryFilter過濾器轉為數字。
 
參數說明:
 -R <ridge> 設定log相似度的嶺迴歸估計量。預設值1e-8
 -M <number> 設定最大迭代次數。預設值 -1 表示直到收斂為止


> java  weka.classifiers.functions.Logistic  -t data\weather.numeric.arff


Logistic Regression with ridge parameter of 1.0E-8
Coefficients...
                          Class
Variable                    yes
===============================
outlook=sunny           -6.4257
outlook=overcast        13.5922
outlook=rainy           -5.6562
temperature             -0.0776
humidity                -0.1556
windy                    3.7317
Intercept                22.234


Odds Ratios...
                          Class
Variable                    yes
===============================
outlook=sunny            0.0016
outlook=overcast    799848.4279
outlook=rainy            0.0035
temperature              0.9254
humidity                 0.8559
windy                   41.7508


Time taken to build model: 0 seconds
Time taken to test model on training data: 0 seconds

=== Error on training data ===

Correctly Classified Instances          11               78.5714 %
Incorrectly Classified Instances         3               21.4286 %
Kappa statistic                          0.5532
Mean absolute error                      0.2066
Root mean squared error                  0.3273
Relative absolute error                 44.4963 %
Root relative squared error             68.2597 %
Total Number of Instances               14     


=== Confusion Matrix ===

 a b   <-- classified as
 7 2 | a = yes
 1 4 | b = no



=== Stratified cross-validation ===

Correctly Classified Instances           8               57.1429 %
Incorrectly Classified Instances         6               42.8571 %
Kappa statistic                          0.0667
Mean absolute error                      0.4548
Root mean squared error                  0.6576
Relative absolute error                 95.5132 %
Root relative squared error            133.2951 %
Total Number of Instances               14     


=== Confusion Matrix ===

 a b   <-- classified as
 6 3 | a = yes
 3 2 | b = no


如下 weather.numeric.arff 案例集的14個案例利用2個文字屬性及2個數字屬性，預測文字屬性。

outlook temperature humidity windy play
sunny 85 85 FALSE no
sunny 80 90 TRUE no
rainy 65 70 TRUE no
sunny 72 95 FALSE no
rainy 71 91 TRUE no
overcast 83 86 FALSE yes
rainy 70 96 FALSE yes
rainy 68 80 FALSE yes
overcast 64 65 TRUE yes
sunny 69 70 FALSE yes
rainy 75 80 FALSE yes
sunny 75 70 TRUE yes
overcast 72 90 TRUE yes
overcast 81 75 FALSE yes


參考:
1.weka.classifiers.functions.Logistic
   code | doc

weka.classifiers.functions.LinearRegression

weka.classifiers.functions.LinearRegression 為標準線性迴歸學習器，
學習各數值屬性的權重，建立線性方程式模型，預測數值類別。

參數說明:
-S select_attribute_code 屬性挑選法代碼，0 表M5'，1 表無，2 表Greedy。預設值 0。


> java  weka.classifiers.functions.LinearRegression  -t data\cpu.arff


Linear Regression Model

class =

      0.0491 * MYCT +
      0.0152 * MMIN +
      0.0056 * MMAX +
      0.6298 * CACH +
      1.4599 * CHMAX +
    -56.075 


Time taken to build model: 0.02 seconds
Time taken to test model on training data: 0.02 seconds

=== Error on training data ===

Correlation coefficient                  0.93
Mean absolute error                     37.9748
Root mean squared error                 58.9899
Relative absolute error                 39.592  %
Root relative squared error             36.7663 %
Total Number of Instances              209     



=== Cross-validation ===

Correlation coefficient                  0.9012
Mean absolute error                     41.0886
Root mean squared error                 69.556 
Relative absolute error                 42.6943 %
Root relative squared error             43.2421 %
Total Number of Instances              209     


cpu.arff 資料集有209案例，每個案例由6個數值屬性預測1個數值屬性。


 

  MYCT
  MMIN
  MMAX
  CACH
  CHMIN
  CHMAX
  class
 

  125
  256
  6000
  256
  16
  128
  198
 

  29
  8000
  32000
  32
  8
  32
  269
 

  29
  8000
  32000
  32
  8
  32
  220
 

  29
  8000
  32000
  32
  8
  32
  172
 

  29
  8000
  16000
  32
  8
  16
  132
 

  26
  8000
  32000
  64
  8
  32
  318
 

  23
  16000
  32000
  64
  16
  32
  367
 

  23
  16000
  32000
  64
  16
  32
  489
 

  23
  16000
  64000
  64
  16
  32
  636
 

  .....
  

  

  

  

  

  

 


參考:
1.weka.classifiers.functions.LinearRegression
   code | doc

weka.classifiers.functions.SimpleLinearRegression

weka.classifiers.functions.SimpleLinearRegression 為簡單線性迴歸學習器，
簡單指的是只挑一個平方誤差最小的屬性作線性預測。
只適用數值對數值的預測，不接受缺值案例。

> java  weka.classifiers.functions.SimpleLinearRegression  -t data\cpu.arff


Linear regression on MMAX

0.01 * MMAX - 34


Time taken to build model: 0 seconds
Time taken to test model on training data: 0.02 seconds

=== Error on training data ===

Correlation coefficient                  0.863
Mean absolute error                     50.8658
Root mean squared error                 81.0566
Relative absolute error                 53.0319 %
Root relative squared error             50.5197 %
Total Number of Instances              209     



=== Cross-validation ===

Correlation coefficient                  0.7844
Mean absolute error                     53.8054
Root mean squared error                 99.5674
Relative absolute error                 55.908  %
Root relative squared error             61.8997 %
Total Number of Instances              209    

cpu.arff 資料集有209案例，每個案例由6個數值屬性預測1個數值屬性。


 

  MYCT
  MMIN
  MMAX
  CACH
  CHMIN
  CHMAX
  class
 

  125
  256
  6000
  256
  16
  128
  198
 

  29
  8000
  32000
  32
  8
  32
  269
 

  29
  8000
  32000
  32
  8
  32
  220
 

  29
  8000
  32000
  32
  8
  32
  172
 

  29
  8000
  16000
  32
  8
  16
  132
 

  26
  8000
  32000
  64
  8
  32
  318
 

  23
  16000
  32000
  64
  16
  32
  367
 

  23
  16000
  32000
  64
  16
  32
  489
 

  23
  16000
  64000
  64
  16
  32
  636
 

  .....
  

  

  

  

  

  

 


參考:
1.weka.classifiers.functions.SimpleLinearRegression
   code | doc

weka.classifiers.lazy.IB1

weka.classifiers.lazy.IB1 為簡單最近鄰居學習器，
訓練時只記錄原始案例，測試時挑選最相鄰1個案例，依案例類別值作預測。
可提供案例集不錯表現值供標竿比較之用。

IB1 計算兩案例距離時，遇文字屬性值相同視屬性距離為0，不同視為1；
遇數值屬性時，依原始案例集區間範圍作正規化，讓數值介於0,1之間，
再將兩正規化值相減，取平方，得到屬性距離；兩案例遇任一屬性缺值時，視該屬性距離為1。
最後視各屬性重要性相同，取所有屬性距離加總，再開根號(歐幾里德距離)，當成兩案例距離。
挑距離最小者為參考案例，回傳其類別值為預測值。

> java -cp weka.jar;. weka.classifiers.lazy.IB1  -t data\weather.numeric.arff

IB1 classifier

Time taken to build model: 0.02 seconds
Time taken to test model on training data: 0 seconds

=== Error on training data ===

Correctly Classified Instances          14              100      %
Incorrectly Classified Instances         0                0      %
Kappa statistic                          1     
Mean absolute error                      0     
Root mean squared error                  0     
Relative absolute error                  0      %
Root relative squared error              0      %
Total Number of Instances               14     


=== Confusion Matrix ===

 a b   <-- classified as
 9 0 | a = yes
 0 5 | b = no



=== Stratified cross-validation ===

Correctly Classified Instances           7               50      %
Incorrectly Classified Instances         7               50      %
Kappa statistic                          0.0392
Mean absolute error                      0.5   
Root mean squared error                  0.7071
Relative absolute error                105      %
Root relative squared error            143.3236 %
Total Number of Instances               14     


=== Confusion Matrix ===

 a b   <-- classified as
 4 5 | a = yes
 2 3 | b = no

如下 weather.numeric.arff 案例集的14個案例利用2個文字屬性及2個數字屬性，預測文字屬性。

outlook temperature humidity windy play
sunny 85 85 FALSE no
sunny 80 90 TRUE no
rainy 65 70 TRUE no
sunny 72 95 FALSE no
rainy 71 91 TRUE no
overcast 83 86 FALSE yes
rainy 70 96 FALSE yes
rainy 68 80 FALSE yes
overcast 64 65 TRUE yes
sunny 69 70 FALSE yes
rainy 75 80 FALSE yes
sunny 75 70 TRUE yes
overcast 72 90 TRUE yes
overcast 81 75 FALSE yes


參考:
1.weka.classifiers.lazy.IB1
   code | doc

weka.classifiers.rules.Prism

weka.classifiers.rules.Prism 為簡單規則集學習器，
訓練時逐一就每一類別之案例用覆蓋法建立精確率較高之規則，
測試時由規則集上往下，依案例屬性值找尋第1個符合規則作預測。
可提供案例集不錯表現值供標竿比較之用。

Prism 學習規則集時，依各類別分別學習適合該類別所有案例之規則集，作法如下:
為每一類別建立覆蓋該類別的規則集前，先將所有案例置於待學習案例集E中，
只要集合E尚存有該類別案例，就表示規則還待添加。
   建立規則時，先從1個屬性條件，窮舉所有屬性配所有值的可能組合，取精確率最大組合；
       再添下1個屬性條件，同樣窮舉所有屬性配所有值的可能組合，取精確率最大組合；
       以此類推，直到添加屬性用光或已完全正確為止。
       取精確率最大組合時，若有持平的屬性條件，則取覆蓋率(分母)較大者。
   規則建好後，將該類別規則已預測正確的案例從集E中刪除，
   針對尚未覆蓋案例學習下一條規則。

Prism 在學習類別規則時有敵情觀念(其他類別案例全部都在)，
所以預測時，同類別的規則哪一條誰先檢查效果都一樣，不會誤含到其他類別的案例空間。


> java -cp weka.jar;. weka.classifiers.rules.Prism  -t data\weather.nominal.arff


Prism rules
----------
If outlook = overcast then yes
If humidity = normal
   and windy = FALSE then yes
If temperature = mild
   and humidity = normal then yes
If outlook = rainy
   and windy = FALSE then yes
If outlook = sunny
   and humidity = high then no
If outlook = rainy
   and windy = TRUE then no


Time taken to build model: 0 seconds
Time taken to test model on training data: 0 seconds

=== Error on training data ===

Correctly Classified Instances          14              100      %
Incorrectly Classified Instances         0                0      %
Kappa statistic                          1     
Mean absolute error                      0     
Root mean squared error                  0     
Relative absolute error                  0      %
Root relative squared error              0      %
Total Number of Instances               14     


=== Confusion Matrix ===

 a b   <-- classified as
 9 0 | a = yes
 0 5 | b = no



=== Stratified cross-validation ===

Correctly Classified Instances           9               64.2857 %
Incorrectly Classified Instances         3               21.4286 %
Kappa statistic                          0.4375
Mean absolute error                      0.25  
Root mean squared error                  0.5   
Relative absolute error                 59.2264 %
Root relative squared error            105.9121 %
UnClassified Instances                   2               14.2857 %
Total Number of Instances               14     


=== Confusion Matrix ===

 a b   <-- classified as
 7 0 | a = yes
 3 2 | b = no


如下 weather.nominal.arff 案例集的14個案例有9個yes、5個no。

 
 
 
 
 
 

  
 
 
 
 
 

  
 
 

  outlook
  temperature
  humidity
  windy
  play
 

  sunny
  hot
  high
  FALSE
  no
 

  sunny
  hot
  high
  TRUE
  no
 

  rainy
  cool
  normal
  TRUE
  no
 

  sunny
  mild
  high
  FALSE
  no
 

  rainy
  mild
  high
  TRUE
  no
 

  overcast
  hot
  high
  FALSE
  yes
 

  rainy
  mild
  high
  FALSE
  yes
 

  rainy
  cool
  normal
  FALSE
  yes
 

  overcast
  cool
  normal
  TRUE
  yes
 

  sunny
  cool
  normal
  FALSE
  yes
 

  rainy
  mild
  normal
  FALSE
  yes
 

  sunny
  mild
  normal
  TRUE
  yes
 

  overcast
  mild
  high
  TRUE
  yes
 

  overcast
  hot
  normal
  FALSE
  yes
 





參考:
1.weka.classifiers.rules.Prism
   code | doc

訂閱：文章 (Atom)

Seke Blog