2015年12月28日 星期一

How to Save and Load a Model in Weka for Training and Testing

Two Weka Command Line Examples of Using Models in Training and Testing:

(1) train and save an OneR model 
    load and test an OneR model
    both using the weather.nominal.arff dataset

(2) train and save a FilteredClassifier (StringToWordVector + J48) model
    load and test a FilteredClassifier (StringToWordVector + J48) model
    using the crude_oil_train.arff dataset for training
      and the crude_oil_test.arff dataset for testing

#-------------------------------
#ask for classifiers options

>java -cp weka.jar weka.classifiers.rules.OneR -h -info

Help requested.

General options:

-h or -help
 Output help information.
-synopsis or -info
 Output synopsis for classifier (use in conjunction  with -h)
-t <name of training file>
 Sets training file.
-T <name of test file>
 Sets test file. If missing, a cross-validation will be performed
 on the training data.
-c <class index>
 Sets index of class attribute (default: last).
-x <number of folds>
 Sets number of folds for cross-validation (default: 10).
-no-cv
 Do not perform any cross validation.
-split-percentage <percentage>
 Sets the percentage for the train/test set split, e.g., 66.
-preserve-order
 Preserves the order in the percentage split.
-s <random number seed>
 Sets random number seed for cross-validation or percentage split
 (default: 1).
-m <name of file with cost matrix>
 Sets file with cost matrix.
-l <name of input file>
 Sets model input file. In case the filename ends with '.xml',
 a PMML file is loaded or, if that fails, options are loaded
 from the XML file.
-d <name of output file>
 Sets model output file. In case the filename ends with '.xml',
 only the options are saved to the XML file, not the model.
-v
 Outputs no statistics for training data.
-o
 Outputs statistics only, not the classifier.
-i
 Outputs detailed information-retrieval statistics for each class.
-k
 Outputs information-theoretic statistics.
-p <attribute range>
 Only outputs predictions for test instances (or the train
 instances if no test instances provided and -no-cv is used),
 along with attributes (0 for none).
-distribution
 Outputs the distribution instead of only the prediction
 in conjunction with the '-p' option (only nominal classes).
-r
 Only outputs cumulative margin distribution.
-z <class name>
 Only outputs the source representation of the classifier,
 giving it the supplied name.
-xml filename | xml-string
 Retrieves the options from the XML-data instead of the command line.
-threshold-file <file>
 The file to save the threshold data to.
 The format is determined by the extensions, e.g., '.arff' for ARFF
 format or '.csv' for CSV.
-threshold-label <label>
 The class label to determine the threshold data for
 (default is the first label)

Options specific to weka.classifiers.rules.OneR:

-B <minimum bucket size>
 The minimum number of objects in a bucket (default: 6).

Synopsis for weka.classifiers.rules.OneR: # synopsis is shown with -info option

Class for building and using a 1R classifier; in other words, uses the minimum-error attribute for prediction, discretizing numeric attributes. For more information, see:

R.C. Holte (1993). Very simple classification rules perform well on most commonly used datasets. Machine Learning. 11:63-91.



#---------------------------------------------------------
# Example (1): use OneR to train and test on weather.nominal.arff

#train classifier by train_data and output model without evaluation

>java -cp weka.jar weka.classifiers.rules.OneR \
>   -t data/weather.nominal.arff -no-cv -v -d model.dat

outlook:
        sunny   -> no
        overcast        -> yes
        rainy   -> yes
(10/14 instances correct)

=== Error on training data ===   # this report not shown with -v option

Correctly Classified Instances          10               71.4286 %
.....

=== Stratified cross-validation ===   # this report not shown with -no-cv option

Correctly Classified Instances           6               42.8571 %
.....


#load model and test classifier by test_data

>java -cp weka.jar weka.classifiers.rules.OneR \
>   -T data/weather.nominal.arff -l model.dat

outlook:
        sunny   -> no
        overcast        -> yes
        rainy   -> yes
(10/14 instances correct)

=== Error on test data ===

Correctly Classified Instances          10               71.4286 %
Incorrectly Classified Instances         4               28.5714 %
Kappa statistic                          0.3778
Mean absolute error                      0.2857
Root mean squared error                  0.5345
Total Number of Instances               14


=== Confusion Matrix ===

 a b   <-- classified as
 7 2 | a = yes
 2 3 | b = no



>java -cp weka.jar weka.classifiers.rules.OneR \
>   -T data/weather.nominal.arff -l model.dat -p first-last

=== Predictions on test data ===

 inst#     actual  predicted error prediction (outlook,temperature,humidity,windy)
     1       2:no       2:no       1 (sunny,hot,high,FALSE)
     2       2:no       2:no       1 (sunny,hot,high,TRUE)
     3      1:yes      1:yes       1 (overcast,hot,high,FALSE)
     4      1:yes      1:yes       1 (rainy,mild,high,FALSE)
     5      1:yes      1:yes       1 (rainy,cool,normal,FALSE)
     6       2:no      1:yes   +   1 (rainy,cool,normal,TRUE)
     7      1:yes      1:yes       1 (overcast,cool,normal,TRUE)
     8       2:no       2:no       1 (sunny,mild,high,FALSE)
     9      1:yes       2:no   +   1 (sunny,cool,normal,FALSE)
    10      1:yes      1:yes       1 (rainy,mild,normal,FALSE)
    11      1:yes       2:no   +   1 (sunny,mild,normal,TRUE)
    12      1:yes      1:yes       1 (overcast,mild,high,TRUE)
    13      1:yes      1:yes       1 (overcast,hot,normal,FALSE)
    14       2:no      1:yes   +   1 (rainy,mild,high,TRUE)


#--------------------------------------------------------------------
#Example (2): use FilteredClassifier (StringToWordVector + J48) to
#             train on crude_oil_train.arff and test on crude_oil_test.arff

#train classifier by train_data and output model without evaluation

> java -cp weka.jar weka.classifiers.meta.FilteredClassifier \
>    -no-cv -v -t data/crude_oil_train.arff -d model.dat \
>    -F weka.filters.unsupervised.attribute.StringToWordVector \
>    -W weka.classifiers.trees.J48

Options: -F weka.filters.unsupervised.attribute.StringToWordVector -W weka.classifiers.trees.J48

FilteredClassifier using weka.classifiers.trees.J48 -C 0.25 -M 2 on data filtered through weka.filters.unsupervised.attribute.StringToWordVector -R 1 -W 1000 -prune-rate -1.0 -N 0 -stemmer weka.core.stemmers.NullStemmer -M 1 -tokenizer "weka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\""

Filtered Header
@relation 'crude_oil_train-weka.filters.unsupervised.attribute.StringToWordVector-R1-W1000-prune-rate-1.0-N0-stemmerweka.core.stemmers.NullStemmer-M1-tokenizerweka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"'

@attribute class {yes,no}
@attribute Crude numeric
@attribute Demand numeric
@attribute The numeric
@attribute crude numeric
@attribute for numeric
@attribute has numeric
@attribute in numeric
@attribute increased numeric
@attribute is numeric
@attribute of numeric
@attribute oil numeric
@attribute outstrips numeric
@attribute price numeric
@attribute short numeric
@attribute significantly numeric
@attribute supply numeric
@attribute Some numeric
@attribute Use numeric
@attribute a numeric
@attribute bit numeric
@attribute cooking numeric
@attribute do numeric
@attribute flavor numeric
@attribute food numeric
@attribute frying numeric
@attribute like numeric
@attribute not numeric
@attribute oily numeric
@attribute olive numeric
@attribute pan numeric
@attribute people numeric
@attribute the numeric
@attribute very numeric
@attribute was numeric

@data


Classifier Model
J48 pruned tree
------------------

crude <= 0: no (4.0/1.0)
crude > 0: yes (2.0)

Number of Leaves  :     2

Size of the tree :      3



#load model and test classifier by test_data

> java -cp weka.jar weka.classifiers.meta.FilteredClassifier \
>                   -T data/crude_oil_test.arff -l model.dat -p first-last

=== Predictions on test data ===

 inst#     actual  predicted error prediction (document)
     1      1:yes      1:yes       1 ('Oil platforms extract crude oil')
     2       2:no       2:no       0.75 ('Canola oil is supposed to be healthy')
     3      1:yes       2:no   +   0.75 ('Iraq has significant oil reserves')
     4       2:no       2:no       0.75 ('There are different types of cooking oil')



> java -cp weka.jar weka.classifiers.meta.FilteredClassifier \
>                   -T data/crude_oil_test2.arff -l model.dat -p first-last

=== Predictions on test data ===

 inst#     actual  predicted error prediction (document)
     1        1:?      1:yes       1 ('Oil platforms extract crude oil')
     2        1:?       2:no       0.75 ('Canola oil is supposed to be healthy')
     3        1:?       2:no       0.75 ('Iraq has significant oil reserves')
     4        1:?       2:no       0.75 ('There are different types of cooking oil')


######### data/weather.nominal.arff
@relation weather.symbolic

@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}

@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no



######### data/crude_oil_train.arff
%
%  witten-12-mkp-data mining- practical machine learning tools and techniques
%    ch17 tutorial exercises for the weka explorer
%    ch17.5 document classification
%
%
@relation 'crude_oil_train'
%
@attribute document string
@attribute class {yes,no}
%
@data
'The price of crude oil has increased significantly',yes
'Demand for crude oil outstrips supply',yes
'Some people do not like the flavor of olive oil',no
'The food was very oily',no
'Crude oil is in short supply',yes
'Use a bit of cooking oil in the frying pan',no



######### data/crude_oil_test.arff
%
%  witten-12-mkp-data mining- practical machine learning tools and techniques
%    ch17   tutorial exercises for the weka explorer
%    ch17.5 document classification
%
%
@relation 'crude_oil_test'
%
@attribute document string
@attribute class {yes,no}
%
@data
'Oil platforms extract crude oil',yes
'Canola oil is supposed to be healthy',no
'Iraq has significant oil reserves',yes
'There are different types of cooking oil',no



######### data/crude_oil_test2.arff
%
%  witten-12-mkp-data mining- practical machine learning tools and techniques
%    ch17   tutorial exercises for the weka explorer
%    ch17.5 document classification
%
%
@relation 'crude_oil_test'
%
@attribute document string
@attribute class {yes,no}
%
@data
'Oil platforms extract crude oil',?
'Canola oil is supposed to be healthy',?
'Iraq has significant oil reserves',?
'There are different types of cooking oil',?

2015年12月4日 星期五

summary of graph algorithms

goodrich-15-wiley-data structures & algorithms in java
ch14.1 Graphs

‧圖(graph)由頂點(vertex)及邊(edge)組成。
‧頂點又稱節點(node),邊又稱弧(arc)。
‧依邊有方向與否分成有向邊(directed edge)及無向邊(undirected edge)。
‧圖若純由有向邊組成,稱為有向圖(directed graph, digraph)。
‧圖若純由無向邊組成,稱為無向圖(undirected graph)。
‧混合圖(mixed graph)的邊包含有向邊及無向邊。
‧每個邊有兩個端頂點(end vertices)或稱端點(endpoints)。
‧有向邊的端點又分成起點(origin)及終點(destination)。

‧若兩頂點為某邊的兩端點,稱兩頂點相鄰(adjacent)。
‧若頂點為邊的端點,稱邊接於(incident to)頂點。
‧頂點的去向邊(outgoing edge)包含所有以頂點為起點的有向邊。
‧頂點的來向邊(incoming edge)包含所有以頂點為終點的有向邊。
‧頂點的邊數(degree)為接於頂點的所有邊個數,
        可細分成去向邊數(out-degree)及來向邊數(in-degree)。

‧存放圖的邊容器為收藏容器(collection),而非集合(set),
  表示兩頂點可以有兩個以上的有向或無向邊相接,
  稱為平行邊(parallel edges)或多重邊(multiple edges)。
‧若有向/無向邊的兩端點相同,稱邊為自我迴圈。
‧若圖不包含平行邊或自我迴圈,稱圖為簡單圖(simple graph),
  簡單圖可用不允許重複的邊集合描述。

‧路徑(path)由一連串的頂點及邊交替組成,起源於某頂點,終止於某頂點,
  串列中每個邊的起點為其前一個頂點,終點為其下一個頂點。
‧迴圈(cycle)為一路徑,其起點及終點為同一頂點,且至少有一個邊。
‧若路徑的每個頂點皆不同,稱為簡單路徑(simple path)。
‧若迴圈的每個頂點皆不同,起終兩頂點相同不算,稱為簡單迴圈(simple cycle)。
‧有向路徑(directed path)的每個邊皆為有向邊。
‧有向迴圈(directed cycle)的每個邊皆為有向邊。
‧無迴圈的有向圖(acyclic directed graph)不存在有向迴圈。

‧若存在頂點u到頂點v的路徑,稱u可到達v,或v可由u到達(reachable)。
‧無向圖的可到達性(reachability)具對稱性,即u可到達v等同v可達u,有向圖則否。
‧若圖的任兩頂點存在路徑相連,稱為連通圖(connected graph)。
‧若有向圖的任兩頂點存在雙向路徑相連,稱為強連通圖(strongly connected graph)。

‧圖G的子圖(subgraph)為一圖H,其頂點及邊分別為G的頂點及邊的子集合。
‧圖G的生成子圖(spanning subgraph)為G的子圖,包含G的所有頂點。
‧若圖G不連通,其最大連通子圖稱為G的連通組件(connected components)。
‧沒有迴圈的圖稱為森林(forest)。
‧樹(tree)為一連通森林,即沒有迴圈的連通圖。
‧圖的生成樹(spanning tree)為圖的生成子圖,本身也是樹。



ch14.2 Data Structures for Graphs

‧四種表現圖的資料結構
1.邊清單(edge list)
2.相鄰清單(adjacency list)
3.相鄰映射(adjacency map)    <== 課本採用,可由點找邊,及邊找點
        圖有頂點鏈結清單vertices,邊鏈結清單edges,及有向狀態isDirected
        頂點記錄元素element,所處頂點鏈結清單位置pos,
                去向邊的<頂點,邊>映射表outgoing,
                來向邊的<頂點,邊>映射表incoming,
                註:有向圖outgoing及incoming才不同,無向圖兩者一樣
        邊記錄元素element,所處邊鏈結清單位置pos,及兩端頂點endpoints
4.相鄰矩陣(adjacency matrix)



14.3 Graph Traversals

‧圖走訪(graph traversal)基本上在作圖轉樹的工作,
  可產生走訪樹(search tree),回答有關頂點間可到達性問題。
  其困難點在如何有效率的檢視圖的所有頂點及邊,
  最好花費的時間能和頂點數及邊數成線性正比。

‧無向圖的可到達性問題:
         1.給定圖形G頂點u,v,若到得了,找出u到v的任一條路徑
         2.給定圖形G頂點u,若到得了,找出u到每個頂點v的路徑,路徑的邊數需最少
         3.給定圖形G,判定所有頂點是否相連
         4.給定圖形G,若存在,找出G的任一迴圈
        *5.給定連通圖形G,找出任一G的生成樹
        *6.給定圖形G,找出所有連通組件(最大相連子圖).

        註: 無向圖走訪樹的邊分類:
            樹幹邊(tree edge): 尋獲邊(discovery edge)
            非樹幹邊(nontree edge): 後退邊(back edge), 跨越邊(cross edge)

‧有向圖的可到達性問題
         1.給定圖形G頂點u,v,若到得了,找出u到v的任一條有向路徑
         2.給定圖形G頂點u,列出所有u可到達頂點
         3.給定圖形G,判定G是否強相連
         4.給定圖形G,判定有無迴圈

        註: 有向圖走訪樹的邊分類:
            樹幹邊: 尋獲邊
            非樹幹邊: 後退邊, 前進邊(forward edge), 跨越邊

‧兩種最基本的圖走訪法:
        1.深度優先走訪(depth-first search)
        2.寬度優先走訪(breadth-first search)

       GraphAlgorithms.DFS(g, u, known, forest)
                從圖 g 的頂點 u 出發,建立深度優先走訪樹,
                回傳走訪頂點 known,走訪頂點的尋獲邊 forest

       GraphAlgorithms.DFSComplete(g)
                回傳圖 g 的深度優先走訪森林,其走訪頂點的尋獲邊 forest

       GraphAlgorithms.BFS(g, u, known, forest)
                從圖 g 的頂點 u 出發,建立寬度優先走訪樹,
                回傳走訪頂點 known,走訪頂點的尋獲邊 forest

       GraphAlgorithms.BFSComplete(g)
                回傳圖 g 的深度優先走訪森林,其走訪頂點的尋獲邊 forest

       GraphAlgorithms.constructPath(g, u, v, forest)
                從圖 g 的走訪樹森林 forest
                回傳頂點 u 到 v 的路徑 (由路徑沿途的邊組成)

2015年11月17日 星期二

auto recursive indexing of chinese articles for later query use


/*
  Indexer.java
     本程式利用 mmseg4j.jar 套件作中文斷詞,可從給定目錄自動對所有以下各層文字檔案編製索引。
     索引表以物件串流存到硬碟invert.dat檔案,下回可以自動復原,方便檢索某詞出現在哪些檔案。

> javac -cp mmseg4j.jar;. Indexer.java
> java -cp mmseg4j.jar;. Indexer \data

init: path=\data
chars loaded time=110ms, line=13060, on file=\data\chars.dic
words loaded time=125ms, line=137450, on file=!/data/words.dic
unit loaded time=0ms, line=22, on file=file:\mmseg4j.jar!\data\units.dic
\data\L1\F1.txt: [,:41,的:24,沈船:17,計畫:8,澎湖:8, :7,。:7,發掘:7,為:7,初:6,進行:6,該:6,勘:6,古:6]

\data\L1\F1.txt: [,:41,的:24,沈船:17,計畫:8,澎湖:8, :7,。:7,發掘:7,為:7,初:6,進行:6,該:6,勘:6,古:6]

\data\L1\F2.txt: [,:88,的:74,、:31, :25,。:25,在:16,海底:16,尋:11,寶:11,﹁:10,沈船:10]

\data\L1\F3.txt: [,:29,的:14,沈船:9,打撈:8,澎湖:6, :5,。:5,工作:5,進行:5,館:5,後:5,古:5]

\data\L1\F4.txt: [,:21,。:13,的:11,船:7,、:6,去年:6,工作:6,澎湖:6,探勘:6,初:5,進行:5,沉:5,包括:5,勘:5,博:5,將軍:5,史:5]

\data\L1\L2\E1.txt: [,:51,的:16,與:9,。:7,主:6,老街:6, :5,做:5,拆:5,三峽:4]

\data\L1\L2\E2.txt: [,:49,的:26,三峽:11,老街:10,。:8,與:7,古蹟:7,、:6,文化:6,而:5,祖師廟:5,保留:5]

\data\L1\L2\E3.txt: [,:36,的:14,。:13,三峽:13,「:7,」:7,主:7,老街:7,協調會:5,發展:5]

\data\L1\L2\E4.txt: [,:53,的:19,。:8,三峽:6,主:6, :5,不:5,拆除:5,在:5,而:4,老街:4,財產:4,住戶:4,改建:4,古蹟:4,保留:4,排除:4,派:4,介入:4]

\data\L1\L2\E5.txt: [,:30, :18,。:10,三峽:10,老街:7,文建會:7,立:7,的:5,派:5,面:5,騎樓:5]

\data\L1\L2\E6.txt: [,:52,的:17,。:9,民眾:8,老街:7,三峽:6,拆:6, :5,而:5,文建會:5]

\data\L1\L2\E7.txt: [,:27,老街:12,。:7,屋:6,街:6, :5,的:5,、:4,與:4,三峽:4,是:4,住戶:4,古蹟:4]

\data\L1\L2\L3\D1.txt: [,:47,「:35,」:35,的:29,、:20,布袋戲:15,。:14,宛然:13,祿:10,天:9,李:9]

\data\L1\L2\L3\D2.txt: [,:23,「:17,」:17,的:14,、:12,。:8,壇:8,藝術:7,儀式:5, :4,主:4,-:4,露天劇場:4,開荒:4,啟用:4]

\data\L1\L2\L3\D3.txt: [,:52,畫:20,的:18,作:17,館:14,。:10,「:10,」:10,資料:10,這些:10]

\data\L1\L2\L3\D4.txt: [,:28,。:12,她:11,、:6,「:6,」:6,貝:6,文:6,王:6,音樂:5,的:5]

\data\L1\L2\M3\C1.txt: [,:27,的:22,」:18,「:17,。:11,中:8,柴可夫斯基:7,盛:7,能:7,余:7]

\data\L1\L2\M3\C2.txt: [,:38,的:27,舞:20,。:19,「:18,」:17,德國:11,能:11,團:11,、:10,舞蹈:10]

save and load 'invert.dat'

Input the query word:
三峽
query '三峽' occurs in articles [\data\L1\L2\E1.txt, \data\L1\L2\E2.txt, \data\L1\L2\E3.txt,\data\L1\L2\E4.txt, \data\L1\L2\E5.txt, \data\L1\L2\E6.txt, \data\L1\L2\E7.txt]

*/
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStreamReader;
import java.io.BufferedReader;
import java.io.ObjectInputStream;
import java.io.ObjectOutputStream;
import java.io.FileNotFoundException;
import java.io.UnsupportedEncodingException;
import java.io.IOException;

import java.util.Set;
import java.util.Map;
import java.util.HashMap;
import java.util.ArrayList;
import java.util.List;
import java.util.Collection;
import java.util.Collections;
import java.util.Comparator;
import java.util.Scanner;

import com.chenlb.mmseg4j.example.Simple;

public class Indexer
{
  private static Simple segmenter;
  private static HashMap<String, ArrayList<String>> invertList;


  public static void statArticle(File file, HashMap<String, Integer> frequency)
    throws FileNotFoundException,UnsupportedEncodingException,IOException
  {
    FileInputStream fis = new FileInputStream(file);  // FileNotFoundException
    InputStreamReader isr = new InputStreamReader(fis, "big5"); //
    BufferedReader br = new BufferedReader(isr);

    // fill hashmap frequency

    while(br.ready()==true)  // IOException
    {
      String text = br.readLine();
      text = text.replaceAll(" ", "");  // remove interference from blanks
      String seg_text = segmenter.segWords(text, " ");
        //System.out.println(text);
       //System.out.println(seg_text);
      String words[] = seg_text.split(" ");

      for(String w : words)
      {
        //System.out.println(w.length() + ":" + w);
        if(w.length()==0) continue;

        if(frequency.containsKey(w)==true)
        {
          int count = frequency.get(w);
          count++;
          frequency.put(w,count);
        }
        else
        {
          frequency.put(w,1);
        }
      }
    }
    br.close();

    // process hashmap frequency

    int numberWords = frequency.size();
/*
    Collection<Integer> counts = frequency.values();
    List<Integer> counts_list = new ArrayList<>(counts);
    Collections.sort(counts_list);

    int count_threshold = counts_list.get(counts_list.size() - 10);

    String fullName = file.getCanonicalPath();
    System.out.printf("%s: [", fullName);
    for(String word : frequency.keySet())
    {
      int count = frequency.get(word);

      if(count >= count_threshold)
      {
        System.out.printf(",%s:%d", word, count);
      }
    }
    System.out.printf("]\n\n");
*/
    Set<Map.Entry<String,Integer>> entries = frequency.entrySet();
    List<Map.Entry<String,Integer>> entryList = new ArrayList<Map.Entry<String,Integer>>(entries);
    Comparator<Map.Entry<String,Integer>> cmp = new Comparator<Map.Entry<String,Integer>>() {
     public int compare(Map.Entry<String,Integer> e1, Map.Entry<String,Integer> e2)
     {
       return e2.getValue() - e1.getValue();  // descending order
     }
    };
    Collections.sort(entryList, cmp);

    int count_threshold = entryList.get(9).getValue();

    String fullName = file.getCanonicalPath();
    System.out.printf("%s: [", fullName);
    boolean first=true;
    for(Map.Entry<String,Integer> e : entryList)
    {
      if(e.getValue() >= count_threshold)
      {
        System.out.printf("%s%s:%d", (first==false)?",":"", e.getKey(), e.getValue());
       if(first==true) first = false;
      }
    }
    System.out.printf("]\n\n");
  }


  public static void indexArticleWords(File dirFile,
    HashMap<String, ArrayList<String>> invertList)
    throws FileNotFoundException,UnsupportedEncodingException,IOException
  {

    File list[] = dirFile.listFiles();
    String fullName;

    for(File f : list)
    {
      fullName = f.getCanonicalPath(); // throws IOException

      if(f.isFile()==true)
      {
       HashMap<String, Integer> frequency
         = new HashMap<String, Integer>();

        statArticle(f, frequency);

        for(String word : frequency.keySet())
        {
          if(invertList.containsKey(word)==true)
          {
            ArrayList<String> oldList = invertList.get(word);
            oldList.add(fullName);
          }
          else
          {
            ArrayList<String> newList = new ArrayList<>();
            newList.add(fullName);
            invertList.put(word, newList);
          }
        }
      }

      else if(f.isDirectory()==true)
      {
       indexArticleWords(f, invertList);
      }
    }
  }



  @SuppressWarnings("unchecked")
  public static void main(String args[])
    throws FileNotFoundException,UnsupportedEncodingException,
           IOException,ClassNotFoundException
  {
     // set up root folder
     String rootName = "data/";
     String testName = "data/L1/F1.txt";
     File root;

     if(args.length==1)
     {
       rootName = args[0];
     }

     root = new File(rootName);

     // set up segmenter and hashmap frequency
     segmenter = new Simple();

     HashMap<String, Integer> frequency = new HashMap<>();

     statArticle(new File(testName), frequency);  // 測試文章斷詞

     if(new File("invert.dat").exists()==false)
     {
       // set up invertList
       invertList = new HashMap<String, ArrayList<String>>();

       indexArticleWords(root, invertList);

       //save invertList to file 'invert.dat' and load from file 'invert.dat'

       ObjectOutputStream oos = new ObjectOutputStream(
       new FileOutputStream("invert.dat"));
       oos.writeObject(invertList);
       oos.close();
     }


     ObjectInputStream ois = new ObjectInputStream(
      new FileInputStream("invert.dat"));
     HashMap<String, ArrayList<String>>
       invertList2 = (HashMap<String, ArrayList<String>>) ois.readObject();
     ois.close();

     // do the query test on the loaded invertList
     System.out.println("save and load 'invert.dat'\n");

     System.out.print("Input the query word: ");
     Scanner sc = new Scanner(System.in);
     String query = sc.next();

     //print the file paths which have the query word
     ArrayList<String> list = invertList2.get(query);
     System.out.printf("query '%s' occurs in articles %s\n", query, list);

  }
}

2015年11月12日 星期四

an example to use the modbus rtu protocol to query/read/write the motor drivers

Modbus應用層協定常作為工業設備之間的溝通語言,
RS485實體層協定則常作為電腦及設備之間較長距離、耐雜訊的溝通工具。
因此,若想由電腦對馬達設備進行控制,常見利用
Modbus及RS485建立電腦和馬達驅動器之間的連線。

電腦和馬達驅動器之間的溝通關係如下:

電腦                              馬達驅動器
------                            --------------
客戶應用程式
MODBUS應用層<--------------------->MODBUS應用層
COM1序列埠
RS485實體層<---------------------->RS485實體層

電腦端應用層對實體層的介面就是COM1序列埠。
電腦及馬達兩者間的應用層溝通就靠MODBUS RTU協定。
MODBUS RTU (remote terminal unit)協定可針對特定id的馬達驅動器,
利用序列埠發出指令,讀寫不同暫存器,達成驅動馬達運轉的目的。

以下將以東方馬達的 AZ 系列驅動器,其支援的modbus指令為例,
介紹其利用modbus rtu協定讀、寫、查詢驅動器的溝通過程,
可作為撰寫序列埠相關程式之參考。
範例出自手冊 pp224-228, HM-60260C(AZ).pdf。

--------- 讀取一段連續暫存器 -----------

A.欲讀取馬達驅動器一段連續暫存器的資料:
A1.先循序對COM1寫出8 bytes如下:

id:         01h     表 馬達驅動器id
code:       03h     表 讀取一段連續暫存器 指令
start_addr: 18h,40h 表 1840h 暫存器起始住址
addr_count: 00h,06h 表 0006h 連續暫存器個數
crc:        c2h,bch 表驗證碼 bcc2h

表示對馬達驅動器id=01h,下03h號指令讀取一段連續暫存器,
回傳起始位址1840h,連續0006h個16bit暫存器的內容

A2.再從COM1循序接收結果如下:

01h      表馬達驅動器id,應為剛才id
03h      表讀取指令之回應結果,應為剛才code
0ch      表後續的byte個數,應為addr_count兩倍
00h, 00h 表 1840h 內容 0000h
00h, 02h 表 1841h 內容 0002h
ffh, ffh 表 1842h 內容 ffffh
d8h, f0h 表 1843h 內容 d8f0h
00h, 00h,表 1844h 內容 0000h
27h, 10h,表 1845h 內容 2710h
82h, eah 表 驗證碼ea82h



--------- 寫入一段連續暫存器 -----------

B.欲寫入資料到馬達驅動器的一段連續暫存器:
B1.先循序對COM1寫出21 bytes如下:

id:         04h     表 馬達驅動器id
code:       10h     表 寫入一段連續暫存器 指令
start_addr: 18h,c6h 表 18c6h 暫存器起始住址
addr_count: 00h,06h 表 0006h 連續暫存器個數
byte_count: 0ch     表後續的byte個數,應為addr_count兩倍
data:
  00h,00h           表 18c6h 內容 0000h
  27h,10h           表 18c7h 內容 2710h
  00h,00h           表 18c8h 內容 0000h
  4eh,20h           表 18c9h 內容 4e20h
  00h,00h,          表 1844h 內容 0000h
  01h,f4h,          表 1845h 內容 01f4h
crc:        6ch,a0h 表 驗證碼a06ch

表示對馬達驅動器id=04h,下10h號指令寫入一段連續暫存器,
將後續0ch個byte,依序寫入起始位址18c6h,連續0006h個16bit暫存器

B2.再從COM1接收回應結果如下

04h     表馬達驅動器id,應為剛才id
10h     表讀取指令之回應結果,應為剛才code
18h,c6h 表 18c6h 起始住址,應為剛才start_addr
00h,06h 表 0006h 連續住址個數,應為剛才addr_count
a6h,c3h 表 驗證碼c3a6h



--------- 寫入單一暫存器 -----------

C.欲寫出資料到馬達驅動器的單一暫存器:
C1.先循序對COM1寫出8 bytes如下:

id:    02h     表 馬達驅動器id
code:  06h     表 寫入單一暫存器 指令
addr:  02h,55h 表寫入住址0255h
data:  00h,50h 表寫入內容0050h
crc:   98h,6dh 表驗證碼6d98h

表示對馬達驅動器id=02h,下06h號指令寫入單一暫存器,
將data=0050h,寫入addr=0255h的16bit暫存器

C2.再從COM1接收回應結果如下:

02h 表馬達驅動器id,應為剛才id
06h 表讀取指令之回應結果,應為剛才code
02h,55h 表 0255h 暫存器住址,應為剛才addr
00h,50h 表 0006h 寫入內容,應為剛才data
98h, 6dh 表 驗證碼6d98h



--------- 診斷驅動器 -----------

D.欲診斷馬達驅動器:
D1.先循序對COM1寫出8 bytes如下:

id:          03h     表 馬達驅動器id
code:        08h     表 診斷 指令
subcode:     00h,00h 表子功能0000h
data:        12h,34h 表任意測試資料1234h
crc:         ech,9eh 表驗證碼9eech

表示對馬達驅動器id=03h,下08h號指令進行0000h號子功能診斷,
將隨意資料data=1234h送出,看會不會回傳該資料回來

D2.再從COM1接收回應結果如下: 應和剛才送出內容完全一樣

id:          03h     表 馬達驅動器id
code:        08h     表 診斷 指令
subcode:     00h,00h 表子功能0000h
data:        12h,34h 表任意測試資料1234h
crc:         ech,9eh 表驗證碼9eech

-

幾種環境的 Serial Port 寫法:

Linux/Cygwin C寫法:
  https://www.cmrr.umn.edu/~strupp/serial.html
  http://www.teuniz.net/RS-232/

Windows Win32 C寫法:
   http://cboard.cprogramming.com/windows-programming/141173-windows-serial-programming.html

Windows C#/C++/VB .NET寫法:
   https://msdn.microsoft.com/zh-tw/library/system.io.ports.serialport(v=vs.110).aspx

SerialPort存取原理
  http://www.dotblogs.com.tw/billchung/category/5702.aspx

PS:
https://en.wikipedia.org/wiki/Modbus
http://www.modbus.org/tech.php 標準文件及多種平台之程式碼

2015年11月10日 星期二

how to extract hand drawn figures from photos

當你用手機拍照手畫圖案時,常常會因為採光關係,出現深淺不一色調,如左圖所示。這時若用photoshop之類的影像處理軟體,其提供的單一臨界值調整(threshold adjustment)工具,很難單獨萃取手畫圖案出來。


若會python程式,建議改用opencv套件提供的adaptiveThreshold方法,可自動調整臨界值,萃取手畫圖案。經適當調整參數,發現效果不錯,如右圖所示。其程式寫法如下:
       去處圖destin = adaptiveThreshold(src來源圖, maxValue像素最大值, 
                adaptiveMethod自動調整法, thresholdType臨界值套用法,
                blockSize參考方塊邊長_像素為單位, C方塊內像素加權和扣掉常數值當成臨界值)
        --
        import cv2
        import matplotlib.pyplot as plt
        %matplotlib inline

        input = 'c:/path/source.jpg'
        output = 'c:/path/destin.jpg'
        img = cv2.imread(input,0)
        img = cv2.medianBlur(img,5)
        newimg = cv2.adaptiveThreshold(img, 255,\
            cv2.ADAPTIVE_THRESH_GAUSSIAN_C,  cv2.THRESH_BINARY, \
            11, 10)
        cv2.imwrite( output, newimg )

2015年11月3日 星期二

weka.classifiers.functions.Winnow

weka.classifiers.functions.Winnow 屬錯誤驅動型學習器,
只處理文字屬性,將之轉成二元屬性,用來預測二元類別值。可線上累進學習。
適用在案例集屬性眾多,卻多數和預測不相關情況,可以快速鎖定相關屬性作預測。

給定案例屬性(a0, a1, ..., ak),門檻 theta,權重升級係數alpha,權重降級係數beta,
權重向量(w0, w1, ..., wk)或(w0+ - w0-, w1+ - w1-, ..., wk+ - wk-)
其中,所有符號皆為正數,擴充屬性 a0 恆為 1。則預測式有二: 

  不平衡版: 權重向量各維度只能正數
     w0 * a0 + w1 * a1 + ... + wk * ak > theta 表類別1; 否則類別2

  平衡版:  權重向量各維度允許負數
    (w0+ - w0-) * a0 + (w1+ - w1-) * a1 + ... + (wk+ - wk-) * ak > theta 表類別1; 否則類別2

學習過程若遇預測錯誤,則權重向量調整法如下:
  類別2誤為類別1:   w *= beta  或 w+ *= beta  and w- *= alpha 讓權重變小
  類別1誤為類別2:   w *= alpha 或 w+ *= alpha and w- *= beta  讓權重變大

參數說明:
 -L  使用平衡版。預設值false
 -I  套用訓練集學習權重的輪數。預設值1
 -A  權重升級係數alpha,需>1。預設值2.0
 -B  權重降級係數beta,需<1。預設值0.5
 -H  預測門檻theta。預設值-1,表示屬性個數
 -W  權重初始值,需>0。預設值2.0
 -S  亂數種子,影響訓練集的案例訓練順序。預設值1


> java  weka.classifiers.functions.Winnow  -t data\weather.nominal.arff


Winnow

Attribute weights

w0 8.0
w1 1.0
w2 2.0
w3 4.0
w4 2.0
w5 2.0
w6 1.0
w7 1.0

Cumulated mistake count: 7


Time taken to build model: 0 seconds
Time taken to test model on training data: 0 seconds

=== Error on training data ===

Correctly Classified Instances          10               71.4286 %
Incorrectly Classified Instances         4               28.5714 %
Kappa statistic                          0.3778
Mean absolute error                      0.2857
Root mean squared error                  0.5345
Relative absolute error                 61.5385 %
Root relative squared error            111.4773 %
Total Number of Instances               14     


=== Confusion Matrix ===

 a b   <-- classified as
 7 2 | a = yes
 2 3 | b = no



=== Stratified cross-validation ===

Correctly Classified Instances           7               50      %
Incorrectly Classified Instances         7               50      %
Kappa statistic                         -0.2564
Mean absolute error                      0.5   
Root mean squared error                  0.7071
Relative absolute error                105      %
Root relative squared error            143.3236 %
Total Number of Instances               14     


=== Confusion Matrix ===

 a b   <-- classified as
 7 2 | a = yes
 5 0 | b = no

如下 weather.nominal.arff 案例集的14個案例利用4個文字屬性,預測文字屬性。
outlook temperature humidity windy play
sunny hot high FALSE no
sunny hot high TRUE no
overcast hot high FALSE yes
rainy mild high FALSE yes
rainy cool normal FALSE yes
rainy cool normal TRUE no
overcast cool normal TRUE yes
sunny mild high FALSE no
sunny cool normal FALSE yes
rainy mild normal FALSE yes
sunny mild normal TRUE yes
overcast mild high TRUE yes
overcast hot normal FALSE yes
rainy mild high TRUE no
參考: 1.weka.classifiers.functions.Winnow code | doc

weka.classifiers.functions.VotedPerceptron

weka.classifiers.functions.VotedPerceptron 為投票型感知器,屬錯誤驅動型學習器。
先全域性取代缺值,再轉換文字屬性為二元屬性,適用於預測二元類別值,可線上累進學習。

給定案例屬性 a=(a0, a1, ..., ak),權重向量 w=(w0, w1, ..., wk)
其中,a 屬性值為二元值 0 或 1,擴充屬性 a0 恆為 1。
預測式為
  w0 * a0 + w1 * a1 + ... + wk * ak > 0 表類別1; 否則類別2

學習過程若遇預測錯誤,則權重向量調整法如下:
  類別2誤為類別1:   w -= a  讓權重變小
  類別1誤為類別2:   w += a  讓權重變大

參數說明:
 -I  套用訓練集學習權重的輪數。預設值1
 -E  多項式核函數(polynomial kernel)之次方。預設值1
 -S  亂數種子,影響訓練集的案例訓練順序。預設值1
 -M  最大允許權重修正次數。預設值10000

> java  weka.classifiers.functions.VotedPerceptron  -t data\weather.numeric.arff


VotedPerceptron: Number of perceptrons=5


Time taken to build model: 0 seconds
Time taken to test model on training data: 0 seconds

=== Error on training data ===

Correctly Classified Instances           9               64.2857 %
Incorrectly Classified Instances         5               35.7143 %
Kappa statistic                          0     
Mean absolute error                      0.3623
Root mean squared error                  0.587 
Relative absolute error                 78.0299 %
Root relative squared error            122.4306 %
Total Number of Instances               14     


=== Confusion Matrix ===

 a b   <-- classified as
 9 0 | a = yes
 5 0 | b = no



=== Stratified cross-validation ===

Correctly Classified Instances           9               64.2857 %
Incorrectly Classified Instances         5               35.7143 %
Kappa statistic                          0     
Mean absolute error                      0.3736
Root mean squared error                  0.589 
Relative absolute error                 78.4565 %
Root relative squared error            119.3809 %
Total Number of Instances               14     


=== Confusion Matrix ===

 a b   <-- classified as
 9 0 | a = yes
 5 0 | b = no


如下 weather.numeric.arff 案例集的14個案例利用2個文字屬性及2個數字屬性,預測文字屬性。
outlooktemperaturehumiditywindyplay
sunny8585FALSEno
sunny8090TRUEno
rainy6570TRUEno
sunny7295FALSEno
rainy7191TRUEno
overcast8386FALSEyes
rainy7096FALSEyes
rainy6880FALSEyes
overcast6465TRUEyes
sunny6970FALSEyes
rainy7580FALSEyes
sunny7570TRUEyes
overcast7290TRUEyes
overcast8175FALSEyes
參考: 1.weka.classifiers.functions.VotedPerceptron code | doc

weka.classifiers.functions.Logistic

weka.classifiers.functions.Logistic 為羅吉斯迴歸學習器,
建立多類別羅吉斯迴歸模型,含嶺迴歸估計量(ridge estimator)參數,可用來預測類別值。
缺值由ReplaceMissingValuesFilter過濾器補值,文字屬性由NominalToBinaryFilter過濾器轉為數字。
 
參數說明:
 -R <ridge> 設定log相似度的嶺迴歸估計量。預設值1e-8
 -M <number> 設定最大迭代次數。預設值 -1 表示直到收斂為止


> java  weka.classifiers.functions.Logistic  -t data\weather.numeric.arff


Logistic Regression with ridge parameter of 1.0E-8
Coefficients...
                          Class
Variable                    yes
===============================
outlook=sunny           -6.4257
outlook=overcast        13.5922
outlook=rainy           -5.6562
temperature             -0.0776
humidity                -0.1556
windy                    3.7317
Intercept                22.234


Odds Ratios...
                          Class
Variable                    yes
===============================
outlook=sunny            0.0016
outlook=overcast    799848.4279
outlook=rainy            0.0035
temperature              0.9254
humidity                 0.8559
windy                   41.7508


Time taken to build model: 0 seconds
Time taken to test model on training data: 0 seconds

=== Error on training data ===

Correctly Classified Instances          11               78.5714 %
Incorrectly Classified Instances         3               21.4286 %
Kappa statistic                          0.5532
Mean absolute error                      0.2066
Root mean squared error                  0.3273
Relative absolute error                 44.4963 %
Root relative squared error             68.2597 %
Total Number of Instances               14     


=== Confusion Matrix ===

 a b   <-- classified as
 7 2 | a = yes
 1 4 | b = no



=== Stratified cross-validation ===

Correctly Classified Instances           8               57.1429 %
Incorrectly Classified Instances         6               42.8571 %
Kappa statistic                          0.0667
Mean absolute error                      0.4548
Root mean squared error                  0.6576
Relative absolute error                 95.5132 %
Root relative squared error            133.2951 %
Total Number of Instances               14     


=== Confusion Matrix ===

 a b   <-- classified as
 6 3 | a = yes
 3 2 | b = no


如下 weather.numeric.arff 案例集的14個案例利用2個文字屬性及2個數字屬性,預測文字屬性。
outlooktemperaturehumiditywindyplay
sunny8585FALSEno
sunny8090TRUEno
rainy6570TRUEno
sunny7295FALSEno
rainy7191TRUEno
overcast8386FALSEyes
rainy7096FALSEyes
rainy6880FALSEyes
overcast6465TRUEyes
sunny6970FALSEyes
rainy7580FALSEyes
sunny7570TRUEyes
overcast7290TRUEyes
overcast8175FALSEyes
參考: 1.weka.classifiers.functions.Logistic code | doc

weka.classifiers.functions.LinearRegression

weka.classifiers.functions.LinearRegression 為標準線性迴歸學習器,
學習各數值屬性的權重,建立線性方程式模型,預測數值類別。

參數說明:
-S select_attribute_code 屬性挑選法代碼,0 表M5',1 表無,2 表Greedy。預設值 0。


> java  weka.classifiers.functions.LinearRegression  -t data\cpu.arff


Linear Regression Model

class =

      0.0491 * MYCT +
      0.0152 * MMIN +
      0.0056 * MMAX +
      0.6298 * CACH +
      1.4599 * CHMAX +
    -56.075 


Time taken to build model: 0.02 seconds
Time taken to test model on training data: 0.02 seconds

=== Error on training data ===

Correlation coefficient                  0.93
Mean absolute error                     37.9748
Root mean squared error                 58.9899
Relative absolute error                 39.592  %
Root relative squared error             36.7663 %
Total Number of Instances              209     



=== Cross-validation ===

Correlation coefficient                  0.9012
Mean absolute error                     41.0886
Root mean squared error                 69.556 
Relative absolute error                 42.6943 %
Root relative squared error             43.2421 %
Total Number of Instances              209     


cpu.arff 資料集有209案例,每個案例由6個數值屬性預測1個數值屬性。

MYCT MMIN MMAX CACH CHMIN CHMAX class
125 256 6000 256 16 128 198
29 8000 32000 32 8 32 269
29 8000 32000 32 8 32 220
29 8000 32000 32 8 32 172
29 8000 16000 32 8 16 132
26 8000 32000 64 8 32 318
23 16000 32000 64 16 32 367
23 16000 32000 64 16 32 489
23 16000 64000 64 16 32 636
.....





參考: 1.weka.classifiers.functions.LinearRegression code | doc

weka.classifiers.functions.SimpleLinearRegression

weka.classifiers.functions.SimpleLinearRegression 為簡單線性迴歸學習器,
簡單指的是只挑一個平方誤差最小的屬性作線性預測。
只適用數值對數值的預測,不接受缺值案例。

> java  weka.classifiers.functions.SimpleLinearRegression  -t data\cpu.arff


Linear regression on MMAX

0.01 * MMAX - 34


Time taken to build model: 0 seconds
Time taken to test model on training data: 0.02 seconds

=== Error on training data ===

Correlation coefficient                  0.863
Mean absolute error                     50.8658
Root mean squared error                 81.0566
Relative absolute error                 53.0319 %
Root relative squared error             50.5197 %
Total Number of Instances              209     



=== Cross-validation ===

Correlation coefficient                  0.7844
Mean absolute error                     53.8054
Root mean squared error                 99.5674
Relative absolute error                 55.908  %
Root relative squared error             61.8997 %
Total Number of Instances              209    

cpu.arff 資料集有209案例,每個案例由6個數值屬性預測1個數值屬性。

MYCT MMIN MMAX CACH CHMIN CHMAX class
125 256 6000 256 16 128 198
29 8000 32000 32 8 32 269
29 8000 32000 32 8 32 220
29 8000 32000 32 8 32 172
29 8000 16000 32 8 16 132
26 8000 32000 64 8 32 318
23 16000 32000 64 16 32 367
23 16000 32000 64 16 32 489
23 16000 64000 64 16 32 636
.....





參考: 1.weka.classifiers.functions.SimpleLinearRegression code | doc

2015年10月26日 星期一

weka.classifiers.lazy.IB1

weka.classifiers.lazy.IB1 為簡單最近鄰居學習器,
訓練時只記錄原始案例,測試時挑選最相鄰1個案例,依案例類別值作預測。
可提供案例集不錯表現值供標竿比較之用。

IB1 計算兩案例距離時,遇文字屬性值相同視屬性距離為0,不同視為1;
遇數值屬性時,依原始案例集區間範圍作正規化,讓數值介於0,1之間,
再將兩正規化值相減,取平方,得到屬性距離;兩案例遇任一屬性缺值時,視該屬性距離為1。
最後視各屬性重要性相同,取所有屬性距離加總,再開根號(歐幾里德距離),當成兩案例距離。
挑距離最小者為參考案例,回傳其類別值為預測值。

> java -cp weka.jar;. weka.classifiers.lazy.IB1  -t data\weather.numeric.arff

IB1 classifier

Time taken to build model: 0.02 seconds
Time taken to test model on training data: 0 seconds

=== Error on training data ===

Correctly Classified Instances          14              100      %
Incorrectly Classified Instances         0                0      %
Kappa statistic                          1     
Mean absolute error                      0     
Root mean squared error                  0     
Relative absolute error                  0      %
Root relative squared error              0      %
Total Number of Instances               14     


=== Confusion Matrix ===

 a b   <-- classified as
 9 0 | a = yes
 0 5 | b = no



=== Stratified cross-validation ===

Correctly Classified Instances           7               50      %
Incorrectly Classified Instances         7               50      %
Kappa statistic                          0.0392
Mean absolute error                      0.5   
Root mean squared error                  0.7071
Relative absolute error                105      %
Root relative squared error            143.3236 %
Total Number of Instances               14     


=== Confusion Matrix ===

 a b   <-- classified as
 4 5 | a = yes
 2 3 | b = no

如下 weather.numeric.arff 案例集的14個案例利用2個文字屬性及2個數字屬性,預測文字屬性。
outlooktemperaturehumiditywindyplay
sunny8585FALSEno
sunny8090TRUEno
rainy6570TRUEno
sunny7295FALSEno
rainy7191TRUEno
overcast8386FALSEyes
rainy7096FALSEyes
rainy6880FALSEyes
overcast6465TRUEyes
sunny6970FALSEyes
rainy7580FALSEyes
sunny7570TRUEyes
overcast7290TRUEyes
overcast8175FALSEyes
參考: 1.weka.classifiers.lazy.IB1 code | doc

2015年10月19日 星期一

weka.classifiers.rules.Prism

weka.classifiers.rules.Prism 為簡單規則集學習器,
訓練時逐一就每一類別之案例用覆蓋法建立精確率較高之規則,
測試時由規則集上往下,依案例屬性值找尋第1個符合規則作預測。
可提供案例集不錯表現值供標竿比較之用。

Prism 學習規則集時,依各類別分別學習適合該類別所有案例之規則集,作法如下:
為每一類別建立覆蓋該類別的規則集前,先將所有案例置於待學習案例集E中,
只要集合E尚存有該類別案例,就表示規則還待添加。
   建立規則時,先從1個屬性條件,窮舉所有屬性配所有值的可能組合,取精確率最大組合;
       再添下1個屬性條件,同樣窮舉所有屬性配所有值的可能組合,取精確率最大組合;
       以此類推,直到添加屬性用光或已完全正確為止。
       取精確率最大組合時,若有持平的屬性條件,則取覆蓋率(分母)較大者。
   規則建好後,將該類別規則已預測正確的案例從集E中刪除,
   針對尚未覆蓋案例學習下一條規則。

Prism 在學習類別規則時有敵情觀念(其他類別案例全部都在),
所以預測時,同類別的規則哪一條誰先檢查效果都一樣,不會誤含到其他類別的案例空間。


> java -cp weka.jar;. weka.classifiers.rules.Prism  -t data\weather.nominal.arff


Prism rules
----------
If outlook = overcast then yes
If humidity = normal
   and windy = FALSE then yes
If temperature = mild
   and humidity = normal then yes
If outlook = rainy
   and windy = FALSE then yes
If outlook = sunny
   and humidity = high then no
If outlook = rainy
   and windy = TRUE then no


Time taken to build model: 0 seconds
Time taken to test model on training data: 0 seconds

=== Error on training data ===

Correctly Classified Instances          14              100      %
Incorrectly Classified Instances         0                0      %
Kappa statistic                          1     
Mean absolute error                      0     
Root mean squared error                  0     
Relative absolute error                  0      %
Root relative squared error              0      %
Total Number of Instances               14     


=== Confusion Matrix ===

 a b   <-- classified as
 9 0 | a = yes
 0 5 | b = no



=== Stratified cross-validation ===

Correctly Classified Instances           9               64.2857 %
Incorrectly Classified Instances         3               21.4286 %
Kappa statistic                          0.4375
Mean absolute error                      0.25  
Root mean squared error                  0.5   
Relative absolute error                 59.2264 %
Root relative squared error            105.9121 %
UnClassified Instances                   2               14.2857 %
Total Number of Instances               14     


=== Confusion Matrix ===

 a b   <-- classified as
 7 0 | a = yes
 3 2 | b = no


如下 weather.nominal.arff 案例集的14個案例有9個yes、5個no。
outlook temperature humidity windy play
sunny hot high FALSE no
sunny hot high TRUE no
rainy cool normal TRUE no
sunny mild high FALSE no
rainy mild high TRUE no
overcast hot high FALSE yes
rainy mild high FALSE yes
rainy cool normal FALSE yes
overcast cool normal TRUE yes
sunny cool normal FALSE yes
rainy mild normal FALSE yes
sunny mild normal TRUE yes
overcast mild high TRUE yes
overcast hot normal FALSE yes
參考: 1.weka.classifiers.rules.Prism code | doc

Let death be what takes us, not lack of imagination.

What Really Matters at the End of Life by J.B. Miller on TED

如上【臨終要事為何】演說中,安寧療護醫師 Miller 最後提到:

If we love such moments ferociously, then maybe we can learn to live well 
-- not in spite of death, but because of it. 
Let death be what takes us, not lack of imagination.

大意是說如果我們強烈喜愛大冷天冰雪在手中消融的瞬間感受,就可能學會如何臨終活得很好。
因為心態上不再是抗拒死亡,而是接受死亡。因為死亡在前,所以珍視每秒活著的感受。
講者希望死亡是可以帶引我們前進,珍視感受、活潑活著的原動力,而非槁木死灰、毫無想像空間的東西。

其中,最後一句的take有點像文言文,一字多義,很難理解其真正意思。
若參考Ted翻譯,似乎解釋作take/carry/move/drive/lead/guide somebody somewhere比較符合前文。
整句的幾種譯法如下:
 希望是死亡在帶領我們,而非貧瘠的想像力
          Hope that it is death that guides us, not the scanty imagination (guiding us).

 讓死亡成為可以引領我們,而非不去想像的東西. by Allen Kuo & Sharon Loh
          Let death become something which can guide us, instead of something which we don't need to imagine.

 讓死亡是帶引我們前進的東西,別讓死亡是沒有想像的東西.
          Let death be something which takes us (forward).
          Don't let death be lack of imagination (or something which allows no imagination).

2015年10月14日 星期三

weka.classifiers.trees.Id3

weka.classifiers.trees.Id3 為簡單決策樹學習器,
訓練時建好以屬性為判斷節點的決策樹,測試時依屬性值決定案例的流向,遇到樹葉時,案例將歸屬走到該處之多數決類別。
可提供案例集不錯表現值供標竿比較之用。

Id3 學習決策樹時,由樹根往樹葉,每一層皆挑選適當屬性作節點判斷,讓分叉後的類別分布變得更純。
衡量類別分布變純程度,乃計算類別分布的資訊量(entropy),資訊量愈小表示分布愈純,單一類別案例出現愈多。
衡量屬性的類別變純能力,則計算屬性的增量(gain)=套用屬性前的類別分布資訊量 - 套用屬性後的綜合類別分布資訊量。
綜合類別分布資訊量依各分叉案例數作加權匯整。屬性的增量愈大,變純能力愈好。
但為避免過度擬合,挑分叉太多屬性,實際用增量比(gain ratio)=增量/分叉固有資訊量,作屬性挑選依據。
屬性分叉愈多,分叉固有資訊量愈大。只有增量愈大,同時分叉固有資訊量不能太大的屬性,其增量比才會愈大,成為新節點屬性。
建樹過程一直持續到剩餘案例的類別分布只剩純一類,或任一屬性的增量皆負,無法靠添加新屬性節點讓純度提升為止。

> java -cp weka.jar;. weka.classifiers.trees.Id3  -t data\weather.nominal.arff

Id3

outlook = sunny
|  humidity = high: no
|  humidity = normal: yes
outlook = overcast: yes
outlook = rainy
|  windy = TRUE: no
|  windy = FALSE: yes


Time taken to build model: 0.01 seconds
Time taken to test model on training data: 0 seconds

=== Error on training data ===

Correctly Classified Instances          14              100      %
Incorrectly Classified Instances         0                0      %
Kappa statistic                          1
Mean absolute error                      0
Root mean squared error                  0
Relative absolute error                  0      %
Root relative squared error              0      %
Total Number of Instances               14


=== Confusion Matrix ===

 a b   <-- classified as
 9 0 | a = yes
 0 5 | b = no



=== Stratified cross-validation ===

Correctly Classified Instances          12               85.7143 %
Incorrectly Classified Instances         2               14.2857 %
Kappa statistic                          0.6889
Mean absolute error                      0.1429
Root mean squared error                  0.378
Relative absolute error                 30      %
Root relative squared error             76.6097 %
Total Number of Instances               14


=== Confusion Matrix ===

 a b   <-- classified as
 8 1 | a = yes
 1 4 | b = no


如下 weather.nominal.arff 案例集的14個案例利用4個文字屬性,預測文字屬性。
outlook temperature humidity windy play
sunny hot high FALSE no
sunny hot high TRUE no
overcast hot high FALSE yes
rainy mild high FALSE yes
rainy cool normal FALSE yes
rainy cool normal TRUE no
overcast cool normal TRUE yes
sunny mild high FALSE no
sunny cool normal FALSE yes
rainy mild normal FALSE yes
sunny mild normal TRUE yes
overcast mild high TRUE yes
overcast hot normal FALSE yes
rainy mild high TRUE no
參考: 1.weka.classifiers.trees.Id3 code | doc

2015年10月13日 星期二

weka.classifiers.bayes.NaiveBayes

weka.classifiers.bayes.NaiveBayes 為簡單貝氏機率式學習器,
記錄各類別事前機率,及給定類別下各屬性值出現之條件機率,
再依案例,累乘得到給定屬性值下各類別出現之事後機率,取機率高者為預測類別,
可提供案例集不錯表現值供標竿比較之用。

NaiveBayes 學習分類時,為每個類別統計其類別事前機率(prior probability)、給定類別下各屬性值出現之條件機率。
遇數值屬性時,預設母體為常態分布,統計其平均值、標準差,供條件機率之推估。
預測分類時,依新案例,累乘得到給定屬性值下各類別出現之事後機率(posterior probability),取機率高者為預測類別。

列印模型時,屬性各類別的加權和(weight sum)在案例權重為1下,等於案例個數。
屬性精確度(precision)=屬性相鄰數值差之總和(deltaSum)/相異值(distinct)個數,
凡間隔小於精確度之值將視為同一個值,供估算母體分布參數之用。

參數說明:
-K 數值屬性不用常態分布,改用核心密度推估器(kernel density estimator)推算條件機率
-D 數值屬性利用監督式離散化(supervised disretization)方法,視為多個區間文字值
-O 模型改用舊格式顯示,適用類別眾多場合

> java -cp weka.jar;. weka.classifiers.bayes.NaiveBayes  -t data\weather.numeric.arff

Naive Bayes Classifier

                 Class
Attribute          yes      no
                (0.63)  (0.38)
===============================
outlook
  sunny             3.0     4.0
  overcast          5.0     1.0
  rainy             4.0     3.0
  [total]          12.0     8.0

temperature
  mean          72.9697 74.8364
  std. dev.      5.2304   7.384
  weight sum          9       5
  precision      1.9091  1.9091

humidity
  mean          78.8395 86.1111
  std. dev.      9.8023  9.2424
  weight sum          9       5
  precision      3.4444  3.4444

windy
  TRUE              4.0     4.0
  FALSE             7.0     3.0
  [total]          11.0     7.0


Time taken to build model: 0.01 seconds
Time taken to test model on training data: 0.01 seconds

=== Error on training data ===

Correctly Classified Instances          13               92.8571 %
Incorrectly Classified Instances         1                7.1429 %
Kappa statistic                          0.8372
Mean absolute error                      0.2798
Root mean squared error                  0.3315
Relative absolute error                 60.2576 %
Root relative squared error             69.1352 %
Total Number of Instances               14


=== Confusion Matrix ===

 a b   <-- classified as
 9 0 | a = yes
 1 4 | b = no



=== Stratified cross-validation ===

Correctly Classified Instances           9               64.2857 %
Incorrectly Classified Instances         5               35.7143 %
Kappa statistic                          0.1026
Mean absolute error                      0.4649
Root mean squared error                  0.543
Relative absolute error                 97.6254 %
Root relative squared error            110.051  %
Total Number of Instances               14


=== Confusion Matrix ===

 a b   <-- classified as
 8 1 | a = yes
 4 1 | b = no


如下 weather.numeric.arff 案例集的14個案例利用2個文字屬性及2個數字屬性,預測文字屬性。
outlook temperature humidity windy play
sunny 85 85 FALSE no
sunny 80 90 TRUE no
rainy 65 70 TRUE no
sunny 72 95 FALSE no
rainy 71 91 TRUE no
overcast 83 86 FALSE yes
rainy 70 96 FALSE yes
rainy 68 80 FALSE yes
overcast 64 65 TRUE yes
sunny 69 70 FALSE yes
rainy 75 80 FALSE yes
sunny 75 70 TRUE yes
overcast 72 90 TRUE yes
overcast 81 75 FALSE yes
參考: 1.weka.classifiers.bayes.NaiveBayes code | doc 2.weka.estimators.NormalEstimator code | doc 3.weka.estimators.KernelEstimator code | doc 4.weka.filters.supervised.attribute.Discretize code | doc

2015年10月11日 星期日

weka.classifiers.misc.VFI

weka.classifiers.misc.VFI 為屬性區間多數決(voting feature intervals)學習器,
記錄各屬性值/區間的類別分布,再累加案例落於各屬性區間之分布,取機率高者為預測類別,
可提供案例集基本表現值供標竿比較之用。

VFI 學習分類時,為每個屬性文字值/數字區間統計其類別分布。
遇數值屬性的數字區間切割法為找出每個類別的數值上界、下界,加上正負無窮大,
形成m=(類別數 x 2) + 2個切割端點,若類別上下界不重疊,最多m-1個區間。
然後記錄這m-1個區間的案例類別分布,案例屬性若剛好位於端點,則端點左右區間各得半個案例貢獻。

預測分類時,累加新案例落於各屬性區間的類別分布,取機率高者為其預測類別。

參數說明:
-C 關閉信心加權。累加各屬性的類別分布時,預設有啟動信心加權,可使用本參數關閉之。
-B <bias> 啟動信心加權時,加權的權重為類別分布資訊量(entropy)的-bias次方,bias預設值為0.6。
   屬性的類別分布資訊量介於0~log2(類別數)之間,值愈小,屬性的類別鑑別度愈高。
   bias介於0~1之間,bias為0表示維持原來分布之貢獻(相乘權重為1),
   bias愈大則愈能突顯鑑別度高屬性在分布加總時的貢獻(相乘權重>1)。

> java -cp weka.jar;. weka.classifiers.misc.VFI  -t data\weather.numeric.arff

Voting feature intervals classifier

 outlook :
  sunny
    2.0    3.0
  overcast
    4.0    0.0
  rainy
    3.0    2.0

 temperature :
  -Infinity
    0.5    0.0
  64.0
    0.5    0.5
  65.0
    7.5    3.5
  83.0
    0.5    0.5
  85.0
    0.0    0.5
  Infinity


 humidity :
  -Infinity
    0.5    0.0
  65.0
    1.5    0.5
  70.0
    6.0    4.0
  95.0
    0.5    0.5
  96.0
    0.5    0.0
  Infinity


 windy :
  TRUE
    3.0    3.0
  FALSE
    6.0    2.0


Time taken to build model: 0 seconds
Time taken to test model on training data: 0 seconds

=== Error on training data ===

Correctly Classified Instances          12               85.7143 %
Incorrectly Classified Instances         2               14.2857 %
Kappa statistic                          0.7143
Mean absolute error                      0.3354
Root mean squared error                  0.3996
Relative absolute error                 72.2387 %
Root relative squared error             83.3373 %
Total Number of Instances               14


=== Confusion Matrix ===

 a b   <-- classified as
 7 2 | a = yes
 0 5 | b = no



=== Stratified cross-validation ===

Correctly Classified Instances           7               50      %
Incorrectly Classified Instances         7               50      %
Kappa statistic                         -0.0426
Mean absolute error                      0.4725
Root mean squared error                  0.5624
Relative absolute error                 99.2318 %
Root relative squared error            113.9897 %
Total Number of Instances               14


=== Confusion Matrix ===

 a b   <-- classified as
 5 4 | a = yes
 3 2 | b = no


如下 weather.numeric.arff 案例集的14個案例利用2個文字屬性及2個數字屬性,預測文字屬性。
outlook temperature humidity windy play
sunny 85 85 FALSE no
sunny 80 90 TRUE no
rainy 65 70 TRUE no
sunny 72 95 FALSE no
rainy 71 91 TRUE no
overcast 83 86 FALSE yes
rainy 70 96 FALSE yes
rainy 68 80 FALSE yes
overcast 64 65 TRUE yes
sunny 69 70 FALSE yes
rainy 75 80 FALSE yes
sunny 75 70 TRUE yes
overcast 72 90 TRUE yes
overcast 81 75 FALSE yes
參考: weka.classifiers.misc.VFI 1. source code 2. documentation

windows mklink vs unix link

在檔案系統中,常有不同路徑指向相同檔案物件(包含檔案或目錄)之需要。
Windows 檔案系統過去提供捷徑檔(.lnk),供檔案總案或部份應用程式取用不同路徑之檔案物件。
Windows Vista 之後開始模仿 Unix 提供符號連結,允許檔案系統層級提供如下四種連結。
   SYMLINK, SYMLINKD, JUNCTION, HardLink

0.捷徑連結: 以特殊捷徑檔(.lnk)供特殊有支援應用程式取用,由client application解析
   DOS DIR 顯示 .lnk 副檔名
   [註] 此 .lnk 於網芳分享他機時,他機可能無法使用
   [註] del .lnk 可刪除捷徑,連結物件仍在

1.檔案符號連結: 預設安全原則需管理權限,連結可以跨切割,由client filesystem解析
   Windows指令: mklink   file_soft_link   file
   Unix指令:    link  -s  file_soft_link   file
   DOS DIR 顯示 <SYMLINK>
   [註] 此 file_soft_link 於網芳分享他機時,他機可能無法使用
   [註] del file_soft_link 可刪除符號連結,連結檔案仍在

2.目錄符號連結: 預設安全原則需管理權限,連結可以跨切割,由client filesystem解析
   Windows指令: mklink  /d  dir_soft_link   dir
   Unix指令:    link  -s  dir_soft_link   dir
   DOS DIR 顯示 <SYMLINKD>
   [註] 此 dir_soft_link 於網芳分享他機時,他機可能無法使用
   [註] rmdir dir_soft_link 可刪除符號連結,連結目錄仍在
   [註] del dir_soft_link 會詢問是否刪除目錄所有內容

3.目錄連結: 不需權限,連結限定本機任意切割,由server filesystem解析
   Windows指令: mklink  /j  dir_hard_link   dir
   Unix指令:    無類似 unix 指令
   DOS DIR 顯示 <JUNCTION>
   [註] 此 dir_hard_link 於網芳分享他機時,他機仍可使用
   [註] rmdir dir_hard_link 可刪除連結,連結目錄仍在
   [註] del dir_hard_link 會詢問是否刪除目錄所有內容

4.檔案連結: 不需權限,連結限定本機本切割,由server filesystem解析
   Windows指令: mklink  /h  file_hard_link  file
   Unix指令:    link  file_hard_link  file
   DOS DIR 顯示 等同普通檔案,無任何標示
   [註] 此 file_hard_link 於網芳分享他機時,他機仍可使用
   [註] del file_hard_link 可刪除連結,若連結檔案已無其他連結,檔案將真正刪除
           
註: 預設安全原則之下 mklink, mklink/d 兩個建立符號連結指令需管理權限,要以系統管理員執行DOS視窗,才能使用。

2015年10月10日 星期六

weka.classifiers.misc.HyperPipes

weka.classifiers.misc.HyperPipes 屬類別屬性區間學習器,
記錄符合類別的屬性出現區間,再預測屬性符合比例高之類別,可提供案例集基本表現值供標竿比較之用。

HyperPipes 學習分類時,為每個類別建立一個超區間(hyperpipe),記錄每個屬性有出現該類別的案例區間為何。
預測分類時,計算新案例符合各類別的超區間程度,取符合程度高者為其預測類別。
案例符合某類別超區間程度(0~1)乃案例有多少比例(介於0~100%)的屬性落於某類別超區間的屬性描述區間內。

> java -cp weka.jar;. weka.classifiers.misc.HyperPipes  -t data\weather.numeric.arff

HyperPipes classifier
HyperPipe for class: yes
  temperature: 64.0,83.0,
  humidity: 65.0,96.0,
  outlook: true,true,true,
  windy: true,true,

HyperPipe for class: no
  temperature: 65.0,85.0,
  humidity: 70.0,95.0,
  outlook: true,false,true,
  windy: true,true,


Time taken to build model: 0 seconds
Time taken to test model on training data: 0 seconds

=== Error on training data ===

Correctly Classified Instances          10               71.4286 %
Incorrectly Classified Instances         4               28.5714 %
Kappa statistic                          0.2432
Mean absolute error                      0.4531
Root mean squared error                  0.4597
Relative absolute error                 97.5824 %
Root relative squared error             95.8699 %
Total Number of Instances               14


=== Confusion Matrix ===

 a b   <-- classified as
 9 0 | a = yes
 4 1 | b = no



=== Stratified cross-validation ===

Correctly Classified Instances           9               64.2857 %
Incorrectly Classified Instances         5               35.7143 %
Kappa statistic                          0
Mean absolute error                      0.483
Root mean squared error                  0.4899
Relative absolute error                101.4286 %
Root relative squared error             99.3055 %
Total Number of Instances               14


=== Confusion Matrix ===

 a b   <-- classified as
 9 0 | a = yes
 5 0 | b = no


如下 weather.numeric.arff 案例集的14個案例有9個yes,5個no。
outlook temperature humidity windy play
sunny 85 85 FALSE no
sunny 80 90 TRUE no
rainy 65 70 TRUE no
sunny 72 95 FALSE no
rainy 71 91 TRUE no
overcast 83 86 FALSE yes
rainy 70 96 FALSE yes
rainy 68 80 FALSE yes
overcast 64 65 TRUE yes
sunny 69 70 FALSE yes
rainy 75 80 FALSE yes
sunny 75 70 TRUE yes
overcast 72 90 TRUE yes
overcast 81 75 FALSE yes
參考: weka.classifiers.misc.HyperPipes 1. source code 2. documentation