2015年12月28日 星期一

How to Save and Load a Model in Weka for Training and Testing

Two Weka Command Line Examples of Using Models in Training and Testing:

(1) train and save an OneR model 
    load and test an OneR model
    both using the weather.nominal.arff dataset

(2) train and save a FilteredClassifier (StringToWordVector + J48) model
    load and test a FilteredClassifier (StringToWordVector + J48) model
    using the crude_oil_train.arff dataset for training
      and the crude_oil_test.arff dataset for testing

#-------------------------------
#ask for classifiers options

>java -cp weka.jar weka.classifiers.rules.OneR -h -info

Help requested.

General options:

-h or -help
 Output help information.
-synopsis or -info
 Output synopsis for classifier (use in conjunction  with -h)
-t <name of training file>
 Sets training file.
-T <name of test file>
 Sets test file. If missing, a cross-validation will be performed
 on the training data.
-c <class index>
 Sets index of class attribute (default: last).
-x <number of folds>
 Sets number of folds for cross-validation (default: 10).
-no-cv
 Do not perform any cross validation.
-split-percentage <percentage>
 Sets the percentage for the train/test set split, e.g., 66.
-preserve-order
 Preserves the order in the percentage split.
-s <random number seed>
 Sets random number seed for cross-validation or percentage split
 (default: 1).
-m <name of file with cost matrix>
 Sets file with cost matrix.
-l <name of input file>
 Sets model input file. In case the filename ends with '.xml',
 a PMML file is loaded or, if that fails, options are loaded
 from the XML file.
-d <name of output file>
 Sets model output file. In case the filename ends with '.xml',
 only the options are saved to the XML file, not the model.
-v
 Outputs no statistics for training data.
-o
 Outputs statistics only, not the classifier.
-i
 Outputs detailed information-retrieval statistics for each class.
-k
 Outputs information-theoretic statistics.
-p <attribute range>
 Only outputs predictions for test instances (or the train
 instances if no test instances provided and -no-cv is used),
 along with attributes (0 for none).
-distribution
 Outputs the distribution instead of only the prediction
 in conjunction with the '-p' option (only nominal classes).
-r
 Only outputs cumulative margin distribution.
-z <class name>
 Only outputs the source representation of the classifier,
 giving it the supplied name.
-xml filename | xml-string
 Retrieves the options from the XML-data instead of the command line.
-threshold-file <file>
 The file to save the threshold data to.
 The format is determined by the extensions, e.g., '.arff' for ARFF
 format or '.csv' for CSV.
-threshold-label <label>
 The class label to determine the threshold data for
 (default is the first label)

Options specific to weka.classifiers.rules.OneR:

-B <minimum bucket size>
 The minimum number of objects in a bucket (default: 6).

Synopsis for weka.classifiers.rules.OneR: # synopsis is shown with -info option

Class for building and using a 1R classifier; in other words, uses the minimum-error attribute for prediction, discretizing numeric attributes. For more information, see:

R.C. Holte (1993). Very simple classification rules perform well on most commonly used datasets. Machine Learning. 11:63-91.



#---------------------------------------------------------
# Example (1): use OneR to train and test on weather.nominal.arff

#train classifier by train_data and output model without evaluation

>java -cp weka.jar weka.classifiers.rules.OneR \
>   -t data/weather.nominal.arff -no-cv -v -d model.dat

outlook:
        sunny   -> no
        overcast        -> yes
        rainy   -> yes
(10/14 instances correct)

=== Error on training data ===   # this report not shown with -v option

Correctly Classified Instances          10               71.4286 %
.....

=== Stratified cross-validation ===   # this report not shown with -no-cv option

Correctly Classified Instances           6               42.8571 %
.....


#load model and test classifier by test_data

>java -cp weka.jar weka.classifiers.rules.OneR \
>   -T data/weather.nominal.arff -l model.dat

outlook:
        sunny   -> no
        overcast        -> yes
        rainy   -> yes
(10/14 instances correct)

=== Error on test data ===

Correctly Classified Instances          10               71.4286 %
Incorrectly Classified Instances         4               28.5714 %
Kappa statistic                          0.3778
Mean absolute error                      0.2857
Root mean squared error                  0.5345
Total Number of Instances               14


=== Confusion Matrix ===

 a b   <-- classified as
 7 2 | a = yes
 2 3 | b = no



>java -cp weka.jar weka.classifiers.rules.OneR \
>   -T data/weather.nominal.arff -l model.dat -p first-last

=== Predictions on test data ===

 inst#     actual  predicted error prediction (outlook,temperature,humidity,windy)
     1       2:no       2:no       1 (sunny,hot,high,FALSE)
     2       2:no       2:no       1 (sunny,hot,high,TRUE)
     3      1:yes      1:yes       1 (overcast,hot,high,FALSE)
     4      1:yes      1:yes       1 (rainy,mild,high,FALSE)
     5      1:yes      1:yes       1 (rainy,cool,normal,FALSE)
     6       2:no      1:yes   +   1 (rainy,cool,normal,TRUE)
     7      1:yes      1:yes       1 (overcast,cool,normal,TRUE)
     8       2:no       2:no       1 (sunny,mild,high,FALSE)
     9      1:yes       2:no   +   1 (sunny,cool,normal,FALSE)
    10      1:yes      1:yes       1 (rainy,mild,normal,FALSE)
    11      1:yes       2:no   +   1 (sunny,mild,normal,TRUE)
    12      1:yes      1:yes       1 (overcast,mild,high,TRUE)
    13      1:yes      1:yes       1 (overcast,hot,normal,FALSE)
    14       2:no      1:yes   +   1 (rainy,mild,high,TRUE)


#--------------------------------------------------------------------
#Example (2): use FilteredClassifier (StringToWordVector + J48) to
#             train on crude_oil_train.arff and test on crude_oil_test.arff

#train classifier by train_data and output model without evaluation

> java -cp weka.jar weka.classifiers.meta.FilteredClassifier \
>    -no-cv -v -t data/crude_oil_train.arff -d model.dat \
>    -F weka.filters.unsupervised.attribute.StringToWordVector \
>    -W weka.classifiers.trees.J48

Options: -F weka.filters.unsupervised.attribute.StringToWordVector -W weka.classifiers.trees.J48

FilteredClassifier using weka.classifiers.trees.J48 -C 0.25 -M 2 on data filtered through weka.filters.unsupervised.attribute.StringToWordVector -R 1 -W 1000 -prune-rate -1.0 -N 0 -stemmer weka.core.stemmers.NullStemmer -M 1 -tokenizer "weka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\""

Filtered Header
@relation 'crude_oil_train-weka.filters.unsupervised.attribute.StringToWordVector-R1-W1000-prune-rate-1.0-N0-stemmerweka.core.stemmers.NullStemmer-M1-tokenizerweka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"'

@attribute class {yes,no}
@attribute Crude numeric
@attribute Demand numeric
@attribute The numeric
@attribute crude numeric
@attribute for numeric
@attribute has numeric
@attribute in numeric
@attribute increased numeric
@attribute is numeric
@attribute of numeric
@attribute oil numeric
@attribute outstrips numeric
@attribute price numeric
@attribute short numeric
@attribute significantly numeric
@attribute supply numeric
@attribute Some numeric
@attribute Use numeric
@attribute a numeric
@attribute bit numeric
@attribute cooking numeric
@attribute do numeric
@attribute flavor numeric
@attribute food numeric
@attribute frying numeric
@attribute like numeric
@attribute not numeric
@attribute oily numeric
@attribute olive numeric
@attribute pan numeric
@attribute people numeric
@attribute the numeric
@attribute very numeric
@attribute was numeric

@data


Classifier Model
J48 pruned tree
------------------

crude <= 0: no (4.0/1.0)
crude > 0: yes (2.0)

Number of Leaves  :     2

Size of the tree :      3



#load model and test classifier by test_data

> java -cp weka.jar weka.classifiers.meta.FilteredClassifier \
>                   -T data/crude_oil_test.arff -l model.dat -p first-last

=== Predictions on test data ===

 inst#     actual  predicted error prediction (document)
     1      1:yes      1:yes       1 ('Oil platforms extract crude oil')
     2       2:no       2:no       0.75 ('Canola oil is supposed to be healthy')
     3      1:yes       2:no   +   0.75 ('Iraq has significant oil reserves')
     4       2:no       2:no       0.75 ('There are different types of cooking oil')



> java -cp weka.jar weka.classifiers.meta.FilteredClassifier \
>                   -T data/crude_oil_test2.arff -l model.dat -p first-last

=== Predictions on test data ===

 inst#     actual  predicted error prediction (document)
     1        1:?      1:yes       1 ('Oil platforms extract crude oil')
     2        1:?       2:no       0.75 ('Canola oil is supposed to be healthy')
     3        1:?       2:no       0.75 ('Iraq has significant oil reserves')
     4        1:?       2:no       0.75 ('There are different types of cooking oil')


######### data/weather.nominal.arff
@relation weather.symbolic

@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}

@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no



######### data/crude_oil_train.arff
%
%  witten-12-mkp-data mining- practical machine learning tools and techniques
%    ch17 tutorial exercises for the weka explorer
%    ch17.5 document classification
%
%
@relation 'crude_oil_train'
%
@attribute document string
@attribute class {yes,no}
%
@data
'The price of crude oil has increased significantly',yes
'Demand for crude oil outstrips supply',yes
'Some people do not like the flavor of olive oil',no
'The food was very oily',no
'Crude oil is in short supply',yes
'Use a bit of cooking oil in the frying pan',no



######### data/crude_oil_test.arff
%
%  witten-12-mkp-data mining- practical machine learning tools and techniques
%    ch17   tutorial exercises for the weka explorer
%    ch17.5 document classification
%
%
@relation 'crude_oil_test'
%
@attribute document string
@attribute class {yes,no}
%
@data
'Oil platforms extract crude oil',yes
'Canola oil is supposed to be healthy',no
'Iraq has significant oil reserves',yes
'There are different types of cooking oil',no



######### data/crude_oil_test2.arff
%
%  witten-12-mkp-data mining- practical machine learning tools and techniques
%    ch17   tutorial exercises for the weka explorer
%    ch17.5 document classification
%
%
@relation 'crude_oil_test'
%
@attribute document string
@attribute class {yes,no}
%
@data
'Oil platforms extract crude oil',?
'Canola oil is supposed to be healthy',?
'Iraq has significant oil reserves',?
'There are different types of cooking oil',?

沒有留言: