Two Weka Command Line Examples of Using Models in Training and Testing: (1) train and save an OneR model load and test an OneR model both using the weather.nominal.arff dataset (2) train and save a FilteredClassifier (StringToWordVector + J48) model load and test a FilteredClassifier (StringToWordVector + J48) model using the crude_oil_train.arff dataset for training and the crude_oil_test.arff dataset for testing #------------------------------- #ask for classifiers options >java -cp weka.jar weka.classifiers.rules.OneR -h -info Help requested. General options: -h or -help Output help information. -synopsis or -info Output synopsis for classifier (use in conjunction with -h) -t <name of training file> Sets training file. -T <name of test file> Sets test file. If missing, a cross-validation will be performed on the training data. -c <class index> Sets index of class attribute (default: last). -x <number of folds> Sets number of folds for cross-validation (default: 10). -no-cv Do not perform any cross validation. -split-percentage <percentage> Sets the percentage for the train/test set split, e.g., 66. -preserve-order Preserves the order in the percentage split. -s <random number seed> Sets random number seed for cross-validation or percentage split (default: 1). -m <name of file with cost matrix> Sets file with cost matrix. -l <name of input file> Sets model input file. In case the filename ends with '.xml', a PMML file is loaded or, if that fails, options are loaded from the XML file. -d <name of output file> Sets model output file. In case the filename ends with '.xml', only the options are saved to the XML file, not the model. -v Outputs no statistics for training data. -o Outputs statistics only, not the classifier. -i Outputs detailed information-retrieval statistics for each class. -k Outputs information-theoretic statistics. -p <attribute range> Only outputs predictions for test instances (or the train instances if no test instances provided and -no-cv is used), along with attributes (0 for none). -distribution Outputs the distribution instead of only the prediction in conjunction with the '-p' option (only nominal classes). -r Only outputs cumulative margin distribution. -z <class name> Only outputs the source representation of the classifier, giving it the supplied name. -xml filename | xml-string Retrieves the options from the XML-data instead of the command line. -threshold-file <file> The file to save the threshold data to. The format is determined by the extensions, e.g., '.arff' for ARFF format or '.csv' for CSV. -threshold-label <label> The class label to determine the threshold data for (default is the first label) Options specific to weka.classifiers.rules.OneR: -B <minimum bucket size> The minimum number of objects in a bucket (default: 6). Synopsis for weka.classifiers.rules.OneR: # synopsis is shown with -info option Class for building and using a 1R classifier; in other words, uses the minimum-error attribute for prediction, discretizing numeric attributes. For more information, see: R.C. Holte (1993). Very simple classification rules perform well on most commonly used datasets. Machine Learning. 11:63-91. #--------------------------------------------------------- # Example (1): use OneR to train and test on weather.nominal.arff #train classifier by train_data and output model without evaluation >java -cp weka.jar weka.classifiers.rules.OneR \ > -t data/weather.nominal.arff -no-cv -v -d model.dat outlook: sunny -> no overcast -> yes rainy -> yes (10/14 instances correct) === Error on training data === # this report not shown with -v option Correctly Classified Instances 10 71.4286 % ..... === Stratified cross-validation === # this report not shown with -no-cv option Correctly Classified Instances 6 42.8571 % ..... #load model and test classifier by test_data >java -cp weka.jar weka.classifiers.rules.OneR \ > -T data/weather.nominal.arff -l model.dat outlook: sunny -> no overcast -> yes rainy -> yes (10/14 instances correct) === Error on test data === Correctly Classified Instances 10 71.4286 % Incorrectly Classified Instances 4 28.5714 % Kappa statistic 0.3778 Mean absolute error 0.2857 Root mean squared error 0.5345 Total Number of Instances 14 === Confusion Matrix === a b <-- classified as 7 2 | a = yes 2 3 | b = no >java -cp weka.jar weka.classifiers.rules.OneR \ > -T data/weather.nominal.arff -l model.dat -p first-last === Predictions on test data === inst# actual predicted error prediction (outlook,temperature,humidity,windy) 1 2:no 2:no 1 (sunny,hot,high,FALSE) 2 2:no 2:no 1 (sunny,hot,high,TRUE) 3 1:yes 1:yes 1 (overcast,hot,high,FALSE) 4 1:yes 1:yes 1 (rainy,mild,high,FALSE) 5 1:yes 1:yes 1 (rainy,cool,normal,FALSE) 6 2:no 1:yes + 1 (rainy,cool,normal,TRUE) 7 1:yes 1:yes 1 (overcast,cool,normal,TRUE) 8 2:no 2:no 1 (sunny,mild,high,FALSE) 9 1:yes 2:no + 1 (sunny,cool,normal,FALSE) 10 1:yes 1:yes 1 (rainy,mild,normal,FALSE) 11 1:yes 2:no + 1 (sunny,mild,normal,TRUE) 12 1:yes 1:yes 1 (overcast,mild,high,TRUE) 13 1:yes 1:yes 1 (overcast,hot,normal,FALSE) 14 2:no 1:yes + 1 (rainy,mild,high,TRUE) #-------------------------------------------------------------------- #Example (2): use FilteredClassifier (StringToWordVector + J48) to # train on crude_oil_train.arff and test on crude_oil_test.arff #train classifier by train_data and output model without evaluation > java -cp weka.jar weka.classifiers.meta.FilteredClassifier \ > -no-cv -v -t data/crude_oil_train.arff -d model.dat \ > -F weka.filters.unsupervised.attribute.StringToWordVector \ > -W weka.classifiers.trees.J48 Options: -F weka.filters.unsupervised.attribute.StringToWordVector -W weka.classifiers.trees.J48 FilteredClassifier using weka.classifiers.trees.J48 -C 0.25 -M 2 on data filtered through weka.filters.unsupervised.attribute.StringToWordVector -R 1 -W 1000 -prune-rate -1.0 -N 0 -stemmer weka.core.stemmers.NullStemmer -M 1 -tokenizer "weka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"" Filtered Header @relation 'crude_oil_train-weka.filters.unsupervised.attribute.StringToWordVector-R1-W1000-prune-rate-1.0-N0-stemmerweka.core.stemmers.NullStemmer-M1-tokenizerweka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"' @attribute class {yes,no} @attribute Crude numeric @attribute Demand numeric @attribute The numeric @attribute crude numeric @attribute for numeric @attribute has numeric @attribute in numeric @attribute increased numeric @attribute is numeric @attribute of numeric @attribute oil numeric @attribute outstrips numeric @attribute price numeric @attribute short numeric @attribute significantly numeric @attribute supply numeric @attribute Some numeric @attribute Use numeric @attribute a numeric @attribute bit numeric @attribute cooking numeric @attribute do numeric @attribute flavor numeric @attribute food numeric @attribute frying numeric @attribute like numeric @attribute not numeric @attribute oily numeric @attribute olive numeric @attribute pan numeric @attribute people numeric @attribute the numeric @attribute very numeric @attribute was numeric @data Classifier Model J48 pruned tree ------------------ crude <= 0: no (4.0/1.0) crude > 0: yes (2.0) Number of Leaves : 2 Size of the tree : 3 #load model and test classifier by test_data > java -cp weka.jar weka.classifiers.meta.FilteredClassifier \ > -T data/crude_oil_test.arff -l model.dat -p first-last === Predictions on test data === inst# actual predicted error prediction (document) 1 1:yes 1:yes 1 ('Oil platforms extract crude oil') 2 2:no 2:no 0.75 ('Canola oil is supposed to be healthy') 3 1:yes 2:no + 0.75 ('Iraq has significant oil reserves') 4 2:no 2:no 0.75 ('There are different types of cooking oil') > java -cp weka.jar weka.classifiers.meta.FilteredClassifier \ > -T data/crude_oil_test2.arff -l model.dat -p first-last === Predictions on test data === inst# actual predicted error prediction (document) 1 1:? 1:yes 1 ('Oil platforms extract crude oil') 2 1:? 2:no 0.75 ('Canola oil is supposed to be healthy') 3 1:? 2:no 0.75 ('Iraq has significant oil reserves') 4 1:? 2:no 0.75 ('There are different types of cooking oil') ######### data/weather.nominal.arff @relation weather.symbolic @attribute outlook {sunny, overcast, rainy} @attribute temperature {hot, mild, cool} @attribute humidity {high, normal} @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,hot,high,FALSE,no sunny,hot,high,TRUE,no overcast,hot,high,FALSE,yes rainy,mild,high,FALSE,yes rainy,cool,normal,FALSE,yes rainy,cool,normal,TRUE,no overcast,cool,normal,TRUE,yes sunny,mild,high,FALSE,no sunny,cool,normal,FALSE,yes rainy,mild,normal,FALSE,yes sunny,mild,normal,TRUE,yes overcast,mild,high,TRUE,yes overcast,hot,normal,FALSE,yes rainy,mild,high,TRUE,no ######### data/crude_oil_train.arff % % witten-12-mkp-data mining- practical machine learning tools and techniques % ch17 tutorial exercises for the weka explorer % ch17.5 document classification % % @relation 'crude_oil_train' % @attribute document string @attribute class {yes,no} % @data 'The price of crude oil has increased significantly',yes 'Demand for crude oil outstrips supply',yes 'Some people do not like the flavor of olive oil',no 'The food was very oily',no 'Crude oil is in short supply',yes 'Use a bit of cooking oil in the frying pan',no ######### data/crude_oil_test.arff % % witten-12-mkp-data mining- practical machine learning tools and techniques % ch17 tutorial exercises for the weka explorer % ch17.5 document classification % % @relation 'crude_oil_test' % @attribute document string @attribute class {yes,no} % @data 'Oil platforms extract crude oil',yes 'Canola oil is supposed to be healthy',no 'Iraq has significant oil reserves',yes 'There are different types of cooking oil',no ######### data/crude_oil_test2.arff % % witten-12-mkp-data mining- practical machine learning tools and techniques % ch17 tutorial exercises for the weka explorer % ch17.5 document classification % % @relation 'crude_oil_test' % @attribute document string @attribute class {yes,no} % @data 'Oil platforms extract crude oil',? 'Canola oil is supposed to be healthy',? 'Iraq has significant oil reserves',? 'There are different types of cooking oil',?
2015年12月28日 星期一
How to Save and Load a Model in Weka for Training and Testing
2015年12月4日 星期五
summary of graph algorithms
goodrich-15-wiley-data structures & algorithms in java ch14.1 Graphs ‧圖(graph)由頂點(vertex)及邊(edge)組成。 ‧頂點又稱節點(node),邊又稱弧(arc)。 ‧依邊有方向與否分成有向邊(directed edge)及無向邊(undirected edge)。 ‧圖若純由有向邊組成,稱為有向圖(directed graph, digraph)。 ‧圖若純由無向邊組成,稱為無向圖(undirected graph)。 ‧混合圖(mixed graph)的邊包含有向邊及無向邊。 ‧每個邊有兩個端頂點(end vertices)或稱端點(endpoints)。 ‧有向邊的端點又分成起點(origin)及終點(destination)。 ‧若兩頂點為某邊的兩端點,稱兩頂點相鄰(adjacent)。 ‧若頂點為邊的端點,稱邊接於(incident to)頂點。 ‧頂點的去向邊(outgoing edge)包含所有以頂點為起點的有向邊。 ‧頂點的來向邊(incoming edge)包含所有以頂點為終點的有向邊。 ‧頂點的邊數(degree)為接於頂點的所有邊個數, 可細分成去向邊數(out-degree)及來向邊數(in-degree)。 ‧存放圖的邊容器為收藏容器(collection),而非集合(set), 表示兩頂點可以有兩個以上的有向或無向邊相接, 稱為平行邊(parallel edges)或多重邊(multiple edges)。 ‧若有向/無向邊的兩端點相同,稱邊為自我迴圈。 ‧若圖不包含平行邊或自我迴圈,稱圖為簡單圖(simple graph), 簡單圖可用不允許重複的邊集合描述。 ‧路徑(path)由一連串的頂點及邊交替組成,起源於某頂點,終止於某頂點, 串列中每個邊的起點為其前一個頂點,終點為其下一個頂點。 ‧迴圈(cycle)為一路徑,其起點及終點為同一頂點,且至少有一個邊。 ‧若路徑的每個頂點皆不同,稱為簡單路徑(simple path)。 ‧若迴圈的每個頂點皆不同,起終兩頂點相同不算,稱為簡單迴圈(simple cycle)。 ‧有向路徑(directed path)的每個邊皆為有向邊。 ‧有向迴圈(directed cycle)的每個邊皆為有向邊。 ‧無迴圈的有向圖(acyclic directed graph)不存在有向迴圈。 ‧若存在頂點u到頂點v的路徑,稱u可到達v,或v可由u到達(reachable)。 ‧無向圖的可到達性(reachability)具對稱性,即u可到達v等同v可達u,有向圖則否。 ‧若圖的任兩頂點存在路徑相連,稱為連通圖(connected graph)。 ‧若有向圖的任兩頂點存在雙向路徑相連,稱為強連通圖(strongly connected graph)。 ‧圖G的子圖(subgraph)為一圖H,其頂點及邊分別為G的頂點及邊的子集合。 ‧圖G的生成子圖(spanning subgraph)為G的子圖,包含G的所有頂點。 ‧若圖G不連通,其最大連通子圖稱為G的連通組件(connected components)。 ‧沒有迴圈的圖稱為森林(forest)。 ‧樹(tree)為一連通森林,即沒有迴圈的連通圖。 ‧圖的生成樹(spanning tree)為圖的生成子圖,本身也是樹。 ch14.2 Data Structures for Graphs ‧四種表現圖的資料結構 1.邊清單(edge list) 2.相鄰清單(adjacency list) 3.相鄰映射(adjacency map) <== 課本採用,可由點找邊,及邊找點 圖有頂點鏈結清單vertices,邊鏈結清單edges,及有向狀態isDirected 頂點記錄元素element,所處頂點鏈結清單位置pos, 去向邊的<頂點,邊>映射表outgoing, 來向邊的<頂點,邊>映射表incoming, 註:有向圖outgoing及incoming才不同,無向圖兩者一樣 邊記錄元素element,所處邊鏈結清單位置pos,及兩端頂點endpoints 4.相鄰矩陣(adjacency matrix) 14.3 Graph Traversals ‧圖走訪(graph traversal)基本上在作圖轉樹的工作, 可產生走訪樹(search tree),回答有關頂點間可到達性問題。 其困難點在如何有效率的檢視圖的所有頂點及邊, 最好花費的時間能和頂點數及邊數成線性正比。 ‧無向圖的可到達性問題: 1.給定圖形G頂點u,v,若到得了,找出u到v的任一條路徑 2.給定圖形G頂點u,若到得了,找出u到每個頂點v的路徑,路徑的邊數需最少 3.給定圖形G,判定所有頂點是否相連 4.給定圖形G,若存在,找出G的任一迴圈 *5.給定連通圖形G,找出任一G的生成樹 *6.給定圖形G,找出所有連通組件(最大相連子圖). 註: 無向圖走訪樹的邊分類: 樹幹邊(tree edge): 尋獲邊(discovery edge) 非樹幹邊(nontree edge): 後退邊(back edge), 跨越邊(cross edge) ‧有向圖的可到達性問題 1.給定圖形G頂點u,v,若到得了,找出u到v的任一條有向路徑 2.給定圖形G頂點u,列出所有u可到達頂點 3.給定圖形G,判定G是否強相連 4.給定圖形G,判定有無迴圈 註: 有向圖走訪樹的邊分類: 樹幹邊: 尋獲邊 非樹幹邊: 後退邊, 前進邊(forward edge), 跨越邊 ‧兩種最基本的圖走訪法: 1.深度優先走訪(depth-first search) 2.寬度優先走訪(breadth-first search) GraphAlgorithms.DFS(g, u, known, forest) 從圖 g 的頂點 u 出發,建立深度優先走訪樹, 回傳走訪頂點 known,走訪頂點的尋獲邊 forest GraphAlgorithms.DFSComplete(g) 回傳圖 g 的深度優先走訪森林,其走訪頂點的尋獲邊 forest GraphAlgorithms.BFS(g, u, known, forest) 從圖 g 的頂點 u 出發,建立寬度優先走訪樹, 回傳走訪頂點 known,走訪頂點的尋獲邊 forest GraphAlgorithms.BFSComplete(g) 回傳圖 g 的深度優先走訪森林,其走訪頂點的尋獲邊 forest GraphAlgorithms.constructPath(g, u, v, forest) 從圖 g 的走訪樹森林 forest 回傳頂點 u 到 v 的路徑 (由路徑沿途的邊組成)
訂閱:
文章 (Atom)