Download : NAFLA.zip (updated 5 Apr 2007)
NAFLA is a C++ library for fast scalable multiclassification. Tested so far on Linux and Cygwin. (For now all algorithms are based on a Linear SVM... later Boosted Naive Bayes and Boosted Decision Stumps, etc will be added.) To install, just unzip and type make in the NAFLA directory.
Data is in LIBSVM format (This example has 3-class 5-dimensional data with classes 10, 20, 30.).
Training : ./learn2classify traindata modelfile
Testing : ./classify datafile modelfile resultsfile
Results can be found in resultsfile, resultsfile.tex, resultsfile.prob and resultsfile.pred.
The model files produced are modelfile
Same as LIBSVM: Each line of a train/test set is of the form below, which is the same format as LIBSVM i.e. sparse dimension:value pairs.
This example has n nonzero dimensions. (At most n, since any vi could be 0.)
class d1:v1 d2:v2 ... dn:vn
class is an integer (NOT 0!) representing the class of this example. 0 < d1 < d2 ... < dn <= D
Where D is the dimension of the data. (D is worked out automatically from the training examples.)
./learn2classify traindata modelfile [-1v1 | -1vr] [-scale | -noscale] [-inv | -noinv]
The defaults are -1v1, -noscale, and -noinv i.e. using a 1-versus-1 ensemble of C(C-1)/2 binary classifiers with no scaling and examples equally weighted.
For now, each binary classifier is a Linear SVM (Least Squares SVM using the algorithm by Keerthi and DeCoste 2005.
./classify datafile modelfile resultsfile [-raw]
This outputs the following files ; C is the number of classes.
After that, each line has 1+C entries; the n-th line has prediction details for the n-th test data point.
5-class Mandarin tone classification problem from some early work in my doctoral thesis (the primary motivation for writing NAFLA, by the way). Each data point is a syllable with 27 features based on its duration and pitch. There are 122 397 training examples and 40 798 test examples. This is a very hard problem since this is not the full set of useful features; although the baseline is 33%, any result above 50% is decent.
This zip file has the training data tone_pitchdur_train.txt and tone_pitchdur_test.txt.
./learn2classify tone_pitchdur_train.txt mymodel ./classify tone_pitchdur_test.txt mymodel tone_predictions
Training takes three minutes with NAFLA, while LIBSVM takes at least an hour (it's still running as I type this).
This produces the files tone_predictions (whose contents are shown below) tone_predictions.tex (which can be LaTeXed and dvipsed and ps2pdfed to produce this PDF file), tone_predictions.prob, and tone_predictions.pred.
23001 out of 40798 correctly classified (Classification Accuracy = 0.563778)
RMSProbErr AveProbCorr Precision Recall F score
class 1 : 0.6301 0.4157 0.5592 0.5618 0.5605
class 2 : 0.6206 0.4180 0.6522 0.5345 0.5875
class 3 : 0.7560 0.2626 0.3013 0.5261 0.3831
class 4 : 0.5552 0.5032 0.6907 0.6005 0.6425
class 5 : 0.8436 0.1649 0.1474 0.4791 0.2254
mean : 0.6811 0.3529 0.4702 0.5404 0.4798
overall : 0.6393 0.4082
Confusion Matrix (counts)
5178 1723 190 2106 62
1516 6520 428 1444 89
511 1618 1723 1756 111
1859 1523 608 9214 136
153 815 326 823 366
Confusion Matrix (probabilities; sum of each row is 1)
0.5592 0.1861 0.0205 0.2275 0.0067
0.1516 0.6522 0.0428 0.1444 0.0089
0.0894 0.2829 0.3013 0.3070 0.0194
0.1394 0.1142 0.0456 0.6907 0.0102
0.0616 0.3282 0.1313 0.3315 0.1474
10-class USPS data, 4649 training and 4649 testing 256-dimensional examples from the GPML site. : This 21Mb zip file has two text files usps_gpml_tr.txt with training examples and usps_gpml_te.txt with testing examples.
On this dataset, LIBSVM (LIBSVM prediction file, use readprobest.m) is five times faster, taking one minute instead of five. There are two main reasons for this, both to do with the basic binary classification linear SVM algorithms used by the two algorithm.
| Performance Metric | NAFLA | LIBSVM |
| Accuracy | 0.956 | 0.954 |
| RMS Error | 0.227 | 0.230 |
| Ave Prob Correct | 0.895 | 0.890 |
./learn2classify usps_gpml_tr.txt mymodel
./classify usps_gpml_te.txt mymodel uspsresult
This produces the files uspsresult (whose contents are shown below) uspsresult.tex (which can be LaTeXed and dvipsed and ps2pdfed to produce this PDF file), uspsresult.prob, and uspsresult.pred. The contents of uspsresult are also output to screen.
4443 out of 4649 correctly classified (Classification Accuracy = 0.955689)
RMSProbErr AveProbCorr Precision Recall F score
class 1 : 0.1765 0.9414 0.9720 0.9782 0.9751
class 2 : 0.1086 0.9699 0.9892 0.9938 0.9915
class 3 : 0.2704 0.8699 0.9339 0.9381 0.9360
class 4 : 0.2843 0.8525 0.9354 0.9332 0.9343
class 5 : 0.2406 0.8737 0.9503 0.9232 0.9366
class 6 : 0.2730 0.8507 0.9352 0.9379 0.9365
class 7 : 0.2231 0.8965 0.9565 0.9565 0.9565
class 8 : 0.2025 0.8898 0.9701 0.9582 0.9642
class 9 : 0.2892 0.8473 0.9215 0.9242 0.9228
class 10 : 0.2400 0.8656 0.9524 0.9694 0.9608
mean : 0.2308 0.8857 0.9517 0.9513 0.9514
overall : 0.2267 0.8954
Confusion Matrix (counts)
764 0 5 0 3 1 4 1 8 0
0 640 0 0 3 0 3 0 1 0
2 1 424 9 6 2 3 1 5 1
2 0 4 391 0 14 0 2 5 0
1 2 7 0 421 3 2 3 0 4
0 0 1 8 6 332 5 0 1 2
7 0 5 0 4 0 396 0 2 0
0 0 0 2 4 0 0 390 2 4
5 1 5 8 2 2 1 1 305 1
0 0 1 1 7 0 0 9 1 380
Confusion Matrix (probabilities; sum of each row is 1)
0.9720 0.0000 0.0064 0.0000 0.0038 0.0013 0.0051 0.0013 0.0102 0.0000
0.0000 0.9892 0.0000 0.0000 0.0046 0.0000 0.0046 0.0000 0.0015 0.0000
0.0044 0.0022 0.9339 0.0198 0.0132 0.0044 0.0066 0.0022 0.0110 0.0022
0.0048 0.0000 0.0096 0.9354 0.0000 0.0335 0.0000 0.0048 0.0120 0.0000
0.0023 0.0045 0.0158 0.0000 0.9503 0.0068 0.0045 0.0068 0.0000 0.0090
0.0000 0.0000 0.0028 0.0225 0.0169 0.9352 0.0141 0.0000 0.0028 0.0056
0.0169 0.0000 0.0121 0.0000 0.0097 0.0000 0.9565 0.0000 0.0048 0.0000
0.0000 0.0000 0.0000 0.0050 0.0100 0.0000 0.0000 0.9701 0.0050 0.0100
0.0151 0.0030 0.0151 0.0242 0.0060 0.0060 0.0030 0.0030 0.9215 0.0030
0.0000 0.0000 0.0025 0.0025 0.0175 0.0000 0.0000 0.0226 0.0025 0.9524