NAFLA

Download : NAFLA.zip (updated 5 Apr 2007)

NAFLA is a C++ library for fast scalable multiclassification. Tested so far on Linux and Cygwin. (For now all algorithms are based on a Linear SVM... later Boosted Naive Bayes and Boosted Decision Stumps, etc will be added.) To install, just unzip and type make in the NAFLA directory.

Quick documentation

Data is in LIBSVM format (This example has 3-class 5-dimensional data with classes 10, 20, 30.).

Training : ./learn2classify traindata modelfile

Testing : ./classify datafile modelfile resultsfile

Results can be found in resultsfile, resultsfile.tex, resultsfile.prob and resultsfile.pred.

The model files produced are modelfile


Full Documentation

Data format

Same as LIBSVM: Each line of a train/test set is of the form below, which is the same format as LIBSVM i.e. sparse dimension:value pairs.

This example has n nonzero dimensions. (At most n, since any vi could be 0.)

    class d1:v1  d2:v2 ... dn:vn

class is an integer (NOT 0!) representing the class of this example. 0 < d1 < d2 ... < dn <= D

Where D is the dimension of the data. (D is worked out automatically from the training examples.)

Training

       ./learn2classify traindata modelfile [-1v1 | -1vr]   [-scale | -noscale] [-inv | -noinv]

The defaults are -1v1, -noscale, and -noinv i.e. using a 1-versus-1 ensemble of C(C-1)/2 binary classifiers with no scaling and examples equally weighted.

For now, each binary classifier is a Linear SVM (Least Squares SVM using the algorithm by Keerthi and DeCoste 2005.

Testing

       ./classify datafile modelfile resultsfile [-raw]

This outputs the following files ; C is the number of classes.


Example 1 (NAFLA is faster than LIBSVM)

5-class Mandarin tone classification problem from some early work in my doctoral thesis (the primary motivation for writing NAFLA, by the way). Each data point is a syllable with 27 features based on its duration and pitch. There are 122 397 training examples and 40 798 test examples. This is a very hard problem since this is not the full set of useful features; although the baseline is 33%, any result above 50% is decent.

This zip file has the training data tone_pitchdur_train.txt and tone_pitchdur_test.txt.

./learn2classify tone_pitchdur_train.txt mymodel       
./classify tone_pitchdur_test.txt mymodel tone_predictions

Training takes three minutes with NAFLA, while LIBSVM takes at least an hour (it's still running as I type this).

This produces the files tone_predictions (whose contents are shown below) tone_predictions.tex (which can be LaTeXed and dvipsed and ps2pdfed to produce this PDF file), tone_predictions.prob, and tone_predictions.pred.

23001 out of 40798 correctly classified (Classification Accuracy = 0.563778)

            RMSProbErr  AveProbCorr  Precision  Recall    F score

class  1  : 0.6301       0.4157       0.5592     0.5618    0.5605
class  2  : 0.6206       0.4180       0.6522     0.5345    0.5875
class  3  : 0.7560       0.2626       0.3013     0.5261    0.3831
class  4  : 0.5552       0.5032       0.6907     0.6005    0.6425
class  5  : 0.8436       0.1649       0.1474     0.4791    0.2254

mean      : 0.6811       0.3529       0.4702     0.5404    0.4798

overall   : 0.6393       0.4082     

Confusion Matrix (counts) 

   5178    1723     190    2106      62 
   1516    6520     428    1444      89 
    511    1618    1723    1756     111 
   1859    1523     608    9214     136 
    153     815     326     823     366 



Confusion Matrix (probabilities; sum of each row is 1) 

 0.5592  0.1861  0.0205  0.2275  0.0067 
 0.1516  0.6522  0.0428  0.1444  0.0089 
 0.0894  0.2829  0.3013  0.3070  0.0194 
 0.1394  0.1142  0.0456  0.6907  0.0102 
 0.0616  0.3282  0.1313  0.3315  0.1474 

Example 2 (NAFLA is slower than LIBSVM)

10-class USPS data, 4649 training and 4649 testing 256-dimensional examples from the GPML site. : This 21Mb zip file has two text files usps_gpml_tr.txt with training examples and usps_gpml_te.txt with testing examples.

On this dataset, LIBSVM (LIBSVM prediction file, use readprobest.m) is five times faster, taking one minute instead of five. There are two main reasons for this, both to do with the basic binary classification linear SVM algorithms used by the two algorithm.

  1. LIBSVM's uses a N x N Gram matrix (where N = # training points) which is fine as long as N is small - as it is for this dataset.
  2. NAFLA uses LSSVM instead of L2SVM (see the Keerthi-DeCoste paper). L2SVM works much faster than LSSVM when there are few support vectors (i.e. the classification problem is easy, as is the case here). We will implement L2SVM shortly to take care of this.
Performance MetricNAFLALIBSVM
Accuracy 0.956 0.954
RMS Error 0.227 0.230
Ave Prob Correct 0.895 0.890
      ./learn2classify usps_gpml_tr.txt mymodel       
      ./classify usps_gpml_te.txt mymodel uspsresult

This produces the files uspsresult (whose contents are shown below) uspsresult.tex (which can be LaTeXed and dvipsed and ps2pdfed to produce this PDF file), uspsresult.prob, and uspsresult.pred. The contents of uspsresult are also output to screen.

4443 out of 4649 correctly classified (Classification Accuracy = 0.955689)

            RMSProbErr  AveProbCorr  Precision  Recall    F score

class  1  : 0.1765       0.9414       0.9720     0.9782    0.9751
class  2  : 0.1086       0.9699       0.9892     0.9938    0.9915
class  3  : 0.2704       0.8699       0.9339     0.9381    0.9360
class  4  : 0.2843       0.8525       0.9354     0.9332    0.9343
class  5  : 0.2406       0.8737       0.9503     0.9232    0.9366
class  6  : 0.2730       0.8507       0.9352     0.9379    0.9365
class  7  : 0.2231       0.8965       0.9565     0.9565    0.9565
class  8  : 0.2025       0.8898       0.9701     0.9582    0.9642
class  9  : 0.2892       0.8473       0.9215     0.9242    0.9228
class 10  : 0.2400       0.8656       0.9524     0.9694    0.9608

mean      : 0.2308       0.8857       0.9517     0.9513    0.9514

overall   : 0.2267       0.8954     

Confusion Matrix (counts) 

    764       0       5       0       3       1       4       1       8       0 
      0     640       0       0       3       0       3       0       1       0 
      2       1     424       9       6       2       3       1       5       1 
      2       0       4     391       0      14       0       2       5       0 
      1       2       7       0     421       3       2       3       0       4 
      0       0       1       8       6     332       5       0       1       2 
      7       0       5       0       4       0     396       0       2       0 
      0       0       0       2       4       0       0     390       2       4 
      5       1       5       8       2       2       1       1     305       1 
      0       0       1       1       7       0       0       9       1     380 



Confusion Matrix (probabilities; sum of each row is 1) 

 0.9720  0.0000  0.0064  0.0000  0.0038  0.0013  0.0051  0.0013  0.0102  0.0000 
 0.0000  0.9892  0.0000  0.0000  0.0046  0.0000  0.0046  0.0000  0.0015  0.0000 
 0.0044  0.0022  0.9339  0.0198  0.0132  0.0044  0.0066  0.0022  0.0110  0.0022 
 0.0048  0.0000  0.0096  0.9354  0.0000  0.0335  0.0000  0.0048  0.0120  0.0000 
 0.0023  0.0045  0.0158  0.0000  0.9503  0.0068  0.0045  0.0068  0.0000  0.0090 
 0.0000  0.0000  0.0028  0.0225  0.0169  0.9352  0.0141  0.0000  0.0028  0.0056 
 0.0169  0.0000  0.0121  0.0000  0.0097  0.0000  0.9565  0.0000  0.0048  0.0000 
 0.0000  0.0000  0.0000  0.0050  0.0100  0.0000  0.0000  0.9701  0.0050  0.0100 
 0.0151  0.0030  0.0151  0.0242  0.0060  0.0060  0.0030  0.0030  0.9215  0.0030 
 0.0000  0.0000  0.0025  0.0025  0.0175  0.0000  0.0000  0.0226  0.0025  0.9524 

   

Dinoj Surendran
Last modified: Thu Apr 5 22:38:13 CDT 2007