DESCRIPTION
SVMlin is software package for linear SVMs. It is well-suited to classification problems involving a large number of examples and features. It is primarily written for sparse datasets (number of non-zero features in an example is typically small). It is written in C++ (mostly C). A mex wrapper is available for MATLAB users.
SVMlin can also utilize unlabeled data, in addition to labeled examples. It currently implements two extensions of standard SVMs to incorporate unlabeled examples.
For a Reuters text categorization problem with around 804414 labeled examples and 47326 features, SVMlin takes less than two minutes to train a linear SVM on an Intel machine with 3GHz processor and 2GB RAM. Given just 1000 labels, it can utilize the remaining hundreds of thousands of unlabeled examples for training a semi-supervised linear SVM in about 20 minutes. Unlabeled data can be very useful in improving classification performance when labels are relatively few.
SVMlin implements
the following algorithms (See references [1],[2],
[3]):
Fully supervised [using only labeled
examples]
* Linear Regularized Least Squares (RLS) Classification
* Modified Finite Newton Linear L2-SVMs [3]
Semi-supervised [can use unlabeled data as well]
* Multi-switch linear Transductive L2-SVMs
* Deterministic Annealing (DA) for Semi-supervised Linear L2-SVMs
DOWNLOAD SVMlin
The current release is available for download
in the following files:
Conditions for Use: This software is freely available for educational or commercial purposes under the terms of the GNU Public Licence. We expect that all publications describing work using this software quote one of the references given below. If you use supervised linear l2 svms, please cite [3]. For semi-supervised applications, cite [1] or [2].
INSTALLATION
unzip svmlin.zip
tar -xvzf svmlin.tar.gz
This will create a directory called svmlin-v1.0 containing a Makefile and 3 source files: ssl.h, ssl.cpp and svmlin.cpp.
Type
make
This will create an executable
svmlin
This single executable is used for training, testing and performance evaluation.
HOW TO USE
Data Files
Input examples (for training and testing) need to be in the SVM-Light/LIBSVM format [except that there is no first column for labels].
Each line describes an example and is a list of feature index -- value pairs for non-zero features, separated by white space. The line should end with a newline '\n' character.
<feature>:<value> <feature>:<value> ... <feature>:<value>
For example, the following data matrix with 4 examples and 5 features
0 3 0 0 1
4 1 0 0 0
0 5 9 2 0
6 0 0 5 3
is described in the input file as
2:3 5:1
1:4 2:1
2:5 3:9 4:2
1:6 4:5 5:3
The file containing labels is separate since it is routine to use the same inputs with different labels. Each line should contain a label for the corresponding line in the input file with one of the following values:
+1 (labeled positive example)
-1 (labeled negative example)
0 (unlabeled examples)
The implementation can only handle binary classification in the current version.
Training
svmlin [options] training_examples training_labels
The learnt weight vector is saved in training_examples.weights.
The outputs of the weight vector on the training examples are saved in training_examples.outputs.
Testing
Type
svmlin -f training_examples.weights test_examples_filename
Test examples should be in the same format as the training examples.
The weight vector in training_examples.weights is used for predictions.
The predicted soft outputs are saved in test_examples.outputs.
Evaluation
If the labels for the test examples are known, Typing
svmlin -f weights_filename test_examples_filename test_labels_filename
You can use the PERF software to compute other performance metrics.
Options
Typing
svmlin
Usage
Train: svmlin [options] training_examples training_labels
Test: svmlin -f weights_file test_examples
Evaluate: svmlin -f weights_file test_examples test_labels
Options:
-A algorithm : set algorithm (default 1)
0 -- Regularized Least Squares (RLS) Classification
1 -- SVM (L2-SVM-MFN) (default choice)
2 -- Multi-switch Transductive SVM (using L2-SVM-MFN)
3 -- Deterministic Annealing Semi-supervised SVM (using L2-SVM-MFN)
-W regularization parameter lambda (default 1)
-U regularization parameter lambda_u (default 1)
-S maximum number of switches in TSVM (default 10000)
-R positive class fraction of unlabeled data (default 0.5)
-f weights_filename (Test Mode: input_filename is the test file)
-Cp relative cost for positive examples (default 1; only available with -A 1)
-Cn relative cost for negative examples (default 1; only available with -A 1)
unzip example.zip
tar -xvzf example.tar.gz
training_labels specifies 50 labels among 1460 examples (unlabeled examples are labeled "0").
For training the multi-switch Transductive SVM run,
svmlin -A 2 -W 0.001 -U 1 -R 0.5 example/training_examples example/training_labels
svmlin -f training_examples.weights example/test_examples example/test_labelsOn this dataset (a subset of 20 newgroups), multi-switch TSVM gives an accuracy of 92.8%. To run the deterministic annealing semi-supervised SVM, run,
svmlin -A 3 -W 0.001 -U 1 -R 0.5 example/training_examples example/training_labels
The options -A 0 or -A 1 ignores the unlabeled examples (if any) and performs supervised learning.
MATLAB Interface
SVMlin can be run through MATLAB using the mex interface implemented in
mex svmlin_mex.cpp ssl.o
help svmlin
RESEARCH ON LARGE SCALE SEMI-SUPERVISED LEARNING
Datasets
| Name |
examples |
features |
average
sparsity* |
original source |
| aut-avn real-sim ccat gcat 33-36 pcmac full-ccat full-gcat kdd99 SecStr |
71175 72309 23119 23119 82692 1946 804414 804414 4898431 1273151 |
20707 20958 47236 47236 59072 7511 47236 47236 127 315 |
51.32 51.32 75.93 75.93 26.56 54.58 76.7 76.7 17.3 15 |
SRAA SRAA RCV1-v2 RCV1-v2 Yahoo! 20 newsgroups RCV1-v2 RCV1-v2 KDD Cup 1999 SSL Book |
[1]
Large Scale
Semi-supervised
Linear SVMs
V. Sindhwani, S. S. Keerthi
SIGIR 2006
ps
pdf
link
[2] Newton
Methods for Fast Solution of Semi-supervised Linear SVMs
V.
Sindhwani, S.S. Keerthi
Large Scale Kernel
Machines, MIT Press, 2005
ps
pdf
The following are references on supervised primal methods:
Support Vector Machines in the Primal
HISTORY
v1.0 -- August 2006 release
COMMENTS/BUG REPORTS
Please
send me an
email at vikass at cs dot uchicago dot edu.
ACKNOWLEDMENTS
I thank
Partha Niyogi and Sathiya
Keerthi for support.
Sumeet Chawla
and Chih-Jen Lin
provided helpful comments.
I was inspired from LIBSVM and SVM-Light in the organization
of the code.