Clustering Song DataBill Sethares gave this data to Misha Belkin, who forwarded it to me. To quote Bill I've generated a bunch of data vectors...there are 14 songs 6 features 5 selections from each song (ie., data gathered from 5 random spots within each song)Each of the vectors has 48 dimensions and is labelled songXfYselZ, where "X, Y, and Z are the appropriate song, feature, and selection number". So -- hopefully we can see what kinds of clustering occur... Presumably/hopefully (1) the various selections from each song will cluster together... (2) most likely, some of the features will do a good job (crossing fingers) and some of the features will not... there are many other possible things to measure. (3) hopefully also certain of the songs will cluster together with others. I have some pretty clear ideas on what *should* cluster with what. I won't tell you which song is which, so that preconceived notions can't pollute any results. The raw data is here (processed with getmat.pl and loadmat.m)
The Laplacian algorithm with Heat Kernel weights and 10 nearest neighbors was then applied, and the smallest nonzero eigenvalues were as below. The corresponding eigenvectors are in a 420 x 10 matrix sfs10.dat 0.0019 0.2277 0.2361 0.2416 0.3300 0.3361 0.3658 0.3850 0.4230 0.4281 If we normalize the data first, then the first 10 eigenvalues are as below. The corrresponding eigenvectors are in sfs10norm.dat
0.0001 0.1068 0.2366 0.2533 0.2971 0.3445 0.3646 0.3862 0.4044 0.4313
The first three components appear to clump the data in two orthogonal and separate lines. One line has song-feature-selection combinations with feature=2 points on half the line and feature=3 points on the other half. Partiview is a very useful program for visualizing data in 3d. It's written and maintained primarily by Stuart Levy at NCSA. Updates are also regularly made by various people, including folks at the American Museum of Natural History, from where you can download it for free. It has two main filetypes, which have extensions .cf and .speck. In Windows, it's helpful to associate .cf files with Partiview. Partiview files were made with the customized matlab script makespeck_sethares.m. The files are zipped into the file lapsong.zip. You can view it with partiview lapsong.cf The data is divided into groups 123,234,345 and 456. For example, 234 has the second, third and fourth Laplacian components of the data. By typing in the command window g2 labels on You turn the labels on group g2 = 234 (assuming it is on!) on. To turn them off you type g2 labels off You can change the coloring of group g4=456 by saying one of g4 color selection g4 color song g4 color feature To color by selection, song or feature respectively. The best clustering I can see here is that of features 2 and 3. Whatever feature actually means. You can change the size of the axies (red=x, green=y, blue=z) by saying censize 1000 Replacing 1000 by whatever you want. I also did the computation by normalizing in each of the 48 dimensions (i.e. using z values, i.e. subtracting the mean in that dimension and then dividing by the stdev if nonzero) before applying the Laplacian reduction; Partiview files for that are in lapnormsong.zip. You can also see groups of points corresponding to their song/feature/selection number. It might be helpful to first turn labels off ("gx labels off", where x is the group number) since the following commands only work on points, not their labels. Here are some examples. (You can, as always, leave out the initial group name if it's the same group to which you applied the last such command). g3 selection 2 5 This shows only g3 points that have selection number between 2 and 5 (i.e. 2345). g2 song <3 This shows only g2 points with song number at most (AND INCLUDING) 3, i.e. 123. g4 feature >4 This shows g4 points with feature number at least (and including) 4. You can also use the 'only=' command. g1 only= selection 1-3 5 This shows points with selection value 1,2,3 or 5. For more commands, see section 4.5 in the BIMA manual. Data analysis conclusionsSong: Not captured by the first three Laplacian components. Something song-related is being captured by the next three components, but there's too much noise to say. I'm not sure if these are being analyzed in the right way for a song-classification problem. Each song should probably be represented by a single vector made of five selections with features 1,2 and 3. Feature: These are captured a lot better, whatever they are. Feature 1 and 2 occupy two orthogonal planes in Laplacian456 space. Component 4 captures feature 2 very well while remaining nearly zero for feature 1. Feature 1 and 2 occupy two non-intersecting planes (almost lines) in Laplacian123 space. Feature 3 is on the same component as Feature 2 in Laplacian123 space, and in Laplacian 456 space. However, feature 3 occupies a far larger superset of the plane it shares with feature 2 in Lap456 space, so further dimensions will probably separate them. Feature 4 is quite mixed up with feature 1 in the first six dimensions. Not completely mixed up though (e.g. in Lap234 space). Ditto Feature 5. Selection: Mostly mixed up (a good thing). There is some separation (e.g. of 1 & 5) in Lap456 (especially components 4 and 6). The picture below shows the 420 data points, colored by feature, in the 4th/5th/6th Laplacian Component-space.
|