What is Linguistica?
Linguistica is a program designed to explore the unsupervised
learning of natural language, with primary focus on morphology
(word-structure). It runs under Windows, Mac OS X and Linux, and is
C++ within the Qt development framework. Its demands on memory depend
on the size of the corpus analyzed.
Unsupervised learning refers to the computational task of making inferences (and therefore acquiring knowledge) about the structure that lies behind some set of data, without any direct access to that structure. In the case of unsupervised learning of morphology, Linguistica explores the possibilities of morpheme-combinations for a set of words, based on no internal knowledge of the language from which the words are drawn.
Segmentation is the first task of this process; the program figures out where the morpheme boundaries are in the words, and then decides what the stems are, what the suffixes are, and so forth. Most of Linguistica's functionality, at this point, goes into making these decisions.
This document presents a brief description of how to use the Linguistica program, and it also provides links to other documents that explain the ideas incorporated in Linguistica. By reading this, you will be able to use the Linguistica program and get useful results; it is not meant to be a technical document on all the inner workings of the program. For a technical description of the Minimum Description Length model that motivates this work, an article has been published in Computational Linguistics 27:2 (2001) pp. 153-198, under the title Unsupervised Learning of the Morphology of a Natural Language.
Linguistica relies on quality text files in an interesting language. If you don't have any, there are some good links here.
This document is likely incomplete. I am more than pleased to answer inquiries by email. The details of this documentation will change frequently as I modify the program. I'll do my best to keep it up-to-date, but there will undoubtedly be slippage from time to time.
The Linguistica main window is divided into three areas: on the left is the Tree area, on the upper right is the Collections area, and below that is the Text area.
Most of the user input is
with the drop-down menus. The program's
output will be displayed in the Tree, Collections, and Text areas. The
area shows general information about the given corpus and works in
tandem with the Collections area, which displays more detailed
information about the corpus. The Text area gives specific examples
from the corpus itself, as relating to the highlighted type in the
The other major source of
information for the user
comes from the optional Log File. If you activate Log Filing,
then the program will write a detailed description of its operations to
a text file, at a location specified by you. The Log File must be
turned on for each individual operation that you want logged. To
begin logging, go to Log File : New
Log File. A dialogue box will appear, where you can save a new
log file (in text file form, .txt) or you can open up an old one.
Select or type in the name of your log file, and click OK. Then go to Log File : Logging Enabled. This
option should now be checked in the menu, and your analysis will be
saved to the text file.
The first operation is to read in a corpus from some language. The default setting for Linguistica's corpus input is 10,000 words: this is the number of words, from the beginning of the corpus, that the program will read. If you wish to change this setting, click on Words requested in the Tree area on the left. A dialogue box will appear where you can specify a different number of words to be read from the corpus. This number refers to the total number of words (tokens) read; it does not refer to word types.
To open a corpus, go to File : New
Corpus (or press Ctrl-N).
dialogue box will appear, allowing you to find the desired corpus on
your hard drive. Find it and click Open.
The corpus must be a
standard or Unicode text file (.txt); it cannot be a word processing
file from Microsoft Word (.doc). If you have already run Linguistica,
and you previously read in a corpus, the program will remember the
location of the file, and you can simply select File : Reread corpus (Ctrl-D) to reread the same corpus.
reading is complete, Words
will appear in the Tree area on the left
side of the screen, right under Lexicon.
The Lexicon will hold a range
of collections, including Words,
Stems, Suffixes, and Signatures. When these collections
are empty (as they are now), they do
not appear in the Tree area. You may click on Words, and the unique
words of the corpus will appear as a list in the Collections area.
Clicking on a collection in the Tree area will allow you to see the
contents of the collection in the Collections area.
width of the columns in the Collections area may be too small or too
large for your purposes. You can change the widths of the columns by
grabbing an edge at the top of the columns with the cursor and moving
it to the left or the right. You can also sort by any of the column
values by clicking on the title of that column. This may be
particularly useful for clicking on the "Corpus Count" column to bring
the most frequent words to the top of the column. You can return to an
alphabetical display of the words by clicking on the top of the first
column, "Words". If you wish to see the words organized into a
you can click on the "Forward trie" line, which is under "Lexicon"
in the Tree area.
Finding a Suffixal System: Signature Based Analysis
Let us suppose you have chosen a language, such as English, in which you wish to discover the suffixal system. On the menu, go to Suffixes : Run All (Keyboard shortcut: Ctrl-S). We will return to other such actions you may take; for now, let's look at what results you may obtain if you perform this operation. You may see something like this in the Tree Area:
In the Lexicon, then, we have 16 suffixes, and 54 signatures built up out of them, along with 571 stems. Of the 2,463 distinct words read, 711 were analyzed, meaning the rest were determined to be mono-morphemic by the program. You should click consecutively on each of these groups, and see that they are displayed in the Collections area on the right as you do so. When the collections get large, it may take a while to display a collection (as much as 10 seconds or more if there are more than 5,000 members).
To save an analysis to a
that you can open in a text editor
or a spreadsheet, go to File : Save
dialogue box will appear, where you can select a folder and give a name
your project (e.g., WarrenCommissionReport_50K). Linguistica will save
a set of about 12 different text files: a list of words, stems,
affixes, signatures, etc. Most of the information
in them is
transparent or yet to be implemented. There
is also the option to Save broken
corpus under the File
menu. This saves a version of
your corpus with the appended name "[your corpus]_Broken.txt". Inside
this text file, you will find the original corpus, but with two spaces
between each word, and a plus sign between the morpheme breaks that
Linguistica has determined.
Returning to the signature-based analysis: if you select Signatures (underneath Suffixes in the Tree are), the signatures will be displayed in the Collections area. You can then click on a signature in the first column, and the stems associated with that signature will be displayed in the Text area below.
By default, signatures are ranked by their robustness, which is roughly the number of letters saved by the analysis, compared with the total number of letters in the original analyzed corpus. It is a measure of how efficiently the analysis has generalized the data. The signatures can be re-sorted by clicking on the header at the top of various columns of the display. Remarks gives an indication of which function was responsible for the identification of the signature.
Now, you can
click successively on the rest of the items in the
Suffixes menu: (1) Successor frequencies, (2) Check signatures, (3)
Known stems and signatures, (4) From sigs: find stems, and (5) Find
singleton signatures. When you select Suffixes
: Run All (Ctrl-S), as
before, the Linguistica program actually goes through these five
analytic tools, some of them multiple times. This will probably provide
you with the best, most comprehensive analysis of your corpus, but
these other commands allow you to fine tune your analysis, as well as
giving a better understanding of what exactly Linguistica is doing
in the background.
Once you are
familiar with the Linguistica program, you may wish to alter the
preferences to improve the quality of the analysis or improve the
readability of the program's output. If you select File: Preferences, the Preferences
window will open, allowing you to view and change the current
preferences. There are three tabs: Lexicon,
User Interface, and Parameters. When you are done
modifying the preferences, you can click OK to apply the changes, or
Cancel to discard them. You can
a certain set of preferences into a text file (prefs.txt, for example).
It is important to save your preferences, because Linguistica will not
remember them from session to session. If you do save your preferences,
Linguistica will load the last open preferences file. If you wish to
use an older preferences file, use the Load
When this box is checked, Linguistica will ask you if you want to save your work when you quit the program.
The character in this box will be used to delimit the different suffixes in the signatures. For instance, for the English words 'office' and 'official', the stem would be 'offic' and the signature, if delimited by a period, would appear as 'e.ial'. This does not affect the analysis; rather it is only used to improve readability.
Any sets of characters you put in this field will be interpretted by Linguistica as one letter. Separate the different character sets by spaces or by putting them on different lines. The character filtering set in the screenshot is for Somali. In this example, dh, sh, kh, and aa are interpretted as single letters by Linguistica.
After you type in the characters to be filtered, click Set Filters.
Each element (Affix, Signature, Stem, Word) can be assigned a different text formatting. The text can be any combination of Bold, Italic, Underline, or a particular color. This text formatting will show up in the Text area of the main Linguistica window. The preview area allows you to see what the formatting will look like.
Default Screen Font:
This setting allows you to modify the font used in all three areas of the Linguistica main window (Tree, Collections, and Text areas). There is another setting, under the View menu, which also allows you to change the font used, but that option only applies to the Collections and Text areas, not the Tree area. This way, the font determined by the preferences window can be a more permanent font, and you can change the fonts under the View menu for special cases, such as analyzing Arabic or any other language that uses a non-Roman writing system.
This tab allows
you to fine tune your analysis, having specific control over a number
of variables. To modify a parameter, select it, and look to the right
of the name. There should be a number, which you can click on and,
after waiting a second, you can modify. The meanings behind the
parameter names are either self-explanatory or beyond the scope of this
When you are
finished with modifying the preferences, be sure to save them into a
If you wish to do
a quick analysis of a corpus, you can follow these general steps:
- Open Linguistica
- Press Ctrl-N (File: New Corpus),
find your corpus, or, if it was the last corpus you used, Ctrl-D (File: Reread Corpus)
- Press Ctrl-S (Suffixes: Run All)
- Now you can see the suffixes and signatures that Linguistica has determined, under Lexicon in the Tree area.
Thanks to Mike LeBeau and Jeremy O'Brien for work on this webpage.