What is Linguistica?
Linguistica is a program designed to explore the unsupervised
learning of natural language, with primary focus on morphology
(word-structure). It runs under Windows, Mac OS X and Linux, and is
written in
C++ within the Qt development framework. Its demands on memory depend
on the size of the corpus analyzed.
Unsupervised learning refers to the computational task of making
inferences (and therefore acquiring knowledge) about the structure that
lies behind some set of data, without any direct access to that
structure. In the case of unsupervised learning of morphology,
Linguistica explores the possibilities of morpheme-combinations for a
set of words, based on no internal knowledge of the language from
which the words are drawn.
Segmentation is the first task of this process; the program figures out
where the morpheme boundaries are in the words, and then decides what
the stems are, what the suffixes are, and so forth. Most of
Linguistica's functionality, at this point, goes into making these
decisions.
Understanding Linguistica
This document presents a brief description of how to use the Linguistica program, and it also provides links to other documents that explain the ideas incorporated in Linguistica. By reading this, you will be able to use the Linguistica program and get useful results; it is not meant to be a technical document on all the inner workings of the program. For a technical description of the Minimum Description Length model that motivates this work, an article has been published in Computational Linguistics 27:2 (2001) pp. 153-198, under the title Unsupervised Learning of the Morphology of a Natural Language.
Some PowerPoint slides describing the ideas behind this program can be accessed here. Slides from a talk at Microsoft on Nov 9 2001 are available here.
Linguistica relies on quality text files in an interesting language. If you don't have any, there are some good links here.
This document is likely incomplete. I am more than pleased to answer inquiries by email. The details of this documentation will change frequently as I modify the program. I'll do my best to keep it up-to-date, but there will undoubtedly be slippage from time to time.
Using Linguistica
The Linguistica main window is divided into three areas: on the left is the Tree area, on the upper right is the Collections area, and below that is the Text area.
Most of the user input is
done
with the drop-down menus. The program's
output will be displayed in the Tree, Collections, and Text areas. The
Tree
area shows general information about the given corpus and works in
tandem with the Collections area, which displays more detailed
information about the corpus. The Text area gives specific examples
from the corpus itself, as relating to the highlighted type in the
Collections area.
The other major source of
information for the user
comes from the optional Log File. If you activate Log Filing,
then the program will write a detailed description of its operations to
a text file, at a location specified by you. The Log File must be
turned on for each individual operation that you want logged. To
begin logging, go to Log File : New
Log File. A dialogue box will appear, where you can save a new
log file (in text file form, .txt) or you can open up an old one.
Select or type in the name of your log file, and click OK. Then go to Log File : Logging Enabled. This
option should now be checked in the menu, and your analysis will be
saved to the text file.
How
to Begin
The first operation
is to read in a corpus from some language.
The default setting for Linguistica's corpus input is 10,000 words:
this is the number of words, from the beginning of the corpus, that the
program will read. If you wish to change this setting, click on Words requested in the Tree area on
the left. A dialogue box will appear
where you can specify a different number of words to be read from the
corpus. This number refers to the total number of words (tokens) read;
it does not refer to word types.
To open a corpus, go to File : New
Corpus (or press Ctrl-N).
A
dialogue box will appear, allowing you to find the desired corpus on
your hard drive. Find it and click Open.
The corpus must be a
standard or Unicode text file (.txt); it cannot be a word processing
file from Microsoft Word (.doc). If you have already run Linguistica,
and you previously read in a corpus, the program will remember the
location of the file, and you can simply select File : Reread corpus (Ctrl-D) to reread the same corpus.
When the
reading is complete, Words
will appear in the Tree area on the left
side of the screen, right under Lexicon.
The Lexicon will hold a range
of collections, including Words,
Stems, Suffixes, and Signatures. When these collections
are empty (as they are now), they do
not appear in the Tree area. You may click on Words, and the unique
words of the corpus will appear as a list in the Collections area.
Clicking on a collection in the Tree area will allow you to see the
contents of the collection in the Collections area.
The
width of the columns in the Collections area may be too small or too
large for your purposes. You can change the widths of the columns by
grabbing an edge at the top of the columns with the cursor and moving
it to the left or the right. You can also sort by any of the column
values by clicking on the title of that column. This may be
particularly useful for clicking on the "Corpus Count" column to bring
the most frequent words to the top of the column. You can return to an
alphabetical display of the words by clicking on the top of the first
column, "Words". If you wish to see the words organized into a
trie,
you can click on the "Forward trie" line, which is under "Lexicon"
in the Tree area.
Finding a Suffixal System: Signature Based Analysis
Let us suppose you have chosen a language, such as English, in which you wish to discover the suffixal system. On the menu, go to Suffixes : Run All (Keyboard shortcut: Ctrl-S). We will return to other such actions you may take; for now, let's look at what results you may obtain if you perform this operation. You may see something like this in the Tree Area:
In the Lexicon, then, we have 16 suffixes, and 54 signatures built up out of them, along with 571 stems. Of the 2,463 distinct words read, 711 were analyzed, meaning the rest were determined to be mono-morphemic by the program. You should click consecutively on each of these groups, and see that they are displayed in the Collections area on the right as you do so. When the collections get large, it may take a while to display a collection (as much as 10 seconds or more if there are more than 5,000 members).
Saving
to File
To save an analysis to a
text file
that you can open in a text editor
or a spreadsheet, go to File : Save
As.... A
dialogue box will appear, where you can select a folder and give a name
to
your project (e.g., WarrenCommissionReport_50K). Linguistica will save
a set of about 12 different text files: a list of words, stems,
affixes, signatures, etc. Most of the information
in them is
transparent or yet to be implemented. There
is also the option to Save broken
corpus under the File
menu. This saves a version of
your corpus with the appended name "[your corpus]_Broken.txt". Inside
this text file, you will find the original corpus, but with two spaces
between each word, and a plus sign between the morpheme breaks that
Linguistica has determined.
Signatures
Returning to the
signature-based analysis: if you select Signatures
(underneath Suffixes in the
Tree are), the signatures will be displayed
in the Collections area. You can then click on a signature in the
first column, and the stems associated with that signature will be
displayed in the Text area below.
By default, signatures are ranked by their robustness, which is roughly the number of letters saved by the analysis, compared with the total number of letters in the original analyzed corpus. It is a measure of how efficiently the analysis has generalized the data. The signatures can be re-sorted by clicking on the header at the top of various columns of the display. Remarks gives an indication of which function was responsible for the identification of the signature.
Now, you can
click successively on the rest of the items in the
Suffixes menu: (1) Successor frequencies, (2) Check signatures, (3)
Known stems and signatures, (4) From sigs: find stems, and (5) Find
singleton signatures. When you select Suffixes
: Run All (Ctrl-S), as
you did
before, the Linguistica program actually goes through these five
analytic tools, some of them multiple times. This will probably provide
you with the best, most comprehensive analysis of your corpus, but
these other commands allow you to fine tune your analysis, as well as
giving a better understanding of what exactly Linguistica is doing
in the background.
Preferences
Once you are
familiar with the Linguistica program, you may wish to alter the
preferences to improve the quality of the analysis or improve the
readability of the program's output. If you select File: Preferences, the Preferences
window will open, allowing you to view and change the current
preferences. There are three tabs: Lexicon,
User Interface, and Parameters. When you are done
modifying the preferences, you can click OK to apply the changes, or
Cancel to discard them. You can
also Save
a certain set of preferences into a text file (prefs.txt, for example).
It is important to save your preferences, because Linguistica will not
remember them from session to session. If you do save your preferences,
Linguistica will load the last open preferences file. If you wish to
use an older preferences file, use the Load
button.
Lexicon
Save verification:
When this box is checked, Linguistica will ask you
if you want to save your work when you quit the program.
Signature delimiter:
The character in this box will be used to delimit
the different suffixes in the signatures. For instance, for the English
words 'office' and 'official', the stem would be 'offic' and the
signature, if delimited by a period, would appear as 'e.ial'. This does
not affect the analysis; rather it
is only used to improve readability.
Character Filtering:
Any sets of characters you put in this field will be
interpretted by Linguistica as one letter. Separate the different
character sets by spaces or by putting them on different lines. The character filtering
set in the screenshot is for Somali. In this example, dh, sh, kh, and aa are interpretted as single
letters by Linguistica.
After you type in the characters to be filtered,
click Set Filters.
User
Interface
Morphology Highlighting:
Each element
(Affix,
Signature, Stem, Word) can be assigned a different text formatting. The
text can be any combination of Bold, Italic, Underline, or a
particular color. This text formatting will show up in the Text area of
the main Linguistica window. The preview area allows you to see what
the formatting will look like.
Default Screen Font:
This setting
allows
you to modify the font used in all three areas of the Linguistica main
window (Tree, Collections, and Text areas). There is another setting,
under the View menu, which
also allows you to change the font used, but that option only applies
to the Collections and Text areas, not the Tree area. This way, the
font determined by the preferences window can be a more permanent font,
and you can change the fonts under the View
menu for special cases, such as analyzing Arabic or any other language
that uses a non-Roman writing system.
Parameters
This tab allows
you to fine tune your analysis, having specific control over a number
of variables. To modify a parameter, select it, and look to the right
of the name. There should be a number, which you can click on and,
after waiting a second, you can modify. The meanings behind the
parameter names are either self-explanatory or beyond the scope of this
tutorial.
When you are
finished with modifying the preferences, be sure to save them into a
text file.
Summary
If you wish to do
a quick analysis of a corpus, you can follow these general steps:
- Open Linguistica
- Press Ctrl-N (File: New Corpus),
find your corpus, or, if it was the last corpus you used, Ctrl-D (File: Reread Corpus)
- Press Ctrl-S (Suffixes: Run All)
- Now you can see the suffixes and signatures that Linguistica has determined, under Lexicon in the Tree area.
Thanks to Mike LeBeau and Jeremy O'Brien for work on this webpage.