The Linguistica Project

What is Linguistica?

Linguistica is a program designed to explore the unsupervised learning of natural language, with primary focus on morphology (word-structure). It runs under Windows, Mac OS X and Linux, and is written in C++ within the Qt development framework. Its demands on memory depend on the size of the corpus analyzed.

Unsupervised learning refers to the computational task of making inferences (and therefore acquiring knowledge) about the structure that lies behind some set of data, without any direct access to that structure. In the case of unsupervised learning of morphology, Linguistica explores the possibilities of morpheme-combinations for a set of words, based on no internal knowledge of the language from which the words are drawn.

Segmentation is the first task of this process; the program figures out where the morpheme boundaries are in the words, and then decides what the stems are, what the suffixes are, and so forth. Most of Linguistica's functionality, at this point, goes into making these decisions.

Understanding Linguistica

This document presents a brief description of how to use the Linguistica program, and it also provides links to other documents that explain the ideas incorporated in Linguistica. By reading this, you will be able to use the Linguistica program and get useful results; it is not meant to be a technical document on all the inner workings of the program. For a technical description of the Minimum Description Length model that motivates this work, an article has been published in Computational Linguistics 27:2 (2001) pp. 153-198, under the title Unsupervised Learning of the Morphology of a Natural Language.

Some PowerPoint slides describing the ideas behind this program can be accessed here. Slides from a talk at Microsoft on Nov 9 2001 are available here.

Linguistica relies on quality text files in an interesting language. If you don't have any, there are some good links here.

This document is likely incomplete. I am more than pleased to answer inquiries by email. The details of this documentation will change frequently as I modify the program. I'll do my best to keep it up-to-date, but there will undoubtedly be slippage from time to time.

Using Linguistica

The Linguistica main window is divided into three areas: on the left is the Tree area, on the upper right is the Collections area, and below that is the Text area.

Most of the user input is done with the drop-down menus. The program's output will be displayed in the Tree, Collections, and Text areas. The Tree area shows general information about the given corpus and works in tandem with the Collections area, which displays more detailed information about the corpus. The Text area gives specific examples from the corpus itself, as relating to the highlighted type in the Collections area.

The other major source of information for the user comes from the optional Log File. If you activate Log Filing, then the program will write a detailed description of its operations to a text file, at a location specified by you. The Log File must be turned on for each individual operation that you want logged. To begin logging, go to Log File : New Log File. A dialogue box will appear, where you can save a new log file (in text file form, .txt) or you can open up an old one. Select or type in the name of your log file, and click OK. Then go to Log File : Logging Enabled. This option should now be checked in the menu, and your analysis will be saved to the text file.

How to Begin

The first operation is to read in a corpus from some language. The default setting for Linguistica's corpus input is 10,000 words: this is the number of words, from the beginning of the corpus, that the program will read. If you wish to change this setting, click on Words requested in the Tree area on the left. A dialogue box will appear where you can specify a different number of words to be read from the corpus. This number refers to the total number of words (tokens) read; it does not refer to word types.

To open a corpus, go to File : New Corpus (or press Ctrl-N). A dialogue box will appear, allowing you to find the desired corpus on your hard drive. Find it and click Open. The corpus must be a standard or Unicode text file (.txt); it cannot be a word processing file from Microsoft Word (.doc). If you have already run Linguistica, and you previously read in a corpus, the program will remember the location of the file, and you can simply select File : Reread corpus (Ctrl-D) to reread the same corpus.

When the reading is complete, Words will appear in the Tree area on the left side of the screen, right under Lexicon. The Lexicon will hold a range of collections, including Words, Stems, Suffixes, and Signatures. When these collections are empty (as they are now), they do not appear in the Tree area. You may click on Words, and the unique words of the corpus will appear as a list in the Collections area. Clicking on a collection in the Tree area will allow you to see the contents of the collection in the Collections area.

The width of the columns in the Collections area may be too small or too large for your purposes. You can change the widths of the columns by grabbing an edge at the top of the columns with the cursor and moving it to the left or the right. You can also sort by any of the column values by clicking on the title of that column. This may be particularly useful for clicking on the "Corpus Count" column to bring the most frequent words to the top of the column. You can return to an alphabetical display of the words by clicking on the top of the first column, "Words". If you wish to see the words organized into a trie, you can click on the "Forward trie" line, which is under "Lexicon" in the Tree area.

Finding a Suffixal System: Signature Based Analysis

Let us suppose you have chosen a language, such as English, in which you wish to discover the suffixal system. On the menu, go to Suffixes : Run All (Keyboard shortcut: Ctrl-S). We will return to other such actions you may take; for now, let's look at what results you may obtain if you perform this operation. You may see something like this in the Tree Area:

In the Lexicon, then, we have 16 suffixes, and 54 signatures built up out of them, along with 571 stems. Of the 2,463 distinct words read, 711 were analyzed, meaning the rest were determined to be mono-morphemic by the program. You should click consecutively on each of these groups, and see that they are displayed in the Collections area on the right as you do so. When the collections get large, it may take a while to display a collection (as much as 10 seconds or more if there are more than 5,000 members).

Saving to File

To save an analysis to a text file that you can open in a text editor or a spreadsheet, go to File : Save As.... A dialogue box will appear, where you can select a folder and give a name to your project (e.g., WarrenCommissionReport_50K). Linguistica will save a set of about 12 different text files: a list of words, stems, affixes, signatures, etc. Most of the information in them is transparent or yet to be implemented. There is also the option to Save broken corpus under the File menu. This saves a version of your corpus with the appended name "[your corpus]_Broken.txt". Inside this text file, you will find the original corpus, but with two spaces between each word, and a plus sign between the morpheme breaks that Linguistica has determined.

Signatures

Returning to the signature-based analysis: if you select Signatures (underneath Suffixes in the Tree are), the signatures will be displayed in the Collections area. You can then click on a signature in the first column, and the stems associated with that signature will be displayed in the Text area below.

By default, signatures are ranked by their robustness, which is roughly the number of letters saved by the analysis, compared with the total number of letters in the original analyzed corpus. It is a measure of how efficiently the analysis has generalized the data. The signatures can be re-sorted by clicking on the header at the top of various columns of the display. Remarks gives an indication of which function was responsible for the identification of the signature.

Now, you can click successively on the rest of the items in the Suffixes menu: (1) Successor frequencies, (2) Check signatures, (3) Known stems and signatures, (4) From sigs: find stems, and (5) Find singleton signatures. When you select Suffixes : Run All (Ctrl-S), as you did before, the Linguistica program actually goes through these five analytic tools, some of them multiple times. This will probably provide you with the best, most comprehensive analysis of your corpus, but these other commands allow you to fine tune your analysis, as well as giving a better understanding of what exactly Linguistica is doing in the background.

Preferences

Once you are familiar with the Linguistica program, you may wish to alter the preferences to improve the quality of the analysis or improve the readability of the program's output. If you select File: Preferences, the Preferences window will open, allowing you to view and change the current preferences. There are three tabs: Lexicon, User Interface, and Parameters. When you are done modifying the preferences, you can click OK to apply the changes, or Cancel to discard them. You can also Save a certain set of preferences into a text file (prefs.txt, for example). It is important to save your preferences, because Linguistica will not remember them from session to session. If you do save your preferences, Linguistica will load the last open preferences file. If you wish to use an older preferences file, use the Load button.

Lexicon

Save verification:
    When this box is checked, Linguistica will ask you if you want to save your work when you quit the program.
Signature delimiter:
    The character in this box will be used to delimit the different suffixes in the signatures. For instance, for the English words 'office' and 'official', the stem would be 'offic' and the signature, if delimited by a period, would appear as 'e.ial'. This does not affect the analysis; rather it is only used to improve readability.
Character Filtering:
    Any sets of characters you put in this field will be interpretted by Linguistica as one letter. Separate the different character sets by spaces or by putting them on different lines.
The character filtering set in the screenshot is for Somali. In this example, dh, sh, kh, and aa are interpretted as single letters by Linguistica.
    After you type in the characters to be filtered, click Set Filters.

User Interface

Morphology Highlighting:
   Each element (Affix, Signature, Stem, Word) can be assigned a different text formatting. The text can be any combination of Bold, Italic, Underline, or a particular color. This text formatting will show up in the Text area of the main Linguistica window. The preview area allows you to see what the formatting will look like.
Default Screen Font:
   This setting allows you to modify the font used in all three areas of the Linguistica main window (Tree, Collections, and Text areas). There is another setting, under the View menu, which also allows you to change the font used, but that option only applies to the Collections and Text areas, not the Tree area. This way, the font determined by the preferences window can be a more permanent font, and you can change the fonts under the View menu for special cases, such as analyzing Arabic or any other language that uses a non-Roman writing system.

Parameters

This tab allows you to fine tune your analysis, having specific control over a number of variables. To modify a parameter, select it, and look to the right of the name. There should be a number, which you can click on and, after waiting a second, you can modify. The meanings behind the parameter names are either self-explanatory or beyond the scope of this tutorial.

When you are finished with modifying the preferences, be sure to save them into a text file.

Summary

If you wish to do a quick analysis of a corpus, you can follow these general steps:

  1. Open Linguistica
  2. Press Ctrl-N (File: New Corpus), find your corpus, or, if it was the last corpus you used, Ctrl-D (File: Reread Corpus)
  3. Press Ctrl-S (Suffixes: Run All)
  4. Now you can see the suffixes and signatures that Linguistica has determined, under Lexicon in the Tree area.

Thanks to Mike LeBeau and Jeremy O'Brien for work on this webpage.