US Census surnames for word origin research

This data set contains hand annotations of the language of origin of frequent US surnames. Surnames were selected from the list of Frequently Occurring First Names and Surnames From the 1990 Census made available the the US Census Bureau.

My goal in labeling was to assign each surname its most likely language of origin when it occurs in the United States in the present day. Labeling was done primarily using the Dictionary of American Family Names (Hanks 2003) and Ellis Island immigration records. Additional resources were used when these sources were inconclusive.

Note that some labels are not individual languages but rather a language group that represents a set of languages related by geography and language family. Groups were created for British Isles, Slavic, Scandinavian, and Indic languages.


Related works

Sonjia Waxmonsky. 2011. Natural language processing for named entities with word-internal information. PhD Thesis, University of Chicago.

Sonjia Waxmonsky and Sravana Reddy. 2012. Grapheme-to-Phoneme Conversion of Proper Names Using Word Origin Information. In Proceedings of NAACL.


S. Waxmonsky, wax at cs uchicago