My goal in labeling was to assign each surname its most likely language of origin when it occurs in the United States in the present day. Labeling was done primarily using the Dictionary of American Family Names (Hanks 2003) and Ellis Island immigration records. Additional resources were used when these sources were inconclusive.
Note that some labels are not individual languages but rather a language group that represents a set of languages related by geography and language family. Groups were created for British Isles, Slavic, Scandinavian, and Indic languages.
Sonjia Waxmonsky. 2011. Natural language processing for named entities with word-internal information. PhD Thesis, University of Chicago.
Sonjia Waxmonsky and Sravana Reddy. 2012. Grapheme-to-Phoneme Conversion of Proper Names Using Word Origin Information. In Proceedings of NAACL.