RSS

Word Frequencies and Language Resources (different sets of corpus)

18 Jun

Recently I’m searching for non-lemmatized word frequency tables compiled for various languages such as German, French, Spanish, Dutch, etc. So far it seems a better idea to construct such tables from different sets of corpus. Here are some relevant links.

2 outstanding examples:

– Wortschatz: 57 Corpus-Based Monolingual Dictionaries: http://corpora.uni-leipzig.de/ and http://corpora.uni-leipzig.de/download.html

– negr@corpus: A Syntactically Annotated Corpus of German Newspaper Texts. The corpus is available free of charge to all universities and other non-profit research organizations. Others please contact us for conditions. Version 2 of the corpus is now available containing 20602 sentences (355096 tokens). http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/negra-corpus.html and http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/negra-corpus.html

Other links:

– Statistical natural language processing and corpus-based computational linguistics: An annotated list of resources: http://nlp.stanford.edu/links/statnlp.html

– Statistical Language Modeling Toolkit: http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html

– The LDC Corpus Catalog. The LDC’s Catalog contains hundreds of corpora of language data. http://www.ldc.upenn.edu/Catalog/

– European Corpus Initiative Multilingual Corpus I (ECI/MCI) The European Corpus Initiative (ECI) was founded to oversee the acquisition and preparation of a large multilingual corpus (ECI/MCI) to be made available in digital form for scientific research at a low a cost as possible. The corpus has been available on CD-ROM since 1994, and is being distributed by ELSNET. http://www.elsnet.org/resources/eciCorpus.html

– ELRA?s missions are to promote language resources for the Human Language Technology (HLT) sector, and to evaluate language engineering technologies. http://www.elra.info/

– ELDA – Evaluations and Language resources Distribution Agency ? is ELRA?s operational body, set up to identify, classify, collect, validate and produce the language resources which may be needed by the HLT ? Human Language Technology ? community.
Besides, ELDA is involved in HLT evaluation campaigns. http://www.elda.org/

Advertisements
 
Leave a comment

Posted by on June 18, 2010 in Linguistics, Programlama

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: