Accueil - Home arrow Corpus multilingues - Multilingual corpus arrow CD-Rom - off line 26-09-2021  
 
 

 

 
Main Menu
Accueil - Home
Qui suis-je - Who am I
Cours
Recherches - Research
Alinea
Multi-Aligneur JAM
ConcQuest
AnaText
WebAlignToolkit
Chercher sur le site
Corpus multilingues - Multilingual corpus
Perl Corpus Processor (PCP)
Notes en vrac

 

 
Corpus sur CD-Rom - Corpora on CD Convertir en PDF  | Version imprimable |  Suggérer par mail

Corpus non distribués librement - Non freely distributed corpora

Présentation :
This double CD-ROM contains extensive corpora, both spoken and written, in more than 21 languages of Western, Central and Eastern Europe, for instance Lithuanian, Polish, Hungarian, and Slovene. The corpora are available in plain text and SGML encoding, and have been successfully aligned. Also available are various tools including a corpus query language, concordancer, alignment tools, software, POS taggers, lexica in 6 languages and samples of research work involving the data.

Distribué par Elsnet
Multilingual Corpus I (ECI/MCI) of over 98 million words, covering most of the major European languages, as well as Turkish, Japanese, Russian, Chinese, Malay and more.
The EMILLE/CIIL Corpus consists of monolingual corpora containing approximately 92,799,000 words for 14 South Asian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu) (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu), a parallel corpus of 200,000 words in English with translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. Annotations include Urdu monolingual and parallel corpora annotated for parts-of-speech, and 20 written Hindi corpus files annotated to show the nature of demonstrative use. All other components are annotated at the sentence level. The corpus is marked up using CES- compliant SGML and encoded using Unicode.

The EMILLE Lancaster Corpus consists of monolingual corpora containing approximately 58,880,000 words for seven South Asian languages (Bengali, Gujarati, Hindi, Punjabi, Sinhala, Tamil and Urdu) (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu), a parallel corpus of 200,000 words in English with translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. Annotations include Urdu monolingual and parallel corpora annotated for parts-of-speech, and 20 written Hindi corpus files annotated to show the nature of demonstrative use. All other components are annotated at the sentence level. The corpus is marked up using CES-compliant SGML and encoded using Unicode.
Journal officiel de la communauté européenne (JOC)
Etiquetage: Parties du discours
Langues: Anglais - Français - Allemand - Italien - Espagnol. 
 
Corpus disponible à l'UCREL(University Centre For Computer Corpus Research on Language - Lancaster)
Documents officiels de la CE sur les télécommunications
Aligné (1 250 000 mots pour chaque langue)
Langues : Anglais - Français
Etiquetage : Parties du discours, lemmatisation

  • INTERSECT Corpus

Cet e-mail est protégé contre les robots collecteurs de mails, votre navigateur doit accepter le Javascript pour le voir
Corpus aligné
Multilingual Corpora and Contrastive Linguistics.
Langues: Anglais / Français, Anglais / Allemand
 
Corpus of German and English translations. The corpus is not available
for copyright reasons. 
 
Dernière mise à jour : ( 01-08-2008 )
 
 
© 2021 Site personnel de Olivier Kraif - Olivier Kraif's Homepage