Multilingual text corpus Languages
Slovenian
(1,359,000 Words)
Language Script: Latin
Swedish
(1,403,000 Words)
Language Script: Latin
Bulgarian
(1,070,000 Words)
Language Script: Cyrillic
English
(1,985,000 Words)
Language Script: Latin
Modern Greek (1453 - )
(1,650,000 Words)
Language Script: Greek
Estonian
(987,000 Words)
Language Script: Latin
Spanish; Castilian
(1,911,000 Words)
Language Script: Latin
Czech
(1,401,000 Words)
Language Script: Latin
German
(1,565,000 Words)
Language Script: Latin
Danish
(1,256,000 Words)
Language Script: Latin
French
(2,152,000 Words)
Language Script: Latin
Finnish
(1,069,000 Words)
Language Script: Latin
Italian
(2,127,000 Words)
Language Script: Latin
Hungarian
(1,205,000 Words)
Language Script: Latin
Latvian
(1,127,000 Words)
Language Script: Latin
Lithuanian
(1,118,000 Words)
Language Script: Latin
Dutch; Flemish
(1,454,000 Words)
Language Script: Latin
Maltese
(1,134,000 Words)
Language Script: Latin
Portuguese
(1,725,000 Words)
Language Script: Latin
Polish
(1,514,000 Words)
Language Script: Latin
Slovak
(1,331,000 Words)
Language Script: Latin
Romanian; Moldavian; Moldovan
(1,269,000 Words)
Language Script: Latin
Linguality Linguality type: Multilingual
Multi-linguality type: Parallel (The texts were aligned on a sentence level using the statistical aligner Maligna (http://align.sourceforge.net).)
Text Format Size
31,810,000 Words
60,120 Texts
Character encoding
UTF - 8
(60,120 Texts)
Domains
law_politics
(60,120 Texts)
Modalities
Written Language
(60,120 Texts)
Annotation Alignment Segmentation level: Sentence
Format: text/xml
Standard practices conformance: TEI
Theoretic Model: Church & Gale algorithm (Gale, William A.; Church, Kenneth W. (1993), "A Program for Aligning Sentences in Bilingual Corpora", Computational Linguistics 19 (1): 75–102)
Annotation Mode: Automatic
Start date: 01/06/2012
End date: 30/06/2012
Size:
60,120 Texts
Segmentation StandOff: False
Segmentation level: Sentence
Format: text/xml
Standard practices conformance: TEI P5
Annotation Mode: Automatic
Start date: 01/06/2012
End date: 30/06/2012
Size:
60,120 Texts
Time Coverage
2005-2012
(60,120 Texts)
Geographic coverage
European Union
(60,120 Texts)
Creation Creation mode details: The texts were acquired using a custom-built web crawler. Semi-automatic scripts were used to pipeline text cleanup, segmentation, alignment and import/export procedures.
Creation mode: Mixed
Original Sources Creation Tools Multilingual text corpus Languages
Swedish
(3,518,000 Words)
Language Script: Latin
Turkish
(5,200 Words)
Language Script: Latin
Romanian; Moldavian; Moldovan
(3,196,000 Words)
Language Script: Latin
Russian
(2,000 Words)
Language Script: Cyrillic
Slovak
(3,426,000 Words)
Language Script: Latin
Slovenian
(3,463,000 Words)
Language Script: Latin
Dutch; Flemish
(4,229,000 Words)
Language Script: Latin
Norwegian
(6,400 Words)
Language Script: Latin
Polish
(4,533,000 Words)
Language Script: Latin
Portuguese
(4,311,000 Words)
Language Script: Latin
Arabic
(1,320 Words)
Language Script: Arabic
German
(4,698,000 Words)
Language Script: Latin
Danish
(3,582,000 Words)
Language Script: Latin
English
(4,958,000 Words)
Language Script: Latin
Modern Greek (1453 - )
(4,388,000 Words)
Language Script: Greek
Belarusian
(311 Words)
Language Script: Cyrillic
Czech
(3,519,000 Words)
Language Script: Latin
Bulgarian
(2,951,000 Words)
Language Script: Cyrillic
Estonian
(2,794,000 Words)
Language Script: Latin
Spanish; Castilian
(5,234,000 Words)
Language Script: Latin
French
(5,627,000 Words)
Language Script: Latin
Finnish
(2,691,000 Words)
Language Script: Latin
Croatian
(3,300 Words)
Language Script: Latin
Irish
(282,000 Words)
Language Script: Latin
Icelandic
(2,900 Words)
Language Script: Latin
Hungarian
(3,533,000 Words)
Language Script: Latin
Lithuanian
(3,069,000 Words)
Language Script: Latin
Italian
(4,790,000 Words)
Language Script: Latin
Maltese
(3,193,000 Words)
Language Script: Latin
Latvian
(2,907,000 Words)
Language Script: Latin
Linguality Linguality type: Multilingual
Multi-linguality type: Parallel (The texts were aligned on a sentence level using the statistical aligner Maligna (http://align.sourceforge.net).)
Text Format Size
84,910,000 Words
88,332 Texts
Character encoding
UTF - 8
(88,332 Texts)
Domains
law_politics
(88,332 Texts)
Modalities
Written Language
(88,332 Texts)
Annotation Alignment Segmentation level: Sentence
Format: text/xml
Standard practices conformance: TEI
Theoretic Model: Church & Gale algorithm (Gale, William A.; Church, Kenneth W. (1993), "A Program for Aligning Sentences in Bilingual Corpora", Computational Linguistics 19 (1): 75–102)
Annotation Mode: Automatic
Start date: 01/06/2012
End date: 30/06/2012
Size:
88,332 Texts
Segmentation StandOff: False
Segmentation level: Sentence
Format: text/xml
Standard practices conformance: TEI P5
Annotation Mode: Automatic
Start date: 01/06/2012
End date: 30/06/2012
Size:
88,332 Texts
Time Coverage
2004-2012
(88,332 Texts)
Geographic coverage
European Union
(88,332 Texts)
Creation Creation mode details: The texts were acquired using a custom-built web crawler. Semi-automatic scripts were used to pipeline text cleanup, segmentation, alignment and import/export procedures.
Creation mode: Mixed
Original Sources Creation Tools Multilingual text corpus Languages
Czech
(107,000 Words)
Language Script: Latin
Finnish
(78,000 Words)
Language Script: Latin
Spanish; Castilian
(129,000 Words)
Language Script: Latin
Icelandic
(99,000 Words)
Language Script: Latin
French
(134,000 Words)
Language Script: Latin
German
(125,000 Words)
Language Script: Latin
English
(114,000 Words)
Language Script: Latin
Danish
(117,000 Words)
Language Script: Latin
Dutch; Flemish
(115,000 Words)
Language Script: Latin
Italian
(119,000 Words)
Language Script: Latin
Polish
(104,000 Words)
Language Script: Latin
Norwegian
(115,000 Words)
Language Script: Latin
Russian
(52,000 Words)
Language Script: Cyrillic
Portuguese
(136,000 Words)
Language Script: Latin
Turkish
(83,000 Words)
Language Script: Latin
Swedish
(116,000 Words)
Language Script: Latin
Ukrainian
(71,000 Words)
Language Script: Cyrillic
Linguality Linguality type: Multilingual
Multi-linguality type: Parallel (The texts were aligned on a sentence level using the statistical aligner Maligna (http://align.sourceforge.net).)
Text Format Size
1,814,000 Words
1,728 Texts
Character encoding
UTF - 8
(1,728 Texts)
Domains Modalities
Written Language
(1,728 Texts)
Annotation Alignment Segmentation level: Sentence
Format: text/xml
Standard practices conformance: TEI
Theoretic Model: Church & Gale algorithm (Gale, William A.; Church, Kenneth W. (1993), "A Program for Aligning Sentences in Bilingual Corpora", Computational Linguistics 19 (1): 75–102)
Annotation Mode: Automatic
Start date: 01/06/2012
End date: 30/06/2012
Size:
1,728 Texts
Segmentation StandOff: False
Segmentation level: Sentence
Format: text/xml
Standard practices conformance: TEI P5
Annotation Mode: Automatic
Start date: 01/06/2012
End date: 30/06/2012
Size:
1,728 Texts
Time Coverage
2009-2012
(1,728 Texts)
Geographic coverage
European Union
(1,728 Texts)
Creation Creation mode details: The texts were acquired using a custom-built web crawler. Semi-automatic scripts were used to pipeline text cleanup, segmentation, alignment and import/export procedures.
Creation mode: Mixed
Original Sources Creation Tools Multilingual text corpus Languages
Italian
(4,247,000 Words)
Language Script: Latin
English
(3,907,000 Words)
Language Script: Latin
German
(3,788,000 Words)
Language Script: Latin
French
(4,456,000 Words)
Language Script: Latin
Spanish; Castilian
(4,558,000 Words)
Language Script: Latin
Polish
(3,581,000 Words)
Language Script: Latin
Linguality Linguality type: Multilingual
Multi-linguality type: Parallel (The texts were aligned on a sentence level using the statistical aligner Maligna (http://align.sourceforge.net).)
Text Format Size
24,539,000 Words
67,787 Texts
Character encoding
UTF - 8
(67,787 Texts)
Domains Modalities
Written Language
(67,787 Texts)
Annotation Alignment Segmentation level: Sentence
Format: text/xml
Standard practices conformance: TEI
Theoretic Model: Church & Gale algorithm (Gale, William A.; Church, Kenneth W. (1993), "A Program for Aligning Sentences in Bilingual Corpora", Computational Linguistics 19 (1): 75–102)
Annotation Mode: Automatic
Start date: 01/08/2011
End date: 30/06/2012
Size:
67,787 Texts
Segmentation StandOff: False
Segmentation level: Sentence
Format: text/xml
Standard practices conformance: TEI P5
Annotation Mode: Automatic
Start date: 01/08/2011
End date: 30/06/2012
Size:
67,787 Texts
Time Coverage
2003-2012
(67,787 Texts)
Geographic coverage
European Union
(67,787 Texts)
Creation Creation mode details: The texts were acquired using a custom-built web crawler. Semi-automatic scripts were used to pipeline text cleanup, segmentation, alignment and import/export procedures.
Creation mode: Mixed
Original Sources Creation Tools