Slovene Web Corpus

67 Last view: 2026-03-19

slWaC

http://www.nljubesic.net/resources/corpora/slwac/

ID:

307 Slovene Web Corpus (slWaC) is the the first version of the Slovene web corpus. It was collected by crawling the whole .si internet domain in 2011-06 yielding ca 380 million tokens. The corpus has been lemmatised and MSD-tagged automatically using ToTaLe system (Erjavec et al. 2005). The compilation of the corpus is described in the TSD2011 paper Ljubešić, N., Erjavec, T. hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene. The morphosyntactically annotated and lemmatized corpus is distributed under the CC-BY-SA licence. The first version is freely accessible for querying at http://faust.ffzg.hr/bonito2/run.cgi/first_form?corpname=slwac. A new crawl with an updated crawler is scheduled for 2012-09. The target size of the second version of slWaC is 1 billion words.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Licence

CC - BY - SA

Restrictions: Attribution, Share Alike

Execution location: hidden

Distribution Access/Medium: Downloadable

Distribution rights holders:

University of Zagreb, Faculty of Humanities and Social Sciences

IPR Holder

University of Zagreb, Faculty of Humanities and Social Sciences

Contact Person

Nikola Ljubešić

text

Monolingual text corpusLanguages

Slovenian

Language Script: Latn

Linguality

Linguality type: Monolingual

Size

380 000 000 Tokens

Character encoding

UTF - 8

AnnotationLemmatization

Segmentation level: Word

Morphosyntactic Annotation - B Pos Tagging

Segmentation level: Word

Segmentation

Segmentation level: Word

Segmentation

Segmentation level: Paragraph

Resource Creation

Resource Creator

Univ. of Zagreb, Faculty of Humanities and Social Sciences, Depts. of Linguistics & Information Sci.

Creation started: 01/06/2011

Funding Project

Central and South-East European Resources (CESAR)

URL: http://www.cesar-pro...

Funding Types: Eu Funds, National Funds

Funders: European Commission (50%), University of Zagreb, Faculty of Humanities and Social Sciences (50%)

Project duration: 01/02/2011 - 31/01/2013

Metadata

Created: 30/07/2012

Last Updated: 04/02/2013

Metadata Creator

Marko Tadić

Version

Version: 1.0

Last Updated: 30/07/2012

Documentation

Nikola Ljubešić and Tomaž Erjavec. hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene. Text, Speech and Dialogue 2011. Lecture Notes in Computer Science, Springer.

People who looked at this resource also viewed the following:

Resources from the same project

Resources from the same creators