Helsingin yliopiston englanninkielinen E-thesis 1999-2016, Korp versio 1.1

Kuvaus

The corpus is available in Kielipankki - the Language Bank of Finland in Korp. The corpus contains the University of Helsinki's English master's theses as well as the doctoral theses and their summaries published at https://ethesis.helsinki.fi by September 2016. This version fixes issues with tokenization, language recognition and OCR in Master's theses and dissertations, 23 subcorpora in total. These subcorpora have also been parsed with Turku Neural Parser Pipeline (TNPP). For more information, see http://turkunlp.org/Turku-neural-parser-pipeline/. The 2 subcorpora containing abstracts (ethesis_en_dissabs and ethesis_en_maabs) are the same as in the previous version. The subcorpus ethesis_en_phd_math has been renamed to ethesis_en_phd_sci. Texts with less than 1000 words have been left out. Most of them contain only an abstract, often both in Finnish and English, and/or just the first page and possibly a list of contents. especially in dissertations. Texts that contain more than 1000 words are included if they contain at least 50 English words, tested with a very simple search containing common English words. Most texts that do not pass this test are not in English or the text itself has been badly OCR'd.
Näytä enemmän

Julkaisuvuosi

2020

Aineiston tyyppi

Tekijät

University of Helsinki - Tekijä, Kuraattori

Projekti

Muut tiedot

Tieteenalat

Kielitieteet

Kieli

englanti

Saatavuus

Avoin

Lisenssi

Creative Commons Nimeä 4.0 Kansainvälinen (CC BY 4.0)

Avainsanat

Asiasanat

Ajallinen kattavuus

undefined

Liittyvät aineistot