Lausetasolla kohdistettu suomi–selkosuomi-rinnakkaiskorpus Ylen suomenkielisestä uutisarkistosta 2014-2020, lähdeaineisto

Kuvaus

This resource is available for download in Kielipankki – the Language Bank of Finland. This is a parallel corpus created of the Yle news articles from 2014-2020 by aligning the standard Finnish versions with the easy-language versions. The dataset, created by Anna Dmitrieva and available in CSV format, is aligned on the sentence level. It is based on the two parallel document-level datasets of Yle News articles available on Kielipankki (http://urn.fi/urn:nbn:fi:lb-2022111625 and http://urn.fi/urn:nbn:fi:lb-2024011701). The dataset spans the period from September 2014 to December 2020. This dataset is comprised of the following parts: 1) Sentence alignments: parallel documents from regular and Easy Finnish Yle news articles aligned sentence-by-sentence. Only the "positive" documents were taken from the 2019-2020 dataset (http://urn.fi/urn:nbn:fi:lb-2022111625). All but 50 documents were aligned automatically with Vecalign (https://github.com/thompsonb/vecalign) using LASER embeddings (https://github.com/facebookresearch/LASER). Each document has the following columns: 1.1) pair_id: an id comprised of three parts divided by a double underscore: the id of the regular document, the id of the Easy Finnish document (with a singular underscore), and the sentence pair number. 1.2) regular_string: a sentence from the regular Finnish article. 1.3) selko_string: a corresponding sentence from the Easy Finnish article. 1.4) score: the confidence score given by Vecalign. The lower the score, the more similar the sentences. The "good" pairs are estimated to have a score below or equal to 0.65; however, the score is not definitive proof of whether the sentences in the pair truly match in meaning. The zero score is assigned when a sentence has no pair. The scores for all non-zero sentence pairs in manually aligned documents are set to 0.(3). 2) Golden sentence alignments: 50 documents aligned manually by a human assessor (text). Also available in the ladder format (indexes).
Näytä enemmän

Julkaisuvuosi

2024

Aineiston tyyppi

Tekijät

University of Helsinki

Anna Dmitrieva Orcid -palvelun logo - Kuraattori

Yleisradio Oy - Tekijä

Projekti

Muut tiedot

Tieteenalat

Kielitieteet

Kieli

suomi

Saatavuus

Saatavuutta rajoitettu

Lisenssi

CLARIN ACA+NC (Academic, Non Commercial) End User License 1.0

Avainsanat

Asiasanat

Ajallinen kattavuus

undefined

Liittyvät aineistot