Turku Children's Book Corpus
Kuvaus
A corpus consisting of Finnish children's books intended to be read by children in Finnish basic education. We have named the corpus "Turku Children's Book Corpus", or TCBC for short, as it has been created by the TurkuNLP research group at the University of Turku.
Version 1.1 (released 2.10.2025):
525 books in total, 175 per age group. A total of 360 fiction, 45 non-fiction, and 120 textbooks. Size ~20.2 million words.
Version 1.0 (released 13.6.2025):
300 books in total, 100 per age group. A total of 210 fiction, 45 non-fiction, and 30 textbooks. Size ~11.6 million words.
In the dataset are the following files:
- physically-scanned-imgs : Physically taken photos of each page in each physical book
- elibrary-book-imgs : Images of each page in each e-book
- google-doc-ai-layouts : OCR output from Google Document AI's Layout processor for the images of each book
- corrected-raw-texts : Raw texts from the GoogleDocAI layouts. Manually checked to fix most OCR mistakes
- trankit-jsons : Trankit output for each raw text file, used in creating the CoNLLU files
- conllus : Full texts of books that have been morphosyntactically parsed with Universal Dependencies annotations and follow the CoNLLU file format
- metadata : Various metadata files containing information on the books included in the dataset
There are different versions of TCBC and each version has been designed such that it contains and equal number of books per age group, which have been designated as ages 7-8 (also includes some books for younger children), 9-12, and 13+. There are also three different genres taken into account: novel, textbook, non-fiction and there is an equal number of books for each genre in each age group. There are also some additional books that have been partially processed, but not a part of the most recent version of the corpus to adhere to this rule. The list of the specific books for each version can be found in the metadata files.
Näytä enemmänJulkaisuvuosi
2025
Aineiston tyyppi
Tekijät
Projekti
Muut tiedot
Tieteenalat
Tietojenkäsittely ja informaatiotieteet
Kieli
suomi
Saatavuus
Saatavuutta rajoitettu