Project Resources
Data
The resources used in Mormor Karl project (‘Grandma Karl’ in English) consist of several datasets, some of which are under development. Since most of them contain personal information, most of the datasets are not openly avaiable. We will - after thorough legal and ethical considerations - create open datasets that could be used for benchmarking.
Primary data – SweLL corpus
SweLL (Swedish Language Learner) corpus contains a collection of ≈1000 digitized essays written by second language learners of Swedish at different levels of proficiency. SweLL essays have rich metadata information about the writers, text types and topics, and contain multiple samples of use of personal information in a variety of topical domains. All personal mentions have been manually labeled for their types to make it possible to work on the development automatic pseudonymization tools.
- SweLL-gold corpus, 502 essays: see corpus metadata and article
- SweLL-pilot corpus, 502 essays, consists of three subcorpora, as described in the article:
- SpIn: see SpIn metadata
- SW1203: see SW1203 metadata
- TISUS: see TISUS metadata
Data from other domains (in Swedish)
- Medical data
- News data
- Social media data
- Working stories
Tools
We start from the SVALA tool, where rule-based automatic pseudonymizer service is available in the menu for tesing, see an article on SVALA.
As new tools and algorithms will be developed in the project, they will appear here.