Project Resources
Data
The resources used in Mormor Karl project (‘Grandma Karl’ in English) consist of several datasets, some of which are under development. Since most of them contain personal information, most of the datasets are not openly avaiable. We will - after thorough legal and ethical considerations - create open datasets that could be used for benchmarking.
Primary data – SweLL corpus
SweLL (Swedish Language Learner) corpus contains a collection of ≈1000 digitized essays written by second language learners of Swedish at different levels of proficiency. SweLL essays have rich metadata information about the writers, text types and topics, and contain multiple samples of use of personal information in a variety of topical domains. All personal mentions have been manually labeled for their types to make it possible to work on the development automatic pseudonymization tools.
- SweLL-gold corpus, 502 essays: see corpus metadata and article
- SweLL-pilot corpus, 502 essays, consists of three subcorpora, as described in the article:
- SpIn: see SpIn metadata
- SW1203: see SW1203 metadata
- TISUS: see TISUS metadata
Texts with fictive personal information
We are conducting a collection of texts (legal texts and personal stories) written by real people about fictive characters.
Data from other domains (in Swedish)
- Medical data
- News data
- Social media data
- Working stories
Tools
We start from the SVALA tool, where rule-based automatic pseudonymizer service is available in the menu for tesing, see an article on SVALA.
We have developed a SPARV plugin for personal information detection and labelling.
The models available in that plugin are available on HuggingFace.
We have collaborated with InfraVis to develop a visualization tool for comparing personal information detection and labeling from different systems.
The three previously mentioned tools and models are described in this article.
As more tools and algorithms will be developed in the project, they will appear here.