Data

The resources used in Mormor Karl project (‘Grandma Karl’ in English) consist of several datasets, some of which are under development. Since most of them contain personal information, most of the datasets are not openly avaiable. We will - after thorough legal and ethical considerations - create open datasets that could be used for benchmarking.


Primary data – SweLL corpus

SweLL (Swedish Language Learner) corpus contains a collection of ≈1000 digitized essays written by second language learners of Swedish at different levels of proficiency. SweLL essays have rich metadata information about the writers, text types and topics, and contain multiple samples of use of personal information in a variety of topical domains. All personal mentions have been manually labeled for their types to make it possible to work on the development automatic pseudonymization tools.


Texts with fictive personal information

We are conducting a collection of texts (legal texts and personal stories) written by real people about fictive characters.

Data from other domains (in Swedish)

  • Medical data
  • News data
  • Social media data
  • Working stories

Tools

We start from the SVALA tool, where rule-based automatic pseudonymizer service is available in the menu for tesing, see an article on SVALA.

We have developed a SPARV plugin for personal information detection and labelling.

The models available in that plugin are available on HuggingFace.

We have collaborated with InfraVis to develop a visualization tool for comparing personal information detection and labeling from different systems.

The three previously mentioned tools and models are described in this article.

As more tools and algorithms will be developed in the project, they will appear here.