Data

The resources used in Mormor Karl project (‘Grandma Karl’ in English) consist of several datasets, some of which are under development. Since most of them contain personal information, most of the datasets are not openly avaiable. We will - after thorough legal and ethical considerations - create open datasets that could be used for benchmarking.


Primary data – SweLL corpus

SweLL (Swedish Language Learner) corpus contains a collection of ≈1000 digitized essays written by second language learners of Swedish at different levels of proficiency. SweLL essays have rich metadata information about the writers, text types and topics, and contain multiple samples of use of personal information in a variety of topical domains. All personal mentions have been manually labeled for their types to make it possible to work on the development automatic pseudonymization tools.


Data from other domains (in Swedish)

  • Medical data
  • News data
  • Social media data
  • Working stories

Tools

We start from the SVALA tool, where rule-based automatic pseudonymizer service is available in the menu for tesing, see an article on SVALA.

As new tools and algorithms will be developed in the project, they will appear here.