Computational Approaches to Language Data Pseudonymization

CALD-pseudo workshop at EACL 2024

a cross-disciplinary forum for advancing privacy protection of unstructured text data & data openness through pseudonymization.

Proceedings

EACL workshop proceedings are out: https://aclanthology.org/volumes/2024.caldpseudo-1/

Video and photos

Watch live recordings from the workshop.

Here are slides and pre-recorded videos provided by Underline.

Dates and venue


Venue: Hotel Corinthia, St. George’s Bay, Malta Date: March 21, 2024 Submission deadline: December 18, 2023 Anywhere on Earth (AoE) Submission website: https://softconf.com/eacl2024/CALD-pseudo-2024/ Registration: https://2024.eacl.org/registration

Quick links
Program Invited speakers - Anders Søgaard, Denmark - Ildikó Pilán, Norway Call for papers Submission information Important dates Program committee Organizers

Program

Location/room: Corinthia hotel, Room: Gardjola 3


09:00-09:10	Opening Remarks - Elena Volodina
09:10-10:00	Invited talk 1. Chair: Elena Volodina Anders Søgaard NLP is Dead - Now What?
	Session 1. Chair: Maria Irena Szawerna
10:00-10:15	Handling Name Errors of a BERT-Based De-Identification System: Insights from Stratified Sampling and Markov-based Pseudonymization – Dalton Simancek and VG Vinod Vydiswaran
10:15-10:30	Automatic Detection and Labelling of Personal Data in Case Reports from the ECHR in Spanish: Evaluation of Two Different Annotation Approaches – Maria Sierro, Begoña Altuna and Itziar Gonzalez-Dios
10:30-11:00	Coffee Break
	Session 2. Chair: Hercules Dalianis
11:00-11:20	PSILENCE: A Pseudonymization Tool for International Law – Luis Adrián Cabrera-Diego and Akshita Gheewala
11:20-11:40	Extending off-the-shelf NER Systems to Personal Information Detection in Dialogues with a Virtual Agent: Findings from a Real-Life Use Case – Mario Mina, Carlos Rodríguez, Aitor Gonzalez-Agirre and Marta Villegas
11:40-12:00	Data Anonymization for Privacy-Preserving Large Language Model Fine-Tuning on Call Transcripts – Shayna Gardiner, Tania Habib, Kevin Humphreys, Frederic Mailhot, Anne Paling, Preston Thomas and Nathan Zhang
12:00-13:00	Lunch Break
13:00-13:50	Invited talk 2. Chair: Elena Volodina Ildikó Pilán Pseudonymisation and related techniques: a quest for determining what personal information to rewrite and how
13:50-14:00	Short Break
	Session 3. Chair: Ricardo Muñoz Sánchez
14:00-14:15	Assessing authenticity and anonymity of synthetic user-generated content in the medical domain – Tomohiro Nishiyama, Lisa Raithel, Roland Roller, Pierre Zweigenbaum and Eiji Aramaki
14:15-14:30	Deidentifying a Norwegian clinical corpus - An effort to create a privacy-preserving Norwegian large clinical language model – Phuong Ngo, Miguel Tejedor, Therese Olsen Svenning, Taridzo Chomutare, Andrius Budrionis and Hercules Dalianis
14:30-14:45	When Is a Name Sensitive? Eponyms in Clinical Text and Implications for De-Identification – Thomas Vakili, Tyr Hullmann, Aron Henriksson and Hercules Dalianis
14:45-14:50	Short Break
	Session 4. Chair: Ildikó Pilán
14:50-15:10	Detecting Personal Identifiable Information in Swedish Learner Essays – Maria Irena Szawerna, Simon Dobnik, Ricardo Muñoz Sánchez, Therese Lindström Tiedemann and Elena Volodina
15:10-15:30	Did the Names I Used within My Essay Affect My Score? Diagnosing Name Biases in Automated Essay Scoring – Ricardo Muñoz Sánchez, Simon Dobnik, Maria Szawerna, Therese Lindström Tiedemann and Elena Volodina
15:30-16:00	Coffee Break
16:00-17:00	Session 5. Panel discussion. Moderator: Elena Volodina Panelists: Ildikó Pilán and Thomas Vakili
Evening	Joint post-workshop dinner for those who want to follow. Place Restaurant Gozitan

Invited speakers

Anders Søgaard, University of Copenhagen, Denmark
NLP is Dead - Now What? For decades, the NLP community was on a mission to get computers to understand language. To the extent the goal of the mission was defined, our mission is complete. Now what? There’s still a ton of open problems, of course. Pseudonymization is one of them. Others include bias mitigation, performance parity, or getting things to run on-device. None of these problems are NLP problems, but they are all inter-dependent. Does their locus leave room for a raison d’être for the remnants of NLP?
BIO Anders Søgaard is Full Professor in Natural Language Processing and Machine Learning, Dpt. of Computer Science, University of Copenhagen. He is also affiliated with the Pioneer Centre for Artificial Intelligence, Dpt. of Philosophy, and Center for Social Data Science. He was previously at University of Potsdam, Amazon and Google Research. He has won eight best paper awards and several prestigious grants

Anders Søgaard, University of Copenhagen, Denmark

NLP is Dead - Now What?
For decades, the NLP community was on a mission to get computers to understand language. To the extent the goal of the mission was defined, our mission is complete. Now what? There’s still a ton of open problems, of course. Pseudonymization is one of them. Others include bias mitigation, performance parity, or getting things to run on-device. None of these problems are NLP problems, but they are all inter-dependent. Does their locus leave room for a raison d’être for the remnants of NLP?

BIO
Anders Søgaard is Full Professor in Natural Language Processing and Machine Learning, Dpt. of Computer Science, University of Copenhagen. He is also affiliated with the Pioneer Centre for Artificial Intelligence, Dpt. of Philosophy, and Center for Social Data Science. He was previously at University of Potsdam, Amazon and Google Research. He has won eight best paper awards and several prestigious grants

Ildikó Pilán, the Norwegian Computing Center, Norway
Pseudonymisation and related techniques: a quest for determining what personal information to rewrite and how In this talk, we will walk through the different steps involved in the process of concealing personal information. We will start by looking at methods for which pieces of personal information to detect and how. We will then discuss strategies for rewriting these and, finally, we will look at approaches proposed for evaluating the resulting redacted text in terms of privacy protection and utility preservation. We will discuss previous work inspired by Named Entity Recognition as well as more recent approaches employing Large Language Models. We will also explore the differences between pseudonymization and anonymization highlighting the remaining challenges in performing these automatically.
BIO Ildikó Pilán is a Senior Research Scientist at the Norwegian Computing Center, Norway. Her most impactful research comes from linguistic complexity studies within the domain of language learning, and recently from the area of anonymization and pseudonymization where she has been actively working on preparing datasets, benchmarks and models for automatic anonymization and pseudonymization of Norwegian and English data in the project Cleanup (e.g. Lison et al., 2021; Pilán et al., 2022). Her fields of expertise include Natural Language Processing, Machine Learning, privacy protection, data privacy, medical text processing and Intelligent Computer-Assisted Language Learning.

Ildikó Pilán, the Norwegian Computing Center, Norway

Pseudonymisation and related techniques: a quest for determining what personal information to rewrite and how
In this talk, we will walk through the different steps involved in the process of concealing personal information. We will start by looking at methods for which pieces of personal information to detect and how. We will then discuss strategies for rewriting these and, finally, we will look at approaches proposed for evaluating the resulting redacted text in terms of privacy protection and utility preservation. We will discuss previous work inspired by Named Entity Recognition as well as more recent approaches employing Large Language Models. We will also explore the differences between pseudonymization and anonymization highlighting the remaining challenges in performing these automatically.

BIO
Ildikó Pilán is a Senior Research Scientist at the Norwegian Computing Center, Norway. Her most impactful research comes from linguistic complexity studies within the domain of language learning, and recently from the area of anonymization and pseudonymization where she has been actively working on preparing datasets, benchmarks and models for automatic anonymization and pseudonymization of Norwegian and English data in the project Cleanup (e.g. Lison et al., 2021; Pilán et al., 2022). Her fields of expertise include Natural Language Processing, Machine Learning, privacy protection, data privacy, medical text processing and Intelligent Computer-Assisted Language Learning.

Call for papers


We invite submissions to the first edition of the CALD-pseudo workshop on Computational Approaches to Language Data Pseudonymization, to be held at EACL 2024 on March 21, 2024.

Description
Accessibility of research data is critical for advances in many research fields but textual data often cannot be shared due to the personal and sensitive information which it contains, e.g names, political opinions, sensitive personal information and medical data. General Data Protection Regulation, GDPR (EU Commission, 2016), suggests pseudonymization as a solution to secure open access to research data but we need to learn more about pseudonymization as an approach before adopting it for manipulation of research data (Volodina et al., 2023). The main challenge is how to effectively pseudonymize data so that such individuals cannot be identified, while at the same time keeping the data usable for research (e.g. in computational linguistics, linguistics) and natural language processing tasks for which it was collected.

Description

Accessibility of research data is critical for advances in many research fields but textual data often cannot be shared due to the personal and sensitive information which it contains, e.g names, political opinions, sensitive personal information and medical data. General Data Protection Regulation, GDPR (EU Commission, 2016), suggests pseudonymization as a solution to secure open access to research data but we need to learn more about pseudonymization as an approach before adopting it for manipulation of research data (Volodina et al., 2023). The main challenge is how to effectively pseudonymize data so that such individuals cannot be identified, while at the same time keeping the data usable for research (e.g. in computational linguistics, linguistics) and natural language processing tasks for which it was collected.

Topics of interest
This workshop invites a broad community of researchers in all concerned cross-disciplinary fields to jointly discuss challenges within pseudonymization, such as
* automatic approaches to detection and labelling of personal information in unstructured language data, including events and other context-dependent cues revealing a person;
* developing context-sensitive algorithms for replacement of personal information in unstructured data;
* studies into the effects of pseudonymization on unstructured data, e.g. applicability of pseudonymised data for the intended research questions, readability of pseudonymised data or addition of unwelcome biases through pseudonymization;
* effectiveness of pseudonymization as a way of protecting writer identity;
* reidentification studies, e.g. adversarial learning techniques that attempt to breach the privacy protections of pseudonymized data;
* constructing datasets for automatic pseudonymization, including methodological and ethical aspects of those;
* approaches to the evaluation of automatic pseudonymization both in concealing the private information and preserving the semantics of the non-personal data;
* pseudonymization tools and software: evaluating the available tools and software for pseudonymization in different languages, and their ease of use, scalability, and performance;
* and numerous other open questions.

Submission information


Authors are invited to submit by December 18, 2023 original and unpublished research papers in the following categories: Full papers (up to 8 pages) for substantial contributions Short or demo papers (up to 4 pages) for ongoing or preliminary work All submissions must be in PDF format must follow the EACL 2024 guidelines described in the ARR CfP https://aclrollingreview.org/cfp, be in pdf, and use the official ACL style templates available here: https://github.com/acl-org/acl-style-files Direct submission deadline: December 18, 2023 at https://softconf.com/eacl2024/CALD-pseudo-2024/ Deadline for registration of ARR reviewed papers: January 19, 2023. Use ARR commitment link: https://openreview.net/group?id=eacl.org/EACL/2024/Workshop/CALD_pseudo_ARR_Commitment We also invite authors of papers on the topics of the workshop accepted to Findings to reach out to the organizing committee of CALD-pseudo to present them at the workshop. Every paper will be reviewed by at least 2 members of the program committee. As reviewing will be blind, please ensure that papers are anonymous. Self-references that reveal the author’s identity, e.g., “We previously showed (Smith, 1991) …”, should be avoided. Instead, use citations such as “Smith previously showed (Smith, 1991) …”. Submissions will be judged on appropriateness, clarity, originality/innovativeness, correctness/soundness, meaningful comparison, significance and impact of ideas or results.
Final camera-ready versions of accepted papers will be given an additional page to address reviewer comments.

Authors are invited to submit by December 18, 2023 original and unpublished research papers in the following categories:

Full papers (up to 8 pages) for substantial contributions
Short or demo papers (up to 4 pages) for ongoing or preliminary work

All submissions must be in PDF format must follow the EACL 2024 guidelines described in the ARR CfP https://aclrollingreview.org/cfp, be in pdf, and use the official ACL style templates available here: https://github.com/acl-org/acl-style-files

Direct submission deadline: December 18, 2023 at https://softconf.com/eacl2024/CALD-pseudo-2024/
Deadline for registration of ARR reviewed papers: January 19, 2023.
Use ARR commitment link: https://openreview.net/group?id=eacl.org/EACL/2024/Workshop/CALD_pseudo_ARR_Commitment

We also invite authors of papers on the topics of the workshop accepted to Findings to reach out to the organizing committee of CALD-pseudo to present them at the workshop.
Every paper will be reviewed by at least 2 members of the program committee. As reviewing will be blind, please ensure that papers are anonymous. Self-references that reveal the author’s identity, e.g., “We previously showed (Smith, 1991) …”, should be avoided. Instead, use citations such as “Smith previously showed (Smith, 1991) …”. Submissions will be judged on appropriateness, clarity, originality/innovativeness, correctness/soundness, meaningful comparison, significance and impact of ideas or results.

Final camera-ready versions of accepted papers will be given an additional page to address reviewer comments.

Important dates


* December 18, 2023 (AoE): Workshop paper deadline * January 19, 2024: Re-submission of pre-reviewed ARR papers * January 22, 2024: Notification of acceptance * January 30, 2024: Camera-ready papers due * March 21, 2024: Workshop date

Program committee


* Ahrenberg Lars, Linköping University, Sweden * Ainiala Terhi, University of Helsinki, Finland * Aldrin Emilia, Halmstad University, Sweden * Arhar Holdt Špela, University of Ljubljana, Slovenia * Caines Andres, University of Cambridge, United Kingdom * Dalianis Hercules, Stockholm University, Sweden * Dannélls Dana, University of Gothenburg, Sweden * Dobnik Simon, University of Gothenburg, Sweden * Grouin Cyril, LIMSI, CNRS, Université Paris-Saclay, France * Hämäläinen, Lasse, University of Helsinki, Finland * Henriksson Aron, Stockholm University, Sweden * Kokkinakis Dimitrios, University of Gothenburg, Sweden * Lassus Jannika, University of Helsinki, Finland * Lindström Tiedemann Therese, University of Helsinki, Finland * Lison Pierre, Norwegian Computing Center, Norway * Lindén Krister, University of Helsinki, Finland * Ljunglöf Peter, Chalmers University of Technology / University of Gothenburg, Sweden * Marko Karoline, University of Graz, Austria * Megyesi Beáta, Stockholm University, Sweden * Nelson Boel, Aarhus University, Denmark * Nordman Lieselott, University of Helsinki, Finland * Ochs Sebastian, Technical University of Darmstadt, Germany * Pilán Ildikó, Norwegian Computing Center, Norway * Raheja Vipul, Grammarly, USA * Sánchez Ruenes David, University of Rovira i Virgili, Spain * Scheffler Tatjana, Ruhr University Bochum, Germany * Torra Vicenc, Umeå University, Sweden * Vakili Thomas, Stockholm University, Sweden * Vydiswaran VG Vinod, University of Michigan, USA * Volodina Elena, University of Gothenburg, Sweden * Vu Xuan-Son, Umeå University, Sweden

* Ahrenberg Lars, Linköping University, Sweden
* Ainiala Terhi, University of Helsinki, Finland
* Aldrin Emilia, Halmstad University, Sweden
* Arhar Holdt Špela, University of Ljubljana, Slovenia
* Caines Andres, University of Cambridge, United Kingdom
* Dalianis Hercules, Stockholm University, Sweden
* Dannélls Dana, University of Gothenburg, Sweden
* Dobnik Simon, University of Gothenburg, Sweden
* Grouin Cyril, LIMSI, CNRS, Université Paris-Saclay, France
* Hämäläinen, Lasse, University of Helsinki, Finland
* Henriksson Aron, Stockholm University, Sweden
* Kokkinakis Dimitrios, University of Gothenburg, Sweden
* Lassus Jannika, University of Helsinki, Finland
* Lindström Tiedemann Therese, University of Helsinki, Finland
* Lison Pierre, Norwegian Computing Center, Norway
* Lindén Krister, University of Helsinki, Finland
* Ljunglöf Peter, Chalmers University of Technology / University of Gothenburg, Sweden
* Marko Karoline, University of Graz, Austria
* Megyesi Beáta, Stockholm University, Sweden
* Nelson Boel, Aarhus University, Denmark
* Nordman Lieselott, University of Helsinki, Finland
* Ochs Sebastian, Technical University of Darmstadt, Germany
* Pilán Ildikó, Norwegian Computing Center, Norway
* Raheja Vipul, Grammarly, USA
* Sánchez Ruenes David, University of Rovira i Virgili, Spain
* Scheffler Tatjana, Ruhr University Bochum, Germany
* Torra Vicenc, Umeå University, Sweden
* Vakili Thomas, Stockholm University, Sweden
* Vydiswaran VG Vinod, University of Michigan, USA
* Volodina Elena, University of Gothenburg, Sweden
* Vu Xuan-Son, Umeå University, Sweden

Organizers


General chair
* Elena Volodina, University of Gothenburg, Sweden
General co-chairs
* Simon Dobnik, University of Gothenburg, Sweden * Therese Lindström Tiedemann, University of Helsinki, Finland * Xuan-Son Vu, Umeå university, Sweden
Organizing co-chairs
* David Alfter, University of Gothenburg, Sweden * Ricardo Muñoz Sánchez, University of Gothenburg, Sweden * Maria Irena Szawerna, University of Gothenburg, Sweden
Contact mormor.karl@svenska.gu.se

Anti-harassment policy
CALD-pseudo workshop adheres to the ACL’s anti-harassment policy https://www.aclweb.org/adminwiki/index.php?title=Anti-Harassment_Policy.

Acknowledgments
The workshop is organized within the research environment project Grandma Karl is 27 years old and is supported by a research grant on pseudonymization from the Swedish Research Council.

Twitter Facebook LinkedIn

Mormor Karl