Week-1: My Research

Mitigating Bias in Native Language Identification Using Large Language Models and Open-Set Recognition

Introduction

Language reflects traces of our linguistic background – even when we write in a second language. My research explores this fascinating connection through the task of Native Language Identification (NLI), which aims to predict an author’s first language (L1) based on their writing in another language (L2).

While early studies relied on formal learner essays such as the TOEFL11 corpus (Tetreault et al., 2013), these datasets capture only controlled, classroom-like writing. In contrast, today’s online communication is full of informal expressions, emojis, and cultural slang. My work therefore focuses on user-generated content (UGC) – specifically the Reddit-L2 dataset (Rabinovich et al., 2018) – to study NLI “in the wild,” where text is messy but authentically human. Continue reading “Week-1: My Research”