Choosing the Right Research Method for My AI-Based NLI Study
Introduction
Research method selection constitutes a pivotal decision in academic inquiry: it structures how evidence is gathered, the standards by which results are evaluated, and the extent to which conclusions can be generalized and replicated. In the context of my MSc. project, I investigate Native Language Identification (NLI) within user-generated English text by developing a bias-aware, generalizable framework that integrates Large Language Model (LLM) embeddings, topic debiasing, and open-set recognition. This post articulates the justification for adopting a quantitative experimental–comparative design and explains how this approach enables systematic assessment and evaluation of project outcomes including model accuracy, fairness, and robustness.
Project Summary
This study designs and evaluates a hybrid LLM-based NLI system on Reddit-L2, targeting three evaluation axes: (i) performance on known languages (accuracy; macro-F1), (ii) fairness through mitigation of topic leakage (e.g., Named-Entity Masking and linear concept erasure), and (iii) generalization to unseen native languages via open-set recognition. The contribution lies in a controlled, comparative framework that produces quantitative, reproducible evidence across in-domain and cross-topic regimes, thereby aligning method choice with the project’s evaluation objectives.
Why the Study Is Quantitative (and Experimental–Comparative)
The core outputs I must evaluate are numeric- accuracy, macro-F1, bias-leakage score, and false-positive rate on unseen L1s. My proposal defines explicit hypotheses and compares multiple model variants under controlled conditions (same data splits; fixed preprocessing). This is a textbook match to a quantitative, experimental–comparative design.
Why Quantitative Is the Most Suitable Method
To meet the requirement for the research project to follow an quantitative method on assessing and evaluating outcomes, the method must yield objective, reproducible evidence and allow controlled comparisons. Below is the extended logic that underpins my choice.
1) Alignment with a recognized framework
Urban & van Eeden-Moorefield (2018) characterise quantitative studies by objective measurement, hypothesis testing, low researcher–participant interaction, and generalization. My project maps to these traits one-to-one.
| Criterion | Quantitative expectation | My study |
|---|---|---|
| Epistemology | Positivist; one discoverable truth | Seeks measurable improvements in fairness/accuracy |
| Core logic | Deductive; hypothesis-driven | Predefined hypotheses on debiasing and open-set gains |
| Data form | Numeric variables & scales | Accuracy, macro-F1, bias-leakage, FPR |
| Design control | Standardized procedures | Fixed splits, seeds, preprocessing, evaluation scripts |
| Bias handling | Method controls for bias | NER masking; concept erasure; identical pipelines per condition |
| Generalization | External validity targeted | Cross-topic regime; TOEFL11 transfer (if licensed) |
| Analysis | Statistical comparison | Mean±SD over seeds; significance tests; ablations |
2) Direct support for evaluation
- Assessing outcomes: KPIs are explicit and comparable across models.
- Fairness evidence: Bias-leakage is a quantitative signal, enabling objective auditing.
- Reproducibility: Version-controlled code/config ensures reviewers can rerun analyses.
3) Why not qualitative / mixed approach
- Qualitative: Suited to human experiences/interpretations; my study is computational with no participants.
- Mixed-methods: Valuable later (e.g., user interviews on fairness perceptions), but adds scope/complexity without improving metric validity for my research work’s evaluation plan.
Conditions/Comparators I Will Evaluate
To evaluate the research hypotheses, the following models will be compared under identical settings:
- Traditional Baseline: character/word/POS n-grams with Logistic Regression or SVM.
- Zero-shot LLM: Prompt-based inference using GPT-like architectures.
- Hybrid (LLM Embeddings + Debias): BERT/RoBERTa embeddings combined with Named-Entity Masking and Linear Concept Erasure.
- Hybrid + Open-set: The hybrid model extended with thresholding and embedding-distance novelty detection.
Each configuration will employ consistent data partitions, preprocessing, and evaluation metrics to ensure fair comparison.
Data I Will Gather
The study primarily utilizes the Reddit-L2 corpus, a large-scale dataset of English texts written by non-native speakers. The corpus provides an opportunity to investigate topic bias and language transfer effects at scale. A secondary dataset, TOEFL11, may be used for external validation. All text data undergo anonymization, normalization, and entity masking to maintain ethical and methodological consistency.
Tools & Techniques
The project leverages modern NLP and ML toolkits – PyTorch, HuggingFace Transformers, and scikit-learn – for training and evaluation. Debiasing employs Named-Entity Masking and Linear Concept Erasure, while open-set recognition uses probabilistic thresholding and Mahalanobis distance in embedding space. Computation will be GPU-accelerated, and all runs parameterized for reproducibility.
Analysis & Evaluation Plan
Evaluation follows a two-regime protocol- in-domain (same-topic) and cross-topic (out-of-domain). Performance metrics include Accuracy and Macro-F1, fairness is assessed via Bias-Leakage Score, and open-set capability through False Positive Rate. Reliability is supported by repeated trials under different random seeds, reporting mean, standard deviation and performing statistical tests to confirm significance. Ablation studies quantify the contribution of each debiasing component.
Validity, Reliability, and Replicability
- Internal validity: Controlled splits and identical preprocessing across conditions.
- External validity: Cross-topic evaluation and (if licensed) TOEFL11 transfer tests.
- Reliability: Seed control; repeated runs; consistent scoring pipelines.
- Replicability: Version-controlled code, configs, and evaluation scripts.
Key Takeaway
The quantitative experimental–comparative method provides the structure and statistical integrity necessary to evaluate AI model outcomes objectively. It supports systematic measurement, hypothesis testing, and replicability – ensuring that performance, fairness, and generalization results are interpretable and academically defensible.
References
- Ahmad, F. (2025) Final Research Project Proposal Report (PROM04 – Assignment-2), University of Sunderland.
- Blanchard, D., Tetreault, J., Higgins, D., Cahill, A. and Chodorow, M. (2013) ‘TOEFL11: A corpus of non-native English’, ETS Research Report Series, 2013(2), pp. 1–15.
- Koppel, M., Schler, J. and Zigdon, K. (2005) ‘Determining an author’s native language by mining a text for errors’, Proceedings of KDD, pp. 624–628.
- Kumar, S., Wintner, S., Smith, N.A. and Tsvetkov, Y. (2019) ‘Topics to avoid: Demoting latent confounds in text classification’, EMNLP-IJCNLP, pp. 3931–3941.
- Rabinovich, E., Ordan, N. and Wintner, S. (2018) ‘Native language cognate effects in L2 English lexical choice’, TACL, 6, pp. 329–342.
- Urban, J.B. and van Eeden-Moorefield, B.M. (2018) Designing and Proposing Your Research Project. Washington, DC: APA.
- Yaghoobzadeh, Y., Hertel, J. and Tsvetkov, Y. (2024) ‘The medium is not the message: Deconfounding text embeddings via linear concept erasure’, arXiv:2403.05025.
- Zhan, X., Liu, L., Wang, J. and Pan, S.J. (2021) ‘A joint learning framework for open-set domain adaptation’, TPAMI, 44(9), pp. 4626–4640.
- Zhang, W. and Salle, A. (2023) ‘Native Language Identification with Large Language Models’, arXiv:2312.07819.
