AI Text Detector | Jasper Meijerink

Fine-tuned DistilBERT (66M parameters) on the GPT-wiki-intro dataset (~150k human vs. AI-generated text pairs) for binary text origin classification, achieving 98% validation accuracy after 3 epochs.

Key features:

Full ML pipeline: AdamW optimizer with linear warmup (10% steps), gradient clipping, mixed-precision training (float16), and best-model checkpointing by validation accuracy
Apple Silicon MPS GPU acceleration support
Gradio web interface with live confidence scores, probability bar chart, and example texts
CLI prediction mode and importable Python API (from predict import TextDetectorPredictor)
Documented limitations: domain-specific to Wikipedia-style text, unreliable on short texts (<100 chars), trained on GPT-2 era output

Developed as part of AI literacy research at VU Amsterdam, with a focus on responsible deployment.

Tech stack: Python, PyTorch, Hugging Face Transformers, Gradio, scikit-learn