AI Text Detection Research

A comprehensive research project evaluating multiple AI text detection approaches for identifying AI-generated and humanized AI text.

🎯 Key Results

Athena Baseline: 98.82% accuracy
Athena Improved: 98.92% accuracy
Athena User Humanized: 98.90% accuracy (specialized for Undetectable.ai)

📊 Models Tested

TF-IDF (99.17% test, failed in real-world)
Structure Detector (86.83%)
Hybrid TF-IDF+Structure (98.80%, failed in real-world)
Perplexity Single Feature (90%, failed)
Enhanced Perplexity (70.30%, failed)
Transformer (99.84% test, failed in real-world)
Athena Baseline (98.82%, SUCCESS)
Athena Improved (98.92%, SUCCESS)
Athena User Humanized (98.90%, SUCCESS)

🔍 Key Findings

Test accuracy does not equal real-world performance
Dataset quality matters more than model complexity
Humanizer detection is tool-specific, not universal
Training on Undetectable.ai samples enables detection of that specific humanizer

🚀 Installation

Prerequisites

Python 3.8+
CUDA-capable GPU (recommended for training)

Setup

Clone this repository:

git clone https://github.com/yourusername/scifair.git
cd scifair

Install dependencies:

pip install -r requirements.txt

For GPU support with PyTorch, visit PyTorch.org for CUDA-specific installation instructions.

📂 Project Structure

scifair/
├── analysis/              # Analysis scripts for model behavior
├── docs/                  # Documentation and research findings
├── results/               # JSON result files from experiments
├── scripts/
│   ├── training/          # Model training scripts (11 files)
│   │   ├── athena_train*.py
│   │   ├── *_detector.py
│   │   └── retrain_detectors.py
│   ├── testing/           # Model testing scripts (10 files)
│   │   └── test_*.py
│   └── analysis/          # Script analysis tools
└── util/                  # Utility functions

📖 Usage

Testing Pre-trained Models

# Test baseline model
python scripts/testing/test_athena.py

# Test with adjusted threshold (5% instead of 50%)
python scripts/testing/test_athena_threshold.py baseline

# Test Undetectable.ai specialist
python scripts/testing/test_athena_threshold.py user

Training Your Own Models

# Train Athena baseline
python scripts/training/athena_train.py

# Train improved version
python scripts/training/athena_train_improved.py

# Train specialized humanized detector
python scripts/training/athena_train_user_humanized.py

📊 Datasets

Note: Large model files and datasets are excluded from this repository due to size constraints.

Required Datasets

You'll need to prepare your own datasets with the following structure:

Training data: CSV files with text and label columns
Label 0: Human-written text
Label 1: AI-generated text
Label 2: Humanized AI text (optional, for specialized models)

Sample Dataset Format

text,label
"Human written text example",0
"AI generated text example",1
"Humanized AI text example",2

📚 Documentation

docs/RESULTS_SUMMARY.md - Complete results summary
docs/COMPLETE_RESULTS.md - Detailed analysis
results/ - JSON result files from all experiments
PROJECT_STRUCTURE.md - Detailed explanation of project structure

🔬 Research Methodology

This project systematically evaluated various approaches to AI text detection:

Traditional ML: TF-IDF with Logistic Regression
Structural Analysis: Sentence length, punctuation patterns
Perplexity-based: Using GPT-2 perplexity scores
Transformer-based: Fine-tuned DistilBERT models (Athena)

🙏 Attribution

This project builds upon the Athena AI detector framework. The original Athena dataset and baseline model provided the foundation for our improvements and specialized variants.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Contributing

Contributions, issues, and feature requests are welcome! Feel free to check the issues page.

⚠️ Disclaimer

This research is for educational purposes. AI detection is an evolving field, and no detector is 100% accurate. Use these tools responsibly and in conjunction with other verification methods.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Text Detection Research

🎯 Key Results

📊 Models Tested

🔍 Key Findings

🚀 Installation

Prerequisites

Setup

📂 Project Structure

📖 Usage

Testing Pre-trained Models

Training Your Own Models

📊 Datasets

Required Datasets

Sample Dataset Format

📚 Documentation

🔬 Research Methodology

🙏 Attribution

📄 License

🤝 Contributing

⚠️ Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
analysis		analysis
athena-source-main		athena-source-main
docs		docs
results		results
scripts		scripts
util		util
.gitignore		.gitignore
CLEANUP_SUMMARY.md		CLEANUP_SUMMARY.md
LICENSE		LICENSE
PROJECT_STRUCTURE.md		PROJECT_STRUCTURE.md
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AI Text Detection Research

🎯 Key Results

📊 Models Tested

🔍 Key Findings

🚀 Installation

Prerequisites

Setup

📂 Project Structure

📖 Usage

Testing Pre-trained Models

Training Your Own Models

📊 Datasets

Required Datasets

Sample Dataset Format

📚 Documentation

🔬 Research Methodology

🙏 Attribution

📄 License

🤝 Contributing

⚠️ Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages