Toaripi Small Language Model

A small language model for generating educational content in Toaripi language

Status: Research Prototype | Language: Toaripi (East Elema), Gulf Province, Papua New Guinea (ISO 639‑3: tqo)

This project develops a small language model (SLM) for the Toaripi language to support language preservation and education. The model is trained on aligned English↔Toaripi Bible text and designed to generate original educational content for primary school learners and teachers.

🎯 About the project

Empower local educators with AI tools that create culturally relevant learning materials while preserving the Toaripi language through:

📚 Original educational content generation (stories, vocabulary, Q&A, dialogues)
🌐 Online and offline deployment (including Raspberry Pi and low-resource devices)
🤝 Open-source collaboration with Toaripi speakers, linguists, and developers
🎓 Educational focus over general-purpose chatbot functionality

✨ Key Features

🧠 Smart Content Generation

Create educational stories and vocabulary exercises
Generate reading comprehension questions and dialogues
Produce culturally relevant educational materials in Toaripi
Support primary school curriculum development

💻 Flexible Deployment

Online: Web UI and REST API for connected environments
Offline: Quantized models (GGUF) for CPU-only devices
Edge: Optimized for Raspberry Pi and low-resource hardware

🔧 Developer-Friendly

Modular Python architecture with clear APIs
Comprehensive configuration management (YAML/TOML)
Docker support for easy deployment
Extensive testing and documentation

🚀 Quick Start

Prerequisites

Python 3.10+ (3.11 recommended)
16GB RAM minimum (32GB recommended for training)
50GB+ free storage space
Git for version control

Installation

# Clone the repository
git clone https://github.com/cknzraposo/toaripi-slm.git
cd toaripi-slm

# Create and activate virtual environment
python -m venv venv
.\venv\Scripts\Activate.ps1  # Windows PowerShell
# source venv/bin/activate    # macOS/Linux

# Install dependencies
python -m pip install --upgrade pip
pip install -r requirements.txt
pip install -e .

Verify Installation

# Check installation
python -c "import src.toaripi_slm; print('✅ Toaripi SLM installed successfully')"

# Run basic tests
pytest tests/ -v --tb=short

# Check CLI tools
toaripi-prepare-data --help
toaripi-finetune --help
toaripi-generate --help

🏗️ Project Structure

toaripi-slm/
├── app/                    # Web application and API
│   ├── api/               # REST API endpoints
│   ├── config/            # App configuration
│   └── ui/                # Web interface
├── configs/               # Training and deployment configs
│   ├── data/              # Data processing configurations
│   ├── deployment/        # Deployment configurations
│   └── training/          # Model training configurations
├── data/                  # Data storage
│   ├── processed/         # Cleaned and aligned data
│   ├── raw/              # Source data files
│   └── samples/          # Sample datasets
├── docs/                  # Documentation
├── models/               # Trained models
│   ├── gguf/             # Quantized models for edge deployment
│   └── hf/               # Hugging Face format models
├── scripts/              # Training and utility scripts
├── src/toaripi_slm/      # Core library code
│   ├── core/             # Model and training components
│   ├── data/             # Data processing utilities
│   ├── inference/        # Model inference and generation
│   └── utils/            # Helper functions
└── tests/                # Test suites

🤖 Usage Examples

Generate Educational Content

from src.toaripi_slm.inference import ToaripiGenerator

# Initialize generator
generator = ToaripiGenerator.load("models/toaripi-mistral-lora")

# Generate a story
story = generator.generate_story(
    prompt="Children helping parents with fishing",
    age_group="primary",
    max_length=200
)

# Generate vocabulary exercise
vocab = generator.generate_vocabulary(
    topic="household items",
    count=10,
    include_examples=True
)

Web API Usage

# Start the web server
uvicorn app.server:app --host 0.0.0.0 --port 8000

# Generate content via API
curl -X POST "http://localhost:8000/api/generate" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Create a simple story about family dinner",
    "content_type": "story",
    "age_group": "primary",
    "max_length": 150
  }'

🔧 Development

For detailed development setup, training procedures, and contribution guidelines, see:

Development Setup Guide - Complete environment setup
docs/contributing/ - Training procedures and contribution guidelines
docs/usage/ - Usage documentation and examples

Training Your Own Model

# Prepare training data
python scripts/prepare_data.py \
  --config configs/data/preprocessing_config.yaml \
  --output data/processed/toaripi_parallel.csv

# Fine-tune with LoRA
python scripts/finetune.py \
  --config configs/training/lora_config.yaml \
  --model_name mistralai/Mistral-7B-Instruct-v0.2 \
  --train_data data/processed/toaripi_parallel.csv \
  --output_dir checkpoints/toaripi-mistral-lora

# Export to GGUF for edge deployment
python scripts/export_to_gguf.py \
  --model_path checkpoints/toaripi-mistral-lora \
  --output_dir models/gguf \
  --quantization q4_k_m

🌐 Deployment Options

Local Development

# Development server with auto-reload
uvicorn app.server:app --reload --port 8000

Docker Deployment

# Build and run with Docker
docker build -t toaripi-slm .
docker run -p 8000:8000 toaripi-slm

# Or use Docker Compose
docker-compose up -d

Raspberry Pi / Edge Devices

# Build ARM-compatible image
docker build -f Dockerfile.pi -t toaripi-slm:pi .

# Run quantized model on CPU
python scripts/run_gguf.py \
  --model_path models/gguf/toaripi-mistral-q4_k_m.gguf \
  --host 0.0.0.0 --port 8000

🧪 Testing

# Run all tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=src/toaripi_slm --cov-report=html

# Run specific test categories
pytest tests/unit/ -v           # Unit tests
pytest tests/integration/ -v    # Integration tests

🤝 Contributing

We welcome contributions from developers, linguists, educators, and the Toaripi community! See our Contributing Guidelines for:

Setting up the development environment
Code style and testing requirements
How to submit issues and pull requests
Guidelines for adding training data
Community code of conduct

📊 Tech Stack

Language: Python 3.10+
ML Libraries: transformers, datasets, accelerate, peft (LoRA)
Web Framework: fastapi + uvicorn
Edge Inference: llama.cpp (GGUF quantized models)
Configuration: YAML/TOML
Testing: pytest, pre-commit
Deployment: Docker, Docker Compose

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Toaripi Community: For language knowledge and cultural guidance
Bible Society PNG: For providing aligned text resources
Open Source Contributors: For tools and libraries that make this possible
Educational Partners: For curriculum guidance and feedback

Preserving language and empowering education through technology.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github		.github
.specify		.specify
app		app
chat_sessions		chat_sessions
configs		configs
data		data
docs		docs
scripts		scripts
specs/001-offline-capable-toaripi		specs/001-offline-capable-toaripi
src/toaripi_slm		src/toaripi_slm
test_results		test_results
tests		tests
toaripi_env		toaripi_env
toaripi_test_env		toaripi_test_env
training_sessions		training_sessions
.env.template		.env.template
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLI_IMPLEMENTATION_SUMMARY.md		CLI_IMPLEMENTATION_SUMMARY.md
ENHANCED_CLI_SUMMARY.md		ENHANCED_CLI_SUMMARY.md
GETTING_STARTED.md		GETTING_STARTED.md
INSTALLATION_IMPROVEMENTS.md		INSTALLATION_IMPROVEMENTS.md
LAUNCHER_README.md		LAUNCHER_README.md
PROGRESS_ENHANCEMENT_SUMMARY.md		PROGRESS_ENHANCEMENT_SUMMARY.md
QUICKSTART_CLI.md		QUICKSTART_CLI.md
demo_chat.py		demo_chat.py
demo_enhanced_cli.py		demo_enhanced_cli.py
demo_full_cli.py		demo_full_cli.py
devsetup.md		devsetup.md
readme.md		readme.md
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py
setup_aliases.sh		setup_aliases.sh
start_toaripi.bat		start_toaripi.bat
start_toaripi.py		start_toaripi.py
start_toaripi.sh		start_toaripi.sh
test_chat_functionality.py		test_chat_functionality.py
test_interactive_cli.py		test_interactive_cli.py
test_model_integration.py		test_model_integration.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Toaripi Small Language Model

🎯 About the project

✨ Key Features

🧠 Smart Content Generation

💻 Flexible Deployment

🔧 Developer-Friendly

🚀 Quick Start

Prerequisites

Installation

Verify Installation

🏗️ Project Structure

🤖 Usage Examples

Generate Educational Content

Web API Usage

🔧 Development

Training Your Own Model

🌐 Deployment Options

Local Development

Docker Deployment

Raspberry Pi / Edge Devices

🧪 Testing

🤝 Contributing

📊 Tech Stack

📄 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Toaripi Small Language Model

🎯 About the project

✨ Key Features

🧠 Smart Content Generation

💻 Flexible Deployment

🔧 Developer-Friendly

🚀 Quick Start

Prerequisites

Installation

Verify Installation

🏗️ Project Structure

🤖 Usage Examples

Generate Educational Content

Web API Usage

🔧 Development

Training Your Own Model

🌐 Deployment Options

Local Development

Docker Deployment

Raspberry Pi / Edge Devices

🧪 Testing

🤝 Contributing

📊 Tech Stack

📄 License

🙏 Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages