BTRFSQA Design Document

Overview
System Architecture
Component Design
Workflow
Data Model
Infrastructure
Testing Framework
Publishing System
Security Considerations
Extensibility
Future Enhancements

Overview

Purpose

BTRFSQA is an automated continuous integration and quality assurance system for the BTRFS filesystem. It provides automated testing of the latest BTRFS kernel development code against comprehensive test suites and publishes results to a public dashboard.

Goals

Automation: Fully automated end-to-end testing pipeline with zero manual intervention
Coverage: Comprehensive testing across kernel, userspace tools, and filesystem operations
Transparency: Public visibility of test results through GitHub Pages
Cost-Effectiveness: Efficient use of cloud resources with automatic cleanup
Reproducibility: Consistent test environment using fresh infrastructure per run

Key Features

Automated AWS EC2 infrastructure provisioning and teardown
Latest BTRFS development kernel compilation and installation
Multi-stage test suite execution (btrfsprogs, xfstests, RAID5 scrub tests)
Terminal recording via Asciinema for visual debugging
Automated results publishing to GitHub Pages
Historical test results preservation

System Architecture

Architecture Overview

┌──────────────────────────────────────────────────────────────┐
│                    BTRFSQA System                             │
│                                                               │
│  ┌────────────────────────────────────────────────────────┐  │
│  │              Control Plane (Local Machine)             │  │
│  │                                                        │  │
│  │  ┌──────────────────────────────────────────────────┐ │  │
│  │  │         btrfsqa.py Orchestrator                  │ │  │
│  │  │  ┌────────────────┬─────────────────────────┐   │ │  │
│  │  │  │ AWS Manager    │ Remote Executor         │   │ │  │
│  │  │  │ (Boto)         │ (Fabric)                │   │ │  │
│  │  │  ├────────────────┼─────────────────────────┤   │ │  │
│  │  │  │ Config Manager │ Results Publisher       │   │ │  │
│  │  │  │                │ (Git)                   │   │ │  │
│  │  │  └────────────────┴─────────────────────────┘   │ │  │
│  │  └──────────────────────────────────────────────────┘ │  │
│  │                                                        │  │
│  │  Configuration Sources:                               │  │
│  │  • aws_auth.json    • ec2.json      • github.json    │  │
│  │  • timeout.json     • kernel.config • local.config   │  │
│  └────────────────────────────────────────────────────────┘  │
│                           │                                   │
│                           │ Provision & Execute               │
│                           ▼                                   │
│  ┌────────────────────────────────────────────────────────┐  │
│  │           Test Execution Environment (AWS EC2)         │  │
│  │                                                        │  │
│  │  Instance: r4.large (Fedora 26)                       │  │
│  │  ┌──────────────────────────────────────────────────┐ │  │
│  │  │  Layer 1: BTRFS Development Kernel               │ │  │
│  │  │  • Source: btrfs-devel/misc-next                 │ │  │
│  │  │  • Build: Custom kernel.config                   │ │  │
│  │  │  • Install: Automatic bootloader update          │ │  │
│  │  └──────────────────────────────────────────────────┘ │  │
│  │  ┌──────────────────────────────────────────────────┐ │  │
│  │  │  Layer 2: Test Execution Framework               │ │  │
│  │  │  ┌────────────┬──────────────┬────────────────┐ │ │  │
│  │  │  │ Script 001 │ Script 002   │ Script 003     │ │ │  │
│  │  │  │ Kernel     │ btrfsprogs   │ xfstests       │ │ │  │
│  │  │  │ Build      │ Test Suite   │ Test Suite     │ │ │  │
│  │  │  ├────────────┼──────────────┼────────────────┤ │ │  │
│  │  │  │ Script 004 │ Asciinema    │ Results        │ │ │  │
│  │  │  │ RAID5      │ Recorder     │ Collector      │ │ │  │
│  │  │  │ Scrub      │              │                │ │ │  │
│  │  │  └────────────┴──────────────┴────────────────┘ │ │  │
│  │  └──────────────────────────────────────────────────┘ │  │
│  │  ┌──────────────────────────────────────────────────┐ │  │
│  │  │  Layer 3: Storage Infrastructure                │ │  │
│  │  │  6x EBS Volumes (20GB each)                     │ │  │
│  │  │  /dev/xvdb, /dev/xvdc, /dev/xvdd,              │ │  │
│  │  │  /dev/xvde, /dev/xvdf, /dev/xvdg               │ │  │
│  │  └──────────────────────────────────────────────────┘ │  │
│  └────────────────────────────────────────────────────────┘  │
│                           │                                   │
│                           │ Results Collection                │
│                           ▼                                   │
│  ┌────────────────────────────────────────────────────────┐  │
│  │            Publishing Layer                            │  │
│  │                                                        │  │
│  │  ┌──────────────────┐  ┌──────────────────────────┐  │  │
│  │  │ Asciinema.org    │  │ GitHub Repository        │  │  │
│  │  │ Terminal         │  │ • results/               │  │  │
│  │  │ Recordings       │  │ • _layouts/default.html  │  │  │
│  │  └──────────────────┘  └──────────────────────────┘  │  │
│  │                                  │                     │  │
│  │                                  │ Triggers            │  │
│  │                                  ▼                     │  │
│  │                      ┌────────────────────────┐       │  │
│  │                      │ GitHub Pages           │       │  │
│  │                      │ Public Dashboard       │       │  │
│  │                      │ (Jekyll + Cayman)      │       │  │
│  │                      └────────────────────────┘       │  │
│  └────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────┘

Component Layers

Layer 1: Control Plane

Location: Local machine
Language: Python 2.7
Dependencies: Boto (AWS SDK), Fabric (SSH automation)
Responsibilities: Orchestration, provisioning, configuration, results publishing

Layer 2: Test Environment

Location: AWS EC2 (ephemeral)
OS: Fedora 26
Instance Type: r4.large (spot instance)
Responsibilities: Kernel compilation, test execution, results generation

Layer 3: Publishing Infrastructure

GitHub Repository: Version control and static hosting
GitHub Pages: Jekyll-based public dashboard
Asciinema.org: Terminal recording hosting

Component Design

1. Orchestrator (btrfsqa.py)

Purpose: Main control script that coordinates all system operations

Key Functions:

main()                    # Entry point, orchestrates entire workflow
req_instance_and_tag()    # Provisions EC2 spot instance with tags
set_bdm()                 # Configures block device mapping (6x EBS volumes)
install_sw()              # Uploads configs, installs dependencies, runs tests
update_htmltable()        # Generates results HTML and publishes to GitHub
del_sys()                 # Cleanup: terminates instance and deletes volumes

Design Patterns:

Configuration-driven: All settings externalized to JSON files
Sequential execution: Scripts run in order with timeout controls
Error handling: Graceful degradation with cleanup on failure
Idempotency: Safe to re-run, cleans up previous resources

2. Configuration System

File Structure:

setup/config/
├── aws_auth.json     # AWS access key, secret key, region
├── ec2.json          # AMI ID, instance type, security group
├── github.json       # Repository URL, credentials
├── timeout.json      # Per-script timeout limits (minutes)
├── kernel.config     # Linux kernel build configuration
├── local.config      # xfstests environment variables
├── bashrc            # Shell environment customization
└── netrc             # Git credentials for automation

Design Principles:

Separation of concerns: Credentials separate from code
Version control: Non-sensitive configs tracked in Git
Flexibility: Easy modification without code changes
Security: Sensitive files (.gitignored)

3. Test Scripts

Execution Model: Sequential execution with completion signaling

Script Architecture:

001_btrfsdevel     (Timeout: 120 min)
├── Clone btrfs-devel kernel source
├── Copy kernel.config
├── Compile kernel (make -j4)
├── Install kernel modules
├── Update bootloader
├── Reboot instance
└── Signal: touch /tmp/001_btrfsdevel.completed

002_btrfsprogs     (Timeout: 120 min)
├── Clone btrfsprogs repository
├── Build from source (autogen, configure, make)
├── Run test suites: fsck, cli, misc, fuzz
└── Signal: touch /tmp/002_btrfsprogs.completed

003_xfstests       (Timeout: 120 min)
├── Clone xfstests repository
├── Setup test environment (local.config)
├── Create test filesystems on /dev/xvdb-xvdg
├── Execute: make && make install && ./check -g auto
└── Signal: touch /tmp/003_xfstests.completed

004_raid5_scrub    (Timeout: 30 min)
├── Apply RAID5-specific patches
├── Run scrub tests
└── Signal: touch /tmp/004_raid5_scrub.completed

Completion Protocol:

Each script creates /tmp/SCRIPTNAME.completed on success
Orchestrator polls for completion files with timeout
Missing completion file = test failure

4. Recording System

Technology: Asciinema (terminal recording)

Workflow:

1. Start recording: asciinema rec -c "bash SCRIPT" OUTPUT.json
2. Execute test script within recording
3. Upload to asciinema.org: asciinema upload OUTPUT.json
4. Parse upload URL from response
5. Embed in HTML table with thumbnail

Benefits:

Visual debugging of test failures
Exact reproduction of terminal session
Lightweight (text-based format)
Publicly shareable links

Workflow

End-to-End Execution Flow

Start
  │
  ├─► [1] Load Configuration Files
  │    ├── aws_auth.json
  │    ├── ec2.json
  │    ├── github.json
  │    └── timeout.json
  │
  ├─► [2] AWS Infrastructure Provisioning
  │    ├── Connect to AWS (Boto)
  │    ├── Request spot instance (r4.large, Fedora 26)
  │    ├── Configure 6x EBS volumes (20GB each)
  │    ├── Tag resources (Name: btrfsqa-DATE)
  │    ├── Wait for instance state: running
  │    └── Get public IP address
  │
  ├─► [3] Remote Environment Setup
  │    ├── SSH connect (Fabric, wait for availability)
  │    ├── Upload configuration files:
  │    │    ├── kernel.config → /tmp/
  │    │    ├── local.config → /tmp/
  │    │    ├── bashrc → /tmp/
  │    │    └── netrc → /root/.netrc
  │    ├── Upload test scripts (001-004)
  │    └── Install base dependencies:
  │         ├── git
  │         ├── python3
  │         ├── asciinema
  │         └── screen
  │
  ├─► [4] Sequential Test Execution
  │    │
  │    ├─► Script 001: BTRFS Kernel Build (120 min timeout)
  │    │    ├── Record: asciinema rec -c "bash 001_btrfsdevel"
  │    │    ├── Download kernel source (misc-next branch)
  │    │    ├── Configure with kernel.config
  │    │    ├── Compile: make -j4
  │    │    ├── Install: make modules_install && make install
  │    │    ├── Update grub bootloader
  │    │    ├── Reboot instance
  │    │    ├── Wait for SSH reconnection
  │    │    ├── Verify new kernel: uname -r
  │    │    ├── Create completion marker
  │    │    └── Upload recording to asciinema.org
  │    │
  │    ├─► Script 002: btrfsprogs Tests (120 min timeout)
  │    │    ├── Record execution
  │    │    ├── Clone btrfsprogs repository
  │    │    ├── Build: ./autogen.sh && ./configure && make
  │    │    ├── Run test suites:
  │    │    │    ├── make test-fsck
  │    │    │    ├── make test-cli
  │    │    │    ├── make test-misc
  │    │    │    └── make test-fuzz
  │    │    ├── Create completion marker
  │    │    └── Upload recording
  │    │
  │    ├─► Script 003: xfstests (120 min timeout)
  │    │    ├── Record execution
  │    │    ├── Clone xfstests repository
  │    │    ├── Install dependencies
  │    │    ├── Build: make && make install
  │    │    ├── Setup test devices:
  │    │    │    ├── TEST_DEV=/dev/xvdb
  │    │    │    ├── TEST_DIR=/mnt/test
  │    │    │    ├── SCRATCH_DEV_POOL=/dev/xvdc-xvdg
  │    │    │    └── SCRATCH_MNT=/mnt/scratch
  │    │    ├── Execute: ./check -g auto (all tests)
  │    │    ├── Create completion marker
  │    │    └── Upload recording
  │    │
  │    └─► Script 004: RAID5 Scrub (30 min timeout)
  │         ├── Record execution
  │         ├── Apply specific patches
  │         ├── Run RAID5 scrub tests
  │         ├── Create completion marker
  │         └── Upload recording
  │
  ├─► [5] Results Collection
  │    ├── Download asciinema upload URLs
  │    ├── Download test logs from /tmp/
  │    ├── Determine pass/fail status:
  │    │    ├── Pass: *.completed file exists
  │    │    └── Fail: timeout or missing completion
  │    └── Collect metadata (timestamps, script names)
  │
  ├─► [6] Results Publishing
  │    ├── Clone GitHub repository (local temp directory)
  │    ├── Create results directory: results/results_YYYY-MM-DD_HH:MM/
  │    ├── Copy test logs to results directory
  │    ├── Generate HTML table with:
  │    │    ├── Script name
  │    │    ├── Status badge (pass/fail)
  │    │    ├── Asciinema embed with thumbnail
  │    │    └── Log file download links
  │    ├── Update _layouts/default.html
  │    ├── Git commit with timestamp message
  │    ├── Git push to origin/master
  │    └── GitHub Pages auto-rebuilds site
  │
  ├─► [7] Infrastructure Cleanup
  │    ├── Wait 2 minutes (allow final syncs)
  │    ├── Terminate EC2 instance
  │    ├── Delete unattached EBS volumes
  │    └── Log cleanup completion
  │
End

Data Model

Results Directory Structure

results/
├── results_2024-11-15_10:30/
│   ├── btrfsprogs_001/
│   │   ├── test.log
│   │   ├── fsck-tests.log
│   │   └── cli-tests.log
│   ├── xfstests_001/
│   │   ├── results.log
│   │   ├── failed.log
│   │   └── check.log
│   ├── logs/
│   │   ├── 001_btrfsdevel.log
│   │   ├── 002_btrfsprogs.log
│   │   ├── 003_xfstests.log
│   │   └── 004_raid5_scrub.log
│   └── screencasts/
│       ├── 001.json (asciinema recording)
│       └── 001.url (uploaded URL)
└── results_2024-11-16_09:45/
    └── ... (next test run)

HTML Table Schema

<tr>
  <td>Script Name</td>
  <td>
    <span class="status-badge pass|fail">PASS|FAIL</span>
  </td>
  <td>
    <script src="https://asciinema.org/a/ID.js" data-theme="monokai"></script>
  </td>
  <td>
    <a href="results/PATH/logs/SCRIPT.log">View Log</a>
  </td>
</tr>

Infrastructure

AWS Resources

EC2 Instance Specifications:

Instance Type: r4.large (15.25 GB RAM, 2 vCPUs)
Purchasing: Spot instance (cost optimization)
AMI: Fedora 26 (ami-id from ec2.json)
Region: Configurable (us-east-1 default)
Security Group: SSH (port 22) enabled
Key Pair: btrfsqa-keypair

Storage Configuration:

Root Volume: Default AMI root (typically 8-10 GB)
Data Volumes: 6x EBS volumes (20 GB GP2 each)
- /dev/xvdb: Primary test device
- /dev/xvdc-xvdg: Scratch devices for multi-disk tests
Lifecycle: Deleted on termination

Resource Tagging:

{
  "Name": "btrfsqa-2024-11-18",
  "Project": "btrfsqa",
  "ManagedBy": "automation"
}

Network Architecture

Internet
   │
   ├─► Local Machine (Control Plane)
   │    ├── Outbound: AWS API (HTTPS)
   │    └── Outbound: SSH to EC2
   │
   └─► AWS Region (us-east-1)
        │
        ├─► EC2 Instance (Public Subnet)
        │    ├── Public IP: Dynamic (assigned at launch)
        │    ├── Inbound: SSH (port 22) from anywhere
        │    └── Outbound: Internet access (Git, package repos)
        │
        ├─► GitHub.com
        │    ├── Git clone (btrfs-devel, btrfsprogs, xfstests)
        │    └── Git push (results publishing)
        │
        └─► Asciinema.org
             └── Recording upload (HTTP POST)

Testing Framework

Test Suite Hierarchy

BTRFSQA Testing Pyramid

        ┌───────────────────────┐
        │   Integration Tests   │  Script 004: RAID5 Scrub
        │   (Specific Scenarios)│  • Targeted regression tests
        └───────────────────────┘  • Known bug validation
                 │
        ┌────────┴─────────┐
        │  Functional Tests │       Script 003: xfstests
        │  (Filesystem Ops) │       • 400+ test cases
        └────────────────────┘      • POSIX compliance
                 │                  • Stress testing
        ┌────────┴─────────┐        • Data integrity
        │  Unit Tests       │       Script 002: btrfsprogs
        │  (Userspace Tools)│       • Tool-specific tests
        └────────────────────┘      • CLI validation
                 │                  • Format verification
        ┌────────┴─────────┐
        │  Kernel Build     │       Script 001: Kernel
        │  (Base Layer)     │       • Compilation check
        └────────────────────┘      • Module loading

Test Script Details

Script 001: Kernel Development Build

Objective: Validate latest BTRFS kernel code compiles and boots

Steps:

Clone btrfs-devel repository (misc-next branch)
Copy custom kernel configuration
Compile kernel (make -j4)
Install kernel and modules
Update bootloader configuration
Reboot instance
Verify new kernel loaded

Success Criteria:

Compilation completes without errors
Kernel boots successfully
BTRFS module loads
/tmp/001_btrfsdevel.completed created

Script 002: btrfsprogs Test Suite

Objective: Validate userspace tools functionality

Test Categories:

fsck-tests: Filesystem check and repair
cli-tests: Command-line interface
misc-tests: Miscellaneous utilities
fuzz-tests: Malformed input handling

Success Criteria:

All test categories pass
No crashes or hangs
/tmp/002_btrfsprogs.completed created

Script 003: xfstests

Objective: Comprehensive filesystem testing

Test Coverage:

File operations (create, read, write, delete)
Directory operations
Extended attributes
ACLs and permissions
Quotas
Snapshots and clones
Compression
Checksumming
RAID configurations
Error injection
Recovery scenarios

Configuration:

export TEST_DEV=/dev/xvdb
export TEST_DIR=/mnt/test
export SCRATCH_DEV_POOL="/dev/xvdc /dev/xvdd /dev/xvde /dev/xvdf /dev/xvdg"
export SCRATCH_MNT=/mnt/scratch
export FSTYP=btrfs

Success Criteria:

Test suite completes
No kernel panics
Acceptable pass rate
/tmp/003_xfstests.completed created

Script 004: RAID5 Scrub Tests

Objective: Validate specific RAID5 functionality

Focus Areas:

RAID5/6 rebuild
Scrub operation
Data recovery
Parity verification

Success Criteria:

Specific test cases pass
No data corruption
/tmp/004_raid5_scrub.completed created

Publishing System

GitHub Pages Integration

Technology Stack:

Framework: Jekyll (static site generator)
Theme: Cayman (GitHub Pages default)
Hosting: GitHub Pages (automatic deployment)

File Structure:

btrfsqa/
├── _config.yml           # Jekyll configuration
│   ├── theme: jekyll-theme-cayman
│   └── title: BTRFSQA Dashboard
│
├── _layouts/
│   └── default.html      # Main page template
│       ├── Header: Project info
│       ├── Table: Test results (dynamically updated)
│       └── Footer: Known issues
│
├── results/              # Test execution results
│   └── (timestamped directories)
│
└── index.md              # Landing page content

Update Mechanism:

def update_htmltable():
    1. Clone repository to temp directory
    2. Read _layouts/default.html
    3. Generate new table rows for latest results
    4. Insert rows into HTML template
    5. Commit changes: "Update results - YYYY-MM-DD HH:MM"
    6. Push to origin/master
    7. GitHub Pages rebuilds (automatic, ~1 minute)

Results Presentation

Table Columns:

Script Name: Test identifier (e.g., "001_btrfsdevel")
Status: Visual badge (green PASS, red FAIL)
Screencast: Embedded Asciinema player with thumbnail
Logs: Download links for detailed output

Asciinema Integration:

<script
  id="asciicast-RECORDING_ID"
  src="https://asciinema.org/a/RECORDING_ID.js"
  async
  data-theme="monokai"
  data-size="small"
  data-cols="120"
></script>

Benefits:

No server infrastructure required
Automatic HTTPS
CDN distribution
Version controlled history
Zero operational cost

Security Considerations

Credentials Management

Sensitive Files (not in version control):

setup/config/
├── aws_auth.json     # AWS access keys
├── github.json       # GitHub credentials
└── netrc            # Git authentication

Git Configuration:

# .gitignore entries
setup/config/aws_auth.json
setup/config/github.json
setup/config/netrc
*.pem
*.key

Access Control

AWS Permissions Required:

ec2:RunInstances (spot instance creation)
ec2:TerminateInstances
ec2:CreateTags
ec2:DescribeInstances
ec2:DescribeVolumes
ec2:DeleteVolume

GitHub Permissions:

Repository write access (for results publishing)
Pages deployment (automatic with write access)

Network Security

EC2 Security Group:

Inbound: SSH (port 22) from 0.0.0.0/0
Outbound: All traffic allowed

Recommendations:

Restrict SSH to known IP ranges
Use IAM roles instead of access keys
Enable CloudTrail for audit logging
Implement GitHub deploy keys (read-only clones)

Extensibility

Adding New Test Scripts

Process:

Create new script file: setup/scripts/00X_testname
Make executable: chmod +x 00X_testname

Follow completion protocol:

# At end of script
touch /tmp/00X_testname.completed

Add timeout to setup/config/timeout.json:
```
{
  "00X_testname": 60
}
```
Script will be automatically discovered and executed

Script Template:

#!/bin/bash
set -e  # Exit on error

# Test logic here
echo "Running custom test..."

# Signal completion
touch /tmp/00X_testname.completed

Configuration Customization

Common Modifications:

Instance Type (ec2.json):

{
  "InstanceType": "r5.xlarge"  # More CPU/RAM
}

Storage (btrfsqa.py:set_bdm()):

# Add more volumes
bdm.append({
    'DeviceName': '/dev/xvdh',
    'Ebs': {'VolumeSize': 50}
})

Timeout Adjustments (timeout.json):

{
  "003_xfstests": 240  # Increase to 4 hours
}

Plugin Architecture Opportunities

Future Extensibility:

Notification plugins: Email, Slack, PagerDuty alerts
Storage backends: S3, NFS for results
Test schedulers: Cron integration, webhook triggers
Results analyzers: Automated failure classification
Comparison tools: Regression detection across runs

Future Enhancements

Short-Term Improvements

Error Handling:
- Retry logic for transient failures
- Partial result preservation on timeout
- Email notifications on test failures
Performance:
- Parallel test execution (where safe)
- Incremental kernel builds
- Result compression
Reporting:
- Test duration tracking
- Pass/fail rate graphs
- Historical trend analysis

Medium-Term Features

Multi-Kernel Testing:
- Test multiple kernel versions per run
- Comparison matrix
- Regression bisection
Custom Test Configurations:
- Parameterized xfstests runs
- Mount option variations
- Feature flag combinations
Integration:
- GitHub webhook triggers
- PR comment integration
- Slack notifications

Long-Term Vision

Distributed Testing:
- Multi-region execution
- Parallel instance testing
- Load balancing
Advanced Analytics:
- ML-based failure prediction
- Automatic bug categorization
- Performance regression detection
Community Features:
- Public API for results
- Custom test submission
- Comparison with community runs

Conclusion

BTRFSQA provides a robust, automated testing infrastructure for BTRFS development. Its design emphasizes automation, transparency, and cost-effectiveness while maintaining extensibility for future enhancements. The system successfully bridges kernel development with public quality assurance, enabling the BTRFS community to track stability and progress over time.

FilesExpand file tree

DESIGN.md

Latest commit

History

DESIGN.md

File metadata and controls

BTRFSQA Design Document

Table of Contents

Overview

Purpose

Goals

Key Features

System Architecture

Architecture Overview

Component Layers

Layer 1: Control Plane

Layer 2: Test Environment

Layer 3: Publishing Infrastructure

Component Design

1. Orchestrator (btrfsqa.py)

2. Configuration System

3. Test Scripts

4. Recording System

Workflow

End-to-End Execution Flow

Data Model

Results Directory Structure

HTML Table Schema

Infrastructure

AWS Resources

Network Architecture

Testing Framework

Test Suite Hierarchy

Test Script Details

Script 001: Kernel Development Build

Script 002: btrfsprogs Test Suite

Script 003: xfstests

Script 004: RAID5 Scrub Tests

Publishing System

GitHub Pages Integration

Results Presentation

Security Considerations

Credentials Management

Access Control

Network Security

Extensibility

Adding New Test Scripts

Configuration Customization

Plugin Architecture Opportunities

Future Enhancements

Short-Term Improvements

Medium-Term Features

Long-Term Vision

Conclusion