Skip to content

Latest commit

 

History

History
797 lines (681 loc) · 30.4 KB

File metadata and controls

797 lines (681 loc) · 30.4 KB

BTRFSQA Design Document

Table of Contents

  1. Overview
  2. System Architecture
  3. Component Design
  4. Workflow
  5. Data Model
  6. Infrastructure
  7. Testing Framework
  8. Publishing System
  9. Security Considerations
  10. Extensibility
  11. Future Enhancements

Overview

Purpose

BTRFSQA is an automated continuous integration and quality assurance system for the BTRFS filesystem. It provides automated testing of the latest BTRFS kernel development code against comprehensive test suites and publishes results to a public dashboard.

Goals

  • Automation: Fully automated end-to-end testing pipeline with zero manual intervention
  • Coverage: Comprehensive testing across kernel, userspace tools, and filesystem operations
  • Transparency: Public visibility of test results through GitHub Pages
  • Cost-Effectiveness: Efficient use of cloud resources with automatic cleanup
  • Reproducibility: Consistent test environment using fresh infrastructure per run

Key Features

  • Automated AWS EC2 infrastructure provisioning and teardown
  • Latest BTRFS development kernel compilation and installation
  • Multi-stage test suite execution (btrfsprogs, xfstests, RAID5 scrub tests)
  • Terminal recording via Asciinema for visual debugging
  • Automated results publishing to GitHub Pages
  • Historical test results preservation

System Architecture

Architecture Overview

┌──────────────────────────────────────────────────────────────┐
│                    BTRFSQA System                             │
│                                                               │
│  ┌────────────────────────────────────────────────────────┐  │
│  │              Control Plane (Local Machine)             │  │
│  │                                                        │  │
│  │  ┌──────────────────────────────────────────────────┐ │  │
│  │  │         btrfsqa.py Orchestrator                  │ │  │
│  │  │  ┌────────────────┬─────────────────────────┐   │ │  │
│  │  │  │ AWS Manager    │ Remote Executor         │   │ │  │
│  │  │  │ (Boto)         │ (Fabric)                │   │ │  │
│  │  │  ├────────────────┼─────────────────────────┤   │ │  │
│  │  │  │ Config Manager │ Results Publisher       │   │ │  │
│  │  │  │                │ (Git)                   │   │ │  │
│  │  │  └────────────────┴─────────────────────────┘   │ │  │
│  │  └──────────────────────────────────────────────────┘ │  │
│  │                                                        │  │
│  │  Configuration Sources:                               │  │
│  │  • aws_auth.json    • ec2.json      • github.json    │  │
│  │  • timeout.json     • kernel.config • local.config   │  │
│  └────────────────────────────────────────────────────────┘  │
│                           │                                   │
│                           │ Provision & Execute               │
│                           ▼                                   │
│  ┌────────────────────────────────────────────────────────┐  │
│  │           Test Execution Environment (AWS EC2)         │  │
│  │                                                        │  │
│  │  Instance: r4.large (Fedora 26)                       │  │
│  │  ┌──────────────────────────────────────────────────┐ │  │
│  │  │  Layer 1: BTRFS Development Kernel               │ │  │
│  │  │  • Source: btrfs-devel/misc-next                 │ │  │
│  │  │  • Build: Custom kernel.config                   │ │  │
│  │  │  • Install: Automatic bootloader update          │ │  │
│  │  └──────────────────────────────────────────────────┘ │  │
│  │  ┌──────────────────────────────────────────────────┐ │  │
│  │  │  Layer 2: Test Execution Framework               │ │  │
│  │  │  ┌────────────┬──────────────┬────────────────┐ │ │  │
│  │  │  │ Script 001 │ Script 002   │ Script 003     │ │ │  │
│  │  │  │ Kernel     │ btrfsprogs   │ xfstests       │ │ │  │
│  │  │  │ Build      │ Test Suite   │ Test Suite     │ │ │  │
│  │  │  ├────────────┼──────────────┼────────────────┤ │ │  │
│  │  │  │ Script 004 │ Asciinema    │ Results        │ │ │  │
│  │  │  │ RAID5      │ Recorder     │ Collector      │ │ │  │
│  │  │  │ Scrub      │              │                │ │ │  │
│  │  │  └────────────┴──────────────┴────────────────┘ │ │  │
│  │  └──────────────────────────────────────────────────┘ │  │
│  │  ┌──────────────────────────────────────────────────┐ │  │
│  │  │  Layer 3: Storage Infrastructure                │ │  │
│  │  │  6x EBS Volumes (20GB each)                     │ │  │
│  │  │  /dev/xvdb, /dev/xvdc, /dev/xvdd,              │ │  │
│  │  │  /dev/xvde, /dev/xvdf, /dev/xvdg               │ │  │
│  │  └──────────────────────────────────────────────────┘ │  │
│  └────────────────────────────────────────────────────────┘  │
│                           │                                   │
│                           │ Results Collection                │
│                           ▼                                   │
│  ┌────────────────────────────────────────────────────────┐  │
│  │            Publishing Layer                            │  │
│  │                                                        │  │
│  │  ┌──────────────────┐  ┌──────────────────────────┐  │  │
│  │  │ Asciinema.org    │  │ GitHub Repository        │  │  │
│  │  │ Terminal         │  │ • results/               │  │  │
│  │  │ Recordings       │  │ • _layouts/default.html  │  │  │
│  │  └──────────────────┘  └──────────────────────────┘  │  │
│  │                                  │                     │  │
│  │                                  │ Triggers            │  │
│  │                                  ▼                     │  │
│  │                      ┌────────────────────────┐       │  │
│  │                      │ GitHub Pages           │       │  │
│  │                      │ Public Dashboard       │       │  │
│  │                      │ (Jekyll + Cayman)      │       │  │
│  │                      └────────────────────────┘       │  │
│  └────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────┘

Component Layers

Layer 1: Control Plane

  • Location: Local machine
  • Language: Python 2.7
  • Dependencies: Boto (AWS SDK), Fabric (SSH automation)
  • Responsibilities: Orchestration, provisioning, configuration, results publishing

Layer 2: Test Environment

  • Location: AWS EC2 (ephemeral)
  • OS: Fedora 26
  • Instance Type: r4.large (spot instance)
  • Responsibilities: Kernel compilation, test execution, results generation

Layer 3: Publishing Infrastructure

  • GitHub Repository: Version control and static hosting
  • GitHub Pages: Jekyll-based public dashboard
  • Asciinema.org: Terminal recording hosting

Component Design

1. Orchestrator (btrfsqa.py)

Purpose: Main control script that coordinates all system operations

Key Functions:

main()                    # Entry point, orchestrates entire workflow
req_instance_and_tag()    # Provisions EC2 spot instance with tags
set_bdm()                 # Configures block device mapping (6x EBS volumes)
install_sw()              # Uploads configs, installs dependencies, runs tests
update_htmltable()        # Generates results HTML and publishes to GitHub
del_sys()                 # Cleanup: terminates instance and deletes volumes

Design Patterns:

  • Configuration-driven: All settings externalized to JSON files
  • Sequential execution: Scripts run in order with timeout controls
  • Error handling: Graceful degradation with cleanup on failure
  • Idempotency: Safe to re-run, cleans up previous resources

2. Configuration System

File Structure:

setup/config/
├── aws_auth.json     # AWS access key, secret key, region
├── ec2.json          # AMI ID, instance type, security group
├── github.json       # Repository URL, credentials
├── timeout.json      # Per-script timeout limits (minutes)
├── kernel.config     # Linux kernel build configuration
├── local.config      # xfstests environment variables
├── bashrc            # Shell environment customization
└── netrc             # Git credentials for automation

Design Principles:

  • Separation of concerns: Credentials separate from code
  • Version control: Non-sensitive configs tracked in Git
  • Flexibility: Easy modification without code changes
  • Security: Sensitive files (.gitignored)

3. Test Scripts

Execution Model: Sequential execution with completion signaling

Script Architecture:

001_btrfsdevel     (Timeout: 120 min)
├── Clone btrfs-devel kernel source
├── Copy kernel.config
├── Compile kernel (make -j4)
├── Install kernel modules
├── Update bootloader
├── Reboot instance
└── Signal: touch /tmp/001_btrfsdevel.completed

002_btrfsprogs     (Timeout: 120 min)
├── Clone btrfsprogs repository
├── Build from source (autogen, configure, make)
├── Run test suites: fsck, cli, misc, fuzz
└── Signal: touch /tmp/002_btrfsprogs.completed

003_xfstests       (Timeout: 120 min)
├── Clone xfstests repository
├── Setup test environment (local.config)
├── Create test filesystems on /dev/xvdb-xvdg
├── Execute: make && make install && ./check -g auto
└── Signal: touch /tmp/003_xfstests.completed

004_raid5_scrub    (Timeout: 30 min)
├── Apply RAID5-specific patches
├── Run scrub tests
└── Signal: touch /tmp/004_raid5_scrub.completed

Completion Protocol:

  • Each script creates /tmp/SCRIPTNAME.completed on success
  • Orchestrator polls for completion files with timeout
  • Missing completion file = test failure

4. Recording System

Technology: Asciinema (terminal recording)

Workflow:

1. Start recording: asciinema rec -c "bash SCRIPT" OUTPUT.json
2. Execute test script within recording
3. Upload to asciinema.org: asciinema upload OUTPUT.json
4. Parse upload URL from response
5. Embed in HTML table with thumbnail

Benefits:

  • Visual debugging of test failures
  • Exact reproduction of terminal session
  • Lightweight (text-based format)
  • Publicly shareable links

Workflow

End-to-End Execution Flow

Start
  │
  ├─► [1] Load Configuration Files
  │    ├── aws_auth.json
  │    ├── ec2.json
  │    ├── github.json
  │    └── timeout.json
  │
  ├─► [2] AWS Infrastructure Provisioning
  │    ├── Connect to AWS (Boto)
  │    ├── Request spot instance (r4.large, Fedora 26)
  │    ├── Configure 6x EBS volumes (20GB each)
  │    ├── Tag resources (Name: btrfsqa-DATE)
  │    ├── Wait for instance state: running
  │    └── Get public IP address
  │
  ├─► [3] Remote Environment Setup
  │    ├── SSH connect (Fabric, wait for availability)
  │    ├── Upload configuration files:
  │    │    ├── kernel.config → /tmp/
  │    │    ├── local.config → /tmp/
  │    │    ├── bashrc → /tmp/
  │    │    └── netrc → /root/.netrc
  │    ├── Upload test scripts (001-004)
  │    └── Install base dependencies:
  │         ├── git
  │         ├── python3
  │         ├── asciinema
  │         └── screen
  │
  ├─► [4] Sequential Test Execution
  │    │
  │    ├─► Script 001: BTRFS Kernel Build (120 min timeout)
  │    │    ├── Record: asciinema rec -c "bash 001_btrfsdevel"
  │    │    ├── Download kernel source (misc-next branch)
  │    │    ├── Configure with kernel.config
  │    │    ├── Compile: make -j4
  │    │    ├── Install: make modules_install && make install
  │    │    ├── Update grub bootloader
  │    │    ├── Reboot instance
  │    │    ├── Wait for SSH reconnection
  │    │    ├── Verify new kernel: uname -r
  │    │    ├── Create completion marker
  │    │    └── Upload recording to asciinema.org
  │    │
  │    ├─► Script 002: btrfsprogs Tests (120 min timeout)
  │    │    ├── Record execution
  │    │    ├── Clone btrfsprogs repository
  │    │    ├── Build: ./autogen.sh && ./configure && make
  │    │    ├── Run test suites:
  │    │    │    ├── make test-fsck
  │    │    │    ├── make test-cli
  │    │    │    ├── make test-misc
  │    │    │    └── make test-fuzz
  │    │    ├── Create completion marker
  │    │    └── Upload recording
  │    │
  │    ├─► Script 003: xfstests (120 min timeout)
  │    │    ├── Record execution
  │    │    ├── Clone xfstests repository
  │    │    ├── Install dependencies
  │    │    ├── Build: make && make install
  │    │    ├── Setup test devices:
  │    │    │    ├── TEST_DEV=/dev/xvdb
  │    │    │    ├── TEST_DIR=/mnt/test
  │    │    │    ├── SCRATCH_DEV_POOL=/dev/xvdc-xvdg
  │    │    │    └── SCRATCH_MNT=/mnt/scratch
  │    │    ├── Execute: ./check -g auto (all tests)
  │    │    ├── Create completion marker
  │    │    └── Upload recording
  │    │
  │    └─► Script 004: RAID5 Scrub (30 min timeout)
  │         ├── Record execution
  │         ├── Apply specific patches
  │         ├── Run RAID5 scrub tests
  │         ├── Create completion marker
  │         └── Upload recording
  │
  ├─► [5] Results Collection
  │    ├── Download asciinema upload URLs
  │    ├── Download test logs from /tmp/
  │    ├── Determine pass/fail status:
  │    │    ├── Pass: *.completed file exists
  │    │    └── Fail: timeout or missing completion
  │    └── Collect metadata (timestamps, script names)
  │
  ├─► [6] Results Publishing
  │    ├── Clone GitHub repository (local temp directory)
  │    ├── Create results directory: results/results_YYYY-MM-DD_HH:MM/
  │    ├── Copy test logs to results directory
  │    ├── Generate HTML table with:
  │    │    ├── Script name
  │    │    ├── Status badge (pass/fail)
  │    │    ├── Asciinema embed with thumbnail
  │    │    └── Log file download links
  │    ├── Update _layouts/default.html
  │    ├── Git commit with timestamp message
  │    ├── Git push to origin/master
  │    └── GitHub Pages auto-rebuilds site
  │
  ├─► [7] Infrastructure Cleanup
  │    ├── Wait 2 minutes (allow final syncs)
  │    ├── Terminate EC2 instance
  │    ├── Delete unattached EBS volumes
  │    └── Log cleanup completion
  │
End

Data Model

Results Directory Structure

results/
├── results_2024-11-15_10:30/
│   ├── btrfsprogs_001/
│   │   ├── test.log
│   │   ├── fsck-tests.log
│   │   └── cli-tests.log
│   ├── xfstests_001/
│   │   ├── results.log
│   │   ├── failed.log
│   │   └── check.log
│   ├── logs/
│   │   ├── 001_btrfsdevel.log
│   │   ├── 002_btrfsprogs.log
│   │   ├── 003_xfstests.log
│   │   └── 004_raid5_scrub.log
│   └── screencasts/
│       ├── 001.json (asciinema recording)
│       └── 001.url (uploaded URL)
└── results_2024-11-16_09:45/
    └── ... (next test run)

HTML Table Schema

<tr>
  <td>Script Name</td>
  <td>
    <span class="status-badge pass|fail">PASS|FAIL</span>
  </td>
  <td>
    <script src="https://asciinema.org/a/ID.js" data-theme="monokai"></script>
  </td>
  <td>
    <a href="results/PATH/logs/SCRIPT.log">View Log</a>
  </td>
</tr>

Infrastructure

AWS Resources

EC2 Instance Specifications:

  • Instance Type: r4.large (15.25 GB RAM, 2 vCPUs)
  • Purchasing: Spot instance (cost optimization)
  • AMI: Fedora 26 (ami-id from ec2.json)
  • Region: Configurable (us-east-1 default)
  • Security Group: SSH (port 22) enabled
  • Key Pair: btrfsqa-keypair

Storage Configuration:

  • Root Volume: Default AMI root (typically 8-10 GB)
  • Data Volumes: 6x EBS volumes (20 GB GP2 each)
    • /dev/xvdb: Primary test device
    • /dev/xvdc-xvdg: Scratch devices for multi-disk tests
  • Lifecycle: Deleted on termination

Resource Tagging:

{
  "Name": "btrfsqa-2024-11-18",
  "Project": "btrfsqa",
  "ManagedBy": "automation"
}

Network Architecture

Internet
   │
   ├─► Local Machine (Control Plane)
   │    ├── Outbound: AWS API (HTTPS)
   │    └── Outbound: SSH to EC2
   │
   └─► AWS Region (us-east-1)
        │
        ├─► EC2 Instance (Public Subnet)
        │    ├── Public IP: Dynamic (assigned at launch)
        │    ├── Inbound: SSH (port 22) from anywhere
        │    └── Outbound: Internet access (Git, package repos)
        │
        ├─► GitHub.com
        │    ├── Git clone (btrfs-devel, btrfsprogs, xfstests)
        │    └── Git push (results publishing)
        │
        └─► Asciinema.org
             └── Recording upload (HTTP POST)

Testing Framework

Test Suite Hierarchy

BTRFSQA Testing Pyramid

        ┌───────────────────────┐
        │   Integration Tests   │  Script 004: RAID5 Scrub
        │   (Specific Scenarios)│  • Targeted regression tests
        └───────────────────────┘  • Known bug validation
                 │
        ┌────────┴─────────┐
        │  Functional Tests │       Script 003: xfstests
        │  (Filesystem Ops) │       • 400+ test cases
        └────────────────────┘      • POSIX compliance
                 │                  • Stress testing
        ┌────────┴─────────┐        • Data integrity
        │  Unit Tests       │       Script 002: btrfsprogs
        │  (Userspace Tools)│       • Tool-specific tests
        └────────────────────┘      • CLI validation
                 │                  • Format verification
        ┌────────┴─────────┐
        │  Kernel Build     │       Script 001: Kernel
        │  (Base Layer)     │       • Compilation check
        └────────────────────┘      • Module loading

Test Script Details

Script 001: Kernel Development Build

Objective: Validate latest BTRFS kernel code compiles and boots

Steps:

  1. Clone btrfs-devel repository (misc-next branch)
  2. Copy custom kernel configuration
  3. Compile kernel (make -j4)
  4. Install kernel and modules
  5. Update bootloader configuration
  6. Reboot instance
  7. Verify new kernel loaded

Success Criteria:

  • Compilation completes without errors
  • Kernel boots successfully
  • BTRFS module loads
  • /tmp/001_btrfsdevel.completed created

Script 002: btrfsprogs Test Suite

Objective: Validate userspace tools functionality

Test Categories:

  • fsck-tests: Filesystem check and repair
  • cli-tests: Command-line interface
  • misc-tests: Miscellaneous utilities
  • fuzz-tests: Malformed input handling

Success Criteria:

  • All test categories pass
  • No crashes or hangs
  • /tmp/002_btrfsprogs.completed created

Script 003: xfstests

Objective: Comprehensive filesystem testing

Test Coverage:

  • File operations (create, read, write, delete)
  • Directory operations
  • Extended attributes
  • ACLs and permissions
  • Quotas
  • Snapshots and clones
  • Compression
  • Checksumming
  • RAID configurations
  • Error injection
  • Recovery scenarios

Configuration:

export TEST_DEV=/dev/xvdb
export TEST_DIR=/mnt/test
export SCRATCH_DEV_POOL="/dev/xvdc /dev/xvdd /dev/xvde /dev/xvdf /dev/xvdg"
export SCRATCH_MNT=/mnt/scratch
export FSTYP=btrfs

Success Criteria:

  • Test suite completes
  • No kernel panics
  • Acceptable pass rate
  • /tmp/003_xfstests.completed created

Script 004: RAID5 Scrub Tests

Objective: Validate specific RAID5 functionality

Focus Areas:

  • RAID5/6 rebuild
  • Scrub operation
  • Data recovery
  • Parity verification

Success Criteria:

  • Specific test cases pass
  • No data corruption
  • /tmp/004_raid5_scrub.completed created

Publishing System

GitHub Pages Integration

Technology Stack:

  • Framework: Jekyll (static site generator)
  • Theme: Cayman (GitHub Pages default)
  • Hosting: GitHub Pages (automatic deployment)

File Structure:

btrfsqa/
├── _config.yml           # Jekyll configuration
│   ├── theme: jekyll-theme-cayman
│   └── title: BTRFSQA Dashboard
│
├── _layouts/
│   └── default.html      # Main page template
│       ├── Header: Project info
│       ├── Table: Test results (dynamically updated)
│       └── Footer: Known issues
│
├── results/              # Test execution results
│   └── (timestamped directories)
│
└── index.md              # Landing page content

Update Mechanism:

def update_htmltable():
    1. Clone repository to temp directory
    2. Read _layouts/default.html
    3. Generate new table rows for latest results
    4. Insert rows into HTML template
    5. Commit changes: "Update results - YYYY-MM-DD HH:MM"
    6. Push to origin/master
    7. GitHub Pages rebuilds (automatic, ~1 minute)

Results Presentation

Table Columns:

  1. Script Name: Test identifier (e.g., "001_btrfsdevel")
  2. Status: Visual badge (green PASS, red FAIL)
  3. Screencast: Embedded Asciinema player with thumbnail
  4. Logs: Download links for detailed output

Asciinema Integration:

<script
  id="asciicast-RECORDING_ID"
  src="https://asciinema.org/a/RECORDING_ID.js"
  async
  data-theme="monokai"
  data-size="small"
  data-cols="120"
></script>

Benefits:

  • No server infrastructure required
  • Automatic HTTPS
  • CDN distribution
  • Version controlled history
  • Zero operational cost

Security Considerations

Credentials Management

Sensitive Files (not in version control):

setup/config/
├── aws_auth.json     # AWS access keys
├── github.json       # GitHub credentials
└── netrc            # Git authentication

Git Configuration:

# .gitignore entries
setup/config/aws_auth.json
setup/config/github.json
setup/config/netrc
*.pem
*.key

Access Control

AWS Permissions Required:

  • ec2:RunInstances (spot instance creation)
  • ec2:TerminateInstances
  • ec2:CreateTags
  • ec2:DescribeInstances
  • ec2:DescribeVolumes
  • ec2:DeleteVolume

GitHub Permissions:

  • Repository write access (for results publishing)
  • Pages deployment (automatic with write access)

Network Security

EC2 Security Group:

  • Inbound: SSH (port 22) from 0.0.0.0/0
  • Outbound: All traffic allowed

Recommendations:

  • Restrict SSH to known IP ranges
  • Use IAM roles instead of access keys
  • Enable CloudTrail for audit logging
  • Implement GitHub deploy keys (read-only clones)

Extensibility

Adding New Test Scripts

Process:

  1. Create new script file: setup/scripts/00X_testname
  2. Make executable: chmod +x 00X_testname
  3. Follow completion protocol:
    # At end of script
    touch /tmp/00X_testname.completed
  4. Add timeout to setup/config/timeout.json:
    {
      "00X_testname": 60
    }
  5. Script will be automatically discovered and executed

Script Template:

#!/bin/bash
set -e  # Exit on error

# Test logic here
echo "Running custom test..."

# Signal completion
touch /tmp/00X_testname.completed

Configuration Customization

Common Modifications:

  1. Instance Type (ec2.json):

    {
      "InstanceType": "r5.xlarge"  # More CPU/RAM
    }
  2. Storage (btrfsqa.py:set_bdm()):

    # Add more volumes
    bdm.append({
        'DeviceName': '/dev/xvdh',
        'Ebs': {'VolumeSize': 50}
    })
  3. Timeout Adjustments (timeout.json):

    {
      "003_xfstests": 240  # Increase to 4 hours
    }

Plugin Architecture Opportunities

Future Extensibility:

  • Notification plugins: Email, Slack, PagerDuty alerts
  • Storage backends: S3, NFS for results
  • Test schedulers: Cron integration, webhook triggers
  • Results analyzers: Automated failure classification
  • Comparison tools: Regression detection across runs

Future Enhancements

Short-Term Improvements

  1. Error Handling:

    • Retry logic for transient failures
    • Partial result preservation on timeout
    • Email notifications on test failures
  2. Performance:

    • Parallel test execution (where safe)
    • Incremental kernel builds
    • Result compression
  3. Reporting:

    • Test duration tracking
    • Pass/fail rate graphs
    • Historical trend analysis

Medium-Term Features

  1. Multi-Kernel Testing:

    • Test multiple kernel versions per run
    • Comparison matrix
    • Regression bisection
  2. Custom Test Configurations:

    • Parameterized xfstests runs
    • Mount option variations
    • Feature flag combinations
  3. Integration:

    • GitHub webhook triggers
    • PR comment integration
    • Slack notifications

Long-Term Vision

  1. Distributed Testing:

    • Multi-region execution
    • Parallel instance testing
    • Load balancing
  2. Advanced Analytics:

    • ML-based failure prediction
    • Automatic bug categorization
    • Performance regression detection
  3. Community Features:

    • Public API for results
    • Custom test submission
    • Comparison with community runs

Conclusion

BTRFSQA provides a robust, automated testing infrastructure for BTRFS development. Its design emphasizes automation, transparency, and cost-effectiveness while maintaining extensibility for future enhancements. The system successfully bridges kernel development with public quality assurance, enabling the BTRFS community to track stability and progress over time.