gRPC-Relay | 中文版
gRPC-Relay is a cross-domain communication relay system designed to establish a secure, controllable, and high-performance gRPC channel between internal devices and external controllers.
It is intended for the following scenarios:
- Internal devices managed by a public-network or office-network Controller through a Relay
- Bidirectional streaming data transfer, including control commands and file/data uploads
- MQTT-based device online/offline notifications, status reporting, and telemetry
- gRPC-based online device discovery and streaming relay capabilities
- Background and Goals
- Core Roles
- System Architecture
- Core Workflows
- API Design
- Security and Authorization Model
- Non-Functional Requirements
- CI/CD
- Deployment and Operations
- Testing Strategy
- MVP Scope and Roadmap
- References
The core goal of gRPC-Relay is to provide cross-domain gRPC relay capability so that devices inside private networks, without public IP addresses, can still be accessed and managed securely by external controllers.
- Controllable relay: Relay sees metadata only and never decrypts business payloads
- End-to-end encryption: Business data is encrypted/decrypted only by Device and Controller
- Availability-first baseline: Deliver a single-node MVP first, then expand to multi-node
- Observable by default: Built-in health checks, metrics, logs, audit, and tracing
- Transport roadmap: Use HTTP/2 today and keep QUIC as the v2 low-latency transport target
| Role | Description | Responsibilities |
|---|---|---|
| Device | Physical device such as an IoT device or workstation | Runs a Device Agent and executes business logic |
| Device Agent | Agent process running on the device | Maintains long-lived connection with Relay, registers, heartbeats, reconnects, reports status |
| Controller | Human-operated control system | Discovers devices, initiates sessions, sends control commands, receives responses |
| Relay | Relay server | Manages long-lived connections, forwards traffic, publishes notifications, provides query APIs |
| MQTT Broker | Message broker | Transmits telemetry data and device online/offline notifications |
| Link | Protocol | Purpose |
|---|---|---|
| Device ↔ Relay | gRPC over HTTP/2 today; QUIC in v2 | Long-lived device connection |
| Controller ↔ Relay | gRPC over HTTP/2 + TLS 1.3 | Controller access and querying |
| Relay ↔ MQTT Broker | MQTT over TLS 1.3 | Device notifications and telemetry |
| Fallback | TLS/TCP | Used as the TCP transport baseline |
- Relay handles metadata, authentication, authorization, rate limiting, and stream forwarding only
- Business payloads between Device and Controller are end-to-end encrypted
- MQTT Broker is deployed independently and decoupled from Relay
- The first release uses a single Relay node, with multi-node and load balancing in later versions
- Device starts the Device Agent
- Device Agent connects to Relay
- Relay verifies device identity
- Relay assigns a
connection_id - Relay publishes a device online event to MQTT
- Device Agent may optionally publish its own status as backup validation
- Device Agent sends a heartbeat every 30 seconds
- Relay updates the device
last_seen - If no heartbeat is received for 120 seconds, the device is marked as suspected offline
- If no heartbeat is received for 300 seconds, Relay closes the connection and publishes an offline event
Three complementary discovery methods are supported:
- Relay publishes online/offline events through MQTT
- Device Agent reports status through MQTT
- Controller queries the online device list through gRPC
- Controller obtains target device information
- Controller connects to Relay and specifies
target_device_id - Relay verifies Controller identity and permissions
- Relay creates a stream mapping between Controller and Device
- Relay starts forwarding bidirectional stream data
- Device reconnects automatically after disconnection
- Reconnect requests include
previous_connection_id - Relay attempts to restore the session within the recovery window
- If recovery fails, a new session is created and a new
connection_idis assigned
- Requests carry a globally unique
sequence_number - Relay caches recently processed sequence numbers
- Duplicate requests return cached responses to avoid repeated execution
Core services include:
DeviceConnect(stream DeviceMessage) returns (stream RelayMessage)ListOnlineDevices(ListOnlineDevicesRequest) returns (ListOnlineDevicesResponse)ConnectToDevice(stream ControllerMessage) returns (stream DeviceResponse)RevokeToken(RevokeTokenRequest) returns (RevokeTokenResponse)
DeviceMessage: device registration, heartbeat, data reportingRelayMessage: registration response, heartbeat response, data requestControllerMessage: request from controller to deviceDeviceResponse: response from deviceListOnlineDevicesRequest/Response: online device queryRevokeTokenRequest/Response: admin token revocation
| Topic | Purpose |
|---|---|
relay/device/online |
Device online notification |
relay/device/offline |
Device offline notification |
device/{device_id}/status |
Device self-reported status |
telemetry/{device_id} |
Device telemetry data |
telemetry/relay/{relay_id} |
Relay telemetry data |
OKDEVICE_OFFLINEUNAUTHORIZEDDEVICE_NOT_FOUNDRATE_LIMITEDINTERNAL_ERROR
- Device: mTLS device certificates are recommended, with pre-provisioned tokens as an alternative
- Controller: HS256 JWT token authentication with
controller_id,role, allowed projects, expiry, issuer, and audience claims
The system uses RBAC + device ownership:
admin: access all devicesoperator: access authorized devices and perform control/data transferviewer: read-only access
- All connections must use TLS 1.3
- Business payloads must be end-to-end encrypted
- Relay must not log encrypted payload contents
- Rate limiting must apply at device, Controller, and global levels
- Metadata such as
device_id,controller_id, andmethod_namemust be validated - Admin Controllers can revoke Controller or Device tokens through the gRPC
RevokeTokenAPI; the current MVP/P1 implementation keeps revocations in Relay memory
| Metric | Target |
|---|---|
| Single-instance long-lived connections | 10,000 |
| Concurrent active streams | 1,000 |
| Relay additional hop latency | P50 < 5ms, P99 < 20ms |
| Maximum single-stream bandwidth | 10 MB/s |
| Memory budget | < 2 GB for 10K connections |
| CPU usage | < 80% at 10K connections and 1K active streams |
- Service availability: 99.9%
- Device reconnect time: < 10 seconds
- Session recovery success rate: > 95%
- MTTR: < 5 minutes
The system provides:
/healthhealth check (with component-level status)- Full Prometheus
/metricsendpoint (connection, stream, latency, error, resource metrics) - Structured JSON logging (via
tracing-subscriber) - Audit logging (auth events, connections, rate limits, errors)
- OpenTelemetry distributed tracing (OTLP exporter, configurable sampling)
- MQTT relay telemetry publishing
- Built-in alerting engine (CPU, memory, MQTT, connection thresholds)
Three GitHub Actions workflows automate quality checks, releases, and publishing.
| Workflow | Trigger | What it does |
|---|---|---|
| CI | push (master/main), PR (master/main), tag, manual | cargo fmt --check → cargo clippy → cargo check → unit tests + integration tests → coverage (80% threshold) → Docker build |
| Create Release | manual (workflow_dispatch) |
Validates version vs Cargo.toml, runs full test suite, builds release binary, verifies relay --version, creates git tag, generates categorized release notes, creates GitHub release, triggers Release |
| Release | release: published |
Publishes relay-proto to crates.io, waits for index propagation, publishes relay-agent-sdk and relay-controller-sdk, builds and pushes Docker image to GHCR |
prepare-release.sh PR merge create-release.yml release.yml (auto)
(local: bumps version, → (CI validates → (tag + GitHub → (crates.io + GHCR
opens a PR) on the branch) release) Docker image)
See doc/RELEASE.md for the full release process, including SemVer guidance, rollback procedures, and troubleshooting.
See deploy/README.md for the full deployment documentation covering Docker Compose, bare-metal, Kubernetes, Prometheus, and Grafana. See deploy/BUILD.md for manual build instructions (binary and Docker image).
Pre-built Docker images are published to ghcr.io/cokkiy/grpc-relay on every release — no local Rust toolchain required to run the relay.
| Method | Directory | What's included |
|---|---|---|
| Docker | Dockerfile, docker-compose.yml, deploy/docker/ |
Pre-built GHCR image, Compose with MQTT + Prometheus + Grafana + Jaeger |
| Bare Metal | deploy/bare-metal/ |
systemd service, install/uninstall/upgrade scripts, env template |
| Kubernetes | deploy/kubernetes/ |
Deployment, Service, ConfigMap, Secret, HPA, NetworkPolicy, PDB, ServiceAccount, Namespace, Kustomization |
| Component | Path | Purpose |
|---|---|---|
| Grafana | deploy/grafana/ |
Pre-built relay-overview dashboard + Prometheus datasource |
| Prometheus | deploy/prometheus/ |
Scrape config targeting relay metrics endpoint |
| MQTT Broker | docker-compose.yml |
Eclipse Mosquitto service for local Docker deployments |
| Port | Protocol | Purpose |
|---|---|---|
50051 |
TCP | gRPC (HTTP/2) |
50052 |
UDP | gRPC over QUIC (v2.0) |
8080 |
TCP | /health and /metrics |
8883 |
TCP | MQTT over TLS |
The relay server is configured via a single YAML file (example). Key sections:
| Section | Contents |
|---|---|
relay |
id, address, QUIC address, max connections, heartbeat interval |
relay.stream |
idle timeout, max active streams, per-controller limits |
relay.rate_limiting |
per-device/controller/global request + connection + bandwidth limits, CPU/memory thresholds |
relay.idempotency |
cache capacity + TTL |
relay.auth |
enable flag, token maps (device + controller), method whitelist, JWT config |
relay.mqtt |
enable flag, broker address, credentials, telemetry interval, reconnect config |
relay.tls |
enable flag, cert/key/CA paths |
observability |
logging level/format, health bind, audit config, OpenTelemetry tracing, alerting rules |
Coverage includes:
- Authentication and authorization
- Sequence number deduplication
- Session management
- Rate limiting
- Error handling
Coverage includes:
- Device connection and registration
- Controller session initiation
- Bidirectional data transfer
- Device reconnect and session recovery
- MQTT notifications and queries
- Authentication failure handling
- Authorization rejection handling
- Rate limit triggering
- 10K concurrent connections
- 1K concurrent active streams
- Latency target validation
- Long-running stability validation
- Unauthenticated access
- Forged tokens
- Cross-device privilege escalation
- DDoS simulation
- Large payload attacks
- Replay attacks
The first release focuses on:
- Device ↔ Relay HTTP/2 connection, with QUIC deferred to v2
- Controller ↔ Relay HTTP/2 connection
- Bidirectional stream relay
- Registration, heartbeat, reconnect, offline handling
- MQTT online/offline notifications
- Controller online device query
- RBAC authorization
- Idempotency
- End-to-end encryption
- Basic rate limiting and input validation
- Metrics, logs, and audit
- Relay telemetry
- Health checks
- Docker / Kubernetes deployment
- v1.1: Session persistence and stronger recovery
- v1.2: Multi-Relay nodes, high availability, load balancing
- v2.0: Controller QUIC, connection migration, 0-RTT, ABAC
- gRPC Official Documentation
- QUIC RFC 9000
- MQTT v5.0 Specification
- OpenTelemetry Documentation
- Prometheus Best Practices
This README was created based on the following project documents:
doc/requirements.mddoc/architecture.mddoc/action_plan.mddoc/RELEASE.mddoc/v1.0_release_summary.md
It is intended as a user-facing entry document that emphasizes project overview, architecture, and implementation path.