RFC: Running Nebraska as a distributed service with PSQL logical replication

## Status

We now have a design doc proposed in PR #1405. It's the most up-to-date source for the design. This issue still has the original proposal below, with a few sections updated to match the current direction.

## Problem

We want to be able to host Nebraska as a managed service and need to run multiple instances across different regions, each backed by its own PostgreSQL database, so that nodes can reach the nearest instance with low latency and we are able to scale.

Nebraska today assumes a single instance with a single database. Even Omaha update checks (POST /v1/update/) write telemetry to the database, so read replicas are not an option. Each region needs its own writable database.

The challenge is keeping publisher metadata (apps, channels, packages, groups) consistent across all regional databases while letting each region write telemetry locally.

## How Nebraska's schema already helps

Nebraska's tables split almost cleanly into two categories:

Admin tables (publisher metadata, must be consistent everywhere): `application`, `package`, `channel`, `groups`, `team`, `users`, `flatcar_action`, `package_channel_blacklist`, `package_file`

Runtime tables (per-region telemetry, stays local): `instance`, `instance_application`, `instance_status_history`, `event`, `activity`, `instance_stats`

Reference / seed tables (identical on every node, populated by migrations, not replicated): `event_type`

All foreign keys point from runtime tables to admin tables, never the other direction. This makes PostgreSQL logical replication a natural fit: replicate admin tables from a primary database to regional subscriber databases, while each region writes runtime data locally.

Two tables mix admin and runtime concerns and need to be split first: `groups` (admin policy fields plus runtime-mutated `rollout_in_progress`) and `activity` (admin-originated rows for channel package updates plus runtime-originated rows for rollout lifecycle and instance update results). See Proposal #1 and #2 below.

## Proposal

Control node: one Nebraska instance that accepts admin API writes and replicates publisher metadata to all edges.

Edge nodes: regional Nebraska instances that serve Omaha update checks and write telemetry locally. Admin API calls return 403. The syncer is disabled.

```
Control                          Edge (per-region)
┌──────────────────┐         ┌──────────────────┐
│ Nebraska Control │         │ Nebraska Edge    │
│ admin.Service    │         │ runtime.Service  │
│ runtime.Service  │         │ (no admin writes)│
└────────┬─────────┘         └────────┬─────────┘
         │                            │
    ┌────┴─────┐                 ┌────┴────┐
    │Primary DB│ ──logical──>    │ Sub DB  │
    │ admin    │  replication    │ admin RO│
    │runtime   │  (admin only)   │runtime  │
    └──────────┘                 └─────────┘
```

The specific changes:

### 1. Split `groups` into admin and runtime (#1396)

Introduce a node-local `group_local` sidecar holding `rollout_in_progress` and a nullable override for each admin policy field on `groups`. Reads return the override if set, otherwise the admin default. After this, `groups` is a pure admin table safe to replicate.

### 2. Split `activity` into admin and runtime (#1398)

The single `activity` table holds two kinds of rows: admin-originated (channel package updates, class=6) and runtime-originated (rollout lifecycle, instance update results, classes 1–5). Split into `admin_activity` (replicated control → edge) and `activity` (runtime, local per node), with an `all_activity` UNION ALL view so existing readers are unchanged. `activity.id` is converted from `serial` to `uuid` so each node can generate ids without coordination. This split is what makes the database-level admin/runtime enforcement (see Enforcement layers below) work for activity: the runtime role can write to `activity` but not `admin_activity`.

### 3. Sub-package structure for compile-time enforcement

Split `pkg/api/` into sub-packages:

- `admin/` - write operations for admin tables (control node only)
- `runtime/` - write operations for runtime tables (all instances)
- `dbreads/` - read operations shared by both (all instances)

```
pkg/api/
├── admin/          # Write operations for admin tables (control node only)
│   ├── service.go
│   ├── applications.go
│   ├── channels.go
│   ├── groups.go
│   ├── packages.go
│   ├── teams.go
│   ├── users.go
│   ├── actions.go
│   └── admin_activity.go
├── runtime/        # Write operations for runtime tables (all instances)
│   ├── service.go
│   ├── events.go
│   ├── instances.go
│   ├── updates.go
│   ├── group_local.go
│   └── activity.go
├── dbreads/        # Read operations (all instances)
│   ├── queries.go
│   ├── applications.go
│   ├── channels.go
│   ├── groups.go
│   ├── instances.go
│   ├── packages.go
│   └── activity.go
├── internal/shared/ # Constants and helpers shared across sub-packages
├── api.go          # Core types, DB connection, migrations
└── db/migrations/
```

`admin.Service` and `runtime.Service` both embed `dbreads.Queries` for reads. `internal/shared/` holds constants and helpers needed by multiple sub-packages. Go package boundaries prevent `runtime` from calling `admin` methods and vice versa.

### 4. Instance mode

`NEBRASKA_INSTANCE_MODE` environment variable. When set to `edge`, admin writes return 403 and the syncer is disabled. When unset, everything works as today. Fully backward compatible. The accepted values are a small validated allowlist (`control`, `edge`, or unset/`single`); any other value fails the process at startup with a clear error.

### 5. Operator-provisioned database roles

The operator provisions least-privilege Postgres roles for admin and runtime serving. Nebraska reads connection strings from environment variables; it does not create, drop, or rotate roles. See §4.5 of the design doc (#1405) for the full operator contract.

### Schema migrations

Schema changes follow a variant of the [expand–contract pattern](https://www.tim-wellhausen.de/papers/ExpandAndContract/ExpandAndContract.html) for the replication scenario where additive changes (e.g. new nullable columns, new tables, relaxed constraints) are applied to subscribers before the primary and subtractive changes (e.g. drops, tightened constraints) are applied to the primary before subscribers and modifications (e.g. type changes, renames, NOT NULL tightening) are decomposed into a multi-release sequence: add new -> dual-write -> backfill -> switch readers -> stop writing old -> drop old. The invariant is that the subscriber schema is always a superset of what the primary currently writes, which is exactly what PostgreSQL's logical replication requires (["intermittent errors can be avoided by applying additive schema changes to the subscriber first"](https://www.postgresql.org/docs/current/logical-replication-restrictions.html#LOGICAL-REPLICATION-RESTRICTIONS-DDL)).

### Enforcement layers

The admin/runtime boundary is enforced at three levels:

1. Compile-time (Go packages): `runtime/` cannot call `admin/` methods.
2. HTTP-level (handler guards): edge nodes return 403 for admin API calls.
3. Database-level (PostgreSQL roles): the runtime role physically cannot write to admin tables.

## PoC results

I put together a simple two-region PoC with a control node and an edge node, each running on separate infrastructure with their own PostgreSQL database and logical replication between them. The implementation covers all the changes described above with all existing tests passing.

Some indicative numbers (not a proper benchmark, just what I saw during testing):

- Apps, channels, packages, and groups created on the primary are visible on the subscriber within about 150ms
- Omaha update checks on the edge return the correct version immediately after replication
- Admin API writes on edge nodes are blocked with 403
- Telemetry (instance registrations, events) stays local to each region
- In-memory caches (app IDs, group track names) use invalidate-on-miss so replicated data is served on first access

The PoC code is on my fork: [Moustafa-Moustafa#1](https://github.com/Moustafa-Moustafa/nebraska/pull/1). This is of course just a PoC. I'll split my changes into smaller PRs when we are ready to move forward with this proposal.

Would love to get feedback on the approach.

## Rollout

- [ ] PR 1 — `group_local` sidecar split (#1396)
- [ ] PR 2 — `activity` row-level split + `all_activity` view (#1398)
- [ ] PR 3 — Extract `pkg/api/dbreads/`
- [ ] PR 4 — Split writes into `pkg/api/admin/` and `pkg/api/runtime/`
- [ ] PR 5 — Conditional role grants migration + two-DSN migration/serving split (`NEBRASKA_MIGRATIONS_DB_URL`)
- [ ] PR 6 — Instance mode: `NEBRASKA_INSTANCE_MODE` + `requirePrimary` middleware + validated allowlist
- [ ] PR 7 — Per-edge override management endpoint
- [ ] PR 8 — Operator documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Running Nebraska as a distributed service with PSQL logical replication #1375

Status

Problem

How Nebraska's schema already helps

Proposal

1. Split `groups` into admin and runtime (#1396)

2. Split `activity` into admin and runtime (#1398)

3. Sub-package structure for compile-time enforcement

4. Instance mode

5. Operator-provisioned database roles

Schema migrations

Enforcement layers

PoC results

Rollout

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

RFC: Running Nebraska as a distributed service with PSQL logical replication #1375

Description

Status

Problem

How Nebraska's schema already helps

Proposal

1. Split groups into admin and runtime (#1396)

2. Split activity into admin and runtime (#1398)

3. Sub-package structure for compile-time enforcement

4. Instance mode

5. Operator-provisioned database roles

Schema migrations

Enforcement layers

PoC results

Rollout

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. Split `groups` into admin and runtime (#1396)

2. Split `activity` into admin and runtime (#1398)