k3s-oci

A production-ready k3s Terraform module for the OCI Always Free tier.

Features

HA control plane: 3 control-plane nodes with embedded etcd; survives 1 node failure
Full stack always deployed: cert-manager, Longhorn, ArgoCD + Image Updater, and kured are always installed; they keep the cluster active and prevent idle reclamation
Separate public/private subnets: k3s nodes have no public IP; only LBs and the optional bastion are internet-facing
Envoy Gateway ingress (Gateway API): DaemonSet with system-cluster-critical priority and PodDisruptionBudget maxUnavailable: 1; standard HTTPRoute/Gateway resources; real client IP preservation via NLB transparent mode
Automatic security updates: unattended-upgrades + kured drain-reboot-uncordon cycle; zero manual intervention (Ubuntu) or zypper patch systemd timers (openSUSE)
Configurable OS (os_family): Ubuntu 24.04 LTS (default, OCI-native image auto-resolved) or openSUSE Leap 16.0 (custom-imported UEFI image via scripts/import-opensuse-aarch64.sh)
k3s version pinned at plan time: resolved from the GitHub API during terraform plan, not at boot time
Cluster-scoped IAM: dynamic group and policy scoped to nodes tagged with the cluster name, not every instance in the compartment
Idempotent cloud-init: all kubectl operations use apply; re-provisioning is safe
Monitoring (grafana_hostname): kube-prometheus-stack (Prometheus + Grafana + Alertmanager) always deployed; optional public Grafana UI via grafana_hostname; PrometheusRules for node disk pressure and Longhorn volume health
Direct SSH via NLB (expose_ssh = true): expose port 22 on the public NLB restricted to my_public_ip_cidr; eliminates the need for OCI Bastion sessions for day-to-day access
OCI Vault (enable_vault = true): cluster secrets in a free software-protected OCI Vault; fetched at boot via instance_principal, not embedded in user-data
Boot volume backups (enable_backup = true): weekly full backups, 1-week retention, within the 5-backup Always Free limit
Object Storage state bucket (enable_object_storage_state = true): versioned OCI Object Storage for Terraform state; S3-compatible endpoint in terraform_state_backend output
OCI Notifications + Alertmanager (enable_notifications = false): opt-in OCI Notifications topic wired to Alertmanager as a webhook receiver
MySQL HeatWave (enable_mysql = false): opt-in Always Free MySQL DB in the private subnet; credentials pre-created as a Kubernetes Secret
External DNS (enable_external_dns = false): automatic Cloudflare DNS record management from HTTPRoute hostnames
External Secrets (enable_external_secrets = false): sync OCI Vault secrets into Kubernetes Secrets via instance_principal; no credentials to rotate

Architecture

graph TD
    Internet(["🌐 Internet"])

    subgraph public["Public Subnet · 10.0.0.0/24"]
        NLB["🔀 Public NLB (Always Free)
HTTP :80 · HTTPS :443
optional: kubeapi :6443 · SSH :22"]
    end

    subgraph private["Private Subnet · 10.0.1.0/24 · no public IPs"]
        ILB["⚖️ Internal Flex LB (Always Free)
kubeapi VIP :6443"]

        subgraph cp["Control Plane × 3  ·  A1.Flex (1 OCPU / 6 GB each)
k3s-server · etcd · Envoy Gateway · Longhorn · user workloads"]
            CP0["control-plane-0"]
            CP1["control-plane-1"]
            CP2["control-plane-2"]
        end

        W["worker-0  ·  A1.Flex (1 OCPU / 6 GB)
k3s-agent · Envoy Gateway · Longhorn · user workloads"]
    end

    NAT["🌍 NAT Gateway (Always Free)"]
    Bastion["🔐 OCI Bastion Service
optional · Always Free"]

    Internet -->|HTTP / HTTPS| NLB
    NLB -->|"Envoy Gateway NodePorts :30080 / :30443"| CP0 & CP1 & CP2 & W
    NLB -. "kubeapi :6443
expose_kubeapi=true" .-> ILB
    NLB -. "SSH :22
expose_ssh=true" .-> CP0 & CP1 & CP2 & W
    ILB --> CP0 & CP1 & CP2
    W -->|joins via kubeapi| ILB
    private -->|outbound| NAT --> Internet
    Bastion -. "SSH tunnel
enable_bastion=true" .-> private

All four A1.Flex instances live in a private subnet with no public IPs. Internet traffic enters exclusively through two Always Free load balancers.

k3s naming note: k3s calls control-plane nodes "servers" (k3s server) and workers "agents" (k3s agent). Terraform resources follow k3s conventions (server/worker); in standard Kubernetes terminology these map to control-plane and worker nodes.

Public NLB forwards HTTP/HTTPS directly to Envoy Gateway NodePorts on all four nodes. is_preserve_source = true preserves real client IPs at the hypervisor level. The NLB optionally exposes the Kubernetes API on port 6443, restricted to your IP.

Internal Flex LB provides a stable private VIP across all three control-plane nodes. Workers join via this VIP so the cluster survives any single control-plane loss.

Longhorn runs on all four nodes with defaultReplicaCount=2; each PVC is replicated across two nodes. For critical PVCs that must survive two simultaneous node losses, use the longhorn-replicated-3 StorageClass (gitops/longhorn/storageclasses/). Control-plane NoSchedule taints are removed after cluster init so user workloads schedule across all four identically-sized nodes.

HA ceiling: etcd runs on the 3 control-plane nodes (quorum = 2). The cluster tolerates 1 control-plane failure, the hard limit of a 4-node Always Free topology.

Quickstart

# 1. Clone the repo
git clone https://github.com/mbologna/k3s-oci.git
cd k3s-oci

# 2. Copy and edit the variables file
cp example/terraform.tfvars.example example/terraform.tfvars
$EDITOR example/terraform.tfvars

# 3. Init and apply (terraform or tofu both work)
cd example && tofu init && tofu apply

A Justfile is included for common operations (requires just):

just init        # tofu init in example/
just plan        # tofu plan in example/
just apply       # tofu apply in example/
just kubeconfig  # fetch kubeconfig via OCI Bastion
just ssh worker  # SSH into a node (server1/server2/server3/worker)
just fmt         # tofu fmt -recursive

kubeconfig

After terraform apply, run:

terraform output kubeconfig_hint

This prints the exact steps for your configuration. If enable_bastion = true (recommended), the fastest path is the included helper script:

cd example && ./get-kubeconfig.sh
export KUBECONFIG=~/.kube/k3s-oci.yaml
kubectl get nodes

enable_bastion defaults to true. It uses OCI Bastion Service, a managed SSH proxy with no VM, no boot volume, and no cost. Without it, nodes are only reachable via OCI serial console (terraform output kubeconfig_hint explains all options).

Direct SSH (no Bastion): set expose_ssh = true to expose port 22 on the public NLB, restricted to my_public_ip_cidr. After apply:
$(terraform output -raw ssh_command)
This is faster than Bastion sessions and avoids session TTLs. When using expose_ssh = true you can set enable_bastion = false to skip the Bastion Service resource entirely.

Deploying a web application

Why TLS is terminated at Envoy Gateway, not at the OCI load balancer

OCI provides two load balancer products with very different capabilities:

	OCI Network Load Balancer (NLB)	OCI Flexible Load Balancer
OSI layer	L4 (TCP passthrough)	L7 (HTTP/HTTPS aware)
TLS termination	❌ Not possible	✅ Yes
Always Free	1 NLB	2 × 10 Mbps
Used here	`nlb.tf`: public internet traffic	`lb.tf`: internal kubeapi HA VIP

The public-facing load balancer is the NLB. It forwards raw TCP streams with protocol = "TCP", so it has no knowledge of TLS, HTTP headers, or certificates. TLS must be terminated by something behind it.

The Flexible LB could terminate TLS, but the one free allocation is already consumed by the kubeapi HA load balancer. Even if it were available, using OCI to manage certificates would break the automatic cert-manager + Let's Encrypt renewal cycle.

The current flow is: Internet → NLB (TCP passthrough, preserves client IPs) → Envoy Gateway NodePort → TLS terminate → route to app pod.

Minimal example: HTTP-only

No domain needed. Requests to the NLB IP are served directly.

# hello-web.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello-web
  namespace: hello-web
spec:
  replicas: 2
  selector:
    matchLabels:
      app: hello-web
  template:
    metadata:
      labels:
        app: hello-web
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: hello-web
      containers:
        - name: hello-web
          image: httpd:alpine
          ports:
            - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: hello-web
  namespace: hello-web
spec:
  selector:
    app: hello-web
  ports:
    - port: 80
      targetPort: 80
---
# HTTPRoute — no hostname filter = matches all requests on the http listener
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: hello-web
  namespace: hello-web
spec:
  parentRefs:
    - name: eg
      namespace: envoy-gateway-system
      sectionName: http
  rules:
    - backendRefs:
        - name: hello-web
          port: 80

kubectl create namespace hello-web
kubectl apply -f hello-web.yaml
NLB_IP=$(cd example && tofu output -raw nlb_ip)
curl http://$NLB_IP/

Minimal example: HTTPS with sslip.io (no domain purchase required)

sslip.io is a public DNS service that resolves <anything>.<ip>.sslip.io directly to <ip>. Combined with cert-manager + Let's Encrypt HTTP-01, this gives a trusted TLS certificate with zero infrastructure cost.

Replace <NLB_IP> with the value of tofu output -raw nlb_ip.

# hello-web-tls.yaml
---
# 1. Certificate — cert-manager issues this via HTTP-01 challenge through Envoy Gateway
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: hello-web-tls
  namespace: envoy-gateway-system   # must be in the same namespace as the Gateway
spec:
  secretName: hello-web-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - hello-web.<NLB_IP>.sslip.io
---
# 2. HTTPS listener on the Gateway (add this to gitops/gateway/gateway.yaml for GitOps management)
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: eg
  namespace: envoy-gateway-system
spec:
  gatewayClassName: eg
  listeners:
    - name: http
      port: 80
      protocol: HTTP
      allowedRoutes:
        namespaces:
          from: All
    - name: https-hello-web
      port: 443
      protocol: HTTPS
      hostname: hello-web.<NLB_IP>.sslip.io
      tls:
        mode: Terminate
        certificateRefs:
          - name: hello-web-tls
      allowedRoutes:
        namespaces:
          from: All
---
# 3. HTTP→HTTPS redirect (add hostname to gitops/gateway/redirect.yaml)
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: http-to-https-redirect
  namespace: envoy-gateway-system
spec:
  parentRefs:
    - name: eg
      sectionName: http
  hostnames:
    - hello-web.<NLB_IP>.sslip.io
  rules:
    - filters:
        - type: RequestRedirect
          requestRedirect:
            scheme: https
            statusCode: 301
---
# 4. HTTPRoute for the app — attaches to both listeners
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: hello-web
  namespace: hello-web
spec:
  parentRefs:
    - name: eg
      namespace: envoy-gateway-system
      sectionName: https-hello-web
  hostnames:
    - hello-web.<NLB_IP>.sslip.io
  rules:
    - backendRefs:
        - name: hello-web
          port: 80

# Wait for certificate issuance (typically 1–2 minutes)
kubectl wait --for=condition=Ready certificate/hello-web-tls -n envoy-gateway-system --timeout=5m
curl https://hello-web.<NLB_IP>.sslip.io/

With a real domain: set enable_external_dns = true and annotate the HTTPRoute with external-dns.alpha.kubernetes.io/hostname: myapp.example.com. External DNS will create the A record automatically, then cert-manager issues the certificate. Alternatively, set enable_dns01_challenge = true to use DNS-01 (supports wildcard certs and does not require inbound port 80).

Resilience: spread replicas across nodes

Use topologySpreadConstraints to ensure pod replicas land on different nodes:

spec:
  template:
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: <your-app>

With 4 identically-sized nodes, 2 replicas survive any single node failure. Envoy Gateway runs as a DaemonSet with maxUnavailable: 1, so ingress remains up on the other 3 nodes throughout any single-node drain or failure.

Monitoring (Grafana + Prometheus)

kube-prometheus-stack (Prometheus, Grafana, Alertmanager) is always deployed as part of the full stack.

Accessing Grafana

Set grafana_hostname in terraform.tfvars to expose the Grafana UI with HTTPS and a Let's Encrypt certificate:

grafana_hostname = "grafana.example.com"   # or leave null for auto sslip.io hostname

When grafana_hostname is null, Grafana is reachable at grafana.<nlb-ip>.sslip.io (no domain purchase required).

Retrieve the admin credentials after terraform apply:

terraform output -raw grafana_admin_credentials

The password is generated by Terraform and stored in OCI Vault when enable_vault = true; it is never embedded in cloud-init user-data.

Built-in alert rules

The following PrometheusRules are included out of the box (gitops/monitoring/prometheus-rules.yaml):

Alert	Condition
`NodeDiskPressure`	Node has disk pressure condition
`NodeDiskSpaceLow`	< 15% free disk on any node
`NodeDiskSpaceCritical`	< 5% free disk on any node
`LonghornVolumeDegraded`	Longhorn volume in degraded state
`LonghornVolumeFaulted`	Longhorn volume in faulted state
`LonghornNodeStorageWarning`	Longhorn node storage > 80% used

Adding custom dashboards

Create a ConfigMap in the monitoring namespace with label grafana_dashboard: "1" — the Grafana sidecar auto-discovers and loads it:

apiVersion: v1
kind: ConfigMap
metadata:
  name: my-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  my-dashboard.json: |
    { ... }   # Grafana dashboard JSON

GitOps — App of Apps

The gitops/ directory contains ArgoCD Application manifests managed with the App of Apps pattern.

After the cluster is running, bootstrap it:

kubectl apply -n argocd -f gitops/apps/app-of-apps.yaml

ArgoCD will then continuously reconcile every manifest under gitops/apps/.

Adding your own applications

This repo is designed to be forked. To add your own apps on top of the built-in stack:

Fork this repo on GitHub.

Update all repoURL references to point to your fork:

bash gitops/update-repo-url.sh https://github.com/your-org/your-fork.git
git add gitops/apps/ && git commit -m "chore: update gitops repoURL"
git push

Add your ArgoCD Application manifests to gitops/apps/ — ArgoCD syncs them automatically. Each app can point at any Helm chart registry or any Git repository.

Deploying for the first time? Also set gitops_repo_url in terraform.tfvars before running tofu apply, so cloud-init writes the correct fork URL at bootstrap:
gitops_repo_url = "https://github.com/your-org/your-fork.git"
Already have a running cluster? Patch the App of Apps directly:
argocd app set app-of-apps --repo https://github.com/your-org/your-fork.git

Private repos: set gitops_ssh_private_key in terraform.tfvars with your SSH private key — Terraform stores it in OCI Vault automatically and cloud-init creates the argocd-repo-gitops Secret before ArgoCD starts. No manual argocd repo add step needed. For repos with a non-standard directory layout, set gitops_path (default: gitops/apps).

Automatic updates & reboots (unattended-upgrades + kured)

unattended-upgrades applies Ubuntu security patches daily and sets /var/run/reboot-required when a kernel update needs a reboot.

kured watches every node for /var/run/reboot-required and, when found:

Acquires a cluster-wide lock (only one node reboots at a time)
Cordons + drains the node
Reboots
Waits for the node to return and uncordons it

This keeps the cluster fully patched with zero manual intervention and no concurrent downtime.

Dependency updates (Renovate)

Renovate tracks Terraform providers, k3s, all stack component versions (via # renovate: inline comments in vars.tf and gitops/apps/*.yaml), and GitHub Actions. Enable with the Renovate GitHub App or the self-hosted workflow at .github/workflows/renovate.yml (requires a RENOVATE_TOKEN secret with repo scope).

Remote Terraform state (OCI Object Storage)

With enable_object_storage_state = true (the default), a versioned OCI Object Storage bucket is created automatically. After terraform apply, get the ready-to-use backend config:

terraform output -json terraform_state_backend

Use it in your terraform { backend "s3" {} } block (requires an OCI Customer Secret Key for S3 credentials):

terraform {
  backend "s3" {
    bucket                      = "<cluster_name>-terraform-state"
    key                         = "terraform.tfstate"
    region                      = "<your-region>"                     # e.g. eu-frankfurt-1
    endpoint                    = "https://<namespace>.compat.objectstorage.<region>.oraclecloud.com"
    skip_region_validation      = true
    skip_credentials_validation = true
    skip_metadata_api_check     = true
    force_path_style            = true
  }
}

Generate OCI Customer Secret Keys under Identity → Users → your user → Customer Secret Keys. The bucket name and namespace endpoint are in terraform output terraform_state_backend.

Always Free budget

Resource	Free allowance	This module
A1.Flex compute	4 OCPUs / 24 GB / 4 instances	3 servers + 1 worker = 4 OCPUs / 24 GB
Block storage	200 GB	4 × 50 GB = 200 GB
Network Load Balancer	1 NLB	1 (public, HTTP/HTTPS)
Flexible Load Balancer	2 × 10 Mbps	1 (private, kubeapi)
E2.1.Micro instances	2	0 (bastion uses OCI Bastion Service, managed, no VM)
NAT Gateway	1 per VCN	1 (outbound-only for private nodes)
Object Storage	20 GB	2 versioned buckets: Terraform state + Longhorn PVC backups (`enable_object_storage_state`, `enable_longhorn_backup`)
Vault (shared)	Software keys + 150 secrets	3 secrets: k3s_token, longhorn_ui_password, grafana_admin_password (`enable_vault = true`)
Volume backups	5 total	4 (one per node, weekly, 1-week retention) (`enable_backup = true`)
Notifications	1M HTTPS + 3K email/month	1 topic wired to Alertmanager (`enable_notifications = false`, opt-in)
MySQL HeatWave	1 standalone DB, 50 GB	1 DB system in private subnet (`enable_mysql = false`, opt-in)

⚠️ Idle reclamation : OCI reclaims Always Free instances where CPU, network, and memory stay below 20% for 7 consecutive days. The full stack (Longhorn, ArgoCD, cert-manager, kured) generates enough background activity to keep the cluster alive.

Failure tolerance

Component	Tolerance	What happens on failure
Any single node (any role)	✅ 1 node	Workloads reschedule to remaining 3 nodes; Longhorn (2 replicas) keeps storage up; Envoy Gateway DaemonSet keeps ingress up on remaining nodes
2 nodes simultaneously	⚠️ Partial	Workloads and ingress continue on 2 surviving nodes; if both failed nodes are control-planes, etcd quorum is lost and the API server stops accepting writes (running pods keep running, no new scheduling)
etcd / control-plane quorum	❌ 2 control-planes	Cluster becomes read-only; recovery requires etcd snapshot restore; see Split-Brain Recovery
Worker node	✅ Full	With taints removed, workloads reschedule to control-planes; no SPOF
HTTP/HTTPS ingress	✅ 3 node losses	Envoy Gateway DaemonSet; NLB health-checks remove unhealthy backends automatically
Kubernetes API	✅ 1 control-plane	ILB routes to remaining 2 control-planes
PVC data (Longhorn)	✅ 1 node	2 replicas across 4 nodes; 1 replica lost, 1 remains serving. Use `longhorn-replicated-3` StorageClass for critical PVCs to survive 2 simultaneous losses
cert-manager	⚠️ Soft	Pod reschedules within minutes; TLS serving unaffected (certs live in Secrets); only new issuance/renewal is paused
ArgoCD	⚠️ Soft	GitOps sync pauses until rescheduled; running workloads unaffected
MySQL (if enabled)	❌ None	Always Free tier = single OCI-managed instance; no HA failover

Node roles and workload placement

Each A1.Flex instance has identical resources (1 OCPU / 6 GB RAM). The k3s role (server vs agent) affects which system processes run, not how much resource is available for workloads.

What	control-plane-0/1/2	worker-0	Scheduling mechanism
etcd	✅	❌	k3s built-in; servers only
Kubernetes API server	✅	❌	k3s built-in; servers only
Envoy Gateway (ingress)	✅	✅	DaemonSet (1 pod per node)
Longhorn (storage daemon)	✅	✅	DaemonSet (1 pod per node)
cert-manager	✅	✅	Deployment: schedules on any node
ArgoCD	✅	✅	Deployment: schedules on any node
kube-prometheus-stack	✅	✅	Deployment/StatefulSet: any node
kured	✅	✅	DaemonSet (1 pod per node)
User workloads	✅	✅	No restrictions — schedules on all 4 nodes

Why control-planes run user workloads: k3s ≥ 1.24 automatically taints control-plane nodes with NoSchedule. This setup removes those taints at cluster init so all 4 identically-sized nodes are available. With only one worker, keeping the taint would make it a single point of failure for all user workloads.

Recommendation: use replicas ≥ 2 with topologySpreadConstraints (see gitops/README.md) to spread pods across nodes and survive any single-node failure.

Why this topology

With a hard cap of 4 A1.Flex instances, the binding constraint is etcd quorum: HA etcd needs at minimum 3 nodes (quorum = ⌊n/2⌋+1 = 2). The result is a 3-server HA cluster plus 1 standalone worker that saturates every Always Free resource class with nothing left unused and nothing that costs money.

Topology comparison

Topology	etcd HA	Nodes for workloads	Effective RAM for workloads†	Assessment
3 CP + 1 worker (this module)	✅ 1-node fault	4 (taints removed)	~15 GB	Optimal: HA etcd, all 4 nodes contribute to workloads
1 CP + 3 workers	❌ CP is total SPOF	4	~18 GB	More capacity but control-plane loss = complete cluster death
2 CP + 2 workers	❌ Invalid	-	-	2-node etcd cannot form quorum; worse than 1 node
4 CP + 0 workers	✅ 1-node fault	4 (taints removed)	~12 GB	Fewer resources for workloads; more etcd overhead

†etcd + kubeapi consume ~300–500 MB RAM and ~100–200m CPU per control-plane node.

4 × 1 OCPU even split prevents any single etcd node from becoming a hot-spot, creates 4 equal fault domains, and allows workloads to spread evenly.

Why not use the 2 free E2.1.Micro instances as extra workers?

Always Free also includes 2 AMD E2.1.Micro instances. They are not worth adding:

Storage budget exhausted: 4 × 50 GB boot volumes already consume the full 200 GB Always Free block storage allowance; two additional instances would require at least 100 GB more
1 GB RAM: k3s agent + Longhorn DaemonSet alone consume ~700–800 MB, leaving ~200 MB for user workloads
1/8 OCPU: negligible compute; adds operational complexity for near-zero workload benefit

Previously rejected alternatives

Alternative	Why it was rejected
nginx stream proxy in front of Envoy Gateway	Extra latency and complexity; NLB already preserves source IPs directly
OCI Bastion VM (E2.1.Micro)	OCI Bastion Service provides managed SSH proxying for free with no VM, no OS to patch, and no boot volume consuming storage budget
Boot volumes < 50 GB	OCI hard minimum is 50 GB per shape; 4 × 50 GB = 200 GB exactly exhausts the free block storage allowance
Additional NLB for kubeapi	Only 1 NLB is Always Free; the existing NLB conditionally exposes port 6443 via `expose_kubeapi = true`
openSUSE (or other non-Ubuntu Linux) as the base OS	OCI provides no native openSUSE ARM platform image. openSUSE Leap 16.0 is now supported via `os_family = "opensuse"` + a custom-imported UEFI image. See Choosing an OS below. Other distros remain unsupported.

Choosing an OS

The module supports two OS families, selected via os_family:

`os_family`	Image	Auto-resolved	SSH user	Auto-updates
`"ubuntu"` (default)	Ubuntu 24.04 LTS (Noble) aarch64	✅ Yes (latest OCI-native image)	`ubuntu`	`unattended-upgrades` + `needrestart`
`"opensuse"`	openSUSE Leap 16.0 Minimal VM aarch64	❌ No (must import and set `os_image_id`)	`sles`	`zypper patch` systemd timers

Ubuntu (default)

No extra steps needed. The latest Ubuntu 24.04 LTS image for VM.Standard.A1.Flex is resolved automatically at plan time from the tenancy.

openSUSE Leap 16.0

OCI has no native openSUSE image. Use the included script to import one before running tofu apply:

./scripts/import-opensuse-aarch64.sh

The script:

Resolves the latest openSUSE Leap 16.0 Minimal VM Cloud aarch64 QCOW2 from download.opensuse.org
Streams the image (~271 MiB) directly into a temporary OCI Object Storage bucket — no local disk required
Imports via the OCI REST API with firmware: UEFI_64 and launchMode: CUSTOM (the OCI CLI's oci compute image import always defaults to BIOS; UEFI_64 is required for VM.Standard.A1.Flex)
Adds VM.Standard.A1.Flex shape compatibility
Cleans up the temp Object Storage object
Prints the image OCID

Then set in terraform.tfvars:

os_family   = "opensuse"
os_image_id = "ocid1.image.oc1..."   # OCID printed by the script above

Script options:

--compartment-id OCID   Compartment OCID (default: tenancy root)
--region REGION         OCI region (default: from ~/.oci/config)
--leap-version VERSION  openSUSE Leap version (default: 16.0)
--bucket-name NAME      Temp bucket name (default: opensuse-image-import-tmp)
--keep-bucket           Do not delete the QCOW2 object after import
--image-name NAME       Custom display name for the imported image

Prerequisites: OCI CLI configured (~/.oci/config), curl, python3.

Known caveats (verified with Leap 16.0 + VM.Standard.A1.Flex):

Caveat	Detail
Image must be re-imported on new Leap releases	No auto-update path for the base OS image; re-run the script and update `os_image_id` when a new build is published
UEFI_64 required at import time	OCI's `oci compute image import` CLI hard-codes `firmware: BIOS`. The script works around this via a direct REST API call
Shape compatibility not auto-detected	OCI does not auto-detect the architecture of imported QCOW2 images; the script adds `VM.Standard.A1.Flex` explicitly
Oracle Cloud Agent (OCA) unavailable	No OCI-native monitoring agent on custom images

Using any other OS image

Set os_image_id to the OCID of any OCI image. Only Ubuntu and openSUSE are tested. Any other OS will need its own bootstrap logic — fork the repo and adapt files/lib/bootstrap-ubuntu.sh as a starting point.

Split-Brain Recovery

A split-brain occurs when multiple k3s server nodes each bootstrap an independent etcd cluster (--cluster-init) instead of joining a single shared one. Symptoms: kubectl get nodes shows only 1 node (not 3), or etcd member IDs differ across servers, or the cluster survives a reboot but each server has different state.

Detection

# On each server node (via SSH):
sudo k3s kubectl get nodes          # should show all 3 servers
sudo k3s etcd-snapshot ls           # should show same snapshots on all servers
/usr/local/bin/k3s etcd-snapshot ls 2>&1 | grep -E "^etcd"

# Check etcd member list (run on each server):
sudo ETCDCTL_API=3 \
  ETCDCTL_CACERT=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
  ETCDCTL_CERT=/var/lib/rancher/k3s/server/tls/etcd/server-client.crt \
  ETCDCTL_KEY=/var/lib/rancher/k3s/server/tls/etcd/server-client.key \
  etcdctl member list
# If IDs differ between servers → split-brain confirmed

Recovery from etcd snapshot (recommended)

# 1. Identify the best snapshot. List snapshots in OCI Object Storage:
#    (if enable_etcd_snapshots = true, snapshots are uploaded every 6h)
oci os object list \
  --namespace <your-namespace> \
  --bucket-name <cluster-name>-terraform-state \
  --prefix "etcd-snapshots/<cluster-name>/" \
  --query 'sort_by(data, &"time-created")[-1]."name"' --raw-output

# 2. Download the best snapshot to the elected first server:
oci os object get \
  --namespace <your-namespace> \
  --bucket-name <cluster-name>-terraform-state \
  --name etcd-snapshots/<cluster-name>/<snapshot-file> \
  --file /tmp/etcd-restore.db

# 3. Stop k3s on ALL server nodes:
sudo systemctl stop k3s

# 4. On the first server: reset etcd and restore from snapshot.
#    WARNING: this wipes all current etcd state on this node.
sudo k3s server --cluster-reset \
  --cluster-reset-restore-path=/tmp/etcd-restore.db &
# Wait for the reset to complete (watch journalctl -u k3s), then stop it:
sudo pkill -f "k3s server --cluster-reset"

# 5. On the REMAINING server nodes: wipe local etcd data and re-join.
#    WARNING: this wipes all etcd state on these nodes (they will re-sync from step 4).
sudo rm -rf /var/lib/rancher/k3s/server/db/
sudo rm -f  /var/lib/rancher/k3s/server/token

# 6. Start k3s on the first server first:
sudo systemctl start k3s
sleep 30  # wait for it to become the etcd leader

# 7. Start k3s on the remaining servers (they will join the restored cluster):
sudo systemctl start k3s  # (on each remaining server)

# 8. Verify all members rejoined:
sudo k3s kubectl get nodes

Recovery without snapshot (last resort)

# 1. Stop k3s on ALL server nodes.
# 2. On the intended first server ONLY, reset with no snapshot:
sudo k3s server --cluster-reset &
# Wait, then stop it.
# 3. Wipe db/ and token on remaining servers (same as step 5 above).
# 4. Start the first server, wait 30s, then start the others.
# Note: without a snapshot you lose all etcd state from the previous cluster.

Deleting a stale leader lock (after full rebuild)

# If cloud-init aborts with "leader lock held by running instance" after a
# tofu destroy + tofu apply, the old lock is still in Object Storage.
# Delete it before re-applying, or it will be cleared automatically if the
# holder instance is no longer RUNNING.
oci os object delete \
  --namespace <your-namespace> \
  --bucket-name <cluster-name>-terraform-state \
  --name cluster-init-lock \
  --force

NLB IP stability

The public NLB has prevent_destroy = true so its IP is stable across tofu apply runs. However, if the NLB is ever recreated (e.g. after tofu state rm + re-apply):

All sslip.io hostnames change (e.g. grafana.<old-ip>.sslip.io → grafana.<new-ip>.sslip.io)
Let's Encrypt certificates are invalid for the new hostnames and must be reissued
With a custom domain + enable_external_dns = true, ExternalDNS updates DNS automatically and cert-manager auto-renews

If using sslip.io defaults, run tofu apply again after NLB recreation: local.grafana_hostname and local.argocd_hostname recompute automatically from the new IP, cloud-init re-creates the Gateway listeners and certificates, and cert-manager reissues via Let's Encrypt.

The first-server TIMECREATED election is stable in practice but not contractually guaranteed when pool instances share the same creation timestamp. In the rare case of a timestamp tie, jq | first returns a stable (but undefined) ordering based on API response. The atomic leader lock (cluster-init-lock in the state bucket) provides the final safety guarantee independent of election ordering.

License

MIT. See LICENSE.

Variables

Inputs

Name	Description	Type	Default	Required
alertmanager_email	Optional email address to subscribe to the OCI Notifications topic. The subscriber must confirm via an OCI confirmation email.	`string`	`null`	no
argocd_chart_version	ArgoCD Helm chart version used for the bootstrap install. Must match gitops/apps/argocd.yaml targetRevision. Managed by Renovate.	`string`	`"9.7.0"`	no
argocd_hostname	Fully-qualified hostname for the ArgoCD UI (e.g. argocd.example.com). When set, a Gateway API HTTPRoute with a cert-manager TLS certificate is created by cloud-init. If null, an sslip.io hostname is derived from the NLB IP.	`string`	`null`	no
availability_domain	Availability domain name, e.g. 'Uocm:EU-FRANKFURT-1-AD-1'	`string`	n/a	yes
boot_volume_size_in_gbs	Boot volume size in GB for k3s nodes (servers + workers). OCI minimum is 50 GB for all shapes. With 4 k3s nodes at 50 GB each the total is 200 GB (exactly at the Always Free limit). The bastion uses OCI Bastion Service — no VM, no boot volume.	`number`	`50`	no
certmanager_chart_version	cert-manager Helm chart version used for the bootstrap install. Must match gitops/apps/cert-manager.yaml targetRevision. Managed by Renovate.	`string`	`"v1.20.2"`	no
certmanager_email_address	Email address for Let's Encrypt ACME registration. Must be a real address.	`string`	n/a	yes
cloudflare_api_token	Cloudflare API token. Required when enable_external_dns = true or enable_dns01_challenge = true. Create a scoped token at https://dash.cloudflare.com/profile/api-tokens with Zone:DNS:Edit permissions.	`string`	`null`	no
cloudflare_zone_id	Cloudflare Zone ID for the managed domain. Required when enable_external_dns = true.	`string`	`null`	no
cluster_name	Logical name for the cluster. Used in display names and freeform tags.	`string`	n/a	yes
compartment_ocid	OCID of the compartment where all resources are created	`string`	n/a	yes
compute_shape	OCI compute shape for k3s nodes	`string`	`"VM.Standard.A1.Flex"`	no
dockerhub_password	Docker Hub access token (PAT) for ArgoCD OCI Helm chart pulls. Paired with dockerhub_username.	`string`	`""`	no
dockerhub_username	Docker Hub username for ArgoCD to authenticate when pulling OCI Helm charts (e.g. Envoy Gateway from registry-1.docker.io). If empty, anonymous pulls are attempted and may be rate-limited. Create a PAT at https://app.docker.com/settings/personal-access-tokens	`string`	`""`	no
enable_backup	Enable weekly boot volume backups for all k3s nodes (Always Free: 5 total backups). With 4 nodes at weekly-1-week-retention there are at most 4 active backups.	`bool`	`true`	no
enable_bastion	Provision an OCI Bastion Service resource (managed SSH proxy, Always Free, no storage). When enabled, a STANDARD bastion is created and associated with the private subnet. Use example/get-kubeconfig.sh to retrieve kubeconfig via a Bastion session. Strongly recommended; without it, nodes are reachable only via serial console.	`bool`	`true`	no
enable_dns01_challenge	Configure cert-manager ClusterIssuers to use DNS-01 ACME challenge via Cloudflare instead of HTTP-01. Enables wildcard certificates (*.example.com) and works even without inbound port 80. Requires cloudflare_api_token.	`bool`	`false`	no
enable_etcd_snapshots	Upload etcd snapshots to the OCI Object Storage state bucket every 6 hours using OCI CLI instance_principal auth (no Customer Secret Keys required). Requires enable_object_storage_state = true. Provides off-node etcd backup for split-brain recovery.	`bool`	`true`	no
enable_external_dns	Deploy external-dns (kubernetes-sigs) configured for Cloudflare. Automatically creates/updates DNS A records when Services or Ingresses are annotated. Requires cloudflare_api_token and cloudflare_zone_id.	`bool`	`false`	no
enable_external_secrets	Deploy the External Secrets Operator and create a ClusterSecretStore backed by OCI Vault (instance_principal auth). Requires enable_vault = true. Workloads can then create ExternalSecret resources to sync any OCI Vault secret into a Kubernetes Secret without hard-coding values.	`bool`	`false`	no
enable_longhorn_backup	Provision a dedicated Always Free OCI Object Storage bucket for Longhorn PVC backups. Cloud-init automatically creates the backup credentials secret and wires the Longhorn BackupTarget when enable_longhorn_backup = true AND user_ocid is set. Shares the 20 GB free allowance with the Terraform state bucket.	`bool`	`true`	no
enable_mysql	Provision an Always Free MySQL HeatWave DB system (single node, 50 GB). Creates a Kubernetes Secret 'mysql-credentials' in the default namespace.	`bool`	`false`	no
enable_notifications	Create an OCI Notifications topic and wire the endpoint to Alertmanager as a webhook receiver (Always Free: 1M HTTPS + 3K email/month). ⚠️ IMPORTANT — ONS authentication limitation: The OCI Notifications PublishMessage REST endpoint requires OCI IAM request signing. Alertmanager sends unsigned HTTP POSTs, which OCI rejects with HTTP 401. Enabling this variable creates the ONS topic and records its endpoint in the 'notification_topic_endpoint' output, but alerts will NOT be delivered to ONS without a signing proxy. Workarounds (choose one): (a) Use Alertmanager's native 'email_configs' receiver with an SMTP relay — no proxy needed. (b) Deploy a small signing proxy (e.g. an OCI Function with instance-principal auth) between Alertmanager and the ONS endpoint. (c) Use a third-party webhook receiver (PagerDuty, Slack, etc.) that does not require signing. The 'alertmanager_email' variable provides a direct ONS email subscription — this works correctly and is independent of the signing limitation (OCI delivers email subscriptions internally).	`bool`	`false`	no
enable_object_storage_state	Provision an Always Free OCI Object Storage bucket for storing Terraform/OpenTofu state (S3-compatible API). See the terraform_state_backend output for the backend configuration snippet.	`bool`	`true`	no
enable_oci_logging	Enable OCI Logging for cloud-init logs. Ships /var/log/k3s-cloud-init.log to OCI Logging Service via the Unified Monitoring Agent (Always Free: 10 GB/month).	`bool`	`true`	no
enable_tailscale	Store Tailscale Kubernetes operator OAuth credentials in OCI Vault so the tailscale-operator ExternalSecret can sync them into the cluster without committing secrets to git. Requires enable_vault = true. Pre-requisite: create an OAuth client at https://login.tailscale.com/admin/settings/oauth with scope Devices → Write (devices:core:write) and allowed tag tag:k8s-operator.	`bool`	`false`	no
enable_vault	Store cluster secrets (k3s_token, longhorn_ui_password, grafana_admin_password) in OCI Vault (Always Free: software keys + 150 secrets). Nodes fetch secrets via OCI CLI instance_principal at boot — plaintext values are removed from cloud-init user-data.	`bool`	`true`	no
environment	Deployment environment label (e.g. staging, production)	`string`	`"staging"`	no
etcd_snapshot_retention	Number of etcd snapshots to retain in OCI Object Storage per node. Older snapshots are pruned automatically by the cron job. Must be >= 1 (0 would disable pruning and grow the bucket unbounded).	`number`	`5`	no
expose_kubeapi	Expose the Kubernetes API server via the public NLB (restricted to my_public_ip_cidr)	`bool`	`false`	no
expose_ssh	Expose SSH (port 22) via the public NLB to all cluster nodes (restricted to my_public_ip_cidr). Eliminates the need for OCI Bastion sessions for day-to-day access.	`bool`	`false`	no
external_dns_domain_filter	Domain filter for external-dns — only DNS records under this domain are managed (e.g. 'k3s.example.com'). Required when enable_external_dns = true.	`string`	`null`	no
external_secrets_chart_version	External Secrets Operator Helm chart version used for the bootstrap install. Must match gitops/apps/external-secrets.yaml targetRevision. Managed by Renovate.	`string`	`"2.6.0"`	no
fault_domains	Fault domains to spread the instance pool across	`list(string)`	[ "FAULT-DOMAIN-1", "FAULT-DOMAIN-2", "FAULT-DOMAIN-3" ]	no
gateway_api_version	Kubernetes Gateway API CRDs version (experimental channel) installed at bootstrap. Experimental channel is a superset of standard and includes GRPCRoute, TCPRoute, TLSRoute, etc. required by Envoy Gateway. Must exist before ArgoCD syncs gateway-config.	`string`	`"v1.5.1"`	no
github_ssh_keys_username	GitHub username whose published SSH keys (https://github.com/.keys) are added to every instance's authorized_keys at plan time, in addition to the primary public_key / public_key_path. Leave empty to skip.	`string`	`""`	no
gitops_path	Path within gitops_repo_url that ArgoCD uses as the App of Apps source. Default is 'gitops/apps' (k3s-oci native layout). Override when your GitOps repo uses a different directory structure.	`string`	`"gitops/apps"`	no
gitops_repo_url	Git repository URL for the ArgoCD App of Apps (e.g. https://github.com/your-org/k3s-oci.git). Set this to your fork so ArgoCD pulls from the right repo.	`string`	`"https://github.com/mbologna/k3s-oci.git"`	no
gitops_ssh_private_key	SSH private key (PEM/OpenSSH format) for ArgoCD to clone the gitops repo. Terraform stores it in OCI Vault; cloud-init fetches it and creates the argocd-repo-gitops Secret before ArgoCD starts. Leave empty only when gitops_repo_url is a public HTTPS repo.	`string`	`""`	no
grafana_hostname	Fully-qualified hostname for the Grafana UI (e.g. grafana.example.com). When set, a Gateway API HTTPRoute with a cert-manager TLS certificate is created in gitops/monitoring/.	`string`	`null`	no
http_lb_port	Public HTTP port on the NLB frontend (default 80).	`number`	`80`	no
https_lb_port	Public HTTPS port on the NLB frontend (default 443).	`number`	`443`	no
ingress_controller_http_nodeport	NodePort on workers that the ingress controller binds for HTTP traffic	`number`	`30080`	no
ingress_controller_https_nodeport	NodePort on workers that the ingress controller binds for HTTPS traffic	`number`	`30443`	no
k3s_server_pool_size	Number of k3s control-plane nodes in the instance pool. Use 3 for HA (etcd quorum). Must be an odd number >= 1.	`number`	`3`	no
k3s_standalone_worker	When true (default), provisions one worker node as a plain oci_core_instance resource. This is the recommended approach for OCI Always Free tenancies: instance pools route requests through OCI Capacity Management which can fail for A1.Flex shapes, whereas a direct oci_core_instance reliably claims the free allocation. Default topology: 3 control-plane nodes (pool) + 1 standalone worker = 4 OCPUs / 24 GB.	`bool`	`true`	no
k3s_subnet	Subnet name used to derive the flannel interface. Leave 'default_route_table' to let k3s auto-detect.	`string`	`"default_route_table"`	no
k3s_version	k3s version to install. Use 'stable' or 'latest' to resolve from the k3s channel API at plan-time, or pin to a specific release (e.g. 'v1.35.5+k3s1').	`string`	`"stable"`	no
k3s_worker_pool_size	Number of k3s worker nodes managed by the OCI Instance Pool. Set to 0 (default) when using k3s_standalone_worker = true, which is the recommended Always Free topology. The pool is kept to allow future scaling beyond the free tier.	`number`	`0`	no
kube_api_port	Port the k3s API server listens on	`number`	`6443`	no
longhorn_hostname	Fully-qualified hostname for the Longhorn UI (e.g. longhorn.example.com). When set, a Gateway API HTTPRoute with BasicAuth (Envoy Gateway SecurityPolicy) and a cert-manager TLS certificate is created.	`string`	`null`	no
longhorn_ui_username	Username for Longhorn UI BasicAuth (only used when longhorn_hostname is set).	`string`	`"admin"`	no
my_public_ip_cidr	Your workstation public IP in CIDR notation (e.g. 1.2.3.4/32). Restricts OCI Bastion Service session creation (enable_bastion = true) and kubeapi access via the public NLB (expose_kubeapi = true). k3s nodes are in a private subnet and are only reachable via OCI Bastion sessions.	`string`	n/a	yes
mysql_admin_username	Admin username for the MySQL HeatWave DB system.	`string`	`"admin"`	no
mysql_shape	MySQL HeatWave shape. 'MySQL.Free' is the Always Free shape.	`string`	`"MySQL.Free"`	no
oci_core_vcn_cidr	CIDR block for the VCN	`string`	`"10.0.0.0/16"`	no
oci_core_vcn_dns_label	DNS label for the VCN (≤15 alphanumeric chars, no hyphens — OCI DNS constraint).	`string`	`"k3svcn"`	no
oci_identity_dynamic_group_name	Name for the OCI dynamic group granting instances access to the OCI API. Must be unique per tenancy — the default 'k3s-cluster-dynamic-group' collides if you deploy multiple clusters in the same tenancy. Recommended: set to "<cluster_name>-dynamic-group" in your tfvars.	`string`	`"k3s-cluster-dynamic-group"`	no
oci_identity_policy_name	Name for the OCI IAM policy attached to the dynamic group. Must be unique per tenancy — the default 'k3s-cluster-policy' collides if you deploy multiple clusters in the same tenancy. Recommended: set to "<cluster_name>-policy" in your tfvars.	`string`	`"k3s-cluster-policy"`	no
os_family	OS distribution for cluster nodes. "ubuntu" (default) uses OCI-native Ubuntu 24.04 and auto-resolves the image. "opensuse" uses openSUSE Leap 16.0 — requires os_image_id (use scripts/import-opensuse-aarch64.sh to import the image and obtain its OCID).	`string`	`"ubuntu"`	no
os_image_id	OCID of the OS image for A1.Flex nodes. If null and os_family = "ubuntu", the latest Ubuntu 24.04 LTS (Noble) aarch64 image is resolved automatically. Required when os_family = "opensuse" — use scripts/import-opensuse-aarch64.sh to import and capture the OCID.	`string`	`null`	no
private_subnet_cidr	CIDR for the private subnet (k3s nodes)	`string`	`"10.0.1.0/24"`	no
private_subnet_dns_label	DNS label for the private subnet (≤15 alphanumeric chars, no hyphens — OCI DNS constraint).	`string`	`"k3sprivate"`	no
public_key	SSH public key content placed on every instance. Preferred over public_key_path — pass the key string directly for CI pipelines where ~/.ssh does not exist. When null, the key is read from public_key_path at plan time.	`string`	`null`	no
public_key_path	Path to SSH public key file. Used as fallback when public_key is null.	`string`	`"~/.ssh/id_ed25519.pub"`	no
public_subnet_cidr	CIDR for the public subnet (load balancers and optional bastion)	`string`	`"10.0.0.0/24"`	no
public_subnet_dns_label	DNS label for the public subnet (≤15 alphanumeric chars, no hyphens — OCI DNS constraint).	`string`	`"k3spublic"`	no
region	OCI region identifier (e.g. 'eu-frankfurt-1'). Required when enable_external_secrets = true for the ClusterSecretStore to locate the OCI Vault endpoint.	`string`	`null`	no
server_memory_in_gbs	RAM in GB per control-plane node. Total RAM must not exceed 24 GB (Always Free).	`number`	`6`	no
server_ocpus	OCPUs per control-plane node. Total OCPUs across all nodes must not exceed 4 (Always Free).	`number`	`1`	no
tailscale_oauth_client_id	Tailscale OAuth client ID. Required when enable_tailscale = true.	`string`	`null`	no
tailscale_oauth_client_secret	Tailscale OAuth client secret. Required when enable_tailscale = true.	`string`	`null`	no
tenancy_ocid	OCID of the tenancy	`string`	n/a	yes
trace_enabled	Enable bash trace mode (set -x) in cloud-init scripts. Produces verbose output in /var/log/k3s-cloud-init.log. Useful for debugging bootstrap failures. Do NOT enable in production.	`bool`	`false`	no
unique_tag_key	Freeform tag key applied to every resource for identification	`string`	`"k3s-provisioner"`	no
unique_tag_value	Freeform tag value applied to every resource for identification	`string`	`"https://github.com/mbologna/k3s-oci"`	no
user_ocid	OCID of the OCI user running Terraform (format: ocid1.user.oc1..xxx). Required when enable_longhorn_backup = true to automatically create a Customer Secret Key for S3-compatible access, wire the Longhorn backup credentials Kubernetes Secret, and apply the Longhorn BackupTarget in cloud-init. When null, the Longhorn backup bucket is still created but wiring is manual (follow the longhorn_backup_setup output instructions).	`string`	`null`	no
worker_memory_in_gbs	RAM in GB per worker node.	`number`	`6`	no
worker_ocpus	OCPUs per worker node.	`number`	`1`	no

Outputs

Name	Description
argocd_initial_password_hint	Command to retrieve the ArgoCD initial admin password (run after cluster is up)
bastion_ocid	OCID of the OCI Bastion Service resource (null if enable_bastion = false). Use with example/get-kubeconfig.sh or oci bastion session create-managed-ssh.
grafana_admin_credentials	Grafana admin credentials (only available after cluster bootstrap)
internal_lb_ip	Private IP of the internal load balancer (used by agents to join the cluster)
k3s_servers_private_ips	Private IPs of k3s control-plane nodes
k3s_standalone_worker_private_ip	Private IP of the standalone worker node (oci_core_instance, not pool-managed)
k3s_token	k3s cluster join token (sensitive)
k3s_workers_private_ips	Private IPs of k3s worker nodes (instance pool)
kubeconfig_hint	How to retrieve kubeconfig after cluster is up
longhorn_backup_setup	Longhorn backup bucket info and wiring status. Null if enable_longhorn_backup = false.
longhorn_ui_credentials	Longhorn UI credentials (only set when longhorn_hostname is configured)
mysql_admin_credentials	MySQL HeatWave admin credentials (sensitive). Null if enable_mysql = false.
mysql_endpoint	MySQL HeatWave connection endpoint (hostname:port). Null if enable_mysql = false.
notification_topic_endpoint	OCI Notifications HTTPS endpoint for the Alertmanager webhook receiver (null if enable_notifications = false).
oci_log_group_id	OCI Log Group OCID for k3s cloud-init logs (null if enable_oci_logging = false)
public_nlb_ip	Public IP address of the NLB (point your DNS here)
ssh_command	SSH command to connect to a cluster node via the public NLB (null if expose_ssh = false). Routes to any available server.
ssh_host_public_key	Shared SSH host public key deployed to all nodes. Add to known_hosts with: ssh-keygen -R && terraform output -raw ssh_host_public_key \| ssh-keyscan -f - >> ~/.ssh/known_hosts (or simply ssh-keyscan >> ~/.ssh/known_hosts after apply).
tailscale_vault_secret_names	OCI Vault secret names for the Tailscale operator OAuth credentials (null if enable_tailscale = false). Reference these names in the ExternalSecret (platform//tailscale-operator/oauth-secret.yaml).
terraform_state_backend	S3-compatible backend config snippet for storing Terraform state in the provisioned OCI Object Storage bucket. Replace and add S3 credentials (OCI Customer Secret Key).
vault_id	OCI Vault OCID (null if enable_vault = false)

Name		Name	Last commit message	Last commit date
Latest commit History 227 Commits
.github/workflows		.github/workflows
example		example
files		files
gitops		gitops
scripts		scripts
.gitignore		.gitignore
.terraform-docs.yml		.terraform-docs.yml
.terraform.lock.hcl		.terraform.lock.hcl
.tflint.hcl		.tflint.hcl
.trivyignore		.trivyignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
Justfile		Justfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
backup.tf		backup.tf
bastion.tf		bastion.tf
checks.tf		checks.tf
compute.tf		compute.tf
data.tf		data.tf
iam.tf		iam.tf
lb.tf		lb.tf
locals.tf		locals.tf
logging.tf		logging.tf
moved.tf		moved.tf
mysql.tf		mysql.tf
network.tf		network.tf
nlb.tf		nlb.tf
notifications.tf		notifications.tf
nsg.tf		nsg.tf
objectstorage.tf		objectstorage.tf
output.tf		output.tf
renovate.json		renovate.json
security.tf		security.tf
vars.tf		vars.tf
vault.tf		vault.tf
versions.tf		versions.tf

Folders and files

Latest commit

History

Repository files navigation

k3s-oci

Features

Architecture

Quickstart

kubeconfig

Deploying a web application

Why TLS is terminated at Envoy Gateway, not at the OCI load balancer

Minimal example: HTTP-only

Minimal example: HTTPS with sslip.io (no domain purchase required)

Resilience: spread replicas across nodes

Monitoring (Grafana + Prometheus)

Accessing Grafana

Built-in alert rules

Adding custom dashboards

GitOps — App of Apps

Adding your own applications

Automatic updates & reboots (unattended-upgrades + kured)

Dependency updates (Renovate)

Remote Terraform state (OCI Object Storage)

Always Free budget

Failure tolerance

Node roles and workload placement

Why this topology

Topology comparison

Why not use the 2 free E2.1.Micro instances as extra workers?

Previously rejected alternatives

Choosing an OS

Ubuntu (default)

openSUSE Leap 16.0

Using any other OS image

Split-Brain Recovery

Detection

Recovery from etcd snapshot (recommended)

Recovery without snapshot (last resort)

Deleting a stale leader lock (after full rebuild)

NLB IP stability

License

Variables

Inputs

Outputs

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages