Important
Human-Coded / Sin IA: All code and manifests in this repository have been developed manually without AI assistance.
Spanish Version (Original) / Versiรณn en Espaรฑol: Este repositorio cuenta con una versiรณn original en espaรฑol redactada manualmente sin el uso de IA: README-Spanish.md.
English Documentation (AI-Enhanced):
This README.md (Engineering Guide) and README-English.md have been generated and translated using AI, based on the original Spanish documentation. Multimedia resources in the resources/notebooklm-summaries/ directory were generated using NotebookLM.
For a quick and dynamic overview of this repository's content, you can check these materials automatically generated by NotebookLM:
- ๐๏ธ Audio Overview (Spanish): Resumen Grafana en OpenShift (MP4)
- ๐๏ธ Audio Overview (English): Engineering Guide & Deep-Dive (MP4)
- ๐ Executive Presentation (PDF): Engineering deep-dive (PDF)
- ๐ Presentation Slides (PPTX): Engineering Slides (PPTX)
- 1. Executive Summary
- 2. Quick Navigation Map
- 3. Prerequisites and Environment
- 4. Platform Engineering: Object Mapping
- 5. Unified Tagging and Discovery Schema
- 6. Solution Comparison Matrix
- 7. Architectural Framework
- 8. Solution Inventory and Mapping
- 9. Solution 1: Grafana Cloud (SaaS)
- 10. Solution 2: kube-prometheus-stack (Community Chart)
- 11. Solution 3: Grafana Operator (Native Integration)
- 12. Identity and Security: AzureAD OAuth Flow
- 13. Security Hardening: SCC Analysis
- 14. Performance and Resource Profile
- 15. Day 1 and Day 2 Operations Cheat Sheet
- 16. Troubleshooting Decision Tree
- 17. Technical Reference and Resources
- 18. Technical Infographics: Engineering Blueprints
- 19. Troubleshooting and FAQ
This project implements a multi-tenant, high-availability observability stack using Grafana components. It is tailored for OpenShift 4.x, focusing on the transition from the legacy Grafana Agent to the new Grafana Alloy and the automation provided by the Grafana Operator.
./
โโโ solution-3-grafana-operator/ # โญ๏ธ Recommended Native OpenShift Integration
โ โโโ 1-grafana-operator.yaml # Operator subscription manifest
โ โโโ 3-grafana.yaml # Grafana Instance and OIDC configuration
โ โโโ templates/ # Dashboards and Datasources as Code
โโโ solution-1-grafana-cloud/ # SaaS Hybrid strategy (Grafana Alloy)
โ โโโ metrics.alloy # Core telemetry pipeline config
โ โโโ grafana-cloud.sh # Automated installer
โโโ solution-2-kube-prometheus-stack/ # Complete community stack (Air-gapped friendly)
โโโ installer-3.sh # AzureAD integrated installer
โโโ values-kube-prometheus-stack.yml # Custom Helm values
- Cluster: OpenShift 4.10+ (Tested up to 4.14).
- Permissions:
cluster-adminfor SCC creation and Operator subscriptions. - Identity: Azure Portal access for App Registrations.
- Tools:
ocv4.x,helmv3.12+,python3(for dashboard scripts).
Analysis of Infrastructure as Code (IaC) components and their system functions.
| Component | K8s/OCP Object | Critical Function (Reverse Engineered) |
|---|---|---|
| Alloy Collector | DaemonSet |
Scrapes kubelet (port 10250) and /var/log/pods via hostpid bypass. |
| Grafana Operator | Subscription |
Manages OLM lifecycle; reconciles Grafana CRs into StatefulSet objects. |
| OAuth Proxy | Sidecar Container |
Injected into Grafana pods; triggers OCP-native auth delegating to AzureAD. |
| Custom SCC | SecurityContextConstraints |
Grants allowPrivilegedContainer for eBPF-based socket filtering in Alloy. |
| Datasource Provisioner | Shell Script / API |
Injects long-lived SA tokens into Grafana to bypass 24h token expiry. |
For Alloy/Prometheus to automatically discover and instrument your applications, they must adhere to this standard.
| Metadata Type | Required Label/Annotation | Description |
|---|---|---|
| App Name | app.kubernetes.io/name |
Used as the service tag in Grafana Cloud. |
| Metric Port | Port name must be metrics |
Alloy specifically looks for ports named metrics in Service/Pod discovery. |
| Environment | tags.datadoghq.com/env |
(Standardized) Used for multi-tenant environment filtering. |
| Version | app.kubernetes.io/version |
Facilitates Trace/Log correlation via version tagging. |
| Feature | Solution 1: Cloud | Solution 2: Community Chart | Solution 3: Operator |
|---|---|---|---|
| Back-end | Grafana Cloud (SaaS) | Local Prometheus/Loki | Thanos / Native OCP |
| Maintenance | Low (Managed) | High (Self-managed) | Medium (Operator-led) |
| Cost Profile | Pay-per-use (SaaS) | Infrastructure only | Low (Reuse OCP data) |
| OCP Integration | Medium | Medium | Very High |
| Ideal For | SaaS-first teams | Air-gapped clusters | Native OCP environments |
The project explores three distinct deployment models:
- Solution 1: Hybrid model pushing telemetry to Grafana Cloud via Alloy.
- Solution 2: Full on-prem stack via the community Helm chart.
- Solution 3: On-prem stack managed via Grafana Operator, integrated with OpenShift's internal Thanos/Prometheus.
| Solution | Path | Primary Backend | Status |
|---|---|---|---|
| Sol 1 | solution-1-grafana-cloud/ |
Grafana Cloud | Validated |
| Sol 2 | solution-2-kube-prometheus-stack/ |
Prometheus/Grafana | Validated |
| Sol 3 | solution-3-grafana-operator/ |
Thanos / OCP | Recommended |
High-Level Architecture (Hybrid SaaS): Alloy acts as the local bridge, concentrating all telemetry before securely forwarding it to the Grafana Cloud backend.
Click to view: High-Level Architecture (Hybrid SaaS)
graph LR
subgraph OCP [OpenShift Cluster]
direction TB
Alloy[Grafana Alloy]
end
subgraph Cloud [Grafana Cloud SaaS]
Ingestion[Cloud Ingestion]
G[Grafana]
P[Prometheus/Loki/Tempo]
end
Alloy -->|OTLP / Logs / Metrics| Ingestion
Ingestion --> P
G --> P
Low-Level Design (Pipeline Flow): Applications send OTLP data to the Alloy DaemonSet, which performs local processing (relabeled, batched) and exports to the cloud.
Click to view: Low-Level Design (Pipeline Flow)
graph TD
subgraph Nodes [Worker Nodes]
App[App Pods] -->|OTLP / gRPC| AlloyDS[Alloy DaemonSet]
Kubelet[Kubelet Stats] --> AlloyDS
end
subgraph Pipelines [Alloy Pipeline]
AlloyDS -->|Process / Filter| OTLPOut[OTLP Exporter]
end
OTLPOut -->|Secure Remote Write| GCloud[Grafana Cloud]
- Namespace:
oc apply -f namespace.yaml - Security: Apply SCCs to grant necessary privileges to Alloy:
oc apply -f scc-grafanacloud.yaml oc apply -f scc-grafanacloud2.yaml
- Deployment: Execute the installation script:
./grafana-cloud.sh
Alloy gateway endpoints for applications:
- OTLP/gRPC:
http://grafana-alloy.grafana-cloud.svc:4317 - Zipkin:
http://grafana-alloy.grafana-cloud.svc:9411
Based on metrics.alloy engineering:
- Metric Dropping: Automatically discards
container_memory_cacheandcontainer_threadsto reduce series volume by ~15%. - Relabeling: Only metrics with
label_keepare sent to the cloud, ensuring cost control at the source.
High-Level Architecture (Self-Managed): A traditional on-premise observability stack where all components (ingestion, storage, and visualization) reside within the OpenShift cluster.
Click to view: High-Level Architecture (Self-Managed)
graph LR
subgraph OCP [OpenShift Cluster]
direction TB
H[Helm: kube-prometheus-stack]
P[Prometheus]
L[Loki]
G[Grafana]
end
H --> P & L & G
Low-Level Design (Internal Interaction): Prometheus scrapes metrics from targets via ServiceMonitors, while Grafana queries both Prometheus and Loki for unified visualization.
Click to view: Low-Level Design (Internal Interaction)
graph TD
subgraph Monitoring [kubeprometheus Namespace]
P[Prometheus]
L[Loki]
G[Grafana]
Proxy[OAuth Proxy]
end
App[Target Pods] -.->|Scrape| P
App -.->|Push Logs| L
User[User] --> Proxy --> G
G -->|PromQL / LogQL| P & L
- Provisioning: Apply
namespace.yamlandscc-kubeprometheus.yaml. - Deployment: Use
./installer-3.shfor AzureAD integration.
- Redirect URIs:
https://grafana-kubeprometheus.apps.<cluster>/login/azuread - RBAC Mapping: Azure Groups are mapped to Grafana Roles (Admin/Editor/Viewer) via
X-Forwarded-Groupsheader.
High-Level Architecture (Operator-Led): Automated lifecycle management of Grafana using Kubernetes-native Custom Resources (CRs), leveraging OpenShift's internal monitoring data.
Click to view: High-Level Architecture (Operator-Led)
graph LR
subgraph OCP [OpenShift Cluster]
direction TB
Op[Grafana Operator]
GI[Grafana Instance]
Thanos[OCP Internal Thanos]
end
Op -->|Manages Lifecycle| GI
GI -->|Queries| Thanos
How the Operator ensures the desired state is met.
Click to view: Operator Reconciliation Sequence
sequenceDiagram
participant Git as Git (IaC)
participant OCP as OCP API Server
participant Op as Grafana Operator
participant GI as Grafana Instance
Git->>OCP: Apply Grafana CR
Op->>OCP: Watch for changes
OCP->>Op: Notify CR Creation
Op->>Op: Calculate Diff
Op->>GI: Create/Update StatefulSet & Service
Op->>GI: Inject Datasource via API
Note over Op: Reconcile Loop (30s)
- Operator:
oc apply -f 1-grafana-operator.yaml - Instance:
oc apply -f 3-grafana.yaml - Automation:
./4-grafana-datasource.shimports the Thanos connection.
The authentication flow leverages the OpenShift OAuth Proxy as a sidecar to the Grafana instance.
Click to view: AzureAD OAuth Flow
sequenceDiagram
participant User
participant Proxy as OCP OAuth Proxy
participant AAD as Azure Active Directory
participant Grafana
User->>Proxy: Access Grafana URL
Proxy->>AAD: Redirect to Login (OIDC)
AAD->>User: Request MFA / Credentials
User-->>AAD: Validated
AAD->>Proxy: Auth Code / ID Token
Proxy->>Grafana: Header-based Auth (X-WEBAUTH-USER)
Grafana->>Grafana: Map Group to Admin/Editor Role
Grafana-->>User: Granted Access
Deep dive into the custom SecurityContextConstraints provided in this repo.
| Capability | Enabled | Technical Reason |
|---|---|---|
allowPrivilegedContainer |
YES | Required for Alloy to use BPF_PROG_TYPE_SOCKET_FILTER. |
allowHostPID |
YES | Alloy must map container PIDs to host PIDs for process-level metrics. |
allowHostNetwork |
YES | Allows collection of host-level network interface statistics. |
runAsUser |
RunAsAny |
Supports legacy images and specific system-level agents. |
Based on production-grade limits defined in values.yaml and DatadogAgent CRs.
| Component | CPU (Req/Lim) | RAM (Req/Lim) | Scaling Factor |
|---|---|---|---|
| Alloy (DaemonSet) | 250m / 500m | 512Mi / 1Gi | Per Cluster Node |
| Grafana Instance | 100m / 200m | 256Mi / 512Mi | High Availability (2 Replicas) |
| Prometheus (Sol 2) | 1.0 / 2.0 | 4Gi / 8Gi | Database Volume dependent |
Provisioning (Day 1):
oc get subscriptions -n openshift-operators- Check Operator health.oc get csv- Verify Cluster Service Version status.
Maintenance (Day 2):
- Token Refresh:
oc create token grafana-sa --duration=8760h(Generates a 1-year token for Thanos). - Logs Audit:
oc logs -l app=grafana -c grafana- Debugging OAuth handshake. - Alloy Debug:
oc port-forward alloy-pod 12345:12345- Access Alloy's internal UI.
Use this guide to diagnose connectivity or visibility issues.
Click to view: Troubleshooting Decision Tree
flowchart TD
Start([Issue Detected]) --> Metrics?{Metrics missing?}
Metrics? -- Yes --> AlloyLogs[Check Alloy DaemonSet Logs]
AlloyLogs --> SCC{SCC Applied?}
SCC -- No --> ApplySCC[Apply alloy-scc.yaml]
SCC -- Yes --> AuthCloud[Verify Cloud Access Policy Token]
Metrics? -- No --> Login?{Login Failed?}
Login? -- Yes --> ProxyLogs[Check OAuth-Proxy Container Logs]
ProxyLogs --> Redirect{Redirect URI match?}
Redirect -- No --> AzureAD[Update Azure App Registration]
Redirect -- Yes --> Secret{Secret Correct?}
Login? -- No --> Datasource?{Thanos Data Error 403?}
Datasource? -- Yes --> SAToken[Regenerate SA Token - Section 19]
- Dashboards: dotdc/grafana-dashboards-kubernetes
- Official Releases: Grafana 11 News
- Alloy Config: Detailed examples in
solution-1-grafana-cloud/metrics.alloy.
High-resolution visual guides for architecture, deployment patterns, and solution comparison.
Click to view: Technical Infographics
Expert Insight: This is usually due to an expired Service Account Token. Since OCP 4.11, SA tokens are bounded.
Solution: Use the TokenRequest API to generate a long-lived token (up to 1 year) and update the Grafana Datasource:
oc create token grafana-instance-sa --duration=$((365*24))hEnsure your Azure App Registration includes the exact redirect URL provided by the OpenShift Route: https://<route-url>/login/azuread.


