Observability and Metrics Guide
The Remediator Agent exposes Prometheus metrics via the controller-runtime metrics endpoint. Use this guide to:
- Understand key metrics and labels
- Collect via OpenTelemetry Collector (recommended) and export to Grafana Cloud or any Prometheus-compatible backend
- Quickly test locally via port-forward + curl
Prerequisites
- Kubernetes cluster with the agent deployed in namespace
go-agent-remediator-system
(default from kustomize). - OpenTelemetry Collector (OTel Collector) installed in-cluster (deployment or daemonset). If not installed, see the example manifest below.
- Optional: Grafana Cloud account to visualize and store metrics.
- Grafana Cloud stack (region), OTLP endpoint, and an API token with metrics:write (and optionally traces/logs if you use them later).
- Example OTLP endpoint:
https://otlp-gateway-<region>.grafana.net/otlp
Key metrics
- remediator_reconciles_total (counter) — labels: result=“success|error”
- remediator_reconcile_duration_seconds (histogram) — labels: result=“success|error”
- violations_active (gauge) — labels: cluster, application, severity
- remediation_plans_generated_total (counter) — labels: cluster, application
- actions_executed_total (counter) — labels: type, status=“success|error”
- pr_opened_total (counter) — labels: repo, application, cluster
- (Defined, pending emission hooks) pr_merged_total (counter), pr_merge_latency_seconds (histogram)
OpenTelemetry Collector (recommended)
Use the OTel Collector to scrape the agent’s metrics (Prometheus receiver) and export to your destination (OTLP to Grafana Cloud shown below).
Create a Secret with your Grafana Cloud OTLP API key and region:
apiVersion: v1
kind: Secret
metadata:
name: otel-grafana-cloud
namespace: go-agent-remediator-system
stringData:
GRAFANA_CLOUD_REGION: "<your-region>" # e.g., us, eu, in
GRAFANA_CLOUD_OTLP_API_KEY: "<your-otlp-api-key>" # token with metrics:write
Create the OTel Collector ConfigMap (accurate Prometheus scrape job and OTLP exporter):
env: &env
- name: GRAFANA_CLOUD_REGION
valueFrom:
secretKeyRef:
name: otel-grafana-cloud
key: GRAFANA_CLOUD_REGION
- name: GRAFANA_CLOUD_OTLP_API_KEY
valueFrom:
secretKeyRef:
name: otel-grafana-cloud
key: GRAFANA_CLOUD_OTLP_API_KEY
---
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: go-agent-remediator-system
data:
config.yaml: |
receivers:
prometheus:
config:
scrape_configs:
- job_name: go-agent-remediator
scheme: https
tls_config:
insecure_skip_verify: true # For dev; use proper certs in prod
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: ["nirmata"]
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_service_label_control_plane]
regex: controller-manager
- action: keep
source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
regex: go-agent-remediator
- action: keep
source_labels: [__meta_kubernetes_endpoint_port_name]
regex: https
processors:
batch: {}
exporters:
otlphttp/grafana:
# Grafana Cloud OTLP gateway
endpoint: "https://otlp-gateway-${GRAFANA_CLOUD_REGION}.grafana.net/otlp"
headers:
Authorization: "Bearer ${GRAFANA_CLOUD_OTLP_API_KEY}"
tls:
insecure_skip_verify: false
service:
pipelines:
metrics:
receivers: [prometheus]
processors: [batch]
exporters: [otlphttp/grafana]
Deploy the OTel Collector (single replica example):
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
namespace: go-agent-remediator-system
spec:
replicas: 1
selector:
matchLabels: { app: otel-collector }
template:
metadata:
labels: { app: otel-collector }
spec:
serviceAccountName: go-agent-remediator-controller-manager
containers:
- name: otelcol
image: otel/opentelemetry-collector:0.104.0
args: ["--config=/conf/config.yaml"]
env:
- name: GRAFANA_CLOUD_REGION
valueFrom:
secretKeyRef: { name: otel-grafana-cloud, key: GRAFANA_CLOUD_REGION }
- name: GRAFANA_CLOUD_OTLP_API_KEY
valueFrom:
secretKeyRef: { name: otel-grafana-cloud, key: GRAFANA_CLOUD_OTLP_API_KEY }
volumeMounts:
- name: config
mountPath: /conf
ports:
- name: metrics
containerPort: 8888
volumes:
- name: config
configMap:
name: otel-collector-config
items:
- key: config.yaml
path: config.yaml
Notes:
- The scrape job uses Kubernetes service discovery and restricts to the controller’s Service/port via labels and port name
https
. - The bearer token comes from the pod’s ServiceAccount and satisfies the metrics endpoint’s authn/authz filter.
- For production, configure TLS properly and remove
insecure_skip_verify
.
Alternative: Prometheus Operator
If you prefer Prometheus Operator, enable config/prometheus/monitor.yaml
and ensure your Prometheus selects the Service/namespace. The OTel Collector pattern above is the recommended default for multi-backend and future tracing/logs.
Quick local test via port-forward + curl
Secure metrics are enabled by default. Use HTTPS, skip cert verify, and include a valid bearer token.
- Port-forward the controller manager Deployment:
kubectl -n go-agent-remediator-system port-forward deploy/go-agent-remediator-controller-manager 8443:8443
- Fetch a token from a ServiceAccount with permissions to view metrics (e.g., the controller manager SA):
SA=go-agent-remediator-controller-manager
NS=go-agent-remediator-system
SECRET=$(kubectl -n "$NS" get sa "$SA" -o jsonpath='{.secrets[0].name}')
TOKEN=$(kubectl -n "$NS" get secret "$SECRET" -o jsonpath='{.data.token}' | base64 -d)
- Curl the metrics endpoint (HTTPS, insecure):
curl -k -H "Authorization: Bearer $TOKEN" https://localhost:8443/metrics
If you prefer HTTP without TLS (dev only), run the manager with --metrics-secure=false
and bind to :8080
, then:
kubectl -n go-agent-remediator-system port-forward deploy/go-agent-remediator-controller-manager 8080:8080
curl http://localhost:8080/metrics
Example Grafana panels (PromQL)
- Reconcile success ratio (1h):
sum(rate(remediator_reconciles_total{result="success"}[1h]))
/
sum(rate(remediator_reconciles_total[1h]))
- Reconcile latency p95 (1h):
histogram_quantile(0.95,
sum by (le) (rate(remediator_reconcile_duration_seconds_bucket[1h]))
)
- Active violations by severity:
sum by (severity) (violations_active)
- Plans generated (rate, by application):
sum by (application) (rate(remediation_plans_generated_total[1h]))
- Actions success/failure (rate, by type):
sum by (type, status) (rate(actions_executed_total[1h]))
- PRs opened (24h, top repos):
topk(5, sum by (repo) (rate(pr_opened_total[24h])))
Troubleshooting
- Empty metrics: confirm metrics Service exists, OTel Collector is running, and the scrape job selects the correct Service/port.
- 403/401 when curling: include a valid bearer token with access to the metrics endpoint.
- TLS errors: use
-k
(insecure) for quick testing, or configure proper certs for the metrics endpoint and OTel Collector.