Observability

Prometheus metrics, status fields, and monitoring the Remediator Agent in production.

Prometheus Metrics

The Remediator Agent exposes Prometheus metrics at the controller manager’s metrics endpoint.

Available Metrics

MetricTypeLabelsDescription
remediator_reconciles_totalCounterresult="success|error"Total number of reconciliation runs
remediator_reconcile_duration_secondsHistogramresult="success|error"Duration of each reconciliation run

Enable ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: go-agent-remediator-metrics
  namespace: go-agent-remediator-system
spec:
  selector:
    matchLabels:
      control-plane: controller-manager
  endpoints:
    - port: https
      path: /metrics
      scheme: https
      tlsConfig:
        insecureSkipVerify: true

Access Metrics Directly

kubectl -n go-agent-remediator-system port-forward \
  deploy/go-agent-remediator-controller-manager 8443:8443

SA=go-agent-remediator-controller-manager
NS=go-agent-remediator-system
TOKEN=$(kubectl -n $NS create token $SA)
curl -k -H "Authorization: Bearer $TOKEN" https://localhost:8443/metrics

Example Queries

# Success rate over the last hour
sum(rate(remediator_reconciles_total{result="success"}[1h]))
/ sum(rate(remediator_reconciles_total[1h]))

# P95 reconciliation latency
histogram_quantile(0.95,
  sum by (le) (rate(remediator_reconcile_duration_seconds_bucket[1h]))
)
```yaml

---

## Remediator Status

The `Remediator` resource reports detailed status about each run.

```bash
# View full status
kubectl get remediator remediator-argo-hub -n nirmata -o yaml

# View just the last run summary
kubectl get remediator remediator-argo-hub -n nirmata \
  -o jsonpath='{.status.lastRunSummary}' | jq

Status Fields

FieldDescription
phaseCurrent operational phase: Running, Idle, or Failed
lastScheduleTimeWhen the last remediation was scheduled
lastSuccessfulTimeWhen the last successful run completed
nextScheduledTimeWhen the next run is scheduled
conditionsStep-by-step workflow tracking with collector information
lastRunSummary.startTime / endTimeRun duration timestamps
lastRunSummary.statusSuccess or failure
lastRunSummary.messageHuman-readable outcome
lastRunSummary.targetsProcessedNumber of targets scanned
lastRunSummary.violationsFoundTotal violations discovered
lastRunSummary.remediationPlansNumber of AI-generated plans produced
lastRunSummary.actionsExecutedNumber of actions taken (PRs created, etc.)
lastRunSummary.errorsAny errors encountered

Example Status Query

kubectl get remediator remediator-argo-hub -n nirmata \
  -o jsonpath='{.status.lastRunSummary}' | jq '{
  status: .status,
  violations: .violationsFound,
  plans: .remediationPlans,
  actions: .actionsExecuted,
  errors: .errors
}'
```bash

---

## Logs

```bash
# Follow live logs
kubectl logs -n nirmata -l app.kubernetes.io/name=nirmata-agent -f

# Last 100 lines
kubectl logs -n nirmata -l app.kubernetes.io/name=nirmata-agent --tail=100
```yaml

---

## Support Matrix

| Component | Supported |
|-----------|-----------|
| **Kubernetes** | All CNCF-compliant distributions v1.20+, including on-prem |
| **AI providers** | Nirmata AI (default), AWS Bedrock, Azure OpenAI |
| **GitOps** | ArgoCD |
| **VCS** | GitHub (App & PAT), GitLab (Enterprise & SaaS) |
| **Manifests** | YAML files, simple Helm charts |