Monitoring Guide

This guide covers monitoring the Frappe Operator and Frappe workloads.

Prometheus Metrics
Grafana Dashboards
Alert Rules
Logging
Health Checks

Prometheus Metrics

The Frappe Operator exposes Prometheus metrics on port 8080 at /metrics.

Available Metrics

Metric	Type	Labels	Description
`frappe_operator_reconciliation_duration_seconds`	Histogram	`controller`, `result`	Time spent reconciling resources
`frappe_operator_reconciliation_errors_total`	Counter	`controller`, `error_type`	Total number of reconciliation errors
`frappe_operator_job_status`	Gauge	`job_name`, `namespace`, `status`	Current status of operator jobs
`frappe_operator_resource_total`	Gauge	`resource_type`, `namespace`	Total count of managed resources

Enabling Metrics

Metrics are enabled by default. The ServiceMonitor is created if Prometheus Operator is detected.

# Custom ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: frappe-operator-metrics
  namespace: frappe-operator-system
  labels:
    control-plane: controller-manager
spec:
  endpoints:
  - port: https
    scheme: https
    bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    tlsConfig:
      insecureSkipVerify: true
    interval: 30s
  selector:
    matchLabels:
      control-plane: controller-manager

Scraping Without Prometheus Operator

# Prometheus scrape config
scrape_configs:
  - job_name: 'frappe-operator'
    kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names:
            - frappe-operator-system
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_control_plane]
        action: keep
        regex: controller-manager
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        action: keep
        regex: https

Grafana Dashboards

Operator Dashboard

A standalone dashboard JSON is available at docs/grafana-dashboard.json for one-click import. Alternatively, use the embedded JSON below.

{
  "annotations": {
    "list": []
  },
  "editable": true,
  "fiscalYearStartMonth": 0,
  "graphTooltip": 0,
  "id": null,
  "links": [],
  "panels": [
    {
      "datasource": {
        "type": "prometheus",
        "uid": "prometheus"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "color": "green", "value": null },
              { "color": "red", "value": 80 }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
      "id": 1,
      "options": {
        "legend": { "calcs": [], "displayMode": "list", "placement": "bottom" },
        "tooltip": { "mode": "single" }
      },
      "targets": [
        {
          "expr": "rate(frappe_operator_reconciliation_duration_seconds_sum[5m]) / rate(frappe_operator_reconciliation_duration_seconds_count[5m])",
          "legendFormat": "",
          "refId": "A"
        }
      ],
      "title": "Reconciliation Duration (avg)",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "prometheus"
      },
      "fieldConfig": {
        "defaults": {
          "color": { "mode": "palette-classic" },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "color": "green", "value": null },
              { "color": "yellow", "value": 1 },
              { "color": "red", "value": 5 }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
      "id": 2,
      "options": {
        "legend": { "calcs": [], "displayMode": "list", "placement": "bottom" },
        "tooltip": { "mode": "single" }
      },
      "targets": [
        {
          "expr": "rate(frappe_operator_reconciliation_errors_total[5m])",
          "legendFormat": " - ",
          "refId": "A"
        }
      ],
      "title": "Reconciliation Errors",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "prometheus"
      },
      "fieldConfig": {
        "defaults": {
          "color": { "mode": "thresholds" },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "color": "green", "value": null }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": { "h": 4, "w": 6, "x": 0, "y": 8 },
      "id": 3,
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": ["lastNotNull"],
          "fields": "",
          "values": false
        },
        "textMode": "auto"
      },
      "targets": [
        {
          "expr": "sum(frappe_operator_resource_total{resource_type=\"frappebench\"})",
          "refId": "A"
        }
      ],
      "title": "Total Benches",
      "type": "stat"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "prometheus"
      },
      "fieldConfig": {
        "defaults": {
          "color": { "mode": "thresholds" },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "color": "green", "value": null }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": { "h": 4, "w": 6, "x": 6, "y": 8 },
      "id": 4,
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": ["lastNotNull"],
          "fields": "",
          "values": false
        },
        "textMode": "auto"
      },
      "targets": [
        {
          "expr": "sum(frappe_operator_resource_total{resource_type=\"frappesite\"})",
          "refId": "A"
        }
      ],
      "title": "Total Sites",
      "type": "stat"
    }
  ],
  "schemaVersion": 38,
  "tags": ["frappe", "operator", "kubernetes"],
  "templating": { "list": [] },
  "time": { "from": "now-1h", "to": "now" },
  "timepicker": {},
  "timezone": "",
  "title": "Frappe Operator",
  "uid": "frappe-operator",
  "version": 1,
  "weekStart": ""
}

Frappe Workload Dashboard

Monitor the Frappe application workloads:

{
  "title": "Frappe Workloads",
  "uid": "frappe-workloads",
  "panels": [
    {
      "title": "Gunicorn Response Time",
      "targets": [
        {
          "expr": "histogram_quantile(0.95, rate(nginx_http_request_duration_seconds_bucket{upstream=~\".*gunicorn.*\"}[5m]))",
          "legendFormat": "p95"
        }
      ],
      "type": "timeseries"
    },
    {
      "title": "Worker Queue Depth",
      "targets": [
        {
          "expr": "frappe_worker_queue_length",
          "legendFormat": ""
        }
      ],
      "type": "timeseries"
    },
    {
      "title": "Database Connections",
      "targets": [
        {
          "expr": "mysql_global_status_threads_connected",
          "legendFormat": "connections"
        }
      ],
      "type": "timeseries"
    }
  ]
}

Alert Rules

A standalone PrometheusRule is available at docs/alert-rules.yaml. Apply with kubectl apply -f docs/alert-rules.yaml. Full content below:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: frappe-operator-alerts
  namespace: frappe-operator-system
  labels:
    prometheus: k8s
    role: alert-rules
spec:
  groups:
  - name: frappe-operator
    rules:
    # Operator is not running
    - alert: FrappeOperatorDown
      expr: absent(up{job="frappe-operator"}) == 1
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Frappe Operator is down"
        description: "Frappe Operator has been down for more than 5 minutes"

    # High reconciliation error rate
    - alert: FrappeOperatorHighErrorRate
      expr: |
        rate(frappe_operator_reconciliation_errors_total[5m]) > 0.1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High reconciliation error rate"
        description: "Controller  has error rate of  errors/sec"

    # Slow reconciliation
    - alert: FrappeOperatorSlowReconciliation
      expr: |
        histogram_quantile(0.95, rate(frappe_operator_reconciliation_duration_seconds_bucket[5m])) > 60
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Slow reconciliation detected"
        description: "Controller  p95 reconciliation time is s"

    # Site not ready
    - alert: FrappeSiteNotReady
      expr: |
        kube_customresource_frappesite_status_phase{phase!="Ready"} == 1
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: "FrappeSite not ready"
        description: "Site  in namespace  is not ready"

    # Bench not ready
    - alert: FrappeBenchNotReady
      expr: |
        kube_customresource_frappebench_status_phase{phase!="Ready"} == 1
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: "FrappeBench not ready"
        description: "Bench  in namespace  is not ready"

  - name: frappe-workloads
    rules:
    # Gunicorn pod restarts
    - alert: FrappeGunicornRestarts
      expr: |
        increase(kube_pod_container_status_restarts_total{container="gunicorn"}[1h]) > 3
      labels:
        severity: warning
      annotations:
        summary: "Gunicorn container restarting"
        description: "Gunicorn container in pod  has restarted  times in the last hour"

    # Worker queue buildup
    - alert: FrappeWorkerQueueHigh
      expr: |
        frappe_worker_queue_length > 100
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Worker queue is backing up"
        description: "Queue  has  pending jobs"

    # Database connections high
    - alert: FrappeDatabaseConnectionsHigh
      expr: |
        mysql_global_status_threads_connected / mysql_global_variables_max_connections > 0.8
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Database connections near limit"
        description: " of max connections in use"

Logging

Operator Logs

# Stream operator logs
kubectl logs -n frappe-operator-system -l control-plane=controller-manager -f

# Filter by log level
kubectl logs -n frappe-operator-system deployment/frappe-operator-controller-manager | grep -E 'level=(error|warn)'

# JSON structured logs (if enabled)
kubectl logs -n frappe-operator-system deployment/frappe-operator-controller-manager -o json

Configuring Log Verbosity

# In Helm values
controller:
  manager:
    args:
      - "--zap-log-level=debug"
      - "--zap-encoder=json"

Log Aggregation

Forward logs to your logging stack:

# Fluent Bit ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
data:
  fluent-bit.conf: |
    [INPUT]
        Name              tail
        Path              /var/log/containers/frappe-*.log
        Parser            docker
        Tag               frappe.*
    
    [OUTPUT]
        Name              es
        Match             frappe.*
        Host              elasticsearch
        Port              9200
        Index             frappe-logs

Health Checks

Operator Health

The operator exposes health endpoints on port 8081:

/healthz - Liveness probe
/readyz - Readiness probe

# Port-forward to check health
kubectl port-forward -n frappe-operator-system svc/frappe-operator-controller-manager-metrics-service 8081:8081

# Check liveness
curl http://localhost:8081/healthz

# Check readiness
curl http://localhost:8081/readyz

Frappe Site Health

Check site health using the Frappe API:

# Get site URL
SITE_URL=$(kubectl get frappesite mysite -o jsonpath='{.status.siteURL}')

# Check API
curl -s "$SITE_URL/api/method/ping"

Kubernetes Health Probes

The operator configures health probes for Frappe pods:

# Gunicorn container probes
livenessProbe:
  httpGet:
    path: /api/method/ping
    port: 8000
  initialDelaySeconds: 30
  periodSeconds: 10
  
readinessProbe:
  httpGet:
    path: /api/method/ping
    port: 8000
  initialDelaySeconds: 10
  periodSeconds: 5