Operations Guide

Day-2 operations, maintenance, and best practices for running Frappe Operator in production.

Production Deployment
Monitoring and Observability
Backup and Restore
Scaling
Updates and Upgrades
Security
Database Management
Disaster Recovery

Production Deployment

Pre-Production Checklist

Before deploying to production, ensure you have:

Resource Planning

Bench Resources

Estimate resources based on expected load:

Small (< 50 users):

Gunicorn: 2 replicas, 500m CPU, 1Gi RAM each
Workers: 1-2 replicas, 250m CPU, 512Mi RAM each
Redis: 1Gi RAM

Medium (50-200 users):

Gunicorn: 3-5 replicas, 1 CPU, 2Gi RAM each
Workers: 2-3 replicas, 500m CPU, 1Gi RAM each
Redis: 4Gi RAM

Large (200+ users):

Gunicorn: 5-10+ replicas, 2 CPU, 4Gi RAM each
Workers: 5+ replicas, 1 CPU, 2Gi RAM each
Redis: 8Gi+ RAM

Database Resources

Shared Database:

2-4 CPU cores
4-8Gi RAM
100Gi+ storage

Dedicated Database (per site):

1-2 CPU cores
2-4Gi RAM
50-200Gi storage

MariaDB Operator Setup

For production, use MariaDB Operator for managed databases:

# Install MariaDB Operator
kubectl apply -f https://github.com/mariadb-operator/mariadb-operator/releases/latest/download/crds.yaml
kubectl apply -f https://github.com/mariadb-operator/mariadb-operator/releases/latest/download/mariadb-operator.yaml

Create a shared MariaDB instance:

apiVersion: k8s.mariadb.com/v1alpha1
kind: MariaDB
metadata:
  name: shared-mariadb
  namespace: databases
spec:
  rootPasswordSecretKeyRef:
    name: mariadb-root
    key: password
  
  database: frappe
  username: frappe
  passwordSecretKeyRef:
    name: mariadb-frappe
    key: password
  
  storage:
    size: 500Gi
    storageClassName: fast-ssd
  
  replicas: 3
  galera:
    enabled: true
  
  resources:
    requests:
      cpu: 2
      memory: 8Gi
    limits:
      cpu: 4
      memory: 16Gi

Namespace Strategy

Organize resources by environment:

# Create namespaces
kubectl create namespace frappe-operator-system
kubectl create namespace production
kubectl create namespace staging
kubectl create namespace development
kubectl create namespace databases

# Apply resource quotas
kubectl create quota production-quota \
  --hard=requests.cpu=50,requests.memory=100Gi,pods=100 \
  -n production

Monitoring and Observability

Prometheus Integration

The operator exposes metrics for Prometheus scraping.

ServiceMonitor for Operator:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: frappe-operator
  namespace: frappe-operator-system
spec:
  selector:
    matchLabels:
      control-plane: controller-manager
  endpoints:
  - port: https
    scheme: https
    tlsConfig:
      insecureSkipVerify: true

PodMonitor for Frappe Components:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: frappe-benches
  namespace: production
spec:
  selector:
    matchLabels:
      app: frappe
  podMetricsEndpoints:
  - port: metrics
    path: /metrics

Key Metrics to Monitor

Application Metrics:

Request rate and latency
Error rate (4xx, 5xx)
Queue length (Redis)
Worker job processing time
Database connections

Resource Metrics:

CPU utilization
Memory usage
Disk I/O
Network throughput

Business Metrics:

Active users
Concurrent sessions
Background job completion rate

Logging

Configure centralized logging:

# Fluent Bit DaemonSet
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: logging
data:
  fluent-bit.conf: |
    [INPUT]
        Name              tail
        Path              /var/log/containers/frappe-*.log
        Parser            docker
        Tag               frappe.*
    
    [OUTPUT]
        Name              es
        Match             frappe.*
        Host              elasticsearch
        Port              9200
        Index             frappe-logs

Alerting Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: frappe-alerts
  namespace: production
spec:
  groups:
  - name: frappe
    interval: 30s
    rules:
    - alert: FrappeSiteDown
      expr: up{job="frappe-site"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Frappe site  is down"
    
    - alert: HighErrorRate
      expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High error rate on "
    
    - alert: HighMemoryUsage
      expr: container_memory_usage_bytes{pod=~".*-gunicorn.*"} / container_spec_memory_limit_bytes > 0.9
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High memory usage on "

Backup and Restore

Automated Backups with SiteBackup

Create a SiteBackup resource:

apiVersion: vyogo.tech/v1
kind: SiteBackup
metadata:
  name: daily-backup
  namespace: production
spec:
  siteRef:
    name: prod-site
  
  # Daily backup at 2 AM
  schedule: "0 2 * * *"
  
  retention:
    days: 30
    count: 90
  
  destination:
    type: s3
    config:
      bucket: frappe-backups
      region: us-east-1
      credentialsSecret: aws-s3-credentials

Updating Scheduled Backups

You can update a scheduled backup at any time by modifying the SiteBackup resource. The Frappe Operator will automatically detect changes and update the backup schedule and configuration.

For example, to change the schedule of the daily-backup from 2 AM to 3 AM, you can use kubectl patch:

kubectl patch sitebackup daily-backup -n production --type=merge -p '{
  "spec": {
    "schedule": "0 3 * * *"
  }
}'

You can also modify other parameters, such as the retention policy or backup destination, and the operator will apply them to the scheduled job.

Manual Backup

# Create manual backup
kubectl create -f - <<EOF
apiVersion: vyogo.tech/v1
kind: SiteJob
metadata:
  name: manual-backup-$(date +%Y%m%d-%H%M%S)
  namespace: production
spec:
  siteRef:
    name: prod-site
  jobType: backup
  jobConfig:
    withFiles: true
    compress: true
EOF

# Check backup status
kubectl get sitejob -n production

Restore from Backup

# Restore database
kubectl exec -it <site-pod> -- bench --site <site-name> \
  restore --mariadb-root-password <password> /path/to/backup.sql.gz

# Restore files
kubectl exec -it <site-pod> -- bench --site <site-name> \
  restore --with-private-files /path/to/files-backup.tar.gz

Scaling

Component Autoscaling (Unified API)

Frappe Operator provides a provider-agnostic autoscaling mechanism for all bench components (nginx, gunicorn, workers, etc.). You can choose between HPA (Horizontal Pod Autoscaler) and KEDA (Kubernetes Event-Driven Autoscaling).

Managed Scaling

Instead of manual kubectl scale commands, define your scaling policy in the FrappeBench CRD:

spec:
  componentAutoscaling:
    gunicorn:
      enabled: true
      provider: hpa      # Use standard HPA for web traffic
      minReplicas: 3
      maxReplicas: 10
      hpa:
        metric: cpu
        targetUtilization: 70
    
    worker-short:
      enabled: true
      provider: keda     # Use KEDA for queue-based scaling
      minReplicas: 0     # Scale to zero when idle!
      maxReplicas: 15
      keda:
        trigger: redis
        targetValue: "5"

The operator will automatically manage the underlying HorizontalPodAutoscaler or ScaledObject resources for you.

Worker Autoscaling with KEDA

For background workers, KEDA is the recommended provider as it can scale based on the number of jobs waiting in the Redis queue and supports scaling to zero.

Configure Worker Autoscaling

spec:
  componentAutoscaling:
    # Short-running tasks - scale to zero when idle
    worker-short:
      enabled: true
      provider: keda
      minReplicas: 0        
      maxReplicas: 10       
      keda:
        trigger: redis      # Triggered by Redis queue length
        targetValue: "2"    # Target 2 jobs per worker
    
    # Long-running tasks - maintain minimum workers
    worker-long:
      enabled: true
      provider: keda
      minReplicas: 1        
      maxReplicas: 5
      keda:
        trigger: redis
        targetValue: "1"

Monitor Autoscaling Status

The operator reports the calculated scaling status in the FrappeBench status:

kubectl get frappebench prod-bench -o jsonpath='{.status.componentScaling}' | jq

Site reconciliation concurrency (100+ sites)

When you run 100s of Frappe sites, the operator reconciles one site at a time by default. To speed up convergence (e.g. after a restart or when creating many sites), increase the number of concurrent site reconciles.

Operator config (recommended) – set in the frappe-operator-config ConfigMap (or via Helm operatorConfig.maxConcurrentSiteReconciles):

# In frappe-operator-config ConfigMap
data:
  maxConcurrentSiteReconciles: "10"   # default; increase for 100+ sites

When using Helm, the value is passed to the operator via the FRAPPE_MAX_CONCURRENT_SITE_RECONCILES env. Changing the ConfigMap requires an operator restart to take effect.

Per-bench override – optional hint on FrappeBench:

spec:
  siteReconcileConcurrency: 20   # operator uses max(operatorConfig, max across benches)

The operator uses max(operator config value, max of all benches’ siteReconcileConcurrency) at startup. Tune down if you hit API or database rate limits.

Vertical Scaling

Update resource limits:

kubectl patch frappebench prod-bench --type=merge -p '{
  "spec": {
    "componentResources": {
      "gunicorn": {
        "requests": {"cpu": "2", "memory": "4Gi"},
        "limits": {"cpu": "4", "memory": "8Gi"}
      }
    }
  }
}'

Updates and Upgrades

Updating Frappe Version

# Update bench version
kubectl patch frappebench prod-bench --type=merge -p '{
  "spec": {
    "frappeVersion": "version-15",
    "imageConfig": {
      "tag": "v15.1.0"
    }
  }
}'

# Monitor rollout
kubectl rollout status deployment/prod-bench-gunicorn -n production

App Updates

# Update apps
kubectl patch frappebench prod-bench --type=merge -p '{
  "spec": {
    "appsJSON": "[\"erpnext\", \"hrms\", \"custom_app@v2.0.0\"]"
  }
}'

Site Migration

Run migrations after updates:

apiVersion: vyogo.tech/v1
kind: SiteJob
metadata:
  name: migrate-prod-site
  namespace: production
spec:
  siteRef:
    name: prod-site
  jobType: migrate

Operator Upgrade

# Upgrade operator
kubectl apply -f https://raw.githubusercontent.com/vyogotech/frappe-operator/v1.1.0/install.yaml

# Verify upgrade
kubectl get deployment -n frappe-operator-system

Security

Security Context Configuration

The operator provides flexible security context configuration with three levels of priority:

Default OpenShift Compatibility

Out of the box, the operator uses OpenShift-compatible defaults:

runAsUser: 1001 (OpenShift arbitrary UID standard)
runAsGroup: 0 (root group for OpenShift compatibility)
fsGroup: 0 (root group for filesystem permissions)

No configuration needed for OpenShift deployments!

Cluster-Wide Custom Defaults

Set environment variables in the operator deployment for organization-wide policies:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: frappe-operator-controller-manager
  namespace: frappe-operator-system
spec:
  template:
    spec:
      containers:
      - name: manager
        image: vyogo.tech/frappe-operator:3.1.0
        env:
        - name: FRAPPE_DEFAULT_UID
          value: "2000"        # All benches default to UID 2000
        - name: FRAPPE_DEFAULT_GID
          value: "2000"
        - name: FRAPPE_DEFAULT_FSGROUP
          value: "2000"

Per-Bench Security Override

Override security context for specific benches:

apiVersion: vyogo.tech/v1
kind: FrappeBench
metadata:
  name: compliance-bench
spec:
  security:
    podSecurityContext:
      runAsUser: 5000      # Custom UID for compliance
      runAsGroup: 5000
      fsGroup: 5000
      seccompProfile:
        type: RuntimeDefault
    securityContext:
      runAsUser: 5000
      runAsGroup: 5000
      allowPrivilegeEscalation: false
      capabilities:
        drop: ["ALL"]
      readOnlyRootFilesystem: false
  apps:
    - name: erpnext

Priority: spec.security → Environment Variables → Hardcoded Defaults (1001/0/0)

See SECURITY_CONTEXT_FIX.md for detailed configuration examples.

Network Policies

Restrict network access:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: frappe-network-policy
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: frappe
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
    ports:
    - protocol: TCP
      port: 8000
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: databases
    ports:
    - protocol: TCP
      port: 3306

Pod Security Standards

The operator is compatible with the restricted Pod Security Standard:

apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

All managed pods comply with:

Non-root user execution
Privilege escalation prevented
All capabilities dropped
Seccomp runtime default profile
OpenShift restricted SCC compatible

Secrets Management

Use external secrets operator:

apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: vault-backend
  namespace: production
spec:
  provider:
    vault:
      server: "https://vault.example.com"
      path: "secret"
      version: "v2"
      auth:
        kubernetes:
          mountPath: "kubernetes"
          role: "frappe-operator"

---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: prod-site-admin-password
  namespace: production
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-backend
    kind: SecretStore
  target:
    name: prod-site-admin-password
  data:
  - secretKey: password
    remoteRef:
      key: prod-site
      property: admin_password

Database Management

Connection Pooling

Configure ProxySQL for connection pooling:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: proxysql
  namespace: databases
spec:
  replicas: 2
  selector:
    matchLabels:
      app: proxysql
  template:
    metadata:
      labels:
        app: proxysql
    spec:
      containers:
      - name: proxysql
        image: proxysql/proxysql:2.5
        ports:
        - containerPort: 6033
        - containerPort: 6032

Database Maintenance

# Optimize tables
kubectl exec -it mariadb-0 -n databases -- \
  mysql -u root -p -e "OPTIMIZE TABLE <database>.<table>;"

# Check database size
kubectl exec -it mariadb-0 -n databases -- \
  mysql -u root -p -e "SELECT table_schema AS 'Database', 
  ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS 'Size (MB)' 
  FROM information_schema.tables GROUP BY table_schema;"

Disaster Recovery

Backup Strategy

Database Backups: Daily full backups, hourly incrementals
File Backups: Daily backups of site files
Configuration Backups: Version control for manifests
Off-site Replication: Store backups in different region

Recovery Procedures

Complete Cluster Failure

# 1. Restore database from backup
kubectl apply -f mariadb-restore.yaml

# 2. Recreate operator
kubectl apply -f https://raw.githubusercontent.com/vyogotech/frappe-operator/main/install.yaml

# 3. Recreate bench
kubectl apply -f bench.yaml

# 4. Recreate sites
kubectl apply -f sites/

# 5. Restore site data
kubectl exec -it <pod> -- bench --site <site> restore <backup>

Site Recovery

# Create new site from backup
kubectl apply -f - <<EOF
apiVersion: vyogo.tech/v1
kind: FrappeSite
metadata:
  name: recovered-site
spec:
  benchRef:
    name: prod-bench
  siteName: "site.example.com"
  restoreFrom:
    backup: "s3://bucket/backup.sql.gz"
    files: "s3://bucket/files.tar.gz"
EOF

Best Practices

1. Use GitOps

Store all manifests in Git and use tools like ArgoCD or Flux:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: frappe-production
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/yourorg/frappe-k8s
    targetRevision: main
    path: production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

2. Resource Quotas

Set limits per namespace:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "50"
    requests.memory: 100Gi
    persistentvolumeclaims: "20"
    pods: "100"

3. Pod Disruption Budgets

Ensure availability during maintenance:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: prod-bench-gunicorn-pdb
  namespace: production
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: prod-bench-gunicorn

4. Health Checks

Configure liveness and readiness probes (handled by operator).

5. Regular Testing

Test disaster recovery procedures quarterly
Validate backups monthly
Performance test before major updates
Security audits semi-annually

Maintenance Windows

Planning Maintenance

Schedule: Off-peak hours
Communication: Notify users in advance
Backups: Take fresh backups before maintenance
Rollback Plan: Prepare rollback procedures
Monitoring: Extra vigilance during and after

Performing Maintenance

# 1. Drain traffic (if using multiple sites)
kubectl patch frappesite <site> --type=merge -p '{
  "spec": {"ingress": {"enabled": false}}
}'

# 2. Perform maintenance
kubectl apply -f updated-manifests.yaml

# 3. Verify
kubectl get pods -n production
kubectl logs -l app=<component>

# 4. Re-enable traffic
kubectl patch frappesite <site> --type=merge -p '{
  "spec": {"ingress": {"enabled": true}}
}'

Next Steps

Troubleshooting - Debugging and problem resolution
API Reference - Complete specification
Examples - Configuration examples