frappe-operator

Operations Guide

Day-2 operations, maintenance, and best practices for running Frappe Operator in production.

Table of Contents


Production Deployment

Pre-Production Checklist

Before deploying to production, ensure you have:

Resource Planning

Bench Resources

Estimate resources based on expected load:

Small (< 50 users):

Medium (50-200 users):

Large (200+ users):

Database Resources

Shared Database:

Dedicated Database (per site):

MariaDB Operator Setup

For production, use MariaDB Operator for managed databases:

# Install MariaDB Operator
kubectl apply -f https://github.com/mariadb-operator/mariadb-operator/releases/latest/download/crds.yaml
kubectl apply -f https://github.com/mariadb-operator/mariadb-operator/releases/latest/download/mariadb-operator.yaml

Create a shared MariaDB instance:

apiVersion: k8s.mariadb.com/v1alpha1
kind: MariaDB
metadata:
  name: shared-mariadb
  namespace: databases
spec:
  rootPasswordSecretKeyRef:
    name: mariadb-root
    key: password
  
  database: frappe
  username: frappe
  passwordSecretKeyRef:
    name: mariadb-frappe
    key: password
  
  storage:
    size: 500Gi
    storageClassName: fast-ssd
  
  replicas: 3
  galera:
    enabled: true
  
  resources:
    requests:
      cpu: 2
      memory: 8Gi
    limits:
      cpu: 4
      memory: 16Gi

Namespace Strategy

Organize resources by environment:

# Create namespaces
kubectl create namespace frappe-operator-system
kubectl create namespace production
kubectl create namespace staging
kubectl create namespace development
kubectl create namespace databases

# Apply resource quotas
kubectl create quota production-quota \
  --hard=requests.cpu=50,requests.memory=100Gi,pods=100 \
  -n production

Monitoring and Observability

Prometheus Integration

The operator exposes metrics for Prometheus scraping.

ServiceMonitor for Operator:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: frappe-operator
  namespace: frappe-operator-system
spec:
  selector:
    matchLabels:
      control-plane: controller-manager
  endpoints:
  - port: https
    scheme: https
    tlsConfig:
      insecureSkipVerify: true

PodMonitor for Frappe Components:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: frappe-benches
  namespace: production
spec:
  selector:
    matchLabels:
      app: frappe
  podMetricsEndpoints:
  - port: metrics
    path: /metrics

Key Metrics to Monitor

Application Metrics:

Resource Metrics:

Business Metrics:

Logging

Configure centralized logging:

# Fluent Bit DaemonSet
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: logging
data:
  fluent-bit.conf: |
    [INPUT]
        Name              tail
        Path              /var/log/containers/frappe-*.log
        Parser            docker
        Tag               frappe.*
    
    [OUTPUT]
        Name              es
        Match             frappe.*
        Host              elasticsearch
        Port              9200
        Index             frappe-logs

Alerting Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: frappe-alerts
  namespace: production
spec:
  groups:
  - name: frappe
    interval: 30s
    rules:
    - alert: FrappeSiteDown
      expr: up{job="frappe-site"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Frappe site  is down"
    
    - alert: HighErrorRate
      expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High error rate on "
    
    - alert: HighMemoryUsage
      expr: container_memory_usage_bytes{pod=~".*-gunicorn.*"} / container_spec_memory_limit_bytes > 0.9
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High memory usage on "

Backup and Restore

Automated Backups with SiteBackup

Create a SiteBackup resource:

apiVersion: vyogo.tech/v1alpha1
kind: SiteBackup
metadata:
  name: daily-backup
  namespace: production
spec:
  siteRef:
    name: prod-site
  
  # Daily backup at 2 AM
  schedule: "0 2 * * *"
  
  retention:
    days: 30
    count: 90
  
  destination:
    type: s3
    config:
      bucket: frappe-backups
      region: us-east-1
      credentialsSecret: aws-s3-credentials

Updating Scheduled Backups

You can update a scheduled backup at any time by modifying the SiteBackup resource. The Frappe Operator will automatically detect changes and update the backup schedule and configuration.

For example, to change the schedule of the daily-backup from 2 AM to 3 AM, you can use kubectl patch:

kubectl patch sitebackup daily-backup -n production --type=merge -p '{
  "spec": {
    "schedule": "0 3 * * *"
  }
}'

You can also modify other parameters, such as the retention policy or backup destination, and the operator will apply them to the scheduled job.

Manual Backup

# Create manual backup
kubectl create -f - <<EOF
apiVersion: vyogo.tech/v1alpha1
kind: SiteJob
metadata:
  name: manual-backup-$(date +%Y%m%d-%H%M%S)
  namespace: production
spec:
  siteRef:
    name: prod-site
  jobType: backup
  jobConfig:
    withFiles: true
    compress: true
EOF

# Check backup status
kubectl get sitejob -n production

Restore from Backup

# Restore database
kubectl exec -it <site-pod> -- bench --site <site-name> \
  restore --mariadb-root-password <password> /path/to/backup.sql.gz

# Restore files
kubectl exec -it <site-pod> -- bench --site <site-name> \
  restore --with-private-files /path/to/files-backup.tar.gz

Scaling

Component Autoscaling (Unified API)

Frappe Operator provides a provider-agnostic autoscaling mechanism for all bench components (nginx, gunicorn, workers, etc.). You can choose between HPA (Horizontal Pod Autoscaler) and KEDA (Kubernetes Event-Driven Autoscaling).

Managed Scaling

Instead of manual kubectl scale commands, define your scaling policy in the FrappeBench CRD:

spec:
  componentAutoscaling:
    gunicorn:
      enabled: true
      provider: hpa      # Use standard HPA for web traffic
      minReplicas: 3
      maxReplicas: 10
      hpa:
        metric: cpu
        targetUtilization: 70
    
    worker-short:
      enabled: true
      provider: keda     # Use KEDA for queue-based scaling
      minReplicas: 0     # Scale to zero when idle!
      maxReplicas: 15
      keda:
        trigger: redis
        targetValue: "5"

The operator will automatically manage the underlying HorizontalPodAutoscaler or ScaledObject resources for you.

Worker Autoscaling with KEDA

For background workers, KEDA is the recommended provider as it can scale based on the number of jobs waiting in the Redis queue and supports scaling to zero.

Configure Worker Autoscaling

spec:
  componentAutoscaling:
    # Short-running tasks - scale to zero when idle
    worker-short:
      enabled: true
      provider: keda
      minReplicas: 0        
      maxReplicas: 10       
      keda:
        trigger: redis      # Triggered by Redis queue length
        targetValue: "2"    # Target 2 jobs per worker
    
    # Long-running tasks - maintain minimum workers
    worker-long:
      enabled: true
      provider: keda
      minReplicas: 1        
      maxReplicas: 5
      keda:
        trigger: redis
        targetValue: "1"

Monitor Autoscaling Status

The operator reports the calculated scaling status in the FrappeBench status:

kubectl get frappebench prod-bench -o jsonpath='{.status.componentScaling}' | jq

Site reconciliation concurrency (100+ sites)

When you run 100s of Frappe sites, the operator reconciles one site at a time by default. To speed up convergence (e.g. after a restart or when creating many sites), increase the number of concurrent site reconciles.

Operator config (recommended) – set in the frappe-operator-config ConfigMap (or via Helm operatorConfig.maxConcurrentSiteReconciles):

# In frappe-operator-config ConfigMap
data:
  maxConcurrentSiteReconciles: "10"   # default; increase for 100+ sites

When using Helm, the value is passed to the operator via the FRAPPE_MAX_CONCURRENT_SITE_RECONCILES env. Changing the ConfigMap requires an operator restart to take effect.

Per-bench override – optional hint on FrappeBench:

spec:
  siteReconcileConcurrency: 20   # operator uses max(operatorConfig, max across benches)

The operator uses max(operator config value, max of all benches’ siteReconcileConcurrency) at startup. Tune down if you hit API or database rate limits.

Vertical Scaling

Update resource limits:

kubectl patch frappebench prod-bench --type=merge -p '{
  "spec": {
    "componentResources": {
      "gunicorn": {
        "requests": {"cpu": "2", "memory": "4Gi"},
        "limits": {"cpu": "4", "memory": "8Gi"}
      }
    }
  }
}'

Updates and Upgrades

Updating Frappe Version

# Update bench version
kubectl patch frappebench prod-bench --type=merge -p '{
  "spec": {
    "frappeVersion": "version-15",
    "imageConfig": {
      "tag": "v15.1.0"
    }
  }
}'

# Monitor rollout
kubectl rollout status deployment/prod-bench-gunicorn -n production

App Updates

# Update apps
kubectl patch frappebench prod-bench --type=merge -p '{
  "spec": {
    "appsJSON": "[\"erpnext\", \"hrms\", \"custom_app@v2.0.0\"]"
  }
}'

Site Migration

Run migrations after updates:

apiVersion: vyogo.tech/v1alpha1
kind: SiteJob
metadata:
  name: migrate-prod-site
  namespace: production
spec:
  siteRef:
    name: prod-site
  jobType: migrate

Operator Upgrade

# Upgrade operator
kubectl apply -f https://raw.githubusercontent.com/vyogotech/frappe-operator/v1.1.0/config/install.yaml

# Verify upgrade
kubectl get deployment -n frappe-operator-system

Security

Security Context Configuration

The operator provides flexible security context configuration with three levels of priority:

Default OpenShift Compatibility

Out of the box, the operator uses OpenShift-compatible defaults:

No configuration needed for OpenShift deployments!

Cluster-Wide Custom Defaults

Set environment variables in the operator deployment for organization-wide policies:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: frappe-operator-controller-manager
  namespace: frappe-operator-system
spec:
  template:
    spec:
      containers:
      - name: manager
        image: vyogo.tech/frappe-operator:latest
        env:
        - name: FRAPPE_DEFAULT_UID
          value: "2000"        # All benches default to UID 2000
        - name: FRAPPE_DEFAULT_GID
          value: "2000"
        - name: FRAPPE_DEFAULT_FSGROUP
          value: "2000"

Per-Bench Security Override

Override security context for specific benches:

apiVersion: vyogo.tech/v1alpha1
kind: FrappeBench
metadata:
  name: compliance-bench
spec:
  security:
    podSecurityContext:
      runAsUser: 5000      # Custom UID for compliance
      runAsGroup: 5000
      fsGroup: 5000
      seccompProfile:
        type: RuntimeDefault
    securityContext:
      runAsUser: 5000
      runAsGroup: 5000
      allowPrivilegeEscalation: false
      capabilities:
        drop: ["ALL"]
      readOnlyRootFilesystem: false
  apps:
    - name: erpnext

Priority: spec.security → Environment Variables → Hardcoded Defaults (1001/0/0)

See SECURITY_CONTEXT_FIX.md for detailed configuration examples.

Network Policies

Restrict network access:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: frappe-network-policy
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: frappe
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
    ports:
    - protocol: TCP
      port: 8000
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: databases
    ports:
    - protocol: TCP
      port: 3306

Pod Security Standards

The operator is compatible with the restricted Pod Security Standard:

apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

All managed pods comply with:

Secrets Management

Use external secrets operator:

apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: vault-backend
  namespace: production
spec:
  provider:
    vault:
      server: "https://vault.example.com"
      path: "secret"
      version: "v2"
      auth:
        kubernetes:
          mountPath: "kubernetes"
          role: "frappe-operator"

---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: prod-site-admin-password
  namespace: production
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-backend
    kind: SecretStore
  target:
    name: prod-site-admin-password
  data:
  - secretKey: password
    remoteRef:
      key: prod-site
      property: admin_password

Database Management

Connection Pooling

Configure ProxySQL for connection pooling:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: proxysql
  namespace: databases
spec:
  replicas: 2
  selector:
    matchLabels:
      app: proxysql
  template:
    metadata:
      labels:
        app: proxysql
    spec:
      containers:
      - name: proxysql
        image: proxysql/proxysql:2.5
        ports:
        - containerPort: 6033
        - containerPort: 6032

Database Maintenance

# Optimize tables
kubectl exec -it mariadb-0 -n databases -- \
  mysql -u root -p -e "OPTIMIZE TABLE <database>.<table>;"

# Check database size
kubectl exec -it mariadb-0 -n databases -- \
  mysql -u root -p -e "SELECT table_schema AS 'Database', 
  ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS 'Size (MB)' 
  FROM information_schema.tables GROUP BY table_schema;"

Disaster Recovery

Backup Strategy

  1. Database Backups: Daily full backups, hourly incrementals
  2. File Backups: Daily backups of site files
  3. Configuration Backups: Version control for manifests
  4. Off-site Replication: Store backups in different region

Recovery Procedures

Complete Cluster Failure

# 1. Restore database from backup
kubectl apply -f mariadb-restore.yaml

# 2. Recreate operator
kubectl apply -f https://raw.githubusercontent.com/vyogotech/frappe-operator/main/config/install.yaml

# 3. Recreate bench
kubectl apply -f bench.yaml

# 4. Recreate sites
kubectl apply -f sites/

# 5. Restore site data
kubectl exec -it <pod> -- bench --site <site> restore <backup>

Site Recovery

# Create new site from backup
kubectl apply -f - <<EOF
apiVersion: vyogo.tech/v1alpha1
kind: FrappeSite
metadata:
  name: recovered-site
spec:
  benchRef:
    name: prod-bench
  siteName: "site.example.com"
  restoreFrom:
    backup: "s3://bucket/backup.sql.gz"
    files: "s3://bucket/files.tar.gz"
EOF

Best Practices

1. Use GitOps

Store all manifests in Git and use tools like ArgoCD or Flux:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: frappe-production
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/yourorg/frappe-k8s
    targetRevision: main
    path: production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

2. Resource Quotas

Set limits per namespace:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "50"
    requests.memory: 100Gi
    persistentvolumeclaims: "20"
    pods: "100"

3. Pod Disruption Budgets

Ensure availability during maintenance:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: prod-bench-gunicorn-pdb
  namespace: production
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: prod-bench-gunicorn

4. Health Checks

Configure liveness and readiness probes (handled by operator).

5. Regular Testing


Maintenance Windows

Planning Maintenance

  1. Schedule: Off-peak hours
  2. Communication: Notify users in advance
  3. Backups: Take fresh backups before maintenance
  4. Rollback Plan: Prepare rollback procedures
  5. Monitoring: Extra vigilance during and after

Performing Maintenance

# 1. Drain traffic (if using multiple sites)
kubectl patch frappesite <site> --type=merge -p '{
  "spec": {"ingress": {"enabled": false}}
}'

# 2. Perform maintenance
kubectl apply -f updated-manifests.yaml

# 3. Verify
kubectl get pods -n production
kubectl logs -l app=<component>

# 4. Re-enable traffic
kubectl patch frappesite <site> --type=merge -p '{
  "spec": {"ingress": {"enabled": true}}
}'

Next Steps