Day-2 operations, maintenance, and best practices for running Frappe Operator in production.
Before deploying to production, ensure you have:
Estimate resources based on expected load:
Small (< 50 users):
Medium (50-200 users):
Large (200+ users):
Shared Database:
Dedicated Database (per site):
For production, use MariaDB Operator for managed databases:
# Install MariaDB Operator
kubectl apply -f https://github.com/mariadb-operator/mariadb-operator/releases/latest/download/crds.yaml
kubectl apply -f https://github.com/mariadb-operator/mariadb-operator/releases/latest/download/mariadb-operator.yaml
Create a shared MariaDB instance:
apiVersion: k8s.mariadb.com/v1alpha1
kind: MariaDB
metadata:
name: shared-mariadb
namespace: databases
spec:
rootPasswordSecretKeyRef:
name: mariadb-root
key: password
database: frappe
username: frappe
passwordSecretKeyRef:
name: mariadb-frappe
key: password
storage:
size: 500Gi
storageClassName: fast-ssd
replicas: 3
galera:
enabled: true
resources:
requests:
cpu: 2
memory: 8Gi
limits:
cpu: 4
memory: 16Gi
Organize resources by environment:
# Create namespaces
kubectl create namespace frappe-operator-system
kubectl create namespace production
kubectl create namespace staging
kubectl create namespace development
kubectl create namespace databases
# Apply resource quotas
kubectl create quota production-quota \
--hard=requests.cpu=50,requests.memory=100Gi,pods=100 \
-n production
The operator exposes metrics for Prometheus scraping.
ServiceMonitor for Operator:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: frappe-operator
namespace: frappe-operator-system
spec:
selector:
matchLabels:
control-plane: controller-manager
endpoints:
- port: https
scheme: https
tlsConfig:
insecureSkipVerify: true
PodMonitor for Frappe Components:
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: frappe-benches
namespace: production
spec:
selector:
matchLabels:
app: frappe
podMetricsEndpoints:
- port: metrics
path: /metrics
Application Metrics:
Resource Metrics:
Business Metrics:
Configure centralized logging:
# Fluent Bit DaemonSet
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: logging
data:
fluent-bit.conf: |
[INPUT]
Name tail
Path /var/log/containers/frappe-*.log
Parser docker
Tag frappe.*
[OUTPUT]
Name es
Match frappe.*
Host elasticsearch
Port 9200
Index frappe-logs
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: frappe-alerts
namespace: production
spec:
groups:
- name: frappe
interval: 30s
rules:
- alert: FrappeSiteDown
expr: up{job="frappe-site"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Frappe site is down"
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate on "
- alert: HighMemoryUsage
expr: container_memory_usage_bytes{pod=~".*-gunicorn.*"} / container_spec_memory_limit_bytes > 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "High memory usage on "
Create a SiteBackup resource:
apiVersion: vyogo.tech/v1alpha1
kind: SiteBackup
metadata:
name: daily-backup
namespace: production
spec:
siteRef:
name: prod-site
# Daily backup at 2 AM
schedule: "0 2 * * *"
retention:
days: 30
count: 90
destination:
type: s3
config:
bucket: frappe-backups
region: us-east-1
credentialsSecret: aws-s3-credentials
You can update a scheduled backup at any time by modifying the SiteBackup resource. The Frappe Operator will automatically detect changes and update the backup schedule and configuration.
For example, to change the schedule of the daily-backup from 2 AM to 3 AM, you can use kubectl patch:
kubectl patch sitebackup daily-backup -n production --type=merge -p '{
"spec": {
"schedule": "0 3 * * *"
}
}'
You can also modify other parameters, such as the retention policy or backup destination, and the operator will apply them to the scheduled job.
# Create manual backup
kubectl create -f - <<EOF
apiVersion: vyogo.tech/v1alpha1
kind: SiteJob
metadata:
name: manual-backup-$(date +%Y%m%d-%H%M%S)
namespace: production
spec:
siteRef:
name: prod-site
jobType: backup
jobConfig:
withFiles: true
compress: true
EOF
# Check backup status
kubectl get sitejob -n production
# Restore database
kubectl exec -it <site-pod> -- bench --site <site-name> \
restore --mariadb-root-password <password> /path/to/backup.sql.gz
# Restore files
kubectl exec -it <site-pod> -- bench --site <site-name> \
restore --with-private-files /path/to/files-backup.tar.gz
Frappe Operator provides a provider-agnostic autoscaling mechanism for all bench components (nginx, gunicorn, workers, etc.). You can choose between HPA (Horizontal Pod Autoscaler) and KEDA (Kubernetes Event-Driven Autoscaling).
Instead of manual kubectl scale commands, define your scaling policy in the FrappeBench CRD:
spec:
componentAutoscaling:
gunicorn:
enabled: true
provider: hpa # Use standard HPA for web traffic
minReplicas: 3
maxReplicas: 10
hpa:
metric: cpu
targetUtilization: 70
worker-short:
enabled: true
provider: keda # Use KEDA for queue-based scaling
minReplicas: 0 # Scale to zero when idle!
maxReplicas: 15
keda:
trigger: redis
targetValue: "5"
The operator will automatically manage the underlying HorizontalPodAutoscaler or ScaledObject resources for you.
For background workers, KEDA is the recommended provider as it can scale based on the number of jobs waiting in the Redis queue and supports scaling to zero.
spec:
componentAutoscaling:
# Short-running tasks - scale to zero when idle
worker-short:
enabled: true
provider: keda
minReplicas: 0
maxReplicas: 10
keda:
trigger: redis # Triggered by Redis queue length
targetValue: "2" # Target 2 jobs per worker
# Long-running tasks - maintain minimum workers
worker-long:
enabled: true
provider: keda
minReplicas: 1
maxReplicas: 5
keda:
trigger: redis
targetValue: "1"
The operator reports the calculated scaling status in the FrappeBench status:
kubectl get frappebench prod-bench -o jsonpath='{.status.componentScaling}' | jq
When you run 100s of Frappe sites, the operator reconciles one site at a time by default. To speed up convergence (e.g. after a restart or when creating many sites), increase the number of concurrent site reconciles.
Operator config (recommended) – set in the frappe-operator-config ConfigMap (or via Helm operatorConfig.maxConcurrentSiteReconciles):
# In frappe-operator-config ConfigMap
data:
maxConcurrentSiteReconciles: "10" # default; increase for 100+ sites
When using Helm, the value is passed to the operator via the FRAPPE_MAX_CONCURRENT_SITE_RECONCILES env. Changing the ConfigMap requires an operator restart to take effect.
Per-bench override – optional hint on FrappeBench:
spec:
siteReconcileConcurrency: 20 # operator uses max(operatorConfig, max across benches)
The operator uses max(operator config value, max of all benches’ siteReconcileConcurrency) at startup. Tune down if you hit API or database rate limits.
Update resource limits:
kubectl patch frappebench prod-bench --type=merge -p '{
"spec": {
"componentResources": {
"gunicorn": {
"requests": {"cpu": "2", "memory": "4Gi"},
"limits": {"cpu": "4", "memory": "8Gi"}
}
}
}
}'
# Update bench version
kubectl patch frappebench prod-bench --type=merge -p '{
"spec": {
"frappeVersion": "version-15",
"imageConfig": {
"tag": "v15.1.0"
}
}
}'
# Monitor rollout
kubectl rollout status deployment/prod-bench-gunicorn -n production
# Update apps
kubectl patch frappebench prod-bench --type=merge -p '{
"spec": {
"appsJSON": "[\"erpnext\", \"hrms\", \"custom_app@v2.0.0\"]"
}
}'
Run migrations after updates:
apiVersion: vyogo.tech/v1alpha1
kind: SiteJob
metadata:
name: migrate-prod-site
namespace: production
spec:
siteRef:
name: prod-site
jobType: migrate
# Upgrade operator
kubectl apply -f https://raw.githubusercontent.com/vyogotech/frappe-operator/v1.1.0/config/install.yaml
# Verify upgrade
kubectl get deployment -n frappe-operator-system
The operator provides flexible security context configuration with three levels of priority:
Out of the box, the operator uses OpenShift-compatible defaults:
runAsUser: 1001 (OpenShift arbitrary UID standard)runAsGroup: 0 (root group for OpenShift compatibility)fsGroup: 0 (root group for filesystem permissions)No configuration needed for OpenShift deployments!
Set environment variables in the operator deployment for organization-wide policies:
apiVersion: apps/v1
kind: Deployment
metadata:
name: frappe-operator-controller-manager
namespace: frappe-operator-system
spec:
template:
spec:
containers:
- name: manager
image: vyogo.tech/frappe-operator:latest
env:
- name: FRAPPE_DEFAULT_UID
value: "2000" # All benches default to UID 2000
- name: FRAPPE_DEFAULT_GID
value: "2000"
- name: FRAPPE_DEFAULT_FSGROUP
value: "2000"
Override security context for specific benches:
apiVersion: vyogo.tech/v1alpha1
kind: FrappeBench
metadata:
name: compliance-bench
spec:
security:
podSecurityContext:
runAsUser: 5000 # Custom UID for compliance
runAsGroup: 5000
fsGroup: 5000
seccompProfile:
type: RuntimeDefault
securityContext:
runAsUser: 5000
runAsGroup: 5000
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
readOnlyRootFilesystem: false
apps:
- name: erpnext
Priority: spec.security → Environment Variables → Hardcoded Defaults (1001/0/0)
See SECURITY_CONTEXT_FIX.md for detailed configuration examples.
Restrict network access:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: frappe-network-policy
namespace: production
spec:
podSelector:
matchLabels:
app: frappe
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
ports:
- protocol: TCP
port: 8000
egress:
- to:
- namespaceSelector:
matchLabels:
name: databases
ports:
- protocol: TCP
port: 3306
The operator is compatible with the restricted Pod Security Standard:
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
All managed pods comply with:
Use external secrets operator:
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
name: vault-backend
namespace: production
spec:
provider:
vault:
server: "https://vault.example.com"
path: "secret"
version: "v2"
auth:
kubernetes:
mountPath: "kubernetes"
role: "frappe-operator"
---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: prod-site-admin-password
namespace: production
spec:
refreshInterval: 1h
secretStoreRef:
name: vault-backend
kind: SecretStore
target:
name: prod-site-admin-password
data:
- secretKey: password
remoteRef:
key: prod-site
property: admin_password
Configure ProxySQL for connection pooling:
apiVersion: apps/v1
kind: Deployment
metadata:
name: proxysql
namespace: databases
spec:
replicas: 2
selector:
matchLabels:
app: proxysql
template:
metadata:
labels:
app: proxysql
spec:
containers:
- name: proxysql
image: proxysql/proxysql:2.5
ports:
- containerPort: 6033
- containerPort: 6032
# Optimize tables
kubectl exec -it mariadb-0 -n databases -- \
mysql -u root -p -e "OPTIMIZE TABLE <database>.<table>;"
# Check database size
kubectl exec -it mariadb-0 -n databases -- \
mysql -u root -p -e "SELECT table_schema AS 'Database',
ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS 'Size (MB)'
FROM information_schema.tables GROUP BY table_schema;"
# 1. Restore database from backup
kubectl apply -f mariadb-restore.yaml
# 2. Recreate operator
kubectl apply -f https://raw.githubusercontent.com/vyogotech/frappe-operator/main/config/install.yaml
# 3. Recreate bench
kubectl apply -f bench.yaml
# 4. Recreate sites
kubectl apply -f sites/
# 5. Restore site data
kubectl exec -it <pod> -- bench --site <site> restore <backup>
# Create new site from backup
kubectl apply -f - <<EOF
apiVersion: vyogo.tech/v1alpha1
kind: FrappeSite
metadata:
name: recovered-site
spec:
benchRef:
name: prod-bench
siteName: "site.example.com"
restoreFrom:
backup: "s3://bucket/backup.sql.gz"
files: "s3://bucket/files.tar.gz"
EOF
Store all manifests in Git and use tools like ArgoCD or Flux:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: frappe-production
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/yourorg/frappe-k8s
targetRevision: main
path: production
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
Set limits per namespace:
apiVersion: v1
kind: ResourceQuota
metadata:
name: production-quota
namespace: production
spec:
hard:
requests.cpu: "50"
requests.memory: 100Gi
persistentvolumeclaims: "20"
pods: "100"
Ensure availability during maintenance:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: prod-bench-gunicorn-pdb
namespace: production
spec:
minAvailable: 2
selector:
matchLabels:
app: prod-bench-gunicorn
Configure liveness and readiness probes (handled by operator).
# 1. Drain traffic (if using multiple sites)
kubectl patch frappesite <site> --type=merge -p '{
"spec": {"ingress": {"enabled": false}}
}'
# 2. Perform maintenance
kubectl apply -f updated-manifests.yaml
# 3. Verify
kubectl get pods -n production
kubectl logs -l app=<component>
# 4. Re-enable traffic
kubectl patch frappesite <site> --type=merge -p '{
"spec": {"ingress": {"enabled": true}}
}'