Operations Guide
Day-2 operations, maintenance, and best practices for running Frappe Operator in production.
Table of Contents
- Production Deployment
- Monitoring and Observability
- Backup and Restore
- Scaling
- Updates and Upgrades
- Security
- Database Management
- Disaster Recovery
Production Deployment
Pre-Production Checklist
Before deploying to production, ensure you have:
- Kubernetes cluster with adequate resources
- Persistent storage configured (StorageClass)
- Ingress controller installed and configured
- cert-manager for TLS certificates (optional)
- MariaDB/MySQL database available
- Monitoring stack (Prometheus + Grafana)
- Backup solution in place
- DNS configured for your domains
- Resource limits defined
- Secrets management strategy
Resource Planning
Bench Resources
Estimate resources based on expected load:
Small (< 50 users):
- Gunicorn: 2 replicas, 500m CPU, 1Gi RAM each
- Workers: 1-2 replicas, 250m CPU, 512Mi RAM each
- Redis: 1Gi RAM
Medium (50-200 users):
- Gunicorn: 3-5 replicas, 1 CPU, 2Gi RAM each
- Workers: 2-3 replicas, 500m CPU, 1Gi RAM each
- Redis: 4Gi RAM
Large (200+ users):
- Gunicorn: 5-10+ replicas, 2 CPU, 4Gi RAM each
- Workers: 5+ replicas, 1 CPU, 2Gi RAM each
- Redis: 8Gi+ RAM
Database Resources
Shared Database:
- 2-4 CPU cores
- 4-8Gi RAM
- 100Gi+ storage
Dedicated Database (per site):
- 1-2 CPU cores
- 2-4Gi RAM
- 50-200Gi storage
MariaDB Operator Setup
For production, use MariaDB Operator for managed databases:
# Install MariaDB Operator
kubectl apply -f https://github.com/mariadb-operator/mariadb-operator/releases/latest/download/crds.yaml
kubectl apply -f https://github.com/mariadb-operator/mariadb-operator/releases/latest/download/mariadb-operator.yaml
Create a shared MariaDB instance:
apiVersion: k8s.mariadb.com/v1alpha1
kind: MariaDB
metadata:
name: shared-mariadb
namespace: databases
spec:
rootPasswordSecretKeyRef:
name: mariadb-root
key: password
database: frappe
username: frappe
passwordSecretKeyRef:
name: mariadb-frappe
key: password
storage:
size: 500Gi
storageClassName: fast-ssd
replicas: 3
galera:
enabled: true
resources:
requests:
cpu: 2
memory: 8Gi
limits:
cpu: 4
memory: 16Gi
Namespace Strategy
Organize resources by environment:
# Create namespaces
kubectl create namespace frappe-operator-system
kubectl create namespace production
kubectl create namespace staging
kubectl create namespace development
kubectl create namespace databases
# Apply resource quotas
kubectl create quota production-quota \
--hard=requests.cpu=50,requests.memory=100Gi,pods=100 \
-n production
Monitoring and Observability
Prometheus Integration
The operator exposes metrics for Prometheus scraping.
ServiceMonitor for Operator:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: frappe-operator
namespace: frappe-operator-system
spec:
selector:
matchLabels:
control-plane: controller-manager
endpoints:
- port: https
scheme: https
tlsConfig:
insecureSkipVerify: true
PodMonitor for Frappe Components:
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: frappe-benches
namespace: production
spec:
selector:
matchLabels:
app: frappe
podMetricsEndpoints:
- port: metrics
path: /metrics
Key Metrics to Monitor
Application Metrics:
- Request rate and latency
- Error rate (4xx, 5xx)
- Queue length (Redis)
- Worker job processing time
- Database connections
Resource Metrics:
- CPU utilization
- Memory usage
- Disk I/O
- Network throughput
Business Metrics:
- Active users
- Concurrent sessions
- Background job completion rate
Logging
Configure centralized logging:
# Fluent Bit DaemonSet
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: logging
data:
fluent-bit.conf: |
[INPUT]
Name tail
Path /var/log/containers/frappe-*.log
Parser docker
Tag frappe.*
[OUTPUT]
Name es
Match frappe.*
Host elasticsearch
Port 9200
Index frappe-logs
Alerting Rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: frappe-alerts
namespace: production
spec:
groups:
- name: frappe
interval: 30s
rules:
- alert: FrappeSiteDown
expr: up{job="frappe-site"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Frappe site is down"
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate on "
- alert: HighMemoryUsage
expr: container_memory_usage_bytes{pod=~".*-gunicorn.*"} / container_spec_memory_limit_bytes > 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "High memory usage on "
Backup and Restore
Automated Backups with SiteBackup
Create a SiteBackup resource:
apiVersion: vyogo.tech/v1alpha1
kind: SiteBackup
metadata:
name: daily-backup
namespace: production
spec:
siteRef:
name: prod-site
# Daily backup at 2 AM
schedule: "0 2 * * *"
retention:
days: 30
count: 90
destination:
type: s3
config:
bucket: frappe-backups
region: us-east-1
credentialsSecret: aws-s3-credentials
Updating Scheduled Backups
You can update a scheduled backup at any time by modifying the SiteBackup resource. The Frappe Operator will automatically detect changes and update the backup schedule and configuration.
For example, to change the schedule of the daily-backup from 2 AM to 3 AM, you can use kubectl patch:
kubectl patch sitebackup daily-backup -n production --type=merge -p '{
"spec": {
"schedule": "0 3 * * *"
}
}'
You can also modify other parameters, such as the retention policy or backup destination, and the operator will apply them to the scheduled job.
Manual Backup
# Create manual backup
kubectl create -f - <<EOF
apiVersion: vyogo.tech/v1alpha1
kind: SiteJob
metadata:
name: manual-backup-$(date +%Y%m%d-%H%M%S)
namespace: production
spec:
siteRef:
name: prod-site
jobType: backup
jobConfig:
withFiles: true
compress: true
EOF
# Check backup status
kubectl get sitejob -n production
Restore from Backup
# Restore database
kubectl exec -it <site-pod> -- bench --site <site-name> \
restore --mariadb-root-password <password> /path/to/backup.sql.gz
# Restore files
kubectl exec -it <site-pod> -- bench --site <site-name> \
restore --with-private-files /path/to/files-backup.tar.gz
Scaling
Component Autoscaling (Unified API)
Frappe Operator provides a provider-agnostic autoscaling mechanism for all bench components (nginx, gunicorn, workers, etc.). You can choose between HPA (Horizontal Pod Autoscaler) and KEDA (Kubernetes Event-Driven Autoscaling).
Managed Scaling
Instead of manual kubectl scale commands, define your scaling policy in the FrappeBench CRD:
spec:
componentAutoscaling:
gunicorn:
enabled: true
provider: hpa # Use standard HPA for web traffic
minReplicas: 3
maxReplicas: 10
hpa:
metric: cpu
targetUtilization: 70
worker-short:
enabled: true
provider: keda # Use KEDA for queue-based scaling
minReplicas: 0 # Scale to zero when idle!
maxReplicas: 15
keda:
trigger: redis
targetValue: "5"
The operator will automatically manage the underlying HorizontalPodAutoscaler or ScaledObject resources for you.
Worker Autoscaling with KEDA
For background workers, KEDA is the recommended provider as it can scale based on the number of jobs waiting in the Redis queue and supports scaling to zero.
Configure Worker Autoscaling
spec:
componentAutoscaling:
# Short-running tasks - scale to zero when idle
worker-short:
enabled: true
provider: keda
minReplicas: 0
maxReplicas: 10
keda:
trigger: redis # Triggered by Redis queue length
targetValue: "2" # Target 2 jobs per worker
# Long-running tasks - maintain minimum workers
worker-long:
enabled: true
provider: keda
minReplicas: 1
maxReplicas: 5
keda:
trigger: redis
targetValue: "1"
Monitor Autoscaling Status
The operator reports the calculated scaling status in the FrappeBench status:
kubectl get frappebench prod-bench -o jsonpath='{.status.componentScaling}' | jq
Site reconciliation concurrency (100+ sites)
When you run 100s of Frappe sites, the operator reconciles one site at a time by default. To speed up convergence (e.g. after a restart or when creating many sites), increase the number of concurrent site reconciles.
Operator config (recommended) β set in the frappe-operator-config ConfigMap (or via Helm operatorConfig.maxConcurrentSiteReconciles):
# In frappe-operator-config ConfigMap
data:
maxConcurrentSiteReconciles: "10" # default; increase for 100+ sites
When using Helm, the value is passed to the operator via the FRAPPE_MAX_CONCURRENT_SITE_RECONCILES env. Changing the ConfigMap requires an operator restart to take effect.
Per-bench override β optional hint on FrappeBench:
spec:
siteReconcileConcurrency: 20 # operator uses max(operatorConfig, max across benches)
The operator uses max(operator config value, max of all benchesβ siteReconcileConcurrency) at startup. Tune down if you hit API or database rate limits.
Vertical Scaling
Update resource limits:
kubectl patch frappebench prod-bench --type=merge -p '{
"spec": {
"componentResources": {
"gunicorn": {
"requests": {"cpu": "2", "memory": "4Gi"},
"limits": {"cpu": "4", "memory": "8Gi"}
}
}
}
}'
Updates and Upgrades
Updating Frappe Version
# Update bench version
kubectl patch frappebench prod-bench --type=merge -p '{
"spec": {
"frappeVersion": "version-15",
"imageConfig": {
"tag": "v15.1.0"
}
}
}'
# Monitor rollout
kubectl rollout status deployment/prod-bench-gunicorn -n production
App Updates
# Update apps
kubectl patch frappebench prod-bench --type=merge -p '{
"spec": {
"appsJSON": "[\"erpnext\", \"hrms\", \"custom_app@v2.0.0\"]"
}
}'
Site Migration
Run migrations after updates:
apiVersion: vyogo.tech/v1alpha1
kind: SiteJob
metadata:
name: migrate-prod-site
namespace: production
spec:
siteRef:
name: prod-site
jobType: migrate
Operator Upgrade
# Upgrade operator
kubectl apply -f https://raw.githubusercontent.com/vyogotech/frappe-operator/v1.1.0/config/install.yaml
# Verify upgrade
kubectl get deployment -n frappe-operator-system
Security
Security Context Configuration
The operator provides flexible security context configuration with three levels of priority:
Default OpenShift Compatibility
Out of the box, the operator uses OpenShift-compatible defaults:
runAsUser: 1001(OpenShift arbitrary UID standard)runAsGroup: 0(root group for OpenShift compatibility)fsGroup: 0(root group for filesystem permissions)
No configuration needed for OpenShift deployments!
Cluster-Wide Custom Defaults
Set environment variables in the operator deployment for organization-wide policies:
apiVersion: apps/v1
kind: Deployment
metadata:
name: frappe-operator-controller-manager
namespace: frappe-operator-system
spec:
template:
spec:
containers:
- name: manager
image: vyogo.tech/frappe-operator:latest
env:
- name: FRAPPE_DEFAULT_UID
value: "2000" # All benches default to UID 2000
- name: FRAPPE_DEFAULT_GID
value: "2000"
- name: FRAPPE_DEFAULT_FSGROUP
value: "2000"
Per-Bench Security Override
Override security context for specific benches:
apiVersion: vyogo.tech/v1alpha1
kind: FrappeBench
metadata:
name: compliance-bench
spec:
security:
podSecurityContext:
runAsUser: 5000 # Custom UID for compliance
runAsGroup: 5000
fsGroup: 5000
seccompProfile:
type: RuntimeDefault
securityContext:
runAsUser: 5000
runAsGroup: 5000
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
readOnlyRootFilesystem: false
apps:
- name: erpnext
Priority: spec.security β Environment Variables β Hardcoded Defaults (1001/0/0)
See SECURITY_CONTEXT_FIX.md for detailed configuration examples.
Network Policies
Restrict network access:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: frappe-network-policy
namespace: production
spec:
podSelector:
matchLabels:
app: frappe
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
ports:
- protocol: TCP
port: 8000
egress:
- to:
- namespaceSelector:
matchLabels:
name: databases
ports:
- protocol: TCP
port: 3306
Pod Security Standards
The operator is compatible with the restricted Pod Security Standard:
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
All managed pods comply with:
- Non-root user execution
- Privilege escalation prevented
- All capabilities dropped
- Seccomp runtime default profile
- OpenShift restricted SCC compatible
Secrets Management
Use external secrets operator:
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
name: vault-backend
namespace: production
spec:
provider:
vault:
server: "https://vault.example.com"
path: "secret"
version: "v2"
auth:
kubernetes:
mountPath: "kubernetes"
role: "frappe-operator"
---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: prod-site-admin-password
namespace: production
spec:
refreshInterval: 1h
secretStoreRef:
name: vault-backend
kind: SecretStore
target:
name: prod-site-admin-password
data:
- secretKey: password
remoteRef:
key: prod-site
property: admin_password
Database Management
Connection Pooling
Configure ProxySQL for connection pooling:
apiVersion: apps/v1
kind: Deployment
metadata:
name: proxysql
namespace: databases
spec:
replicas: 2
selector:
matchLabels:
app: proxysql
template:
metadata:
labels:
app: proxysql
spec:
containers:
- name: proxysql
image: proxysql/proxysql:2.5
ports:
- containerPort: 6033
- containerPort: 6032
Database Maintenance
# Optimize tables
kubectl exec -it mariadb-0 -n databases -- \
mysql -u root -p -e "OPTIMIZE TABLE <database>.<table>;"
# Check database size
kubectl exec -it mariadb-0 -n databases -- \
mysql -u root -p -e "SELECT table_schema AS 'Database',
ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS 'Size (MB)'
FROM information_schema.tables GROUP BY table_schema;"
Disaster Recovery
Backup Strategy
- Database Backups: Daily full backups, hourly incrementals
- File Backups: Daily backups of site files
- Configuration Backups: Version control for manifests
- Off-site Replication: Store backups in different region
Recovery Procedures
Complete Cluster Failure
# 1. Restore database from backup
kubectl apply -f mariadb-restore.yaml
# 2. Recreate operator
kubectl apply -f https://raw.githubusercontent.com/vyogotech/frappe-operator/main/config/install.yaml
# 3. Recreate bench
kubectl apply -f bench.yaml
# 4. Recreate sites
kubectl apply -f sites/
# 5. Restore site data
kubectl exec -it <pod> -- bench --site <site> restore <backup>
Site Recovery
# Create new site from backup
kubectl apply -f - <<EOF
apiVersion: vyogo.tech/v1alpha1
kind: FrappeSite
metadata:
name: recovered-site
spec:
benchRef:
name: prod-bench
siteName: "site.example.com"
restoreFrom:
backup: "s3://bucket/backup.sql.gz"
files: "s3://bucket/files.tar.gz"
EOF
Best Practices
1. Use GitOps
Store all manifests in Git and use tools like ArgoCD or Flux:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: frappe-production
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/yourorg/frappe-k8s
targetRevision: main
path: production
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
2. Resource Quotas
Set limits per namespace:
apiVersion: v1
kind: ResourceQuota
metadata:
name: production-quota
namespace: production
spec:
hard:
requests.cpu: "50"
requests.memory: 100Gi
persistentvolumeclaims: "20"
pods: "100"
3. Pod Disruption Budgets
Ensure availability during maintenance:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: prod-bench-gunicorn-pdb
namespace: production
spec:
minAvailable: 2
selector:
matchLabels:
app: prod-bench-gunicorn
4. Health Checks
Configure liveness and readiness probes (handled by operator).
5. Regular Testing
- Test disaster recovery procedures quarterly
- Validate backups monthly
- Performance test before major updates
- Security audits semi-annually
Maintenance Windows
Planning Maintenance
- Schedule: Off-peak hours
- Communication: Notify users in advance
- Backups: Take fresh backups before maintenance
- Rollback Plan: Prepare rollback procedures
- Monitoring: Extra vigilance during and after
Performing Maintenance
# 1. Drain traffic (if using multiple sites)
kubectl patch frappesite <site> --type=merge -p '{
"spec": {"ingress": {"enabled": false}}
}'
# 2. Perform maintenance
kubectl apply -f updated-manifests.yaml
# 3. Verify
kubectl get pods -n production
kubectl logs -l app=<component>
# 4. Re-enable traffic
kubectl patch frappesite <site> --type=merge -p '{
"spec": {"ingress": {"enabled": true}}
}'
Next Steps
- Troubleshooting - Debugging and problem resolution
- API Reference - Complete specification
- Examples - Configuration examples