Troubleshooting
Common issues and solutions when running Frappe Operator.
Table of Contents
- General Debugging
- Installation Issues
- Bench Issues
- Site Issues
- Database Issues
- Networking Issues
- Performance Issues
- Storage Issues
General Debugging
Check Operator Logs
# View operator logs
kubectl logs -n frappe-operator-system \
deployment/frappe-operator-controller-manager -f
# Check for errors
kubectl logs -n frappe-operator-system \
deployment/frappe-operator-controller-manager | grep ERROR
Check Resource Status
# Check all Frappe resources
kubectl get frappebench,frappesite -A
# Describe resources for events
kubectl describe frappebench <bench-name>
kubectl describe frappesite <site-name>
# Check all pods
kubectl get pods -A | grep frappe
Common Commands
# Get events
kubectl get events --sort-by='.lastTimestamp'
# Check resource status
kubectl get <resource> <name> -o yaml
# View logs from specific component
kubectl logs -l app=<component-name> -f
# Execute commands in pod
kubectl exec -it <pod-name> -- bash
Installation Issues
CRDs Not Installing
Problem: CRDs fail to install or are not recognized.
Solution:
# Check if CRDs exist
kubectl get crd | grep vyogo.tech
# Manually install CRDs
kubectl apply -f config/crd/bases/
# Verify CRD installation
kubectl get crd vyogo.tech_frappebenchs.yaml -o yaml
Operator Pod Not Starting
Problem: Operator pod is in CrashLoopBackOff or pending.
Diagnosis:
# Check pod status
kubectl get pods -n frappe-operator-system
# Check logs
kubectl logs -n frappe-operator-system <pod-name>
# Describe pod for events
kubectl describe pod -n frappe-operator-system <pod-name>
Common Causes:
- Insufficient Resources:
# Check node resources kubectl top nodes kubectl describe nodes - Image Pull Errors:
# Check image pull status kubectl describe pod -n frappe-operator-system <pod-name> | grep -A 10 Events # Verify image exists docker pull <image-name> - RBAC Issues:
# Check service account kubectl get serviceaccount -n frappe-operator-system # Check role bindings kubectl get rolebinding,clusterrolebinding | grep frappe-operator
Webhook Configuration Issues
Problem: Validating/mutating webhook errors.
Solution:
# Delete webhooks
kubectl delete validatingwebhookconfiguration frappe-operator-validating-webhook-configuration
kubectl delete mutatingwebhookconfiguration frappe-operator-mutating-webhook-configuration
# Reinstall operator
kubectl apply -f config/install.yaml
Bench Issues
Bench Not Becoming Ready
Problem: FrappeBench status shows ready: false.
Diagnosis:
# Check bench status
kubectl get frappebench <bench-name> -o yaml | grep -A 10 status
# Check all bench pods
kubectl get pods -l bench=<bench-name>
# Check init job
kubectl get job <bench-name>-init
kubectl logs job/<bench-name>-init
Common Causes:
- Init Job Failed:
# Check init job logs kubectl logs -l job-name=<bench-name>-init # Delete and recreate if needed kubectl delete job <bench-name>-init kubectl delete frappebench <bench-name> kubectl apply -f bench.yaml - Image Pull Issues:
# Check if image exists kubectl describe pod <bench-pod> | grep Image # Create pull secret if needed kubectl create secret docker-registry regcred \ --docker-server=<registry> \ --docker-username=<username> \ --docker-password=<password> - Resource Constraints:
# Check if pods are pending kubectl get pods -l bench=<bench-name> # Check resource availability kubectl describe nodes kubectl top nodes
Redis/DragonFly Not Starting
Problem: Redis or DragonFly pod failing.
Solution:
# Check Redis logs
kubectl logs -l app=<bench-name>-redis
# Common issues:
# 1. Memory limits too low
kubectl patch frappebench <bench-name> --type=merge -p '{
"spec": {
"redisConfig": {
"resources": {
"requests": {"memory": "2Gi"},
"limits": {"memory": "4Gi"}
}
}
}
}'
# 2. Persistence issues
kubectl get pvc | grep redis
kubectl describe pvc <redis-pvc>
Site Issues
Site Stuck in Provisioning
Problem: FrappeSite phase remains “Provisioning”.
Diagnosis:
# Check site status
kubectl get frappesite <site-name> -o yaml
# Check if bench is ready
kubectl get frappebench <bench-ref-name> -o jsonpath='{.status.ready}'
# Check site init job
kubectl get job <site-name>-init
kubectl logs job/<site-name>-init -f
Common Causes:
- Bench Not Ready:
# Wait for bench to be ready first kubectl wait --for=condition=ready frappebench/<bench-name> --timeout=600s - Database Connection Issues:
# Check database secret kubectl get secret <db-secret-name> -o yaml # Test database connectivity kubectl run mysql-client --rm -it --image=mysql:8 -- \ mysql -h <db-host> -u <user> -p<password> - Insufficient Storage:
# Check PVC status kubectl get pvc | grep <site-name> kubectl describe pvc <site-pvc> # Check storage class kubectl get storageclass
Site Init Job Fails
Problem: Site initialization job fails.
Solution:
# View init job logs
kubectl logs job/<site-name>-init
# Common errors and fixes:
# 1. Database already exists
# Delete the site from database manually
kubectl exec -it mariadb-0 -- \
mysql -u root -p -e "DROP DATABASE IF EXISTS <site_db>;"
# 2. Database connection refused
# Check database service and credentials
kubectl get svc <db-service>
kubectl get secret <db-secret> -o yaml
# 3. Permission denied
# Check security context and volume permissions
kubectl exec -it <site-pod> -- ls -la /home/frappe/frappe-bench/sites
# Restart init job
kubectl delete job <site-name>-init
kubectl delete frappesite <site-name>
kubectl apply -f site.yaml
Site Not Accessible
Problem: Cannot access site via browser.
Diagnosis:
# Check all site components
kubectl get pods -l site=<site-name>
# Check services
kubectl get svc | grep <site-name>
# Check ingress
kubectl get ingress <site-name>-ingress
kubectl describe ingress <site-name>-ingress
# Test internal connectivity
kubectl run curl-test --rm -it --image=curlimages/curl -- \
curl -H "Host: <site-domain>" http://<bench-nginx-service>:8080
Common Causes:
- Ingress Not Configured:
# Check ingress controller kubectl get pods -n ingress-nginx # Check ingress resource kubectl describe ingress <site-name>-ingress # Check ingress class kubectl get ingressclass - DNS Not Configured:
# Check DNS resolution nslookup <site-domain> dig <site-domain> # For local testing, add to /etc/hosts echo "127.0.0.1 <site-domain>" | sudo tee -a /etc/hosts - TLS Certificate Issues:
# Check cert-manager kubectl get certificate -A kubectl describe certificate <cert-name> # Check certificate secret kubectl get secret <tls-secret> -o yaml
Database Issues
Cannot Connect to Database
Problem: Site cannot connect to database.
Solution:
# Check database service
kubectl get svc <db-service>
# Check database pods
kubectl get pods -l app=mariadb
# Test connectivity from site pod
kubectl exec -it <site-pod> -- \
mysql -h <db-host> -u <user> -p<password> -e "SELECT 1;"
# Check database credentials secret
kubectl get secret <db-secret> -o jsonpath='{.data.password}' | base64 -d
# Check network policies
kubectl get networkpolicy -A
Database User/Grants Issues (MariaDB Operator)
Problem: Using MariaDB Operator and permissions are incorrect.
Solution:
# Check MariaDB User resources
kubectl get user -A
kubectl describe user <site-db-user>
# Check grants
kubectl get grant -A
kubectl describe grant <site-db-grant>
# Manually fix (if needed)
kubectl exec -it mariadb-0 -- mysql -u root -p << EOF
GRANT ALL PRIVILEGES ON <database>.* TO '<user>'@'%';
FLUSH PRIVILEGES;
EOF
Database Performance Issues
Problem: Slow database queries.
Diagnosis:
# Check database resource usage
kubectl top pod <db-pod>
# Check slow query log
kubectl exec -it mariadb-0 -- mysql -u root -p -e "
SELECT * FROM mysql.slow_log ORDER BY start_time DESC LIMIT 10;
"
# Check connection count
kubectl exec -it mariadb-0 -- mysql -u root -p -e "
SHOW STATUS LIKE 'Threads_connected';
SHOW STATUS LIKE 'Max_used_connections';
"
Solutions:
- Increase resources:
kubectl patch frappebench <name> --type=merge -p '{ "spec": { "dbConfig": { "resources": { "requests": {"cpu": "2", "memory": "8Gi"} } } } }' -
Optimize queries: Check Frappe logs for slow queries and optimize.
- Add indexes: Use Frappe’s database tools to add appropriate indexes.
Networking Issues
Services Not Resolving
Problem: Cannot resolve service DNS names.
Solution:
# Test DNS resolution
kubectl run dns-test --rm -it --image=busybox -- nslookup <service-name>
# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns
# Check service endpoints
kubectl get endpoints <service-name>
Ingress Not Working
Problem: External traffic not reaching services.
Solution:
# Check ingress controller
kubectl get pods -n ingress-nginx
kubectl logs -n ingress-nginx -l app.kubernetes.io/component=controller
# Check ingress resource
kubectl get ingress
kubectl describe ingress <ingress-name>
# Verify ingress class
kubectl get ingressclass
kubectl describe ingressclass <class-name>
# Check ingress controller service
kubectl get svc -n ingress-nginx
Network Policies Blocking Traffic
Problem: Network policies preventing communication.
Solution:
# List network policies
kubectl get networkpolicy -A
# Describe policy
kubectl describe networkpolicy <policy-name>
# Temporarily remove policy for testing
kubectl delete networkpolicy <policy-name>
# Test connectivity
kubectl run test-pod --rm -it --image=curlimages/curl -- curl <service>
Performance Issues
High Response Times
Problem: Slow page loads and API responses.
Diagnosis:
# Check resource usage
kubectl top pods
# Check gunicorn logs
kubectl logs -l app=<site>-gunicorn --tail=100
# Check worker queue lengths
kubectl exec -it <site-pod> -- bench --site <site-name> doctor
Solutions:
- Scale up gunicorn:
kubectl patch frappebench <name> --type=merge -p '{ "spec": {"componentReplicas": {"gunicorn": 5}} }' - Increase resources:
kubectl patch frappebench <name> --type=merge -p '{ "spec": { "componentResources": { "gunicorn": { "requests": {"cpu": "2", "memory": "4Gi"} } } } }' - Check database performance: See Database Performance Issues.
Workers Not Processing Jobs
Problem: Background jobs queuing up.
Solution:
# Check worker logs
kubectl logs -l app=<site>-worker-default --tail=100
# Check Redis queue
kubectl exec -it <redis-pod> -- redis-cli LLEN <queue-name>
# Scale up workers
kubectl patch frappebench <name> --type=merge -p '{
"spec": {
"componentReplicas": {
"workerDefault": 5,
"workerLong": 3
}
}
}'
# Check for stuck jobs
kubectl exec -it <site-pod> -- bench --site <site-name> show-pending-jobs
Memory Issues
Problem: Pods being OOMKilled.
Solution:
# Check pod events
kubectl describe pod <pod-name> | grep -A 10 Events
# Increase memory limits
kubectl patch frappebench <name> --type=merge -p '{
"spec": {
"componentResources": {
"gunicorn": {
"limits": {"memory": "8Gi"}
}
}
}
}'
# Check for memory leaks
kubectl top pod <pod-name> --containers
Storage Issues
PVC Not Binding
Problem: PersistentVolumeClaim stuck in Pending.
Solution:
# Check PVC status
kubectl describe pvc <pvc-name>
# Check storage class
kubectl get storageclass
kubectl describe storageclass <class-name>
# Check available PVs
kubectl get pv
# Check provisioner logs (depends on storage provider)
kubectl logs -n kube-system -l app=<storage-provisioner>
Disk Full
Problem: Storage exhausted.
Solution:
# Check disk usage in pod
kubectl exec -it <pod-name> -- df -h
# Check PVC size
kubectl get pvc <pvc-name>
# Expand PVC (if storage class supports it)
kubectl patch pvc <pvc-name> -p '{
"spec": {"resources": {"requests": {"storage": "100Gi"}}}
}'
# Clean up old files
kubectl exec -it <site-pod> -- bench --site <site-name> clear-cache
kubectl exec -it <site-pod> -- bench --site <site-name> clear-logs
File Permission Issues
Problem: Permission denied errors.
Solution:
# Check file permissions
kubectl exec -it <site-pod> -- ls -la /home/frappe/frappe-bench/sites
# Fix permissions
kubectl exec -it <site-pod> -- \
chown -R frappe:frappe /home/frappe/frappe-bench/sites
# Check security context
kubectl get pod <pod-name> -o jsonpath='{.spec.securityContext}'
Getting Help
Collecting Debug Information
When reporting issues, collect:
# 1. Resource definitions
kubectl get frappebench <name> -o yaml > bench.yaml
kubectl get frappesite <name> -o yaml > site.yaml
# 2. Pod status
kubectl get pods -o wide > pods.txt
# 3. Events
kubectl get events --sort-by='.lastTimestamp' > events.txt
# 4. Logs
kubectl logs -l bench=<name> --tail=500 > bench-logs.txt
kubectl logs -l site=<name> --tail=500 > site-logs.txt
kubectl logs -n frappe-operator-system deployment/frappe-operator-controller-manager > operator-logs.txt
# 5. Describe resources
kubectl describe frappebench <name> > bench-describe.txt
kubectl describe frappesite <name> > site-describe.txt
Resources
- GitHub Issues: vyogotech/frappe-operator/issues
- Documentation: Frappe Operator Docs
- Frappe Forum: discuss.frappe.io
Next Steps
- Operations Guide - Production operations
- Examples - Configuration examples
- API Reference - Complete specification