DEV Community

Cover image for Safety vs Security in Software: A Practical Guide for Engineers and Infrastructure Teams
Anderson Leite
Anderson Leite

Posted on

Safety vs Security in Software: A Practical Guide for Engineers and Infrastructure Teams

As engineers, we often hear "safety" and "security" used interchangeably, but they represent fundamentally different concerns that require distinct approaches.

Understanding this distinction is crucial for building resilient systems that protect both your users and your organization.

The Core Difference

Security is about protecting systems from malicious actors who intentionally try to cause harm, steal data, or disrupt operations.

Safety is about protecting systems and users from unintended failures, bugs, or accidents that could cause harm, even when everyone has good intentions.

Think of it this way: Security asks "What if someone tries to break this?" while Safety asks "What if something goes wrong?"

For Software Engineers

Security Concerns

Software engineers must defend against adversaries actively trying to exploit vulnerabilities.

Key Security Concepts

1. Input Validation and Sanitization

Malicious users will try to inject harmful code or manipulate your system through user inputs.

// ❌ UNSAFE: SQL Injection vulnerability const getUserData = (userId) => { return db.query(`SELECT * FROM users WHERE id = ${userId}`); } // ✅ SECURE: Parameterized queries const getUserData = (userId) => { return db.query('SELECT * FROM users WHERE id = ?', [userId]); } 
Enter fullscreen mode Exit fullscreen mode

2. Authentication and Authorization

Ensure users are who they claim to be (authentication) and can only access what they should (authorization).

# ❌ INSECURE: No permission check @app.route('/api/user/<user_id>/delete', methods=['DELETE']) def delete_user(user_id): User.delete(user_id) return {'status': 'deleted'} # ✅ SECURE: Proper authorization @app.route('/api/user/<user_id>/delete', methods=['DELETE']) @require_auth def delete_user(user_id): if not current_user.is_admin() and current_user.id != user_id: raise PermissionError("Unauthorized") User.delete(user_id) return {'status': 'deleted'} 
Enter fullscreen mode Exit fullscreen mode

3. Secrets Management

This should be a 101 for both SWE and Cloud Engineers, but don't hurt repeat it: Never hardcode credentials or expose sensitive data.

// ❌ INSECURE: Hardcoded credentials const apiKey = "sk_live_51HxYz2KzP9876543210"; // ✅ SECURE: Environment variables with secret management const apiKey = process.env.STRIPE_API_KEY; // Loaded from vault/secret manager in production 
Enter fullscreen mode Exit fullscreen mode

4. Dependency Security

Third-party libraries can introduce vulnerabilities.

# Regular security audits npm audit pip-audit 
Enter fullscreen mode Exit fullscreen mode

Security Checklist for Software Engineers

  • [ ] All user inputs are validated and sanitized
  • [ ] SQL injection prevention via parameterized queries
  • [ ] XSS protection implemented (content security policy, output encoding)
  • [ ] CSRF tokens on state-changing operations
  • [ ] Secure password storage (bcrypt, Argon2)
  • [ ] Multi-factor authentication supported
  • [ ] API rate limiting implemented
  • [ ] Dependencies regularly scanned for vulnerabilities
  • [ ] Secrets stored in environment variables or secret managers
  • [ ] HTTPS enforced everywhere
  • [ ] Security headers configured (HSTS, X-Frame-Options, etc.)
  • [ ] Logging excludes sensitive data
  • [ ] Regular penetration testing or security reviews

Safety Concerns

Software engineers must also ensure systems fail gracefully and don't harm users through bugs or design flaws.

Key Safety Concepts

1. Error Handling and Graceful Degradation

Systems should handle failures without causing cascading problems or data loss.

# ❌ UNSAFE: Unhandled exception crashes the service def process_payment(amount, user_id): user = get_user(user_id) payment = charge_card(user.card_token, amount) update_balance(user_id, amount) return payment # ✅ SAFE: Proper error handling with rollback def process_payment(amount, user_id): try: user = get_user(user_id) if not user: return {'error': 'User not found', 'status': 'failed'} payment = charge_card(user.card_token, amount) try: update_balance(user_id, amount) except Exception as e: # Rollback the charge if balance update fails  refund_charge(payment.id) log_error(f"Payment processing failed: {e}") return {'error': 'Processing error', 'status': 'failed'} return {'status': 'success', 'payment_id': payment.id} except Exception as e: log_error(f"Payment error: {e}") return {'error': 'Service temporarily unavailable', 'status': 'failed'} 
Enter fullscreen mode Exit fullscreen mode

2. Race Conditions and Concurrency

Multiple operations happening simultaneously can lead to data corruption.

// ❌ UNSAFE: Race condition var balance = 1000 func withdraw(amount int) { if balance >= amount { // Another goroutine might modify balance here! balance -= amount } } // ✅ SAFE: Using mutex for synchronization var ( balance = 1000 mu sync.Mutex ) func withdraw(amount int) bool { mu.Lock() defer mu.Unlock() if balance >= amount { balance -= amount return true } return false } 
Enter fullscreen mode Exit fullscreen mode

3. Data Validation for Integrity

Validate data not just for security, but to prevent logical errors.

// ❌ UNSAFE: No bounds checking function calculateDiscount(price: number, discountPercent: number): number { return price * (discountPercent / 100); } // ✅ SAFE: Validate business logic constraints function calculateDiscount(price: number, discountPercent: number): number { if (price < 0) { throw new Error('Price cannot be negative'); } if (discountPercent < 0 || discountPercent > 100) { throw new Error('Discount must be between 0 and 100'); } return price * (discountPercent / 100); } 
Enter fullscreen mode Exit fullscreen mode

4. Circuit Breakers and Timeouts

Prevent cascading failures when dependencies fail.

const CircuitBreaker = require('opossum'); const options = { timeout: 3000, // If function takes longer than 3s, trigger a failure errorThresholdPercentage: 50, // Open circuit if 50% of requests fail resetTimeout: 30000 // After 30s, try again }; async function callExternalAPI(data) { const response = await fetch('https://api.example.com/data', { method: 'POST', body: JSON.stringify(data) }); return response.json(); } const breaker = new CircuitBreaker(callExternalAPI, options); // If the API is down, circuit opens and fails fast breaker.fire(requestData) .then(result => console.log(result)) .catch(err => console.log('Service degraded, using fallback')); 
Enter fullscreen mode Exit fullscreen mode

Safety Checklist for Software Engineers

  • [ ] Comprehensive error handling on all critical paths
  • [ ] Database transactions with proper rollback mechanisms
  • [ ] Timeouts configured for all external calls
  • [ ] Circuit breakers for downstream dependencies
  • [ ] Input validation for business logic (not just security)
  • [ ] Race condition prevention (locks, atomic operations)
  • [ ] Idempotency for critical operations
  • [ ] Graceful degradation when services fail
  • [ ] Dead letter queues for failed async operations
  • [ ] Comprehensive logging and monitoring
  • [ ] Feature flags for safe rollouts
  • [ ] Automated testing including edge cases
  • [ ] Health checks and readiness probes

For Infrastructure Specialists and Cloud Engineers

Security Concerns

Infrastructure teams must protect the entire system perimeter and prevent unauthorized access.

Key Security Concepts

1. Network Segmentation and Least Privilege

Isolate resources and grant minimal necessary permissions.

# Terraform example: Secure VPC setup resource "aws_security_group" "web_tier" { name = "web-tier" description = "Web tier security group" # Only allow HTTPS from internet ingress { from_port = 443 to_port = 443 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] } # No SSH from internet # SSH only from bastion host in private subnet } resource "aws_security_group" "database_tier" { name = "database-tier" description = "Database tier security group" # Only allow MySQL from app tier ingress { from_port = 3306 to_port = 3306 protocol = "tcp" security_groups = [aws_security_group.app_tier.id] } # No direct internet access } 
Enter fullscreen mode Exit fullscreen mode

2. Identity and Access Management (IAM)

Principle of least privilege for cloud resources.

# ❌ INSECURE: Overly permissive IAM policy AWSTemplateFormatVersion: '2010-09-09' Resources: DeveloperRole: Type: AWS::IAM::Role Properties: ManagedPolicyArns: - arn:aws:iam::aws:policy/AdministratorAccess # TOO BROAD! # ✅ SECURE: Scoped permissions AWSTemplateFormatVersion: '2010-09-09' Resources: DeveloperRole: Type: AWS::IAM::Role Properties: Policies: - PolicyName: DeveloperPolicy PolicyDocument: Statement: - Effect: Allow Action: - s3:GetObject - s3:PutObject Resource: - arn:aws:s3:::my-app-bucket/* - Effect: Allow Action: - logs:CreateLogGroup - logs:CreateLogStream - logs:PutLogEvents Resource: arn:aws:logs:*:*:log-group:/aws/lambda/my-app-* 
Enter fullscreen mode Exit fullscreen mode

3. Secrets and Encryption

Protect data at rest and in transit.

# Kubernetes example: Using secrets properly apiVersion: v1 kind: Secret metadata: name: database-credentials type: Opaque data: username: YWRtaW4= # base64 encoded password: cGFzc3dvcmQ= --- apiVersion: v1 kind: Pod metadata: name: app-pod spec: containers: - name: app image: myapp:latest env: - name: DB_USERNAME valueFrom: secretKeyRef: name: database-credentials key: username - name: DB_PASSWORD valueFrom: secretKeyRef: name: database-credentials key: password 
Enter fullscreen mode Exit fullscreen mode

4. Security Monitoring and Intrusion Detection

Detect and respond to threats in real-time.

# AWS CloudWatch + GuardDuty example resource "aws_cloudwatch_log_metric_filter" "unauthorized_api_calls" { name = "UnauthorizedAPICalls" log_group_name = "/aws/cloudtrail/organization" pattern = "{ ($.errorCode = \"*UnauthorizedOperation\") || ($.errorCode = \"AccessDenied*\") }" metric_transformation { name = "UnauthorizedAPICalls" namespace = "Security/Metrics" value = "1" } } resource "aws_cloudwatch_metric_alarm" "unauthorized_api_calls_alarm" { alarm_name = "UnauthorizedAPICallsAlarm" comparison_operator = "GreaterThanThreshold" evaluation_periods = "1" metric_name = "UnauthorizedAPICalls" namespace = "Security/Metrics" period = "300" statistic = "Sum" threshold = "5" alarm_description = "Triggers when unauthorized API calls exceed threshold" alarm_actions = [aws_sns_topic.security_alerts.arn] } 
Enter fullscreen mode Exit fullscreen mode

Security Checklist for Infrastructure Teams

  • [ ] Network segmentation implemented (VPCs, subnets, security groups)
  • [ ] Principle of least privilege for all IAM roles and policies
  • [ ] MFA enforced for privileged accounts
  • [ ] Secrets managed via vault/secrets manager (not in code)
  • [ ] Encryption at rest enabled for all data stores
  • [ ] TLS/SSL enforced for all data in transit
  • [ ] Regular security patching automated
  • [ ] Bastion hosts or VPN for administrative access
  • [ ] Audit logging enabled (CloudTrail, Cloud Audit Logs)
  • [ ] Intrusion detection system deployed
  • [ ] DDoS protection configured
  • [ ] Regular vulnerability scanning
  • [ ] Container image scanning in CI/CD
  • [ ] Web Application Firewall (WAF) configured
  • [ ] Backup encryption enabled

Safety Concerns

Infrastructure teams must ensure systems remain available and resilient to failures.

Key Safety Concepts

1. High Availability and Redundancy

Eliminate single points of failure.

# Terraform: Multi-AZ deployment for high availability resource "aws_autoscaling_group" "app" { name = "app-asg" vpc_zone_identifier = [ aws_subnet.private_a.id, aws_subnet.private_b.id, aws_subnet.private_c.id ] min_size = 3 max_size = 10 desired_capacity = 3 # Spread instances across availability zones health_check_type = "ELB" health_check_grace_period = 300 launch_template { id = aws_launch_template.app.id version = "$Latest" } target_group_arns = [aws_lb_target_group.app.arn] } resource "aws_lb" "app" { name = "app-lb" load_balancer_type = "application" # Deploy across multiple AZs subnets = [ aws_subnet.public_a.id, aws_subnet.public_b.id, aws_subnet.public_c.id ] enable_deletion_protection = true } 
Enter fullscreen mode Exit fullscreen mode

2. Disaster Recovery and Backups

Ensure data can be recovered and services restored.

# Kubernetes: Automated backup with Velero apiVersion: velero.io/v1 kind: Schedule metadata: name: daily-backup namespace: velero spec: schedule: "0 2 * * *" # Daily at 2 AM template: includedNamespaces: - production - staging storageLocation: aws-backup volumeSnapshotLocations: - aws-snapshots ttl: 720h # 30 days retention --- # RDS automated backups resource "aws_db_instance" "production" { identifier = "production-db" backup_retention_period = 30 backup_window = "03:00-04:00" # Enable automated backups to different region copy_tags_to_snapshot = true # Enable point-in-time recovery enabled_cloudwatch_logs_exports = ["audit", "error", "general", "slowquery"] } 
Enter fullscreen mode Exit fullscreen mode

3. Resource Limits and Auto-scaling

Prevent resource exhaustion and ensure capacity.

# Kubernetes: Resource limits and HPA apiVersion: apps/v1 kind: Deployment metadata: name: web-app spec: replicas: 3 template: spec: containers: - name: app image: myapp:latest resources: requests: memory: "256Mi" cpu: "250m" limits: memory: "512Mi" cpu: "500m" livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5 --- apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: web-app-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: web-app minReplicas: 3 maxReplicas: 20 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 
Enter fullscreen mode Exit fullscreen mode

4. Chaos Engineering and Testing

Proactively test system resilience.

# Chaos Mesh: Simulating pod failures apiVersion: chaos-mesh.org/v1alpha1 kind: PodChaos metadata: name: pod-failure-test namespace: chaos-testing spec: action: pod-failure mode: one selector: namespaces: - production labelSelectors: app: web-service duration: "30s" scheduler: cron: "@every 2h" --- # Network chaos: Simulating latency apiVersion: chaos-mesh.org/v1alpha1 kind: NetworkChaos metadata: name: network-delay-test namespace: chaos-testing spec: action: delay mode: all selector: namespaces: - production labelSelectors: app: api-service delay: latency: "100ms" correlation: "100" jitter: "0ms" duration: "5m" 
Enter fullscreen mode Exit fullscreen mode

Safety Checklist for Infrastructure Teams

  • [ ] Multi-AZ/region deployment for critical services
  • [ ] Automated backups with tested recovery procedures
  • [ ] Auto-scaling configured for compute resources
  • [ ] Resource quotas and limits enforced
  • [ ] Health checks and liveness probes configured
  • [ ] Load balancers with proper health checks
  • [ ] Database replication and failover tested
  • [ ] Disaster recovery runbooks documented and tested
  • [ ] Monitoring and alerting for resource exhaustion
  • [ ] Rate limiting at infrastructure level
  • [ ] Canary deployments or blue-green deployment strategy
  • [ ] Rollback procedures tested and automated
  • [ ] Chaos engineering tests run regularly
  • [ ] Capacity planning based on metrics
  • [ ] Graceful shutdown handling for pods/instances

Improving Safety and Security at Your Company

Of course, this is not a comprehensive list of how or what you should implement, but can give you ideas of "oh, we forgot this thing", and covers some topics which should be handled by your internal IT or security team (depending of how big your company is, or how segregated the rules are there).

Immediate Actions

For Everyone:

  1. Enable MFA on all accounts
  2. Audit and rotate credentials
  3. Review and update dependencies
  4. Set up basic monitoring and alerting

For Software Engineers:

  1. Add input validation to critical endpoints
  2. Implement proper error handling
  3. Add health check endpoints

For Infrastructure Teams:

  1. Review IAM policies for over-privileged access
  2. Enable audit logging
  3. Verify backup processes work

Short-term Improvements

For Software Engineers:

  • Implement automated security scanning in CI/CD (and if you don't know how to do it, do not suffer in silence, ask for help of your infra folks!)
  • Add comprehensive test coverage for edge cases
  • Implement circuit breakers for external dependencies
  • Set up proper secrets management
  • Add structured logging with correlation IDs

For Infrastructure Teams:

  • Implement network segmentation
  • Set up automated patching
  • Configure auto-scaling
  • Implement blue-green or canary deployments
  • Set up cross-region backups

Long-term Strategic Initiatives

Organization-wide:

  • Establish security champions program
  • (if the budget allows) Implement bugbounty programs
  • Conduct regular disaster recovery drills
  • Implement chaos engineering practices
  • Create incident response playbooks
  • Regular security and safety training
  • Implement observability stack (metrics, logs, traces)
  • Conduct penetration testing
  • Establish SRE practices and SLOs

Culture Building:

  • Blameless post-mortems for incidents
  • Security and safety in code review checklists
  • Threat modeling for new features
  • Regular game days for failure scenarios
  • Share lessons learned across teams

"Real-world" examples

Infrastructure Setup

# Kubernetes deployment with both safety and security apiVersion: apps/v1 kind: Deployment metadata: name: payment-service spec: replicas: 5 # Safety: Multiple replicas strategy: type: RollingUpdate # Safety: Zero-downtime deployments rollingUpdate: maxSurge: 1 maxUnavailable: 0 template: spec: # Security: Run as non-root securityContext: runAsNonRoot: true runAsUser: 1000 fsGroup: 1000 containers: - name: payment-service image: payment-service:v1.2.3 # Security: Read-only filesystem securityContext: readOnlyRootFilesystem: true allowPrivilegeEscalation: false capabilities: drop: - ALL # Safety: Resource limits prevent resource exhaustion resources: requests: memory: "512Mi" cpu: "500m" limits: memory: "1Gi" cpu: "1000m" # Safety: Liveness probe ensures unhealthy containers restart livenessProbe: httpGet: path: /health/live port: 8080 initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 # Safety: Readiness probe prevents traffic to unready containers readinessProbe: httpGet: path: /health/ready port: 8080 initialDelaySeconds: 10 periodSeconds: 5 failureThreshold: 3 # Security: Secrets from vault env: - name: PAYMENT_GATEWAY_API_KEY valueFrom: secretKeyRef: name: payment-secrets key: gateway-api-key - name: ENCRYPTION_KEY valueFrom: secretKeyRef: name: payment-secrets key: encryption-key # Safety: Graceful shutdown lifecycle: preStop: exec: command: ["/bin/sh", "-c", "sleep 15"] # Wait for connections to drain # Security: Network policies restrict access # Safety: Pod disruption budget ensures availability during maintenance --- apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: payment-service-pdb spec: minAvailable: 3 # Safety: Always keep 3 pods running selector: matchLabels: app: payment-service 
Enter fullscreen mode Exit fullscreen mode

Conclusion

Safety and security are complementary but distinct disciplines. Security protects against malicious actors, while safety protects against failures and accidents. Both are essential for building trustworthy systems.

Remember:

  • Security = Protecting against adversaries
  • Safety = Protecting against failures

The best engineering teams excel at both. Start with the checklists above, implement improvements incrementally, and build a culture where both safety and security are everyone's responsibility.

Further Reading

Top comments (0)