DEV Community

Tom
Tom

Posted on • Originally published at bubobot.com

Monitoring System Health with Linux CLIs: Troubleshooting Disk, Memory, and CPU for Better Uptime

Untitled

Let's be real, downtime sucks. Nobody wants to deal with angry users or scramble to restore service at 3 AM. But when disaster strikes, knowing which command-line tools to reach for can be the difference between a quick fix and an extended outage.

I've spent countless hours in the terminal wrestling unresponsive servers back to life. Here are the CLI tools I use most often - organized by scenario so you can quickly find what you need when things go sideways.

Scenario 1: System Down – Time to Investigate!

Your system crashed. First step? Get it back online ASAP (restart, rollback, whatever it takes). Then it's time to play detective and figure out why it happened.

Your Investigation Toolkit

# How long was the system up before the crash? $ uptime 11:23:42 up 2 days, 1:14, 3 users, load average: 15.32, 12.67, 10.21 # What does the system log say about the crash? $ journalctl -p err..emerg -b -1 May 12 03:42:11 webserver kernel: Out of memory: Kill process 4312 (java) score 567 or sacrifice child May 12 03:42:11 webserver kernel: Killed process 4312 (java) total-vm:18245652kB, anon-rss:11291012kB # Which processes are consuming resources now? $ htop # Interactive process viewer 
Enter fullscreen mode Exit fullscreen mode

When a system crashes, the first things I check are:

  • System load with uptime - Those load averages of 15+ on the example above suggest severe overload

  • System logs with journalctl - Look for OOM (Out Of Memory) killers, kernel panics, and service failures

  • Resource usage with htop - Find CPU/memory hogs that might be causing problems

  • Disk space with df -h - A full disk can wreak havoc across the system

Quick Analysis Example

# Check what's taking up disk space $ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 30G 30G 12K 100% / # Find the culprit $ du -h --max-depth=1 / | sort -hr | head -10 15G /var 8G /usr 4G /opt 2G /home # Dig deeper $ du -h --max-depth=1 /var | sort -hr | head -5 14G /var/log 512M /var/lib 128M /var/cache # Find the specific log files $ find /var/log -type f -name "*.log" -size +100M | xargs ls -lh -rw-r--r-- 1 root root 12G May 12 03:40 /var/log/application.log 
Enter fullscreen mode Exit fullscreen mode

This investigation shows a classic scenario: a runaway log file filled the disk, causing the system to crash. Time to implement log rotation!

Scenario 2: System Slowdown – Feeling Sluggish?

Sometimes systems don't crash outright - they just start performing poorly. Pages load slowly, requests time out intermittently, and everything feels... off.

Your Slowdown Diagnostic Tools

# Check memory usage $ free -h total used free shared buff/cache available Mem: 31Gi 28Gi 256Mi 1.0Gi 2.7Gi 1.2Gi Swap: 2.0Gi 2.0Gi 0B # Find memory hogs $ ps aux --sort=-%mem | head -10 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND mysql 12345 95.7 85.2 5270716 27311600 ? Ssl May10 3122:41 mysqld # Check disk I/O $ iostat -xz 1 Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %util sda 12.00 1450.00 152.00 24512.00 0.00 0.00 95.60 # All-in-one monitoring $ dstat -cdngy 1 ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai stl| read writ| recv send| in out | int csw 12 8 45 35 0| 0 24.5M| 0 0 | 0 0 | 789 1425 15 10 40 35 0| 0 26.2M| 16k 12k| 0 0 | 812 1567 
Enter fullscreen mode Exit fullscreen mode

The output here tells a clear story:

  1. Memory is almost exhausted (28GB used out of 31GB)

  2. MySQL is consuming 85% of system memory

  3. Disk I/O is at 95% utilization with heavy writes

  4. CPU is spending 35% of its time waiting for I/O operations

This is a classic case of a database server that needs optimization - either query tuning, proper indexing, or possibly hardware upgrades.

Quick Action Steps

For memory issues:

# Clear cache if needed (be careful!) $ sync; echo 3 > /proc/sys/vm/drop_caches # Restart the memory-hogging service $ systemctl restart mysql 
Enter fullscreen mode Exit fullscreen mode

For disk I/O issues:

# Find processes causing disk I/O $ iotop -o # Check for large files that might be causing issues $ ncdu /var 
Enter fullscreen mode Exit fullscreen mode

Scenario 3: Resource Overload – "Help! I'm drowning!"

Sometimes systems get overwhelmed by too many requests, background jobs, or runaway processes.

Your Resource Management Arsenal

# See which processes are using the most CPU $ top -b -n 1 -o %CPU | head -20 # Find I/O bottlenecks $ iotop -o -b -n 2 # Check network connections and listening ports $ ss -tuln $ netstat -tnlp # Monitor network traffic $ iftop -i eth0 
Enter fullscreen mode Exit fullscreen mode

Example Resource Overload Diagnosis

Let's say your web server is struggling with too many connections:

# Check current connection count $ ss -s Total: 1425 TCP: 1418 (estab 1124, closed 276, orphaned 0, timewait 267) # See which processes have the most open connections $ lsof -i | grep ESTABLISHED | awk '{print $1}' | sort | uniq -c | sort -rn 987 nginx 124 php-fpm 13 sshd # Check nginx process details $ ps aux | grep nginx www-data 12345 98.7 2.3 142796 28392 ? R 09:27 12:42 nginx: worker process 
Enter fullscreen mode Exit fullscreen mode

This shows a classic overload scenario - nginx is handling nearly 1,000 connections and consuming 98.7% CPU. Time to either scale up or implement rate limiting!

Scenario 4: Peak Demand – Handling the Rush Hour

During peak traffic periods, you need to keep a close eye on system performance to ensure everything scales properly.

# Monitor load averages over time $ watch -n 10 "uptime" # Check system activity reports $ sar -q 1 10 # Load average $ sar -r 1 10 # Memory usage $ sar -b 1 10 # I/O operations # Track process resource usage over time $ pidstat -r -u -d 1 10 
Enter fullscreen mode Exit fullscreen mode

Handling Peak Load

During high load periods, you might need to temporarily prioritize critical processes:

# Give your database higher priority $ renice -n -5 -p $(pgrep mysql) # Limit CPU usage of non-critical background jobs $ cpulimit -p $(pgrep backup_script) -l 30 # Limit to 30% CPU 
Enter fullscreen mode Exit fullscreen mode

Creating Your Own Monitoring Dashboard

Want to create a simple monitoring dashboard? Here's a quick script I use:

#!/bin/bash # Simple terminal dashboard while true; do clear echo "=== SYSTEM DASHBOARD === $(date) ===" echo "" echo "=== LOAD ===" uptime echo "" echo "=== MEMORY ===" free -h echo "" echo "=== DISK ===" df -h | grep -v tmpfs echo "" echo "=== TOP PROCESSES ===" ps aux --sort=-%cpu | head -6 echo "" echo "=== RECENT ERRORS ===" journalctl -p err..emerg -n 5 --no-pager sleep 5 done 
Enter fullscreen mode Exit fullscreen mode

Save this as dashboard.sh, make it executable with chmod +x dashboard.sh, and run it in a terminal window for a simple real-time system overview.

Pro Tips from the Trenches

After years of dealing with system issues, here are some best practices I've learned:

  1. Establish baselines - Know what "normal" looks like so you can quickly spot abnormal behavior

  2. Use screen or tmux for long-running diagnostics - Nothing worse than losing your SSH connection during troubleshooting

  3. Create aliases for common commands - Add these to your .bashrc:

alias meminfo='free -h' alias cpuinfo='top -b -n 1 | head -20' alias diskinfo='df -h' alias ioinfo='iostat -xz 1 5' 
Enter fullscreen mode Exit fullscreen mode
  1. Keep a troubleshooting journal - Document issues and solutions for faster resolution next time

  2. Set up automated monitoring - Don't rely solely on manual checks

Automating Your Monitoring

While these CLI tools are invaluable for troubleshooting, you shouldn't rely on manual checks alone. Automated monitoring systems can alert you before small issues become major outages.

For critical production systems, consider setting up:

  1. Resource threshold alerts - Get notified when CPU, memory, or disk usage crosses critical thresholds

  2. Service availability checks - Ensure your key services remain responsive

  3. Log analysis - Automatically scan logs for error patterns

  4. Performance metrics - Track response times to catch slowdowns early

Conclusion

Linux CLI tools provide immediate insights when things go wrong, but they're most powerful when you know which ones to use in specific scenarios. Keep this guide handy for your next firefighting session!

Remember that while manual CLI troubleshooting is essential, combining these techniques with automated monitoring gives you the best of both worlds - deep diagnostic capabilities plus proactive notification when things start to go wrong.


For more CLI-based monitoring tips and advanced Linux troubleshooting techniques, check out our comprehensive guide on the Bubobot blog.

SystemMonitoring, #LinuxCLI, #BetterUptime

Read more at https://bubobot.com/blog/linux-cl-is-for-system-health-prevent-downtime-ensure-uptime?utm_source=dev.to

Top comments (0)