Debugging GnuLinux

Here are notes on system level commands I find useful for diagnosing problems with a process.

See more sections on Gnu/Linux , including NFS and routing problems.

gdb is well-documented, so I won't bother. Try ddd as a friendly wrapper for gdb.


Books

Check out "Advanced Linux Programming" http://www.advancedlinuxprogramming.com/ from http://www.codesourcery.com/ .

Find some great suggestions at http://people.redhat.com/alikins/system_tuning.html


Aliases for ps

Many prefer interactive tools like top for watching running programs, but the command-line tool ps is actually more flexible and less intrusive, if you construct some adequate aliases. Here are my three favorites. The first sorts by memory, and the second by CPU usage.
function psc {
  ps --cols=1000 --sort='-%cpu,uid,pgid,ppid,pid' -e \
     -o user,pid,ppid,pgid,stime,stat,wchan,time,pcpu,pmem,vsz,rss,sz,args |
     sed 's/^/ /' | less
}
function psm {
  ps --cols=1000 --sort='-vsz,uid,pgid,ppid,pid' -e \
     -o user,pid,ppid,pgid,stime,stat,wchan,time,pcpu,pmem,vsz,rss,sz,args |
     sed 's/^/ /' | less
}

Programs swapped to disk are shown in brackets without arguments.

The STAT column shows the process status:
       D   uninterruptible sleep (usually IO)
       R   runnable (on run queue)
       S   sleeping
       T   traced or stopped
       Z   a defunct ("zombie") process
       W   has no resident pages
       <   high-priority process
       N   low-priority task
       L   has pages locked into memory (for real-time and custom IO)

The WCHAN column shows the resource the system is waiting for -- mapped to ascii according to the file /boot/System.map-`uname -r`


CPU activity

To see a history of CPU usage install the sysstat package, which contains sar. The sar -A output distinguishes user CPU from system CPU, which is consumed for operations like I/O, swapping, and error handling. You should investigate any unusually high system CPU usage. strace will confirm whether a specific program is making too many system requests.

To see what CPU's you have type cat /proc/cpuinfo

You may want to install and display xosview for a quick glance at system activity. If you use the gnome desktop, try the System Monitor applet that comes preinstalled, or upgrade to gkrellm.

See the alias psc in the previous section.


The /proc filesystem

To see the status of a running process look at /proc/PID where PID is the process ID of the thread of interest. /proc/self will point to the current process. See memory and CPU usage, file descriptors, environment, working directory, mapped shared objects, the command line, and a symbolic link to the executable.

The command procinfo -a is handy for certain information from the /proc filesystem.

See http://www.redhat.com/docs/manuals/linux/RHL-8.0-Manual/ref-guide/ch-proc.html for more info.


Getting Process ID's

Each thread is managed by the kernel as a separate process with shared memory. You often need the process ID's of all threads.

If you are running only one instance of a particular executable, then you can get the PIDs with pidof program.

Otherwise you may want to identify some string that appears as a command-line argument only for your particular process. Write a script like the following to return a list of the PIDs.
  if [ $# -ne 1 ] ; then echo 'Usage: greppids "unique string"' ; fi
  UNIQUE_STRING="$1"
  MYUSERNAME=`whoami`
  ps --cols=1000 -e -o pid,user,args | grep " $MYUSERNAME " | sed 's/$/ /' |
     grep "$UNIQUE_STRING" | grep -v grep | awk '{print $1}' | sort -n | paste -s -d" " - 

Use kill -0 to see if a process ID or process group ID is active.
$ if kill -0 $PID ; then echo "Yes, process $PID is running" ; done
$ if kill -0 -$PGID ; then echo "Yes, process group ID $PGID is running" ; done


Resource limits

See your resource limits with ulimit -a and sysctl -a, and change with other options. Within bash, type help ulimit. Set user limits permanently inside /etc/security/limits.conf


Shared library dependencies

ldd -v filename will print shared libraries required by a program or another shared library, and will show where these libraries are found on your file system for the current LD_LIBRARY_PATH. Your system will also search directories listed in /etc/ld.so.conf, which is reread after typing ldconfig. Print out symbols with nm -C -u -g filename.


System activity

The best overview of recent system activity (usually 12 hours) is from sar -A.

Use strace to monitor all communication between a thread and the kernel. Use strace on any process that is running unusually slowly or is causing high system usage. Hung processes can often be diagnosed this way. Attach an strace to each thread of an executable foo.exe with a script like the following:
  P=`pidof foo.exe`
  for n in $P ; do
   echo strace -p$n 
   ( strace -v -p$n 2>&1 | sed -e "s/^/$n\| /" )&
  done
  sleep 30
  killall strace

addr2line is a handy utility for converting program addresses into file names and line numbers.


System failures

Look for possible system failures with tac /var/log/messages | less or dmesg.


Files and I/O activity

Use fuser or lsof to find out what processes are using a file.

Use lsof to find out what files are being used by a process, as
$ lsof -p PID
$ lsof -c program

See which process is using a tcp port (say 8080) with
  $ fuser -n tcp 8080
or
  $ fuser 8080/tcp

To investigate I/O activity, type iostat and iostat -x to see which devices are being used. iostat is also from the sysstat package.

Check all the inode times and id's with stat filename. Check your user's id's with id -a [username].

Check local disk speed with hdparm -Tt /dev/hda where the device is the one you see mounted with df. You may want to change some default options (hdparm /dev/hda) for improved performance. For IDE disk, look at changes like hdparm -c3 -m16 /dev/hda. Test riskier changes in single user mode in case you hang.


Memory

Try the following commands to see whether you are swapping memory or not. Look at the man pages first.
  $ cat /proc/meminfo
  $ vmstat 5
  $ free -s 5
  $ procinfo -n5 -f
  $ top

Watch the si and so columns of vmstat 5 as you run various processes. The presence of swapped memory is not necessarily bad. In fact, idle process memory (like kde daemons) SHOULD be swapped out to disk, even when large amounts of memory are free. You just don't want to see them swap very often. Usually, the swapping does not occur until another process requires the space. Then the idle process may stay swapped out indefinitely.


Memory leaks

To find memory leaks and errors in C code (using malloc and free), try valgrind from http://developer.kde.org/~sewardj/

Try mcheck.h with mtrace that comes with glibc.

Try linking ccmalloc from http://cs.ecs.baylor.edu/~donahoo/tools/ccmalloc/

or Electric Fence from http://perens.com/FreeSoftware/


Shared memory, message queues, semaphores

ipcs will show the status of shared memory, message queues, and semaphores.


Network activity

Check for network collisions and dropped packets with netstat -i, netstat -s, and ifconfig Your network may be saturated.

usernet will help you watch for surges in network activity.

Look for permanently lost packets on the disk server with
  $ head -2 /proc/net/snmp | cut -d' ' -f17
  ReasmFails
  2

If you can see this number increasing during network activity, then you are losing packets.

You can reduce the number of lost packets by increasing the buffer size for fragmented packets to double the default:
  $ echo 524288 > /proc/sys/net/ipv4/ipfrag_low_thresh
  $ echo 524288 > /proc/sys/net/ipv4/ipfrag_high_thresh 


Sockets

See what programs are using a given socket with netstat -pta or lsof -i See how many sockets are in each state with
  $ netstat -tan | grep "^tcp" | cut -c 68- | sort | uniq -c | sort -n

See how many sockets are active with cat /proc/net/sockstat.

See what process ids are using a specific TCP socket with
  $ fuser -n tcp 5006 | sed -e 's/.*: *//'
or 
  $ lsof -i tcp:5006


NFS activity

Look for high nfs failure rates with nfsstat -o rpc. If more than 3% of calls are restransmitted, then there are problems with the network or NFS server. If packets are getting lost on the network then it may help to lower rsize and wsize mount parameters (read and write block sizes) in /etc/fstab. If the server is responding too slowly, then either replace the server or increase the timeo mount parameter. See my separate section on NFS .


Handy commands

Type kill -l to get a list of signals and their numbers.

Bill Harlan, 2002-2005


Return to parent directory.