Here are notes on system level commands I find useful for diagnosing problems with a process.
See more sections on Gnu/Linux , including NFS and routing problems.
gdb
is well-documented, so I won't
bother. Try ddd
as a friendly wrapper
for gdb.
Check out "Advanced Linux Programming" http://www.advancedlinuxprogramming.com/ from http://www.codesourcery.com/ .
Find some great suggestions at http://people.redhat.com/alikins/system_tuning.html
ps
Many prefer interactive tools like top
for watching running programs, but the
command-line tool ps
is actually more
flexible and less intrusive, if you construct
some adequate aliases. Here are my three
favorites. The first sorts by memory, and
the second by CPU usage.
function psc { ps --cols=1000 --sort='-%cpu,uid,pgid,ppid,pid' -e \ -o user,pid,ppid,pgid,stime,stat,wchan,time,pcpu,pmem,vsz,rss,sz,args | sed 's/^/ /' | less } |
function psm { ps --cols=1000 --sort='-vsz,uid,pgid,ppid,pid' -e \ -o user,pid,ppid,pgid,stime,stat,wchan,time,pcpu,pmem,vsz,rss,sz,args | sed 's/^/ /' | less } |
Programs swapped to disk are shown in brackets without arguments.
The STAT column shows the process status:
D uninterruptible sleep (usually IO) R runnable (on run queue) S sleeping T traced or stopped Z a defunct ("zombie") process W has no resident pages < high-priority process N low-priority task L has pages locked into memory (for real-time and custom IO) |
The WCHAN column shows the resource the
system is waiting for -- mapped to ascii
according to the file
/boot/System.map-`uname -r`
To see a history of CPU usage install the
sysstat
package, which contains sar
.
The sar -A
output distinguishes user CPU
from system CPU, which is consumed for
operations like I/O, swapping, and error
handling. You should investigate any
unusually high system CPU usage. strace
will confirm whether a specific program is
making too many system requests.
To see what CPU's you have type cat
/proc/cpuinfo
You may want to install and display
xosview
for a quick glance at system
activity. If you use the gnome desktop,
try the System Monitor
applet that
comes preinstalled, or upgrade to gkrellm
.
See the alias psc
in the previous section.
To see the status of a running process look
at /proc/PID
where PID is the process ID
of the thread of interest. /proc/self
will point to the current process. See
memory and CPU usage, file descriptors,
environment, working directory, mapped shared
objects, the command line, and a symbolic
link to the executable.
The command procinfo -a
is handy for
certain information from the /proc
filesystem.
See http://www.redhat.com/docs/manuals/linux/RHL-8.0-Manual/ref-guide/ch-proc.html for more info.
Each thread is managed by the kernel as a separate process with shared memory. You often need the process ID's of all threads.
If you are running only one instance of a
particular executable, then you can get the
PIDs with pidof program
.
Otherwise you may want to identify some string that appears as a command-line argument only for your particular process. Write a script like the following to return a list of the PIDs.
if [ $# -ne 1 ] ; then echo 'Usage: greppids "unique string"' ; fi UNIQUE_STRING="$1" MYUSERNAME=`whoami` ps --cols=1000 -e -o pid,user,args | grep " $MYUSERNAME " | sed 's/$/ /' | grep "$UNIQUE_STRING" | grep -v grep | awk '{print $1}' | sort -n | paste -s -d" " - |
Use kill -0
to see if a process ID or
process group ID is active.
$ if kill -0 $PID ; then echo "Yes, process $PID is running" ; done $ if kill -0 -$PGID ; then echo "Yes, process group ID $PGID is running" ; done |
See your resource limits with ulimit -a
and sysctl -a
, and change with other
options. Within bash, type help ulimit
.
Set user limits permanently inside
/etc/security/limits.conf
ldd -v filename
will print shared
libraries required by a program or another
shared library, and will show where these
libraries are found on your file system for
the current LD_LIBRARY_PATH.
Your system
will also search directories listed in
/etc/ld.so.conf
, which is reread after
typing ldconfig
. Print out symbols with
nm -C -u -g filename
.
The best overview of recent system activity
(usually 12 hours) is from sar -A
.
Use strace
to monitor all communication
between a thread and the kernel. Use
strace
on any process that is running
unusually slowly or is causing high system
usage. Hung processes can often be diagnosed
this way. Attach an strace
to each
thread of an executable foo.exe
with a
script like the following:
P=`pidof foo.exe` for n in $P ; do echo strace -p$n ( strace -v -p$n 2>&1 | sed -e "s/^/$n\| /" )& done sleep 30 killall strace |
addr2line
is a handy utility for
converting program addresses into file names
and line numbers.
Look for possible system failures with tac
/var/log/messages | less
or dmesg
.
Use fuser
or lsof
to find out what
processes are using a file.
Use lsof
to find out what files are being
used by a process, as
$ lsof -p PID $ lsof -c program |
See which process is using a tcp port (say 8080) with
$ fuser -n tcp 8080 or $ fuser 8080/tcp |
To investigate I/O activity, type iostat
and iostat -x
to see which devices are
being used. iostat
is also from the
sysstat
package.
Check all the inode times and id's with
stat filename
. Check your user's id's
with id -a [username]
.
Check local disk speed with hdparm -Tt
/dev/hda
where the device is the one you
see mounted with df
. You may want to
change some default options (hdparm
/dev/hda
) for improved performance. For
IDE disk, look at changes like hdparm -c3
-m16 /dev/hda
. Test riskier changes in
single user mode in case you hang.
Try the following commands to see whether you are swapping memory or not. Look at the man pages first.
$ cat /proc/meminfo $ vmstat 5 $ free -s 5 $ procinfo -n5 -f $ top |
Watch the si
and so
columns of
vmstat 5
as you run various processes.
The presence of swapped memory is not
necessarily bad. In fact, idle process
memory (like kde daemons) SHOULD be swapped
out to disk, even when large amounts of
memory are free. You just don't want to see
them swap very often. Usually, the swapping
does not occur until another process requires
the space. Then the idle process may stay
swapped out indefinitely.
To find memory leaks and errors in C code
(using malloc and free), try valgrind
from http://developer.kde.org/~sewardj/
Try mcheck.h
with mtrace
that comes
with glibc.
Try linking ccmalloc
from
http://cs.ecs.baylor.edu/~donahoo/tools/ccmalloc/
or Electric Fence from http://perens.com/FreeSoftware/
ipcs
will show the status of shared
memory, message queues, and semaphores.
Check for network collisions and dropped
packets with netstat -i
, netstat -s
,
and ifconfig
Your network may be
saturated.
usernet
will help you watch for surges in
network activity.
Look for permanently lost packets on the disk server with
$ head -2 /proc/net/snmp | cut -d' ' -f17 ReasmFails 2 |
If you can see this number increasing during network activity, then you are losing packets.
You can reduce the number of lost packets by increasing the buffer size for fragmented packets to double the default:
$ echo 524288 > /proc/sys/net/ipv4/ipfrag_low_thresh $ echo 524288 > /proc/sys/net/ipv4/ipfrag_high_thresh |
See what programs are using a given socket
with netstat -pta
or lsof -i
See how
many sockets are in each state with
$ netstat -tan | grep "^tcp" | cut -c 68- | sort | uniq -c | sort -n |
See how many sockets are active with cat
/proc/net/sockstat
.
See what process ids are using a specific TCP socket with
$ fuser -n tcp 5006 | sed -e 's/.*: *//' or $ lsof -i tcp:5006 |
Look for high nfs failure rates with
nfsstat -o rpc
. If more than 3% of calls
are restransmitted, then there are problems
with the network or NFS server. If packets
are getting lost on the network then it may
help to lower rsize
and wsize
mount
parameters (read and write block sizes) in
/etc/fstab
. If the server is responding
too slowly, then either replace the server or
increase the timeo
mount parameter. See
my separate
section on NFS .
Type kill -l
to get a list of signals and
their numbers.
Bill Harlan, 2002-2005
Return to parent directory.