Here are notes I've found useful for configuring reliable shared disk on a linux cluster.
See more sections on Gnu/Linux .
I've extracted most of this information from "Linux NFS and Automounter Administration" by Erez Zadok, published from Sybex.
Increasingly I see a single large RAID disk server being clobbered by 16 or 32 linux at a time. Here are some parameters to check.
First make sure you have an up-to-date copy of NFS installed with
$ rpm -q nfs-utils or $ rpm -q -f /usr/sbin/rpc.nfsd |
Check dependencies (like portmap
) with
rpm -q -R nfs-utils
and check their
versions as well. See what files are
affected by rpm -q -l nfs-utils
See that your services are running with
rpcinfo -p [hostname]
. On a client
machine look for portmapper
, nlockmgr
and possibly amd
or autofs
. A server
will also run mountd
and nfs
.
First exercise your disk with your own code or with a simple write operation like
$ time dd if=/dev/zero of=testfile bs=4k count=8182 8182+0 records in 8182+0 records out real 0m8.829s user 0m0.000s sys 0m0.160s |
Writing files should be enough to test network saturation.
When profiling reads instead of writes, call
umount
and mount
to flush caches, or
the read will seem instantaneous.
$ cd / $ umount /mnt/test $ mount /mnt/test $ cd /mnt/test $ dd if=testfile of=/dev/null bs=4k count=8192 |
Check for failures on a client machine with
$ nfsstat -c or $ nfsstat -o rpc |
If more than 3% of calls are retransmitted, then there are problems with the network or NFS server.
Look for NFS failures on a shared disk server with
$ nfsstat -s or $ nfsstat -o rpc |
It is not unreasonable to expect 0 badcalls. You should have very few "badcalls" out of the total number of "calls."
NFS must resend packets that are lost by a busy host. Look for permanently lost packets on the disk server with
$ head -2 /proc/net/snmp | cut -d' ' -f17 ReasmFails 2 |
If you can see this number increasing during nfs activity, then you are losing packets.
You can reduce the number of lost packets on the server by increasing the buffer size for fragmented packets.
$ echo 524288 > /proc/sys/net/ipv4/ipfrag_low_thresh $ echo 524288 > /proc/sys/net/ipv4/ipfrag_high_thresh |
This is about double the default.
See if your server is receiving too many overlapping requests with
$ grep th /proc/net/rpc/nfsd th 8 594 3733.140 83.850 96.660 0.000 73.510 30.560 16.330 2.380 0.000 2.150 |
The first number is the number of threads available for servicing requests, and the the second number is the number of times that all threads have been needed. The remaining 10 numbers are a histogram showing how many seconds a certain fraction of the threads have been busy, starting with less than 10% of the threads and ending with more than 90% of the threads. If the last few numbers have accumulated a significant amount of time, then your server probably needs more threads.
Increase the number of threads used by the
server to 16 by changing RPCNFSDCOUNT=16
in /etc/rc.d/init.d/nfs
If separate clients are sharing information through NFS disks, then you have special problems. You may delete a file on one client node and cause a different client to get a stale file handle. Different clients may have cached inconsistent versions of the same file. A single client may even create a file or directory and be unable to see it immediately. If these problems sound familiar, then you may want to adjust NFS caching parameters and code multiple attempts in your applications.
The server side of NFS allows no real configuration for performance or reliability. NFS servers are stateless so you don't have to worry much about cached state, except for delayed asynchronous writes.
Default asynchronous writes are not very
risky unless you expect your disk servers to
crash often. sync
guarantees that all
writes are completed when the client thinks
they are. All client machines should still
see consistent states with async
because
all access the same server. Client caching
is a much greater risk. I recommend the
default async
on the server side.
If you change server export properties in
/etc/exports
re-export with exportfs
-rav
.
You can see what parameters you are using
with cat /proc/mounts
.
Edit /etc/fstab
to change properties.
(Hard mounts are simpler for a data
processing platform, so I have little to say
about auto-mounts. Do not run both amd
and autofs
. Check with chkconfig
--list
. You may find it useful to add
dismount_interval=1800
in the global
section of /etc/amd.conf
for a long 30
minute wait to keep automounted directories
around.)
When you change mount attributes, remount
with mount -a
Here are client properties that you may want to change from their default values.
Usually you want the flag rw
to allow
read-write access, and it is off by default.
Allow users to interrupt hung processes with this flag (off by default). This might sound risky, but in fact this property is consistent with the original nfs design and is well supported. Unnecessary hangs will be more destabilizing.
If your code needs file locking, then by all means turn this on. But if you are certain that locking is not required (as in my current project), then turn it off. I could create unnecessary opportunities for timeouts.
Avoid the complexity of amd if you can for
simple clusters. Use hard
This appears as v2
or v3
in
/proc/mounts. The NFS version supposedly
defaults to version 2, but version 3 is
faster and supports big files. I get v3
by default much of the time.
Almost everyone runs NFS under udp
for
performance. But udp is an unreliable
protocol and can perform worse than tcp
on a saturated host or network. If nfs
errors occur too often, then you may want to
try tcp
instead of udp
.
If packets are getting lost on the network then it may help to lower rsize and wsize mount parameters (read and write block sizes) in /etc/fstab.
For reliability, prefer smaller rsize and
wsize values in /etc/fstab
. I recommend
rsize=1024,wsize=1024
instead of the
defaults of 4096.
If the server is responding too slowly, then either replace the server or increase the timeo or retrans parameters.
For more reliability when the machine stays
overloaded, set retrans=10
to retry
sending RPC commands 10 times instead of the
default 3 times.
The default timeout between retries is
timeo=7
(seven tenths of a second).
Increase to timeo=20
(two full seconds)
to avoid hammering an already overloaded
server.
acregmax and acdirmax are the maximum number
of seconds to cache attributes for files and
directories respectively. Both default to 60
seconds. 0 disables caching and noac
disables all caching. cto
(on by
default), guarantees that files will be
rechecked after closing and reopening.
Minimum numbers of seconds are set with acregmin and acdirmin. acdirmin defaults to 30 seconds and acregmin to 3 seconds.
I recommend setting acdirmin=0,acdirmax=0
to disable caching of directory information
and reduce acregmax=10
because we have
had so many problems with directories and
files not appearing to exist shortly after
created.
Performance should improve by adding the
noatime flag. Everytime a client reads from
a file, the server must update the server's
inode time stamp for most recently accessed
time. Most applications don't care about the
most recent access time, so you can set the
noatime
with impunity.
Nevertheless, this flag is rarely set on a
general purpose machine, and if you are more
concerned about reliability, then use the
default atime
.
It is surprising how often cluster nodes are allowed to run with totally inconsistent clocks. Caching should not be affected, but file properties will be a mess.
If you are on a network with a time server,
add hostnames of timeservers to
/etc/ntp/step-tickers
or
/etc/ntp.conf
, and start the service with
chkconfig ntpd on
.
Bill Harlan, 2002
Return to parent directory.