SystemImager® v4.0.2 Manual | ||
---|---|---|
<<< Previous | Troubleshooting | Next >>> |
Goran Pocian reported an instance of unacceptable si_updateclient performance that went away when he upgraded from kernel 2.2.17 to 2.2.18.
He also noted that if you mount an NFS filesystem after executing si_prepareclient, si_getimage will retrieve its contents. As this can heavily increase network load, it can also cause bad performance.
Brian Finley reported other possible causes:
Every once in a while, someone reports some mysterious hanging or transfer interruption issue related to rsync. I had a chance to speak with Andrew Tridgell in person to discuss these issues.
We found two known issues that could be the source of these symptoms. One is a known kernel issue, and one is an rsync issue. The kernel issue is supposedly resolved in 2.4.x series kernels, (SystemImager has not yet been "officially" tested with 2.4.x kernels) and may not be present in all 2.2.x series kernels (I believe).
The rsync bug will be fixed in the rsync 2.4.7 release (to happen "Real Soon Now (TM)" ). The rsync bug is caused by excessive numbers of errors filling the error queue which causes a race condition. However, until rsync 2.4.7 has been out for some time, I will still recommend using v2.4.6 unless you specifically experience one of these issues.
Here's a hack that seems to work for Chris Black. Add "--bwlimit=10000" right after "rsync" in each rsync command in the <image>.master script.
Change: "rsync -av --numeric-ids $IMAGESERVER::web_server_image_v1/ /a/" To: "rsync --bwlimit=10000 -av --numeric-ids $IMAGESERVER::web_server_image_v1/ /a/"Here are some tips on diagnosing the problem:
If you get an error message in /var/log/messages that looks like:
Jan 23 08:49:42 mybox rsyncd[19347]: transfer interrupted (code 30) at io.c(65)
You can look up the code number in the errcode.h file which you can find in the rsync source code.
To diagnose the kernel bug: Run netstat -tn. Here is some sample output (from a properly working system):
$ netstat -tn Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 1 0 192.168.1.149:1094 216.62.20.226:80 CLOSE_WAIT tcp 1 0 192.168.1.149:1090 216.62.20.226:80 CLOSE_WAIT tcp 1 0 192.168.1.149:1089 216.62.20.226:80 CLOSE_WAIT tcp 0 0 127.0.0.1:16001 127.0.0.1:1029 ESTABLISHED tcp 0 0 127.0.0.1:1029 127.0.0.1:16001 ESTABLISHED tcp 0 0 127.0.0.1:16001 127.0.0.1:1028 ESTABLISHED tcp 0 0 127.0.0.1:1028 127.0.0.1:16001 ESTABLISHEDThe symptoms are:
Machine A has data in its Send-Q
Machine B has no data in its Recv-Q
The data in machine A's Send-Q is not being reduced
What's happening is:
One or both kernels aren't honoring the other's send/receive window settings (these are dynamically calculated)
The result is the kernel(s) aren't getting data from machine A to machine B
rsync, therefore, isn't getting data on the receive side
The process appears to hang.
Details about the rsync bug:
What happens:
A large number of errors clogs the error pipe between the receiver and generator
All progress stops.
Again, the process appears to hang.
I hope this information helps...
A possible solution, suggested by Robert Berkowitz, is to add --bwlimit=10000 to the rsync options in the rsync initscript.
<<< Previous | Home | Next >>> |
si_getimage fails with a "Failed to retrieve /etc/systemimager/mounted_filesystems from <golden client>" message. | Up | My autoinstallcd doesn't boot. |