|
Folding@Home Common Errors |
|
|
|
The following are taken from Stanford Folding Community at this link: Folding@Home Common Errors.
- EARLY_UNIT_END
- FILE_IO_ERROR
- CLIENT_DIED
- UNKNOWN_ERROR
- Client-Core Communications Error
- BAD_FRAME_CHECKSUM
- SPECIAL_EXIT
EARLY_UNIT END:
Quite possibly the most common error found today. EARLY_UNIT_END is
usually caused by one of two things: a bad WU or an unstable system.
If you get one isolated EARLY_UNIT_END, it's most likely just a WU that
is bad. It's not a problem, and you shouldn't worry about it. It's
usually caused when atoms in the WU reach impossible positions and
Gromacs can't continue.
Multiple EARLY_UNIT_END errors are a sign of a severe problem with your
machine. Machines that are clocked too high, have heat problems, or
possibly have SSE forced on (AMD only) will generate this error. You
should stop F@H if you get more than one EARLY_UNIT_END per week per
machine, and certainly if you get two in a row. Make sure your machine
is up to spec, with reasonable temperatures, reasonable clocking (CPU,
FSB, and memory must all be stable), and a good, powerful PSU.
EARLY_UNIT_END is most often caused by problems with a user's machine,
and an abnormal number of them certainly merits examining your system.
This error may be accompanied by a LINCS WARNING message that gives more specific technical details on exactly what happened.
NOTE: See the description about "-forceasm" (3.x) or "-forceSSE" (4.x)
causing SPECIAL_EXIT on certain AMD based systems. If you are running
an AMD Athlon XP with the Thoroughbred or Barton cores, you should
remove the "-forceasm" or "-forceSSE" switch, most likely fixing your
problems.
FILE_IO_ERROR:
An error that occurs when disk operations go bad. This is a fairly
general error, having many sub-types. It has plummeted in frequency
since the release of Gromacs Core 1.46. Now, this error usually happens
when a hardware error occurs: something like "Write 0010, read back
0011". If you experience this error, make sure your hard drives are OK:
run ScanDisk, CHKDSK, or fsck, make sure the IDE bus is in spec, make
sure you're using good IDE cables, and make sure the drive isn't dying.
FILE_IO_ERROR has also been reported to occur if two Console clients
working on the same unit are started. This can occur if you
accidentally start one client twice on a dually, instead of two clients
once.
Thanks to sortofageek for contributing the part about two clients causing this error.
CLIENT_DIED:
This happens when, simply enough, the client dies. The core is still
running, and can't find the client, so it shuts down. This is usually
related to overclocking and/or overly aggressive memory timings. Back
down on these and this error should vanish.
UNKNOWN ERROR:
A now rare Gromacs error that usually occurs if there's a corrupt WU
being processed. It is no longer common and any instances should
probably be reported (post a log, etc.). You may also want to check
your hardware if you've had past errors.
Client-Core Communications Error:
There are several different kinds of this error.
ERROR 0xX is basically another form of an unknown error. It can be
found on Linux if you're having Glibc version problems. See the Linux
forum for more info. Overclocking is another possible cause. ERROR 0x1
has occured with Gromacs units. Its cause is still unknown. This error
has not been replicated by the Pande group. There are known solutions
to 0xX if it's caused by overclocking (stop!) or Glibc (see Linux
forum). Otherwise, there's no known fix. Post relevant sections of
FAHlog.txt (including version and type of client) and which version
your OS is and continue folding.
ERROR 0x1 has been reported to occur if the core is killed while the
client is processing, though this is a fairly rare occurrence if you
are not using scripts that kill the core.
[15:07:06] CoreStatus = 1 (1)
[15:07:06] Client-core communications error: ERROR 0x1
[15:07:06] Deleting current work unit & continuing...
[15:07:26] Trying to send all finished work units
[15:07:26] + No unsent completed units remaining.
[15:07:26] - Preparing to get new work unit...
Thanks to gnewbury for information on this form of ERROR 0x1.
ERROR 0xC0000005 means there was a memory access violation. This is
a standard Windows error code for any program trying to access memory
it does not control. This can be a rare hardware error and is not cause
for concern. Old versions of clients/cores can also cause this problem.
ERROR 0x________, where the blank is an eight-digit hexadecimal code,
is usually a general Windows error. Look up the specific Windows error
code (if you need help, just post a thread) and you will most likely
find the cause.
Thanks to Bruce and Guha for clarifying 0xX errors.
BAD_FRAME_CHECKSUM:
You'll see a block in your log that looks something like this:
[hh:mm:ss] Header on frame 220 differs from expected header
[hh:mm:ss] Got: A028B-5C-3E84B02E-EA1B7D4: 0220
[hh:mm:ss] Expected: A028B-5C-3E84B02E-EA1B7D4: 0219
Note that the two lines of Hexadecimal numerals are the same. This
strange error only occurs with Tinker units. The only known cause is
when two or more clients are started at once and are working in the
same directory, but there may be other causes. This error often,
bizzarely, occurs on an early frame but is not detected until the
unit's end.
BAD_FRAME_CHECKSUM, similar to one type of Gromacs FILE_IO_ERROR, can
also mean that a hardware error occurred where there was a slight
discrepancy between what was read and what was expected: something like
writing 101010 and reading back 110110. Again, this is commonly not
detected until the unit finishes.
Server reports digital signature does not match
Some of the newer servers don't seem to like the older versions of the client. Upgrade to the latest client.
SPECIAL_EXIT:
This severe error means that something unknown happened inside the Gromacs core. The only known cause is when "-forceasm" (3.x) or "-forceSSE"
(4.x) is applied to an AMD system that is not 100% stable with SSE.
CPUs that have had problems include the Thoroughbred B, Barton, and
Opteron cored processors. In this case it should be dealt with as an
EARLY_UNIT_END error (see above). Removing "-forceasm" or "-forceSSE" will almost certainly fix the problem. SSE related errors are now fairly rare, compared to a few months ago.
|