solving_lock_up_problems_memory_testing

This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.



Subject: Re: Locking up
From: "Mike A. Harris" <mharris@meteng.on.ca>
Date: Sat, 8 Jul 2000 15:40:30 -0400 (EDT)

On Fri, 7 Jul 2000, Mark Shewmaker wrote:

>> I have ordered replacement memory for the machine, but I
>> am curious as to any other possible causes.
>
>I know you were asking for other suggestions, but there's a convenient,
>(except it requires a reboot) way to test your memory under linux.
>You simply take any large tarfile of text files, (such as the linux
>kernel), extract it into one directory, and then extract it again,
>multiple times, into a second directory, doing recursive diffs after
>every extract.  If there are differences between the two directories,
>then you've probably got a memory problem.

This may find a memory problem, but it is more likely to miss
problems.  The reason is that the files would be cached in RAM,
and subsequent accesses to the same files would have a lot of RAM
cache hits (not to be confused with your L1 and L2 cache).

Also, this doesn't test for all memory problems by far.  If it
finds anything, it's more of a 'luck' hit than anything.  If
memory problems are suspected, good memory test software is about
as good as you can get without going to a hardware memory tester.

memtest86 for example.

Also, keep in mind that a fair bit of RAM used by the kernel
itself and other processes in memory will NOT get tested by such
a tar test.


>One annoyance is that to do the test right, you would want to disable
>all memory caches from your BIOS.  In other words, you would need to
>schedule downtime for the reboot(s).

This allows you to isolate RAM problems while memory testing, but
again, a real test is best if you want to be sure.  And... it is
free...

>(It's even better if you can run the test with the different
>caches enabled one by one.  If the problem shows up when you
>have all caches disabled then you probably have a memory
>problem, although it could still be other hardware, but if the
>problem shows up with only one particular cache enabled, then
>you'll know you have a bad motherboard or cache or cpu.)

Exactly..  There isn't any way of knowing wether it is cache, or
RAM with a buster test.  memtest86 can test cache or RAM
extensively.  And it runs with NO operating system present from
floppy disk.

>I've attached part of a linux-kernel thread from a while back
>that describes the test with sample code--Doug Ledford suggests
>and discusses the following script in the second email in the
>attached thread:
>
>  #!/bin/sh
>  cd /tmp
>  tar xzf linux-2.1.123.tar.gz
>  mv linux linux.save
>  for i in 1 2 3 4 5 6 7 8 9 10
>  do
>    tar xzf linux-2.1.123.tar.gz
>    diff -U 3 -rN linux.save linux
>  done
>
>(Note that with some kernels, you can get some ignorable errors
>associated with tar extracts and permissions.)

Yes, and running kernel builds at the same time, while doing
multiple "find /"'s in background, updatedb, and running some
benchmark programs or 'crashme' type programs.  All beat up the
system and memory good, but aren't really designed for testing
memory specifically.  A problem during the above testing, could
be because of software compatibility, a kernel bug, hardware
conflicts, etc..

A proper memory test takes not much more time, and is quite
accurate with diagnosing memory trouble.  You don't change your
tires and oil to solve the problem of running out of gas...

Another thing to watch is the power supply.  I just had lockups
recently, and heavily suspected my PS.  I replaced the fan in it
and dusted it out, still lockups...  I replaced the power supply
with a spare, and the lockups are gone.

CPU fan is another possibility, as are any 3d graphics cards that
get hot, etc...

Good luck with the testing..  If you can't find memtest86, head
to freshmeat.net, or metalab.

===


the rest of The Pile (a partial mailing list archive)

doom@kzsu.stanford.edu