balug-debugging_system_lockup_problems

This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.

To: Balug-talk@balug.org
From: "Karsten M. Self" <kmself@ix.netcom.com>
Subject: Re: [Balug-talk] random lockup
Date: Sat, 2 Mar 2002 01:10:03 -0800

on Fri, Mar 01, 2002, Devon (devon@sfsu.edu) wrote:

> Hi, I am setting up RH 7.2 on an old Cyrix 686 system I pulled out of
> the back of my closet. I ve gotten pretty far; networked with my
> windows machine, samba, apache, etc ok it all appears to be working
> fine. The problem is for no reason the system locks up after a
> seemingly random time, it can run for more than 24 hours or just a
> couple, weird. I cant find anything in any log file that consistently
> occurs before a crash and it never happens when I am using the system,
> but I really don't know what to look for. Also I've heard of the
> "coma_bug", and downloaded set6x86, but I'm not sure if I really need
> to use it.

> Has anyone heard of this kind of problem before or have any ideas
> about where to begin trouble shooting this type of problem. How do I
> get more info about what's going on when the system locks up? I don't
> have a lot of experience setting up Linux so any info would be
> helpful.  thanks

You can usually attribute lockups to a relatively small number of prime
causes:

  - Bad hardware.  Generally divided among memory, disk, and CPU.  Other
    components can fail but generally don't completely hose the system,
    contenting themselves to cause less significant, but all the more
    annoying, sorts of mishaps.

  - Bad drivers.  Kernel-level software glitches *can* and *do* lock
    Linux hard.  Probably most common are X Window System driver/card
    issues.  I've had problems at various points in time with same,
    Samba (2.2.14), and a flaky power supply interacting with a
    SpeedStep / Geyserville self-throttling PIII CPU (600 MHz on AC
    power, 500 MHz on battery).

  - Bad juju.  Your system just doesn't like you.  However this
    typically reduces to one of the above.

Diagnosis is usually a tedious matter of trying to eliminate That Which
Doesn't Cause The Problem.

Typical steps:

  - Install a memory tester and run a few cycles (I'd do at least four)
    of memory tests.  memtest86 is a linux tool for this job.

  - Check your disks, particularly swap partitions, for errors.  A flaky
    drive once was crashing a GNU/Linux install on a box, several months
    before Legacy MS Windows died on the same system.

  - Run exhaustive, repetitive, kernel compiles.  A SIG11 bus error
    generally indicates a hardware problem in your CPU (thermal stress).

  - Try disabling components or removing drivers, and see if the problem
    remains or disappears.  One technique I do generally is take half of
    anything and pull it from the system (drivers, components, lines of
    code, whatever).  If the error goes away, but reappears when I
    replace that half but remove the other, I've isolated the problem.
    Rarely is there an interaction between components, so assume there
    isn't at first unless the evidence _strongly_ suggests there is.

On problems:  seek simple solutions first.  I've addressed four issues
in the last week in which the problem was a connector either in the
wrong place, or plugged into a dead circuit.  Two of these were my own
doing (granted, one occured in the course of a system teardown).

I've found, in ten years of systems work (both hardware and software)
very few truly arcane bugs.  Most are a matter of one (or in some cases
a few) simple fundamental errors.


There's one additional debugging tool I can offer -- I created a script
for generating a kernel bug report automagickally.  You're welcome to
use it to generate a pretty good inventory of what's on your system.
Attached.


#!/bin/bash

# Kernel bug report generator script
# Script generated from prior bug report form by Karsten M. Self
# $Revision: 1.3 $ $Date: 2000/05/13 07:48:36 $ $Author: root $


# ------------------------------------------------------------------------
# [Some of this is taken from Frohwalt Egerer's original linux-kernel FAQ]

#      What follows is a suggested procedure for reporting
# Linux bugs. You aren't obliged to use the bug reporting
# format, it is provided as a guide to the kind of
# information that can be useful to developers - no more.

#      If the failure includes an "OOPS:" type message in
# your log or on screen please read
# "Documentation/oops-tracing.txt" before posting your bug
# report. This explains what you should do with the "Oops"
# information to make it useful to the recipient.

#       Send the output the maintainer of the kernel area
# that seems to be involved with the problem. Don't worry
# too much about getting the wrong person. If you are unsure
# send it to the person responsible for the code relevant to
# what you were doing. If it occurs repeatably try and
# describe how to recreate it. That is worth even more than
# the oops itself.  The list of maintainers is in the
# MAINTAINERS file in this directory.

#       If you are totally stumped as to whom to send the
# report, send it to linux-kernel@vger.rutgers.edu. (For
# more information on the linux-kernel mailing list see
# http://www.tux.org/lkml/).

# This is a suggested format for a bug report sent to the
# Linux kernel mailing list. Having a standardized bug
# report form makes it easier for you not to overlook
# things, and easier for the developers to find the pieces
# of information they're really interested in. Don't feel
# you have to follow it.

#    First run the ver_linux script included as
# scripts/ver_linux or at
# <URL:ftp://ftp.sai.msu.su/pub/Linux/ver_linux> It checks
# out the version of some important subsystems.  Run it with
# the command "sh scripts/ver_linux"

# Use that information to fill in all fields of the bug
# report form, and post it to the mailing list with a
# subject of "PROBLEM: <one line summary from [1.]>" for
# easy identification by the developers

# ------------------------------------------------------------------------

# indent by one tabstop
function tabout () { sed -e '/^/s//	/'; }

kversion=3D$( uname -r )
dmesg=3Ddmesg
dmesg=3D"cat /var/log/kern.log"	# for debugging only
oops_number=3D$( $dmesg | grep Oops | tail -1 | sed -e '/^.*:/s///' )
oops_module=3D$( $dmesg | grep EIP | tail -1 | sed -e '/^.*:/s///' )

cat <<EOF

This is a script-generated kernel bug report. 

The system administrator/developer should provide additional information
where appropriate.

kernel-bug-report: $Revision: 1.3 $ $Date: 2000/05/13 07:48:36 $ $Author: root $

[1.] One line summary of the problem:   

	PROBLEM:  $1 oops $oops_number in $oops_module, $kversion kernel

[2.] Full description of the problem/report:

	n/a

[3.] Keywords (i.e., modules, networking, kernel):

	linux kernel $kversion oops $oops_number $oops_module

[4.] Kernel version (from /proc/version):

$( cat /proc/version | tabout )

[5.] Output of Oops.. message (if applicable) with symbolic information
     resolved (see Documentation/oops-tracing.txt)

$( $dmesg | ksymoops -k /proc/ksyms | tabout )

[6.] A small shell script or example program which triggers the
     problem (if possible)

	n/a

[7.] Environment

$( set | tabout )

[7.1.] Software (add the output of the ver_linux script here)

$( sh -f /usr/src/linux/scripts/ver_linux | tabout )

[7.2.] Processor information (from /proc/cpuinfo):

$( cat /proc/cpuinfo | tabout )

[7.3.] Module information (from /proc/modules):

$( cat /proc/modules | tabout )

[7.4.] SCSI information (from /proc/scsi/scsi)

$( cat /proc/scsi/scsi | tabout )

[7.5.] Other information that might be relevant to the problem
       (please look in /proc and include all information that you
       think to be relevant):

	System memory (at time of oops):
$( cat /proc/meminfo | tabout )

	System uptime:
$( uptime | tabout )

[X.] Other notes, patches, fixes, workarounds:
EOF

===

To: "Karsten M. Self" <kmself@ix.netcom.com>
From: Jeffrey Siegal <jbs@quiotix.com>
Subject: Re: [Balug-talk] random lockup
Date: Sat, 02 Mar 2002 01:14:50 -0800

Karsten M. Self wrote:
>   - Bad hardware.  Generally divided among memory, disk, and CPU.  Other
>     components can fail but generally don't completely hose the system,
>     contenting themselves to cause less significant, but all the more
>     annoying, sorts of mishaps.

Also:

Bad or overloaded power supply.

Stuck or burned out CPU fan.


===
the rest of The Pile (a partial mailing list archive)
doom@kzsu.stanford.edu