Using Q4 to analysis system dump


Standard disclaimer: Use the information that follows at your own risk. If you screw up a system, don't blame it on me...
mailto: dkoleary@olearycomputers.com

Crash checklist
HP's full document

HP Crash procedures checlist

The following checkist is a shortened, and hopefully, better organized rendition of HP's full crash procedure doc: OZBEKBRC00000611. HP's document goes into various details of versions, what for, why, etc; this checkiist assumes version 11.X and a patched version of q4

Verify crash dumps are present; try to recreate if not.
Verify q4 is loaded
swlist -l fileset | grep -i q4
Uncompress the compressed kernel . /usr/contrib/Q4/bin/set_env
/usr/contrib/Q4/bin/q4pxdb vmunix
Run:
/usr/contrib/bin/q4 -p .
q4> run Analyze AU > ana.out
q4> run WhatHappened -HANG > what.out
q4> exit
Examine output: Send the following:

ITRC DOCUMENT ID: OZBEKBRC00000611 (For 10.10-11.20)

When HP-UX crashes, it saves a snapshot of RAM in disk-based swap space or dedicated dump space, reboots the system, and copies the resulting "dump" into /var/adm/crash

A utility called q4, normally loaded on the system, is available to make text files for fast analysis. A patched version of q4 must be loaded to interpret dumps resulting from a "hanging" operating system.

To preprocess the dump, follow these steps and email the resulting files to the HP Response Center for analysis. Steps vary depending on the version of the O/S and the version of q4.

Please note, all email generated from this procedure should be sent to the dump team email address hpcu@atl.hp.com using the CALL ID as the SUBJECT.

DO NOT send this information to the engineer's personal address.

After emailing the data, please log a callback against the call to let the engineer know that you have emailed your data.

  1. WHERE IS THE DUMP?
    1. Verify a current dump exists in the dump directory:

      ll /var/adm/crash/c*

      A recent core.N(10.X) or crash.N(11.X) directory should be listed. (NOTE:N is the next available dump index, which increments with each successive dump.)

      The INDEX file in /var/adm/crash/c* and /etc/shutdownlog contains the "panic" statement.

    2. touch /etc/shutdownlog 
    3. If a current dump is not in /var/adm/crash, do

      grep _DIR /etc/rc.config.d/save*

      The value pointed to by SAVECORE_DIR=(10.X) or SAVECRASH_DIR=(11.X) is where the system places dump files.

    4. If the system dump is not in the expected location try to re-save the dump with:
      1. 10.X : # savecore -vr
      2. 11.00: # savecrash -vr

      A return message "invalid dump header" means the dump is non-existent.

      NOTE: If the current dump directory gets full with a dump save, update the directory variable with a directory with more space, and make the new directory to capture future dumps.

  2. IS A VERSION OF Q4 LOADED?
    1. Determine if and which version of q4 is loaded:

      swlist -l fileset | grep -i Q4 

      The following are unpatched versions supplied with the OS:

      OS-Core.Q4 B.10.20 HP-UX Crash Dump Debugger for PA-RISC systems
      OS-Core.Q4 B.11.00 HP-UX Crash Dump Debugger for PA-RISC systems
      OS-Core.Q4 B.11.11 HP-UX Crash Dump Debugger for PA-RISC systems
    2. If one of the following patched versions are listed, proceed to STEP 3:
      10.20 11.00 11.11
      PHCO_20261 PHCO_20262 PHCO_25723
    3. If the system does not have q4, or the dump was the result of a hang, load the patched version. Loading the patched version will not cause a system reboot. Installation instructions accompany the patch.

      Download the appropriate version from this site:

      For the 10.10 or 10.20 version:

      1. For 10.[12]0 versions:ftp://us-ffs.external.hp.com/hp-ux_patches/s700_800/10.X/PHCO_20261
      2. For 11.0 versions: ftp://us-ffs.external.hp.com/hp-ux_patches/s700_800/11.X/PHCO_20262
      3. For 11.11 versions:ftp://us-ffs.external.hp.com/hp-ux_p atches/s700_800/11.X/PHCO_25723

      NOTE: the patch number may be superceded over time

      NOTE 2: Those links have zero chance of working as 10.X nad 11.0 ar eno longer supported. Additionally, HP has gone to a pay to play model for patches. You need a software service agreement to get patches.

    4. If web access is unavailable and no version of q4 is on the system and the install CD is available, proceed to load the standard version of q4:

      Mount the INSTALL media and verify a matching version of Q4 is available:

      swlist -l fileset -s / | grep Q4
      OS-Core.Q4    B.10.10     HP-UX Crash Dump Debugger for PA-RISC systems 
                     ^^^^^ -matches the O/S 

      Use swinstall to install it:

      # swinstall -vs / OS-Core.Q4
  3. CD TO THE DUMPS DIRECTORY

    NOTE: csh (c-shell) will cause errors with q4. Use sh-posix.

    cd (dump directory)
    eg:  cd /var/adm/crash/core.0 OR /var/adm/crash/crash.0
  4. IF USING UNPATCHED Q4
    1. Perform this command:
      /usr/contrib/bin/gunzip vmunix.gz
       (uncompresses the kernel file)
      For 10.20 through 11.11, type this command and then skip to 4.2:
      /usr/contrib/bin/q4prep -p 
      For 11.20 and beyond, type this command and then skip to 4.2:
      /usr/contrib/Q4/bin/q4prep -p
      If at 10.10, type the following commands:
      uncompress /usr/contrib/lib/Q4Lib.tar.Z
      (ignore the error if this was done previously)
      tar -xf /usr/contrib/lib/Q4Lib.tar
      (output goes into the current directory)
      cp q4lib/sample.q4rc.pl ~/.q4rc.pl
      Note the use of a tilde and letter "l" (not digit 1)

      /usr/contrib/bin/q4pxdb vmunix
      This may complain if vmunix is already preprocessed.
    2. If the next command causes "/var: file system full", move the core. directory to a file system with adequate space (approximately 2x the sum of the core.x.y.gz files) and continue at this point.

      Type:

      q4 -p  .    
      (note the "dot" at the end of the command)

      Then:

      q4> trace event 0 > trace.out
      q4> include analyze.pl 
      NOTE letter "l" (not digit 1)
      q4> run Analyze AU >> ana.out
      NOTE: ctrl-c will interrupt q4
      q4>  exit

      Skip to STEP 6

  5. IF USING THE PATCHED VERSION OF Q4
    1. Type:
      . /usr/contrib/Q4/bin/set_env
      Note the 'dot' at the beginning of the command.
    2. If the next steps cause "/var: file system full", move the core. or crash. directory to a file system with adequate space (approximately 2x the sum of the core.x.y.gz files) and continue at this point.

      Type:

      /usr/contrib/Q4/bin/q4pxdb vmunix
      (Disregard "unnecessary" message)
      /usr/contrib/Q4/bin/q4 -p . 
      (note the "dot" at the end of the command)
    3. At the q4> prompt, type:
      q4>run Analyze AU > ana.out 
      q4>run WhatHappened -HANG > what.out 
      NOTE:  ctrl-c can interrupt these two commands, which may take several minutes to process.
    4. Type:
      q4>exit 
  6. REVIEW AND SEND DATA
    1. Determine if a hardware problem induced the crash. If the ana.out or trace.out contains references to an HPMC occuring, the cause of the crash was very likely a hardware fault.

      Type:

      grep HPMC ana.out trace.out

      Check for:

      1. "crash event was an HPMC"
      2. "Crash Event 0 (HPMC, struct crash_event_table_struct..."

      If either of this lines appear, open a hardware repair request with the hardware support organization for this system.

      Also, send the /var/tombstones/ts* file (if that directory exists) matching the "dumptime" listed in the INDEX file. It may well have the hardware fault codes that can aid in isolating the hardware cause.

      If an HPMC did not occur, proceed to 6.2.

    2. Check ana.out to see if MC/ServiceGuard (if it is installed) triggered the reboot. Look for this message:

      "MC/ServiceGuard: Unable to maintain contact with cmcld daemon. Performing TOC to ensure data integrity."

      If so, type:

      cmgetconf |grep E_T /etc/cmcluster/* 
      (Check the cluster for a NODE_TIMEOUT of 2000000)
      
        If NODE_TIMEOUT is set to 2 seconds, the crash is probably due to this
        extremely low setting.
      
        To correct the problem:
        Increase the value to 5-8 seconds in the cluster configuration file and
        perform a "cmapplyconf" with the cluster down.  Also, read this
        article UXSGLVKBAN00000010 in the http://ITRC.HP.COM technical database for
        more details on dealing with ServiceGuard-induced crashes
      
        If NODE_TIMEOUT was set to 2 seconds and the value was corrected, stop
        here.
    3. Generate a patch list:
      /usr/sbin/swlist -l product | grep PH > patches.out 
    4. Send the following files to hpcu@atl.hp.com using the SOFTWARE CASE ID as the subject:
      1. ana.out
      2. patches.out
      3. trace.out
      4. what.out (if created)
      5. /etc/shutdownlog
      6. /var/tombstones/ts* (if HPMC was detected)
      NOTES:
      1. The hpcu E-Mail box has a 3MB maximum mail size!
      2. Keep this document and use it on future dumps to determine whether to open a hardware or software case.

Document:
URL:
Last updated: