Causes of kernel errors

In Linux, a kernel error can be caused due to various reasons. Here we will discuss a few of the reasons:

  • Hardware – Machine Check Exceptions: This type of kernel error is caused when a component failure is detected and reported by the hardware through an exception. This typically looks like this:
System hangs or kernel panics with MCE (Machine Check Exception) in /var/log/messages file.
System was not responding. Checked the messages in netdump server. Found the following messages ..."Kernel panic - not syncing: Machine check".
System crashes under load.
System crashed and rebooted.
Machine Check Exception panic
  • Error Detection and Correction (EDAC): If any memory chip and PCI transfer error is detected, the hardware mechanism reports it causing EDA errors. This error gets reported in /sys/devices/system/edac/{mc/,pci} and typically looks like this:
Northbridge Error, node 1, core: -1
K8 ECC error.
EDAC amd64 MC1: CE ERROR_ADDRESS= 0x101a793400
EDAC MC1: INTERNAL ERROR: row out of range (-22 >= 8)
EDAC MC1: CE - no information available: INTERNAL ERROR
EDAC MC1: CE - no information available: amd64_edacError Overflow
  • Non-Maskable Interrupts (NMIs): When a standard operating system mechanism is unable to ignore or mask out an interrupt, it is called a Non-Maskable Interrupt (NMI). It is generally used for critical hardware errors. A sample NMI error appearing in /var/log/messages would look like this:
kernel: Dazed and confused, but trying to continue
kernel: Do you have a strange power saving mode enabled?
kernel: Uhhuh. NMI received for unknown reason 21 on CPU 0
kernel: Dazed and confused, but trying to continue
kernel: Do you have a strange power saving mode enabled?
kernel: Uhhuh. NMI received for unknown reason 31 on CPU 0.
  • Software – The BUG() macro: When any abnormal situation is seen indicating a programming error, kernel code causes this kind of kernel error. It typically looks like this:
NFS client kernel crash because async task already queued hitting BUG_ON(RPC_IS_QUEUED(task)); in __rpc_executekernel BUG at net/sunrpc/sched.c:616!invalid opcode: 0000 [#1] SMPlast sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_mapCPU 8Modules linked in: nfs lockd fscache nfs_acl auth_rpcgss pcc_cpufreq sunrpc power_meter hpilohpwdt igb mlx4_ib(U) mlx4_en(U) raid0 mlx4_core(U) sg microcode serio_raw iTCO_wdtiTCO_vendor_support ioatdma dca shpchp ext4 mbcache jbd2 raid1 sd_mod crc_t10dif mpt2sasscsi_transport_sas raid_class ahci dm_mirror dm_region_hash dm_log dm_mod[last unloaded: scsi_wait_scan]
  • Software – Pseudo-hangs: These type of errors are commonly encountered, when the system appears to be hung, and could have several reasons for this kind of behavior such as:
    • Livelock: When running a real-time kernel, if application load is too high, it could lead the system to a situation where it becomes unresponsive. The system is not completely hung, but appears to be as it is moving so slowly.

A sample error message getting logged in /var/log/messages, when the system is frequently hung, looks like this:

INFO: task cmaperfd:5628 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
cmaperfd D ffff810009025e20 0 5628 1 5655 5577 (NOTLB)
ffff81081bdc9d18 0000000000000082 0000000000000000 0000000000000000
0000000000000000 0000000000000007 ffff81082250f040 ffff81043e100040
0000d75ba65246a4 0000000001f4db40 ffff81082250f228 0000000828e5ac68
Call Trace:
[<ffffffff8803bccc>]
:jbd2:start_this_handle+0x2ed/0x3b7
[<ffffffff800a3c28>] autoremove_wake_function+0x0/0x2e
[<ffffffff8002d0f4>] mntput_no_expire+0x19/0x89
[<ffffffff8803be39>]
:jbd2:jbd2_journal_start+0xa3/0xda
[<ffffffff8805e7b0>]
:ext4:ext4_dirty_inode+0x1a/0x46
[<ffffffff80013deb>] __mark_inode_dirty+0x29/0x16e
[<ffffffff80041bf5>] inode_setattr+0xfd/0x104
[<ffffffff8805e70c>] :ext4:ext4_setattr+0x2db/0x365
[<ffffffff88055abc>] :ext4:ext4_file_open+0x0/0xf5
[<ffffffff8002cf2b>] notify_change+0x145/0x2f5
[<ffffffff800e45fe>] sys_fchmod+0xb3/0xd7
  • Software – Out-of-Memory killer: This type of error or panic is triggered when some memory needs to be released by killing a few processes, when a case of memory starvation occurs. This error typically looks like this:
Kernel panic - not syncing: Out of memory and no killable processes...

Whenever a kernel panic or error occurs, you may have to analyze these errors to diagnose and troubleshoot them. This can be done using the Kdump utility. Kdump can be configured using the following steps:

  1. Install kexec-tools.
  2. Edit the /etc/grub.conf file, and insert crashkernel=<reservered-memory-setting> at the end of kernel line.
  3. Edit /etc/kdump.conf and specify the destination for sending the output of kexec, that is vmcore.
  4. Discard unnecessary memory pages and compress only the ones that are needed by configuring the Core collector.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.147.87