One line summary: Best of practical Troubleshooting methods for everyday use
[Review: very long - 30-40 minutes]
--- digested version start ---
--- DISCLAIMER: This is a requested review by PTR, however any opinions expressed within the review are my personal ones. ---
The book Linux Troubleshooting for System Administrators and Power Users from Kirkland,
Carmichael and the Tinker brothers tackles many issues a typcial system admin / level one
or two support staff will face at the daily work.
As such it does provide many solutions and method overviews on how to troubleshoot well
reoccuring problems. It differs from other troubleshooting books in that it delivers solutions which
help you maintain sound, smooth running system configurations within your company network
and allows you to see the bigger picture rather than just offering fix "A" for problem "A" etc....
Each chapter is self contained and explains first a general overview and background about
the technologies used and the related OS processes. It then deepens into the most common
problems and makes suggestions how to troubleshoot those issues.
Finally each chapter follows up with typically 3-4 troubleshoot hunting scenarios, where
the reader can exercise and extend the learned knowledge.
The authors do provide distinct scenarios and extensions to the most useful commands and
system tweaks. Basically the value of the book is that Kirkland, Carmichael and the
Tinker brothers deliver proven troubleshooting methods to cut the chase and maintain a
coherent system. And here is where the book really lives.
For interested readers, I will be continueing a longer (more detailed) review version below the line.
--- digested version end ---
--- Long version start ---
The original book's TOC reads as below:
Table of Contents
Preface xvii
Chapter 1 System Boot, Startup, and Shutdown Issues (48p)
Chapter 2 System Hangs and Panics (26p)
Chapter 3 Performance Tools (26p)
Chapter 4 Performance (50p)
Chapter 5 Adding New Storage via SAN with Reference to PCMCIA and USB (24p)
Chapter 6 Disk Partitions and Filesystems (43p)
Chapter 7 Device Failure and Replacement (23p)
Chapter 8 Linux Processes: Structures, Hangs, and Core Dumps (31p)
Chapter 9 Backup/Recovery (28p)
Chapter 10 cron and at (30p)
Chapter 11 Printing and Printers (37p)
Chapter 12 System Security (38p)
Chapter 13 Network Problems (70p)
Chapter 14 Login Problems (29p)
Chapter 15 X Windows Problems (22p)
To further describe the book in more detail and get the most out of it, I would like to
take the privilege to rearrange the order of the books chapters. That is no criticism for the book,
rather the approach that worked best for me. So I devided Kirkland, Carmichael and the Tinker
brothers book into the following 5 topics / sections.
Section I - Required troubleshooting skills and tasks for disaster prevention
Section II - Troubleshooting basics
Section III - Troubleshooting Hardware devices
Section IV - Troubleshotting the OS itself
Section V - Service related troubleshooting
So "my" TOC becomes:
Table of Contents
Preface xvii
Section I - Required troubleshooting skills and tasks for disaster prevention
Chapter 9 Backup/Recovery (28p)
Chapter 10 cron and at (30p)
Chapter 12 System Security (38p)
Section II - Troubleshooting basics
Chapter 3 Performance Tools (26p)
Chapter 4 Performance Hunting (50p)
Section III - Troubleshooting Hardware devices
Chapter 5 Adding New Storage via SAN with Reference to PCMCIA and USB (24p)
Chapter 7 Device Failure and Replacement (23p)
Section IV - Troubleshotting the OS itself
Chapter 1 System Boot, Startup, and Shutdown Issues (48p)
Chapter 6 Disk Partitions and Filesystems (43p)
Chapter 2 System Hangs and Panics (26p)
Chapter 8 Linux Processes: Structures, Hangs, and Core Dumps (31p)
Section IV - Application Service related troubleshooting
Chapter 13 Network Problems (70p)
Chapter 14 Login Problems (29p)
Chapter 11 Printing and Printers (37p)
Chapter 15 X Windows Problems (22p)
So, after having rearranged the chapter lets "walk" our way through the virtual
"sections", shall we ? (note I used the word walk not work ;-)
--- Required troubleshooting skills and tasks for disaster prevention
This is the section which I would have started the book with. It describes the tasks EVERY
good system admin should be familiar with before starting to troubleshoot any
sophisticated problems.
Chapter 9 describes the backup and recovery related tasks.
What I liked most about the chapter is that it does not just describe backup medias,
backup devices, backup strategies and required utilites but also compares their
technological (dis)advantages. The troubleshooting scenarios will f.e show you how to how
to read out a file header in the first blocks of the tape in order to find out whats on the tape
*imagine that the admin forgot to label it. Similar steps will help you decide quickly if you have
a hardware or a software issue.
In the next chapter (10) it allows you to schedule regularly occuring tasks with cron, at,
anacron and kcron. Scheduling, not necessarily being one of my favourite topics, I found
that the authors provide excellent troubleshooting scenarios (and the solutions). They do show
many approaches how to come to the same results, which is certainly another one of the books strengths.
Chapter 12 provides the reader with basic security defintions f.e. for
Vulnerabilities and Exposures and how to use SSH, iptables and verify downloaded packages.
The troubleshooting scenarios will teach you how to check open and closed ports and which
services are running on them. Therefore some related commands f.e. like netstat will be shortly
mentioned. Keep in mind that the book is not mainly security focused alone, so it can only
cover a certain range, however the chapter is a valid security related "introduction"
that does highlight often overlooked security keypoints.
--- Troubleshooting basics
The Chapter 3 briefly describes performance related toools such as top, sar, vmstat, iostat,
free before they are put to use in Chapters 4 performance hunts.
After generally discussing the methods used, the authors goes straight into measuring
performance on raw HDD devices, single and multithreaded processes and their effects.
They discuss the influence of the block sizes within the HBA driver itself , the block
size of the filesystem, stripeing sizes, filesystem layouts and related command options
and even go into a multipath (load balancing) discussion. They continue with a brief
bonnie benchmark before continuing to CPU utilization issues and Oracles statspak.
--- Troubleshooting Hardware devices
In Chapter 5 the authors continue on the hardware side, again with storage media, this
time adding a Storage area network based on Emulex LP8000 and LP9802 HBA's to two
machines. Using lspci, /proc/ioports, lshw and other tools the authors show first how to
look for relevant information on your system BEFORE actually installing the device
physically and gettting it up and running.
For the unfortunate case that a device fails or needs replacement Chapter 7 will prepare
you for the worst case. The discussion starts with the ussual hints to supported hardware
lists of the distributions before going into what and how to look for problems and errors with
failed devices. This includes the encryption of error codes and their meaning on a byte level
in order to find the real cause for any issues. Finally the impacts of a failed device on the
surrounding environmental processes are discussed (Imagine f.e removing / replacing a
partition or drive which is mounted from several servers at same time).
--- Troubleshooting the OS itself:
The book authors dont loose a lot of time and start straight into system boot and
startup/shutdown issues in Chapter 1. That includes any bootloader and init script issues.
This is one of the longer chapters. It does describe many issues that might arise and prevent
you from starting the OS in great detail.
They continue this path in Chapter 6 with Partitioning and Filesystem related issue.
They basically even go into data recovery of the MBR, including the partition table and
bootsectors and literally walk you through the bits on the harddrive and how to read /
interpret them.
In Chapter 2 they continue with troubleshooting "System Hangs and Panics"
and give a nice overview to what types exist and how the troubleshooting approach differs
for them. Also I am using Linux more than 5+ years I found especially the definitions for
the System hangs and / or Panics very well written.
In chapter 8 the authors basically continue where they left off in Chapter 7 at the discussion
of failed devices but this time looking through what happens from the system processes point
of view f.e process structures, hangs and core dumps.
--- Application Service related troubleshooting
About network problems: What is there to write ? ... you plug in the cable, do a ping and off you go, right ?
Aehm, ... not quite. The authors recognize that networking can have many tricky issues envolved.
This is why Chapter 13 "Network Problems" is the biggest chapter (70p) and that is for a good reason:
They are the most diffcut to troubleshoot. You are basically fighting with issues that could be related to
Applications, the OS, Protocols, Signal interference issues, timeout issues, compatibility problems,
bugs, driver annoyances.
This chapter will provide you with tools and the basic OSI Layer background to solve them.
*tools described: dmesg , lspci, /proc/ioports, mii-tool, ethtool, route, arp, ifconfig,
*OSI Layer background: f.e Ethernet frames of Novell 802.3, SNAP, Ethernet II frame
and IEEE 802.3 with LLC Ethernet frame
It even goes into more detail f.e. at the kernel level describing modprobe, /etc/modules.conf ,
enabling/disabling kernel features with sysctl. Basically what makes this chapter so valuable is
the approaches for the troubleshooting scenarios at the different OSI model layers and their correlations.
Below are the troubleshooting scenarios that you will be exercising:
- Being unable to comminucate with other hosts within a network
- The network Protocol *UDP based
- Slow DNS lookups *UDP based
- Heavy load conditons and packet loss * UDP based
- Slow FTP transfer scenario
In chapter 14 the authors describe "Login problems". I have to admit that I never
experienced any of the issues described *probably because our company is to small ;-),
but I found the chapter informative. It basically describes the usage of chage, passwd and
usermod and gives a nice overview of the PAM authentication system.
One discipline which regularly gives headaches to new and seasoned network administrators
is "Printing and Printers" (Chapter 11). After giving an introduction to spooler and printer types
the book approaches this topic from the connection types point of view (7):
- Local serial Prinitng
- Local USB Printing
- Local parallel printing
- Remote printing
- Raw network Socket printing
- IPP
- Terminal servers
All necessary commands will be explained and their usage showed within each section.
This is the only chapter that I found NOT to have troubleshooting scenarios, however that
is perhaps due to the fact that the authors describe each connection type *including troubleshooting
points, very detailed in their respective sections.
The book finishes off describing X Window Problems with the two most common used Xservers (Xorg and Xfree86),
their components and the client - server modell in chapter 15.
Again, the troubleshooting scenarios are well chosen. I was immediately reminded when I had those issues on some machines
some time back and I did not find any info on the internet how to troubleshoot them.
- Troubleshooting X-forwarding via SSH
- Troubleshooting the X-server not starting up on a Dual Head video card.
Finally the book concludes with a 21 page thick index at the end.
Summary:
The book delivers more than what immediately meets the eye when reading the TOC. So to say the book
over delivers. The TOC actually doesnt describe the many facets Kirkland, Carmichael and
the Tinker brothers describe in the little side notes or hints within the paragraphs. The
book is written in a way that beginners as well as seasoned system admins can both benefit
from it, also their mileage may vary.
It keeps a good pace and is motivating to read. Junior administrators wont feel offended
in case there is topic they havent heard of yet, while intermediate administrators will
get a brush up on best of troubleshooting practices and on top a nice overview for the
big picture.
It delivers solutions and troubleshooting skills straight to your door. This is one of
those books you want to keep within your arms reach, because you will find yourself
refering to it more often than you want to imagine.
I would rate the books contents itself with full marks. The reason I gave the book "only" 4 of 5
stars was that it contains some spelling mistakes and the one or the other unfortunate word
phrase. My personal preference would have been reordered chapters, but the contents alone
is well worth the book.
My review is probably a bit unusual long, but I hope that many people will find the provided additonal information useful.
This review is linked to from: