Bugs/grub
From openSUSE
Contents |
How to report grub bugs
First of all: there are no grub bugs ;) Usually there's a broken BIOS or the disks have been partitioned improperly, or an unsuitable setup has been chosen. There are also cases where buggy video hardware or a video BIOS disturbs the operation of SuSE's graphical interface to boot loaders, which lives in a separate package.
This article exists to give you some insight into PC booting, and to enable you to narrow down your booting problem so you can fix minor configuration issues, or report a bug in a manner such that it can get fixed.
Common Phrases
"But I just accepted the proposed defaults"
The bugs and quirks manufacturers have in their machines are numerous; it's hard to generically determine "the right way" to boot Linux. The only sure thing is that an operating system on the first partition of the only disk will boot very well, if the partition is "primary", physically first on the disk, and not "too big" (depending on the BIOSes age).
"But windows can boot without problems"
Since most computers still have windows pre-installed, it sits on the first partition of the only disk, and hence boots fine. That's a piece of cake. Besides, hardware vendors test whether their machine can boot windows (sometimes one gets the impression that this is all they ever test). It only gets tricky when you have more than one operating system on your computer, and some of them must be able to boot from less favourable positions on the disk. Additionally, you surely want to choose at one point during every boot which operating system you'd like to run then. This creates an additional challenge if the suspend-to-disk power saving feature is used on laptops.
"But it works when I uses LILO"
Grub was written with the naïve assumption that BIOS vendors at least try to adhere to standards and common practise. Grub determines a lot at run-time during boot, querying the BIOS for its features. It is a surprisingly common case that a BIOS has certain features, but the corresponding query mechanism is broken. LBA support and A20 gate were the latest occurences. Some features may not work reliably or in the expected way. Once the BIOS misguides grub, there is no more fallback; booting fails in a more or less mysterious way.
LILO was written with a fundamental distrust in the BIOS, and expects the worst performance and poorest ability. It gets as much information as possible during install time from the running Linux kernel, and never asks the BIOS about features. It uses them, and records for itself whether they work or not. Thus, there are some cases where a switch to LILO will help.
However when you include disk parts in the booting process that are inaccessible via the BIOS, or the Linux installer gathered incorrect information then even LILO won't help you. Also, buggy video cards harm a LILO boot as well, via the common gfxboot code. OTOH you can also force grub to use LBA, no matter what the BIOS answers to the query, and you can forcibly use certain disks.
"When I run grub-install ..."
grub-install is a simple script included in the grub package, and it is mentioned in many online documents, hence we cannot remove it as we would like to. It's not bad if you have nothing else, but its heuristics and actions fall far behind after what Yast can do during an installation. After all findings and user input, Yast will write the device.map and /etc/grub.conf. Use these! The command line is
grub --batch < /etc/grub.conf
With grub-install, you are on your own to install the boot loader.
Starting with openSUSE 10.3, the original grub-install has therefore been renamed to grub-install.unsupported to reduce this confusion. The new grub-install command will instead take reasonable action suitable for SUSE Linux, e.g. run "grub --batch", or call yast2 to create the configuration files first, if necessary.
Disk Layout
How many disks are there, and how does the BIOS see them?
In the most trivial case, there is only one hard disk in the system. This disk will be known to the BIOS as drive number 0x80, and will be the hard disk boot drive. Similarly, you will know what the disk is called in linux by looking at /etc/fstab or similar, and see all those /dev/hda* entries or /dev/sda* respectively.
When there's more than one disk, things start to get complicated, as there is no easy relationship between Linux devices and BIOS drive numbers. If you are lucky, you BIOS supports "Enhanced Drive Diagnostics", or EDD for short. Issuing a few BIOS calls, Linux can record some information about disks known to the BIOS, before the Linux kernel really gets going. This information can later be obtained under /sys/firmware/edd/int13_dev8?/ , maybe after loading the "edd" module into the kernel. If all of your /sys/firmware/edd/int13_dev8?/mbr_signature files are different, you have resolved this issue, as these values can now easily be matched to the first blocks on your Linux devices. Otherwise you may get hints from the disks' sizes, as shown in each "sectors" file.
In any case the Linux devices corresponding to BIOS numbers must be recorded in the file /boot/grub/device.map. You'll see lines like
(hd0) /dev/hda
meaning Linux' /dev/hda is what the BIOS thinks is hard disk 0, numbered 0x80 (0x00 would the first floppy drive). Make sure this mapping is correct, if you can. YAST will have generated this file using some good guessing, but you better check.
Currently coming up are what we call fake RAIDs, multiple (S)ATA drives in a system merged to form one single BIOS device using only (BIOS) software. Linux sees just a bunch of separate, identical disks, which is what the hardware really is like, but the BIOS knows this "RAID" only by fewer disk numbers. Yast tries hard to look at the disk contents and to set up a Linux software RAID that exactly corresponds to the BIOS device, but again this cannot be perfect because of the many "standards" out there.
Real RAID controllers that properly hide multiple disks behind a single disk controller interface, supply a single boot disk both to Linux and the BIOS, and pose no problem to boot loader configuration.
How are the disks partitioned
The first block on a disk is called the "master boot record", or MBR for short. All BIOSes reliably load disk 0x80's first block into memory and jump to it. The MBR also holds the primary part of the traditional DOS partition table, consisting of 4 entries. These are called the "primary partitions". When it showed that 4 partitions fall a little short, "extended" partitions were invented. These are primary partitions with a special type number, which may hold further sub-partitions. This is purely a convention among operating systems. Linux extended partitions are recognised by the type number 0x85, windows extended partitions these days carry the 0x0f type. The Linux kernel will discover both types of extended partitions. The sub-partitions within any of these are called "logical" partitions. Linux will attach numbers to the disk name when it refers to partitions. The first four will always be assigned to the primary partitions, so logical partitions will always start with number 5.
Example:
Disk /dev/hde: 163.9 GB, 163928604672 bytes 255 heads, 63 sectors/track, 19929 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/hde1 * 1 997 8008371 83 Linux /dev/hde2 998 1994 8008402+ 83 Linux /dev/hde3 1995 2991 8008402+ a9 NetBSD /dev/hde4 2992 19929 136054485 85 Linux extended /dev/hde5 2992 15150 97667136 83 Linux /dev/hde6 15151 15649 4008186 83 Linux /dev/hde7 15650 19929 34379068+ 83 Linux
The four primary partitions here are 2x 8GB Linux (type 0x83), 8GB NetBSD (type 0xa9), and the rest of the disk is covered by the "Linux extended" container hde4 (type 0x85). That one in turn holds 3 further Linux partition, of which hde6 and hde7 are certainly out of question to boot anything from, at least via grub.
Independently of this scheme, BSD-derived operating systems use their own partitioning data on PCs, and call it "disk label". Linux can be configured to detect these, too, and will integrate the BSD partitions like logical partitions. In the above example, the kernel does not recognise the NetBSD label. If it was configured to do so, the BSD partitions would be merged as hde5 and up, shifting the names of the logical partitions within hde4 upwards. Grub has a clear distinction between extended partitions and BSD disk labels, so it is wise to keep BSD partitions opaque for Linux. Report a yast bug if you suspect that your BSD disk label is not handled properly with the kernel config switch CONFIG_BSD_DISKLABEL turned on.
Intel has invented a totally new scheme to partition disks, called GPT. The real partition table resides at the end of the disk, but a compatibility table is put into the first block, and the BIOS still loads that for booting. Linux will find a GPT partition table, and will prefer that over the primary partitions defined in the MBR. Disks with a leftover GPT generate "funny" effects during boot loader configuration.
Limitations
Depending on the age and bug set of a BIOS, it may not be able to read all disk blocks into memory. There's little that can be done about this.
- 1024 cylinders
- This historical limit strikes when LBA mode is not supported, or
- reported not to be supported. This is rarely found these days, but
- still not extinct. Old BIOSes reach this limit at 504MB.
- 8GB
- This is the 1024 cylinder limits when a "Xlated" C/H/S geometry (XCHS) is
- used. The largest geometry supported by cylinder/head/sector based
- calls is 1024/255/64, resulting in something below 8GB of disk space.
When LBA is fully supported, the BIOS limit is at least as high as 120GB. Disks larger than that must support an extension called LBA48 to be accessible at all; if the BIOS also knows how to utilise this (do not take this for granted), the disk blocks required for booting may be residing as high as 2^32 blocks, or 2TB. This is the maximum that grub currently supports. Depending on the actual geometry used, and the potential add-on BIOS in effect, the disks in the system may have different limits on which parts can be used for booting.
We have also seen BIOSes, mainly on IBM laptops, that report and access a somewhat smaller disk than installed, because of a reserved area used for device recovery.
When any part of the boot setup lies beyond a relevant BIOS limit, booting will fail. Should the limit be within a partition, it's a game of chance whether the booting code ends up completely before, or partly after that limit.
How does a PC boot / How can I set up a working GRUB?
As mentioned above, the only sure thing about PC disk booting is that the BIOS will load the first block into memory and likely execute its code. Very few BIOSes apply one or more sanity checks to the MBR.
We recommend to keep the MBR "neutral", that is, not consider it part of any operating system. For that purpose, a generic MBR can be used, that simply determines one of the 4 primary partitions by a bit flag, and then loads the first block of that partition in turn, to continue the boot process. The Yast installation offers the option to install such generic code in the MBR; do it when in doubt. The next operating system to be installed may do something similar, so all should agree on the expected behaviour of the MBR. There might also be an older SuSE Linux installation left on the disk, with some boot loader's first stage in the MBR. With a new boot loader installed over the rest, the remaining first stage alone will not work any more. Again, laptops have been found to abuse the MBR for recovery. They work well when you assume the generic behaviour, but things break when the MBR gets overwritten. Use a generic MBR wherever you can.
Where and how does a disk boot start?
Because both the BIOS and a generic MBR load only a single 512-Byte disk block into memory, that's what all PC boot loaders have to start with. The piece of code grub uses at that point is called stage1, for obvious reasons.
The BIOS will happily load a stage1 located in the first disk's MBR, although this is discouraged as described above. Further, any generic MBR will load a stage1 from the first block of a primary partition marked as "active" or "bootable", as long as there is exactly one such primary partition. Besides that, there are many non-standard ways to get stage1 into memory, the most popular is probably the chainloader feature found in capable boot loaders.
When any grub stage1 has made it into the PC's memory and gets executed, it will immediately print out "GRUB" on the BIOS default console. When you see this message, it may not necessarily be from the right stage1, it might also be a from a stage1 left over from a previous installation. That is another reason to keep stage1 tightly coupled to your current installation ( in the "/" or "/boot" partition ), and not abuse the MBR.
If you do not see the "GRUB"-message at all, have a look at the disk block where stage1 was expected to be written; the commands dd and xxd come handy here, like e.g.: dd if=/dev/hda1 count=1 | xxd ; if xxd is unavailable, od -c or hexdump will suffice.
Without a "GRUB"-message being printed somewhere your problem is not a grub bug, as no portion of grub ever got executed.
File Systems or: How to proceed after stage1
Besides the message printing, stage1 contains just enough code to load the blocks for the next stage. If that fails it is usually because of one of two reasons:
- the stage1 running is currently "orphaned", and the further parts of the boot loader aren't any more at the physical disk location where the grub installation has recorded them, or
- the BIOS disk ordering is wrong, meaning stage1 is looking for the next stage on a totally wrong disk drive.
In the former case it helps to install a generic MBR or otherwise make sure the correct stage1 gets loaded, maybe after properly (re-)installing grub to disk. In the latter case it is probably necessary to force a BIOS drive number into stage1 to continue from.
This is accomplished using the "d"-flag (just the lonely lowercase letter) in the "install" command in /etc/grub.conf. Yast will do this automatically whenever stage1 and stage2 are on different BIOS disks. the "setup" command in /etc/grub.conf will likewise do it automatically. But there are also BIOSes that report a bogus drive number about the disk stage1 was loaded from. Without the "d"-flag, stage1 will continue loading from "that same disk" and hence fail unless "d" is specified manually.
Usually the file system used for /boot (or "/" if there is no such mount point) leaves just enough space at the beginning to store a stage1 block, which in turn loads stage2 by absolute block numbers, which then enables the use of path names. However, some file systems have peculiarities worthy to mention:
ext3: it is possible to manually increase the inode size to e.g. 256 bytes. This works with the linux kernel, but the simple read-only file system code used inside grub will no longer recognise these files. If you want to increase the inode size on your root partition then use a /boot partition with unaltered inodes.
XFS: xfs starts at the very first block of the partition it occupies, no room for a stage1. stage1 needs to be stored elsewhere; maybe the first block of an extended partition or, if there's only one single Linux installation on the machine, maybe the MBR.
ReiserFS: reiserfs leaves 64 kilobytes of space at its beginning for boot loaders. In this case, grub "setup" or Yast choose a slightly more robust way to install the boot loader binary: the stage1, preferably the first block in the reiserfs partition, will load an intermediate "stage1.5", by absolute disk block numbers, starting at the second block in the partition. This stage1.5 contains code to actually read the reiserfs directories and files, and will then load in turn e.g. "boot/grub/stage2" by its path name from that partition.
A stage1.5 can as well be placed right after the MBR, again with the appropriate caveats that other OSes or BIOS recovery might also try to use this location.
Other Booting Issues
gfxboot
Does it work better after the "gfxmenu" line is commented out? You might then see a text-mode only menu during hard disk boot, which lets you still choose what to boot. In this case, report the buggy graphics hardware against the gfxboot package, the maintainer might add another workaround.
Booting from the SuSE Linux DVD
The installation DVDs boot using a variant of syslinux, called isolinux. This has nothing to do with grub, other than both sharing the same graphics routines from gfxboot. If grub has a graphics problem, the boot DVD likely has as well, and vice versa. See the gfxboot section above in that case.
Network Booting
Due to limited resources, we cannot help much with grub's network booting features. They are in the code, they compile, so we include them. Currently there is no one playing with them, let alone testing them thoroughly. It's good if they work for you. Too bad if they don't.
Troubleshooting
- does it print "GRUB", e.g. does the boot process get to any grub stage1?
-> check which drive is 0x80, and whether there's a loading chain from
the MBR to your expected stage1.
-> is there a grub stage1 at the expected location?
- it prints "GRUB loading ...", but then hangs or reboots
-> check whether stage1 is really the one you just installed, see above.
-> "loading stage2" is the last thing printed before it jumps to stage2.
stage2 might actually be running. If grub "hangs" before or at the
graphics screen, comment out the "gfxmenu" line and try again.
Report unsuitable partitioning proposals against the installation. Yast should create a Linux primary partition on the first disk to boot from, whenever possible.
Report broken configuration files against the yast2-bootloader package.
Report buggy video to the gfxboot package. GFXboot uses the VESA-"standard", but not every vendor is really following it.
Real Bugs
"(I think) I have ruled out all the problems above. Grub is buggy."
Include dumps of all disks' MBRs, as well as the disk block where stage1 should have gone to, include the relevant contents of /sys/firmware/edd/ (cylinders,heads,sectors, both legacy and default, sectors and the mbr signature at least), fdisk -l and fdisk -lu output or equivalent, grub's config files /etc/grub.conf, /boot/grub/device.map and menu.lst, and the output of grub --batch < /etc/grub.conf if it shows anything peculiar.
Bugzilla tells you not to paste log files into comments; however for the above information, besides maybe boot block dumps, these are small enough to be pasted into the report directly.
Alternatively, you're welcome to supply a patch that fixes your
problem and is guaranteed not to break any of the other workarounds.

