Last week we had a failure on a server were two disks in a md raidset reported bad sectors at the same time. This caused the server to lock up and hang. Normally a quick reboot would have solved the problem. In this case the reboot got stuck when mounting the partitions, with the message “Recovering journal”. The server hung there for a long time and nothing happened. This was caused by a corrupt journal. The journal on the ext3 filesystem was probably affected by the bad blocks on the disks.

It proved to be quite difficult to recover from this error. The following steps were needed to get the server to boot normally again:

  1. Remove the needs_filesystemcheck flag if it is enabled on the partition. Otherwise the journal can not be removed: debugfs -w -R “feature ^needs_recovery” /dev/VolGroupXX/LogVolXX
  2. Remove the journal from the partition: tune2fs -f -O ^has_journal /dev/VolGroupXX/LogVolXX
  3. Check the filesystem: fsck -y /dev/VolGroupXX/LogVolXX
  4. Enable journalling again: tune2fs -j /dev/VolGroupXX/LogVolXX
  5. Reboot

Needless to say that when this happens the disks are ready for the bin. Get rid of them as soon as possible! 🙂

Leave a reply