This morning I've had issues with my linickx.com cluster, the file system on both nodes went to read-only and I ended up in a world of pain.
[root@georgia ~]# sudo /etc/init.d/httpd start
Starting httpd:
[root@georgia ~]# tail -f /var/log/messages
Jan 9 09:48:35 georgia kernel: [ 474.259265] (httpd,1712,0):ocfs2_reserve_clusters_with_limit:1190 ERROR: status = -22
Jan 9 09:48:35 georgia kernel: [ 474.259271] (httpd,1712,0):ocfs2_lock_allocators:2546 ERROR: status = -22
Jan 9 09:48:35 georgia kernel: [ 474.259276] (httpd,1712,0):ocfs2_write_begin_nolock:1732 ERROR: status = -22
Jan 9 09:48:35 georgia kernel: [ 474.259282] (httpd,1712,0):ocfs2_write_begin:1856 ERROR: status = -22
Jan 9 09:49:31 georgia kernel: [ 530.660071] o2net: no longer connected to node amy (num 1) at 10.176.128.7:7777
Jan 9 09:49:31 georgia kernel: [ 530.661856] ocfs2: Unmounting device (147,0) on (node 2)
Jan 9 09:59:46 georgia kernel: [ 1145.772174] o2dlm: Nodes in domain E9447DBE28154DAEA1B988CEC573EB64: 2
Jan 9 10:01:05 georgia kernel: [ 1223.911192] o2net: connected to node amy (num 1) at 10.176.128.7:7777
Jan 9 10:01:09 georgia kernel: [ 1227.933348] o2dlm: Nodes in domain E9447DBE28154DAEA1B988CEC573EB64: 1 2
Jan 9 10:01:09 georgia kernel: [ 1227.938693] ocfs2: Mounting device (147,0) on (node 2, slot 1) with ordered data mode.
Jan 9 10:02:35 georgia kernel: [ 1314.467741] OCFS2: ERROR (device drbd0): ocfs2_validate_gd_self: Group descriptor #419328 has bit count 32256 but claims that 45941 are free
Jan 9 10:02:35 georgia kernel: [ 1314.467754] File system is now read-only due to the potential of on-disk corruption. Please run fsck.ocfs2 once the file system is unmounted.
Jan 9 10:02:35 georgia kernel: [ 1314.467764] (httpd,2389,0):ocfs2_search_chain:1729 ERROR: status = -22
Jan 9 10:02:35 georgia kernel: [ 1314.467771] (httpd,2389,0):ocfs2_claim_suballoc_bits:1902 ERROR: status = -22
Jan 9 10:02:35 georgia kernel: [ 1314.467778] (httpd,2389,0):__ocfs2_claim_clusters:2185 ERROR: status = -22
Jan 9 10:02:35 georgia kernel: [ 1314.467783] (httpd,2389,0):ocfs2_local_alloc_new_window:1204 ERROR: status = -22
Jan 9 10:02:35 georgia kernel: [ 1314.467790] (httpd,2389,0):ocfs2_local_alloc_slide_window:1306 ERROR: status = -22
Jan 9 10:02:35 georgia kernel: [ 1314.467798] (httpd,2389,0):ocfs2_reserve_local_alloc_bits:695 ERROR: status = -22
Jan 9 10:02:35 georgia kernel: [ 1314.467803] (httpd,2389,0):ocfs2_reserve_clusters_with_limit:1190 ERROR: status = -22
Jan 9 10:02:35 georgia kernel: [ 1314.467809] (httpd,2389,0):ocfs2_lock_allocators:2546 ERROR: status = -22
Jan 9 10:02:35 georgia kernel: [ 1314.467814] (httpd,2389,0):ocfs2_write_begin_nolock:1732 ERROR: status = -22
Jan 9 10:02:35 georgia kernel: [ 1314.467821] (httpd,2389,0):ocfs2_write_begin:1856 ERROR: status = -22
Jan 9 10:02:36 georgia kernel: [ 1315.046965] OCFS2: ERROR (device drbd0): ocfs2_validate_gd_self: Group descriptor #419328 has bit count 32256 but claims that 45941 are free
^C
[root@georgia ~]#
What made this odd is that running fsck.ocfs2
as suggested made no
difference, as the output said that the disk was clean.
[root@georgia ~]# fsck.ocfs2 /dev/drbd0
fsck.ocfs2 1.4.4
Checking OCFS2 filesystem in /dev/drbd0:
Label: linickxcluster
UUID: E9447DBE28154DAEA1B988CEC573EB64
Number of blocks: 1048535
Block size: 4096
Number of clusters: 1048535
Cluster size: 4096
Number of slots: 4
/dev/drbd0 is clean. It will be checked after 20 additional mounts.
[root@georgia ~]#
I learn that in fact the above output was a lie! For any future googlers seeing the same issue, run:
fsck.ocfs2 -fy /dev/drbd0
The f & y force a check and fix any found issues, the force on my filesystem found the errors and we appear to be back online :)