Blog |Follow Nick on Twitter| About
 

This morning I've had issues with my linickx.com cluster, the file system on both nodes went to read-only and I ended up in a world of pain.

[root@georgia ~]# sudo /etc/init.d/httpd start
Starting httpd: 
[root@georgia ~]# tail -f /var/log/messages
Jan  9 09:48:35 georgia kernel: [  474.259265] (httpd,1712,0):ocfs2_reserve_clusters_with_limit:1190 ERROR: status = -22
Jan  9 09:48:35 georgia kernel: [  474.259271] (httpd,1712,0):ocfs2_lock_allocators:2546 ERROR: status = -22
Jan  9 09:48:35 georgia kernel: [  474.259276] (httpd,1712,0):ocfs2_write_begin_nolock:1732 ERROR: status = -22
Jan  9 09:48:35 georgia kernel: [  474.259282] (httpd,1712,0):ocfs2_write_begin:1856 ERROR: status = -22
Jan  9 09:49:31 georgia kernel: [  530.660071] o2net: no longer connected to node amy (num 1) at 10.176.128.7:7777
Jan  9 09:49:31 georgia kernel: [  530.661856] ocfs2: Unmounting device (147,0) on (node 2)
Jan  9 09:59:46 georgia kernel: [ 1145.772174] o2dlm: Nodes in domain E9447DBE28154DAEA1B988CEC573EB64: 2 
Jan  9 10:01:05 georgia kernel: [ 1223.911192] o2net: connected to node amy (num 1) at 10.176.128.7:7777
Jan  9 10:01:09 georgia kernel: [ 1227.933348] o2dlm: Nodes in domain E9447DBE28154DAEA1B988CEC573EB64: 1 2 
Jan  9 10:01:09 georgia kernel: [ 1227.938693] ocfs2: Mounting device (147,0) on (node 2, slot 1) with ordered data mode.
Jan  9 10:02:35 georgia kernel: [ 1314.467741] OCFS2: ERROR (device drbd0): ocfs2_validate_gd_self: Group descriptor #419328 has bit count 32256 but claims that 45941 are free
Jan  9 10:02:35 georgia kernel: [ 1314.467754] File system is now read-only due to the potential of on-disk corruption. Please run fsck.ocfs2 once the file system is unmounted.
Jan  9 10:02:35 georgia kernel: [ 1314.467764] (httpd,2389,0):ocfs2_search_chain:1729 ERROR: status = -22
Jan  9 10:02:35 georgia kernel: [ 1314.467771] (httpd,2389,0):ocfs2_claim_suballoc_bits:1902 ERROR: status = -22
Jan  9 10:02:35 georgia kernel: [ 1314.467778] (httpd,2389,0):__ocfs2_claim_clusters:2185 ERROR: status = -22
Jan  9 10:02:35 georgia kernel: [ 1314.467783] (httpd,2389,0):ocfs2_local_alloc_new_window:1204 ERROR: status = -22
Jan  9 10:02:35 georgia kernel: [ 1314.467790] (httpd,2389,0):ocfs2_local_alloc_slide_window:1306 ERROR: status = -22
Jan  9 10:02:35 georgia kernel: [ 1314.467798] (httpd,2389,0):ocfs2_reserve_local_alloc_bits:695 ERROR: status = -22
Jan  9 10:02:35 georgia kernel: [ 1314.467803] (httpd,2389,0):ocfs2_reserve_clusters_with_limit:1190 ERROR: status = -22
Jan  9 10:02:35 georgia kernel: [ 1314.467809] (httpd,2389,0):ocfs2_lock_allocators:2546 ERROR: status = -22
Jan  9 10:02:35 georgia kernel: [ 1314.467814] (httpd,2389,0):ocfs2_write_begin_nolock:1732 ERROR: status = -22
Jan  9 10:02:35 georgia kernel: [ 1314.467821] (httpd,2389,0):ocfs2_write_begin:1856 ERROR: status = -22
Jan  9 10:02:36 georgia kernel: [ 1315.046965] OCFS2: ERROR (device drbd0): ocfs2_validate_gd_self: Group descriptor #419328 has bit count 32256 but claims that 45941 are free
^C
[root@georgia ~]#

What made this odd is that running fsck.ocfs2 as suggested made no difference, as the output said that the disk was clean.

[root@georgia ~]# fsck.ocfs2 /dev/drbd0
fsck.ocfs2 1.4.4
Checking OCFS2 filesystem in /dev/drbd0:
  Label:              linickxcluster
  UUID:               E9447DBE28154DAEA1B988CEC573EB64
  Number of blocks:   1048535
  Block size:         4096
  Number of clusters: 1048535
  Cluster size:       4096
  Number of slots:    4

/dev/drbd0 is clean.  It will be checked after 20 additional mounts.
[root@georgia ~]#

I learn that in fact the above output was a lie! For any future googlers seeing the same issue, run:

fsck.ocfs2 -fy /dev/drbd0

The f & y force a check and fix any found issues, the force on my filesystem found the errors and we appear to be back online :)

 

 
Nick Bettison ©