Hard disks fixed&#8230;.?

I just installed my new Intel X25-M, its an 80Gb 2nd-generation SSD drive, and it goes like shit off a shovel – way over twice as fast as my [also new] WD Caviar Black 1Tb, averaging around 280MB/sec for reads.

I found the WDC drive had the same problem as the two 500Gb WDC drives, namely these errors in dmesg, leading to inode corruption:

ata2: exception Emask 0x10 SAct 0x0 SErr 0x4090000 action 0xe frozen
ata2: irq_stat 0x00400040, connection status changed
ata2: SError: { PHYRdyChg 10B8B DevExch }
ata2: hard resetting link
ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata2.00: configured for UDMA/133
ata2: EH complete

The problem apparently is due to a broken implementation of NCQ by Western Digital, it results in the drive being turned off every now and then, especially under heavy load (like having 500Gb of backups rsync’ed to them!)

The fix, that is hopefully going to work for me (seems OK for the past hour or so) is to disable NCQ just for that drive, not the entire controller, in /etc/rc.local

echo 1 > /sys/block/sdb/device/queue_depth

Unfortunately, by the time I found the fix I already had two bad sectors (current pending sector count). Hopefully they will be dealt with by the drive/SMART, I reformatted/partitioned, so it won’t mean a third RMA perhaps.

Stangely enough the 1.5Tb/640Gb WDC Greens in my fileserver and Dad’s PC don’t seem to have any problems – but they run Ubuntu not Fedora, so maybe they are blacklisted in the kernel driver, or the BIOSes don’t support NCQ – in fact I think they are both running the SATA controller in “IDE” mode as I recall.

Next up I have to figure out why my other PC has been switching on and off at random. I expect its heat issues I (need to disable overclocking) and/or a flaky PSU (which I’m about to replace).

Update: Disabling NCQ doesn’t seem to have fixed the issue so I’ve re-enabled it. I’ve reworked my PSU cabling so that only the SSD and HDD are on one cable and all the fans and the DVD+RW are on the other cable, as it could be a power issue. I’ve also swapped the SATA connector to the third port on the motherboard in case that’s the problem. Anyway I don’t seem to be seeing the issue at the moment, and oddly enough the bad sectors in SMART have disappeared too!

Update 2: Spoke too soon, I’ve got 4 unrecoverable sectors now.

These errors in dmesg, are in fact a kernel USB bug, nothing to do with disk inodes:

name_count maxed, losing inode data: dev=00:05, inode=8572

# find /sys/ -inum 8572
/sys/devices/pci0000:00/0000:00:1a.1/usb4/4-0:1.0/ep_81/bmAttributes

Oh and I just received a Paypal refund for the two previous disks!