16

SMART is stating one pending sector on of my server's hdd. I've read lot's of articles recommending using hdparm to "easily" force the disk to relocated the bad sector, but I can't find the correct way to use it.

Some info from my "smartctl":

Error 95 occurred at disk power-on lifetime: 20184 hours (841 days + 0 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 d7 55 dd 02 Error: UNC at LBA = 0x02dd55d7 = 48059863 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 d6 55 dd e2 00 18d+05:13:42.421 READ DMA 27 00 00 00 00 00 e0 00 18d+05:13:42.392 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 02 18d+05:13:42.378 IDENTIFY DEVICE ef 03 46 00 00 00 a0 02 18d+05:13:42.355 SET FEATURES [Set transfer mode] 27 00 00 00 00 00 e0 00 18d+05:13:42.327 READ NATIVE MAX ADDRESS EXT SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 20194 48059863 # 2 Short offline Completed without error 00% 15161 - 

With that "bad LBA" (48059863) in hand, how do I use hdparm? What type of address the parameters "--read-sector" and "--write-sector" should have?

If I issue the command hdparm --read-sector 48095863 /dev/sda it reads and dumps data. If this command was right, I should expect an I/O error, right?

Instead, it dumps data:

$ ./hdparm --read-sector 48059863 /dev/sda /dev/sda: reading sector 48059863: succeeded 4b50 5d1b 7563 a932 618d 1f81 4514 2343 8a16 3342 5e36 2591 3b4e 762a 4dd7 037f 6a32 6996 816f 573f eee1 bc24 eed4 206e (...) 
1
  • 2
    The proper thing to do is to send the drive back to its manufacturer and get a warranty replacement. Commented Dec 27, 2012 at 16:58

4 Answers 4

13

If for whatever reason you prefer to try to clear those bad sectors, and you do not care about the existing contents of a drive, the below shell snippet may help. I tested this on an older Seagate Barracuda drive that is well past its warranty anyway. It might not work right with other drive models or manufacturers, but it should put you on the right path if you must script something. It will destroy any content you have on the drive.

You may prefer just running badblocks, an hdparm Secure Erase (SE) (https://wiki.archlinux.org/index.php/Securely_wipe_disk), or some other tool that is actually designed for this. Or even the manufacturer provided tools like SeaTools (there is a 32bit linux 'enterprise' version, google it).

Make sure the drive in question is completely unused/unmounted before doing this. Also, I know, while loop, no excuses. It is a hack, you can make it better...

baddrive=/dev/sdb badsect=1 while true; do echo Testing from LBA $badsect smartctl -t select,${badsect}-max ${baddrive} 2>&1 >> /dev/null echo "Waiting for test to stop (each dot is 5 sec)" while [ "$(smartctl -l selective ${baddrive} | awk '/^ *1/{print substr($4,1,9)}')" != "Completed" ]; do echo -n . sleep 5 done echo badsect=$(smartctl -l selective ${baddrive} | awk '/# 1 Selective offline Completed: read failure/ {print $10}') [ $badsect = "-" ] && exit 0 echo Attempting to fix sector $badsect on $baddrive hdparm --repair-sector ${badsect} --yes-i-know-what-i-am-doing $baddrive echo Continuning test done 

One advantage of using the 'selftest' method is the load is handled by the drive firmware, so the PC it is connected to is not loaded down like it would be with dd or badblocks.

NOTE : I'm sorry, I made a mistake, the correct while condition is like this :

while [ "$(smartctl -l selective ${baddrive} | awk '/^ *1/{print $4}')" = "Self_test_in_progess" ]; do 

And the exit condition of the script becomes :

[ $badsect = "-" ] || [ "$badsect" = "" ] && exit 0 
2
  • My experience shows that it's often not enough to overwrite just the faulty sector - it will still fail to read after that. You need to overwrite $count consecutive sectors from $((badsect/count*count)) where $count is a power of two (for the disk I'm currently restoring, count is as much as 16384). The cause seems to be that the sector is faulty because HDD can't find its positioning marker (aka preamble or whatever marks the start of sector data), and overwriting a large chunk of sectors also restores all positioning markers. Commented Jun 29, 2017 at 21:47
  • 1
    If you do not have too many pending sectors to repair, then you could do this manually. Use a smartctl test to identify the "LBA_of_first_error", then enter that number into something like the following: hdparm --repair-sector 48059863 --yes-i-know-what-i-am-doing /dev/sdb Then check the Current_Pending_Sector count and hopefully it will have reduced. Commented Jan 21, 2022 at 13:09
5

I think it may have read without error because that sector is not bad, but other tools fail reading the sector because of some other behavior. (read ahead that reaches an actually unreadable sector?)

I found some bad sectors, and if I repair the only one that is unreadable with "hdparm --read-sector", the other 'bad' sectors suddenly are no longer unreadable with things like dd. And interestingly, when looking at "dmesg" output, only the hdparm-unreadable ones are ever reported.

eg. I had sectors 36589320 to 36589327, and 36589344 to 36589351 unreadable with dd, but only 36589326 and 36589345 were unreadable with hdparm --read-sector. Then I used hdparm --write-sector on those 2, and then all 16 sectors were readable again.

Here's a small part of dmesg output:

[30152036.527940] end_request: I/O error, dev sda, sector 36589326 [30152077.363710] end_request: I/O error, dev sda, sector 36589345 

And the disk info:

# smartctl -i /dev/sda ... === START OF INFORMATION SECTION === Device Model: TOSHIBA MK2002TSKB ... Firmware Version: MT2A User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Size: 512 bytes logical/physical ... 

And apparently this disk's firmware either doesn't properly record reallocated sectors, or they weren't really reallocated, but just corrupt (like an unrecoverable ECC error, but the surface still works, like it was caused by bit rot rather than faulty electronics or bad media):

# smartctl -A /dev/sda | egrep "Reallocated|Pending|Uncorrectable" 5 Reallocated_Sector_Ct 0x0033 100 100 050 Pre-fail Always - 0 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 # smartctl -l error /dev/sda ... SMART Error Log Version: 1 No Errors Logged 

Please note, I ran a --read-sector and a --write-sector. A read may be required to properly reallocate a sector, not just a write. If you don't read first, it might not know the sector is bad.

2
  • Yes, you're right. This is because the kernel reads "pages" not sectors. A page is 4096 bytes = 8 sectors. They are aligned to boundaries (4096-byte) as well. Commented Sep 3, 2014 at 21:38
  • The dd command's "iflag=direct" (and "oflag=direct") options can be used to bypass the kernel's cache and read/write individual sectors (similar to the way hdparm's --read-sector/-write-sector do). Commented Sep 14, 2020 at 22:16
2

based on @Glenn's answer you'll find the script fixbad at

http://wiki.bitplan.com/index.php/Bad_Block_Howto

as of 2020-09-10 the content of the script is:

#!/bin/bash # see http://wiki.bitplan.com/index.php/Bad_Block_Howto # see https://github.com/hradec/fix_smart_last_bad_sector/blob/master/fix_smart_last_bad_sector.sh # see https://www.thomas-krenn.com/de/wiki/Analyse_einer_fehlerhaften_Festplatte_mit_smartctl # WF 2020-10-04 disk=/dev/sdb mode=short # verbose verbose=false # should commands only be shown? dry=false # should write fixes be performed? fix=false # range of sectors to modify after bad sector range=8 # set to sudo if sudo is needed sudo=sudo # serial number serial="-?-" #ansi colors #http://www.csc.uvic.ca/~sae/seng265/fall04/tips/s265s047-tips/bash-using-colors.html blue='\033[0;34m' red='\033[0;31m' green='\033[0;32m' # '\e[1;32m' is too bright for white bg. endColor='\033[0m' # # a colored message # params: # 1: l_color - the color of the message # 2: l_msg - the message to display # color_msg() { local l_color="$1" local l_msg="$2" echo -e "${l_color}$l_msg${endColor}" } # # error # # show an error message and exit # # params: # 1: l_msg - the message to display error() { local l_msg="$1" # use ansi red for error color_msg $red "Error: $l_msg" 1>&2 exit 1 } # # show the usage # usage() { echo "usage: $0 [disk]" echo " [-c|--check]" echo " [-d|--dry]" echo " [-h|--help]" echo " [-i|--info]" echo " [[-m|--mode] mode]" echo " [[-r|--range] range]" echo " [[-s|--serial [serial]]" echo " [-t|--test]" echo " [[-w|--wait [type]]" echo " [-v|--verbose]" echo echo " -h|--help: show this usage" echo " -c|--check: check the disk" echo " -d|--dry: dry run - show commands only" echo " -i|--info: show info about the given disk" echo " -m|--mode: set mode: default=short" echo " -r|--range: range of sectors to modify after bad sector" echo " -s|--serial: get serial number of confirm serial number" echo " -t|--test: run test for the given type e.g. selective selftest" echo " -w|--wait: wait for the result of the given testype e.g. selective selftest" echo " -v|--verbose: set verbose mode" echo "" echo "example:" echo " $0 /dev/sdb -i" echo "" echo "for any write operation you need to confirm the serial number" echo "to get serial number: " echo " $0 disk -s " exit 1 } # # get a number range from 0 to the given n-1 # # params # 1: n function getRange() { local l_n="$1" range=$(python -c "for i in range($l_n): print i,") echo $range } # # read the result of the smartctl test for the given disk # # params # 1: l_disk: the disk under test e.g. /dev/sdb # 2: l_type: the type of the test e.g. selective function readResult() { local l_disk="$1" local l_type="$2" $sudo smartctl -l $l_type $l_disk | egrep "^#?[[:space:]]*[0-9]" } # # show the Result # function showResult() { local l_logline="$1" local l_logstatus="$2" if [ "$verbose" == "true" ] then echo $l_logstatus:$l_logline else echo $l_logline | gawk ' /#/ { print $0; exit } { status=substr($4,1,9) progress=$5; gsub("\\[","",progress); range=$7 printf("\r%s",progress); }' fi } # # wait for the result of a running selftest # # param 1: l_disk: the disk under test e.g. /dev/sdb # param 2: l_type: the type of the test e.g. selective # param 3: l_wait: number of seconds to wait # function waitForResult() { # example #=== START OF READ SMART DATA SECTION === #SMART Selective self-test log data structure revision number 1 #SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS # 1 7814037167 Self_test_in_progress [90% left] (2564632-2630167) local l_disk="$1" local l_type="$2" local l_wait="$3" local l_logline="" local l_logstatus="" color_msg $blue "Waiting for $l_type test of $l_disk to stop (each dot is $l_wait sec)" while [ "$l_logstatus" != "Completed" ]; do l_logline=$(readResult "$l_disk" "$l_type" | egrep "^#?[[:space:]]*1") l_logstatus=$(echo $l_logline | gawk ' /Completed/ { print "Completed"; }') showResult "$l_logline" "$l_logstatus" sleep $l_wait done } # # get the serial number of the device # function getSerialNumber() { local l_disk="$1" serial=$($sudo smartctl -i $l_disk | grep "Serial Number" | cut -f 2 -d':') echo $serial } # # get the blocksize of the given file system # function getBlockSize() { local l_fs="$1" blocksize=$($sudo tune2fs -l $l_fs | grep "Block size:" | cut -f2 -d':') echo $blocksize } # # get the partition for the given disk # function getPartition() { local l_disk="$1" fs=$(mount | grep $l_disk | cut -f1 -d' ') echo $fs } # # get the start sector for the given disk # function getStartSector() { local l_disk="$1" local l_fs="$2" startsector=$($sudo fdisk -l $l_disk | grep $l_fs | cut -f4 -d' ') echo $startsector } # # get Info about the given disk # function getInfo() { local l_disk="$1" $sudo smartctl -i $l_disk | egrep "(Model|Serial|Rotation|Sector|Capacity)" $sudo hdparm -I $l_disk | egrep "(Serial Number|Model)" fs=$(getPartition $l_disk) if [ "$fs" != "" ] then color_msg $blue "Partition: $fs" blocksize=$(getBlockSize $fs) color_msg $blue "Blocksize: $blocksize" else color_msg $red "couldn't find mounted partition for $l_disk" fi } # # geh the current pending sector for the given disk # function getCurrentPendingSector() { local l_disk="$1" # if msg is empty don't show message but only return the current pending sector count local l_msg="$2" psectorline=$($sudo smartctl -A $l_disk | grep Current_Pending_Sector) psector=0 if [ $? -eq 0 ] then if [ "$l_msg" != "" ]; then color_msg $green "$psectorline"; fi psector=$(echo $psectorline | cut -f 10 -d ' ') if [ $psector -gt 0 ] then if [ "$l_msg" != "" ]; then color_msg $red "Current_Pending_Sector is not zero but $psector"; fi else if [ "$l_msg" != "" ]; then color_msg $green "Current_Pending_Sector is zero!"; fi fi else if [ "$l_msg" != "" ]; then color_msg $red "smartctl -A did not output Current_Pending_Sector"; fi psector=-1 fi if [ "$l_msg" == "" ]; then echo $psector; fi } # # fix the given bad sector on the given disk with the given range of sectors to fix # # param 1: disk e.g. /dev/sdb1 # param 2: defect sector to repair # param 3: range - range of sectors to repair e.g. 8 # fixBad() { local l_disk="$1" local l_sector="$2" local l_range="$3" color_msg $blue "repairing sector $l_sector to $l_sector+$l_range on $l_disk ..." r=$(getRange $l_range) for i in $r ; do let b1=$l_sector+$i if [ "$dry" == "true" ] then echo hdparm --repair-sector $b1 --yes-i-know-what-i-am-doing $l_disk else $sudo hdparm --repair-sector $b1 --yes-i-know-what-i-am-doing $disk >> /tmp/smart_repaired.log fi done #tail -n 60 /tmp/smart_repaired.log | grep writing | tail -n 20 #grep '#' /tmp/smart | head -5 #hdparm -I $disk > /tmp/hdparm } # # check the needed software # checkSoftware() { for sw in gawk debugfs fdisk hdparm smartctl tune2fs python $sudo do bin=$(which $sw) if [ $? -eq 0 ] then if [ "$verbose" == "true" ] then color_msg $green "will use $bin as $sw" fi else error "$0 needs $sw to work please install it" fi done } # # run a test for the given disk in the given mode # # params # 1: l_disk: the disk under test e.g. /dev/sdb # 2: l_mode: the mode of the self test e.g. short/long function runTest() { local l_disk="$1" local l_mode="$2" color_msg $blue "running $l_mode smartctl test for $l_disk ..." $sudo smartctl -t $l_mode $l_disk > /tmp/null } # # check the given disk in the given mode # function checkDisk() { local l_disk="$1" local l_mode="$2" local l_serial="$3" fs=$(getPartition $l_disk) blocksize=$(getBlockSize $fs) startsector=$(getStartSector $l_disk $fs) color_msg $blue "checking Current_Pending_Sector count for $l_disk partition $fs blocksize $blocksize startsector $startsector" getCurrentPendingSector "$l_disk" show psector=$(getCurrentPendingSector "$l_disk") if [ $psector -gt 0 ] then runTest $l_disk $l_mode fi } # # check the lba block # function lbaCheck() { local l_disk="$1" fs=$(getPartition $l_disk) blocksize=$(getBlockSize $fs) startsector=$(getStartSector $l_disk $fs) diskserial=$(getSerialNumber $l_disk) readResult "$l_disk" selftest | while read line do echo $line | grep "read failure" > /dev/null if [ $? -eq 0 ] then if [ "$verbose" == "true" ] then echo $line fi index=$(echo $line | cut -f2 -d' ') state=$(echo $line | cut -f3-4 -d ' ') progress=$(echo $line | cut -f8 -d ' ') lba=$(echo $line | cut -f10 -d ' ') if [ "$lba" == "" ] then lba=0 fi if [ "$lba" -gt 0 ] then echo $index $state echo "progress: $progress" echo "lba: $lba" # calculate the file system block fsb=$(gawk -v L=$lba -v S=$startsector -v B=$blocksize 'BEGIN {printf ("%.0f",((L-S)*512/B))}') echo "file system block: $fsb" if [ "$fix" == "true" ] then if [ "$serial" != "$diskserial" ] then color_msg $red "you need to provide the serial number of $l_disk to perform fix operations" else fixBad $l_disk $lba $range fi fi fi fi done } # # try Fixing bad sectors # function tryFix() { local l_disk="$1" badsect=$($sudo smartctl -l selective ${baddrive} | gawk '/# 1 Selective offline Completed: read failure/ {print $10}') [ $badsect = "-" ] && exit 0 echo Attempting to fix sector $badsect on $baddrive echo hdparm --repair-sector ${badsect} --yes-i-know-what-i-am-doing $baddrive } # # start a check loop on the given drive # function checkLoop() { local baddrive="$1" badsect=1 while true; do color_msg $blue "Testing $baddrive from LBA $badsect" $sudo smartctl -t select,${badsect}-max ${baddrive} 2>&1 >> /dev/null waitForResult $baddrive selective 5 tryFix $baddrive color_msg $blue "running next test" done } # make sure the needed software is available checkSoftware # commandline option while [ "$1" != "" ] do option=$1 shift case $option in -h|--help) usage ;; -i|--info) getInfo $disk ;; -m|--mode) if [ $# -lt 1 ] then usage else mode=$1 shift fi ;; -c|--check) checkDisk $disk $mode $serial ;; -d|--dry) dry=true ;; -l|--loop) checkLoop $disk ;; -f|--fix) fix=true ;; -r|--range) if [ $# -lt 1 ] then usage else range=$1 shift fi ;; -s|--serial) if [ $# -lt 1 ] then getSerialNumber $disk exit 1 else serial=$1 shift fi ;; -t|--test) runTest $disk $mode ;; -v|--verbose) verbose=true ;; -w|--wait) if [ $# -lt 1 ] then usage else type=$1 shift waitForResult $disk $type 5 fi ;; -x) lbaCheck $disk $serial;; *) disk=$option ;; esac done 

Personally i wasn't able to get any meaningful results aka "repair" a disk with this toolkit. Still the script and it's part are helpful in analyzing and attempting fixes. Beware of using the script in the hope of "full automation". You might loose your data instead of fixing it.

2
  • This doesn't work on Fedora with XFS: Usage: tune2fs [-c max_mounts_count] [-e errors_behavior] [-f] [-g group] [-i interval[d|m|w]] [-j] [-J journal_options] [-l] [-m reserved_blocks_percent] [-o [^]mount_options[,...]] [-r reserved_blocks_count] [-u user] [-C mount_count] [-L volume_label] [-M last_mounted_dir] [-O [^]feature[,...]] [-Q quota_options] [-E extended-option[,...]] [-T last_check_time] [-U UUID] [-I new_inode_size] [-z undo_file] device Usage: grep [OPTION]... PATTERNS [FILE]... Try 'grep --help' for more information. Commented Sep 14, 2021 at 1:12
  • 1
    WF: It might be a good idea to add a read -p manual input on the blocksize detection as a fallback, since people have been trying non-ext4 filesystems for a while now. Commented May 9, 2023 at 11:25
1

You may be using the incorrect address depending on which option you are using in smartctl. -a shows the address as a 32bit value which will be incorrect for large disks. Use -x instead. I believe -l also shows the truncated value.

man smartctl

-a, --all Prints all SMART information about the device. For ATA, this is equivalent to '-H -i -c -A -l error -l selftest -l selective'. --> This option is no longer recommended for ATA disks because it does not enable the SMART options which require support for 48-bit ATA commands (see '-x' below). For SCSI, this is equivalent to '-H -i -A -l error -l selftest'. For NVMe, this is equivalent to '-H -i -c -A -l error -l selftest'. 

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.