Categories
Linux

Monitoring Hard Drive Health on Linux with smartmontools

S.M.A.R.T. is a system in modern hard drives designed to report conditions that may indicate impending failure. smartmontools is a free software package that can monitor S.M.A.R.T. attributes and run hard drive self-tests. Although smartmontools runs on a number of platforms, I will only cover installing and configuring it on Linux.

Why Use S.M.A.R.T.?

Basically, S.M.A.R.T. may give you enough of a warning that you can safely backup all your data before your hard drive dies. There is some amount of conflicting information on the internet about how reliable the warnings are. The best source of research that I found is a paper from Google that describes an internal study of hard drive failure. A quick summary: certain events greatly increase the chance of hard drive failure including reallocation events and failed self-tests, but only about 60% of the drives that failed in the study had any negative S.M.A.R.T. attributes. Obviously, nothing replaces regular backups.

A good source for more information is the S.M.A.R.T. wikipedia page.

Installation

On Debian or Ubuntu systems:

$ sudo apt-get install smartmontools

On Fedora:

$ sudo yum install smartmontools

Capabilities and Initial Tests

smartmontools comes with two programs: smartctl which is meant for interactive use and smartd which continuously monitors S.M.A.R.T. Let’s look at smartctl first:

$ sudo smartctl -i /dev/sda

Replace /dev/sda with your hard drive’s device file in this command and all subsequent commands. If there’s only one hard drive in the system, it should be /dev/sda or /dev/hda. If this command fails, you may need to let smartctl know what type of hard drive interface you’re using:

$ sudo smartctl -d TYPE -i /dev/sda

where TYPE is usually one of ata, scsi, or sat (for serial ata). See the smartctl man page for more information. Note that if you need -d here, you will need to add it to all smartctl commands. This should print information similar to:

=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint T133 series
Device Model:     SAMSUNG HD300LJ
Serial Number:    S0D7J1UL303628
Firmware Version: ZT100-12
User Capacity:    300,067,970,560 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 4a
Local Time is:    Fri Jan  2 03:08:20 2009 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Now that smartctl can access the drive, let’s turn on some features. Run the following command:

$ sudo smartctl -s on -o on -S on /dev/sda

  • -s on: This turns on S.M.A.R.T. support or does nothing if it’s already enabled.
  • -o on: This turns on offline data collection. Offline data collection periodically updates certain S.M.A.R.T. attributes. Theoretically this could have a performance impact. However, from the smartctl man page:

    Normally, the disk will suspend offline testing while disk accesses are taking place, and then automatically resume it when the disk would otherwise be idle, so  in  practice  it has little effect.

  • -S on: This enables “autosave of device vendor-specific Attributes”.

The command should return:

=== START OF ENABLE/DISABLE COMMANDS SECTION ===
SMART Enabled.
SMART Attribute Autosave Enabled.
SMART Automatic Offline Testing Enabled every four hours.

Next, let’s check the overall health:

$ sudo smartctl -H /dev/sda

This command should return:

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

If it doesn’t return PASSED, you should immediately backup all your data. Your hard drive is probably failing. Next, let’s make sure that the drive supports self-tests. I have yet to see a drive that doesn’t, but the following command also gives time estimates for each test:

$ sudo smartctl -c /dev/sda

I won’t list the complete output because it’s somewhat lengthy. Make sure “Self-test supported” appears in the “Offline data collection capabilities” section. Also, look for output similar to:

Short self-test routine
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 127) minutes.

These are rough estimates of how long the short and long self-test’s will take respectively. Let’s run the short test:

$ sudo smartctl -t short /dev/sda

On my drive, this test should take 2 minutes, but this obviously varies. You can run:

$ sudo smartctl -l selftest /dev/sda

to check results. Unfortunately, there’s no way to check progress, so just keep running that command until the results show up. A successful run will look like:

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     21472         -

Now, do the same for the long self-test:

$ sudo smartctl -t long /dev/sda

The long test can take a significant amount of time. You might want to run it overnight and check for the results in the morning. If either test fails, you should immediately backup all your data and read the last section of this guide.

Configuring smartd

We’ve now enabled some features and run the basic tests. Instead of repeating the previous section daily, we can setup smartd to do it all automatically. If your system has an /etc/smartd.conf file, check for a line that begins with DEVICESCAN. If you find one comment it out by adding a ‘#’ to the beginning of the line. DEVICESCAN doesn’t work on my system and specifying a device file is easy. Add the following line to /etc/smartd.conf:

/dev/sda -a -d sat -o on -S on -s (S/../.././02|L/../../6/03) -m root -M exec /usr/share/smartmontools/smartd-runner

Here’s what each option does:

  • /dev/sda: Replace this with the device file you’ve been using in smartctl commands.
  • -a: This enables some common options. You almost certainly want to use it.
  • -d sat: On my system, smartctl correctly guesses that I have a serial ata drive. smartd on the other hand does not. If you had to add a “-d TYPE” parameter to the smartctl commands, you’ll almost certainly have to do the same here. If you didn’t, try leaving it out initially. You can add it later if smartd fails to start.
  • -o on, -S on: These have the same meaning as the smartctl equivalents
  • -s (S/../.././02|L/../../6/03): This schedules the short and long self-tests. In this example, the short self-test will run daily at 2:00 A.M. The long test will run on Saturday’s at 3:00 A.M. For more information, see the smartd.conf man page.
  • -m root: If any errors occur, smartd will send email to root. On my system, mail for root is forwarded to my normal email account. If you don’t have a similar setup, replace root with your normal email address. This option also requires a working email setup. Most Linux distributions automatically have working outbound email.
  • -M exec /usr/share/smartmontools/smartd-runner: This last part may be specific to the Debian and Ubuntu smartmontools packages. Check if your system has /usr/share/smartmontools/smartd-runner. If it doesn’t, remove this option. Instead of sending email directly, “-M exec” makes smartd run a different command when errors occur. On Debian, smartd-runner will run each script in /etc/smartmontools/run.d/, one of which emails the user specified by the “-m” option.

If you have more than one hard drive in your system, add a line for each one replacing /dev/sda with a different device file.

Update on 2009-01-06:

Thanks to commenter robert for pointing out an omission on my part. If your system has the file /etc/default/smartmontools, uncomment the “#start_smartd=yes” line by removing the “#”.

Finally, restart smartd:

$ sudo /etc/init.d/smartmontools restart

If this command fails, the end of /var/log/daemon.log should have some diagnostic information. If smartd started fine, we should still test that email notifications are working. Add “-M test” to the end of the configuration line in /etc/smartd.conf. This will make smartd send out a test notification when it’s next started. Once again, restart smartd:

$ sudo /etc/init.d/smartmontools restart

You should receive an email similar to:

This email was generated by the smartd daemon running on:

   host name: polar
  DNS domain: shadypixel.com
  NIS domain: (none)

The following warning/error was logged by the smartd daemon:

TEST EMAIL from smartd for device: /dev/sda

For details see host's SYSLOG (default: /var/log/syslog).

Afterward, you can delete “-M test”.

What To Do If smartd Detects Problems

First, immediately backup everything. Depending on the error, your drive might be close to death or it may still have a long life ahead. Consult the smartmontools FAQ. It has some recommendations for specific errors. Otherwise, ask for help on the smartmontools-support mailing list.

78 replies on “Monitoring Hard Drive Health on Linux with smartmontools”

Hey, nice intro. Small addition: on my Xubuntu intrepid and jaunty (alpha) installation, I had to uncomment the line ‘#start_smartd=yes’ in the file /etc/default/smartmontools.

Cheers

Hi btmorex, nice howto.

I configured my smartd.conf like this:

dev/sdb -I 194 -a -o on -S on -s (S/../.././03|L/../../6/04) \
-m sys@base.com \
-M exec /usr/share/smartmontools/smartd-runner

Also, by adding “-M test”, I tested email notifications and received test email message.

As you see, each morning my HDD is tested, but I didn’t received any email notification about test results.

Probably, notifications are sent when something is getting wrong, am I right on this point?

Right now my drive is reports OK status with “smartctl -H” command.

Thanks a lot again.

Agip,

It sounds like you’ve set it up correctly. You’re right that smartd will only email you if there is a problem. If you want to look at the test results, you can do:

smartctl -l selftest /dev/sdb

You can view the progress of a self test by running smartctl -Hc /dev/XXX. It will be across from the “Self Test Execution Status”. Should look something like this:

Self-test routine in progress…
70% of test remaining.

Hi btmorex

Nice work.
I had to remove the first line in the /etc/smartd.conf:
# *SMARTD*AUTOGENERATED* /etc/smartd.conf
without doing this, all changes are lost after restarting the deamon.

Cheers

For Xubuntu users, I’ve made a little script that will work together with smartd to pop up a notification in case of any hard disk trouble. See http://ubuntuforums.org/showthread.php?t=1031244

I think it can be easily adapted to Ubuntu though.

@Agip: a mail server needs to be installed (and probably configured) if you’d like e-mail notifications. You either try my script (:)), or use the smart-notifier package if available for your distribution.

Cheers

Great tutorial!

Just one question: what is a reasonable schedule for the short, long and offline tests?
Short: every day?
Long: Once per week?
Offline: ???

Great info!
One question, will SMART tool function correctly on an un-formatted drive?
Say I found an old drive that is raw, can I run SMART on it?
Thanks,
Alex

hi Mark,
i’m still not 100% sure about this but from my initial testing of smartmontools it appears smartctl needs at least one disk partition to be mounted otherwise it just stays at the “90% completed” point for some time until smartctl eventually gives up and kills the test with a message like this:
“# 7 Short offline Interrupted (host reset) 90% 2455 -”

if you have a partition on your disks, try mounting it before the test.
if not, maybe it is possible to mount the disk it’s self as a raw device? (i don’t know. haven’t tested it yet.)

dunk.

That’s odd that it works when you mount something. I know that having a partition mounted is not a requirement though, as I run tests daily against drives that are almost never mounted.

Thank you. Clearly written and informative. Other explanations always left me a bit dazed and confused.

I actually have a half-written post about gsmartcontrol :)

It’s a nice program although I prefer the set-it-up-and-forget-about-it nature about smartd.

Thank you, setting up smartd went fine, however I cannot persuade the system to send mail which somehow makes the whole thing useless.
I use postfix on Ubuntu server 8.04. I can send mail from the command line; I installed logwatch which can mail as well, but when smartd tries to send out mail, it always fails with the following error message:
“Test of mail to root produced unexpected output (90 bytes) to STDOUT/STDERR: send-mail: invalid option — i Can’t send mail: sendmail process failed with error code 1”

I spent a lot of time on Google, configuration files, forwarding etc., but the result is always the same.

Does anybody have an idea of what might go wrong? Thank you.

Zdenek

I haven’t been able to get it to send a mail either, but logwatch manages just fine.

Wondering if I have the same problem you do, however I haven’t been able to locate any kind of error message in the logs, where exactly do you find yours?

Thanks for the guide, very useful. However I got a problem: my second hard disk has some unreadable sector, every time I boot up the PC a new mail is sent.
Is there a way to get smartd to send mail only when a new short/long test is performed? I just want to monitor the situation, not receive the same mail every time i boot the PC….
Thanks for any help.

Thanks for the guide; worked perfectly and easy to follow. The hardest part was setting up postfix!

Thoughts on usefulness of doing more than regular short and long tests? For example from the sample config file:
# Monitor all attributes except normalized Temperature (usually 194),
# but track Temperature changes >= 4 Celsius, report Temperatures
# >= 45 Celsius and changes in Raw value of Reallocated_Sector_Ct (5).
# Send mail on SMART failures or when Temperature is >= 55 Celsius.
#/dev/hdc -a -I 194 -W 4,45,55 -R 5 -m admin@example.com

Best

Charles

Ok, I’m running the long test, I’ve figured out how to see the progress it’s making as it goes along, but I can’t figure out how to view the results. I have 4 drives testing simultaneously – so I want to view the results separately, and maybe several times.

THANK YOU,
David

I found it –

smartctl -Hc /dev/sdx

Shows the progress and the code with interpretation if the test has completed.

Thanks for a great tutorial!

Hi!

Is there any chance, that one can estimate the remaining lifetime of the hard drive based on some of the S.M.A.R.T attributes (a very rough estimation is perfect for what i am doing.) I know that SMART data is correct but you cannot rely on it to catch a fail, however if there is such a formula to roughly estimate the remaining lifetime i will be very greatful.

10x in advance.

No, you can’t really make an estimate like that. Actually, Google did there S.M.A.R.T. study to find out exactly what you’re asking. The conclusion they reached is that even though some values have predictive value, they are nowhere near good enough to actually preemptively replace hard drives (which is very similar to estimating remaining life).

Hey!

10x for your reply. However, the study says that if you combine all parameters only 36 % of all failed drives were unable to predict or have zero values, so actually this is quite good for me. What is more, even if I take only the 4 important parameters into account I will be successful in 44% of the cases. Combining this with the age of the hard drive will be enough for me… So are you aware of a formula or combination of these parameters in a way that I can estimate the health (or the remaining life time) of a hdd.

Thanks in advance…

To answer your question right away: I don’t have any formula.

I want to add though that I think what you’ll find is that you’ll be able to split drives into two groups: one group will have no predictive S.M.A.R.T. values, and one group will have one or more values that indicate imminent failure. There’s no doubt that that’s valuable, but I don’t think you’ll be able to estimate remaining life with any accuracy for most of the drives.

Here’s an oddity. I’m testing smartmontools vs. cciss_vol_status on an HP with external array, and getting some inconsistent results.
I know that one drive has failed.
I know that another drive is in jeopardy (which is why I’m testing on the box I’m testing on).
Running smartctl, I see in my health report that the second drive is in danger, but it makes no mention of the failed drive.
Then, running cciss_vol_status, I see that the first drive has failed, but no mention is made of the second.

I’ll post this in the cciss_vol_status forum as well, but I find it interesting that the two utilities show such different results!

What is cciss_vol_status actually checking? One possibility is that the drive is completely dead. There would be no S.M.A.R.T. status, but cciss_vol_status would know that there was supposed to be a drive there so it could determine it was dead.

As for the one that’s failing, probably cciss_vol_status isn’t checking S.M.A.R.T. status (I have no idea because I haven’t used it).

You can pass sudo smartctl -l selftest /dev/sda as argument to watch in order to follow its progress.

I have huge problem with my disks, my disks are killed by bad IO synchronisation. I’ve ran the iotop check and i see all processes in IO column on 99,90%, What to do to make my disks again stabil to gaing synchronisation.

I runs game servers on my dedicated server and all my customers have lags and can’t play, what to do please help :?

Great tut there.

There seems to be some interest in an automatic periodic “all-is-well” email notification perhaps containing the info from the last health checks. Since smartmon normally only sends an email when there is a problem, can we add a line to smartd.conf that forces an “all-is-well” email to be sent, say, monthly? Or would we have to cron a scratch built script which uses smartctl to do that?
Anybody know how with smartd? or have the “cronable” script for smartctl? or pehaps a how-to to get some monitoring program to get this feature?

smartctl has “-s on” option to make the hard disk to support S.M.A.R.T. For some new hard disk, it is required to set at the beginning. However, sine the new hard disk doesn’t contain any SMART information, for smart health check, it would be show failed. But, after a day, the result change to “passed”. I am thinking how to reset the value of the old-age attributes

Thanks for the awesome tutorial! I was hoping you could help me with emailing to 2 separate email accounts. The current line I have in /etc/smartd.conf is DEVICESCAN -a -d sat -o on -S on -s (S/../.././02|L/../../6/03) -m email1@gmail.com -M exec /usr/share/smartmontools/smartd-runner which works without a problem. I have tried adding email1@gmail.com,email2@gmail.com but that doesn’t work. What is the best way to accomplish sending results to 2 emails. Thanks for the help.

Thanks for your hard work. If anyone has advice on suggestions for the frequency of the various tests, I would be very interested. Obviously, continually running the long test over and over would be enough th wear out the drive, but some advice on which tests to run and how often would be appreciated.

On a Debian system (wheezy/sid) I needed to install bsd-mailx to get smartd to send emails via sendmail:

apt-get install bsd-mailx

Thanks for the guide!

Excellent post. Thanks for you time in putting this together. One of the better walkthroughs of configuration from smartmontools out there.

Hallo, Nice tutorial but I have hardware raid and need to use this to view data, How can I use the runner to execute this periodlically?

smartctl -c -a /dev/cciss/c0d0p1 -d cciss,0

If you mean configuring smartd, you can just add those options to the smartd.conf line. The smartd-runner program just executes scripts in /etc/smartmontools/run.d on failures.

Very informative and precise tutorial with all the correct commands and screen shots
just got round to testing my HD as its making a buzzing noise that is worse under Linux
preliminary results are promising i will run the extended test and see what it spits out.

Thanks dude.

Thanks for the tutorial, found it helpful. Below is a short script used on an ubuntu 12.04 system run via cron once a week to email a summary of disk information and self tests completed for each disk found at boot time. The formatting of the summary output can be changed to add or remove info as needed. It assumes you have the system configured to send mail, and will email the output to the root user.

#!/bin/bash

#
# script created to provide the general disk information and smartmon test completion status for
# all disk devices found at boot time by OS
# 10/12/2013 jmm
#

export PATH=/usr/bin:/bin:/usr/sbin
export Smart_Out=/tmp/smart.out
export Device_file=/tmp/devs
export HoSt=`hostname`
export emailsubj=”`hostname` – SMART self-test summary for `date “+%A %B %d %Y”`”
export SendTo=root

#
# get the devices seen at OS boot time
#

ls /dev/sd? > $Device_file

#
# for each device found in /dev get the general drive info and SMART self test status
# send both to a temp file and do simple formatting
#

while IFS= read -r line
do
if [ “$line” = “/dev/sda” ]; then
echo -e “The SMART status for Hard disk $line is: \n\n” > /tmp/smart.out
smartctl -a $line|awk ‘NR>=4&&NR> /tmp/smart.out
smartctl -l selftest $line >> /tmp/smart.out
echo -e “=== END OF READ SMART DATA SECTION === \n\n” >> /tmp/smart.out
else
echo -e “The SMART status for Hard disk $line is: \n\n” >> /tmp/smart.out
smartctl -a $line |awk ‘NR>=4&&NR> /tmp/smart.out
smartctl -l selftest $line >> /tmp/smart.out
echo -e “=== END OF READ SMART DATA SECTION === \n\n” >> /tmp/smart.out
fi

done < "$Device_file"

#
# send output to the appropriate user
#

cat $Smart_Out | mailx -s "$emailsubj" $SendTo
logger $emailsubj

rm $Device_file $Smart_Out

Hi,

I think it has been spotted in a previous post but with another command, for this part of the article:
“Unfortunately, there

End of the comment was:

You can find out the advancement of your test using the command:

smartctl –capabilities /dev/sdX

It will show the advancement for your test in percentage.

Thanks for the tutorial, really helpful!

Cheers,
Clem

Tried setting up on CentOs 6.5, tried to restart the service using “/etc/init.d/smartmontools” restart but got an error.
“smartmontools” is not located in the “/etc/init.d/” directory, but smartd is. is getting the smartd service started enough to get smartmontools working?

Thanks for publishing this article. I had few questions:
1. Shall I rely on smartctl -H to see of the device is in good health ? Or do I need to do further selftest, short or long test ? My aim is just find if the disk is fine or not for read and write. We have a high availability solution and we want to use this utility to failover to standby node in case of any issue with the disk.

2. Is health check with -H option or selftest or short test – are they handled by the device driver independently ? Or they consume some CPU cycles ? Any data read or write is involved to run these tests that takes CPU times ?

I cannot get the mail notification to work. After some tries I got sendmail to work from command line using a gmail account, but having set it to test using “-M test” it now generates a mail every 20 minutes but the subject is
Cron test -x /etc/init.d/sendmail && /usr/share/sendmail/sendmail cron-msp

and the message is
/usr/share/sendmail/sendmail: 899: /usr/share/sendmail/sendmail: /usr/sbin/sendmail-msp: not found

Any idea what I need to do?

This is running on Linux Mint 17

Hello…

I’m trying the short test on an SSD drive but it looks like that is not ending! It freezes at 10% remaining and it doesn’t get’s to 0% remaining! I have to abort it with the “-X” flag!

I’m using:
sudo smartctl -t short /dev/sda

then when I issue:
sudo smartctl -l selftest /dev/sda

I get this, which is an aborted previous test:
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Aborted by host 10% 14687 –

If I try to run again the command for the short test it says:
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.1.0-0.bpo.2-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, http://www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Can’t start self-test without aborting current test (10% remaining),
add ‘-t force’ option to override, or run ‘smartctl -X’ to abort test.

I want to use smarttool for C++ code, is there any exposed C/C++ api available

Thanks
Hari Shankar

Weekly long test is generating a temp warning as the drive temp reaches 46 degrees near the end of the test. Normal operating temp is 38 degrees. Short test isn’t a problem.

Perhaps weekly long tests are doing more harm than good?

Thanks for the information. I’ve managed to set everything up as indicated. Its a shame for me that am doing this only after my disk died.
Much appreciated for this blog.

I have noticed you don’t monetize shadypixel.com, don’t waste your traffic, you can earn extra cash
every month with new monetization method.
This is the best adsense alternative for any type
of website (they approve all sites), for more details simply search
in gooogle: murgrabia’s tools

Comments are closed.