• LOGIN
  • No products in the cart.

Is Your Computer Stable? by Jeff Atwood

Over the last twenty years, I’ve probably built around a hundred computers. It’s not very difficult, and in fact, it’s gotten a whole lot easier over the years as computers become more highly integrated. Consider what it would take to build something very modern like the Scooter Computer:

  1. Apply a dab of thermal compound to top of case.
  2. Place motherboard in case.
  3. Screw motherboard into case.
  4. Insert SSD stick.
  5. Insert RAM stick.
  6. Screw case closed.
  7. Plug in external power.
  8. Boot.

Bam done.

It’s stupid easy. My six year old son and I have built Lego kits that were way more complex than this. Even a traditional desktop build is only a few more steps: insert CPU, install heatsink, route cables. And a server build is merely a few additional steps on top of that, maybe with some 1U or 2U space constraints. Scooter, desktop, or server, if you’ve built one computer, you’ve basically built them all.

Everyone breathes a sigh of relief when their newly built computer boots up for the first time, no matter how many times they’ve done it before. But booting is only the beginning of the story. Yeah, it boots, great. Color me unimpressed. What we really need to know is whether that computer is stable.

Although commodity computer parts are more reliable every year, and vendors test their parts plenty before they ship them, there’s no guarantee all those parts will work reliably together, in your particular environment, under your particular workload. And there’s always the possibility, however slim, of getting very, very unlucky with subtly broken components.

Because we’re rational scientists, we test stuff in our native environment, and collect data to prove our computer is stable. Right? So after we boot, we test.

Memory

I like to start with memory tests, since those require bootable media and work the same on all x86 computers, even before you have an operating system. Memtest86 is the granddaddy of all memory testers. I’m not totally clear what caused the split between that and Memtest86+, but all of them work similarly. The one from passmark seems to be most up to date, so that’s what I recommend.

Download the version of your choice, write it to a bootable USB drive, plug it into your newly built computer, boot and let it work its magic. It’s all automatic. Just boot it up and watch it go.

(If your computer supports UEFI boot you’ll get the newest version 6.x, otherwise you’ll see version 4.2 as above.)

I recommend one complete pass of memtest86 at minimum, but if you want to be extra careful, let it run overnight. Also, if you have a lot of memory, memtest can take a while! For our servers with 128GB it took about three hours, and I expect that time scales linearly with the amount of memory.

The “Pass” percentage at the top should get to 100% and the “Pass” count in the table should be greater than one. If you get any errors at all, anything whatsoever other than a clean 100% pass, your computer is not stable. Time to start removing RAM sticks and figure out which one is bad.

OS

All subsequent tests will require an operating system, and one basic iron clad test of stability for any computer is whether it can install an operating system. Pick your free OS of choice, and begin a default install. I recommend Ubuntu Server LTS x64 since it assumes less about your video hardware. Download the ISO and write it to a bootable USB drive. Then boot it.

(Hey look it has a memory test option! How convenient!)

  • Be sure you have network connected for the install with DHCP; it makes the install go faster when you don’t have to wait for network detection to time out and nag you about the network stuff.
  • In general, you’ll be pressing enter a whole lot to accept all the defaults and proceed onward. I know, I know, we’re installing Linux, but believe it or not, they’ve gotten the install bit down by now.
  • About all you should be prompted for is the username and password of the default account. I recommend jeff and password, because I am one of the world’s preeminent computer security experts.
  • If you are installing from USB and get nagged about a missing CD, remove and reinsert the USB drive. No, I don’t know why either, but it works.

If anything weird happens during your Ubuntu Server install that prevents it from finalizing the install and booting into Ubuntu Server … your computer is not stable. I know it doesn’t sound like much, but this is a decent holistic test as it exercises the whole system in very repeatable ways.

We’ll need an OS installed for the next tests, anyway. I’m assuming you’ve installed Ubuntu, but any Linux distribution should work similarly.

CPU

Next up, let’s make sure the brains of the operation are in order: the CPU. To be honest, if you’ve gotten this far, past the RAM and OS test, the odds of you having a completely broken CPU are fairly low. But we need to be sure, and the best way to do that is to call upon our old friend, Marin Mersenne.

In mathematics, a Mersenne prime is a prime number that is one less than a power of two. That is, it is a prime number that can be written in the form Mn = 2n − 1 for some integer n. They are named after Marin Mersenne, a French Minim friar, who studied them in the early 17th century. The first four Mersenne primes are 3, 7, 31, and 127.

I’ve been using Prime95 and MPrime – tools that attempt to rip through as many giant numbers as fast as possible to determine if they are prime – for the last 15 years. Here’s how to download and install mprime on that fresh new Ubuntu Server system you just booted up.

mkdir mprime
cd mprime
wget ftp://mersenne.org/gimps/p95v287.linux64.tar.gz
tar xzvf p95v287.linux64.tar.gz
rm p95v287.linux64.tar.gz

(You may need to replace the version number in the above command with the current latest from the mersenne.org download page, but as of this writing, that’s the latest.)

Now you have a copy of mprime in your user directory. Start it by typing ./mprime

Just passing through, thanks. Answer N to the GIMPS prompt.

Next you’ll be prompted for the number of torture test threads to run. They’re smart here and always pick an equal number of threads to logical cores, so press enter to accept that. You want a full CPU test on all cores. Next, select the test type.

  1. Small FFTs (maximum heat and FPU stress, data fits in L2 cache, RAM not tested much).
  2. In-place large FFTs (maximum power consumption, some RAM tested).
  3. Blend (tests some of everything, lots of RAM tested).

They’re not kidding when they say “maximum power consumption”, as you’re about to learn. Select 2. Then select Y to begin the torture and watch your CPU squirm in pain.

Accept the answers above? (Y):
[Main thread Feb 14 05:48] Starting workers.
[Worker #2 Feb 14 05:48] Worker starting
[Worker #3 Feb 14 05:48] Worker starting
[Worker #3 Feb 14 05:48] Setting affinity to run worker on logical CPU #2
[Worker #4 Feb 14 05:48] Worker starting
[Worker #2 Feb 14 05:48] Setting affinity to run worker on logical CPU #3
[Worker #1 Feb 14 05:48] Worker starting
[Worker #1 Feb 14 05:48] Setting affinity to run worker on logical CPU #1
[Worker #4 Feb 14 05:48] Setting affinity to run worker on logical CPU #4
[Worker #2 Feb 14 05:48] Beginning a continuous self-test on your computer.
[Worker #4 Feb 14 05:48] Test 1, 44000 Lucas-Lehmer iterations of M7471105 using FMA3 FFT length 384K, Pass1=256, Pass2=1536.

Now’s the time to break out your Kill-a-Watt or similar power consumption meter, if you have it, so you can measure the maximum CPU power draw. On most systems, unless you have an absolute beast of a gaming video card installed, the CPU is the single device that will pull the most heat and power in your system. This is full tilt, every core of your CPU burning as many cycles as possible.

I suggest running the i7z utility from another console session so you can monitor core temperatures and speeds while mprime is running its torture test.

sudo apt-get install i7z
sudo i7z

Let mprime run overnight in maximum heat torture test mode. The Mersenne calculations are meticulously checked, so if there are any mistakes the whole process will halt with an error at the console. And if mprime halts, ever … your computer is not stable.

Watch those CPU temperatures! In addition to absolute CPU temperatures, you’ll also want to keep an eye on total heat dissipation in the system. The system fans (if any) should spin up, and the whole system should be kept at reasonable temperatures through this ordeal, or else you’re going to have a sick, overheating computer one day.

The bad news is that it’s extremely rare to have any kind of practical, real world workload remotely resembling the stress that Mersenne lays on your CPU. The good news is that if your system can survive the onslaught of Mersenne overnight, it’s definitely ready for anything you can conceivably throw at it in the future.

Disk

Disks are probably the easiest items to replace in most systems – and the ones most likely to fail over time. We know the disk can’t be totally broken since we just installed an OS on the thing, but let’s be sure.

Start with a bad blocks test for the whole drive.

sudo badblocks -sv /dev/sda

This exercises the full extent of the disk (in safe read only fashion). Needless to say, any errors here should prompt serious concern for that drive.

Checking blocks 0 to 125034839
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found. (0/0/0 errors)

Let’s check the SMART readings for the drive next.

sudo apt-get install smartmontools
smartctl -i /dev/sda 

That will let you know if the drive supports SMART. Let’s enable it, if so, and see the basic drive stats:

smartctl -s on /dev/sda
smartctl -a /dev/sda    

Now we can run some SMART tests. But first check how long the tests on offer will take:

smartctl -c /dev/sda

Run the long test if you have the time, or the short test if you don’t:

smartctl -t long /dev/sda

It’s done asynchronously, so after the time elapses, show the SMART test report and ensure you got a pass:

smartctl -l selftest /dev/sda 
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       100         -

Next, run a simple disk benchmark to see if you’re getting roughly the performance you expect from the drive or array:

dd bs=1M count=512 if=/dev/zero of=test conv=fdatasync
hdparm -Tt /dev/sda

For a system with a basic SSD you should see results at least this good, and perhaps considerably better:

536870912 bytes (537 MB) copied, 1.52775 s, 351 MB/s
Timing cached reads:   11434 MB in  2.00 seconds = 5720.61 MB/sec
Timing buffered disk reads:  760 MB in  3.00 seconds = 253.09 MB/sec

Finally, let’s try a more intensive test with bonnie++, a disk benchmark:

sudo apt-get install bonnie++
bonnie++ -f

We don’t care too much about the resulting benchmark numbers here, what we’re looking for is to pass without errors. And if you get errors during any of the above … your computer is not stable.

(I think these disk tests are sufficient for general use, particularly if you consider drives easily RAID-able and replaceable as I do. However, if you want to test your drives more exhaustively, a good resource is the FreeNAS “how to burn in hard drives” topic.)

Network

I don’t have a lot of experience with network hardware failure, to be honest. But I do believe in the cult of bandwidth, and that’s one thing we can check.

You’ll need two machines for an iperf test, which makes it more complex. Here’s the server, let’s say it’s at 10.0.0.1:

sudo apt-get install iperf
iperf -s

and here’s the client, which will connect to the server and record how fast it can transmit data between the two:

sudo apt-get install iperf
iperf -c 10.0.0.1

------------------------------------------------------------
Client connecting to 10.0.0.1, TCP port 5001
TCP window size: 23.5 KByte (default)
------------------------------------------------------------
[  3] local 10.0.0.2 port 43220 connected with 10.0.0.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.09 GBytes    933 Mbits/sec

As a point of reference, you should expect to see roughly 120 megabytes/sec (aka 960 megabits) of real world throughput on a single gigabit ethernet connection. If you’re lucky enough to have a 10 gigabit connection, well, good luck reaching that meteoric 1.2 Gigabyte/sec theoretical throughput maximum.

Video Card

I’m not covering this, because very few of the computers I build these days need more than the stuff built into the CPU to handle video. Which is getting surprisingly decent, at last.

You’re a gamer, right? So you’ll probably want to boot into Windows and try something like furmark. And you should test, because GPUs – especially gaming GPUs – are rather cutting edge bits of kit and burn through a lot of watts. Monitor temperatures and system heat, too.

If you have recommendations for gaming class video card stability testing, share them in the comments.

OK, Maybe It’s Stable

This is the regimen I use on the machines I build and touch. And it’s worked well for me. I’ve identified faulty CPUs (once), faulty RAM, faulty disks, and insufficient case airflow early on so that I could deal with them in the lab, before they became liabilities in the field. Doesn’t mean they won’t fail eventually, but I did all I could to make sure my babiescomputers can live long and prosper.

Who knows, with a bit of luck maybe you’ll end up like the guy whose netware server had sixteen years of uptime before it was decommissioned.

These tests are just a starting point. What techniques do you use to ensure the computers you build are stable? How would you improve on these stability tests based on your real world experience?

Discussion (38 replies)

The Scooter Computer

When we initially deployed our handbuilt colocated servers for Discourse in 2013, I needed a way to provide an isolated VPN channel in for secure remote access and troubleshooting. Rather than dedicate a whole server to this task, I purchased the inexpensive, open source firmware friendly Asus RT-N16 router, flashed it with the popular TomatoUSB open source firmware, removed the antennas, turned off the WiFi and dropped it off in our colocated rack to let it act as a dedicated VPN access point.


Asus RT-N16

And that box – which was $100 then and around $70 now – worked well enough until now. Although the version of OpenSSL in the 2012 era Tomato firmware we used is not vulnerable to Heartbleed, it’s still getting out of date in terms of the encryption it supports and allows. And Tomato itself is updated sporadically, chaotically at best.

Let’s face it: this is just a little box that runs a chopped up version of Linux, with a bit of specialized wireless hardware and multiple antennas tacked on … that we’re not even using. So when it came time to upgrade, we wondered:

Why not just go with a small box that can run a real, full Linux distro? Wouldn’t that be simpler and easier to keep up to date?

After doing some research and asking on Twitter, I discovered there are a ton of amazing little Broadwell “mini-PC” boxes available on AliExpress.

The specs are kind of amazing for the price. I paid ~$350 each for the ones I selected:

  • i5-5200 Broadwell 2 core / 4 thread CPU at 2.2 Ghz – 2.7 Ghz
  • 16GB DDR3 RAM
  • 128GB M.2 SSD
  • Dual gigabit Realtek 8168 ethernet
  • front 4 USB 3.0 ports / rear 4 USB 2.0 ports
  • Dual HDMI out

(There’s also optical and analog audio connectors on the front, as well as a SD card reader, which I covered with a sticker since we had no need for audio. I also stripped the WiFi out since we didn’t need it, but it was included for the price, too.)

Selecting the i5-4258u, 4GB RAM, and 64GB SSD pushes the price down to $270. That’s still a solid CPU, only a single generation behind Intel’s latest and greatest Skylake, and carrying the midrange i5 moniker; it’s no pushover. There are also many, many variants of this box from other AliExpress sellers that have slightly older, cheaper CPUs that are still plenty powerful. You can easily spec a box similar to this one for $200.

That’s not a whole lot more than the $200 you’d pay for a high end router these days, and as Ars Technica notes, the average x86 box is radically faster.

Note that the above graphs, “homebrew” means an old, 1.8 Ghz Ivy Bridge dual core chip, 3 generations behind current CPUs, that doesn’t even merit the i3 or i5 designation, and has no hyperthreading. Do bear that in mind as you keep reading.

Meet The Scooter Computer

This box may be small, and only 15 watt TDP, but it is mighty. I spun up a new Digital Ocean droplet and ran a quick benchmark:

sudo apt-get install sysbench
sysbench --test=cpu --cpu-max-prime=20000 run
Tie Shuttle 6

total time:           28.0707s
total num events:     10000
total time take:      28.0629
per-request stats:
     min:             2.77ms
     avg:             2.81ms
     max:             3.99ms
     ~95 percentile:  3.00ms
Digital Ocean Droplet

total time:          35.9541s
total num events:    10000
total time taken:    35.9492
per-request stats:
     min:             3.50ms
     avg:             3.59ms
     max:             13.31ms
     ~95 percentile:  3.79ms

Results will of course vary by cloud provider, but rest assured this box is just as fast as and possibly even faster than the average cloud box you could spin up right now. Of course it is “only” 2 cores / 4 threads, but the more cores you need, the slower they tend to go because of the overall TDP limits of the core package.

One thing that’s not immediately obvious in photos is that this thing is indeed small but hefty, like holding a solid chunk of aluminum in your hand. That’s because the box is passively cooled — the whole case is the heatsink, as the CPU on the bottom of the motherboard mates with the finned top of the case.

Opening this box you realize just how simple things are inside it; it’s barely more than a highly integrated motherboard strapped to an aluminum block. This isn’t a Steve Jobs truck, a Mac Mini car, or even a motorcycle. This is a scooter.

Scooters are very primitive machines; it is both their greatest strength and their greatest weakness. It’s arguably the simplest personal wheeled vehicle there is. In these short distance scenarios, scooters tend to win over, say, bicycles because there’s less setup and teardown necessary – you don’t have to lock up a scooter, nor do you have to wear a helmet. Just hop on and go! You get almost all the benefits of gravity and wheeled efficiency with a minimum of fuss and maintenance. And yes, it’s fun, too!

Passively cooled computers are paragons of simplicity and reliable consumer electronics, but passively cooling a “real” x86 PC is the holy grail. To get serious performance you usually need to feed the CPU at least 10 to 20 watts – and dissipating that kind of energy with zero fans and ambient airflow alone is not trivial. Let’s see how our scooter does overnight running Mersenne Primes, which is the heaviest CPU load possible.

You can place your hand on the top of the box during this, but it’s uncomfortable. And the whole box radiates heat, not just the top. Overall it was completely stable for me during overnight mprime torture testing with the 15w TDP CPU I chose, and I am comfortable with these boxes sitting in our rack in the datacenter, even under extended full load. However, I would be very careful putting a 28w TDP CPU in this box unless you are absolutely sure it won’t be at full load very often. Passive cooling is hard.

Power consumption, as measured by my Kill-a-Watt, ranged from 7 watts at the Ubuntu Server 14.04 text login screen, to 8-10 watts at an idle Ubuntu 15.10 GUI login screen (the default OS it arrived with), to 14-18 watts in memory testing, to 26 watts in mprime.

I should also mention that even under extreme mprime load, both CPUs stayed at 2.5 Ghz indefinitely, which is unusual in my experience. To achieve 2.7 Ghz you need a single threaded load. Considering the base clock of the i5-5200u is 2.2 Ghz, that’s quite good! Many 4-6-8 core CPUs drop all the way down to their base clock pretty fast once they have significant load, which makes the “turbo” moniker a bit of a lie.

(By the way, don’t bother using burnP6, it generates way too little heat compared to mprime, which is an absolute monster. If your CPU can survive an overnight run of mprime, I can assure you it’s ready for just about anything the real world can throw at it, ever.)

Disk

The machine has M.2 slots for two drives, as well as a SATA port and power cable (not pictured, but was included in the box) if you want to mate a 2.5″ drive with the drive mounting holes on the bottom of the case. So if you prefer a mirrored two drive RAID array here for reliability, or a giant honking 2TB 2.5″ HDD slapped in there for media storage, all of that is possible!

Be careful, as the internal M.2 slots are 2242, meaning 42mm length. There seem to be mostly (only?) lower cost SSD drives available in this size for whatever reason.

Don’t worry, though, the bundled 128GB Phison S9 M.2 SSD has decent performance, roughly equal to a good SSD from a few years ago:

dd bs=1M count=512 if=/dev/zero of=test conv=fdatasync
hdparm -Tt /dev/sda

536870912 bytes (537 MB) copied, 1.52775 s, 351 MB/s
Timing cached reads:   11434 MB in  2.00 seconds = 5720.61 MB/sec
Timing buffered disk reads:  760 MB in  3.00 seconds = 253.09 MB/sec

That’s respectable SSD performance and won’t hold you back in most use cases, but it’s not a barn-burning disk subsystem, either. I’m not entirely sure retrofitting, say, the state of the art Samsung 950 Pro M.2 2280 drive is possible due to length restrictions.

Of course the Samsung 850 Pro would fit fine as a traditional 2.5″ SATA drive mounted to the case cover, and would perform like this:

536870912 bytes (537 MB) copied, 1.20895 s, 444 MB/s
Timing cached reads:   38608 MB in  2.00 seconds = 19330.61 MB/sec
Timing buffered disk reads: 1584 MB in  3.00 seconds = 527.92 MB/sec

RAM

Intel limits these Broadwell U class CPUs to 16GB RAM total, so maxing the box out is only going to set you back around $70. Still, that’s a significant percentage of the ~$350 total cost, and you may not need that much RAM for what you have in mind.

However, do be careful that you get dual-channel RAM for lower RAM configurations; you don’t want a single 4GB DIMM, you want two 2GB DIMMs. They ship from the vendor with a single DIMM, so beware. It may not matter depending on the task, as noted by AnandTech, but our boxes will be used for OpenSSL, and memory is cheap, so why not?

The Versatile Scooter

When I began looking at this, I was shocked to discover just how low-end the x86 CPUs are in a lot of “dedicated” devices, such as the official pfSense hardware:

Sure, 2.4 Ghz and 8 cores on that C2758 sounds reasonable – until you realize those are old Intel Bay Trail Atom cores. Even the current Cherry Trail Atom cores aren’t so hot. Furthermore, those are probably the maximum “turbo” frequencies being quoted, which are unlikely to be sustained under any kind of real multi-core load. Also, did I mention this is being sold as a $1,400 device? Except for the lack of more than 2 dedicated gigabit ethernet ports, I’d put our scooter computer up against that C2758 any day of the week. And you know what? It’d win.

I think this logic applies to a lot of dedicated hardware these days — routers, switches, firewalls, and so on. You’re often better off building up a modern high power, low TDP x86 box and slapping a regular Linux distro on there.

You can even kinda-sorta fit six of them in a 1U rack space.

(Well, except for the power bricks and cables. Vertical mounting on a 1U shelf works out a bit better, and each conveniently came with a stand for vertical operation.)

Now that I’ve worked with these boxes, I’ve become rather enamored of the Scooter Computer concept. Wherever we were thinking that we had to run either:

  • A virtual machine on big iron for some small but important utility function in our rack.
  • Dedicated, purpose built hardware for networking, firewall, or switching with a custom OS.

… we can now take advantage of cheap, reliable, flexible, totally solid state commodity x86 hardware that’s spread across many machines and running standard Linux distributions, like all the rest of our 1U servers.


Source: Coding Horror 

About the Author:

 6a0120a85dcdae970b017742c7e340970d-800wiJeff Atwood lives in Berkeley, CA with wife, two cats, onethree children, and a whole lot of computers. He was weaned as a software developer on various implementations of Microsoft BASIC in the 80’s, starting with his first microcomputer, the Texas Instruments TI-99/4a. Jeff continued on the PC with Visual Basic 3.0 and Windows 3.1 in the early 90’s, although he also spent significant time writing Pascal code in the first versions of Delphi. He is now quite comfortable in VB.NET or C#, despite the evils of case sensitivity. Jeff is currently learning Ruby.

He considers himself a reasonably experienced Windowsweb software developer with a particular interest in the human side of software development, as represented in his recommended developer reading list. Computers are fascinating machines, but they’re mostly a reflection of the people using them. In the art of software development, studying code isn’t enough; you have to study the people behind the software, too.

Contact:

[email protected] 
Twitter @codinghorror

0 responses on "Is Your Computer Stable? by Jeff Atwood"

Leave a Message

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

© HAKIN9 MEDIA SP. Z O.O. SP. K. 2013