The Computer Oracle

What happened to ECC RAM?

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Ominous Technology Looping

--

Chapters
00:00 What Happened To Ecc Ram?
00:54 Answer 1 Score 69
02:51 Accepted Answer Score 42
06:26 Answer 3 Score 39
06:56 Answer 4 Score 39
08:25 Answer 5 Score 4
09:30 Thank you

--

Full question
https://superuser.com/questions/1635090/...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#memory #ecc

#avk47



ANSWER 1

Score 69


15+ years ago Intel decided ECC RAM support was not of value in consumer machines.

In other words, Intel decided 15+ years ago that consumer machines don’t need it. Thus the market doesn’t support it outside of server hardware. Thus end consumers are paying the price.

This January 2021 article in ExtremeTech provides a fairly solid summary of what happened: “Linus Tovalds Blames Intel for Killing ECC RAM in Consumer Systems”:

“There was a time when you could buy ECC support on mainstream chipsets, but Intel phased out that capability on non-Xeon platforms a number of years ago. The 975X may have been the last consumer Intel platform to support it, and that family launched 15 years ago. The Xeon 3450 chipset was cross-compatible with certain high-end CPUs in the Nehalem family, but that’s still a Xeon chipset — not a mainstream part.”

“As a result, support for ECC in consumer products — and the availability of ECC RAM for consumer products — both fell off a cliff.”

Since the article quotes Linus Torvalds, here is his specific complaint:

“The memory manufacturers claim it’s because of economics and lower power. And they are lying bastards – let me once again point to row-hammer about how those problems have existed for several generations already, but these f*ckers happily sold broken hardware to consumers and claimed it was an ‘attack’, when it always was ‘we’re cutting corners.’”

The issue here is Linux is getting blamed for kernel errors, but Linus Torvalds believes the root cause are hardware issues that can be traced to the prevalence of non-ECC RAM in machines nowadays.

But that is a tangent… What it comes down to is PC manufacturers cutting corners. Classic manufacturing issue.

And nowadays where PC hardware is considered pretty disposable, there might be some rationale here: RAM starts to get flaky, just toss the machine and buy a new one. The truth is the market is filled with non-techs and non-PC builders so hey… It stinks but it is what it is.




ACCEPTED ANSWER

Score 42


A decade or two ago I could buy ECC (Error Correction Code) RAM for PCs I assembled. ECC RAM provided SEC-DED, I guess from bit flips caused by ionizing radiation (I don't know what else could cause transient bit errors to pop up in RAM or I/O buses).

There are 3 general causes of bit errors, the first two of which are single event upsets:

  1. Radiation (primarily free neutrons). This particular phenomenon is dependent on a number of things such as the neutron cross section of the particular device. It may seem counter-intuitive, but the newer much smaller geometries have a lower probability of an upset due to neutrons because they have been designed to be less susceptible. See the Xilinx link (from below).

  2. Lead, specifically Pb210 which is part of the Uranium decay chain and is found in older kit in the balls of BGA devices. Xilinx refers to errors from this as the alpha rate as they emit an alpha particle during decay. Clearly not an issue for a great deal of current equipment that is lead free (but still quite an issue in aerospace where tin lead processing is still common).

  3. General bit error rate issues. A memory interface is a communication channel, and all communications channels have an error rate. Admittedly, you may never see a single bit error in the life of a particular piece of equipment as this is a statistical quantity. Errors due to electrical noise and poor device decoupling also fall into this category.

i.e., if ECC RAM was considered a useful feature a decade ago, do the reasons it was useful no longer apply to current personal computers and servers? Or is the thinking now that ECC RAM was never actually useful?

It was useful, but of limited value, although many side channel attacks can be mitigated by its use.

The real reason you can't find it in commercially available boards is simply cost and those boards that do have it have a rather large premium, far higher than the delta cost of the silicon to handle it and the extra 8 data bits (for a 64 bit memory system). The cost-benefit analysis doesn't support its broad availability.

I do remember a research paper from Boeing that discussed soft errors in a Denver data centre. The amount of free neutrons is (up to a certain level) proportional to altitude. The higher you go, the more there are.

If ECC memory was helpful twenty years ago presumably it would be more helpful now that PCs are running with 1-2 orders of magnitude more memory, at lower voltages and with smaller physical features that (presumably) are more susceptible to corruption from stray radiation. Are any of these assumptions incorrect?

The memory interfaces we have today are far more robust than you might think; for DDRx, the data strobes are differential (so they reject common mode noise) and lower transition voltages are actually better for high speed interfaces, as we proved years ago with ECL.

In avionics, and in particular flight safety critical avionics such as flight control computers, the use of ECC for L2 and beyond is mandatory as is the use of parity for L1. That is one of the reasons those cards are not from Intel or AMD.

[Update]. The specifics of just how memory cells are laid out has a rather large effect on their susceptibility to SEUs; Xilinx has taken a particular approach that effectively stacks memory cells in such a way that the probability of a high energy neutron causing a bit flip is significantly reduced.

As I am not an IC designer that is all I can really say. There is a great deal more information at the Rosetta Project.




ANSWER 3

Score 39


I agree with the answer provided by @Giacomo1968 as far as history goes. The current state however is changing. AMD has recently started to support ECC memory in their current desktop CPU line for the AM4 socket: "ECC is not disabled. It works, but not validated for our consumer client platform." (Source: Reddit)

That said, the motherboard also needs to support this. Some consumer boards do, some don't.




ANSWER 4

Score 39


A bit more that addresses the question:

Intel unilaterally decided that consumers did not need ECC and decided to only provide it to server and workstation customers where Intel could charge a premium.

Microsoft tried to make ECC a required feature for Vista certification but Intel refused to do it. Before the Core i7 series the memory controller was part of the motherboard and ECC support was a motherboard chipset feature.

You can get laptops with ECC. For example, there's the Dell Precision Workstation line which you can get with a Xeon-W CPU and ECC RAM.

You can buy any Ryzen CPU. Well, any Ryzen without integrated graphics. For the integrated graphics to work with ECC you need a Pro version which is hard to find unless you buy it in a prebuilt system.

With a Ryzen and a motherboard like the ASUS PRO line the unbuffered ECC will work great.

For the registered, buffered ECC modules you need a real Xeon or EPYC CPU because those RAM types are controlled differently.

In the near future DDR5 RAM has the option to use ECC internally, without any notification or control from the CPU. It also has the option to provide signalling and control to CPUs that support it.

Here is an example of an ECC module you can buy today. I bought four of them for a Ryzen build:

"Crucial Server Memory 16GB DDR4 DIMM 288-pin - 2666 MHz / PC4-21300 - CL19-1.2 V - unbuffered - ECC CT16G4WFD8266"

I would provide an Amazon link but that may be considered spam. Also note that you can get ECC modules at 3,200 MHz speed today.




ANSWER 5

Score 4


This is a reflection of consumer software quality, among other factors.

ECC RAM only really helps if the error rate from sporadic bit flips is the dominant cause of failures. If you run high reliability software which never crashes on its own, eliminating the few remaining sources of errors actually improves the MTBF of the system.

If you run software which is produced in a rush because of the "winner takes it all" economics, it will have plenty of failure sources besides RAM errors. Paying the premium to reduce your error rate by a few percent just doesn't make sense in this case.

And then comes the positive feedback typical for goods with high fixed costs: higher prices means less demand which in turn means even higher prices. I don't think Intel is to blame here: they didn't stop supporting ECC in consumer chips because they hate technology, they did so because they earn more money selling cheaper non-ECC chips.

Notably, in the world of industrial microcontrollers controllers which run software designed according to functional safety standards, ECC RAM is widely used today.