0

Possible Duplicate:
Should I use ECC RAM for the next computer I build?

How often does regular RAM make a mistake?

Generally speaking, what are the odds of it actually impacting anything (including silent data corruption here, but excluding changes to some data that will never be read)?

EDIT: I would also be interested to know if there are differences between DDR2 and DDR3 in this regard.

soandos
  • 24,206
  • 28
  • 102
  • 134
  • @techie007 I disagree. I don't care about what is faster, and what motherboards support it. Also lacking in that answer are any numbers for how often these issues occur. All he has in anecdotal evidence that it helps. – soandos Dec 11 '11 at 07:20
  • @techie007 Also, the PDF he links to, while interesting, does not give a real number to how many errors regular RAM makes (especially today, as opposed to in 2009) or why there appears to be such variation (sign of bad data?). It also only deals with errors that are correctable. What is the difference between that and total? – soandos Dec 11 '11 at 07:24
  • What's the actual question? "What's the difference between ECC and non-ECC RAM?" (which you can find out on Wikipedia at least I'm sure), or "What are the odds I could be affected by a 1-bit RAM error that ECC could have caught"? – Ƭᴇcʜιᴇ007 Dec 11 '11 at 07:32
  • More a modification of the second. What are the odd I will be a effected by an error minus the odds that I will be effected by an error using ECC RAM. – soandos Dec 11 '11 at 07:33
  • Well ECC memory can only correct 1-bit errors, so at least those are the only ones you have to worry about subtracting from your total. ;) So you want to know how much more "at risk" of 1-bit errors you are by not having ECC? – Ƭᴇcʜιᴇ007 Dec 11 '11 at 07:39
  • Something like that. I think there is also a small percentage (1-bit error chance squared) that it will not catch a 1 bit error – soandos Dec 11 '11 at 07:41
  • Another related: [How frequent are DRAM errors?](http://superuser.com/questions/26493/how-frequent-are-dram-errors) – Ƭᴇcʜιᴇ007 Dec 11 '11 at 07:45
  • The paper linked in that article is the same that is quoted in the other question that you marked as related. – soandos Dec 11 '11 at 07:46
  • It doesn't mean it's not asking the same question (at least partly) that you are. – Ƭᴇcʜιᴇ007 Dec 11 '11 at 07:47
  • I think the core question there is still " Is it worth investing in ECC RAM?" and the answer reflects that. I don't really care about that at all. – soandos Dec 11 '11 at 07:49
  • Well a system doing one memory operation an hour will have a lot less chance of getting the type of memory error that ECC will prevent, than the same computer doing 10000 operations a second. :) – Ƭᴇcʜιᴇ007 Dec 11 '11 at 07:59
  • Why is that the case? – soandos Dec 11 '11 at 08:00

1 Answers1

1

You are looking for the mean time between failure (mtbf) and mean time to failure (mttf).

Both of these are dependent of the quality (defects are possible) of the RAM as well as its failure rate. The failure rate is mainly dependent on the total device hours and an acceleration factor, at which the cells fail. Other parameters include temperature, uptime and energy...

A document that goes into detail on this is Hybrid Memory Products Ltd - SRAM Module - MTBF analysis. In this example, the used memory lasts over a lifetime for hundreds of years. You can see various memory manufacturers report the same thing, here is an example from Kingston:

Our process works so well that our mean time between failure rating exceeds 500 years!

The gist of this is that ECC is there to cover up for hardware mistakes or extreme usage, this is the reason why you often see it installed in servers as they don't want to risk having faulty memory.


From the other question, there is a study on this that shows different results, 50 - 167 errors per month rather than an error after a long lifetime. Now who speaks the truth? Did Google properly use MemTest?

Google has come out swinging on this issue. See http://blogs.zdnet.com/storage/?p=638 for how this really does affect modern-day systems.

This however is from 2009 based on data from the years before that, so it might be different these days.

Tamara Wijsman
  • 57,083
  • 27
  • 185
  • 256
  • If there are hardware mistakes, won't they show up rather quickly using something like memtest? – soandos Dec 11 '11 at 07:44
  • @soandos: They might or might not, it depends on how long you scan. Some people can't afford to spent so much time confirming that all their hardware components are fine. While I guess it is done at a professional level, often people just go and run it when they suspect memory errors... – Tamara Wijsman Dec 11 '11 at 07:48
  • If it is a hardware defect, it would not fail 100% of the time when that area of RAM gets written/read to/from? – soandos Dec 11 '11 at 07:50
  • Single-bit memory errors can happen even if the RAM isn't faulty, and can happen practically spontaneously ("Electrical or magnetic interference inside a computer system can cause a single bit of dynamic random-access memory (DRAM) to spontaneously flip to the opposite state."). ECC for catching those one-off system errors, not to make it keep working if the RAM is broken (that's what "ChipKill" is for). – Ƭᴇcʜιᴇ007 Dec 11 '11 at 07:53
  • @techie007: As I said, other parameters have an influence. Interference should be fine with most of today's motherboards, and that's again a hardware defect if it does occur on a regular basis. Hence, ECC covers up for those hardware mistakes... – Tamara Wijsman Dec 11 '11 at 08:05
  • @soandos: Not really, a bad contact or two cells that are too close together can result in random behavior. – Tamara Wijsman Dec 11 '11 at 08:05
  • Those happen intermittently? – soandos Dec 11 '11 at 08:07
  • @soandos: What happens intermittently? Also, I have updated my previous comment slightly... – Tamara Wijsman Dec 11 '11 at 08:08
  • Bad contacts or two cells that are close together. – soandos Dec 11 '11 at 08:08
  • @soandos: Yes, environmental parameters have an influence on this. Temperature can cause a distance change between them, more/less energy (voltage) can make it easier/harder for the bit to propagate, the magnetic interference due to a defect is dependent on whether positive/negative bits are set. That last one is one of the reasons why different patterns are used when testing... :) – Tamara Wijsman Dec 11 '11 at 08:16