Monday, June 2, 2008

3: A Fairy's Tale

Once upon a time, a long time ago, a beautiful Princess lived in delightful castle at the top of a sumptuous green, flower dappled hill. The soft perfumed air was filled with the restful sound of water burbling over stones. The deer and rabbits played together happily in the sun spangled glades in the nearby forest. She sat at the window; and as she sat she combing her long beautiful hair and thought ... Thank goodness the mapping between Boolean Equations and Logic Implementation is also this perfect, or they will never yield the 32nm integrated circuits that embedded intelligent systems will be dependent upon!

... We have become used to that absolute truth; that the Si Fabrication Process results in an Integrated Circuit that does what we Designed it to do. But poised at the doorstep of 32nm, it is well for us to consider the Princess's warning and check that that perfection does still apply!

Well we know that Boolean Mathematics is an absolute truth. So we can be sure that the gates and software we model, exactly predict the state and transitions that occur in the circuits and modules we implement! ... Or can we?

The mathematics may be right, but we also know that even a relatively simple logical object can have so much state and logical combinations, that its functionality cannot be fully explored by simulation in a lifetime. Simulations can only be explorations of a limited area of the functional space; and even less of the non-functional space ... And we also know that we only probably simulate what the application needs! Of course real Logic Gates are not actually binary, but analogue circuits; they are susceptible to the wide variety of interference sources in the real world. Which casts a further element of doubt over what we did simulate!

OK; well we are confident about the Manufacturing Process. We know that what the masks describe, the fabrication process implements ... Doesn't it?

Every 18mth the number of features on a chip doubles (as dictated by that other fairy-tale, Moore's Law), and each step halves the size and more than doubles the features on each mask; Frequently introducing new masks and process steps as well. As the features get smaller and more numerous, they are more susceptible to imperfections (defects), which also get more difficult to contain. Whilst some defects will 'break' the circuit, many more are non fatal producing weakened circuits, many of which will not show up ... just yet.

... We also know that as we move below ~60nm, the number of atoms in the transistors are getting so low that device behaviour is ceasing to be determined by bulk Si properties, become atomistic (probabilistic). Your NAND2 will be a NAND2 most of the time, and its timing will only be only statistically predictable ... Design with that!

Fortunately we have Manufacturing Test, to wheedle out all those devices that contain defects and marginalities, along with all the ones that just don't work!

Well of course we don't actually test all the 'functional vectors' we simulated, and though we use ATPG and Scan, there's not enough time to use full combinational patterns. This is an acceptable risk as long as defects are 'large' enough that their effect will be detected up in the wider circuit operation. But it is a statistical gamble, and whilst justifiable a few years ago when the number of features on chip was in the millions, will it still be ok when we hit the billions of the next few years?

Well at least silicon is reliable; there is no physical movement so there is nothing to wear out! What it is, is what it stays!

Well it is well known that electrons bounce off atoms and get stuck in gate insulation layers altering the threshold voltage and hence the speed of the gate. And some atoms, like aluminium, are not as firmly located as you might think: Some bumble down the track under the flux of electrons bombarding them; Others creep slowly along potential gradients to create novel circuit configurations. Then there are the high energy particles streaming through the galaxy and all the chips in it ... And the alpha particles arising from the package itself. And of course the smaller geometry processes are increasingly susceptible to all of this. It may be working when we ship it, but what probability it will stay working?

Mathematics predict that to achieve a useful yield of even 100Mtr chips calls for defect density of the order of 1 ppb through Design, Manufacture and Test! As even the most optimistic plans for parts of the process only targeting single figure ppms (Indeed "zero defects" is usually defined as less than 1 ppm), we must be a very long way above this today!

So lets face it, we just can't: Design it right; Make it right; Test it right; or Keep it right.

... Time out here! If this were true we should not be able to make the complex embedded systems that we so obviously do do today!

It seems systems exhibit a natural robustness to defects, which when combined with the innate inequality between defects, delivers working systems ... most of the time. It is a classic Tyranny of Numbers! But it is also a wake up call, as some cracks are showing.

We may have lived a charmed life so far; but as we progress to smaller geometries and more complex embedded systems, it does not seem sensible to rely on it as a strategy! We need to review our methodologies to recognise that we create copious defects along with everything else we do.

... Well, that or increase the personnel assigned to kissing frogs and polishing lamps.