Paths To Knowledge (dot Science)

What is actually real in Objective Reality? How do you know? Now, prove it's real!

Archive for July 31st, 2009

Larrabee

Posted by pwl on July 31, 2009

Larrabee is a super multi-core Pentium processor design upgraded to 64 bits x86-64 with a new SIMD set of registers maxed out on steroids with ~250 new instructions! Sorta like SSE on steroids!

“LRBni adds two sorts of registers to the x86 architectural state. There are 32 new 512-bit vector registers, v0-v31, and 8 new 16-bit vector mask registers, k0-k7. While some core resources such as caches are shared by the core threads, that is not the case for registers; each thread has a full complement of vector and vector mask registers.”

WOW!

“LRBni vector instructions are either 16-wide or 8-wide, so a vector register can be operated on by a single LRBni instruction as 16 float32s, 16 int32s, 8 float64s, or 8 int64s, with all elements operated on in parallel. LRBni vector instructions are also ternary; that is, they involve three vector registers, of which typically two are inputs and the third the output. This eliminates the need for most move instructions; such instructions are not a significant burden on out-of-order cores, which can schedule them in parallel with other work, but they would slow Larrabee’s in-order pipeline considerably.”

Some excellent must read papers.

A First Look at the Larrabee
New Instructions (LRBni), pdf paper, by Michael Abrash

Rasterization on Larrabee Adaptive rasterization helps boost efficiency, pdf paper, by Michael Abrash

Intel® Architecture Code-Named Larrabee, pdf slides, by Dr. Thomas Willhalm

GDC Session Overview: SIMD Programming on Larrabee

This video is a brief overview of Tom Forsyth’s session from GDC about Larrabee. Larrabee is Intel’s revolutionary approach to take the current evolving programmability of the GPGPU to its logical end. The Larrabee architecture features many cores and threads, as well as a new vector instruction-set extension, the Larrabee new instructions (LRBni). This talk follows Michael Abrash’s first glimpse into LRBni and examines the programming methods and hardware instructions that help programmers get the most out of LRBni’s extremely wide vector units. Starting with simple math examples that are fairly simple to vectorize, it moves through loops, conditionals, and more complex flow control, showing how to implement these algorithms in LRBni. Next, the numerous choices of data format are examined – when to use SOA or AOS (and what those terms mean!), and how to use gather/scatter most efficiently from the same data structures used in an existing engine. Finally, there is a quick look at efficient code scheduling and how to use the multiple hardware threads to help absorb instruction latencies.

Thoughts On How Larrabee Will Change Game Development

As an Intel Application Engineer, Doug McNabb enables game developers to develop for Larabee. Doug describes his background as a game developer and his thoughts on how Larrabee will change game development.

Posted in Intel Larrabee | Leave a Comment »

When otherwise good math goes wrong especially in parallel or massively parallel programs

Posted by pwl on July 31, 2009

Sometimes we forget that not all numbers are the same. This becomes very apparent in dealing with floating point numbers in parallel computing.Floating numbers are not associative or distributive. … The more cores programmers run their parallelized code on, the more ways operations can be interleaved and the more challenges programmers face.” – Tim Mattson and Ken Strandberg, Intel

As if it’s climate science is not bad enough with intentionally corrupt or incompetently done statistics it turns out that climate models may be based upon computer programs with serious math flaws: the limits of the floating point and double precision floating point data types can produce incorrect results since “Floating Point Numbers Aren’t Real Numbers!” they are data types with limited precision. It gets even worse than that, when supposedly good programs are transformed into massively parallel programs with N threads of execution the results can vary with the number of threads chosen to run the program! Of course in climate science N can be 2 or 4 threads on a single multi-core machine but it can also be 1,000+ using GPGPUs or server compute farms.

Have the climate model programs been vetted to ensure mathematical accuracy? Is there a set of test cases that validate it after new changes have been made to the climate models? Do the test cases cover all the calculations in the climate model software? How do we know the answers are even accurate mathematically? (Of course that’s not even asking how do we know the model is relevant but this inquiry is not into relevancy it’s into accuracy of the calculations, whatever they happen to be, in climate models).

The more cores programmers run their parallelized code on, the more ways operations can be interleaved and the more challenges programmers face. Parallel programmers must deal with a host of issues peculiar to parallel programs such as synchronization, protecting shared variables, and finding thread safe versions of common math routines (such as random number generation). One of the most subtle problems faced by the parallel programmer, however, arises from the properties of floating point numbers. Floating point numbers are the same in serial and parallel computations, of course, but when a program executes in parallel, special features of these numbers are more likely to impact your results.” – Tim Mattson and Ken Strandberg, Intel, in “Parallelization and Floating Point Numbers

One aspect of models in science and engineering that involve calculations using the “floating point number format” is that Floating Point numbers are NOT REAL NUMBERS they are limited precision approximations of Real Numbers and as such they have their limits often caused by rounding which results in Floating Point numbers not being associative, in other words the order matters!!! What happens in science and engineering calculations on computers using Float 32 or Double floats (64 bits) especially when scaling massive numbers of computations to multiple threads on your multiple cores or on thousands of processor nodes in super computers or on GPGPU (general purpose graphics processing units) is that you get the wrong answers due to the miss use of these floating point data types.

You can easily generate numbers that don’t fit into the floating point format, and thus you produce answers from the basic arithmetic operations that don’t fit into a floating point format. In other words, the floating point numbers when operated on by the basic arithmetic operations do not constitute a closed set.

The impact of this is significant. Floating numbers are not associative or distributive. So,

A * (C * B) ≠ (A * C) * B and

A * (B + C) ≠ A * B + A * C

[Obviously the sentence is missing something here, most likely the two equations do not produce the same answers! -pwl]

This means that as you change the order of a long sequence of arithmetic operations, you can generate different answers. Mathematically with real numbers, the answers can’t depend on the order of the operations (for commutative operations) or the way they are grouped together (associatively). But with floating point numbers, if you interleave the operations in different ways, you get different results.

Here’s a good test to demonstrate the implications of this behavior by floating point numbers:

1. Fill 2 arrays each with 10000 random values between 0.0 and 1.0.

2. Shift one up by 100 and shift the other down by 0.001.

3. Mix the arrays together, sum them, and subtract a large number (500000).

Here are the results run on 1, 2, and 4 threads.

1 thread computes 170.968750

2 threads computes 171.968750

4 threads computes 172.750000

Which one of these numbers is correct, the 1-thread, 2-thread, or 4-thread value? Are any of these the true value? Would you consider that with 4 threads, the answer is correct and the others wrong? Or with 1 thread?

This is not a trick question, nor is its goal to make programmers look silly. Developers are smart people. But, many programmers steeped in sequential programming for so many years make the assumption that there is only one right answer for their algorithm. After all, their code has always delivered the same answer every time it was run. When you consider the above example, however, all the answers are equally correct. To pick one arbitrarily and call it right and the others wrong is completely unjustified.

Wait there is more! This is quite shocking isn’t it? What you were taught in math class isn’t the way that computers do math! Yikes. Most computer scientists are not aware of this problem as most never encounter it in their careers, or don’t know that it’s a problem that is happening right under their noses. Scary.

By mixing the numbers as this example does, it creates a pathological situation designed to maximize problems due to round off error. The test mixes very large and very small numbers together. The arithmetic unit aligns the numbers before adding them, which, given the large difference in their absolute magnitudes, all but guarantees that we’ll loose bits of precision in the process.

As the number of threads changes, the combinations of numbers being added also changes. With all the roundoff errors, as the way these numbers are combined changes, the way roundoff error is accumulated also keeps changing. Thus, the answers change.

So which answer is correct? The algorithm for adding them together is unstable. If you carefully add the numbers together so large numbers add with large numbers and small numbers add with small numbers, and then add the “large_set_sum” to the “small_set_sum”, you get a numerically stable result. The answer in this case is 177.750. Note that the test answers in every case considerably vary from the stable method of obtaining the answer.

Note also that, with a serial algorithm, you’d never know there was a problem. Only as the thread count grows and the answers change, does the instability of the algorithm become obvious. It’s apparent the problem is not in the compiler or even the program. The problem is with the numerical instability of the algorithm. And it’s only revealed by going to multiple threads.

The video with Tim Mattson explains it well.

Are you getting different numbers for calculations as you vary the number of threads? Your algorithm may be incorrect. Real numbers are nice; floating numbers are not nice. Floating numbers are not a closed set, the overflow bits need to be rounded and that can change the results of a calculation. Tim discusses how to work with floating point numbers and resources available to help in the Intel Math Kernel Libraries.

In the engineering applications that I’ve worked on for civil engineering of bridges we found that Double Precision Floating Point Numbers at 64 bits was simply not enough accuracy. We were able to use 80bit Extended Double Precision Floating Point numbers supported by the 8087 math coprocessor in the Intel line of chips. Even though the extended precision covered most of the cases that Double Precision didn’t there were a few cases where we had to adjust the order of our computations to ensure that we didn’t overflow the precision limits of the computation of the extended 80bit math! As you can imagine errors in bridge calculations are rather critical to life and limb.

The same is true in the Climate Models, lives and treasure both depend on correct math. The science fails when the math is wrong. Have they been vetted for numeric accuracy? How do we know that? Have test cases been written that test these limits in the climate model programs?


The same applies to random numbers in computers, they are not real random numbers either. As Tim Mattson says “We now know that god does play dice but that computers can’t!” (paraphrased). Random numbers are important in climate models since the climate is in inherently a system with randomness being generated from within. See Wolfram’s A New Kind of Science.

Computers cannot make truly random numbers. For statistical algorithms requiring random numbers, developers need to be careful in parallel code to avoid overlapping sequences from random number generators. Tim discusses different methods to use random number generators – including using independent generators for each thread and the “leap frog method” – to produce “pseudo random numbers” for statistical algorithms that work in parallel code.


Very interesting and important topic to any system that depends upon parallel computations being correct to protect limb, life and treasure.

Thanks to Intel, Tim Mattson and Ken Strandberg for this important information.

Posted in Awesome, Bad Ideas, Bad Science, Climate Science, Complex Systems, Damn it!, Debunking Bad Environmentalism, Ethics in Science, Get some perspective people, Hard Science, Hard Science Required, Holding those who run the world responsible for their crimes against humanity, Proofs, Rational Thinking, Reality Based Environmentalism, Science Education, Science over Propaganada, Science Shows | 1 Comment »

 
%d bloggers like this: