Itanium2was designed by HP and Intel, as a new kind of computing paradigm. Instead of CISC / RISC, this time, they moved to EPIC(ironically, RISC is the invention of HP labs, and there is a lot of material on how one can optimize RISC based appplications by HP, including their Dynamo framework). In a nutshell, EPIC entails, the moving of the onus of scheduling the instructions from the hardware to the application (the compiler generating the code). Traditionally, the microprocessor and the associated hardware had the logic hardwired to do instruction prefetching / data prefetching / many of the techniques to reduce the number of cache misses and instruction stalls. This, has some pros and cons, and I guess for HP and Intel, the cons outweighed the pros. The pros of hardwiring the (say) prefetch logic is the amount of throughput that one can gain. The cons of course is that the prefetch logic for a particular program structure may not be the best for some other application. Moving this logic to the compiler might help the backend of the compiler generate better code, given that it has information regarding program flow. How good or bad this is as an architecture decision can be argued both ways. Having seen an enterprise class application being moved to Itanium2, I can say that is is not a bad choice. Even though, people complain about the increased size of the binaries, with not-so-low amount of infrastructure, you can get a decent performance boost.
Itanium2 is a multi-issue (specifically, two instruction bundles)machine. This means that in one clock cycle, two instruction bundles can be executed. A bundle has 128 bits and comprises of 3 instructions. So, in one clock cycle, approximately, 6 instructions can be executed [in case you are wondering how, each instruction in the IA architecture is 41 bits, and each bundle contains 3*41=123 + 5 bits for decoding of instructions]. The latest of the Itanium2 contain three caches: L1, L2 and L3.
The L1 cache, again has an instruction cache – ICache, and a data cache – DCache. The size of the L1 cache (both instruction and data each) is 16Kb and it is four way associative with 64 byte lines. So the L1 has 64 divisions within it, which can be used. Also, the L1 has a throughput of 2 bundles/cycle, which is the fastest among all the caches (understandably so). The interesting feature of the L1 data cache is that it stores only integer data. Any floating point data in the IA2 is stored in the L2 cache. This is an important thing to remember if you are trying to tune numerically intensive applications on the IA.
The L2 cache is a 256KB unified cache for both instructions and data.As mentioned above, the L2 is the place where the FP data is stored, so it has a larger byte-line size of 128bytes.The L2 cache is eight way associative, and so it has 256 divisions within it.
The L3 cache size varies from each system. It usually is in the range 1.5-3 MB. This too, like the L2 is a unified cache. This is 12-way associative and has a 128byte line size (for the higher end). It means that the higher end IA’s L3 has 2048 divisions in it. The L3 can support a data transfer rate of 32 bytes/cycle to the L2.
|64byte line size||128byte linesize||128 byte linesize|
|16KB, 4 way associative||256KB 8 way associative||1.5-3MB, 12 way associative|
|Latency of 1 cycle||Latency of 5,6 cycles||Latency of 12,13 cycles|
So, if you are trying to optimize the execution of an application on IA, then its obvious that the code and data should have the least number of cache misses. Here is where EPIC can help a lot. Given that two instruction bundles can be issued at the same time, both the bundles could be within L1, there by avoiding any misses. This allocation of the instructions within the bundle is done by the compiler and so, it is possible to do it judiciously. Another of the standard optimizing technique is to make the inner loop fit into the L1 / L2 caches, so that stalls during the instruction prefetching can be avoided.And if the inner loop is iterated a large number of times, after the first iteration, no more misses will be recorded. Practically speaking, for business applications which have large loops, this might not serve right.For scientific applications, this might be possible. Also, loop unrolling (in higher optimization levels) might create a loop which spills the cache boundaries. This of course you can record with any of the performance monitoring tools (example, Caliper on HP-UX).
As noted above, as the L1 data cache can only contain integers, if your data structure has a combination of integers and FP data, it is suggested that you group them individually i.e. put the integers together and the FP data together. That way, integer data can be accessed via the L1 and if and when FP data is accessed L2 can be used. This again becomes difficult if you have a large application, wherein data layout changes means recompilation of all the clients – which is not always a possible solution. But, if you are developing afresh on Itanium2, this is something you can keep in mind.
As with any n-way associative cache structure, minimizing the data stride i.e.e the distance between two datums is important. If both the data elements map to the same cache line, that will result in a cache miss, and more cycles will be wasted. Keeping the data tight that way will ensure that there are no cache misses.
While you are at it, it might be a good idea to read about hardware pipelining and superscalar execution. These two links at ars pipelining explained and more pipelining and superscalar executionare excellent places to understand this. Especially, if you need to get the idea of microprocessor frontend and backend (and you thought only applications had frontends and backends :P)
Stay tuned, as I expect to write more about Itanium2 and its micro architecture in the coming few days. There are enough formal docs available, both from HP and Intel, so I will try not to repeat what is already available.