Itanium2 and performance tuning

Itanium2’s EPIC architecture and how it helps in performance tuning

Itanium2was designed by HP and Intel, as a new kind of computing paradigm. Instead of CISC / RISC, this time, they moved to EPIC(ironically, RISC is the invention of HP labs, and there is a lot of material on how one can optimize RISC based appplications by HP, including their Dynamo framework). In a nutshell, EPIC entails, the moving of the onus of scheduling the instructions from the hardware to the application (the compiler generating the code). Traditionally, the microprocessor and the associated hardware had the logic hardwired to do instruction prefetching / data prefetching / many of the techniques to reduce the number of cache misses and instruction stalls. This, has some pros and cons, and I guess for HP and Intel, the cons outweighed the pros. The pros of hardwiring the (say) prefetch logic is the amount of throughput that one can gain. The cons of course is that the prefetch logic for a particular program structure may not be the best for some other application. Moving this logic to the compiler might help the backend of the compiler generate better code, given that it has information regarding program flow. How good or bad this is as an architecture decision can be argued both ways. Having seen an enterprise class application being moved to Itanium2, I can say that is is not a bad choice. Even though, people complain about the increased size of the binaries, with not-so-low amount of infrastructure, you can get a decent performance boost.

Itanium2 is a multi-issue (specifically, two instruction bundles)machine. This means that in one clock cycle, two instruction bundles can be executed. A bundle has 128 bits and comprises of 3 instructions. So, in one clock cycle, approximately, 6 instructions can be executed [in case you are wondering how, each instruction in the IA architecture is 41 bits, and each bundle contains 3*41=123 + 5 bits for decoding of instructions]. The latest of the Itanium2 contain three caches: L1, L2 and L3.
The L1 cache, again has an instruction cache – ICache, and a data cache – DCache. The size of the L1 cache (both instruction and data each) is 16Kb and it is four way associative with 64 byte lines. So the L1 has 64 divisions within it, which can be used. Also, the L1 has a throughput of 2 bundles/cycle, which is the fastest among all the caches (understandably so). The interesting feature of the L1 data cache is that it stores only integer data. Any floating point data in the IA2 is stored in the L2 cache. This is an important thing to remember if you are trying to tune numerically intensive applications on the IA.

The L2 cache is a 256KB unified cache for both instructions and data.As mentioned above, the L2 is the place where the FP data is stored, so it has a larger byte-line size of 128bytes.The L2 cache is eight way associative, and so it has 256 divisions within it.
The L3 cache size varies from each system. It usually is in the range 1.5-3 MB. This too, like the L2 is a unified cache. This is 12-way associative and has a 128byte line size (for the higher end). It means that the higher end IA’s L3 has 2048 divisions in it. The L3 can support a data transfer rate of 32 bytes/cycle to the L2.
In summary

L1 L2 L3
64byte line size 128byte linesize 128 byte linesize
16KB, 4 way associative 256KB 8 way associative 1.5-3MB, 12 way associative
Latency of 1 cycle Latency of 5,6 cycles Latency of 12,13 cycles

So, if you are trying to optimize the execution of an application on IA, then its obvious that the code and data should have the least number of cache misses. Here is where EPIC can help a lot. Given that two instruction bundles can be issued at the same time, both the bundles could be within L1, there by avoiding any misses. This allocation of the instructions within the bundle is done by the compiler and so, it is possible to do it judiciously. Another of the standard optimizing technique is to make the inner loop fit into the L1 / L2 caches, so that stalls during the instruction prefetching can be avoided.And if the inner loop is iterated a large number of times, after the first iteration, no more misses will be recorded. Practically speaking, for business applications which have large loops, this might not serve right.For scientific applications, this might be possible. Also, loop unrolling (in higher optimization levels) might create a loop which spills the cache boundaries. This of course you can record with any of the performance monitoring tools (example, Caliper on HP-UX).
As noted above, as the L1 data cache can only contain integers, if your data structure has a combination of integers and FP data, it is suggested that you group them individually i.e. put the integers together and the FP data together. That way, integer data can be accessed via the L1 and if and when FP data is accessed L2 can be used. This again becomes difficult if you have a large application, wherein data layout changes means recompilation of all the clients – which is not always a possible solution. But, if you are developing afresh on Itanium2, this is something you can keep in mind.
As with any n-way associative cache structure, minimizing the data stride i.e.e the distance between two datums is important. If both the data elements map to the same cache line, that will result in a cache miss, and more cycles will be wasted. Keeping the data tight that way will ensure that there are no cache misses.
While you are at it, it might be a good idea to read about hardware pipelining and superscalar execution. These two links at ars pipelining explained and more pipelining and superscalar executionare excellent places to understand this. Especially, if you need to get the idea of microprocessor frontend and backend (and you thought only applications had frontends and backends :P)
Stay tuned, as I expect to write more about Itanium2 and its micro architecture in the coming few days. There are enough formal docs available, both from HP and Intel, so I will try not to repeat what is already available.

Is the fuel lobby in India so strong?

India, non-fossil fuels, ethanol

In the global economy scene now, the emphasis on the developing economies needn’t be over-stated. The press coverage of all the major media channels is a clear giveaway. The next economies to watch are the BRIC – Brazil, Russia, India, China. With each of these countries (am not sure about Russia) showing a ~8% growth rate, the interest of the developed world is well justified. After all, a market is needed for the products being developed. And the global gaints needn’t be just traditional industries. IBM’s interest is obvious too. What is interesting from the IBM’s insight is this <quote>Brazil also has a history of putting its agricultural output to creative use. During the oil crisis of the 1970s, it used its massive sugarcane resources to create an alternative fuel: ethanol. By the mid-80s almost every new car sold in Brazil ran on it, and today, ethanol comprises 40 percent of the nation’s fuel. (Brazil energy: Homegrown fuel supply helps Brazil breathe easy, Los Angeles Times, 4 October 2005)
</quote>.Alternate fuels are a major R&D investment for most of the global oil producers. And most of the major oil producers are from the developed nations (am not counting Gazprom in it). Coming to India, the prices of fuel have increased at a steady rate in the last 6 months. A litre of petrol now costs Rs.55 in Bangalore (its different in different cities). And the high-octane version, like BP’s Speed, costs Rs.56. This is from ~Rs.49 last year. With the increased reliance of India on fossil fuels, the price of fuel has large economic repurcussions. From a comman man’s expenditure bill, the rise of Rs.10 per litre is not a small change. What I don’t understand though, is the disinterest in alternate fuels in India. India is also a major sugarcane producer, and ethanol can be produced in large quantities in this country too. Then why is it that the Indian government ignoring this? If they think that ethanol driven automobiles are not feasible, atleast for the rural energy needs, Ethanol can be used. Relying on subsidies on kerosene and diesel, might not be a good way to go for the agriculture-energy sector. They might as well produce the energy needed for themselves from their own produce. May be it is difficult to convert an existing diesel engine to an ethanol-driven engine, but it sure must be possible. Or is it that, the government doesn’t want to do it ? Is the fuel lobby in India so strong?

Move to wordpress

So, its official now. After the official announcement of the move of all MT blogs to WordPress, my blog has been converted. Yes, ladies and gentlemen, I am now a WordPress user. Of course, let me also add that I know zilch about wordpress – hmm, quite a dam squib ending for an announcement isn’t it 😉 Well, fret not, I need to look up what all one can do with wordpress, and I shalt do them ! So, this is my first post via the wordpress UI. There are lots of tweakings to be done, and they will be done. Hold on to the joy ride. And once again, thanks a ton to JD and all the crew @ Weblogs. Awesome folks they are !

Microsoft’s threat analysis and modeling tool

As part of their increased impetus on security, Microsoft has released the Threat analysis and modeling tool. This is now RTM according to the blog post. The increased interest in threat modeling is understandable. Security is not only about making sure that the code doesn’t have bad C calls (the usual, strcpy and cousins), but also understand the data piddling that can potentially happen due to bad design. The new class of attacks like SQL injection can be, in good proportion mitigated via a proper design. With the increased proliferation of web-applications and that too AJAX style applications comes an increased data transfer rate. So, every datum being passed can potentially help in the exploitation of the application. In cases like this, the application designer, or, in general, the application maintainer should know the data flow. A DFD is generally used for this purpose, but the intent behind the DFD is not security. Having a data flow picture and the corresponding use cases that highlight the potential problems, is a good thing to have for the designer.
Continue reading “Microsoft’s threat analysis and modeling tool”

Poems on the tube

The London underground (also known as the tube), amongst other things seems to have a good literary sense. Poems on the underground is an attempt to bring poetry to wide audiences. This is an interesting exercise for the furthering of literature en massé. Even though, there are instances of street plays and theater on trains, the idea of exposing poetry to the junta has never happened in India. If such an exercise were to be undertaken in Bangalore, then I think that will be a good boost to the already rich Kannada literature. Also, it might also be a novel experiment if there can be poems from various Indian languages. Given the truly cosmopolitan nature of Bangalore, I think an idea like this might find a good following. Something to beat the traffic jam blues in this city. I wonder though, if English might be a chosen language. Will they be interested in one of my favorites ?. I guess not !