STANFORD, Calif. — As Intel pursues its multicore strategy, the company is looking for a little help in developing adequate benchmarks to test the forthcoming chips.
Both Intel and AMD
are charging forth into multicore technology. But Intel seems more anxious to cram a few dozen cores into a chip, while AMD is going slower and making more changes across the board to its core design.
Justin Rattner, chief technology officer and senior fellow at Intel, was one of the keynote speakers at the 18th HOT CHIPS conference, an annual semiconductor conferences being held here at the Stanford University campus.
His speech covered the hardware and software development process that will be required to support multicore chips as Intel moves beyond dual core into processors with four, eight, 16, 32 and 64 cores.
Intel has found applications, including benchmarks, scale well to around four cores before leveling off. The company has known for a while that just throwing cores at the problem won’t necessarily make computers run faster.
“[The code we have] is optimized for a small number of threads. Attention was paid to getting the most performance out of thread count. No one was concerned with scaling beyond that because there were no machines scaling beyond four processors,” he said.
To that end, Rattner called up Dr. Kai Li, a professor of computer science at Princeton University, to discuss plans to create and host a repository for benchmark suites for multicore processors.
They readily admitted there isn’t much out there for multicore benchmarks or frameworks and such a centralized site needs to be built. A hosting site will be announced shortly.
Intel is looking at up to 64-core chips and how to properly scale up. Just adding cores does nothing, as Rattner noted in his speech. With the addition of hardware threading, cache improvements and new instruction sets, Intel saw scalability and performance improvements with more cores.
Just adding cores is not enough to see improvements of scale. Motion capture, for example, requires more than 100GB/sec of memory bandwidth, well beyond what PCs are capable of today. “Memory bandwidth will be a key area for multicore design going forward,” said Rattner.
Rattner focused on three specific areas where multicore chips could be of particular use: recognition, mining and synthesis. In the area of recognition, he showed off four cameras capturing the movements of an individual and rendering a 3D image on the fly in the computer.
The mining demo involved searching for a similar instance of data. In the demo, they did searches through databases of thousands of images looking for pictures of a certain type. In this case, blue skies and water. For further drill down, the search was narrowed to look for just a dolphin, and more images with the dolphin image were found.
The last demo was fluid flow synthesis and ray tracing, something not possible in real-time yet. It showed the dynamics of watch splashing around while being rendered in real-time, and the prototype CPU showed all 16 cores at full utilization to do the rendering.
The afternoon brought a speech from Pat Conway, a senior member of the AMD technical staff, who discussed the Opteron NorthBridge architecture. The old x86 architecture, he pointed out, has the inherent bottleneck of forcing everything through the memory controller.
It didn’t matter how many cores you added to the chip, they had to go through an external processor to access memory. Opteron’s northbridge, with the memory controller, is on the CPU, so the chip goes straight to memory.
“Because each CPU is a memory controller, there is inherent scaling of memory as you add nodes,” said Conway.
The previous generation of Opterons used three HyperTransport links for up to 2GB/sec of throughput from the CPU to the memory and had an L2 cache of 2MB.
The next-generation of Opteron will change things around a bit. Future Opterons will have four HyperTransport ports, each running 2.5 times faster than the previous generation for up to 10GB/sec of throughput.
The future Opteron chips will come with three levels of cache instead of two. L1 will be 64k for the most critical data, L2 will be 512k and dedicated to each core, and L3 will be a 2MB cache optimized for multicore use.
The HOT CHIPS 18 conference concludes tomorrow.