Skip to content

Rocket ("1 GHz" cycle accurate emulator)

Christopher Celio edited this page Feb 28, 2014 · 2 revisions

The following results were gathered by running the Rocket C++ cycle-accurate emulator coupled with the DRAMSim2 memory model. Timing results assume a 1 GHz clock, which is actually slower than what taped-out chips of Rocket have achieved.

Cache Sizes and Access Latencies

The following graph shows the output from the "caches" micro-benchmark, providing insight into Rocket's cache sizes and relative access latencies.

Rocket cache sizes

  • blue (unit-stride)
  • green (cache-line stride)
  • red (random stride)
for (int k = 0; k < num_iterations; k++)
{
   idx = arrary[idx];
}

Each execution of the "caches" benchmark performs a pointer chase on an array of a fixed size. The above graph shows each execution, where the x-axis provides the size of the array and the y-axis provides the average time per iteration (ideally, the time to perform the load).

As the array size grows, the pointer chase slows down as the array is forced into higher levels of the memory hierarchy.

Results

The above graph tells us a few things about the Rocket processor (which is running at 1GHz).

  • unit-stride is much faster than cache-line or random stride.
  • Rocket lacks prefetchers (random stride and cache-line stride performance is identical)
  • the L1 data cache is 32 kB.*
  • the L2 cache is 256 kB, and the load access is roughly 30 cycles.
  • The last level, DRAM, is roughly 70 cycles latency.

*When running out of the L1 cache for an in-order core, the 4ns "access time" is more a demonstration of the number of instructions in the inner loop rather than a measurement of the load access time. In this case, by analyzing the disassembly, we see there are indeed 4 instructions in the pointer chase.

Peakflops

The "peakflops" benchmark throws a large number of flops at the pipeline (that should fit within the register file). The Rocket emulator simulates a 1GHz chip. As the pipeline is single-issue with one floating-point unit, we expect to measure 1 GFlop (as we do):

Add/Mul Mix Adds Only
MFLops 949.3 914.634

Note: Rocket actually has a FMA unit, which means that "peakflops" should be modified to express FMA operations to obtain the full 2 GFlop of Rocket.