SystemVerilog implementation of RISC-V RV32IM & custom packed SIMD ISA as a 5-stage scalar M-mode core, with L1 caches and branch predictors
- RISC-V core
- Getting the project
- Microarchitecture
- Measured Performance - FPGA emulation
- Verification
- FPGA emulation
- Analysis example use-case: Dhrystone
- Overall hardware performance estimates correlation
Project relies on a few external libraries and tools. Clone recursively with
git clone --recurse-submodules git@github.com:AleksandarLilic/ama-riscv.gitSubmodules are pulled automatically with --recurse-submodules
- Vivado (tested on 2023.2)
- GCC >= 10 (C++17,
gnu++17) - Make
- Python
- Prerequisites of ./sim submodule
To check that everything is available and working as expected:
- build RTL & testbench
- build cosim
- run the test and check the logs
# set up environment variables
source setup.sh
# build test
cd sim/sw/baremetal/asm_rv32i
make
cd -
# build and run RTL
run -t sim/sw/baremetal/asm_rv32i/test.hex -v VERBOSE
# check the execution log
vim testrun_<run_date>/asm_rv32i_test/test.logOutputs:
ls testrun_<run_date>/asm_rv32i_test
asm_rv32i_test_out_cosim build.log cosim Makefile Makefile.inc make_run_default.log run.sh test.log test.status work.ama_riscv_tb.wdb xsim.dirCosim outputs:
ls testrun_<run_date>/asm_rv32i_test/asm_rv32i_test_out_cosim/
asm_rv32i_test.kanata.log callstack_folded_cycle_cosim.txt callstack_folded_inst_cosim.txt hw_stats.json inst_profile_clk.json trace_clk.bin uart.logSecond option is to use testlist instead of manually specifying each test. Optionally, testlist can be filtered e.g. using "simple" group filter, adding rundir, and timeout in number of clock cycles
run --testlist ../testlist.yaml -r testrun_demo -c 1000000 -f "simple"Add -k to keep the build and -p to keep already passed tests (together with the same --rundir), reduce expected runtime/timeout, and run official RISC-V ISA tests
run --testlist ../testlist.yaml -r testrun_demo -c 5000 -f "riscv_isa_rv32i" -kpRunning -v VERBOSE should be done with care since it will slow down execution. For short tests, the difference is insignificant, while for longer ones it can add minutes to the simulation time, and create a large log.
Similarly, running with -testplusarg enable_konata and -testplusarg prof_trace can create large kanata and profiler trace logs.
Core supports RV32IM_zicsr_zifencei_zicntr_zihpm_xsimd (rv drom) or any legal subset (gcc -march options)
Core microarchitecture follows a fairly standard 5-stage single issue RISC-style pipeline:
- FET:
- Core select PC source.
- Icache does address generation and tag lookup.
- Branch predictor logically sits here.
- FET_DEC pipe:
- Icache responds with one instruction.
- Next PC is prepared.
- DEC:
- Core decodes the instruction, generates immediate-based next PC for conditional direct branches (
b*) and unconditional direct branches (jal). - Pipeline information is sent to the branch predictor; Next PC is predicted and frontend redirected if needed.
- Source registers are either read, or bypassed from the backend. Core progresses even if not all operands are available.
- Core decodes the instruction, generates immediate-based next PC for conditional direct branches (
- DEC_EXE pipe:
- Control signals and (all available) operands are propagated to EXE stage.
- EXE:
- Core executes arithmetic, logic, and control flow instructions in this one cycle.
- RV32 multiplier and SIMD finish first (out of two) execution cycle.
- RV32 restoring binary divider starts, takes 2 to 34 cycles to finish (data dependent).
- SIMD data formatting unit executes.
- First part of Dcache address generation is done.
- If source operand(s) are not available through register read nor in the bypass network, machine stalls.
- EXE_MEM pipe:
- Available results and relevant control signals are propagated to MEM stage.
- RV32 multiplier and SIMD unit propagates first stage results.
- Branch results are propagated to MEM stage.
- MEM:
- RV32 multiplier and SIMD unit finishes second execution cycle.
- SIMD unit takes in
late_cin case it's a dot product instruction either from EXE_MEM pipe or bypass network, so no penalty is incurred on back-to-backdots to the same accumulator.
- SIMD unit takes in
- Resolution of conditional direct branches (
b*) and unconditional indirect branches (jalr) is available. - On branch miss or
jalrinstruction, frontend is redirected, taking 2 cycle miss penalty (b*) or 2 cycle stall (jalr). - Dcache does second part of the address generation, and tag lookup.
- RV32 multiplier and SIMD unit finishes second execution cycle.
- MEM_WBK pipe:
- Available results and relevant control signals are propagated to WBK stage.
- RV32 multiplier and SIMD results are propagated to WBK stage
- Dcache loads or stores data.
- WBK:
- Result from RV32 multiplier and SIMD unit is ready
- Dcache returns loaded data.
- RETIRE pipe:
- Instruction retires, optionally writes to RF.
- Performance events about this cycle are collected.
Default parameters are primarily set in src/ama_riscv_defines.svh and src/ama_riscv_types.svh, with a few others set in the sources.f files, or directly set from command line, like during synthesis.
SIMD ISA can be completely disabled by setting --define CPU_SIMD_EN=0.
Multiplier (RV32 M-extension) can still use the same BW tree with --define CPU_MULT_USE_BW=1, or plain 16x16 partial products multiplier with --define CPU_MULT_USE_BW=0.
- 4KB, 32-set, 2-way, 64B lines, 128-bit bus, LRU replacement
- Parametrizable for number of sets and ways, with one bank per way
- 1 clock cycle on hit, 7 on miss
- Configuration driven by sweeps via hw_model_sweep.py and cache config; results are under examples/hw_sweeps
Follows the combined predictor from McFarling's "Combining Branch Predictors"
- Combined predictor: bimodal + global, with a meta predictor
- No penalty on hit, 2 cycle penalty on miss, predicts back-to-back branches
- Parametrizable - options: static, bimodal, global, gshare, gselect, combined
- Chosen through sweeps + config constraints: due to timing considerations, no PHT is bigger than 2^8 entries, i.e. no more than 8 bits are used for indexing
- Configuration driven by sweeps via hw_model_sweep.py and bp config (see sim sweeps); results are under examples/hw_sweeps
3R2W design, banked by default
- Address of a 2nd write port is
rd_addr + 1 - Address of rs3 is rd (used only for dot* instructions)
- Optionally banked as odd/even such that rd and rdp paired writes always land in a different bank
2-cycle pipelined unit (EXE + MEM), shared across rv32 mul* and packed SIMD arithmetic instructions
- Single 32x32 Baugh-Wooley partial product array reduced through a CSA tree, reused across all element widths via diagonal lane masks: the full products are computed once, and the narrower (or accumulated) results are picked at the output mux
- First stage builds and reduces the partial products, second stage finishes reduction and does the final add and result select
- For
dot*instructions,rs3is passed or forwarded aslate_cin SIMD unit second stage, so back-to-backdots to the same accumulator don't cause stalls
Single cycle unit, mostly made up of wiring and muxes for moving vector elements around.
Segmented reversible shifter is primarily used for shift (slli*/srli*/srai*) instructions. The reversible shifter also doubles as a second stage for the widen* class of instructions for the first word, while a second left-only segmented shifter is used for the second word.
32-bit restoring binary divider with clz-based normalization and a one-entry result cache
- Common case: 3 + (cnt_b - cnt_a + 1) (
cnt_a=clz(|dividend|),cnt_b=clz(|divisor|)) - 1 cycle start, 1 cycle setup, 1 cycle per dividend bit offset for the number of cnt_b bits, 1 cycle fixup - Special case: 2 - 1 cycle start and 1 cycle setup
- Cache hit: 1 - combinational hit + flop the output
- 4KB, 16-set, 4-way, 64B lines, 128-bit bus, LRU replacement with writeback
- Parametrizable for number of sets and ways, with one bank per way
- 1 clock cycle on hit, 7 on miss, 10 on miss with writeback; 1 cycle penalty on load to use
- Configuration driven by sweeps, same flow and config as Icache (results are under examples/hw_sweeps)
- Caches are backed by a single main memory, with parametrizable size
- default: 128KB, per the current linker script
- True dual port, 128-bit bus, 16B words
$readmemhis used to load in the workload's.memfile- For verification,
ama_riscv_tbwrites the file - For FPGA emulation, synthesis writes the initial file, later patched as needed (check FPGA emulation for details).
- For verification,
ama_riscv_top # Design top
├── ama_riscv_core_top # Core & caches wrapper
│ ├── ama_riscv_core # CPU pipeline
│ │ ├── ... # All core modules
│ ├── ama_riscv_icache # L1 instruction cache
│ │ └── mem # memory banks
│ └── ama_riscv_dcache # L1 data cache
│ └── mem # memory banks
├── ama_riscv_mem # Unified backing memory
└── ama_riscv_uart # Memory-mapped UART
└── uart # TX & RX subsystem
├── ... # TX & RX PHYsFor verification, ama_riscv_top is instantiated as DUT in the ama_riscv_tb testbench.
For FPGA emulation, ama_riscv_top is instantiated in the ama_riscv_fpga FPGA wrapper.
Emulation is ran at 50MHz on Arty A7-100T board. Since design uses single clock domain, a change in clock frequency by X would yield the same change in speed by X, thus keeping the 'per MHz' result the same.
- Dhrystone: 81 DMIPS, 1.63 DMIPS/MHz (IPC: 0.91)
- Coremark: 145 Coremarks, 2.9 Coremarks/MHz (IPC: 0.89)
- STREAM-INT:
- Copy: 66 MB/s
- Scale: 50 MB/s
- Add: 64 MB/s
- Triad: 57 MB/s
- Embench_1.0 compiled for speed (with detailed breakdown):
- Size: 9.47 (4.32 - 20.76)
- Speed: 51.55 (34.02 - 78.11)
- Speed/MHz: 1.03 (0.68 - 1.56)
SIMD ISA improvements on MLP, measured in inferences per second
| Flavor | RV32IM[inf/s] |
RV32IM_Xsimd[inf/s] |
Improvement |
|---|---|---|---|
| w8a8 | 238 | 1968 | 8.3x |
| w4a8 | 228 | 1984 | 8.7x |
| w2a8 | 293 | 2083 | 7.1x |
All TDA and counter plots are available under examples/perf_runs_fpga
With CSV summaries under the same directory
All breakdowns are available under examples/perf_runs_fpga/all_stats
There are two supported mechanisms for functional verification:
- Self-checking C/ASM tests use
tohostCSR LSB where testbench waits fortohost[0] == 1to end the simulation. In case of a failing test,tohost[31:1]reports the failed ID - Cosim - instruction set simulator ama-riscv-sim tied up over DPI and single-stepped from testbench after each retired instruction in RTL. Checkers are set up on the architectural state (32 GP registers, and PC)
On the RTL level, one or both methods can be used. Since cosim relies on probing internal signals, cosim is unsupported out of the box for GLS.
As shown above, running tests is done through ./run_test.py script (aliased to run in setup.sh)
Full usage available in examples/run.help
Design is synthesized with conservative 50MHz constraint for the emulation purposes, targeting xc7a100tcsg324-1 part on Arty A7-100T board.
Defines used for synthesis:
set_property verilog_define { SYNT FPGA FPGA_HEX_PATH=<workdir>/ama-riscv/sim/sw/baremetal/uart_direct_send/hello_world.mem } [current_fileset]Main memory is later patched with workload(s) of interest via script/update_mem.sh and loaded into the FPGA with script/flash_bit.tcl.
Utilization overview:
+----------------------------+-------+-------+------------+-----------+-------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+----------------------------+-------+-------+------------+-----------+-------+
| Slice LUTs | 12756 | 0 | 0 | 63400 | 20.12 |
| LUT as Logic | 12624 | 0 | 0 | 63400 | 19.91 |
| LUT as Memory | 132 | 0 | 0 | 19000 | 0.69 |
| LUT as Distributed RAM | 132 | 0 | | | |
| LUT as Shift Register | 0 | 0 | | | |
| Slice Registers | 5013 | 0 | 0 | 126800 | 3.95 |
| Register as Flip Flop | 5013 | 0 | 0 | 126800 | 3.95 |
| Register as Latch | 0 | 0 | 0 | 126800 | 0.00 |
| F7 Muxes | 399 | 0 | 0 | 31700 | 1.26 |
| F8 Muxes | 150 | 0 | 0 | 15850 | 0.95 |
+----------------------------+-------+-------+------------+-----------+-------+
First three logic levels, with percentage contribution compared to part's total resource availability
+--------------------------+------------------------+---------------+---------------+------------+-------------+------------+----------+------------+
| Instance | Module | Total LUTs | Logic LUTs | LUTRAMs | FFs | RAMB36 | RAMB18 | DSP Blocks |
+--------------------------+------------------------+---------------+---------------+------------+-------------+------------+----------+------------+
| ama_riscv_fpga | (top) | 12756(20.12%) | 12624(19.91%) | 132(0.69%) | 5013(3.95%) | 44(32.59%) | 0(0.00%) | 0(0.00%) |
| (ama_riscv_fpga) | (top) | 7(0.01%) | 7(0.01%) | 0(0.00%) | 28(0.02%) | 0(0.00%) | 0(0.00%) | 0(0.00%) |
| ama_riscv_top_i | ama_riscv_top | 12748(20.11%) | 12616(19.90%) | 132(0.69%) | 4985(3.93%) | 44(32.59%) | 0(0.00%) | 0(0.00%) |
| ama_riscv_core_top_i | ama_riscv_core_top | 12646(19.95%) | 12514(19.74%) | 132(0.69%) | 4891(3.86%) | 12(8.89%) | 0(0.00%) | 0(0.00%) |
| ama_riscv_core_i | ama_riscv_core | 10380(16.37%) | 10248(16.16%) | 132(0.69%) | 3558(2.81%) | 0(0.00%) | 0(0.00%) | 0(0.00%) |
| ama_riscv_dcache_i | ama_riscv_dcache | 1775(2.80%) | 1775(2.80%) | 0(0.00%) | 789(0.62%) | 8(5.93%) | 0(0.00%) | 0(0.00%) |
| ama_riscv_icache_i | ama_riscv_icache | 491(0.77%) | 491(0.77%) | 0(0.00%) | 544(0.43%) | 4(2.96%) | 0(0.00%) | 0(0.00%) |
| ama_riscv_mem_i | ama_riscv_mem | 31(0.05%) | 31(0.05%) | 0(0.00%) | 2(0.01%) | 32(23.70%) | 0(0.00%) | 0(0.00%) |
| (ama_riscv_mem_i) | ama_riscv_mem | 8(0.01%) | 8(0.01%) | 0(0.00%) | 2(0.01%) | 0(0.00%) | 0(0.00%) | 0(0.00%) |
| u_mem | xpm_memory_tdpram | 24(0.04%) | 24(0.04%) | 0(0.00%) | 0(0.00%) | 32(23.70%) | 0(0.00%) | 0(0.00%) |
| ama_riscv_uart_i | ama_riscv_uart | 74(0.12%) | 74(0.12%) | 0(0.00%) | 92(0.07%) | 0(0.00%) | 0(0.00%) | 0(0.00%) |
| (ama_riscv_uart_i) | ama_riscv_uart | 20(0.03%) | 20(0.03%) | 0(0.00%) | 44(0.03%) | 0(0.00%) | 0(0.00%) | 0(0.00%) |
| uart_i | uart | 54(0.09%) | 54(0.09%) | 0(0.00%) | 48(0.04%) | 0(0.00%) | 0(0.00%) | 0(0.00%) |
| fpga_clk_gen_i | ama_riscv_fpga_clk_gen | 1(0.01%) | 1(0.01%) | 0(0.00%) | 0(0.00%) | 0(0.00%) | 0(0.00%) | 0(0.00%) |
+--------------------------+------------------------+---------------+---------------+------------+-------------+------------+----------+------------+
Detailed utilization reports are available under examples/perf_runs_fpga/fpga_synt_reports
FPGA synthesis and implementation flow uses Vivado in batch mode, with fpga/synt.tcl for the underlying recipe, and fpga/run_synt.py as the multi-run driver. The run_synt.py driver is the entry point, and the only mandatory argument is the .yaml config file, e.g.
mkdir workdir_synt && cd workdir_synt # useful to work in a separate, gitignored, area
../fpga/run_synt.py --config ../fpga/configs/simd.yamlWhich will kick off the run, with otherwise default settings:
launching 1 run(s), up to 4 parallel (auto threads each):
simd_50_flat_rebuilt -> /home/alek/dev/ama-riscv/workdir_synt/synt_simd_50_flat_rebuilt
The yaml config is set up such that there's provided baseline at fpga/configs/_base.yaml, which each of the run configs (simd.yaml in the above case) first extend, and then either add new or override existing parameters. run_synt.py therefore first merges the two (or more, multiple extends can be chained) yaml config files together, and then produces config.resolved.yaml under the rundir, with params.tcl as those same parameters prepared in tcl format for the synt.tcl. After config files are ready, it kicks-off the synthesis.
Running with --dry_run first is useful to confirm all settings were as expected. Dry run will only print the rundir, and write config.resolved.yaml and params.tcl under the rundir (for each of the configs).
Multiple config files can be specified together, where each config would create a separate job:
../fpga/run_synt.py --config ../fpga/configs/simd.yaml ../fpga/configs/simd_hier.yamlWhen running multiple jobs in parallel, it's good to be mindful of the host resources. The synthesis phase is limited to 4 threads by Vivado (v2023) itself, while implementation can use more threads. Therefore, running multiple jobs in parallel can quickly throttle the host system. Using -j/--jobs will limit the number of parallel runs, e.g. running with -j 1 will fully serialize the jobs. The default is currently set to 4 parallel jobs. It's also possible to limit the number of threads each job will use with --threads. This directly changes Vivado's own general.maxThreads parameter. The default is currently set to 0 which means it won't be set, and will instead let Vivado use its auto-detection.
Vivado directive options file is provided as a quick reference guide for the synt/impl adjustments.
Flow will create two checkpoints along the way, as post_synth.dcp and routed.dcp, which can then be opened up for further work, either in GUI mode with vivado routed.dcp & or in TCL mode with vivado -mode tcl and then open_checkpoint routed.dcp in Vivado's TCL shell.
Note
Tests, profiling, and logging are heavily reused from ama-riscv-sim and therefore only differences introduced in the RTL environment will be covered here. Otherwise all of the functionality carries over.
Run the same Dhrystone build as in the ISA sim example
run -t sim/sw/baremetal/dhrystone/dhrystone.elfIn all places where ISA sim previously counted instructions, profilers now count cycles
On the default verbosity, execution is not logged, and only run stats are printed
...
=== UART END ===
Test ran to completion
Checker 1/2 - 'tohost': ENABLED: PASS
Checker 2/2 - 'cosim' : ENABLED: PASS
==== PASS ====
Warnings: 1
Errors: 0
DUT instruction count: 521118
Core stats:
Cycles: 609603, Inst: 521118, Stalls: 88485, CPI: 1.170 (IPC: 0.855)
$finish called at time : 6096050100 ps : File "<home>/ama-riscv/verif/direct_tb/ama_riscv_tb.sv" Line 885
Simulation cycles: 609603
Stats - Profiling Summary:
core (TDA counters)
Cycles: 609603, Inst: 519031, Stalls: 90572, CPI: 1.175 (IPC: 0.851)
TDA:
L1: Bad Spec: 11692, FE: 72235, BE: 6645, Retired: 519031
L2: FE Mem: 25775, FE Core: 46460, BE Mem: 1805, BE Core: 4840, INT: 519031, SIMD: 0
bpred
P: 58138, M: 5847, ACC: 90.86%, MPKI: 11.27
icache
Ref: 530724, H: 525552(525552/0), M: 5172(5172/0), R: 0, HR: 99.03%; CT (R/W): core 2.0/0.0 MB, mem 323.2/0.0 KB
dcache
Ref: 162719, H: 162513(86168/76345), M: 206(29/177), R: 0, WB: 142, HR: 99.87%; CT (R/W): core 297/279 KB, mem 12.9/8.9 KB
core (all counters)
Control Flow: 108436 - J: 21746, JR: 22705, BR: 63985
Memory: 166065 - Load: 87869, Store: 78196
SIMD: 0 - Arith: 0, Data Format: 0
Stall - SIMD: 0, Load: 4840
icache - A: 530723, M: 5172, SM (G/B): 2061(1027/1034), AMAT: 1.05
dcache - A: 162718, M: 206, WB: 142, AMAT: 1.01
run: Time (s): cpu = 00:00:00.18 ; elapsed = 00:02:17 . Memory (MB): peak = 1326.262 ; gain = 8.004 ; free physical = 7026 ; free virtual = 27924
## puts "Simulation runtime: [expr {[clock seconds] - $start}]s"
Simulation runtime: 138s
Running with -v VERBOSE adds per cycle logs
21 ns: INFO: Reset released
30 ns: VERBOSE: Core empty cycle (stall: backend)
40 ns: VERBOSE: Core empty cycle (lost: other)
50 ns: VERBOSE: Core empty cycle (lost: other)
60 ns: VERBOSE: Core empty cycle (lost: other)
70 ns: VERBOSE: Core empty cycle (lost: other)
80 ns: VERBOSE: Core empty cycle (stall: frontend)
90 ns: VERBOSE: Core empty cycle (stall: frontend)
100 ns: VERBOSE: Core empty cycle (stall: frontend)
110 ns: VERBOSE: Core empty cycle (stall: frontend)
120 ns: VERBOSE: Core empty cycle (stall: frontend)
130 ns: VERBOSE: Core empty cycle (stall: frontend)
140 ns: VERBOSE: Core [R] 40000: 00000093
140 ns: VERBOSE: COSIM 40000: 00000093 addi x1,x0,0 x1 : 0x00000000
141 ns: VERBOSE: First write to x1. Checker activated
150 ns: VERBOSE: Core [R] 40004: 00000113
150 ns: VERBOSE: COSIM 40004: 00000113 addi x2,x0,0 x2 : 0x00000000
151 ns: VERBOSE: First write to x2. Checker activated
160 ns: VERBOSE: Core [R] 40008: 00000193
160 ns: VERBOSE: COSIM 40008: 00000193 addi x3,x0,0 x3 : 0x00000000
161 ns: VERBOSE: First write to x3. Checker activated
170 ns: VERBOSE: Core [R] 4000c: 00000213
170 ns: VERBOSE: COSIM 4000c: 00000213 addi x4,x0,0 x4 : 0x00000000
171 ns: VERBOSE: First write to x4. Checker activated
180 ns: VERBOSE: Core [R] 40010: 00000293
180 ns: VERBOSE: COSIM 40010: 00000293 addi x5,x0,0 x5 : 0x00000000
181 ns: VERBOSE: First write to x5. Checker activated
190 ns: VERBOSE: Core [R] 40014: 00000313
190 ns: VERBOSE: COSIM 40014: 00000313 addi x6,x0,0 x6 : 0x00000000
191 ns: VERBOSE: First write to x6. Checker activated
...
Folded callstacks are saved as callstack_folded_cycle_cosim.txt and callstack_folded_inst_cosim.txt, counting the number of cycles and instruction in each stack
Example snippet from the 'cycle' folded callstack
...
call_main;main;Proc_2; 9006
call_main;main;Proc_1;Proc_6;Func_3; 3006
call_main;main;Proc_1;Proc_7; 4000
call_main;main;Proc_1; 59035
call_main;main;Proc_8; 25022
call_main;main;Func_2;strcmp; 57048
call_main;main;Func_2; 32014
call_main;main;Proc_4; 9006
call_main;main;Proc_5; 4006
...
Profiled instructions summary is saved as inst_profile_clk.json, couting the the number of cycles each instruction type took to execute.
Note that cycle count reports how long it took for that instruction to retire. That means that some stalls (like on jumps or pipe flush) would be counted in the affected instructions, not jumps or branches themselves. That also means that dcache misses on loads and stores would be counted under those instructions.
{
"add": {"count": 26560},
"sub": {"count": 6516},
"sll": {"count": 22},
...
"_max_sp_usage": 368,
"_profiled_cycles": 609603
}Execution trace, saved as trace_clk.bin, contains trace_entry struct for each simulation cycle, and it's needed as an input for the analysis scripts (below)
Stats for core, icache, dcache, and branch predictor are available as hw_stats.json
This replaces HW models present in the ISA sim
{
"core": {
"bad_spec": 11692,
"stall_be": 6645,
"stall_l1d": 1805,
"stall_l1d_r": 290,
"stall_l1d_w": 1515,
"stall_fe": 72235,
"stall_l1i": 25775,
"stall_simd": 0,
"stall_load": 4840,
"ret_ctrl_flow": 108436,
"ret_ctrl_flow_j": 21746,
"ret_ctrl_flow_jr": 22705,
"ret_ctrl_flow_br": 63985,
"ret_mem": 166065,
"ret_mem_load": 87869,
"ret_mem_store": 78196,
"ret_simd": 0,
"ret_simd_arith": 0,
"ret_simd_data_fmt": 0,
"l1i_ref": 530723,
"l1i_miss": 5172,
"l1i_spec_miss": 2061,
"l1i_spec_miss_bad": 1034,
"l1i_spec_miss_good": 1027,
"l1d_ref": 162718,
"l1d_ref_r": 86196,
"l1d_ref_w": 76522,
"l1d_miss": 206,
"l1d_miss_r": 29,
"l1d_miss_w": 177,
"l1d_writeback": 142,
"ret": 519031,
"cycles": 609603,
"stalls": 90572,
"stall_fe_core": 46460,
"stall_be_core": 4840,
"ret_int": 519031,
"cpi": 1.1745,
"ipc": 0.851425
},
"icache": {
"references": 530724,
"hits": {"reads": 525552, "writes": 0},
"misses": {"reads": 5172, "writes": 0},
"replacements": 0,
"writebacks": 0,
"ct_core": {"reads": 2122896, "writes": 0},
"ct_mem": {"reads": 331008, "writes": 0}
},
"dcache": {
"references": 162719,
"hits": {"reads": 86168, "writes": 76345},
"misses": {"reads": 29, "writes": 177},
"replacements": 0,
"writebacks": 142,
"ct_core": {"reads": 304189, "writes": 285631},
"ct_mem": {"reads": 13184, "writes": 9088}
},
"bpred": {
"type": "rtl_defines",
"branches": 63985,
"predicted": 58138,
"predicted_fwd": 0,
"predicted_bwd": 0,
"mispredicted": 5847,
"mispredicted_fwd": 0,
"mispredicted_bwd": 0,
"accuracy": 90.86,
"mpki": 11.27
},
"_done": true
}Konata is used to visualize the pipeline and instruction execution. When recorded during execution (with -testplusarg enable_konata), kanata log is available as <test_tag>.kanata.log under test's *_out_cosim/ directory
Beginning of main loop execution

Collection of custom and open source tools are provided for profiling, analysis, and visualization
Flat profile script provides samples/time spent in all executed functions, and prints it to the stdout
./sim/script/prof_stats.py -t examples/dhrystone_dhrystone_out_cosim/callstack_folded_cycle_cosim.txt -e cycle --plotProfile - Cycle
%[c] %cumulative[c] self[c] total[c] self[us] total[us] symbol
16.89 16.89 86236 86236 862.4 862.4 strcpy
16.00 32.90 81690 498303 816.9 4983.0 main
11.57 44.46 59035 99065 590.4 990.6 Proc_1
11.18 55.64 57048 57048 570.5 570.5 strcmp
6.27 61.91 32014 94068 320.1 940.7 Func_2
5.28 67.19 26953 70003 269.5 700.0 npf_vpprintf
4.90 72.09 25022 25022 250.2 250.2 Proc_8
4.31 76.41 22016 25022 220.2 250.2 Proc_6
2.94 79.34 15006 15006 150.1 150.1 Func_1
2.36 81.70 12037 12037 120.4 120.4 clear_bss_w_loop
2.35 84.05 12006 12006 120.1 120.1 Proc_7
2.31 86.37 11794 11794 117.9 117.9 npf_putc_cnt
2.31 88.68 11794 11794 117.9 117.9 send_byte_uart0
2.16 90.83 11008 11008 110.1 110.1 Proc_3
1.76 92.60 9006 9006 90.1 90.1 Proc_2
1.76 94.36 9006 9006 90.1 90.1 Proc_4
1.33 95.69 6790 6790 67.9 67.9 npf_putc_uart
total_samples : 510456
clk_mhz : 100.0
total_time : 5104.56
time_unit : us
(Showing top 17 of 36 entries after filtering - Threshold: 1%)
It's also possible to combine RTL and ISA sim callstacks to get IPC breakdown
./sim/script/prof_stats.py -t examples/dhrystone_dhrystone_out_cosim/callstack_folded_inst_cosim.txt -s examples/dhrystone_dhrystone_out_cosim/callstack_folded_cycle_cosim.txt --plotProfile - Inst/Cycles combined
%[i] %cumulative[i] self[i] total[i] %[c] %cumulative[c] self[c] total[c] self[us] total[us] ipc cpi symbol
18.73 18.73 86172 86172 16.89 16.89 86236 86236 862.4 862.4 0.999 1.001 strcpy
13.11 31.84 60291 449319 16.00 32.90 81690 498303 816.9 4983.0 0.902 1.109 main
11.74 55.97 54000 90000 11.57 44.46 59035 99065 590.4 990.6 0.908 1.101 Proc_1
12.39 44.23 57000 57000 11.18 55.64 57048 57048 570.5 570.5 0.999 1.001 strcmp
6.09 62.06 28000 90000 6.27 61.91 32014 94068 320.1 940.7 0.957 1.045 Func_2
4.48 71.97 20588 56186 5.28 67.19 26953 70003 269.5 700.0 0.803 1.246 npf_vpprintf
5.43 67.49 25000 25000 4.90 72.09 25022 25022 250.2 250.2 0.999 1.001 Proc_8
4.35 76.32 20000 23000 4.31 76.41 22016 25022 220.2 250.2 0.919 1.088 Proc_6
3.26 79.58 15000 15000 2.94 79.34 15006 15006 150.1 150.1 1.000 1.000 Func_1
2.31 87.06 10616 10616 2.36 81.70 12037 12037 120.4 120.4 0.882 1.134 clear_bss_w_loop
2.61 82.19 12000 12000 2.35 84.05 12006 12006 120.1 120.1 1.000 1.000 Proc_7
2.56 84.75 11788 11788 2.31 86.37 11794 11794 117.9 117.9 0.999 1.001 npf_putc_cnt
2.20 89.25 10104 10104 2.31 88.68 11794 11794 117.9 117.9 0.857 1.167 send_byte_uart0
1.96 93.17 9000 9000 2.16 90.83 11008 11008 110.1 110.1 0.818 1.223 Proc_3
1.96 91.21 9000 9000 1.76 92.60 9006 9006 90.1 90.1 0.999 1.001 Proc_2
1.96 95.12 9000 9000 1.76 94.36 9006 9006 90.1 90.1 0.999 1.001 Proc_4
0.73 96.72 3368 3368 1.33 95.69 6790 6790 67.9 67.9 0.496 2.016 npf_putc_uart
total instructions : 459998
total cycles : 510456
total time : 5104.56
clk MHz : 100.0
time unit : us
CPI : 1.11
IPC : 0.901
(Showing top 17 of 36 entries after filtering - Threshold: 1%)
Top-down analysis can be run based on the collected performance counters
By default, script will open up plots in the default browser. The -r <arg> passes argument straight to plotly's renderer argument. Using -r notebook or -r png is useful when running form jupyter notebook. The -r png simply streams png contents to stdout
./sim/script/tda.py examples/dhrystone_dhrystone_out_cosim/hw_stats.jsonTDA for 'dhrystone_dhrystone_cosim'
L1 L2 cycles
0 bad_spec <NA> 11692
1 frontend icache 25775
2 frontend core 46460
3 backend dcache 1805
4 backend core 4840
5 retiring integer 519031
6 retiring simd 0
Performance Counters for 'dhrystone_dhrystone_cosim' (IPC: 0.851)
counter value class count
cycles 609603 cycles 609.6k
ret 519031 ret 519.0k
stalls 90572 stall 90.6k
bad_spec 11692 bad_spec 11.7k
ret_int 519031 ret_* 519.0k
ret_ctrl_flow 108436 ret_* 108.4k
ret_ctrl_flow_br 63985 ret_* 64.0k
ret_ctrl_flow_j 21746 ret_* 21.7k
ret_ctrl_flow_jr 22705 ret_* 22.7k
ret_mem 166065 ret_* 166.1k
ret_mem_load 87869 ret_* 87.9k
ret_mem_store 78196 ret_* 78.2k
ret_simd 0 ret_* 0
ret_simd_arith 0 ret_* 0
ret_simd_data_fmt 0 ret_* 0
stall_be 6645 stall_* 6.64k
stall_be_core 4840 stall_* 4.84k
stall_fe 72235 stall_* 72.2k
stall_fe_core 46460 stall_* 46.5k
stall_l1d 1805 stall_* 1.80k
stall_l1d_r 290 stall_* 290
stall_l1d_w 1515 stall_* 1.51k
stall_l1i 25775 stall_* 25.8k
stall_load 4840 stall_* 4.84k
stall_simd 0 stall_* 0
l1i_miss 5172 l1i_* 5.17k
l1i_ref 530723 l1i_* 530.7k
l1i_spec_miss 2061 l1i_* 2.06k
l1i_spec_miss_bad 1034 l1i_* 1.03k
l1i_spec_miss_good 1027 l1i_* 1.03k
l1d_miss 206 l1d_* 206
l1d_miss_r 29 l1d_* 29
l1d_miss_w 177 l1d_* 177
l1d_ref 162718 l1d_* 162.7k
l1d_ref_r 86196 l1d_* 86.2k
l1d_ref_w 76522 l1d_* 76.5k
l1d_writeback 142 l1d_* 142
./sim/script/get_flamegraph.py examples/dhrystone_dhrystone_out_cosim/callstack_folded_cycle_cosim.txtOpen the generated interactive flamegraph_clk_cycle.svg in the web browser
./sim/script/get_call_graph.py examples/dhrystone_dhrystone_out_cosim/callstack_folded_cycle_cosim.txtNote
Running any of the below commands with -b will open (or host with --host) interactive session in the browser
Get timeline plot
./sim/script/run_analysis.py -t examples/dhrystone_dhrystone_out_cosim/trace_clk.bin --dasm sim/sw/baremetal/dhrystone/dhrystone.dasm --timeline --clkGet stats trace (adjust window sizes as needed)
./sim/script/run_analysis.py -t examples/dhrystone_dhrystone_out_cosim/trace_clk.bin --dasm sim/sw/baremetal/dhrystone/dhrystone.dasm --stats_trace --win_size_stats 512 --win_size_hw 64 --clkGet execution breakdown
./sim/script/run_analysis.py -i examples/dhrystone_dhrystone_out_cosim/inst_profile_clk.json --clkGet execution histograms
./sim/script/run_analysis.py -t examples/dhrystone_dhrystone_out_cosim/trace_clk.bin --dasm sim/sw/baremetal/dhrystone/dhrystone.dasm --pc_hist --add_cache_lines --clk
./sim/script/run_analysis.py -t examples/dhrystone_dhrystone_out_cosim/trace_clk.bin --dasm sim/sw/baremetal/dhrystone/dhrystone.dasm --dmem_hist --add_cache_lines --clkGet execution trace
./sim/script/run_analysis.py -t examples/dhrystone_dhrystone_out_cosim/trace_clk.bin --dasm sim/sw/baremetal/dhrystone/dhrystone.dasm --pc_trace --add_cache_lines --clk
./sim/script/run_analysis.py -t examples/dhrystone_dhrystone_out_cosim/trace_clk.bin --dasm sim/sw/baremetal/dhrystone/dhrystone.dasm --dmem_trace --add_cache_lines --clkOptionally, save symbols found in dasm with --save_symbols
Note
Any time ./sim/script/run_analysis.py is invoked with --dasm arg, backannotated dasm will be saved, e.g dhrystone.prof.dasm
Timeline
Stats Trace
Execution breakdown
Execution histograms
Execution trace
Backannotation of disassembly
Adding --print_symbols to any of the commands using -t will print all found symbols to stdout
Symbols found in sim/sw/baremetal/dhrystone/dhrystone.dasm in 'text' section:
0x430EC - 0x433D8: _free_r (0)
0x42FAC - 0x430E8: _malloc_trim_r (0)
0x42F08 - 0x42FA8: strcpy (86.2k)
0x42E7C - 0x42F04: __libc_init_array (54)
0x42E20 - 0x42E78: _sbrk_r (38)
0x42E1C - 0x42E1C: __malloc_unlock (2)
0x42E18 - 0x42E18: __malloc_lock (8)
0x42608 - 0x42E14: _malloc_r (418)
0x425FC - 0x42604: malloc (28)
0x425D4 - 0x425F8: _sbrk (21)
0x42298 - 0x425D0: mini_vpprintf (25.8k)
0x42224 - 0x42294: _puts (0)
0x42018 - 0x42220: mini_pad (955)
0x41EB0 - 0x42014: mini_itoa (2.81k)
0x41E88 - 0x41EAC: mini_strlen (648)
0x41E2C - 0x41E84: trap_handler (0)
0x41E00 - 0x41E28: timer_interrupt_handler (0)
0x41DEC - 0x41DFC: get_cpu_time (12)
0x41D98 - 0x41DE8: mini_printf (1.39k)
0x41D64 - 0x41D94: __puts_uart (26.1k)
0x41D14 - 0x41D60: _write (34.7k)
0x41CFC - 0x41D10: send_byte_uart0 (11.7k)
0x41CD4 - 0x41CF8: time_s (28)
0x41C14 - 0x41CD0: Proc_6 (22.0k)
0x41C08 - 0x41C10: Func_3 (3.00k)
0x41B8C - 0x41C04: Func_2 (34.0k)
0x41B6C - 0x41B88: Func_1 (15.0k)
0x41B08 - 0x41B68: Proc_8 (25.0k)
0x41AF8 - 0x41B04: Proc_7 (18.0k)
0x41478 - 0x41AF4: main (85.6k)
0x41468 - 0x41474: Proc_5 (4.01k)
0x41444 - 0x41464: Proc_4 (9.00k)
0x412F4 - 0x41440: Proc_1 (71.0k)
0x412D0 - 0x412F0: Proc_3 (13.0k)
0x412A8 - 0x412CC: Proc_2 (9.01k)
0x4125C - 0x412A4: __clzsi2 (0)
0x4122C - 0x41258: __modsi3 (0)
0x411F8 - 0x41228: __umodsi3 (354)
0x411B0 - 0x411F4: __hidden___udivsi3 (30.1k)
0x411A8 - 0x411AC: __divsi3 (2.00k)
0x41184 - 0x411A4: __mulsi3 (36)
0x41074 - 0x41180: __floatsisf (0)
0x41004 - 0x41070: __fixsfsi (0)
0x40CB8 - 0x41000: __mulsf3 (0)
0x40944 - 0x40CB4: __divsf3 (0)
0x4037C - 0x40940: __udivdi3 (308)
0x40200 - 0x40378: strcmp (65.0k)
0x400EC - 0x401FC: trap_entry (0)
0x400E8 - 0x400E8: forever (0)
0x400DC - 0x400E4: call_main (4)
0x400CC - 0x400D8: clear_bss_b_loop (0)
0x400C8 - 0x400C8: clear_bss_b_check (3)
0x400B8 - 0x400C4: clear_bss_w_loop (12.0k)
0x40000 - 0x400B4: _start (69)
It also backannotates the disassembly and saves it as dhrystone.prof.dasm
00040000 <_start>:
12 40000: 00000093 addi x1,x0,0
1 40004: 00000113 addi x2,x0,0
1 40008: 00000193 addi x3,x0,0
1 4000c: 00000213 addi x4,x0,0
1 40010: 00000293 addi x5,x0,0
1 40014: 00000313 addi x6,x0,0
...
000400b8 <clear_bss_w_loop>:
4070 400b8: 00052023 sw x0,0(x10)
2654 400bc: 00450513 addi x10,x10,4
2655 400c0: fff68693 addi x13,x13,-1
2654 400c4: fe069ae3 bne x13,x0,400b8 <clear_bss_w_loop>
...
00041468 <Proc_5>:
1006 41468: 04100693 addi x13,x0,65
1000 4146c: 82d186a3 sb x13,-2003(x3) # 44145 <Ch_1_Glob>
1000 41470: 8201a823 sw x0,-2000(x3) # 44148 <Bool_Glob>
1000 41474: 00008067 jalr x0,0(x1)
...
Same as with ISA sim, except now -c/--corr can be passed in to get correlation against the estimates
Run with positional arguments as
./sim/script/perf_est_v2.py \
sim/examples/dhrystone_dhrystone_out/inst_profile.json \
sim/examples/dhrystone_dhrystone_out/hw_stats.json \
sim/examples/dhrystone_dhrystone_out/rf_trace.bin \
-c examples/dhrystone_dhrystone_out_cosim/hw_stats.jsonPerformance estimate breakdown for:
sim/examples/dhrystone_dhrystone_out/inst_profile.json
sim/examples/dhrystone_dhrystone_out/hw_stats.json
<home_path>/sim/script/hw_perf_metrics_v2.yaml
sim/examples/dhrystone_dhrystone_out/rf_trace.bin
Peak Stack usage: 352 bytes
Instructions executed: 460.0k
icache (32 sets, 2 ways, 4096B data): References: 460.3k, Hits: 459.9k, Misses: 367, Hit Rate: 99.92%, MPKI: 0.80
DMEM inst: 155.3k - L/S: 84.2k/71.1k (33.77% instructions)
dcache (16 sets, 4 ways, 4096B data): References: 152.0k, Hits: 151.8k, Misses: 209, Writebacks: 145, Hit Rate: 99.86%, MPKI: 0.45
Branch inst: 50487 (10.98% instructions)
bpred (combined): Predicted: 50.1k, Mispredicted: 339, Accuracy: 99.33%, MPKI: 0.74
DIV/REM inst: 1496 (0.33% instructions)
divider (16B): Cache: 1.10k (73.53%), Special: 344 (22.99%), Common: 52 (3.48%), 489 b, 9.4 b/d
Pipeline stalls (max):
Bad spec: 678
FE bound: 39.8k - ICache: 2.20k (AMAT: 1.0), Core: 37.6k
BE bound: 10.7k - DCache: 1.69k (AMAT: 1.01), Core: 8.97k (SIMD 0, DIV 2.43k, Load 6.53k)
Estimated HW performance at 100MHz:
Best: 500.5k cycles (5.00ms), IPC: 0.919; BW (avg MB/s) - icache: 350.9, dcache (R/W): 105.7 (55.9/49.8), mem (R/W): 8.8 (7.0/1.8)
Expected: 511.1k cycles (5.11ms), IPC: 0.900; BW (avg MB/s) - icache: 343.5, dcache (R/W): 103.5 (54.7/48.8), mem (R/W): 8.6 (6.9/1.7)
Estimated Cycles range: 10.7k cycles, midpoint: 505.8k, ratio: 2.11%
Correlation:
metric est rtl diff diff%
cycles 511140 511200 -60 -0.012
empty 51159 51224 -65 -0.127
stalls 50481 50470 11 0.022
lost 678 754 -76 -11.209
lost_other 0 22 -22 -100.000
bad_spec 678 732 -54 -7.965
stall_be 10655 10781 -126 -1.183
stall_l1d 1689 1835 -146 -8.644
stall_be_core 8966 8946 20 0.223
stall_fe 39826 39689 137 0.344
stall_l1i 2202 2062 140 6.358
stall_fe_core 37624 37627 -3 -0.008
ret 459976 459976 0 0.000
ret_simd 0 0 0 0.000
ret_int 459976 459976 0 0.000
ret_ctrl_flow_br 50487 50487 0 0.000
bp_miss 339 366 -27 -7.965
l1i_ref 460315 460673 -358 -0.078
l1i_miss 367 402 -35 -9.537
l1d_ref 151974 151974 0 0.000
l1d_miss 209 209 0 0.000
A useful check for the confidence that can be put in the functional models. Since the estimation flow offers much faster turnaround time, it's a tempting target for rapid exploration of either workload changes, or the microarchitectural tweaks.
Cycle estimates were compared against RTL cycle counts using only the timed workload regions, that is, excluding benchmark harness overhead like setup, warmup, UART printing, and others.
| Metric | Value |
|---|---|
| Mean signed error | -0.33% |
| Std dev signed error | 1.32% |
| Mean absolute error | 0.61% |
| Median absolute error | 0.17% |
| Worst absolute error | 6.13% |
| ≤ 1% | 22/28 (79%) |
| ≤ 3% | 27/28 (96%) |
| ≤ 5% | 27/28 (96%) |
| ≤ 10% | 28/28 (100%) |
| > 10% | 0/28 (0%) |
| Metric | Value |
|---|---|
| Mean signed error | -0.13% |
| Std dev signed error | 0.28% |
| Mean absolute error | 0.15% |
| Median absolute error | 0.01% |
| Worst absolute error | 0.69% |
| ≤ 1% | 13/13 (100%) |
| ≤ 3% | 13/13 (100%) |
| ≤ 5% | 13/13 (100%) |
| ≤ 10% | 13/13 (100%) |
| > 10% | 0/13 (0%) |





















