Skip to content

AleksandarLilic/ama-riscv

Repository files navigation

RISC-V core

SystemVerilog implementation of RISC-V RV32IM & custom packed SIMD ISA as a 5-stage scalar M-mode core, with L1 caches and branch predictors

Getting the project

Project relies on a few external libraries and tools. Clone recursively with

git clone --recurse-submodules git@github.com:AleksandarLilic/ama-riscv.git

Prerequisites

Submodules are pulled automatically with --recurse-submodules

  • Vivado (tested on 2023.2)
  • GCC >= 10 (C++17, gnu++17)
  • Make
  • Python
  • Prerequisites of ./sim submodule

Quick start

To check that everything is available and working as expected:

  1. build RTL & testbench
  2. build cosim
  3. run the test and check the logs
# set up environment variables
source setup.sh
# build test
cd sim/sw/baremetal/asm_rv32i
make
cd -
# build and run RTL
run -t sim/sw/baremetal/asm_rv32i/test.hex -v VERBOSE
# check the execution log
vim testrun_<run_date>/asm_rv32i_test/test.log

Outputs:

ls testrun_<run_date>/asm_rv32i_test
asm_rv32i_test_out_cosim  build.log  cosim  Makefile  Makefile.inc  make_run_default.log  run.sh  test.log  test.status  work.ama_riscv_tb.wdb  xsim.dir

Cosim outputs:

ls testrun_<run_date>/asm_rv32i_test/asm_rv32i_test_out_cosim/
asm_rv32i_test.kanata.log  callstack_folded_cycle_cosim.txt callstack_folded_inst_cosim.txt hw_stats.json  inst_profile_clk.json  trace_clk.bin  uart.log

Second option is to use testlist instead of manually specifying each test. Optionally, testlist can be filtered e.g. using "simple" group filter, adding rundir, and timeout in number of clock cycles

run --testlist ../testlist.yaml -r testrun_demo -c 1000000 -f "simple"

Add -k to keep the build and -p to keep already passed tests (together with the same --rundir), reduce expected runtime/timeout, and run official RISC-V ISA tests

run --testlist ../testlist.yaml -r testrun_demo -c 5000 -f "riscv_isa_rv32i" -kp

Running -v VERBOSE should be done with care since it will slow down execution. For short tests, the difference is insignificant, while for longer ones it can add minutes to the simulation time, and create a large log.
Similarly, running with -testplusarg enable_konata and -testplusarg prof_trace can create large kanata and profiler trace logs.

Microarchitecture

High-level description

Core supports RV32IM_zicsr_zifencei_zicntr_zihpm_xsimd (rv drom) or any legal subset (gcc -march options)

Core microarchitecture follows a fairly standard 5-stage single issue RISC-style pipeline:

  • FET:
    • Core select PC source.
    • Icache does address generation and tag lookup.
    • Branch predictor logically sits here.
  • FET_DEC pipe:
    • Icache responds with one instruction.
    • Next PC is prepared.
  • DEC:
    • Core decodes the instruction, generates immediate-based next PC for conditional direct branches (b*) and unconditional direct branches (jal).
    • Pipeline information is sent to the branch predictor; Next PC is predicted and frontend redirected if needed.
    • Source registers are either read, or bypassed from the backend. Core progresses even if not all operands are available.
  • DEC_EXE pipe:
    • Control signals and (all available) operands are propagated to EXE stage.
  • EXE:
    • Core executes arithmetic, logic, and control flow instructions in this one cycle.
    • RV32 multiplier and SIMD finish first (out of two) execution cycle.
    • RV32 restoring binary divider starts, takes 2 to 34 cycles to finish (data dependent).
    • SIMD data formatting unit executes.
    • First part of Dcache address generation is done.
    • If source operand(s) are not available through register read nor in the bypass network, machine stalls.
  • EXE_MEM pipe:
    • Available results and relevant control signals are propagated to MEM stage.
    • RV32 multiplier and SIMD unit propagates first stage results.
    • Branch results are propagated to MEM stage.
  • MEM:
    • RV32 multiplier and SIMD unit finishes second execution cycle.
      • SIMD unit takes in late_c in case it's a dot product instruction either from EXE_MEM pipe or bypass network, so no penalty is incurred on back-to-back dots to the same accumulator.
    • Resolution of conditional direct branches (b*) and unconditional indirect branches (jalr) is available.
    • On branch miss or jalr instruction, frontend is redirected, taking 2 cycle miss penalty (b*) or 2 cycle stall (jalr).
    • Dcache does second part of the address generation, and tag lookup.
  • MEM_WBK pipe:
    • Available results and relevant control signals are propagated to WBK stage.
    • RV32 multiplier and SIMD results are propagated to WBK stage
    • Dcache loads or stores data.
  • WBK:
    • Result from RV32 multiplier and SIMD unit is ready
    • Dcache returns loaded data.
  • RETIRE pipe:
    • Instruction retires, optionally writes to RF.
    • Performance events about this cycle are collected.

Default parameters are primarily set in src/ama_riscv_defines.svh and src/ama_riscv_types.svh, with a few others set in the sources.f files, or directly set from command line, like during synthesis.

SIMD ISA can be completely disabled by setting --define CPU_SIMD_EN=0.
Multiplier (RV32 M-extension) can still use the same BW tree with --define CPU_MULT_USE_BW=1, or plain 16x16 partial products multiplier with --define CPU_MULT_USE_BW=0.

Icache

  • 4KB, 32-set, 2-way, 64B lines, 128-bit bus, LRU replacement
  • Parametrizable for number of sets and ways, with one bank per way
  • 1 clock cycle on hit, 7 on miss
  • Configuration driven by sweeps via hw_model_sweep.py and cache config; results are under examples/hw_sweeps

Branch predictor

Follows the combined predictor from McFarling's "Combining Branch Predictors"

  • Combined predictor: bimodal + global, with a meta predictor
  • No penalty on hit, 2 cycle penalty on miss, predicts back-to-back branches
  • Parametrizable - options: static, bimodal, global, gshare, gselect, combined
  • Chosen through sweeps + config constraints: due to timing considerations, no PHT is bigger than 2^8 entries, i.e. no more than 8 bits are used for indexing
  • Configuration driven by sweeps via hw_model_sweep.py and bp config (see sim sweeps); results are under examples/hw_sweeps

Register File

3R2W design, banked by default

  • Address of a 2nd write port is rd_addr + 1
  • Address of rs3 is rd (used only for dot* instructions)
  • Optionally banked as odd/even such that rd and rdp paired writes always land in a different bank

RV32 integer multiplier & SIMD unit

2-cycle pipelined unit (EXE + MEM), shared across rv32 mul* and packed SIMD arithmetic instructions

  • Single 32x32 Baugh-Wooley partial product array reduced through a CSA tree, reused across all element widths via diagonal lane masks: the full products are computed once, and the narrower (or accumulated) results are picked at the output mux
  • First stage builds and reduces the partial products, second stage finishes reduction and does the final add and result select
  • For dot* instructions, rs3 is passed or forwarded as late_c in SIMD unit second stage, so back-to-back dots to the same accumulator don't cause stalls

SIMD data formatting unit and segmeted shifter

Single cycle unit, mostly made up of wiring and muxes for moving vector elements around.
Segmented reversible shifter is primarily used for shift (slli*/srli*/srai*) instructions. The reversible shifter also doubles as a second stage for the widen* class of instructions for the first word, while a second left-only segmented shifter is used for the second word.

RV32 restoring binary divider

32-bit restoring binary divider with clz-based normalization and a one-entry result cache

  • Common case: 3 + (cnt_b - cnt_a + 1) (cnt_a=clz(|dividend|), cnt_b=clz(|divisor|)) - 1 cycle start, 1 cycle setup, 1 cycle per dividend bit offset for the number of cnt_b bits, 1 cycle fixup
  • Special case: 2 - 1 cycle start and 1 cycle setup
  • Cache hit: 1 - combinational hit + flop the output

Dcache

  • 4KB, 16-set, 4-way, 64B lines, 128-bit bus, LRU replacement with writeback
  • Parametrizable for number of sets and ways, with one bank per way
  • 1 clock cycle on hit, 7 on miss, 10 on miss with writeback; 1 cycle penalty on load to use
  • Configuration driven by sweeps, same flow and config as Icache (results are under examples/hw_sweeps)

Main memory

  • Caches are backed by a single main memory, with parametrizable size
  • True dual port, 128-bit bus, 16B words
  • $readmemh is used to load in the workload's .mem file
    • For verification, ama_riscv_tb writes the file
    • For FPGA emulation, synthesis writes the initial file, later patched as needed (check FPGA emulation for details).

Hierarchy

ama_riscv_top               # Design top
├── ama_riscv_core_top      # Core & caches wrapper
│   ├── ama_riscv_core      # CPU pipeline
│   │   ├── ...             # All core modules
│   ├── ama_riscv_icache    # L1 instruction cache
│   │   └── mem             # memory banks
│   └── ama_riscv_dcache    # L1 data cache
│       └── mem             # memory banks
├── ama_riscv_mem           # Unified backing memory
└── ama_riscv_uart          # Memory-mapped UART
    └── uart                # TX & RX subsystem
        ├── ...             # TX & RX PHYs

For verification, ama_riscv_top is instantiated as DUT in the ama_riscv_tb testbench.
For FPGA emulation, ama_riscv_top is instantiated in the ama_riscv_fpga FPGA wrapper.

Measured Performance - FPGA emulation

Emulation is ran at 50MHz on Arty A7-100T board. Since design uses single clock domain, a change in clock frequency by X would yield the same change in speed by X, thus keeping the 'per MHz' result the same.

  • Dhrystone: 81 DMIPS, 1.63 DMIPS/MHz (IPC: 0.91)
  • Coremark: 145 Coremarks, 2.9 Coremarks/MHz (IPC: 0.89)
  • STREAM-INT:
    • Copy: 66 MB/s
    • Scale: 50 MB/s
    • Add: 64 MB/s
    • Triad: 57 MB/s
  • Embench_1.0 compiled for speed (with detailed breakdown):
    • Size: 9.47 (4.32 - 20.76)
    • Speed: 51.55 (34.02 - 78.11)
    • Speed/MHz: 1.03 (0.68 - 1.56)

SIMD ISA improvements on MLP, measured in inferences per second

Flavor RV32IM
[inf/s]
RV32IM_Xsimd
[inf/s]
Improvement
w8a8 238 1968 8.3x
w4a8 228 1984 8.7x
w2a8 293 2083 7.1x

Plots and data

All TDA and counter plots are available under examples/perf_runs_fpga

With CSV summaries under the same directory

All breakdowns are available under examples/perf_runs_fpga/all_stats

Verification

There are two supported mechanisms for functional verification:

  1. Self-checking C/ASM tests use tohost CSR LSB where testbench waits for tohost[0] == 1 to end the simulation. In case of a failing test, tohost[31:1] reports the failed ID
  2. Cosim - instruction set simulator ama-riscv-sim tied up over DPI and single-stepped from testbench after each retired instruction in RTL. Checkers are set up on the architectural state (32 GP registers, and PC)

On the RTL level, one or both methods can be used. Since cosim relies on probing internal signals, cosim is unsupported out of the box for GLS.

As shown above, running tests is done through ./run_test.py script (aliased to run in setup.sh) Full usage available in examples/run.help

Environment

FPGA emulation

Design is synthesized with conservative 50MHz constraint for the emulation purposes, targeting xc7a100tcsg324-1 part on Arty A7-100T board.

Defines used for synthesis:

set_property verilog_define { SYNT FPGA FPGA_HEX_PATH=<workdir>/ama-riscv/sim/sw/baremetal/uart_direct_send/hello_world.mem } [current_fileset]

Main memory is later patched with workload(s) of interest via script/update_mem.sh and loaded into the FPGA with script/flash_bit.tcl.

Utilization overview:

+----------------------------+-------+-------+------------+-----------+-------+
|          Site Type         |  Used | Fixed | Prohibited | Available | Util% |
+----------------------------+-------+-------+------------+-----------+-------+
| Slice LUTs                 | 12756 |     0 |          0 |     63400 | 20.12 |
|   LUT as Logic             | 12624 |     0 |          0 |     63400 | 19.91 |
|   LUT as Memory            |   132 |     0 |          0 |     19000 |  0.69 |
|     LUT as Distributed RAM |   132 |     0 |            |           |       |
|     LUT as Shift Register  |     0 |     0 |            |           |       |
| Slice Registers            |  5013 |     0 |          0 |    126800 |  3.95 |
|   Register as Flip Flop    |  5013 |     0 |          0 |    126800 |  3.95 |
|   Register as Latch        |     0 |     0 |          0 |    126800 |  0.00 |
| F7 Muxes                   |   399 |     0 |          0 |     31700 |  1.26 |
| F8 Muxes                   |   150 |     0 |          0 |     15850 |  0.95 |
+----------------------------+-------+-------+------------+-----------+-------+

First three logic levels, with percentage contribution compared to part's total resource availability

+--------------------------+------------------------+---------------+---------------+------------+-------------+------------+----------+------------+
|         Instance         |         Module         |   Total LUTs  |   Logic LUTs  |   LUTRAMs  |     FFs     |   RAMB36   |  RAMB18  | DSP Blocks |
+--------------------------+------------------------+---------------+---------------+------------+-------------+------------+----------+------------+
| ama_riscv_fpga           |                  (top) | 12756(20.12%) | 12624(19.91%) | 132(0.69%) | 5013(3.95%) | 44(32.59%) | 0(0.00%) |   0(0.00%) |
|   (ama_riscv_fpga)       |                  (top) |      7(0.01%) |      7(0.01%) |   0(0.00%) |   28(0.02%) |   0(0.00%) | 0(0.00%) |   0(0.00%) |
|   ama_riscv_top_i        |          ama_riscv_top | 12748(20.11%) | 12616(19.90%) | 132(0.69%) | 4985(3.93%) | 44(32.59%) | 0(0.00%) |   0(0.00%) |
|     ama_riscv_core_top_i |     ama_riscv_core_top | 12646(19.95%) | 12514(19.74%) | 132(0.69%) | 4891(3.86%) |  12(8.89%) | 0(0.00%) |   0(0.00%) |
|       ama_riscv_core_i   |         ama_riscv_core | 10380(16.37%) | 10248(16.16%) | 132(0.69%) | 3558(2.81%) |   0(0.00%) | 0(0.00%) |   0(0.00%) |
|       ama_riscv_dcache_i |       ama_riscv_dcache |   1775(2.80%) |   1775(2.80%) |   0(0.00%) |  789(0.62%) |   8(5.93%) | 0(0.00%) |   0(0.00%) |
|       ama_riscv_icache_i |       ama_riscv_icache |    491(0.77%) |    491(0.77%) |   0(0.00%) |  544(0.43%) |   4(2.96%) | 0(0.00%) |   0(0.00%) |
|     ama_riscv_mem_i      |          ama_riscv_mem |     31(0.05%) |     31(0.05%) |   0(0.00%) |    2(0.01%) | 32(23.70%) | 0(0.00%) |   0(0.00%) |
|       (ama_riscv_mem_i)  |          ama_riscv_mem |      8(0.01%) |      8(0.01%) |   0(0.00%) |    2(0.01%) |   0(0.00%) | 0(0.00%) |   0(0.00%) |
|       u_mem              |      xpm_memory_tdpram |     24(0.04%) |     24(0.04%) |   0(0.00%) |    0(0.00%) | 32(23.70%) | 0(0.00%) |   0(0.00%) |
|     ama_riscv_uart_i     |         ama_riscv_uart |     74(0.12%) |     74(0.12%) |   0(0.00%) |   92(0.07%) |   0(0.00%) | 0(0.00%) |   0(0.00%) |
|       (ama_riscv_uart_i) |         ama_riscv_uart |     20(0.03%) |     20(0.03%) |   0(0.00%) |   44(0.03%) |   0(0.00%) | 0(0.00%) |   0(0.00%) |
|       uart_i             |                   uart |     54(0.09%) |     54(0.09%) |   0(0.00%) |   48(0.04%) |   0(0.00%) | 0(0.00%) |   0(0.00%) |
|   fpga_clk_gen_i         | ama_riscv_fpga_clk_gen |      1(0.01%) |      1(0.01%) |   0(0.00%) |    0(0.00%) |   0(0.00%) | 0(0.00%) |   0(0.00%) |
+--------------------------+------------------------+---------------+---------------+------------+-------------+------------+----------+------------+

Detailed utilization reports are available under examples/perf_runs_fpga/fpga_synt_reports

Synthesis flow

FPGA synthesis and implementation flow uses Vivado in batch mode, with fpga/synt.tcl for the underlying recipe, and fpga/run_synt.py as the multi-run driver. The run_synt.py driver is the entry point, and the only mandatory argument is the .yaml config file, e.g.

mkdir workdir_synt && cd workdir_synt # useful to work in a separate, gitignored, area
../fpga/run_synt.py --config ../fpga/configs/simd.yaml

Which will kick off the run, with otherwise default settings:

launching 1 run(s), up to 4 parallel (auto threads each):
    simd_50_flat_rebuilt -> /home/alek/dev/ama-riscv/workdir_synt/synt_simd_50_flat_rebuilt

The yaml config is set up such that there's provided baseline at fpga/configs/_base.yaml, which each of the run configs (simd.yaml in the above case) first extend, and then either add new or override existing parameters. run_synt.py therefore first merges the two (or more, multiple extends can be chained) yaml config files together, and then produces config.resolved.yaml under the rundir, with params.tcl as those same parameters prepared in tcl format for the synt.tcl. After config files are ready, it kicks-off the synthesis.

Running with --dry_run first is useful to confirm all settings were as expected. Dry run will only print the rundir, and write config.resolved.yaml and params.tcl under the rundir (for each of the configs).

Multiple config files can be specified together, where each config would create a separate job:

../fpga/run_synt.py --config ../fpga/configs/simd.yaml ../fpga/configs/simd_hier.yaml

When running multiple jobs in parallel, it's good to be mindful of the host resources. The synthesis phase is limited to 4 threads by Vivado (v2023) itself, while implementation can use more threads. Therefore, running multiple jobs in parallel can quickly throttle the host system. Using -j/--jobs will limit the number of parallel runs, e.g. running with -j 1 will fully serialize the jobs. The default is currently set to 4 parallel jobs. It's also possible to limit the number of threads each job will use with --threads. This directly changes Vivado's own general.maxThreads parameter. The default is currently set to 0 which means it won't be set, and will instead let Vivado use its auto-detection.

Vivado directive options file is provided as a quick reference guide for the synt/impl adjustments.

Flow will create two checkpoints along the way, as post_synth.dcp and routed.dcp, which can then be opened up for further work, either in GUI mode with vivado routed.dcp & or in TCL mode with vivado -mode tcl and then open_checkpoint routed.dcp in Vivado's TCL shell.

Analysis example use-case: Dhrystone

Note

Tests, profiling, and logging are heavily reused from ama-riscv-sim and therefore only differences introduced in the RTL environment will be covered here. Otherwise all of the functionality carries over.

Run the same Dhrystone build as in the ISA sim example

run -t sim/sw/baremetal/dhrystone/dhrystone.elf

In all places where ISA sim previously counted instructions, profilers now count cycles

Execution log

On the default verbosity, execution is not logged, and only run stats are printed

...
=== UART END === 

Test ran to completion
Checker 1/2 - 'tohost': ENABLED: PASS
Checker 2/2 - 'cosim' : ENABLED: PASS
==== PASS ====
Warnings:  1
Errors:    0   

DUT instruction count: 521118
Core stats: 
    Cycles: 609603, Inst: 521118, Stalls: 88485, CPI: 1.170 (IPC: 0.855)

$finish called at time : 6096050100 ps : File "<home>/ama-riscv/verif/direct_tb/ama_riscv_tb.sv" Line 885 
Simulation cycles: 609603
Stats - Profiling Summary:
core (TDA counters)
    Cycles: 609603, Inst: 519031, Stalls: 90572, CPI: 1.175 (IPC: 0.851)
    TDA:
        L1: Bad Spec: 11692, FE: 72235, BE: 6645, Retired: 519031
        L2: FE Mem: 25775, FE Core: 46460, BE Mem: 1805, BE Core: 4840, INT: 519031, SIMD: 0
bpred
    P: 58138, M: 5847, ACC: 90.86%, MPKI: 11.27
icache
    Ref: 530724, H: 525552(525552/0), M: 5172(5172/0), R: 0, HR: 99.03%; CT (R/W): core 2.0/0.0 MB, mem 323.2/0.0 KB
dcache
    Ref: 162719, H: 162513(86168/76345), M: 206(29/177), R: 0, WB: 142, HR: 99.87%; CT (R/W): core 297/279 KB, mem 12.9/8.9 KB
core (all counters)
    Control Flow: 108436 - J: 21746, JR: 22705, BR: 63985
    Memory: 166065 - Load: 87869, Store: 78196
    SIMD: 0 - Arith: 0, Data Format: 0
    Stall - SIMD: 0, Load: 4840
    icache - A: 530723, M: 5172, SM (G/B): 2061(1027/1034), AMAT: 1.05
    dcache - A: 162718, M: 206, WB: 142, AMAT: 1.01

run: Time (s): cpu = 00:00:00.18 ; elapsed = 00:02:17 . Memory (MB): peak = 1326.262 ; gain = 8.004 ; free physical = 7026 ; free virtual = 27924
## puts "Simulation runtime: [expr {[clock seconds] - $start}]s"
Simulation runtime: 138s

Running with -v VERBOSE adds per cycle logs

       21 ns: INFO: Reset released
       30 ns: VERBOSE: Core empty cycle (stall: backend)
       40 ns: VERBOSE: Core empty cycle (lost: other)
       50 ns: VERBOSE: Core empty cycle (lost: other)
       60 ns: VERBOSE: Core empty cycle (lost: other)
       70 ns: VERBOSE: Core empty cycle (lost: other)
       80 ns: VERBOSE: Core empty cycle (stall: frontend)
       90 ns: VERBOSE: Core empty cycle (stall: frontend)
      100 ns: VERBOSE: Core empty cycle (stall: frontend)
      110 ns: VERBOSE: Core empty cycle (stall: frontend)
      120 ns: VERBOSE: Core empty cycle (stall: frontend)
      130 ns: VERBOSE: Core empty cycle (stall: frontend)
      140 ns: VERBOSE: Core [R] 40000: 00000093
      140 ns: VERBOSE: COSIM    40000: 00000093 addi x1,x0,0                  x1 : 0x00000000  
      141 ns: VERBOSE: First write to x1. Checker activated
      150 ns: VERBOSE: Core [R] 40004: 00000113
      150 ns: VERBOSE: COSIM    40004: 00000113 addi x2,x0,0                  x2 : 0x00000000  
      151 ns: VERBOSE: First write to x2. Checker activated
      160 ns: VERBOSE: Core [R] 40008: 00000193
      160 ns: VERBOSE: COSIM    40008: 00000193 addi x3,x0,0                  x3 : 0x00000000  
      161 ns: VERBOSE: First write to x3. Checker activated
      170 ns: VERBOSE: Core [R] 4000c: 00000213
      170 ns: VERBOSE: COSIM    4000c: 00000213 addi x4,x0,0                  x4 : 0x00000000  
      171 ns: VERBOSE: First write to x4. Checker activated
      180 ns: VERBOSE: Core [R] 40010: 00000293
      180 ns: VERBOSE: COSIM    40010: 00000293 addi x5,x0,0                  x5 : 0x00000000  
      181 ns: VERBOSE: First write to x5. Checker activated
      190 ns: VERBOSE: Core [R] 40014: 00000313
      190 ns: VERBOSE: COSIM    40014: 00000313 addi x6,x0,0                  x6 : 0x00000000  
      191 ns: VERBOSE: First write to x6. Checker activated
...

Callstack

Folded callstacks are saved as callstack_folded_cycle_cosim.txt and callstack_folded_inst_cosim.txt, counting the number of cycles and instruction in each stack
Example snippet from the 'cycle' folded callstack

...
call_main;main;Proc_2; 9006
call_main;main;Proc_1;Proc_6;Func_3; 3006
call_main;main;Proc_1;Proc_7; 4000
call_main;main;Proc_1; 59035
call_main;main;Proc_8; 25022
call_main;main;Func_2;strcmp; 57048
call_main;main;Func_2; 32014
call_main;main;Proc_4; 9006
call_main;main;Proc_5; 4006
...

Profiled instructions

Profiled instructions summary is saved as inst_profile_clk.json, couting the the number of cycles each instruction type took to execute.
Note that cycle count reports how long it took for that instruction to retire. That means that some stalls (like on jumps or pipe flush) would be counted in the affected instructions, not jumps or branches themselves. That also means that dcache misses on loads and stores would be counted under those instructions.

{
    "add": {"count": 26560},
    "sub": {"count": 6516},
    "sll": {"count": 22},
...
    "_max_sp_usage": 368,
    "_profiled_cycles": 609603
}

Execution trace

Execution trace, saved as trace_clk.bin, contains trace_entry struct for each simulation cycle, and it's needed as an input for the analysis scripts (below)

Hardware stats

Stats for core, icache, dcache, and branch predictor are available as hw_stats.json
This replaces HW models present in the ISA sim

{
"core": {
    "bad_spec": 11692,
    "stall_be": 6645,
    "stall_l1d": 1805,
    "stall_l1d_r": 290,
    "stall_l1d_w": 1515,
    "stall_fe": 72235,
    "stall_l1i": 25775,
    "stall_simd": 0,
    "stall_load": 4840,
    "ret_ctrl_flow": 108436,
    "ret_ctrl_flow_j": 21746,
    "ret_ctrl_flow_jr": 22705,
    "ret_ctrl_flow_br": 63985,
    "ret_mem": 166065,
    "ret_mem_load": 87869,
    "ret_mem_store": 78196,
    "ret_simd": 0,
    "ret_simd_arith": 0,
    "ret_simd_data_fmt": 0,
    "l1i_ref": 530723,
    "l1i_miss": 5172,
    "l1i_spec_miss": 2061,
    "l1i_spec_miss_bad": 1034,
    "l1i_spec_miss_good": 1027,
    "l1d_ref": 162718,
    "l1d_ref_r": 86196,
    "l1d_ref_w": 76522,
    "l1d_miss": 206,
    "l1d_miss_r": 29,
    "l1d_miss_w": 177,
    "l1d_writeback": 142,
    "ret": 519031,
    "cycles": 609603,
    "stalls": 90572,
    "stall_fe_core": 46460,
    "stall_be_core": 4840,
    "ret_int": 519031,
    "cpi": 1.1745,
    "ipc": 0.851425
},
"icache": {
    "references": 530724,
    "hits": {"reads": 525552, "writes": 0}, 
    "misses": {"reads": 5172, "writes": 0}, 
    "replacements": 0,
    "writebacks": 0,
    "ct_core": {"reads": 2122896, "writes": 0}, 
    "ct_mem": {"reads": 331008, "writes": 0}
},
"dcache": {
    "references": 162719,
    "hits": {"reads": 86168, "writes": 76345}, 
    "misses": {"reads": 29, "writes": 177}, 
    "replacements": 0,
    "writebacks": 142,
    "ct_core": {"reads": 304189, "writes": 285631}, 
    "ct_mem": {"reads": 13184, "writes": 9088}
},
"bpred": {
    "type": "rtl_defines",
    "branches": 63985,
    "predicted": 58138,
    "predicted_fwd": 0,
    "predicted_bwd": 0,
    "mispredicted": 5847,
    "mispredicted_fwd": 0,
    "mispredicted_bwd": 0,
    "accuracy": 90.86,
    "mpki": 11.27
},
"_done": true
}

Konata

Konata is used to visualize the pipeline and instruction execution. When recorded during execution (with -testplusarg enable_konata), kanata log is available as <test_tag>.kanata.log under test's *_out_cosim/ directory

Bootup sequence

Beginning of main loop execution

Analysis scripts

Collection of custom and open source tools are provided for profiling, analysis, and visualization

Flat profile

Flat profile script provides samples/time spent in all executed functions, and prints it to the stdout

./sim/script/prof_stats.py -t examples/dhrystone_dhrystone_out_cosim/callstack_folded_cycle_cosim.txt -e cycle --plot
Profile - Cycle
  %[c]  %cumulative[c]    self[c]   total[c]   self[us]  total[us]   symbol
 16.89           16.89      86236      86236      862.4      862.4   strcpy
 16.00           32.90      81690     498303      816.9     4983.0   main
 11.57           44.46      59035      99065      590.4      990.6   Proc_1
 11.18           55.64      57048      57048      570.5      570.5   strcmp
  6.27           61.91      32014      94068      320.1      940.7   Func_2
  5.28           67.19      26953      70003      269.5      700.0   npf_vpprintf
  4.90           72.09      25022      25022      250.2      250.2   Proc_8
  4.31           76.41      22016      25022      220.2      250.2   Proc_6
  2.94           79.34      15006      15006      150.1      150.1   Func_1
  2.36           81.70      12037      12037      120.4      120.4   clear_bss_w_loop
  2.35           84.05      12006      12006      120.1      120.1   Proc_7
  2.31           86.37      11794      11794      117.9      117.9   npf_putc_cnt
  2.31           88.68      11794      11794      117.9      117.9   send_byte_uart0
  2.16           90.83      11008      11008      110.1      110.1   Proc_3
  1.76           92.60       9006       9006       90.1       90.1   Proc_2
  1.76           94.36       9006       9006       90.1       90.1   Proc_4
  1.33           95.69       6790       6790       67.9       67.9   npf_putc_uart
total_samples : 510456
clk_mhz : 100.0
total_time : 5104.56
time_unit : us

(Showing top 17 of 36 entries after filtering - Threshold: 1%)

It's also possible to combine RTL and ISA sim callstacks to get IPC breakdown

./sim/script/prof_stats.py -t examples/dhrystone_dhrystone_out_cosim/callstack_folded_inst_cosim.txt -s examples/dhrystone_dhrystone_out_cosim/callstack_folded_cycle_cosim.txt --plot
Profile - Inst/Cycles combined 
  %[i]  %cumulative[i]    self[i]   total[i]     %[c]  %cumulative[c]    self[c]   total[c]   self[us]  total[us]     ipc     cpi   symbol
 18.73           18.73      86172      86172    16.89           16.89      86236      86236      862.4      862.4   0.999   1.001   strcpy
 13.11           31.84      60291     449319    16.00           32.90      81690     498303      816.9     4983.0   0.902   1.109   main
 11.74           55.97      54000      90000    11.57           44.46      59035      99065      590.4      990.6   0.908   1.101   Proc_1
 12.39           44.23      57000      57000    11.18           55.64      57048      57048      570.5      570.5   0.999   1.001   strcmp
  6.09           62.06      28000      90000     6.27           61.91      32014      94068      320.1      940.7   0.957   1.045   Func_2
  4.48           71.97      20588      56186     5.28           67.19      26953      70003      269.5      700.0   0.803   1.246   npf_vpprintf
  5.43           67.49      25000      25000     4.90           72.09      25022      25022      250.2      250.2   0.999   1.001   Proc_8
  4.35           76.32      20000      23000     4.31           76.41      22016      25022      220.2      250.2   0.919   1.088   Proc_6
  3.26           79.58      15000      15000     2.94           79.34      15006      15006      150.1      150.1   1.000   1.000   Func_1
  2.31           87.06      10616      10616     2.36           81.70      12037      12037      120.4      120.4   0.882   1.134   clear_bss_w_loop
  2.61           82.19      12000      12000     2.35           84.05      12006      12006      120.1      120.1   1.000   1.000   Proc_7
  2.56           84.75      11788      11788     2.31           86.37      11794      11794      117.9      117.9   0.999   1.001   npf_putc_cnt
  2.20           89.25      10104      10104     2.31           88.68      11794      11794      117.9      117.9   0.857   1.167   send_byte_uart0
  1.96           93.17       9000       9000     2.16           90.83      11008      11008      110.1      110.1   0.818   1.223   Proc_3
  1.96           91.21       9000       9000     1.76           92.60       9006       9006       90.1       90.1   0.999   1.001   Proc_2
  1.96           95.12       9000       9000     1.76           94.36       9006       9006       90.1       90.1   0.999   1.001   Proc_4
  0.73           96.72       3368       3368     1.33           95.69       6790       6790       67.9       67.9   0.496   2.016   npf_putc_uart
total instructions : 459998
total cycles : 510456
total time : 5104.56
clk MHz : 100.0
time unit : us
CPI : 1.11
IPC : 0.901

(Showing top 17 of 36 entries after filtering - Threshold: 1%)

TDA

Top-down analysis can be run based on the collected performance counters
By default, script will open up plots in the default browser. The -r <arg> passes argument straight to plotly's renderer argument. Using -r notebook or -r png is useful when running form jupyter notebook. The -r png simply streams png contents to stdout

./sim/script/tda.py examples/dhrystone_dhrystone_out_cosim/hw_stats.json
TDA for 'dhrystone_dhrystone_cosim'
         L1       L2  cycles
0  bad_spec     <NA>   11692
1  frontend   icache   25775
2  frontend     core   46460
3   backend   dcache    1805
4   backend     core    4840
5  retiring  integer  519031
6  retiring     simd       0

Performance Counters for 'dhrystone_dhrystone_cosim' (IPC: 0.851)
           counter  value    class  count
            cycles 609603   cycles 609.6k
               ret 519031      ret 519.0k
            stalls  90572    stall  90.6k
          bad_spec  11692 bad_spec  11.7k
           ret_int 519031    ret_* 519.0k
     ret_ctrl_flow 108436    ret_* 108.4k
  ret_ctrl_flow_br  63985    ret_*  64.0k
   ret_ctrl_flow_j  21746    ret_*  21.7k
  ret_ctrl_flow_jr  22705    ret_*  22.7k
           ret_mem 166065    ret_* 166.1k
      ret_mem_load  87869    ret_*  87.9k
     ret_mem_store  78196    ret_*  78.2k
          ret_simd      0    ret_*      0
    ret_simd_arith      0    ret_*      0
 ret_simd_data_fmt      0    ret_*      0
          stall_be   6645  stall_*  6.64k
     stall_be_core   4840  stall_*  4.84k
          stall_fe  72235  stall_*  72.2k
     stall_fe_core  46460  stall_*  46.5k
         stall_l1d   1805  stall_*  1.80k
       stall_l1d_r    290  stall_*    290
       stall_l1d_w   1515  stall_*  1.51k
         stall_l1i  25775  stall_*  25.8k
        stall_load   4840  stall_*  4.84k
        stall_simd      0  stall_*      0
          l1i_miss   5172    l1i_*  5.17k
           l1i_ref 530723    l1i_* 530.7k
     l1i_spec_miss   2061    l1i_*  2.06k
 l1i_spec_miss_bad   1034    l1i_*  1.03k
l1i_spec_miss_good   1027    l1i_*  1.03k
          l1d_miss    206    l1d_*    206
        l1d_miss_r     29    l1d_*     29
        l1d_miss_w    177    l1d_*    177
           l1d_ref 162718    l1d_* 162.7k
         l1d_ref_r  86196    l1d_*  86.2k
         l1d_ref_w  76522    l1d_*  76.5k
     l1d_writeback    142    l1d_*    142

FlameGraph

./sim/script/get_flamegraph.py examples/dhrystone_dhrystone_out_cosim/callstack_folded_cycle_cosim.txt

Open the generated interactive flamegraph_clk_cycle.svg in the web browser

Call Graph

./sim/script/get_call_graph.py examples/dhrystone_dhrystone_out_cosim/callstack_folded_cycle_cosim.txt

Execution visualization

Note

Running any of the below commands with -b will open (or host with --host) interactive session in the browser

Get timeline plot

./sim/script/run_analysis.py -t examples/dhrystone_dhrystone_out_cosim/trace_clk.bin --dasm sim/sw/baremetal/dhrystone/dhrystone.dasm --timeline --clk

Get stats trace (adjust window sizes as needed)

./sim/script/run_analysis.py -t examples/dhrystone_dhrystone_out_cosim/trace_clk.bin --dasm sim/sw/baremetal/dhrystone/dhrystone.dasm --stats_trace --win_size_stats 512 --win_size_hw 64 --clk

Get execution breakdown

./sim/script/run_analysis.py -i examples/dhrystone_dhrystone_out_cosim/inst_profile_clk.json --clk

Get execution histograms

./sim/script/run_analysis.py -t examples/dhrystone_dhrystone_out_cosim/trace_clk.bin --dasm sim/sw/baremetal/dhrystone/dhrystone.dasm --pc_hist --add_cache_lines --clk
./sim/script/run_analysis.py -t examples/dhrystone_dhrystone_out_cosim/trace_clk.bin --dasm sim/sw/baremetal/dhrystone/dhrystone.dasm --dmem_hist --add_cache_lines --clk

Get execution trace

./sim/script/run_analysis.py -t examples/dhrystone_dhrystone_out_cosim/trace_clk.bin --dasm sim/sw/baremetal/dhrystone/dhrystone.dasm --pc_trace --add_cache_lines --clk
./sim/script/run_analysis.py -t examples/dhrystone_dhrystone_out_cosim/trace_clk.bin --dasm sim/sw/baremetal/dhrystone/dhrystone.dasm --dmem_trace --add_cache_lines --clk

Optionally, save symbols found in dasm with --save_symbols

Note

Any time ./sim/script/run_analysis.py is invoked with --dasm arg, backannotated dasm will be saved, e.g dhrystone.prof.dasm

Timeline

Stats Trace

Execution breakdown

Execution histograms

Execution trace

Backannotation of disassembly

Adding --print_symbols to any of the commands using -t will print all found symbols to stdout

Symbols found in sim/sw/baremetal/dhrystone/dhrystone.dasm in 'text' section:
0x430EC - 0x433D8: _free_r (0)
0x42FAC - 0x430E8: _malloc_trim_r (0)
0x42F08 - 0x42FA8: strcpy (86.2k)
0x42E7C - 0x42F04: __libc_init_array (54)
0x42E20 - 0x42E78: _sbrk_r (38)
0x42E1C - 0x42E1C: __malloc_unlock (2)
0x42E18 - 0x42E18: __malloc_lock (8)
0x42608 - 0x42E14: _malloc_r (418)
0x425FC - 0x42604: malloc (28)
0x425D4 - 0x425F8: _sbrk (21)
0x42298 - 0x425D0: mini_vpprintf (25.8k)
0x42224 - 0x42294: _puts (0)
0x42018 - 0x42220: mini_pad (955)
0x41EB0 - 0x42014: mini_itoa (2.81k)
0x41E88 - 0x41EAC: mini_strlen (648)
0x41E2C - 0x41E84: trap_handler (0)
0x41E00 - 0x41E28: timer_interrupt_handler (0)
0x41DEC - 0x41DFC: get_cpu_time (12)
0x41D98 - 0x41DE8: mini_printf (1.39k)
0x41D64 - 0x41D94: __puts_uart (26.1k)
0x41D14 - 0x41D60: _write (34.7k)
0x41CFC - 0x41D10: send_byte_uart0 (11.7k)
0x41CD4 - 0x41CF8: time_s (28)
0x41C14 - 0x41CD0: Proc_6 (22.0k)
0x41C08 - 0x41C10: Func_3 (3.00k)
0x41B8C - 0x41C04: Func_2 (34.0k)
0x41B6C - 0x41B88: Func_1 (15.0k)
0x41B08 - 0x41B68: Proc_8 (25.0k)
0x41AF8 - 0x41B04: Proc_7 (18.0k)
0x41478 - 0x41AF4: main (85.6k)
0x41468 - 0x41474: Proc_5 (4.01k)
0x41444 - 0x41464: Proc_4 (9.00k)
0x412F4 - 0x41440: Proc_1 (71.0k)
0x412D0 - 0x412F0: Proc_3 (13.0k)
0x412A8 - 0x412CC: Proc_2 (9.01k)
0x4125C - 0x412A4: __clzsi2 (0)
0x4122C - 0x41258: __modsi3 (0)
0x411F8 - 0x41228: __umodsi3 (354)
0x411B0 - 0x411F4: __hidden___udivsi3 (30.1k)
0x411A8 - 0x411AC: __divsi3 (2.00k)
0x41184 - 0x411A4: __mulsi3 (36)
0x41074 - 0x41180: __floatsisf (0)
0x41004 - 0x41070: __fixsfsi (0)
0x40CB8 - 0x41000: __mulsf3 (0)
0x40944 - 0x40CB4: __divsf3 (0)
0x4037C - 0x40940: __udivdi3 (308)
0x40200 - 0x40378: strcmp (65.0k)
0x400EC - 0x401FC: trap_entry (0)
0x400E8 - 0x400E8: forever (0)
0x400DC - 0x400E4: call_main (4)
0x400CC - 0x400D8: clear_bss_b_loop (0)
0x400C8 - 0x400C8: clear_bss_b_check (3)
0x400B8 - 0x400C4: clear_bss_w_loop (12.0k)
0x40000 - 0x400B4: _start (69)

It also backannotates the disassembly and saves it as dhrystone.prof.dasm

00040000 <_start>:
   12    40000:	00000093          	addi	x1,x0,0
    1    40004:	00000113          	addi	x2,x0,0
    1    40008:	00000193          	addi	x3,x0,0
    1    4000c:	00000213          	addi	x4,x0,0
    1    40010:	00000293          	addi	x5,x0,0
    1    40014:	00000313          	addi	x6,x0,0
...
000400b8 <clear_bss_w_loop>:
 4070    400b8:	00052023          	sw	x0,0(x10)
 2654    400bc:	00450513          	addi	x10,x10,4
 2655    400c0:	fff68693          	addi	x13,x13,-1
 2654    400c4:	fe069ae3          	bne	x13,x0,400b8 <clear_bss_w_loop>
...
00041468 <Proc_5>:
 1006    41468:	04100693          	addi	x13,x0,65
 1000    4146c:	82d186a3          	sb	x13,-2003(x3) # 44145 <Ch_1_Glob>
 1000    41470:	8201a823          	sw	x0,-2000(x3) # 44148 <Bool_Glob>
 1000    41474:	00008067          	jalr	x0,0(x1)
...

Hardware performance estimates correlation

Same as with ISA sim, except now -c/--corr can be passed in to get correlation against the estimates
Run with positional arguments as

./sim/script/perf_est_v2.py \
    sim/examples/dhrystone_dhrystone_out/inst_profile.json \
    sim/examples/dhrystone_dhrystone_out/hw_stats.json \
    sim/examples/dhrystone_dhrystone_out/rf_trace.bin \
    -c examples/dhrystone_dhrystone_out_cosim/hw_stats.json
Performance estimate breakdown for: 
    sim/examples/dhrystone_dhrystone_out/inst_profile.json
    sim/examples/dhrystone_dhrystone_out/hw_stats.json
    <home_path>/sim/script/hw_perf_metrics_v2.yaml
    sim/examples/dhrystone_dhrystone_out/rf_trace.bin

Peak Stack usage: 352 bytes
Instructions executed: 460.0k
    icache (32 sets, 2 ways, 4096B data): References: 460.3k, Hits: 459.9k, Misses: 367, Hit Rate: 99.92%, MPKI: 0.80
DMEM inst: 155.3k - L/S: 84.2k/71.1k (33.77% instructions)
    dcache (16 sets, 4 ways, 4096B data): References: 152.0k, Hits: 151.8k, Misses: 209, Writebacks: 145, Hit Rate: 99.86%, MPKI: 0.45
Branch inst: 50487 (10.98% instructions)
    bpred (combined): Predicted: 50.1k, Mispredicted: 339, Accuracy: 99.33%, MPKI: 0.74
DIV/REM inst: 1496 (0.33% instructions)
    divider (16B): Cache: 1.10k (73.53%), Special: 344 (22.99%), Common: 52 (3.48%), 489 b, 9.4 b/d

Pipeline stalls (max): 
    Bad spec: 678
    FE bound: 39.8k - ICache: 2.20k (AMAT: 1.0), Core: 37.6k
    BE bound: 10.7k - DCache: 1.69k (AMAT: 1.01), Core: 8.97k (SIMD 0, DIV 2.43k, Load 6.53k)

Estimated HW performance at 100MHz:
    Best:     500.5k cycles (5.00ms), IPC: 0.919; BW (avg MB/s) - icache: 350.9, dcache (R/W): 105.7 (55.9/49.8), mem (R/W): 8.8 (7.0/1.8)
    Expected: 511.1k cycles (5.11ms), IPC: 0.900; BW (avg MB/s) - icache: 343.5, dcache (R/W): 103.5 (54.7/48.8), mem (R/W): 8.6 (6.9/1.7)
    Estimated Cycles range: 10.7k cycles, midpoint: 505.8k, ratio: 2.11%

Correlation:
          metric    est    rtl  diff    diff%
          cycles 511140 511200   -60   -0.012
           empty  51159  51224   -65   -0.127
          stalls  50481  50470    11    0.022
            lost    678    754   -76  -11.209
      lost_other      0     22   -22 -100.000
        bad_spec    678    732   -54   -7.965
        stall_be  10655  10781  -126   -1.183
       stall_l1d   1689   1835  -146   -8.644
   stall_be_core   8966   8946    20    0.223
        stall_fe  39826  39689   137    0.344
       stall_l1i   2202   2062   140    6.358
   stall_fe_core  37624  37627    -3   -0.008
             ret 459976 459976     0    0.000
        ret_simd      0      0     0    0.000
         ret_int 459976 459976     0    0.000
ret_ctrl_flow_br  50487  50487     0    0.000
         bp_miss    339    366   -27   -7.965
         l1i_ref 460315 460673  -358   -0.078
        l1i_miss    367    402   -35   -9.537
         l1d_ref 151974 151974     0    0.000
        l1d_miss    209    209     0    0.000 

Overall hardware performance estimates correlation

A useful check for the confidence that can be put in the functional models. Since the estimation flow offers much faster turnaround time, it's a tempting target for rapid exploration of either workload changes, or the microarchitectural tweaks.
Cycle estimates were compared against RTL cycle counts using only the timed workload regions, that is, excluding benchmark harness overhead like setup, warmup, UART printing, and others.

Benchmarks

Metric Value
Mean signed error -0.33%
Std dev signed error 1.32%
Mean absolute error 0.61%
Median absolute error 0.17%
Worst absolute error 6.13%
≤ 1% 22/28 (79%)
≤ 3% 27/28 (96%)
≤ 5% 27/28 (96%)
≤ 10% 28/28 (100%)
> 10% 0/28 (0%)

Ustress

Metric Value
Mean signed error -0.13%
Std dev signed error 0.28%
Mean absolute error 0.15%
Median absolute error 0.01%
Worst absolute error 0.69%
≤ 1% 13/13 (100%)
≤ 3% 13/13 (100%)
≤ 5% 13/13 (100%)
≤ 10% 13/13 (100%)
> 10% 0/13 (0%)

About

SystemVerilog implementation of RISC-V RV32IM & custom packed SIMD ISA as 5-stage single-issue CPU core with branch predictor and L1 caches, tied up in lockstep with ISA sim over DPI for verification

Topics

Resources

License

Stars

Watchers

Forks

Contributors