- Reduce allocations and clones #6782
- Add BAL (EIP-7928) Prometheus instruments (
bal_blocks_total,bal_size_bytes,bal_size_bytes_histogram,bal_account_count,bal_slot_count) and a BAL row in theethrex_l1_perfGrafana dashboard #6678
<<<<<<< perf/shard-storage-merkleization
- Parallelize single-account storage-trie merkleization: shard a hot account's (>=2048 slot updates) storage-root computation across 16 nibble-keyed workers, so one bloated contract no longer serializes its trie inserts on a single thread #6845 =======
- Thread
Arc<BlockAccessList>through the block pipeline to avoid an O(BAL-size) deep clone of the Block Access List (and its validation index) per block on the parallel execution path #6829
main
- In-place top-slot mutation for unary/binary opcodes and
MLOAD: mutate the top stack slot directly instead of pop-then-push, removing the serial read-modify-write of the stack offset on offset-chain-bound ops. ~2.16x on an ISZERO loop and ~1.63x onMLOAD(IPC 2.41 -> 3.45 on a 33M-op loop) #6865 - Monomorphize the LEVM opcode dispatch loop over whether a struct-log tracer is active (
run_dispatch::<const TRACED: bool>), moving the per-opcode trace capture into cold out-of-line helpers. The non-traced path (the common case) drops the per-op tracer load/branches and the capture code's influence on loop codegen: ~36% fewer instructions and ~23% fewer cycles per opcode on a minimal op, lifting all-opcode dispatch throughput #6847 - Read a block's receipts in a single bulk read in
eth_getLogsinstead of a point lookup per transaction, ~5–7x faster on mainnet log queries #6852 - Skip non-matching blocks in
eth_getLogsusing the per-block header bloom, avoiding body/receipt loads for blocks that provably cannot match #6813
- Pad
Codebytecode with 33 trailingSTOPbytes so the hot dispatch fetch andpcadvance drop their bounds checks (~8% fewer instructions, ~13% fewer branches on dispatch). The logical length is tracked separately andCodeis encapsulated so all consumers read the true length #6866
- Route the native P256VERIFY (secp256r1, RIP-7212/EIP-7951) precompile through the
aws-lc-rs(AWS-LC) backend, replacing the pure-Rustp256path onNativeCrypto; zkVM guest builds keep the portable backend #6828
- Route the native BLS12-381 (EIP-2537) precompiles through the
blstbackend, replacing the pure-Rust path whose Fermat field inversion madeBLS12_G1ADD/BLS12_G2ADDtime-per-gas outliers; ~7.7x faster G1ADD, ~5.6x faster G2ADD (zkVM guest builds keep the portable backend) #6792
- Short-circuit the
KECCAK256opcode on zero-length input by returning the precomputedkeccak256("")constant, skipping the permutation #6775
- Prefetch all BAL storage synchronously before execution #6732
- Lazy BAL cursor for per-tx parallel execution #6669
- Move per-tx BAL validation into the rayon par_iter closure on the parallel execution path #6677
- Replace synchronous disk I/O with async operations in snap sync #6113
- Skip
vm.run_execution()for transfers to codeless EOAs #6570 - BAL optimistic merkleization: synthesize state deltas from the input Block Access List pre-execution and merkleize in parallel with the EVM on the
engine_newPayloadvalidation path. Includes a parallel state-trie pre-warm and per-account hashed-key-sorted storage inserts to keep the trie node arena hot for Stage B/C #6655
- Two-CF receipts migration: copy old RLP-keyed receipts to
receipts_v2with fixed-width keys; old CF auto-dropped on restart #6598
- Reduce peak disk usage during snap sync by moving SST files into the temp DB instead of copying #6532
- Replace per-block thread spawning with persistent thread pool for merkleization #6344
- Eliminate stack-frame spill in Stack::push for zero-upper-limb values #6390
- Use const-generic big-endian conversion in PUSH opcodes #6390
- Doubly pipelined merkleization with self-coordinating shard workers #6278
- Switch hot EVM and mempool HashMaps to FxHashMap for faster hashing #6303
- SIMD-accelerate trie nibble operations for block execution #6286
- Use FxHashMap in call frame backup #6286
- Use nested storage originals, FxHashMap call frame backup, and sstore-specific storage access helper #6265
- Refactor LEVM opcode handlers to avoid expensive matches #4791
- Speed up snap sync validation with parallelism and deduplication #6191
- Disable balance check for prewarming to avoid early reverts #6259
- Expand fast-path dispatch in LEVM interpreter loop #6245
- Check self before parent in Substate warm/cold lookups #6244
- Add precompile result cache shared between warmer and executor threads #6243
- Optimize storage layer for block execution by reducing lock contention and allocations #6207
- Defer KZG blob proof verification from P2P to mempool insertion #6150
- Cache ECDSA sender recovery in transaction structs #6153
- Optimize
debug_executionWitnessby pre-serializing RPC format at storage time #5956 - Use fastbloom as the bloom filter #5968
- Improve snap sync logging with table format and visual progress bars #5977
- Remove
ethrex-threadpoolcrate and moveThreadPooltoethrex-trie#5925 - Add frame pointers setting to makefiles #5746
- Remove
Mutex<Box<_>>fromDatabaseLogger::storeto reduce contention #5930
- Reduce state iterated when calculating partial state transitions #5864
- Remove needless allocs in CALLDATACOPY/CODECOPY/EXTCODECOPY #5810
- Inline common opcodes #5761
- Improve ecrecover precompile by removing heap allocs and conversions #5709
- Refactor
ecpairingusing ark #5792
- Remove needless allocs on store api #5709
- Avoid double parsing and extra clones in doc signature formatting #9285
- Make HashSet use fxhash in discv4 peer_table #5688
- Validate tx blobs after checking if it's already in the mempool #5686
- Parallelize storage merkelization #6079
- Avoid unnecessary hashing of init codes and already hashed codes #5397
- Change some calls from
encode_to_vec().len()to.length()when wanting to get the rlp encoded length #5374 - Use our keccak implementation for receipts bloom filter calculation #5454
- Use unchecked swap for stack #5439
- Improve rlp encoding by avoiding extra loops and remove unneeded array vec, also adding a alloc-less length method the default trait impl #5350
- Parallelize merkleization #5377
- Avoid temporary allocations when decoding and hashing trie nodes #5353
- Use specialized DUP implementation #5324
- Avoid recalculating blob base fee while preparing transactions #5328
- Use BlobDB for account_codes column family #5300
- Only mark individual values as dirty instead of the whole trie #5282
- Separate Account and storage Column families in rocksdb #5055
- Avoid copying while reading account code #5289
- Cache
BLOBBASEFEEopcode value #5288
- Insert instead of merge for bloom rebuilds #5223
- Replace sha3 keccak to an assembly version using ffi #5247
- Fix
FlatKeyValuegeneration on fullsync mode #5274
- Disable RocksDB compression #5223
- Reuse stack pool in LEVM #5179
- Merkelization backpressure and batching #5200
- Pipeline Merkleization and Execution #5084
- Add bloom filters to snapshot layers #5112
- Make trusted setup warmup non blocking #5124
- Run "engine_newPayload" block execution in a dedicated worker thread. #5051
- Reusing FindNode message per lookup loop instead of randomizing the key for each message. #5047
- Move trie updates post block execution to a background thread. #4989.
- Instead of lazy computation of blocklist, do greedy computation of allowlist and store the result, fetch it with the DB. #4961
- Remove duplicate subgroup check in ecpairing precompile #4960
- Replaces incremental iteration with a one-time precompute method that scans the entire bytecode, building a
BitVec<u8, Msb0>where bits mark validJUMPDESTpositions, skippingPUSH1..PUSH32data bytes. - Updates
is_blacklistedto O(1) bit lookup.
- Improve get_closest_nodes p2p performance #4838
- Remove explicit cache-related options from RocksDB configuration and reverted optimistic transactions to reduce RAM usage #4853
- Remove unnecesary mul in ecpairing #4843
- Improve block headers vec handling in syncer #4771
- Refactor current_step sync metric from a
Mutex<String>to a simple atomic. #4772
- Change remaining_gas to i64, improving performance in gas cost calculations #4684
- Downloading all slots of big accounts during the initial leaves download step of snap sync #4689
- Downloading and inserting intelligently accounts with the same state root and few (<= slots) #4689
- Improving the performance of state trie through an ordered insertion algorithm #4689
- Remove
OpcodeResultto improve tight loops of lightweight opcodes #4650
- Avoid dumping empty storage accounts to disk #4590
- Improve instruction fetching, dynamic opcode table based on configured fork, specialized push_zero in stack #4579
-
Fix caching mechanism of the latest block's hash #4479
-
Add
jemallocas an optional global allocator used by default #4301 -
Improve time when downloading bytecodes from peers #4487
- Add
RocksDBas an optional storage engine #4272
- Implement fast partition of
TrieIteratorand use it for quickly respondingGetAccountRangesandGetStorageRanges#4404
- Refactor substrate backup mechanism to avoid expensive clones #4381
- Use x86-64-v2 cpu target on linux by default, dockerfile will use it too. #4252
- Process JUMPDEST gas and pc together with the given JUMP JUMPI opcode, improving performance. #4220
- Improve P2P mempool gossip performance #4205
- Improve precompiles further: modexp, ecrecover #4168
- Improve memory resize performance #4117
- Improve calldatacopy opcode further #4150
-
Improve Memory::load_range by returning a Bytes directly, avoding a vec allocation #4098
-
Improve ecpairing (bn128) precompile #4130
-
Improve BLS12 precompile #4073
-
Improve blobbasefee opcode #4092
-
Make precompiles use a constant table #4097
-
Improve addmod and mulmod opcode performance #4072
-
Improve signextend opcode performance #4071
-
Improve performance of calldataload, calldatacopy, extcodecopy, codecopy, returndatacopy #4070
- Use malachite crate to handle big integers in modexp, improving perfomance #4045
-
Cache chain config and latest canonical block header #3878
-
Batching of transaction hashes sent in a single NewPooledTransactionHashes message #3912
-
Make
JUMPDESTblacklist lazily generated on-demand #3812 -
Rewrite Blake2 AVX2 implementation (avoid gather instructions and better loop handling).
-
Add Blake2 NEON implementation.
- Add a secondary index keyed by sender+nonce to the mempool to avoid linear lookups #3865
-
Refactor current callframe to avoid handling avoidable errors, improving performance #3816
-
Add shortcut to avoid callframe creation on precompile invocations #3802
- Use
rayonto recover the sender address from transactions #3709
-
Migrate EcAdd and EcMul to Arkworks #3719
-
Add specialized push1 and pop1 to stack #3705
-
Improve precompiles by avoiding 0 value transfers #3715
-
Improve BlobHash #3704
Added push1 and pop1 to avoid using arrays for single variable operations.
Avoid checking for blob hashes length twice.
-
Use a lookup table for opcode execution #3669
-
Improve CodeCopy perfomance #3675
-
Improve sstore perfomance further #3657
- Improve levm memory model #3564
- Add sstore bench #3552
- Add AVX256 implementation of BLAKE2 #3590
- Improve sstore opcodes #3555
- Improve blake2f #3503
- Use a stack pool #3386
- Refactor jump opcodes to use a blacklist on invalid targets.
-
Improved the performance of shift instructions. 2933
-
Refactor Patricia Merkle Trie to avoid rehashing the entire path on every insert 2687
- Add immutable cache to LEVM that stores in memory data read from the Database so that getting account doesn't need to consult the Database again. 2829
- Reduce account clone overhead when account data is retrieved 2684
- Reduce transaction clone and Vec grow overhead in mempool 2637
- Make TrieDb trait use NodeHash as key 2517
-
Avoid calculating state transitions after every block in bulk mode 2519
-
Transform the inlined variant of NodeHash to a constant sized array 2516
-
Removed some unnecessary clones and made some functions const: 2438
-
Asyncify some DB read APIs, as well as its users #2430
- Fix an issue where the table was locked for up to 20 sec when performing a ping: 2368
- Fix a bug where RLP encoding was being done twice: #2353, check
the report under
docs/perf_reportsfor more information.
- Asyncify DB write APIs, as well as its users #2336
- Faster block import, use a slice instead of copy #2097
- Don't recompute transaction senders when building blocks #2097
- Process blocks in batches when syncing and importing #2174
- Compute tx senders in parallel #2268