Assembling the Puzzle: High-Performance Entity Building streaming Beam pipeline using a Two-Tiered State Architecture
This repository contains the official demo for the talk "Assembling the Puzzle: High-Performance Entity Building streaming Beam pipeline using a Two-Tiered State Architecture" presented at Beam College 2026.
Note
Presentation Slides The slides for the talk are available in the repository: Assembling_the_puzzle.pdf
This project implements an Apache Beam pipeline to merge and reconstruct e-commerce session entities from partial events using a two-level cache (In-Memory State and Local Filesystem persistent storage).
The project generates events internally using Java (EventGenerator.java) to simulate high-volume, concurrent user traffic by generating complex event sequences.
Each execution generates a configurable number of sessions (based on requested event count), with each session containing exactly 30 events. While sessions overlap in time, the processing order is globally randomized to simulate real-world distributed system behavior (out-of-order arrival).
Each session follows a strict stateful progression to ensure data integrity:
- Shopping Phase (Events 1-28):
- Consists of
ADD_TO_CARTandREMOVE_FROM_CARTevents. - Constraint: An item must be added before it can be removed.
- Safety: The generator ensures at least one item remains in the cart before proceeding to checkout.
- Consists of
- Payment Phase (Event 29):
- Exactly one
ADD_PAYMENTevent. - Constraint: This always occurs after the shopping phase.
- Exactly one
- Completion Phase (Event 30):
- Exactly one
SUBMIT_ORDERevent. - Constraint: This is always the final event in every session.
- Exactly one
- Internal Consistency: Every event within a session has a strictly increasing
timestamp(chronological order). - External Randomization: Events from all generated sessions are shuffled before being processed. This simulates out-of-order data arrival, testing the pipeline's ability to handle event-time processing.
- Java 25
- Gradle 9.1+
Run unit tests to verify the core business logic of the pipeline.
./gradlew testTo run the pipeline locally using the DirectRunner. The pipeline will generate events internally, process them, and write the output to stdout.
You have to pass some arguments to the pipeline, to specify the number of events to generate or the state base directory:
./gradlew run --args="--numEvents=300 --stateBaseDir=/tmp/beam-state"Check the output in stdout to see the merged sessions (Orders). Failures will be logged to stderr.
The tests in src/test/java/com/google/cloud/pso/BeamCollegeDemoTest.java are specifically designed to be run step-by-step with a debugger to help you understand how the pipeline works.
This is a standard Java Gradle project, so you can use your favorite IDE or debugger (e.g., VS Code, IntelliJ IDEA) without needing any cloud environment.
-
Set Breakpoints:
- Open
src/main/java/com/google/cloud/pso/transform/MergeFn.java. - Put a breakpoint inside the
processElementmethod (around line 89). - Put a breakpoint inside the
onTimermethod (around line 140).
- Open
-
Run Tests in Debug Mode:
- Run the tests in
BeamCollegeDemoTest.javausing your IDE's debugger. - For example, in VS Code, you can use the "Debug Test" codelens above the test methods.
- Run the tests in
-
Observe Variables:
- When the breakpoint hits, inspect the local variables like
sessionId,newEvent, andcurrentStateJson. - Observe how
MergeFnreads from internal state, updates it, and interacts with the external state store.
- When the breakpoint hits, inspect the local variables like
-
Inspect State Files:
- While the tests are running,
MergeFnwill offload state to a file-based cache when the timer fires. - In the unit tests, these files are written to a temporary folder (check the value of
stateBaseDirin the debugger to see the path). - If you run the pipeline locally using the
./gradlew runcommand (mentioned above), the state files will be written to/tmp/beam-state(or whatever you specified in--stateBaseDir). You can inspect these files to see the JSON serialized state of the sessions.
- While the tests are running,