Skip to content

Latest commit

 

History

History
142 lines (104 loc) · 5.85 KB

File metadata and controls

142 lines (104 loc) · 5.85 KB

B+Tree Duplicate Value Handling and Node Management Guide

serializable-bptree supports mapping multiple keys to the same value. This guide provides technical instructions for maintaining system stability and performance when large amounts of duplicate data occur.


1. Understanding Internal Mechanics

Instead of creating new nodes when handling duplicate values, this library stores keys in an array within a single entry.

📊 Data Structure Visualization

graph TD
    Node["B+Tree Leaf Node"]
    Entry1["Value: A<br/>Keys: [K101, K102, K105...]"]
    Entry2["Value: B<br/>Keys: [K103]"]
    Entry3["Value: C<br/>Keys: [K104, K106]"]

    Node --> Entry1
    Node --> Entry2
    Node --> Entry3
Loading
  • Insertion Logic: When inserting new data, if the value already exists, the new key is added to the keys array at that index.
  • Key Characteristic: While efficient for a small number of duplicates, this approach can cause the array size to grow indefinitely if data concentrates on a specific value.

2. Issues with Data Concentration (Node Bloat)

When a specific value has thousands of duplicates, they are all stored within a single entry in one node. This does not trigger the B+Tree's node splitting mechanism, causing data to concentrate in a single node.

Warning

Data concentration creates a performance bottleneck for every insertion.

Category Description Impact
I/O Bottleneck Every time a single key is inserted into a bloated value, the entire multi-MB node file must be re-written. Disk write performance (IOPS) drops significantly as the duplication increases.
Serialization Lag The massive array must be processed when converting the entire node to JSON for storage. Sharp increase in CPU usage and latency for every update.
Structural Limits B+Tree's automatic splitting only works at the node level, not inside entries. Leads to imbalances where specific nodes become massive while others remain small.

3. Recommended Solution: Composite Value Strategy

The most effective solution is to assign uniqueness to the Value, distributing data across multiple entries. This activates the B+Tree's native splitting mechanism.

✅ Implementation Example: Composite Comparator

Combine the sorting value (v) with a key (k) that ensures data uniqueness.

// 1. Define Composite Value Interface
interface MyValue {
  k: number  // Key for uniqueness (e.g., ID, timestamp)
  v: number  // Actual data used for querying/sorting
}

// 2. Implement Composite Comparator
class MyValueComparator extends ValueComparator<MyValue> {
  // Actual sorting logic
  asc(a: MyValue, b: MyValue): number {
    const diff = (+a.v) - (+b.v)
    // If the primary sort (v) is the same, use the unique key (k) to distribute data.
    return diff === 0 ? (a.k - b.k) : diff
  }

  // Define primary comparison group (used for primaryEqual queries)
  primaryAsc(a: MyValue, b: MyValue): number {
    return (+a.v) - (+b.v)
  }

  match(v: MyValue): string {
    return v.v.toString()
  }
}

📈 After Adopting Composite Values

graph TD
    Node1["Node 1 (v:1~v:3)"]
    Node2["Node 2 (v:3~v:4)"]
    
    EntryA1["Value: {v:3, k:1}<br/>Keys: [K101]"]
    EntryA2["Value: {v:3, k:2}<br/>Keys: [K102]"]
    EntryA3["Value: {v:3, k:3}<br/>Keys: [K105]"]

    Node1 --> EntryA1
    Node1 --> EntryA2
    Node2 --> EntryA3
    
    style EntryA1 fill:#f9f,stroke:#333
    style EntryA2 fill:#f9f,stroke:#333
    style EntryA3 fill:#f9f,stroke:#333
Loading

Tip

Even if values are the same (v:3), they are treated as distinct entries due to the unique key (k), allowing normal page splits to occur when a node becomes full.


4. Efficient Query Methods

Since composite values distribute entries, queries must account for this structure.

Method 1: Use Primary Operators (Highly Recommended)

If primaryAsc is defined, you can find all related data using the primary sorting group. Unlike standard range queries, these operators are specifically optimized to search within composite value structures.

// Query all data where v is 3
const results = await tree.where({
  primaryEqual: { v: 3 } as Partial<MyValue>
})

// Query all data where v is 5 or more
const results = await tree.where({
  primaryGte: { v: 5 } as Partial<MyValue>
})

All standard comparison and logical operators have primary-based equivalents:

  • primaryGt, primaryGte, primaryLt, primaryLte, primaryEqual, primaryNotEqual, primaryOr.

Method 2: Use Range Queries

Explicitly specify a range to collect data.

Condition Start Value (gte) End Value (lte) Remarks
All for a specific value { v: 3, k: -Infinity } { v: 3, k: Infinity } Useful for total surveys of duplicate data
After a specific point { v: 3, k: 100 } { v: 3, k: Infinity } Advantageous for pagination

Caution

While range queries are simple and flexible, they may suffer from performance issues compared to primaryEqual:

  • Scan Range Issues: Unnecessary node scans may occur while searching for the exact start and end boundaries.
  • Redundant Processing: Performance may degrade due to additional verification and filtering processes within the range, which are less optimized than the internal logic of primaryEqual.

5. Additional Optimization Tips

  • Lightweight Keys: Use integers or short hash values instead of strings to reduce the size of the key array.
  • Data Normalization: For data with extremely high duplication (e.g., gender 'M/F'), consider linking to separate external storage using a Group ID rather than inserting directly into the B+Tree.