Example selectors choose k best training examples to guide LLM in Few-shot Strategy.
class BaseSelector(ABC):
@abstractmethod
def select_examples(
self,
question: str,
k: int,
db_id: Optional[str] = None
) -> List[Dict[str, Any]]Input:
question: New question for SQL generationk: Number of examples to selectdb_id: Optional - filter by database
Output:
- List of k examples, each containing:
{question, query, db_id, ...}
| Selector | Complexity | Speed | Requires | Best For |
|---|---|---|---|---|
| Random | ⚡⚡⚡ | Nothing | Baseline | |
| DICL | ⚡⚡ | Pre-computed embeddings | Semantic similarity | |
| ASTRES | ⚡ | SQL parsing | Structural matching | |
| Skill-KNN |
|
⚡⚡ | LLM calls | Task-aware |
| ViR2 | ⚡⚡ | Pre-computed embeddings | Best overall |
Randomly selects k examples from training pool.
Input: Training pool P, k examples
Output: k random examples
1. If db_id specified:
P' = filter P by db_id
Else:
P' = P
2. Return random.sample(P', k)
- ✅ Quick baseline
- ✅ Sanity check
- ✅ No preprocessing available
- ❌ Production use
python vipersql.py \
--strategy few-shot \
--example-selection-strategy random \
--few-shot-examples 3- ⚡ Fastest
- 🎯 Simple
- 📦 No preprocessing
- ❌ No relevance
- ❌ Poor accuracy
- ❌ Inconsistent results
DICL = Domain-Independent Context Learning
Selects examples based on semantic similarity (PhoBERT/BERT embeddings).
Input: Question q, Candidate pool C, k examples
Output: k most similar examples
1. Encode q → embedding e_q using PhoBERT/BERT
2. For each candidate c in C:
Compute similarity: sim(e_q, e_c)
3. Sort candidates by similarity descending
4. Return top-k candidates
Cosine similarity in embedding space.
- ✅ Semantic relevance needed
- ✅ Have pre-computed embeddings
- ✅ Fast selection required
- ❌ Need syntactic matching
- ❌ Need diversity
python vipersql.py \
--strategy few-shot \
--example-selection-strategy dicl \
--few-shot-examples 3# Build DICL candidates with pre-computed embeddings
python scripts/build_dicl_candidates.py \
--dataset-path dataset/ViText2SQL \
--level stdGenerates: dataset/ViText2SQL/std-level/dicl_candidates.json
- 🎯 Semantic relevance
- ⚡ Fast (pre-computed embeddings)
- 📊 Better than random
- ❌ Only semantic (ignores structure)
- ❌ No diversity
- ❌ May select redundant examples
ASTRES = AST-based Retrieval and Example Selection
Selects examples based on SQL structure similarity via Abstract Syntax Trees.
Input: Question q, Candidate pool C, k examples, Schema
Output: k structurally similar examples
1. Generate zero-shot SQL for q → sql_q
2. Parse sql_q to AST → ast_q
3. For each candidate c in C:
Parse c.query to AST → ast_c
Compute AST similarity: sim(ast_q, ast_c)
4. Sort by AST similarity
5. Return top-k
Based on tree edit distance:
- ✅ Structural similarity important
- ✅ Have SQL parser
- ✅ English queries (designed for English)
- ❌ Vietnamese (not optimized)
- ❌ Speed critical
python vipersql.py \
--strategy few-shot \
--example-selection-strategy astres \
--few-shot-examples 3- 🌳 Structure-aware
- 📐 Good for similar query patterns
- 🎯 Matches SQL complexity
- ⏱️ Slow (parsing overhead)
- 🇬🇧 English-centric
- 🔧 Requires zero-shot generation first
- ❌ Not optimized for Vietnamese
Extract "SQL skills" from queries using LLM, then match by skill similarity.
Input: Question q, Candidate pool C (with skills), k examples
Output: k skill-matched examples
1. Extract skills from q using LLM:
Prompt: "What SQL skills are needed for: {q}?"
→ skills_q = "JOIN, aggregation, GROUP BY"
2. Encode skills_q → embedding e_skills
3. For each candidate c in C:
Compute similarity: sim(e_skills, e_skills_c)
4. Sort by skill similarity
5. Return top-k
- "Simple SELECT"
- "JOIN two tables"
- "Aggregation (COUNT/SUM/AVG)"
- "GROUP BY"
- "Nested SELECT"
- "HAVING clause"
- "ORDER BY"
- ✅ Task-aware selection
- ✅ Good accuracy
- ✅ Interpretable (can see skills)
- ❌ Budget tight (requires LLM calls)
- ❌ Speed critical
python vipersql.py \
--strategy few-shot \
--example-selection-strategy skill_knn \
--few-shot-examples 3# Extract skills for training data using LLM
python scripts/skill_knn_preprocessing.py \
--num_samples 1000 \
--delay 1.0Generates: dataset/ViText2SQL/std-level/skill_knn_train.json
Warning: Expensive! Calls LLM for every training example.
- 🧠 Task-aware
- 📊 Good accuracy
- 🔍 Interpretable skills
- 🎯 Matches query complexity
- 💰 Expensive preprocessing
- ⏱️ Slow selection
- 🎲 Depends on skill extraction quality
- 🔧 Requires LLM for new questions
ViR2 = Vietnamese Retrieval and Re-ranking
Two-stage selection:
- Stage 1: PhoBERT semantic retrieval → top-M candidates
- Stage 2: Beam search re-ranking with POS + diversity → k examples
Input: Question q, Meaning Pool P, Parameters (M, B, k, λ)
Output: k optimal examples
Stage 1: Semantic Retrieval
├─ Encode q using PhoBERT
├─ Compute similarity with all examples in P
└─ Select top-M candidates → C
Stage 2: Beam Search Re-ranking
├─ Initialize beams = [[]]
├─ For i in [1, k]:
│ ├─ For each beam:
│ │ ├─ Try adding each candidate from C
│ │ └─ Score = POS_match + λ * diversity
│ └─ Keep top-B beams
└─ Return best beam
where:
- ✅ Best accuracy (recommended)
- ✅ Vietnamese Text-to-SQL
- ✅ Have pre-computed embeddings
- ✅ Can afford medium complexity
- ❌ Extremely simple queries (use random)
Default:
python vipersql.py \
--strategy few-shot \
--example-selection-strategy vir2 \
--few-shot-examples 3Custom:
python vipersql.py \
--strategy few-shot \
--example-selection-strategy vir2 \
--few-shot-examples 5 \
--vir2-candidate-pool-size 100 \
--vir2-beam-size 10 \
--vir2-diversity-weight 0.5| Parameter | Default | Range | Description |
|---|---|---|---|
M (pool size) |
50 | [10, 200] | Stage 1 candidates |
B (beam size) |
5 | [1, 20] | Beam search width |
λ (diversity) |
0.3 | [0, 1] | Diversity weight |
k (examples) |
3 | [1, 10] | Final examples |
# Use DICL candidates as meaning pool
python scripts/build_dicl_candidates.py \
--dataset-path dataset/ViText2SQL \
--level stdGenerates: dataset/ViText2SQL/std-level/dicl_candidates.json
ViR2 loads this file automatically.
- 🏆 Best accuracy
- 🇻🇳 Vietnamese-optimized (PhoBERT + underthesea)
- 🎯 Combines semantic + syntactic
- 📊 Diversity optimization
- ⚡ Fast (pre-computed embeddings)
- 🔧 More complex than DICL
- ⚙️ Hyperparameters to tune
- 📦 Requires preprocessing
Extension of ViR2 with automatic language detection and language-specific models.
- Vietnamese: PhoBERT + underthesea POS
- English: BERT + spaCy POS
Input: Question q (any language)
Output: k examples
1. Detect language: lang = detect_language(q)
→ "vi" or "en"
2. Stage 1: Semantic retrieval
- If lang == "vi": Use PhoBERT
- If lang == "en": Use BERT
→ Top-M candidates
3. Stage 2: Beam search
- If lang == "vi": Use underthesea POS
- If lang == "en": Use spaCy POS
→ k examples
- ✅ Multi-language datasets (Vietnamese + English)
- ✅ BIRD dataset (has both languages)
- ✅ Want single unified system
- ❌ Single language only (use regular ViR2)
# Auto-detect language
python vipersql.py \
--strategy few-shot \
--example-selection-strategy multilang-vir2 \
--samples 100
# Force Vietnamese
python vipersql.py \
--strategy few-shot \
--example-selection-strategy multilang-vir2 \
--language vi \
--samples 100
# Force English
python vipersql.py \
--strategy few-shot \
--example-selection-strategy multilang-vir2 \
--language en \
--samples 100- 🌍 Multi-language support
- 🔄 Unified architecture
- 🎯 Language-specific optimization
- 🔧 More complex
- 📦 Requires both language models
- 💾 More memory (2 models loaded)
To test contribution of each component in ViR2:
Remove: POS matching component
Scoring: Only semantic similarity + diversity
python vipersql.py \
--strategy few-shot \
--example-selection-strategy vir2-no-posRemove: Diversity optimization
Scoring: Only POS matching
python vipersql.py \
--strategy few-shot \
--example-selection-strategy vir2-no-diversityRemove: Beam search
Method: Greedy top-k selection
python vipersql.py \
--strategy few-shot \
--example-selection-strategy vir2-no-beam-searchSee Ablation Experiments for details.
| Selector | Time Complexity | Space Complexity |
|---|---|---|
| Random | ||
| DICL | ||
| ASTRES | ||
| Skill-KNN | ||
| ViR2 |
where:
-
$n$ = training pool size -
$d$ = embedding dimension (768) -
$m$ = average SQL length -
$M$ = candidate pool size (50) -
$B$ = beam size (5) -
$k$ = final examples (3)
| Selector | Selection Time |
|---|---|
| Random | ~0.001s |
| DICL | ~0.05s |
| ASTRES | ~2.0s |
| Skill-KNN | ~0.1s + LLM call |
| ViR2 | ~0.2s |
Results are illustrative:
| Selector | Exact Match | Component F1 |
|---|---|---|
| Random | Baseline | Baseline |
| DICL | +5-10% | +5-10% |
| ASTRES | +3-7% | +3-7% |
| Skill-KNN | +6-12% | +6-12% |
| ViR2 | +10-15% | +10-15% |
# Random baseline
python vipersql.py --strategy few-shot --example-selection-strategy random --samples 10# ViR2 (best accuracy)
python vipersql.py --strategy few-shot --example-selection-strategy vir2 --samples 1000# DICL (good accuracy, fast)
python vipersql.py --strategy few-shot --example-selection-strategy dicl --samples 1000# Multi-language ViR2
python vipersql.py --strategy few-shot --example-selection-strategy multilang-vir2 --samples 1000# Selector defaults
EXAMPLE_SELECTION_STRATEGY=vir2
FEW_SHOT_EXAMPLES=3
# ViR2 parameters
VIR2_CANDIDATE_POOL_SIZE=50
VIR2_BEAM_SIZE=5
VIR2_DIVERSITY_WEIGHT=0.3Solution: Use semantic selector (DICL or ViR2)
Solution: Use ViR2 (has diversity optimization)
Solutions:
- Reduce
M(candidate pool size) - Reduce
B(beam size) - Use DICL instead
Solution:
python scripts/build_dicl_candidates.py \
--dataset-path dataset/ViText2SQL \
--level std- ViR2 Method - Detailed ViR2 algorithm and math
- Strategies - Few-shot strategy usage
- Ablation Experiments - Testing framework
- Configuration - All parameters