Project structure:
.
├── data/
│ └── 2026_diavgeia.csv
├── src/
│ └── fetch_diavgeia.py
├── state/
│ └── state.json
├── logs/
│ └── fetch_runs.csv
└── fetch_diavgeia.py
src/fetch_diavgeia.py: core fetch + enrich + persist logicdata/2026_diavgeia.csv: dataset outputstate/state.json: incremental fetch checkpointlogs/fetch_runs.csv: run history (timestamp, fetched count, CSV update flag, success/error)fetch_diavgeia.py: root launcher for backward-compatible execution
Run:
python fetch_diavgeia.pyThe live public.current_fires dataset and the homepage fire ticker are sourced from the Hellenic Fire Service live incidents page:
- URL:
https://www.fireservice.gr/el/energa-symvanta/ - Scraper:
src/scrape_forest_fires.py - Scope used in the frontend: active fires only, excluding rows with status
ΛΗΞΗ
One-command local fetch + git sync:
./scripts/run_fetch_and_sync.shScript behavior:
- auto-commits any existing local changes first
- pulls latest
origin/mainwith rebase - runs
fetch_diavgeia.py(PDF download is disabled by default) - runs
fetch_kimdis_procurements.py,src/fetch_copernicus.py, and DB ingest when enabled - automatic DB ingest excludes the static
fundtable; reload it only intentionally - the ERD DB ingest is incremental for
public.procurement; it does not make the DB an exact mirror ofdata/raw_procurements.csvunless the procurement stack is reset first - the default ERD DB ingest also runs an excluded-keyword prune after payment upsert; use the
manual
--skip-prune-excluded-procurementsmode below for restore/recovery loads where rows must be inserted without any destructive keyword deletion - does not run
locate_work; that step is separate viasrc/run_locate_work_updates.py - commits changed artifacts (
data/,state/,logs/) - pushes to
origin/main
Useful flags for run_fetch_and_sync.sh:
DOWNLOAD_DIAVGEIA_PDFS=1: enable Diavgeia PDF download + parse during fetch stepRUN_DB_INGEST=1: run DB ingestion scripts (includingingest_raw_procurements.py)REBUILD_ORG_MAPPINGS=1: rebuildorg_to_municipality*.csvfrom rules (disabled by default to preserve curated mappings)
Examples:
./scripts/run_fetch_and_sync.sh
DOWNLOAD_DIAVGEIA_PDFS=1 ./scripts/run_fetch_and_sync.sh
RUN_DB_INGEST=1 ./scripts/run_fetch_and_sync.sh
REBUILD_ORG_MAPPINGS=1 ./scripts/run_fetch_and_sync.sh
DOWNLOAD_DIAVGEIA_PDFS=1 RUN_DB_INGEST=1 ./scripts/run_fetch_and_sync.shUse scripts/delete_procurements_by_keywords.py to remove contracts from public.procurement when selected text fields contain specific keywords.
Default behavior is dry-run only. The script prints a JSON preview with:
- matched procurement count
- affected dependent rows (
payment,payment_beneficiary,cpv,diavgeia_procurement,works) - a sample preview of matching contracts
Add --apply only after verifying the preview.
Examples:
./.fireprotection/bin/python scripts/delete_procurements_by_keywords.py \
--keyword καθαρισμός \
--keyword αποψίλωση
./.fireprotection/bin/python scripts/delete_procurements_by_keywords.py \
--keywords-file data/keywords/to_remove.txt \
--match-mode all \
--applySupported options:
--keyword <value>: repeat for multiple keywords or phrases--keywords-file <path>: load one keyword/phrase per line,#comments ignored--column <name>: search specific columns; default istitleandshort_descriptions--match-mode any|all: match any keyword or require all keywords--preview-limit <N>: limit preview rows in JSON output--apply: execute the delete instead of previewing it
Text matching is normalized before comparison:
- not case-sensitive
- ignores Greek tonos/diacritics
- normalizes final sigma (
ς -> σ) - ignores spaces and special characters such as
-,/,_, punctuation
This means values like καθαρισμός, Καθαρισμος, and κα-θα/ρι_σμός are treated as equivalent during matching.
PDFs are stored locally in pdf/ (excluded from git via .gitignore).
Each filename is derived from documentUrl as the code after /doc/, with .pdf suffix.
Example: https://diavgeia.gov.gr/doc/9ΚΠΣΩ1Ε-ΕΑ0 -> pdf/9ΚΠΣΩ1Ε-ΕΑ0.pdf.
The pipeline does two steps:
- download missing PDFs from
documentUrl - parse local PDFs and build
data/pdf_pages_dataset.csv
Current parser output is one row per PDF (not one row per page):
adafile_namepage_counttext(all pages concatenated)text_lengthparse_error
Base command:
./.fireprotection/bin/python src/pdf_pipeline.py- Full pipeline (download missing PDFs + build dataset)
./.fireprotection/bin/python src/pdf_pipeline.pyWhat it does:
- reads source records from
data/2026_diavgeia.csv - downloads only missing PDFs into
pdf/ - parses local PDFs
- writes aggregated text dataset to
data/pdf_pages_dataset.csv - appends run stats to
logs/pdf_pipeline_runs.csv
- Download only (no parsing / no dataset rebuild)
./.fireprotection/bin/python src/pdf_pipeline.py --download-onlyWhat it does:
- reads
documentUrlvalues from the source CSV - downloads only PDFs that do not already exist in
pdf/ - skips dataset generation
- still logs the run
- Build only (parse existing local PDFs only)
./.fireprotection/bin/python src/pdf_pipeline.py --build-onlyWhat it does:
- does not download anything
- parses PDFs already present in
pdf/ - rewrites
data/pdf_pages_dataset.csv - logs parsing counters
- Test on a small subset (
--limit)
./.fireprotection/bin/python src/pdf_pipeline.py --limit 100What it does:
- limits both download scanning and PDF parsing to the first
100records/files - useful for smoke tests and debugging
- Faster local parsing with multiple workers (
--workers)
./.fireprotection/bin/python src/pdf_pipeline.py --build-only --workers 4What it does:
- parses PDFs in parallel (processes)
- speeds up the build step on multi-core machines
- only affects the parsing/build step (downloads are still sequential)
- Increase HTTP timeout for slow downloads (
--timeout)
./.fireprotection/bin/python src/pdf_pipeline.py --download-only --timeout 120What it does:
- increases PDF download read timeout (seconds)
- useful for slow network responses / large files
- Use custom input/output paths
./.fireprotection/bin/python src/pdf_pipeline.py \
--source-csv data/2026_diavgeia.csv \
--pdf-dir pdf \
--pages-dataset data/pdf_pages_dataset.csvWhat it does:
- overrides default source CSV / PDF storage directory / output dataset path
- Common combinations
Download a subset only:
./.fireprotection/bin/python src/pdf_pipeline.py --download-only --limit 50Build a subset with parallel parsing:
./.fireprotection/bin/python src/pdf_pipeline.py --build-only --limit 200 --workers 6Run full pipeline with custom timeout and parallel build:
./.fireprotection/bin/python src/pdf_pipeline.py --timeout 120 --workers 4--source-csv <path>: source CSV containing at leastadaanddocumentUrl--pdf-dir <path>: local folder where PDFs are stored/read--pages-dataset <path>: output CSV path for parsed PDF text dataset (one row per PDF)--limit <N>: process only the firstNrecords/files (useful for testing)--workers <N>: number of worker processes for PDF parsing (--build-onlyor full run build step)--download-only: run only the download step--build-only: run only the parsing/dataset build step--timeout <seconds>: HTTP read timeout for PDF downloads (connect timeout is fixed at 10s)
Note:
--download-onlyand--build-onlyare mutually exclusive (cannot be used together)
Every run appends one row to logs/pdf_pipeline_runs.csv, including:
- download counters (
records_scanned,downloaded,skipped_existing,skipped_missing_url,failed_downloads) - parsing counters (
pdf_files_seen,parsed_pdfs,parsed_pages,parse_errors) successerror_message
This is a separate local-only post-processing step that runs after:
fetch_diavgeia.py(raw records + decision-type enrichments, optional PDF embed)pdf_pipeline.py(download + parse PDFs)
It checks whether each record is relevant to forest-fire prevention/suppression using:
- the decision
subject - the parsed PDF text (looked up by
ada)
If at least one keyword is found in either source, the row is marked relevant and included in the filtered dataset.
This step depends on local PDF artifacts and parsed PDF text, which are large and operationally unsuitable for GitHub Actions in this project.
Local-only components:
- PDF downloading (
src/pdf_pipeline.py) - PDF parsing (
src/pdf_pipeline.py) - Relevance filtering (
src/filter_relevance.py)
src/filter_relevance.py includes a CI guard and will refuse to run in CI/GitHub Actions unless explicitly overridden with --allow-ci.
For each row in data/2026_diavgeia.csv:
subject_match = any(keyword in normalized(subject))pdf_match = any(keyword in normalized(pdf_text_for_same_ada))is_relevant = subject_match OR pdf_match
No scoring / ranking is used.
To avoid inflating the raw dataset or doing a heavy merge:
- the script reads
data/pdf_pages_dataset.csvusing only columnsadaandtext - builds an in-memory lookup dictionary:
ada -> text - checks PDF text per row using
ada
This means:
- no large text join into
data/2026_diavgeia.csv - no duplication of PDF text inside the raw dataset
Both keywords and text (subject, PDF text) are normalized before matching:
- lowercase
- remove Greek tonos/diacritics
- normalize final sigma (
ς -> σ) - replace punctuation/symbols with spaces
- collapse multiple spaces
This allows matching regardless of:
- accents (e.g.
δασικώνvsδασικων) - uppercase/lowercase
- punctuation differences
Inputs:
data/2026_diavgeia.csv(raw dataset)data/pdf_pages_dataset.csv(parsed PDF text dataset, one row per PDF; must includeada,text)
Outputs:
- updates
data/2026_diavgeia.csvby adding/updating relevance columns - writes
data/2026_diavgeia_filtered.csv(onlyis_relevant == True) - appends run metrics to
logs/relevance_filter_runs.csv
Database feed source:
data/2026_diavgeia_filtered.csv
subject_match(True/False)pdf_match(True/False)pdf_available_for_filter(True/False)Trueif parsed PDF text exists for thatadaFalseif no parsed PDF text is available (missing PDF / parse failure / no row)
is_relevant(True/False)matched_keywords_subject- matched keyword(s) from
subject - cleanup rule:
[] -> empty,[x] -> x,[x,y] -> list
- matched keyword(s) from
matched_keywords_pdf- matched keyword(s) from PDF text
- same cleanup rule as above
Contains:
- all columns from
data/2026_diavgeia.csv - only rows where
is_relevant == True
Recommended use:
- use this file as the source for database ingestion
Base command (local):
./.fireprotection/bin/python src/filter_relevance.pyWhat it does:
- loads raw dataset
- loads parsed PDF text lookup by
ada - computes relevance columns in raw dataset
- writes filtered dataset
Custom paths:
./.fireprotection/bin/python src/filter_relevance.py \
--input-csv data/2026_diavgeia.csv \
--pdf-pages-dataset data/pdf_pages_dataset.csv \
--filtered-output data/2026_diavgeia_filtered.csv \
--log-csv logs/relevance_filter_runs.csvProgress frequency:
./.fireprotection/bin/python src/filter_relevance.py --progress-every 100CI override (not recommended):
./.fireprotection/bin/python src/filter_relevance.py --allow-ciEach run appends one row including:
run_started_at_localinput_csvpdf_pages_datasetfiltered_output_csvkeywords_countrows_totalrows_relevantrows_not_relevantrows_subject_matchrows_pdf_matchrows_pdf_availablefiltered_rows_writtensuccesserror_message
The keyword list is defined in:
src/filter_relevance.py->RELEVANCE_KEYWORDS
Update that list to refine recall/precision. After any keyword change, re-run the relevance filter locally to regenerate:
data/2026_diavgeia.csvrelevance columnsdata/2026_diavgeia_filtered.csv
./.fireprotection/bin/python fetch_diavgeia.py./.fireprotection/bin/python src/pdf_pipeline.py(or--build-onlyif PDFs already downloaded)./.fireprotection/bin/python src/filter_relevance.py
- If PDF text extraction failed (or the PDF is unavailable),
pdf_matchmay beFalseeven for a relevant record. - This is why the filter checks both
subjectand PDF text. - The raw dataset remains the audit source; the filtered dataset is the operational source for DB ingestion.
A workflow is included at .github/workflows/daily-fetch.yml and runs:
- every day at
03:00UTC - on manual trigger (
workflow_dispatch)
To enable automation:
git add .
git commit -m "chore: setup daily Diavgeia automation"
git branch -M main
git remote add origin <your-github-repo-url>
git push -u origin mainThen in GitHub:
- Open the repository
Settings->Actions->General - Ensure actions are allowed and workflow permissions allow read/write
- Open the
Actionstab and runDaily Diavgeia Fetchonce manually
The initial relational schema is in sql/001_init_schema.sql.
It creates:
organization(one organization to many records)record(main records, each linked to one organization)file(one-to-one withrecordviaada)
To create tables:
- Open your Supabase project
- Go to
SQL Editor - Paste and run the contents of
sql/001_init_schema.sql
This section documents exactly how src/fetch_diavgeia.py handles data from Diavgeia and writes output to data/2026_diavgeia.csv.
- Data source endpoint:
https://diavgeia.gov.gr/luminapi/api/search - Query terms are controlled by
KEYWORDS. - Pagination is used (
PAGE_SIZE=100). - Incremental cutoff comes from
state/state.jsonkeylast_fetch. - If
state/state.jsondoes not exist:- The script tries to derive the latest timestamp from
data/2026_diavgeia.csv. - If no CSV exists, it fetches all available data.
- The script tries to derive the latest timestamp from
- After a successful run with new records,
last_fetchis updated to the maximum fetchedsubmissionTimestamp.
Each fetched batch is converted into a dataframe and enriched before save.
PDF download/parse during fetch is controlled by DOWNLOAD_DIAVGEIA_PDFS:
- default (
0): skip PDF download/parse infetch_diavgeia.py - set to
1: download/parse PDFs and embed text/status columns in fetched rows
Main enrichments:
org: extracted fromorganization.labelorg_type,org_name_clean: derived by organization classificationdecisionType: converted to label-only stringthematicCategories: converted to list of label-only stringssubject_has_anatrop_or_anaklis(True/False): derived boolean flag fromsubjectTruewhen subject containsανατροπ*orανακλησ*(accent-insensitive)
subject_has_budget_balance_report_terms(True/False): derived boolean flag fromsubjectTruewhen subject containsπροϋπολογισμ*,ισολογισμ*, orαπολογισμ*(accent-insensitive)
org_name_cleanexclusion list (dataset scope cleanup)- rows whose normalized
org_name_cleanmatches a configured blacklist are dropped from the dataset - applied both during fetch (API batch filtering) and before CSV save (safety net)
- rows whose normalized
The script supports both:
- raw API dict/list values
- CSV stringified dict/list values (legacy rows)
The script uses robust parsing helpers to avoid crashes and inconsistent shapes:
-
extract_org_label(value)- Handles dict payloads from API (
{"label": ...}). - Handles stringified dicts from CSV.
- Handles dict payloads from API (
-
parse_structured_value(value)- Tries parsing dict/list represented as strings.
- Supports both Python-literal style and JSON style.
-
extract_label(value)- Normalizes single label fields (used by
decisionType).
- Normalizes single label fields (used by
-
extract_labels_list(value)- Normalizes list-like fields (used by
thematicCategories). - Deduplicates while preserving original order.
- Normalizes list-like fields (used by
Classification is prefix-based and order-sensitive (ORG_PREFIXES).
Important safeguards:
- Prefixes are matched as whole tokens, not partial words.
- This prevents bad truncation like
ΔΗΜΟΤΙΚΟ -> ΤΙΚΟ.
- This prevents bad truncation like
- A special typo rule handles forms like
ΔΗΜΟ ΑΡΓΟΥΣwithout incorrectly matchingΔΗΜΟΤΙΚΟ ....
Current explicit categories include:
ΑΠΟΚΕΝΤΡΩΜΕΝΗ ΔΙΟΙΚΗΣΗΠΕΡΙΦΕΡΕΙΑΚΟ ΤΑΜΕΙΟ ΑΝΑΠΤΥΞΗΣΚΕΝΤΡΟ ΚΟΙΝΩΝΙΚΗΣ ΠΡΟΝΟΙΑΣ ΠΕΡΙΦΕΡΕΙΑΣΣΥΝΔΕΣΜΟΣ ΔΗΜΩΝΔΗΜΟΤΙΚΟ ΛΙΜΕΝΙΚΟ ΤΑΜΕΙΟΔΗΜΟΤΙΚΟ ΒΡΕΦΟΚΟΜΕΙΟΔΗΜΟΤΙΚΟ ΠΕΡΙΦΕΡΕΙΑΚΟ ΘΕΑΤΡΟΔΗΜΟΤΙΚΗ ΕΠΙΧΕΙΡΗΣΗΠΕΡΙΦΕΡΕΙΑΥΠΟΥΡΓΕΙΟΔΗΜΟΣ- fallback:
ΑΛΛΟΣ ΦΟΡΕΑΣ
After classification:
- text is converted to uppercase
- accents/diacritics are removed
Type-specific rules:
-
For
ΣΥΝΔΕΣΜΟΣ ΔΗΜΩΝ:- removes leading boilerplate:
ΓΙΑ ΤΗΝ ...ΚΑΙ ΚΟΙΝΟΤΗΤΩΝ ΓΙΑ ΤΗΝ ...
- removes leading boilerplate:
-
For
ΥΠΟΥΡΓΕΙΟ:- applies conservative canonical mappings for known historical variants
- example:
ΠΕΡΙΒΑΛΛΟΝΤΟΣ, ΕΝΕΡΓΕΙΑΣ ΚΑΙ ΚΛΙΜΑΤΙΚΗΣ ΑΛΛΑΓΗΣ->ΠΕΡΙΒΑΛΛΟΝΤΟΣ ΚΑΙ ΕΝΕΡΓΕΙΑΣΥΠΟΔΟΜΩΝ, ΜΕΤΑΦΟΡΩΝ ΚΑΙ ΔΙΚΤΥΩΝ->ΥΠΟΔΟΜΩΝ ΚΑΙ ΜΕΤΑΦΟΡΩΝΕΣΩΤΕΡΙΚΩΝ ΚΑΙ ΔΙΟΙΚΗΤΙΚΗΣ ΑΝΑΣΥΓΚΡΟΤΗΣΗΣ->ΕΣΩΤΕΡΙΚΩΝΠΑΙΔΕΙΑΣ, ΕΡΕΥΝΑΣ ΚΑΙ ΘΡΗΣΚΕΥΜΑΤΩΝ->ΠΑΙΔΕΙΑΣ ΚΑΙ ΘΡΗΣΚΕΥΜΑΤΩΝ
When appending new data:
- existing CSV rows are re-normalized, not just new rows
- this ensures old formatting/classification issues are corrected over time
- deduplication is then applied (
drop_duplicates)
Each run appends one row to logs/fetch_runs.csv with:
run_started_at_athensfetched_recordsrows_addedcsv_updatedsuccesserror(boolean;Falseon success,Trueon failure)error_message(NONEon success)
- GitHub Action commits updated artifacts:
data/2026_diavgeia.csvstate/state.jsonlogs/fetch_runs.csv
- Fetch logs may report API totals larger than CSV additions because excluded organizations are skipped after retrieval.
- Schedule is
03:00 UTCdaily. - If fetch fails, run log is still persisted and workflow is marked failed.
For selected decisionType values, src/fetch_diavgeia.py performs an extra API call to:
https://diavgeia.gov.gr/luminapi/api/decisions/view/{ada}
The response contains a meta field (list of one-key dictionaries). The script flattens that list and extracts type-specific fields into dedicated CSV columns.
Important storage note:
- In memory, many extracted values are Python lists/dicts.
- In
data/2026_diavgeia.csv, they are stored as stringified values (because CSV has no native nested types).
Quick summary table:
decisionType |
Column prefix | Main extracted entities |
|---|---|---|
ΕΓΚΡΙΣΗ ΔΑΠΑΝΗΣ |
spending_* |
signers + contractors (AFM, name, amount, currency) |
ΑΝΑΛΗΨΗ ΥΠΟΧΡΕΩΣΗΣ |
commitment_* |
signers + fiscal/budget fields + Ποσό και ΚΑΕ/ΑΛΕ lines |
ΑΝΑΘΕΣΗ ΕΡΓΩΝ / ΠΡΟΜΗΘΕΙΩΝ / ΥΠΗΡΕΣΙΩΝ / ΜΕΛΕΤΩΝ |
direct_* |
signers + persons (AFM/name) + amount + references |
ΟΡΙΣΤΙΚΟΠΟΙΗΣΗ ΠΛΗΡΩΜΗΣ |
payment_* |
signers + beneficiaries (AFM/name/value) + references |
Relevant meta keys used:
ΥπογράφοντεςΣτοιχεία αναδόχων(list)- each item may include:
ΑΦΜ / Επωνυμία->{ΑΦΜ, Επωνυμία, ...}Ποσό δαπάνης->{Αξία, Νόμισμα}
- each item may include:
Collected columns:
spending_signers: list fromΥπογράφοντεςspending_contractors_afm: list of contractor AFM valuesspending_contractors_name: list of contractor names (Επωνυμία)spending_contractors_value: list of expense amounts (Αξία)spending_contractors_currency: list of currencies (Νόμισμα)spending_contractors_count: number of contractor rows extractedspending_contractors_details: list of dicts with{ΑΦΜ, Επωνυμία, Αξία, Νόμισμα}
Status / audit columns:
spending_enrichment_status:ok,error, orskip_missing_adaspending_enrichment_error: error text when status iserror
Relevant meta keys used:
ΥπογράφοντεςΟικονομικό ΈτοςΚατηγορία ΠροϋπολογισμούΣυνολικό ποσό(fallback whenΠοσό και ΚΑΕ/ΑΛΕis empty)Ποσό και ΚΑΕ/ΑΛΕ(list)- each item may include:
ΑΦΜ / ΕπωνυμίαΑριθμός ΚΑΕ/ΑΛΕΠοσό με ΦΠΑΥπόλοιπο διαθέσιμης πίστωσηςΥπόλοιπο ΚΑΕ/ΑΛΕ
- each item may include:
Collected columns:
commitment_signers: list fromΥπογράφοντεςcommitment_fiscal_year:Οικονομικό Έτοςcommitment_budget_category:Κατηγορία Προϋπολογισμούcommitment_counterparty: list fromΑΦΜ / Επωνυμία(one per line inΠοσό και ΚΑΕ/ΑΛΕ)commitment_amount_with_vat: list ofΠοσό με ΦΠΑcommitment_remaining_available_credit: list ofΥπόλοιπο διαθέσιμης πίστωσηςcommitment_kae_ale_number: list ofΑριθμός ΚΑΕ/ΑΛΕcommitment_remaining_kae_ale: list ofΥπόλοιπο ΚΑΕ/ΑΛΕcommitment_lines_count: number of rows inΠοσό και ΚΑΕ/ΑΛΕcommitment_lines_details: list of dicts preserving all extracted row-level fields
Status / audit columns:
commitment_enrichment_status:ok,error, orskip_missing_adacommitment_enrichment_error: error text when status iserror
Relevant meta keys used:
ΥπογράφοντεςΑΦΜ / Επωνυμία προσώπου / προσώπων(list)- each item may include
ΑΦΜ,Επωνυμία
- each item may include
Ποσό->{Αξία, Νόμισμα}(currently onlyΑξίαis stored)Σχετ. Ανάληψη υποχρέωσηςΔείτε επίσης και ..
Collected columns (requested direct_* naming):
direct_signers: list fromΥπογράφοντεςdirect_afm: list of AFM valuesdirect_name: list of names (Επωνυμία)direct_value: amount value fromΠοσό -> Αξίαdirect_related_commitment:Σχετ. Ανάληψη υποχρέωσηςdirect_see_also:Δείτε επίσης και ..
Helper columns:
direct_people_count: number of persons inΑΦΜ / Επωνυμία προσώπου / προσώπωνdirect_people_details: list of dicts with{ΑΦΜ, Επωνυμία}direct_enrichment_statusdirect_enrichment_error
Relevant meta keys used:
ΥπογράφοντεςΣτοιχεία δικαιούχων(list)- each item may include:
ΑΦΜ / Επωνυμία->{ΑΦΜ, Επωνυμία, ...}Ποσό δαπάνης->{Αξία, Νόμισμα}
- each item may include:
Σχετ. Ανάληψη Υποχρέωσης/Έγκριση ΔαπάνηςΔείτε επίσης και ..
Collected columns:
payment_signers: list fromΥπογράφοντεςpayment_beneficiary_afm: list of beneficiary AFM valuespayment_beneficiary_name: list of beneficiary names (Επωνυμία)payment_value: list of beneficiary expense amounts (Αξία)payment_related_commitment_or_spending:Σχετ. Ανάληψη Υποχρέωσης/Έγκριση Δαπάνηςpayment_see_also:Δείτε επίσης και ..
Helper columns:
payment_beneficiaries_count: number of beneficiary rowspayment_beneficiaries_details: list of dicts with{ΑΦΜ, Επωνυμία, Αξία}payment_enrichment_statuspayment_enrichment_error
- Decision-type enrichment runs automatically during
fetch_diavgeia.pyfor new rows. - PDF enrichment in
fetch_diavgeia.pyruns only whenDOWNLOAD_DIAVGEIA_PDFS=1. - Existing CSV rows can be backfilled from a notebook via:
fetch_diavgeia.backfill_spending_approval_columns(...)
- Root-level
fetch_diavgeia.pyis a thin wrapper (exportsmainonly). For backfill helpers, import fromsrc/withinsert(0, ...)so the root wrapper is not imported first:python -c "import sys; sys.path.insert(0, 'src'); from fetch_diavgeia import backfill_spending_approval_columns; backfill_spending_approval_columns(...)"
- The backfill currently processes all supported types above (despite the legacy function name).
- Progress is printed during enrichment (
[spending],[commitment],[direct],[payment]start/progress/done lines).
The web-app now uses data/raw_procurements.csv (KIMDIS contracts) as the main procurement dataset.
Raw dataset pipeline:
- collection script:
fetch_kimdis_procurements.py(wrapper) /src/fetch_kimdis_procurements.py - source API:
https://cerpp.eprocurement.gov.gr/khmdhs-opendata/contract - output files:
data/raw_items_backup.json(single raw backup — primary + secondary items in one flat list)data/raw_procurements.csv(filtered, deduplicated tabular dataset)
- DB table:
public.raw_procurements
Run raw collection manually:
python fetch_kimdis_procurements.pyIncremental behavior (Diavgeia-style):
- uses
state/kimdis_state.jsonwithlast_fetch - if state is missing, derives last fetch from max
submissionDateindata/raw_procurements.csv - fetches from the effective start date forward and then merges with existing CSV using dedupe
- use
--full-refreshto ignore state and refetch the whole window - CSV merge dedupe strategy is full-row dedupe after normalizing list/dict values to stable JSON strings
- contract-chain dedupe is not written back to the raw CSV
- instead, it is applied at DB ingest / reporting time using
prevReferenceNoandnextRefNo
Rebuild CSV from existing backup only (no API call):
python fetch_kimdis_procurements.py --from-backupForce a full refetch:
python fetch_kimdis_procurements.py --full-refresh --start-date 2024-01-01KIMDIS fetch flags:
--request-wait-seconds <float>: wait between API requests (default1.0)--retry-sleep-seconds <int>: base retry sleep in seconds for backoff (default5)--request-timeout <int>: HTTP timeout per request in seconds (default60)--max-window-days <int>: date span per API window (default180)--state-file <path>: incremental state file path (defaultstate/kimdis_state.json)--backup-json <path>: primary raw API backup JSON path (defaultdata/raw_items_backup.json)--output-csv <path>: output CSV path (defaultdata/raw_procurements.csv)--log-csv <path>: run log CSV path
Some fire-protection contracts use general-purpose CPV codes (roads, waterworks, drainage) rather than dedicated fire-protection codes. A second API pass fetches these contracts and keeps only those whose title contains a fire-protection keyword.
Constants in src/fetch_kimdis_procurements.py:
SECONDARY_CPVS = {
"45233141-9": "Συντήρηση οδών",
"45240000-1": "Κατασκευαστικές εργασίες για υδατικά έργα",
"45232152-2": "Έργα αντιπλημμυρικής / αποχετευτικής υποδομής",
}
SECONDARY_TITLE_KEYWORDS = ["πυροπροστασ"]Behavior:
- the secondary collector uses the same date window / incremental state as the primary
- title matching is done post-fetch (
is_excludedchecksnormalize_string(title)) - secondary items are tagged
_src="secondary"in the raw backup so they survive round-trips - primary and secondary items are stored in a single flat list in
data/raw_items_backup.json - on
--from-backup, items are split by the_srctag to restore proper per-collector filtering - primary and secondary DataFrames are concatenated then deduplicated (full-row dedupe
after normalizing list/dict cells to stable JSON) before writing
data/raw_procurements.csv
Examples:
# Full refresh with default 1s per-request wait
python fetch_kimdis_procurements.py --full-refresh --start-date 2024-01-01
# Slower request pace + longer timeout
python fetch_kimdis_procurements.py --full-refresh --start-date 2024-01-01 --request-wait-seconds 2 --request-timeout 120Ingest raw CSV into Supabase:
python ingest/ingest_raw_procurements.pyDry-run parse check (no DB write):
python ingest/ingest_raw_procurements.py --dry-runRun locate_work updates for newly ingested eligible procurements:
python src/run_locate_work_updates.pyRecovery-only mode for contracts that are missing rows in public.works even if they already exist in the state file:
python src/run_locate_work_updates.py --reprocess-missing-works
python src/run_locate_work_updates.py --reprocess-missing-works --limit 20Force rerun specific reference numbers, bypassing the state file and candidate-selection query:
python src/run_locate_work_updates.py --reference-number 26SYMV018515731
python src/run_locate_work_updates.py --reference-number 26SYMV018515731 --reference-number 26SYMV018537881
python src/run_locate_work_updates.py --reference-number "26SYMV018515731,26SYMV018537881"Diavgeia procurement tables are still kept for future extensions.
The raw KIMDIS CSV may contain contract chains where an older contract is amended, extended, or superseded by a newer contract with a new referenceNumber.
Rules used by the ingest / app layer:
- keep the raw CSV unchanged for auditability
- zero
payment.amount_without_vatfor superseded contracts - exclude superseded contracts from frontend counts / lists
- excluded-keyword matching is token-prefix based: a keyword must match the start of a word/token
(for example
ΕΠΑΛmatchesΕΠΑΛ, but notΔ.Ε. ΠΑΛΑΙΡΟΥorσε παλιά)
A contract is treated as superseded when:
- its
referenceNumberappears as another row'sprevReferenceNo - or the row itself has non-empty
nextRefNo
Effect:
- only the terminal contract in the chain keeps monetary weight
- older links remain visible in raw source data but do not inflate totals or
contract_count
The web-app procurement layer now uses a two-table model:
public.procurement_decisions: one row perADA(header-level metadata)public.procurement_decision_lines: multiple rows perADA(amounts / counterparties / line details)
Why:
- some Diavgeia decisions contain multiple amounts and/or multiple beneficiaries/contractors
- keeping only one
amount_eur/contractor perADAloses detail
Run these in SQL Editor (once per database):
sql/004_procurement_subject_flags.sql- adds subject-derived boolean flags to
procurement_decisions
- adds subject-derived boolean flags to
sql/005_procurement_decision_lines.sql- creates line-level table
procurement_decision_lines
- creates line-level table
sql/006_org_municipality_coverage.sql- creates
org_municipality_coverage(org -> all municipalities covered)
- creates
sql/007_raw_procurements.sql- creates
raw_procurements(main KIMDIS raw contracts table)
- creates
sql/008_raw_procurements_views.sql- creates
v_raw_procurements_municipality(frontend-friendly municipality-linked raw procurements view)
- creates
sql/009_raw_procurements_hero_stats_fn.sql- creates RPC function
get_raw_procurements_hero_stats(p_year_main, p_year_prev1, p_year_prev2, p_as_of_date) - returns YTD hero KPIs (total spend, top contract type, top CPV) for the homepage
- the YTD window uses
LEAST(DAY(as_of_date), last_day_of_month_in_year)to compare the same calendar period fairly across leap and non-leap years
- creates RPC function
sql/010_raw_procurements_cumulative_curve_fn.sql- creates RPC function
get_raw_procurements_cumulative_curve(p_as_of_date, p_year_main, p_year_start) - generates one data point per day per year from
p_year_starttop_year_mainusinggenerate_series - current year (
p_year_main) stops atLEAST(MAX(data_date), p_as_of_date); prior years run to 31 Dec - single call from the frontend returns all series; no duplicate year rows (UNION deduplication in CTE)
- creates RPC function
From the project root:
python ingest/ingest_raw_procurements.py
python ingest/ingest_procurement.py
python ingest/ingest_procurement_lines.py
python ingest/ingest_org_municipality_coverage.pyERD procurement ingest used by the web app:
python ingest/stage2_load_erd.py --tables region,municipality,organization,diavgeia_document_type,procurement,cpv,diavgeia,payment,diavgeia_procurement,beneficiaryRestore/recovery mode for KIMDIS procurement rows that should be loaded from the current
data/raw_procurements.csv without running the destructive excluded-keyword prune:
python ingest/stage2_load_erd.py \
--tables procurement,cpv,payment,diavgeia_procurement,beneficiary \
--skip-beneficiary-gemi \
--skip-prune-excluded-procurementsBehavior:
ingest_raw_procurements.pytruncates + reloadspublic.raw_procurementsfromdata/raw_procurements.csvstage2_load_erd.pyloadsdata/raw_procurements.csvintopublic.procurementincrementally; existing identities are skipped by default and rows missing from the CSV are not deleted- by default,
stage2_load_erd.pyprunespublic.procurementrows that match excluded keywords after payment upsert;--skip-prune-excluded-procurementsdisables only that final delete step ingest_procurement.pyloads header-level rows fromdata/2026_diavgeia_filtered.csvingest_procurement_lines.pyexpands line-level detail from:spending_contractors_detailspayment_beneficiaries_detailscommitment_lines_details(with fallback fromcommitment_*columns when line details are missing)direct_people_details+direct_value
ingest_org_municipality_coverage.pyloadsdata/mappings/org_to_municipality_coverage.csvintopublic.org_municipality_coverage(truncate + reload)
The project now keeps a dedicated coverage mapping for organizations that affect multiple municipalities (e.g. regions, decentralized administrations, syndicates, development organizations, national bodies).
- Source CSV:
data/mappings/org_to_municipality_coverage.csv - DB table:
public.org_municipality_coverage
This is different from data/mappings/org_to_municipality.csv:
org_to_municipality.csvis a header-level / best single-match mappingorg_to_municipality_coverage.csvstores full coverage (one org -> many municipalities)
Coverage rows are built from:
- deterministic hierarchy rules (
ΠΕΡΙΦΕΡΕΙΑ,ΑΠΟΚΕΝΤΡΩΜΕΝΗ ΔΙΟΙΚΗΣΗ, etc.) - local region reference (
data/mappings/region_to_municipalities.csv) - manual municipality lists for special cases
- national-level whole-country expansion (
["*"]) - fallback to single
municipality_idwhen available inorg_to_municipality.csv
Mapping review workflow note:
- temporary/manual review CSVs (for example
raw_unmapped_orgs_review*.csv,admin_codes_reference.csv) should be moved todata/mappings/archived/after use data/mappings/archived/is ignored by git and is intended for local helper artifacts only
Important note for direct assignments:
- if one direct assignment has multiple persons but one total amount, the amount is placed only on the first line row to avoid double-counting in aggregates
The frontend (MunicipalityPanel) now aggregates procurement amounts per ADA from procurement_decision_lines and uses that total in the UI.
- header rows still come from
procurement_decisions - amounts are summed from line rows when available
- fallback to
procurement_decisions.amount_euris used only when no line rows exist
ingest_procurement.py uses UPSERT and does not delete older rows automatically.
If you shrink data/2026_diavgeia_filtered.csv (for example to issueDate >= 2024), first clear both procurement tables in Supabase, then re-ingest:
TRUNCATE TABLE
public.procurement_decision_lines,
public.procurement_decisions
RESTART IDENTITY;Then rerun:
python ingest/ingest_procurement.py
python ingest/ingest_procurement_lines.py
python ingest/ingest_org_municipality_coverage.pydata/2026_diavgeia.csvcurrently stores only records withissueDate >= 2024data/2026_diavgeia_filtered.csvis also kept atissueDate >= 2024- a local archival copy of the full raw range was saved as
data/2026_diavgeia_from_2009.csv(ignored in git)
A React + Vite + Supabase frontend in app/.
Years are never hardcoded in the frontend:
const YEAR_START = 2024 // first year in the dataset
const currentYear = new Date().getFullYear() // e.g. 2026, 2027, …
const chartYears = Array.from( // [2024, 2025, 2026, …]
{ length: currentYear - YEAR_START + 1 },
(_, i) => YEAR_START + i
)The chart automatically gains a new series each January 1st without code changes.
Chart line styles are indexed from newest year (CHART_YEAR_STYLES[0] = bold black) to
oldest year (faded grey). Any number of years share the most-faded style.
The homepage makes two RPC calls:
| Call | Function | Purpose |
|---|---|---|
| Hero KPIs | get_raw_procurements_hero_stats |
Total spend, top type, top CPV for current YTD |
| Cumulative chart | get_raw_procurements_cumulative_curve |
Daily cumulative series for all years from YEAR_START |
// Hero stats — compare current year vs two prior years at same YTD window
supabase.rpc('get_raw_procurements_hero_stats', {
p_year_main: currentYear,
p_year_prev1: currentYear - 1,
p_year_prev2: currentYear - 2,
p_as_of_date: asOf,
})
// Cumulative curve — single call, server generates all year series
supabase.rpc('get_raw_procurements_cumulative_curve', {
p_as_of_date: asOf,
p_year_main: currentYear,
p_year_start: YEAR_START,
})/— Homepage: hero KPIs + cumulative spending chart + municipality panel/contracts— Contract browser (app/src/pages/ContractsPage.tsx)