Details
_______________________________________________________
| ______________________ __________ |
| DataLakeHouse | | | | | |
____________ | ___________ | | | | | |
| Data | | | | | Table | Table | | Query | << BI
| Source | >> E >> | | Storage | | Format | Catalog | | Engine | << (Metabase)
| (postgres) | (Airbyte) | | (Minio) | | (Iceberg) | (Nessie) | | (Dremio) | <<
|____________| | |___________| | | | | | |
| |___________|__________| |__________| |
|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _|
| TL |
| Transform/Orchestration |
| (dbt / Airflow) |
|_______________________________________________________|
Client-Server model architecture:
Dremio serves as the Query and Semantic Layer, created as an infrastructure component on a dedicated container.
It connects to the MinIO data lake, tracks metadata via Nessie and provides a SQL interface for both dbt (transformation) and Metabase (visualization).
https://docs.airbyte.com/platform/using-airbyte/getting-started/oss-quickstart
Docker exposes the following ports to the host, so components UI are available at:
localhost:8000airbytelocalhost:8081airflowlocalhost:3000metabaselocalhost:9001minio (create bucket: raw-data)localhost:9047dremiolocalhost:19120nessie
dremio-metabase custom plugin installed in the custom metabase image
(https://github.com/Baoqi/metabase-dremio-driver/releases)
dremio:31010 (metabase endpoint)
uv add dbt-dremio
https://docs.getdbt.com/docs/local/connect-data-platform/dremio-setup
- Create the
.envfile in docker repo - Run containers creation
- Create the bucket in minio and the tables in source db.
- Create the Airbyte connection from Postgres source to S3 Data Lake destination.
- Create the Dremio account (used in dbt) and the Nessie connection NOTE: Airflow scheduled to run at 15' and 45' each hour and Airbyte to run every 30' (:00, :30 each hour)
docker compose \
-f docker/docker-compose-pg-source.yml \
-f docker/docker-compose-datalakehouse.yml \
-f docker/docker-compose-airflow.yml \
-f docker/docker-compose-metabase.yml \
-f docker/docker-compose-transformer.yml \
up --build
Data producer with some test data
CREATE TABLE public.users (
id VARCHAR(10) PRIMARY KEY,
name VARCHAR(20) not null,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW() -- utc
);
INSERT INTO public.users (id, name) VALUES ('U10001', 'Mark');
INSERT INTO public.users (id, name) VALUES ('U10002', 'Phil');
INSERT INTO public.users (id, name) VALUES ('M10003', 'John');
INSERT INTO public.users (id, name) VALUES ('M10004', 'Jack');
CREATE TABLE public.bookings (
id SERIAL PRIMARY KEY,
booking_date DATE NOT NULL,
service_id INT NOT NULL,
quantity INT NOT NULL,
total_amount NUMERIC(10, 2) NOT null,
user_id VARCHAR references users(id),
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW() -- utc
);
INSERT INTO public.bookings (booking_date, service_id, quantity, total_amount, user_id) VALUES ('2023-01-01', 101, 3, 450.00, 'U10001');
INSERT INTO public.bookings (booking_date, service_id, quantity, total_amount, user_id) VALUES ('2023-02-01', 102, 1, 320.00, 'U10001');
INSERT INTO public.bookings (booking_date, service_id, quantity, total_amount, user_id) VALUES ('2023-03-01', 103, 4, 720.00, 'U10002');
...
Create source, destination and connection
Set sync details
Processing
Load to storage and save as iceberg tables
Table catalog updated
Configure table catalog and iceberg tables
Query data from the BI layer
Transformations on source data using Dremio query engine, tracked in Nessie and stored in Minio
Enter transformer to check dbt runs ok:
docker compose -f docker/docker-compose-transformer.yml exec transformer bash
uv run dbt run
uv run dbt test
DAG orchestrating the dbt transformation and updated data
Check dag tests:
uv run pytest tests -v
###################################################################################################################################################################################################################################################
<Signin form>
create admin profile in login form
sources > add sources > Nessie
<General>
Name: nessie
Nessie Endpoint URL: http://nessie:19120/api/v2
Nessie Auth Type: none
<Storage>
AWS Root Path: raw-data
Auth Type: AWS Access Key
AWS Access Key: admin
AWS Access Secret: password
IAM Role to Assume:
Connection Properties:
* fs.s3a.endpoint: minio:9000
* fs.s3a.path.style.access: true
* fs.s3a.endpoint.region: eu-central-1 (Used by the Hadoop S3A filesystem driver)
* dremio.s3.compat: true (Compatibility mode for an S3 compatible storage)
* dremio.s3.region: eu-central-1 (Explicitly sets the region for Dremio S3 client)
- [ ] Uncheck "encrypt connection"
https://docs.getdbt.com/docs/local/connect-data-platform/dremio-setup
uv add dbt-dremio
dbt init transformer (a dremio account is needed for the setup)
Which database would you like to use?
[1] dremio
[2] software_with_username_password
software_host: 127.0.0.1
port [9047]: 9047
user (username): admin
password (password): password1
Desired storage configuration method option:
[2] sources_and_spaces
profiles.yml dbt-dremio parameters define exactly where dbt will create tables and views and where underlying data are stored.
Because Dremio separates logical metadata (Spaces) from the physical storage (Sources), these settings act as a map for dbt output:
dremio_space: this is the top level space in Dremio where dbt will create the virtual datasets.dremio_space_folder: (optional) this allows to organize models into a specific folder within the space (dremio_space/dremio_space_folder/model_xyz)object_storage_source: name of the physical source already configured in the Dremio instanceobject_storage_path: directory path within the storage where the data files will live (avoid writing everything in root repo)
WARNING: if only using view materializations, dbt relies on the Space settings. If models are materialized='table' the object_storage_source settings become mandatory because Dremio needs to know where to physically write the tables.
When using a catalog connected to storage instead of Dremio spaces, the logic for the parameters slightly differs: catalog acts as both the logical AND the physical organisation of the tables.
profiles dremio params config with catalog:
dremio_space: nessie ---> in a catalog-first workflow, you want the views/tables to live in the same catalog as the source tables
dremio_space_folder: transformer ---> dbt will create this namespace in the source to host transformations views/tables
object_storage_source: nessie ---> dbt will use catalog (backed by storage) to manage the transformations physical iceberg tables
object_storage_path: transformer ---> defines the namespace where the physical iceberg tables will be registered (usually matches dremio_space_folder)