Skip to content

rpalloni/airbyte-airflow-datalakehouse-dbt-metabase

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Architecture diagram

dlh
Details
                          _______________________________________________________
                         |                 ______________________    __________  |
                         | DataLakeHouse  |           |          |  |          | |
 ____________            |  ___________   |           |          |  |          | |
|    Data    |           | |           |  |  Table    | Table    |  |  Query   | <<    BI
|   Source   |  >> E >>  | | Storage   |  |  Format   | Catalog  |  |  Engine  | << (Metabase)
| (postgres) | (Airbyte) | | (Minio)   |  | (Iceberg) | (Nessie) |  | (Dremio) | <<
|____________|           | |___________|  |           |          |  |          | |
                         |                |___________|__________|  |__________| |
                         |_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _|
                         |                          TL                           |
                         |               Transform/Orchestration                 |
                         |                   (dbt / Airflow)                     |
                         |_______________________________________________________|

Client-Server model architecture:
Dremio serves as the Query and Semantic Layer, created as an infrastructure component on a dedicated container.
It connects to the MinIO data lake, tracks metadata via Nessie and provides a SQL interface for both dbt (transformation) and Metabase (visualization).

Airbyte local setup

https://docs.airbyte.com/platform/using-airbyte/getting-started/oss-quickstart

Components

Docker exposes the following ports to the host, so components UI are available at:

  • localhost:8000 airbyte
  • localhost:8081 airflow
  • localhost:3000 metabase
  • localhost:9001 minio (create bucket: raw-data)
  • localhost:9047 dremio
  • localhost:19120 nessie

Plugins

dremio-metabase custom plugin installed in the custom metabase image
(https://github.com/Baoqi/metabase-dremio-driver/releases)

dremio:31010 (metabase endpoint)

image

Dependencies

uv add dbt-dremio
https://docs.getdbt.com/docs/local/connect-data-platform/dremio-setup

Run

  1. Create the .env file in docker repo
  2. Run containers creation
  3. Create the bucket in minio and the tables in source db.
  4. Create the Airbyte connection from Postgres source to S3 Data Lake destination.
  5. Create the Dremio account (used in dbt) and the Nessie connection NOTE: Airflow scheduled to run at 15' and 45' each hour and Airbyte to run every 30' (:00, :30 each hour)
docker compose \
-f docker/docker-compose-pg-source.yml \
-f docker/docker-compose-datalakehouse.yml \
-f docker/docker-compose-airflow.yml \
-f docker/docker-compose-metabase.yml \
-f docker/docker-compose-transformer.yml \
up --build

postgres-source

Data producer with some test data

CREATE TABLE public.users (
	id VARCHAR(10) PRIMARY KEY,
	name VARCHAR(20) not null,
	created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW() -- utc
);

INSERT INTO public.users (id, name) VALUES ('U10001', 'Mark');
INSERT INTO public.users (id, name) VALUES ('U10002', 'Phil');
INSERT INTO public.users (id, name) VALUES ('M10003', 'John');
INSERT INTO public.users (id, name) VALUES ('M10004', 'Jack');

CREATE TABLE public.bookings (
	id SERIAL PRIMARY KEY,
	booking_date DATE NOT NULL,
	service_id INT NOT NULL,
	quantity INT NOT NULL,
	total_amount NUMERIC(10, 2) NOT null,
	user_id VARCHAR references users(id),
	created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW() -- utc
);

INSERT INTO public.bookings (booking_date, service_id, quantity, total_amount, user_id) VALUES ('2023-01-01', 101, 3, 450.00, 'U10001');
INSERT INTO public.bookings (booking_date, service_id, quantity, total_amount, user_id) VALUES ('2023-02-01', 102, 1, 320.00, 'U10001');
INSERT INTO public.bookings (booking_date, service_id, quantity, total_amount, user_id) VALUES ('2023-03-01', 103, 4, 720.00, 'U10002');
...

airbyte

Create source, destination and connection

image

Set sync details

image

Processing

image

minio

Load to storage and save as iceberg tables

image

nessie

Table catalog updated

image

dremio

Configure table catalog and iceberg tables

image

metabase

Query data from the BI layer

image

dbt

Transformations on source data using Dremio query engine, tracked in Nessie and stored in Minio

image image image

Enter transformer to check dbt runs ok:

docker compose -f docker/docker-compose-transformer.yml exec transformer bash
uv run dbt run
uv run dbt test

airflow

DAG orchestrating the dbt transformation and updated data

image image

Check dag tests:
uv run pytest tests -v

###################################################################################################################################################################################################################################################

Dremio UI config

<Signin form>
create admin profile in login form
sources > add sources > Nessie
<General>
Name: nessie
Nessie Endpoint URL: http://nessie:19120/api/v2
Nessie Auth Type: none
<Storage> 
AWS Root Path: raw-data
Auth Type: AWS Access Key
AWS Access Key: admin
AWS Access Secret: password
IAM Role to Assume:
Connection Properties:
* fs.s3a.endpoint: minio:9000
* fs.s3a.path.style.access: true 
* fs.s3a.endpoint.region: eu-central-1 (Used by the Hadoop S3A filesystem driver)
* dremio.s3.compat: true (Compatibility mode for an S3 compatible storage)
* dremio.s3.region: eu-central-1 (Explicitly sets the region for Dremio S3 client)
- [ ] Uncheck "encrypt connection"

dbt config

https://docs.getdbt.com/docs/local/connect-data-platform/dremio-setup

uv add dbt-dremio
dbt init transformer (a dremio account is needed for the setup)

Which database would you like to use?

[1] dremio
[2] software_with_username_password
software_host: 127.0.0.1
port [9047]: 9047
user (username): admin
password (password): password1

Desired storage configuration method option:

[2] sources_and_spaces

profiles.yml dbt-dremio parameters define exactly where dbt will create tables and views and where underlying data are stored.
Because Dremio separates logical metadata (Spaces) from the physical storage (Sources), these settings act as a map for dbt output:

  • dremio_space: this is the top level space in Dremio where dbt will create the virtual datasets.
  • dremio_space_folder: (optional) this allows to organize models into a specific folder within the space (dremio_space/dremio_space_folder/model_xyz)
  • object_storage_source: name of the physical source already configured in the Dremio instance
  • object_storage_path: directory path within the storage where the data files will live (avoid writing everything in root repo)

WARNING: if only using view materializations, dbt relies on the Space settings. If models are materialized='table' the object_storage_source settings become mandatory because Dremio needs to know where to physically write the tables.

dbt config with catalog

When using a catalog connected to storage instead of Dremio spaces, the logic for the parameters slightly differs: catalog acts as both the logical AND the physical organisation of the tables.

profiles dremio params config with catalog:

dremio_space: nessie              ---> in a catalog-first workflow, you want the views/tables to live in the same catalog as the source tables
dremio_space_folder: transformer  ---> dbt will create this namespace in the source to host transformations views/tables
object_storage_source: nessie     ---> dbt will use catalog (backed by storage) to manage the transformations physical iceberg tables
object_storage_path: transformer  ---> defines the namespace where the physical iceberg tables will be registered (usually matches dremio_space_folder)

About

end-to-end data infrastructure using Airbyte, Airflow, Minio, Nessie, Dremio, Iceberg, dbt, Metabase

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages