A portfolio project combining marine ecology and spatial machine learning (K-Means + spatially blocked Random Forest) to understand fish community patterns in the Bay of Biscay.
Ecology × ML portfolio project. I built spatial–temporal–depth “cells”, derived a Hellinger-transformed community matrix, clustered 5 fish community types (C1–C5) with K-Means, then trained a spatially blocked Random Forest to predict community type from environment.
Performance: accuracy ≈ 0.39, macro-F1 0.357 (↑ from 0.317 baseline).
Top drivers: sst_actual, ocean_depth, coast_km, chl.
- 5 communities (C1–C5) with clear spatial/seasonal structure.
- Spatial cross-validation (GroupKFold) to reduce leakage across space.
- Interpretability: permutation importances, partial dependence, and MDS of predictor space.
- Cells & community matrix — Built spatial–temporal–depth cells; computed species-level community matrix and applied Hellinger transform.
- Unsupervised structure — K-Means → 5 communities (C1–C5) to capture dominant assemblage patterns.
- Supervised prediction — Random Forest predicts community_type from environment with spatial-block CV (GroupKFold) to reduce spatial leakage.
- Model tuning & baseline — Grid search around RF hyperparams; compared to a stratified baseline.
- Interpretation
— Permutation importances (drivers:
sst_actual,month_sin/cos,ocean_depth,coast_km,chl) and class-wise PDPs (not all shown here to keep the README lean).
| Item | Value |
|---|---|
| Communities discovered | 5 (C1–C5) |
| CV scheme | GroupKFold (spatial blocks) |
| Final model | Random Forest (tuned) |
| Accuracy | ≈ 0.39 |
| Macro-F1 | 0.357 (↑ from 0.317 baseline) |
| Top predictors | sst_actual, month_sin, month_cos, ocean_depth, coast_km, chl |
Visuals above: community map (ecological pattern) and permutation importance (drivers).
All environmental and biological data used in this project are open-access.
-
Fish occurrence data:
OBIS – Ocean Biodiversity Information System (accessed via the OBIS API).
Records were filtered for Actinopterygii (ray-finned fishes) within the Bay of Biscay polygon. -
Sea surface temperature (SST):
HadISST1 dataset (Hadley Centre, UK Met Office).
Monthly SST fields extracted for 2015–2020. -
Chlorophyll concentration (CHL):
Copernicus Marine Environment Monitoring Service (CMEMS).
Near-surface chlorophyll-a data, aggregated to monthly means. -
Bathymetry:
GEBCO 2020 Gridded Bathymetry. -
Distance to coast:
Computed internally from the GEBCO coastline using Euclidean distance in projected space.
All raw datasets are excluded from the repository (data/ folder is git-ignored) but can be retrieved from the original sources above.
- Environment
conda env create -f environment.yml conda activate github-fish-biscay jupyter lab
Theo Murphy

