Skip to content

feat: Add GooseFS support as an OpenDAL-backed I/O source #7108

Description

@XuQianJin-Stars

Is your feature request related to a problem?

Daft currently supports several object-storage backends (S3, GCS, Azure, OSS, COS, OBS, HuggingFace, etc.) through native clients or via OpenDAL. However, there is no first-class support for GooseFS — Tencent Cloud's distributed caching/acceleration filesystem that is widely used in front of COS for big-data and AI workloads.

Users running Daft on GooseFS-backed datasets currently have to:

  • Mount GooseFS as a local filesystem and use the local:// scheme, which loses cloud-native semantics (credentials, multi-node access, prefix listing, etc.); or
  • Pre-copy data to S3/COS, which defeats the purpose of using GooseFS as a cache layer.

OpenDAL 0.57.0 has shipped a services-goosefs backend, so Daft can now expose GooseFS natively through the same OpenDALSource machinery used by OSS / COS / OBS, with minimal new code.

Describe the solution you'd like

Add a new goosefs:// URL scheme to Daft I/O, mirroring the existing COS / OBS integrations:

  1. Config

    • New daft.io.GooseFSConfig (Python) / common_io_config::GooseFSConfig (Rust) with fields:
      • endpoint (GooseFS master endpoint, e.g. http://master:9200)
      • root (optional root path inside the namespace)
      • anonymous (bool, default false)
    • Wired into IOConfig next to cos, obs, oss, etc., with full __init__ / replace / __repr__ / pickling / Python bindings.
  2. I/O source

    • Register goosefs in OpenDALSource::available_schemes().
    • Build the operator via OpenDAL's Goosefs service (opendal = "0.57", feature services-goosefs).
    • Route goosefs://... URIs in daft-io's scheme dispatcher to the new source.
  3. SQL / catalog

    • Recognize goosefs:// in URL parsing so read_parquet / read_csv / read_json / write_* work out of the box.
  4. Tests & docs

    • Unit tests for config round-tripping (Rust + Python).
    • Add goosefs to the supported-schemes table in the I/O docs.

Describe alternatives you've considered

  • Mount GooseFS via FUSE and read through local:// — works, but requires every Daft worker/node to mount the filesystem, and it bypasses GooseFS's native client optimizations (locality hints, async prefetch, credential handling).
  • Use the COS backend pointing at the underlying bucket — loses the GooseFS cache benefits, which is the main reason users adopt GooseFS in the first place.
  • Implement a fully custom GooseFS client in Daft — far more code to maintain; unnecessary now that OpenDAL 0.57 ships an official services-goosefs backend that's already battle-tested.

Implementing it through OpenDAL (option chosen) is the most consistent path: it reuses the exact same plumbing as oss / cos / obs, keeps the surface area small, and lets us inherit upstream improvements for free.

Component(s)

No response

Additional Context

No response

Would you like to implement a fix?

Yes

Metadata

Metadata

Assignees

No one assigned

    Labels

    acceptedThis issue has been accepted as a known bug, or important feature to includeenhancementNew feature or requestp2 (backlog)Nice to have features

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions