← projects

Spark Cluster

wip

4-node DGX Spark vLLM cluster on ConnectX-7 with a custom NCCL 2.28.9 build (works around the 2.29.2 Docker bug). RDMA L2-only.

infra · 2026
DGX SparkvLLMNCCLDocker

A 4-node DGX Spark cluster running vLLM across ConnectX-7 with a custom NCCL 2.28.9 build (the 2.29.2 Docker image has a known bug). RDMA is L2-only via NCCL_IB_DISABLE=1 — the topology constraints are real and not in the docs.

In progress: making the multi-node ring stable enough to be the primary local inference for the agent fleet.