GPU Cluster Architect

GPU Architect
Remote ( Amsterdam, Netherlands)
Why This Role
Join a fast-growing team that is redefining cloud infrastructure for the AI era. The focus is on building platforms that enable customers to tackle complex real-world problems and scale innovation, without the burden of massive infrastructure investments or large internal AI/ML teams. You''ll be working at the forefront of GPU and AI infrastructure alongside highly skilled engineers and industry leaders.
About the Position
We are looking for a GPU Cluster Architect to lead the design of large-scale, next-generation GPU clusters that underpin advanced AI workloads. This is a hands-on, senior role with responsibility for shaping architecture across compute, networking, and storage, ensuring systems deliver the scale, reliability, and performance demanded by today''s AI and ML applications.
You''ll be responsible for defining how very large GPU deployments are networked, powered, cooled, and optimised across multiple data centre environments.
Core Responsibilities:
- Cluster Architecture: Design and define scalable topologies, spanning compute, interconnects (InfiniBand, Ethernet), storage, and orchestration layers.
- Workload Analysis: Model and assess AI/ML workloads (such as LLM training and inference) to guide design choices on latency, bandwidth, and GPU density.
- Networking: Collaborate with network specialists to implement and validate ultra-low latency, high-throughput solutions (InfiniBand HDR/NDR, RoCEv2) at rack, POD, and DC scale.
- Data & Storage: Partner with storage teams to optimise training data access, checkpointing, and high-performance throughput.
- Reliability & Observability: Translate signals from monitoring and telemetry systems into design improvements and reliability gains.
- Cross-Functional Collaboration: Work closely with reliability, networking, storage, and data centre engineering teams to deliver designs that scale seamlessly.
What You''ll Bring:
- 5+ years of experience architecting or designing large-scale compute clusters
- Strong knowledge of modern GPU platforms (e.g. NVIDIA, AMD)
- Hands-on expertise with HPC interconnect technologies (InfiniBand, RoCE)
- Background in systems architecture, hardware reliability, and networking fundamentals
- Experience building automation and telemetry pipelines with scripting languages (Python, Go, etc.)
What''s on Offer:
- A competitive salary and full benefits package
- Clear opportunities for professional development and growth
- Flexibility with hybrid/remote work arrangements
- A dynamic environment that rewards initiative, creativity, and innovation
If you''re excited by the idea of shaping the backbone of large-scale AI infrastructure, we''d love to hear from you.