Permanent

Site Reliability Engineer

London

Negotiable

Posted Yesterday

Gizmo is an AI startup on a mission to make learning so easy that anyone can learn anything. We''re building Duolingo for anything - a platform that uses gamification and social mechanics to make learning fun.With over 1 million monthly active users and $4M in annual recurring revenue, we’re already one of the fastest-growing startups in the UK. Backed by leading investors, we recently raised $22M in Series A funding to accelerate our vision of helping 1 billion people learn.Role OverviewReporting to the CTO, you will own

capacity, performance and reliability

for Gizmo’s full-stack platform as daily traffic climbs from hundreds of thousands to millions of users. You’ll write code across the stack, but your charter is classic SRE: defend

SLOs , eliminate

toil , and raise the ceiling on scale before it becomes a hard limit.Key ResponsibilitiesDefine SLIs/SLOs for latency, availability and error rate; codify

error budgets

and partner with product teams on trade-offs.Perform load-testing, capacity modelling and

up-front scalability design

for PostgreSQL, OpenSearch, Redis, Hasura and CF Workers; produce data-driven scaling plans.Extend metrics, structured logging and tracing; establish

alert rules

that page only on user-visible impact; build actionable

runbooks .Join the on-call rotation, lead blameless post-mortems, drive remediation work to closure and track

MTTR/MTBF

improvements.Automate repetitive ops on Kubernetes and CI/CD; keep “toil”

Coach full-stack engineers on query optimisation, schema design and back-pressure techniques; document patterns and anti-patterns by creating an

SRE playbookHands-on scale experience : you have run relational stores at 100 k+ TPS or 1 M+ concurrent users (e.g., multi-tenant PostgreSQL, sharded MySQL).Strong backend fundamentals

around concurrency, caching, indexing and distributed systems trade-offs.Proven track record of setting SLOs, building dashboards (Prometheus/Grafana, OpenTelemetry, etc.) and tuning alerts.Comfort with

Kubernetes , IaC and cloud-native patterns; can debug from network to application layer.Start-up bias for action: you prioritise high-leverage fixes, ship iteratively and own outcomes end-to-end.Collaborative and feedback-driven; you welcome post-mortem culture and continuous improvement.Driven by impact - you prioritise work that moves the needle!Nice-to-haves: experience with Hasura internals, Cloudflare Workers edge optimisation, or operating OpenSearch at scale.Highly competitive salary.You''ll own a piece of what you''re building - equity included.Hybrid working model with 4 days in our East London office, ideally located between Shoreditch High Street, Old Street, and Liverpool Street stations.The opportunity to become one of the earliest employees in one of the UK’s fastest-growing startups.Private health ..... full job details .....

View Full Details

Site Reliability Engineer

Other jobs of interest...

Reliability Engineer

Site Reliability Engineer with Python (IT) ...

Site Reliability Engineer

Lead Site Reliability Engineer

Perform a fresh search...

Jobs. Straight to your inbox!