IT Operations/Site Reliability Engineer (Datadog/Dynatrace)
Job Description: SRE will play a pivotal role in driving the modernization of IT operations by implementing observability practices and automating toil. This position requires a deep understanding of Site Reliability Engineering (SRE) principles, modern observability tools, and automation techniques to ensure scalability, reliability, and efficiency in IT systems. This role requires a strategic thinker with hands-on expertise who can lead modernization efforts while fostering a culture of reliability and innovation. Primary Responsibilities: Work closely with Product Engineering team and implement strategies for modernizing IT operations enhancing observability and toil reduction. Architect and deploy observability platforms to monitor system health, performance, and reliability effectively. Propose andamp; drive strategies for AI-driven alerting and proactive anomaly detection to reduce MTTD andamp; MTTR. Develop and enforce SRE best practices, including Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets. Establish andamp; create AIOPS roadmap for improving operational efficiency. Lead efforts to automate repetitive tasks (toil) using Scripting, orchestration tools, and AI/ML-based solutions. Drive toil automation initiatives for automated incident responses andamp; self-healing automation for achieving autonomous operations. Collaborate with cross-functional teams to ensure systems are scalable, resilient, and maintainable. Drive incident ..... full job details .....
Perform a fresh search...
-
Create your ideal job search criteria by
completing our quick and simple form and
receive daily job alerts tailored to you!