Site Reliability Engineer Job at Infinity Quest UK, Remote

T1h6b0JCbkZxZmVCamtEOVpjUGwwYkFnWXc9PQ==
  • Infinity Quest UK
  • Remote

Job Description

Primary Responsibilities:

  • Work closely with Product Engineering team and implement strategies for modernizing IT operations enhancing observability and toil reduction.
  • Architect and deploy observability platforms to monitor system health, performance, and reliability effectively.
  • Propose & drive strategies for AI-driven alerting and proactive anomaly detection to reduce MTTD & MTTR.
  • Develop and enforce SRE best practices, including Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets.
  • Establish & create AIOPS roadmap for improving operational efficiency.
  • Lead efforts to automate repetitive tasks (toil) using scripting, orchestration tools, and AI/ML-based solutions.
  • Drive toil automation initiatives for automated incident responses & self-healing automation for achieving autonomous operations.
  • Collaborate with cross-functional teams to ensure systems are scalable, resilient, and maintainable.
  • Drive incident management and root cause analysis processes through automation, ensuring continuous improvement to enable autonomous operations.
  • Partner with engineering, architecture, and product teams to enable shift-left engineering practices ensuring reliability.
  • Mentor and guide teams on adopting SRE principles and tools.
  • Advocate for a culture of reliability, automation, and continuous improvement across the organization.

Key Skills:

  • Strong expertise in implementing Site Reliability Engineering (SRE) principles.
  • Advanced knowledge of establishing observability using tools Dynatrace & Datadog (primary skills).
  • Proficiency in automation & scripting using Python & Ansible (primary skills).
  • Strong experience with cloud platforms AWS & Azure (primary skills).
  • Solid understanding of containerization and orchestration tools like Docker and Kubernetes .
  • Proficiency in cloud native distributed systems & microservices architecture.
  • Exposure to AI/ML techniques for predictive analytics and automated problem resolution.
  • Familiarity with CI/CD pipelines & enabling automated release & deployment engineering solutions.
  • Good to have experience with chaos engineering tools like Gremlin or Chaos Monkey and implementing automation frameworks for resilience tracking.
  • Ability to manage and prioritize multiple projects in a fast-paced environment.
  • Strong interpersonal and communication skills to work effectively across teams.
  • Excellent problem solving, analytical thinking, and adaptability.
  • Strategic mindset balancing engineering excellence with business priorities.

Preferred Qualifications:

  • 12+ years of experience in IT operations, SRE, or DevOps roles.
  • Proven track record of SRE experience in implementing observability and automation solutions in large-scale environments.
  • Certifications in cloud platforms, observability tools & other SRE related areas.

Job Tags

Shift work,

Similar Jobs

Stripe

Machine Learning Engineer, Foundation Model Job at Stripe

 ...which is roughly 1.3% of the worlds GDP . We process petabytes of financial data using our ML platform to build features, train models, and deploy them to production. We focus on seeing how LLMs can solve some of our hardest problems in merchant risk and understanding... 

LanceSoft

Travel Nurse RN - Psychiatric Job at LanceSoft

 ...~ Current MI RN License- Required ~ BLS from American Heart Association or American Red Cross- Required ~ Will need MI Fingerprints prior to start ~ Handle with Cares- within 2 months of hire Registration/Certification Requirements: ~ BLS State... 

The Coca-Cola Company

Senior Manager, Incident Management Job at The Coca-Cola Company

**Senior Manager, Incident Management**This role is critical to ensuring the integrity and continuity of Coca-Cola Trademark products worldwide. The Senior Manager, Incident Management leads advanced analytical testing and incident resolution to safeguard product quality... 

TriCom Quest

Japanese Bilingual Call Center Customer Service Associate Job at TriCom Quest

 ...Job #: 103521 Title: Japanese Bilingual Call Center Customer Service Associate Location: Torrance, CA Salary Range: $23.51 / hour Position: Customer Service Rep in Call Center Description: An International Airline Company is seeking Japanese Bilingual Call Center Customer... 

Ontario Center

Housekeeper Job at Ontario Center

 ...Ontario Center is hiring a Housekeeper in Canandaigua, NY. DUTIES: Knows how to sweep, dust mop and mop floors Knows...  ...Health Facility residents. The facility provides a versatile nursing-home environment that encourages creativity among residents, managers...