Your Role:
We are seeking a Staff Site Reliability Engineer (Infrastructure & Site Reliability Engineering) with extensive experience in AWS, AZURE, Kubernetes, GitOps to lead our Site Reliability Engineering (SRE) team. The successful candidate will deeply understand SRE practices and have a track record of implementing high-quality site reliability engineering practices (SLAs, SLOs, Proactive Alert Management, Incident Response/Review, Postmortems, etc.).
In this role, you will work with our SRE and cross-functional engineering teams to develop and operate our development and production infrastructure and operations
Your Impact:
Work collaboratively with software engineering to define infrastructure and deployment requirements
Be the driving force behind our automation and observability initiatives
Build and maintain operational tools for deployment, monitoring, and analysis of cloud (AWS & AZURE) infrastructure and systems
Leading the response to production incidents, conducting postmortems and continuous improvement and be on on-call rotation
Establish and drive operations performance through SLOs
Provide project management, sprint planning, and road-mapping support to the SRE team
Expert level technical skills and able to provide mentoring to team members
Our team uses practices to maximize our development velocity, including but not limited to: continuous integration/deployment, code review via GitHub pull requests
Ideal Attributes
Strong customer orientation
Excellent interpersonal and organizational skills
Attention to detail and focus on quality
Strong communication skills to effectively liaise with both technical and non-technical staff
Ability to act decisively and works well under pressure
Must be a collaborative problem solver
Strong bias for ownership and action
Your Experience:
At least 8 + years of experience designing, building and maintaining SAAS environments
5+ years of experience designing, building and maintaining AWS/AZURE infrastructure with Terraform
Experience building and running Kubernetes clusters
Experience with observability (monitoring – logging, tracing, metrics)
Experience with GitOps CI/CD processes
Experience with scripting with Python, Go (Golang), bash, or PowerShell and AWS CLI tools
Experience with security operations – security policies, infrastructure, key management, setup of encryption at rest and transport