Senior Site Reliability Engineer
Your Impact
Our client platform is a kubernetes-native distributed system that requires the orchestration of many components. Efficiently serving and training large neural networks presents unique design and infrastructure challenges.
You will be critical to solving these challenges both in the context of the cloud and in on premise environments. Additionally, you will be responsible for our broader cloud infrastructure and development tools and environments.
The Opportunity
- Ensure the smooth operation and high availability of core services
- Monitor system performance, identify bottlenecks, and implement optimizations to enhance reliability and efficiency
- Develop Kubernetes resources and custom tooling for seamless cloud and on-premise deployments
- Design and implement scalable, secure, and cost-effective infrastructure solutions.
- Partner with teams across the organization to identify & solve engineering challenges
Requirements
- BS/BA in Computer Science or related degree
- Good knowledge of cloud providers (AWS, GCP or similar)
- Expertise with Kubernetes (EKS, GKE, self-hosted) and Infrastructure as Code using Terraform, Helm
- Solid understanding of web and networking (HTTP, TLS, DNS, Certificates, etc)
- Experience with CI/CD pipelines using tools such as GitHub Actions, ArgoCD, and Atlantis
- Strong interpersonal skills working with teams across different time zones and regions
Great to Have
- Knowledge of basic Microservice Architecture principles
- Familiarity with security best practices for cloud-based systems.
- Experience with relational databases, message queues, key value stores
- Experience writing python, golang, or any other popular programming language
- Familiarity with any RPC framework
- Experience developing & building custom Kubernetes operators