At Domino Data Lab, we have an ambitious vision for data science. Our platform helps data science teams accelerate research, increase collaboration, and rapidly deploy predictive models. Our customers are the most sophisticated analytical organizations in the world, including companies like Bristol Myers Squibb, Allstate, Bayer, and Red Hat. Backed by Sequoia Capital, Coatue Management, Bloomberg Beta, and Zetta Venture Partners, we are at the epicenter of the data science revolution, helping companies develop the next breakthrough in medicine, build better cars, or recommend the best song play next.
What we are building
The Customer Reliability Engineering team is focused on making sure that our customers have a performant and reliable experience on Domino. We take SRE principles and apply them directly on our customer-managed deployments of our product. As a senior engineer in our customer-facing organization, you will help clients govern their infrastructure to maximize uptime while also serving as a subject matter expert within the team.
What your impact will be
Your work directly assists our largest customers in producing the next generation of AI products. They rely heavily on our product to perform in a smooth and stable manner and this is not currently possible without the involvement of the Customer Reliability Engineering team.
You will be responsible for production deployments of our product, which run on a variety of infrastructure and involve a growing number of technical components. Ensuring stability involves building out our observability systems, automation, while also optimizing existing deployments and outage response.
What we look for in this role
- Experience with managing cloud environments (AWS, GCP, Azure)
- Strong coding ability (Python, Bash, Go)
- Systems fluency (Linux, storage, networking)
- Experience with container management (Kubernetes, Docker, EKS)
- Observability systems (New Relic, Prometheus, Grafana)
- Infrastructure and configuration automation (Terraform)
- Operating stacks based on modern software components
(ex. MongoDB, RabbitMQ, Redis, ElasticSearch, PostgreSQL)
- Customer Focus: The Customer Reliability Engineering team (CRE) improves the Domino experience for our most valuable customers. CRE, in collaboration with other customer-facing teams, leads urgent, coordinated responses to large-scale, customer-facing production issues. This involves:
- Incident Response - Investigating unexpected loss of Domino functionality
- Broader Deployment challenges - We investigate when multiple, deep technical issues on a deployment threaten to negatively impact a customer’s experience
- Comprehensive technical health checks - We inspect customers’ deployments to ensure they are configured properly based on their usage patterns
- CRE partners with the larger Reliability Engineering team, in addition to Domino’s Engineering pods, to act as a center of excellence for any urgent investigations
What we value
- We value a growth mindset. High-performing creative individuals who dig into problems and see the opportunities for success
- We believe in individuals who seek truth and speak the truth and can be their whole selves at work.
- We value all of you that believe improving is always possible At Domino Everything is a work in progress – we can do better at everything.
- We emphasize an environment of teaching and learning to equip employees with the tools needed to be successful in their function and the company.
- We strongly believe in the value of growing a diverse team and encourage people of all backgrounds, genders, ethnicities, abilities, and sexual orientations to apply