Senior Site Reliability Engineer
Corsearch has more than 1500 team members serving over 5,000 clients on five continents, and we’re growing and changing rapidly. We are a fantastic company to work for — with great benefits, growth opportunities, and a terrific internal culture — and we truly believe that its people who make us thrive. Every day, we are transforming ourselves into a better partner for our customers, a better employer for our colleagues, and a better investment for our owners.
About the Position
· Keep the customer-facing services available at top performance by maintaining the constant health of the supporting systems.
· Own the incident response system to alert service owners when their services need their attention, thereby further enabling teams to own their code from their desktop to production
· Problem Management - populate in participate in (Root Cause Analyses (RCAs) and hand them off to the appropriate team
· Ensuring that work carried out by the Site Reliability team is executed in such a way as to comply with the company’s internal compliance policy and directives
· Improve the observability of the platforms to measure system health as well as see historic metrics to allow for faster diagnosis of pending issues or retroactive analysis of production issues.
· Being available to discuss and resolve technical issues and escalations with other technical with clear communication
· Document, develop, and improve operational practices and procedures.
· Maintain configuration management and orchestration tooling.
· Work with and lead other members of the team in staying on top of key industry innovation and technology, and assist in team development growth
· Identifying work opportunities and preparing or assisting with the preparation of technical proposals as required
· Ability to operate in the high-pressure environment and troubleshoot complex issues quickly successfully handle multiple priorities
· Work to automate detection and resolution of recurring issues in the production environment
Requirements:
· Experience with monitoring, logging, and alerting technologies: Datadog, CloudWatch, Grafana, Prometheus, ELK stack and related
· Experience with software engineering and data structure principles and practices.
· Experience with object-oriented and structured programming principles and practices.
· Experience with distributed computing, storage, and networking design, monitoring and administration.
· Experience with public cloud services including AWS, GCP, and Azure.
· Experience with virtualization and containerization solutions such as OpenStack, VMWare, Kubernetes, and Docker.
· Experience with CI/CD tools, configuration management, and IaC.
· Experience with application metrics, performance monitoring, and optimization.
· Experience automating, maintaining, and improving systems and applications.
· Strong ability to understand and translate technical needs into actionable solutions.
· Proactive mindset with strong attention to details, patterns, and potential bottlenecks.
· Provable success collaborating across teams and tiers within an enterprise organization.
Apply now