Site Reliability Engineer
Corsearch has more than 1500 team members serving over 5,000 clients on five continents, and we’re growing and changing rapidly. We are a fantastic company to work for — with great benefits, growth opportunities, and a terrific internal culture — and we truly believe that its people who make us thrive. Every day, we are transforming ourselves into a better partner for our customers, a better employer for our colleagues, and a better investment for our owners.
Position Description
When not fighting fires, the team is responsible for fire prevention through monitoring, automation, self-healing and resiliency initiatives, destructive testing, and game day exercises. The incumbent in this role would demonstrate a strong focus on tactical operations, as well as large-scale production engineering and orchestration.
· Keep the customer-facing services available at top performance by maintaining the constant health of the supporting systems.
· Incident management - Act in key response roles during major incidents e.g. Sev0, Sev1. Also, participate in the technical review of the incident for problem management
· Problem Management - populate in participate in (Root Cause Analyses (RCAs) and hand them off to the Global Solutions team
· Ensuring that work carried out by the Site Reliability team is executed in such a way as to comply with the company’s internal compliance policy and directives
· Being available to discuss and resolve technical issues and escalations with other technical staff as required
· Document, develop, and improve operational practices and procedures.
· Maintain configuration management and orchestration tooling.
· Work with and lead other members of the team in staying on top of key industry innovation and technology, and assist in team development growth
· Identifying work opportunities and preparing or assisting with the preparation of technical proposals as required
· Ability to operate in the high-pressure environment and troubleshoot complex issues quickly successfully handle multiple priorities
· Work to automate detection and resolution of recurring issues in the production environment
Requirements:
· Experience with monitoring, logging, and alerting technologies: Datadog, CloudWatch, Grafana, Prometheus, ELK stack and related
· Experience with software engineering and data structure principles and practices.
· Experience with object-oriented and structured programming principles and practices.
· Experience with distributed computing, storage, and networking design, monitoring and administration.
· Experience with public cloud services including AWS, GCP, and Azure.
· Experience with virtualization and containerization solutions such as OpenStack, VMWare, Kubernetes, and Docker.
· Experience with CI/CD tools, configuration management, and IaC.
· Experience with application metrics, performance monitoring, and optimization.
· Experience automating, maintaining, and improving systems and applications.
· Strong ability to understand and translate technical needs into actionable solutions.
· Proactive mindset with strong attention to details, patterns, and potential bottlenecks.
· Provable success collaborating across teams and tiers within an enterprise organization.
Apply now