Job Term


Company Website

Corsearch has more than 1500 team members serving over 5,000 clients on five continents, and we’re growing and changing rapidly. We are a fantastic company to work for — with great benefits, growth opportunities, and a terrific internal culture — and we truly believe that its people who make us thrive. Every day, we are transforming ourselves into a better partner for our customers, a better employer for our colleagues, and a better investment for our owners.


Position Description

When not fighting fires, the team is responsible for fire prevention through monitoring, automation, self-healing and resiliency initiatives, destructive testing, and game day exercises. The incumbent in this role would demonstrate a strong focus on tactical operations, as well as large-scale production engineering and orchestration.

·      Keep the customer-facing services available at top performance by maintaining the constant health of the supporting systems.

·      Incident management - Act in key response roles during major incidents e.g. Sev0, Sev1. Also, participate in the technical review of the incident for problem management

·      Problem Management - populate in participate in (Root Cause Analyses (RCAs) and hand them off to the Global Solutions team

·      Ensuring that work carried out by the Site Reliability team is executed in such a way as to comply with the company’s internal compliance policy and directives

·      Being available to discuss and resolve technical issues and escalations with other technical staff as required

·      Document, develop, and improve operational practices and procedures.

·      Maintain configuration management and orchestration tooling.

·      Work with and lead other members of the team in staying on top of key industry innovation and technology, and assist in team development growth

·      Identifying work opportunities and preparing or assisting with the preparation of technical proposals as required

·      Ability to operate in the high-pressure environment and troubleshoot complex issues quickly successfully handle multiple priorities

·      Work to automate detection and resolution of recurring issues in the production environment



·         Experience with monitoring, logging, and alerting technologies: Datadog, CloudWatch, Grafana, Prometheus, ELK stack and related

·         Experience with software engineering and data structure principles and practices.

·         Experience with object-oriented and structured programming principles and practices.

·         Experience with distributed computing, storage, and networking design, monitoring and administration.

·         Experience with public cloud services including AWS, GCP, and Azure.

·         Experience with virtualization and containerization solutions such as OpenStack, VMWare, Kubernetes, and Docker.

·         Experience with CI/CD tools, configuration management, and IaC.

·         Experience with application metrics, performance monitoring, and optimization.

·         Experience automating, maintaining, and improving systems and applications.

·         Strong ability to understand and translate technical needs into actionable solutions.

·         Proactive mindset with strong attention to details, patterns, and potential bottlenecks.

·         Provable success collaborating across teams and tiers within an enterprise organization.

Apply now