Careers

Location

Remote

Job Term

Full-Time

Company Website

Corsearch has more than 1500 team members serving over 5,000 clients on five continents, and we’re growing and changing rapidly. We are a fantastic company to work for — with great benefits, growth opportunities, and a terrific internal culture — and we truly believe that its people who make us thrive. Every day, we are transforming ourselves into a better partner for our customers, a better employer for our colleagues, and a better investment for our owners.

 

About the Position

·      Keep the customer-facing services available at top performance by maintaining the constant health of the supporting systems.

·      Own the incident response system to alert service owners when their services need their attention, thereby further enabling teams to own their code from their desktop to production

·      Problem Management - populate in participate in (Root Cause Analyses (RCAs) and hand them off to the appropriate team

·      Ensuring that work carried out by the Site Reliability team is executed in such a way as to comply with the company’s internal compliance policy and directives

·      Improve the observability of the platforms to measure system health as well as see historic metrics to allow for faster diagnosis of pending issues or retroactive analysis of production issues.

·      Being available to discuss and resolve technical issues and escalations with other technical with clear communication

·      Document, develop, and improve operational practices and procedures.

·      Maintain configuration management and orchestration tooling.

·      Work with and lead other members of the team in staying on top of key industry innovation and technology, and assist in team development growth

·      Identifying work opportunities and preparing or assisting with the preparation of technical proposals as required

·      Ability to operate in the high-pressure environment and troubleshoot complex issues quickly successfully handle multiple priorities

·      Work to automate detection and resolution of recurring issues in the production environment

 

Requirements:

·         Experience with monitoring, logging, and alerting technologies: Datadog, CloudWatch, Grafana, Prometheus, ELK stack and related

·         Experience with software engineering and data structure principles and practices.

·         Experience with object-oriented and structured programming principles and practices.

·         Experience with distributed computing, storage, and networking design, monitoring and administration.

·         Experience with public cloud services including AWS, GCP, and Azure.

·         Experience with virtualization and containerization solutions such as OpenStack, VMWare, Kubernetes, and Docker.

·         Experience with CI/CD tools, configuration management, and IaC.

·         Experience with application metrics, performance monitoring, and optimization.

·         Experience automating, maintaining, and improving systems and applications.

·         Strong ability to understand and translate technical needs into actionable solutions.

·         Proactive mindset with strong attention to details, patterns, and potential bottlenecks.

·         Provable success collaborating across teams and tiers within an enterprise organization.

Apply now