Site Reliability Engineer

‐ Incopro

Location

United States of America

Job Term

Full-Time

Company Website

www.incoproip.com

Corsearch has more than 1500 team members serving over 5,000 clients on five continents, and we’re growing and changing rapidly. We are a fantastic company to work for — with great benefits, growth opportunities, and a terrific internal culture — and we truly believe that its people who make us thrive. Every day, we are transforming ourselves into a better partner for our customers, a better employer for our colleagues, and a better investment for our owners.

Position Description

When not fighting fires, the team is responsible for fire prevention through monitoring, automation, self-healing and resiliency initiatives, destructive testing, and game day exercises. The incumbent in this role would demonstrate a strong focus on tactical operations, as well as large-scale production engineering and orchestration.

· Keep the customer-facing services available at top performance by maintaining the constant health of the supporting systems.

· Incident management - Act in key response roles during major incidents e.g. Sev0, Sev1. Also, participate in the technical review of the incident for problem management

· Problem Management - populate in participate in (Root Cause Analyses (RCAs) and hand them off to the Global Solutions team

· Ensuring that work carried out by the Site Reliability team is executed in such a way as to comply with the company’s internal compliance policy and directives

· Being available to discuss and resolve technical issues and escalations with other technical staff as required

· Document, develop, and improve operational practices and procedures.

· Maintain configuration management and orchestration tooling.

· Work with and lead other members of the team in staying on top of key industry innovation and technology, and assist in team development growth

· Identifying work opportunities and preparing or assisting with the preparation of technical proposals as required

· Ability to operate in the high-pressure environment and troubleshoot complex issues quickly successfully handle multiple priorities

· Work to automate detection and resolution of recurring issues in the production environment

Requirements:

· Experience with monitoring, logging, and alerting technologies: Datadog, CloudWatch, Grafana, Prometheus, ELK stack and related

· Experience with software engineering and data structure principles and practices.

· Experience with object-oriented and structured programming principles and practices.

· Experience with distributed computing, storage, and networking design, monitoring and administration.

· Experience with public cloud services including AWS, GCP, and Azure.

· Experience with virtualization and containerization solutions such as OpenStack, VMWare, Kubernetes, and Docker.

· Experience with CI/CD tools, configuration management, and IaC.

· Experience with application metrics, performance monitoring, and optimization.

· Experience automating, maintaining, and improving systems and applications.

· Strong ability to understand and translate technical needs into actionable solutions.

· Proactive mindset with strong attention to details, patterns, and potential bottlenecks.

· Provable success collaborating across teams and tiers within an enterprise organization.

Apply now

hyperexponential raises $73m Series B to expand its mission-critical insurance pricing platform

hyperexponential raises $73m Series B to expand its mission-critical insurance pricing platform

Site Reliability Engineer

Location

Job Term

Company Website

hyperexponential raises $73m Series B to expand its mission-critical insurance pricing platform

hyperexponential raises $73m Series B to expand its mission-critical insurance pricing platform