Bulge Bracket Investment Banks

Posted 2 months ago

No clicks

Lead SRE role at JPMorgan Chase's Consumer & Community Banking Infrastructure & Production Management organization, responsible for driving reliability, resiliency design reviews, and technical leadership for medium-to-large products. You will own incident command for major outages, mentor engineers, and promote SRE practices including SLIs/SLOs, postmortems, and observability. The role emphasizes leveraging AI and data-driven automation to detect, diagnose, and resolve issues, and requires hands-on experience with cloud platforms, Kubernetes, IaC, CI/CD, and observability tooling. Strong communication, documentation, and the ability to perform under pressure are essential.

Compensation: Not specified
City: Plano
Country: United States

Full Job Description

Location: Plano, TX, United States

Assume a critical role in defining the future of a globally recognized firm and have a direct and significant effect in a realm tailored for top achievers in site reliability.

As a Lead Site Reliability Engineer at JPMorgan Chase within the Infrastructure & Production Management sector of Consumer & Community Banking, you hold a leadership role in your team, demonstrate strong knowledge across multiple technical domains, and advise others on the technical and business issues facing them. Take lead and conduct resiliency design reviews, break up complex problems into digestible work for other engineers, act as a technical lead for medium to large-sized products, and provide advice and mentoring to other engineers.

Job responsibilities

Advocate and embody site reliability principles, fostering a culture of excellence and technical influence within your team.

Leverage AI tools to enhance operational effectiveness and automate processes, ensuring high-quality customer service.

Spearhead projects aimed at enhancing the reliability and stability of applications and platforms.

Utilize data-driven analytics and AI technologies to automate detection, diagnosis, resolution processes, elevate service levels and drive continuous improvement.

Engage stakeholders to establish realistic service level objectives and error budgets, ensuring alignment with customer expectations.

Exhibit advanced technical proficiency in one or more domains, proactively addressing technology-related bottlenecks.

Employ AI-driven solutions to streamline processes and enhance operational efficiency.

Serve as the primary contact during major incidents, demonstrating the ability to swiftly identify and resolve issues to prevent financial losses.

Act as a culture carrier by documenting and disseminating knowledge through internal forums and communities of practice.

Mentor team members, guiding them in the strategic adoption of AI technologies to enhance operational effectiveness and customer service.

Required qualifications, capabilities, and skills

Formal training or certification on site reliability engineering concepts and 5+ years applied experience.

Proven success in an SRE or senior DevOps role, with deep knowledge of service level indicators/objectives (SLIs/SLOs), incident management, postmortem analysis, and systems reliability.

Expert with observability stacks (e.g. Datadog/Dynatrace, Prometheus, Grafana, Splunk, Elk, OpenTelemetry), including deep experience correlating telemetry across services and time.

Hands-on skills in coding (at least one high-level programming language), cloud platforms (AWS or GCP), container orchestration (Kubernetes), infrastructure as code (Terraform), and resilient CI/CD pipelines.

Active experience or deep curiosity in applying AI to operations—such as LLM-based copilots, anomaly detection, automated runbooks, autonomous agents.

A track record of delivering under pressure. You finish what you start, adapt to uncertainty, and thrive in high-accountability environments.

You deconstruct complexity, organize effectively, and drive clarity into ambiguous operational environments. Documentation and design are second nature.

Outstanding communication, empathy, and professionalism—especially during incidents. You recognize that great systems serve real people.

Preferred qualifications, capabilities, and skills

Experience with operational and compliance rigor in banking, fintech, or similar.

Manage and optimize various types of databases, including relational, NoSQL databases.

Experience with game days, chaos experiments, or failure-mode analysis to improve service robustness.

A background in mentoring engineers or leading technical knowledge-sharing, especially around AI and SRE best practices.

Ability to initiate and implement ideas to solve business problems

Strong communicator with excellent problem-solving, critical thinking, and analytical reasoning skills, along with attention to detail and a passion for innovation.

Lead and conduct resiliency design reviews, break up complex problems, and act as a technical lead for medium to large sized products

Full Job Description

Location: Plano, TX, United States

Assume a critical role in defining the future of a globally recognized firm and have a direct and significant effect in a realm tailored for top achievers in site reliability.

Job responsibilities

Advocate and embody site reliability principles, fostering a culture of excellence and technical influence within your team.

Leverage AI tools to enhance operational effectiveness and automate processes, ensuring high-quality customer service.

Spearhead projects aimed at enhancing the reliability and stability of applications and platforms.

Utilize data-driven analytics and AI technologies to automate detection, diagnosis, resolution processes, elevate service levels and drive continuous improvement.

Engage stakeholders to establish realistic service level objectives and error budgets, ensuring alignment with customer expectations.

Exhibit advanced technical proficiency in one or more domains, proactively addressing technology-related bottlenecks.

Employ AI-driven solutions to streamline processes and enhance operational efficiency.

Serve as the primary contact during major incidents, demonstrating the ability to swiftly identify and resolve issues to prevent financial losses.

Act as a culture carrier by documenting and disseminating knowledge through internal forums and communities of practice.

Mentor team members, guiding them in the strategic adoption of AI technologies to enhance operational effectiveness and customer service.

Required qualifications, capabilities, and skills

Formal training or certification on site reliability engineering concepts and 5+ years applied experience.

Proven success in an SRE or senior DevOps role, with deep knowledge of service level indicators/objectives (SLIs/SLOs), incident management, postmortem analysis, and systems reliability.

Expert with observability stacks (e.g. Datadog/Dynatrace, Prometheus, Grafana, Splunk, Elk, OpenTelemetry), including deep experience correlating telemetry across services and time.

Hands-on skills in coding (at least one high-level programming language), cloud platforms (AWS or GCP), container orchestration (Kubernetes), infrastructure as code (Terraform), and resilient CI/CD pipelines.

Active experience or deep curiosity in applying AI to operations—such as LLM-based copilots, anomaly detection, automated runbooks, autonomous agents.

A track record of delivering under pressure. You finish what you start, adapt to uncertainty, and thrive in high-accountability environments.

You deconstruct complexity, organize effectively, and drive clarity into ambiguous operational environments. Documentation and design are second nature.

Outstanding communication, empathy, and professionalism—especially during incidents. You recognize that great systems serve real people.

Preferred qualifications, capabilities, and skills

Experience with operational and compliance rigor in banking, fintech, or similar.

Manage and optimize various types of databases, including relational, NoSQL databases.

Experience with game days, chaos experiments, or failure-mode analysis to improve service robustness.

A background in mentoring engineers or leading technical knowledge-sharing, especially around AI and SRE best practices.

Ability to initiate and implement ideas to solve business problems

Strong communicator with excellent problem-solving, critical thinking, and analytical reasoning skills, along with attention to detail and a passion for innovation.

Lead and conduct resiliency design reviews, break up complex problems, and act as a technical lead for medium to large sized products

Lead Site Reliability Engineer

Full Job Description

SIMILAR OPPORTUNITIES