
at J.P. Morgan
Bulge Bracket Investment BanksPosted 5 days ago
No clicks
**Senior Lead Site Reliability Engineer** - JPMorgan Chase, Jersey City, NJ - Lead SRE role, driving reliability across consumer data and analytics systems - 16+ years of software experience, 5+ in SRE, with expertise in AI/ML platforms and tools (Databricks, GPU clusters, etc.) - Champion site reliability principles, design robust systems, and reduce toil - Collaborate cross-functionally to define and enforce SLOs/SLIs, drive reliability at scale - Leverage AI/Agents for intelligent incident management and automation - Provide mentorship, contribute to community forums and improve CCB systems governance - Preferred: Cloud architecture experience (AWS, Snowflake, Kubernetes), communication/problem-solving skills.
- Compensation
- Not specified USD
- City
- Jersey City
- Country
- United States
Currency: $ (USD)
Full Job Description
Location: Jersey City, NJ, United States
- Creates high quality designs, roadmaps, and program charters that are delivered by you or the engineers under your guidanceProvides advice and mentoring to other engineers and acts as a key resource for technologists seeking advice on technical and business-related issuesDemonstrates site reliability principles and practices every day and champions the adoption of site reliability throughout your teamCollaborates with others to create and implement observability and reliability designs for complex systems that are robust, stable, and do not incur additional toil or technical debtIdentify application patterns and analytics in support of better service level objectivesDesign self-healing and resiliency patternsDesign automated software and product upgrades, change management, and release management solutionsWorks toward becoming an expert on the applications and platforms in your remit while understanding their interdependencies and limitationsEvolves and debug critical components of applications and platformsProvides comprehensive and ongoing guidance, tools, and solutions to support the firms growthMakes significant contributions to JPMorgan Chases site reliability community via internal forums, communities of practice, guilds, and conferencesRequired qualifications, capabilities, and skills
- 16+ Years of software engineering experience with 5+ years of Site Reliability Engineering experience.Advanced knowledge in site reliability culture and principles with demonstrated ability to implement site reliability within an application or platform.At least 2+ years of hands-on experience in architecting, scaling, and providing SRE support for AI/ML platforms and products, including infrastructure tech stacks such as Databricks, GPU clusters, Model Serving frameworks, Feature Stores, Vector Databases, and LLM inference pipelines.Demonstrated ability to apply core SRE fundamentals including reliability patterns, capacity planning, incident management, performance tuning, and toil reduction specifically to AI/ML and data-intensive, compute-heavy workloads.Experience in defining and enforcing SLOs/SLIs tailored to AI/ML workloads (e.g., model latency, throughput, data freshness, inference availability) to drive reliability at scale.Proven hands-on experience in designing and implementing Agentic AI-based solutions to deliver SRE capabilities at scale, including practical expertise with AI Agents, Skills, Context Management, Retrieval-Augmented Generation (RAG), and tool-use patterns.Ability to apply Agentic AI frameworks to automate and augment core SRE functions such as intelligent incident detection and remediation, automated root cause analysis, predictive alerting, self-healing infrastructure, runbook automation, and observability enrichment to reduce toil and accelerate MTTR.Contribute to governance and controls of AI usage with site reliability mindset and principles of CCB systems and platforms.Advanced knowledge and experience in observability such as white and black box monitoring, service level objectives, alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, etc.
Preferred Qualifications
- Experience with cloud-based data and analytics architecture, including AWS storage, Snowflake, Kubernetes (EKS), event-driven architectures, streaming services, batch jobs, and ETL pipelines.Proficiency with modern data processing frameworks such as Apache Kafka, Apache Spark, and similar tools, with a focus on ensuring scalability, reliability, and performance of data and analytics platforms.Strong communication skills with ability to mentor and educate others on site reliability principles and practices.Recognized as an active contributor of the engineering community. Work with stakeholders to define non-functional requirements and availability targets for the services in application and product linesApply now
SIMILAR OPPORTUNITIES

Senior Lead Site Reliability Engineer
J.P. Morgan
Added 7 days ago

Senior Site Reliability Engineer
Fidelity Investments
Added 11 days ago

Senior Engineer - Site Reliability Engineering
London Stock Exchange
Added 4 days ago

Site Reliability Engineering Lead
Citi
Added 15 days ago

Staff Site Reliability Engineer
CME Group
Added 8 days ago
Senior Lead Site Reliability Engineer
Compensation
Not specified USD
City: Jersey City
Country: United States
ExperiencedNo visa sponsorship**Senior Lead Site Reliability Engineer** - JPMorgan Chase, Jersey City, NJ - Lead SRE role, driving reliability across consumer data and analytics systems - 16+ years of software experience, 5+ in SRE, with expertise in AI/ML platforms and tools (Databricks, GPU clusters, etc.) - Champion site reliability principles, design robust systems, and reduce toil - Collaborate cross-functionally to define and enforce SLOs/SLIs, drive reliability at scale - Leverage AI/Agents for intelligent incident management and automation - Provide mentorship, contribute to community forums and improve CCB systems governance - Preferred: Cloud architecture experience (AWS, Snowflake, Kubernetes), communication/problem-solving skills.
Full Job Description
Location: Jersey City, NJ, United States
Elevate your engineering prowess to unprecedented levels by joining a team of exceptionally gifted professionals and position yourself among the top echelon in site reliability.As a Sr Lead Site Reliability Engineer at JPMorgan Chase within the Consumer & Community BankingData and Analytics team, you will help build a meaningful engineering discipline, combining software and systems to develop creative engineering solutions to operations problems. Much of our support and software development focuses on optimizing existing systems, building infrastructure and reducing work through automation. Youll join a team of curious problem solvers with a diverse set of perspectives who are thinking big and taking risks. In this environment youll take the lead on relevant projects, supported by an organization that provides the support and mentorship you need to learn and grow. AJob responsibilities- Creates high quality designs, roadmaps, and program charters that are delivered by you or the engineers under your guidanceProvides advice and mentoring to other engineers and acts as a key resource for technologists seeking advice on technical and business-related issuesDemonstrates site reliability principles and practices every day and champions the adoption of site reliability throughout your teamCollaborates with others to create and implement observability and reliability designs for complex systems that are robust, stable, and do not incur additional toil or technical debtIdentify application patterns and analytics in support of better service level objectivesDesign self-healing and resiliency patternsDesign automated software and product upgrades, change management, and release management solutionsWorks toward becoming an expert on the applications and platforms in your remit while understanding their interdependencies and limitationsEvolves and debug critical components of applications and platformsProvides comprehensive and ongoing guidance, tools, and solutions to support the firms growthMakes significant contributions to JPMorgan Chases site reliability community via internal forums, communities of practice, guilds, and conferencesRequired qualifications, capabilities, and skills
- 16+ Years of software engineering experience with 5+ years of Site Reliability Engineering experience.Advanced knowledge in site reliability culture and principles with demonstrated ability to implement site reliability within an application or platform.At least 2+ years of hands-on experience in architecting, scaling, and providing SRE support for AI/ML platforms and products, including infrastructure tech stacks such as Databricks, GPU clusters, Model Serving frameworks, Feature Stores, Vector Databases, and LLM inference pipelines.Demonstrated ability to apply core SRE fundamentals including reliability patterns, capacity planning, incident management, performance tuning, and toil reduction specifically to AI/ML and data-intensive, compute-heavy workloads.Experience in defining and enforcing SLOs/SLIs tailored to AI/ML workloads (e.g., model latency, throughput, data freshness, inference availability) to drive reliability at scale.Proven hands-on experience in designing and implementing Agentic AI-based solutions to deliver SRE capabilities at scale, including practical expertise with AI Agents, Skills, Context Management, Retrieval-Augmented Generation (RAG), and tool-use patterns.Ability to apply Agentic AI frameworks to automate and augment core SRE functions such as intelligent incident detection and remediation, automated root cause analysis, predictive alerting, self-healing infrastructure, runbook automation, and observability enrichment to reduce toil and accelerate MTTR.Contribute to governance and controls of AI usage with site reliability mindset and principles of CCB systems and platforms.Advanced knowledge and experience in observability such as white and black box monitoring, service level objectives, alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, etc.
Preferred Qualifications
- Experience with cloud-based data and analytics architecture, including AWS storage, Snowflake, Kubernetes (EKS), event-driven architectures, streaming services, batch jobs, and ETL pipelines.Proficiency with modern data processing frameworks such as Apache Kafka, Apache Spark, and similar tools, with a focus on ensuring scalability, reliability, and performance of data and analytics platforms.Strong communication skills with ability to mentor and educate others on site reliability principles and practices.Recognized as an active contributor of the engineering community. Work with stakeholders to define non-functional requirements and availability targets for the services in application and product lines
SIMILAR OPPORTUNITIES

Senior Lead Site Reliability Engineer
J.P. Morgan
Added 7 days ago

Senior Site Reliability Engineer
Fidelity Investments
Added 11 days ago

Senior Engineer - Site Reliability Engineering
London Stock Exchange
Added 4 days ago

Site Reliability Engineering Lead
Citi
Added 15 days ago

Staff Site Reliability Engineer
CME Group
Added 8 days ago
