Bulge Bracket Investment Banks

Posted 5 days ago

No clicks

**Senior Lead Software Engineer - SRE**: Drive market-leading data solutions in AI/ML platforms. Manage teams, resolve application issues, mentor staff, and collaborate cross-functionally. Proficient in Databricks, Snowflake, AWS, Kubernetes, and SRE principles. Experience required in incident management, observability (Grafana, Dynatrace), error budget management, and toil reduction. Promote resilience, data platform robustness, and strategic change. 10+ years in SRE roles plus relevant certifications preferred.

Compensation: Not specified USD
City: Jersey City
Country: United States

Full Job Description

Location: Jersey City, NJ, United States

As a Lead Site Reliability Engineer at JPMorgan Chase within Enterprise technology AI/ML Data Platforms team, you will be instrumental in building scalable, resilient and market-leading data solutions. You will engage in root cause analysis, production changes, budgetary considerations, and staffing challenges. Your experience will be vital in managing and mentoring team members to drive strategic change, both within your team and in partnership with colleagues across JPMorgan Chase & Co.'s global network of innovators.

Job Responsibilities:

Expertise in application development and support with multiple technologies such as Databricks, Snowflake, AWS, Kubernetes, etc.
Coordinate incident management coverage to ensure effective resolution of application issues.
Collaborate with cross-functional teams to perform root cause analysis and implement production changes.
Mentor and guide team members to foster innovation and strategic change.
Develop and support AI/ML solutions for troubleshooting and incident resolution.

Required qualification, skills and capabilities:

Proficient in site reliability culture and principles and familiarity with how to implement site reliability within an application or platform
Proficiency in running production incident calls and managing incident resolution.
Experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others
Strong understanding of SLI/SLO/SLA and Error Budgets
Proficiency in Python or PySpark for AI/ML modeling.
Must be able to reduce toil by building new tools to automate repeated tasks.
Hands-on experience in system design, resiliency, testing, operational stability, and disaster recovery
Understanding of network topologies, load balancing, and content delivery networks.
Awareness of risk controls and compliance with departmental and company-wide standards.
Ability to work collaboratively in teams and build meaningful relationships to achieve common goals.

Preferred Qualifications

10+ years in an SRE or production support role with AWS Cloud, Databricks, Snowflake or similar Technologies.
AWS, Snowflake or Databricks certifications.

Promote robust data platforms, lead root cause analysis and changes, and mentor engineers to deliver scalable, resilient solutions.

Full Job Description

Location: Jersey City, NJ, United States

Job Responsibilities:

Expertise in application development and support with multiple technologies such as Databricks, Snowflake, AWS, Kubernetes, etc.
Coordinate incident management coverage to ensure effective resolution of application issues.
Collaborate with cross-functional teams to perform root cause analysis and implement production changes.
Mentor and guide team members to foster innovation and strategic change.
Develop and support AI/ML solutions for troubleshooting and incident resolution.

Required qualification, skills and capabilities:

Proficient in site reliability culture and principles and familiarity with how to implement site reliability within an application or platform
Proficiency in running production incident calls and managing incident resolution.
Experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others
Strong understanding of SLI/SLO/SLA and Error Budgets
Proficiency in Python or PySpark for AI/ML modeling.
Must be able to reduce toil by building new tools to automate repeated tasks.
Hands-on experience in system design, resiliency, testing, operational stability, and disaster recovery
Understanding of network topologies, load balancing, and content delivery networks.
Awareness of risk controls and compliance with departmental and company-wide standards.
Ability to work collaboratively in teams and build meaningful relationships to achieve common goals.

Preferred Qualifications

10+ years in an SRE or production support role with AWS Cloud, Databricks, Snowflake or similar Technologies.
AWS, Snowflake or Databricks certifications.

Promote robust data platforms, lead root cause analysis and changes, and mentor engineers to deliver scalable, resilient solutions.

Senior Lead Software Engineer- SRE

Full Job Description

SIMILAR OPPORTUNITIES

Senior Lead Software Engineer- Site Reliability

Senior Engineer - Site Reliability Engineering

Senior Systems Operations Engineer - SRE

Site Reliability Engineering Lead

Senior Site Reliability Engineer

Senior Lead Software Engineer- SRE

Full Job Description

SIMILAR OPPORTUNITIES

Senior Lead Software Engineer- Site Reliability

Senior Engineer - Site Reliability Engineering

Senior Systems Operations Engineer - SRE

Site Reliability Engineering Lead

Senior Site Reliability Engineer