LOG IN
SIGN UP
Canary Wharfian - Online Investment Banking & Finance Community.
Sign In
or continue with e-mail and password
Forgot password?
Don't have an account?
Create an account
or continue with e-mail and password
By signing up, you agree to our Terms & Conditions and Privacy Policy.

Lead Software Engineer -SRE (Grafana, Dynatrace, SLO/SLI)

ExperiencedNo visa sponsorship
J.P. Morgan logo

at J.P. Morgan

Bulge Bracket Investment Banks

Posted 13 days ago

No clicks

**Lead SRE - AI/ML Data Platforms (Grafana, Dynatrace, SLO/SLI)** - Drive AI/ML data solutions' scalability, resilience, and innovation in top-tier finance. - Key responsibilities: Root cause analysis, production change management, team mentoring, and strategic collaboration across global teams. - Skills required: Proficient in multiple tech stacks (Databricks, Snowflake, AWS, Kubernetes), SRE culture, incident management, Python/PySpark for AI/ML, observability tools (Grafana, Dynatrace), and error budget understanding. - 5+ years' proven experience in software engineering and SRE roles, with a strong focus on operational excellence. - Collaborate and build meaningful relationships to achieve shared goals in an agile team environment.

Compensation
Not specified

Currency: Not specified

City
Not specified
Country
India

Full Job Description

Location: Hyderabad, Telangana, India

Assume a critical role in defining the future of a globally recognized firm and have a direct and significant effect in a realm tailored for top achievers in site reliability.

As a Lead Site Reliability Engineer

at JPMorgan Chase within the AI/ML Data Platforms team, you will be instrumental in building scalable, resilient and market-leading data solutions. You will engage in root cause analysis, production changes, budgetary considerations, and staffing challenges. Your experience will be vital in managing and mentoring team members to drive strategic change, both within your team and in partnership with colleagues across JPMorgan Chase & Co.'s global network of innovators.

Job Responsibilities:

  • Expertise in application development and support with multiple technologies such as Databricks, Snowflake, AWS, Kubernetes, etc.
  • Coordinate incident management coverage to ensure effective resolution of application issues.
  • Collaborate with cross-functional teams to perform root cause analysis and implement production changes.
  • Mentor and guide team members to foster innovation and strategic change.
  • Develop and support AI/ML solutions for troubleshooting and incident resolution.

 

Required qualifications, capabilities and skills

  • Formal training or certification on software engineering concepts and 5+ years applied experience
  • Proficient in site reliability culture and principles and familiarity with how to implement site reliability within an application or platform
  • Proficiency in running production incident calls and managing incident resolution.
  • Experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others
  • Strong understanding of SLI/SLO/SLA and Error Budgets
  • Proficiency in Python or PySpark for AI/ML modeling.
  • Must be able to reduce toil by building new tools to automate repeated tasks.
  • Hands-on experience in system design, resiliency, testing, operational stability, and disaster recovery
  • Understanding of network topologies, load balancing, and content delivery networks.
  • Awareness of risk controls and compliance with departmental and company-wide standards.
  • Ability to work collaboratively in teams and build meaningful relationships to achieve common goals.

 

Preferred qualifications, capabilities and skills

  • Hands on experience an SRE or production support role with AWS Cloud, Databricks, Snowflake or similar Technologies.
  • AWS, Snowflake or Databricks certifications.
Carry out critical tech solutions across multiple technical areas as an integral part of an agile team

Lead Software Engineer -SRE (Grafana, Dynatrace, SLO/SLI)

Compensation

Not specified

City: Not specified

Country: India

J.P. Morgan logo
Bulge Bracket Investment Banks

13 days ago

No clicks

at J.P. Morgan

ExperiencedNo visa sponsorship

**Lead SRE - AI/ML Data Platforms (Grafana, Dynatrace, SLO/SLI)** - Drive AI/ML data solutions' scalability, resilience, and innovation in top-tier finance. - Key responsibilities: Root cause analysis, production change management, team mentoring, and strategic collaboration across global teams. - Skills required: Proficient in multiple tech stacks (Databricks, Snowflake, AWS, Kubernetes), SRE culture, incident management, Python/PySpark for AI/ML, observability tools (Grafana, Dynatrace), and error budget understanding. - 5+ years' proven experience in software engineering and SRE roles, with a strong focus on operational excellence. - Collaborate and build meaningful relationships to achieve shared goals in an agile team environment.

Full Job Description

Location: Hyderabad, Telangana, India

Assume a critical role in defining the future of a globally recognized firm and have a direct and significant effect in a realm tailored for top achievers in site reliability.

As a Lead Site Reliability Engineer

at JPMorgan Chase within the AI/ML Data Platforms team, you will be instrumental in building scalable, resilient and market-leading data solutions. You will engage in root cause analysis, production changes, budgetary considerations, and staffing challenges. Your experience will be vital in managing and mentoring team members to drive strategic change, both within your team and in partnership with colleagues across JPMorgan Chase & Co.'s global network of innovators.

Job Responsibilities:

  • Expertise in application development and support with multiple technologies such as Databricks, Snowflake, AWS, Kubernetes, etc.
  • Coordinate incident management coverage to ensure effective resolution of application issues.
  • Collaborate with cross-functional teams to perform root cause analysis and implement production changes.
  • Mentor and guide team members to foster innovation and strategic change.
  • Develop and support AI/ML solutions for troubleshooting and incident resolution.

 

Required qualifications, capabilities and skills

  • Formal training or certification on software engineering concepts and 5+ years applied experience
  • Proficient in site reliability culture and principles and familiarity with how to implement site reliability within an application or platform
  • Proficiency in running production incident calls and managing incident resolution.
  • Experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others
  • Strong understanding of SLI/SLO/SLA and Error Budgets
  • Proficiency in Python or PySpark for AI/ML modeling.
  • Must be able to reduce toil by building new tools to automate repeated tasks.
  • Hands-on experience in system design, resiliency, testing, operational stability, and disaster recovery
  • Understanding of network topologies, load balancing, and content delivery networks.
  • Awareness of risk controls and compliance with departmental and company-wide standards.
  • Ability to work collaboratively in teams and build meaningful relationships to achieve common goals.

 

Preferred qualifications, capabilities and skills

  • Hands on experience an SRE or production support role with AWS Cloud, Databricks, Snowflake or similar Technologies.
  • AWS, Snowflake or Databricks certifications.
Carry out critical tech solutions across multiple technical areas as an integral part of an agile team