
at J.P. Morgan
Bulge Bracket Investment BanksPosted 13 days ago
No clicks
**Lead SRE - AI/ML Data Platforms (Grafana, Dynatrace, SLO/SLI)** - Drive AI/ML data solutions' scalability, resilience, and innovation in top-tier finance. - Key responsibilities: Root cause analysis, production change management, team mentoring, and strategic collaboration across global teams. - Skills required: Proficient in multiple tech stacks (Databricks, Snowflake, AWS, Kubernetes), SRE culture, incident management, Python/PySpark for AI/ML, observability tools (Grafana, Dynatrace), and error budget understanding. - 5+ years' proven experience in software engineering and SRE roles, with a strong focus on operational excellence. - Collaborate and build meaningful relationships to achieve shared goals in an agile team environment.
- Compensation
- Not specified
- City
- Not specified
- Country
- India
Currency: Not specified
Full Job Description
Location: Hyderabad, Telangana, India
Assume a critical role in defining the future of a globally recognized firm and have a direct and significant effect in a realm tailored for top achievers in site reliability.
As a Lead Site Reliability Engineer
Job Responsibilities:
- Expertise in application development and support with multiple technologies such as Databricks, Snowflake, AWS, Kubernetes, etc.
- Coordinate incident management coverage to ensure effective resolution of application issues.
- Collaborate with cross-functional teams to perform root cause analysis and implement production changes.
- Mentor and guide team members to foster innovation and strategic change.
- Develop and support AI/ML solutions for troubleshooting and incident resolution.
Required qualifications, capabilities and skills
- Formal training or certification on software engineering concepts and 5+ years applied experience
- Proficient in site reliability culture and principles and familiarity with how to implement site reliability within an application or platform
- Proficiency in running production incident calls and managing incident resolution.
- Experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others
- Strong understanding of SLI/SLO/SLA and Error Budgets
- Proficiency in Python or PySpark for AI/ML modeling.
- Must be able to reduce toil by building new tools to automate repeated tasks.
- Hands-on experience in system design, resiliency, testing, operational stability, and disaster recovery
- Understanding of network topologies, load balancing, and content delivery networks.
- Awareness of risk controls and compliance with departmental and company-wide standards.
- Ability to work collaboratively in teams and build meaningful relationships to achieve common goals.
Preferred qualifications, capabilities and skills
- Hands on experience an SRE or production support role with AWS Cloud, Databricks, Snowflake or similar Technologies.
- AWS, Snowflake or Databricks certifications.
SIMILAR OPPORTUNITIES

Lead Software Engineer -SRE (Grafana, Dynatrace, SLO/SLI)
J.P. Morgan
Added 13 days ago

Site Reliability Engineering Lead
Citi
Added 15 days ago

Senior Engineer - Site Reliability Engineering
London Stock Exchange
Added 4 days ago

Senior Software Engineer -SRE and AIOps
Wells Fargo Corporate & Investment Banking
Added 7 days ago

Site Reliability Engineer (SRE)
Barclays
Added 13 days ago
