Back

Site Reliability Engineer

Job Description

As a Site Reliability Engineer (SRE), you'll help build a meaningful engineering discipline, combining software and systems to develop creative engineering solutions to operations problems. Much of our support and software development focuses on optimizing existing systems, building infrastructure and reducing work through automation. You'll join a team of curious problem solvers with a diverse set of perspectives who are thinking big and taking risks. In this environment, you'll take the lead on relevant projects, supported by an organization that provides the support and mentorship you need to learn and grow. As an SRE, you'll be focused on running better production applications and systems.

SRE

Develop, test and debug automated tasks (Apps, Systems, Infrastructure)
Troubleshoot priority incidents, facilitate blameless post-mortems
Work with development teams throughout the software life cycle ensuring sustainable software releases
Perform analytics on previous incidents and usage patterns to better predict issues and take proactive actions
Build and drive adoption for greater self-healing and resiliency patterns
Lead and participate in performance tests; identify bottlenecks, opportunities for optimization, and capacity demands
Responsibilities and Qualifications:

• Incident Management

Possess excellent trouble-shooting skills, and the drive to help internal/external customers
Demonstrate sound analytical and diagnostic skills dealing with issues that are not readily defined and/or conflict with available information, ability to reach sound decisions quickly
Conduct appropriate monitoring tasks to include maintenance and patching validation
Gather logs and necessary details to facilitate the analysis of technical issues
Create technical documentation to further increase product knowledge
Create agile stories for alerting, monitoring and self-healing
Collaborate with cross- teams to bring the appropriate visibility on critical issues
• Knowledge Management

Review historical records on closed cases to increase product and technical knowledge
Contribute to LOB focused environment that encourages information sharing, team-based resolution activity, cross training and an absolute focus on updating customer incidents as quickly and effectively as possible.
Attend training sessions offered and assist with peer training as needed
Strong configuration and development background combined with reporting and analytics
Experience with routing, workflow, design, development and test to support CTI
• Communication

Ensure that all internal and external customer interactions are handled professionally and with the highest level of service, follow-through and consistently keep commitments
Demonstrate effectively communication verbally and written to the team and customers
Show leadership for any production issue and manage all the corresponding team in working towards fix and also should ensure minimal customer impact
Demonstrated ability to translate and communicate business processes to applicable requirement types (functional, technical, etc)
Positive attitude to self-learn and mentor others on new platform skills and technologies
• Innovation

Implement continuous process improvement, including but not limited to policy, procedures, and production monitoring
Identify, coordinate, and implement initiatives/projects and activities that create efficiencies and optimize technical processing
Analyze upcoming changes into production, review all the necessary documents and support implementation efficiencies
• Required Skills

Must have hands-on experience on orchestrations technologies Kubernetes, Docker, and Operating systems such as Linux/Unix, Windows and VMware virtualization 
Knowledge on infrastructure, network zone, load balancing and data center
Experience supporting java web frameworks (Spring)
Development of automation/monitoring/deployment Scripting (Ansible, PowerShell, Bash, Python, Java)
Experience supporting cloud solutions both hosted and on-prem
Knowledge supporting and troubleshooting DB: PostgreSQL, MS-SQL, MySQL, Oracle
Working knowledge in an Agile program preferably Scrum and/or Kanban
Experience in IT Security Tools such as Fortify, Web Inspect, Blackduck
Experience of implementing and configuring Open source monitoring tools such as Prometheus, Grafana, ELK, LogZ, Jaeger, OpenTelemetry stack
Understanding of orchestrations technologies: Maven, Jenkins, Docker, Kubernetes
Experience supporting a distributed messaging layer: Apache Kafka, MQ (Websphere MQ)
Supporting APIs and services that utilize REST, SOAP and Web Services
Experience in creating and evolving CI/CD pipelines with Azure Devops, GitLab or Github following GitOps principles
Experience in Disaster Recovery and Site Resiliency Engineering planning and test execution

Organization: Digital Industries

Company: Siemens Industry Software (India) Private Limited

Experience Level: Experienced Professional

Job Type: Full-time

Can't find what you are looking for?

Let's stay connected

Can't find what you are looking for?