Ready to take your career to the next level? Do you like the feeling that you are making a difference?This is your chance to be an integral part of a dynamic team of talented professionals deploying and maintaining innovative, industry-leading, cloud-based software.
Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems.
SRE is a key role in our growing and dynamic IBM Watson Cognitive AI business on Cloud. This technical role is focused on deploying, maintaining, and automating wide ranges of operational tasks for the IBM Watson Cognitive AI services on IBM Cloud environments.
Watson AI Site Reliability Engineer is responsible for : Providing Production and Non Production environments support and deployment for IBM Cloud public regions and dedicated environments.
Developing SLA / SLOs for the Watson AI services by monitoring availability and taking a holistic view of system health.Driving incident management process and support a blameless post-mortems culture.
Partnering with development teams to improve services via rigorous testing and release procedures.Developing automation for deployments, upgrades and self-remediation.
Being the primary SME for Kafka issues / projectsThis role may be based in one of the following strategic locations : Austin, TXRaleigh, NCSan Jose, CABoston, MAAtlanta, GA