Senior Machine Learning Operations Engineer

ID de l'offre

493075

Publié depuis

28-Jan-2026

Organisation

Infrastructure intelligente

Domaine d'activité

Recherche et développement

Entreprise

Brightly Software India Private Limited

Niveau d'expérience

Confirmé

Type de poste

Temps plein

Modalités de travail

A distance

Type de contrat

Contrat à durée indéterminée (CDI)

Lieu

Noida - - Inde
Nouveau Caire - - Égypte

About Brightly Software

Brightly Software is a leader in intelligent asset management and operational optimization, empowering organizations with data‑driven insights. As we expand our AI and ML capabilities, we are seeking a Senior MLOps Engineer to build and scale the infrastructure that powers our next generation of predictive and autonomous solutions.

Role Overview

As a Senior MLOps Engineer, you will architect, develop, and operate end‑to‑end machine learning infrastructure on AWS. You will work at the intersection of ML engineering, cloud infrastructure, and developer productivity—enabling Brightly's data science teams to move seamlessly from experimentation to reliable, secure, and cost‑efficient production systems.

Your work will ensure that ML models and data pipelines are scalable, observable, and compliant with best‑in‑class MLOps practices.

Key Responsibilities

ML Platform & Infrastructure (AWS‑focused)

• Design, build, and operate ML/AI development platforms on AWS, leveraging services such as Amazon SageMaker (Studio, Training, Real‑Time & Async Inference, Pipelines, Feature Store), S3, Glue, Lambda, ECS/EKS, and related cloud infrastructure.

• Implement infrastructure‑as‑code using Terraform or equivalent, and manage workflow orchestration using AWS Step Functions or Airflow.

Data & Model Pipelines

• Build automated data ingestion and transformation pipelines using S3, Glue, EMR/Spark, and Redshift, incorporating data quality and lineage tooling (e.g., Great Expectations, Deequ).

CI/CD for Machine Learning

• Develop CI/CD pipelines for ML with CodeBuild, CodePipeline, or GitHub Actions, integrating unit tests, data contract checks, model validation, canary/shadow deployments, and automated rollback strategies.

Model Deployment & Operations

• Deploy real‑time inference endpoints (SageMaker endpoints or FastAPI‑based services on Lambda/ECS/EKS) and scalable batch processing jobs.

• Define SLOs, implement autoscaling, and drive cost/performance optimizations across ML workloads.

Monitoring, Observability & Governance

• Implement production monitoring for drift, bias, and performance using SageMaker Model Monitor and service telemetry tools like CloudWatch, Prometheus, and Grafana.

• Enforce security and governance best practices, including least‑privilege IAM, VPC‑isolated architectures, encryption, and secret management.

Cross‑Functional Collaboration

• Partner closely with data scientists, ML engineers, and backend engineers to productionize ML models and streamline development workflows.

• Contribute to the integration of emerging GenAI workloads, including Amazon Bedrock, vector databases (e.g., OpenSearch), and RAG pipelines.

Required Qualifications

• Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.

• 7+ years of professional experience in ML engineering, DevOps, cloud engineering, or MLOps roles, with at least 3 years in a senior or lead capacity.

• 3+ years of proven track record in designing and architecting robust, scalable ML systems and infrastructure in cloud environments, particularly on AWS.

• 5+ years of deep experience building on the AWS ML ecosystem, including SageMaker, S3, Lambda, ECR, EKS/ECS, Step Functions, IAM, VPC networking, and CI/CD tooling.

• 3+ years of hands-on experience deploying, maintaining, and scaling ML models in production environments.

• 3+ years of strong Python development skills and familiarity with Docker‑based workflows.

• 5+ years of solid understanding of ML lifecycles, model evaluation, and monitoring patterns.

• 5+ years of extensive experience with infrastructure‑as‑code (Terraform, CloudFormation).

• 5+ years of expertise in designing system architecture for ML platforms, including microservices, container orchestration, and cloud networking.

• 3+ years of familiarity with MLOps best practices as defined by AWS and industry standards.

• 2+ years of experience with data quality frameworks (Great Expectations, Deequ).

• 2+ years of experience optimizing distributed training workflows on AWS.

• 3+ years of knowledge of security and compliance requirements for ML in enterprise settings, such as IAM, encryption, and secret management.

• 2+ years of experience with monitoring tools (CloudWatch, Prometheus, Grafana) and implementing model observability solutions.

• 5+ years of effective cross-functional collaboration skills, working closely with data scientists, ML engineers, and software engineers to deliver production-grade ML solutions.

• 7+ years of excellent problem-solving and communication abilities, with a focus on delivering scalable, reliable, and cost-effective ML platforms.

The Brightly culture

We’re guided by a vision of community that serves the ambitions and wellbeing of all people, and our professional communities are no exception. We model that ideal every day by being supportive, collaborative partners to one another, conscientiously making space for our colleagues to grow and thrive. Our passionate team is driven to create a future where smarter infrastructure protects the environments that shape and connect us all. That brighter future starts with us.

Détails de l’offre d’emploi