IJAIEM

International journal of application or innovation in engineering
and management
ISSN:2319-4847

Abstract

Resilient ML Pipelines Using MLOps with SRE Principles and Chaos Testing for Fault-Tolerant AI Infrastructure

Pramod Begur Nagaraj, Winner Pulakhandam, Visrutatma Rao Vallu, Archana Chaluvadi, R Padmavathy

Abstract

Machine learning (ML) has brought a great revolution in the industries enabling the data-based decisions and predictive analysis, yet the deployment and maintenance of ML systems with high reliability and low downtime still pose a very complex challenge. Although MLOps enhances model deployment, monitoring, and governance, it does not have resilience and fault tolerance as a core part of its process. This is the concept-based design of a fault-tolerant ML pipeline that incorporates Site Reliability Engineering (SRE) principles from MLOps along with Chaos Testing to address issues like model drift, differences in data, or adversarial attacks. The proposed design would accept automated deployments, proactive monitoring, and failure simulations to ensure minimal downtime and the best performance under sudden conditions by improving resilience to AI systems. Chaos Testing tests the pipeline by simulating the failures, while SRE takes care of proactive monitoring, fault prediction,

IMPORTANT LINKS

Plagiarism

Check Article for

Plagiarism


UPDATES

  • call for paper:
    volume8
  • issue-1 october 2024
  • Submission date:
    22.10.2024

  • publishing date:28.10.2024

INDEXED BY: