Abstract
Pramod Begur Nagaraj, Winner Pulakhandam, Visrutatma Rao Vallu, Archana Chaluvadi, R Padmavathy
Machine learning (ML) has brought a great revolution in the industries enabling the data-based decisions and predictive analysis, yet the deployment and maintenance of ML systems with high reliability and low downtime still pose a very complex challenge. Although MLOps enhances model deployment, monitoring, and governance, it does not have resilience and fault tolerance as a core part of its process. This is the concept-based design of a fault-tolerant ML pipeline that incorporates Site Reliability Engineering (SRE) principles from MLOps along with Chaos Testing to address issues like model drift, differences in data, or adversarial attacks. The proposed design would accept automated deployments, proactive monitoring, and failure simulations to ensure minimal downtime and the best performance under sudden conditions by improving resilience to AI systems. Chaos Testing tests the pipeline by simulating the failures, while SRE takes care of proactive monitoring, fault prediction,
IMPORTANT LINKS
Check Article for
Plagiarism
UPDATES
INDEXED BY: