著雲端與人工智慧技術快速發展,許多企業採用 MLOps 的方法來優化及管理機器學習專案,以減少手動操作所帶來的風險。目前主流的方法中,能使用 Apache Airflow組建容器任務,來達成自動化的 ML Pipeline。Airflow的容錯機制中,在恢復中斷任務的方法,是將任務重新啟動。換言之,任務的「狀態」並不會被保存。然而雲端平台中,難以確保中斷事件不會發生,導致長時間的訓練任務無法受到良好的保護。一旦長時間訓練的模型遭遇任務中斷,會需要耗費大量的時間重新執行,此狀況會造成 Pipeline後續任務的延宕,以及壓縮群集的資源。為了解決上述問題,本研究將目標放在探討 ML pipeline 在Airflow 系統中,執行 Pytorch 訓練任務的狀態保護,在發生中斷後恢復任務狀態。本研究提出結合 Airflow 的容錯機制以及 Pytorch 的檢查點功能,實作出在 GPU 叢集中以檢查點恢復的方法,保護訓練任務的狀態不會因重新啟動而消失。此方法能讓「舊任務的狀態」對應到「重啟的新計算資源」,並通過 Checkpoint Hook Function 來恢復任務狀態。實驗中,訓練任務使用 ResNet18 與 ImageNet-1k 資料集,任務訓練 31 個回合。這項任務在Airflow的七次平均的總執行時間約為 1234.33 分鐘,而使用 Checkpoint Hook Function 則增加15.51分鐘,平均增加約1.25%的時間成本。在不同時間點發生中斷事件,七次平均的總執行時間為 1862.52 分鐘,而使用 Checkpoint Hook Function 則減少592.02 分鐘,平均縮短約31.79% 的執行時間。;With the rapid development of cloud and artificial intelligence technologies, many enterprises are adopting MLOps approaches to optimize and manage machine learning project, thereby reducing the risks associated with manual operations. Apache Airflow is used to assemble containerized tasks into an automated ML pipeline. However, in Apache Airflow’s fault tolerance mechanism, tasks are directly restarted, meaning the task “state” is not preserved. This situation can lead to inadequate protection for long-duration model training on cloud platform. If a long-duration training model encounters a task interrupt, a significant amount of time is required to re-execute it, causing delays in subsequent pipeline tasks and compressing cluster resource. To address these issues, this study focuses on the protection and recovery of the state of ML pipeline training tasks in Pytorch after interruption within the Airflow system. This research proposes a solution that combine Airflow’s fault tolerance mechanism with Pytorch’s checkpoint. By implementing a checkpoint recovery method in a GPU cluster, the training task’s state is preserved even after a restart. This method allows the system to recognize new computational resource as a restart of the task and recover the task state via a Checkpoint Hook Function. Training tasks using ResNet18 with the ImageNet-1K dataset for 31 epochs. Training task had an average execution time of approximately 1234.33 minutes over five runs. Using the Checkpoint Hook Function added 15.51 minutes, which is about 1.25% of the overhead time. During interruptions at different point, the average execution timer over seven runs was 1862.52 minutes, while using the Checkpoint Hook Function reduced this time by 592.02 minutes, shortening the execution time by an average of approximately 31.79%.