使用強化學習模擬抑制新冠肺炎疫情;Simulations of Optimal Control of COVID-19 Pandemic Using Reinforcement Learning

NCUIR > college of Health Sciences and Technology > Institute of Biomedical Engineering > Electronic Thesis & Dissertation > Item 987654321/89273

Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/89273

Title:	使用強化學習模擬抑制新冠肺炎疫情;Simulations of Optimal Control of COVID-19 Pandemic Using Reinforcement Learning
Authors:	朱柏瑞;Chu, Po-Jui
Contributors:	生物醫學工程研究所
Keywords:	新冠肺炎;深度學習;強化學習;流行病房室模型;COVID-19;Deep Learning;Reinforcement Learning;Compartmental models in epidemiology
Date:	2022-07-04
Issue Date:	2022-10-04 11:08:30 (UTC+8)
Publisher:	國立中央大學
Abstract:	COVID-19為新型冠狀病毒所造成的疾病，病毒學名為SARS-CoV-2，由2019年12月從中國武漢市發現不明原因肺炎群聚，且疫情隨後迅速擴散至全世界，是能有效人傳人的病毒。SARS-CoV-2病毒傳播快速且染病容易有嚴重症狀，對世界造成巨大影響，在沒有足夠疫苗前，需要大量醫療資源並配合減少人流移動的政策措施，才能有效抑制其傳播。減少SARS-CoV-2病毒傳播的政策包括邊境管制、強制或自願封城、隔離、限制社交距離、戴口罩與接種疫苗。這些政策透過限制人的活動和接觸有效抑止病毒散播，但過度的限制會對經濟造成影響。本研究的目標是運用強化學習技術A3C (Asynchronous Advantage Actor-Critic) 加上PPO (Proximal Policy Optimization) 來探索政策嚴謹度與經濟間的最佳平衡點，並對政策實施時機與人口密度不同造成傳染程度差異作分析。我們使用房室模型中的SEIR模型(Susceptible-Exposed-Infectious-Recovered model) 來模擬，並調整模型中各狀態(易受感染期、感染期、傳染期、復原或死亡期)間參數使模型中的基本傳染數與COVID-19基本傳染數相符合。在實驗中，我們針對日本四個都道府縣：北海道、沖繩、大阪、東京，使用其從2020年1月到2021年10月的確診數據做分析。在該期間的數據中有五個感染高峰，但對於封閉式的SEIR房室模型而言很難直接做出全景 (whole picture) 模擬。因此我們分別對應五個感染高峰架設五個相符的SEIR模型環境，然後再使用優化訓練好的代理 (agent) 與這五個環境互動以達成目標。訓練時使用 i9-10980XE 18核36執行緒，RTX 3090 24GB GPU，並使用A3C技術在主機的多執行緒中採用18個workers。實驗中發現平均獎勵隨著訓練上升，並在500回合後趨緩。結果顯示訓練好的代理能有效抑制住確診數上升，由代理所提供的動作策略發現其實施嚴謹政策時機為三個時機點：傳染者數增長當下、傳染者數增長後、傳染者數維持不變時；在風險高的區域平均以實施嚴謹政策較多，並在確診數下降時放寬政策。實驗中也發現在人口密度的計算上得知人口加權密度更能代表一區域的人口密集程度，在研究病毒在區域中的傳染力上用人口加權密度較為精準。在我們模型中更改SEIR模型，新增隔離者Q (Quarantined)，形成SEIQR模型，從實驗中得知能藉由更改SEIR模型來模擬各種情形，並可以用來模擬不同傳染病；然而我們訓練好的代理是否能廣泛運用在不同傳染病，需要看環境給予的States是否相同才能做判斷。而若能將環境給予的States泛化，找到每個傳染病共同的必要資訊，且這些資訊足夠讓代理判斷是否採取嚴謹政策，那麼就能這些資訊構建出適合流行病學的獎勵函數，並能訓練出適用於傳染病學的代理，這是未來能加以研究主題。;Novel coronavirus (COVID-19) disease is an infectious disease caused by the SARS-CoV-2 virus. COVID-19 originated at Wuhan city of China in early December 2019, and the epidemic quickly spread to the world. It is a Human-to-human transmission. SARS-CoV-2 spreads rapidly and is prone to severe symptoms after infection. It has led to a great impact on the world. In the absence of an adequate vaccine, significant medical resources and policies to limit human movement and contact such as restriction on gathering will be needed to mitigate the epidemic. Policies to reduce the spread of SARS-CoV-2 include border controls, mandatory or voluntary lock-down, quarantines, social distancing, mask-wearing, and vaccination. These measures are effective by restricting human movement and contact; however, the economy is seriously impacted as well. We focus on exploring the optimal balance between policy stringency and economy using Reinforcement Learning (RL): Asynchronous Advantage Actor-Critic + Proximal Policy Optimization. We use the compartmental SEIR model to train the agent and adjust the parameters of each state: suspected, exposed, infected, and removed. The parameters of these four states make the basic reproduction number in the SEIR correspond with the basic reproduction number of COVID-19 . In the experiment, we focus on the four prefectures in Japan – Hokkaido, Okinawa, Osaka, and Tokyo – and use the tested positive cases data from January 2020 to October 2021. There are five infection peaks in the data. For the compartmental SEIR model, it is difficult to make the whole picture simulations directly like the real situation. Hence, we create five environments to simulate these peaks then use an optimally trained agent to interact with these environments to reach the goal. We use CPU: i9-10980XE with 18 cores and 36 threads & GPU: RTX 3090 2ith 24GB to train the agent. With 18 workers for multi-threading on the A3C during training, the average reward rises with training and plateaus after 500 episodes. The results show that the optimal agent can effectively suppress the increase in the active cases. We also find the agent implement strict policies when the number of infected cases increase, continue increasing for a few days, or remain unchanged. These strict policies are implemented in high-risk areas on average. Finally, weighted population density can better represent the density of population in the area compared to traditional population density, hence it is more accurate to use population weighted density for pandemic infectivity studies. We change the SEIR model and add the Quarantined (Q) to form SEIQR model. Learned from the experiment that we can simulate various situations and various epidemic diseases by changing traditional SEIR model. However, whether our trained agent can be generally used in different epidemic diseases depends on the states that the environment gave. If we can generalize these states from the different epidemiological environment to find the necessary and crucial information which is sufficient for the agent to judge whether to implement strict policies, we can construct a general epidemiologically reward function with this information and train the agent to apply it to different epidemic diseases.
Appears in Collections:	[Institute of Biomedical Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	64	View/Open

社群 sharing

Loading...