本論文採用預期未來所獲得的報酬當作群體在空間中尋找最佳解的依據,並以增強式學習法計算其預期報酬,以期能夠得到最佳解。根據此觀念,提出新的最佳化演算法-以Q學習法為基礎之群體智慧演算法 ( Q-learning-based swarm optimization algorithm, 簡稱QSO )。QSO演算法是一種結合Q-學習法 ( Q-learning ) 與粒子群體最佳化 ( particle swarm optimization, PSO )並以族群為基礎( population-based )的最佳化演算法。 QSO演算法主要是將粒子個體的表現當作是報酬率,並選出逐次演化過程中擁有最大報酬率的個體,視為表現最好的個體,而非以個體單次或是立即的表現作為選擇。在求解過程中,所有的個體都會學習群體中表現最佳的個體,粒子不斷的演化學習得到最佳解。 我們將QSO與現有最佳化演算法利用標準評估函數比較其搜尋最佳解的能力,並將其應用在電力分配問題(the economic dispatch (ED) problem)與手術排程問題(operating schedules problem)。在此兩種問題中,QSO最佳化演算法找到最佳解優於一些現有的最佳化演算法。 本論文在電力分配問題應用中,解決3台發電機與40台發電機的兩類問題。此應用的目標為,在相同輸出總電力中(3台發電機總的電力是850 MW, 40台發電機總電力是10500MW),並且滿足每台發電機的輸出電力範圍的限制條件下,希望能夠得到最小的電力成本。本論文針對具有限制條件的求解問題中,提出兩種方法-不等式限制方法與等式限制方法。QSO在求解的過程中,透過不等式的限制方法尋找個別發電機的輸出功率;利用等式限制方法調整發電機以期總電力能合乎限制條件。具有限制條件的兩類應用中QSO演算法配合2種限制方法,皆能求得比現有方法更佳的解。 現今醫院的手術排程大都是以人工的方式進行,效率往往不佳,因此本論文試圖將QSO演算法應用於手術排程中,以期提高手術房使用效率。在手術排程的目標是希望能夠安排手術在有限的手術房中,必須符合手術的優先使用順序,以及須符合某些手術使用特定手術房的限制條件。本論文針對此特性提出手術排程編碼方式,將手術順序編碼成以權重為主的向量編碼方式。此編碼方式主要有兩個部分,1.手術房分配向量(operating room assignment vector, ORAV);2.手術優先權向量(operation order priority vector, OOPV)。個體之ORAV決定如何安排每一件手術在手術房,其OOPV則提供所執行手術之優先使用手術房的權重資訊。並利用所提之QSO最佳化演算法更新權重,所找到的最佳解也優於基因演算法與人工排程的結果。 本論文所提之QSO最佳化演算法是以群體為基礎,在連續空間中採用個體預期報酬當作是評斷個體表現的依據,再逐次演化過程,群體透過相互影響與互相學習(如合作,競爭,模仿)搜尋最佳解。預期表現與立即表現最大的不同,預期表現是我們透過Q-learning評估個體的表現來是否會在環境中得到最大的報酬,而立即表現只是評估當下的行為所得到的環境給予的回饋。本論文認為,因為個體的學習行為是累積持續,個體的行為是會反映在未來的所得到的報酬,並不能根據當下的表現來評估。並將所提之最佳化演算法,應用在電力調度與手術排程兩類實際問題中,所得之最佳解也比現有方法來的更佳。In this paper, we treat optimization problems as a kind of reinforcement learning problems by regarding an optimization procedure for searching an optimal solution as a reinforcement learning procedure for finding the best policy to maximize the expected rewards. This viewpoint motivated us to propose a Q-learning-based swarm optimization (QSO) algorithm. The proposed QSO algorithm is a population-based optimization algorithm which integrates the essential properties of Q-learning and particle swarm optimization (PSO). The optimization procedure of the QSO algorithm proceeds as each individual imitates the behavior of the global best one in the swarm. The best individual is chosen based on its future expected performance instead of its momentary performance at each evaluation. Several data sets including a set of benchmark functions and a real-world problem - the economic dispatch (ED) problem for power systems and the operating schedules problem were used to test the performance of the proposed QSO algorithm. The simulation results on the benchmark functions show that the proposed QSO algorithm is comparable to or even outperforms several existing optimization algorithms. The well-known ED problem was tested to demonstrate the performance of the QSO algorithm for solving optimization problems with inequality constraints and equality constraints. For both the 3 generators and the 40 generators problems, the QSO algorithm has found better solutions than those previously known the best solutions. In our scheduling algorithm, an operation sequence assignment list is encoded into a weight vector consisted of two component vectors: 1) operating room assignment vector (ORAV) and 2) operation order priority vector (OOPV). The QSO algorithm is used to update the weight vectors to search a better operation schedule. In this paper, a Q-learning-based swarm optimization (QSO) algorithm is proposed for solving continuous optimization problems. Through the social influence and social learning process (i.e., cooperation, competition, and imitation), good solutions to an optimization problem can be simultaneously explored and exploited.