在這份研究中,我們設計了一套互動式對話系統,用以協助使用者完成一項機器人組裝任務。此對話系統會針對使用者在組裝過程中遇到的問題給出解決方法。我們將使用者的問題映射到最相近的預定義常見問題(FAQ),以此做為使用者意圖。接下來系統會根據使用者意圖給出相對應的回答。 一般狀況下,有著相同使用者意圖的問題大多都能夠以相似的回答來解決。然而,在我們的組裝任務上,既使是同樣的問題,在不同的組裝步驟中被提出,也應該有不同的回應。我們對話系統中的使用者意圖分類器在只有使用者問句的情況下只能達到68.95%的準確率。為了解決這個問題,我們在原來系統中的使用者意圖分類器上加入了Yolo-based Masker with CNN-LSTM (YMCL)模型。透過合併影像資訊,在不同資料集的實驗結果上可以看到大幅度的準確率提升。;In this research, we design an interactive dialogue system which aims at helping user complete the robot assembly task.The system would provide solution to the user question when the user encounters problems during the assembly process.We map the user question to the most related pre-defined frequently asked question (FAQ) as the user intent.The system will then give out the answer according to the detected user intent. In general case, user questions with the same user intent can mostly be solved with similar answers. However, in our assembly task, even the same user question asked in different assembly step should lead to different response.With only user question utterance, our user intent classifier achieves accuracy of 68.95%. To solve this problem, we integrate the proposed Yolo-based Masker with CNN-LSTM (YMCL) model into the user intent classifier in our dialogue system.By incorporating visual information, a significant improvement can be observed from the experiments conducted on different dataset.