2018年2月24日 星期六

Kaldi 使用 Grid Engine 運算之錯誤問題解決,以 Voxforge s5 訓練為例

Kaldi 使用 Grid Engine 時碰到錯誤的解決方案。

1. 安裝 Grid Engine


請在設定前到 /etc/hosts 把自己的主機名稱新增進去,像是:


詳情請參考這篇文章進行設定: Sun Grid Engine installation on Ubuntu Server

2. Output of qsub was: Unable to run job: denied: host "vk-tensorflow" is no submit host.
Exiting.


使用指令: 
$ sudo qconf -as [your_host_name] 
$ sudo qconf -ah [your_host_name]
詳情請參考這篇文章進行設定: Unable to run job: denied: host "host_name" is no submit host.

3. Output of qsub was: Unable to run job: warning: root your job is not allowed to run in any queue 


Grid Engine 設定時,沒有設定 Manager 的帳號,請進行使用者的設定 (建議與系統帳號名稱一一致):

$ sudo qconf -am [your_user_name] # 把自己的帳號加入 manager 名單
$ sudo qconf -ao [your_user_name] # 把自己的帳號加入 operator 名單
# 建立 HOSTLIST
echo -e "group_name @allhosts\nhostlist NONE" > ./grid
sudo qconf -Ahgrp ./grid
rm ./grid
並請做 all.q, 和 @allhost 的設定:

$ sudo qconf -aq #設定請取名為 all.q, hostlist 留下 @allhost [自己的hostname]
$ sudo qconf -shgrp @allhosts | grep hostlist
詳情請參考這篇文章: [GE users] SGE jobs stuck in pending state


4. 如果執行 VoxForge s5 在 make_mfcc 這段卡很久



請使用指令 qstat -f 列出執行中的任務列表,上面的 job_id 用來使用指令:

$ sudo qstat -j [job_id] 

如果出現: queue instance "[email protected]" dropped because it is temporarily not available. 這樣的錯誤,表示 Grid Engine 的 sge_qmaster 啟動有問題,使用指令查看無法啟動的原因:

$ cat /var/spool/gridengine/qmaster/messages | grep [host_name]
如果是像這樣的訊息:
tensorflow,execd,1>
02/24/2018 16:28:54|listen|vk-tensorflow|C|denied: request for user "vk" does not match credentials for connection <vk-tensorflow,execd,1>

這個時候如果執行:

$ sge_execd
就會出現錯誤:
Unable to run job: unable to send message to qmaster using port 6444 on host "vk-tensorflow" ...

代表 qmaster 沒有被正確啟動,原因可能是目前的 Command Line 使用者不是訊息中寫的使用者,請先用指令砍掉所有 sge 的 process:

$ ps aux | grep sge
然後把列出的 process 砍掉:

$ kill -15 [id] [id] ...
然後使用 login 重新登入錯誤訊息中的使用者來執行看看:

$ login vk
$ sudo sge_qmaster
$ sudo sge_execd 
正常啟動後,應該會像這個樣子:


詳情請參考這篇文章進行設定: SGE 运行在 OpenStack 上的 HOST_NOT_RESOLVABLE 问题


Reference:
https://peteris.rocks/blog/sun-grid-engine-installation-on-ubuntu-server/
http://vincentwanggs.blogspot.tw/2010/12/unable-to-run-job-denied-host-is-no.html
https://www.vpsee.com/2012/08/host-not-resolvable-problem-when-run-sge-on-openstack/
https://askubuntu.com/questions/945267/cannot-use-qsub-unable-to-send-message-to-qmaster-using-port-6444
https://arc.liv.ac.uk/pipermail/gridengine-users/2009-July/026275.html
https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2016-September/069515.html
https://www.mail-archive.com/[email protected]/msg08274.html
http://www.yinqisen.cn/blog-212.html
http://arc.liv.ac.uk/pipermail/gridengine-users/2007-January/012616.html
https://malariageninformatics.wordpress.com/2011/06/01/gridengine-the-ubuntu-debian-way/

---
http://jrmeyer.github.io/asr/2016/12/15/DNN-AM-Kaldi.html
https://askubuntu.com/questions/710379/do-i-need-to-install-awk-or-is-it-inbuilt-in-ubuntu
https://github.com/gc3-uzh-ch/elasticluster/issues/368
http://vpanayotov.blogspot.tw/2012/07/voxforge-scripts-for-kaldi.html
http://kaldi-asr.org/doc/queue.html
https://sourceforge.net/p/kaldi/discussion/1355348/thread/63447344/
https://unix.stackexchange.com/questions/148715/i-dont-know-how-to-cancel-job
http://www.baidu.com/s?wd=Unable%20to%20run%20job%3A%20warning%3A%20…%20your%20job%20is%20not%20allowed%20to%20run%20in%20any%20queue&rsv_spt=1&rsv_iqid=0xe490222900027d19&issp=1&f=8&rsv_bp=0&rsv_idx=2&ie=utf-8&tn=baiduhome_pg&rsv_enter=1&rsv_n=2&rsv_sug3=1
http://bbs.chinaunix.net/forum.php?mod=viewthread&tid=4175777
https://github.com/gawbul/docker-sge/issues/3
https://groups.google.com/forum/#!topic/kaldi-help/FYg9Uriw2lA
https://github.com/gawbul/docker-sge/issues/3
http://gridengine.org/pipermail/users/2012-April/003228.html
https://askubuntu.com/questions/945267/cannot-use-qsub-unable-to-send-message-to-qmaster-using-port-6444
https://groups.google.com/forum/#!topic/kaldi-help/AW_AXK8PJuM

沒有留言:

張貼留言

© ERIC RILEY , 自由無須告知轉貼
Background Japanese Sayagata by Olga Libby