排除方法:
第一步:登录到故障节点, ssh snode1
查看zookeeper日志目录;查看日志信息
注:zookeeper日志存放路径会在当前运行zookeeper启动命令处创建。 例如:在/home/hadoop下运行: /opt/zookeeper/bin/zkServer.sh start 命令如下:cd /home/hadoop //登录到日志目录 tail -200f zookeeper.out // 查看最近200条日志记录 第二步:重新启动zookeeper服务
命令如下:/opt/zookeeper/bin/zkServer.sh start //开启zookeeper服务 开启查看日志显示:
2013-09-29 11:19:20,620 [myid:0] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@202] - Accepted socket connection from /192.168.3.10:42504
2013-09-29 11:19:20,621 [myid:0] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2013-09-29 11:19:20,621 [myid:0] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for client /192.168.3.10:42504 (no session established for client) 2013-09-29 11:19:20,787 [myid:0] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@202] - Accepted socket connection from /192.168.3.14:33207
2013-09-29 11:19:20,788 [myid:0] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2013-09-29 11:19:20,788 [myid:0] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for client /192.168.3.14:33207 (no session established for client) 2013-09-29 11:19:20,860 [myid:0] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@202] - Accepted socket connection from /192.168.3.12:50742
2013-09-29 11:19:20,860 [myid:0] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2013-09-29 11:19:21,124 [myid:0] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for client /192.168.3.12:50742 (no session established for client) 2013-09-29 11:19:21,124 [myid:0] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@202] - Accepted socket connection from /192.168.3.15:47952 2013-09-29 11:19:21,124 [myid:0] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@202] - Accepted socket connection from /192.168.3.10:42505
2013-09-29 11:19:21,125 [myid:0] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2013-09-29 11:19:21,125 [myid:0] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for client /192.168.3.15:47952 (no session established for client) 2013-09-29 11:19:21,125 [myid:0] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2013-09-29 11:19:21,125 [myid:0] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for client /192.168.3.10:42505 (no session established for client) 在不同节点启动zookeeper 显示:
/lib/log4j-1.2.15.jar:/opt/zookeeper/bin/../lib/jline-0.9.94.jar:/opt/zookeeper/bin/../zookeeper-3.4.3-cdh4.1.2.jar:/opt/zookeeper/bin/../src/java/lib/*.jar:/opt/zookeeper/bin/../conf: 2013-09-29 11:21:03,055 [myid:2] - INFO
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server
environment:java.library.path=/opt/java/jre/lib/amd64/server:/opt/java/jre/lib/amd64:/opt/java/jre/../lib/amd64:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib 2013-09-29 11:21:03,056 [myid:2] - INFO
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:java.io.tmpdir=/tmp
2013-09-29 11:21:03,056 [myid:2] - INFO
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:java.compiler=
2013-09-29 11:21:03,056 [myid:2] - INFO
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:os.name=Linux
2013-09-29 11:21:03,057 [myid:2] - INFO
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:os.arch=amd64
2013-09-29 11:21:03,057 [myid:2] - INFO
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:os.version=2.6.32-358.el6.x86_64 2013-09-29 11:21:03,057 [myid:2] - INFO
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:user.name=demo
2013-09-29 11:21:03,057 [myid:2] - INFO
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:user.home=/home/demo 2013-09-29 11:21:03,058 [myid:2] - INFO
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:user.dir=/home/demo
2013-09-29 11:21:03,059 [myid:2] - INFO
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@162] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /var/zookeeper/version-2 snapdir /var/zookeeper/version-2 2013-09-29 11:21:03,059 [myid:2] - INFO
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@63] - FOLLOWING - LEADER ELECTION TOOK - 233
2013-09-29 11:21:03,066 [myid:2] - INFO
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@325] - Getting a snapshot from leader 2013-09-29 11:21:03,073 [myid:2] - INFO
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@270] - Snapshotting: 0x200000001 to /var/zookeeper/version-2/snapshot.200000001
如果在上述操作没有在30分钟之内恢复。请启动应急预案中的集群主备切换操作 具体操作方法请参照《HDQS-AM-004历史数据查询系统应急处理手册》
2.4.4、datanode服务故障
现象描述:syslog服务告警syslog日志上传报警信息
以下为检测到TY103-006服务器上datanode服务dead信息。
2013-10-20 10:05:02 INFO ] [com.cms.web.syslog.SyslogUtil:100] - 发送:
CEB-HDQS|+|CEB-HDQS|+|1001|+|1001|+|NA|+|TY103-006|+|检测Datanode服务状态|+|TY103-006|+|dead|+|APP|+|HDQS|+|Datanode|+|1|+|TY103-006上
Datanode服务故障|+|1382234640|+|xiaoxu|+|13810466464
确认方法:登录到hadoop界面查看节点状态,登录方法: http://10.1.242.182:50070 点击Dead nodes会显示出已经down掉的节点,红色标记处 查看datanode服务故障服务器日志信息:
2013-10-20 09:42:56,741 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-1800471205-192.168
.0.75-1376982711635:blk_2801755526513394545_957410
2013-10-20 09:44:01,341 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-1800471205-192.168
.0.75-1376982711635:blk_8004390811280095539_189478
2013-10-20 09:45:05,940 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-1800471205-192.168
.0.75-1376982711635:blk_-7558433326837136207_950828
2013-10-20 09:45:05,964 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-1800471205-192.168
.0.75-1376982711635:blk_961832263082316047_963685
2013-10-20 09:45:06,141 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-1800471205-192.168
.0.75-1376982711635:blk_3084063903778417422_751015
2013-10-20 09:45:06,159 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-1800471205-192.168
.0.75-1376982711635:blk_-3145692262741932109_458312
2013-10-20 09:45:19,741 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-1800471205-192.168