2.4、常见系统问题及应对办法 2.4.1、主namenode服务故障报警
现象描述:
使用web访问主机IP:50070 显示不能正常打开;例如:http://10.1.20.23:50070; 确认方法:
后台进入系统用户下输入jps显示集群服务状态 标记红色框内的服务不存在 排除方法:
第一步:登录到监控告警节点,查看日志告警记录,命令如下:
cd /opt/hadoop/logs //进入日志目录
tail -200f hadoop-hadoop-zkfc-TY101-M01.log //查看日志最新200条日志信息 第二步:重新启动namenode服务;命令如下:
/opt/hadoop/sbin/hadoop-daemon.sh start namenode //重启启动nemenode服务集群; 查看日志信息:tail –200f /opt/ hadoop/logs/hadoop-hadoop-namenode-TY101-M01.log 如果没有异常,服务正常启动。
第三步:确认服务是否正常,查看日志状态和使用jps查看服务状态
正常开启后,日志后打印在/opt/hadoop/logs下的hadoop-demo-namenode-master.log 日志显示如下为正常状态:
2013-09-29 10:27:42,892 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
/************************************************************ STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = master/192.168.3.10 STARTUP_MSG: args = []
STARTUP_MSG: version = 2.0.0-cdh4.1.2 STARTUP_MSG: classpath =
/opt/hadoop/etc/hadoop:/opt/hadoop/share/hadoop/common/lib/commons-io-2.1.jar:/opt/hadoop/share/hadoop/common/lib/commons-httpclient-3.1.jar:/opt/hadoop/share/hadoop/common/lib/jets3t-0.6.1.jar:/opt/hadoop/share/hadoop/common/lib/kfs-0.3.jar:/opt/hadoop/share/hadoop/common/lib/hadoop-annotations-2.0.0-cdh4.1.2.jar:/opt/hadoop/share/hadoop/common/lib/commons-el-1.0.jar:/opt/hadoop/share/hadoop/common/lib/jsp-api-2.1.jar:/opt/hadoop/share/hadoop/common/lib/mockito-all-1.8.5.jar:/opt/hadoop/share/hadoop/common/lib/jersey-core-1.8.jar:/opt/hadoop/share/hadoop/common/lib/servlet-api-2.5.jar:/opt/hadoop/share/hadoop/common/lib/jetty-util-6.1.26.cloudera.2.jar:/opt/hadoop/share/hadoop/common/lib/hadoop-auth-2.0.0-cdh4.1.2.jar:/opt/hadoop/share/hadoop/common/lib/commons-beanutils-1.7.0.jar:/opt/hadoop/share/hadoop/common/lib/snappy-java-1.0.4.1.jar:/opt/hadoop/share/hadoop/common/lib/jsch-0.1.42.jar:/opt/hadoop/share/hadoop/common/lib/jetty-6.1.26.cloudera.2.jar:/opt/hadoop/share/hadoop/common/lib/slf4j-api-1.6.1.jar:/opt/hadoop/share/hadoop/common/lib/commons-configuration-1.6.jar:/opt/hadoop/share/hadoop/common/lib/commons-codec-1.4.jar:/opt/hadoop/share/hadoop/common/lib/jackson-core-asl-1.8.8.jar:/opt/hadoop/share/hadoop/common/lib/guava-11.0.2.jar:/opt/hadoop/share/hadoop/common/lib/paranamer-2.3.jar:/opt/hadoop/share/hadoop/common/lib/avro-1.7.1.cloudera.2.jar:/opt/hadoop/share/hadoop/common/lib/commons-lang-2.5.jar:/opt/hadoop/share/hadoop/common/lib/jersey-server-1.8.jar:/opt/hadoop/share/hadoop/common/lib/jackson-xc-1.8.8.jar:/opt/hadoop/share/hadoop/common/lib/commons-logging-1.1.1.jar:/opt/hadoop/share/hadoop/common/lib/activation-1.1.jar:/opt/hadoop/share/hadoop/common/lib/protobuf-java-2.4.0a.jar:/opt/hadoop/share/hadoop/common/lib/zookeeper-3.4.3-cdh4.1.2.jar:/opt/hadoop/share/hadoop/common/lib/commons-cli-1.2.jar:/opt/hadoop/share/hadoop/common/lib/asm-3.2.jar:/opt/hadoop/share/hadoop/common/lib/jasper-compiler-5.5.23.jar:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.6.1.jar:/opt/hadoop/share/hadoop/common/lib/jackson-jaxrs-1.8.8.jar:/opt/hadoop/share/hadoop/common/lib/jackson-mapper-asl-1.8.8.jar:/opt/hadoop/share/hadoop/common/lib/jersey-json-1.8.jar:/opt/hadoop/share/had
2013-09-29 10:29:46,795 INFO org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: Fast-forwarding stream '/mnt/share/current/edits_0000000000000050373-0000000000000050374' to transaction ID 50373
he reported blocks 944 has reached the threshold 0.9990 of total blocks 944. Safe mode will be turned off automatically in 9 seconds.
2013-09-29 10:28:15,892 INFO org.apache.hadoop.hdfs.StateChange: STATE* Leaving safe mode after 32 secs.
2013-09-29 10:28:15,892 INFO org.apache.hadoop.hdfs.StateChange: STATE* Safe mode is OFF.
2.4.2、备namenode服务故障报警
现象描述:
使用web访问主机IP:50070 显示不能正常打开;例如:http://10.1.20.32:50070; 确认方法:
后台进入系统用户下输入jps显示集群服务状态 标记红色框内的服务不存在 排除方法:
第一步:登录到监控告警节点,查看日志告警记录,命令如下: cd /opt/hadoop/logs //进入日志目录
tail -200f hadoop-hadoop-zkfc-TY102-M02.log //查看日志最新200条日志信息 第二步:重新启动namenode服务;命令如下:
/opt/hadoop/sbin/hadoop-daemon.sh start namenode //重启启动nemenode服务集群; 查看日志信息:tail –200f /opt/ hadoop/logs/hadoop-hadoop-namenode-TY102-M02.log 如果没有异常,服务正常启动。
第三步:确认服务是否正常,查看日志状态和使用jps查看服务状态
正常开启后,日志后打印在/opt/hadoop/logs下的hadoop-demo-namenode-master.log 日志显示如下为正常状态:
2013-09-29 10:27:42,892 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
/************************************************************ STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = master/192.168.3.10 STARTUP_MSG: args = []
STARTUP_MSG: version = 2.0.0-cdh4.1.2 STARTUP_MSG: classpath =
/opt/hadoop/etc/hadoop:/opt/hadoop/share/hadoop/common/lib/commons-io-2.1.jar:/opt/hadoop/share/hadoop/common/lib/commons-httpclient-3.1.jar:/opt/hadoop/share/hadoop/common/lib/jets3t-0.6.1.jar:/opt/hadoop/share/hadoop/common/lib/kfs-0.3.jar:/opt/hadoop/share/hadoop/common/lib/hadoop-annotations-2.0.0-cdh4.1.2.jar:/opt/hadoop/share/hadoop/common/lib/commons-el-1.0.jar:/opt/hadoop/share/hadoop/common/lib/jsp-api-2.1.jar:/opt/hadoop/share/hadoop/common/lib/mockito-all-1.8.5.jar:/opt/hadoop/share/hadoop/common/lib/jersey-core-1.8.jar:/opt/hadoop/share/hadoop/common/lib/servlet-api-2.5.jar:/opt/hadoop/share/hadoop/common/lib/jetty-util-6.1.26.cloudera.2.jar:/opt/hadoop/share/hadoop/common/lib/hadoop-auth-2.0.0-cdh4.1.2.jar:/opt/hadoop/share/hadoop/common/lib/commons-beanutils-1.7.0.jar:/opt/hadoop/share/hadoop/common/lib/snappy-java-1.0.4.1.jar:/opt/hadoop/share/hadoop/common/lib/jsch-0.1.42.jar:/opt/hadoop/share/hadoop/common/lib/jetty-6.1.26.cloudera.2.jar:/opt/hadoop/share/hadoop/common/lib/slf4j-api-1.6.1.jar:/opt/hadoop/share/hadoop/common/lib/commons-configuration-1.6.jar:/opt/hadoop/share/hadoop/common/lib/commons-codec-1.4.jar:/opt/hadoop/share/hadoop/common/lib/jackson-core-asl-1.8.8.jar:/opt/hadoop/share/hadoop/common/lib/guava-11.0.2.jar:/opt/hadoop/share/hadoop/common/lib/paranamer-2.3.jar:/opt/hadoop/share/hadoop/common/lib/avro-1.7.1.cloudera.2.jar:/opt/hadoop/share/hadoop/common/lib/commons-lang-2.5.jar:/opt/hadoop/share/hadoop/common/lib/jersey-server-1.8.jar:/opt/hadoop/share/hadoop/common/lib/jackson-xc-1.8.8.jar:/opt/hadoop/share/hadoop/common/lib/commons-logging-1.1.1.jar:/opt/hadoop/share/hadoop/common/lib/activation-1.1.jar:/opt/hadoop/share/hadoop/common/lib/protobuf-java-2.4.0a.jar:/opt/hadoop/share/hadoop/common/lib/zookeeper-3.4.3-cdh4.1.2.jar:/opt/hadoop/share/hadoop/common/lib/commons-cli-1.2.jar:/opt/hadoop/share/hadoop/common/lib/asm-3.2.jar:/opt/hadoop/share/hadoop/common/lib/jasper-compiler-5.5.23.jar:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.6.1.jar:/opt/hadoop/share/hadoop/common/lib/jackson-jaxrs-1.8.8.jar:/opt/hadoop/share/hadoop/common/lib/jackson-mapper-asl-1.8.8.jar:/opt/hadoop/share/hadoop/common/lib/jersey-json-1.8.jar:/opt/hadoop/share/hadoop/common/lib/log4j-1.2.17.jar:/opt/hadoop/share/hadoop/common/lib/commons-math-2.1.jar:/opt/hadoop/share/hadoop/common/lib/commons-collections-3.2.1.jar:/opt/hadoop/share/hadoop/common/lib/commons-digester-1.8.jar:/opt/hadoop/share/hadoop/common/lib/jaxb-api-2.2.2.jar:/opt/hadoop/share/hadoop/common/lib/junit-4.8.2.jar:/opt/hadoop/share/hadoop/common/lib/jline-0.9.94.jar:/opt/hadoop/share/hadoop/common/lib/jsr305-1.3.9.jar:/opt/hadoop/share/hadoop/common/lib/jettison-1.1.jar:/opt/hadoop/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:/opt/hadoop/share/hadoop/common/lib/commons-net-3.1.jar:/opt/hadoop/share/hadoop/common/lib/commons-beanutils-core-1.8.0.jar:/opt/hadoop/share/hadoop/common/lib/stax-api-1.0.1.jar:/opt/hadoop/share/hadoop/common/lib/xmlenc-0.52.jar:/opt/hadoop/share/hadoop/common/lib/jasper-runtime-5.5.23.jar:/opt/hadoop/share/hadoop/common/hadoop-common-2.0.0-cdh4.1.2-sources.jar:/opt/hadoop/share/hadoop/common/hadoop-common-2.0.0-cdh4.1.2.jar:/opt/hadoop/share/hadoop/common/hadoop-common-2.0.0-cdh4.1.2-tests.jar:/opt/hadoop/share/hadoop/common/hadoop-common-2.0.0-cdh4.1.2-test-sources.jar:/opt/hive/lib/hive-hbase-handler-0.9.0-cdh4.1.2.jar:/opt/hbase/hbase-0.92.1-cdh4.1.2-security.jar:/opt/hbase/lib/zookeeper-3.4.3-cdh4.1.2.jar:/opt/hbase/conf:/opt/hive/lib/hive-hbase-handler-0.9.0-cdh4.1.2.jar:/opt/
2013-09-29 10:29:46,795 INFO org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: Fast-forwarding stream '/mnt/share/current/edits_0000000000000050373-0000000000000050374' to transaction ID 50373
he reported blocks 944 has reached the threshold 0.9990 of total blocks 944. Safe mode will be turned off automatically in 9 seconds.
2013-09-29 10:28:15,892 INFO org.apache.hadoop.hdfs.StateChange: STATE* Leaving safe mode after 32 secs.
2013-09-29 10:28:15,892 INFO org.apache.hadoop.hdfs.StateChange: STATE* Safe mode is OFF.
2.4.3、Zookeeper服务故障报警
现象描述:syslog告警 Syslog日志报警信息:
例如:snode1上的zookeeper服务故障信息如下:
[2013-10-20 10:00:02 INFO ] [com.cms.web.syslog.SyslogUtil:100] - 发送:
CEB-HDQS|+|CEB-HDQS|+|1001|+|1001|+|NA|+|TY101-001|+|检测Zookeeper务状态
|+|TY101-001:2181|+|dead|+|APP|+|HDQS|+|Zookeeper|+|1|+|TY101-001上Zookeeper服务故障|+|1382234280|+|xiaoxu|+|13810466464 确认方法:登录到故障节点:ssh snode1
使用jps查看服务状态;如果标识红色服务不存在,故服务故障 通过BDP管理界面获取snode1上zookeeper服务dead节点服务信息:
平台首页>管理控制台>集群监控>集群服务监控