集群某节点91有故障发生,出现 [plain] 2013-11-08 08:32:13,908 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201311061017_18902_r_000000_0 copy failed: attempt_201311061017_18902_m_000003_0 from node-192 2013-11-08 08:32:13,921 WARN org.a
集群某节点91有故障发生,出现
[plain]
2013-11-08 08:32:13,908 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201311061017_18902_r_000000_0 copy failed: attempt_201311061017_18902_m_000003_0 from node-192
2013-11-08 08:32:13,921 WARN org.apache.hadoop.mapred.ReduceTask: java.net.ConnectException: Connection timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)
at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source)
at java.net.AbstractPlainSocketImpl.connect(Unknown Source)
at java.net.SocksSocketImpl.connect(Unknown Source)
at java.net.Socket.connect(Unknown Source)
at sun.net.NetworkClient.doConnect(Unknown Source)
at sun.net.www.http.HttpClient.openServer(Unknown Source)
at sun.net.www.http.HttpClient.openServer(Unknown Source)
at sun.net.www.http.HttpClient.
at sun.net.www.http.HttpClient.New(Unknown Source)
at sun.net.www.http.HttpClient.New(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.connect(Unknown Source)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1631)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1588)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1488)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1399)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1331)
分析hadoop代码:
[java]
localFs = FileSystem.getLocal(fConf);
if (fConf.get("slave.host.name") != null) {
this.localHostname = fConf.get("slave.host.name");
}
if (localHostname == null) {
this.localHostname =
DNS.getDefaultHost
(fConf.get("mapred.tasktracker.dns.interface","default"),
fConf.get("mapred.tasktracker.dns.nameserver","default"));
}
在该节点ping 下这个hostname:
[plain]
ping node-191
PING node-128-191.localhost (220.250.64.228) 56(84) bytes of data.
64 bytes from 220.250.64.228: icmp_seq=1 ttl=247 time=14.8 ms
64 bytes from 220.250.64.228: icmp_seq=2 ttl=247 time=14.3 ms
64 bytes from 220.250.64.228: icmp_seq=3 ttl=247 time=14.4 ms
发现压根不是191的ip。
到该节点的hosts里查看,也没有配置191的hostname。
问题得解。
将191的hostname添加到集群所有节点的hosts上。重启tasktracker搞定。