用pyhdfs对hadoop hdfs操作,之前用listdir
正常读取目录文件没有问题,但是用open
时候出现了Failed to establish a new connection:[Errno 11004] getaddrinfo failed
。
我的pyhdfs没有部署在节点里,因为是要在django中应用,节点机器不方便显示器展示页面,所以通过HdfsClient
访问节点的。这个问题似乎是通信问题。
以下是测试代码,目的是循环扫描hdfs下的文件夹,检测新增文件,并读取新增文件以进行操作:
import pyhdfs
import time
hdfs = pyhdfs.HdfsClient(hosts='192.168.x.x:50070')
dir = '/django-test'
do_cycle = True
before_list = [file for file in hdfs.listdir(dir) if file.endswith('.txt')]
count = 0
while do_cycle == True:
time.sleep(2)
after_list = [file for file in hdfs.listdir(dir) if file.endswith('.txt')]
add_list = list(set(after_list) - set(before_list))
print('##################'+str(count)+'##################')
print(add_list)
if len(add_list) != 0:
for file in add_list:
data = hdfs.open(dir + '/' +file)
#打开文件后动作略
data.close()
count += 1
before_list = after_list
到了这句后报错:
data = hdfs.open(dir + '/' +file)
报错内容:
ConnectionError: HTTPConnectionPool(host='hadoop-slave10', port=50075): Max retries exceeded with url: /webhdfs/v1/django-test/D1044U00001729F7SD558.txt?op=OPEN&user.name=florian.fu&namenoderpcaddress=ns1&offset=0 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x000000000A890D68>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
hdfs = pyhdfs.HdfsClient(hosts='192.168.x.x:50070') set retry