When processing the training data of a large model, it is often necessary to traverse large folders, which may include tens or hundreds of millions of files. At this time, the general Python traversal function will be very slow, such as os.walk, glob, path.rglob, etc. At the same time, the overall traversal time cannot be estimated.
This article uses Python's os.scandir and is based on the breadth-first search algorithm to achieve controllable and efficient traversal of files. At the same time, it outputs traversal logs and supports suffix filtering and removal. Hide files and implement the function of traversing folders containing a large number of files.
os.scandir is a directory iteration function that returns an iterator of os.DirEntry objects, corresponding to the entries in the directory specified by path. These entries are generated in any order, excluding special entries‘.’ and‘…’. The operating efficiency of os.scandir is higher than that of os.walk. In PEP 471, Python officials also recommend using os.scandir to traverse directories.
Source code
def traverse_dir_files_for_large(root_dir, ext=""): """ 列出文件夹中的文件, 深度遍历 :param root_dir: 根目录 :param ext: 后缀名 :return: 文件路径列表 """ paths_list = [] dir_list = list() dir_list.append(root_dir) while len(dir_list) != 0: dir_path = dir_list.pop(0) dir_name = os.path.basename(dir_path) for i in tqdm(os.scandir(dir_path), f"[Info] dir {dir_name}"): path = i.path if path.startswith('.'): # 去除隐藏文件 continue if os.path.isdir(path): dir_list.append(path) else: if ext: # 根据后缀名搜索 if path.endswith(ext): paths_list.append(path) else: paths_list.append(path) return paths_list
Output log:
[Info] Initialization path starts!
[Info] Data set path: / alphafoldDB/pdb_from_uniprot
[Info] dir pdb_from_uniprot: 256it [00:10, 24.47it/s]
[Info] dir 00: 240753it [00:30, 7808.36it/s]
[Info] dir 01: 241432it [00:24, 9975.56it/s]
[Info] dir 02: 240466it [00:24, 9809.68it/s]
[Info] dir 03: 241236it [00:22, 10936.76it /s]
[Info] dir 04: 241278it [00:24, 10011.14it/s]
[Info] dir 05: 241348it [00:25, 9414.16it/s]
Supplement
In addition to the above method, the editor has also compiled other Python methods for traversing folders. If you need it, you can refer to it
Method 1: Traverse through os.walk() and process the files directly
def traverse_dir_files(root_dir, ext=None, is_sorted=True): """ 列出文件夹中的文件, 深度遍历 :param root_dir: 根目录 :param ext: 后缀名 :param is_sorted: 是否排序,耗时较长 :return: [文件路径列表, 文件名称列表] """ names_list = [] paths_list = [] for parent, _, fileNames in os.walk(root_dir): for name in fileNames: if name.startswith('.'): # 去除隐藏文件 continue if ext: # 根据后缀名搜索 if name.endswith(tuple(ext)): names_list.append(name) paths_list.append(os.path.join(parent, name)) else: names_list.append(name) paths_list.append(os.path.join(parent, name)) if not names_list: # 文件夹为空 return paths_list, names_list if is_sorted: paths_list, names_list = sort_two_list(paths_list, names_list) return paths_list, names_list
Method 2: Traverse through pathlib.Path().rglob(), you need to filter out the files, speed Faster. Note that glob() does not support recursive traversal
def traverse_dir_files(root_dir, ext=None, is_sorted=True): """ 列出文件夹中的文件, 深度遍历 :param root_dir: 根目录 :param ext: 后缀名 :param is_sorted: 是否排序,耗时较长 :return: [文件路径列表, 文件名称列表] """ names_list = [] paths_list = [] for path in list(pathlib.Path(root_dir).rglob("*")): path = str(path) name = path.split("/")[-1] if name.startswith('.') or "." not in name: # 去除隐藏文件 continue if ext: # 根据后缀名搜索 if name.endswith(ext): names_list.append(name) paths_list.append(path) else: names_list.append(name) paths_list.append(path) if not names_list: # 文件夹为空 return paths_list, names_list if is_sorted: paths_list, names_list = sort_two_list(paths_list, names_list) return paths_list, names_list
The above is the detailed content of How to loop through a folder containing a large number of files using Python?. For more information, please follow other related articles on the PHP Chinese website!