;标签。例如:
<div role="rowheader">
<a href="/owner/repo/tree/main/folder-name">folder-name</a>
</div>
登录后复制
登录后复制
3.实施抓取器
3.1。递归爬取函数
该脚本将递归地抓取文件夹并打印其结构。为了限制递归深度并避免不必要的负载,我们将使用深度参数。
import requests
from bs4 import BeautifulSoup
import time
def crawl_github_folder(url, depth=0, max_depth=3):
"""
Recursively crawls a GitHub repository folder structure.
Parameters:
- url (str): URL of the GitHub folder to scrape.
- depth (int): Current recursion depth.
- max_depth (int): Maximum depth to recurse.
"""
if depth > max_depth:
return
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
if response.status_code != 200:
print(f"Failed to access {url} (Status code: {response.status_code})")
return
soup = BeautifulSoup(response.text, 'html.parser')
# Extract folder and file links
items = soup.select('div[role="rowheader"] a')
for item in items:
item_name = item.text.strip()
item_url = f"https://github.com{item['href']}"
if '/tree/' in item_url:
print(f"{' ' * depth}Folder: {item_name}")
crawl_github_folder(item_url, depth + 1, max_depth)
elif '/blob/' in item_url:
print(f"{' ' * depth}File: {item_name}")
# Example usage
if __name__ == "__main__":
repo_url = "https://github.com/<owner>/<repo>/tree/<branch>/<folder>"
crawl_github_folder(repo_url)
登录后复制
登录后复制
4.功能解释
-
请求标头:使用用户代理字符串来模拟浏览器并避免阻塞。
-
递归爬行:
- 检测文件夹 (/tree/) 并递归地输入它们。
- 列出文件 (/blob/),无需进一步输入。
-
缩进:反映输出中的文件夹层次结构。
-
深度限制:通过设置最大深度(max_深度)来防止过度递归。
5.增强功能
这些增强功能旨在提高爬虫程序的功能和可靠性。它们解决了导出结果、处理错误和避免速率限制等常见挑战,确保该工具高效且用户友好。
5.1。导出结果
将输出保存到结构化 JSON 文件以便于使用。
pip install requests beautifulsoup4
登录后复制
登录后复制
5.2。错误处理
为网络错误和意外的 HTML 更改添加强大的错误处理:
<div role="rowheader">
<a href="/owner/repo/tree/main/folder-name">folder-name</a>
</div>
登录后复制
登录后复制
5.3。速率限制
为了避免受到 GitHub 的速率限制,请引入延迟:
import requests
from bs4 import BeautifulSoup
import time
def crawl_github_folder(url, depth=0, max_depth=3):
"""
Recursively crawls a GitHub repository folder structure.
Parameters:
- url (str): URL of the GitHub folder to scrape.
- depth (int): Current recursion depth.
- max_depth (int): Maximum depth to recurse.
"""
if depth > max_depth:
return
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
if response.status_code != 200:
print(f"Failed to access {url} (Status code: {response.status_code})")
return
soup = BeautifulSoup(response.text, 'html.parser')
# Extract folder and file links
items = soup.select('div[role="rowheader"] a')
for item in items:
item_name = item.text.strip()
item_url = f"https://github.com{item['href']}"
if '/tree/' in item_url:
print(f"{' ' * depth}Folder: {item_name}")
crawl_github_folder(item_url, depth + 1, max_depth)
elif '/blob/' in item_url:
print(f"{' ' * depth}File: {item_name}")
# Example usage
if __name__ == "__main__":
repo_url = "https://github.com/<owner>/<repo>/tree/<branch>/<folder>"
crawl_github_folder(repo_url)
登录后复制
登录后复制
6.道德考虑
由软件自动化和道德编程专家 Shpetim Haxhiu 撰写,本部分确保在使用 GitHub 爬虫时遵守最佳实践。
-
合规性:遵守 GitHub 的服务条款。
-
最小化负载:通过限制请求和增加延迟来尊重 GitHub 的服务器。
-
权限:获得广泛爬取私有仓库的权限。
7.完整代码
这是包含所有功能的综合脚本:
import json
def crawl_to_json(url, depth=0, max_depth=3):
"""Crawls and saves results as JSON."""
result = {}
if depth > max_depth:
return result
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
if response.status_code != 200:
print(f"Failed to access {url}")
return result
soup = BeautifulSoup(response.text, 'html.parser')
items = soup.select('div[role="rowheader"] a')
for item in items:
item_name = item.text.strip()
item_url = f"https://github.com{item['href']}"
if '/tree/' in item_url:
result[item_name] = crawl_to_json(item_url, depth + 1, max_depth)
elif '/blob/' in item_url:
result[item_name] = "file"
return result
if __name__ == "__main__":
repo_url = "https://github.com/<owner>/<repo>/tree/<branch>/<folder>"
structure = crawl_to_json(repo_url)
with open("output.json", "w") as file:
json.dump(structure, file, indent=2)
print("Repository structure saved to output.json")
登录后复制
通过遵循此详细指南,您可以构建强大的 GitHub 文件夹爬虫。该工具可以适应各种需求,同时确保道德合规性。
欢迎在评论区留言!另外,别忘了与我联系:
-
电子邮件:shpetim.h@gmail.com
-
LinkedIn:linkedin.com/in/shpetimhaxhiu
-
GitHub:github.com/shpetimhaxhiu