詳細なチュートリアル: API を使用しない GitHub リポジトリフォルダーのクロール-Python チュートリアル-php.cn

詳細なチュートリアル: API を使用しない GitHub リポジトリフォルダーのクロール

Barbara Streisand

リリース： 2024-12-16 06:28:14

オリジナル

1082 人が閲覧しました

Detailed Tutorial: Crawling GitHub Repository Folders Without API

超詳細なチュートリアル: API を使用しない GitHub リポジトリフォルダーのクロール

Shpetim Haxhiu が作成したこの非常に詳細なチュートリアルでは、GitHub API に依存せずにプログラムで GitHub リポジトリフォルダーをクロールする手順を説明します。これには、構造の理解から、強化された堅牢な再帰的実装の提供まで、すべてが含まれます。

1.セットアップとインストール

始める前に、次のものが揃っていることを確認してください。

Python: バージョン 3.7 以降がインストールされています。
ライブラリ: リクエストと BeautifulSoup をインストールします。

   pip install requests beautifulsoup4

ログイン後にコピー

エディタ: VS Code や PyCharm などの Python がサポートされている IDE。

2. GitHub HTML 構造の分析

GitHub フォルダーをスクレイピングするには、リポジトリページの HTML 構造を理解する必要があります。 GitHub リポジトリページ:

フォルダーは、/tree// のようなパスでリンクされています。
ファイルは、/blob// のようなパスにリンクされています。

各項目 (フォルダーまたはファイル) は

内にあります。属性 role="rowheader" を持ち、が含まれています。タグ。例:

<div role="rowheader">
  <a href="/owner/repo/tree/main/folder-name">folder-name</a>
</div>

ログイン後にコピー

3.スクレーパーの実装

3.1.再帰的クローリング機能

スクリプトはフォルダーを再帰的に取得し、その構造を出力します。再帰の深さを制限し、不要な負荷を避けるために、深さパラメータを使用します。

import requests
from bs4 import BeautifulSoup
import time

def crawl_github_folder(url, depth=0, max_depth=3):
    """
    Recursively crawls a GitHub repository folder structure.

    Parameters:
    - url (str): URL of the GitHub folder to scrape.
    - depth (int): Current recursion depth.
    - max_depth (int): Maximum depth to recurse.
    """
    if depth > max_depth:
        return

    headers = {"User-Agent": "Mozilla/5.0"}
    response = requests.get(url, headers=headers)

    if response.status_code != 200:
        print(f"Failed to access {url} (Status code: {response.status_code})")
        return

    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract folder and file links
    items = soup.select('div[role="rowheader"] a')

    for item in items:
        item_name = item.text.strip()
        item_url = f"https://github.com{item['href']}"

        if '/tree/' in item_url:
            print(f"{'  ' * depth}Folder: {item_name}")
            crawl_github_folder(item_url, depth + 1, max_depth)
        elif '/blob/' in item_url:
            print(f"{'  ' * depth}File: {item_name}")

# Example usage
if __name__ == "__main__":
    repo_url = "https://github.com/<owner>/<repo>/tree/<branch>/<folder>"
    crawl_github_folder(repo_url)

ログイン後にコピー

4.機能の説明

リクエストのヘッダー: ユーザーエージェント文字列を使用してブラウザを模倣し、ブロックを回避します。
再帰的クロール:
- フォルダー (/tree/) を検出し、再帰的にフォルダーに入ります。
- それ以上入力せずにファイル (/blob/) を一覧表示します。
インデント: 出力内のフォルダー階層を反映します。
深さの制限: 最大深さ (max_ Depth) を設定することにより、過度の再帰を防ぎます。

5.機能強化

これらの機能強化は、クローラーの機能と信頼性を向上させるように設計されています。これらは、結果のエクスポート、エラーの処理、レート制限の回避などの一般的な課題に対処し、ツールが効率的でユーザーフレンドリーであることを保証します。

5.1.結果をエクスポートしています

使いやすくするために、出力を構造化された JSON ファイルに保存します。

   pip install requests beautifulsoup4

ログイン後にコピー

5.2.エラー処理

ネットワークエラーや予期しない HTML 変更に対する堅牢なエラー処理を追加します:

<div role="rowheader">
  <a href="/owner/repo/tree/main/folder-name">folder-name</a>
</div>

ログイン後にコピー

5.3.レート制限

GitHub によるレート制限を回避するには、遅延を導入します。

import requests
from bs4 import BeautifulSoup
import time

def crawl_github_folder(url, depth=0, max_depth=3):
    """
    Recursively crawls a GitHub repository folder structure.

    Parameters:
    - url (str): URL of the GitHub folder to scrape.
    - depth (int): Current recursion depth.
    - max_depth (int): Maximum depth to recurse.
    """
    if depth > max_depth:
        return

    headers = {"User-Agent": "Mozilla/5.0"}
    response = requests.get(url, headers=headers)

    if response.status_code != 200:
        print(f"Failed to access {url} (Status code: {response.status_code})")
        return

    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract folder and file links
    items = soup.select('div[role="rowheader"] a')

    for item in items:
        item_name = item.text.strip()
        item_url = f"https://github.com{item['href']}"

        if '/tree/' in item_url:
            print(f"{'  ' * depth}Folder: {item_name}")
            crawl_github_folder(item_url, depth + 1, max_depth)
        elif '/blob/' in item_url:
            print(f"{'  ' * depth}File: {item_name}")

# Example usage
if __name__ == "__main__":
    repo_url = "https://github.com/<owner>/<repo>/tree/<branch>/<folder>"
    crawl_github_folder(repo_url)

ログイン後にコピー

6.倫理的配慮

ソフトウェア自動化と倫理的プログラミングの専門家である Shpetim Haxhiu が執筆したこのセクションは、GitHub クローラーを使用する際のベストプラクティスの順守を保証します。

コンプライアンス: GitHub の利用規約を遵守します。
負荷を最小限に抑える: リクエストを制限し、遅延を追加することで、GitHub のサーバーを尊重します。
権限: プライベートリポジトリの広範なクロールに対する権限を取得します。

7.完全なコード

すべての機能が含まれた統合スクリプトは次のとおりです:

import json

def crawl_to_json(url, depth=0, max_depth=3):
    """Crawls and saves results as JSON."""
    result = {}

    if depth > max_depth:
        return result

    headers = {"User-Agent": "Mozilla/5.0"}
    response = requests.get(url, headers=headers)

    if response.status_code != 200:
        print(f"Failed to access {url}")
        return result

    soup = BeautifulSoup(response.text, 'html.parser')
    items = soup.select('div[role="rowheader"] a')

    for item in items:
        item_name = item.text.strip()
        item_url = f"https://github.com{item['href']}"

        if '/tree/' in item_url:
            result[item_name] = crawl_to_json(item_url, depth + 1, max_depth)
        elif '/blob/' in item_url:
            result[item_name] = "file"

    return result

if __name__ == "__main__":
    repo_url = "https://github.com/<owner>/<repo>/tree/<branch>/<folder>"
    structure = crawl_to_json(repo_url)

    with open("output.json", "w") as file:
        json.dump(structure, file, indent=2)

    print("Repository structure saved to output.json")

ログイン後にコピー

この詳細なガイドに従うことで、堅牢な GitHub フォルダークローラーを構築できます。このツールは、倫理遵守を確保しながら、さまざまなニーズに適応できます。

お気軽にコメント欄に質問を残してください。また、私とつながることを忘れないでください: