大規模な言語モデルを実稼働アプリケーションに統合する-Python チュートリアル-php.cn

この実践的なガイドでは、アプリケーション用に LLM が組み込まれた拡張性の高いモデル展開ソリューションを作成する方法を学びます。
例では、Hugging Face の ChatGPT2 モデルを使用しますが、ChatGPT4、Claude などの他のモデルを簡単にプラグインできます。
AI 機能を備えた新しいアプリケーションを設計している場合でも、既存の AI システムを改善している場合でも、このガイドは、強力な LLM 統合を作成するためのステップバイステップに役立ちます。

LLM 統合の基礎を理解する

コードを書き始める前に、本番環境の LLM 統合を構築するために何が必要かを理解しましょう。本番環境に対応した LLM 統合を構築するときに考慮する必要があるのは API 呼び出しだけではなく、信頼性、コスト、安定性なども考慮する必要があります。運用アプリケーションは、コストを管理しながら、サービスの停止、レート制限、応答時間の変動などの問題に対処する必要があります。
私たちが一緒に構築するものは次のとおりです:

障害を適切に処理する堅牢な API クライアント
コストと速度を最適化するスマートキャッシングシステム
適切かつ迅速な管理体制
包括的なエラー処理と監視
サンプルプロジェクトとしての完全なコンテンツモデレーションシステム

前提条件

コーディングを始める前に、次のものが揃っていることを確認してください。

マシンに Python 3.8 以降がインストールされている
Redis クラウドアカウントまたはローカルにインストールされている
Python プログラミングの基本的な知識
REST API の基本的な理解
Hugging Face API キー (またはその他の LLM プロバイダーキー)

フォローしてみませんか?完全なコードは GitHub リポジトリで入手できます。

開発環境のセットアップ

開発環境を準備することから始めましょう。クリーンなプロジェクト構造を作成し、必要なパッケージをすべてインストールします。

まず、プロジェクトディレクトリを作成し、Python 仮想環境をセットアップしましょう。ターミナルを開いて次を実行します:

mkdir llm_integration && cd llm_integration
python3 -m venv env
syource env/bin/activate

ログイン後にコピー

次に、プロジェクトの依存関係を設定しましょう。次の必須パッケージを含む新しいrequirements.txtファイルを作成します:

transformers==4.36.0
huggingface-hub==0.19.4
redis==4.6.0
pydantic==2.5.0
pydantic-settings==2.1.0
tenacity==8.2.3
python-dotenv==1.0.0
fastapi==0.104.1
uvicorn==0.24.0
torch==2.1.0
numpy==1.24.3

ログイン後にコピー

これらの各パッケージが必要な理由を詳しく見てみましょう:

トランスフォーマー: これは、Qwen2.5-Coder モデルとのインターフェースに使用する Hugging Face の強力なライブラリです。
ハギングフェイスハブ: モデルの読み込みとバージョン管理を処理できるようにします。 redis: リクエストキャッシュの実装用
pydantic: データの検証と設定に使用されます。
Tenacity: 信頼性を高めるための再試行機能を担当します
python-dotenv: 環境変数のロード用
fastapi: 少量のコードで API エンドポイントを構築します
uvicorn: FastAPI アプリケーションを効率的に実行するために使用されます
torch: 変圧器モデルの実行と機械学習操作の処理用
numpy: 数値計算に使用されます。

次のコマンドを使用してすべてのパッケージをインストールします:

mkdir llm_integration && cd llm_integration
python3 -m venv env
syource env/bin/activate

ログイン後にコピー

きれいな構造でプロジェクトを整理しましょう。プロジェクトディレクトリに次のディレクトリとファイルを作成します:

transformers==4.36.0
huggingface-hub==0.19.4
redis==4.6.0
pydantic==2.5.0
pydantic-settings==2.1.0
tenacity==8.2.3
python-dotenv==1.0.0
fastapi==0.104.1
uvicorn==0.24.0
torch==2.1.0
numpy==1.24.3

ログイン後にコピー

LLM クライアントの構築

アプリケーションの最も重要なコンポーネントである LLM クライアントから始めましょう。ここで、ChatGPT モデル (またはその他の任意の LLM) を操作します。次のコードスニペットを core/llm_client.py ファイルに追加します:

pip install -r requirements.txt

ログイン後にコピー

LLMClient クラスの最初の部分では、基盤をセットアップします。

モデルをロードするために、トランスフォーマーライブラリの AutoModelForCausalLM と AutoTokenizer を使用しています
device_map="auto" パラメーターは GPU/CPU 割り当てを自動的に処理します
良好なパフォーマンスを維持しながらメモリ使用量を最適化するために torch.float16 を使用しています

次に、モデルと通信するメソッドを追加しましょう。

llm_integration/
├── core/
│   ├── llm_client.py      # your main LLM interaction code
│   ├── prompt_manager.py  # Handles prompt templates
│   └── response_handler.py # Processes LLM responses
├── cache/
│   └── redis_manager.py   # Manages your caching system
├── config/
│   └── settings.py        # Configuration management
├── api/
│   └── routes.py          # API endpoints
├── utils/
│   ├── monitoring.py      # Usage tracking
│   └── rate_limiter.py    # Rate limiting logic
├── requirements.txt
└── main.py
└── usage_logs.json

ログイン後にコピー

この完了メソッドで何が起こっているのかを詳しく見てみましょう:

一時的な障害に対処するために @retry デコレータメソッドを追加しました。
torch.no_grad() コンテキストマネージャーを使用して、勾配計算を無効にしてメモリを節約しました。
入力と出力の両方でトークンの使用状況を追跡します。これはコスト計算に非常に重要です。
応答と使用統計を含む構造化辞書を返します。

LLM 応答ハンドラーの作成

次に、LLM の生の出力を解析して構造化するための応答ハンドラーを追加する必要があります。これを core/response_handler.py ファイルで次のコードスニペットを使用して実行します:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from tenacity import retry, stop_after_attempt, wait_exponential
from typing import Dict, Optional
import logging

class LLMClient:
    def __init__(self, model_name: str = "gpt2", timeout: int = 30):
        try:
            self.tokenizer = AutoTokenizer.from_pretrained(model_name)
            self.model = AutoModelForCausalLM.from_pretrained(
                model_name,
                device_map="auto",
                torch_dtype=torch.float16
            )
        except Exception as e:
            logging.error(f"Error loading model: {str(e)}")
            # Fallback to a simpler model if the specified one fails
            self.tokenizer = AutoTokenizer.from_pretrained("gpt2")
            self.model = AutoModelForCausalLM.from_pretrained("gpt2")

        self.timeout = timeout
        self.logger = logging.getLogger(__name__)

ログイン後にコピー

堅牢なキャッシュシステムの追加

次に、アプリケーションのパフォーマンスを向上させ、コストを削減するためのキャッシュシステムを作成しましょう。次のコードスニペットを、cache/redis_manager.py ファイルに追加します:

 @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=4, max=10),
        reraise=True
    )
    async def complete(self, 
                      prompt: str, 
                      temperature: float = 0.7,
                      max_tokens: Optional[int] = None) -> Dict:
        """Get completion from the model with automatic retries"""
        try:
            inputs = self.tokenizer(prompt, return_tensors="pt").to(
                self.model.device
            )

            with torch.no_grad():
                outputs = self.model.generate(
                    **inputs,
                    max_new_tokens=max_tokens or 100,
                    temperature=temperature,
                    do_sample=True
                )

            response_text = self.tokenizer.decode(
                outputs[0], 
                skip_special_tokens=True
            )

            # Calculate token usage for monitoring
            input_tokens = len(inputs.input_ids[0])
            output_tokens = len(outputs[0]) - input_tokens

            return {
                'content': response_text,
                'usage': {
                    'prompt_tokens': input_tokens,
                    'completion_tokens': output_tokens,
                    'total_tokens': input_tokens + output_tokens
                },
                'model': "gpt2"
            }

        except Exception as e:
            self.logger.error(f"Error in LLM completion: {str(e)}")
            raise

ログイン後にコピー

上記のコードスニペットでは、次のようにすべてのキャッシュ操作を処理する CacheManager クラスを作成しました。

_generate_key メソッド。プロンプトとパラメーターに基づいて一意のキャッシュキーを作成します
get_cached_response は、指定されたプロンプトに対するキャッシュされた応答があるかどうかを確認します
成功した応答を将来の使用のために保存するcache_response

スマートプロンプトマネージャーの作成

LLM モデルのプロンプトを管理するプロンプトマネージャーを作成しましょう。次のコードを core/prompt_manager.py に追加します:

mkdir llm_integration && cd llm_integration
python3 -m venv env
syource env/bin/activate

ログイン後にコピー

次に、コードスニペットを使用して、prompts/content_moderation.json ファイルにコンテンツモデレーション用のサンプルプロンプトテンプレートを作成します。

transformers==4.36.0
huggingface-hub==0.19.4
redis==4.6.0
pydantic==2.5.0
pydantic-settings==2.1.0
tenacity==8.2.3
python-dotenv==1.0.0
fastapi==0.104.1
uvicorn==0.24.0
torch==2.1.0
numpy==1.24.3

ログイン後にコピー

これで、プロンプトマネージャーは JSON ファイルからプロンプトテンプレートをロードし、フォーマットされたプロンプトテンプレートも取得できるようになります。

構成マネージャーのセットアップ

すべての LLM 構成を 1 か所に保管し、アプリケーション全体で簡単に再利用するには、構成設定を作成しましょう。以下のコードを config/settings.py ファイルに追加します:

pip install -r requirements.txt

ログイン後にコピー

レート制限の実装

次に、レート制限を実装して、ユーザーがアプリケーションのリソースにアクセスする方法を制御しましょう。これを行うには、次のコードを utils/rate_limiter.py ファイルに追加します。

llm_integration/
├── core/
│   ├── llm_client.py      # your main LLM interaction code
│   ├── prompt_manager.py  # Handles prompt templates
│   └── response_handler.py # Processes LLM responses
├── cache/
│   └── redis_manager.py   # Manages your caching system
├── config/
│   └── settings.py        # Configuration management
├── api/
│   └── routes.py          # API endpoints
├── utils/
│   ├── monitoring.py      # Usage tracking
│   └── rate_limiter.py    # Rate limiting logic
├── requirements.txt
└── main.py
└── usage_logs.json

ログイン後にコピー

RateLimiter では、一定期間内に各ユーザーに許可されるリクエストの期間と数を渡すだけでレート制限を処理するために、任意のルートで使用できる再実行可能な check_rate_limit メソッドを実装しました。

API エンドポイントの作成

次に、API エンドポイントを api/routes.py ファイルに作成して、アプリケーションに LLM を統合しましょう。

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from tenacity import retry, stop_after_attempt, wait_exponential
from typing import Dict, Optional
import logging

class LLMClient:
    def __init__(self, model_name: str = "gpt2", timeout: int = 30):
        try:
            self.tokenizer = AutoTokenizer.from_pretrained(model_name)
            self.model = AutoModelForCausalLM.from_pretrained(
                model_name,
                device_map="auto",
                torch_dtype=torch.float16
            )
        except Exception as e:
            logging.error(f"Error loading model: {str(e)}")
            # Fallback to a simpler model if the specified one fails
            self.tokenizer = AutoTokenizer.from_pretrained("gpt2")
            self.model = AutoModelForCausalLM.from_pretrained("gpt2")

        self.timeout = timeout
        self.logger = logging.getLogger(__name__)

ログイン後にコピー

ここでは、API ルートの編成を担当する APIRouter クラスに /moderate エンドポイントを定義しました。 @lru_cache デコレーターは依存関係注入関数 (get_llm_client、get_response_handler、get_cache_manager、および get_prompt_manager) に適用され、LLMClient、CacheManager、および PromptManager のインスタンスがキャッシュされてパフォーマンスが向上します。 @router.post で修飾されたmoderate_content関数は、コンテンツモデレーションのためのPOSTルートを定義し、FastAPIのDependsメカニズムを利用してこれらの依存関係を注入します。関数内では、設定からレート制限設定を構成した RateLimiter クラスがリクエスト制限を強制します。

最後に、main.py を更新してすべてをまとめましょう:

 @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=4, max=10),
        reraise=True
    )
    async def complete(self, 
                      prompt: str, 
                      temperature: float = 0.7,
                      max_tokens: Optional[int] = None) -> Dict:
        """Get completion from the model with automatic retries"""
        try:
            inputs = self.tokenizer(prompt, return_tensors="pt").to(
                self.model.device
            )

            with torch.no_grad():
                outputs = self.model.generate(
                    **inputs,
                    max_new_tokens=max_tokens or 100,
                    temperature=temperature,
                    do_sample=True
                )

            response_text = self.tokenizer.decode(
                outputs[0], 
                skip_special_tokens=True
            )

            # Calculate token usage for monitoring
            input_tokens = len(inputs.input_ids[0])
            output_tokens = len(outputs[0]) - input_tokens

            return {
                'content': response_text,
                'usage': {
                    'prompt_tokens': input_tokens,
                    'completion_tokens': output_tokens,
                    'total_tokens': input_tokens + output_tokens
                },
                'model': "gpt2"
            }

        except Exception as e:
            self.logger.error(f"Error in LLM completion: {str(e)}")
            raise

ログイン後にコピー

上記のコードでは、/api/v1 プレフィックスの下に api.routes を使用して FastAPI アプリとルーターを作成しました。タイムスタンプ付きの情報メッセージを表示するためのログ記録を有効にしました。アプリは、ホットリロードを有効にして、Uvicorn を使用して localhost:8000 を実行します。

アプリケーションを実行する

これですべてのコンポーネントが整ったので、アプリケーションを起動して実行してみましょう。まず、プロジェクトのルートディレクトリに .env ファイルを作成し、HUGGGINGFACE_API_KEY と REDIS_URL:
を追加します。

mkdir llm_integration && cd llm_integration
python3 -m venv env
syource env/bin/activate

ログイン後にコピー

次に、マシン上で Redis が実行されていることを確認します。ほとんどの Unix ベースのシステムでは、次のコマンドで起動できます:

transformers==4.36.0
huggingface-hub==0.19.4
redis==4.6.0
pydantic==2.5.0
pydantic-settings==2.1.0
tenacity==8.2.3
python-dotenv==1.0.0
fastapi==0.104.1
uvicorn==0.24.0
torch==2.1.0
numpy==1.24.3

ログイン後にコピー

これでアプリケーションを開始できます:

pip install -r requirements.txt

ログイン後にコピー

FastAPI サーバーは http://localhost:8000 で実行を開始します。自動 API ドキュメントは http://localhost:8000/docs で入手できます。これはエンドポイントをテストするのに非常に役立ちます!

Integrating Large Language Models in Production Applications

Content Moderation API のテスト

新しく作成した API を実際のリクエストでテストしてみましょう。新しいターミナルを開き、次のcurlコマンドを実行します:

llm_integration/
├── core/
│   ├── llm_client.py      # your main LLM interaction code
│   ├── prompt_manager.py  # Handles prompt templates
│   └── response_handler.py # Processes LLM responses
├── cache/
│   └── redis_manager.py   # Manages your caching system
├── config/
│   └── settings.py        # Configuration management
├── api/
│   └── routes.py          # API endpoints
├── utils/
│   ├── monitoring.py      # Usage tracking
│   └── rate_limiter.py    # Rate limiting logic
├── requirements.txt
└── main.py
└── usage_logs.json

ログイン後にコピー

ターミナルに次のような応答が表示されるはずです:

Integrating Large Language Models in Production Applications

モニタリングと分析の追加

次に、アプリケーションのパフォーマンスと使用されているリソースの量を追跡するために、いくつかの監視機能を追加しましょう。次のコードを utils/monitoring.py ファイルに追加します:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from tenacity import retry, stop_after_attempt, wait_exponential
from typing import Dict, Optional
import logging

class LLMClient:
    def __init__(self, model_name: str = "gpt2", timeout: int = 30):
        try:
            self.tokenizer = AutoTokenizer.from_pretrained(model_name)
            self.model = AutoModelForCausalLM.from_pretrained(
                model_name,
                device_map="auto",
                torch_dtype=torch.float16
            )
        except Exception as e:
            logging.error(f"Error loading model: {str(e)}")
            # Fallback to a simpler model if the specified one fails
            self.tokenizer = AutoTokenizer.from_pretrained("gpt2")
            self.model = AutoModelForCausalLM.from_pretrained("gpt2")

        self.timeout = timeout
        self.logger = logging.getLogger(__name__)

ログイン後にコピー

UsageMonitor クラスは次の操作を実行します:

タイムスタンプによるすべての API リクエストの追跡
コスト監視のためのトークン使用量の記録
応答時間の測定
構造化されたログファイルにすべてを保存します (アプリケーションを運用環境にデプロイする前に、これをデータベースに置き換えます)

次に、使用状況統計を計算する新しいメソッドを追加します。

 @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=4, max=10),
        reraise=True
    )
    async def complete(self, 
                      prompt: str, 
                      temperature: float = 0.7,
                      max_tokens: Optional[int] = None) -> Dict:
        """Get completion from the model with automatic retries"""
        try:
            inputs = self.tokenizer(prompt, return_tensors="pt").to(
                self.model.device
            )

            with torch.no_grad():
                outputs = self.model.generate(
                    **inputs,
                    max_new_tokens=max_tokens or 100,
                    temperature=temperature,
                    do_sample=True
                )

            response_text = self.tokenizer.decode(
                outputs[0], 
                skip_special_tokens=True
            )

            # Calculate token usage for monitoring
            input_tokens = len(inputs.input_ids[0])
            output_tokens = len(outputs[0]) - input_tokens

            return {
                'content': response_text,
                'usage': {
                    'prompt_tokens': input_tokens,
                    'completion_tokens': output_tokens,
                    'total_tokens': input_tokens + output_tokens
                },
                'model': "gpt2"
            }

        except Exception as e:
            self.logger.error(f"Error in LLM completion: {str(e)}")
            raise

ログイン後にコピー

API を更新して、UsageMonitor クラスから監視機能を追加します。

from typing import Dict
import logging

class ResponseHandler:
    def __init__(self):
        self.logger = logging.getLogger(__name__)

    def parse_moderation_response(self, raw_response: str) -> Dict:
        """Parse and structure the raw LLM response for moderation"""
        try:
            # Default response structure
            structured_response = {
                "is_appropriate": True,
                "confidence_score": 0.0,
                "reason": None
            }

            # Simple keyword-based analysis
            lower_response = raw_response.lower()

            # Check for inappropriate content signals
            if any(word in lower_response for word in ['inappropriate', 'unsafe', 'offensive', 'harmful']):
                structured_response["is_appropriate"] = False
                structured_response["confidence_score"] = 0.9
                # Extract reason if present
                if "because" in lower_response:
                    reason_start = lower_response.find("because")
                    structured_response["reason"] = raw_response[reason_start:].split('.')[0].strip()
            else:
                structured_response["confidence_score"] = 0.95

            return structured_response

        except Exception as e:
            self.logger.error(f"Error parsing response: {str(e)}")
            return {
                "is_appropriate": True,
                "confidence_score": 0.5,
                "reason": "Failed to parse response"
            }

    def format_response(self, raw_response: Dict) -> Dict:
        """Format the final response with parsed content and usage stats"""
        try:
            return {
                "content": self.parse_moderation_response(raw_response["content"]),
                "usage": raw_response["usage"],
                "model": raw_response["model"]
            }
        except Exception as e:
            self.logger.error(f"Error formatting response: {str(e)}")
            raise

ログイン後にコピー

次に、次のcurlコマンドを実行して/statsエンドポイントをテストします:

import redis
from typing import Optional, Any
import json
import hashlib

class CacheManager:
    def __init__(self, redis_url: str, ttl: int = 3600):
        self.redis = redis.from_url(redis_url)
        self.ttl = ttl

    def _generate_key(self, prompt: str, params: dict) -> str:
        """Generate a unique cache key"""
        cache_data = {
            'prompt': prompt,
            'params': params
        }
        serialized = json.dumps(cache_data, sort_keys=True)
        return hashlib.sha256(serialized.encode()).hexdigest()

    async def get_cached_response(self, 
                                prompt: str, 
                                params: dict) -> Optional[dict]:
        """Retrieve cached LLM response"""
        key = self._generate_key(prompt, params)
        cached = self.redis.get(key)
        return json.loads(cached) if cached else None

    async def cache_response(self, 
                           prompt: str, 
                           params: dict, 
                           response: dict) -> None:
        """Cache LLM response"""
        key = self._generate_key(prompt, params)
        self.redis.setex(
            key,
            self.ttl,
            json.dumps(response)
        )

ログイン後にコピー

上記のコマンドは、以下のスクリーンショットに示すように、/moderate エンドポイント上のリクエストの統計を表示します。

Integrating Large Language Models in Production Applications