코드 데모를 통해 Crunchbase 스크레이퍼를 구축하는 방법-JS 튜토리얼-php.cn

집

웹 프론트엔드

JS 튜토리얼

코드 데모를 통해 Crunchbase 스크레이퍼를 구축하는 방법

Barbara Streisand

Oct 18, 2024 pm 12:58 PM

データに金の価値がある時代において、Crunchbase は宝の山です。ここには、何千もの企業プロフィール、投資データ、経営陣の地位、資金調達情報、ニュースなどが掲載されています。 Crunchbase スクレイピングを使用すると、金の塊 (必要な洞察) を取得し、すべての破片 (自分に無関係なその他すべての情報) を取り除くことができます。

この記事では、すべての技術的な詳細と Python を使用したコードを含め、Crunchbase スクレイパーをゼロから構築するプロセスを、実際に実行できるデモとともに説明します。そうは言っても、Crunchbase スクレーパーの構築は時間のかかる作業であり、途中で多くの課題があることも理解する必要があります。そのため、代わりに作業を行う有料の API ベースツールである Proxycurl を使用した別のアプローチのデモも行います。両方のオプションが用意されているので、それぞれの利点を比較検討し、ニーズに最も適したものを選択できます。

ここでは、Python を使用して Web サイトから会社名と本社都市を抽出する基本的な Crunchbase スクレーパーのスニークピークを示します。

import requests
from bs4 import BeautifulSoup

url = 'https://www.crunchbase.com/organization/apple'
headers = {'User-Agent': 'Mozilla/5.0'}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')

name_section = soup.find('h1', class_='profile-name')
company_name = name_section.get_text(strip=True) if name_section else 'N/A'

headquarters_section = soup.find('span', class_='component--field-formatter field_type_text')
headquarters_city = headquarters_section.get_text(strip=True) if headquarters_section else 'N/A'

print(f"Company Name: {company_name}")
print(f"Headquarters City: {headquarters_city}")

로그인 후 복사

次に、代替アプローチである Proxycurl について説明します。これは比較的効率的な Crunchbase スクレイピングツールであり、わずか数行のコードを使用して同じ企業情報を取得できます。ここでの追加の利点は、HTML 解析や Proxycurl によるスクレイピングの障害について心配する必要がないことです。

import requests

api_key = 'YOUR_API_KEY'
headers = {'Authorization': 'Bearer ' + api_key}
api_endpoint = 'https://nubela.co/proxycurl/api/linkedin/company'
params = {
    'url': 'https://www.linkedin.com/company/apple/',
    }

response = requests.get(api_endpoint, params=params, headers=headers)
data = response.json()

print(f"Company Name: {data['company_name']}")
print(f"Company Headquarter: {data['hq']['city']}")

로그인 후 복사

この記事を読み終えるまでに、両方の方法に慣れ、情報に基づいた意思決定ができるようになります。したがって、腕まくりして独自のスクレイパーをコーディングすることに興奮している場合でも、ワンストップソリューションを求めている場合でも、読み続けて Crunchbase スクレイパーをセットアップしてください。

Crunchbase Scraper をゼロから構築する

Crunchbase には、買収、人物、イベント、ハブ、資金調達ラウンドなど、いくつかのデータタイプが含まれています。この記事では、会社の説明を解析して JSON データとして取得するための単純な Crunchbase スクレイパーの構築について説明します。 Apple を例にしてみましょう。

まず、会社の説明を抽出する関数を定義する必要があります。 get_company_description() 関数は、会社の説明を含む Span HTML 要素を検索します。次に、テキストを抽出して返します。

def get_company_description(raw_html):
    description_section = raw_html.find("span", {"class": "description"})
    return description_section.get_text(strip=True) if description_section else "Description not found"

로그인 후 복사

これにより、スクレイピングする会社プロファイルの URL (この場合は Apple のプロファイル) に HTTP GET リクエストが送信されます。完全なコードは次のようになります:

import requests
from bs4 import BeautifulSoup

def get_company_description(raw_html):
    # Locate the description section in the HTML
    description_section = raw_html.find("span", {"class": "description"})

    # Return the text if found, else return a default message
    return description_section.get_text(strip=True) if description_section else "Description not found"

# URL of the Crunchbase profile to scrape
url = "https://www.crunchbase.com/organization/apple"
# Set the User-Agent header to simulate a browser request
headers = {"User-Agent": "Mozilla/5.0"}

# Send a GET request to the specified URL
response = requests.get(url, headers=headers)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the HTML content of the response using BeautifulSoup
    soup = BeautifulSoup(response.content, "html.parser")

    # Call the function to get the company description
    company_description = get_company_description(soup)

    # Print the retrieved company description
    print(f"Company Description: {company_description}")
else:
    # Print an error message if the request failed
    print(f"Failed to retrieve data. Status Code: {response.status_code}")

로그인 후 복사

このスクリプトは、Crunchbase から Apple の会社説明を取得するためのトリックを実行します。あなたの経験と何を探しているかによっては、物事はさらに複雑になる可能性があります。大量のデータの処理、ページネーションの管理、authwall メカニズムのバイパスなど、その過程には多くのハードルがあります。以下のことを行う必要があることに注意してください。

関心のあるフィールドごとにこのアクションを実行します。
Web ページの変更を常に最新の状態に保ってください。 Web サイトでのフィールドの表示方法がわずかに変更されただけでも、スクレイピングロジックに軽微または大幅な調整が加えられる可能性があります。

**注: Web サイトの利用規約と robots.txt ファイルをチェックして、法的制限内で責任を持ってスクレイピングしていることを確認してください。

Crunchbase Scraper の構築が難しいのはなぜですか?

独自の Crunchbase スクレーパーを構築することは実行可能な選択肢ですが、Gung-ho に行く前に、どのような課題が待ち構えているかを知っておいてください。

正確さと完全性

抽出されたデータが虚偽であれば、あなたの努力は無意味になります。手動でスクレイピングするとエラーの範囲が大きくなり、ページが完全に読み込まれない場合、または一部のコンテンツが iframe または外部リソースに埋め込まれている場合、コードによって重要なデータが見落とされる可能性があります。

Crunchbaseの構造と変更点

ウェブページの HTML を解析して特定のデータフィールドを抽出することは、スクレイピングの基本的な手順です。 Crunchbase の HTML は複雑で、動的要素と複数のコンテナ層が含まれています。適切なデータを特定して対象とすること自体がタスクです。これに、Web サイトの構造の変化が加わると、仕事が 10 倍難しくなる可能性があります。

authwall とアンチスクレイピングメカニズムの処理

Crunchbase は、ほとんどのデータを認証ウォールの背後で保護しており、ログイン資格情報またはプレミアムアカウントが必要です。スクレーパーでログインセッション、トークン、または Cookie を手動で処理すると、特に複数のリクエストにわたってこれらのセッションを維持する場合、タスクがより複雑になります。同様に、Crunchbase はボット検出システムを使用し、リクエストのレート制限を行います。ブロックされるリスクがあり、これらの保護をバイパスするには、プロキシのローテーションや CAPTCHA の処理などの手法を実装する必要がありますが、言うは易く行うは難しです。

独自の Crunchbase スクレーパーを構築すると、柔軟性と達成感が得られますが、それに伴う課題と比較検討してください。必要なデータを取得するには、深い技術的専門知識、継続的な監視、努力が必要です。このプロセスがいかに時間がかかり、エラーが発生しやすいかは言うまでもありません。労力とメンテナンスがニーズにとって本当に価値があるかどうかを検討してください。

Crunchbase Scraper をセットアップする手間のかからない方法

ふぅ！ Crunchbase Scraper をゼロから構築するのは、確かにかなりの大変な作業です。多くの時間と労力を費やすだけでなく、潜在的な課題にも常に目を光らせておく必要があります。 Proxycurl の存在に感謝します!

Proxycurl のエンドポイントを利用して、必要なすべてのデータを JSON 形式で取得します。また、Crunchbase は会社で利用可能な公開データのみを提供するため、手の届かないデータはありません。個人情報をスクレイピングしようとすると、結果は 404 になります。エラーコードを返すリクエストに対して料金が請求されることはありませんので、ご安心ください。

Proxycurl は、会社プロファイルエンドポイントの下に標準フィールドのリストを提供します。レスポンスの完全な例は、レスポンスを生成したリクエストの下の右側にあるドキュメントで確認できます。 Proxycurl には、リクエストに応じて次のフィールドをスクレイピングする機能があります:

カテゴリー
資金調達データ
出口データ
買収
追加

リクエストするこれらの各フィールドには追加のクレジットコストがかかるため、必要なパラメータのみを選択してください。ただし、必要な場合は、Proxycurl によって 1 つのパラメータが保管されます!

Proxycurl について理解したところで、実際に動作するデモを見てみましょう。 Postman の例と Python の例の 2 つを含めます。

Postman 経由で Proxycurl を使用した Crunchbase スクレイピング

ステップ 1: アカウントを設定し、API キーを取得する

Proxycurl でアカウントを作成すると、一意の API キーが割り当てられます。 Proxycurl は有料 API であり、ベアラートークン (API キー) を使用してすべてのリクエストを認証する必要があります。仕事用メールでサインアップした場合は 100 クレジット、個人用メールを使用した場合は 10 クレジット も獲得できます。そうすれば、すぐに実験を始めることができます。ダッシュボードは次のようになります。

How To Build A Crunchbase Scraper In With Code Demo

ここから、下にスクロールして、個人プロファイルエンドポイントまたは会社プロファイルエンドポイントを操作することを選択できます。個人プロフィールエンドポイントは、LinkedIn をスクレイピングしたい場合に便利なツールです。詳細については、「LinkedIn データスクレイパーの構築方法」をご覧ください。

この使用例では、会社プロファイルエンドポイントだけを操作します。

ステップ 2: Postman を実行し、ベアラートークンを設定します。

Postman で Proxycurl のコレクションに移動し、Company Profile Endpoint ドキュメントをクリックして、「Run in Postman」というオレンジ色のボタンを見つけてクリックします。次に、「フォークコレクション」をクリックし、好きなようにログインします。このように見えるはずです。 Postman で Proxycurl API をセットアップする方法に関する完全なチュートリアルがあります。

How To Build A Crunchbase Scraper In With Code Demo
Postman での Proxycurl API のセットアップ

Postman にアクセスしたら、[認証] に移動し、[ベアラートークン] を選択してトークン (API キー) を追加し、それを Proxycurl に制限します。これは、「変数」タブから行うか、「トークン」フィールドに入力を開始したときに表示されるポップアップから行うことができます。トークンに好きな名前を付けるか、単に「Bearer Token」という名前を付けます。

認証タイプが「ベアラートークン」に設定されていること、および [トークン] フィールドに「{{ベアラートークン}}」と入力していることを確認して、右上隅にある [保存] をクリックします。 必ず [保存] をクリックしてください!! ページは次のようになります:

How To Build A Crunchbase Scraper In With Code Demo

ステップ 3: ワークスペースに移動する

左側の [マイワークスペース] で、Proxycurl コレクションに移動し、次に Company API に移動します。ドロップダウンメニューにオプションのリストが表示されますが、次のことを知っておく必要があります:

Company Profile Endpoint: Enriches company profile with Crunchbase data like funding, acquisitions, etc. You will need to use the company’s LinkedIn profile URL as input parameter to the API.
Company Lookup Endpoint: Input a company’s website and get its LinkedIn URL.
Company Search Endpoint: Input various search parameters and find a list of companies that matches that search criteria, and then pull Crunchbase data for these companies.

How To Build A Crunchbase Scraper In With Code Demo
The various company-related endpoints

Step 4: Edit your params and send!

Go to Company Profile Endpoint and from there, you can uncheck some of the fields if you want or modify others. For instance, you might want to change use_cache from if-present to if-recent to get the most up-to-date info, but maybe you don't need the acquisitions information this time.

How To Build A Crunchbase Scraper In With Code Demo
Choose the relevant fields that you need. Some cost extra credits.

Once you've modified all the fields to your liking, click the blue "Send" button in the upper left-hand corner. Your output should look something like this.

How To Build A Crunchbase Scraper In With Code Demo

If you come across a 401 status code, it is most likely you forgot to hit Save after setting the Authorization type to {{Bearer Token}} in Step 2. A good way to troubleshoot this is to see if you can fix it by editing the Authorization tab for this specific query to be the {{Bearer Token}} variable. If that fixes it, then the auth inheritance isn't working, which probably means you forgot to save.

Crunchbase scraping with Proxycurl via Python

Now let’s try and do the same with Python. In the Proxycurl docs under Company Profile Endpoint, you can toggle between shell and Python. We’ll use the company endpoint to pull Crunchbase-related data, and it’s as simple as switching to Python in the API docs.

How To Build A Crunchbase Scraper In With Code Demo
Toggle between shell and Python

Now, we can paste in our API key where it says YOUR_API_KEY. Once we have everything set up, we can extract the JSON response and print it. Here’s the code for that, and you can make changes to it as needed:

import requests

api_key = 'YOUR_API_KEY'
headers = {'Authorization': 'Bearer ' + api_key}
api_endpoint = 'https://nubela.co/proxycurl/api/linkedin/company'
params = {
    'url': 'https://www.linkedin.com/company/apple/',
    'categories': 'include',
    'funding_data': 'include',
    'exit_data': 'include',
    'acquisitions': 'include',
    'extra': 'include',
    'use_cache': 'if-present',
    'fallback_to_cache': 'on-error',
}

response = requests.get(api_endpoint, params=params, headers=headers)
print(response.json())

로그인 후 복사

Now, what you get is a structured JSON response that includes all the fields that you have specified. Something like this:

"linkedin_internal_id": "162479",
   "description": "We're a diverse collective of thinkers and doers, continually reimagining what's possible to help us all do what we love in new ways. And the same innovation that goes into our products also applies to our practices -- strengthening our commitment to leave the world better than we found it. This is where your work can make a difference in people's lives. Including your own.\n\nApple is an equal opportunity employer that is committed to inclusion and diversity. Visit apple.com/careers to learn more.",
   "website": "http://www.apple.com/careers",
   "industry": "Computers and Electronics Manufacturing",
   "company_size": [
       10001,
       null
   ],
   "company_size_on_linkedin": 166869,
   "hq": {
       "country": "US",
       "city": "Cupertino",
       "postal_code": "95014",
       "line_1": "1 Apple Park Way",
       "is_hq": true,
       "state": "California"
   },
   "company_type": "PUBLIC_COMPANY",
   "founded_year": 1976,
   "specialities": [
       "Innovative Product Development",
       "World-Class Operations",
       "Retail",
       "Telephone Support"
   ],
   "locations": [
       {
           "country": "US",
           "city": "Cupertino",
           "postal_code": "95014",
           "line_1": "1 Apple Park Way",
           "is_hq": true,
           "state": "California"
        }
                 ]
...... //Remaining Data
}

로그인 후 복사

Great! Congratulations on your journey from zero to data!

Is any of this legal?

Yes, scraping Crunchbase is legal. The legality of scraping is based on different factors like the type of data, the website’s terms of service, data protection laws like GDPR, and much more. The idea is to scrape for publicly available data within these boundaries. Since Crunchbase only houses public data, it is absolutely legal to scrape by operating within the Crunchbase Terms of Service.

Final thoughts

A DIY Crunchbase scraper can be an exciting project and gives you full control over the data extraction process. But be mindful of the challenges that come with it. Facing a roadblock in each step can make scraping a time-consuming and often fragile process that requires technical expertise and constant maintenance.

Proxycurl provides a simpler and more reliable alternative. Follow along with the steps and you can access structured company data through an API without worrying about any roadblocks. Dedicate your time by focusing on using the data and leave the hard work and worry to Proxycurl!

We'd love to hear from you! If you build something cool with our API, let us know at hello@nubela.co! And if you found this guide useful, there's more where it came from - sign up for our newsletter!

위 내용은 코드 데모를 통해 Crunchbase 스크레이퍼를 구축하는 방법의 상세 내용입니다. 자세한 내용은 PHP 중국어 웹사이트의 기타 관련 기사를 참조하세요!

본 웹사이트의 성명

본 글의 내용은 네티즌들의 자발적인 기여로 작성되었으며, 저작권은 원저작자에게 있습니다. 본 사이트는 이에 상응하는 법적 책임을 지지 않습니다. 표절이나 침해가 의심되는 콘텐츠를 발견한 경우 admin@php.cn으로 문의하세요.

핫 AI 도구

Undresser.AI Undress

사실적인 누드 사진을 만들기 위한 AI 기반 앱

AI Clothes Remover

사진에서 옷을 제거하는 온라인 AI 도구입니다.

Undress AI Tool

무료로 이미지를 벗다

Clothoff.io

AI 옷 제거제

Video Face Swap

완전히 무료인 AI 얼굴 교환 도구를 사용하여 모든 비디오의 얼굴을 쉽게 바꾸세요!

뜨거운 도구

메모장++7.3.1

사용하기 쉬운 무료 코드 편집기

SublimeText3 중국어 버전

중국어 버전, 사용하기 매우 쉽습니다.

스튜디오 13.0.1 보내기

강력한 PHP 통합 개발 환경

드림위버 CS6

시각적 웹 개발 도구

SublimeText3 Mac 버전

신 수준의 코드 편집 소프트웨어(SublimeText3)

뜨거운 주제

Gmail 이메일의 로그인 입구는 어디에 있나요?

7769

자바 튜토리얼

1644

Cakephp 튜토리얼

1399

라라벨 튜토리얼

1294

PHP 튜토리얼

1234

Related knowledge

프론트 엔드 열 용지 영수증에 대한 차량 코드 인쇄를 만나면 어떻게해야합니까? Apr 04, 2025 pm 02:42 PM

프론트 엔드 개발시 프론트 엔드 열지대 티켓 인쇄를위한 자주 묻는 질문과 솔루션, 티켓 인쇄는 일반적인 요구 사항입니다. 그러나 많은 개발자들이 구현하고 있습니다 ...

Demystifying JavaScript : 그것이하는 일과 중요한 이유 Apr 09, 2025 am 12:07 AM

JavaScript는 현대 웹 개발의 초석이며 주요 기능에는 이벤트 중심 프로그래밍, 동적 컨텐츠 생성 및 비동기 프로그래밍이 포함됩니다. 1) 이벤트 중심 프로그래밍을 사용하면 사용자 작업에 따라 웹 페이지가 동적으로 변경 될 수 있습니다. 2) 동적 컨텐츠 생성을 사용하면 조건에 따라 페이지 컨텐츠를 조정할 수 있습니다. 3) 비동기 프로그래밍은 사용자 인터페이스가 차단되지 않도록합니다. JavaScript는 웹 상호 작용, 단일 페이지 응용 프로그램 및 서버 측 개발에 널리 사용되며 사용자 경험 및 크로스 플랫폼 개발의 유연성을 크게 향상시킵니다.

누가 더 많은 파이썬이나 자바 스크립트를 지불합니까? Apr 04, 2025 am 12:09 AM

기술 및 산업 요구에 따라 Python 및 JavaScript 개발자에 대한 절대 급여는 없습니다. 1. 파이썬은 데이터 과학 및 기계 학습에서 더 많은 비용을 지불 할 수 있습니다. 2. JavaScript는 프론트 엔드 및 풀 스택 개발에 큰 수요가 있으며 급여도 상당합니다. 3. 영향 요인에는 경험, 지리적 위치, 회사 규모 및 특정 기술이 포함됩니다.

JavaScript를 사용하여 동일한 ID와 동일한 ID로 배열 요소를 하나의 객체로 병합하는 방법은 무엇입니까? Apr 04, 2025 pm 05:09 PM

동일한 ID로 배열 요소를 JavaScript의 하나의 객체로 병합하는 방법은 무엇입니까? 데이터를 처리 할 때 종종 동일한 ID를 가질 필요가 있습니다 ...

JavaScript는 배우기가 어렵습니까? Apr 03, 2025 am 12:20 AM

JavaScript를 배우는 것은 어렵지 않지만 어려운 일입니다. 1) 변수, 데이터 유형, 기능 등과 같은 기본 개념을 이해합니다. 2) 마스터 비동기 프로그래밍 및 이벤트 루프를 통해이를 구현하십시오. 3) DOM 운영을 사용하고 비동기 요청을 처리합니다. 4) 일반적인 실수를 피하고 디버깅 기술을 사용하십시오. 5) 성능을 최적화하고 모범 사례를 따르십시오.

Shiseido의 공식 웹 사이트와 같은 시차 스크롤 및 요소 애니메이션 효과를 달성하는 방법은 무엇입니까? 또는: Shiseido의 공식 웹 사이트와 같은 페이지 스크롤과 함께 애니메이션 효과를 어떻게 달성 할 수 있습니까? Apr 04, 2025 pm 05:36 PM

이 기사에서 시차 스크롤 및 요소 애니메이션 효과 실현에 대한 토론은 Shiseido 공식 웹 사이트 (https://www.shiseido.co.jp/sb/wonderland/)와 유사하게 달성하는 방법을 살펴볼 것입니다.

JavaScript의 진화 : 현재 동향과 미래 전망 Apr 10, 2025 am 09:33 AM

JavaScript의 최신 트렌드에는 Typescript의 Rise, 현대 프레임 워크 및 라이브러리의 인기 및 WebAssembly의 적용이 포함됩니다. 향후 전망은보다 강력한 유형 시스템, 서버 측 JavaScript 개발, 인공 지능 및 기계 학습의 확장, IoT 및 Edge 컴퓨팅의 잠재력을 포함합니다.

Console.log 출력 결과의 차이 : 두 통화가 다른 이유는 무엇입니까? Apr 04, 2025 pm 05:12 PM

Console.log 출력의 차이의 근본 원인에 대한 심층적 인 논의. 이 기사에서는 Console.log 함수의 출력 결과의 차이점을 코드에서 분석하고 그에 따른 이유를 설명합니다. � ...

See all articles

코드 데모를 통해 Crunchbase 스크레이퍼를 구축하는 방법

Crunchbase Scraper をゼロから構築する

Crunchbase Scraper の構築が難しいのはなぜですか?

正確さと完全性

Crunchbaseの構造と変更点

authwall とアンチスクレイピングメカニズムの処理

Crunchbase Scraper をセットアップする手間のかからない方法

Postman 経由で Proxycurl を使用した Crunchbase スクレイピング

ステップ 1: アカウントを設定し、API キーを取得する

ステップ 2: Postman を実行し、ベアラー トークンを設定します。

ステップ 3: ワークスペースに移動する

Step 4: Edit your params and send!

Crunchbase scraping with Proxycurl via Python

Is any of this legal?

Final thoughts

핫 AI 도구

Undresser.AI Undress

AI Clothes Remover

Undress AI Tool

Clothoff.io

Video Face Swap

인기 기사

뜨거운 도구

메모장++7.3.1

SublimeText3 중국어 버전

스튜디오 13.0.1 보내기

드림위버 CS6

SublimeText3 Mac 버전

뜨거운 주제

ステップ 2: Postman を実行し、ベアラートークンを設定します。