In a time where data is worth its weight in gold, Crunchbase is a goldmine. It’s home to thousands of company profiles, their investment data, leadership position, funding information, news and much more. Crunchbase scraping will allow you to get to the gold chunks (the insights you need) and filter out all the debris (all the other information irrelevant to you).
In this article, we’ll walk you through the process of building a Crunchbase scraper from scratch, including all the technical details and code using Python, with a working demo for you to follow along. With that being said, you should also understand that building a Crunchbase scraper is a time consuming task, with many challenges along the way. That is why we will also go through a demo of an alternative approach using Proxycurl, a paid API-based tool that does the work for you. With both options on the table, you can weigh their advantages and choose the one that best fits your needs.
Here’s a sneak peak at a basic Crunchbase scraper using Python to extract company name and headquarter city from the website.
import requests from bs4 import BeautifulSoup url = 'https://www.crunchbase.com/organization/apple' headers = {'User-Agent': 'Mozilla/5.0'} response = requests.get(url, headers=headers) soup = BeautifulSoup(response.content, 'html.parser') name_section = soup.find('h1', class_='profile-name') company_name = name_section.get_text(strip=True) if name_section else 'N/A' headquarters_section = soup.find('span', class_='component--field-formatter field_type_text') headquarters_city = headquarters_section.get_text(strip=True) if headquarters_section else 'N/A' print(f"Company Name: {company_name}") print(f"Headquarters City: {headquarters_city}")
Now, to our alternative approach, Proxycurl. It is a comparably efficient Crunchbase scraping tool and you can pull the same company information using just a few lines of code. The added benefit here is you won’t have to worry about HTML parsing or any scraping roadblocks with Proxycurl.
import requests api_key = 'YOUR_API_KEY' headers = {'Authorization': 'Bearer ' + api_key} api_endpoint = 'https://nubela.co/proxycurl/api/linkedin/company' params = { 'url': 'https://www.linkedin.com/company/apple/', } response = requests.get(api_endpoint, params=params, headers=headers) data = response.json() print(f"Company Name: {data['company_name']}") print(f"Company Headquarter: {data['hq']['city']}")
By the end of this article, you'll be familiar with both methods and be able to make an informed decision. So whether you're excited to roll up your sleeves and code your own scraper or you’re after a one stop solution, keep reading to set up your Crunchbase scraper.
Crunchbase contains several data types including acquisitions, people, events, hubs and funding rounds. For this article, we will go through building a simple Crunchbase scraper to parse out a company's description to retrieve as JSON data. Let’s go with Apple for our example.
First, we will need to define a function to extract the company description. The get_company_description() function searches for the span HTML element that contains the company’s description. It then extracts the text and returns it:
def get_company_description(raw_html): description_section = raw_html.find("span", {"class": "description"}) return description_section.get_text(strip=True) if description_section else "Description not found"
This sends an HTTP GET request to the URL of the company profile you want to scrape, in this case, Apple’s profile. Here’s what the full code looks like:
import requests from bs4 import BeautifulSoup def get_company_description(raw_html): # Locate the description section in the HTML description_section = raw_html.find("span", {"class": "description"}) # Return the text if found, else return a default message return description_section.get_text(strip=True) if description_section else "Description not found" # URL of the Crunchbase profile to scrape url = "https://www.crunchbase.com/organization/apple" # Set the User-Agent header to simulate a browser request headers = {"User-Agent": "Mozilla/5.0"} # Send a GET request to the specified URL response = requests.get(url, headers=headers) # Check if the request was successful (status code 200) if response.status_code == 200: # Parse the HTML content of the response using BeautifulSoup soup = BeautifulSoup(response.content, "html.parser") # Call the function to get the company description company_description = get_company_description(soup) # Print the retrieved company description print(f"Company Description: {company_description}") else: # Print an error message if the request failed print(f"Failed to retrieve data. Status Code: {response.status_code}")
This script does the trick for pulling Apple’s company description from Crunchbase. Depending on your experience and what you are looking for, things can get a lot trickier. Handling large volumes of data, managing pagination, bypassing authwall mechanisms, there are a lot of hurdles along the way. Keep in mind that you will have to:
**Note: Check the website’s terms of service and robots.txt file to ensure you're scraping responsibly and within legal limits.
Building your own Crunchbase scraper is a viable option, but before you go Gung-ho, be aware of what challenges await you.
Your efforts will be meaningless if the extracted data is false. Manually scraping raises the margin of error, and the code may overlook important data if the page doesn't fully load or if some content is embedded in iframes or external resources.
Parsing the HTML of a webpage to extract specific data fields is a basic step in scraping. Crunchbase's HTML is complex, with dynamic elements and multiple layers of containers. It is a task in itself to identify and target the right data. This added with the website’s changing structure can make your job tenfold tougher.
Crunchbase protects most of their data behind an authwall and will require login credentials or a premium account. Handling login sessions, tokens, or cookies in the scraper manually makes the task more complex, especially for maintaining these sessions across multiple requests. Similarly, Crunchbase uses bot detection systems and rate-limits requests. You run a risk of getting blocked, and bypassing these protections means implementing techniques such as rotating proxies or handling CAPTCHAs, which is easier said than done.
建立自己的 Crunchbase 刮刀可以為您提供靈活性和成就感,但要權衡所涉及的挑戰。它需要深厚的技術專業知識、持續的監控和努力來獲得您想要的數據。更不用說這個過程是多麼耗時且容易出錯。考慮一下為了您的需求而付出的努力和維護是否真正值得。
唷!從頭開始建立 Crunchbase Scraper 確實是一項嚴肅的工作。你不僅需要投入大量的時間和精力,還要密切注意潛在的挑戰。感謝上帝 Proxycurl 存在!
利用 Proxycurl 的端點並以 JSON 格式取得您想要的所有資料。由於 Crunchbase 僅提供公司可用的公共數據,因此沒有您無法取得的數據。任何私人資訊抓取嘗試都將導致 404。請放心,您永遠不會因返回錯誤代碼的請求而付費。
Proxycurl 為您提供了公司資料端點下的標準欄位清單。您可以在產生回應的請求下方右側的文件中查看任何回應的完整範例。 Proxycurl 能夠根據您的要求抓取以下欄位:
您要求的每個欄位都會產生額外的信用成本,因此請僅選擇您需要的參數。但是當您確實需要它們時,Proxycurl 會將它們放在一個參數中!
現在我們已經熟悉了 Proxycurl,讓我們來看一個工作演示。我們將提供兩個範例,一個是 Postman,另一個是 Python。
使用 Proxycurl 建立帳戶,您將被指派一個唯一的 API 金鑰。 Proxycurl 是一個付費 API,您需要使用不記名令牌(您的 API 金鑰)對每個請求進行驗證。如果您使用工作電子郵件註冊,您還將獲得 100 點,如果您使用個人電子郵件註冊,您還將獲得 10 點。然後你就可以立即開始實驗了!您的儀表板應如下所示。
從這裡,您可以向下捲動並選擇使用個人資料端點或公司資料端點。如果您想抓取 LinkedIn,則人員設定檔端點是一個有用的工具。查看如何建立 LinkedIn 資料抓取器以了解更多詳細資訊。
對於此用例,我們將僅使用公司資料端點。
前往 Postman 中的 Proxycurl 集合,點擊公司簡介端點文檔,找到顯示「在 Postman 中運行」的橘色按鈕,然後按一下它。然後點擊“Fork Collection”並按照您喜歡的方式登入。它應該看起來像這樣。我們有關於如何在 Postman 中設定 Proxycurl API 的完整教學。
在 Postman 中設定 Proxycurl API
進入 Postman 後,請前往授權,選擇不記名令牌並新增您的令牌(您的 API 金鑰)並將其限制為 Proxycurl。您可以從「變數」標籤或在「令牌」欄位中開始輸入內容時出現的彈出視窗中執行此操作。根據您的喜好命名令牌,或直接使用名稱 Bearer Token。
驗證授權類型是否設定為“承載令牌”,並且您已在令牌欄位中輸入 {{承載令牌}},然後按一下右上角的儲存。 記得點選「儲存」! ! 您的頁面應該如下所示:
在左側的「我的工作區」下,前往您的 Proxycurl 集合,然後前往 Company API。您將在下拉式選單中找到選項列表,但您需要了解以下內容:
The various company-related endpoints
Go to Company Profile Endpoint and from there, you can uncheck some of the fields if you want or modify others. For instance, you might want to change use_cache from if-present to if-recent to get the most up-to-date info, but maybe you don't need the acquisitions information this time.
Choose the relevant fields that you need. Some cost extra credits.
Once you've modified all the fields to your liking, click the blue "Send" button in the upper left-hand corner. Your output should look something like this.
If you come across a 401 status code, it is most likely you forgot to hit Save after setting the Authorization type to {{Bearer Token}} in Step 2. A good way to troubleshoot this is to see if you can fix it by editing the Authorization tab for this specific query to be the {{Bearer Token}} variable. If that fixes it, then the auth inheritance isn't working, which probably means you forgot to save.
Now let’s try and do the same with Python. In the Proxycurl docs under Company Profile Endpoint, you can toggle between shell and Python. We’ll use the company endpoint to pull Crunchbase-related data, and it’s as simple as switching to Python in the API docs.
Toggle between shell and Python
Now, we can paste in our API key where it says YOUR_API_KEY. Once we have everything set up, we can extract the JSON response and print it. Here’s the code for that, and you can make changes to it as needed:
import requests api_key = 'YOUR_API_KEY' headers = {'Authorization': 'Bearer ' + api_key} api_endpoint = 'https://nubela.co/proxycurl/api/linkedin/company' params = { 'url': 'https://www.linkedin.com/company/apple/', 'categories': 'include', 'funding_data': 'include', 'exit_data': 'include', 'acquisitions': 'include', 'extra': 'include', 'use_cache': 'if-present', 'fallback_to_cache': 'on-error', } response = requests.get(api_endpoint, params=params, headers=headers) print(response.json())
Now, what you get is a structured JSON response that includes all the fields that you have specified. Something like this:
"linkedin_internal_id": "162479", "description": "We're a diverse collective of thinkers and doers, continually reimagining what's possible to help us all do what we love in new ways. And the same innovation that goes into our products also applies to our practices -- strengthening our commitment to leave the world better than we found it. This is where your work can make a difference in people's lives. Including your own.\n\nApple is an equal opportunity employer that is committed to inclusion and diversity. Visit apple.com/careers to learn more.", "website": "http://www.apple.com/careers", "industry": "Computers and Electronics Manufacturing", "company_size": [ 10001, null ], "company_size_on_linkedin": 166869, "hq": { "country": "US", "city": "Cupertino", "postal_code": "95014", "line_1": "1 Apple Park Way", "is_hq": true, "state": "California" }, "company_type": "PUBLIC_COMPANY", "founded_year": 1976, "specialities": [ "Innovative Product Development", "World-Class Operations", "Retail", "Telephone Support" ], "locations": [ { "country": "US", "city": "Cupertino", "postal_code": "95014", "line_1": "1 Apple Park Way", "is_hq": true, "state": "California" } ] ...... //Remaining Data }
Great! Congratulations on your journey from zero to data!
Yes, scraping Crunchbase is legal. The legality of scraping is based on different factors like the type of data, the website’s terms of service, data protection laws like GDPR, and much more. The idea is to scrape for publicly available data within these boundaries. Since Crunchbase only houses public data, it is absolutely legal to scrape by operating within the Crunchbase Terms of Service.
A DIY Crunchbase scraper can be an exciting project and gives you full control over the data extraction process. But be mindful of the challenges that come with it. Facing a roadblock in each step can make scraping a time-consuming and often fragile process that requires technical expertise and constant maintenance.
Proxycurl provides a simpler and more reliable alternative. Follow along with the steps and you can access structured company data through an API without worrying about any roadblocks. Dedicate your time by focusing on using the data and leave the hard work and worry to Proxycurl!
We'd love to hear from you! If you build something cool with our API, let us know at hello@nubela.co! And if you found this guide useful, there's more where it came from - sign up for our newsletter!
以上がコードデモを使用して Crunchbase Scraper を構築する方法の詳細内容です。詳細については、PHP 中国語 Web サイトの他の関連記事を参照してください。