在数据如金子般重要的时代,Crunchbase 就是一座金矿。它包含数千家公司简介、投资数据、领导地位、融资信息、新闻等等。 Crunchbase 抓取将使您能够获得黄金块(您需要的见解)并过滤掉所有碎片(与您无关的所有其他信息)。
在本文中,我们将引导您完成从头开始构建 Crunchbase scraper 的过程,包括所有技术细节和使用 Python 的代码,并提供一个工作演示供您遵循。话虽如此,您还应该了解构建 Crunchbase 刮刀是一项耗时的任务,并且一路上面临许多挑战。这就是为什么我们还将使用 Proxycurl 进行替代方法的演示,Proxycurl 是一种基于 API 的付费工具,可以为您完成工作。有了这两种选择,您可以权衡它们的优点并选择最适合您需求的一种。
以下是使用 Python 从网站中提取公司名称和总部城市的基本 Crunchbase 抓取工具的预览。
import requests from bs4 import BeautifulSoup url = 'https://www.crunchbase.com/organization/apple' headers = {'User-Agent': 'Mozilla/5.0'} response = requests.get(url, headers=headers) soup = BeautifulSoup(response.content, 'html.parser') name_section = soup.find('h1', class_='profile-name') company_name = name_section.get_text(strip=True) if name_section else 'N/A' headquarters_section = soup.find('span', class_='component--field-formatter field_type_text') headquarters_city = headquarters_section.get_text(strip=True) if headquarters_section else 'N/A' print(f"Company Name: {company_name}") print(f"Headquarters City: {headquarters_city}")
现在,我们的替代方法 Proxycurl。它是一个相对高效的 Crunchbase 抓取工具,您只需使用几行代码即可提取相同的公司信息。这里的额外好处是您不必担心 HTML 解析或 Proxycurl 的任何抓取障碍。
import requests api_key = 'YOUR_API_KEY' headers = {'Authorization': 'Bearer ' + api_key} api_endpoint = 'https://nubela.co/proxycurl/api/linkedin/company' params = { 'url': 'https://www.linkedin.com/company/apple/', } response = requests.get(api_endpoint, params=params, headers=headers) data = response.json() print(f"Company Name: {data['company_name']}") print(f"Company Headquarter: {data['hq']['city']}")
读完本文后,您将熟悉这两种方法并能够做出明智的决定。因此,无论您是兴奋地卷起袖子编写自己的抓取工具,还是正在寻求一站式解决方案,请继续阅读以设置您的 Crunchbase 抓取工具。
Crunchbase 包含多种数据类型,包括收购、人员、事件、中心和融资轮次。在本文中,我们将构建一个简单的 Crunchbase scraper 来解析公司的描述以检索为 JSON 数据。让我们以 Apple 为例。
首先,我们需要定义一个函数来提取公司描述。 get_company_description() 函数搜索包含公司描述的 span HTML 元素。然后它提取文本并返回它:
def get_company_description(raw_html): description_section = raw_html.find("span", {"class": "description"}) return description_section.get_text(strip=True) if description_section else "Description not found"
这会将 HTTP GET 请求发送到您要抓取的公司资料的 URL,在本例中为 Apple 的资料。完整代码如下:
import requests from bs4 import BeautifulSoup def get_company_description(raw_html): # Locate the description section in the HTML description_section = raw_html.find("span", {"class": "description"}) # Return the text if found, else return a default message return description_section.get_text(strip=True) if description_section else "Description not found" # URL of the Crunchbase profile to scrape url = "https://www.crunchbase.com/organization/apple" # Set the User-Agent header to simulate a browser request headers = {"User-Agent": "Mozilla/5.0"} # Send a GET request to the specified URL response = requests.get(url, headers=headers) # Check if the request was successful (status code 200) if response.status_code == 200: # Parse the HTML content of the response using BeautifulSoup soup = BeautifulSoup(response.content, "html.parser") # Call the function to get the company description company_description = get_company_description(soup) # Print the retrieved company description print(f"Company Description: {company_description}") else: # Print an error message if the request failed print(f"Failed to retrieve data. Status Code: {response.status_code}")
这个脚本的作用是从 Crunchbase 中提取 Apple 的公司描述。根据您的经验和您正在寻找的内容,事情可能会变得更加棘手。处理大量数据、管理分页、绕过 authwall 机制,一路上有很多障碍。请记住,您必须:
**注意:检查网站的服务条款和robots.txt 文件,以确保您负责任地进行抓取并在法律限制内。
构建自己的 Crunchbase 刮刀是一个可行的选择,但在你全力以赴之前,请注意等待你的挑战。
如果提取的数据是假的,你的努力将毫无意义。手动抓取会增加误差范围,如果页面未完全加载或者某些内容嵌入到 iframe 或外部资源中,代码可能会忽略重要数据。
解析网页的 HTML 以提取特定的数据字段是抓取的基本步骤。 Crunchbase 的 HTML 很复杂,具有动态元素和多层容器。识别和定位正确的数据本身就是一项任务。再加上网站结构的不断变化,您的工作会变得更加困难。
Crunchbase 通过 authwall 保护大部分数据,并且需要登录凭据或高级帐户。手动处理抓取器中的登录会话、令牌或 cookie 会使任务变得更加复杂,特别是在跨多个请求维护这些会话时。同样,Crunchbase 使用机器人检测系统和速率限制请求。您面临着被阻止的风险,绕过这些保护意味着实施轮换代理或处理验证码等技术,这说起来容易做起来难。
构建自己的 Crunchbase 刮刀可以为您提供灵活性和成就感,但要权衡所涉及的挑战。它需要深厚的技术专业知识、持续的监控和努力来获取您想要的数据。更不用说这个过程是多么耗时且容易出错。考虑一下为了您的需求而付出的努力和维护是否真正值得。
唷!从头开始构建 Crunchbase Scraper 确实是一项严肃的工作。你不仅需要投入大量的时间和精力,还要密切关注潜在的挑战。感谢上帝 Proxycurl 存在!
利用 Proxycurl 的端点并以 JSON 格式获取您想要的所有数据。由于 Crunchbase 仅提供公司可用的公共数据,因此没有您无法获取的数据。任何私人信息抓取尝试都将导致 404。请放心,您永远不会因返回错误代码的请求而付费。
Proxycurl 为您提供了公司资料端点下的标准字段列表。您可以在生成响应的请求下方右侧的文档中查看任何响应的完整示例。 Proxycurl 能够根据您的要求抓取以下字段:
您请求的每个字段都会产生额外的信用成本,因此请仅选择您需要的参数。但是当您确实需要它们时,Proxycurl 会将它们放在一个参数中!
现在我们已经熟悉了 Proxycurl,让我们来看一个工作演示。我们将提供两个示例,一个是 Postman,另一个是 Python。
使用 Proxycurl 创建帐户,您将被分配一个唯一的 API 密钥。 Proxycurl 是一个付费 API,您需要使用不记名令牌(您的 API 密钥)对每个请求进行身份验证。如果您使用工作电子邮件注册,您还将获得 100 积分,如果您使用个人电子邮件注册,您还将获得 10 积分。然后你就可以立即开始实验了!您的仪表板应如下所示。
从这里,您可以向下滚动并选择使用个人资料端点或公司资料端点。如果您想抓取 LinkedIn,则人员配置文件端点是一个有用的工具。查看如何构建 LinkedIn 数据抓取器了解更多详细信息。
对于此用例,我们将仅使用公司资料端点。
转到 Postman 中的 Proxycurl 集合,单击公司简介端点文档,找到显示“在 Postman 中运行”的橙色按钮,然后单击它。然后单击“Fork Collection”并按照您喜欢的方式登录。它应该看起来像这样。我们有关于如何在 Postman 中设置 Proxycurl API 的完整教程。
在 Postman 中设置 Proxycurl API
进入 Postman 后,转到授权,选择不记名令牌并添加您的令牌(您的 API 密钥)并将其限制为 Proxycurl。您可以从“变量”选项卡或在“令牌”字段中开始输入内容时出现的弹出窗口中执行此操作。根据您的喜好命名令牌,或者直接使用名称 Bearer Token。
验证授权类型是否设置为“承载令牌”,并且您已在令牌字段中输入 {{承载令牌}},然后单击右上角的保存。 记得点击“保存”!!您的页面应该如下所示:
在左侧的“我的工作区”下,转到您的 Proxycurl 集合,然后转到 Company API。您将在下拉菜单中找到选项列表,但您需要了解以下内容:
The various company-related endpoints
Go to Company Profile Endpoint and from there, you can uncheck some of the fields if you want or modify others. For instance, you might want to change use_cache from if-present to if-recent to get the most up-to-date info, but maybe you don't need the acquisitions information this time.
Choose the relevant fields that you need. Some cost extra credits.
Once you've modified all the fields to your liking, click the blue "Send" button in the upper left-hand corner. Your output should look something like this.
If you come across a 401 status code, it is most likely you forgot to hit Save after setting the Authorization type to {{Bearer Token}} in Step 2. A good way to troubleshoot this is to see if you can fix it by editing the Authorization tab for this specific query to be the {{Bearer Token}} variable. If that fixes it, then the auth inheritance isn't working, which probably means you forgot to save.
Now let’s try and do the same with Python. In the Proxycurl docs under Company Profile Endpoint, you can toggle between shell and Python. We’ll use the company endpoint to pull Crunchbase-related data, and it’s as simple as switching to Python in the API docs.
Toggle between shell and Python
Now, we can paste in our API key where it says YOUR_API_KEY. Once we have everything set up, we can extract the JSON response and print it. Here’s the code for that, and you can make changes to it as needed:
import requests api_key = 'YOUR_API_KEY' headers = {'Authorization': 'Bearer ' + api_key} api_endpoint = 'https://nubela.co/proxycurl/api/linkedin/company' params = { 'url': 'https://www.linkedin.com/company/apple/', 'categories': 'include', 'funding_data': 'include', 'exit_data': 'include', 'acquisitions': 'include', 'extra': 'include', 'use_cache': 'if-present', 'fallback_to_cache': 'on-error', } response = requests.get(api_endpoint, params=params, headers=headers) print(response.json())
Now, what you get is a structured JSON response that includes all the fields that you have specified. Something like this:
"linkedin_internal_id": "162479", "description": "We're a diverse collective of thinkers and doers, continually reimagining what's possible to help us all do what we love in new ways. And the same innovation that goes into our products also applies to our practices -- strengthening our commitment to leave the world better than we found it. This is where your work can make a difference in people's lives. Including your own.\n\nApple is an equal opportunity employer that is committed to inclusion and diversity. Visit apple.com/careers to learn more.", "website": "http://www.apple.com/careers", "industry": "Computers and Electronics Manufacturing", "company_size": [ 10001, null ], "company_size_on_linkedin": 166869, "hq": { "country": "US", "city": "Cupertino", "postal_code": "95014", "line_1": "1 Apple Park Way", "is_hq": true, "state": "California" }, "company_type": "PUBLIC_COMPANY", "founded_year": 1976, "specialities": [ "Innovative Product Development", "World-Class Operations", "Retail", "Telephone Support" ], "locations": [ { "country": "US", "city": "Cupertino", "postal_code": "95014", "line_1": "1 Apple Park Way", "is_hq": true, "state": "California" } ] ...... //Remaining Data }
Great! Congratulations on your journey from zero to data!
Yes, scraping Crunchbase is legal. The legality of scraping is based on different factors like the type of data, the website’s terms of service, data protection laws like GDPR, and much more. The idea is to scrape for publicly available data within these boundaries. Since Crunchbase only houses public data, it is absolutely legal to scrape by operating within the Crunchbase Terms of Service.
A DIY Crunchbase scraper can be an exciting project and gives you full control over the data extraction process. But be mindful of the challenges that come with it. Facing a roadblock in each step can make scraping a time-consuming and often fragile process that requires technical expertise and constant maintenance.
Proxycurl provides a simpler and more reliable alternative. Follow along with the steps and you can access structured company data through an API without worrying about any roadblocks. Dedicate your time by focusing on using the data and leave the hard work and worry to Proxycurl!
We'd love to hear from you! If you build something cool with our API, let us know at hello@nubela.co! And if you found this guide useful, there's more where it came from - sign up for our newsletter!
Das obige ist der detaillierte Inhalt vonSo bauen Sie einen Crunchbase-Scraper mit Code-Demo ein. Für weitere Informationen folgen Sie bitte anderen verwandten Artikeln auf der PHP chinesischen Website!