The job information released by many company websites can not always be found on the mainstream job search website. For example, finding a long -distance startup work may be challenging because these companies may not even list on the job website. To find these tasks, you need:
Find a company with potential
Preparation
We will use the Parsra library to automate the position. PARSERA provides two use options:Local mode
: Use your choice LLM to handle the page on your machine;API mode : All processing is performed on the PARSERA server.
Since we are running the local settings, LLM connection is needed. For simplicity, we will use Openai's GPT-4O-MINI, and only need to set an environment variable:
After all settings are completed, we can start to capture.
<code>pip install parsera playwright install</code>
Step 1: Get the list of the latest A round financing startup
<code>import os from parsera import Parsera os.environ["OPENAI_API_KEY"] = "<your_openai_api_key_here>" scraper = Parsera(model=llm) </your_openai_api_key_here></code>
Let's get the countries and websites of these companies:
With national information, we can filter the country we are interested in. Let's narrow the search range to the United States:
Step 2: Find the career page
<code>url = "https://growthlist.co/series-a-startups/" elements = { "Website": "公司的网站", "Country": "公司的国家", } all_startups = await scraper.arun(url=url, elements=elements)</code>
Now, we have a list of websites of Series A financing startups from the United States. The next step is to find their career page. We will extract the career page directly from their homepage:
<code>us_websites = [ item["Website"] for item in all_startups if item["Country"] == "United States" ]</code>
Step 3: Grasp the open position
<code>from urllib.parse import urljoin # 定义我们的目标 careers_target = {"url": "职业页面网址"} careers_pages = [] for website in us_websites: website = "https://" + website result = await scraper.arun(url=website, elements=careers_target) if len(result) > 0: url = result[0]["url"] if url.startswith("/") or url.startswith("./"): url = urljoin(website, url) careers_pages.append(url)</code>
Finally, we get a table containing the position list, as shown below:
<code>jobs_target = { "Title": "职位的名称", "Location": "职位的所在地", "Link": "职位发布的链接", "SE": "如果这是软件工程职位,则为True,否则为False", } jobs = [] for page in careers_pages: result = await scraper.arun(url=page, elements=jobs_target) if len(result) > 0: for row in result: row["url"] = page row["Link"] = urljoin(row["url"], row["Link"]) jobs.extend(result)</code>
职位名称 | 所在地 | 链接 | 软件工程职位 | 网址 |
---|---|---|---|---|
AI技术主管经理 | 班加罗尔 | https://job-boards.greenhouse.io/enterpret/jobs/6286095003 | True | https://boards.greenhouse.io/enterpret/ |
后端开发人员 | 特拉维夫 | https://www.upwind.io/careers/co/tel-aviv/BA.04A/backend-developer/all#jobs | True | https://www.upwind.io/careers |
... | ... | ... | ... | ... |
Next, we can repeat the same process to extract more information from the full job list. For example, get the tech stack or filter for jobs at remote startups. This will save time manually reviewing all pages. You can try iterating the Link field yourself and extracting the elements you are interested in.
I hope you found this article helpful and please let me know if you have any questions.
The above is the detailed content of Search startup jobs with Python and LLMs. For more information, please follow other related articles on the PHP Chinese website!