Google holds an immense volume of data for businesses and researchers. It performs over 8.5 billion daily searches and commands a 91% share of the global search engine market.
Since the debut of ChatGPT, Google data has been utilized not only for traditional purposes like rank tracking, competitor monitoring, and lead generation but also for developing advanced LLM models, training AI models, and enhancing the capabilities of Natural Language Processing (NLP) models.
Scraping Google, however, is not easy for everyone. It requires a team of professionals and a robust infrastructure to scrape at scale.
In this article, we will learn to scrape Google Search Results using Python and BeautifulSoup. This will enable you to build your own tools and models that are capable of leveraging Google’s data at scale.
Let’s get started!
Google Search Results are the listings that appear on Google based on the user query entered in the search bar. Google heavily utilizes NLP to understand these queries and present users with relevant results. These results often include featured snippets in addition to organic results, such as the latest AI overviews, People Also Ask sections, Related Searches, and Knowledge Graphs. These elements provide summarized and related information to users based on their queries.
Google Search Data has various applications:
Python is a versatile and robust language that provides a powerful HTTP handshake configuration for scraping websites that other languages may struggle with or have lower success rates. As the popularity of AI models trained on web-scraped data grows, Python’s relevance in web-scraping topics continues to rise within the developer community.
Additionally, beginners looking to learn Python as a web scraping skill can understand it easily due to its simple syntax and code clarity. Plus, it has huge community support on platforms like Discord, Reddit, etc., which can help with any level of problem you are facing.
This scalable language excels in web scraping performance and provides powerful frameworks like Scrapy, Requests, and BeautifulSoup, making it a superior choice for scraping Google and other websites compared to other languages.
This section will teach us to create a basic Python script to retrieve the first 10 Google search results.
To follow this tutorial we need to install the following libraries:
Requests — To pull HTML data from the Google Search URL.
BeautifulSoup — To refine HTML data in a structured format.
The setup is simple. Create a Python file and install the required libraries to get started.
Run the following commands in your project folder:
touch scraper.py
And then install the libraries.
pip install requests pip install beautifulsoup4
We are done with the setup and have all the stuff to move forward. We will use the Requests library in Python to extract the raw HTML and the BeautifulSoup to refine it and get the desired information.
But what is “desired information” here?
The filtered data would contain this information:
Let us import our installed libraries first in the scraper.py file.
from bs4 import BeautifulSoup import requests
Then, we will make a GET request on the target URL to fetch the raw HTML data from Google.
headers={'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.361681276786'} url='https://www.google.com/search?q=python+tutorials&gl=us' response = requests.get(url,headers=headers) print(response.status_code)
Passing headers is important to make the scraper look like a natural user who is just visiting the Google search page for some information.
The above code will help you in pulling the HTML data from the Google Search link. If you got the 200 status code, that means the request was successful. This completes the first part of creating a scraper for Google.
In the next part, we will use BeautifulSoup to get out the required data from HTML.
soup = BeautifulSoup(response.text, ‘html.parser’)
This will create a BS4 object to parse the HTML response and thus we will be able to easily navigate inside the HTML and find any element of choice and the content inside it.
To parse this HTML, we would need to first inspect the Google Search Page to check which common pattern can be found in the DOM location of the search results.
So, after inspecting we found out that every search result is under div container with the class g. This means, we just have to run a loop over each div container with g class to get the information inside it.
Before writing the code, we will find the DOM location for the title, description, and link from the HTML.
If you inspect the title, you’ll find that it is contained within an h3 tag. From the image, we can also see that the link is located in the href attribute of the anchor tag.
The displayed link or the cite link can be found inside the cite tag.
And finally, the description is stored inside a div container with the class VwiC3b.
Wrapping all these data entities into a single block of code:
touch scraper.py
We declared an organic results array and then looped over all the elements with g class in the HTML and pushed the collected data inside the array.
Running this code will give you the desired results which you can use for various purposes including rank tracking, lead generation, and optimizing the SEO of the website.
pip install requests pip install beautifulsoup4
So, that’s how a basic Google Scraping script is created.
However, there is a CATCH. We still can’t completely rely on this method as this can result in a block of our IP by Google. If we want to scrape search results at scale, we need a vast network of premium and non-premium proxies and advanced techniques that can make this possible. That’s where the SERP APIs come into play!
Another method for scraping Google is using a dedicated SERP API. They are much more reliable and don’t let you get blocked in the scraping process.
The setup for this section would be the same, just we need to register on ApiForSeo to get our API Key which will provide us with access to its SERP API.
After activating the account, you will be redirected to the dashboard where you will get your API Key.
You can also copy the code from the dashboard itself.
Then, we will create an API request on a random query to scrape data through ApiForSeo SERP API.
from bs4 import BeautifulSoup import requests
You can try any other query also. Don’t forget to put your API Key into the code otherwise, you will receive a 404 error.
Running this code in your terminal would immediately give you results.
touch scraper.py
The above data contains various points, including titles, links, snippets, descriptions, and featured snippets like extended sitelinks. You will also get advanced feature snippets like People Also Ask For, Knowledge Graph, Answer Boxes, etc., from this API.
The nature of business is evolving at a rapid pace. If you don’t have access to data about ongoing trends and your competitors, you risk falling behind emerging businesses that make data-driven strategic decisions at every step. Therefore, it is crucial for a business to understand what is happening in its environment, and Google can be one of the best data sources for this purpose.
In this tutorial, we learned how to scrape Google search results using Python. If you found this blog helpful, please share it on social media and other platforms.
Thank you!
The above is the detailed content of Scrape Google Search Results Using Python. For more information, please follow other related articles on the PHP Chinese website!