Harvesting emails from Google search results

Automated gathering of email addresses is known as email harvesting. Suppose we want to gather the email addresses of certain kinds of individuals, such as influencers or content creators. This can be accomplished through certain less-known features of Google. For example, site: operator limits search results to a given domain. Double-quoting something forces Google to provide exact matches on the quoted text. Boolean operators (AND, OR) are also supported.

Google caps the number of results to about 300 per query, but we can mitigate this limitation by breaking down search space into smaller segments across multiple search queries. In this example, we are going to search for people not only by keyword but also by location - for each search, we will be using a combination of keywords with a Boolean operator to tell the search engine to match on a combination of strings.

We will be implementing email harvesting from social media profiles that can be found via Google. We won’t be scraping social media portals, but we will be extracting email addresses from Google SERPs. This enables us to avoid the difficulties of social media scraping.

One more thing to address is Google’s rate-limiting on excessive searching. If we launch too many search engines in a short timeframe Google will start asking for captcha solutions, which we want to avoid. For this purpose, we will be sending the requests through a proxy pool.

Let us prepare the search query components first. In sites.txt, we put the list of domain names we want to find the results in:

instagram.com
tiktok.com

In q1.txt we put list of domain names we want to find emails in, prefixed by @ character:

@gmail.com
@yahoo.com

In q2.txt we put a list of locations we want to find people in:

San Francisco
New York

In q3.txt we put a list of occupational keywords:

influencer
model
content creator

Our code will be doing nested loops across all permutations of the above things, scraping Google search result pages, and applying regular expressions to extract email addresses.

Depending on your exact needs, you may want to use simpler or more complex regular expression to match email addresses in the result snippet text. For some, examples see the following Stack Overflow pages:

To get HTML page with search results, we are going to send HTTP GET requests to https://www.google.com/search with the following parameters:

  • q - search query.
  • start - initial index of start results.
  • num - number of search results per page (we will be using max value: 100).

We also set the User-Agent header to match that of a regular browser, as Google would block our requests otherwise.

The complete script for harvesting email is as follows:

#!/usr/bin/python3

import csv
from pprint import pprint
import re
import time
import os

import requests
from lxml import html

FIELDNAMES = ["result_url", "title", "text", "email"]

def scrape_result_page(query, start, num):
    url = "https://www.google.com/search"

    params = {
        "q": query,
        "start": start,
        "num": num,
    }

    while True:
        resp = requests.get(url, params=params, verify=False)
        print(resp.url)

        if resp.status_code == 200:
            break

        time.sleep(1)
        continue

    tree = html.fromstring(resp.text)

    rows = []

    result_divs = tree.xpath('//div[@class="jtfYYd"]')

    for result_div in result_divs:
        result_url = result_div.xpath(".//a/@href")
        if len(result_url) >= 1:
            result_url = result_url[0]
        else:
            result_url = ""

        title = result_div.xpath(".//h3/text()")
        if len(title) == 1:
            title = title[0]
        else:
            title = ""

        snippet_div = result_div.xpath('.//div[@style="-webkit-line-clamp:2"]')
        if len(snippet_div) == 1:
            snippet = snippet_div[0].text_content()
        else:
            snippet = ""

        row = {
            "result_url": result_url,
            "title": title,
            "text": snippet,
        }

        # https://stackoverflow.com/questions/17681670/extract-email-sub-strings-from-large-document
        m = re.search(
            r"(?:\.?)([\w\-_+#~!$&\'\.]+(?<!\.)(@|[ ]?\(?[ ]?(at|AT)[ ]?\)?[ ]?)(?<!\.)[\w]+[\w\-\.]*\.[a-zA-Z-]{2,3})(?:[^\w])",
            snippet + " " + title,
        )

        if m is not None:
            row["email"] = m.groups()[0]
        else:
            continue

        rows.append(row)

    return rows, len(result_divs) < num


def scrape_results(query):
    start = 0
    num = 100

    while True:
        rows, done = scrape_result_page(query, start, num)

        for row in rows:
            yield row

        if done:
            break

        start += num


def main():
    in_f = open("sites.txt", "r")
    sites = in_f.read().strip().split("\n")
    in_f.close()

    in_f = open("q1.txt", "r")
    queries1 = in_f.read().strip().split("\n")
    in_f.close()

    in_f = open("q2.txt", "r")
    queries2 = in_f.read().strip().split("\n")
    in_f.close()

    in_f = open("q3.txt", "r")
    queries3 = in_f.read().strip().split("\n")
    in_f.close()

    out_f = open("emails.csv", "w", encoding="utf-8")

    csv_writer = csv.DictWriter(out_f, fieldnames=FIELDNAMES, lineterminator="\n")
    csv_writer.writeheader()

    for site in sites:
        for q1 in queries1:
            for q2 in queries2:
                for q3 in queries3:
                    query = 'site:{} "{}" AND "{}" AND "{}"'.format(site, q1, q2, q3)

                    for row in scrape_results(query):
                        pprint(row)
                        csv_writer.writerow(row)

    out_f.close()


if __name__ == "__main__":
    main()

This code expects HTTPS_PROXY environment variable with proxy URL. For this purpose, we can create a SERP proxy zone on Bright Data.

Note that Google is constantly updating it’s frontend, which may cause the code to break. For example, you may need to update XPath queries for parts of page being extracted. To sidestep this issue, you may want to check out some SaaS tools that provide an API for Google result scraping:

Trickster Dev

Code level discussion of web scraping, gray hat automation, growth hacking and bounty hunting


By rl1987, 2022-02-04