How to scrape Zillow with Python and Scrapy

The following text assumes some knowledge and experience with Scrapy and is meant to provide a realistic example of scraping moderately difficult site that you may be doing as freelancer working in web scraper. This post is not meant for people completely unfamiliar with Scrapy. The reader is encouraged to read some content earlier in the blog that introduces Scrapy from ground up. This time, we are building up from the knowledge about basics of Scrapy and thus skipping some details.

Understanding TLS fingerprinting

TLS (Transport Level Security) is a network protocol that sits between trasport layer (TCP) and application layer protocols (HTTP, IMAP, SMTP and so on). It provides security features such as encryption and authentication to TCP connections that merely deal with reliably transferring streams of data. For example, a lot of URLs in modern web start with https:// and you typically see a lock icon by the address bar on your web browser.

How to scrape Youtube view intensity time series

Recently Youtube has introduced a small graph on it’s user interface that visualises a time series of video viewing intensity. Spikes in this graph indicate what parts of the video tend to be replayed often, thus being most interesting or relevant to watch. This requires quite some data to be accumulated and is only available on sufficiently popular videos. Screenshot 1 Let us try scraping this data as an exercise in web scraper development.

Understanding Abstract Syntax Trees

First thing that happens when a program source code is parsed by compiler or interpreter is tokenization - cutting of code into substrings that are later organised into parse tree. However, this tree structure merely represents textual structure of the code. Further step is syntactic analysis - an activity of converting parse tree (also known as CST - concrete syntax tree) into another tree structure that represent logical (not textual) structure.

Notes from The Bug Hunters Methodology - Application Analysis v1

This post will consist of notes taken from The Bug Hunter’s Methodology: Application Analysis v1 - a talk by Jason Haddix at Nahamcon 2022. These notes are mostly for my own future review, but hopefully other people will find it useful as well. Many people have been teaching how to inject an XSS payload, but not how to systematically find vulnerabilities in the first place. Jason has created an AppSec edition of his methodology when it became large enough to be split into recon and AppSec parts.

Building higher-order automation workflows with n8n

Automation systems tends to have a temporal aspect to them as some action or the entire flow may need to be executed at specific times, at a certain intervals. Scrapers, vulnerability scanners and social media bots are examples of things that you may want to run at schedule. Those using web scraping for lead generation or price intelligence need to relaunch the web scraper often enough to get up-to-date snapshots of data.

Importing Shopify product info programmatically

Sometimes one would scrape eCommerce product data for the purpose of reselling these products. For example, a retail ecommerce company might be sourcing their products from a distributor that does not provide an easy way to integrate into Shopify store. This problem can be solved through web scraping. Once data is scraped, it can be imported into Shopify store. One way to do that is to wrangle the product dataset into file(s) that heed the Shopify product CSV schema and import it via Shopify store admin dashboard.

Turning Scrapy spider into API with ScrapyRT

Scrapy framework provides many benefits over regular Python scripts when it comes to developing web scrapers of non-trivial complexity. However Scrapy by itself does provide a direct way to integrate your web scraper into larger system that may need some on-demand scraping (e.g. price comparison website). ScrapyRT is a Scrapy extension that was developed by Zyte to address this limitation. Just like Scrapy itself, it is trivially installable through PIP. ScrapyRT exposes your Scrapy project (not just spiders, but also pipelines, middlewares and extensions) through HTTP API that you can integrate into your systems.

Scraping Instagram API with Instauto

Instagram scraping is of interest to OSINT and growth hacking communities, but can be rather challenging. If we proceed with using browser automation for this purpose we risk triggering client-side countermeasures. Using private API is a safer, more performant approach, but has it’s own challenges. Instagram implements complex API flows that involve HMACs and other cryptographic techniques for extra security. When it comes to implementing Instagram scraping code, one does not simply use mitmproxy or Chrome DevTools to intercept API requests so that we could reproduce them programmatically.

Scapy: low level packet hacking toolkit for Python

To make HTTP requests with Python we can use requests module. To send and receive things via TCP connection we can use stream sockets. However, what if we want to go deeper than this? To parse and generate network packets, we can use struct module. Depending on how deep in protocol stack are we working we may need to send/receive the wire format buffer through raw sockets. This can be fun in a way, but if this kind of code is being written for research purposes (e.

Strategies and patterns of gray hat social media automation

Introduction and motivation We live in the postmodern age of hyperreality. Vast majority of information most people receive through technological means without having any direct access to it’s source or any way to verify or deny what is written or shown on the screen. A pretty girl sitting in a private jet might have paid a lot of money to fly somewhere warm or might have paid a company that keeps the plane on the tarmac a smaller amount of money to do a photo shoot.

You probably don’t need AWS and are better off without it

Amazon Web Services (AWS) is the most prominent large cloud provider in the world, offering what seems to be practically unlimited scalability and flexibility. It is heavily marketed towards startups and big companies alike. There’s even a version of AWS meant for extra-secure governmental use (AWS for Government). There’s an entire branch of DevOps industry that helps setting up software systems on AWS and a lot of people are making money by using AWS in some capacity.

Compiling Python programs with Pyinstaller

Pyinstaller is a CLI tool that compiles Python scripts into executable binaries, installable through PIP. Let us go through a couple of examples of using this tool. SMTP enumeration script from previous post can be trivially compiled with this tool by running the following commnd: $ pyinstaller This creates two directories - build/ for intermediate files and dist/ for the results of compilation. However we find that dist/ contains multiple binary files, whereas it is generally more convenient to compile it into single file that statically links all the dependencies.

Notes on TBHM v4 recon edition

This post will summarize The Bug Hunter’s Methodology v4.01: Recon edition - a talk at h@activitycon 2020 by Jason Haddix, a prominent hacker in the bug bounty community. Taking notes is important when learning new things and therefore notes were taken for future reference of this material. This methodology represents breadth-first approach to bounty hunting and is meant to provide a reproducible strategy to discover as many assets related to the target as possible (but make sure to heed scope!

Writing web scrapers in Go with Colly framework

Colly is a web scraping framework for Go programming language. The feature set of Colly largely overlaps with that of Scrapy framework from Python ecosystem: Built-in concurrency. Cookie handling. Caching of HTTP response data. Automatic heeding of robots.txt rules. Automatic throttling of outgoing traffic. Furthermore, Colly supports distributed scraping out-of-the-box through a Redis-based task queue and can be integrated with Google App Engine. This makes it a viable choice for large-scale web scraping projects.

Creating DC proxies on cloud providers

Although many proxy providers offer data center (DC) proxies fairly cheaply, sometimes it is desirable to make our own. In this post we will discuss how to set up Squid proxy server on cheap Virtual Private Servers from Vultr. We will be using Debian 11 Linux environment on virtual “Cloud Compute” servers. Let us go through the steps to install Squid through Linux shell with commands that will be put into provisioning script.

SMTP user enumeration for fun and profit

Sometimes it is desirable to enumerate all (or as much as possible) email recipients at given domain. This can be done by establishing SMTP connection to corresponding mail server that can be found via DNS MX record and getting the server to verify your guesses for email usernames or addresses. There are three major approaches to do this: Using VRFY request that is meant to verify if there is user with corresponding mailbox on the server.

Running GUI apps within Docker containers

Typically Docker is used to encapsulate server-side software in reproducible packages - containers. A certain degree of isolation is ensured between containers. Furthermore, containers can be used as building blocks for systems consisting of multiple software servers. For example, a web app can consist of backend server, database server, frontend server, load balancer, redis instance for caching and so on. However, what if we want to run desktop GUI apps within Docker containers to use them as components within larger systems?

Reproducible Linux environments with Vagrant and Terraform

When developing and operating scraping/automation solutions, we don’t exactly want to focus on systems administration part of things. If we are writing code to achieve a particular objective, making that code run on the VPS or local system is merely a supporting side-objective that is required to make the primary thing happen. Thus it is undesirable to spend too much time on it, especially if we can use automation to avoid repetitive, error-prone activity of installing the required tooling in disposable virtual machines or virtual private servers.

Decrypting your own HTTPS traffic with Wireshark

HTTP messages are typically are not sent in plaintext in the post-Snowden world. Instead, TLS protocol is used to provide communications security against tampering and surveillance of communications based on HTTP protocol. TLS itself is fairly complex protocol consisting of several sub-protocols, but let us think of it as encrypted and authenticated layer on top of TCP connection that also does some server (and optionally client) verification through public key cryptography.

Scrapy framework architecture

In this post we will take a deeper look into architecture of not just Scrapy projects, but Scrapy framework itself. We will go through some key components of Scrapy and will look into how data is flowing through the system. Let us look into the following picture from Scrapy documentation: We see the following components: Engine is the central switchboard for all data that is transferred inside Scrapy when it is running.

How to scrape pages behind login with Python

Many websites do not provide proper access to information for unauthenticated users. Instead, the data is provided on some client area that is accessible only after the user goes through login flow, possibly by signing up for a paid plan beforehand. How can such websites be scraped with Python? There are two additional things we have to do when scraping data behind login: Set up requests.Session object with proper cookies by reproducing the login flow programmatically.

Using Scrapy pipelines to export scraped data

By default, Scrapy framework provides a way to export scraped data into CSV, JSON, JSONL, XML files with a possibility to store them remotely. However, we may need more flexibility in how and where the scraped data will be stored. This is the purpose of Scrapy item pipelines. Scrapy pipeline is a component of Scrapy project for implementing post-processing and exporting of scraped data. We are going to discuss how to implement data export code in pipelines and provide a couple of examples.

How to use Scrapy Cloud

Scrapy is a prominent Python framework for web scraping that provides a certain kind of project template structure for easier development. However, once Scrapy project is developed it may be necessary to deploy it into cloud for long running scraping jobs. Scrapy Cloud is a SaaS solution for hosting your Scrapy projects, developed by the same company that created Scrapy framework. Let us go through an example for using it with already developed scraper.

Sending notifications programmatically: let me count the ways

You may want to get notified about certain events happening during your scraping/botting operations. Examples might be an outages of external systems that your setup depends on, fatal error conditions, scraping jobs being finished, and so on. If you are implementing automations for bug bounty hunting, you certaintly want to get notified about new vulnerabilities being found in target systems being scanned. You may also want to get periodic status updates on long running tasks.

Evaluating MS Playwright for gray hat automation

MS Playwright is a framework for web testing and automation. Playwright supports programmatic control of Chromium and Firefox browser and also integrates WebKit engine. Headless mode is supported and enabled by default, thus making it possible to run your automations in environments that have no GUI support (lightweight virtual private servers, Docker containers and so on). Playwright is written in JavaScript, but has official bindings for Python, C#, TypeScript and Java.

Automating spreadsheet tasks with openpyxl

Spreadsheets are a mainstay tool for information processing in many domains and widely used by people in many walks of life. It is fairly common for developers working in scraping and automation to be dealing with spreadsheets for inputting data into custom code, report generation and other tasks. Openpyxl is a prominent Python module for reading and writing a spreadsheet files in Open Office XML format that is compatible with many spreasheet programs (MS Excel 2010 and later, Google Docs, Open Office, Libre Office, Apple Numbers, etc.

Scrapy framework tips and tricks

Use Scrapy shell for interactive experimentation Running scrapy shell gives you an interactive environment for experimenting with the site being scraped. For example, running fetch() with URL of page fetches the page and creates a response variable with scrapy.Response object for that page. view(response) opens the browser to let you see the HTML that the Scrapy spider would fetch. This bypasses some of the client-side rendering and also lets us detect if the site has some countermeasures against scraping.

Email harvesting from Github profiles

You may have some reason to automatically gather (harvest) software developer emails. That might be SaaS marketing objectives, recruitment or community building. One place online that has a lot of developers is Github. Some of the developers have a public email address listed on their Github profile. Turns out, this information is available through Github’s REST API and can be extracted with just a bit of scripting. First we need an API token.

CAPTCHA solver services for scraping and automation

CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) is a type of automation countermeasure. It is based on challenge-response approach that involves asking user to perform an action that only human is supposed to be able to perform, such as: Writing down characters or words from distorted image. Performing based mathematical operations that are presented in distorted image. Recognizing specific object within a set of images, possibly with distortion.

Introduction to Scrapy framework

Scrapy is a Python framework for web scraping. One might ask: why would we need an entire framework for web scraping if we can simply code up some simple scripts in Python using Beautiful Soup, lxml, requests and the like? To be fair, for simple scrapers you don’t. However when you are developing and running web scraper of non-trivial complexity the following problems will arise: Error handling. If you have an unhandled exception in your Python code, it can bring down an entire script, which can cause to time and data being lost.

Tools of the trade for scraping and automation

To get things done, one needs a set of tools appropriate to the task. We will discuss several open source tools that are highly valuable for developers working in scraping and automation. Chrome DevTools Let us start with the basics. Chrome has a developer tool panel that you can open by right-clicking on something in the web page and choose “Inspect Element” or by going to View -> Developer -> Developer Tools.

Harvesting emails from Google search results

Automated gathering of email addresses is known as email harvesting. Suppose we want to gather the email addresses of certain kinds of individuals, such as influencers or content creators. This can be accomplished through certain less-known features of Google. For example, site: operator limits search results to a given domain. Double-quoting something forces Google to provide exact matches on the quoted text. Boolean operators (AND, OR) are also supported. Google caps the number of results to about 300 per query, but we can mitigate this limitation by breaking down search space into smaller segments across multiple search queries.

The very basics of XPath for web scraping: less than 1% of XPath language to scrape more than 90% of pages

When scraping HTML pages, many developers are using CSS selectors to find the right elements for data extraction. However, modern HTML is largely based on XML format, which means there is a better, more powerful way to find the exact elements one needs: XPath. XPath is an entire language created for traversing XML (and by extension HTML) element trees. The size and complexity of XPath 3.1 specification might seem daunting, but the good news is that you need to know very little of it as web scraper developer.

Using proxies for web scraping and automation

When it comes to scraping and automation operations, it might be important to control where remote systems see the traffic coming from to evade rate-limiting, captchas and IP bans. This is what we need to use proxies for. Let us talk about individual proxy servers and proxy pools. Proxy server is server somewhere in the network that acts as middleman for network communications. One way this can work is connection-level proxying via SOCKS protocol.

Setting up mitmproxy with Android

In the previous post, instructions were provided on how to set up mitmproxy with iOS device. In this one, we will be going through setting up mitmproxy with Android device or emulator. If you would prefer to use Android emulator for hacking mobile apps I would recommend Genymotion software which lets you create pre-rooted virtual devices. The following steps were reproduced with Android 10 system running in Genymotion emulator with Google Chrome installed through ADB from some sketchy APK file that was found online.

Setting up mitmproxy with iOS 15

mitmproxy is an open source proxy server developed for launching man-in-the-middle attacks against network communications (primarily HTTP(S). mitmproxy enables passive sniffing, active modification and replaying of HTTP messages. It is meant to be used for troubleshooting, reverse engineering and penetration testing of networked software. We will be setting up mitmproxy with iOS 15 device for the scraping and gray hat automation purposes. One use case of mitmproxy-iPhone setup is discussed in my previous post about scraping private API of mobile app.

Using Python and mitmproxy to scrape private API of mobile app

Web scraping is a widely known way to gather information from external sources. However, it is not the only way. Another way is API scraping. We define API scraping as the activity of automatically extracting data from reverse engineered private APIs. In this post, we will go through an example of the reverse engineering private API of a mobile app and developing a simple API scraping script that reproduces API calls to extract the exact information that the app is showing on the mobile device.

You don’t typically need Selenium for scraping and automation

First, let us presume that we want to develop code to extract structured information from web pages that may or may not be doing a lot of API calls from client-side JavaScript. We are not trying to develop a general crawler like Googlebot. We don’t mind writing some code that would be specific for each site we are scraping (e.g. some Scrapy spiders or Python scripts). When coming across discussions about web scraper development on various forums online, it is common to hear people saying that they need JavaScript rendering to scrape websites.

Easier JSON wrangling with jq, JSONPath and JSONPointer

In this day and age, JSON is the dominating textual format for information exchange between software systems. JSON is based on key-value pairs. Key is always a string, but values can be objects (dictionaries), arrays, numbers, booleans, strings or nulls. Common problem that software developers are running into is that JSON tree structures can be deeply nested with parts that may or may not be present. This can lead to tedious, awkward code.

Python Web Scraping: tips and tricks

Use pandas read_html() to parse HTML tables (when possible) Some pages you might be scraping might include old-school HTML <table> elements with tabular data. There’s an easy way to extract data from pages like this with into Pandas dataframes (although you may need to clean up it afterward): >>> import pandas as pd >>> pd.read_html('')[0] Yakutsk Якутск Yakutsk Якутск.1 0 City under republic jurisdiction[1] City under republic jurisdiction[1] 1 Other transcription(s) Other transcription(s) 2 • Yakut Дьокуускай 3 Central Yakutsk from the air Central Yakutsk from the air 4 .

Grayhat Twitch Chatbots

Twitch is a popular video streaming platform originally developed for gamers, but now expanding beyond gaming community. It enables viewers to use chat rooms associated with video streams to communicate amongst each other and with the streamer. Chatbots for these chatrooms can be used for legitimate uses cases such as moderations, polls, ranking boards, and so on. Twitch is allowing such usage but requires passing a verification process for chatbot software to be used in production at scale.

Automating Google Dorking

There is more to using Google than searching by keywords, phrases and natural language questions. Google also has advanced features that can empower users to be extra specific with their searches. To search for an exact phrase, wrap it in double quotes, for example: "top secret" To search for documents with a specific file extension, use filetype: operator: filetype:pdf "top secret' To search for results in a specific domain, use site: operator.

Using ephemeral Onion Services for quick NAT traversal

Sometimes, when developing server-side software it is desirable to make it accessible for access outside the local network, which might be shielded from incoming TCP connections by a router that performs Network Address Translation. One option to work around this is to use Ngrok - a SaaS app that lets you tunnel out a connection from your network and exposes it to external traffic through a cloud endpoint. However, it is primarily designed for web apps and it would be nice if we didn’t need to rely on a third-party SaaS vendor to make our server software accessible outside our local network.

Sending mass DMs on Reddit through API: a small experiment

Howitzer is a SaaS tool that scrapes subreddits for users mentioning given keywords and automates mass direct message sending for growth hacking purposes. Generally speaking, there are significant difficulties when automating against major social media platforms. However, Reddit is not as hostile towards automation as other platforms and even provides a relatively unrestricted official API for building bots and integrations. Let us try automating against Reddit API with Python to build a poor mans Howitzer with Python.

Introducing GPT-3: Playground and API

GPT-3 is a large scale natural language processing system based on deep learning and developer by OpenAI. It is now generally available for developers and curious people. GPT-3 is the kind of AI that works exclusively on text. Text is both input and output for GPT-3. One can provide questions and instructions in plain English and receive fairly coherent responses. However, GPT-3 is not: HAL from 2001: Space Odyssey Project 2501 from Ghost In the Shell T-800 from The Terminator AM from I Have No Mouth And I Must Scream What GPT-3 is a tool AI (as opposed to agent AI).

How does PerimeterX Bot Defender work

PerimeterX is a prominent vendor of anti-bot technology, used by portals such as Zillow, Crunchbase, StockX and many others. Many developers working on web scraping or automation scripts have ran into PerimeterX Human Challenge - a proprietary CAPTCHA that involves pressing and holding a HTML element and does not seem to solvable by any of the CAPTCHA solving services. PerimeterX has registered the following US patents: US 10,708,287B2 - ANALYZING CLIENT APPLICATION BEHAVIOR TO DETECT ANOMALIES AND PREVENT ACCESS US 10,951,627B2 - SECURING ORDERED RESOURCE ACCESS US 2021/064685A1 - IDENTIFYING A SCRIPT THAT ORIGINATES SYNCHRONOUS AND ASYNCHRONOUS ACTIONS (pending application) Let us take a look into these patents to discover key working principles of PerimeterX bot mitigation technology.

How to download embedded videos

When wandering across the World Wide Web, many netizens have come across pages containing Youtube or Vimeo videos embedded in them. youtube-dl is a prominent tool to download online videos from many sources (not limited by Youtube - see the complete list of supported sites, but can it download the videos even if they are embedded in some third party website? Turns out, it can (with a little bit of help from the user).

Trickster Dev

Code level discussion of web scraping, gray hat automation, growth hacking and bounty hunting