Not every website wants to let the data to be scraped and not every app wants to allow automation of user activity. If you work in scraping and automation at any capacity you certainly have dealt with sites that work just fine when accessed through normal browser throwing captchas or error pages at your bot. There are multiple security mechanisms that can cause this to happen. Today we will do a broad review of automation countermeasures that can be implemented at various levels.
The following techniques work at Internet Protocol level by allowing or blocking traffic based on source IP address it is coming from.
Geofencing (or geo-restricting/geo-blocking) is blocking/allowing requests based on source geographic location (typically a country). This relies on GeoIP data provided by vendors such as MaxMind or IP2Location to perform lookups.
As web scraper developer you can trivially bypass this using proxies.
Web sites may impose upper bounds on on how many requests are allowed per some timeframe from a single IP and start blocking incoming traffic if traffic intensity exceeds that threshold. This is typically implemented with a leaky bucket algorithm. Suppose the site allows 60 requests per minute from a single IP (i.e. one per seconds). It keeps a counter of how many requests it received, decreasing it by one every second. If the counter exceeds the threshold value (60) the site refuses further requests until it is below the threshold again. This limits how aggressive we can be when scraping, but typically can be defeated by spreading the traffic around a set of IP addresses by routing it through proxy pool.
Filtering by AS/ISP
This can be seen as a variation of geofencing, but blocking can also be performed based on source ISP or Autonomous System by performing Autonomous System Lookups. For example, sites or antibot vendors may explicitly disallow traffic that comes from data centers or major cloud vendors (AWS, Digital Ocean, Azure and so on).
Filtering by IP reputation
Proxies can be used to bypass the above ways to block traffic, but blocking can also be performed based on IP address reputation data from threat intelligence vendors like IPQualityScore. So if your proxy provider is a sketchy one that is built on foundation of botnet it’s proxies may be listed in some blacklist, which makes the trust score bad. One must take care to use a reputable proxy provider.
The following techniques work at HTTP and TLS protocol levels.
HTTP headers and cookies
The very simplest way to know that traffic is automated is to check User-Agent
header in HTTP request. For example, curl puts something like
there and some headless browsers actually let the server to know they are
running in headless mode (unless configured otherwise). But that’s trivial
to defeat by just reproducing the header value programmatically.
One baby step further is to not only check the User-Agent header,
but also other request headers to make sure like the ones in the browser in
aggregate. This is also easy to defeat by reproducing all the headers in a
request. In fact, Chrome DevTools Network tab has a handy feature of letting you
Cookies also matter when it comes to dealing with sites that are hostile to
automation, as cookieless requests might get rejected. Depending on exact
specifics of the site, getting the cookies might be as easy as just loading
the front page with
requests.Session or it might involve more elaborate
trickery if antibot vendor or captcha service is issuing them based on browser
assesment and user activity (more on that later).
We covered the simple stuff, now lets get to something more complicated - HTTP/2 fingerprinting. HTTP/2 is quite complex binary protocol that leaves quite a bit of wiggle room to implementor. The way it is implemented in curl is different than implementation in Google Chrome (altough both are RFC-compliant). Traffic generated by each implementation can reliably betray what kind of implementation it is to the server, CDN or WAF. I have covered this in greated detail in another post earlier.
HTTP/2 is pretty much never used in plaintext form. After all, you cannot let the dirty people from NSA to just watch the unencrypted traffic without any problems. However, TLS protocol is also very complex. It involves a great deal of technical details in the client traffic (protocol version, list of extensions, proposed cryptographic protocols, etc.) that betray what kind of client software it comes from. In fact, a single ClientHello message can be used to quite reliably predict the OS and browser that generated it. There’s also post earlier in this blog to cover this topic in greater detail.
The following techniques deal with probing the client-side environment (typically a web browser) to detect and/or prevent automation.
JS environment checking
navigator.webdriver=truegenerally betrays browser automation.
- Presence of
document.$cdc_asdjflasutopfhvcZLmcfl_object betrays Selenium.
- On Chromium-based browsers,
window.chromeobject not being available means the browser is running in headless mode.
Browser and device fingerprinting
Other than aforementioned tell-tale signs of automation, client side JS code can gather a great deal of information on the system it is running on: installed fonts, screen resolution, supported browser features, hardware components and so on. Each of them is a trait of hardware-software setup the code is running on. Furthermore, exact rendering of graphics (through HTML canvas or WebGL API) is dependent on hardware/software setup, i.e. Chrome on macOS/Apple Silicon will not render the exact same image as Firefox on Linux/x86 - some pixels will be slightly, but consistently off, which is another piece of information that can be kept track of. A collection of these traits is referred to as fingerprint.
Fingerprinting is part of anti-automation toolkit. For example, social media apps are able to detect when login session is launched from a new devide vs. the same device being used for new user session. Furthermore, it can be quite telling when a cluster of accounts share the same fingerprint (in the case of e.g. device farm or fully virtualised botting operation).
For this particular example of data obfuscation, there’s a Stack Overflow answer providing a small Python snippet to decode it back. However, if you are doing requests-based scraping and rely on regular expressions to extract email addresses this sort of stuff can already cause at least some difficulty.
Furthermore, JS challenges can be more elaborate than this. Depending on how Cloudflare’s antibot features are configured browser may be required to run some JS code with cryptographic proof of work stuff, provide the result of the computation to CF and only then get the cookie that allows it to properly access anything on the site.
The following techniques involve behavioural tracking of user activity, checking that the user is actually human and leveraging user identitities (accounts) to fight automation.
A captcha is often (but not always) an incovenient silly puzzle that you have to solve to prove you’re not a robot by picking the segments containing certain object or something like that. That’s a visible captcha. Some captchas (e.g. reCaptcha v3) are invisible - they work in background tracking user activity (mouse movement, etc.) to assess if it looks human or robotic. The puzzle UI is displayed only when there is a suspicion that user is a bot.
The underlying idea for this is that certain tasks are difficult to do programmatically, but easy to do for a person.
A simple way to hinder automation is to rate-limit user activity per account. This is commonly done by social media platforms. If user activity exceeds the rate limits the site/app can simply reject the requests or even ban the account.
No suprise here - ban hammer is part of anti-automation arsenal as well. Misbehaving accounts can be banned. This is of particular importance to social media automation - an account ban can mean a setback or even outright failure of social media growth hacking efforts.
Making account creation difficult
Banning and throttling user accounts would not help much if it’s possible to create a lot of them quickly. Thus making account creation difficult to automate (by requiring phone verification, integrating third party antibot solutions, making request flows hard to reproduce) is one more thing that sites/apps can do to fight automation.
Realistically, it is near impossible to completely prevent botting if there’s an a sufficient incentive for people to engage in automation. However, the economic feasibility of automation can be weakened by putting barriers that make it more expensive such as phone or ID verification for new accounts and rate-limiting/restricting/banning existing accounts for sketchy, anomalous activity.
Mouse activity monitoring
Client-side JS code can monitor mouse movements and submit them to some neural network running on the backend to assess if they look human or not. This is performed not only by some captcha solutions, but also antibot vendors as well. For example, Akamai Bot Manager phones back a sample of user’s mouse movement as part of clearance cookie issuance flow.