Posts

SMTP, IMAP, POP3 and MIME: understanding email protocols

Introduction and big picture Email is a fairly old school technology meant to replicate postal service digitally. To understand how email works, we must know about network protocols that specify sets of data formats and data exchange rules for exchanging messages over the network. In the modern internet email sending part is conceptually (and sometimes technically) separate from receiving part. Email sending aspects are formalised in RFCs that describe SMTP protocol, whereas reception part can be done either via IMAP or POP3 protocol.

Wget tips and tricks

So, what is wget? Wget is a prominent command line tool to download stuff from the web. It has a fairly extensive feature set that includes recursive downloading, proxy support, site mirroring, cookie handling and so on. Let us go through some use cases of wget. To download a single file/page using wget, just pass the corresponding URL into argv[1]: $ wget http://www.textfiles.com/100/crossbow --2022-11-11 18:24:33-- http://www.textfiles.com/100/crossbow Resolving www.textfiles.com (www.textfiles.com)... 208.86.224.90 Connecting to www.

Axiom: just-in-time dynamic infra for offensive security operations

When doing recon part of penetration testing or bug bounty hunting engagements one may want to run various tools (such as port scanners, crawlers, vulnerability scanners, headless browsers and so on) in a VPS environment that will no longer be needed when the task at hand is complete. For large-scale scanning it is highly desirable to spread the workload across many servers. To address these need, axiom was developed. Axiom is a dynamic infrastructure framework designed for quick setup and teardown of reproducible infrastructure.

Scraping legacy ASP.Net sites with Scrapy: a real world example

Legacy ASP.Net sites are some of the most difficult ones to deal with for web scraper developers due to their archaic state management mechanism that entails hidden form submissions via POST requests. This stuff is not easy to reproduce programmatically. In this post we will discuss a real world example of scraping a site based on some old version of ASP.Net technology. The site in question is publicaccess.claytoncounty.gov - a real estate record website for Clayton County, Georgia.

Making heuristics smarter with GPT-3

Sometimes data being scraped is a little fuzzy - easy to understand for human mind, but not strict and formal enough to be easily tractable by algorithms. This can get in a way of analysing the data or basing further automations off it. This is where heuristics come in. A heuristic is an inexact, but quick way to solve a problem. We will discuss how GPT-3 API can be used to implement heuristic techniques to deal with somewhat fuzzy data that might otherwise require a human to take a look into it.

JavaScript AST manipulation with Babel: defeating string array mapping

One Javascript obfuscation technique that can be found in the wild is string array mapping. It entails gathering all the string constants from the code into an array and modifying the code so that string literals are accesssed by referencing the array at various indices instead of having string literals being used directly. Like with previously discussed obfuscation techniques we will be exploring it in isolation, but in real websites it will most likely be used in combination with another techniques.

Active DNS recon techniques

We discussed some passive DNS recon techniques, but that’s only half of the story. There is also the active DNS reconnaissance that involves generating actual DNS requests to gather data. Active DNS recon does not rely on information footprints and secondary sources. Thus it enables us to get more up-to-date information on target systems. So how do we send the DNS queries? We want to control what exact request we send and bypass DNS caching on the local system, so using getaddrinfo() is out of question.

Passive DNS recon techniques

DNS is a network protocol and a distributed system that maps human-readable domain names to IP addresses. It can be thought of as a big phone book for internet hosts. Passive DNS recon is activity of using various data sources to map out a footprint of site or organisation without ever directly accessing the target or launching DNS requests. That’s basically doing OSINT on DNS names and records that are linked to the target.

HTTPX: modern Python module for doing HTTP(S) requests

Python requests module is a well established open source library for doing HTTP requests that is widely used in web scraping and other fields. However it has some limitations. At the time of writing, requests module only supports HTTP/1.1 yet significant fraction of sites are supporting more modern, faster HTTP/2 protocol and there is no support for asynchronous communication in requests module. HTTPX is a newer, more modern Python module that addresses some of the limitations of requests module (not to be confused with another httpx that is CLI tool for probing HTTP servers).

Recon-ng: modular framework for OSINT automation

OSINT is collection and analysis of information from public sources. Nowadays it can largely be automated through web scraping and API integrations. Recon-ng is open source OSINT framework that features a modular architecture. Each module can be thought as a pluggable piece of code that can be loaded on as-needed basis. Most modules are API integrations and scrapers of data sources. Some deal with report generation and other auxillary tasks. Since recon-ng was developed by and for infosec people, the user interface resembles that of Metasploit Framework, thus making the learning curve easier to people who are working on the offensive side of cybersecurity.

cURL beyond the basics

Curl is a prominent CLI tool and library for performing client-side requests in a number of application layer protocols. Commonly it is used for HTTP requests. However, there is more to curl than that. We will go through some less known features of curl that can be used in web scraping and automation systems. Debugging Sometimes things fail and require us to take a deeper look into technical details of what is happening.

Running Stable Diffusion on Vultr

Stable Diffusion is a recently released open source text-to-image AI system that challenges DALL-E by OpenAI. Nowadays OpenAI products are open in name only: aside from client libraries and some other inconsequential things, all the new products by OpenAI (GPT-3, DALL-E 2) are not only proprietary, but also offered in SaaS form only. In constrast with locked-down proprietarism of DALL-E, Stable Diffusion is fully open source and can be self-hosted on a server the end user controls.

Experimenting with DALL-E

DALL-E 2 is a cutting edge AI system for creating and manipulating images based on natural language instructions. Although it is not 100% available to the general public yet, it is possible to apply for access by filling a form at OpenAI site. This will place you in a waitlist. I was able to get access after some 20 days. It’s unclear on what basis they choose or prioritise people for access, but my guess is that having spent a few dollars on GPT-3 had helped.

JavaScript AST manipulation with Babel: removing unreachable code

Unreachable parts can be injected into JavaScript source as a form of obfuscation. We will go through a couple of simple examples to show how the unreachable code paths can be removed at Abstract Syntax Tree level for reverse engineering purposes. For the first example, let us consider the following code: if (1 == 2) console.log("1 == 2"); if (1 == 1) { console.log("1 == 1"); } else if (1 == 2) { console.

Building Python script CLIs with argparse and Click

To make Python script parametrisable with some input data we need to develop a Command Line Interface. This will enable the end user to run the script with various inputs and provide a way to choose optional behaviors that the script supports. For example, running the script with --silent suppresses the output that it would otherwise print to the terminal, whereas --verbose makes the output extra detailed with fine-grained technical details included.

ImageMagick: programmable image processing toolkit

ImageMagick is a set of CLI tools and a C library for performing a variety of digital image handling tasks, such as converting between formats, resizing, mirroring, rotating, adjusting colors, rendering text, drawing shapes and so on. In many cases ImageMagick is not being used directly, but exists as part of the hidden substrate of C code that the modern shiny stuff is built upon. For example, it may be used to generate thumbnails of images that users uploaded to a web application.

Writing custom Scrapy middleware for proxy pool integration

One needs to use proxy pool to scrape some of the sites that implement countermeasures against scraping. It might be because of IP-based rate limiting, geographic restrictions, AS-level traffic filtering or even something more advanced such as TLS fingerprinting. When Scrapy framework is used to implement a scraper, there are some ways to do so: Setting HTTPS_PROXY environment variable when running a spider will make the traffic go through the proxy.

Decompiling Android apps

To get a deeper insight into how an Android app works we may want to convert the binary form inside APK file to some sort of textual representation that we can read and edit. This may be desirable for working out how to defeat some automation countermeasures (e.g. HMAC being applied on API requests), finding vulnerabilities in the apps themselves or in backend systems that service the apps, analysing malicious code and so on.

JavaScript AST manipulation with Babel: extracting hardcoded data

When scraping the web one sometimes comes across data being hardcoded in JS snippets. One could be using regular expressions or naive string operations to extract it. However, it is also possible to parse JavaScript code into Abstract Syntax Tree to extract the hardcoded data in a more structured way than by using simple string processing. Let us go through couple of examples. Suppose you were scraping some site and came across to JS code similar to following example from Meta developer portal:

More tools of the trade for scraping and automation

Since the previous post I realised that there’s more interesting and valuable tools that were not covered and that this warrants a new post. So let’s discuss several of them. AST Explorer Sometimes you run into obfuscated code that you need to make sense of and write some code to undo the obfuscation. AST explorer is a web app that let’s you put in code snippets and browse the AST for multiple programming languages and parsing libraries.

JavaScript AST manipulation with Babel: the first steps

Previously on Trickster Dev: Understanding Abstract Syntax Trees JavaScript Obfuscation Techniques by Example When it comes to reverse engineering obfuscated JavaScript code there are two major approaches: Dynamic analysis - using debugger to step through the code and observing it’s behaviour over time. Static analysis - performing source code analysis without running it, but parsing and analysing the code itself. It is bad idea to rely exclusively on regular expressions and naive string manipulation to do web scraping.

Javascript obfuscation techniques by example

Sometimes when working on scraping some website you look into JavaScript code and it looks like a complete mess that is impossible to read - no matter how much you squint, you cannot make sense of it. That’s because it has been obfuscated. Code obfuscation is a transformation that is meant to make code difficult to read and reverse engineer. Many web sites utilize JavaScript obfuscation to make things difficult to scraper/automation developers.

How to scrape Zillow with Python and Scrapy

The following text assumes some knowledge and experience with Scrapy and is meant to provide a realistic example of scraping moderately difficult site that you may be doing as freelancer working in web scraping. This post is not meant for people completely unfamiliar with Scrapy. The reader is encouraged to read some content earlier in the blog that introduces Scrapy from ground up. This time, we are building up from the knowledge about basics of Scrapy and thus skipping some details.

Understanding TLS fingerprinting

TLS (Transport Level Security) is a network protocol that sits between trasport layer (TCP) and application layer protocols (HTTP, IMAP, SMTP and so on). It provides security features such as encryption and authentication to TCP connections that merely deal with reliably transferring streams of data. For example, a lot of URLs in modern web start with https:// and you typically see a lock icon by the address bar on your web browser.

How to scrape Youtube view intensity time series

Recently Youtube has introduced a small graph on it’s user interface that visualises a time series of video viewing intensity. Spikes in this graph indicate what parts of the video tend to be replayed often, thus being most interesting or relevant to watch. This requires quite some data to be accumulated and is only available on sufficiently popular videos. Screenshot 1 Let us try scraping this data as an exercise in web scraper development.

Understanding Abstract Syntax Trees

First thing that happens when a program source code is parsed by compiler or interpreter is tokenization - cutting of code into substrings that are later organised into parse tree. However, this tree structure merely represents textual structure of the code. Further step is syntactic analysis - an activity of converting parse tree (also known as CST - concrete syntax tree) into another tree structure that represent logical (not textual) structure.

Notes from The Bug Hunters Methodology - Application Analysis v1

This post will consist of notes taken from The Bug Hunter’s Methodology: Application Analysis v1 - a talk by Jason Haddix at Nahamcon 2022. These notes are mostly for my own future review, but hopefully other people will find it useful as well. Many people have been teaching how to inject an XSS payload, but not how to systematically find vulnerabilities in the first place. Jason has created an AppSec edition of his methodology when it became large enough to be split into recon and AppSec parts.

Building higher-order automation workflows with n8n

Automation systems tends to have a temporal aspect to them as some action or the entire flow may need to be executed at specific times, at a certain intervals. Scrapers, vulnerability scanners and social media bots are examples of things that you may want to run at schedule. Those using web scraping for lead generation or price intelligence need to relaunch the web scraper often enough to get up-to-date snapshots of data.

Importing Shopify product info programmatically

Sometimes one would scrape eCommerce product data for the purpose of reselling these products. For example, a retail ecommerce company might be sourcing their products from a distributor that does not provide an easy way to integrate into Shopify store. This problem can be solved through web scraping. Once data is scraped, it can be imported into Shopify store. One way to do that is to wrangle the product dataset into file(s) that heed the Shopify product CSV schema and import it via Shopify store admin dashboard.

Turning Scrapy spider into API with ScrapyRT

Scrapy framework provides many benefits over regular Python scripts when it comes to developing web scrapers of non-trivial complexity. However Scrapy by itself does provide a direct way to integrate your web scraper into larger system that may need some on-demand scraping (e.g. price comparison website). ScrapyRT is a Scrapy extension that was developed by Zyte to address this limitation. Just like Scrapy itself, it is trivially installable through PIP. ScrapyRT exposes your Scrapy project (not just spiders, but also pipelines, middlewares and extensions) through HTTP API that you can integrate into your systems.

Scraping Instagram API with Instauto

Instagram scraping is of interest to OSINT and growth hacking communities, but can be rather challenging. If we proceed with using browser automation for this purpose we risk triggering client-side countermeasures. Using private API is a safer, more performant approach, but has it’s own challenges. Instagram implements complex API flows that involve HMACs and other cryptographic techniques for extra security. When it comes to implementing Instagram scraping code, one does not simply use mitmproxy or Chrome DevTools to intercept API requests so that we could reproduce them programmatically.

Scapy: low level packet hacking toolkit for Python

To make HTTP requests with Python we can use requests module. To send and receive things via TCP connection we can use stream sockets. However, what if we want to go deeper than this? To parse and generate network packets, we can use struct module. Depending on how deep in protocol stack are we working we may need to send/receive the wire format buffer through raw sockets. This can be fun in a way, but if this kind of code is being written for research purposes (e.

Strategies and patterns of gray hat social media automation

Introduction and motivation We live in the postmodern age of hyperreality. Vast majority of information most people receive through technological means without having any direct access to it’s source or any way to verify or deny what is written or shown on the screen. A pretty girl sitting in a private jet might have paid a lot of money to fly somewhere warm or might have paid a company that keeps the plane on the tarmac a smaller amount of money to do a photo shoot.

You probably don’t need AWS and are better off without it

Amazon Web Services (AWS) is the most prominent large cloud provider in the world, offering what seems to be practically unlimited scalability and flexibility. It is heavily marketed towards startups and big companies alike. There’s even a version of AWS meant for extra-secure governmental use (AWS for Government). There’s an entire branch of DevOps industry that helps setting up software systems on AWS and a lot of people are making money by using AWS in some capacity.

Compiling Python programs with Pyinstaller

Pyinstaller is a CLI tool that compiles Python scripts into executable binaries, installable through PIP. Let us go through a couple of examples of using this tool. SMTP enumeration script from previous post can be trivially compiled with this tool by running the following commnd: $ pyinstaller smtp_enum.py This creates two directories - build/ for intermediate files and dist/ for the results of compilation. However we find that dist/ contains multiple binary files, whereas it is generally more convenient to compile it into single file that statically links all the dependencies.

Notes on TBHM v4 recon edition

This post will summarize The Bug Hunter’s Methodology v4.01: Recon edition - a talk at h@activitycon 2020 by Jason Haddix, a prominent hacker in the bug bounty community. Taking notes is important when learning new things and therefore notes were taken for future reference of this material. This methodology represents breadth-first approach to bounty hunting and is meant to provide a reproducible strategy to discover as many assets related to the target as possible (but make sure to heed scope!

Writing web scrapers in Go with Colly framework

Colly is a web scraping framework for Go programming language. The feature set of Colly largely overlaps with that of Scrapy framework from Python ecosystem: Built-in concurrency. Cookie handling. Caching of HTTP response data. Automatic heeding of robots.txt rules. Automatic throttling of outgoing traffic. Furthermore, Colly supports distributed scraping out-of-the-box through a Redis-based task queue and can be integrated with Google App Engine. This makes it a viable choice for large-scale web scraping projects.

Creating DC proxies on cloud providers

Although many proxy providers offer data center (DC) proxies fairly cheaply, sometimes it is desirable to make our own. In this post we will discuss how to set up Squid proxy server on cheap Virtual Private Servers from Vultr. We will be using Debian 11 Linux environment on virtual “Cloud Compute” servers. Let us go through the steps to install Squid through Linux shell with commands that will be put into provisioning script.

SMTP user enumeration for fun and profit

Sometimes it is desirable to enumerate all (or as much as possible) email recipients at given domain. This can be done by establishing SMTP connection to corresponding mail server that can be found via DNS MX record and getting the server to verify your guesses for email usernames or addresses. There are three major approaches to do this: Using VRFY request that is meant to verify if there is user with corresponding mailbox on the server.

Running GUI apps within Docker containers

Typically Docker is used to encapsulate server-side software in reproducible packages - containers. A certain degree of isolation is ensured between containers. Furthermore, containers can be used as building blocks for systems consisting of multiple software servers. For example, a web app can consist of backend server, database server, frontend server, load balancer, redis instance for caching and so on. However, what if we want to run desktop GUI apps within Docker containers to use them as components within larger systems?

Reproducible Linux environments with Vagrant and Terraform

When developing and operating scraping/automation solutions, we don’t exactly want to focus on systems administration part of things. If we are writing code to achieve a particular objective, making that code run on the VPS or local system is merely a supporting side-objective that is required to make the primary thing happen. Thus it is undesirable to spend too much time on it, especially if we can use automation to avoid repetitive, error-prone activity of installing the required tooling in disposable virtual machines or virtual private servers.

Decrypting your own HTTPS traffic with Wireshark

HTTP messages are typically are not sent in plaintext in the post-Snowden world. Instead, TLS protocol is used to provide communications security against tampering and surveillance of communications based on HTTP protocol. TLS itself is fairly complex protocol consisting of several sub-protocols, but let us think of it as encrypted and authenticated layer on top of TCP connection that also does some server (and optionally client) verification through public key cryptography.

Scrapy framework architecture

In this post we will take a deeper look into architecture of not just Scrapy projects, but Scrapy framework itself. We will go through some key components of Scrapy and will look into how data is flowing through the system. Let us look into the following picture from Scrapy documentation: https://docs.scrapy.org/en/latest/_images/scrapy_architecture_02.png We see the following components: Engine is the central switchboard for all data that is transferred inside Scrapy when it is running.

How to scrape pages behind login with Python

Many websites do not provide proper access to information for unauthenticated users. Instead, the data is provided on some client area that is accessible only after the user goes through login flow, possibly by signing up for a paid plan beforehand. How can such websites be scraped with Python? There are two additional things we have to do when scraping data behind login: Set up requests.Session object with proper cookies by reproducing the login flow programmatically.

Using Scrapy pipelines to export scraped data

By default, Scrapy framework provides a way to export scraped data into CSV, JSON, JSONL, XML files with a possibility to store them remotely. However, we may need more flexibility in how and where the scraped data will be stored. This is the purpose of Scrapy item pipelines. Scrapy pipeline is a component of Scrapy project for implementing post-processing and exporting of scraped data. We are going to discuss how to implement data export code in pipelines and provide a couple of examples.

How to use Scrapy Cloud

Scrapy is a prominent Python framework for web scraping that provides a certain kind of project template structure for easier development. However, once Scrapy project is developed it may be necessary to deploy it into cloud for long running scraping jobs. Scrapy Cloud is a SaaS solution for hosting your Scrapy projects, developed by the same company that created Scrapy framework. Let us go through an example for using it with already developed scraper.

Sending notifications programmatically: let me count the ways

You may want to get notified about certain events happening during your scraping/botting operations. Examples might be an outages of external systems that your setup depends on, fatal error conditions, scraping jobs being finished, and so on. If you are implementing automations for bug bounty hunting, you certaintly want to get notified about new vulnerabilities being found in target systems being scanned. You may also want to get periodic status updates on long running tasks.

Evaluating MS Playwright for gray hat automation

MS Playwright is a framework for web testing and automation. Playwright supports programmatic control of Chromium and Firefox browser and also integrates WebKit engine. Headless mode is supported and enabled by default, thus making it possible to run your automations in environments that have no GUI support (lightweight virtual private servers, Docker containers and so on). Playwright is written in JavaScript, but has official bindings for Python, C#, TypeScript and Java.

Automating spreadsheet tasks with openpyxl

Spreadsheets are a mainstay tool for information processing in many domains and widely used by people in many walks of life. It is fairly common for developers working in scraping and automation to be dealing with spreadsheets for inputting data into custom code, report generation and other tasks. Openpyxl is a prominent Python module for reading and writing a spreadsheet files in Open Office XML format that is compatible with many spreasheet programs (MS Excel 2010 and later, Google Docs, Open Office, Libre Office, Apple Numbers, etc.

Scrapy framework tips and tricks

Use Scrapy shell for interactive experimentation Running scrapy shell gives you an interactive environment for experimenting with the site being scraped. For example, running fetch() with URL of page fetches the page and creates a response variable with scrapy.Response object for that page. view(response) opens the browser to let you see the HTML that the Scrapy spider would fetch. This bypasses some of the client-side rendering and also lets us detect if the site has some countermeasures against scraping.

Email harvesting from Github profiles

You may have some reason to automatically gather (harvest) software developer emails. That might be SaaS marketing objectives, recruitment or community building. One place online that has a lot of developers is Github. Some of the developers have a public email address listed on their Github profile. Turns out, this information is available through Github’s REST API and can be extracted with just a bit of scripting. First we need an API token.

CAPTCHA solver services for scraping and automation

CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) is a type of automation countermeasure. It is based on challenge-response approach that involves asking user to perform an action that only human is supposed to be able to perform, such as: Writing down characters or words from distorted image. Performing based mathematical operations that are presented in distorted image. Recognizing specific object within a set of images, possibly with distortion.

Introduction to Scrapy framework

Scrapy is a Python framework for web scraping. One might ask: why would we need an entire framework for web scraping if we can simply code up some simple scripts in Python using Beautiful Soup, lxml, requests and the like? To be fair, for simple scrapers you don’t. However when you are developing and running web scraper of non-trivial complexity the following problems will arise: Error handling. If you have an unhandled exception in your Python code, it can bring down an entire script, which can cause to time and data being lost.

Tools of the trade for scraping and automation

To get things done, one needs a set of tools appropriate to the task. We will discuss several open source tools that are highly valuable for developers working in scraping and automation. Chrome DevTools Let us start with the basics. Chrome has a developer tool panel that you can open by right-clicking on something in the web page and choose “Inspect Element” or by going to View -> Developer -> Developer Tools.

Harvesting emails from Google search results

Automated gathering of email addresses is known as email harvesting. Suppose we want to gather the email addresses of certain kinds of individuals, such as influencers or content creators. This can be accomplished through certain less-known features of Google. For example, site: operator limits search results to a given domain. Double-quoting something forces Google to provide exact matches on the quoted text. Boolean operators (AND, OR) are also supported. Google caps the number of results to about 300 per query, but we can mitigate this limitation by breaking down search space into smaller segments across multiple search queries.

The very basics of XPath for web scraping: less than 1% of XPath language to scrape more than 90% of pages

When scraping HTML pages, many developers are using CSS selectors to find the right elements for data extraction. However, modern HTML is largely based on XML format, which means there is a better, more powerful way to find the exact elements one needs: XPath. XPath is an entire language created for traversing XML (and by extension HTML) element trees. The size and complexity of XPath 3.1 specification might seem daunting, but the good news is that you need to know very little of it as web scraper developer.

Using proxies for web scraping and automation

When it comes to scraping and automation operations, it might be important to control where remote systems see the traffic coming from to evade rate-limiting, captchas and IP bans. This is what we need to use proxies for. Let us talk about individual proxy servers and proxy pools. Proxy server is server somewhere in the network that acts as middleman for network communications. One way this can work is connection-level proxying via SOCKS protocol.

Setting up mitmproxy with Android

In the previous post, instructions were provided on how to set up mitmproxy with iOS device. In this one, we will be going through setting up mitmproxy with Android device or emulator. If you would prefer to use Android emulator for hacking mobile apps I would recommend Genymotion software which lets you create pre-rooted virtual devices. The following steps were reproduced with Android 10 system running in Genymotion emulator with Google Chrome installed through ADB from some sketchy APK file that was found online.

Setting up mitmproxy with iOS 15

mitmproxy is an open source proxy server developed for launching man-in-the-middle attacks against network communications (primarily HTTP(S). mitmproxy enables passive sniffing, active modification and replaying of HTTP messages. It is meant to be used for troubleshooting, reverse engineering and penetration testing of networked software. We will be setting up mitmproxy with iOS 15 device for the scraping and gray hat automation purposes. One use case of mitmproxy-iPhone setup is discussed in my previous post about scraping private API of mobile app.

Using Python and mitmproxy to scrape private API of mobile app

Web scraping is a widely known way to gather information from external sources. However, it is not the only way. Another way is API scraping. We define API scraping as the activity of automatically extracting data from reverse engineered private APIs. In this post, we will go through an example of the reverse engineering private API of a mobile app and developing a simple API scraping script that reproduces API calls to extract the exact information that the app is showing on the mobile device.

You don’t typically need Selenium for scraping and automation

First, let us presume that we want to develop code to extract structured information from web pages that may or may not be doing a lot of API calls from client-side JavaScript. We are not trying to develop a general crawler like Googlebot. We don’t mind writing some code that would be specific for each site we are scraping (e.g. some Scrapy spiders or Python scripts). When coming across discussions about web scraper development on various forums online, it is common to hear people saying that they need JavaScript rendering to scrape websites.

Easier JSON wrangling with jq, JSONPath and JSONPointer

In this day and age, JSON is the dominating textual format for information exchange between software systems. JSON is based on key-value pairs. Key is always a string, but values can be objects (dictionaries), arrays, numbers, booleans, strings or nulls. Common problem that software developers are running into is that JSON tree structures can be deeply nested with parts that may or may not be present. This can lead to tedious, awkward code.

Python Web Scraping: tips and tricks

Use pandas read_html() to parse HTML tables (when possible) Some pages you might be scraping might include old-school HTML <table> elements with tabular data. There’s an easy way to extract data from pages like this with into Pandas dataframes (although you may need to clean up it afterward): >>> import pandas as pd >>> pd.read_html('https://en.wikipedia.org/wiki/Yakutsk')[0] Yakutsk Якутск Yakutsk Якутск.1 0 City under republic jurisdiction[1] City under republic jurisdiction[1] 1 Other transcription(s) Other transcription(s) 2 • Yakut Дьокуускай 3 Central Yakutsk from the air Central Yakutsk from the air 4 .

Grayhat Twitch Chatbots

Twitch is a popular video streaming platform originally developed for gamers, but now expanding beyond gaming community. It enables viewers to use chat rooms associated with video streams to communicate amongst each other and with the streamer. Chatbots for these chatrooms can be used for legitimate uses cases such as moderations, polls, ranking boards, and so on. Twitch is allowing such usage but requires passing a verification process for chatbot software to be used in production at scale.

Automating Google Dorking

There is more to using Google than searching by keywords, phrases and natural language questions. Google also has advanced features that can empower users to be extra specific with their searches. To search for an exact phrase, wrap it in double quotes, for example: "top secret" To search for documents with a specific file extension, use filetype: operator: filetype:pdf "top secret' To search for results in a specific domain, use site: operator.

Using ephemeral Onion Services for quick NAT traversal

Sometimes, when developing server-side software it is desirable to make it accessible for access outside the local network, which might be shielded from incoming TCP connections by a router that performs Network Address Translation. One option to work around this is to use Ngrok - a SaaS app that lets you tunnel out a connection from your network and exposes it to external traffic through a cloud endpoint. However, it is primarily designed for web apps and it would be nice if we didn’t need to rely on a third-party SaaS vendor to make our server software accessible outside our local network.

Sending mass DMs on Reddit through API: a small experiment

Howitzer is a SaaS tool that scrapes subreddits for users mentioning given keywords and automates mass direct message sending for growth hacking purposes. Generally speaking, there are significant difficulties when automating against major social media platforms. However, Reddit is not as hostile towards automation as other platforms and even provides a relatively unrestricted official API for building bots and integrations. Let us try automating against Reddit API with Python to build a poor mans Howitzer with Python.

Introducing GPT-3: Playground and API

GPT-3 is a large scale natural language processing system based on deep learning and developer by OpenAI. It is now generally available for developers and curious people. GPT-3 is the kind of AI that works exclusively on text. Text is both input and output for GPT-3. One can provide questions and instructions in plain English and receive fairly coherent responses. However, GPT-3 is not: HAL from 2001: Space Odyssey Project 2501 from Ghost In the Shell T-800 from The Terminator AM from I Have No Mouth And I Must Scream What GPT-3 is a tool AI (as opposed to agent AI).

How does PerimeterX Bot Defender work

PerimeterX is a prominent vendor of anti-bot technology, used by portals such as Zillow, Crunchbase, StockX and many others. Many developers working on web scraping or automation scripts have ran into PerimeterX Human Challenge - a proprietary CAPTCHA that involves pressing and holding a HTML element and does not seem to solvable by any of the CAPTCHA solving services. PerimeterX has registered the following US patents: US 10,708,287B2 - ANALYZING CLIENT APPLICATION BEHAVIOR TO DETECT ANOMALIES AND PREVENT ACCESS US 10,951,627B2 - SECURING ORDERED RESOURCE ACCESS US 2021/064685A1 - IDENTIFYING A SCRIPT THAT ORIGINATES SYNCHRONOUS AND ASYNCHRONOUS ACTIONS (pending application) Let us take a look into these patents to discover key working principles of PerimeterX bot mitigation technology.

How to download embedded videos

When wandering across the World Wide Web, many netizens have come across pages containing Youtube or Vimeo videos embedded in them. youtube-dl is a prominent tool to download online videos from many sources (not limited by Youtube - see the complete list of supported sites, but can it download the videos even if they are embedded in some third party website? Turns out, it can (with a little bit of help from the user).

Trickster Dev

Code level discussion of web scraping, gray hat automation, growth hacking and bounty hunting