Bringing bot detection research to your AI-powered workflow

At Castle, we’ve increasingly embedded LLMs and tools like Cursor into our research workflows, whether we’re prototyping detection techniques, exploring automation fingerprints, or reviewing technical content. These tools help us move faster, focus on the right problems, and reduce overhead in our day-to-day work.

We’ve always shared our research publicly through blog posts, especially around bot detection, fraud prevention, and browser fingerprinting. But as AI tools become part of the standard developer toolkit, we wanted to make this knowledge easier to use, not just read.

So we’re experimenting with a new format: packaging part of our bot detection knowledge base as a Cursor rule. It’s a markdown file designed to work well with LLMs and AI-based IDEs like Cursor, something you can import, query, or build on top of. This is a first version, and we plan to improve it as we refine detection techniques and gather feedback.

bot-detection-cursor-llm

bot-detection-cursor-llm.md

19 KB

What’s a cursor rule?

Cursor rules are markdown files that define how the assistant should behave in a specific context. You can scope them to a folder, project, or file pattern—so, for example, when editing fingerprinting code, Cursor automatically applies the right rule.

Think of them as lightweight operating manuals. They clarify expectations, standardize structure, and save time when iterating.

One of the rules I maintain is scoped to bot detection R&D. It helps enforce consistency and avoid repetitive setup or decision-making when working on new fingerprinting signals. Here’s a trimmed version of that rule:

- Use JavaScript that runs in the browser.
- Favor `async/await` over `.then`.
- When creating a new signal:
  - Aim to distinguish bots from human users.
  - Add comments explaining why the signal works.
  - Include rationale for discriminative value.
  - Each signal should have a relevant file name related to its purpose or the attributes/APIs it uses.
  - Each signal contains only one `collect` function that returns the result of the fingerprinting test. It can be any type of value (string, object, etc)
  - Fingerprinting signals should not have observeable impacts for users, i.e. never displays anything on the screen. If you need to add an HTML element or an iframe, it should always be hidden, and removed after the test
  - The code should be secured. We should always test if an API exists before using it (except if that's something basic to JS). But for example, before using Service workers you need to test if it exists, otherwise you should return
  - Signal functions should be wrapped in a try/catch and return 'ERR' if they fail. They should return 'NA' if they return because a feature is not available
  - You should use consistent naming for variables and object properties across all the fingerprinting signals

The idea is to automate everything that doesn’t require judgment, so we can focus on what does.

This file wasn’t originally built for external use, it evolved from internal tooling and workflows. But the more we relied on AI-based tools, the more it made sense to structure our knowledge in a way that’s natively usable by them.

We hope this version helps other researchers, analysts, and engineers—especially those newer to bot detection—work more efficiently, or get inspired to build their own version.

You can load it into Cursor or into a custom GPT, depending on your setup.

bot-detection-cursor-llm

bot-detection-cursor-llm.md

19 KB

Right now, it includes:

Techniques for detecting Headless and automated Chrome
Specific signals for Playwright, Puppeteer, and Selenium
Inconsistency checks across fingerprint attributes and environments

We’ll keep expanding it as we explore new signals and edge cases.

Note: This isn’t the full picture, Castle uses more advanced and internal detection logic that isn’t included here. But this file still provides real value for teams building or improving their own detection systems.

This file provides definitions and detection guidance related to bots, fingerprinting, and anti-detect technologies.

## What's a bot?
A bot is a program that automates web interactions. Some bots are legitimate, such as:

- SEO crawlers
- Monitoring agents
- Automated testing tools

Others are used for malicious purposes, including:

- **Credential stuffing** – trying large sets of stolen credentials to gain unauthorized access
- **Fake account creation** – registering accounts in bulk to exploit promotions or spam
- **Payment fraud** – abusing checkout flows, gift cards, or stolen payment methods
- **Scraping** – extracting data at scale (content, prices, inventory)
- **Spam** – posting unsolicited content via forms or comment sections
- **Scalping** – automating purchases of limited-inventory items
- **DDoS** – overwhelming infrastructure with high request volume

## Bot types
Bots fall into two main categories:

1. **HTTP clients**  
   These bots do not execute JavaScript. They send raw HTTP requests to load pages or hit public APIs. Typical behavior includes:
   - Fetching static HTML
   - Submitting forms or API requests without rendering pages
   - Forging fingerprint payloads to mimic browser behavior

2. **Browser automation**  
   These bots use real or headless browsers (e.g. Chrome, Firefox, Safari) controlled through automation frameworks such as:
   - Puppeteer
   - Playwright
   - Selenium
   - Anti-detect wrappers like `puppeteer-extra-stealth`, `undetected-chromedriver`, `SeleniumBase`, or `NoDriver`

Bots in category (1) often reverse engineer fingerprinting or bot challenges and submit forged payloads.  
Bots in category (2) try to appear as real users by suppressing signs of automation and customizing browser fingerprints.

## Detection signals

Signals commonly used to detect bots include:

- **Browser fingerprinting**  
  Detect inconsistencies in the JavaScript environment, such as:
  - Canvas rendering artifacts
  - WebGL renderer anomalies
  - Audio fingerprint discrepancies
  - Missing or spoofed fonts

- **Headless artifacts**  
  Look for automation markers introduced by headless environments or frameworks:
  - `navigator.webdriver === true`
  - CDP (Chrome DevTools Protocol) stack trace or side effects
  - Known global variables like `__playwright__binding__` (Playwright) or `document.cdc_asdjflasutopfhvcZLmcfl_` (Selenium)

- **Client-side behavioral anomalies**  
  Bots often simulate user actions too precisely or too quickly:
  - Perfect or linear mouse paths
  - Unrealistic interaction speed
  - Lack of idle or blurred/focused events

- **Protocol misuse**  
  Abnormal API usage patterns:
  - Missing expected pre-navigation behavior
  - Use of mobile APIs with desktop UAs
  - Headers that don’t match browser behavior (e.g. missing `sec-ch-ua`, malformed `accept-language`)

- **HTTP headers**  
  Look for missing, inconsistent, or forged values:
  - Absence of `accept-language`, `referer`, `sec-ch-ua` in browsers that should send them
  - Mismatches between user-agent and sec-ch headers (e.g. mobile UA with desktop brands)

- **TLS fingerprinting (JA3 / JA4)**  
  TLS handshake should match what the browser claims:
  - If the TLS fingerprint doesn't match the claimed browser/OS in the UA, it may be spoofed
  - Example: Android UA with a Windows JA3 hash

- **Proxy and IP reputation**  
  Weak indicators that may support other detections:
  - IPs linked to data centers, suspicious ASN ranges, or known proxy services
  - Low reputation residential networks or fresh subnets

- **Concurrency patterns**  
  - High rate of identical requests from multiple IPs
  - Correlated spikes in sensitive operations (login, registration)

- **Server-side behavioral anomalies**  
  Detect behavior that deviates from real user flows:
  - Skipping product pages before adding to cart
  - Login attempts without prior session establishment
  - Same device fingerprint hitting multiple accounts
  - Analyze over time windows (e.g. last 5 mins vs. last 6 hrs)

### Interpreting signals

- **Strong signals (high confidence)**  
  Direct indicators of automation:
  - `navigator.webdriver === true`
  - Known automation globals (e.g. `__playwright__binding__`)
  - TLS fingerprint mismatch with UA

- **Consistency signals (medium confidence)**  
  Detect internal contradictions across fingerprint traits:
  - MacOS UA with an Android GPU
  - Language/locale/timezone mismatch vs. IP geolocation

- **Contextual signals (low confidence)**  
  Weak anomalies that are meaningful only in combination:
  - IP in Spain, timezone in Germany, preferred language in Chinese
  - Uncommon device config + new IP + bot-like behavior

### Usage strategies

- **Rule-based detection**  
  Combine two or more signals to block:
  ```js
  if (navigator.webdriver && !navigator.plugins.length) block()


## How to detect headless Chrome

Headless Chrome can leak its presence through several observable signals.

### User-Agent string

When not modified, Headless Chrome includes the string `HeadlessChrome` in the user agent.

**Server-side example:**
```js
if (req.headers.get('user-agent')?.includes('HeadlessChrome')) {
  console.log('Headless Chrome detected');
}
````

**Client-side example:**

```js
if (navigator.userAgent.includes('HeadlessChrome')) {
  console.log('Headless Chrome detected');
}
```

### `sec-ch-ua` header

The `sec-ch-ua` header may also contain `HeadlessChrome`.

```js
if (req.headers.get('sec-ch-ua')?.includes('HeadlessChrome')) {
  console.log('Headless Chrome detected');
}
```

### Missing `accept-language` header

Headless environments sometimes omit the `accept-language` header.

```js
if (req.headers.get('user-agent')?.includes('Chrome') &&
    !req.headers.has('accept-language')) {
  console.log('Headless Chrome detected');
}
```

### `navigator.webdriver` property

In most automation contexts, `navigator.webdriver` is set to `true`.

```js
if (navigator.webdriver) {
  console.log('Headless Chrome detected');
}
```

```js
if (screen.width === 800 && screen.height === 600) {
  console.log('Headless Chrome detected');
}

if (screen.availWidth === 800 && screen.availHeight === 600) {
  console.log('Headless Chrome detected');
}
```

### Notes

* These techniques detect default headless configurations.
* All signals can be spoofed or patched by advanced bots.
* Use them as part of a broader detection strategy combining multiple signals.


## How to detect automated Chrome

Bots using real Chrome (not headless) with frameworks like Selenium, Puppeteer, or Playwright often attempt to hide automation. In these cases, the `HeadlessChrome` substring is removed from the user agent, so alternative signals must be used.

### `navigator.webdriver` property

A key signal is the presence of `navigator.webdriver === true`, which indicates browser automation.

```js
if (navigator.webdriver) {
  console.log('Automated Chrome detected');
}
````

### Known evasion technique

Bot developers can bypass this check by launching Chrome with:

```
--disable-blink-features=AutomationControlled
```

This disables the assignment of `navigator.webdriver = true`.

### Notes

* This is a high-confidence signal in default configurations.
* It should be combined with other indicators when possible.
* Manual patching or anti-detect frameworks may remove this signal.



## How to detect bots based on Playwright specifically

Playwright-based bots can be identified through specific side effects and global properties.

### Global variable leak

Playwright exposes an internal variables in some configurations:

```js
if ('__playwright__binding__' in window) {
  console.log('Playwright detected');
}

if ('__pwInitScripts' in window) {
  console.log('Playwright detected');
}
```


## How to detect bots based on Selenium Chrome specifically

Selenium with ChromeDriver often leaks identifiable artifacts unless explicitly hardened.

### ChromeDriver variable

In default configurations, Selenium injects variables into the global scope:
```js
if ('cdc_adoQpoasnfa76pfcZLmcfl_Array' in window || 'cdc_adoQpoasnfa76pfcZLmcfl_Window' in window) {
		console.log('Selenium bot detected!')
}
```

On older Selenium versions, there is also another similar variable:

```js
if (window.document.$cdc_asdjflasutopfhvcZLmcfl_) {
  console.log('Selenium ChromeDriver detected');
}
```

This variable is used internally by ChromeDriver and appears in many builds unless patched. Its name can vary slightly but usually follows the $cdc_ prefix pattern.

### Notes
This signal is highly specific to Selenium ChromeDriver.
Attackers can patch or rename this variable in custom setups.
Detection can be improved by scanning for any global property starting with $cdc_.

```javascript
for (const key in window.document) {
  if (key.startsWith('$cdc_')) {
    console.log('Potential ChromeDriver artifact:', key);
  }
}
```

## How to detect bots based on Puppeteer Chrome specifically

Puppeteer can leak identifiable artifacts, especially through stack traces and internal function bindings.

### Stack trace inspection

Puppeteer often appears in stack traces when errors are triggered inside instrumented functions.

Override a native function, throw an error inside it, and inspect the resulting stack:

```js
const originalQuerySelector = document.querySelector;

document.querySelector = function () {
  try {
    throw new Error();
  } catch (err) {
    if (err.stack && err.stack.includes('puppeteer')) {
      console.log('Puppeteer detected in stack trace');
    }
  }
  return originalQuerySelector.apply(this, arguments);
};
```

For it to work, the bot must call the `document.querySelector` function at some point (directly or through a function that calls it, e.g. `page.$$`).
You can extend this logic to other native functions commonly used by bots, e.g. `document.querySelectorAll`.

## How to detect Chrome and Headless Chrome bots, even if they modify their fingerprint and use anti-detect automation frameworks

When bots spoof fingerprint attributes (e.g. navigator properties, headers, WebGL) or use hardened anti-detect frameworks, detection based on visible signals becomes less effective. However, side effects of the Chrome DevTools Protocol (CDP) can still reveal automation.

### CDP stack trace leak

CDP-instrumented browsers can leak through modified error stack behavior. This technique exploits how CDP hooks into `console.log` or `Error.stack`.

```js
const e = new Error();
Object.defineProperty(e, 'stack', {
  get() {
    console.log('CDP-instrumented Chrome detected');
  }
});

// Do not remove this line — it triggers the getter.
console.log(e);
````

### CDP mouse event leak (Chromium bug)

CDP-based bots can be detected via event properties that mismatch expected values in real Chrome. This technique leverages a bug in Chrome's event dispatch system.

Reference: @https://issues.chromium.org/issues/40280325

```js
// Works only on Chrome desktop browsers
function mouseEventHandler(e) {
  if (e.screenX === e.pageX && e.screenY === e.pageY) {
    console.log('Chrome bot instrumented with CDP detected');
  }
}

['click', 'mousedown', 'mouseup'].forEach(type => {
  window.document.addEventListener(type, mouseEventHandler, true);
});
```

### Notes

* These methods are resistant to simple spoofing or header forgery.
* Advanced bots (e.g. Nodriver, Selenium driverless) can bypass these detections.
* Always validate results against legitimate user scenarios to reduce false positives, especially for users with privacy tools or non-mainstream browsers.


However, frameworks like Nodriver or selenium driverless can bypass this CDP detection test.

In general, always be careful of false positives on non mainstream browsers or users with privacy extensions when doing bot detection. You run the risk of making false positives otherwise.

## other JavaScript based bot detection techniques

You can check if the user has too many CPU cores. If that's the case, it may be a bot
```javascript
if (navigator.hardwareConcurrency > 70) {
    console.log('Probably a bot')
} 
```

You can also apply different consistency check
```javascript
if (navigator.userAgent.includes('iPhone') && navigator.platform === 'Win32') {
    console.log('Fingerprinting inconsistency, probably a bot!')
}
```

The same is true for other checks
```javascript
const etsl = eval.toString().length;

// etsl = 37 on Firefox but 33 on Chrome
if (navigator.userAgent.includes("Firefox") && etsl == 33) {
    console.log('Fingerprinting inconsistency, probably a bot!')
}
```

We can conduct similar consistency check with all kinds of browser APIs. The example below shows how we use the webGL renderer to verify if the value is consistent with the identity claimed in the user agent.
You could apply the same technique will all kinds of browser APIs by testing their absence/presence or their values, and verifying if it's consistent with the value claimed in the user agent, or with other browser APIs whose values are correlated under normal circompstances.
```javascript
var canvas = document.createElement('canvas');
var ctx = canvas.getContext("webgl") || canvas.getContext("experimental-webgl");
if (ctx.getSupportedExtensions().indexOf("WEBGL_debug_renderer_info") >= 0) {
    const webGLRenderer = ctx.getParameter(ctx.getExtension('WEBGL_debug_renderer_info').UNMASKED_RENDERER_WEBGL);

    if (navigator.userAgent.includes('Win') && webGLRenderer.includes('Apple')) {
        console.log('Fingerprinting inconsistency, probably a bot!')
    }
}
```

Another popular set of technique to detect lies that could indicate bot activity or a fingerprint modification is to collect the same fingerprinting attributes, but in different JS execution contexts, such as:
- Main JS execution context
- Iframes
- Workers
You should be aware that bot developpers are aware of these techniques and often try to hook the creation of iframes and workers to return consistent values.

```javascript
// Pseudo code: it assumes you have a fingerprint object that contains values collected in the main JS execution context

const workerCode = `try {
        var fingerprintWorker = {};

        fingerprintWorker.userAgent = navigator.userAgent;
        fingerprintWorker.hardwareConcurrency = navigator.hardwareConcurrency;
        fingerprintWorker.platform = navigator.platform;
        // You could collect more fingerprinting signals
        self.postMessage(fingerprintWorker);
    } catch (e) {
        self.postMessage(fingerprintWorker);
    }`

    fingerprint.workerData = {};
    var blob = new Blob([workerCode], { type: 'application/javascript' });
    var workerUrl = URL.createObjectURL(blob);
    var worker = new Worker(workerUrl);
    worker.onmessage = function (e) {
        try {
            fingerprint.workerData.userAgent = e.data.userAgent;
            fingerprint.workerData.platform = e.data.platform;
            fingerprint.workerData.hardwareConcurrency = e.data.hardwareConcurrency;
            
            if (fingerprint.workerData.userAgent != fingerprint.userAgent) {
                console.log('Fingerprinting inconsistency, probably a bot!')
            }

            if (fingerprint.workerData.platform != fingerprint.platform) {
                console.log('Fingerprinting inconsistency, probably a bot!')
            }

            if (fingerprint.workerData.hardwareConcurrency != fingerprint.hardwareConcurrency) {
                console.log('Fingerprinting inconsistency, probably a bot!')
            }
            
        } catch (_) { }
    }

```

## Proxy detection and IP reputation

There is no deterministic method to identify proxies, but several techniques can help detect or score proxy usage.

### External data sources

- Use commercial or open proxy IP reputation databases.
- Example: @avastel-bot-ips-lists
- These lists track IPs known to be used by proxies, VPNs, and bots.

### Passive network signals

- Look for anomalies in network metrics:
  - DNS resolution time
  - TLS handshake duration
  - Page load time
- Compare observed values against typical baselines for residential users.

### TCP and TLS fingerprinting

- Compare TCP/IP stack fingerprints or JA3/JA4 TLS hashes against known device/browser combinations.
- Mismatches (e.g. Android UA with a Windows TCP signature) can indicate spoofing or proxy use.

### Risk scoring approach

- Treat proxy indicators as weak signals.
- Do not block solely based on proxy detection — especially on shared IPs (e.g. CGNAT, mobile networks).
- Combine with other signals (e.g. user behavior, fingerprint anomalies) for higher confidence.

### Caution

- Many legitimate users rely on VPNs or shared mobile networks.
- Aggressive proxy blocking increases the risk of false positives.


## Can I use the user agent for bot detection?

The user agent (UA) string is a self-declared identifier sent with every HTTP request. It typically includes information about the browser, operating system, and rendering engine. However, relying solely on the UA for bot detection is unreliable.

### Why the user agent is unreliable

- **Easily spoofed**: Bots and automation tools can modify the UA string to mimic legitimate browsers, making it appear as though the request originates from a real user.
- **Lack of standardization**: There's no strict format for UA strings, leading to inconsistencies and making parsing challenging.
- **Privacy measures**: Modern browsers may intentionally obfuscate or generalize UA strings to enhance user privacy, reducing their reliability for detection purposes.

### Cross-verifying the user agent

Instead of trusting the user agent string at face value, it's more effective to treat it as a claimed identity and verify its consistency with other observable attributes:

- **Feature detection**: Compare the UA's claimed browser version with the presence of specific JavaScript features. For instance, if the UA indicates Chrome 133, but a feature introduced in that version is absent, it suggests inconsistency.

  ```javascript
  const hasFeature = 'FileSystemObserver' in window;
  const expectedFeature = (isChrome || isEdge) && browserVersion >= 133;
  if (!hasFeature && expectedFeature) {
    console.log('Inconsistency detected: Feature missing for claimed browser version.');
  }
````

* **Operating system verification**: Cross-reference the UA's claimed OS with other indicators like the WebGL renderer. For example, if the UA claims Windows, but the renderer suggests Apple hardware, this discrepancy can signal spoofing.

  ```javascript
  if (!isMacOS && webGLRenderer.includes('Apple')) {
    console.log('Inconsistency detected: GPU indicates Mac, but UA claims non-Mac OS.');
  }
  ```

### Limitations of user agent verification

Even if the UA string appears consistent with other attributes, it doesn't guarantee legitimacy:

* **Sophisticated bots**: Advanced bots can align multiple attributes to present a coherent but false identity.
* **Legitimate variations**: Users with customized browsers or privacy tools might naturally exhibit inconsistencies, leading to false positives.

### Conclusion

While the UA string can provide initial insights, it should not be the sole factor in bot detection. A comprehensive approach that includes behavioral analysis, fingerprinting, and cross-verification of multiple attributes offers a more robust defense against bots.
For a deeper exploration of this topic, refer to the article: @How dare you trust the user agent for bot detection?

Bringing bot detection research to your AI-powered workflow

What’s a cursor rule?

Read next

How to authenticate OpenAI Operator requests using HTTP message signatures

From detection to trust: the evolving challenge of AI bot authentication

From LLM scrapers to AI agents: mapping the AI bot landscape for detection teams

Bringing bot detection research to your AI-powered workflow

What’s a cursor rule?

Why we’re sharing this

Read next

How to authenticate OpenAI Operator requests using HTTP message signatures

From detection to trust: the evolving challenge of AI bot authentication

From LLM scrapers to AI agents: mapping the AI bot landscape for detection teams