Traffic Classification

Early Draft

This specification is at an early draft stage. Ideas are open for change and debate. A lot of the content was developed with the help of Claude AI.

Traffic Classification

RequestClassifier

The RequestClassifier is the first processing stage in the SDK. It analyzes incoming requests and routes them to one of the three channels.

The primary gate is Vera header validation: a request carrying a valid Vera header set is classified as VERA_HUMAN. Any request without valid Vera headers can be blocked outright: 100% coverage, no allowlists required. Publishers that want finer-grained handling of non-Vera traffic can optionally classify it further into AI_AGENT, STANDARD_BROWSER, or UNKNOWN_BOT.

from enum import Enum

class RequestorType(Enum):
    VERA_HUMAN       = "vera_human"       # Vera browser with valid header set
    AI_AGENT         = "ai_agent"         # Known AI crawler (optional detection)
    STANDARD_BROWSER = "standard_browser" # Chrome, Firefox, Safari, etc.
    UNKNOWN_BOT      = "unknown_bot"      # No recognizable signal


def is_vera_request(request) -> bool:
    """Returns True if the request carries the minimum Vera header set (Tier 1+).
    Higher tiers additionally validate X-Vera-Token via the SDK — see Authentication."""
    ua = request.headers.get("User-Agent", "")
    return (
        ua.startswith("Vera/")
        and "X-Vera-Client-Version" in request.headers
    )


# Optional: known AI crawler User-Agent patterns for secondary classification.
# Not required to block AI traffic — any non-Vera request can be blocked via
# is_vera_request(). These patterns are useful only if differentiated handling
# is needed (e.g. a separate licensing path for crawlers).
KNOWN_AGENT_PATTERNS = [
    # OpenAI
    "GPTBot", "ChatGPT-User", "OAI-SearchBot",
    # Anthropic
    "ClaudeBot", "anthropic-ai", "Claude-Web",
    # Google
    "Googlebot", "Google-Extended", "Gemini",
    # Meta
    "FacebookBot", "Meta-ExternalAgent",
    # Perplexity
    "PerplexityBot",
    # Cohere
    "cohere-ai",
    # Common Crawl (LLM training data)
    "CCBot",
    # Generic crawlers
    "Diffbot", "omgili", "DataForSeoBot",
]


def classify_request(request) -> RequestorType:
    # 1. Vera browser: validated by header set
    if is_vera_request(request):
        return RequestorType.VERA_HUMAN

    ua = request.headers.get("User-Agent", "")
    ua_lower = ua.lower()

    # 2. Optional: detect known AI crawlers for differentiated handling
    if any(p.lower() in ua_lower for p in KNOWN_AGENT_PATTERNS):
        return RequestorType.AI_AGENT

    # 3. Heuristic: no Accept-Language header → likely bot
    if not request.headers.get("Accept-Language"):
        return RequestorType.UNKNOWN_BOT

    return RequestorType.STANDARD_BROWSER

Classification Matrix

Signal	Vera Human	AI Agent	Standard Browser	Unknown Bot
`User-Agent: Vera/*` + `X-Vera-Client-Version`	✅	—	—	—
Known AI crawler UA pattern	—	✅	—	—
No `Accept-Language` header	—	—	—	✅
Standard browser UA	—	—	✅	—

Access Matrix by Traffic Type

Traffic type	Vera headers	Publisher options	Notes
Vera Human	✅ Valid	Clean experience, subscription, pay-per-read	Token validated at Tier 2+
Non-Vera (any)	❌ Absent	Block entirely (403), or serve standard experience	100% non-Vera blocking possible
AI Agent	❌ Absent	Block (403) or license via separate agreement	Optional secondary classification
Standard Browser	❌ Absent	Standard HTML, classic paywall	—

Blocking non-Vera traffic

Because Vera headers are injected by the browser at the network layer and are not replicable by arbitrary clients without the signed token, is_vera_request() provides a reliable origin signal. Publishers do not need to maintain UA allowlists or blocklists to achieve full coverage.