Technical Architecture and Implementation Specification
This document describes the architecture of MALV, an approach for building AI-powered applications with built-in security, intelligent cost optimization, and edge-native deployment.
Core Technical Features:
MALV handles the infrastructure layer that would typically require 12-18 months and a dedicated team to build, allowing developers to focus on application-specific logic rather than foundational systems.
Building with MALV means working with three distinct architectural layers: the client layer providing user interfaces, the application layer containing domain-specific business logic, and the infrastructure layer providing shared services for security, storage, and deployment. This separation enables independent scaling and deployment of components while maintaining system-wide consistency through well-defined interfaces.
The client layer provides user interfaces for interacting with the system. Two primary implementations exist: a web-based client built with Vite and served as a static single-page application, and a command-line interface for terminal-based interaction. Both clients communicate exclusively with the orchestrator application and load application metadata from the Apps CDN.
Applications are independent Cloudflare Workers implementing specific business capabilities. Each application exposes a set of tools (discrete functions) that can be invoked by the orchestrator. Applications are stateless; all persistent state is managed through the storage service. The orchestrator application holds special status as the coordination point for AI-powered planning and execution.
Infrastructure services provide shared capabilities required by all applications. The token service handles cryptographic signing, the storage service enforces permissions and proxies R2 operations, the event service coordinates publish/subscribe messaging, the hub service manages deployment, and the Apps CDN distributes application assets.
| Component | Technology | Purpose |
|---|---|---|
| Runtime | Cloudflare Workers (V8 Isolates) | Serverless execution environment |
| Storage | R2 Object Storage | S3-compatible object storage with zero egress |
| Cryptography | Ed25519 | High-speed public-key signatures |
| Language | TypeScript (strict mode) | Type-safe application development |
| Build System | Rollup, Vite | Module bundling and optimization |
| AI Models | Claude Sonnet 4.5, GPT-5 | Language model inference |
This section illustrates a complete request lifecycle, from user query through semantic filtering, token-aware planning, tool execution, and response streaming. The flow demonstrates how the architecture's components coordinate to deliver sub-second response initiation while processing complex multi-step operations.
| Phase | Latency | Cost Impact | Key Optimization |
|---|---|---|---|
| Semantic Filtering | ~10ms | 90% reduction | Cosine similarity on cached embeddings |
| Token Signing | <1ms | Negligible | Ed25519 performance + key caching |
| Permission Verification | <0.5ms | Negligible | 3-phase validation with early exit |
| Tool Execution | ~200ms/tool | Variable | Parallel execution when possible |
| AI Planning | ~350ms | 85% of total cost | Smaller context from filtering |
| Streaming Updates | 0ms (async) | None | Server-Sent Events (SSE) |
The security architecture employs Ed25519 public-key cryptography for token signing and verification. Applications authenticate requests using signed JWT tokens that embed permission grants. A three-phase validation system ensures that storage operations are authorized before execution. Automatic key rotation occurs every 30 days, maintaining a rolling window of valid keys for token verification.
Ed25519 was selected for its performance characteristics and security properties. Signatures are generated in under 1ms on edge hardware, and verification completes in under 0.5ms. The algorithm provides 128-bit security with 32-byte public keys and 64-byte signatures, significantly smaller than RSA equivalents.
Storage access requests undergo validation in three sequential phases, with each phase providing progressively granular control:
New Ed25519 keypairs are generated automatically every 30 days. The token service maintains four active private keys and publishes five public keys (current plus four historical). This overlap window ensures that tokens signed immediately before rotation remain valid during their lifetime. Old keys are archived to R2 but removed from active use.
The event system enables asynchronous, loosely-coupled communication between applications through a publish/subscribe pattern. Applications declare events they can emit, and other applications subscribe to receive those events. The event service manages subscription lifecycle, coordinates webhook setup and teardown, and delivers events to subscribers in parallel.
Event source applications implement three handler functions for each event type:
Subscriptions are grouped by deterministic keys generated from token payloads. For example, a Gmail event subscription might generate a key from the Google user ID in the authentication token. This ensures that each user's subscription is independent, enabling per-user webhook configuration and isolated event streams.
The event service implements idempotent subscription operations. Multiple subscription requests with identical parameters result in a single stored subscription. The start handler is invoked exactly once for each unique key, even if multiple subscribers register simultaneously. This property simplifies subscription logic and prevents resource leaks.
The hub service provides a centralized deployment gateway that eliminates the need for application developers to possess Cloudflare credentials or understand deployment mechanics. Applications are packaged as multipart form uploads containing compiled JavaScript, configuration files, and assets. The hub service processes these uploads, generates semantic embeddings for tool discovery, and orchestrates deployment to Cloudflare's edge network.
Applications are packaged as multipart/form-data with the following components:
metadata: JSON object containing app ID, version, domain, and descriptionworker_script: Compiled JavaScript bundle for Worker executiontools.json: Tool definitions with schemas and capability requirementstokens.json: Token type definitions with permission patternsevents.json: Event definitions (optional)examples/*.json: Usage examples for semantic discovery (optional)The hub service generates 512-dimensional embeddings using OpenAI's text-embedding-3-small model for three categories of content:
These embeddings enable the orchestrator to perform semantic similarity search during tool discovery, filtering irrelevant capabilities and reducing context size by approximately 90%.
The hub service communicates with Cloudflare's Workers API to deploy applications. This includes uploading the compiled script, configuring environment bindings (R2 buckets, KV namespaces, secrets), and creating routes if custom domains are specified. The hub maintains Cloudflare API credentials, insulating application developers from infrastructure complexity.
The orchestrator application coordinates AI-powered planning and execution. When a user submits a query, the orchestrator performs semantic filtering to identify relevant tools, analyzes token requirements to determine tool availability, uses a two-phase decision process for efficient tool selection and input generation, executes the resulting plan, and streams updates to the client in real-time.
Tool discovery employs cosine similarity between the embedded user query and the embedded tool descriptions. Tools with similarity scores below 0.3 are excluded from the prompt. This filtering typically removes 80-90% of available tools, substantially reducing prompt size and associated API costs while maintaining high recall for relevant capabilities.
The filtering algorithm also considers application-level embeddings and usage example embeddings. If an application's description is semantically relevant, all of its tools receive a score boost. Similarly, if a usage example matches the query, the tools referenced in that example are prioritized.
Rather than asking the AI to simultaneously select tools and generate their inputs, MALV separates these concerns into two phases for improved accuracy and cost efficiency:
$defs, $ref, anyOf) receive individual AI calls executed in parallel, ensuring focused attention on intricate type structures. Simple tools are batched into a single call for efficiency. Full JSON schemas are provided only in this phase.Schema complexity is scored based on structural features. Tools scoring above the threshold (e.g., those with recursive types or union schemas) are isolated to prevent malformed inputs:
| Schema Feature | Complexity Impact | Handling |
|---|---|---|
| Simple properties | Low | Batched with other simple tools |
| Nested objects | Medium | May be batched if total score low |
$defs / $ref |
High | Individual AI call |
anyOf unions |
High | Individual AI call |
| Recursive types | Very High | Individual AI call with focused context |
Tools declare required tokens in their definitions. The orchestrator analyzes available tokens (provided by the client) and categorizes tools as available or unavailable. Available tools are presented normally in the AI prompt. Unavailable tools are presented with instructions on how to unlock them, typically by invoking an authentication tool. This enables the AI to automatically guide users through authentication flows when necessary.
Tool execution results stream to the client using Server-Sent Events (SSE). The orchestrator emits events for: tool decisions (when the AI selects tools to invoke), tool execution start/end (with streaming logs), token creation (when tools generate new authentication tokens), and final response generation. This provides real-time visibility into system behavior and improves perceived performance.
The combination of semantic filtering and two-phase planning compounds cost savings:
At $3 per million input tokens (Claude Sonnet 4.5 pricing), these optimizations reduce per-request cost from ~$0.06 (naive approach) to ~$0.002, representing a 30x improvement.
Unlike traditional AI approaches that limit outputs to text responses, MALV implements a sophisticated object rendering system for rich data visualization. Objects are persistent, renderable data entities created by tools and stored in R2. Each object can be visualized through custom renderers with full lifecycle management, enabling interactive dashboards, data tables, research boards, and other rich UI components.
object.set()"]
end
subgraph STORAGE["Persistent Storage"]
R2[("R2 Bucketobjects/{type}/web.js"]
CAPS["Build Capabilities
Objects follow a reference-based architecture where metadata stores pointers rather than actual data. When a tool creates a table object, it stores a tableId reference in the object metadata. The renderer then uses this reference to fetch the actual data through tool calls. This separation enables:
Each object type can define custom renderers for different environments. Web renderers are ES modules loaded dynamically from the Apps CDN. CLI renderers produce formatted terminal output. Renderers receive structured parameters including object info, capabilities, and lifecycle hooks.
// objects/{type}/web.ts - Web Renderer Signature
export default async function web(
info: { id: string; name: string; metadata: ObjectMetadata },
capabilities: { callTool, storage, ai },
lifecycle: { onDataUpdated, onUnmount }
): Promise<HTMLElement>
Renderers can subscribe to lifecycle events for reactive updates:
onDataUpdated(callback)onUnmount(callback)
Objects can be stored in applications other than the one that created them, enabling team-wide sharing. The storage configuration in objects.json specifies the target app and path template:
{
"storage": {
"inApp": "@malv/auth",
"path": "/teams/<token.team>/objects/",
"tokenType": "account",
"tokenFromApp": "@malv/auth"
}
}
This configuration stores objects under the team's namespace in the auth app, making them accessible to all team members regardless of which app created them.
Traditional AI assistants react to explicit user requests. MALV's perception system enables proactive intelligence by defining contextual conditions that help the AI understand user state and suggest relevant actions. Apps declare "perceptions" that match token presence and storage values, surfacing suggested tasks when conditions are met.
exists / absent"]
STOR_CHECK["Storage Conditionsequals, isEmpty, exists"]
end
MATCH["Matched Perceptions"]
end
subgraph OUTPUT["AI Context"]
PROMPT["Inject into Prompt
Perceptions are defined in perception/*.json files within each app. Each perception specifies conditions that must be met and the contextual insight to provide when matched:
{
"tokens": {
"exists": { "@malv/inventory": "warehouse" },
"absent": { "@malv/inventory": "product" }
},
"storage": {
"@malv/inventory": {
"/teams/<warehouse.teamId>/warehouses/<warehouse.warehouseId>/config.json": {
"lowStockThreshold": { "operator": "equals", "value": null }
}
}
},
"perception": "User has a warehouse configured but hasn't set the low stock threshold",
"tasks": ["Help user define low stock threshold", "Suggest reorder policies"]
}
Two types of conditions enable precise state matching:
exists) or absence (absent). Token conditions are fast to evaluate as they only require checking the client-provided token list. Use these to gate perceptions by authentication state or resource selection.<warehouse.teamId>). Storage conditions enable state-aware perceptions like "warehouse exists but threshold is not set."Storage conditions support multiple comparison operators for flexible matching:
| Operator | Description | Example Use Case |
|---|---|---|
equals |
Exact value match (including null) | Check if objective is unset |
notEquals |
Value differs from specified | Check if status changed from draft |
exists |
Field is present (any value) | Check if configuration exists |
notExists |
Field is missing entirely | Detect unconfigured resources |
isEmpty |
Array is empty or string is blank | Check if no products added |
isNotEmpty |
Array has items or string has content | Check if inventory has items |
When perceptions match, their suggested tasks are presented to the AI as actionable next steps. This transforms the assistant from purely reactive to contextually proactive:
AI sees in prompt:
Current Context:
User has warehouse "West Coast Distribution Center" but the low stock threshold is not configured.
Suggested Actions: Help user define low stock threshold, Suggest reorder policies
Beyond semantic tool discovery, MALV extends vector-based search to storage data itself. When applications write to storage, embeddings are automatically generated in the background, enabling AI-powered discovery of relevant data across all applications. This transforms storage from a simple key-value system into an intelligent data layer.
storage.put(path, data)"]
QUEUE["Background Queue"]
EMBED["Generate Embedding/search?q=...&types=storage"]
FILTER["Filter by Security Keysstorage.get(path)"]
QUERY --> SEMANTIC
SEMANTIC --> FILTER
FILTER --> RESULTS
RESULTS --> FETCH
end
WRITE ~~~ SEARCH
classDef writeStyle fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
classDef searchStyle fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
class TOOL,QUEUE,EMBED,INDEX writeStyle
class QUERY,SEMANTIC,FILTER,RESULTS,FETCH searchStyle
When apps configure storage paths for searchability, the infrastructure automatically processes writes through an embedding pipeline:
bge-base-en-v1.5) creates 768-dimensional vectorsThis process runs asynchronously, ensuring write latency is not affected by embedding computation.
Semantic search respects the same permission model as direct storage access. Each indexed item includes security keys derived from token payloads. Search requests specify which security keys the user holds, and results are filtered to include only accessible data:
// Search request with security keys
GET /search?q=project%20goals&types=storage&securityKeys=["account:user123","team:team456"]
// Only returns data where indexed securityKey matches one of the provided keys
Semantic storage search operates across application boundaries, enabling powerful cross-app queries. An inventory app can discover relevant data from an orders app, or a reporting tool can find metrics from multiple data sources:
Search queries can specify location constraints to scope results to specific apps or paths. This enables targeted searches within a project namespace or across a specific application's data:
// Search within a specific warehouse
GET /search?q=stock&types=storage&locations={"@malv/inventory":"/teams/team1/warehouses/wh1"}
// Search across all inventory data
GET /search?q=products&types=storage&locations={"@malv/inventory":"*"}
Search results include the storage path, source application, similarity score, and a content preview. The AI or application can then fetch the full data using standard storage operations:
{
"query": "low stock alerts",
"types": ["storage"],
"results": [
{
"type": "storage",
"appName": "@malv/inventory",
"path": "/teams/team1/warehouses/wh1/products/prod_001.json",
"preview": "Widget A stock level below threshold, 12 units remaining...",
"similarity": 0.89,
"securityKey": "f8d2c9b1a5f3e7d9"
}
]
}
Long conversations accumulate "behavioral gravity" - directive language, persuasive framing, and emotional tone that can bias AI responses when reflecting on history. MALV's background summarization system strips this gravity by converting messages into neutral, factual summaries that compress context while preserving essential information.
When conversation history is passed to an AI, the language patterns in that history influence future responses. Phrases like "I really need," "this is critical," or "you should" create implicit pressure. Over long conversations, this accumulates, causing the AI to:
MALV applies different summarization strategies based on message role:
Summarization runs as a background job, queued after each response completes. This ensures summarization latency never impacts user-facing response time:
// Queued automatically after respond tool completes
await capabilities.tool.queue('@malv/orchestrator', 'summarize', {
conversationId,
assistantMessagePath
}, {
maxBatchTimeout: 5000, // Wait up to 5s for more items
maxBatch: 5 // Process up to 5 messages together
});
Summaries enable efficient context compression for long conversations. When conversation history approaches token limits, the orchestrator can substitute summaries for older messages:
| Message Type | Original Tokens | Summary Tokens | Compression |
|---|---|---|---|
| User message (avg) | ~150 | ~40 | 73% |
| Assistant message (avg) | ~400 | ~80 | 80% |
| Tool results | ~200 | ~50 | 75% |
Summaries are stored alongside original messages, enabling flexible retrieval:
/conversations/{id}/messages/
├── msg_001.json # Original user message
├── msg_001-summary.json # Third-person summary
├── msg_002.json # Original assistant message
├── msg_002-summary.json # First-person summary
└── ...
| Metric | Value | Notes |
|---|---|---|
| Token Signing Latency | < 1ms | Ed25519 signature generation |
| Token Verification Latency | < 0.5ms | Ed25519 signature verification |
| Cold Start Time | < 50ms | V8 isolate initialization |
| Global Latency (P50) | < 50ms | Edge network proximity |
| Embedding Generation | ~100ms | Per application at publish time |
| Semantic Filtering | < 10ms | Cosine similarity computation |
| Parameter | Value |
|---|---|
| Signature Algorithm | Ed25519 (128-bit security) |
| Public Key Size | 32 bytes |
| Signature Size | 64 bytes |
| Key Rotation Period | 30 days |
| Active Private Keys | 4 |
| Published Public Keys | 5 (current + 4 historical) |
| Token Format | JWT (compact serialization) |
| Resource | Limit | Notes |
|---|---|---|
| Worker CPU Time | 50ms (free), 15s (paid) | Per invocation |
| Worker Memory | 128 MB | Per isolate |
| R2 Object Size | 5 TB | Single object maximum |
| Request Rate | Unlimited | Auto-scaling |
| Storage Capacity | Unlimited | R2 bucket capacity |
| Parameter | Value |
|---|---|
| Model | text-embedding-3-small |
| Dimensions | 512 |
| Distance Metric | Cosine similarity |
| Relevance Threshold | 0.3 |
| Cost per Embedding | $0.00002 per 1K tokens |
The MALV architecture provides significant technical benefits through its design choices. By handling infrastructure concerns at the architecture level, developers can focus on application logic while benefiting from optimizations that would otherwise require substantial engineering effort to implement.
| Component | Building from Scratch | Using MALV | Difference |
|---|---|---|---|
| Infrastructure Engineering | 2-3 senior engineers (dedicated) |
0 (handled by MALV) |
Eliminated |
| Time to First Deployment | 12-18 months | 2-4 weeks | ~95% reduction |
| Security Maintenance | Ongoing engineering overhead | Automated (key rotation, permissions) | Automated |
Embedding-based filtering reduces token usage by filtering irrelevant tools before they reach the LLM. This optimization requires sophisticated embedding generation, caching, and similarity scoring infrastructure.
Ed25519 signature verification with automatic key rotation provides strong security guarantees without developer overhead. This approach requires cryptography expertise and key management infrastructure to implement correctly.
Running on Cloudflare Workers eliminates traditional infrastructure concerns: no containers to manage, no orchestration complexity, no idle resources consuming budget, and no data egress fees.
Three-phase permission validation with template substitution enables secure multi-tenancy through JSON configuration rather than custom code. This eliminates a common source of security vulnerabilities.
The following table compares MALV's architectural choices against common alternatives:
| Capability | Common Approach | MALV Approach |
|---|---|---|
| Tool Discovery | Send all tools to LLM | Semantic filtering (90% reduction) |
| Authentication | API keys or JWT | Ed25519 signatures with auto-rotation |
| Permissions | Manual ACLs per tool | Declarative templates, centrally enforced |
| Deployment | Docker + Kubernetes | Edge Workers (zero config, global scale) |
| Tool Integration | Custom per integration | Standardized tool interface |
| Event System | Manual webhook management | Automatic lifecycle (start/stop handlers) |
| Resource Usage | Always-on containers | Pay-per-request execution |
Building equivalent infrastructure from scratch typically requires:
The MALV architecture handles this infrastructure, allowing developers to focus on building application-specific functionality.