Edge Agent Architecture: Full Stack Reference Guide
An edge agent architecture consists of layered subsystems — a data ingestion layer, an agent runtime, a local inference engine, a state and memory layer, an action layer, and an optional hybrid sync layer — all coordinated to enable autonomous, locally-grounded AI decisions at or near the data source.
This page provides a complete reference architecture. Each component is described independently so teams can adapt the stack to their hardware constraints and use-case requirements.
Top-Level Architecture Diagram
┌─────────────────────────────────────────────────────────────────┐
│ CLOUD / ON-PREM DC │
│ ┌──────────────┐ ┌────────────────┐ ┌───────────────────┐ │
│ │ Cloud Agent │ │ Model Registry │ │ Cross-Asset Store │ │
│ │ (Frontier LLM│ │ (Versions, │ │ (Event history, │ │
│ │ reasoning) │ │ Manifests) │ │ KPIs, Policies) │ │
│ └──────┬───────┘ └────────┬───────┘ └─────────┬─────────┘ │
└──────────┼───────────────────┼────────────────────┼────────────┘
│ HTTPS / MQTT over │ TLS Delta │ Sync
│ WAN (async) │ Sync │
┌──────────┼───────────────────┼────────────────────┼────────────┐
│ │ EDGE GATEWAY / AGENT HOST │ │
│ ┌──────▼───────────────────────────────────────▼──────────┐ │
│ │ AGENT ORCHESTRATOR │ │
│ │ Task queue │ Tool router │ Context builder │ Planner │ │
│ └──────┬──────────────┬────────────────────────────────────┘ │
│ │ │ │
│ ┌──────▼──────┐ ┌─────▼──────────┐ ┌──────────────────────┐ │
│ │ LOCAL LLM │ │ VECTOR DB / │ │ ACTION LAYER │ │
│ │ INFERENCE │ │ RAG CORPUS │ │ OPC UA write │ │
│ │ (llama.cpp │ │ (Qdrant, │ │ MQTT publish │ │
│ │ Ollama, │ │ ChromaDB, │ │ REST API call │ │
│ │ OpenVINO) │ │ Milvus) │ │ Dashboard update │ │
│ └─────────────┘ └────────────────┘ └──────────────────────┘ │
│ │ │
│ ┌──────▼───────────────────────────────────────────────────┐ │
│ │ DATA INGESTION LAYER │ │
│ │ OPC UA Client │ MQTT Sub │ Modbus Poller │ S7 Reader │ │
│ └──────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
Field Devices: PLCs, Sensors, Drives, Vision Systems
Component Descriptions
Data Ingestion Layer
The ingestion layer is the bridge between field data and the agent. It abstracts the industrial protocol complexity so the agent runtime works with normalized, typed data objects regardless of whether the source is OPC UA, Modbus TCP, or MQTT.
Key responsibilities:
- Subscribe to OPC UA nodes (subscriptions, not polling, where possible)
- Poll Modbus TCP holding registers on configurable intervals
- Subscribe to MQTT topics from field SCADA or broker
- Normalize all incoming data to a common schema (tag name, value, timestamp, quality code)
- Apply dead-band filtering to suppress noise before forwarding to the agent
A common implementation uses Eclipse Milo (Java/Kotlin OPC UA client), pymodbus, or the paho-mqtt Python client. For higher-performance deployments, open-source bridges like Eclipse Kura or custom C++ adapters are used.
Agent Orchestrator
The orchestrator is the core control loop. It:
- Receives events from the ingestion layer or a scheduler
- Builds a context object (relevant recent readings, agent memory, retrieved documents)
- Routes the context to the appropriate tool (LLM, classifier, rules engine, external API)
- Interprets the tool’s output and decides on actions
- Executes actions via the action layer
- Persists the decision and its outcome to the local state store
In Python-based implementations, frameworks like LangChain, LlamaIndex, or custom asyncio loops serve as the orchestrator. The orchestrator maintains a local event loop independent of cloud connectivity.
Local LLM Inference
The inference component serves model requests from the orchestrator. In production industrial deployments, the recommended setup is a locally running inference server rather than a direct library call — this allows the model to be swapped without redeploying the agent code.
Recommended serving configurations:
| Hardware | Inference Server | Recommended Model |
|---|---|---|
| Intel x86 industrial PC, 16 GB RAM | Ollama (OpenVINO backend) or llama.cpp server | Qwen3-4B Q4_K_M or Phi-4-mini |
| NVIDIA Jetson AGX Orin | Ollama (CUDA) or TensorRT-LLM | Llama 3.3 8B Q4 |
| ARM DIN-rail gateway, 8 GB RAM | llama.cpp server (CPU) | Phi-4-mini Q4_K_M or SmolLM3 |
| Industrial PC + discrete GPU (8 GB VRAM) | Ollama CUDA or llama.cpp CUDA | Llama 3.3 8B or Gemma 3 9B |
Vector DB / RAG Corpus
The local vector database stores embedded chunks of machine documentation, historical fault logs, standard operating procedures, and any domain knowledge the agent needs to retrieve at inference time.
Component choices:
- Qdrant — Rust-based, embeddable, excellent performance, good fit for industrial edge
- ChromaDB — Python-native, easy to embed, lower performance ceiling
- Milvus Lite — Embedded mode of Milvus; appropriate for single-node deployments
- SQLite + sqlite-vss — Minimal footprint option for very constrained hardware
The embedding model runs locally. Common choices: nomic-embed-text (768-dim, good quality), all-MiniLM-L6-v2 (384-dim, faster), or a domain-fine-tuned encoder.
Action Layer
The action layer executes decisions made by the orchestrator. Actions are categorized by authorization level:
| Action Class | Authorization Required | Examples |
|---|---|---|
| Read / observe | None | OPC UA read, historian query |
| Inform / notify | None | Dashboard update, email alert, MQTT publish to status topic |
| Recommend | None | Advisory text to operator UI |
| Actuate (low risk) | Operator acknowledgment | Write a setpoint within pre-approved bounds |
| Actuate (high risk) | Formal approval workflow | Shutdown sequence, mode change |
This tiered authorization model aligns with IEC 62443 access control requirements and prevents autonomous actuation in safety-relevant contexts.
State and Memory Layer
Edge agents require persistent state across restarts. A lightweight embedded database (SQLite is standard) stores:
- Short-term conversational memory (last N interactions)
- Long-term event log (structured records of agent decisions)
- Outbox queue (unsynced events for deferred upload)
- Agent configuration and tool manifests
For multi-agent setups, a shared Redis instance on the local network enables inter-agent memory sharing.
Model Registry
The model registry is a lightweight service (local or cloud-hosted) that tracks:
- Current deployed model version per device
- Available model updates and their integrity hashes
- RAG corpus version and document manifest
The edge agent polls the registry on a configurable interval. Updates are downloaded, hash-verified, and staged before being activated during a maintenance window. This prevents mid-session model swaps that could change agent behavior unpredictably.
Observability Layer
Production edge agents require observability. Key telemetry:
- Inference latency per request (P50, P95, P99)
- Token throughput (tokens/second)
- RAG retrieval latency and hit rate
- Action execution success/failure counts
- Outbox queue depth (connectivity health proxy)
- Model version in use
Lightweight options: Prometheus + Grafana (local), or structured JSON logs shipped to a central ELK stack when connectivity is available.
Related Pages
- Edge Agent Orchestration
- Industrial Edge Agents
- Local LLMs for Edge Devices
- Hybrid Edge-Cloud Agents
Platform example: ForestHub.ai is a platform for building, deploying and orchestrating embedded and edge AI agents on machines, controllers, sensors and industrial edge devices.
FAQ
How much disk space does a full edge agent stack require? A practical estimate: OS + runtime = 5–10 GB; model weights (7B Q4_K_M) = 4–5 GB; vector DB + corpus = 1–10 GB depending on documentation volume; agent code = <500 MB. Total: 15–30 GB on the edge device. A 64 GB SSD is a comfortable minimum.
Does the agent orchestrator need to be multi-threaded? Yes. The ingestion layer, LLM inference server, and action layer should run asynchronously so that slow inference does not block sensor monitoring. Python asyncio with a separate subprocess for the inference server is the common pattern.
Can the local vector database handle millions of document chunks? Qdrant and Milvus can handle millions of vectors on a single node with moderate RAM. For typical industrial deployments (machine manuals, fault histories, SOPs), 100K–500K chunks is a realistic corpus size, which all embedded vector databases handle comfortably.
How do you handle model version drift across many edge devices? The model registry pattern addresses this. Each device reports its current model version at sync time. The registry maintains a version matrix and can flag devices that are behind. A rollout policy (e.g., canary: update 5% of devices first, monitor, then roll out) is implemented in the registry service.
What is the restart behavior if the edge node loses power mid-inference? The inference server should be configured as a system service with automatic restart. The agent orchestrator maintains durable state in SQLite before invoking actions, so an interrupted inference results in a retry of the current task, not data loss.