Overview
ArgusAI is an on-premises AI inference platform designed for industrial operations environments where data cannot leave the facility. This brief covers the deployment architecture: the hardware and software components, how the Model Context Protocol (MCP) server layer connects the AI to operational data, and the configuration choices that affect performance and capability.
Architecture Overview
An ArgusAI deployment consists of four layers:
User Interface: The Ask Argus interface in ArgusIQ, or a custom application using ArgusAI’s inference API. Sends natural language queries; receives answers with source citations.
Inference Server: Receives queries, orchestrates MCP context retrieval, assembles the full prompt (query + context), calls the LLM runtime for inference, returns the response. The inference server is the orchestration layer.
LLM Runtime: The large language model running on GPU hardware. Receives the assembled prompt and generates the response. The LLM runtime has no direct access to operational data — it receives structured context prepared by the MCP server layer.
MCP Server Layer: Domain-specific servers that translate natural language intent into structured queries, retrieve context data from operational data sources (ArgusIQ, external databases, document repositories), and format the context for LLM consumption. Without the MCP servers, the LLM has only its training knowledge — no live operational state.
The LLM Runtime
Model Selection
ArgusAI supports instruction-tuned open-weight models in the 7B–70B parameter range. Model selection affects inference capability and hardware requirements.
Practical capability tiers:
7B models (Mistral 7B Instruct, Llama 3.1 8B Instruct, similar):
- Suitable for straightforward operational queries: current status, recent history, list retrieval
- Response quality degrades for complex multi-step reasoning
- Hardware requirement: single GPU with 16–24 GB VRAM
- Appropriate for: small facilities, limited query complexity, hardware-constrained environments
13B–34B models (Llama 3.1 13B/34B, similar):
- Good performance for operational queries with moderate complexity: pattern analysis, comparison across multiple assets, trend summaries
- Better handling of ambiguous queries and follow-up questions in a conversation
- Hardware requirement: single GPU with 40–80 GB VRAM (or dual-GPU configuration)
- Appropriate for: most production deployments
70B models (Llama 3.1 70B, similar):
- Best performance for complex analytical queries, report synthesis, multi-document reasoning
- Appropriate for enterprise deployments with demanding analytical use cases
- Hardware requirement: multi-GPU (2–4 GPUs), 80–160+ GB total VRAM
- Appropriate for: large enterprise deployments, complex analytical workloads
Quantization
Model quantization reduces memory requirements at a modest accuracy cost:
FP16 (half precision): Full model capability. Memory requirement = ~2 bytes per parameter. A 7B FP16 model requires ~14 GB VRAM.
INT8 (8-bit): ~10–15% accuracy reduction on complex tasks, negligible on most operational queries. Memory requirement = ~1 byte per parameter. A 7B INT8 model requires ~7 GB VRAM.
INT4 (4-bit): ~20–25% accuracy reduction on complex tasks. Memory requirement = ~0.5 bytes per parameter. A 7B INT4 model requires only ~3.5 GB VRAM.
For most operational query workloads (current status, maintenance history, alert summary), INT8 quantization provides acceptable quality with significantly reduced hardware requirements. INT4 is appropriate for hardware-constrained environments where the alternative is no AI capability.
Inference Engine
ArgusAI uses vLLM as the inference engine. vLLM’s paged attention architecture provides efficient memory management and batching for concurrent query requests. For hardware-constrained environments, llama.cpp is supported as an alternative inference engine with lower memory overhead at reduced throughput.
The MCP Server Layer
The Model Context Protocol (MCP) is the bridge between the LLM and operational data. Understanding the MCP architecture is essential for deploying ArgusAI effectively.
What MCP Servers Do
A MCP server is a service that:
- Exposes a set of “tools” — named functions with defined inputs and outputs
- Implements those tools as queries against specific data sources
- Returns structured data that the inference server formats as LLM context
When the inference server processes a query, it determines which tools to call (based on the query’s apparent data needs), calls those tools via the MCP servers, collects the results, and assembles them into a structured prompt context.
Example query flow:
User: “Which motors on Press Line 2 have health scores below 70?”
- Inference server identifies needed data: asset health scores, filtered by asset type (motor) and location (Press Line 2)
- Calls
asset_hub_mcp.get_assets_by_filter(asset_type="motor", location="Press Line 2", health_score_max=70) - MCP server queries ArgusIQ Asset Hub for matching assets
- Returns: list of motor assets with names, current health scores, and key metrics
- Inference server includes this data as context in the LLM prompt
- LLM generates the answer with the retrieved data as the factual basis
ArgusIQ MCP Servers
ArgusAI includes pre-built MCP servers for each ArgusIQ module domain:
asset_hub_mcp: Queries Asset Hub for asset records, health scores, telemetry history, baseline statistics, and asset relationships. Supports filtering by asset type, location, health score range, and time period.
cmms_mcp: Queries CMMS for work order records, maintenance history, PM schedules, and parts records. Supports filtering by asset, date range, work order type, and status.
alarm_mcp: Queries Alarm Engine history for alert events, active alerts, and acknowledgment records. Supports filtering by asset, severity, time period, and alert condition.
space_mcp: Queries Space Hub for asset locations, zone assignments, and RTLS location history.
ticketing_mcp: Queries Ticketing for service ticket records, SLA status, and resolution history.
Custom MCP Servers
For data sources outside ArgusIQ — ERP systems, external document repositories, proprietary production management systems — custom MCP servers can be developed using the MCP server SDK. Custom MCP servers follow the same interface contract as the ArgusIQ MCP servers and integrate transparently into the ArgusAI inference pipeline.
Custom MCP server development typically requires: familiarity with the target data source’s API or query interface, Python or TypeScript development capability, and 1–4 weeks of development time per data source depending on complexity.
Hardware Specifications
Minimum Production Configuration
Use case: Single facility, < 100 concurrent users, moderate query complexity
Server: Single GPU server, 2U rackmount GPU: NVIDIA A10 (24 GB GDDR6) or RTX A5000 (24 GB) CPU: 16-core x86-64 (AMD EPYC or Intel Xeon) RAM: 128 GB system RAM Storage: 2 TB NVMe SSD Model: 7B INT8 or 13B INT4
Standard Production Configuration
Use case: Mid-size facility or multiple facilities sharing one deployment, < 500 concurrent users, standard query complexity
Server: Dual GPU server, 2U or 4U rackmount GPU: 2× NVIDIA A100 (80 GB) or 2× A30 (24 GB) CPU: 32-core x86-64 RAM: 256 GB system RAM Storage: 4 TB NVMe SSD Model: 13B–34B FP16 or 70B INT8 (dual A100)
High-Performance Configuration
Use case: Large enterprise, 1000+ concurrent users, complex analytical queries
Server: Multi-GPU cluster, 4–8 GPUs GPU: 4× NVIDIA H100 (80 GB) or 4× A100 CPU: 64-core x86-64 RAM: 512 GB system RAM Storage: 8 TB NVMe SSD, separate logging storage Model: 70B FP16 (with NVLink GPU interconnect)
Network Architecture
ArgusAI runs within the facility network with no required external connectivity:
Inbound connections: The inference server accepts HTTPS connections from:
- ArgusIQ (Ask Argus interface sends queries to the ArgusAI endpoint)
- Authorized user workstations (direct API access if configured)
No outbound connections required: ArgusAI makes no outbound connections during operation. Model weights are downloaded once during deployment; operational queries are processed entirely from local resources.
Air-gapped deployment: For facilities with no internet connectivity, model weights are transferred via removable media or authorized file transfer. Once deployed, no external connectivity is needed.
Network segmentation: ArgusAI can be deployed on the same network segment as ArgusIQ, or on a separate segment with a controlled connection to ArgusIQ’s API. In OT/IT-segmented environments, ArgusAI is deployed on the OT network segment (alongside ArgusIQ), not on the corporate IT network.
Inference Latency and Throughput
Typical inference latency for operational queries (not counting MCP context retrieval):
| Model Size | GPU | Tokens/sec | Typical Query Time |
|---|---|---|---|
| 7B INT8 | A10 (24 GB) | 60–80 | 3–8 seconds |
| 13B FP16 | A100 (80 GB) | 40–60 | 5–12 seconds |
| 34B FP16 | 2× A100 | 30–50 | 8–18 seconds |
| 70B FP16 | 4× A100 | 20–35 | 12–25 seconds |
MCP context retrieval adds 0.5–3 seconds per data source queried, depending on query complexity and data volume.
For most operational status queries (< 500 output tokens), total response time including MCP retrieval is 5–20 seconds depending on configuration.
Security Configuration
Authentication: ArgusAI inference endpoint authentication via JWT tokens. Integration with ArgusIQ’s RBAC — user permissions that govern which data ArgusIQ serves are enforced in the MCP servers, not bypassed by the AI interface.
Audit logging: All queries and responses logged on-premises, never transmitted externally. Audit logs are stored on-premises.
Model security: Model weights are stored on encrypted storage. Access to the model files requires the server’s encryption key. Model weights do not change after initial deployment (no training, no fine-tuning) — the deployed model is the same model that was deployed.
Talk to our team about ArgusAI deployment specifications for your environment.