Methodologies and practices for building AI systems: approaches such as RAG, prompt engineering, agent design patterns and evaluation. The “how” of AI development.

Adopt

Mature, well-supported approaches ready for production use.

Classical ML

Classical machine learning approaches such as random forests, gradient boosting (XGBoost, LightGBM), linear/logistic regression and support vector machines remain the best balance of explainability and efficiency for structured data problems. These techniques routinely outperform more complex approaches on tabular data while training faster and costing less to run.

Realising these benefits requires quality training data and staff with appropriate expertise. Unlike LLM-based solutions that have democratised AI for organisations without data science teams, classical ML demands specialised knowledge in feature engineering and model selection. For organisations with the necessary capabilities, these methods work well even with smaller enterprise datasets, matching or exceeding the performance of more complex approaches while remaining more interpretable and easier to maintain.

RAG

Retrieval-Augmented Generation (RAG) combines search and text generation to produce more accurate responses, grounding them in real data and reducing confabulation. It is valuable where accuracy and traceability matter, such as customer service or compliance. Implementation requires attention to document processing and embedding strategies, but tooling has lowered the barriers.

We’re watching techniques such as Self-RAG, which prompts the model to gather more evidence or refine its responses; early results suggest it reduces confabulations further.

RAG introduces an indirect prompt injection attack surface. Retrieved documents are injected into model context, so adversarial content in any document reaches the prompt via the retrieval step. Retrieval access controls and provenance tracking for ingested documents help mitigate this risk.

See also: Cross-encoder reranking, Structured RAG, Hypothetical document embeddings (HyDE).

LLM-as-a-judge

LLM-as-a-judge has proven one of the most practical techniques for evaluating AI system outputs. Today’s strongest models provide nuanced, multidimensional critique that simpler evaluation methods cannot match, except for very constrained metrics such as exact match or BLEU scores.

The technique is widely adopted in both offline and online evaluation. Offline, it scales far better than human assessment, allowing teams to test thousands of outputs quickly. Online, an LLM judge can evaluate another LLM’s output in real-time, enabling dynamic workflow adjustments based on quality assessments.

Research demonstrates that frontier models provide judgements correlating strongly with human preferences across many evaluation dimensions. We recommend using a different LLM as the judge than the one being evaluated, and viewing this as an augmentation to human evaluation rather than a replacement. The strongest LLMs can identify nuanced issues in reasoning and factuality that would otherwise require substantial human review time.

BERT variants

Bidirectional Encoder Representations from Transformers (BERT) revolutionised NLP by processing words in relation to their entire context rather than sequentially. The family has continued to evolve, with ModernBERT the latest iteration, improving training times and accuracy through architectural updates.

BERT-style models serve a different purpose from generative models. Where GPT generates text, BERT models are optimised for understanding tasks such as classification and sentiment analysis. They are also the basis for the semantic vector embeddings that RAG systems use to retrieve relevant context for generative models.

We recommend DeBERTa for new NLP projects, as it handles word relationships more effectively using a disentangled attention mechanism. DistilBERT is smaller and faster whilst retaining most performance, valuable for production deployments with strict latency requirements. Domain-specific variants exist for biomedical (BioBERT) and financial text (FinBERT), though these require expertise to use effectively.

Few-shot prompting

Providing examples to guide model responses has proven consistently effective across Large Language Models.

This is shifting. As models become more capable, interactive multi-turn approaches are gaining favour: rather than providing examples upfront, practitioners prompt models to ask clarifying questions and iterate toward a solution. The pattern often produces better results, particularly in agentic workflows where the model can refine its approach.

Few-shot prompting retains an important role in non-interactive contexts. System prompts and automated pipelines do not afford clarifying dialogue, and well-chosen examples remain the most effective way to establish output format and domain conventions. We typically see diminishing returns beyond 3-5 examples, with token consumption as the main trade-off.

Agentic tool use

We’ve moved agentic tool use to the Adopt ring for local, sandboxed environments. AI coding assistants that can edit files, run tests, execute shell commands and perform web searches deliver considerably more value than those limited to conversation.

The ecosystem has matured to support this. Standards such as MCP and OpenAI’s Function Calling provide reliable integration patterns, while improved observability tooling lets teams monitor what agents are doing. The Development Containers specification makes it straightforward to isolate agent execution.

The risks magnify for applications accepting external user input. Prompt injection attacks remain an unsolved problem. An agent that safely edits files for a developer becomes a liability when processing untrusted input. Our recommendation: adopt for local developer tooling and internal workflows, but proceed with caution for customer-facing systems, treating each tool permission as a potential attack vector.

See also: Visual computer use agents, Model Context Protocol, Temporal.

Spec-driven development

Software engineers working agentically are finding that natural-language prompts are not enough to constrain the behaviour of an AI system. Prompts carry ambiguities that the model resolves silently, and the implementation that emerges may differ from the one the engineer thought they had asked for. Spec-driven development has gained popularity as the response: an agentic engineering practice where a written specification, rather than a conversation, becomes the contract the AI works against.

The practice has moved into the mainstream of agentic engineering over the past year. GitHub Spec Kit and Amazon’s Kiro are the most visible tools: Spec Kit is agent-agnostic, while Kiro ships a dedicated specs mode that generates EARS notation by default. Both treat structured markdown as the source of truth, with code generated or verified against it. A parallel revival of Gherkin and behaviour-driven development sits in the same middle ground between prose and formal language.

The direction we find most exciting goes further. A specification written in natural-language markdown, even one structured in EARS or Gherkin form, only moves the ambiguity problem one step back. The prose may be richer than a prompt, but it still relies on the model to disambiguate it. We are following formal specification languages instead, including our own Allium, an open-source language that captures system behaviour in a structured, machine-readable form. Allium is early in its life and external uptake remains nascent, but in our own work it gives AI agents an unambiguous reference and lets engineers detect contradictions before any code is written.

Trial

Promising approaches with growing adoption, worth exploring for teams ready to invest in emerging patterns.

Cross-encoder reranking

Cross-encoder reranking enhances AI search and chat systems by examining initial search results more carefully. While embedding search is fast and good at finding broadly relevant content, cross-encoder reranking excels at understanding subtle relevance signals by examining the query and potential results together.

Most teams use a two-step process: embedding search finds 50-100 potentially relevant items, then cross-encoder reranking sorts these candidates to surface the most relevant. The technique often reduces confabulations in downstream LLM responses by ensuring higher quality context selection. Implementation has become straightforward with libraries such as sentence-transformers providing ready-to-use models. Teams should be mindful of the additional latency and may need to tune the number of candidates based on performance requirements.

Ontologies for AI grounding

As AI systems scale beyond isolated experiments, shared meaning becomes critical infrastructure. Ontologies provide what LLMs lack: authoritative definitions of entities and relationships that don’t shift with statistical probability. They ground responses in agreed definitions, enable knowledge graph traversal that pure RAG cannot achieve and support the structured outputs that agentic systems require.

Traditional ontology development tends toward two failure modes: academic approaches aiming for formal completeness using OWL, and pragmatic approaches creating spreadsheets that grow unmaintainable. The key is to start lightweight and formalise selectively. Mark Burgess argues that traditional ontologies impose rigid hierarchies that don’t match how language models represent meaning, proposing alternative graph structures designed to work with vector embeddings. For organisations needing to ground AI in domain knowledge today, ontologies offer a practical path with mature tooling.

Graph databases such as Neo4j provide accessible implementation options, while LinkML offers YAML-based modelling without deep ontology expertise. Start with a painful, high-value domain rather than attempting to model the entire organisation.

See also: LinkML, Neurosymbolic AI, Prolog.

Model distillation & synthetic data

Model distillation involves training a smaller, more efficient model to mimic a larger one. A common pattern uses LLMs to generate synthetic training data for the smaller model: the large LLM acts as a “teacher”, creating diverse examples that help the “student” learn desired behaviour. This makes AI deployment more practical for edge devices or resource-constrained environments.

We’re keeping it in Trial because the process requires considerable expertise. Teams need to validate the quality of generated training data and ensure the distilled model maintains acceptable performance. There is ongoing debate about amplification of biases through this approach.

Check the licence of models used for distillation. Llama forbids using its output to train other models.

UMAP

UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique that has gained substantial traction in the AI community. While t-SNE has been the go-to choice for visualising high-dimensional data, UMAP offers better preservation of global structure and runs significantly faster, making it valuable for large-scale AI applications such as exploring embedding spaces and analysing neural network activations.

UMAP’s parameters need careful tuning to avoid misleading visualisations.

The Python UMAP library provides extensive documentation and explanation, with implementations also available for Rust, Java and R.

Claude Skills

Claude Skills are reusable prompt templates that codify workflows and domain expertise into repeatable patterns for AI coding assistants. Our teams have found them valuable for drafting proposals, structured debugging, generating commit messages and writing PR descriptions. The common thread is tasks that benefit from consistent approach and structured output.

Skills provide a simpler solution than MCP servers for many problems. Where MCP requires implementing a server and managing the protocol lifecycle, Skills are markdown files that encode expertise directly. Skills work well with data that exists as files in your project, since the AI assistant can already read those. MCP extends reach to running services and systems beyond filesystem access.

Structured RAG

Structured RAG extends basic RAG by organising retrieved knowledge as graphs, schemas or typed records rather than flat text chunks. Microsoft’s GraphRAG uses an LLM to build a knowledge graph from source documents during indexing, then queries that graph at retrieval time. This addresses a weakness in standard RAG: questions requiring synthesis across many documents rather than finding a single relevant passage.

GraphRAG has matured since our last radar. LazyGraphRAG reduced indexing costs to a fraction of the original, removing the biggest barrier to adoption. Neo4j provides dedicated examples combining graph-based retrieval with LLM generation.

The trade-off remains upfront investment. Graph-based indexing requires more compute and design than vector-based RAG, and the knowledge graph must be maintained as source documents change. If your queries are primarily about finding relevant passages, standard RAG with cross-encoder reranking may suffice. If they require reasoning across documents, structured approaches justify their cost.

Assess

Emerging or specialised approaches that warrant investigation for specific use cases, but require careful evaluation before adoption.

Neurosymbolic AI

Neurosymbolic AI combines neural networks with symbolic reasoning to address fundamental limitations of pure LLM approaches. Neural networks excel at pattern recognition and handling ambiguity, while symbolic AI provides logical reasoning and explainable inference. LLMs understand natural language well but cannot guarantee rule compliance or explain their reasoning in auditable ways.

Fundamentally, this is an architectural problem. LLMs operate through probabilistic pattern matching over language, not causal modelling. As Mark Burgess argues in his work on semantic spacetime, language models “paraphrase intentional knowledge” rather than tracing actual causal chains. Precise answers to precise questions require systems that explicitly encode what causes what.

This matters most in regulated sectors. Regulatory rules are non-negotiable constraints, not suggestions a model can approximate. Risk models need to know what entities are and how they relate, and compliance requires explainable decision trails. Similar pressures apply across financial services, healthcare and insurance.

Practical implementations range from lightweight to sophisticated. On the simpler end, teams constrain LLM outputs to valid ontology terms or use knowledge graphs to ground RAG retrieval. More advanced implementations use symbolic reasoning engines to validate LLM-generated conclusions. Renewed interest in Prolog reflects exploration of logic programming alongside LLMs.

We’ve placed this in Assess because production patterns are still emerging, but organisations in regulated sectors should be experimenting now.

See also: Prolog, Ontologies for AI grounding, Agentic tool use, World models.

World models

World models sit in the Assess ring as an emerging alternative to pure language model architectures for tasks requiring causal reasoning and planning. Where LLMs predict the next token based on statistical patterns in text, world models build internal representations of how environments behave, enabling systems to simulate outcomes before acting.

The field is developing along several paths. Yann LeCun’s Joint Embedding Predictive Architecture (JEPA) learns by predicting missing information in an abstract embedding space rather than reconstructing raw pixels or tokens. Meta’s V-JEPA and VL-JEPA extend this to video and vision-language tasks with significantly fewer parameters than autoregressive alternatives. Karl Friston’s active inference framework, implemented by Verses AI in their AXIOM system, takes a different approach rooted in how biological systems model their environments. Rather than chasing reward signals, active inference agents build generative models and act to minimise prediction error, with Verses reporting 60% performance improvement using only 3% of comparable deep learning compute. Generative world models form a third strand, with NVIDIA Cosmos and Google DeepMind’s Genie 3 creating physically plausible simulated environments for training robots and autonomous systems.

For financial services, MarS from Microsoft Research demonstrates the pattern applied to market simulation, generating realistic interactive market scenarios for forecasting and anomaly detection without real capital at risk. The paper was accepted at ICLR 2025.

The enterprise value: these approaches offer causal modelling rather than statistical pattern matching. An LLM asked “what happens if I do X?” can only paraphrase similar scenarios from its training data. A world model can simulate the consequences. For teams wanting to experiment, Meta’s V-JEPA 2 and NVIDIA Cosmos models are available on HuggingFace under permissive licences.

See also: Neurosymbolic AI, Physical AI and robotics foundation models.

LLM reproducibility

Large language models are non-deterministic even at temperature zero. This presents a fundamental challenge for regulated industries where Model Risk Management frameworks require reproducible, auditable decision-making. Banking regulations such as OCC/SR 11-7 assume a level of model stability that generative AI does not provide.

The underlying cause extends beyond floating-point arithmetic. Research demonstrates that batch-dependent kernel operations cause outputs to vary with server load rather than input alone. Smaller open weight models on controlled infrastructure tend to achieve more reproducible outputs than larger models served via shared APIs. Where stochastic behaviour is acceptable, the variation must be well-characterised so it can be explained to regulators as a designed property rather than an infrastructure artefact. Prompts and model versions should be treated as versioned code with change control and rollback procedures.

For teams requiring determinism from larger models, SGLang now offers deterministic inference building on batch-invariant operators, with the underlying research selected for oral presentation at NeurIPS 2025. Teams subject to MRM requirements should be actively evaluating their options now.

See also: Neurosymbolic AI, LLM-as-a-judge.

Hypothetical document embeddings (HyDE)

HyDE (Hypothetical Document Embeddings) addresses a common problem in search systems: poor performance when searching content that differs from training data. HyDE asks a large language model to imagine what an ideal document answering the query might look like, bridging the gap between how users ask questions and how information is written.

The system creates several hypothetical documents, converts them into embeddings and blends them together. This averaged representation finds real documents that are mathematically similar, often leading to more relevant results than traditional methods. The approach is particularly effective within RAG systems where accurate retrieval is crucial. Teams should evaluate HyDE for cases where high-precision retrieval is needed and the additional latency is acceptable.

See also: RAG, BERT variants, Cross-encoder reranking.

Fine-tuning with LoRA

Low-Rank Adaptation (LoRA) makes model customisation more practical by adding a small set of trainable parameters while keeping the original model unchanged, reducing computing requirements by 3-4 orders of magnitude while maintaining most of the performance of full fine-tuning.

Tools such as Lightning AI’s lit-gpt and axolotl support implementation. We place it in Assess rather than Trial because successfully applying LoRA still requires significant ML expertise and careful attention to training data quality. Fine-tuning ties you to a specific model architecture, and given the pace of AI advancement, tomorrow’s general-purpose models may outperform your carefully tuned older models. Migrating fine-tuned weights between architectures is particularly challenging. LoRA should only be deployed when the immediate business value clearly outweighs the technical and opportunity costs.

Physical AI and robotics foundation models

Physical AI represents the convergence of foundation model capabilities with robotics. Where traditional robotics relied on brittle, task-specific programming, robotics foundation models enable machines to generalise across tasks and adapt to novel situations.

The technical breakthrough is Vision-Language-Action (VLA) models, which extend vision-language models to include physical action outputs. NVIDIA’s Isaac GR00T N1 represents the first open humanoid robot foundation model, using a dual-system architecture that separates deliberate planning from rapid reactive control. Google’s Gemini Robotics is advancing similar capabilities. World Foundation Models complement these by enabling simulation-based training: NVIDIA Cosmos generates physically plausible synthetic environments that can train robots on scenarios too dangerous or rare to capture in the real world.

Production deployments remain concentrated in well-resourced organisations. The gap between research demonstrations and reliable industrial deployment is substantial. Hardware costs have fallen substantially over the past few years, but perception and control challenges in unstructured environments remain formidable. Organisations with physical AI ambitions should be experimenting, while approaching production timelines with caution.

See also: Digital twin platforms, World models.

CaMeL

CaMeL (CApabilities for MachinE Learning) is a defence architecture from Google DeepMind for mitigating prompt injection in agentic systems. The paper, Defeating Prompt Injections by Design, treats prompt injection as a problem to be solved structurally rather than through prompt cleverness or red-teaming alone. The architecture splits responsibilities across two models: a privileged P-model processes only user instructions and emits a program defining execution steps, while a quarantined Q-model handles external data but cannot call tools directly. A custom interpreter tracks data provenance, enforcing capability-based security so untrusted data cannot escalate privileges. The implementation is open source.

Prompt injection is the dominant unsolved problem for agentic systems, and it appears throughout this radar: in MCP, in agentic tool use, in AI red teaming tools and in our Hold placement for OpenClaw. Most current defences are statistical: filters, classifiers, evaluation harnesses. CaMeL is one of the first credible attempts to make the architecture itself prove that untrusted data cannot reach privileged operations. On the AgentDojo benchmark it solves 77% of tasks with provable security guarantees, often reducing successful attacks to zero. Simon Willison has a useful walkthrough for readers wanting an accessible introduction.

We’ve placed CaMeL in Assess because it addresses prompt injection more rigorously than anything else we have seen, but production patterns have not yet emerged. The limitations are real: users must define security policies, which carries fatigue risk; running two models adds latency and cost; and the approach has not been battle-tested at scale. For teams building agentic systems for regulated industries, this paper is required reading.

Hold

Not recommended for new projects; better alternatives exist.

Word2Vec & GloVe

We’ve placed both GloVe (Global Vectors for Word Representation) and Word2Vec (Word to Vector) in the Hold ring of our techniques quadrant. While these word embedding techniques were groundbreaking when introduced and served as fundamental building blocks for many NLP applications, they have been largely superseded by more advanced approaches.

These older embedding techniques, though computationally efficient, lack the contextual understanding that modern transformer-based models provide. Modern large language models and contextual embeddings such as BERT produce more nuanced representations that capture word meaning based on surrounding context, rather than the static embeddings that GloVe and Word2Vec generate. For new projects, we recommend exploring more recent embedding techniques (see “BERT Variants” in our Adopt ring) unless you have very specific constraints around computational resources or model size that make these older approaches necessary.

t-SNE

We’ve placed t-SNE (t-distributed Stochastic Neighbor Embedding) in the Hold ring of our techniques quadrant. While t-SNE was groundbreaking when introduced for visualising high-dimensional data in lower dimensions, particularly for understanding the internal representations of neural networks, we’re seeing its limitations become more apparent in modern AI workflows.

The core issue is that t-SNE can be misleading when interpreting AI model behaviour, as it prioritises preserving local structure at the expense of global relationships. This can lead teams to draw incorrect conclusions about their models’ decision boundaries and feature representations. We’re increasingly recommending alternatives such as UMAP (Uniform Manifold Approximation and Projection), which better preserves both local and global structure while offering superior computational performance. For projects requiring dimensionality reduction and visualisation of AI model internals, we suggest exploring these newer techniques rather than defaulting to t-SNE.

Zero-shot prompting

Zero-shot prompting, the practice of asking Large Language Models to perform tasks without examples or training, has been a quick way to get started with AI. However, we strongly recommend against using zero-shot prompts in production without appropriate guardrails and safety measures. We’ve heard of multiple incidents where unprotected prompts led to harmful or inappropriate outputs, potentially exposing organisations to significant risks.

Our view is that zero-shot prompting should always be combined with input validation and output filtering. While it can be valuable for prototyping and exploration, moving to few-shot prompting or fine-tuning with careful guardrails is a more robust approach for production systems. The current placement in Hold reflects our concern about organisations rushing to deploy unsafe prompt patterns rather than taking the time to implement proper controls.

Chain of thought (CoT)

Chain of Thought (CoT) has moved to Hold. While useful when it emerged, research from Wharton’s Generative AI Labs demonstrates diminishing returns: gains are rarely worth the time cost, and for reasoning models such as o3 and GPT-5.2, CoT prompting can decrease performance since step-by-step reasoning is already internalised at the architecture level.

For non-reasoning models, CoT still shows modest benefits on mathematical and symbolic reasoning tasks, but these are precisely the domains where better alternatives are emerging. Dedicated reasoning models handle them natively, while neurosymbolic architectures offer more reliable solutions by coupling LLMs with explicit reasoning engines.

The frontier of prompt engineering has moved to structuring problems effectively. Frameworks such as the 5 Whys and inversion now offer more value than CoT prompting. Step-by-step reasoning is now handled by the models and architectures rather than the prompts.

AI pull request review

AI’s code review capabilities have improved substantially. Developers who work effectively with multi-turn AI conversations can now get useful feedback at every level: syntax issues, architectural patterns and subtle runtime concerns such as race conditions.

Yet we’ve kept AI Pull Request Review in Hold, for organisational rather than technical reasons. PR review isn’t just about finding errors; it’s a knowledge-sharing mechanism where senior developers mentor juniors and the team maintains awareness of how the codebase evolves. Teams who delegate review to AI often see a decline in collective code ownership.

We recommend using AI as a first-pass reviewer to catch issues before human review, but preserving the human step as deliberate practice for team alignment and knowledge transfer.

Techniques