Notes on Active Inference

FEP

Lastmod: 2025-08-01

Disentangling the properties of the Active Inference algorithm.

The Free Energy Principle (FEP) is a theory which attempts to provide a foundational framing of the nature of stable, self-organizing systems. One thing which makes the FEP attention-grabbing is its near-universal scope of applicability (as commonly represented) combined with its simple and intuitive picture of how stability, adaptivity, and prediction marry together.

On the other hand, evaluating the claims of the FEP can be made difficult by the fact that the theory seems to simultaneously make many different types of claims without fully distinguishing them.

Normative - Many presentations of the FEP appear to indicate that any system which maintains a stable boundary from its environment over time must engage in a kind of variational free energy minimization which can be interpreted as performing inference on the environmental state.
Mechanistic - Other presentations, so called “Process Theories”, of the FEP attempt to show that the dynamics of a system’s internal state at Non-equilibrium Steady State (NESS) can be interpreted as gradient flows of the variational free energy.
Prescriptive - Active Inference extends to the core ideas of FEP to prescribe an algorithm which enables an object to approach a NESS distribution. This development on top of the FEP is prescriptive both in that it is a somewhat positively constructed algorithm and in that its presentation tends to make claims about the optimality or general benefits of the approach, e.g. relative to comparable approaches in reinforcement learning.

Distinguishing these different types of claims is an important first step in ascertaining the general structure of the FEP as a theory as well as engaging in any criticism or scrutiny of the theory.

This post addresses the prescriptive claim represented by the Active Inference algorithm which is often attached to presentations of the FEP. In the spirit of disentangling different layers of the theory, the post aims to distinguish different aspects of the Active Inference algorithm and how each of them relates to its final properties. In particular, we approach the algorithm layer-by-layer, starting with a presentation centered around vanilla inference building toward the AI algorithm, e.g. by adding variational approximations or constructing the expected free energy (EFE) objective.

The Entropy Minimization Objective

One of the compelling aspects of the FEP is that it aims to describe the behaviors that must apply to any object which persists over time.

As a basic tautology, in order for an object to persist over time, the world must remain in the set of states where the object exists. A more complex, adaptive object may make observations about the state of the world and exert control actions to keep the state from entering a region where the object ceases to exist.

An object’s portals to the world are typically formalized using the concept of a Markov Blanket, which is simply a conditional independence structure which seems to match up nicely with the structure of a Partially Observable Markov Decision Process (POMDP).

The object is usually allowed a generative model which connects the actions, observations, and hidden environmental state variables:

$$P(o_{1:T},s_{1:T},a_{1:T}) = P(a_{1:T})\prod P(o_t|s_t)\prod P(s_t|s_{t-1},a_t).$$

Given access to this generative model, in principle the agent can select a sequence of actions in order to control distribution over environmental states. For instance, to choose the action at time step $\tau$, the agent might consider the distribution,

$$P(s_{\tau:T}|a_{1:T},o_{1:\tau-1}).$$

It is commonly supposed that the agent will wish to do something like minimize long term average of surprisal over time,

$$\frac{1}{T}\sum_{t=1}^T-\log(P(o_t|a_{1:T},o_{1:t-1})),$$

but this quantity is somewhat notional, and we will need to specify a more exact form. Also, at a more detailed level of consideration, we’ll see that more opinionated objectives enter the picture.

Perception/Prediction as Inference

Active Inference attempts to unify perception and planning into a single form factor of inference. This is potentially powerful, because it allows for the application of many powerful and principled tools of approximate inference to problems like planning which tend to be more driven by heuristics.

Before attempting to face this unification, we’ll start with a more narrow walkthrough of perception and prediction as viewed through the lens of inference.

Vanilla Inference

Before moving on to more mathematically laden ideas, it’s worth taking some time to make some observations about what the active inference conception of an agent as something which performs inference to control a distribution over state.

We can start by unpacking exactly what would be entailed by performing the inference in question.

Suppose we want to calculate a posterior of the form

$$P(o_{\tau+1:T}|\pi,o_{1:\tau}),$$

where $\pi = a_{1:T}$ is the full action trajectory. From an active inference standpoint, the most helpful way to consider this calculation is by breaking into two stages. First, we perform inference on the hidden states, $s_{1:T}$:

$$P(s_{1:T}|o_{1:\tau},\pi) = \frac{P(o_{1:\tau},s_{1:T}|\pi)}{P(o_{1:\tau}|\pi)}.$$

The difficult part of this step is calculating the evidence term in the denominator on the right hand side, since it requires integrating the numerator over all possible state trajectories, $s_{1:T}$.

Once we have the hidden state distribution, we can construct the joint posterior over hidden states and observations,

$$ P(o_{\tau+1:T},s_{1:T}|o_{1:\tau},\pi) = P(o_{\tau+1:T}|s_{\tau+1:T}) P(s_{1:T}|o_{1:\tau},\pi) $$

and then marginalize out the hidden states,

$$P(o_{\tau+1:T}|\pi,o_{1:\tau-1}) = \int P(o_{\tau+1:T},s_{1:T}|o_{1:\tau},\pi) ds_{1:T} = E_{P(s_{1:T}|o_{1:\tau},\pi)}[P(o_{\tau+1:T}|s_{\tau+1:T})]$$

another computationally impractical operation.

While we are here, it is worth asking what is the notion of learning that is allowed by this conception of agent. It appears that learning corresponds somewhat narrowly to the process of developing more certainly about the state of the environment as more actions are taken and observational data points are taken.

There’s a different type of learning which might appear to be absent from this framing, which is learning about the world model itself–since the dynamics of the environment are already given. However, we can show how this type of learning can be implicitly captured.

Suppose that we don’t completely know the state transition function of the world. Instead, we have a model of the state which depends on a parameter $\theta$:

$$P(s_t|s_{t-1},a_t,\theta).$$

We can also have a prior over $\theta$, $P(\theta)$. Then the new generative model becomes

$$P(o_{1:T},s_{1:T},a_{1:T},\theta) = P(\theta)P(a_{1:T})\prod P(o_t|s_t)\prod P(s_t|s_{t-1},a_t,\theta)$$

But we could actually represent this same distribution in the original representation by letting $\mathcal{S}^* = \mathcal{S}\times\mathcal{\Theta}$ and requiring for any $(s_1,\theta_1) \in \mathcal{S}^*$ and $(s_2,\theta_2) \in \mathcal{S}^*$ with $\theta_1 \neq \theta_2$ that the transition probability $P(s_t=(s_1,\theta_1)|s_{t-1}=(s_2,\theta_2)) = 0$.

Thus, in principle, there is a possibility that the agent can learn something about “the kind of world it is in” as it collects more observations over time.

Variational Inference

Performing exact inference is generally computationally intractable. In this section, we’ll look at how Variational Inference can be used to address this problem.

Basic Variational Inference
Simplifying approximations

Variational inference is a reframing of Bayesian inference as an optimization problem. It utilizes the property of the KL-divergence, defined as

$$D_{KL}(Q(s)||P(s)) = \int \log\left(\frac{Q(s)}{P(s|o)}\right)d Q(s) = \mathbb{E}_{Q(s)}[ \log(Q(s)) - \log(P(s|o))]$$

that it is minimized when $P(s) = Q(s)$.

Supposing we want to find $P(s|o)$, we can therefore do so by performing the optimization $\min_Q D_{KL}(Q(s)||P(s|o))$. This may not look particularly helpful, as we seem to already need to have calculated $P(s|o)$. However, upon rewriting the KL divergence as

$$ D_{KL}(Q(s)||P(s|o)) = \mathbb{E}_{Q(s)}[ \log(Q(s)) - \log(P(s,o))] +\log(P(o)) $$

we notice that this functional over $Q(s)$ has the same minimum as $\mathbb{E}_{Q(s)}[ \log(Q(s)) - \log(P(s,o))]$, which is known as the variational free energy.

Thus, by minimizing the variational free energy, we can recover $P(s|o)$ without performing the expensive integration needed to calculate the evidence term $P(o)$. Likewise, for the POMDP problem described above, we have

$$ D_{KL}(Q(s_{1:T})||P(s_{1:T}|o_{1:\tau},\pi)) = \mathbb{E}_{Q(s_{1:T})}[ \log(Q(s_{1:T})) - \log(P(s_{1:T},o_{1:\tau}|\pi))] + \log P(o_{1:\tau}|\pi) $$

and the corresponding free energy $\mathbb{E}_{Q(s_{1:T})}[ \log(Q(s_{1:T})) - \log(P(s_{1:T},o_{1:\tau}|\pi))]$.

Armed with (possibly approximate) posterior, $Q^*(s_{1:T})$, we can now calculate posteriors as we did in the previous section.

$$P(o_{\tau+1:T}|\pi,o_{1:\tau-1}) = E_{Q^*(s_{1:T})}[P(o_{\tau+1:T}|s_{\tau+1:T})]$$

This is great, but there’s a catch: Calculating the variational free energy itself requires integrating over the hidden state space as does taking the expectation needed to form the final posterior over observations. So it turns out that variational inference by itself has not reduced the complexity of the inference problem. To achieve a computational benefit, we usually will introduce simplifying assumptions about the form of $Q$. These simplifications are usually framed in terms of their computational benefit, but they may also move us into the domain of approximate inference and represent a kind of bias on the outcome of the inference, which we will consider in its own right.

We’ll consider the following approximations:

Factorization
Variational message passing
Black box VI
Implicit inference

Factorization (Mean Field Assumption)

One of the most common approximations that shows up in the VI and Active Inference literature is to factorize $Q$, which corresponds to an independence assumption on this variables in its domain:

$$Q(s_{1:T}) =\prod_{t=1}^TQ_t(s_t).$$

Assuming that $P$ is Markovian, the effect of this factorization is to reduce the complexity of the various integrals from $|\mathcal{S}|^2$ to $|\mathcal{S}|$ (where $\mathcal{S}$ is the state space).

Some basic questions we can ask about this approximation:

Do we properly recover the marginals of $P$?
What is the impact of the factorization assumption on planning? I.e., do we actually need to capture the correlations in the dynamical model to do proper planning?

The answer to the first question is, in general, no. ChatGPT can supply a simple counter-example here, which shows that the KL-divergence will tend to promote mode-seeking behavior.

The answer to the second question depends on the objective that we seek to optimize in planning. The general case that we will explore below does utilize these correlations, such that the mean field approximation will result in sub-optimal planning.

Message Passing

Message passing and its variational variations are algorithms developed specifically for distributions which have a certain structure.

My open questions: How does their complexity compare to the mean field approaches and what are their limitations? What happens if we try to apply these techniques to a real-world problem?

Black Box Variational Inference

There’s a variety of techniques that mix concepts from Monte Carlo sampling and amortized inference into the variational inference framework.

Suppose that we have an ansatz $Q(s_{1:T};\lambda)$, parameterized by $\lambda$ in a differentiable manner. We can construct the loss function

$$\ell(\lambda) =\mathbb{E}_{Q(s_{1:T;\lambda})}[ \log(Q(s_{1:T};\lambda)) - \log(P(s_{1:T},o_{1:\tau}|\pi))]$$

and note that its gradient also has the form of an expectation:

$$\nabla_\lambda\ell = \mathbb{E}_{Q(s_{1:T;\lambda})}[...]$$

Thus, we can approximate the gradient by sampling trajectories $s_{1:T}$ from $Q$. This is particularly appropriate if we can amortize in some manner so that $Q$ has a better chance of being a good guess from the start.

Planning as Inference

There are a few interesting aspects of how Active Inference relates to planning.

Most basically, active inference explicitly formulates its objectives in terms of probabilistic concepts such as entropy minimization, variational inference, and ultimately, expected free energy minimization. This approach has the potential to improve upon approaches in standard reinforcement learning (RL), which often lack the principled probabilistic framing. In particular, it is often claimed that Active Inference yields a principled derivation of heuristics for balancing epistemic and pragmatic gains which arise in RL. We’ll evaluate this idea.

Beyond this, active inference attempts to unify perception and planning into a single operation of inference. We’ll try to understand this unification and what benefits it might provide.

Exploration / Exploitation Dynamics

Exploration/Exploitation with Entropy Objectives

We saw previously that the inference process at the heart of active inference allows for learning in the form of reducing uncertainty about the state of the environment or even the type of world that the agent is in.

With respect to this dynamic of uncertainty reduction, there is a possible narrative around the contrasting behaviors of exploration and exploitation, or the pragmatic and epistemic value of different actions as it is commonly put within the active inference literature. It is instructive to carefully consider this narrative within the context of vanilla inference, before moving to consider expected free energy minimization, both 1) to provide a relief against which we can clearly see what expected free energy adds and 2) to provide a basic starting perspective from which to view the overall set of claims made relative to the theory.

If we start with the basic idea that we want to minimize surprisal, there are a number of different ways to translate this into a concrete objective. A common objective looks at the one-step entropy:

$$\sum_{t=\tau}^T\mathcal{H}(P(o_t|a_{1:T},o_{1:\tau})) = \sum_{t=\tau}^T\mathbb{E}_{P(o_t|a_{1:T},o_{1:\tau})}[-\log(P(o_t|a_{1:T},o_{1:\tau}))].$$

But there are other possibilities. For instance, we can get a more complete sense of surprisal by looking at the entropy over the full trajectory of future observations:

$$\mathcal{H}(P(o_{\tau:T}|a_{1:T},o_{1:\tau})) =\mathbb{E}_{P(o_{\tau:T}|a_{1:T},o_{1:\tau})}[-\log(P(o_{\tau:T}|a_{1:T},o_{1:\tau}))].$$

Another interesting possibility would be to look at the entropy, conditioned on possible observations:

$$\sum_{t=\tau}^T\mathbb{E}_{P(o_{\tau:t-1}|a_{1:T},o_{1:\tau})}[\mathcal{H}(P(o_t|a_{1:T},o_{1:t-1})))] = \sum_{t=\tau}^T\mathbb{E}_{P(o_{\tau:t-1}|a_{1:T},o_{1:\tau})}[\mathbb{E}_{P(o_t|a_{1:T},o_{1:t-1})}[-\log(P(o_t|a_{1:T},o_{1:t-1}))]].$$

We can try out these objectives in a simple scenario. Suppose there is a state $s_1$ which produces low entropy observations for all values of $\theta$, and a state $s_2$ which produces high entropy observations across all values of $\theta$ but low entropy observations once $\theta$ has been determined. Suppose also that the agent has available an action which will reliably move it to state $s_1$ and an action which will reliably move into state $s_2$. In this scenario, there’s a potential benefit to moving to $s_2$ so that I can learn $\theta$ and enjoy lower entropy observations. But we need to define the entropy objective carefully to enable the agent to select for this path.

Here’s a concrete version of this game (Thanks to GPT):

ingredient	choice
latent parameter	$\theta\in\{0,1\}$ with prior ½–½
state $s_1$ (“safe”)	observations ∼ Bernoulli(½) always → entropy = 1 bit
state $s_2$ (“informative”)	if $\theta=0$ you always see A; if $\theta=1$ you always see B
horizon	two future steps: $t=\tau,\tau+1$
actions	stay-1 puts you in $s_1$; jump-2 puts you in $s_2$ and keeps you there

We can now calculate the value of the objective for each objective x action sequence.

action	one-step frozen	joint-sequence $H(o_\tau,o_{\tau+1})$	expected dynamic
stay-1	1+1=2 bits	2 bits (each step 1 bit)	1+1=2 bits
jump-2	1+1=2 bits	1 bit (two possible strings AA / BB)	1+0=1 bit

Notably, the one-step entropy objective fails to select the informative action. At time τ it evaluates each future time-slice with today’s posterior. It never gives credit for the fact that tomorrow’s posterior will be sharper if today’s observation was useful. Hence it drives pure exploitation (pick the state whose next observation looks easiest now).

Introducing the Expected Free Energy

Now that we have an intuitive picture of a mode of uncertainty-seeking behavior that we might expect to arise in a form of planning guided by vanilla inference, it’s interesting to ask whether this is the same as the oft-touted behavior that arises under the expected free energy (EFE) objective.

The motivation for formulating something like an expected free energy is that variational free energy implicitly captures both the observational surprisal, $- \log P(o_{\tau:T}|\pi)$, as the “goodness of fit” between the variational distribution and the hidden state posterior, $D_{KL}(Q^*(s_{\tau:T}|\pi)||P(s_{\tau:T}|o_{\tau:T},\pi))$:

$$ \mathbb{E}_{Q^*(s_{\tau:T}|\pi)}[ \log(Q^*(s_{\tau:T}|\pi)) - \log(P(s_{\tau:T},o_{\tau:T}|\pi))] = D_{KL}(Q^*(s_{\tau:T})||P(s_{\tau:T}|o_{\tau:T},\pi)) - \log P(o_{\tau:T}|\pi). $$

Since this objective seems to capture two things that we care about, it seems intuitively plausible that we should use its minimization as a basis for selecting a policy. (Here, in the planning context, we’re minimizing the free energy by selecting $\pi$ and not by selecting $Q$; we’ve already selected $Q^*(s_{\tau:T}) \approx P(s_{\tau:T}|o_{1:\tau},\pi))$ by minimizing a separate variational free energy term. To simplify interpretation going forward, we can imagine that the predictive phase of the variational inference is highly accurate so that $Q^*(s_{\tau:T}) = P(s_{\tau:T}|o_{1:\tau},\pi)$ holds exactly.)

However, we can’t use this form of the free energy directly because it is a function of $o_{\tau:T}$. We need to take an expectation over this trajectory in order to remove the dependence. Arguably the most natural way to construct an expected free energy would be to take an expectation with respect to $P(o_{\tau:T}|\pi)$, to get:

$$\mathbb{E}_{P(o_{\tau:T}|\pi)}[D_{KL}(P(s_{\tau:T})||P(s_{\tau:T}|o_{\tau:T},\pi))] + \mathcal{H}(P(o_{\tau:T}|\pi)).$$

But the KL term here would essentially promote trajectories where observation tells us nothing new about the hidden state, which is generally undesirable from an epistemic standpoint.

In the FEP literature, the EFE is instead constructed (apparently somewhat arbitrarily as far as I can tell) by changing $\mathbb{E}_{Q^*(s_{\tau:T}|\pi)}$ to $\mathbb{E}_{Q^*(s_{\tau:T}, o_{\tau:T}|\pi)}$, where $Q^*(s_{\tau:T}, o_{\tau:T}|\pi) = Q^*(s_{\tau:T}|\pi)P(o_{\tau:T}|s_{\tau:T})$ to obtain

$$ \mathcal{G}(\pi) = \mathbb{E}_{Q^*(s_{\tau:T},o_{\tau:T}|\pi)}[ \log(Q^*(s_{\tau:T}|\pi)) - \log(P(s_{\tau:T},o_{\tau:T}|\pi))] $$

Fascinatingly, when we refactor this expression by breaking out $\log(P(s_{\tau:T},o_{\tau:T}|\pi))= \log(P(s_{\tau:T}|o_{\tau:T},\pi))+\log(P(o_{\tau:T}|\pi))$ to obtain

$$\mathcal{G}(\pi) = \mathcal{H}(P(o_{\tau:T}|\pi))-\mathbb{E}_{P(o_{\tau:T}|\pi)}[D_{KL}(P(s_{\tau:T}|o_{\tau:T},\pi))||P(s_{\tau:T}|\pi))]$$

we find that this version of the EFE has precisely the opposite tendency–to promote trajectories which maximize the new information about the hidden state, $s_{\tau:T}$.

In my perhaps uncharitable view, this appears to challenge the common claim that EFE allows us to recover an algorithm for managing exploration/exploitation tradeoffs from first principles. If anything, it looks to me like the favorable exploration/exploitation tradeoffs are serving as the basis for choosing this version of the EFE over other equally arbitrary alternatives.

A decision-theoretic view of EFE

We can strengthen the above critique of the EFE by taking a broader decision theoretic view.

If we look at the concrete examples above in which the basic entropy objective successfully favors near term loss in order to gain information, we can remind ourselves that favoring exploration only makes sense when we explicitly model planning and action over consecutive steps (over time). Exploration trades off a short-term loss for a larger long-term gain.

If we ignore computational constraints, then it should be clear that no objective will better allow us to minimize observational entropy over time than the observational entropy itself. We can see this by refactoring the EFE again:

$$\begin{aligned} \mathcal{G}(\pi) &= \mathbb{E}_{P(s_{\tau:T},o_{\tau:T}|\pi)}[ \log(P(s_{\tau:T}|\pi)) - \log(P(o_{\tau:T}|s_{\tau:T},\pi))-\log(P(s_{\tau:T}|\pi))] \\ &= \mathbb{E}_{P(s_{\tau:T}|\pi)}\mathcal{H}(P(o_{\tau:T}|s_{\tau:T},\pi)) \end{aligned}$$

(Note that this version of the EFE is generally presented with another term penalizing KL divergence between the generative dynamical model and the true dynamics, but as noted before, we have assumed an exact fit here for the purpose of more plainly understanding the selective function of the EFE objective).

Apparently, the EFE objective will select for trajectories which pass through states which correspond to low observational entropy. Note that this is not the same as minimizing the actual observational entropy, because it ignores uncertainty about the state itself. It’s quite easy to construct an POMDP which emphasizes this discrepancy: Imagine a vast collection of states, each of which maps to a different unique observable, such that the state conditional entropy of observation for each state is 0. Suppose also that the dynamics will progress from each state to each of the other states in this collection with equal probability.

The reader can verify that both 1) the observational entropy of any trajectory which enters this collection will be quite high, due to the uniform distribution over states in each step, and 2) the EFE will give the highest possible preference to a trajectory which immediately enters this collection.

Given these observations, any deviation from using the observational entropy as an objective for planning purposes should be motivated from a computational perspective. There is some argument to be made here. The EFE only requires us to take an expectation w.r.t. $\mathbb{E}_{Q^*(s_{\tau:T},o_{\tau:T}|\pi)}$. Directly using the observational entropy would require an expectation w.r.t. $\mathbb{E}_{Q^*(s_{\tau:T}})$ to obtain $P(o_{\tau:T}|\pi)$ and then a separate expectation with respect to $P(o_{\tau:T}|\pi)$ to calculate the entropy. Thus, it seems more appropriate to see the behavior of the EFE as an acceptable and practically beneficial planning bias that arises from a computational simplification.

Planning as inference

Left for the future.