Max Fierro

Poem 2. "Occidental"

Wed, 30 Apr 2025 00:00:00 +0000

El negro de la noche es el final del infinito;
la verdad me llega a la velocidad de la luz.
Son trayectorias que trazan entre los átomos,
son curvas que desnudan misterios paralelos.
Es hipótesis de tiempo escrita en el espacio,
es información que se filtra entre la materia.

The Elo Rating System through Likelihood Gradient Ascent

Wed, 30 Apr 2025 00:00:00 +0000

Abstract

Probability and optimization are strong monsters. The Elo rating system, used to estimate performance in competitive chess, online dating, and AI agents, is an under-the-hood reminder of this fact that operates within many of the systems that need to establish comparative metrics. This piece is my contribution to the endless pile of explainers on the topic. I exercise an emphasis on bayesian statistics and optimization that should ring a bell for anyone familiar with the basics of machine learning.

Background

Mathematical Orderings

At the risk of including a needless dependency on the topic of this piece, I introduce you to the idea of an ordering. Colloquially, we take this to mean an arrangement (i.e., a permutation) of a set of things. We will replace that with its formal meaning, which is a specific kind of binary relation.

Definition

A binary relation $R$ from a set $X$ to another $Y$ is a subset of $X \times Y$, where it is possible that $X = Y$.

This should seem odd, as a subset is in no obvious way reminiscent of a permutation. But introducing some new syntax to indicate membership in a relation,

$$ (x, y) \in R \vdash xRy, $$

we are an example away from making sense. In particular, consider $R = \, \leq$ (less-than). When we say things like “$x \leq y$,” we are in fact using syntactic sugar for “$(x, y) \in \, \leq$.” With this in mind, we can take a look at partial orders.

Definition

A partial order $R$ is a binary relation over a set $X$ and itself which satisfies the following:

Reflexivity. This means that $\forall x \in X, \, xRx$.
Antisymmetry. This means that $\forall (x, y) \in X^2, \, xRy \wedge yRx \implies x = y$.
Transitivity. This means that $\forall (x, y, z) \in X^3, \, xRy \wedge yRz \implies xRz$.

The canonical example of a partial order is $\subseteq$ over $\mathcal P(S)$. Importantly, a partial order over a set does not imply a permutation over it, because of the possibility for two elements $x$ and $y$ to be unrelated, or in other words, for $\neg xRy$ and $\neg yRx$. In a total order, we simply do not allow this.

Definition

A total order $R$ is a partial order that is also total, which means that $\forall (x, y) \in X^2, \, xRy \vee yRx$.

The canonical example of a total order is $\leq$ over $\mathbb{R}$. With a total order, there is a single valid ordering $\bold{x}$ (i.e. arrangement or permutation) over its set $X$ such that $x_iRx_{i + 1}$ for all $i = 0, \ldots, |X| - 1$. One more variation we can make on the idea of an order is that of a weak order.

Definition

A weak order $R$ is a total order that is not necessarily antisymmetric.

In other words, it is possible that for distinct elements $x$ and $y$, both $xRy$ and $yRx$. This is conceptually aligned with allowing “ties” in any resulting ordering, potentially sacrificing their uniqueness.

Elo Ratings and Updates

We can take a look at the question that Arpad Elo (kind of) answered: How can you compare the skill level of two chess players?

Arpad Elo (August 25, 1903 – November 5, 1992)

His proposed procedure is straightforward. Each player $i \in N$ will have a real-valued rating $r_i$, which will be a proxy for their skill level. These ratings will be initialized at some predetermined value for all players. Then, when there is a match between player $i$ and $j$, the following updates are made:

$$ \begin{align*} &r_j \gets r_j + k(s_j - e_j), \\ &r_i \gets r_i + k(s_i - e_i), \end{align*} $$

where¹,

$$ e_p = \frac{1}{1 + e^{-(r_p - r_{\text{other}})}}, \;\;\;\; s_p = \begin{cases} 1 &\text{if $p$ wins},\\ 0.5 &\text{if draw},\\ 0 &\text{if $p$ loses},\\ \end{cases} $$

and $k$ is a constant chosen arbitrarily. So, as players accrue matches with other players, their ratings are updated according to the above rules with the hope that they will eventually stabilize. Now, the difference between players’ ratings can be used to compare their skill levels via the ordering $\leq$ on $\mathbb{R}$.

Maximum Likelihood Estimation

Maximum likelihood estimation (MLE) is a method used to fit distribution parameters to samples. The setup for MLE is a random variable $Y$ of known distribution $\mathcal{D_\theta}$ (parameterized by $\theta$), with access to IID samples $\langle y_i \rangle \sim \mathcal{D}_\theta$. The objective is to estimate $\hat\theta$ such that the likelihood function $\mathcal{L}$ is maximized,

$$ \begin{equation} \hat\theta_\mathrm{MLE} = \argmax_\theta \, \mathcal{L}(\theta; \langle y_i \rangle) = \argmax_\theta \, \prod_i \mathbb{P}_\theta[Y = y_i]. \end{equation} $$

In other words, MLE is the optimization procedure associated with finding the distribution parameters that were most likely to generate observed data, provided that we know or assume its distribution.

MAP Estimation

When there is access to a (known or assumed) prior $p(\theta)$ on the distribution of parameters, we can fold it into our optimization process by doing MLE on the posterior distribution, which by Bayes’ theorem,

$$ \begin{align*} p(\theta \mid \langle y_i\rangle) &= \frac{p(\langle y_i \rangle \mid \theta) p(\theta)}{p(\langle y_i \rangle)} \; \propto \; \underbrace{p(\langle y_i \rangle \mid \theta)}_{\displaystyle{\mathcal{L}(\theta; \langle y_i \rangle)}} p(\theta). \end{align*} $$

The resulting parameters $\hat\theta_\mathrm{MAP}$ are then a maximum a posteriori (MAP) estimate,

$$ \begin{align*} \hat\theta_{\mathrm{MAP}} &= \argmax_\theta\;p\bigl(\theta \mid \langle y_i\rangle\bigr) \\ &= \argmax_\theta\;\Bigl[\mathcal{L}\bigl(\theta;\langle y_i\rangle\bigr)\,p(\theta)\Bigr] \\ &= \argmax_\theta\;\Bigl[\prod_{i=1}^n \mathbb{P}_\theta\bigl[Y=y_i\bigr] \times p(\theta)\Bigr]. \end{align*} $$

Optimization

Sometimes, it is possible to find closed-form solutions for $\hat\theta_\mathrm{MAP}$ and $\hat\theta_\mathrm{MLE}$ through convex optimization. For example, samples with gaussian noise lead to the closed-form solution of the OLS problem through the process of MLE.

However, most of the time the resulting optimization objective of MLE (and hence also MAP estimation) is not convex. Here, gradient-based approaches (along with all other non-convex optimization techniques) are helpful for finding local maxima of the likelihood objective.

Derivation

The derivation presented here will depart from the usual in hopes of contributing some kind of novelty. We begin with the game-theory-native idea of payoff, which we will take to be a numeric value representing a player’s utility differential with respect to the start of a game $G$,

$$ \text{payoff of player i} = p_i. $$

Next, we will consider player performance. Just as Elo, we take the performance of a player $i$ on a game $G$ to be a real-valued random variable $X_i$, independent to other players.

Observation

Perhaps Elo motivated this decision after noticing the variance of his own performance over the chess board.

Then, we will expand our setup by allowing players to outperform others, which we will present through the difference between the performance of two players during $G$ (which is another RV),

$$ \delta_{i, \, j} = X_i - X_j. $$

We will also establish a relationship between $\delta_{i , \, j}$ and $p_i$. For this purpose, we introduce a game-specific mapping $g$ with a noise term $\epsilon \sim \mathcal{N}(0, \sigma_\epsilon^2)$, which together form the generative process of payoffs:

$$ p_i = g(\delta_{i, \, j}) + \epsilon. $$

Finally, we will assert a prior on the distribution of $X_i$, which we will refer to as $\mathcal{D}(\theta_i)$ without yet deciding on a particular distribution (just that it is parameterized by $\theta_i$). This prior $\pi(x)$ will be global for all players, and its distribution parameters will be $\theta_\pi$.

Note

While this is a global prior, notice that none of the following breaks if it were player-specific from the start.

So far, none of this has helped us answer the question that Elo answered. For that, we will introduce one last artifact on top of our setup; each player $i$ will have a “rating” $r_i$, which we will ultimately use to order players by skill in our system or organization:

$$ r_i = \mathbb{E}[X_i]. $$

MAP Estimation

Clearly, since our goal is to know players’ ratings, the only additional information we will need to get them are the distribution parameters $\theta_i$. Of course, at a lack of observations, we can assert from our prior

$$ r_i = \mathbb{E}_{X \, \sim \, \pi}[X]. $$

But what if at the end of a game $G$ between players $i$ and $j$, we observe WLOG the payoff $p_i$? Here, we will be wishing that $g$ is neatly invertible. Assuming it is, we arrive at the following MLE for their difference in performance via application of $(1)$:

$$ \begin{equation} \hat\delta_{i, \, j} = \argmax_{\delta} \exp\!\Bigl(-\frac{(p_i-g(\delta))^2}{2\sigma_\epsilon^2}\Bigr) = g^{-1}(p_i). \end{equation} $$

Note

Notice we did not use a prior when estimating $\hat\delta_{i, , j}$. This assumption is due to Elo; we will not use players’ history when calculating their performance for a single game. This is the design decision that, by omission, accounts for sudden changes in player skill (as a result of learning, etc.).

Note

We just estimated the difference between the performance of the players from the payoff of a single player. The invertibility of $g$ has the hidden implication that it is strictly monotonic; no two differences in performance lead to the same payoff, and the greater the difference, the greater the payoff for the outperforming player.

Knowing this, we can perform a bayesian update to our prior through MAP estimation. Writing down the joint posterior of the parameters of $X_i$ and $X_j$,

$$ \begin{equation} p(\theta_i, \, \theta_j \mid \hat\delta_{i, \, j}) \propto \underbrace{ p(\hat\delta_{i, \, j} \mid \theta_i, \, \theta_j) }_{ \displaystyle{p(\hat\delta_{i, \, j} \mid X_i - X_j)} } p(\theta_i)p(\theta_j). \end{equation} $$

Notice that we already have access to priors $p(\theta_i)$ and $p(\theta_j)$; those are quite simply $\pi(\theta_i)$ and $\pi(\theta_j)$, which we assume per our initial setup.

Gaussian Performance

We proceed by considering the case where $X_i \sim \mathcal{N}(r_i, \, \sigma_i^2)$, such that $\theta_i = (r_i, \, \sigma_i^2)$. That is, player performance is gaussian-distributed,

$$ \begin{equation} p_{X_i}(x_i \mid \theta_i) = \frac{1}{\sqrt{2\pi} \, \sigma_i} \exp\!\Bigl(-\frac{(x_i - r_i)^2}{2\,\sigma_i^2}\Bigr). \end{equation} $$

Our next goal is to set up an analytic function for the likelihood $p(\hat\delta_{i, \, j} \mid \theta_i, \, \theta_j)$. We observe that we have access to the conditional density of $\hat\delta_{i, \, j}$

$$ \begin{equation} p(\hat\delta_{i, \, j} \mid x_i, \, x_j) = \frac{1}{\sqrt{2\pi}\,\sigma_\varepsilon} \exp\!\Bigl(-\frac{\bigl(g(\hat\delta_{i, \, j}) - (x_i - x_j)\bigr)^2}{2\,\sigma_\varepsilon^2}\Bigr) \;|g'(\hat\delta_{i, \, j})| \end{equation} $$

by using $(2)$ implicitly through the change of variables

$$ p(\hat\delta_{i, \, j} \mid x_i, \, x_j) = p_{p_i}\bigl(g(\hat\delta_{i, \, j}) \mid x_i, \, x_j \bigr) \;\Bigl|\frac{d}{d\hat\delta_{i, \, j}}\,g(\hat\delta_{i, \, j})\Bigr|, $$

where we take

$$ p\bigl(p_i \mid x_i, \, x_j \bigr) = \frac{1}{\sqrt{2\pi}\,\sigma_\varepsilon} \exp\!\Bigl(-\frac{\bigl(p_i - g(x_i - x_j)\bigr)^2}{2\,\sigma_\varepsilon^2}\Bigr). $$

Now, we can use $(4)$ and $(5)$ to derive the desired likelihood by marginalizing,

$$ \begin{align*} p\bigl(\hat\delta_{i, \, j} \mid \theta_i,\theta_j\bigr) &= \iint p\bigl(\hat\delta_{i, \, j}\mid x_i,x_j\bigr)\; p_{X_i}(x_i \mid \theta_i)\;p_{X_j}(x_j \mid \theta_j)\, dx_i\,dx_j \\ &= \iint \frac{1}{\sqrt{2\pi}\,\sigma_\varepsilon} \exp\!\Bigl(-\frac{\bigl(g(\hat\delta_{i, \, j}) - (x_i - x_j)\bigr)^2} {2\,\sigma_\varepsilon^2}\Bigr)\, \bigl|g'(\hat\delta_{i, \, j})\bigr| \\[-2pt] &\quad\;\times\, \frac{1}{\sqrt{2\pi}\,\sigma_i} \exp\!\Bigl(-\frac{(x_i - r_i)^2}{2\,\sigma_i^2}\Bigr) \;\frac{1}{\sqrt{2\pi}\,\sigma_j} \exp\!\Bigl(-\frac{(x_j - r_j)^2}{2\,\sigma_j^2}\Bigr) \,dx_i\,dx_j. \\[6pt] \end{align*} $$

After another cup of coffee, we arrive at the following version of our joint likelihood $p\bigl(\hat\delta_{i, , j} \mid \theta_i,\theta_j\bigr)$,

$$ \begin{aligned} &= \int \frac{1}{\sqrt{2\pi}\,\sigma_\varepsilon} \exp\!\Bigl(-\frac{\bigl(g(\hat\delta_{i, \, j}) - d\bigr)^2}{2\,\sigma_\varepsilon^2}\Bigr) \;|g'(\hat\delta_{i, \, j})|\; \\ &\quad\;\quad\;\times \frac{1}{\sqrt{2\pi(\sigma_i^2+\sigma_j^2)}} \exp\!\Bigl(-\frac{(d - (r_i - r_j))^2}{2(\sigma_i^2+\sigma_j^2)}\Bigr) \,dd \end{aligned} $$

where $d = x_i - x_j$ (hinted at in equation $(3)$) is possible because $X_i - X_j \sim \mathcal{N}(r_i - r_j, \sigma_i^2 + \sigma_j^2)$. Finally, we obtain the following after remembering an important fact from signal processing,

$$ \begin{align} \mathcal{J}_\mathrm{MLE}(\theta_i, \, \theta_j; \, \hat\delta_{i, \, j}) = \frac{|g'(\hat\delta_{i, \, j})|}{\sqrt{2\pi\,\bigl(\sigma_\varepsilon^2 + \sigma_i^2 + \sigma_j^2\bigr)}}\, \exp\!\Biggl(-\frac{\bigl(g(\hat\delta_{i, \, j}) - (r_i - r_j)\bigr)^2} {2\,\bigl(\sigma_\varepsilon^2 + \sigma_i^2 + \sigma_j^2\bigr)}\Biggr). \end{align} $$

Wonderful. We then attend to the reflexes drilled into our brains from machine learning, and find the gradient of the log-likelihood with respect to learned… ahem, the ratings $\bold{r} = [r_i, \, r_j]^\top$:

$$ \begin{equation} \nabla_\bold{r}\log\mathcal{J}_\mathrm{MLE}(\theta_i, \, \theta_j; \, \hat\delta_{i, \, j}) = \begin{bmatrix} \displaystyle\frac{g(\hat\delta_{i, \, j}) - (r_i - r_j)}{\sigma_\varepsilon^2 + \sigma_i^2 + \sigma_j^2} \\[8pt] \\ \displaystyle-\frac{g(\hat\delta_{i, \, j}) - (r_i - r_j)}{\sigma_\varepsilon^2 + \sigma_i^2 + \sigma_j^2} \end{bmatrix}. \end{equation} $$

Note

By taking the $\log$ of the joint likelihood we achieve nothing, but we respect a very important tradition².

Using $\nabla_\bold{r}\log\mathcal{J}(\theta_i, \, \theta_j; \, \hat\delta_{i, \, j})$ as it stands to adjust $\bold{r}$ would be tantamount to MLE on $\bold{r}$. To turn this into a proper MAP estimate we must also fold in our prior terms into $(6)$, which we assume to be gaussian:

$$ \begin{aligned} \mathcal{J}_\mathrm{MAP}(\theta_i, \, \theta_j; \, \hat\delta_{i, \, j}) &= \frac{\lvert g'(\hat\delta_{i,j})\rvert} {\sqrt{2\pi\,\bigl(\sigma_\varepsilon^2 + \sigma_i^2 + \sigma_j^2\bigr)}} \exp\!\Bigl(-\frac{\bigl(g(\hat\delta_{i, \, j}) - (r_i - r_j)\bigr)^2} {2\,\bigl(\sigma_\varepsilon^2 + \sigma_i^2 + \sigma_j^2\bigr)}\Bigr)\\ &\quad\;\times\; \frac{1}{\sqrt{2\pi}\,\sigma_\pi} \exp\!\Bigl(-\frac{(r_i - r_\pi)^2}{2\,\sigma_\pi^2}\Bigr) \;\times\; \frac{1}{\sqrt{2\pi}\,\sigma_\pi} \exp\!\Bigl(-\frac{(r_j - r_\pi)^2}{2\,\sigma_\pi^2}\Bigr). \end{aligned} $$

Being again unable to ignore our instincts,

$$ \begin{equation} \nabla_{\mathbf r}\log \mathcal{J}_{\mathrm{MAP}}(\theta_i,\theta_j;\hat\delta_{i,j}) = \begin{bmatrix} \displaystyle \frac{g(\hat\delta_{i,j}) - (r_i - r_j)}{\sigma_\varepsilon^2 + \sigma_i^2 + \sigma_j^2} \;-\;\frac{r_i - r_\pi}{\sigma_\pi^2} \\\\ \displaystyle -\frac{g(\hat\delta_{i,j}) - (r_i - r_j)}{\sigma_\varepsilon^2 + \sigma_i^2 + \sigma_j^2} \;-\;\frac{r_j - r_\pi}{\sigma_\pi^2} \end{bmatrix}. \end{equation} $$

Checkpoint

Let us take a step back for a second, and roughly see what is on the table. Intuitively, we are:

Observing a materialized payoff $p_i$.
Inverting $g$ to recover the latent skill gap $\hat\delta_{i, \, j}$ that was most likely to produce $p_i$.
Comparing that inferred gap to our current belief of the skill gap $r_i - r_j$.
Deriving the change to $r_i$ and $r_j$ would bring our belief closer to $\hat\delta_{i, \, j}$.

Then, the gradient-ascent update with step size $k$,

$$ \begin{equation} \bold{r}_{t + 1} \gets \bold{r}_{t} + k\nabla_{\bold{r}}\log\mathcal{J}_\mathrm{MAP}(\theta_i, \, \theta_j; \, \hat\delta_{i, \, j}), \end{equation} $$

offers a complete recovery (and generalization) of the Elo update after an observed payoff $p_i$.

Discussion

Procedural Discrepancy

Usually, implementations of Elo updates do not consider a prior. Instead, they simply initialize parameters at some default amount, then do MLE (as opposed to MAP estimation) to produce gradient updates. I decided to display the full MAP estimate because I think it is more principled; if you believe that ratings “start off” at some amount, that constitutes a bayesian prior in my eyes.

Distribution Discrepancy

The Elo rating system assumes a logistic distribution on player performance, not gaussian. However, the above procedure will invariantly recover Elo updates as presented in the background section with both distributions (at least in form). I thought it would be somewhat interesting to make it gaussian.

Fixed Parameters

In theory, one could estimate the variance parameters using the exact same procedure, by taking the gradient of the joint likelihood with respect to them in addition to the means (the ratings). Surprisingly, people do things similar to this – although not in this particular way. See the Glicko rating system.

Redundancy with $g$

You may have noticed that throughout our derivations (most notably in equations $(7)$ and $(8)$) there are $g(\hat\delta_{i, \, j})$ terms that can be safely replaced with $p_i$ by definition, and can be therefore seen as redundant. This is a completely accurate observation.

I decided to make $g$ explicit to make the fundamental link between payoffs and performance differentials also explicit, which is something I consider to be a lot more principled. In fact, $g$ does not need to be strictly monotonic, as we never explicitly evaluate $g^{-1}(\small\bullet)$. However, not satisfying this property may result in a lack of parameter identifiability, which is easy to forget if you discard the symbol early on.

Weak Ordering

It is important to acknowledge that mapping player skill to $\mathbb{R}$ and then using $\leq$ to order players is a fundamentally misguided approach to how the world works. In doing so, we establish a weak ordering among players, but completely ignore that some players have qualities that make them strong against some players and weak against others (in a manner that is potentially cyclic).

Example

To illustrate this, consider three players of rock-paper-scissors. One always plays rock, one always plays paper, and the other scissors. You will find that there is no way of assigning them a real number such that the player with the highest number beats both of the other players in expectation.

Still, sometimes we are forced to make rankings which make sense in expectation. In the real world, there is sufficient variance in player attributes that there are actors that can consistently beat some others. Here, systems such as Elo’s bring real utility. But as a human, you should trust your intuition more than some potentially senseless number.

Outcome Prediciton

Further expanding on the inadequacy of ranking players via weak order, consider the very plausible machine learning task of outcome prediction, say, for the game of Basketball.

It is tempting to, for example, train a network $f_\theta : \mathbb{R}^n \to \mathbb{R}$ on $n$-dimensional encodings of teams to predict a scalar value, where you then train in tandem over historic game outcomes $\langle ((a, b), \, y)_i \rangle_{i \in D}$.

Here, teams $a, \, b \in \mathbb{R}^n$ played each other and achieved outcome $y \in \{-1, 1\}$ for each match $i \in D$. One could optimize under the following loss,

$$ \mathcal{L}((a, \, b), \, y; \, \theta) = \log\exp\bigl( 1 + y(f_\theta(a) - f_\theta(b))\bigr) + \lambda(f_\theta(a) + f_\theta(b)), $$

where the regularization term helps with stability. Then, $f_\theta$ would essentially become a rating estimator. Whoever does this, however, will have the same fundamental problem as the Elo system; a weak order cannot capture the potentialy cyclic structure of actors’ dominance on each other.

The solution, of course, is to instead train another model $f_\theta : \mathbb{R}^{2n} \to \mathbb{R}$ that admits pairings as an input via concatentaion, and implements typical binary cross-entropy loss:

$$ \mathcal{L}((a, \, b), \, y; \, \theta) = -\bold{I}_y\,\log\bigl(\sigma(f_\theta(a \Vert b))\bigr) - (1 - \bold{I}_y)\,\log\bigl(1-\sigma(f_\theta(a \Vert b))\bigr). $$

However, there is no free lunch – when training over pairs of teams in $T$, the sample space of the task grows with the size of $T \times T$, naturally increasing the amount of out-of-distribution data for your model quadratically. Of course, this problem was ignored by the first formulation too, just in a different way.

Appendix

Gaussian Convolution

Here, I justify equation $(6)$ by instantiating a proof of the fact that the convolution of two gaussians is another gaussian determined by the parameters of the original gaussians.

Proof

This was made via ChatGPT with o4-mini-high and adjusted by me, because you can probably find it in a textbook somewhere. Let

$$ f(d)=\frac{1}{\sqrt{2\pi}\,\sigma_1}\exp\!\Bigl(-\frac{(d-\mu_1)^2}{2\,\sigma_1^2}\Bigr), \quad g(d)=\frac{1}{\sqrt{2\pi}\,\sigma_2}\exp\!\Bigl(-\frac{(d-\mu_2)^2}{2\,\sigma_2^2}\Bigr). $$

We wish to show

$$ \int_{-\infty}^{\infty} f(d)\,g(d)\,dd =\frac{1}{\sqrt{2\pi\,(\sigma_1^2+\sigma_2^2)}}\, \exp\!\Bigl(-\frac{(\mu_1-\mu_2)^2}{2\,(\sigma_1^2+\sigma_2^2)}\Bigr). $$

Set $A=\sigma_1^2$ and $B=\sigma_2^2$. Then,

$$ f(d)\,g(d) =\frac{1}{2\pi\sqrt{AB}} \exp\!\Bigl(-\tfrac12\bigl[\tfrac{(d-\mu_1)^2}{A}+\tfrac{(d-\mu_2)^2}{B}\bigr]\Bigr). $$

Combine quadratic terms:

$$ B(d-\mu_1)^2 + A(d-\mu_2)^2 =(A+B)\Bigl(d-\frac{B\mu_1 + A\mu_2}{A+B}\Bigr)^2 +\frac{AB}{A+B}(\mu_1-\mu_2)^2. $$

Define

$$ m=\frac{B\mu_1 + A\mu_2}{A+B}, \quad C=\frac{AB}{A+B}. $$

Then,

$$ \int f(d)\,g(d)\,dd =\frac{1}{2\pi\sqrt{AB}} \int \exp\!\Bigl(-\tfrac12\bigl[\tfrac{(d-m)^2}{C}+\tfrac{(\mu_1-\mu_2)^2}{A+B}\bigr]\Bigr) \,dd. $$

Factor out the constant term and use

$$ \int \exp\Bigl(-\frac{(d-m)^2}{2C}\Bigr) \, dd =\sqrt{2\pi\,C}. $$

Hence,

$$ \begin{aligned} \int f(d)\,g(d)\,dd &=\frac{\sqrt{2\pi\,C}}{2\pi\sqrt{AB}} \exp\!\Bigl(-\frac{(\mu_1-\mu_2)^2}{2\,(A+B)}\Bigr)\\ &=\frac{1}{\sqrt{2\pi\,(A+B)}} \exp\!\Bigl(-\frac{(\mu_1-\mu_2)^2}{2\,(A+B)}\Bigr). \quad \square \end{aligned} $$

Instantiation

Consider the expression which $(6)$ was derived from,

$$ \begin{aligned} &\int \frac{1}{\sqrt{2\pi}\,\sigma_\varepsilon} \exp\! \overbrace{ \Bigl(-\frac{\bigl(g(\hat\delta_{i, \, j}) - d\bigr)^2}{2\,\sigma_\varepsilon^2}\Bigr) }^{ \text{Quadratic term is symmetric.} } \;|g'(\hat\delta_{i, \, j})|\; \\ &\quad\;\quad\;\times \frac{1}{\sqrt{2\pi(\sigma_i^2+\sigma_j^2)}} \exp\!\Bigl(-\frac{(d - (r_i - r_j))^2}{2(\sigma_i^2+\sigma_j^2)}\Bigr) \,dd. \end{aligned} $$

Now, use the substitutions

$$ \mu_1 = g(\hat\delta_{i, \, j}), \quad \mu_2 = r_i - r_j, \quad A = \sigma_\varepsilon^2, \quad B = \sigma_i^2 + \sigma_j^2, $$

and re-attach the Jacobian factor $|g^\prime(\hat\delta_{i, \, j})|$ to recover

$$ \mathcal{J}(\theta_i,\theta_j;\hat\delta_{i, \, j}) =\frac{|g'(\hat\delta_{i, \, j})|} {\sqrt{2\pi\,(\sigma_\varepsilon^2 + \sigma_i^2 + \sigma_j^2)}}\, \exp\!\Bigl(-\frac{(g(\hat\delta_{i, \, j}) - (r_i - r_j))^2} {2\,(\sigma_\varepsilon^2 + \sigma_i^2 + \sigma_j^2)}\Bigr). $$

I modified a term in $e_p$ to exclude scaling factors, to make it look less crazy. These scaling factors make the resulting ratings quite practical by allowing one to make comparisons like “player $i$ is 10x better than $j$ if $i$’s rating is 400 points higher.” ↩︎
Taking the $\log$ makes it easier to deal with multiple samples, as it turns the product in $(1)$ into a sum. But here, we only use one sample, so it is useless. However, tradition is important for learning. ↩︎

Poem 1. "Delia"

Sun, 27 Apr 2025 00:00:00 +0000

Madre que no te suelta el estandarte,
madre de selva que te cubre como yedra;
me arrulla desde la penumbra,
me susurra el nombre de Dios.
Ya quítame de aquí que me muero, madre,
dame las palabras que me corresponden,
sacude desde tu sigilo mi sangre,
hazme llegar tu amor.
Soledad eterna y vida corta,
no te vayas a olvidar de mí.
Madre que a ciegas todo lo ve;
corazón de parota, palabras de luz.
Eres la cumbre de este desierto,
autora del método mío,
madre de todo.
Mi primer amor, también el último,
mi castillo, mi escudo, mi ángel.
Águila de quinientas virtudes,
reguilete infinito de colores,
piedra de orgullo inexorable.
Arbol terrestre que toca el cielo,
sosiego inminente,
calor solar.

N-Gram Model of Optimal Policy on Interpretable Abstractions

Fri, 21 Feb 2025 00:00:00 +0000

Abstract

Generally, the choice of functional form of a policy model and of the domain that it operates on forms the basis of interpretability. Domains that are the image of class-valued abstractions of the observable state space are desireable because humans excel at visual classification tasks that map onto (largely) discrete characteristics. Hence, we provide an interpretable functional form that is valid over multiclass spaces in the form of an $n$-gram model approximation of dynamics under optimal policy.

Background

$N$-Gram Modeling

$n$-gram models were developed as a rudimentary statistical model of language. Assuming an $n^{th}$-order Markov property on the probability of a word $w_{t + 1}$ at discrete time $t + 1$ given a history $\langle w_i \rangle_{i \in [1, \, t]}$,

$$ \begin{equation} P(w_1, \, \ldots, w_{t + 1}) = P(w_1, \dots, w_{t - n - 1}) \prod_{i = 0}^{n - 1} P(w_{t+1} \mid w_{t-n}, \dots, w_t), \end{equation} $$

straightforward maximum likelihood estimation shows that this probability is the proportion of times that the sequence $\langle w_{t-n}, \, \ldots, w_t \rangle$ appears before $w_{t + 1}$ in observations. This is can be seen as frequentist inference, making the probability measure intuitive.

When applied to a set of symbols (words) $S$, such a model implies a Markov chain over the product $S^n = S \times \cdots \times S$. It follows that the chain’s stochastic matrix $\Pi$ is an element of $\mathbb{R}^{k \times k^n}$ with $k = |S|$, so the number of learnable parameters grows exponentially with the order of the model for a fixed $S$.

As a result of upholding the Markov property, $n$-gram models are stationary¹. This flaw makes them incompatible with natural language to any useful extent, and is directly addressed by modern language models through mechanisms like attention.

Rules of Thumb

Many heuristics taught in strategic decision-making can be described to be conditionals on the result of classification exercises. For example, there is a rule of thumb in Chess which calls for protecting one’s own king if it is open.

When implementing this heuristic, a player performs classification via a mapping $\phi : S \to \{\text{Yes}, \, \text{No}\}$ from the set of board states to an answer to the heuristic’s condition, where experience insists that if a player’s $\phi$ is sufficiently close to ground truth, they obtain a performance improvement in expectation.

Naturally, the complexity involved in evaluating a classification $\phi_h(s)$ for some state $s \in S$ should be minimal so that its heuristic $h$ can be implemented without computer assistance. In many cases, their simplicity to humans (i.e., how intuitive they are) directly translates to the simplicity of implementing them in other models of computation. Put simply, it is generally easy to program such functions.

However, humans can obtain an unexplainable intuitive understanding of a game. In such cases, the classification exercises they carry out for their expert heuristics are mappings onto a set of abstract characteristics (e.g., area ‘crowdedness’ in Chess). This can be seen as representation learning.

But even in these cases, it is relatively simple to train a model which replicates a human’s capacity to perform classification for their own expert-level heuristics by having them label training datasets by hand. Hence, one can generally assume access to efficient classifiers for human-interpretable features.

Markov Abstractions

Given an abstraction $\phi : S \to Z$ over a state set $S$, the lack of an injectivity constraint could produce a situation where, for a policy $\pi_S : S \to S$ with $\pi(s) = a$ and $\pi(s^\prime) = b$ on distinct $a, \, b, \, s, \, s^\prime \in S$,

$$ \begin{equation} \phi(s) = \phi(s^\prime) \;\; \text{and} \;\; \phi(a) \neq \phi(b). \end{equation} $$

Hence, learning a counterpart $\pi_Z : Z \to Z$ which preserves the information in $\pi_S$ could be impossible, as $\pi_Z(\alpha(s)) = \pi_Z(\alpha(s^\prime))$ would have to ‘retain’ the information of both $\pi_S(s) = a$ and $\pi_S(s^\prime) = b$. Such an abstraction $\phi$ is said to not be Markov, as its image contains insufficient information to induce a dynamics that corresponds to the behavior specified by $\pi_S$ in $S$.

In the context of interpretable rules of thumb, reducing the state space $S$ to significantly smaller abstract spaces (e.g., taking decision in $\{\text{Yes}, \, \text{No}\}$ while implementing a heuristic) nearly guarantees that the abstraction which mediated the reduction is not Markov².

Model

Let $\langle \phi^{(\alpha)} : S \to Z^{(\alpha)} \rangle_{\alpha \in \Alpha}$ be a collection of abstractions enumerated in $\Alpha$, and $\pi_S : S \to S$ a policy over $S$. We propose modeling class-conditional transition probability distributions,

$$ \begin{equation} P^{(\alpha)}_{t+1}(k) = P[\phi^{(\alpha)}(\pi^{t + 1}(s)) = k \; | \; \phi^{(\alpha)}(\pi^t(s)) = k_t, \, \ldots, \, \phi^{(\alpha)}(\pi^0(s)) = k_0], \end{equation} $$

of the elements $k_i \in Z^{(\alpha)}$ via an $n$-gram model. This effectively establishes sequences in $\phi^{(\alpha)}(S)$ via repeated aplication of $\pi$ within $S$ (following the dynamics of $\pi$), so that in the above equation, we allow $\pi^t(s) = \pi_t(\pi_{t-1}(\ldots\pi_1(s)))$. This yields a collection of stochastic matrices $\langle \Pi^{(\alpha)} \rangle_{\alpha \in \Alpha}$ with

$$ \Pi^{(\alpha)}_{i, j} = P[\, i \text{ is observed at time } t \; | \; j \text{ is observed immediately before}\,], $$

where $i \in Z^{(\alpha)}$ and $j \in (Z^{(\alpha)})^n$. The amount of learnable parameters (i.e., the size) of such a model $M = \langle \Pi^{(a)} \rangle_{\alpha \in \Alpha}$ is therefore

$$ \begin{equation} |M| = \sum_{\alpha \in \Alpha} |Z^{(\alpha)}|^n \, (|Z^{(\alpha)}| - 1). \end{equation} $$

The finalized abstract policy $\pi_Z$ would use this model to operate on $Z = \large{\times_{\alpha \in \Alpha}} Z^{(\alpha)}$ (see the inference section for a high-level overview of evaluation). By operating on the cross-product of multiple sufficiently independent heuristics, $\pi_Z$ could closely approximate $\pi_S$ while remaining interpretable.

Training

The parameter space for a model $M$ of order $n$ is precisely

$$ \begin{equation} \Theta = \large{\times_{\alpha \in \Alpha}} \large{\times_{k \in Z^{(\alpha)}}} \bold{S}^{|Z^{(\alpha)}|^n}, \end{equation} $$

(where $\bold{S}^d$ denotes the $d$-dimensional unit sphere). Finding optimal parameters $\theta^* \in \Theta$ follows standard procedure as in any $n$-gram model. Hence, we simply provide the generic closed-form solution written in terms of the objects at hand,

$$ \begin{equation} \Pi^{(\alpha)}_{i, j} = \frac{1}{N} \sum_{s \in S} I^{(\alpha)}_{i,j}(\pi^n(s), \langle \pi^i(s) \rangle_{i \in [0, \, n)}), \end{equation} $$

where

$$ \begin{equation*} I^{(\alpha)}_{i,j}(a, \langle b_i \rangle_{i \in [0, \, n)}) = \begin{cases} 1 & \text{if } \; \phi^{(\alpha)}(a) = i \; \text{ and } \; \phi^{(\alpha)}(b) = j, \\ 0 & \text{otherwise}, \end{cases} \end{equation*} $$

and $N$ is the number of length-$(n + 1)$ contiguous subsequences in the dynamics of $\pi$, which can be easily sketched while computing the sum in $(5)$.

Sources

The nature of the policy operator $\pi$ is such that there exists some $s \in S$ wihtout an $s^\prime$ with $\pi(s^\prime) = s$. Here, $s$ is called a source within the dynamics of $\pi$. This constitutes a problem, as the start $s_0$ of the game for which $S$ is a state space is necessarily a source (which may not be unique); therefore, an attempt to find an $n$-length sequence of moves leading up to a state less than $n$ applications of $\pi$ away from a source in its dynamics may fail.

This is important because it is a step necessary to compute the $\Pi^{(\alpha)}_{i, j}$$^{\text{th}}$ parameter of the model, where $i$ is the parameter that is too close to a source to have a valid $n$-gram history. A solution which does not significantly alter transition distributions of $\Pi^{(\alpha)}$ is to sample missing elements of $n$-gram histories from a uniform distribution while computing its entries. If this measure is taken, $N$ can be set to $|S|$ in $(5)$, avoiding the need for sketching proportions.

Sinks

In many traditional definitions of a policy $\pi$, there may exist elements $s^\prime_i$ of $S$ over which $\pi$ is not defined, as they are terminal in the game under representation. These are sinks in the dynamics of $\pi$, and should never be considered as part of a history while computing model parameters.

Inference

When at a state $s \in S$, a human player can consider the set of next possible states $t(s)$ (where the transition function $t : S \to \mathcal{P}(S)$ is set-valued). Optimally, combinatorial optimization would be done across all elements $s^\prime \in t(s)$ under the MLE objective of maximizing the probability that their action is observed across all abstract state space transitions.

While this is possible to an extent due to the simplicity of the abstractions in consideration (which map onto small sets of classes, reducing maximization objectives during MLE), the true value of the model is in the subjective analysis of each $\Pi^{(\alpha)}$. Additionally, quantitative techniques (such as finding the static distribution and convergence rate of these matrices) may illustrate interpretable patterns in the dynamics of $\pi$, depending on $\langle \phi^{(\alpha)} \rangle$.

Remarks

Establishing an approximation of optimal policy in the form of a Markov process provides an interpretable functional representation that is able to work with intuitive abstractions. Thus, it is a valid representation of a praxis, and the above methods effectively ’translate’ from policies of arbitrary form.

Explorations

The following are left as potential avenues of analysis relating to the model family.

Smoothing techniques, and an analysis of their benefit in the context of optimal policy.
Non-interpretability of $n$-gram model successors; in particular transformer attention.
Skip-gram models as an extension of this family.

Credits

Thank you to my good friend Humberto Gutierrez for spending late nights discussing the concept of policy abstraction with me, and helping me organize many ideas about policies over continuous abstractions.

A stationary model’s probability assignments are invariant with respect to shifts in the time index. ↩︎
Which is a way of saying that rules of thumb are not globally applicable. ↩︎

Representation Concepts in Game-Theoretic Systems

Sat, 20 Apr 2024 00:00:00 +0000

Abstract

I gave an introductory talk about how computer systems represent, compute, and store noteworthy attributes about a particular class of games. This was part of Sprouts ‘24, an undergraduate-oriented conference primarily dedicated to combinatorial game theory.

Here, I share the materials I used during my presentation and share a longer-form (but very different) exploration of the topic I covered. Generically, it can be useful for all problems where one must run a domain-specific algorithm on a graph that is not materialized in memory, but can be traversed in linear time from a starting node and a set of functions that derive adjacent edges and nodes from existing ones (a so-called implicit graph).

As a concrete case of this abstract class of problems, I present concepts that support the process of finding a Nash Equilibrium for a specific subclass of games through cousins of the minimax algorithm. However, these concepts are also applicable to other such problems (e.g., the membership problem¹ for decidable subclasses of context-free grammars).

Materials

The slides I used during my talk can be found below. Anyone can use them without my permission.

/ [pdf]

Errata

Here are the mistakes I have found in the slides:

In slide 10, the first bullet point should also restrain the set of games under consideration to be extensive-form and non-collaborative, as implied by the subsequent definition in slides 11-15.
In slide 20, the formulation $\langle N, S, p, u \rangle$ should also include a transition function $t : S \to \mathcal{P}(S)$, where if $s$ is a state corresponding to history $h$, then the history $h’$ corresponding to $s’$ should be the same as $h$ with an additional action appended for all $s’ \in t(s)$.
In slide 28, the partition should not necessarily minimize the sum of the conductance of all cuts that produce the partition. Instead, the ideal partition would be a solution to the balanced partition problem, where optimal parameters are determined from hardware-related constraints (such as the cost of inter-process communication). The goal is to balance parallelism with its own overhead.

Background

In the interest of accessibility, I will briefly cover useful basics in game theory and computer science that seldom find their way into students’ syllabi or are otherwise worth refreshing. If you think you can safely skip this, you are probably right.

Game theory

The generic setup of a game is some amount of “players” taking actions according to their own interests or preferences, potentially affecting other players in the process. From the point of view of a single player, a game is an optimization problem that seeks to find an “optimal strategy” from the information available to them. This is an assumption known as “individual rationality” that pervades most of game theory.

But from a global point of view, there is no obvious question to ask about a game. This is why games are not problems; they are situations that we can ask different questions about. But per se, games are not aching to be solved. To ask specific questions with some hope of rigor, a lot of effort has been placed into defining classes of games that posses different characteristics.

Taxonomy of games

There are a lot of classes of games. They are separated by the mathematical properties of their setting and participants, among other factors.

There is no global dictionary or atlas for game classes, as interpretations can become nuanced to the extent of opinion. Hence, anytime someone makes a statement in game theory they must specify the class of games it targets. In this article, we will target games that are:

Perfect-information. Here, all players know everything in the universe that could possibly help them make or avoid any decision, except the decisions that other players will make. Most forms of Poker are not perfect-information, as the exact location of the cards is unknown.
Deterministic. Here, if all players choose a strategy and never change it, there is only one possible outcome for the game. Chess is deterministic, because if players make the exact same moves in two different games they are guaranteed to achieve the same result.
Sequential. More intuitive superset of extensive-form games. Here, all actions that players can take are indivisible. Soccer is not discrete, because players’ movements constitute their actions and it is possible to divide any movement into a shorter one.
Extensive-form. The adjective “extensive-form” refers to games that can be written in extensive form, which is a kind of mathematical template. Here, games are defined in terms of the histories of actions that could be observed by the players, and the preferences each player has among them.

Solution concepts

There are some questions that are so broadly-applicable in terms of the classes of games they can target that they achieve the special status of a solution concept. This is a term that refers to a characteristic can be observed in a useful set of game classes.

A very human thing to ask about broad categories of games is who will win. As it turns out, there is no real answer to this question most of the time, because it can come across obstacles like chance, incomplete information, and lack of clarity around the word “win.” An equally important yet more applicable question, however, is what strategy each player should take to achieve the best possible result for themselves.

In many cases, it is necessary to tack on additional nuances to this question to be able to answer it. One such refinement of the question (which revolutionized economics) asks which strategy each player should adopt so that no single player could change their own and benefit from it. A pairing of players to strategies is known as a strategy profile, and those that satisfy the above property are known as Nash Equilibria.

The strategies and strategy profiles that allow players to act probabilistically are called mixed. Mixed strategies are tantamount to sampling probability distributions of pure strategies, which themselves specify deterministic actions. In 1950, John Nash defined the concept of a Nash Equilibrium (NE), additionally proving that there exists such a mixed strategy profile in all games ².

Note

Nash Equilibrium is an overloaded term, as it refers to both a solution concept and a strategy profile that satisfies the solution concept. You will need to tell which is which from context.

Computer science

The possibility of players taking actions simultaneously (among other things) can make the existence of a pure-strategy NE impossible. But if sequential play is assumed, it is straightforward to show that there always exists a pure-strategy NE ³. As mentioned above, this article benefits from this assumption.

Because finding a NE is such a popular desire, most of the discussion here will focus on the procedure of finding a pure-strategy NE in the class of games we specified previously. This is a costly process, which is why it calls for techniques that help minimize use of computational resources. However, you will notice that the things that make this an inherently costly process for some games are actually factors that have nothing to do with game theory.

Hence, it is possible that the concepts I will discuss are applicable beyond the problem of finding a pure-strategy NE. To elaborate, a maximally generic yet snobby version of this article would perhaps be titled Techniques for Implementing Solutions to Search Problems on Implicit Graphs. The meanings of these terms are:

Implementing. Bring into the real world.
Solutions. In this context, algorithms that solve problems.
Search Problem. A problem that asks you to find something.
Implicit Graph. Graph representation in terms of an initial element and a collection of functions which allow you to perform a traversal. Useful but unconventional term.

In particular, the section titled “Representation” will explain the link between the definition of an extensive game and the representation of its structure as an implicit graph, and will introduce a trick that can be used to end up with a significantly simpler traversals. This trick is also applicable to problem domains other than games, but I only present it with regard to games because its implementation depends on the underlying problem. Everything else is applicable as soon as you have an implicit graph in your hands.

While explaining these concepts, it will be useful to have access to ideas in complexity theory. Below are some domain-specific remarks and definitions introducing language that will be of relevance later.

Complexity theory

In computer science, complexity is a measure⁴ of the minimal number of elementary operations that must be composed to complete a target operation. The relevant elementary operations correspond to the kind of complexity in question:

Time complexity. Here, elementary operations are other operations whose time is assumed to be a known constant. Elementary operations should always be specified.
Space complexity. Here, the elementary operation is setting a bit. A more formal definition of space complexity depends on the model of computation.

For the sake of expressibility, complexity is usually expressed in terms of asymptotic characteristics. In particular, symbolisms like Big-O notation help compare the asymptotic complexity of different algorithms.

Computational problems can be put into complexity classes. For example, the problem of finding a mixed-strategy NE, known as $\text{N}\small{\text{ASH}}$, is in the time complexity class $\text{FNP}$, which has led many people to look for other solution concepts that are more computationally favorable⁵. The equivalent problem for the class of games we are considering, however, is in $FP$. In other words, $\text{N}\small{\text{ASH}}$ can be solved efficiently on this restricted domain of games⁶.

Representation

One can find solutions to instances of many search and decision problems over games without incurring large computational expenses. This is possible by deriving a logical analysis on a case-by-case basis, using the mathematical properties of the components of the game in question.

Definition

An extensive-form game is a 4-tuple $\langle N, H, p, (\succsim_i) \rangle$, where:

$N$ is a set of players, usually $\{1, \; \ldots, \; n\}$ for simplicity.
$H$ is a set of strings of actions where $h \in H \implies h’ \in H$ where $h’$ is the string resulting from removing the last action from the string $h$.
$p : H \to N$ assigns a player to each non-terminal history.
The player $i \in N$ has a preference relation $\succsim_i$ on the set $Z \subseteq H$ of terminal histories (which is reflexive and transitive).

For a game provided in the extensive form, its instantiation of the above abstractions can be logically leveraged to prove statements about the game. But given that it can be defined arbitrarily, sometimes it is impossible to achieve this solely through formal rewriting.

In some of these cases it is possible to simply expand the component definitions into their explicit forms in order to later compute a solution using these expansions. But of course, doing this can also be extremely impractical. For example, it is common for $H$ to be of very high cardinality.

Example

Consider a Rubik’s Cube that is initialized to specific starting colors, which can be set into the extensive form via the following instantiations:

$N = \{1\}$.
$H = \{ h \; | \; h \text{ is a sequence of 90° rotations of a plane in the cube} \}$.
$p : h \mapsto 1$ for all $h \in H$.
$h_i \succsim_1 h_j$ for all $h_i$ that leave the cube solved.

Here, the set $H$ is countably infinite, so it is impossible to expand it to its elements in order to later compute a property of this puzzle. One such property is the smallest number of actions that can solve the cube⁷.

Rulesets

Before introducing tools to deal with this, there is another representation that is common when dealing with abstract strategy games. A ruleset, as it is known in combinatorial game theory, is the most familiar kind of representation of a game.

In particular, a ruleset specifies exactly what actions are permitted to whom and when. It also explains the utility obtained by each participating player when no further actions are available. These characteristics are expressed in terms of the mutable state of a proxy (e.g., a board with pieces).

Example

The game 10-to-0-by-1-or-2 is generated by the following ruleset:

There is a collection of 10 items.
2 players take alternating turns removing either 1 or 2 items from the collection.
Player 1 starts. The player who takes the last item from the collection wins.

In this game, the collection of items that is mutated by players’ actions is a proxy that allows players to judge what they are allowed to do and to determine who wins the game.

This way, the information contained in the state of a proxy is a representation of the history of actions that produced it. These representations are called game states. This makes rulesets implicit graphs over the set of game states (denoted $S$). While they do not act directly over a set of histories, rulesets include enough information to generate an equivalent extensive-form representation. The resulting structure of generated action histories is hence intimately tied to the nature of the proxy.

Definition

Given a directed graph $G = \langle S, E \rangle$, the corresponding implicit graph $G^I$ is a 3-tuple $\langle S, t, s_0 \rangle$, where:

$S$ is a set of states.
$s_0 \in S$ is a starting element.
$t : S \to \mathcal{P}(S)$ is a transition dynamics function with $(s_i, s_j) \in E \iff s_j \in t(s_i)$.

A proof of the bijection between implicit and directed graphs is omitted.

Furthermore, because many actions could be globally or locally commutative with respect to proxies’ mutable state, sets of histories in the extensive form are usually of much higher cardinality than sets of possible states for rulesets’ proxies. This is of course computationally favorable, as finite ruelesets (whose proxies have a finite number of possible states) can generate even infinite extensive forms.

Example

In the diagram to the right, let $S_0$, $S_1$, $S_2$, and $S_3$ be allowed states under a ruleset, and $A = \{a, b\}$ allowed actions. We have that:

The set of histories is $H = \{ \epsilon, a, b, ab, ba \}$.
The set of states is $S = \{ S_0, S_1, S_2, S_3 \}$.

As you can see, $|H| > |S|$ despite how $|A| < |S|$. The difference in size between $H$ and $S$ scales rapidly in the general case.

Abstraction

In real life, games come mostly in the form of rulesets. We are usually aware of the environment they transpire in (the so-called intuitive proxy) and the laws that describe how it can change as a function of players’ actions. Because of this, much of applied theory is centered around semantics that involve mutable state. For example, a Markov Decision Process (MDP) from reinforcement learning strongly reflects the nature of a ruleset.

However, all of these constructs have a latent yet equivalent extensive form representation. This will motivate the technique of state abstraction⁸: The necessity for action histories to be directly prefixed implies that they have a directed tree structure, and because all directed graphs have an equivalent implicit graph representation, the relationship between action histories and ruleset states maps an implicit graph to another one that retains important information about the original.

Definition

An abstraction map $a : S_{\text{pre}} \to S_{\text{post}}$ maps the states in an implicit graph $G^I_{\text{pre}}$ to the states in another implicit graph $G^I_{\text{post}}$ with the structure-preserving condition

$$ s_j \in t_{\text{pre}}(s_i) \iff a(s_j) \in t_{\text{post}}(a(s_i)). $$

As shown in the last example, the set of action histories is usually of much greater cardinality than its corresponding set of ruleset states, with the difference being possibly infinite. Altogether, this is a hint that, much like the implicit jump from action histories to ruleset states, it is possible to jump to more abstract state sets of smaller cardinality but equal representational power.

Example

An abstraction mapping between the action histories (left) and states (right) from the previous example. Notice how action histories always imply a tree, and the process of abstraction folds it into a potentially cyclic graph.

Therefore, much like imposing a ruleset proxy causes many of the action histories to fold into equivalent states, we can determine for a problem $\text{P}$ which states would be $\text{P}$-equivalent under the ruleset’s laws. This way, we can create new abstractions $a_i : S_{i - 1} \to S_i$ that can be composed to obtain a reduced state space that is equivalent to the original under $\text{P}$ for a significant computational upside.

Example

Consider the ruleset underlying the game of Tic-Tac-Toe. Denote $B_i$ a board state. Then, any algorithm that computes a NE for this game through its ruleset is invariant to the abstraction $a$, where

$$ a(B_i) = a(B_j) \iff B_i \text{ is symmetrical to } B_j. $$

Further, the number of board states this algorithm will need to visit is reduced by a factor $>5$.

Design

So far, discussion has brought us to implicit graphs and abstractions through the lens of game theory. The objective of this section will be to motivate these concepts beyond game theory, while supplying references to concrete programming ideas.

To do this, we will cover a representation of an implicit graph in a real programming language, apply it to a new problem domain, and make improvements that bring real-world utility. Examples will still be given in terms of games, as they also happen to fit under our new focus. In doing so, we will design a solution to a broad problem using our new toolset items.

Interface items

There are a number of considerations to make when encoding an implicit graph, whose importance will vary depending on the object being represented. This section will introduce only the example of graphs of subproblems in the context of dynamic programming (DP), and will iterate on the following Rust interface to eventually allow their solutions to be found in parallel through a special kind of abstraction.

1
2
3
4
5
6
7
8


trait ImplicitGraph<C>
where
 C: IntoIterator<Item = Self::State>,
{
 type State;
 fn start() -> Self::State;
 fn transition(state: Self::State) -> C;
}

The elements of this interface declaration relate to the implicit graph $G^I = \langle S, t, s_0 \rangle$ as follows:

The generic parameter Start is the type of the elements in $S$.
The generic parameter C is the type of the elements in $\mathcal{P}(S)$.
transition is the template of $t$.
start simply returns $s_0$.

Example

Implementation of the game 10-to-0-by-1-or-2 from the section on Rulesets as an implicit graph.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


/// The game `10-to-0-by-1-or-2`.
struct ZeroBy;

impl ImplicitGraph<Vec<(u32, bool)>> for ZeroBy {
 // Tuple of (items, turn).
 type State = (u32, bool);

 // Returns (10 items left, player 0's turn).
 fn start() -> (u32, bool) {
 (10, false)
 }

 // Returns states with one and two less items on the opposing player's turn.
 fn transition(state: (u32, bool)) -> Vec<(u32, bool)> {
 let (items, turn) = state;
 let mut next = Vec::new();
 if items == 1 {
 next.push((0, !turn));
 } else if items > 1 {
 next.push((items - 1, !turn));
 next.push((items - 2, !turn));
 }
 next
 }
}

Parallel DP

A popular characterization of DP is to establish dependency relations on sets of subproblems, defining a directed acyclic graph (DAG) for any properly formulated subproblem definition. A natural link to implicit graphs exists through their bijection with general graphs.

Having established that a DP problem can be characterized as an implicit graph of subproblems, it is also worth mentioning that most unorganized solution implementations (i.e., that do not organize solutions or information about subproblems in a tensor) make use of stack-like data structures to aid traversals of subproblems in postorder.

Example

DP algorithm to compute who will win a game of 10-to-0-by-1-or-2, with the below subproblem relation⁹:

$$ W(s) = \max_{s' \in \\, t(s)} \min_{s' \in \\, t(s)} W(s'). $$

Here, $W : S \to \{ 0, \, 1 \} $ maps state information (including the number of items remaining and player turn) to whether the player whose turn it is at $s$ would win under optimal play, with $t$ being the transition function of the implicit graph over this game’s states. This uses the implementation from the previous example.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


procedure:
 temp ← empty stack
 stack ← empty stack
 visited ← empty set

 stack.push(ZeroBy::start())
 while stack is not empty:
 current ← stack.pop()
 if current is not in visited:
 visited.add(current)
 temp_stack.push(current)
 for each state in ZeroBy::transition(current):
 if state is not in visited:
 stack.push(state)

 solution ← empty map
 while temp_stack is not empty:
 current ← temp_stack.pop()
 solution[current] = W(current)

 return solution[ZeroBy::start()]

Note the use of $W$ in line 19. Also note that the subproblem relation does not necessarily generalize to other games and, for the sake of brevity, is not defined for base cases (where $t(s) = \varnothing$).

Attempts to parallelize this setup must first identify a method to partition the subproblem graph in a way optimizes the tradeoff between parallelism and its own overhead. Of course, this depends significantly on the specific resources that will be used to execute the resulting program.

I will introduce one way of doing this that follows naturally from the use of the implicit graph interface. Concretely, a carefully chosen abstraction $\pi : S \to \mathbb{N}$ that connects the graph of subproblems to a DAG of enumerated sets of states will provide a parallelization scheme.

1
2
3
4
5
6
7
8
9


trait ImplicitGraph<C>
where
 C: IntoIterator<Item = Self::State>,
{
 type State;
 fn start() -> Self::State;
 fn partition(state: Self::State) -> u64; // <-- NEW
 fn transition(state: Self::State) -> C;
}

Here, partition is the template of $\pi$. The big idea is that during a traversal, we can observe a change in the value of *::partition(current), where current is the current state in the traversal. This way, we can build a graph of the outputs of this function based on their adjacency in the subproblem graph. Finally, we analyze the resulting graph to find sets of states that can be traversed simultaneously.

Example

On the left, a graph over a set of states $S = \{ s_0, \, \ldots, \, s_{11} \}$. On the right, a graph over a set $\Pi \subset \mathbb{N}$ with four elements. They are related by an abstraction $\pi : S \to \Pi$ that is special in that the resulting graph over the elements of $\Pi$ is acyclic. Hence, the fibers $\pi^{-1}(\{\pi_i\})$ of certain distinct elements $(\pi_i)$ of $\Pi$ could possibly be traversed in parallel. An example of groupings of elements in $\Pi$ whose fibers under $\pi$ could be traversed in parallel is provided in dotted boxes on the graph of $\Pi$.

Finding a suitable $\pi$ depends on the chosen state (subproblem) representation and is of course highly problem-specific. However, a general strategy is to ensure that $\pi$ outputs a different label if and only if an irreversible change is made to some form of mutable state. As a high-level criterion, this helps construct an abstraction that assuredly maps onto a DAG.

Having found such an abstraction, the next perplexity of parallelizing a postorder traversal of subproblems is managing efficient and clear use of shared data structures. Here, a zoo of approaches with varyingly personable tradeoffs are available. I will introduce only one, which involves an additional interface item in the form of a function $t’ : S \to \mathcal{P}(S)$ called retrograde.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


trait ImplicitGraph<C>
where
 C: IntoIterator<Item = Self::State>,
{
 type State;
 fn start() -> Self::State;
 fn partition(state: Self::State) -> u64;
 fn transition(state: Self::State) -> C;
 fn retrograde(state: Self::State) -> C; // <-- NEW
}

In an ideal world, retrograde is the equivalent of transition for the transpose of the graph being represented. Unfortunately, it is not generally tractable to have a perfect implementation of retrograde without first materializing the entire graph (defeating the purpose of an implicit graph representation). Thus, we only make sure that retrograde returns a superset of what transition would return for the transpose of the graph, which is easy to ensure in many useful cases.

With $\pi$ and $t’$ in our hands, we can provision the following procedure for parallelizing the execution of an unorganized dynamic programming algorithm over an implicit graph $G^I$ of subproblems:

Procedure

Traverse $G^I$ to obtain the set of subproblems $S$ in $\mathcal{\Theta}(|G^I|)$ time and $\mathcal{\Theta}(|S|)$ space.
During (1), track the subset of subproblems $S_{base} = \{ s \; | \; t(s) = \varnothing \}$ at no additional cost.
During (1), construct a graph $G_\Pi$ over a set of partition labels $\Pi$ using $\pi$, at a $\pi$-dependent cost.
Use $G_\Pi$ to generate a plan¹⁰ of labeled parallel tasks in $\mathcal{\Theta}(|G_\Pi|)$ time and $\mathcal{\Theta}(|\Pi|)$ space.
Delegate tasks, starting exploration from popped elements of $S_{base}$ with the task’s label.
Use $t’$ for backward intra-partition traversal (using $S$ for existence checks) in $t’$-dependent time.
When a change in $\pi(s)$ is observed on the current state $s$, add $s$ to $S_{base}$, leaving it unexplored.
When a partition is completely explored, finish its task and free the parallel unit.
Repeat from (5) on new elements of $S_{base}$ until there are no tasks remaining.

These general steps skip some details, but they present an arbitrarily parallel stack-free traversal that can be implemented over a single shared data structure whose size scales in the order of $\mathcal{\Theta}(|S|)$ (ignoring structures related to partitions, whose size is assumed to be negligible). This kind of map-like thread-safe functionality is available in many database implementations which automatically bring the added benefit of disk usage, making this method applicable to “bigger” problems.

Meta-content

The section titled “Representation” got us to stumble across the new concepts of implicit graphs and abstractions by looking at different forms for game representations. The following section extrapolated these ideas to the domain of dynamic programming, and showed how it is possible to incorporate them into the design of solutions to real-world problems.

Something interesting is that games made their way into the second section, despite being decidedly out of scope at that point. In a dying hope of getting this article back on track, I will point out that the particular example of parallelizing DP algorithms was conveniently chosen because it is used to solve bigger games faster than was previously possible (through DP algorithms that consume representations of them).

This way, I can say that this whole article was in fact about game-theoretic systems. But we both know that it was really about implicit graphs and abstractions. Maybe, if we squint our eyes, it can be about both topics. Either way, I hope the lack of clarity was more stimulating than it was confusing.

The problem of deciding whether or not a string is in the language of a context-free grammar. ↩︎
For his doctoral dissertation, Non-Cooperative Games. ↩︎
This fact is known as Zermelo’s theorem. ↩︎
In both the mathematical and informal sense. ↩︎
For example, Christos Papadimitrou on replicator dynamics. ↩︎
$\text{N}\small{\text{ASH}}$ asks to find a mixed-strategy NE. A pure-strategy NE is a case of a mixed-strategy NE. Zermelo’s theorem shows a pure-strategy NE always exists for the class of games in question. Then, backward induction algorithms can find one in linear time for finite representations of games. ↩︎
It is possible to obtain this number for a specific starting configuration of a Rubik’s Cube, but no one knows the length of the longest minimal sequence of moves necessary to solve it across all starting configurations. This is somewhat dramatically known as God’s Number. ↩︎
Term generalized from its use in reinforcement learning. ↩︎
This subproblem relation monomorphizes the generic formulation of the minimax algorithm. ↩︎
A parallelization plan is a data structure that can dispense information on which task must be worked on next when a parallel unit becomes available for work. It may also communicate the need to wait until another unit completes its work. ↩︎