Jekyll2023-07-15T22:18:28+00:00/feed.xmlmohitd’s BlogI sometimes forget how things work so I write them down hereLanguage Modeling - Part 1: n-gram Models2023-07-14T00:00:00+00:002023-07-14T00:00:00+00:00/n-grams<p>Over the past several months, Large Language Models (LLMs) such as ChatGPT, GPT-4, and AutoGPT, have flooded the Internet with all kinds of different applications and use-cases. These are regarded as language models that can remember context as well as understand their own capabilities. They’re often treated as black-boxes where the majority of the implementation details are left to the researchers. However, having some understanding of how they work can also help people more clearly and concisely instruct the model to get the desired output.</p>
<p>Rather than jumping straight to how LLMs work, I think it’s helpful to cover some prerequisite knowledge to help us demystify LLMs. In this post, we’ll go back in time before neural networks and talk about language, language modeling, and n-gram language models since they’re simple to understand and we can do an example by hand.</p>
<h1 id="language">Language</h1>
<p>Before we start with n-gram models, we need to understand the kind of data we’re working with. If we were going to delve into convolutional neural networks (CNNs), we’d start our discussion with images and image data. Since we’re talking about language modeling, let’s talk about language so we can better motive why language modeling is very hard. One definition of <strong>language</strong> that’s particularly relevant to language modeling is a <em>structured system of communication with a grammar and vocabulary</em> (note this applies for spoken, written, and sign language). Given you’re reading this post in the English language, you’re probably already familiar with vocabulary and grammar so let me present to you a sentence.</p>
<blockquote>
<p>The quick brown fox jumps over the lazy dog.</p>
</blockquote>
<p>You might recognize this sentence as being one that uses each letter of the English/Latin alphabet at least once. Immediately we see the words belonging to the vocabulary and their part-of-speech: nouns like “fox” and “dog”; adjectives like “quick”, “brown”, “lazy”; articles like “the”; verbs like “jumps”; and prepositions like “over”.</p>
<p><strong>Grammar</strong> is what dictates the ordering of the words in the vocabulary: the subject “fox” comes before the verb “jumps” and the direct object “dog”. This ordering depends on the language however. For example, if I translated the above sentence into Japanese, it would read: 素早い茶色のキツネが怠惰な犬を飛び越えます。A literal translation would go like “Quick brown fox lazy dog jumped over”. Notice how the verb came at the end rather than between the subject and direct object.</p>
<p>These problems help illustrate why we can’t simply have a model that performs a one-to-one mapping when we try to model languages. We might end up with more words, e.g., if the target language uses particles words, or fewer words, e.g., if the target language doesn’t have article words. Even if we did have the same number of words, the ordering might change. For example, in English, we’d say “red car” but in Spanish we’d say “carro rojo” which literally translates to “car red”: the adjective comes after the noun it describes.</p>
<p>To summarize, language is very difficult! Even for humans! So it’s going to be a challenge for computers too.</p>
<h1 id="applications-of-language-modeling">Applications of Language Modeling</h1>
<p>With that little aside on languages, before we formally define language modeling, let’s look at a few applications that use some kind of language modeling under-the-hood.</p>
<p><b>Sentiment Analysis</b>. When reading an Amazon review, as humans, we can tell if they’re positive or negative. We want to have a language model that can do the same kind of thing. Given a sequence of text, we want to see if the sentiment is good or bad. Cases like “It’s hard not to hate this movie” are particularly challenging and need to be handled correctly. This particular application of language modeling is used in “Voice of the Customer” style analysis to gauge perceptions about a company or their products.</p>
<p><b>Automatic Speech Recognition</b>. Language modeling can be useful for speech recognition by being able to correctly model sentences, especially for words that sound the same but are written differently like “tear” and “tier”.</p>
<p><b>Neural Machine Translation</b>. Google Translate is a great example of this! If we have language models of different languages, implicitly or explicitly, we can translate between the languages that they model!</p>
<p><b>Text Generation</b>. This is what ChatGPT has grown famous for: generating text! This application of language modeling can be used for question answering, code generation, summarization, and a lot more applications.</p>
<h1 id="language-modeling">Language Modeling</h1>
<p>Now that we’ve seen a few applications, what do all of these haven in common? It seems like one point of commonality is that we want to understand and analyze text against the trained corpus to ensure that we’re consistent with it. In other words, if our model was trained on a dataset of English sentences, we don’t want it generating grammatically incorrect sentences. In other words, we want to ensure that the outputs “conform” to the dataset.</p>
<p>One way to measure this is to compute a probability of “belonging”. For a some random given input sequence, if the probability is high, then we expect that sequence to be close to what we’ve see in the dataset. If that probability is low, then that sequence is likely something that doesn’t make sense in the dataset. For example, a good language model would score something like $p(\texttt{The quick brown fox jumps over the lazy dog})$ high and something like $p(\texttt{The fox brown jumped dog laziness over lazy})$ low because the former has proper grammar and uses known words in the vocabulary.</p>
<p>This is what a language model does: given an input sequence $x_1,\cdots,x_N$, it assigns a probability $p(x_1,\cdots,x_N)$ that represents how likely it is to appear in the dataset. That seems a little strange given we’ve just discussed the above applications. What does something like generating text have to do with assigning probabilities to sequences? Well we want the generated text to match well with the dataset, don’t we? In other words, we don’t want text with poor grammar or broken sentences. This also explains why those phenomenal LLMs are trained on billions of examples: they need diversity in order to assign high probabilities to sentences that encode facts and data of the dataset.</p>
<p>So how do we actually compute this probability? Well the most basic definition of probability is “number of events that happened” / “number of all possible events” so we can try to do the same thing with this sequence of words.</p>
\[p(w_1,\dots, w_N)=\displaystyle\frac{C(w_1,\dots, w_N)}{\displaystyle\sum_{w_1,\dots,w_N} C(w_1,\dots, w_N)}\]
<p>So for a word sequence $w_1,\dots, w_N$, in our corpus, we count how many times we find that sequence divide by all possible word sequences of length $N$. There are several problems with this. To compute the numerator, we need to count a particular sequence in the dataset but notice that this gets harder to do the longer the sequence is. For example, finding the sequence “the cat” is far easier than finding the sequence “the cat sat on the mat wearing a burgundy hat”. To compute the denominator, we need the combination of all English words up to length $N$. To give a sense of scale, Merriam Webster estimates there are about ~1 million words so this becomes the combinatorial problem.</p>
\[\binom{1\mathrm{e}6}{N} = \displaystyle\frac{1\mathrm{e}6!}{N!(1\mathrm{e}6-N)!}\]
<p>In other words, for each word up to $N$, there are about a million possibilities we have to account for until we get up to the desired sequence length. The factorial of a million is an incredibly large number! So these reasons make it difficult to compute language model probabilities in that form so we have to try something else. If we remember some probability theory, we can try to rearrange the terms using the chain rule of probability.</p>
\[\begin{align*}
p(w_1,\dots, w_N) &= p(w_N|w_1,\dots,w_{N-1})p(w_1,\dots,w_{N-1})\\
&= p(w_N|w_1,\dots,w_{N-1})p(w_{N-1}|w_1,\dots,w_{N-2})p(w_1,\dots,w_{N-2})\\
&= \displaystyle\prod_{i=1}^N p(w_i|w_1,\dots,w_{i-1})\\
\end{align*}\]
<p>So we’ve decomposed the joint distribution of the language model into a product of conditionals $p(w_i\vert w_1,\dots,w_{i-1})$. Intuitively, this measures the probability that word $w_i$ follows the previous sequence $w_1,\dots,w_{i-1}$. Basically for a word, we depend on all previous words. So let’s see if this is any easier to practically count up the sequences.</p>
\[p(w_i|w_1,\dots,w_{i-1})=\displaystyle\frac{C(w_1,\dots,w_i)}{C(w_1,\dots,w_{i-1})}\]
<p>This looks a little better! Intuitively, we count a particular sequence up to $i$: $w_1,\dots,w_i$ in the corpus. But the denominator, we only count up to the previous word $w_1,\dots,w_{i-1}$. This is a bit better than going up to the entire sequence length $N$ but still a problem. Particularly, the biggest problem is the history $w_1,\dots,w_{i-1}$. How do we deal with it?</p>
<h1 id="n-gram-model">n-gram Model</h1>
<p>Rather than dealing with the entire history up to a certain word, we can approximate it using only the past few words! This is the premise behind <strong>n-gram models</strong>: we approximate the entire past history using the past $n$ words.</p>
\[p(w_i|w_1,\dots,w_{i-1})\approx p(w_i|w_{1-(n-1)},\dots,w_{i-1})\]
<p>A <strong>unigram</strong> model looks like $p(w_i)$; a <strong>bigram</strong> model looks like $p(w_i\vert w_{i-1})$; a <strong>trigram</strong> model looks like $p(w_i\vert w_{i-1},w_{i-2})$. Intuitively, a unigram model looks at no prior words; a bigram models looks only at the previous word; a trigram model looks at only the past two words. Now let’s see if it’s easier to compute these conditional distributions using the same counting equation.</p>
\[\begin{align*}
p(w_i|w_{i-1})&=\displaystyle\frac{C(w_{i-1}, w_i)}{\displaystyle\sum_w C(w_{i-1}, w)}\\
&\to\displaystyle\frac{C(w_{i-1}, w_i)}{C(w_{i-1})}
\end{align*}\]
<p>We go to the second line by using maximum likelihood estimation. Computing these counts is much easier! To see this, let’s actually compute an n-gram model by hand using a very small corpus.</p>
\[\texttt{<SOS>}\text{I am Sam}\texttt{<EOS>}\]
\[\texttt{<SOS>}\text{Sam I am}\texttt{<EOS>}\]
<p>Practically, we use special tokens that denote the start of the sequence (<small><SOS></small>) and end of sequence (<small><EOS></small>). The <small><EOS></small> token is required to normalize the conditional distribution into a true probability distribution. The <small><SOS></small> token is optional but it becomes useful for sampling the language model later so we’ll add it. Treating these as two special tokens, let’s compute the bigram word counts and probabilities by hand.</p>
<table>
<thead>
<tr>
<th>$w_i$</th>
<th>$w_{i-1}$</th>
<th>$p(w_i\vert w_{i-1})$</th>
</tr>
</thead>
<tbody>
<tr>
<td>I</td>
<td><small><SOS></small></td>
<td>$\frac{1}{2}$</td>
</tr>
<tr>
<td>Sam</td>
<td><small><SOS></small></td>
<td>$\frac{1}{2}$</td>
</tr>
<tr>
<td><small><EOS></small></td>
<td>Sam</td>
<td>$\frac{1}{2}$</td>
</tr>
<tr>
<td>I</td>
<td>Sam</td>
<td>$\frac{1}{2}$</td>
</tr>
<tr>
<td>Sam</td>
<td>am</td>
<td>$\frac{1}{2}$</td>
</tr>
<tr>
<td><small><EOS></small></td>
<td>am</td>
<td>$\frac{1}{2}$</td>
</tr>
<tr>
<td>am</td>
<td>I</td>
<td>$1$</td>
</tr>
</tbody>
</table>
<p>Concretely, let’s see how to compute $p(\text{I}\vert\text{Sam})$. Intuitively, this is asking for the likelihood that “I” follows “Sam”. In our corpus, we have two instances of “Sam” and the words after are “<small><EOS></small>” and “I”. So overall, the likelihood is $\frac{1}{2}$. Notice how the conditionals form a valid probability distribution, e.g., $\sum_w p(w\vert\text{Sam}) = 1$.</p>
<p>With this model, we can approximate the full language model with a product of n-grams. Consider bigrams:</p>
\[\begin{align*}
p(w_1,\dots, w_N)&\approx p(w_2|w_1)p(w_3|w_2)\cdots p(w_N|w_{N-1})\\
p(\text{the cat sat on the mat}) &\approx p(\text{the}|\texttt{<SOS>})p(\text{cat}|\text{the})\cdots p(\texttt{<EOS>}|\text{mat})
\end{align*}\]
<p>This is a lot more tractable! So now we have an approximation of the language model! What other kinds of things can we do? We can sample from language models. We start with the <small><SOS></small> and then use the conditionals to sample. We can either keep sampling until we hit a <small><EOS></small> or we can keep sampling for a fixed number of words. This is why we have a <small><SOS></small>: if we didn’t, we’d need to specific a start token. But since we used <small><SOS></small>, we have a uniform start token.</p>
<h1 id="practical-language-modeling">Practical Language Modeling</h1>
<p>Now that we’ve covered the maths, let’s talk about some practical aspects of language modeling. The first problem we can address is what we just talked about: approximating a full language model with the product of n-grams.</p>
\[p(w_1,\dots, w_N)\approx p(w_2|w_1)p(w_3|w_2)\cdots p(w_N|w_{N-1})\]
<p>What’s the problem with this? Numerically, when we multiply a bunch of probabilities together, we’re multiplying together numbers that are in $[0, 1]$ which means the probability gets smaller and smaller. This has a risk of underflowing to 0. To avoid this, we use a trick called the exp-log-sum trick:</p>
\[\exp\Big[\log p(w_2|w_1)+\log p(w_3|w_2)+\cdots+\log p(w_N|w_{N-1})\Big]\]
<p>In the log-space, multiplying is adding so the number just gets increasingly negative rather than increasingly small. Then we can take the exponential to “undo” the log-space.</p>
<p>Going beyond the numerical aspects, practically, language models need to be trained on a large corpus because of sparsity. After we train, two major problems we encounter in the field are unknown words not in the training corpus and words that are known but used in an unknown context.</p>
<p>For the former, when we train language models, we often construct a vocabulary during training. This can either be an open vocabulary where we add words as we see them or a closed vocabulary where we agree on the words ahead of time (perhaps the most common $k$ words for example). In either case, during inference, we’ll encounter out-of-vocabulary (OOV) words. One solution to this is to create a special token called <small><UNK></small> that represents unknown words. For any OOV word, we map it to the <small><UNK></small> token and treat it like any other token in our vocabulary.</p>
<h2 id="smoothing">Smoothing</h2>
<p>What about known words in an unknown context? Let’s consider how we compute bigrams.</p>
\[p(w_i|w_{i-1})=\displaystyle\frac{C(w_{i-1},w_i)}{C(w_{i-1})}\]
<p>Mathematically, the problem is that the numerator can be zero. So the simplest solution is to make it not zero by adding $1$. But we can’t simply add $1$ without correcting the denominator since we want a valid probability distribution. So we also need to add something to the denominator. Since we’re adding $1$ to each count for each word, we need to add a count for the total number of words in the vocabulary $V$.</p>
\[p(w_i|w_{i-1})=\displaystyle\frac{C(w_{i-1},w_i)+1}{C(w_{i-1})+V}\]
<p>With this, we’re guaranteed not to have zero counts! This is called <strong>Laplace Smoothing</strong>. The issue with this kind of smoothing is that the probability density moves too sharply since we’re just blindly adding $1$. We can generalize this so that we actually add some $k$ (and normalize by $kV$) to help better ease the probability density less sharply towards the unknown context event.</p>
\[p(w_i|w_{i-1})=\displaystyle\frac{C(w_{i-1},w_i)+k}{C(w_{i-1})+kV}\]
<p>This is called <strong>Add-$k$ Smoothing</strong>. It can perform better than Laplace Smoothing in most cases, with the appropriate choice of $k$ tuned.</p>
<h2 id="backoff-and-interpolation">Backoff and Interpolation</h2>
<p>One alternative to smoothing is to try to use less information if it’s available. The intuition is that if we can’t find a bigram $p(w_{i-1},w_i)$, we can see if a unigram exists $p(w_i)$ that we can use in its place. This technique is called <strong>backoff</strong> because we back off to a smaller n-gram.</p>
<p>Going a step further, we don’t have to necessarily choose between using backing off to only the $(n-1)$-gram. We can choose to always consider all previous n-gram, but create a linear combination of them.</p>
\[\begin{align*}
p(w_i|w_{i-2},w_{i-1})&=\lambda_1 p(w_i)+\lambda_2 p(w_i|w_{i-1})+\lambda_3 p(w_i|w_{i-2},w_{i-1})\\
\displaystyle\sum_i \lambda_i &= 1
\end{align*}\]
<p>Here the $\lambda_i$s are the interpolation coefficients and they have to sum to $1$ to create a valid probability distribution. This allows us to consider all previous n-grams in the absence of data. Backoff with interpolation works pretty well in practice.</p>
<h1 id="code">Code</h1>
<p>We’ve been talking about the theory of language models and n-gram models for a while but let’s actually try training one on a dataset and use it to generate text! Fortunately since they’ve been around for a while, training them is very simple with existing libraries.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">torchtext.datasets</span> <span class="kn">import</span> <span class="n">AG_NEWS</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="kn">from</span> <span class="nn">nltk.lm</span> <span class="kn">import</span> <span class="n">MLE</span>
<span class="kn">from</span> <span class="nn">nltk.lm.preprocessing</span> <span class="kn">import</span> <span class="n">padded_everygram_pipeline</span>
<span class="n">N</span> <span class="o">=</span> <span class="mi">6</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">AG_NEWS</span><span class="p">(</span><span class="n">root</span><span class="o">=</span><span class="s">'.'</span><span class="p">,</span> <span class="n">split</span><span class="o">=</span><span class="s">'train'</span><span class="p">)</span>
<span class="n">train</span><span class="p">,</span> <span class="n">vocab</span> <span class="o">=</span> <span class="n">padded_everygram_pipeline</span><span class="p">(</span><span class="n">N</span><span class="p">,</span>
<span class="p">[</span><span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">r</span><span class="s">'[^A-Za-z0-9 ]+'</span><span class="p">,</span> <span class="s">''</span><span class="p">,</span> <span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">]).</span><span class="n">split</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">data</span><span class="p">])</span>
<span class="n">lm</span> <span class="o">=</span> <span class="n">MLE</span><span class="p">(</span><span class="n">N</span><span class="p">)</span>
<span class="n">lm</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">train</span><span class="p">,</span> <span class="n">vocab</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">' '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">lm</span><span class="p">.</span><span class="n">generate</span><span class="p">(</span><span class="mi">20</span><span class="p">,</span> <span class="n">random_seed</span><span class="o">=</span><span class="mi">4</span><span class="p">)))</span>
</code></pre></div></div>
<p>We’re using the <code class="language-plaintext highlighter-rouge">AG_NEWS</code> dataset that contains 120,000 training examples of news articles across World, Sports, Business, and Science/Tech. The <code class="language-plaintext highlighter-rouge">padded_everygram_pipeline</code> adds the <small><SOS></small> and <small><EOS></small> tokens and creates n-grams and backoff n-grams; we’re using 6-grams which tend to work well in practice. For simplicity, we ignore any non-alphanumeric character besides spaces. Then we use a maximum likelihood estimator (similar to the conditional distribution tables we created above) to train our model. Finally, we can generate some examples of length 20.</p>
<p>I tried a bunch of different seeds and here are a few cherry-picked examples (I’ve truncated them after the <small><EOS></small> token):</p>
<ul>
<li>Belgian cancer patient made infertile by chemotherapy has given birth following revolutionary treatment</li>
<li>Two US citizens were killed when a truck bomb exploded in downtown Kabul in the second deadly blast to strike</li>
<li>This year the White House had rejected a similar request made by 130 Republican and Democratic members of Congress</li>
<li>Greatly enlarged museum is expected to turn into a cacophony on Saturday</li>
</ul>
<p>These look pretty good for just an n-gram model! Notice they retain some information, probabilistically, across the sequence. For example, in the first one, the word “infertile” comes before “birth” since, when generating “birth”, we could see “infertile” in our previous history.</p>
<p>But I also found scenarios where the generated text didn’t really make any. Here are some of those lemon-picked examples:</p>
<ul>
<li>For small to medium businesses</li>
<li>Can close the gap with SAP the world 39s biggest software company after buying US rival PeopleSoft Oracle 39s Chairman</li>
<li></li>
<li>British athletics appoint psychologist for 2008 Olympics British athletics chiefs have appointed sports psychologist David Collins</li>
<li>Can close the gap with SAP the world 39s biggest software company after buying US rival PeopleSoft Oracle 39s Chairman</li>
</ul>
<p>These are sometimes short phrases or nonsensical with random digits. In one case, the language model just generated a bunch of <small><EOS></small> tokens! These examples also help show why neural language models tend to outperform simplistic n-gram models in general. Feel free to change the dataset and generate your own sentences!</p>
<h1 id="conclusion">Conclusion</h1>
<p>Large Language Models (LLMs) are gaining traction online as being able to perform complex and sequential reasoning tasks. They’re often treated as black-box models but understanding a bit about how they work can make it easier to interact with them. Starting from the beginning, we learned a bit about language itself and why this problem is so difficult and why it wasn’t solved decades ago. We introduced language modeling as a task of assigning a probability to a sequence of words based on how likely it is to appear in the dataset. Then we learned about how $n$-gram models approximate this full previous history of a particular word using only the past $n$ words. We can use these models for language modeling and sampling. We finally discussed some practical considerations when training language models including handing unknown words and backoff and interpolation.</p>
<p>There’s still a lot more to cover! This is just the start of our journey to the precipice of language modeling 🙂</p>We'll start our language modeling journey starting at classical language modeling using n-gram language models.Lie Groups - Part 22023-04-07T00:00:00+00:002023-04-07T00:00:00+00:00/lie-groups-part-2<p>In the previous post, we motivated Lie Groups by looking at 2D rotations and their geometry. After defining a group and group axioms, we discussed the other core aspect of Lie Groups: manifolds. We ended on state estimation for robotics for moving a robot under some kinematics all on the manifold to avoid reprojection error. However, we only saw how to apply Lie Groups to the pose of the robot but not the uncertainty! To do that, we need to take derivatives of the motion model but on the manifold!</p>
<p>In this post, we’ll take our existing notion of Lie Groups and extend them to perform calculus so we can compute derivatives to compute things like the covariance, as it relates to the latter half of Dr. Joan Solà’s work: <a href="https://arxiv.org/abs/1812.01537v9">A micro Lie theory for state estimation in robotics</a>. We’ll start by defining the adjoint to relate the local and global frames since we’ll need it for later. Then we build up calculus by learning how to take derivatives on manifolds as well as covariances. Finally, we’ll take what we learned and arrive at the on-manifold state estimation equations.</p>
<h1 id="adjoint">Adjoint</h1>
<p>In the previous post, we ended with defining the global and local frames and the $\oplus$ and $\ominus$ operators. However, since we have these two global and local frames, how do we relate them? Note that these might be at different places in the manifold so we can’t simply use the $\Exp$ or $\Log$ operators directly. Unfortunately, we can’t continue without defining some reasonable axioms. So let’s go ahead and identify the left and right $\oplus$ operators.</p>
\[X \oplus {}^Xv={}^Ev\oplus X\]
<p>Now let’s expand the $\oplus$ on both sides and simplify</p>
\[\begin{align*}
X \oplus {}^Xv & ={}^Ev\oplus X\\
\Exp({}^Ev^\wedge)X &= X~\Exp({}^Xv)\\
\exp({}^Ev^\wedge) &= X\exp({}^Xv^\wedge)X^{-1}=\exp(X{}^Xv^\wedge X^{-1})\\
{}^Ev^\wedge &= X{}^Xv^\wedge X^{-1}
\end{align*}\]
<p>Note that in the third line we use a property of the exponential map that $X\exp({}^Xv^\wedge)X^{-1}=\exp(X{}^Xv^\wedge X^{-1})$. In the last line, notice that we relate the the tangent space $T_X M$ to the tangent space $T_E M$; in other words, we can bring a vector in the local frame to a vector in the global frame. This turns out to be a useful-enough operation that we give it a name: the <strong>adjoint</strong>:</p>
\[\Ad_X : \mathfrak{m}\to\mathfrak{m}; v^\wedge\mapsto\Ad_Xv^\wedge\equiv X{}^Xv^\wedge X^{-1}\]
<p>The adjoint map sends vectors in the local frame to vectors in the global frame. Equivalently, we can say ${}^E v^\wedge=\Ad_X {}^X v^\wedge$. The adjoint at $X$ brings ${}^Xv^\wedge$ to ${}^Ev^\wedge$. Similar to the exponential map, this mapping is exact. From the definition, we can derive several properties:</p>
<ul>
<li><strong>Linearity</strong>: $\Ad_X (av^\wedge+bw^\wedge) = a\Ad_X v^\wedge+b\Ad_X w^\wedge$</li>
<li><strong>Homomorphism</strong>: $\Ad_Y\Ad_Y v^\wedge=\Ad_{XY}v^\wedge$
We can also define an adjoint map more directly to map between two tangent spaces.</li>
</ul>
\[\Ad_X : \R^n\to\R^n; {}^Xv\mapsto{}^Ev=\Ad_X{}^Xv\]
<p>This map also has properties</p>
<ul>
<li>$X\oplus{}^Xv = (\Ad_X{}^Xv)\oplus X = {}^Ev\oplus X$</li>
<li>$\Ad_{X^{-1} }=\Ad_X^{-1}$</li>
<li>$\Ad_X\Ad_Y=\Ad_{XY}$</li>
</ul>
<p><img src="/images/lie-groups-part-2/adjoint.png" alt="Adjoint" title="Adjoint" /></p>
<p>As a simple example, we can consider the set of rotations on the plane $SO(2)$. Since rotations on the plane communte everywhere, the mapping the left and right lead to the same result so the adjoint is just the identity: $\Ad_X=I = X\oplus {}^Xv={}^Ev\oplus X$ .</p>
<p>As a more complex example, consider $SO(3)$. We know rotations in space don’t commute, but if we compute the adjoint, we can figure out how exactly they commute (in other words, which term is missing). To do this, let’s pick an arbitrary $[\omega]_\times\in\mathfrak{so}(3)$ and $R\in SO(3)$. We’ll remove these later since they’re arbitrary anyways. Instead of starting immediately with the final definition, it’s a bit more illustrative to start a few steps above in the adjoint derivation.</p>
\[R\exp([w]_\times) = \exp([\Ad_R~\omega]_\times)R\]
<p>On the left, we have a rotation matrix times another rotation matrix, but expressed in the Lie algebra $\omega$. In other words, we could have written $R’=\exp([w]_\times)$. But remember the adjoint operates in the Lie algebra (or corresponding vector space) so we need this extra decomposition. On the right side, we have commuted the two but applied the adjoint since it maps across vector spaces.</p>
\[\begin{align*}
R\exp([w]_\times) &= \exp([\Ad_R~\omega]_\times)R\\
\exp([\Ad_R~\omega]_\times)R &= R\exp([w]_\times)\\
\exp([\Ad_R~\omega]_\times) &= R\exp([w]_\times)R^{-1}\\
\exp([\Ad_R~\omega]_\times) &= \exp(R[w]_\times R^{-1})\\
[\Ad_R~\omega]_\times &= R [w]_\times R^{-1}\\
[\Ad_R~\omega]_\times &= [Rw]_\times\\
\Ad_R &= R\\
\end{align*}\]
<p>In the second-to-last step we use a property of the $[\cdot]_\times$ operator $R[\omega]_\times R^{-1}=[Rw]_\times$. Also, in the last step, we removed the $[\omega]_\times$ since it was arbitrary in the first place. So the adjoint of $SO(3)$ is the same as the rotation matrix $R$! This tells us how to relate commutations for 3D rotations.</p>
<h1 id="calculus-on-lie-groups">Calculus on Lie Groups</h1>
<p>Now we have all of the pieces to develop calculus on Lie Groups which we need to compute derivatives for optimization or any other kind of state estimation. The principle for calculus on Lie Groups is same as the original motivation: we want to avoid working directly on the manifold but rather in the tangent space. Tying this to state estimation, if we have a nonlinear motion model using Lie Groups, we need to compute Jacobians which means we need calculus on Lie Groups.</p>
<p>Recall for a scalar function $f:\R\to\R$ the definition of a derivative is</p>
\[f'(x) = \lim_{h\to 0}\frac{f(x+h)-f(x)}{h}\]
<p>For a multivariate function $f:\R^n\to\R$, we can compute a gradient vector of partial derivatives:</p>
\[\nabla f=\left[\frac{\p f}{\p x_1},\cdots,\frac{\p f}{\p x_n}\right]^T\]
<p>For a multivariate in-out function $f: \R^n\to\R^m$, can compute a Jacobian matrix of partial derivatives:</p>
\[J = \frac{\p \vec{f} }{\p \vec{x} } =
\begin{bmatrix}
\frac{\p f_1}{\p x_1} & \cdots & \frac{\p f_1}{\p x_n}\\
\vdots & \ddots & \vdots\\
\frac{\p f_m}{\p x_1} & \cdots & \frac{\p f_m}{\p x_n}
\end{bmatrix}\]
<p>Note that the intermediate notation I used $\frac{\p \vec{f} }{\p \vec{x} }$ is not well-defined but intended to be illustrative. Now suppose we have a function $f:G\to G$ on a Lie Group. We want to compute $\frac{\D f}{\D X}$. In other words, we want to know how a wiggle in $X\in G$ wiggles $f(X)\in G$. But what does it mean to wiggle $X$? This was well-defined for a scalar but not for a group element.</p>
<p>The key idea is that we use some small wiggle $\vec\varepsilon$ in the <em>tangent space</em> of $X$ rather than $X$ itself and map that wiggle to the manifold using the exponential map.</p>
<p><img src="/images/lie-groups-part-2/derivative.png" alt="Derivative" title="Derivative" /></p>
<p>Notationally, we can write something like</p>
\[\begin{align*}
\frac{ {}^X\D f}{\D X}&=\lim_{\vec\varepsilon\to 0}\frac{f(X\oplus\vec\varepsilon)\ominus f(X)}{\vec\varepsilon}\\
&=\lim_{\vec\varepsilon\to 0}\frac{\Log(f(X)^{-1}\circ f(X\cdot\Exp(\vec\varepsilon)))}{\vec\varepsilon}\\
&=\frac{\p}{\p\vec\varepsilon}\left[\Log(f(X)^{-1}\circ f(X\cdot\Exp(\vec\varepsilon)))\right]_{\vec\varepsilon=0}\\
\end{align*}\]
<p>Note that we had to “upgrade” $+$ to $\oplus$ and $-$ and $\ominus$ since we’re dealing with manifolds and tangent spaces. We’re being a bit sloppy with notation since vector division isn’t well-defined. If we want to be a bit more accurate, we should use $h\vec\varepsilon_i$ such that $h\in\R, \vert\vert h\vert\vert « 1$ where $\vec\varepsilon_i$ is a basis in the $i$ direction and and we take the limit with respect to $h$. Then we need to stack all of the $i$ bases.</p>
\[\frac{ {}^X\D f}{\D X_i} =\lim_{h\to 0}\frac{f(X\oplus h\vec\varepsilon_i)\ominus f(X)}{\vec\varepsilon}\]
<p>Using that key idea, we’ve expressed variations in $X$ of $f(X)$ entirely in the tangent space. This Jacobian linearly maps tangent spaces $T_X M\cong\R^m\to T_{f(X)} M\cong\R^n$.</p>
<p>This new kind of derivative behaves similar to a normal derivative in that, for small variations:</p>
\[f(X\oplus\vec\varepsilon)\approx f(X)\oplus\frac{\D f}{\D X}\vec\varepsilon\]
<p>To make the derivative more concrete, let’s try to compute the Jacobian of $SO(2)$ under the group action $Rv$, rotating a vector $v\in\R^2$ using a rotation matrix $R\in SO(2)$. Specifically, $f(R)=Rv$.</p>
\[\begin{align*}
\frac{ {}^R\D~ ~(Rv)}{\D R}&=\lim_{\theta\to 0}\frac{(R\oplus\theta)v\ominus Rv}{\theta}\\
&=\lim_{\theta\to 0}\frac{R~\Exp(\theta) v - Rv}{\theta}\\
&=\lim_{\theta\to 0}\frac{R(I + [\theta]_\times) v - Rv}{\theta}\\
&=\lim_{\theta\to 0}\frac{R[\theta]_\times v}{\theta}\\
&=\lim_{\theta\to 0}\frac{-R[1]_\times v~\theta}{\theta}\\
&=-R[1]_\times v\\
\end{align*}\]
<p>Note that since rotations in the plane commute $R\ominus S=\theta_R - \theta_S$ where $\theta\in\R$ is the corresponding angle to the 2D rotation matrix $R\in SO(2)$. We also expand the exponential map using a Taylor series $\Exp(\theta)\approx I +[\theta]_\times$ since the higher order terms vanish in the limit. We also use a useful property that $[a]_\times b= -a[b]_\times$. The other derivative is much simpler:</p>
\[\frac{ {}^R\D~ ~(Rv)}{\D v}=R\]
<p>So far, we’ve been using the right $\oplus$ operator for now; this creates a mapping between local tangent spaces $T_X M\to T_{f(X)} N$. We could also define the left Jacobian $\frac{ {}^E\D f}{\D X}$ using the left $\oplus$ operator that creates a mapping between global tangent spaces $T_E M\to T_{E} N$. The maths is pretty straightforward to define, and we can relate the two using the adjoint.</p>
<p><img src="/images/lie-groups-part-2/adjoint-derivative.png" alt="Adjoint Derivative" title="Adjoint Derivative" /></p>
\[\frac{ {}^E\D f}{\D X}\Ad_X=\Ad_{f(X)}\frac{ {}^X\D f}{\D X}\]
<p>So now we’re able to do calculus on Lie Groups by taking the derivative of a function with respect to a point on the manifold. Now for motion models, we can apply derivatives to compute the Jacobian of the motion model! Recall that for an on-manifold motion model, we take an initial pose $X_0$ and twists $v_i$ at some frequency $\Delta t_i$ and apply the exponential map iteratively:</p>
\[\begin{align*}
X_k&=X_0\oplus v_1\Delta t_1\oplus\cdots\oplus v_k\Delta t_k\\
&=X_0\Exp(v_1\Delta t_1)\cdots\Exp(v_k\Delta t_k)\\
\end{align*}\]
<p>The exponential map performs continuous integration on the manifold. However, with that motion model, we need to compute the derivative of the exponential map.</p>
<h1 id="jacobian-blocks">Jacobian Blocks</h1>
<p>We’ll need some building blocks before computing things like the Jacobian of the exponential map and its inverse.</p>
<p>The first tool we’ll need is chain rule! This operates on Lie Groups exactly in the same way as ordinary calculus:</p>
\[\frac{\D Z}{\D X} = \frac{\D Z}{\D Y}\frac{\D Y}{\D X}\]
<p>Next, we’ll need to prove the Jacobian of the inverse $f(X)=X^{-1}$ :</p>
\[\begin{align*}
\frac{\D X^{-1} }{\D X} &=\lim_{v\to 0}\frac{\Log[(X^{-1})^{-1}(X~\Exp(v))^{-1}]}{v}\\
&=\lim_{v\to 0}\frac{\Log(X~\Exp(v)^{-1}X^{-1})}{v}\\
&=\lim_{v\to 0}\frac{\Log(X~\Exp(-v)X^{-1})}{v}\\
&=\lim_{v\to 0}\frac{X~(-v)^{\wedge}X^{-1} }{v}\\
&=\lim_{v\to 0}\frac{\Ad_X(-v)}{v}\\
&=\lim_{v\to 0}\frac{-\Ad_X(v)}{v}\\
&=-\Ad_X\\
\end{align*}\]
<p>In the last step we removed $v$ since it was arbitrary. Now let’s prove composition $f(X,Y)=X\circ Y$ with respect to the first argument</p>
\[\begin{align*}
\frac{\D}{\D X}(X\circ Y) &=\lim_{v\to 0}\frac{\Log[f(X,Y)^{-1} f(X\Exp(v), Y)]}{v}\\
&=\lim_{v\to 0}\frac{\Log[(XY)^{-1} X~\Exp(v) Y]}{v}\\
&=\lim_{v\to 0}\frac{\Log[Y^{-1} X^{-1} X~\Exp(v) Y]}{v}\\
&=\lim_{v\to 0}\frac{\Log[Y^{-1} \Exp(v) Y]}{v}\\
&=\lim_{v\to 0}\frac{[Y^{-1}~\Exp(v) Y]^\vee}{v}\\
&=\lim_{v\to 0}\frac{\Ad_{Y^{-1} }v}{v}\\
&=\Ad_{Y^{-1} }\\
&=\Ad_Y^{-1}\\
\end{align*}\]
<p>and with respect to the second argument</p>
\[\begin{align*}
\frac{\D}{\D Y}(X\circ Y) &=\lim_{v\to 0}\frac{\Log[f(X,Y)^{-1}\circ f(X, Y~\Exp(v))]}{v}\\
&=\lim_{v\to 0}\frac{\Log[(X\circ Y)^{-1}\circ XY~\Exp(v)]}{v}\\
&=\lim_{v\to 0}\frac{\Log[(Y^{-1}X^{-1}\circ XY~\Exp(v)]}{v}\\
&=\lim_{v\to 0}\frac{\Log[\Exp(v)]}{v}\\
&=\frac{v}{v}\\
&= I
\end{align*}\]
<p>Now that we have these blocks, we can define the <strong>right Jacobian</strong> as the derivative of the exponential map in the local frame.</p>
\[J_r(v)=\frac{ {}^X\D}{\D v}\Exp(v)\]
<p>And the <strong>left Jacobian</strong> as the derivative of the exponential map in the global frame.</p>
\[J_l(v)=\frac{ {}^E\D}{\D v}\Exp(v)\]
<p>Like other global and local frame relations, we can relate the two using the adjoint</p>
\[\Ad_{\Exp(v)}=J_l(v)J_r^{-1}(v)\]
<p>This is where things get really complicated because, even for known manifolds, computing the closed forms for these Jacobians is super difficult so I’ll have to gloss over the details.</p>
<p>Now that we have some building blocks, we can compute Jacobians for the remaining operations like $\Log$, $\oplus$, and $\ominus$:</p>
\[\begin{align*}
\frac{\D}{\D X}\Log(X)&=J_r^{-1}(\Log(X))\\
\frac{\D}{\D X}(X\oplus v)&=\Ad_{\Exp(v)}^{-1} & \frac{\D}{\D X}(Y\ominus X)=-J_l^{-1}(Y\ominus X)\\
\frac{\D}{\D v}(X\oplus v)&=J_r(v) & \frac{\D}{\D Y}(Y\ominus X)=J_r^{-1}(Y\ominus X)\\
\end{align*}\]
<p>These can be proven using the chain rule we showed earlier.</p>
<h1 id="uncertainty-on-manifolds">Uncertainty on Manifolds</h1>
<p>The last piece we’re missing is how to compute uncertainties on manifolds. Similar to a state estimate, uncertainty is also localized to the tangent space at some point (state estimate) $X$. We can define a mean $\bar{X}\in M$ and a perturbation $\sigma\in T_{\bar{X} } M$ in the <em>tangent space</em> at $\bar{X}$!</p>
<p><img src="/images/lie-groups-part-2/uncertainty.png" alt="Uncertainty" title="Uncertainty" /></p>
<p>Then we can use $\ominus$ to compute uncertainties.</p>
\[\begin{align*}
X&=\bar{X}\oplus\sigma\\
\sigma &=X\ominus \bar{X}
\end{align*}\]
<p>We can define a covariance in the local frame using the definition of covariance too:</p>
\[{}^{X}\Sigma=\mathbb{E}[\sigma\sigma^T]=\mathbb{E}[(X\ominus \bar{X})(X\ominus \bar{X})^T]\]
<p>With this, we can define Gaussians on the manifold $\mathcal{N}(\bar{X},{}^{X}\Sigma)$. Note that the covariance is of the tangent perturbation.</p>
<h1 id="motion-integration-using-lie-groups">Motion Integration using Lie Groups</h1>
<p>Now we can get back to the question at hand: how do we perform motion integration on Lie Groups for things like EKFs. In the previous post we defined the motion model</p>
\[\begin{align*}
X_{i+1}&=X_i\oplus v=X_i\Exp(v)\\
P_{i+1}&=FP_{i}F^T+GW_iG^T
\end{align*}\]
<p>where</p>
<ul>
<li>$X_i$ is the state at timestep $i$</li>
<li>$v$ is the twist (linear and angular velocities)</li>
<li>$P_i$ is the covariance at timestep $i$</li>
<li>$F$ is the Jacobian of the motion model with respect to $X$</li>
<li>$G$ is the Jacobian of the motion model with respect to $v$</li>
<li>$W_i$ is the Gaussian noise matrix at timestep $i$</li>
</ul>
<p>Now that we have the Jacobian blocks we can actually compute $F$ and $G$!</p>
\[\begin{align*}
F&=\frac{\D}{\D X}[X\oplus v] = \Ad_{\Exp(v)}^{-1}\\
G&=\frac{\D}{\D v}[X\oplus v] = J_r(v)
\end{align*}\]
<p>With this, we have the full equations for state estimation on the manifold! Lie Groups don’t only work for EKFs though; we can apply the same logic to pose graph optimization or any other kind of optimization.</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this post, we wrapped up the discussion on Lie Groups by finishing on-manifold motion integration equations for state estimation. We started with defining the adjoint to relate the global and local frames. Then we took our familiar notion of calculus and extended it to work with Lie Groups. We also derived a few fundamental Jacobian blocks to use as a basis for more complicated derivatives. Using those blocks, we also were able to show how uncertainty propagates on a manifold. With all of that background, we were finally able to show the full equations of motion integration.</p>
<p>As I stated before, Lie Groups are pretty theoretical compared to other kinds of applied maths for engineering. Fortunately, there are libraries that abstract away the details of these implementations but it’s still important to know when Lie Groups might be useful. There’s still a lot more to Lie Groups but I’ve covered enough in these two posts for them to prove useful to you should you encounter a scenario where you’re on a manifold working with functions 🙂</p>I'll continue the discussion of Lie Groups into the realm of calculus on Lie Groups; we'll finish applying them to the robotic state estimation.Lie Groups - Part 12022-04-02T00:00:00+00:002022-04-02T00:00:00+00:00/lie-groups-part-1<p>In this post, I want to discuss Lie Groups. For engineers or beginning physicists, Lie Groups might not be as familiar as multivariable calculus or linear algebra, but, in many regards, they’re a combination of both. Like with other topics in advanced mathematics, I like to apply them to solve some kind of real problem in engineering or physics. In this case, we’ll be looking at a problem I described in one of my previous posts: robotic state estimation. Given a position and orientation of a mobile robot, if we receive some new sensor data, we want to update both to account for those new sensor measurements. The tricky part is the orientation: we often represent it as a quaternion or rotation matrix and these are constrained, i.e., not just any ordinary matrix is a rotation matrix. We want to make sure the orientation update still obeys those constraints else we won’t have a valid orientation!</p>
<p>There are a lot of really good resources out there to learn about Lie Groups, particularly from physics. However, I think most of them lack an initial motivation: they jump right into a definition without giving any concrete examples. The closest I’ve found is Dr. Joan Solà’s work: <a href="https://arxiv.org/abs/1812.01537v9">A micro Lie theory for state estimation in robotics</a>, which I think does a really good job at explaining the topic practically. It has concrete examples along with proofs and derivations; it starts with just talking about group structure and then adds calculus later on instead of conflating the two at the beginning. But there were many things I had to look up or do by hand when I was going through it to fill knowledge and really understand the proofs. Nevertheless, I still really like that work and used it as one of my references when writing this series on Lie Groups; the structure of this series and some of the examples are inspired from that work (especially when we get to caclulus on Lie Groups).</p>
<p>Lie Groups are a bit more theory-oriented that other kinds of maths, especially for engineers. It could be argued that you could go your entire engineering or (undergrad-level) physics career without ever using Lie Groups. This is partly true, but, for robotic state estimation, we’ll see why we can get a better result (rather than an approximate/error-filled one) if we’re aware about the structure of our problem.</p>
<p>As a meta-point, I’m breaking this up into two parts: this is the introductory part without any (or much) calculus and the next part will intersect calculus and Lie Groups to construct Jacobians and other structures.</p>
<h1 id="2d-rotations">2D Rotations</h1>
<p>Let’s start with the simple example of a vector on a plane. This could be the position and orientation of a robot. Suppose we get some new sensor update that says our robot has rotated by some amount $\phi$ and we want to rotate the vector by that amount.</p>
<p><img src="/images/lie-groups-part-1/single-rotation.png" alt="Single Rotation" title="Single Rotation" /></p>
<p><small>We have the $x$ basis vector $v$ that we want to rotate by some $\phi$ to get $v’$.</small></p>
<p>How would we go about doing this? We need an way to transform the initial vector $v=(x,y)^T$ into the rotated vector $v’=(x’,y’)^T$. Since we’re dealing with a vector in a plane, we can do this using ordinary geometry if we draw some angles and remember some trig formulas. Without going through the trig, we end up with the following way to relate $v’$ and $v$.</p>
\[\begin{align*}
x' &= x \cos\phi - y\sin\phi\\
y' &= y \cos\phi + x\sin\phi\\
\end{align*}\]
<p>For convenience, we can write this in matrix form.</p>
\[\begin{align*}
\begin{bmatrix}x'\\y'\end{bmatrix}
&= \begin{bmatrix}\cos\phi & - \sin\phi \\ \sin\phi & \cos\phi\end{bmatrix}\begin{bmatrix}x\\y\end{bmatrix}\\
v' &= R(\phi) v
\end{align*}\]
<p>$R(\phi)\in\R^{2\times 2}$ is the 2D rotation matrix. Of course, we can plug in some known values and see if we get what we expect. Try plugging in $\phi=\frac{\pi}{2}$ and $(1,0)^T$, i.e., the $x$ basis vector, and the result should be $(0,1)^T$, i.e., the $y$ basis vector. We’ve basically rotate the $x$ basis vector into the $y$ basis vector!</p>
<p><img src="/images/lie-groups-part-1/double-rotation.png" alt="Double Rotation" title="Double Rotation" /></p>
<p><small>Suppose we have $v’$ that is $v$ rotated by $\phi$, and we rotate $v’$ again by $\gamma$ to get $v’’$.</small></p>
<p>If we have another rotation by angle $\gamma$ that we want to apply after the rotation to $\phi$, we can first apply $R(\phi)$ and then $R(\gamma)$.</p>
\[\begin{align*}
\begin{bmatrix}x''\\y''\end{bmatrix}
&= \begin{bmatrix}\cos\gamma & - \sin\gamma \\ \sin\gamma & \cos\gamma\end{bmatrix}\begin{bmatrix}\cos\phi & - \sin\phi \\ \sin\phi & \cos\phi\end{bmatrix}\begin{bmatrix}x\\y\end{bmatrix}\\
v'' &= R(\gamma) R(\phi) v
\end{align*}\]
<p>Notice the ordering we apply the rotations: right to left. We can also combine the two matrices into a single one and, with some trig, we find that the result is also a rotation matrix!</p>
\[\begin{align*}
\begin{bmatrix}x''\\y''\end{bmatrix}
&= \begin{bmatrix}\cos(\gamma+\phi) & - \sin(\gamma+\phi) \\ \sin(\gamma+\phi) & \cos(\gamma+\phi)\end{bmatrix}\begin{bmatrix}x\\y\end{bmatrix}\\
v'' &= R(\gamma + \phi) v
\end{align*}\]
<p>Natural to applying a rotation, what if we wanted to reverse/undo a rotation? For example, we wanted to backtrack an orientation, we have to undo the existing rotation. To undo a rotation $R(\phi)$, we need to supply a matrix such that, when composed with $R(\phi)$, we get the identify matrix $I$ because, when we multiply any vector by the identify matrix, we get the same vector out. Naturally, this is the inverse matrix $R(\phi)^{-1}$! However, matrix inverses aren’t free! We need to prove that a rotation matrix $R(\phi)$ has an inverse. In other words, we need to show it has a nonzero determinant, i.e., it is nonsingular. Let’s take the determinant of a general 2D rotation matrix $R(\phi)$:</p>
\[\det\begin{bmatrix}\cos\phi & -\sin\phi \\ \sin\phi & \cos\phi\end{bmatrix}=\cos^2\phi + \sin^2\phi = 1\]
<p>Using the trig identity $\cos^2\phi + \sin^2\phi = 1$, we’ve shown that every 2D rotation matrix has an inverse! This makes intuitive sense beacuse there isn’t a value of $\phi$ that we couldn’t “undo” by rotating by the same amount in the opposite direction.</p>
\[\begin{align*}
\begin{bmatrix}x\\y\end{bmatrix}
&= \begin{bmatrix}\cos\phi & - \sin\phi \\ \sin\phi & \cos\phi\end{bmatrix}^{-1}\begin{bmatrix}\cos\phi & - \sin\phi \\ \sin\phi & \cos\phi\end{bmatrix}\begin{bmatrix}x\\y\end{bmatrix}\\
v &= R(\phi)^{-1}R(\phi) v\\
v &= Iv\\
v &= v
\end{align*}\]
<p>We’ve also discovered an implicit rule here: multiplying any vector by the identity matrix $I=R(0)$ doesn’t change the vector at all.</p>
<p>Let’s take a second and recap what we’ve learned so far because, while it might not seem like it, we’ve learned a lot about how 2D rotations work.</p>
<ul>
<li>Rotating vectors is a linear transform because it’s just a matrix multiplication (any linear function/operation can be represented as a matrix that acts on a vector)</li>
<li>We can compose multiple rotations by multiplying their rotation matrices together, and we get a valid rotation matrix as a result</li>
<li>Since we’re using matrix multiplication, this composition is also associative, i.e., $[R(\theta)\cdot R(\phi)]\cdot R(\gamma)=R(\theta)\cdot [R(\phi)\cdot R(\gamma)]$</li>
<li>The inverse for a 2D rotation matrix always exists and can reverse/undo a rotation</li>
<li>The identity matrix doesn’t affect the vector in any way</li>
</ul>
<p>This set of properties is so useful that we actually give them a name in mathematics: a <strong>group</strong>! Remember the topic of this series is about Lie <em>Groups</em> so we have to discuss groups! Now that we’ve demonstrated some properties of groups using 2D rotations, let’s generalize that into a formal definition of a group.</p>
<h1 id="groups">Groups</h1>
<p>A <strong>group</strong> $(G, \circ)$ is a set $G$ and an operator $\circ$ such that any $X,Y,Z\in G$ obeys the following group axioms:</p>
<ul>
<li><strong>Closure</strong>: $X\circ Y\in G$. Composing any two elements of the group gives us another element in the group.</li>
<li><strong>Identity</strong>: $E\circ X = X\circ E = X$. There exists an identity element $E$ that has no effect on any element in the group.</li>
<li><strong>Inverse</strong>: $X^{-1}\circ X = X\circ X^{-1} = E$. For every element in the group, there’s an inverse element that brings it back to the identity.</li>
<li><strong>Associativity</strong>: $(X\circ Y)\circ Z = X\circ (Y\circ Z)$. We can group any two compositions in the sequence of compositions and do those first.</li>
</ul>
<p>One other thing we saw was the <strong>action</strong> of the group on a vector: we multiplied the rotation matrix by a vector to rotate the vector. The action of a group on another set $V$, e.g., 2D vectors, has to be defined for every group and set since each action can be applied differently. More formally, the group action $\cdot$ can be defined as $\cdot: G\times V\rightarrow V; (X,v)\mapsto X\cdot v$ and has the following properties:</p>
<ul>
<li><strong>Identity</strong>: $E\cdot v=v$. Applying the identity element doesn’t change the input.</li>
<li><strong>Compatibility</strong>: $(X\circ Y)\cdot v = X\cdot (Y\cdot v)$. Applying a composition is the same as applying each element of the composition in sequence.</li>
</ul>
<p>Now that we understand the axioms of a group, let’s phrase 2D rotations as a group: $G=\{\text{2D rotation matrices}\}$ and $\circ=\cdot$, i.e., matrix multiplication. To be a bit more specific, we showed earlier than all 2D rotation matrices have a determinant of exactly 1. This is why the set of all 2D rotation matrices is called $SO(2)$ for <strong>Special Orthogonal Group</strong> of 2 dimensions. What makes it <em>special</em> is the unit determinant. It’s a subgroup of the general <strong>Orthogonal Group</strong> $O(2)$, which is the set of orthogonal matrices, i.e., $R^T R=I=RR^T$. So we can more formally define $SO(2)=\{R\in\R^{2\times 2}\vert R^T R=I, \det R = 1\}$. Notice that the only time we make mention of the dimension is in how large the matrices are; more generally, we can define $SO(n)=\{R\in\R^{n\times n}\vert R^T R=I, \det R = 1\}$. We can verify that 2D rotation matrices are orthogonal with some more trig identities. We can also verify that all of the group axioms are satisfied for $SO(2)$.</p>
<p>I’ll also take this opportunity to show another represenation of 2D rotations: unit-norm complex numbers: $z=\cos\theta + i\sin\theta$. These are also easier to visualize than rotation matrices since we can plot them on the complex plane. In fact, if we take all possible values of $\theta$ and plot all unit norm $z$ vectors on the complex plane, we get the unit circle $S^1$!</p>
<p><img src="/images/lie-groups-part-1/circle-group.png" alt="Circle Group" title="Circle Group" /></p>
<p><small>All of the possible rotations on a plane can be represented as the circle group $S^1$. A particular rotation $z=\cos\theta + i\sin\theta$ can be represented as a complex number that lives on that circle.</small></p>
<p>To develop this even further as a group, $G=S^1$ and $\circ=\cdot$, i.e., complex multiplication. If we have a 2D real vector (or just a complex number) represented as a complex number like $v=x+iy$ and a rotation $z=\cos\theta + i\sin\theta$, then we can rotate $v$ by $\theta$ by multiplying $v’=zv$. Notice that this is closed under multiplication, the identity element is 1, and the inverse is the complex conjugate $z^\star$.</p>
<p><img src="/images/lie-groups-part-1/translation-group.png" alt="Translation Group" title="Translation Group" /></p>
<p><small>The translation group is an additive group that is simply $\R^n$.</small></p>
<p>As a more trivial example, consider the set of 2D translation $v=\displaystyle\begin{bmatrix}t_x\ t_y\end{bmatrix}^T\in\R^2=G$ and $\circ=+$. This is an example of an additive group. It’s closed under addition, the identity is 0, and the inverse is the negative $-t$.</p>
<p><img src="/images/lie-groups-part-1/quaternion-group.png" alt="Quaternion Group" title="Quaternion Group" /></p>
<p><small>Quaternions can be represented as an axis and rotation about that axis. One way to visualize them is their effect on a basis vector or as an axis and rotation on the unit sphere.</small></p>
<p>As a less trivial example, consider the set of unit quaternions $S^3$ (a 3-sphere/hypersphere). They are a representation of 3D rotations $SO(3)$. Another way to think about quaternions is using the “axis-angle” formulation where we have an axis $\mathbf{u}=u_x i + u_y j + u_z k$ (where $i,j,k$ are the base/unit quaternions such that $i^2=j^2=k^2=ijk=-1$) that represents the vector we’re rotating around and an angle $\theta$ that we’re rotating by. We put both of them together into a single object: $\mathbf{q}=\cos\frac{\theta}{2}+\mathbf{u}\sin\frac{\theta}{2}$. (We’ll see a derivation of this later.) The reason we need an $i,j,k$ is because they obey a special relation that makes rotating vectors actually work. The group action is quaternion/complex multiplication. Acting a quaternion on a vector $\mathbf{v}= v_xi+v_yj+v_zk$ uses the double product $\mathbf{q}\mathbf{v}\mathbf{q}^\star$. It’s closed under that double product, the identity is 1, and the inverse is the complex conjugate $\mathbf{q}^\star$.</p>
<h1 id="manifolds">Manifolds</h1>
<p>Going back to the problem of robotic state estimation, we generally have a state that includes some orientation, for example, in 3D space. We receive sensor updates and accumulate that orientation. For example, <a href="ekf">Kalman Filters</a> do this by literally adding up increments in the state over some time interval. Other kinds of state estimation use numerical optimization to solve for the state history so it can be corrected later after we learn more information. This generally takes an objective function $C(x)$ that minimizes the sum of squared errors, computes a derivative (Jacobian) $\frac{\d C}{\d x_i}\vert_{x_i=\hat{x_i}}$ for the current values of the parameters $\hat{x}$, and applies a tiny update $\Delta x$ to get new parameters. The cycle repeats until we’ve found the minimum of the function.</p>
<p><img src="/images/lie-groups-part-1/gimbal-lock.png" alt="Gimbal Lock" title="Gimbal Lock" /></p>
<p><small>In a normal scenario, all three of the gimbals have all three degrees of freedom. However, during Gimbal Lock, we lose a degree of freedom because motions along two degrees of freedom only correspond to one motion.</small></p>
<p>For representing orientations in 3D, one option is to use Euler angles where we define an angle for roll, pitch, and yaw. This creates a vector in 3D space with exactly the same degrees of freedom as a 3D rotation. There’s nothing wrong with using Euler angles as a way to represent 3D rotation, however, we run into problems when we try to use them for optimization or accumulation. This is because of a problem called <strong>Gimbal Lock</strong> where we lose a degree of freedom, i.e., changing two variables leads to the same rotation. (More formally, we can think of Euler angles as a mapping of $\R^3$ into the set of 3D rotations $SO(3)$, but the derivative of this mapping isn’t always full-rank.)</p>
<p>However, we can avoid the problem of gimbal lock by using quaternions. But remember we’re not using just any quaternions, we using <em>unit</em> quaternions to represent 3D rotations. A general quaternion is $\mathbf{q}=\cos\frac{\theta}{2}+\mathbf{u}\sin\frac{\theta}{2}$ such that $\mathbf{u}=u_x i + u_y j + u_z k$, so we have 4 degrees of freedom $(\theta, u_x, u_y, u_z)$. But unit quaternions have the additional constraint of unit norm $\vert\vert\mathbf{q}\vert\vert=1$, which removes a degree of freedom (if we knew the values of 3 degrees of freedom, we could use the unit-norm equation to solve for the remaining one). So instead of a full 4D space, we actually have a constrained 3D surface in 4D, which is partly why unit quaternions are called $S^3$: they have 3 degrees of freedom! As an analogy, think about the unit circle $S^1$ for $SO(2)$. The unit circle is a 1D curve embedded in 2D governed by $x^2+y^2=1$: given either $x$ or $y$, we can compute the other using that equation. In other words, it’s a subspace embedded in a higher-dimensional space. Every point on that surface satisfies the constraint and any point off of that surface doesn’t.</p>
<p><img src="/images/lie-groups-part-1/optimizer-bad-dof.png" alt="Bad DoF Optimization" title="Bad Dof Optimization" /></p>
<p><small>If our optimizer sees all degrees of freedom for $S^1$, then we’ll get an update for both $x$ and $y$ that can move us off the circle.</small></p>
<p>But does our optimizer know that? Unconstrained optimization, by definition, is unconstrained! (In general, unconstrained optimization is easier than constrained optimization and have had more practical success.) If we hand the full quaternion to the optimizer, it’ll see all degrees of freedom so produce an tiny update for each parameter. If we simply fold in that increment, then we’ll almost always end up off of the constrained surface. In other words, we’d end up with something that isn’t a unit quaternion and hence isn’t a 3D rotation. Before the next step of optimization, we’d have to “project” or “renormalize” it back into a unit quaternion, which induces some error!</p>
<p><img src="/images/lie-groups-part-1/circle-group-projection.png" alt="Circle Group Projection" title="Circle Group Projection" /></p>
<p><small>If we look at a line tangent to the sphere, we can define an increment $\theta$ on that line and find a way to project that onto the circle.</small></p>
<p>Instead, what if we parameterized the constrained surface in a way that we only handed the optimizer the exact degrees of freedom it could actually optimize over. Consider 2D rotations $SO(2)$ and $S^1$. For rotations on a plane, we really only need a single variable $\theta$ instead of two numbers for the complex representation or four for the rotation matrix. We could hand the optimizer the single $\theta$ and project that angle onto the unit circle.</p>
<p><img src="/images/lie-groups-part-1/manifolds.png" alt="Manifolds" title="Manifolds" /></p>
<p><small>Examples of manifolds are $\R^n$ and $S^n$: they’re locally flat at a point. Examples of spaces that aren’t manifolds are cones and planes with lines through them because the tip of the cone and the point where the line intersects the plane aren’t locally flat.</small></p>
<p>As it turns out, there already exists a mathematical structure that encodes exactly what we’re trying to do: a <strong>manifold</strong>. Manifolds are complicated structures in their own right, and I actually have another series explaining them in detail (<a href="manifolds-part-1">Part 1</a>, <a href="manifolds-part-2">Part 2</a>, <a href="manifolds-part-3">Part 3</a>) so I won’t go over them again. Feel free to read those posts to understand their construction, but I’ll just give the more basic intuition here. A manifold is a space that is required to be flat locally but not globally. Some examples are $\R^n$: it’s both locally flat and globally flat! Another well-known example is the sphere $S^2$. At a point, a sphere is flat (in other words $\R^2$), but globally, it’s not flat; in fact, it has intrinsic curvature. A few examples of spaces that aren’t manifolds are cones or a plane with a line going through it. This is because the point of a cone and the point where the line intersects the plane are not locally flat. Similar to a circle, a sphere is another example of a constrained surface: we only need two coordinates to specify a point on a sphere, but it can be embedded in a 3D space.</p>
<h1 id="tangent-spaces">Tangent Spaces</h1>
<p>In general, the most interesting manifolds are smooth, i.e., continuous and infinitely differentiable. Going back to the example of a circle, if we took a derivative at a point, we’d get a tangent line with one degree of freedom. Specifically, if we consider the circle in the complex plane and took a derivative at $\theta=0$, we’d get the complex line $i\R$ which has one degree of freedom and is a flat space. Another name for this is the <strong>Tangent Space</strong> at a point $T_p M$. One way to intuitively construct it is by considering some curve on the manifold $\lambda(t) : \R\to M$ (in the case of the cirlce, it’s the circle itself!) and taking a derivative $\frac{\d}{\d t}$. (I discuss a more formal way to construct this in my other post on manifolds.) The tangent space has a few properties as a result of its construction:</p>
<ul>
<li>it exists uniquely at all points $p$</li>
<li>the degrees of freedom of the tangent space is the same as the manifold</li>
<li>the tangent space has the same structure at every point</li>
</ul>
<p>In the context of Lie Groups, another name for the tangent space is the Lie Algebra $\mathfrak{m} = T_E M$. We specifically call the Lie Algebra the tangent space at the identity $E$ only because every Lie Group, by definition, is guaranteed to have an identity element. Remember that the structure of tangent space is the same at all points on the manifold so it really doesn’t matter which point we pick, but the identity is the most convenient element that every Lie Group is guaranteed to have.</p>
<p><img src="/images/lie-groups-part-1/general-tangent-space.png" alt="General Tangent Spaces" title="General Tangent Spaces" /></p>
<p><small>More formally, we can define a tangent space $T_p M$ at a point $p$ on a manifold $M$ as the set of all directional derivatives of all scalar functions through $p$.</small></p>
<p>Ideally, we want the optimizer to only operate in the tangent space since it has exactly the same degrees of freedom as the manifold itself. Before talking about how the optimizer would do this, let’s see a few examples of tangent spaces.</p>
<p><img src="/images/lie-groups-part-1/circle-tangent-space.png" alt="Circle Tangent Space" title="Circle Tangent Space" /></p>
<p><small>For the circle group $S^1$, the tangent space $T_E M$ is a line $i\R$ and elements of that tangent space are scalars $\theta\in i\R=T_E M$.</small></p>
<p>Let’s explore the tangent space of 2D rotation $SO(2)$. To do this, we need to identify a curve on the circle that we can take the derivative of. We can use the fact that all rotation matrices have the constraint that $R^T R = I$, i.e., orthogonal columns. We can replace the $R$s with parameterized curves $R(t)$ to get $R(t)^T R(t) = I$ and take the derivative $\frac{\d}{\d t}$.</p>
\[\begin{align*}
\frac{\d}{\d t}[R(t)^T R(t)] &= \frac{\d}{\d t} I\\
R(t)^T \frac{\d}{\d t} R(t) + \frac{\d}{\d t}[R(t)^T] R(t) &= 0\\
R(t)^T \frac{\d}{\d t} R(t) + \left(\frac{\d}{\d t}R(t)\right)^T R(t) &= 0\\
R(t)^T \frac{\d}{\d t} R(t) &= -\left(\frac{\d}{\d t}R(t)\right)^T R(t)\\
R(t)^T \frac{\d}{\d t} R(t) &= -\left(R(t)^T \frac{\d}{\d t} R(t)\right)^T\\
A &= -A^T\\
\end{align*}\]
<p>Between the first and second lines, we use the product rule to expand the product. Then we use the property that derivatives can move in and out of the transpose operation. We moved the second term to the right-hand side. Finally, we transpose the right-hand side so that we end up with an equation of the form $A=-A^T$. If we removed the minus sign, this would be the constraint for a symmetric matrix $A=A^T$! But since we have the minus sign, we call matrices that obey this constraint <strong>skew-symmetric</strong> matrices. By the way, nothing we’ve done so far has been specific to $SO(2)$: as it turns out, this is the same constraint for $SO(3)$ and even $SO(n)$ as well. But going back to $SO(2)$, we’ve found that the Lie Algebra/structure of the tangent space, called $\mathfrak{so}(2)$, is the set of $2\times 2$ skew-symmetric matrices.</p>
<p>The general form for $2\times 2$ skew-symmetric matrices looks like</p>
\[\begin{bmatrix}0 & -\theta \\ \theta & 0\end{bmatrix}=\theta\begin{bmatrix}0 & -1 \\ 1 & 0\end{bmatrix}=\theta E_\theta\in\mathfrak{so}(2)\]
<p>We call $E_\theta$ the <strong>generator</strong> of the $\mathfrak{so}(2)$ because we can write every element in terms of $E_\theta$. Think of it as a “basis matrix”. From this formulation, we can take any $\theta\in\R$ and map it to $\theta E_\theta\in\mathfrak{so}(2)$ uniquely. This means that there’s a unique mapping between $\R$ and $\mathfrak{so}(2)$ so we can choose to use either space, whichever is convenient for us. For the optimizer, it would be most convenient to use the $\theta\in\R$ space. We can create a notation $[\theta]_\times$ to define this mapping as</p>
\[[\cdot]_\times : \R\to\mathfrak{so}(2);~\theta\mapsto\begin{bmatrix}0 & -\theta \\ \theta & 0\end{bmatrix}\]
<p>With $SO(3)$, we can follow the exact same procedure to end up with the set of $3\times 3$ skew-symmetric matrices for its Lie Algebra $\mathfrak{so}(3)$. The general form of those looks like</p>
\[\begin{align*}
\begin{bmatrix}0 & -\omega_z & \omega_y \\ \omega_z & 0 & -\omega_x \\ -\omega_y & \omega_x & 0\end{bmatrix}&=\omega_x\begin{bmatrix}0 & 0 & 0 \\ 0 & 0 & -1 \\ 0 & 1 & 0\end{bmatrix}+\omega_y\begin{bmatrix}0 & 0 & 1 \\ 0 & 0 & 0 \\ -1 & 0 & 0\end{bmatrix}+\omega_z\begin{bmatrix}0 & -1 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 0\end{bmatrix}\\
&=\omega_x E_x+\omega_y E_y+\omega_z E_z
\end{align*}\]
<p>Note that we have 3 degrees of freedom $\omega_x, \omega_y, \omega_z$ and thus 3 generators $E_x, E_y, E_z$. So instead of just $\R$, the degrees of freedom can be grouped into a vector $\omega=[\omega_x, \omega_y, \omega_z]^T\in\R^3$. Just like with $\mathfrak{so}(2)$ and $\R$, the degrees of freedom match the dimension of the flat space. We reuse the same notation to denote converting a vector $\omega\in\R^3$ into a skew-symmetric matrix in $\mathfrak{so}(3)$: $[\omega]_\times$.</p>
<p><img src="/images/lie-groups-part-1/tangent-space-isomorphisms.png" alt="Tangent Space Isomorphisms" title="Tangent Space Isomorphisms" /></p>
<p><small>Between the tangent space $T_E M=\mathfrak{m}$ and flat space $\R^n$, we can define isomorphisms that exactly map between the two spaces.</small></p>
<p>In general, not all Lie Algebras are skew-symmetric matrices, but we can define an <strong>isomorphism</strong>, i.e., a bijection/one-to-one correspondence, that maps between $\mathfrak{m}\leftrightarrow \R^n$.</p>
\[\begin{align*}
\mathrm{Hat} : \R^n\to\mathfrak{m} &;~v\mapsto v^\wedge\\
\mathrm{Vee} : \mathfrak{m}\to \R^n &;~v^\wedge\mapsto (v^\wedge)^\vee=v
\end{align*}\]
<p>In other words, $v$ is some element in a flat space $\R^n$ and $v^\wedge$ is some element of the Lie Algebra. As an example, for $SO(2)$ and $\mathfrak{so}(2)$, we can define these operators in terms of $[\cdot]_\times$.</p>
\[\begin{align*}
\mathrm{Hat}: \R\to\mathfrak{so}(2) &;~\theta\mapsto \theta^\wedge = [\theta]_\times\\
\mathrm{Vee}: \mathfrak{so}(2)\to \R &;~[\theta]_\times \mapsto [\theta]^\vee_\times=\theta
\end{align*}\]
<p>For $SO(3)$ and $\mathfrak{so}(3)$, we can define the same kinds of operators, except using $\R^3$ and $\mathfrak{so}(3)$.</p>
\[\begin{align*}
\mathrm{Hat}: \R^3\to\mathfrak{so}(3) &;~\omega\mapsto \omega^\wedge = [\omega]_\times\\
\mathrm{Vee}: \mathfrak{so}(3)\to \R^3 &;~[\omega]_\times \mapsto [\omega]^\vee_\times=\omega
\end{align*}\]
<p>With these functions, we now have a way to map our degree-of-freedom flat space $\R^n$ into the tangent space/Lie Algebra of the particular Lie Group we’re working with. In the case of 2D rotations, we only have a single degree of freedom $\theta$ that we can project out into the Lie Algebra of $2\times 2$ skew-symmetric matrices. However, we’re still missing a way to project the Lie Algebra onto the Lie Group manifold. Let’s figure out how (and why).</p>
<h1 id="the-exponential-map">The Exponential Map</h1>
<p>Recall that our problem with state estimation was that our representations for orientation were either overparameterized (quaternions or rotation matrices) or not suitable for optimization/integration (Euler angles). However, learning about manifolds and the tangent space, we can let our optimizer move around in the tangent space where we have the same degrees of freedom as the manifold: no more, no less. After the optimizer computes the derivatives, we get some gradient vector $\Delta x\in\R^n$ that represents the tiny update for all of our parameters. Since we’re at some point $\hat{x}$ on the manifold, this update $\Delta x$ is in the tangent space!</p>
<p><img src="/images/lie-groups-part-1/optimizer-good-dof.png" alt="Good DoF Optimization" title="Good DoF Optimization" /></p>
<p><small>At any stage of optimization, we have the current values of the parameters $\hat{x}$. Giving that to our optimizer along with the Jacobians, we’ll get some $\Delta x$ for all parameters that lives in the tangent space $T_\hat{x} M$. We can’t blindly apply the update so we want to project that onto the manifold $M$.</small></p>
<p>To get the next value of the parameters, we need to add/accumulate $\Delta x$ into $\hat{x}$. What we’d do is just add $\hat{x}+\Delta x$, which almost certainly puts it off the constrained surface, and “reproject” it back onto the manifold so that the solution obeys the constraints. Rinse and repeat until we converge. The problem is that the “reprojection” induces some error. Ideally, we want to perform this mapping from $\mathfrak{m}\to M$ exactly, without any error. Then, after we get a parameter update $\Delta x$, we can apply that mapping and get the next value of the parameters that are guaranteed to obey the constraints, i.e., they remain on the manifold.</p>
<p>In other words, given some vector $v$ or $v^\wedge\in T_p M$, we want to relate it to some $X\in M$. If we consider rotation groups and go back to the definition of the Lie Algebra: $R(t)^T \frac{\d}{\d t}R(t)=\omega^\wedge=R(t)^{-1} \frac{\d}{\d t}R(t)$ (for orthogonal matrices, $R^T=R^{-1}$), then we have an equation relating an element of the Lie Algebra $\omega^\wedge$ and an element of the Lie Group $R(t)$. Isolating $\frac{\d}{\d t}R(t)$ to one side, we get the differential equation:</p>
\[\frac{\d}{\d t}R(t) = R(t)\omega^\wedge\]
<p>This is an ordinary differential equation in $t$ whose solution is well-known (if you took a differential equations class, this was probably the first solution you saw):</p>
\[R(t) = R(0)\exp(\omega^\wedge t)\]
<p>Since $R(t)\in M$ and $R(0)\in M$, then $\exp(\omega^\wedge t)\in M$. Since the structure of the tangent space is the same at all points, we can actually set $R(0)=E=I$ to get $R(t)=\exp(\omega^\wedge t)$. So it seems the way to relate a $\omega^\wedge\in T_p M$ and $R(t)$ is via $\exp$. We call this the <strong>exponential map</strong>: a function that sends elements of $\mathfrak{m}$ to $M$ exactly, with no error or approximation (i.e., the solution to the differential equation is analytical). Naturally, we can reverse the operation by taking a $\log$ and can define the <strong>logarithmic map</strong> as a function that maps $M$ to $\mathfrak{m}$ exactly.</p>
\[\begin{align*}
\exp: \mathfrak{m}\to M &; v^\wedge\mapsto X=\exp(v^\wedge)\\
\log: M\to\mathfrak{m} &; X\mapsto v^\wedge=\log(X)
\end{align*}\]
<p>Intuitively, we can think of these maps as “wrapping” and “unwrapping” the vector along the manifold. To be more precise, this creates a geodesic at $p$ whose tangent vector is $v$. A <strong>geodesic</strong> is a generalization of a “straight line” or “shortest distance” path on a manifold. In $\R^n$, geodesics are lines. However, on other kinds of manifold, these are generally not lines. For example, for the sphere $S^2$, geodesics are “great circles”: a circle on the sphere such that the center of the circle is the center of the sphere. This is because “straight lines” don’t generally exist on arbitrary manifolds so we have to compromise and pick the “straight as possible” line. The formal way to derive geodesics is to use calculus of variations and solve for the function that minimzes the distance between two points on the manifold given the manifold metric. We’re not going to do that here, but look at my other series on manifolds for more intuition.</p>
<p>Now that we’ve defined the exponential and logarithmic maps, we have the full picture where we can convert between the flat space $\R^n$, the Lie Algebra/tangent space $T_p M=\mathfrak{m}$, and the manifold $M$.</p>
<p>Let’s look at a few concrete examples of the exponential map starting with $SO(2)$. Recall that all $2\times 2$ skew-symmetric matrices are of the form</p>
\[\begin{bmatrix}0 & -\theta \\ \theta & 0\end{bmatrix} = \theta E_\theta\]
<p>Applying the exponential map:</p>
\[\exp\left(\begin{bmatrix}0 & -\theta \\ \theta & 0\end{bmatrix}\right) = \exp(\theta E_\theta)\]
<p>But what does it mean to take the exponential of a matrix? Remember that $\exp$ can be written as a Taylor series!</p>
\[\exp(x) = \sum_{k=0}^\infty\frac{x^k}{n!}=1+x+\frac{x^2}{2!}+\frac{x^3}{3!}+\cdots\]
<p>We can take powers of square matrices so the matrix exponential is well-defined. Expanding it out we get:</p>
\[\exp(\theta E_\theta) = I+\theta E_\theta+\frac{\theta^2}{2!}E_\theta^2+\frac{\theta^3}{3!}E_\theta^3+\cdots\]
<p>To expand this further, we need to compute matrix products $E_\theta^k$. Let’s start by computing the first two:</p>
\[\begin{align*}
E_\theta &= \begin{bmatrix}0 & -1 \\ 1 & 0\end{bmatrix}\\
E_\theta^2 &= \begin{bmatrix}0 & -1 \\ 1 & 0\end{bmatrix}\\
\end{align*}\]
<p>An interesting property of skew-symmetric matrices is that the powers are cyclic and we actually only need $E_\theta$ and $E_\theta^2$. Here’s the pattern:</p>
\[\begin{align*}
E_\theta^0 &= I&\\
E_\theta^1 &= E_\theta & E_\theta^2 &= E_\theta^2\\
E_\theta^3 &= -E_\theta & E_\theta^4 &= -E_\theta^2\\
E_\theta^5 &= E_\theta&\\
\cdots
\end{align*}\]
<p>Applying this cycling to the Taylor series, we get:</p>
\[\begin{align*}
\exp(\theta E_\theta) &= I+\theta E_\theta+\frac{\theta^2}{2!}E_\theta^2-\frac{\theta^3}{3!}E_\theta-\frac{\theta^4}{4!}E_\theta^2+\cdots\\
&= I+E_\theta\left(\theta-\frac{\theta^3}{3!}+\frac{\theta^5}{5!}+\cdots\right) + E_\theta^2\left(\frac{\theta^2}{2!}-\frac{\theta^4}{4!}+\cdots\right)\\
&= I + E_\theta\sin\theta + E_\theta^2(1-\cos\theta)\\
&=\begin{bmatrix}1 & 0\\ 0 & 1\end{bmatrix} + \begin{bmatrix}0 & -\sin\theta\\ \sin\theta & 0\end{bmatrix} + \begin{bmatrix}\cos\theta-1 & 0\\ 0 & \cos\theta-1\end{bmatrix}\\
&=\begin{bmatrix}\cos\theta & -\sin\theta\\ \sin\theta & \cos\theta\end{bmatrix}
\end{align*}\]
<p>In the first step, we’ve regrouped the terms by $E_\theta$ and $E_\theta^2$. Then we notice that the two series are actually convergent Taylor series for $\sin\theta$ and $1-\cos\theta$. This is the general strategy when dealing with Taylor series: expand it out, regroup the terms, and condense it using other known Taylor series. After that, we can expand $I$, $E_\theta$, and $E_\theta^2$ into matrices and solve for the end result and get a 2D rotation matrix! So the exponential map for $SO(2)$ maps a scalar $\theta\in\R$ into a 2D rotation matrix $R\in SO(2)$!</p>
<p><img src="/images/lie-groups-part-1/circle-exp-log-map.png" alt="Circle exp and log Maps" title="Circle exp and log Maps" /></p>
<p><small>For the circle group $S^1$, the exponential map exactly sends and element in the tangent space to an element in the group. The logarithmic map does the opposite: it maps an element of the group into a tangent space at a point.</small></p>
<p>For $SO(3)$, The procedure is almost exactly the same, except we parameterize the input as an axis-angle representation $\theta[\omega]_\times$. Since $[\omega]_\times$ is also a skew-symmetric matrix, the same power cycling happens, and we actually end up with the same result.</p>
\[\exp(\theta[\omega]_\times)=I+[\omega]_\times\sin\theta+[\omega]_\times^2(1-\cos\theta)\]
<p>This formula is so important that it’s actually called the <strong>Rodrigues Rotation Formula</strong>. As it turns out, quaternions have the same kind of result (except with a factor of 2 to account for the double product).</p>
<p><img src="/images/lie-groups-part-1/manifold-isomorphisms.png" alt="Manifold Isomorphisms" title="Manifold Isomorphisms" /></p>
<p><small>Using the isomorphisms and the $\exp$/$\log$ maps, we can exactly map between $M$, $T_p M$, and $\R$.</small></p>
<p>Note that all of these exponential maps are exact. There’s no approximation! We’re exactly condensing the infinite series using convergent Taylor series. Now that we’ve seen some concrete examples, we can use the same formula to derive a few properties (that I won’t prove directly).</p>
\[\begin{align*}
\exp((a+b)v^\wedge)&=\exp(av^\wedge)\exp(bv^\wedge)\\
\exp(av^\wedge)&=\exp(v^\wedge)^a\\
\exp(-v^\wedge)&=\exp(v^\wedge)^{-1}\\
\exp(X v^\wedge X^{-1}) &= X\exp(v^\wedge)X^{-1}
\end{align*}\]
<p>As a shortcut, we can define $\Exp$ and $\Log$ operators that use $\exp$ and $\log$ and map directly between $\R$ and $M$.</p>
\[\begin{align*}
\Exp: \R^n\to M &; v\mapsto X=\Exp(v)\equiv\exp(v^\wedge)\\
\Log: M\to\R^n &; X\mapsto v=\Log(X)\equiv\log(X)^\vee
\end{align*}\]
<p><img src="/images/lie-groups-part-1/shortcut-isomorphisms.png" alt="Shortcut Isomorphisms" title="Shortcut Isomorphisms" /></p>
<p><small>We can define shortcut isomorphisms $\Exp$/$\Log$ that map directly between $M$ and $\R$.</small></p>
<p>As another convenience, we can define $\oplus$ and $\ominus$ that use $\Exp$ and $\Log$ as well as group composition. But since not all group operations commute, we need to define left and right operations. We can define the right ones as:</p>
\[\begin{align*}
\oplus &: Y=X\oplus {}^X v\equiv X\Exp({}^Xv)\in M\\
\ominus &: {}^xv=Y\ominus X\equiv\Log(X^{-1}Y)\in T_X M\\
\end{align*}\]
<p>The left ones are defined as:</p>
\[\begin{align*}
\oplus &: Y={}^E v\oplus X\equiv \Exp({}^Ev)X\in M\\
\ominus &: {}^Ev=Y\ominus X\equiv\Log(YX^{-1})\in T_E M\\
\end{align*}\]
<p><img src="/images/lie-groups-part-1/oplus-ominus.png" alt="On-manifold Addition/Subtraction" title="On-manifold Addition/Subtraction" /></p>
<p><small>We can define additional shortcut notation to perform on-manifold “addition” and “subtraction”. Since not all group operations are commutative, we need two operations: one for left and one for right operations.</small></p>
<p>Note that the left and right $\oplus$ are distinguished by the order of the operations but $\ominus$ is ambiguous. Another thing to note is the left superscript: $E$ means the “global frame” while $X$ means the “local frame”. The structure of all $T_p M$ are identical so it really doesn’t matter what we call the global and local frames, but, since every Lie Group has an $E$, we decide on that for the consistent “global frame” and everything else is a “local frame”. The usefulness of this construct is that we can using the right $\oplus$ to define perturbations in the local frame: when our optimizer has a little update $\Delta x$, that happens in the local frame of the current set of parameters $\hat{x}$.</p>
<h1 id="motion-integration-using-lie-groups">Motion Integration using Lie Groups</h1>
<p>While there’s still (at least) a Part 2 to this series, we’ve covered enough to perform some motion integration or, at least, set up the problem. For robot state estimation in a 2D space, we have both a 2D translation as well as a rotation. The Lie Group corresponding to a combination of translations and rotations is called $SE(2)$, the <strong>Special Euclidean Group</strong> of 2 dimensions. This combines both translations and rotations so that all operations consider both, jointly; in other words, it’s the set of rigid motions in 2D.</p>
\[X=\begin{bmatrix}R & t \\ 0 & 1\end{bmatrix}\]
<p>where $R\in SO(2)$ and $t\in\R^2$. Just like with other Lie Groups, we can define the Lie Algebra and exponential maps for $SE(2)$ as well. In the context of state estimation, we start with some pose $X\in SE(2)$. At some fixed $\Delta t$, we get translational and rotation data from our sensors, e.g., the inertial measurement unit (IMU) and wheel encoders of our robot. If we integrate that, we get a small $\Delta t$ for the translation and $\Delta\theta$ for the angle. This exists in the Lie Algebra of $X$, i.e., the current pose we’re at. If we want to integrate that measurement to get a new pose, we need to use the exponential map to ensure that we have a valid rotation at each step.</p>
<p><img src="/images/lie-groups-part-1/motion-integration.png" alt="On-manifold Motion Integration" title="On-manifold Motion Integration" /></p>
<p><small>Starting at $X_0$, we receive a number of sensor measurements in that local frame and can incorporate that into our pose using the $\oplus$ operator for each sensor measurement.</small></p>
<p>Starting with $X$, we get some increment $v = \begin{bmatrix}\Delta t & \Delta\theta\end{bmatrix}^T\in T_X M$ across some time increment. To integrate it into the current pose, we use the $\oplus$ operator.</p>
\[X_{i+1}=X_i\oplus v=X_i\Exp(v)\]
<p>This is a simple equation but builds on all of the things we’ve learned so far. If we had a sequence of these, we can fold them in through the group operation.</p>
\[X_{i}=X_0\oplus v_1\oplus v_2\oplus\cdots \oplus v_i\]
<p>This allows us to take sensor measurements in the local frame and apply them exactly to the pose we’re at to get a new pose that obeys orientation constraints. The only thing we’re missing is the propagation of uncetainties as well. For most state estimation, in addition to the poses, we also have some estimate of uncertainty, either implicit or explicit. Using those uncertainties, however, requires us to perform calculus since we have to compute the Jacobian of the state propagation function, i.e., $\oplus$! We’ll get to that next time!</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this post, I introduced Lie Group using rotations. We first defined 2D rotation using just plain geometry. We used that intuition to define groups and their axioms. Then I gave the intuition about the other part of Lie Groups: manifolds. As a part of manifolds, we also constructed tangent spaces and saw how to map between the tangent space and its corresponding flat space. Beyond the tangent space, we defined the exponential map to map between the tangent space and the manifold itself. Finally, we saw how to apply our new way of thinking to motion integration.</p>
<p>Lie Groups are fairly more theoretical than other kinds of engineering work, and they do represent a different way of thinking about rotation. However, armed with this new knowledge, we can manipulate rotations and other Lie Groups in an error-free way. The other part that we have yet to cover is how to perform calculus on Lie Groups. The optimizer computes derivatives/Jacobians, after all. Just like with the exponential map, we want to stay in the tangent space because it has the same degrees of freedom as the manifold. We want to do the same thing with derivatives: compute variations solely in the tangent space. After we figure that out, we can really perform motion integration and optimization on the manifold. We’ll get to that in the next post! 😀</p>I'll introduce concept of Lie Groups and how they can be useful for working with constrained surfaces like rotations; we'll also apply them to the problem of accurate robotic state estimation.Manifolds - Part 32021-09-24T00:00:00+00:002021-09-24T00:00:00+00:00/manifolds-part-3<p>In the previous article, we constructed a manifold from just open sets and reinvented vectors, tangent spaces, dual vectors, cotangent spaces, and general tensors using the language of a manifold, i.e., without assuming flat coordinates. In this post, we’re going to discuss and derive the most important property of a manifold: curvature!</p>
<h1 id="covariant-derivatives">Covariant Derivatives</h1>
<p>To discuss curvature, we’ll need some extra constructs. Curvature in a flat space involves taking second derivatives, but we haven’t actually discussed how to do calculus on manifolds. Partial derivatives and gradients only counted as basis vectors, not a calculus operations. But maybe they can do both. Let’s ask an important question about the partial derivative: does it transform like a tensor? If it does, we can simply use it as the primary method of doing calculus on manifolds. If not, then we need to invent some kind of derivative operator that <em>does</em> transform like a tensor. Let’s find out the answer by applying a coordinate transform to the partial derivative $\p_{\mu’}$ to a vector $V$:</p>
\[\begin{align*}
\frac{\p}{\p x^{\mu'}}V^{\nu'}&=\Big(\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p}{\p x^\mu}\Big) \Big(\frac{\p x^{\nu'}}{\p x^{\nu}}V^\nu\Big)\\
&=\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p}{\p x^\mu} \Big(\frac{\p x^{\nu'}}{\p x^{\nu}}V^\nu\Big)\\
&=\frac{\p x^\mu}{\p x^{\mu'}}\Big(\frac{\p x^{\nu'}}{\p x^\nu} \frac{\p}{\p x^{\mu}}V^\nu+V^\nu\frac{\p}{\p x^\mu}\frac{\p x^{\nu'}}{\p x^\nu}\Big)\\
&=\underbrace{\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^\nu} \frac{\p}{\p x^{\mu}}V^\nu}_\text{transforms like a tensor}+\underbrace{V^\nu\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p}{\p x^\mu}\frac{\p x^{\nu'}}{\p x^\nu}}_\text{doesn't transform like a tensor}\\
\end{align*}\]
<p>(Note: going from the second to the third line, we used the product rule since $\frac{\p}{\p x^\mu}$ is a derivative operator.)</p>
<p>It doesn’t seem like partial derivatives transform like tensors! So it’s not a good derivative operator for us to do calculus on manifolds, unfortunately. We’ll have to invent our own derivative operator such that it produces a tensor when acting on vectors, duals, and tensors. What kind of properties do we want in a “good” derivative operator?</p>
<ul>
<li>Just like $\p_\mu$, we’d like to send $(k, l)$-rank tensors to $(k,l+1)$-rank tensors.</li>
<li>Just like $\p_\mu$, we’d like to obey the Leibniz product rule (and thus linearity).</li>
</ul>
<p>Since the partial derivative <em>almost</em> transforms like a tensor except for the non-tensorial part, we can use it as the base, but add a correction to account for the non-tensorial part. Actually, if we closely inspect the non-tensorial part, it seems to be taking the derivative of the <em>basis</em>; in other words, it accounts for the changing basis from point-to-point. We need a correction for each component so that means we need a linear transform for each. Therefore, the general form of the correction is a set of $n$ matrices $(\Gamma_\mu)^\nu_\lambda$. The outer upper and lower indices mean this is a linear transform, and the inner lower indicates we have $n$ of them.</p>
<p>We define the <strong>covariant derivative</strong> $\nabla$ as a generalization of the partial derivative but for arbitrary coordinates. We can think of it as the partial derivative with a correction for the changing basis. As it turns out (and that we’ll soon prove), the correction matrices $\Gamma^\nu_{\mu\lambda}$ <em>do not</em> transform like tensors so we don’t have to be so careful about the index placement because we can’t raise and lower indices on $\Gamma^\nu_{\mu\lambda}$ anyways. But in which basis does the correction happen? Well we might as well use the same basis used to define the vector we’re operating on; after all, it’s right there! With that, we can mathematically define the covariant derivative.</p>
\[\nabla_\mu V^\nu\equiv\underbrace{\p_\mu V^\nu}_\text{partial}+\underbrace{\Gamma^\nu_{\mu\lambda}V^\lambda}_\text{correction}\]
<p>The correction matrices are special enough that we call them the <strong>connection coefficients</strong> or <strong>Christoffel symbols</strong>. Another way to think about this is that the covariant derivative tells us the change in $V^\nu$ in the $\mu$ direction. The complete geometric picture won’t make complete sense until we discuss parallel transport and geodesics soon, but I’ll present it here with some hand-waving.</p>
<p><img src="/images/manifolds-part-3/covariant-derivative.png" alt="Covariant Derivative" title="Covariant Derivative" /></p>
<p><small>There are a few key actors to understanding the geometry of the covariant derivative. The first is having a vector $V$ at a point $p$. We have another point $q$ and a different value of $V$ at that point. Remember that vector fields are defined at each point on the manifold. The $\mu$ represents the tangent vector to some curve at $p$ that connects to $q$. If we were to take $V$ and move it along the curve in such a way to keep it “as straight as possible”, we’d end up with a different vector $V_{||}$ at $q$. The covariant derivative is just the difference between $V$ at $q$ and the “translated” vector $V_{||}$. Don’t worry if this doesn’t make perfect sense now; we’ll revisit this when we have a more rigourous definition of moving a vector “as straight as possible” along a curve.</small></p>
<p>The point to remember is that the connection coefficients are the correction matrices, i.e., the non-tensorial part.</p>
\[\begin{align*}
\Gamma^\nu_{\mu\lambda}&=\text{change in }\p_\mu\text{ caused by }\lambda\text{ in the }\p_\nu\text{ direction.}\\
&=\frac{\p^2 x^\nu}{\p x^\mu\p x^\lambda}
\end{align*}\]
<p>I’ve said multiple times now that the connection coefficients represent the non-tensorial part so are they actually tensors? It turns out they are not. Let’s see why. First, let’s start with the above definition of the covariant derivative acting on a vector $V$.</p>
\[\begin{align*}
\nabla_\mu V^\nu &= \p_\mu V^\nu + \Gamma_{\mu\lambda}^\nu V^{\lambda}\\
\nabla_{\mu'} V^{\nu'} &= \p_{\mu'} V^{\nu'} + \Gamma_{\mu'\lambda'}^{\nu'} V^{\lambda'}
\end{align*}\]
<p>Now we’re going to simply demand that the covariant derivative transform like a tensor.</p>
\[\nabla_{\mu'} V^{\nu'} = \frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^\nu}\nabla_\mu V^\nu\]
<p>Since we’re inventing the covariant derivative for the sole purpose of being a tensorial operator on a manifold, demaning this constraint is a reasonable thing to do. Now we need to expand this equation to write the primed connection coefficients in terms of the unprimed ones. To start, let’s just consider the left-hand side and transform what we can transform from the primed to the unprimed coordinates.</p>
\[\begin{align*}
\nabla_{\mu'} V^{\nu'} &= \p_{\mu'} V^{\nu'} + \Gamma_{\mu'\lambda'}^{\nu'} V^{\lambda'}\\
&=\frac{\p x^\mu}{\p x^{\mu'}}\p_\mu\Big(\frac{\p x^{\nu'}}{\p x^\nu} V^\nu \Big) + \Gamma_{\mu'\lambda'}^{\nu'} \frac{\p x^{\lambda'}}{\p x^{\lambda}}V^{\lambda}\\
&=\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^{\nu}}\p_\mu V^\nu + \frac{\p x^\mu}{\p x^{\mu'}}V^\nu\frac{\p}{\p x^\mu}\frac{\p x^{\nu'}}{\p x^{\nu}} + \Gamma_{\mu'\lambda'}^{\nu'}\frac{\p x^{\lambda'}}{\p x^{\lambda}}V^\lambda
\end{align*}\]
<p>Just like we figured out the other tensor transformation rules, let’s expand the primed coordinates in terms of the unprimed ones using coordinate transforms. For the time being, let’s leave the connection coefficients untransformed since we don’t yet know how to transform them. Taking the above equation and adding back the right-hand side:</p>
\[\require{cancel}
\begin{align*}
\nabla_{\mu'} V^{\nu'} &= \frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^\nu}\nabla_\mu V^\nu\\
\cancel{\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^{\nu}}\p_\mu V^\nu} + \frac{\p x^\mu}{\p x^{\mu'}}V^\nu\frac{\p}{\p x^\mu}\frac{\p x^{\nu'}}{\p x^{\nu}} + \Gamma_{\mu'\lambda'}^{\nu'}\frac{\p x^{\lambda'}}{\p x^{\lambda}}V^\lambda &= \frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^\nu}(\cancel{\p_\mu V^\nu} + \Gamma_{\mu\lambda}^\nu V^{\lambda})\\
\frac{\p x^\mu}{\p x^{\mu'}}V^\nu\frac{\p}{\p x^\mu}\frac{\p x^{\nu'}}{\p x^{\nu}} + \Gamma_{\mu'\lambda'}^{\nu'}\frac{\p x^{\lambda'}}{\p x^{\lambda}}V^\lambda &= \frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^\nu}\Gamma_{\mu\lambda}^\nu V^{\lambda}
\end{align*}\]
<p>We want to remove $V$ since it was arbitrary from the start, but we can’t since the indices don’t match up. We can make them match by relabeling $\nu$ to $\lambda$; this is completely legal since $\nu$ in $V^\nu$ and $\lambda$ in $V^\lambda$ are both dummy indices that we can relabel to anything convenient so let’s relabel everything to be $\lambda$ and get rid of $V$ entirely (and move the primed connection coefficients to one side of the equation and use second-order derivatives).</p>
\[\begin{align*}
\frac{\p x^\mu}{\p x^{\mu'}}V^\lambda\frac{\p}{\p x^\mu}\frac{\p x^{\nu'}}{\p x^{\lambda}} + \Gamma_{\mu'\lambda'}^{\nu'}\frac{\p x^{\lambda'}}{\p x^{\lambda}}V^\lambda &= \frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^\nu}\Gamma_{\mu\lambda}^\nu V^{\lambda}\\
\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p}{\p x^\mu}\frac{\p x^{\nu'}}{\p x^{\lambda}} + \Gamma_{\mu'\lambda'}^{\nu'}\frac{\p x^{\lambda'}}{\p x^{\lambda}} &= \frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^\nu}\Gamma_{\mu\lambda}^\nu\\
\frac{\p x^{\lambda'}}{\p x^{\lambda}} \Gamma_{\mu'\lambda'}^{\nu'}&= \frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^\nu}\Gamma_{\mu\lambda}^\nu - \frac{\p x^\mu}{\p x^{\mu'}}\frac{\p}{\p x^\mu}\frac{\p x^{\nu'}}{\p x^{\lambda}}\\
\frac{\p x^{\lambda'}}{\p x^{\lambda}} \Gamma_{\mu'\lambda'}^{\nu'}&= \frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^\nu}\Gamma_{\mu\lambda}^\nu - \frac{\p x^\mu}{\p x^{\mu'}}\frac{\p^2 x^{\nu'}}{\p x^\mu\p x^{\lambda}}
\end{align*}\]
<p>We’re almost done isolating the primed coordinates in terms of the unprimed coordinates, but we need to get rid of the leading $\frac{\p x^{\lambda’}}{\p x^\lambda}$ on the left-hand side. A convenient strategy for removing terms of this form is to exploit the property of the Kronecker delta: $\frac{\p x^{\lambda}}{\p x^{\rho’}}\frac{\p x^{\lambda’}}{\p x^{\lambda}}=\delta_{\rho’}^{\lambda’}$. So we can multiply both sides by $\frac{\p x^{\lambda}}{\p x^{\rho’}}$ and get a Kronecker delta on the left-hand side that we can replace by swapping indices:</p>
\[\begin{align*}
\frac{\p x^{\lambda}}{\p x^{\rho'}}\frac{\p x^{\lambda'}}{\p x^{\lambda}} \Gamma_{\mu'\lambda'}^{\nu'}&= \frac{\p x^{\lambda}}{\p x^{\rho'}}\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^\nu}\Gamma_{\mu\lambda}^\nu - \frac{\p x^{\lambda}}{\p x^{\rho'}}\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p^2 x^{\nu'}}{\p x^\mu\p x^{\lambda}}\\
\delta_{\rho'}^{\lambda'} \Gamma_{\mu'\lambda'}^{\nu'}&= \frac{\p x^{\lambda}}{\p x^{\rho'}}\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^\nu}\Gamma_{\mu\lambda}^\nu - \frac{\p x^{\lambda}}{\p x^{\rho'}}\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p^2 x^{\nu'}}{\p x^\mu\p x^{\lambda}}\\
\Gamma_{\mu'\rho'}^{\nu'}&= \frac{\p x^{\lambda}}{\p x^{\rho'}}\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^\nu}\Gamma_{\mu\lambda}^\nu - \frac{\p x^{\lambda}}{\p x^{\rho'}}\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p^2 x^{\nu'}}{\p x^\mu\p x^{\lambda}}\\
\end{align*}\]
<p>Now we can finally relabel $\rho’$ to $\lambda’$ to be more consistent with the original notation. This is also legal to do since $\rho’$ is also a dummy index that we’re free to relabel.</p>
\[\Gamma_{\mu'\lambda'}^{\nu'} = \underbrace{\frac{\p x^{\lambda}}{\p x^{\lambda'}}\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^\nu}\Gamma_{\mu\lambda}^\nu}_{\text{tensorial-like}} - \underbrace{\frac{\p x^{\lambda}}{\p x^{\lambda'}}\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p^2 x^{\nu'}}{\p x^\mu\p x^{\lambda}}}_{\text{non-tensorial-like}}\]
<p>From this equation, we see that the first term seems to look like a valid transform; however, the second term is some second-order quantity that ruins the ability for the connection coefficients to transform like tensors. If that second term was zero, then we could say the connection coefficients transform like a tensor, but, from its existence, we can say that <em>the connection coefficients do not transform like tensors</em>. In fact, we can even say that the connection coefficients are <em>intentionally</em> non-tensorial to cancel the non-tensorial part of the partial derivative that we saw earlier. The consequence of non-tensorial terms means we can’t raise or lower indices on the connection coefficients with the metric tensor, but it also means we can be more haphazard with the index placement and leave one upper and two lower indices 😉</p>
<p>So far, we’ve shown the action of the covariant derivative on vectors, but what about its action on covectors? If we can figure out how to apply it to both vectors and covectors, we can generalize its action on arbitrary tensors. Similar to what we did with vectors, we can simply demand that the result of the covariant derivative transforms like a tensor.</p>
\[\nabla_\mu\omega_\nu = \p_\mu\omega_\nu + \Theta_{\mu\nu}^\lambda\omega_\lambda\]
<p>We’re using $\Theta$ because, at this point in time, we have no reason to believe that $\Theta$ and $\Gamma$ are related. Spoiler alert: they are! In order to operate the covariant derivative on covectors, we need to impose/demand two more constraints:</p>
<ol>
<li>It commutes with contractions: $\nabla_\mu (T_{\nu\lambda}^{\lambda})=(\nabla T)_{\mu\nu\lambda}^\lambda$</li>
<li>It reduces to the partial derivative on scalar (functions) $\phi$: $\nabla_\mu \phi = \p_\mu\phi$</li>
</ol>
<p>Like last time, we can apply a covector to a vector to get a scalar.</p>
\[\begin{align*}
\nabla_\mu(\omega_\lambda V^\lambda) &= (\nabla_\mu\omega_\lambda)V^\lambda + \omega_\lambda(\nabla_\mu V^\lambda)\\
&= (\p_\mu\omega_\lambda + \Theta_{\mu\lambda}^\sigma\omega_\sigma)V^\lambda + \omega_\lambda(\p_\mu V^\lambda+\Gamma_{\mu\rho}^\lambda V^\rho)\\
&= \p_\mu\omega_\lambda V^\lambda + \Theta_{\mu\lambda}^\sigma\omega_\sigma V^\lambda + \omega_\lambda\p_\mu V^\lambda+ \omega_\lambda\Gamma_{\mu\rho}^\lambda V^\rho\\
\end{align*}\]
<p>From the second constraint on the covariant derivative, we know that the left-hand side of the above equation reduces to the partial derivative acting on a scalar.</p>
\[\begin{align*}
\nabla_\mu(\omega_\lambda V^\lambda) &= \p_\mu(\omega_\lambda V^\lambda)\\
&= \p_\mu\omega_\lambda V^\lambda + \omega_\lambda\p_\mu V^\lambda
\end{align*}\]
<p>Now let’s set both sides of the equation equal to each other to cancel out terms (and isolate $\Theta$).</p>
\[\begin{align*}
\cancel{\p_\mu\omega_\lambda V^\lambda} + \Theta_{\mu\lambda}^\sigma\omega_\sigma V^\lambda + \bcancel{\omega_\lambda\p_\mu V^\lambda} + \omega_\lambda\Gamma_{\mu\rho}^\lambda V^\rho &= \cancel{\p_\mu\omega_\lambda V^\lambda} + \bcancel{\omega_\lambda\p_\mu V^\lambda}\\
\Theta_{\mu\lambda}^\sigma\omega_\sigma V^\lambda + \omega_\lambda\Gamma_{\mu\rho}^\lambda V^\rho &= 0\\
\Theta_{\mu\lambda}^\sigma\omega_\sigma V^\lambda &= -\omega_\lambda\Gamma_{\mu\rho}^\lambda V^\rho\\
\end{align*}\]
<p>(I’ve used two different kinds of slashes to note which of the like terms cancel.) To relate $\Theta$ and $\Gamma$, we need to get rid of $\omega$ and $V$. We can relabel them on the right-hand side by mapping $\lambda$ to $\sigma$ and $\rho$ to $\lambda$.</p>
\[\Theta_{\mu\lambda}^\sigma\omega_\sigma V^\lambda = -\omega_\sigma\Gamma_{\mu\lambda}^\sigma V^\lambda\\\]
<p>Now we can remove $\omega$ and $V$.</p>
\[\Theta_{\mu\lambda}^\sigma = -\Gamma_{\mu\lambda}^\sigma\\\]
<p>So $\Theta$ and $\Gamma$ are related by a negation! So we can make that substitution in the equation that applies the covariant derivative to covectors.</p>
\[\nabla_\mu\omega_\nu \equiv \p_\mu\omega_\nu - \Gamma_{\mu\nu}^\lambda\omega_\lambda\]
<p>Take a second to compare the indices on the action of the covariant derivative on vectors versus covectors. For vectors, we have a positive connection coefficient whose second lower index becomes a dummy index across the vector’s index. For covectors, we have a negative connection coefficient whose only upper index becomes a dummy index across the covector’s index. With this observation, we can generalize to arbitrary tensors.</p>
\[\begin{align*}
\nabla_\lambda T_{\nu_1\cdots\nu_l}^{\mu_1\cdots\mu_k} &= \p_\lambda T_{\nu_1\cdots\nu_l}^{\mu_1\cdots\mu_k}\\
&+ \Gamma_{\lambda\sigma}^{\mu_1}T_{\nu_1\cdots\nu_l}^{\sigma\mu_2\cdots\mu_k}+\Gamma_{\lambda\sigma}^{\mu_2}T_{\nu_1\cdots\nu_l}^{\mu_1\sigma\cdots\mu_k}+\cdots+\Gamma_{\lambda\sigma}^{\mu_k}T_{\nu_1\cdots\nu_l}^{\mu_1\cdots\mu_{k-1}\sigma}\\
&- \Gamma_{\lambda\nu_1}^{\sigma}T_{\sigma\nu_2\cdots\nu_l}^{\mu_1\cdots\mu_k}-\Gamma_{\lambda\nu_2}^{\sigma}T_{\nu_1\sigma\cdots\nu_l}^{\mu_1\cdots\mu_k}-\cdots-\Gamma_{\lambda\nu_l}^{\sigma}T_{\nu_1\cdots\nu_{l-1}\sigma}^{\mu_1\cdots\mu_{k}}\\
\end{align*}\]
<p>There’s a pattern here depending on how many upper and lower indices. Take a second to understand the pattern since it’ll be useful later.</p>
<p>To quickly recap, we’ve successfully defined the covariant derivative on arbitrary tenors. However, in each definition, we write the covariant derivative in terms of the connection coefficients which, as a consequence of their non-tensorial-ness, are coordinate-dependent. We could use many different coordinates, which means we could have many different definitions of the covariant derivative! This is a fundamental characteristic of the covariant derivative and the connection coefficients, but we can define a <em>unique</em> connection if we impose some additional constraints: <strong>torsion-free</strong> and <strong>metric compatibility</strong>.</p>
<p>For a connection to be <strong>torsion-free</strong>, it must be symmetric in its lower indices.</p>
\[\Gamma_{\mu\nu}^\lambda=\Gamma_{\nu\mu}^\lambda\]
<p>The consequence of a connection being torsion-free means, given a connection $\Gamma_{\mu\nu}^\lambda$, we can immediately define another connection with permutated lower indices $\Gamma_{\nu\mu}^\lambda$. In fact, we define the <strong>torsion tensor</strong> as $T_{\mu\nu}^\lambda = \Gamma_{\mu\nu}^\lambda - \Gamma_{\nu\mu}^\lambda = 2\Gamma_{[\mu\nu]}^\lambda$. Interestingly, the torsion tensor is a valid tensor, even though it is composed of the non-tensorial connection. To see this, suppose we had two connections $\nabla$ and $\tilde{\nabla}$. Let’s apply both on an arbitrary vector $V^\lambda$ and take the difference.</p>
\[\begin{align*}
\nabla_\mu V^\lambda-\tilde{\nabla}_\mu V^\lambda &= \cancel{\p_\mu V^\lambda} + \Gamma_{\mu\nu}^\lambda V^\nu - \cancel{\p_\mu V^\lambda} - \tilde{\Gamma}_{\mu\nu}^\lambda V^\nu\\
&= (\Gamma_{\mu\nu}^\lambda - \tilde{\Gamma}_{\mu\nu}^\lambda) V^\nu\\
&= S_{\mu\nu}^\lambda V^\nu\\
\end{align*}\]
<p>Since the left-hand side is a tensor, the right-hand side must also be a tensor, which means $S_{\mu\nu}^\lambda$, which is the difference of the connections, is also a tensor. Torsion is a special case of $S_{\mu\nu}^\lambda$ where we use the connection.</p>
<p><img src="/images/manifolds-part-3/torsion-geometry.png" alt="Geometry of Torsion" title="Geometry of Torsion" /></p>
<p><small>Geometrically, we can think of torsion as the “twisting” of reference frames or a “corkscrew” of reference frames along a path. We’ll get a slightly better geometric interpretation after we discuss parallel transport soon.</small></p>
<p>The second constraint we enforce is <strong>metric compatibility</strong>, which says $\nabla_\rho g_{\mu\nu}=0$. In words, that means the metric is flat/Euclidean at each individual point in the space. We need this property so that the covariant derivative commutes with the metric tensor when raising and lowering indices: $g_{\mu\lambda}\nabla_\rho V^\lambda = \nabla_\rho V_\mu$. Like with the covariant derivative action on covectors, there’s no way to prove these two constraints; we simply demand that they be true.</p>
<p><img src="/images/manifolds-part-3/tangent-space.png" alt="Tangent space" title="Tangent space" /></p>
<p><small>Metric compatibility means that components of the metric are constant at a point. Geometrically, this means, at a point, we can define a flat tangent space. Or, to be more precise, we can write the metric components in a way that they are constant.</small></p>
<p>Now that we have those two contraints, we can construct a unique connection from the metric using those two properties. Let’s first apply the covariant derivative to the metric tensor and set it to zero (using metric compatibility). With that one equation, we can can permute the indices to get two more equations.</p>
\[\begin{align*}
\nabla_\rho g_{\mu\nu} &= \p_\rho g_{\mu\nu} - \Gamma_{\rho\mu}^\lambda g_{\lambda\nu} - \Gamma_{\rho\nu}^\lambda g_{\mu\lambda} &= 0\\
\nabla_\mu g_{\nu\rho} &= \p_\mu g_{\nu\rho} - \Gamma_{\mu\nu}^\lambda g_{\lambda\rho} - \Gamma_{\mu\rho}^\lambda g_{\nu\lambda} &= 0\\
\nabla_\nu g_{\rho\mu} &= \p_\nu g_{\rho\mu} - \Gamma_{\nu\rho}^\lambda g_{\lambda\mu} - \Gamma_{\nu\mu}^\lambda g_{\rho\lambda} &= 0\\
\end{align*}\]
<p>Now we take the first equation and subtract the second and third equations. Then we can use the torsion-free property to cancel multiple terms, i.e., any connection coefficients with permuted lower indices.</p>
\[\require{cancel}
\begin{align*}
\nabla_\rho g_{\mu\nu} &= \p_\rho g_{\mu\nu} - \cancel{\Gamma_{\rho\mu}^\lambda g_{\lambda\nu}} - \bcancel{\Gamma_{\rho\nu}^\lambda g_{\mu\lambda}} &= 0\\
-\nabla_\mu g_{\nu\rho} &= \p_\mu g_{\nu\rho} - \Gamma_{\mu\nu}^\lambda g_{\lambda\rho} - \cancel{\Gamma_{\mu\rho}^\lambda g_{\nu\lambda}} &= 0\\
-\nabla_\nu g_{\rho\mu} &= \p_\nu g_{\rho\mu} - \bcancel{\Gamma_{\nu\rho}^\lambda g_{\lambda\mu}} - \Gamma_{\nu\mu}^\lambda g_{\rho\lambda} &= 0\\
\end{align*}\]
<p>And we’re left with an equation with a single connection coefficient after permuting the indices so they match.</p>
\[\begin{align*}
\p_\rho g_{\mu\nu} - \p_\mu g_{\nu\rho} - \p_\nu g_{\rho\mu} + 2\Gamma_{\mu\nu}^\lambda g_{\lambda\rho} &= 0\\
\Gamma_{\mu\nu}^\lambda g_{\lambda\rho} &= \frac{1}{2}(\p_\mu g_{\nu\rho} + \p_\nu g_{\rho\mu} - \p_\rho g_{\mu\nu})\\
\end{align*}\]
<p>To get rid of the extra $g_{\lambda\rho}$, we can multiply by $g^{\sigma\rho}$ and use the Kronecker delta.</p>
\[\begin{align*}
\Gamma_{\mu\nu}^\lambda g_{\lambda\rho}g^{\sigma\rho} &= \frac{1}{2}g^{\sigma\rho}(\p_\mu g_{\nu\rho} + \p_\nu g_{\rho\mu} - \p_\rho g_{\mu\nu})\\
\Gamma_{\mu\nu}^\lambda \delta_{\lambda}^{\sigma} &= \frac{1}{2}g^{\sigma\rho}(\p_\mu g_{\nu\rho} + \p_\nu g_{\rho\mu} - \p_\rho g_{\mu\nu})\\
\Gamma_{\mu\nu}^\sigma &= \frac{1}{2}g^{\sigma\rho}(\p_\mu g_{\nu\rho} + \p_\nu g_{\rho\mu} - \p_\rho g_{\mu\nu})\\
\end{align*}\]
<p>Finally we’ve written the connection coefficients in terms of the metric! This unique connection is called the <strong>Christoffel</strong>/<strong>Levi-Civita</strong>/<strong>Riemannian connection</strong>. This is the canonical connection that’s used often in general relativity and other fields so we have a “preferred” covariant derivative. It’s not necessary to use this particular connection, especially if there is another set of connection coefficients that makes the particular problem we’re studying easier, but this connection is often used because it’s convenient.</p>
<h1 id="parallel-transport">Parallel Transport</h1>
<p>Now that we have a clear definition of a “preferred” covariant derivative, we can do calculus on a manifold like we could in a flat space! However, we quickly run into a problem: how do we compare vectors on a manifold? With scalars, we can compare two of them at different points on a manifold, but we can’t compare two different vectors at two different points on the manifold since they would be in different tangent spaces! The vector might actually be the same in one tangent space but look different in the other tangent space (but still related by a transform).</p>
<p><img src="/images/manifolds-part-3/parallel-transport-cartesian.png" alt="Parallel transport in Cartesian coordinates" title="Parallel transport in Cartesian coordinates" /></p>
<p><small>In a Cartesian space, if we have a vector $V$ and we move it along a path, it will forever have the same magnitude and direction. Some people say that vectors (in a Cartesian space) are just displacements that you can slide around the space because the displacement is relative: it doesn’t depend on where the arrow starts/ends. However, this is not true for curved coordinates.</small></p>
<p>In a flat space, we didn’t have to be this careful since we can arbitrary move a vector from point to point while keeping it parallel with itself. If we took a vector and drew an arbitrary path for the vector to take, at each point along the path, the vector would be pointed in exactly the same direction with the same magnitude! A consequence of this is that it doesn’t matter what the path, e.g., a long path and short path that have the same endpoint will still keep the vector the exact same.</p>
<p>Since it seems to work in flat space, let’s try this idea on a manifold: take a vector in one tangent space and “transport” it to the other tangent space so that the two vectors are in the same tangent space while keeping the “transported” vector “as straight as possible”. This notion is called <strong>parallel transport</strong>. We have to say “as straight as possible” since, in a curved space, it’s not always possible to keep a vector pointed completely in the same direction with the same magnitude at each point along the path. In fact, it’s even worse than that because <em>the path we take will change the resulting vector!</em></p>
<p><img src="/images/manifolds-part-3/parallel-transport-sphere.png" alt="Parallel transport on a sphere" title="Parallel transport on a sphere" /></p>
<p><small>On a sphere, suppose we start at the equator with a vector pointing along the equator. Then we parallel transport that vector to the North Pole. Then we parallel transport it back to the equator on a different longitude. Finally, we parallel transport it along the equator back to its original position. We’ll find that it has rotated! It’s different from the original vector.</small></p>
<p>Even keeping the vector as straight as possible, the resulting vectors are pointed in completely different directions. Unfortunately, this is a fundamental fact about manifolds that we can’t get over with a clever trick or coordinate transform! But we can try to precisely define parallel transport and what we mean by “keeping the vector as straight as possible”. Mathematically, this means we want to keep the tensor components from changing as much as possible along the curve. Suppose we have a curve $x^\mu(\lambda)$ and an arbitrary tensor $T_{\nu_1\cdots\nu_l}^{\mu_1\cdots\mu_k}$. Then keeping the components the same just means the derivative of the tensor along the path must vanish.</p>
\[\frac{\d}{\d\lambda}T_{\nu_1\cdots\nu_l}^{\mu_1\cdots\mu_k} = \frac{\d x^\sigma}{\d\lambda}\p_\sigma T_{\nu_1\cdots\nu_l}^{\mu_1\cdots\mu_k} = 0\]
<p>However this isn’t quite tensorial because we have a partial derivative. We can make this tensorial by replacing the partial derivative with a covariant derivative (this is sometimes called the “comma goes to semicolon” rule if you denote partials with commas and covariant derivatives with semicolons, but I hate that notation), and we get the <strong>equation of parallel transport</strong>.</p>
\[\frac{\d x^\sigma}{\d\lambda}\nabla_\sigma T_{\nu_1\cdots\nu_l}^{\mu_1\cdots\mu_k} = 0\]
<p>For convenience, we can define a parallel transport operator/directional covariant derivative using the covariant derivative and a tangent vector.</p>
\[\frac{\D}{\d\lambda} = \frac{\d x^\sigma}{\d\lambda}\nabla_\sigma\]
<p>Going back to our original inquiry, let’s see what this equation looks like for a vector $V^\mu$.</p>
\[\begin{align*}
\frac{\d x^\sigma}{\d\lambda}\nabla_\sigma V^\mu &= 0\\
\frac{\d x^\sigma}{\d\lambda}(\p_\sigma V^\mu + \Gamma_{\sigma\rho}^\mu V^\rho) &= 0\\
\frac{\d x^\sigma}{\d\lambda}\Big(\frac{\p}{\p x^\sigma} V^\mu + \Gamma_{\sigma\rho}^\mu V^\rho\Big) &= 0\\
\frac{\d}{\d\lambda} V^\mu + \Gamma_{\sigma\rho}^\mu \frac{\d x^\sigma}{\d\lambda} V^\rho &= 0\\
\end{align*}\]
<p>Note that this is a set of 1st order differential equations, one for each $\mu$ index. Also note that since the parallel transport equation depends on coordinate-dependent things like $\Gamma$ and $\frac{\d x^\sigma}{\d\lambda}$, the equation itself also depends on coordinates.</p>
<p>One immediately practical application of the parallel transport equation is to see what happens when we parallel transport the metric $g_{\mu\nu}$.</p>
\[\require{cancel}
\frac{\D}{\d\lambda}g_{\mu\nu} = \frac{\d x^\sigma}{\d\lambda}\cancelto{0}{\nabla_\sigma g_{\mu\nu}} = 0\]
<p>We can see that the metric is always parallel transported because of metric compatibility! This means that the value of inner products is preserved as we parallel transport along a curve.</p>
<p>Now suppose we also parallel transport two vectors that the metric acts on $V^\mu$ and $W^\nu$ along with it. Suppose those vectors are also parallel transported along with the metric.</p>
\[\require{cancel}
\begin{align*}
\frac{\D}{\d\lambda}(g_{\mu\nu}V^\mu W^\nu) &= 0\\
\cancelto{0}{(\frac{\D}{\d\lambda}g_{\mu\nu})}V^\mu W^\nu + g_{\mu\nu}\cancelto{0}{(\frac{\D}{\d\lambda}V^\mu)} W^\nu + g_{\mu\nu}V^\mu\cancelto{0}{(\frac{\D}{\d\lambda}W^\nu)} &= 0
\end{align*}\]
<p>The first term is cancelled because of metric compatibility and the second and third terms are also cancelled because we defined $V^\mu$ and $W^\nu$ to be parallel transported. This means that norms, angles, and orthogonality are also preserved!</p>
<p>Now that we’ve discussed parallel transport, let me circle back to a few points geometrically and suppliment the lines and lines of equations with actual geometrical pictures. Let’s start with the geometrical picture of the covariant derivative. Recall that it generalizes the partial derivative by adding a correction for the changing basis that occurs from point-to-point. But, if a vector was parallel transported, by the parallel transport equation, the change in the covariant derivative along the path is zero. So we can think of the covariant derivative as being the vector that is the difference between parallel transporting a vector on a path from one point to another and simply evaluating the vector at that point on the manifold (see the first image in this post).</p>
<p>Additionally, with parallel transport, we can also get a slightly better geometric picture of torsion.</p>
<p><img src="/images/manifolds-part-3/torsion-algebra.png" alt="Algebriac picture of torsion" title="Algebriac picture of torsion" /></p>
<p><small>Suppose we have two vector fields $A^\mu$ and $B^\nu$. If we parallel transport $A^\mu$ in the direction of $B^\nu$ and $B^\nu$ in the direction of $A^\mu$, then the torsion tensor $S_{\mu\nu}^\lambda$ measures the ability of that loop to close. With a torsion-free metric, the parallel-transported vectors form a closed parallelogram.</small></p>
<h1 id="geodesics">Geodesics</h1>
<p>One last crucial topic we’ll need to discuss before getting into curvature is a <strong>geodesic</strong>. To understand the intuition, remember that parallel transport changes a vector along a particular path from point to point. But there are an infinite number of paths between any two points so there doesn’t immediately seem to be a way to have a “preferred” path between two points that multiple people could compare. One candidate is picking the “shortest possible” path between the points. In a flat space, we knew how to do this: pick a straight line! But on a curved manifold where the coordinates change as well, there isn’t always a “straight” path.</p>
<p>One way to do this is to find a path $\frac{\d x^\mu}{\d\lambda}$ that minimizes the total arc length/path length between any two points. But this way requires us to know and use calculus of variations so that’s complicated! A slightly less formal, but more intuitive, way to understand a path length is in terms of parallel transport. One observation is that, in a flat space, a straight line keeps its tangent vector pointing in the same direction along the line. In other words, the straight line parallel transports its own tangent vector. This intuition carries over to a curved space. Suppose we have a curve $x^\mu(\lambda)$ and its tangent vector $\frac{\d x^\mu}{\d\lambda}$. Let’s parallel transport the tangent vector along the curve.</p>
\[\begin{align*}
\frac{\D}{\d\lambda}\Big(\frac{\d x^\mu}{\d\lambda}\Big) &= 0\\
\frac{\d x^\sigma}{\d\lambda}\nabla_\sigma\frac{\d x^\mu}{\d\lambda} &= 0\\
\frac{\d x^\sigma}{\d\lambda}\Big(\frac{\p}{\p x^\sigma}\frac{\d x^\mu}{\d\lambda} + \Gamma_{\sigma\rho}^\mu\frac{\d x^\rho}{\d\lambda}\Big) &= 0\\
\frac{\d}{\d\lambda}\frac{\d x^\mu}{\d\lambda} + \Gamma_{\sigma\rho}^\mu\frac{\d x^\sigma}{\d\lambda}\frac{\d x^\rho}{\d\lambda} &= 0\\
\frac{\d^2 x^\mu}{\d\lambda^2} + \Gamma_{\sigma\rho}^\mu\frac{\d x^\sigma}{\d\lambda}\frac{\d x^\rho}{\d\lambda} &= 0\\
\end{align*}\]
<p>The final result is the <strong>geodesic equation</strong>, a 2nd order differential equation, one for each coordinate/index $\mu$. Notice that in a Cartesian space, all $\Gamma=0$ so we’re left with $\frac{\d^2 x^\mu}{\d\lambda^2} = 0$. The solution to this differential equation is a line! (If you don’t know any differential equations, you can convince yourself of this since the only kinds of functions with no second derivative anywhere are lines!) Even without talking about curvature, geodesics are incredibly important: in general relativity, test particles in a gravitational field move along geodesics so they’re critical for understanding the consequences of different gravities.</p>
<p>Solving the geodesic equation can seem a little complicated so there’s an alternative way to think about geodesics that’s a bit more practical. Imagine we’re at an arbitrary point $p$ on a manifold, and we have a tangent vector $V^\mu$ to some curve/direction we want to travel in. We can construct a unique geodesic in a small neighborhood of $p$. Suppose our geodesic is $\gamma^\mu(\lambda)$. From the above statements, we immediately have two constraints to the geodesic: $\gamma^\mu(\lambda=0)=p$ and $\frac{\d\gamma^\mu}{\d\lambda}(\lambda=0)=V^\mu$. The former says that the geodesic “starts” at $p$ and the second statement says that the tangent vector at $\lambda=0$ on the geodesic is $V^\mu$. The <strong>exponential map</strong> is the map we use to get the geodesic. It is defined as $\exp_p: T_p\to M, V^\mu\mapsto\gamma^\mu(\lambda=1)$ such that $\gamma^\mu$ solves the geodesic equation.</p>
<p><img src="/images/manifolds-part-3/exponential-map.png" alt="The exponential map" title="The exponential map" /></p>
<p><small>Given a point $p$ and a direction $V^\mu$ at $p$, it’s always possible to specify a unique geodesic $\gamma^\mu$ “in the neighborhood” of the point. If we stray too far from the point, this geodesic fails to be unique because they could cross over each other.</small></p>
<p>Since the geodesic is on the manifold, if we follow $\gamma^\mu$, then there’s some other point $q$ also on the manifold such that $\gamma^\mu(\lambda=1)=q$. After this process, we’re now at another point on the manifold by travelling along the geodesic. With this technique, we can travel all across the manifold by travelling from tangent space to tangent space along the shortest path. An important thing to note is that this geodesic is only unique and invertible in a “small enough” neighborhood around $p$. Travel too far away, and the we no longer have a unique geodesic since some of them might overlap so that some other one ends up at $q$ too.</p>
<h1 id="curvature">Curvature</h1>
<p>With all of those prerequisites addressed, we can finally discuss curvature. In a flat space, when we talk about curvature, we often mean the curvature of a 2D/3D curve or a parameterized surface. These are forms of <strong>extrinsic</strong> curvature since they depend on the embedding space. However, remember that a manifold is completely independent of the space it’s embedded in. Alternative to extrinsic curvature, we also have <strong>intrinsic</strong> curvature. Intuitively, imagine if you were a little bug walking on top of the manifold. Could you tell if the space was curved like the Earth or flat? As it turns out, on a manifold with arbitrary coordinates, it’s much harder to tell if the <em>space is curved</em> or we just chose <em>curved coordinates</em>. As an example, consider a flat plane. We could use Cartesian coordinates and know that the space is flat like $\R^2$. However, we could also use polar coordinates on the plane, and that’s more difficult to tell if the space is flat since polar coordinates are curved and have nonzero connection coefficients!</p>
<p><img src="/images/manifolds-part-3/curvature-flat-space.png" alt="Polar coordinates" title="Polar coordinates" /></p>
<p><small>In Cartesian coordinates, it’s pretty clear to see that the components of the basis don’t change from point-to-point. However, in polar coordinates, this is not true. However, polar coordinates are just curved coordinates on a flat space! We need a way to differentiate an intrinsically curved space from just the choice of curved coordinates on that space.</small></p>
<p>Interestingly, the inverse can also be true: manifolds that appear to have curvature can actually be intrinsically flat! Consider a torus. At first glance, it appears to be a curved space, but that’s only extrinsically. As it turns out, we can show that the torus is actually intrinsically flat, specifically, it is the same as a square with the sides identified.</p>
<p><img src="/images/manifolds-part-3/curvature-torus.png" alt="Curvature of a torus" title="Curvature of a torus" /></p>
<p><small>We can flatten a torus by cutting the torus into a cylinder and then cutting the cylinder in half and unrolling it. The sides are identified so the space “repeats”. On the other hand, there’s no way to cut a sphere into a flat space (in a way that preserves distances and angles).</small></p>
<p>So if we were a little bug on a torus, we would think our world was flat! We could construct a map of a torus on a piece of paper that perfectly preserves angles and distances. To complete the list of examples, a sphere, e.g., the surface of the Earth, is both extrinsically <em>and intrinsically</em> curved! We’ll see exactly how to prove this shortly.</p>
<p>So far, I’ve described curvature intuitively, but we need some equations to let us definitively differentiate a flat from a curved space. The key is to recall what we said about parallel transporting a vector from a start point to an end point: the final result depends on the path! Taking that same notion, what would happen if we parallel transported a vector in a little infinitesimal loop? In a flat space, either Cartesian or polar, the vector should be pointing in the same direction! But what if a space is not flat? Remember what happened for the sphere? When we parallel transported a vector in a loop, it wasn’t pointed in the same direction! Let’s take the same concept, but do it at a much smaller/infinitesimal scale so we can define a curvature at each point in space.</p>
<p>“Parallel transport around a little loop” is a bit too informal, so let’s use some equations to make this more concrete. Some texts take this too literally, but I think a better interpretation is to consider two vectors $A^\mu$ and $B^\nu$ and an arbitrary vector $V^\rho$ that we parallel transport along those two vectors. The mathematical way to represent this is with the commutator of the covariant derivative:</p>
\[[\nabla_\mu, \nabla_\nu]V^\rho = \nabla_\mu \nabla_\nu V^\rho - \nabla_\nu \nabla_\mu V^\rho\]
<p>Intuitively, this is like transporting the vector to the far side of the loop and then back to the start again. The computation itself is fairly straightforward. Let’s first start by applying the outermost covariant derivative to the first term.</p>
\[\nabla_\mu \nabla_\nu V^\rho - \nabla_\nu \nabla_\mu V^\rho = \p_\mu(\nabla_\nu V^\rho) - \Gamma_{\mu\nu}^\lambda\nabla_\lambda V^\rho + \Gamma_{\mu\sigma}^\rho\nabla_\nu V^\sigma - (\mu\leftrightarrow\nu)\\\]
<p>Recall that we’re applying $\nabla_\mu$ on the tensor $\nabla_\nu V^\rho$, which has one upper and one lower index so we need two connection coefficients. (You can think of this tensor as $(\nabla V)_\nu^\rho$ if that helps). As it turns out, the expansion of the second term is identical to the first except with the $\mu$s and $\nu$s swapped, which is denoted as $(\mu\leftrightarrow\nu)$. Don’t worry about those for now; we’ll expand them later. Now let’s expand the inner covariant derivative.</p>
\[\p_\mu(\p_\nu V^\rho + \Gamma_{\mu\sigma}^\rho V^\sigma) - \Gamma_{\mu\nu}^\lambda(\p_\lambda V^\rho + \Gamma_{\lambda\sigma}^\rho V^\sigma) + \Gamma_{\mu\sigma}^\rho(\p_\nu V^\sigma + \Gamma_{\nu\lambda}^\sigma V^\lambda) - (\mu\leftrightarrow\nu)\\\]
<p>Now let’s multiple everything out, but be careful about the partial $\p_\mu$.</p>
\[\p_\mu\p_\nu V^\rho + \p_\mu(\Gamma_{\mu\sigma}^\rho V^\sigma) - \Gamma_{\mu\nu}^\lambda\p_\lambda V^\rho - \Gamma_{\mu\nu}^\lambda\Gamma_{\lambda\sigma}^\rho V^\sigma + \Gamma_{\mu\sigma}^\rho\p_\nu V^\sigma + \Gamma_{\mu\sigma}^\rho\Gamma_{\nu\lambda}^\sigma V^\lambda - (\mu\leftrightarrow\nu)\\\]
<p>For the $\p_\mu(\Gamma_{\mu\sigma}^\rho V^\sigma)$ term, we have to expand it using the product rule!</p>
\[\p_\mu\p_\nu V^\rho + \p_\mu\Gamma_{\mu\sigma}^\rho V^\sigma + \Gamma_{\mu\sigma}^\rho\p_\mu V^\sigma - \Gamma_{\mu\nu}^\lambda\p_\lambda V^\rho - \Gamma_{\mu\nu}^\lambda\Gamma_{\lambda\sigma}^\rho V^\sigma + \Gamma_{\mu\sigma}^\rho\p_\nu V^\sigma + \Gamma_{\mu\sigma}^\rho\Gamma_{\nu\lambda}^\sigma V^\lambda - (\mu\leftrightarrow\nu)\\\]
<p>Even though this equation already has a lot of terms, we’re ready to add in the other terms and see what cancels!</p>
\[\begin{align*}
\p_\mu\p_\nu V^\rho + \p_\mu\Gamma_{\mu\sigma}^\rho V^\sigma + \Gamma_{\mu\sigma}^\rho\p_\mu V^\sigma - \Gamma_{\mu\nu}^\lambda\p_\lambda V^\rho - \Gamma_{\mu\nu}^\lambda\Gamma_{\lambda\sigma}^\rho V^\sigma + \Gamma_{\mu\sigma}^\rho\p_\nu V^\sigma + \Gamma_{\mu\sigma}^\rho\Gamma_{\nu\lambda}^\sigma V^\lambda\\
-\p_\nu\p_\mu V^\rho - \p_\nu\Gamma_{\nu\sigma}^\rho V^\sigma - \Gamma_{\nu\sigma}^\rho\p_\nu V^\sigma + \Gamma_{\nu\mu}^\lambda\p_\lambda V^\rho + \Gamma_{\nu\mu}^\lambda\Gamma_{\lambda\sigma}^\rho V^\sigma - \Gamma_{\nu\sigma}^\rho\p_\mu V^\sigma - \Gamma_{\nu\sigma}^\rho\Gamma_{\mu\lambda}^\sigma V^\lambda\\
\end{align*}\]
<p>Remembering that partial derivatives commute, we can get rid of quite a few terms!</p>
\[\require{cancel}
\begin{align*}
\cancel{\p_\mu\p_\nu V^\rho} + \p_\mu\Gamma_{\mu\sigma}^\rho V^\sigma + \bcancel{\Gamma_{\mu\sigma}^\rho\p_\mu V^\sigma} - \Gamma_{\mu\nu}^\lambda\p_\lambda V^\rho - \Gamma_{\mu\nu}^\lambda\Gamma_{\lambda\sigma}^\rho V^\sigma + \xcancel{\Gamma_{\mu\sigma}^\rho\p_\nu} V^\sigma + \Gamma_{\mu\sigma}^\rho\Gamma_{\nu\lambda}^\sigma V^\lambda\\
-\cancel{\p_\nu\p_\mu V^\rho} - \p_\nu\Gamma_{\nu\sigma}^\rho V^\sigma - \xcancel{\Gamma_{\nu\sigma}^\rho\p_\nu V^\sigma} + \Gamma_{\nu\mu}^\lambda\p_\lambda V^\rho + \Gamma_{\nu\mu}^\lambda\Gamma_{\lambda\sigma}^\rho V^\sigma - \bcancel{\Gamma_{\nu\sigma}^\rho\p_\mu V^\sigma} - \Gamma_{\nu\sigma}^\rho\Gamma_{\mu\lambda}^\sigma V^\lambda\\
\end{align*}\]
<p>Nearly half of our terms cancel! Let’s examine the surviving terms. I’ve swapped dummy indices $\lambda\leftrightarrow\sigma$ for the last terms of each line so that the notation is more consistent.</p>
\[\begin{align*}
\p_\mu\Gamma_{\mu\sigma}^\rho V^\sigma - \Gamma_{\mu\nu}^\lambda\p_\lambda V^\rho - \Gamma_{\mu\nu}^\lambda\Gamma_{\lambda\sigma}^\rho V^\sigma + \Gamma_{\mu\lambda}^\rho\Gamma_{\nu\sigma}^\lambda V^\sigma\\
- \p_\nu\Gamma_{\nu\sigma}^\rho V^\sigma + \Gamma_{\nu\mu}^\lambda\p_\lambda V^\rho + \Gamma_{\nu\mu}^\lambda\Gamma_{\lambda\sigma}^\rho V^\sigma - \Gamma_{\nu\lambda}^\rho\Gamma_{\mu\sigma}^\lambda V^\sigma\\
\end{align*}\]
<p>There are a few interesting things to notice, especially with the middle two terms of each line. They can each be condensed back into a covariant derivative, but with a connection coefficient as a coefficient on the front.</p>
\[\begin{align*}
\p_\mu\Gamma_{\mu\sigma}^\rho V^\sigma - \Gamma_{\mu\nu}^\lambda(\nabla_\lambda V^\rho) + \Gamma_{\mu\lambda}^\rho\Gamma_{\nu\sigma}^\lambda V^\sigma\\
- \p_\nu\Gamma_{\nu\sigma}^\rho V^\sigma + \Gamma_{\nu\mu}^\lambda(\nabla_\lambda V^\rho) - \Gamma_{\nu\lambda}^\rho\Gamma_{\mu\sigma}^\lambda V^\sigma\\
\end{align*}\]
<p>Yet another condensation we can do is to look at each term in the middle of each line. They’re almost identical except the $\mu$ and $\nu$ are swapped! This is exactly twice the commutator of the indices!</p>
\[\p_\mu\Gamma_{\mu\sigma}^\rho V^\sigma + \Gamma_{\mu\lambda}^\rho\Gamma_{\nu\sigma}^\lambda V^\sigma - \p_\nu\Gamma_{\nu\sigma}^\rho V^\sigma - \Gamma_{\nu\lambda}^\rho\Gamma_{\mu\sigma}^\lambda V^\sigma - 2\Gamma_{[\mu\nu]}^\lambda\nabla_\lambda V^\rho \\\]
<p>But remember that for a torsion-free metric, this terms cancels so we’re left with only the first four terms, that we can factor out the $V^\sigma$ since it was arbitrary (and we do a bit of rearranging).</p>
\[(\p_\mu\Gamma_{\mu\sigma}^\rho - \p_\nu\Gamma_{\nu\sigma}^\rho + \Gamma_{\mu\lambda}^\rho\Gamma_{\nu\sigma}^\lambda - \Gamma_{\nu\lambda}^\rho\Gamma_{\mu\sigma}^\lambda) V^\sigma\]
<p>With some inspection, the tensor in the parentheses seems to have one upper and three lower indices. We define this as the <strong>Riemann tensor</strong>, which tells us the curvature (at a point) of a space.</p>
\[R_{\sigma\mu\nu}^\rho = \p_\mu\Gamma_{\mu\sigma}^\rho - \p_\nu\Gamma_{\nu\sigma}^\rho + \Gamma_{\mu\lambda}^\rho\Gamma_{\nu\sigma}^\lambda - \Gamma_{\nu\lambda}^\rho\Gamma_{\mu\sigma}^\lambda\]
<p>We went through several stages of equations to get here, but remember that we were trying to see what happens if we parallel transported a vector along a little infinitesimal loop. The final result is that the parallel transported vector is linearly transformed by the Riemann tensor! To see this more clearly, let me group the indices a bit differently: $(R_\sigma^\rho)_{\mu\nu}$. The first upper and lower indices together represent a linear transform, just like a matrix linearly transforms a vector. The last two lower indices tell us in which directions are we parallel transporting the vector along a little loop.</p>
<p><img src="/images/manifolds-part-3/riemann-tensor.png" alt="Riemann Curvature Tensor" title="Riemann Curvature Tensor" /></p>
<p><small>Similar to torsion, suppose we have two vectors $A^\mu$ and $B^\nu$ that we parallel transport into each other to make a closed loop (we’re assuming no torsion). Then if we have a vector $V^\rho$ that we move around in that little loop, we’ll end up with $V^{\rho’}$ that’s related to the original $V^\rho$ we started with by a linear transform. That linear transform that relates the two is what we call the Riemann tensor $R_{\sigma\mu\nu}^\rho$.</small></p>
<p>There are a few more things to note about this tensor. First of all, from the derivation, we can see that it’s antisymmetric in its last two lower indices. Imagine if we went around the loop in the other way and swapped $\mu$ and $\nu$ right from the beginning. Another important property is that it really does tell us if a space is flat or not because it’s written in terms of the <em>derivatives</em> of the connection, which, canonically, is written in terms of the metric. So this is effectively looking at second derivatives of the metric, similar to how curvature in a flat space looks at second derviatives. In Cartesian coordinates, we can immediately see that $R_{\sigma\mu\nu}^\rho=0$ everywhere.</p>
<p>As it turns out, there’s a theorem that says we can find a coordinate system in which the metric components are constant if and only if the Riemann tensor vanishes everywhere. From the above examples, it’s easy to show the forward implication of that theorem, but it’s a bit more work to show the backwards implication. I think the forward implication is more commonly used so I’ll skip the backwards implication and refer you to Sean Carroll’s book on general relativity.</p>
<p>In terms of components, navïely, we might think it has $n^4$ components since there are four indices, but, with the symmetries, we actually have much fewer components. The first symmetry we already saw: antisymmetric in the last two lower indices. There are more symmetries, but they are easier to discover if we lower the single upper index.</p>
\[R_{\rho\sigma\mu\nu} = g_{\rho\lambda}R_{\sigma\mu\nu}^\lambda\]
<p>Let’s expand this out, but we’re going to use a special set of coordinates called <strong>Riemann normal coordinates</strong>. They’re a set of coordinates such that $\partial_{\sigma}g_{\mu\nu}=0$. A consequence of this (that you can verify yourself) is that all of the connection coefficients themselves are zero. However, this doesn’t mean the derivatives of the connection coefficients are zero so we still have to keep those.</p>
\[\require{cancel}
\begin{align*}
R_{\rho\sigma\mu\nu} &= g_{\rho\lambda}R_{\sigma\mu\nu}^\lambda\\
&= g_{\rho\lambda}(\p_\mu\Gamma_{\mu\sigma}^\lambda- \p_\nu\Gamma_{\nu\sigma}^\lambda + \cancelto{0}{\Gamma_{\mu\lambda}^\lambda\Gamma_{\nu\sigma}^\lambda} - \cancelto{0}{\Gamma_{\nu\lambda}^\lambda\Gamma_{\mu\sigma}^\lambda)}\\
&= g_{\rho\lambda}(\p_\mu\Gamma_{\mu\sigma}^\lambda- \p_\nu\Gamma_{\nu\sigma}^\lambda)\\
\end{align*}\]
<p>Now we can expand the connection coefficients in terms of the metric (since we’re assuming a Levi-Civita connection):</p>
\[\begin{align*}
&= g_{\rho\lambda}(\p_\mu\Gamma_{\mu\sigma}^\lambda- \p_\nu\Gamma_{\nu\sigma}^\lambda)\\
&= g_{\rho\lambda}\Bigg(\p_\mu\Big[\frac{1}{2}g^{\lambda\tau}(\p_\nu g_{\sigma\tau} + \p_\sigma g_{\tau\nu} - \p_\tau g_{\nu\sigma})\Big] - \p_\nu\Big[\frac{1}{2}g^{\lambda\tau}(\p_\mu g_{\sigma\tau} + \p_\sigma g_{\tau\mu} - \p_\tau g_{\mu\sigma})\Big]\Bigg)\\
&= \frac{1}{2}g_{\rho\lambda}\Bigg(\p_\mu\Big[g^{\lambda\tau}(\p_\nu g_{\sigma\tau} + \p_\sigma g_{\tau\nu} - \p_\tau g_{\nu\sigma})\Big] - \p_\nu\Big[g^{\lambda\tau}(\p_\mu g_{\sigma\tau} + \p_\sigma g_{\tau\mu} - \p_\tau g_{\mu\sigma})\Big]\Bigg)\\
\end{align*}\]
<p>We have to expand out the inner partials $\p_\mu$ and $\p_\nu$ using the product rule, but remember that we’re in Riemann normal coordinates so the partials of the metric tensor and inverse metric tensor are zero $\p_\mu g^{\lambda\tau}=0$. So we can just apply the partial on the second term and factor out the $g^{\lambda\tau}$ to the front.</p>
\[= \frac{1}{2}g_{\rho\lambda}g^{\lambda\tau}\Bigg(\p_\mu(\p_\nu g_{\sigma\tau} + \p_\sigma g_{\tau\nu} - \p_\tau g_{\nu\sigma}) - \p_\nu(\p_\mu g_{\sigma\tau} + \p_\sigma g_{\tau\mu} - \p_\tau g_{\mu\sigma})\Bigg)\]
<p>The partials can distribute through as well.</p>
\[= \frac{1}{2}g_{\rho\lambda}g^{\lambda\tau}(\p_\mu\p_\nu g_{\sigma\tau} + \p_\mu\p_\sigma g_{\tau\nu} - \p_\mu\p_\tau g_{\nu\sigma} - \p_\nu\p_\mu g_{\sigma\tau} + \p_\nu\p_\sigma g_{\tau\mu} - \p_\nu\p_\tau g_{\mu\sigma})\]
<p>The partials commute so we can cancel out the first and fourth terms.</p>
\[= \frac{1}{2}g_{\rho\lambda}g^{\lambda\tau}(\p_\mu\p_\sigma g_{\tau\nu} - \p_\mu\p_\tau g_{\nu\sigma} + \p_\nu\p_\sigma g_{\tau\mu} - \p_\nu\p_\tau g_{\mu\sigma})\]
<p>Finally, recall that $g_{\rho\lambda}g^{\lambda\tau}=\delta_\rho^\tau$ so we can substitute any lower $\tau$ with a $\rho$, and we’re left with the final result.</p>
\[R_{\rho\sigma\mu\nu} = \frac{1}{2}(\p_\mu\p_\sigma g_{\rho\nu} - \p_\mu\p_\rho g_{\nu\sigma} + \p_\nu\p_\sigma g_{\rho\mu} - \p_\nu\p_\rho g_{\mu\sigma})\]
<p>From these terms, there are two symmetries we can see (by the fact the metric is symmetric and the partials commute). The first is that the tensor is antisymmetric in the first two indices.</p>
\[R_{\rho\sigma\mu\nu} = -R_{\sigma\rho\mu\nu}\]
<p>Also, the tensor is invariant if we swap the first pair with the last pair of indices.</p>
\[R_{\rho\sigma\mu\nu} = R_{\mu\nu\rho\sigma}\]
<p>You can convince yourself of these by substituting (and carefully changing indices around!) to find that things cancel or match up. There really isn’t much insight or practice gained from showing you that so I’ll just skip it. The last property is that if we cycle the last three indices completely and take the sum, everything cancels!</p>
\[R_{\rho\sigma\mu\nu} + R_{\rho\mu\nu\sigma} + R_{\rho\nu\sigma\mu} = 0\]
<p>With some more index acrobatics, we can show that cyclical permutations are equivalent to taking a multi-index antisymmetry:</p>
\[R_{\rho[\sigma\mu\nu]} = 0\]
<p>(You can verify this yourself, but it’s not a very interesting calculation to do so I’ve also skipped this.) Note that we haven’t done anything non-tensorial here, even though we’ve used the connection coefficients.</p>
<p>Now we can use these symmetries to figure out the number of components. Using the first antisymmetry, the pairs of indices can only take the values $\binom{n}{2}$. To see this, consider $n=4$ (as commonly used in general relativity!). Because of the antisymmetry, the only unique values of the indices are $01$, $02$, $03$, $12$, $13$, $23$. The diagonal values vanish and the other side of the diagonal is repeated. Hence, we have $n$ choose $2$, in combinatorial syntax.</p>
\[m = \binom{n}{2} = \frac{n(n-1)}{2}\]
<p>Now we can factor that into the second symmetry that says the first and second pair are swappable. For a symmetric matrix, we have $\frac{m(m+1)}{2}$ independent values, but that’s on top of the antisymmetry, which is why I used $m$ again. Substituting in terms of $n$, we can get the following (I’m skipping the algebra because it’s just algebra).</p>
\[\frac{m(m+1)}{2} = \frac{n^4-2n^3+3n^2-2n}{8}\]
<p>Note that this is for the entire tensor so we need to subtract out additional constraints. Now to account for the cyclic permutation, using the same binomial syntax, we get $\binom{n}{4}$ because we’re fixing four indices. The permutation of the last three fixes the three, but the fourth one at the beginning also has to be subtracted else the relation devolves into the first and second symmetries.</p>
\[\binom{n}{4} = \frac{n^4-6n^3+11n^2-6n}{24}\]
<p>This constrains the degrees of freedom from the general case of the first one so we subtract them to get the final result.</p>
\[\frac{n^4-2n^3+3n^2-2n}{8} - \frac{n^4-6n^3+11n^2-6n}{24} = \frac{n^2(n^2 - 1)}{12}\]
<p>(Yet again, I’ve skipped over the algebra because it’s not very interesting.) Finally we’re left with the number of independent components of the Riemann tensor with all of the symmetries accounted for! It’s certainly smaller than $n^4$, but it’s also not <em>that</em> small. For $n=4$, we have 20 independent components.</p>
<p>There’s just one last property regarding the Riemann tensor we need to discuss before we can simplify it into something easier to use. We can consider the derivative of the lowered Riemann (also in Riemann normal coordinates so there’s no connection coefficient term).</p>
\[\begin{align*}
\nabla_\lambda R_{\rho\sigma\mu\nu} &= \p_\lambda R_{\rho\sigma\mu\nu}\\
&= \frac{1}{2}\p_\lambda (\p_\mu\p_\sigma g_{\rho\nu} - \p_\mu\p_\rho g_{\nu\sigma} + \p_\nu\p_\sigma g_{\rho\mu} - \p_\nu\p_\rho g_{\mu\sigma})\\
&= \frac{1}{2}(\p_\lambda \p_\mu\p_\sigma g_{\rho\nu} - \p_\lambda \p_\mu\p_\rho g_{\nu\sigma} + \p_\lambda \p_\nu\p_\sigma g_{\rho\mu} - \p_\lambda \p_\nu\p_\rho g_{\mu\sigma})\\
\end{align*}\]
<p>If we consider cyclical permutations of the first three indices, everything cancels!</p>
\[\nabla_\lambda R_{\rho\sigma\mu\nu} + \nabla_\rho R_{\sigma\lambda\mu\nu} + \nabla_\sigma R_{\lambda\rho\mu\nu} = 0\]
<p>Like with the symmetry with cyclical permutations of the last three indices, we can use an equivalent antisymmetry.</p>
\[\nabla_{[\lambda} R_{\rho\sigma]\mu\nu} = 0\]
<p>The above property is called the <strong>Bianchi identity</strong> and it’s actually used to prove an important property of the Einstein Field Equations used in general relativity.</p>
<p><img src="/images/manifolds-part-3/bianchi-identity.png" alt="Geometric of the Bianchi Identity" title="Geometric of the Bianchi Identity" /></p>
<p><small>One geometric interpretation of the Bianchi Identity that I really like is the ability/inability to close a parallelepiped. Suppose we have three vectors $U$, $V$, and $W$. If we parallel transport each in the direction of each other, we’ll get a parallelepiped. The Bianchi Identity measures the ability of the ends of the vectors to close into a closed parallelepiped.</small></p>
<p>Even for small dimensionalities, the Riemann tensor has a lot of components! Practically speaking, we don’t often have to deal with this tensor directly. Instead, we can deal with a smaller tensor formed from a contraction of the Riemann tensor called the <strong>Ricci tensor</strong>.</p>
\[R_{\mu\nu} = R_{\mu\lambda\nu}^\lambda\]
<p>In fact, we can contract it even further to get a scalar called the <strong>Ricci scalar</strong>.</p>
\[R = R_\mu^\mu= g^{\mu\nu}R_{\mu\nu}\]
<p>As with the Riemann tensor, I also want to provide some illustrative intuition behind both of these quantities. (I won’t go through the exact proofs since that requires setting up some more machinery.) One interpretation I really like is John Baez’s coffee grounds. Imagine a ball of comoving coffee grounds on the manifold; “comoving” just means each individual coffee particle is at rest relative to all of the others so the whole group moves as a single coffee ground blob. In a flat space, the shape and size remain the same no matter how we move around the manifold. But, on a curved manifold, the ball might expand, collapse, rotate, or deform in all kinds of different ways. This is because each individual coffee ground doesn’t follow the same geodesic. The Ricci tensor measures only the change in volume of our coffee grounds. There is an other tensor called the Weyl tensor that measures the deformation.</p>
<p>The Ricci scalar, sometimes called scalar curvature, measures how the volume of the coffee ground blob differs from flat space. A positive scalar curvature is like a sphere. As we’ll see, a sphere has positive curvature everywhere, and geodesics tend to “bend apart” on a sphere. On the other hand, a negative curvature is like a saddle.</p>
<p><img src="/images/manifolds-part-3/scalar-curvature.png" alt="Different types of scalar curvature" title="Different types of scalar curvature" /></p>
<p><small>With a positive scalar curvature, like a sphere, the edges of a triangle will “bow outward”. This is the reason we need to use the Haversine Formula when we look at angles and distance on the surface of the Earth. In a flat space, a triangle is simply a triangle. With a negative scalar curvature, like with a saddle, the edges of a triangle will “bow inward”.</small></p>
<p>To see a practical application of the Ricci tensor and scalar to general realtivity, there’s a little computation we have to do first. Taking the Bianchi identity a step further, we can contract it twice on the Bianchi identity to write it in terms of the Ricci tensor and Ricci scalar.</p>
\[\begin{align*}
g^{\nu\sigma}g^{\mu\lambda}(\nabla_\lambda R_{\rho\sigma\mu\nu} + \nabla_\rho R_{\sigma\lambda\mu\nu} + \nabla_\sigma R_{\lambda\rho\mu\nu}) &= 0\\
g^{\nu\sigma}g^{\mu\lambda}(\nabla_\lambda R_{\mu\nu\rho\sigma} + \nabla_\rho R_{\mu\nu\sigma\lambda} + \nabla_\sigma R_{\lambda\rho\mu\nu}) &= 0\\
g^{\nu\sigma}g^{\mu\lambda}(\nabla_\lambda R_{\mu\nu\rho\sigma} - \nabla_\rho R_{\nu\mu\sigma\lambda} + \nabla_\sigma R_{\lambda\rho\mu\nu}) &= 0\\
g^{\nu\sigma}(\nabla^\mu R_{\mu\nu\rho\sigma} - \nabla_\rho R_{\nu\mu\sigma}^\mu + \nabla_\sigma R_{\rho\mu\nu}^\mu) &= 0\\
g^{\nu\sigma}(\nabla^\mu R_{\mu\nu\rho\sigma} - \nabla_\rho R_{\nu\sigma} + \nabla_\sigma R_{\rho\nu}) &= 0\\
\nabla^\mu R_{\mu\nu\rho}^\nu - \nabla_\rho R_{\nu}^\nu + \nabla^\nu R_{\rho\nu} &= 0\\
\nabla^\mu R_{\mu\rho} - \nabla_\rho R + \nabla^\nu R_{\rho\nu} &= 0\\
\nabla^\mu R_{\mu\rho} - \nabla_\rho R + \nabla^\mu R_{\mu\rho} &= 0\\
2\nabla^\mu R_{\mu\rho} - \nabla_\rho R &= 0\\
\nabla^\mu R_{\mu\rho} - \frac{1}{2}\nabla_\rho R &= 0\\
\nabla^\mu R_{\mu\rho} &= \frac{1}{2}\nabla_\rho R\\
\end{align*}\]
<p>Between the first two equations, I used the second symmetry on the first and second terms. From the second and third equations, I used the first antisymmetry on the second term. The rest follow from raising the tensors and forming the Ricci tensor and Ricci scalar. Note that we can raise the index on a covariant derivative (rather than a partial) because of metric compatibility.</p>
<p>Now suppose we define the <strong>Einstein tensor</strong> in terms of the Ricci tensor and scalar as the following.</p>
\[G_{\mu\nu} \equiv R_{\mu\nu} - \frac{1}{2}R g_{\mu\nu}\]
<p>(Note that this tensor is also symmetric because the Ricci tensor and the metric are also symmetric!) Applying to the above Bianchi identity, we can see the following property is true.</p>
\[\begin{align*}
\nabla^\mu G_{\mu\nu} &= 0\\
\nabla^\mu (R_{\mu\nu} - \frac{1}{2}R g_{\mu\nu}) &= 0\\
\nabla^\mu R_{\mu\nu} - \frac{1}{2}\nabla^\mu g_{\mu\nu} R &= 0\\
\nabla^\mu R_{\mu\nu} - \frac{1}{2}\nabla_\nu R &= 0\\
\end{align*}\]
<p>Note that the final line corresponds to the second-to-last line of the Bianchi identity above. As it turns out, this property corresponds to the conservation of energy and momentum in general relativity! In fact, the Einstein tensor is actually the left half of the <strong>Einstein Field Equations (EFE)</strong> that tell us how the geometry of a space is affected by the energy-momentum of that space.</p>
<h1 id="example-the-2-sphere">Example: The 2-Sphere</h1>
<p>So far, we’ve set up a ton of machinery, so let’s put it into practice on a canonical example: the two-sphere $S^2$!</p>
<p><img src="/images/manifolds-part-3/spherical-coordinates.png" alt="Spherical coordinates" title="Spherical coordinates" /></p>
<p><small>We’ll define intrinsic spherical coordinates like a physicist such that the polar angle, i.e., the angle with respect to the $z$-axis is $\theta$ and the azimuthal angle, i.e., the angle in the $xy$-plane from the $x$-axis, is $\phi$.</small></p>
<p>The metric for a two-sphere requires only two intrinsic coordinates. Think about the Earth: we only need a latitude and longitude to specify a coordinate on the surface. To see this, let’s start with the spherical coordinate metric in a flat space.</p>
\[\d s^2 = \d r^2 + r^2 \d\theta^2 + r^2\sin^2\theta\d\phi^2\]
<p>However, if we’re on a sphere of a constant radius, note that $\d r^2$ vanishes and we’re left with an intrinsic metric on a sphere.</p>
\[\d s^2 = r^2(\d\theta^2 + \sin^2\theta\d\phi^2)\]
<p>Visually, treat $\d s^2$ as a little slice along the sphere, in terms of a $\theta$ and $\phi$. We can write the components of the metric and inverse metric tensor in matrix form.</p>
\[\begin{align*}
g_{ij} &= \begin{bmatrix}1 & 0\\ 0 & \sin^2\theta \end{bmatrix}\\
g^{ij} &= \begin{bmatrix}1 & 0\\ 0 & \frac{1}{\sin^2\theta} \end{bmatrix}\\
\end{align*}\]
<p>(Recall that the inverse of a diagonal metric is just the inverse of the components.) From these, we can compute the connection coefficients. It’s just the algebra of plugging the connection coefficients into the equation and churning them out. Remember that the bottom two indices are symmetric so we don’t have to compute them twice. Also, the off-diagonals of the metric and its inverse are zero so this should make it a bit easier. The only non-zero connection coefficients are the following.</p>
\[\begin{align*}
\Gamma^\theta_{\phi\phi} &= -\cos\theta\sin\theta\\
\Gamma^\phi_{\theta\phi} = \Gamma^\phi_{\phi\theta} &= \cot\theta\\
\end{align*}\]
<p>While we’re at it, we can compute the Ricci tensor. (This is also just algebra.)</p>
\[\begin{align*}
R_{\theta\theta} &= 1\\
R_{\theta\phi} = R_{\phi\theta} &= 0\\
R_{\phi\phi} &= r^2\sin^2\theta\\
\end{align*}\]
<p>And finally we can compute the Ricci scalar.</p>
\[R = \frac{2}{r^2}\]
<p>From this, we see that the Ricci scalar is constant across the sphere and positive. This makes sense since neighboring geodesics tend to “bow” outwards and “inflate”. On the other hand, if we had added some “noise” to the metric, then this wouldn’t be the case. One interesting thing to note is that the scalar curvature increases as the radius decreases. One interesting application is that we can model some kinds of black hole’s event horizons as sphere. And, as it turns out, the strength of tidal forces is inversely proportional to the scalar curvature. In other words, a black hole with a very large event horizon doesn’t have as strong tidal forces. For the supermassive black hole at the center of our Milky Way galaxy, we could toss anything in without it being ripped apart by tidal forces.</p>
<p>Another, more interesting, thing to consider is geodesics on the sphere. This is particular interesting because, if we wanted to find the shortest path between two points on the Earth, the geodesic tell us exactly that! Let’s start by rewriting the geodesic equation.</p>
\[\frac{\d^2 x^\mu}{\d\lambda^2} + \Gamma_{\sigma\rho}^\mu\frac{\d x^\sigma}{\d\lambda}\frac{\d x^\rho}{\d\lambda} = 0\]
<p>Recall that these are actually a <em>set</em> of 2nd order differential equations in $\mu$. Since we have two coordinates $\theta$ and $\phi$, we’ll have two equations. We can also simplify the equations since there are only two unique, non-zero connection coefficients.</p>
\[\begin{align*}
\frac{\d^2 x^\theta}{\d\lambda^2} + \Gamma_{\phi\phi}^\theta\frac{\d x^\phi}{\d\lambda}\frac{\d x^\phi}{\d\lambda} &= 0\\
\frac{\d^2 x^\phi}{\d\lambda^2} + \Gamma_{\theta\phi}^\phi\frac{\d x^\theta}{\d\lambda}\frac{\d x^\phi}{\d\lambda} +\Gamma_{\phi\theta}^\phi\frac{\d x^\phi}{\d\lambda}\frac{\d x^\theta}{\d\lambda}&= 0\\
\end{align*}\]
<p>But remember that the connection coefficients are symmetric so the last two terms in the second equation are the same.</p>
\[\begin{align*}
\frac{\d^2 x^\theta}{\d\lambda^2} + \Gamma_{\phi\phi}^\theta\frac{\d x^\phi}{\d\lambda}\frac{\d x^\phi}{\d\lambda} &= 0\\
\frac{\d^2 x^\phi}{\d\lambda^2} + 2\Gamma_{\theta\phi}^\phi\frac{\d x^\theta}{\d\lambda}\frac{\d x^\phi}{\d\lambda} &= 0\\
\end{align*}\]
<p>Now let’s plug in the values for the connection coefficients.</p>
\[\begin{align*}
\frac{\d^2 x^\theta}{\d\lambda^2} -\cos\theta\sin\theta\frac{\d x^\phi}{\d\lambda}\frac{\d x^\phi}{\d\lambda} &= 0\\
\frac{\d^2 x^\phi}{\d\lambda^2} + 2\cot\theta\frac{\d x^\theta}{\d\lambda}\frac{\d x^\phi}{\d\lambda} &= 0\\
\end{align*}\]
<p>These are a set of paired 2nd order differential equations that are too difficult to solve in general. Fortunately, the sphere has a lot of symmetries so, even if we restrict the solution, we can use those symmetries to produce general solutions. For now, let’s fix a lattitude $\theta=\tilde{\theta}$ so we have the equations $x^\theta(\lambda)=\tilde{\theta}, x^\phi(\lambda)=\alpha\lambda + \beta$ where $\alpha$ and $\beta$ are just constants that represent path around the lattitude. (We could ignore $\beta$, but I left it in for completeness.) Now let’s compute the first and second order derivatives needed for the geodesic equation.</p>
\[\begin{align*}
\frac{\d x^\theta}{\d\lambda} = 0 &, \frac{\d x^\phi}{\d\lambda} = \alpha\\
\frac{\d^2 x^\theta}{\d\lambda^2} = 0 &, \frac{\d^2 x^\phi}{\d\lambda^2} = 0\\
\end{align*}\]
<p>Now we can plug these into the geodesic equation and substitute $\theta=\tilde{\theta}$.</p>
\[\begin{align*}
0 - \cos\tilde{\theta}\sin\tilde{\theta}\cdot\alpha^2 &= 0\\
0 + 2\cot\tilde{\theta}\cdot 0\cdot\alpha &= 0\\
\end{align*}\]
<p>The second equation is just $0=0$ so we can ignore that so we can just focus on the first one.</p>
\[- \cos\tilde{\theta}\sin\tilde{\theta}\cdot\alpha^2 = 0\]
<p>The goal is to set $\tilde{\theta}$ and $\alpha$ such that the equation is also $0=0$. The easiest thing to do seems to be to set $\alpha=0$. But if we do that, the resulting equations become $x^\theta(\lambda)=\tilde{\theta}, x^\phi(\lambda)=\beta$, which is just a fixed point on the sphere. Let’s try to set $\sin\tilde{\theta}=0$. In this case, $\tilde{\theta}=0$ or $\tilde{\theta}=\pi$. The resulting equations become $x^\theta(\lambda)=0, x^\phi(\lambda)=\alpha\lambda+\beta$, but this is also just a point because $x^\theta(\lambda)=0$ and $x^\theta(\lambda)=\pi$ are the North and South Poles.</p>
<p>Instead, let’s try to set $\cos\tilde{\theta}=0$, which means $\tilde{\theta}=\frac{\pi}{2}$. The equations become $x^\theta(\lambda)=\frac{\pi}{2}, x^\phi(\lambda)=\alpha\lambda+\beta$. This represents a path along the equator! This kind of circle is called a <strong>great circle</strong>: a circle on a sphere where the center of the circle is the center of the sphere. Using the rotational symmetry of the sphere, all geodesics on a sphere are great circles. In other words, the shortest distance between any two points on a sphere is the great circle that contains those two points. (There are actually two directions, but we can simply pick the shortest one.) With this, we have shown that geodesics on spheres are all great circles using the geodesic equation. An alternative to finding geodesics with the geodesic equation is to use calculus of variations and the Euler-Lagrange equations, and that’s sometimes easier (maybe I’ll explain that in another post!), but this is also a valid way of finding geodesics.</p>
<h1 id="conclusion">Conclusion</h1>
<p>We’ve covered a lot of topics in this post, eventually culminating in answering a deceptively simple question: “how do we know if a space is flat?”. We saw that this was not an easy question when manifolds and intrinsic geometry was involved! To answer that question, we had to build up to it piece-by-piece, starting with a good intrinsic derivative operator, then discussing how to compare vectors on a manifold, and ending on curvature with a peek into general relativity. Here are some of the core concepts we learned:</p>
<ul>
<li>The covariant derivative is a way to compute derivatives on a manifold that accounts for the changing basis using the connection coefficients. The connection coefficients do not transform like tensors intentionally, to cancel out the non-tensorial part of the partial derivative.</li>
<li>Parallel transport is a way to move a vector along a curve so that it stays “as straight as possible.” On a manifold, there is no way to compare vectors at different points and how the vector changes depends on the path.</li>
<li>A geodesic is the generalization of straight lines in on a manifold. It is the curve that parallel transports its own tangent vector. We can construct geodesics at a point by using the exponential map to project a tangent vector into a curve on the manifold.</li>
<li>The Riemann tensor characterizes the intrinsic curvative of a manifold by parallel transporting a vector along a little loop and measure how much it changes. If the Riemann tensor is zero everywhere, then there exists a coordinate system where the metric is flat.</li>
<li>The Ricci tensor is a useful contraction of the Riemann tensor that shows how a small group of neighboring geodesics change in volume as they move about the manifold. The Ricci scalar is a scalar way to measure that change.</li>
</ul>
<p>That’s all! In this set of posts, we’ve learned all about manifolds and how to do calculus on them 😀</p>In the last part, I'll show how we can define curvature on a manifold by extending calculus to work on manifolds with the covariant derivative!Manifolds - Part 22021-05-30T00:00:00+00:002021-05-30T00:00:00+00:00/manifolds-part-2<p>In the previous article, we reviewed vectors, duals, and tensors in a flat coordinate system, i.e., Euclidean space. Now that we have a good understanding of those generalizations in flat space, we can construct a manifold and re-invent the same machinery.</p>
<h1 id="introduction">Introduction</h1>
<p>So far, we’ve only dealt with Euclidean spaces. However, there are plenty of spaces that are only locally Euclidean, but, globally, have a more interesting topology. This is the informal definition of a <strong>manifold</strong>: a space that is locally flat but globally more interesting. This has some profound connotations for how vectors, duals, and tensors are defined, as well as how we perform any kind of calculus (differentiation and integration) on this manifold. To be more precise, we can work with manifolds that don’t allow for calculus on them, i.e., non-differentiable manifolds, but those are much less interesting, and, practically, we’ll usually be able to perform calculus on our manifolds.</p>
<h1 id="manifolds">Manifolds</h1>
<p>I gave an intuition of manifolds, but let me define a few concrete examples that you’ve likely seen or heard of:</p>
<ul>
<li>$\R^n$: $\R^n$ is globally Euclidean as well as locally Euclidean!</li>
<li>$S^n$: the $n$-sphere is a manifold ($S^1$ is a circle; $S^2$ is a sphere; $S^3$ is a glome; etc.).</li>
<li>$\mathbb{T}^n$: the $n$-dimensional torus is a manifold.</li>
<li>$\mathbb{G}^n$: an $n$-genus is a manifold: the $n$ denotes the number of “holes”. A $0$-genus is a sphere; a $1$-genus is a torus; a $2$-genus has two holes, like the number 8; etc.</li>
<li>matrix groups: the set of continuous rotations in $\R^3$ that leave the origin fixed, i.e., a <strong>Lie group</strong>. We’ll discuss these in a different article 😉, but they’re also essential to both particle physics and astrophysics.</li>
<li>spacetime, as we know it: in the previous article, I stated that my personal motivation to learn about manifolds was to understand general relativity. In that framework, spacetime is a 4D manifold, 3 space-like dimensions and 1 time-like dimension.</li>
<li>$S^1\times\R$: in other words, a cylinder!</li>
<li>A Mobius strip</li>
</ul>
<p>With all of these examples, what isn’t a manifold? Using that same definition, anything that isn’t a manifold is a space where, at some point, it locally doesn’t look like a flat, $\R^n$ space. There are a few contrived examples, but also a few practical examples:</p>
<ul>
<li>Anything with a boundary: at the boundary, the space doesn’t look Euclidean.</li>
<li>Intersections of different flat spaces, e.g., a plane with a line through it: at the intersection of the line and the plane, the space doesn’t look Euclidean.</li>
<li>A light cone: light cones are ubiquitous in general relativity, but they aren’t manifolds because the point where the past and future light cones intersect doesn’t look like a flat, Euclidean space!</li>
</ul>
<h2 id="preliminaries">Preliminaries</h2>
<p>As for the more rigourous definition, we’ll be following Wald’s textbook on general relativity; even though we rarely use the full definition of a manifold, I think it’s a really neat construction that emphasizes several important characteristics of a manifold, e.g., indepedence of coordinates, no global frame, and independence of embedding space. Before we do that, however, I’ll review some definitions of maps and functions since they’re essential constructs in manifolds and differential geometry.</p>
<p>Given two sets $A$ and $B$, a <strong>map</strong> $\phi : A\to B$ assigns, to each $a\in A$, exactly one $b\in B$. We can think of this as a “generalization” of a function. With this definition, there are several different kinds of maps that are more specific:</p>
<ul>
<li>
<p><strong>one-to-one/injective</strong>: $\forall b\in B$, there is at most $a\in A$ mappped to it by $\phi$. A technique you might have heard of for identifying injective function is the “horizontal line test”: if there is a horizontal line that intersects with the function more than once, it is not an injective function. For example, $f(x)=x^2$ fails, since, for $f(x)=4$, $x=\pm 2$. Also, there may be $b\in B$ such that $\nexists a\in A$ such that $\phi(a)=b$. In other words, there is an element in $B$ such that no element in $A$ is mapped to it.</p>
</li>
<li>
<p><strong>onto/surjective</strong>: $\forall b\in B$, there is at least $a\in A$ such that $\phi(a) = b$. In other words, every $b\in B$ originates from some $a\in A$, even if $a$ is the same for more than two $b\in B$. An example of such a map is $f(x)=x^3$: each element in the $x$-axis is sent to some element in the $y$-axis. On the other hand, functions like $f(x)=e^x$ and $f(x)=\log x$ are not surjective since they don’t span the entire $x$-axis.</p>
</li>
<li>
<p><strong>one-to-one correspondence/bijective</strong>: a function that is both one-to-one and onto. In other words, each $a\in A$ is sent to exactly one $b\in B$. For example, $x^3$ is bijective since it is both injective and surjective. As a corollary of the definition, for each bijection, there exists an inverse bijiection $\phi^{-1} : B\to A$ such that $\phi^{-1}(\phi(a)) = a$. From this definition, it’s pretty easy to show that the composition of bijections is also a bjiection.</p>
</li>
</ul>
<p><img src="/images/manifolds-part-2/types-of-functions.svg" alt="Types of functions" title="Types of functions" /></p>
<p><small>The top function is one-to-one; the middle function is onto; and the bottom function is bijective.</small></p>
<p>One last thing we’ll need about maps is composition of maps: if $\phi: A\to B$ and $\psi : B\to C$, then $(\psi\circ\phi): A\to C, a\mapsto\psi(\phi(a))$.</p>
<h2 id="manifold-construction">Manifold Construction</h2>
<p>Now that we’ve reviewed the preliminaries, let’s construct a manifold! We’ll start by defining an <strong>open ball</strong> as the set of all points $x\in\R^n$ such that $\lVert x - y\rVert < r$ for some $y\in\R^n$ and $r\in\R$.</p>
<p><img src="/images/manifolds-part-2/open-ball.svg" alt="Open ball" title="Open ball" /></p>
<p><small>An open ball is a really simple construct: a set of points inside of an open circle.</small></p>
<p>(If we considered a closed ball, we’d have to worry about the boundary! As it turns out, we can completely construct a manifold with open balls rather than closed balls.) With that definition, we can define an <strong>open subset</strong> as a union of (a potentially infinite number of) open balls.</p>
<p><img src="/images/manifolds-part-2/open-set.svg" alt="Open set" title="Open set" /></p>
<p><small>An open set is just a (possibly infinite) collection of open balls.</small></p>
<p>In fact, we can say that a subset $U\subset\R^n$ is open iff $\forall u\in U, \exists$ an open ball at $u$ such that it is inside of $U$. In other words, we can say that $U$ defines the interior of an $(n-1)$-dimensional surface. As a concrete example, an open set $U$ for $\R^2$ defines the interior of an $1$-dimensional surface, i.e., the interior of a closed loop on a plane. For $\R^3$, this would define the interior of a closed surface.</p>
<p>Now that we have this arbitrary set, we can naturally and immediately define a <strong>coordinate system</strong>/<strong>chart</strong> on this open set as being a subset $U\subset M$ and a one-to-one function $\phi : U\to\R^n$ that maps the open set $U$ into the flat Euclidean space $\R^n$. For convenience, instead of applying $\phi$ to individual points, we can consider the <strong>image</strong> of $\phi$ for a set of points. This is defined to be the set of all points $\R^n$ that $U$ gets mapped to. As an example, we can consider the unit circle parameterized by $\theta$. Then we can define a chart such that $U=\{ \theta | \theta\in(0,\pi) \}$ and $\phi(\theta)=\theta$. This maps the half-circle $\theta\in(0,\pi)$ to the real line by “flattening” it. In fact, we could have actually mapped the entire circle to the real line by flattening it, but, as we’ll see, this is usually not possible for more complicated manifolds.</p>
<p><img src="/images/manifolds-part-2/coordinate-chart.svg" alt="Coordinate chart" title="Coordinate chart" /></p>
<p><small>A coordinate chart maps an arbitrary open set to an open set in a flat space.</small></p>
<p>Even though we can’t usually use a single chart to cover a manifold, we could use multiple charts if we impose some additional constraints. This is called a <strong>$C^\infty$ atlas</strong>: an indexed family of charts $\{(U_\alpha, \phi_\alpha)\}$ such that</p>
<ol>
<li>
<p>The union of all of the sets cover the manifold: $\bigcup_\alpha U_\alpha = M$. If they didn’t, then we couldn’t create a chart for some part of our manifold!</p>
</li>
<li>
<p>If two charts overlap, they are smoothly sewn together. More formally, if $U_\alpha\cap U_\beta\neq\emptyset$, then $\phi_\beta\circ\phi_\alpha^{-1} : \phi_\alpha(U_\alpha\cap U_\beta)\to\phi_\beta(U_\alpha\cap U_\beta)$. This is best explained in the figure below. This condition is the crux of manifold construction: we can smoothly sew together a bunch of locally flat spaces into a structure that is only locally flat, and we’ve said absolutely nothing about the global structure. The reason this is called a <em>$C^\infty$</em> atlas is because all of the maps are $C^\infty$, in other words, continuous and infinitely differentiable.</p>
</li>
</ol>
<p><img src="/images/manifolds-part-2/smooth-stitching.svg" alt="Smooth stitching" title="Smooth stitching" /></p>
<p><small>This “smooth stitching” constraint is the most important part of the manifold definition: if we’re in one open set, we can “hop” to an adjacent one using this property.</small></p>
<p>Now we can finally get to the definition we’ve been waiting for! A <strong>$C^\infty$ $n$-dimensional manifold</strong> is a set $M$ with a <strong>maximal atlas</strong>. A <strong>maximal atlas</strong> is an atlas that contains every possible chart for that manifold. The reason we need a <em>maximal</em> atlas is so we don’t consider different atlases to be different manifolds. For example, if we had an atlas of a circle and another atlas that starts at 45 degrees relative to the first one, without the condition of a maximal atlas, we would have thought we had two different circles!</p>
<p>Note that in the construction of the manifold, we never mentioned anything about the space that the manifold may be embedded in or the global structure. We simply took a bunch of flat $\R^n$ spaces and smoothly sewed them together on their overlaps. Manifolds exist completely independent of the space they are embedded in. We can take a circle, embed it in either a plane or a space and the maps into the real line would be the same. In fact, there’s a famous theorem called <strong>Whitney’s embedding theorem</strong> that states any $n$-manifold can be embedded in <em>at most</em> $\R^{2n}$. For example, a sphere $S^2$ can be embedded in at most $\R^4$, but, it turns out we can also embed it in $\R^3$. Another example is a Klein bottle, which is a $2$-manifold, but it can only be embedded in $\R^4$.</p>
<p>Now let’s look at a few concrete examples of constructing a manifold from an atlas. We’ve seen an atlas for a circle, but we only covered it with a single chart. This doesn’t quite fit the manifold construction because a single chart means we have a closed set and we need an open set. Let’s fix that and use two overlapping charts to cover the circle:</p>
\[\begin{align*}
U_1 &=\Big\{\theta | \theta\in\Big(\frac{\pi}{4}, \frac{7\pi}{4}\Big)\Big\}, \phi_1(\theta)=\theta\\
U_2 &=\Big\{\theta | \theta\in\Big(\frac{3\pi}{4}, \frac{-3\pi}{4}\Big)\Big\}, \phi_2(\theta)=\theta\\
\end{align*}\]
<p>These two charts cover the circle with plenty of overlap, so they are open sets. This atlas isn’t maximal, of course, but showing just one atlas is proof that a structure is a manifold.</p>
<p><img src="/images/manifolds-part-2/atlas-for-a-circle.svg" alt="Atlas for a circle" title="Atlas for a circle" /></p>
<p><small>The atlas for a circle needs to use two charts to ensure openness, even though it could technically be covered with one chart.</small></p>
<p>For a slightly more complicated example, let’s consider the sphere $S^2$. This is one manifold where it is impossible to have a single chart that covers the manifold. We can split the sphere into two atlases using the Mercator projection by excluding the North and South Poles. We can use the planes $x^3=\pm 1$ as the two sets of $\R^2$ to project into. (recall that $x^3$ is a coordinate, not an exponent!) We will project a ray starting from one of the poles, intersecting the sphere, and landing on one of the planes. The two charts for our atlas are $U_1=\{\text{all points excluding the North pole}\}$ and $U_2=\{\text{all points excluding the South pole}\}$ with the maps</p>
\[\begin{align*}
\phi_1(x^1, x^2, x^3) &= \Big(\frac{2x^1}{1-x^3}, \frac{2x^2}{1-x^3}\Big)\\
\phi_2(x^1, x^2, x^3) &= \Big(\frac{2x^1}{1+x^3}, \frac{2x^2}{1+x^3}\Big)\\
\end{align*}\]
<p>This atlas hits all points on the sphere twice except for the North and South poles, which are hit only once; therefore, we still have an open set, and we can see that this hits all points on the sphere.</p>
<p><img src="/images/manifolds-part-2/mercator-projection.svg" alt="Mercator projection" title="Mercator projection" /></p>
<p><small>Take either pole, project a beam from the inside through the surface to the outside, and record where it falls on the “catching” plane. This gives us a smooth map that projects the points on the circle into a flat space.</small></p>
<p>So we’ve shown a sphere is indeed a manifold. Moreoever, since we’re mapping the atlas into $\R^2$, we’ve shown it is specifically a 2-dimensional manifold.</p>
<h2 id="tensors-on-a-manifold">Tensors on a Manifold</h2>
<p>Now that we’ve constructed the manifold, we need to re-introduce tensors, starting with vectors in the tangent space. In flat space, we already defined vectors to exist only at a point (to get around vectors in a curved coordinate system) and the collection of them all pointing in each direction to be the tangent space $T_p M$ at that point. First off, let’s construct the tangent space. Unlike in flat space, we can’t simply construct it by considering all vectors pointing in every direction because we haven’t defined the tangent space! Instead, we might think of “creating” vectors by looking at all possible <em>curves</em> $\xi : \R\to M, \lambda\mapsto\xi(\lambda)$ that go through a point $p$ and their tangent vectors at $p$. That would seem to give us basically the same result, but the problem lies in the parametrization of $\xi$: it’s dependent on the coordinates of the manifold! In other words, our tangent vectors would be $\frac{\d\xi^\mu}{\d\lambda}$, which depend on the coordinates $\xi^\mu$. Recall that vectors are independent of all coordinates since they are geometric objects so we can’t use this definition. Also, we’re cheating here since we haven’t defined what “tangent to a curve” even means!</p>
<p>We’re still pretty close though. Instead, let’s flip this notion and define the set of all continuous, infinitely-differentiable functions on the manifold $\mathcal{F}=\{\text{all } C^\infty f : M\to\R\}$. Given any function, we can define a directional derivative operator $\frac{\d}{\d\lambda}$ that can act on a function $f$ to produce $\frac{\d f}{\d\lambda}$. Notice that this doesn’t depend on the coordinates since we’re using a <em>scalar</em> function $f$, not a curve under some coordinates! Now we can take a similar approach where we look at all possible directional derivative operators of functions through $p$ and define the tangent space to be that.</p>
<p><img src="/images/manifolds-part-2/directional-derivatives-on-curves.svg" alt="Directional derivatives of curves" title="Directional derivatives of curves" /></p>
<p><small>At a point p, consider all possible (scalar) functions through that point. We can always take the directional derivative of a parameterized curve with respect to the parameter.</small></p>
<p>However, in order to make that statement, we need to show the following conditions hold:</p>
<ol>
<li>The space of all directional derivative operators forms a valid vector space. After all, a tangent space is a vector space.</li>
<li>The dimensionality of this vector space is the same as the manifold, i.e., $n$. Recall an $n$-manifold has tangent spaces of dimensionality $n$. This is because we’ve constructed the manifold with tangent spaces of $\R^n$ so the dimensionality has to match.</li>
</ol>
<p>To show that the space of directional derivatives is a vector space, we need to show that two of these operators can be added and scaled and the result is also a directional derivative operator. The first part of this is pretty easy:</p>
\[a\frac{\d}{\d\lambda} + b\frac{\d}{\d\tau}\]
<p>The second part is a bit trickier. A directional derivative operator must be linear and obey the Leibniz product rule. From the equation above, we can already see that the operator is linear so we just need to show the product rule holds:</p>
\[\begin{align*}
\Big(\frac{\d}{\d\lambda}+\frac{\d}{\d\tau}\Big)(fg) &= f\frac{\d g}{\d\lambda} + g\frac{\d f}{\d\lambda} + f\frac{\d g}{\d\tau} + g\frac{\d f}{\d\tau}\\
&= \Big(\frac{\d f}{\d\lambda}+\frac{\d f}{\d\tau}\Big)g + f\Big(\frac{\d g}{\d\lambda}+\frac{\d g}{\d\tau}\Big)\\
&= \Big(\frac{\d}{\d\lambda}+\frac{\d}{\d\tau}\Big)(f)g + f\Big(\frac{\d}{\d\lambda}+\frac{\d}{\d\tau}\Big)(g)
\end{align*}\]
<p>Therefore directional derivatives form a valid vector space. It sounds rather interesting that an “operator” can form a vector space, but really any kind of object can form a vector space as long as it satisfies the constraints! (Personally, I think “linear space” is maybe a better name since the properties of a vector space are really just linearity and closure.)</p>
<p>The last thing we have to do is show that the dimensionality of this vector space is the same as that of the manifold. In Wald’s textbook on general relativity, he shows this directly, but, in Sean Carroll’s book, he uses a clever identity: the dimensionality of a vector space is the same as the number of basis vectors. Therefore we just need to show that the number of basis vectors for the tangent space is the same as the dimensionality of the manifold. In other words, we need to construct a basis for the tangent space.</p>
<p>Let’s start by assuming some arbitrary coordinates $x^\mu$. Given that, there’s a natural choice for the basis of directional derivatives: the partial derivatives of the coordinates $\partial_\mu$! Let’s define the directional derivatives as a linear combination of the partial derivatives of some arbitrary coordinates. Then we need to show that the set of partial derivatives form a basis and the number of elements in that set is $n$, i.e., the dimensionality of our manifold. Since we’re defining the directional derivatives as partial derivatives, we need to show that any directional derivative $\frac{\d}{\d\lambda}$ can be decomposed into a linear combination of the partial derivatives $\partial_\mu$.</p>
<p><img src="/images/manifolds-part-2/partial-derivatives.svg" alt="Partial derivatives" title="Partial derivatives" /></p>
<p><small>For a set of coordinate functions on the manifold, the partial derivatives can form a basis for the directional derivatives.</small></p>
<p>Since we’re dealing with operators, it’s much less error-prone if we define some arbitrary function $f:M\to\R$ that the operators act on that we’ll remove at the end. We’ll also need a curve $\xi:\R\to M$ since $\xi$ is the function that is actually parameterized by the $\lambda$ in the directional derivative $\frac{\d}{\d\lambda}$. Since we’re at a point $p$, we’ll also get a chart $\phi:M\to\R^n$ with coordinates $x^\mu$ for free!</p>
<p>To reiterate, our goal is to show that we can write $\frac{\d}{\d\lambda}$ as a linear combination of partial derivatives $\partial_\mu$. With all of the maps and spaces, we can draw this picture.</p>
<p><img src="/images/manifolds-part-2/directional-derivatives-maps.svg" alt="Directional derivative map" title="Directional derivative map" /></p>
<p><small>The complicated set of maps can be used to show how any directional derivative can be teased apart into scalars and partial derivatives.</small></p>
<p>Conceptually, we’ll be applying $\frac{\d}{\d\lambda}$ to $f$, but realistically, we need to compose with $\xi$ since $\xi$ is the thing that is parameterized by $\lambda$.</p>
\[\begin{align*}
\frac{\d}{\d\lambda}f&\to\frac{\d}{\d\lambda}(f\circ\xi)\\
&=\frac{\d}{\d\lambda}[(f\circ\phi^{-1})\circ(\phi\circ\xi)]\\
&=\frac{\partial}{\partial x^\mu}(f\circ\phi^{-1})\frac{\d}{\d\lambda}(\phi\circ\xi)\\
&=\frac{\d}{\d\lambda}(\phi\circ\xi)\partial_\mu(f\circ\phi^{-1})\\
&=\frac{\d x^\mu}{\d\lambda}\partial_\mu(f\circ\phi^{-1})\\
&\to\frac{\d x^\mu}{\d\lambda}\partial_\mu f\\
\end{align*}\]
<p>In the last step, we use the fact that $\phi$ has coordinates $x^\mu$. Now we can remove $f$ since it was arbitrary:</p>
\[\frac{\d}{\d\lambda}=\frac{\d x^\mu}{\d\lambda}\partial_\mu\]
<p>Now we’ve shown that we can decompose an arbitrary directional derivative $\frac{\d}{\d\lambda}$ into a scalar $\frac{\d x^\mu}{\d\lambda}$ and a vector $\partial_\mu$. Thus, the set of $n$ partial derivatives actually do form a basis for the tangent space and we have $n$ of them! It’s a little strange to think that an operator is a vector! (Maybe this is less surprising if you’ve taken any quantum mechanics and learned that operators can be represented as matrices.) In fact, this basis is so convenient that we give it a name: the <strong>coordinate basis</strong> $\hat{e}_{(\mu)}\equiv\partial_\mu$. We don’t have to use this basis, but it’s often easy and convenient. One important thing to note is that this basis is not orthonormal everywhere like Cartesian coordinates in a flat space. In fact, if that were the case, then we would actually have a flat space!</p>
<p>Given this basis, we can write out the general vector and basis transformation laws from the index notation (this isn’t exactly rigourous, but it works for now):</p>
\[\begin{align*}
\partial_{\mu'}&=\frac{\partial x^\mu}{\partial x^{\mu'}}\partial_{\mu}\\
V^{\mu'}&=\frac{\partial x^{\mu'}}{\partial x^{\mu}}V^{\mu}\\
\end{align*}\]
<p>Since we’re using a coordinate basis, the components will change when the basis changes, and a change of coordinates means a change of basis as well.</p>
<p>So far, we’ve constructed the tangent space using partial derivatives as the basis vectors, but what about the cotangent space $T_p^* M$? How do we construct/define the basis for this space? Analogously to how we used the partials for the basis, we can use the gradients $\d x^\mu$ as the basis for $T_p^* M$. They used to be defined $\hat{\zeta}^{(\mu)}(\hat{e}_{(\nu)})=\delta^\mu_\nu$, but we’re going to upgrade them using our calculus notation:</p>
\[\d x^\mu(\partial_\nu)\equiv\delta^\mu_\nu=\frac{\partial x^\mu}{\partial x^\nu}\]
<p>In this case, $\d x$ is not an infinitesimal, but actually a kind of object called a differential form (specifically a one-form, also known as a gradient). A <strong>differential form</strong> is a $(0, p)$ antisymmetric tensor; a $0$-form is a scalar or scalar function, and a $1$-form is a gradient. There’s more work we have to do to discuss differential forms, so, for now, it’s ok to think of these as just dual vectors. From the definition, the set of gradients also form a basis for the cotangent space. (We can go through a similar process to apply the one-forms to vectors and show this, but it looks very similar to vectors so I’m going to skip it.) Similar to vectors, we can derive the transformation laws.</p>
\[\begin{align*}
\d x^{\mu'}&=\frac{\partial x^{\mu'}}{\partial x^\mu}\d x^\mu\\
\omega_{\mu'}&=\frac{\partial x^\mu}{\partial x^{\mu'}}\omega_\mu
\end{align*}\]
<p>Now that we’ve re-invented vectors and duals using the language of manifolds, we’re ready to construct tensors. As you might think, this construction follows straightforwardly from the construction in flat space: we take the tensor product of the basis vectors (partial derivatives) and duals (gradients).</p>
\[\begin{align*}
T^{\mu_1\cdots\mu_k}_{\nu_1\cdots\nu_l}&=T(\d x^{\mu_1}, \cdots, \d x^{\mu_k}, \partial_{\nu_1}, \cdots, \partial_{\nu_l})\\
T&=T^{\mu_1\cdots\mu_k}_{\nu_1\cdots\nu_l} \partial_{\mu_1}\otimes\cdots\otimes\partial_{\mu_l}\otimes\d x^{\nu_1}\otimes\cdots\otimes\d x^{\nu_k}\\
T^{\mu_1'\cdots\mu_k'}_{\nu_1'\cdots\nu_l'}&=\frac{\partial x^{\mu_1'}}{\partial x^{\mu_1}}\cdots\frac{\partial x^{\mu_k'}}{\partial x^{\mu_k}}\frac{\partial x^{\nu_1}}{\partial x^{\nu_1'}}\cdots\frac{\partial x^{\nu_l}}{\partial x^{\nu_l'}}T^{\mu_1\cdots\mu_k}_{\nu_1\cdots\nu_l}
\end{align*}\]
<p>Almost everything is the same as it was in a flat space, except we upgraded our basis vectors and duals to partial derivatives and gradients (this also technically works in a flat space but is a bit overkill in that context). Just as with flat space, we have the metric tensor $g_{\mu\nu}$.</p>
<p>The last few things I’ll point out is a small nuiance with notation. Recall the polar coordinates metric</p>
\[\d s^2=\d r^2+r^2\d\theta^2\]
<p>$\d s^2$ is just a symbol, but $\d r^2$ and $\d\theta^2$ are honest basis one-forms. That being said, for this case, our use of basis one-forms is consistent with the infinitesimal philosophy for now.</p>
<p>I’ll end on some nomenclature that is popular in other sources (as well as some foreshadowing). A metric is said to be in <strong>canonical form</strong> if it is written as $g_{\mu\nu}=\mathrm{diag}(-1,\cdots,-1,+1,\cdots,+1,0,\cdots,0)$ where $\mathrm{diag}$ is a diagonal matrix with the diagonal entries as the arguments to the function. At a point, it’s always possible to put the metric in this form: for a point $p\in M$, there exist coordinates $x^{\hat{\mu}}$ such that $g_{\hat{\mu}\hat{\nu}}$ is canonical and $\partial_\hat{\sigma}g_{\hat{\mu}\hat{\nu}}=0$. In other words, the metric is flat and its components are constant at $p$. Coordinates that satisfy these conditions are called <strong>Riemann Normal Coordinates</strong>:</p>
\[\begin{align*}
g_{\hat{\mu}\hat{\nu}}(p)&=\delta_{\hat{\mu}\hat{\nu}}\\
\partial_\hat{\sigma}g_{\hat{\mu}\hat{\nu}}(p)&=0
\end{align*}\]
<p>This gives us a convenient set of coordinates to work in initially, then we can generalize using tensor notation. If we can show our equation is true in this coordinate system, then it must be true in all coordinate systems because a tensor equation is true in all coordinate systems. We’ll need some extra machinery to make this claim, but it stands nonetheless.</p>
<p>One last bit of terminology is the <strong>metric signature</strong>: the number of positive and negative eigenvalues of the metric. A metric is <strong>Euclidean/Riemannian/positive-definite</strong> if all eigenvalues are positive. This is the signature for most mathematical manifolds. A metric is <strong>Lorentzian/pseudo-Riemannian</strong> if it has exactly one negative eigenvalue and the rest are positive. This is the metric used in relativity as the metric of spacetime, with the negative eigenvalue acting as the time coordinate. (Alternative, we could flip the spacetime metric to have three negative eigenvalues for the spatial components and a positive eigenvalue for the temporal component.) A metric is <strong>indefinite</strong> if it has a mixture of positive and negative eigenvalues. A metric is <strong>degenerate</strong> if it has any zero eigenvalues; note that this means an inverse metric doesn’t exist. If a metric is continuous and non-degenerate, its <em>signature</em> is the same everywhere. In other words, if we start in a Lorentzian spacetime, the metric is non-degenerate and continuous so spacetime stays Lorentzian everywhere (at least, that’s what we think now). In practice, we don’t usualy deal with indefinite or degenerate metrics; in fact, in special relativity, we often assume a non-degenerate metric because a degenerate one wouldn’t be terribly useful in the first place!</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this post, we learned how to construct a manifold from fundamental objects like sets and how to re-invent vectors, duals, and tensors on the manifold. Let’s take a second to review what we’ve learned in this part:</p>
<ul>
<li>Manifolds are constructed by smoothly sewing together atlases, which are open sets with a coordinate system/chart mapping the set to the tangent space/a flat space.</li>
<li>The tangent space consists of partial derivatives of the coordinates, which are used to build directional derivatives.</li>
<li>The cotangent space consists of gradients (one-forms) which are defined analogously to basis covectors in flat space.</li>
<li>General tensors are still the tensor product of tangent and cotangent spaces.</li>
</ul>
<p>In the next installment, we’ll discuss the most important property of a manifold: curvature 😀</p>In the second part, I'll construct a manifold from scratch and redefine vectors, dual vectors, and tensors on a manifold.Manifolds - Part 12021-02-10T00:00:00+00:002021-02-10T00:00:00+00:00/manifolds-part-1<p>Manifolds! This might be an esoteric word you’ve heard in the most arcane of contexts, but manifolds are really interesting structures that are useful in a variety of fields like mathematics (obviously!), physics, robotics, computer vision, chemistry, and computer graphics. My own motivation to study manifolds stems from general relativity. In that context, spacetime is defined as a 4D manifold with 3 spatial components and 1 temporal component. Almost every interesting structure that arises in general relativity is a result of the manifold structure (specifically the metric). The goal of this post is to introduce the machinery that is the Riemannian manifold!</p>
<p>Manifolds are a fairly large topic so here’s an overview of the big picture that we’ll be discussing:</p>
<ol>
<li>Vectors, dual vectors, and general tensors in a flat, Euclidean space.</li>
<li>Construction of manifolds and curved spaces</li>
<li>Vectors, dual vectors, and general tensors again, but on a manifold this time</li>
<li>Geodesics and Curvature.</li>
</ol>
<p>In the interest of accessibility, I’ll assume you’re comfortable with multivariable calulus (differentiation, integration, parametrization of curves) and linear algebra (vector spaces, determinants, bases, linear transforms). Although some of my examples do include physics, I won’t assume any prior knowledge.</p>
<h1 id="introduction">Introduction</h1>
<p>To motivate our discussion consider airplane pilots charting a course on the Earth, e.g., a sphere in 3 dimensions. The paths they take between two cities do not look like straight-line paths on a paper map. We know that the shortest distance between any two points in a Euclidean space is a line; so why don’t pilots chart courses that are straight lines on the map? This is because the Earth is a <em>curved surface</em>; it’s not like a flat Euclidean space. As we’ll prove later, the shortest path between any two points on a sphere is the <strong>great circle</strong>, i.e., the cirle along the surface of the Earth such that the center of the cirlce is the center of the Earth. So pilots are using this fact to chart courses to take the least amount of time and fuel to get from one city to another.</p>
<p>Another way to convince yourself that a sphere is a curved surface is to consider what happens to neighboring parallel lines. Suppose we consider two adjacent longitudes at the equator. Place two people next to each other at the equator and ask them to point North. They’ll start off parallel since they’re neighboring, but, as they move towards the North pole, the directions they’re pointing in will intersect until they end up at the North pole pointing in different directions! This kind of scenario can never happen in a Euclidean space: lines that start out parallel will forever stay parallel in a Euclidean space. So clearly, a sphere, intrinsically, is not a Euclidean space! However, the fact that we could have two “close enough” people initially parallel suggests that for “close enough” distances, we can think of the sphere as being like a flat Euclidean space.</p>
<p>This is exactly the intuition behind a <strong>manifold</strong>! Informally, it’s a kind of space that is locally flat/Euclidean, but, globally, it might have a more complicated, non-Euclidean structure. Our goal is to understand these structures by re-inventing the things we know how to do in Euclidean space, namely differentiation and integration, on a manifold.</p>
<p>As we’ve learned in multivariable calculus, most of the interesting things we do in Euclidean space involve <em>vectors</em> so I want to start close to there. However, instead of just dealing with ordinary vectors, we’re going to upgrade them to <strong>tensors</strong>, which can be thought of as a generalization of vectors, i.e., vector is a special kind of tensor. Working with tensors versus sticking with vectors won’t seem immediately useful until we discuss how to construct a manifold, but, in general, we can think of tensors as being the most basic, fundamental object in geometry.</p>
<p>Tensors might be something you’ve heard of, especially in machine learning (e.g., Tensorflow). In that context, tensors are taken to be multidimensional arrays. This definition works in that context, but it forgoes the geometric properties of a tensor, which are critial to the appeal of using them. It also conflates the components of a tensor with the abstract, geometric object that is the tensor itself. (That being said, we can sometimes interpret the multidimensional arrays in machine learning as being geometric transformations between spaces, ending at the space where our training data is linearly separable; but that’s a topic for a different time 😉.)</p>
<p>In addition to being a basic building block in geometry, another reason we like tensors is because <em>a tensor equation is true in all coordinate systems!</em> As you can imagine, this is an incredibly useful fact: if we have a tensor equation, we can work in whatever coordinate system we want and, if expressed correctly, end up with the right answer. When working with manifolds, or just curved coordinates, e.g., polar coordinates, in general, there isn’t always a canonical set of coordinates like in Euclidean space (e.g., if I write the vector $\begin{bmatrix}1 & 1 & 1\end{bmatrix}^T$, you know exactly what I mean and can visualize it in your head). Tensors allow us to write equations and work with different quantities in a coordinate-free way. (After all, I don’t want to write this post a hundred times for a hundred different coordinate systems!) For non-mathematical (and sometimes even mathematical) uses, at the end of the day, we’ll have to pick coordinates to fully understand or implement the structure we’re working with, but we can work out all of the theory independent of coordinates in case we change our mind.</p>
<p>So I’ve motivated why we’re starting with tensors and up why we like them, but I haven’t actually given a useful definition or construction of one, although I’ve given an example of a not-so-good definition for our purposes. We know tensors should have the same properties, or at least more general properties, as vectors since they’re a generalization of them; however, tensors are also comprised of another kind of geometric object we’re going to discuss called a <em>dual vector</em> (often shortened to just “duals”) or <em>covariant vector</em>/<em>covector</em>. In a Euclidean space, we don’t need to discuss duals, but we lose that convenience in a non-Euclidean space. As we’ll see, every vector space implies the existence of a corresponding dual vector space so it’s more powerful to have tensors comprise of both vectors and duals since they’re separate-but-complementary, objects in a non-Euclidean space. Furthermore, we know from linear algebra that vectors transform in a very specific way: with a transformation matrix. Similarly, we want tensors to also have this property since it’s essential for tensor equations to look the same in all coordinates.</p>
<p>Putting all of these pieces together, we arrive at our definition of a <strong>tensor</strong>: <em>a multilinear map constructed from vectors and dual vectors that obey the tensor transformation law</em>. This is also perhaps not the most useful definition at this point, but we’re going to dissect each piece and initially work in the Euclidean space we’re all very comfortable with.</p>
<h1 id="vectors">Vectors</h1>
<p>Let’s first review vectors in plain Euclidean space. There are a couple of definitions people refer to when thinking of a vector. Probably the most common is a displacement arrow with a magnitude and direction that can be slid around the space. This isn’t a bad definition for some uses, but it’s not a very good one for our case. As a counterexample, suppose we have curvy coordinates (like polar coordinates!) on a flat piece of paper. What does a vector look like in this space? Is the arrow straight? Or is it curved along the coordinates?</p>
<p><img src="/images/manifolds-part-1/tangent-space.svg" alt="Vectors" title="Vectors" /></p>
<p><small>In polar coordinates, does a vector look like the image on the left where it is straight in a curved coordinate system or is it curved along the curved coordinates, like the image on the right?</small></p>
<p>The vector being curved doesn’t really match with the “displacement” notion. To get around this problem, let’s define vectors only at a point $p$. In fact, let’s take all possible magnitudes and directions at the point $p$ and construct the <strong>tangent space</strong> $T_p$ of vectors. This circumvents our problem with curved coordinates by only defining vectors in the tangent space <em>at a point</em>. Gone are the days we can think of vectors as displacements or sliding around a space! The tangent space is an honest vector space, but we’ll prove this more formally later on.</p>
<p>As a reminder, a vector space is a collection of objects such that they can be linearly scaled and added to get another element in the collection. Mathematically, $U$ and $V$ are elements of a vector space and $a,b\in\mathbb{R}$, then</p>
\[\begin{equation}
(a+b)(U+V) = aU + bU + aV + bV
\label{eq:vector_space}
\end{equation}\]
<p>There are a few other properties, like existence of an identity element, but I think Equation $\eqref{eq:vector_space}$ is the most important. So these vectors live in an honest vector space called the tangent space (the “tangent” in the name will become apparent later). Any abstract vector can be decomposed into a set of components and basis vectors. These <strong>basis vectors</strong> must (1) have the same dimensionality of the vector space, (2) be linearly independent, (3) and span the space.</p>
<ol>
<li>If the basis vectors weren’t of the same dimensionality of the space, then we couldn’t decompose an arbitrary vector into these basis vectors because vector addition of two vectors of different dimensionality is ill-defined.</li>
<li><strong>Linearly independent</strong> means we can’t write one basis vector as a linear combination of the others. If we could, then why would we need that one in the first place? If our basis were $u, v, w$ and $w=u+v$, then everywhere we use $w$, we could just use $u+v$; $w$ is redundant.</li>
<li><strong>Spanning the space</strong> means that every vector in the space can be written as a linear combination of the basis vectors. If they didn’t span the space, then there would exist vectors in our space that we couldn’t construct! For any vector space, there are an infinite number of bases we could select. In a Euclidean space, we usually stick with the canonical basis.</li>
</ol>
\[\begin{Bmatrix} \begin{bmatrix}1\\ 0\\ 0\\ \vdots\\ 0\end{bmatrix}, \begin{bmatrix}0\\ 1\\ 0\\ \vdots\\ 0\end{bmatrix}, \cdots, \begin{bmatrix}0\\ 0\\ 0\\ \vdots\\ 1\end{bmatrix}\end{Bmatrix}\]
<p>In general, on a manifold, there usually isn’t such a convenient basis everywhere. Instead, let’s assume we have some arbitrary basis vectors
$\hat{e}_{(\mu)}$ where $\mu=0,\cdots,n$ is an index that iterates over the dimensionality of our space $n$. (I’m using Sean Carroll’s notation where the parentheses in the subscript denote a <em>set</em> of vectors.) Then we can decompose a vector into components and basis vectors.</p>
\[\begin{equation}
V = V^\mu \hat{e}_{(\mu)} = V^0\hat{e}_{(0)} + V^1\hat{e}_{(1)} + \cdots + V^n\hat{e}_{(n)}
\end{equation}\]
<p>where $V$ is the abstract vector and $V^\mu$ are the <strong>components</strong>. There’s a bit of notation to unpack here. The superscripts aren’t exponents, but indices. There’s an importance to index placement, i.e., upper versus lower, but we won’t fully see that until we discuss dual vectors. We’re using Einstein summation convention where we sum over repeated upper and lower indices. (Einstein himself claimed that this summation convention was one of his most important contributions!)</p>
<p><img src="/images/manifolds-part-1/abstract-vector.svg" alt="Abstract vector" title="Abstract vector" /></p>
<p><small>A vector is an geometric object that exists independent of coordinates. We can impose a basis and look at the components of the vector in that basis. Changing bases changes components, but the abstract, geometric object is left unchanged.</small></p>
<p>It’s important to distuinguish the vector from its components. The vector is a geometrical object that is independent of any coordinates or basis. However, the <em>components</em> of the vector are dependent on the choice of basis. In linear algebra, we learned we can transform the components of vectors with a linear transform by multiplying by a matrix. We can express the same transformation law in our new summation convention like this.</p>
\[\begin{equation}
V^{\mu} = \Lambda_{\mu'}^{\mu} V^{\mu'}
\label{eq:vector_transform_law}
\end{equation}\]
<p>where $\Lambda_{\mu’}^{\mu}$ is the linear transformation matrix. We’re representing different coordinates with a primed index $\mu’$ rather than a primed variable to emphasize that the geometric vector is still the same but the coordinates are transformed. The other notational thing about tensor equations is that the upper and lower indices on each side of the equation must match. In the above case, notice how the summed out index $\mu’$ and free index $\mu$ match up. The right-hand-side has no $\mu’$ because it’s a dummy variable that’s being summed over; the left-hand-side has an upper $\mu$ to match the one in $\Lambda_{\mu’}^{\mu}$. This is also a really useful tool to catch mistakes in equations: the indices don’t match!</p>
<p>So Equation $\eqref{eq:vector_transform_law}$ allows us to change coordinates to get new vector components. This lets us work in a more convenient basis for computations, then convert our answer to whichever basis we need. But there’s another way to transform the vector! Remember that the components are a function of the basis so changing the basis imposes a change in components! But how does the basis transform? We can derive the transformation law by using the property that an abstract vector $V$ is invariant under a coordinate change and relate the components and basis.</p>
\[\begin{align*}
V = V^{\mu}\hat{e}_{(\mu)} &= V^{\mu'}\hat{e}_{(\mu')}\\
\Lambda_{\mu'}^{\mu} V^{\mu'} \hat{e}_{(\mu)}&= V^{\mu'}\hat{e}_{(\mu')}\tag*{Apply Equation \eqref{eq:vector_transform_law}}\\
\end{align*}\]
<p>But $V^{\mu’}$ is arbitrary so we can get rid of it.</p>
\[\Lambda_{\mu'}^{\mu} \hat{e}_{(\mu)} = \hat{e}_{(\mu')}\\\]
<p>Now to solve for the $\mu$ basis in terms of the $\mu’$ basis, we need to multiply by the inverse matrix $\Lambda_\mu^{\mu’}$, which is still a valid linear transform. The resulting transformation law for basis vectors can be written as the following.</p>
\[\begin{equation}
\hat{e}_{(\mu)} = \Lambda_\mu^{\mu'}\hat{e}_{(\mu')}
\end{equation}\]
<p>Notice the indices are in the right place! To transform the basis, we have the multiply by the <em>inverse</em> of the matrix used to transform the components. Here’s another way to express that these matrices are inverses.</p>
\[\Lambda_{\nu'}^{\mu}\Lambda_{\rho}^{\nu'} = \delta_\rho^\mu\\\]
<p>where $\delta_\mu^\nu$ is the Kronecker Delta that is equal to 1 if $\mu=\nu$ and 0 otherwise. (This is the Einstein summation convention equivalent of the linear algebra definition of inverse: $AA^{-1}=A^{-1}A=I$ where $I$ is the identity matrix.)</p>
<p>To review, we have transformation laws for the vector components and the basis vectors.</p>
\[\begin{align}
V^{\mu} &= \Lambda_{\mu'}^{\mu} V^{\mu'}\\
\hat{e}_{(\mu)} &= \Lambda_\mu^{\mu'}\hat{e}_{(\mu')}
\end{align}\]
<p><em>Vector components transform in the opposite way as the basis vectors</em>! In other words, doubling the basis vectors halves the components. Historically, since vector components transform with $\Lambda_{\mu’}^{\mu}$, they are sometimes called <strong>contravariant vectors</strong>. Nowadays, we just call them vectors with upper indices.</p>
<p>Let’s look at a numerical example of these transformation laws. Suppose we have a vector $\begin{bmatrix}1 & 1\end{bmatrix}^T$ in the canonical Cartesian basis. Now let’s double the basis and see what happens to the components; this operation corresponds to applying the following transformation matrix to the basis vectors.</p>
\[\Lambda_{\mu'}^{\mu} = \begin{bmatrix}
2 & 0\\
0 & 2
\end{bmatrix}\]
<p>Try this out for yourself. Apply this matrix to each canonical basis vector and check the result is twice the length. With some linear algebra (or MATLAB/numpy), the inverse matrix to apply to the components is the following.</p>
\[\Lambda_{\mu}^{\mu'} = \begin{bmatrix}
\frac{1}{2} & 0\\
0 & \frac{1}{2}
\end{bmatrix}\]
<p>Indeed doubling the basis vectors halves the components: the basis vectors and vector components transform in the opposite way!</p>
<p><img src="/images/manifolds-part-1/basis-vectors.svg" alt="Basis vectors" title="Basis vectors" /></p>
<p><small>When we double the basis vectors, the components are halved because they transform inversely to each other.</small></p>
<p>A slightly more abstract example that we’ll see all the time is a vector tangent to a curve. Suppose we have a parameterized curve $x^\mu(\lambda) : \mathbb{R}\to M$ where $\lambda$ is the parameter and $M$ is the manifold. (Note that the definition of a curve in a space $V$ is a function $\gamma : \mathbb{R}\to V$.) Einstein convention used here means we have a function for each component of the space $x^0(\lambda), x^1(\lambda), \cdots, x^n(\lambda)$. Then we can take a derivative with respect to $\lambda$ to get the tangent-to-the-curve vector $\frac{\mathrm{d}x^\mu(\lambda)}{\mathrm{d}\lambda}$.</p>
<h1 id="dual-vectors">Dual Vectors</h1>
<p>In the original definition of tensors I gave, there was another kind of object I said tensors were comprised of: dual vectors. Pedogogically, I’ve found that duals are difficult to motivate without starting off with a definition or flimsy motivation, but I’ll do my best to try to draw on what we’ve learned so far.</p>
<p>We saw that the transformation law for vectors means that doubling the basis vectors halves the components. The natural question arises: “are there geometric objects such that doubling the basis vectors double their components?” It turns out there are! And, in fact, these objects are the second part of constructing tensors: dual vectors!</p>
<p>To discuss dual vectors, I’ll start by saying a good way to understand structures in mathematics is to look at maps between them. In our specific case, we can try to understand our vector space $V$ better by looking at linear maps from it to the reals ${ \omega : V\to \mathbb{R}}$. In other words, a linear map $\omega$ <em>acts on</em> a vector to produce a scalar $\omega(V)\in\mathbb{R}$. This itself creates a new kind of space <em>dual</em> to the tangent space called the <strong>cotangent space</strong> $T_p^*$ at a point $p$. It’s constructed from all possible linear maps from the corresponding vector space to the reals. As it turns out, this is also a vector space! (Remember that many things in mathematics form a vector space; “vector space” is really a misnomer since plenty of things obey the properties required of a vector space besides conventional vectors.) If we have two linear maps $\omega$ and $\eta$ in the cotangent space and $a,b\in\mathbb{R}$, then</p>
\[(a+b)(\omega+\eta)(V) = a\omega(V) + b\omega(V) + a\eta(V) + b\eta(V)\]
<p>Since these functions are linear, we can express them as a collection of numbers. In linear algebra, we learned all linear operators and functions can be expressed as a “matrix times the vector input”. Since the input here is a vector and the output a scalar, duals can be thought of to be <em>row vectors</em>. We’ll circle back to this interpretation soon, but the key point is that we can represent duals in the same way we represent vectors: as components in a basis.</p>
<p>The basis for the cotangent space is defined to be $\hat{\varepsilon}^{(\nu)}$ such that the following property holds.</p>
\[\hat{\varepsilon}^{(\nu)}(\hat{e}_{(\mu)})\triangleq\delta_{\mu}^{\nu}\]
<p>Therefore, we can write a general dual vector as a combination of components and basis duals.</p>
\[\begin{equation}
\omega = \omega_\mu\hat{\varepsilon}^{(\mu)}
\end{equation}\]
<p>As we’ve discussed before, we can act a dual vector on a vector to get a scalar.</p>
\[\begin{align*}
\omega(V) &= \omega_\mu\hat{\varepsilon}^{(\mu)}(V^\nu\hat{e}_{(\nu)})\\
&= \omega_\mu V^\nu \hat{\varepsilon}^{(\mu)}(\hat{e}_{(\nu)})\\
&= \omega_\mu V^\nu \delta_\nu^\mu\\
&= \omega_\mu V^\mu\in\mathbb{R}
\end{align*}\]
<p>One good intuition to note is that applying the Kronecker Delta essentially “replaces” indices. We can think of the third line as “applying” the Kronecker Delta to $V^\nu$ to swap $\nu$ with $\mu$.</p>
<p>Similar to vectors, we can derive the transformation laws for dual vectors; specifically, we can use the index notation to our advantage to figure out the right matrices.</p>
\[\begin{align}
\omega_{\mu} &= \Lambda_{\mu}^{\mu'}\omega_{\mu'}\\
\hat{\varepsilon}^{(\mu)} &= \Lambda_{\mu'}^{\mu}\hat{\varepsilon}^{(\mu')}\\
\end{align}\]
<p>Notice that the dual components transform using the same matrix as the basis vectors $\Lambda_{\mu}^{\mu’}$. For this reason, historically, they are sometimes called <strong>covariant vectors</strong> or <strong>covectors</strong> for short. Nowadays, we just call them vectors with lower indices. (I’ll use duals and covectors interchangeably.)</p>
<p>We’ve discussed the theory of dual vectors but I haven’t given you a geometric description or picture of one yet. If we visualize vectors as arrows, we can visualize a dual as a stack of oriented lines/hyperplanes!</p>
<p><img src="/images/manifolds-part-1/basis-duals.svg" alt="Basis duals" title="Basis duals" /></p>
<p><small>The top row shows the canonical basis $x$ and $y$ duals. The bottom row shows the $x$ and $y$ basis vectors with the dual basis as well.</small></p>
<p>To act a dual on a vector to produce a scalar, we simply count how many lines the vector pierces and that gets us our scalar. As with the transformation law, if we double the basis vectors, the dual’s components are also doubled. Graphically, this corresponds to the stack of lines getting more dense so the vector pierces more lines.</p>
<p><img src="/images/manifolds-part-1/dual-action.svg" alt="Dual action" title="Dual action" /></p>
<p><small>To figure out the components of a dual, act the basis vectors on it. The dual pictured above has the components $\begin{bmatrix}1 & 1\end{bmatrix}$.</small></p>
<p>Yet another way to think about duals is algebriacally: we can think of dual vectors as being row vectors while vectors are column vectors. Only in Cartesian coordinates can we simply transpose one to get the other. In general, a column vector and a row vector are two fundamentally different objects: they transform differently! (After we introduce the metric tensor, we can use it to convert freely between vectors and duals, but, since the components aren’t usually identity, the conversion usually modifies the components.) Let’s look at an algebraic example: the dual with components 1 and 1 would be written as the row vector $\begin{bmatrix} 1 & 1\end{bmatrix}$. Acting a dual on a vector then becomes matrix multiplication.</p>
\[\begin{align*}
\begin{bmatrix}1 & 1\end{bmatrix}\begin{bmatrix}1\\0\end{bmatrix} &= 1\\
\begin{bmatrix}1 & 1\end{bmatrix}\begin{bmatrix}0\\1\end{bmatrix} &= 1
\end{align*}\]
<p>A more important example of a dual vector is the gradient! In multivariable calculus, we learned the gradient of a scalar function $f$ produces a vector field $\nabla f$. However, the gradient is really a dual vector because of the way it transforms! Suppose we have a scalar function $f$, then we can define the gradient with the following notation:</p>
\[\mathrm{d}f = \frac{\partial f}{\partial x^\mu}\hat{\varepsilon}^{(\mu)}\]
<p>(Coarsely, upper indices in the denominator become lower indices.) There’s a much deeper meaning to $\mathrm{d}f$: $\mathrm{d}$ is an exterior derivative operator that promotes the function $f$ from a <em>0-form</em> to a <em>1-form</em>. We’ll discuss more about differential forms when we re-invent duals on a manifold. Getting back to why the gradient $\mathrm{d}f$ is a dual and not a vector, let’s apply a transformation to change coords from $x^{\mu’}$ to $x^\mu$. The components must transform like the following to preserve index notation:</p>
\[\begin{align*}
(\mathrm{d}f)_{\mu} &= \frac{\partial f}{\partial x^\mu} = \Lambda_{\mu}^{\mu'}\frac{\partial f}{\partial x^{\mu'}}\\
&= \partial_{\mu}f = \Lambda_{\mu}^{\mu'}\partial_{\mu'}f\\
\end{align*}\]
<p>Notice that this is exactly how the components of a dual transform, with $\Lambda_{\mu}^{\mu’}$! I’ve also introduced a new notational shorthand that we’ll frequently use: $\partial_\mu f = \frac{\partial f}{\partial x^\mu}$.</p>
<h1 id="tensors">Tensors</h1>
<p>With vectors and duals covered, we can revisit our definition of tensors: <em>a multilinear map constructed from vectors and dual vectors that obey the tensor transformation law</em>. This makes a bit more sense now, but we’re going to fill in the gaps. With duals, we thought of them as linear functions that sent elements of our vector space to the reals; with tensors, we can think of them as multilinear functions, i.e., linear in each argument, that sends multiple vectors and duals to the reals. The <strong>rank</strong> of a tensor tells us how many of each the tensor takes: a rank $(k, l)$ tensor maps $k$ duals and $l$ vectors to the reals:</p>
\[T : \underbrace{T^*_p\times\cdots\times T^*_p}_{k}\times \underbrace{T_p\times\cdots\times T_p}_{l}\to\mathbb{R}\]
<p>To see the multilinearity property, consider a rank $(1,1)$ tensor $T(\omega, V)$ and some scalars $a,b,c,d\in\mathbb{R}$:</p>
\[\begin{align*}
T(a\omega + b\eta, V) &= aT(\omega, V) + bT(\eta, V)\\
T(\omega, aU + bV) &= aT(\omega, U) + bT(\omega, V)
\end{align*}\]
<p>which we can write more compactly as:</p>
\[T(a\omega + b\eta, cU + dV) = acT(\omega, U) + adT(\omega, V) + bcT(\eta, U) + bdT(\eta, V)$\]
<p>Thus, the entire tensor itself is linear since linear combinations of already linear things like vectors and duals produce linear things like tensors. Therefore, we should be able to decompose a tensor into its components in a particular basis. But how do we construct a basis? Well we know the bases for vector and dual spaces and tensors are comprised of both so we need to somehow “combine” the bases into a single one. The operation we need is the <strong>tensor product</strong> $\otimes$, which allows us build higher-rank tensors from lower-rank ones. Regarding ranks, the tensor product has the following property: $(k, l)\otimes(m,n)\to(k+m,l+n)$. Therefore, we can construct the basis for higher-rank tensor by taking the tensor product of the basis vectors and basis duals.</p>
\[\hat{e}_{(\mu_1)}\otimes\cdots\otimes\hat{e}_{(\mu_k)}\otimes\hat{\varepsilon}^{(\nu_1)}\otimes\cdots\otimes\hat{\varepsilon}^{(\nu_l)}\]
<p>To get a better understanding of the tensor product, let’s once again look at the algebriac interpretation of the basis vectors and basis duals. For simplicity, let’s use the canonical Cartesian basis vectors and basis duals, and, as an example, suppose we want to construct a rank $(1, 1)$-tensor, $\Lambda^\mu_{\mu’}$ for instance! We know this is a matrix so our basis for this should also be matrices. We’ll take the tensor product of the basis vectors $\hat{e}_{(\mu)}$ and basis duals $\hat{\varepsilon}^{(\nu)}$ while treating them as column and row vectors. Doing this for each combination, we get the following basis for a $(1,1)$-tensor in a canonical basis.</p>
\[\begin{align*}
\hat{e}_{(0)}\otimes\hat{\varepsilon}^{(0)} &\to \begin{bmatrix}1\\0\end{bmatrix}\begin{bmatrix}1 & 0\end{bmatrix} = \begin{bmatrix}1 & 0\\0 & 0\end{bmatrix}\\
\hat{e}_{(0)}\otimes\hat{\varepsilon}^{(1)} &\to \begin{bmatrix}1\\0\end{bmatrix}\begin{bmatrix}0 & 1\end{bmatrix} = \begin{bmatrix}0 & 1\\0 & 0\end{bmatrix}\\
\hat{e}_{(1)}\otimes\hat{\varepsilon}^{(0)} &\to \begin{bmatrix}0\\1\end{bmatrix}\begin{bmatrix}1 & 0\end{bmatrix} = \begin{bmatrix}0 & 0\\1 & 0\end{bmatrix}\\
\hat{e}_{(1)}\otimes\hat{\varepsilon}^{(1)} &\to \begin{bmatrix}0\\1\end{bmatrix}\begin{bmatrix}0 & 1\end{bmatrix} = \begin{bmatrix}0 & 0\\0 & 1\end{bmatrix}\\
\end{align*}\]
<p>The resulting basis looks just like a canonical basis for $2\times 2$ matrices! Indeed we can take any matrix and write it as a scalar times one of these “basis matrices”.</p>
<p>Now let’s go back to our abstract basis and write out our tensor components in the abstract basis</p>
\[T = T^{\mu_1\cdots\mu_k}_{\nu_1\cdots\nu_l}\hat{e}_{(\mu_1)}\otimes\cdots\otimes\hat{e}_{(\mu_k)}\otimes\hat{\varepsilon}^{(\nu_1)}\otimes\cdots\otimes\hat{\varepsilon}^{(\nu_l)}\]
<p>Since tensors are comprised of vectors and duals, they transform like $(k, l)$ duals and vectors with $k+l$ transformation matrices.</p>
\[T^{\mu_1'\cdots\mu_k'}_{\nu_1'\cdots\nu_l'} = \Lambda^{\mu_1'}_{\mu_1}\cdots\Lambda^{\mu_k'}_{\mu_k}\Lambda^{\nu_1}_{\nu_1'}\cdots\Lambda^{\nu_l}_{\nu_l'}T^{\mu_1\cdots\mu_k}_{\nu_1\cdots\nu_l}\]
<p>Notice all of the indices are in the right place! Summation convention makes it really easy to verify if we’ve made a mistake or not. All of the operations on tensors we’re going to cover really amount to keeping careful track of our indices.</p>
<h2 id="the-metric-tensor">The Metric Tensor</h2>
<p>Before discussing tensor operations, I want to introduce the most important tensor: the <strong>metric tensor</strong> $g$! It allows us to compute distances and angles in arbitrary coordinates. Specifically, the inner product in an arbitrary space is written in terms of the metric tensor:</p>
\[g(U, V) = g_{\mu\nu}U^\mu V^\nu\]
<p>where $g_{\mu\nu}$ are the components of the metric tensor. Notice the metric tensor is a $(0, 2)$-tensor so it takes two vectors to produce a scalar. In a Euclidean space, $g_{\mu\nu}=\delta_{\mu\nu}$ so the inner product simply the component-wise product $U_\nu V^\nu$, which is exactly how the dot product that we’re familiar with is defined. In general spaces, however, the metric tensor is often not even constant and changes based on where we are in the space. We’ll see an example of this with polar coordinates shortly.</p>
<p>In addition to representing the metric as a function, remember that we can also represent it by it’s components in a basis, particularly two basis duals in this case:</p>
\[g = g_{\mu\nu}\hat{\varepsilon}^{(\mu)}\otimes\hat{\varepsilon}^{(\nu)}\]
<p>For the metric tensor, there’s another way to express the components that is a bit more canonical: as a <strong>line element</strong>. For example, consider the line element of Cartesian coordinates.</p>
\[ds^2 = dx^2 + dy^2\]
<p>Notice that the nonzero components of the metric tensor in Cartesian coordinates are simply $1$ and so are the coefficients on $dx^2$ and $dy^2$. The zero components represent the $0$ coefficients on $dxdy$ and $dydx$, which is why there are no cross-terms.</p>
<p>For now, we can think of the line element as being an infinitesimal displacement, but there’s actually a deeper meaning. $ds^2$ is just a symbol, but something like $dx^2$ is secretly the bilinear differential form $\mathrm{d}x\otimes\mathrm{d}x$ which is the exterior derivative $\mathrm{d}$ applied to the 0-form $x$. Anyways, in this notation, let’s try to find a corresponding line element for polar coordinates and then write the metric tensor components.</p>
<p>We can start by writing the Cartesian coordinates $(x,y)$ in terms of the polar coordinates $(r,\theta)$:</p>
\[\begin{align*}
x &= r\cos\theta\\
y &= r\sin\theta
\end{align*}\]
<p>Now we take the total derivative $df = \frac{\partial f}{\partial x^\mu}dx^\mu$ of both sizes. (Note this is also a differential form!)</p>
\[\begin{align*}
dx &= \cos\theta\,dr - r\sin\theta\,d\theta\\
dy &= \sin\theta\,dr + r\cos\theta\,d\theta
\end{align*}\]
<p>Then we can “square” both sizes, being careful not to commute $dxdy$ and $dydx$ for good practice. As a side note, the metric tensor is indeed symmetric, but we haven’t defined that just yet.</p>
\[\begin{align*}
dx^2 &= \cos^2\theta\,dr^2 - 2r\cos\theta\sin\theta(drd\theta + d\theta dr) + r^2\sin^2\theta\,d\theta^2\\
dy^2 &= \sin^2\theta\,dr^2 + 2r\sin\theta\cos\theta(drd\theta + d\theta dr) + r^2\cos^2\theta\,d\theta^2
\end{align*}\]
<p>Now we can add them and cancel the cross terms with $drd\theta + d\theta dr$.</p>
\[\begin{align*}
ds^2 &= dx^2 + dy^2\\
&= \cos^2\theta\,dr^2 + r^2\sin^2\theta\,d\theta^2 + \sin^2\theta\,dr^2 + r^2\cos^2\theta\,d\theta^2\\
&= (\cos^2\theta + \sin^2\theta)dr^2 + r^2(\sin^2\theta + \cos^2\theta)d\theta^2\tag*{Group like terms}\\
&= dr^2 + r^2\,d\theta^2\tag*{$\sin^2\theta + \cos^2\theta = 1$}
\end{align*}\]
<p>Now we can read the components of the metric tensor from the coefficients on $dr^2$ and $d\theta^2$:</p>
\[g_{\mu\nu} = \begin{bmatrix}1 & 0\\ 0 & r^2\end{bmatrix}\]
<p>Notice we’re arranging the components in a matrix since the metric tensor is a $(0, 2)$-tensor. So it seems the metric in polar coordinates isn’t constant and <em>does</em> depend on where we are in the space. This makes sense because, as we move farther away from the origin in polar coordinates, the arc length between any two angles increases. In fact, if we treat $ds^2$ as an infinitesimal displacement and keep the radius fixed, i.e., $dr^2=0$, then we get $ds^2=r^2 d\theta^2\to s=r \theta$ which is exactly the arc length formula! (I’m being a little loose with the notation, but the point still remains.)</p>
<h2 id="tensor-operations">Tensor Operations</h2>
<p>Now that we’ve discussed the metric tensor, we can start looking at different operations we can perform on tensors. I want to cover these operations now rather than when they’re needed so that we don’t need to digress too often when we start to use them. One of the most important operations is <em>raising and lowering indices</em>: in other words, we’re converting between upper and lower indices and converting between vectors and duals. We do this via the metric tensor. We’ll need a small bit of additional machinery if we want to raise indices: the <strong>inverse metric tensor</strong>. Given a metric tensor, there exists an inverse $g^{\mu\nu}$ with two upper indices defined as the following.</p>
\[g^{\mu\lambda}g_{\lambda\nu} = g_{\nu\lambda}g^{\lambda\mu}=\delta^\mu_\nu\]
<p>The combination of these two lets us raise and lower indices. For example, we can convert between vectors and duals.</p>
\[\begin{align*}
V^\mu &= g^{\mu\nu}\omega_\nu\\
\omega_\nu &= g_{\mu\nu}V^\mu\\
\end{align*}\]
<p>This operation explains why we don’t distinguish between vectors and duals in Euclidean space: since $g_{\mu\nu}=\delta_{\mu\nu}$, the components are the same, and we can simply transpose between column and row vectors. Of course, we can raise and lower arbitrary indices on arbitrary tensors for as many indices as we want.</p>
\[\begin{align*}
T^{\alpha\beta}_\mu g_{\alpha\rho} &= T^{\beta}_{\mu\rho}\\
T^{\alpha\beta\mu\nu} g_{\alpha\rho} g_{\beta\sigma} &= T^{\mu\nu}_{\rho\sigma}\\
T^{\alpha\beta}_{\mu\nu}g^{\mu\rho}g_{\alpha\sigma} &= T^{\rho\beta}_{\sigma\nu}
\end{align*}\]
<p>Notice that in the last equation, the ordering of the indices doesn’t change. Another tensor operation is called <strong>contraction</strong>, which maps $(k,l)$ tensors to $(k-1, l-1)$ tensors by summing over an upper and lower index:</p>
\[T^\mu_\nu\to T^\mu_\mu = T\in\mathbb{R}\]
<p>Contractions are <em>only</em> defined for one upper and one lower; we can’t contract two lower or two upper indices. Algebraically, if we represent $T^\mu_\nu$ as a matrix, then contraction is the same as taking the trace. As with raising and lowering, we can contract arbitrary indices from arbitrary tensors as long as we keep the ordering the same:</p>
\[T^{\alpha\mu\beta}_{\mu\rho} = S^{\alpha\beta}_\rho\]
<p>If we did want to raise or lower two upper or two lower indices, we’d have to first use the metric tensor to lower or raise one of them, then contract.</p>
<p>The final tensor operation we’re going to look at for the moment is symmetrization and anti-symmetrization. We say a tensor is <strong>symmetric in its first two indices</strong> if $T_{\mu\nu\alpha\beta}=T_{\nu\mu\alpha\beta}$. A tensor is just <strong>symmetric</strong> if all pairs are symmetric. We’ve seen a symmetric tensor: the metric tensor! The metric tensor is symmetric $g_{\mu\nu}=g_{\nu\mu}$. This is apparent from the inner product being symmetric (although the latter is actually a collolary of the former).</p>
<p>On the other hand, we say a tensor is <strong>anti-symmetric in its first two indices</strong> if $T_{\mu\nu\alpha\beta}=-T_{\nu\mu\alpha\beta}$. An <strong>anti-symmetric</strong> tensor is defined in the same way as a symmetric tensor. A canonical example in physics is the electromagnetic field strength tensor/Faraday tensor $F_{\mu\nu}$. Electromagnetism is comprised of electric and magnetic fields. Often, they’re treated as separate entities, but they’re really two sides of the same coin. A more compact way to treat them is to put them into one tensor with components:</p>
\[F_{\mu\nu}=\begin{bmatrix}0 & -E_1 & -E_2 & -E_3\\ E_1 & 0 & B_3 & -B_2\\ E_2 & -B_3 & 0 & B_1 \\ E_3 & B_2 & -B_1 & 0\end{bmatrix} = -F_{\nu\mu}\]
<p>Note that swapping indices is the same as transposing the matrix representing the components. From these components, it’s clear that $F_{\mu\nu}=-F_{\nu\mu}$. In linear algebra, this is also called a skew-symmetric matrix. As it turns out, we can take any tensor and symmetrize or anti-symmetrize it. To symmetrize a tensor, we take the sum of all permutation of the indices scaled by $\frac{1}{n!}$.</p>
\[T^\mu_{(\nu_1\cdots\nu_l)\rho} = \frac{1}{n!}\Big(T^\mu_{\nu_1\cdots\nu_l\rho} + \text{sum of permutations of }\nu_1\cdots\nu_l\Big)\]
<p>As an example, let’s consider $T_{(\mu\nu\rho)\sigma}$:</p>
\[T_{(\mu\nu\rho)\sigma} = \frac{1}{6}\Big(T_{\mu\nu\rho\sigma}+T_{\nu\rho\mu\sigma}+T_{\rho\mu\nu\sigma}+T_{\rho\nu\mu\sigma}+T_{\nu\mu\rho\sigma}+T_{\mu\rho\nu\sigma}\Big)\]
<p>To anti-symmetrize a tensor, we take the alternating sum of all permutations of the indices scaled by the same factor. By alternating, we mean even exchanges of indices have a factor of $+1$ while odd number of index exchanges have a factor of $-1$.</p>
\[T^\mu_{[\nu_1\cdots\nu_l]\rho} = \frac{1}{n!}\Big(T^\mu_{\nu_1\cdots\nu_l\rho} + \text{alternating sum of permutations of }\nu_1\cdots\nu_l\Big)\]
<p>As an example, let’s consider $T_{[\mu\nu\rho]\sigma}$:</p>
\[T_{[\mu\nu\rho]\sigma} = \frac{1}{6}\Big(T_{\mu\nu\rho\sigma}-T_{\mu\rho\nu\sigma}+T_{\rho\mu\nu\sigma}-T_{\nu\mu\rho\sigma}+T_{\nu\rho\mu\sigma}-T_{\rho\nu\mu\sigma}\Big)\]
<p>As I’ve stated before, most of these tensor operations are really about keeping careful track of the indices. With that concludes the tensor operations that we’ll be using, at least for now! This is a good place to stop so I’ll end the new content here since the next thing we’re going to do is actually construct a manifold!</p>
<h1 id="review">Review</h1>
<p>We’ve covered many topics so far, so I want to take a second to review what we’ve learned:</p>
<ul>
<li>Manifolds are structures that are locally Euclidean, but globally might have some more interesting structure.</li>
<li>Tensors are the most fundamental geometric object, constructed from vectors and dual vectors, and are invariant to all coordinates</li>
<li>Vectors in general coordinates are defined at a point $p$ in vector space called the tangent space $T_p$.</li>
<li>Any abstract vector can be decomposed into components in a basis.</li>
<li>Basis vectors transform inversely to how the components transform.</li>
<li>For each vector space, there is an associated dual vector space where dual vectors send elements of the vector space to the reals.</li>
<li>Duals can also be represented as components in a basis whose components transform in the same way as the basis vectors.</li>
<li>General tensors are comprised of vectors and duals and also written as components in a basis, constructed via the tensor product of the basis vectors and basis duals.</li>
<li>The metric tensor is a $(0, 2)$-tensor that is used to compute distances and angles.</li>
<li>Tensor operations like raising/lowering, contracting, and symmetrize/antisymmetrizing all maintain the Einstein notation.</li>
</ul>
<p>With those points, we’ll conclude this first part on manifolds. In the next installment, we’ll actually construct a manifold 😀 and re-define most of this machinery on a manifold!</p>As the first in a multi-part series, I'll introduce manifolds and discuss how vectors, dual vectors, and tensors work in a flat, Euclidean space.Particle Filters for Robotic State Estimation2020-12-28T00:00:00+00:002020-12-28T00:00:00+00:00/particle-filter<p>In the <a href="/ekf">previous post</a>, we discussed Extended Kalman Filters for robotic state estimation. These are widely used for estimating unknown variables of a Gaussian, nonlinear dynamical system; specifically, we saw how we can use them to estimate the state of a robot traveling in the world. Although they’re popular state estimators and work really well for some systems, they are not without their faults. In this post, we’re going to explore another kind of state estimator that can overcome the limitations of the EKF: the particle filter!</p>
<h1 id="limitations-of-ekfs">Limitations of EKFs</h1>
<p>To motivate particle filters, let’s take a second to recall some assumptions we’ve made about Kalman Filters (KFs) and Extended Kalman Filters (EKFs). Both work on the Gaussian assumption: the noises are fundamentally Gaussian. We always draw the process and sensor noise from Gaussian distributions. While many kinds of distributions in the real world are indeed Gaussians, it’s a sliding scale: while some distributions are completely Gaussian, others are very poorly approximated by Gaussians. As an example, consider the exponential distribution.</p>
<p><img src="/images/particle-filters/exponential-distribution.svg" alt="The exponential distribution" title="The exponential distribution" /></p>
<p><small>The exponential distribution is parameterized by $\lambda$, called the decay factor, which controls the “sharpness” or “flatness” of the graph.</small></p>
<p>From the plot, we can see that this distribution is incredibly non-Gaussian and would be very poorly approximated as such. (This distribution isn’t just some counterexample for the sake of counterexample: for example, bacteria growth is well-modeled by an exponential distribution.) Antoher such example is a multimodal distribution, i.e., one with multiple peaks. A Gaussian approximation would effectively average the peaks with a wide variance, depending on the distance between the peaks. These distributions are also common in robot localization: we have two rooms that have some similar features so the estimate can be in either of those rooms. (With subsequent navigation, we can further narrow down where we are.)</p>
<p>Additionally, we know KFs are only good for linear systems, and EKFs can, to some degree, handle nonlinear systems. Recall that, for EKFs, we simply replace the KF matrices with Jacobians linearized from the nonlinear motion and sensor models at the state estimate. But this linearization introduces error since we’re just approximating the nonlinear function with a linearlized version. This causes some artifacts: the linearization point can be unstable, i.e., on a peak, and the farther we move from the linearization point, the worse the error.</p>
<p><img src="/images/particle-filters/very-nonlinear-fuction.svg" alt="A very nonlinear function" title="A very nonlinear function" /></p>
<p><small>An example of a very nonlinear function. There are many bad linearization points that could throw off the approximation.</small></p>
<p>For systems that are only <em>somewhat</em> nonlinear, we can afford to move slightly farther away from the linearization point without inducing too much error, assuming the linearization point isn’t a bad one. However, for <em>very</em> nonlinear systems, the error accumulates much faster.</p>
<h1 id="particle-filters">Particle Filters</h1>
<p>With those two caveats in mind, what can we do to support a broader case of distributions (non-Gaussians and multimodal distributions) and functions (highly-nonlinear functions and discontinuous functions)? Let’s start by writing down our EKF equations from <a href="/ekf">the previous post</a>:</p>
\[\begin{align*}
\hat{x}_k &= f(\hat{x}_{k-1}, u_k)\\
P_k &= F_k \hat{x}_{k-1} F^T_k + Q_k\\
\hat{x}_k' &= \hat{x}_k + K(z_k-h(\hat{x}_k))\\
P_k' &= P_k - KH_k P_k\\
K &= P_k H^T_k(H_k P_k H^T_k + R_k)^{-1}
\end{align*}\]
<p>An EKF is an example of a <strong>parametric</strong> state estimator (sometimes called <strong>model-based</strong>): we’re directly parameterizing our result with a mean of $\hat{x}$ and covariance of $P_k$, and the points is to try to estimate those two <em>parameters</em>. If we want to represent distributions that aren’t Gaussians or just arbitrary distributions, then we need to do away with directly estimating these parameters. Those kinds of state estimators are called <strong>non-parametric</strong> or <strong>model-free</strong>.</p>
<p>So how can we move from a parametric to non-parametric state estimator? The key insight is, for any distribution, instead of representing it by its parameters, e.g., mean, covariance, or decay factor, <em>we represent it by a collection of samples of the distribution</em>.</p>
<p><img src="/images/particle-filters/sampled-gaussian.svg" alt="A Gaussian represented by its samples" title="A Gaussian represented by its samples" /></p>
<p><small>The blue plot is the true, underlying Gaussian distribution. The orange plot represents 100 samples taken from the underlying Gaussian. We can see that the resulting sampling distribution, even with just 100 samples, is giving us a good estimate of the true distribution.</small></p>
<p>Let’s build the intuition for how this works with concrete, known distributions. Suppose we’re handed a collection of samples, called <strong>particles</strong>, generated from a secret distribution. Although we don’t know the distribution itself, we can still try to compute some distribution-independent properties such as the mean and covariance. If we’re later told that these samples were indeed taken from a Gaussian distribution, then we can compute the mean and covariance of the samples, and we’re done characterizing the distribution! If we’re told it’s a Beta distribution, for example, although it’s more work, we could solve for the parameters of the Beta distribution given the set of samples. (This also exemplifies how we can convert our non-parametric model to a parametric one for visualization or logging or other purposes.)</p>
<p>One other important thing to notice is how changes in the particles correspond to changes in the distribution. For example, suppose we take the same set of generated particles and translate each one by the same amount. If we’re told the secret distribution is Gaussian and we recompute the mean and covariance with the new points, we’ll find that the mean will be approximately translated by that same amount! The transformations of these particles correspond to transformations in the distribution. This further emphasizes the fact that the particles represent the distribution: changes in one correspond to changes in the other.</p>
<p><img src="/images/particle-filters/transformed-gaussian.svg" alt="Transformations of Gaussian samples" title="Transformations of Gaussian samples" /></p>
<p><small>The blue plot is the underlying Gaussian before any transformation and the green plot is the underlying Gaussian after offsetting the mean by 5. The orange plot represents 100 samples taken from the pre-transformed Gaussian, and the red plot takes each sample from the pre-transformed Gaussian and translates it by 5 (notice that it’s the exact same histogram translated by 5). The resulting sampling distribution after transforming the particles is also translated by 5 and looks similar to the post-transformed underlying distribution.</small></p>
<p>Now that we have an intuition for this correspondence, I’m going to avoid speaking about any particular distributions, e.g., Gaussian, exponential, beta, the particles represent since the distributions tends to be arbitrary. Remember, that was the point: we want a way to represent a distribution in a non-parametric way so we’ll be working exclusively with the particles and not any parameters of any particular distribution.</p>
<p>Now that we’ve established particles as the non-parametric way of modeling distributions, let’s be slightly more specific about particle filters for localization (as the title of this post seemed to hint at). Let’s assume a robot on a plane with a position and velocity so our state looks like the following.</p>
\[x_k = \begin{aligned}\begin{bmatrix}
p_k \\
v_k \\
\end{bmatrix}\end{aligned}\]
<p>If we were using EKFs, we’d have a mean vector and covariance matrix representing the distribution of all of the places our robot could be, but we’ve committed ourselves to a non-parametric approach so we can’t use that! Instead, we can consider samples of this distribution to be our particles so each particle gets its own $\begin{bmatrix} p_k & v_k \end{bmatrix}^T$. In other words, each particle represents <em>an estimate of the true state of the robot</em>, i.e., it’s position and velocity. We’ll denote the state of a particular particle $i$ at a timestep $k$ as $x_k^{(i)}$ where I’ve kept the particle index $i$ in parentheses to avoid confusing it with an exponent.</p>
<p>To initialize our filter, we need to randomly initialize a bunch of the particles with random states. How many is a “bunch”? There’s no one good answer; it’s a tuning parameter that has to be empirically determined. We’ll use $N$ as the placeholder for the number of particles. If we’re stationary at the origin, then perhaps a good way to initialize each particle is to randomly sample from a uniform distribution centered around $0$ for both the position and velocity.</p>
\[\Big\{x^{(i)}_k\sim\mathcal{U}(-\varepsilon, +\varepsilon)\Big\}_{i=1,...,N}\]
<p>where $\mathcal{U}(a, b)$ is the uniform distribution for the interval $[a, b]$ and $\varepsilon$ is some small, positive number, i.e., $\varepsilon > 0$ and $|\varepsilon| \ll 1$. (Of course, we could have sampled from a Gaussian, but I decided to use the uniform distribution to show that we can use any distribution; also, the uniform distribution isn’t a bad choice for this kind of initialization.)</p>
<p>With this, we’ve initialized our particle filter! Now, like any filtering approach, we need an motion and sensor model (see the <a href="/ekf">previous post</a> for the motivation for these). Furthermore, just like with the filter initialization, we’ll need to come up with non-parametric ways to apply these models.</p>
<h2 id="predict">Predict</h2>
<p>Our prediction model $f$ time-evolves our state from time $k$ to $k+1$. With EKFs, we had a nonlinear function transforming our mean and the Jacobian of that function transforming the covariance. Since we’re working without an explicit mean and covariance, we’re only allowed to work with the particles. The hint in figuring out the right way to implement a non-parametric motion model is something we’ve already discussed! Recall we noticed that transformations of the particles correspond to transformations in the underlying distribution. So, for the most part, all we need to do is apply a motion model $f$ to each particle independently. The additional nuiance is that we need to encode the uncertainty of the motion model $q_k$ directly in this update because we no longer have an explicit covariance matrix to represent that uncertainty. (You can pick $q_k$ to be sampled from your favorite distribution, Gaussian or otherwise: $q_k\sim\mathcal{N}(0, \sigma_q)$).</p>
<p><img src="/images/particle-filters/motion-model.svg" alt="Applying motion model to particles" title="Applying motion model to particles" /></p>
<p><small>The blue particles represent the previous state and the orange particles represent the particles after the motion model has been applied to the previous state, with some added noise.</small></p>
<p>After applying this to each particle, we get a new particle distribution.</p>
\[\Big\{x^{(i)}_{k+1} = f(x^{(i)}_k, u_k) + q_k\Big\}_{i=1,...N}\]
<p>(where $u_k$ is an optional control input. In our case, I’ve chosen to ignore it for simplicity since we already encode velocity in our state. If we wanted to account for second-order effects like acceleration/angular acceleration, then we could include it.)</p>
<p>The update function $f$ can actually be taken right from the EKF; remember to add in the noise at the end too. And that’s the entirety of the motion model: we simply apply this function to each particle to get its new state estimate.</p>
<h2 id="digression-bayes-nets-and-hidden-markov-models">Digression: Bayes Nets and Hidden Markov Models</h2>
<p>The non-parametric motion model followed rather naturally from the parametric EKF motion model. However, the sensor model is a bit tricker to motivate directly from the EKF. Instead, I want to digress for a while to discuss the generalization of EKFs and particle filters: the <strong>Bayes Net (BN)</strong>. This might seem unrelated at first glance, but EKFs and particle filters share the BN as a common ancestor. In fact, EKFs are just a special case of the BN: one where all functions are linearized and all errors are Gaussian. The particle filter, as we’ve motivated before, doesn’t have these requirements but is still a realization of an BN. From this generalization, we can easily motivate the sensor model.</p>
<p>Specifically, the BN structure that we’re interested in is called a <strong>Hidden Markov Model (HMM)</strong> and looks like this.</p>
<p><img src="/images/particle-filters/hmm.svg" alt="Hidden Markov Model" title="Hidden Markov Model" /></p>
<p><small>The HMM is a kind of Bayes Net where the variables, i.e., the blue rounded rectangles, are the unknown variables while the measurements, i.e., the green ellipses, are the known. The goal is to solve for the unknown state given just the very first prior distribution $x_0$ and the measurements $z_1, …, z_k$.</small></p>
<p>This structure makes sense because we’re interested in figuring out the state of our robot given the previous state and the sensor measurements. The “catch” is that the previous state is a random variable! We don’t know what its value is, but we do know the sensor measurements. So the goal is to solve for all of the random variables given the sensor measurements. But we can simplify this by noting that we only care about solving for the most recent state: if we’re at $x_k$, we don’t want to solve for $x_{k-4}$. It’d be simpler if we had <em>folded in</em> the previous state and sensor measurements so that we keep a sort of “running tally” state estimate. When we’re adding a new state $x_k$, we want to <em>fold in</em> the information of the previous state $x_{k-1}$ into $x_k$. Similarly, when we get a new sensor measurement $z_k$ in $x_k$, we want to <em>fold in</em> $z_k$ into $x_k$ to refine it. We’ll discuss precisely and mathematically what I mean by “<em>fold in</em>” in just a minute, but this process of “get a new state estimate by folding in the previous one; then fold in the sensor measurement; repeat” is called <strong>recursive Bayes estimation</strong>; we maintain only one “running” state, not the entire history.</p>
<p>One thing we notice about this structure is that there seems to be two kinds of arrows: one connecting states to states and one connecting states to sensor measurements. These are actually our motion and sensor models! Recall that our motion model tells us how to get a new state by time-evolving our previous state; this is exactly the dependence that arrow is capturing. In fact, we can use probability notation to generically discuss both the motion and sensor models. The motion model is written as $p(x_k|x_{k-1})$: given the distribution over $x_{k-1}$ (explicltly mean and covariance in the case of KFs/EKFs), the motion model tells us how to get the distribution over $x_k$. This is exactly what the notation is telling us!</p>
<p>Interestingly, we could have structured the problem differently by requiring a particular state $x_k$ be conditioned on by the <em>past two</em> states $x_{k-1}$ and $x_{k-2}$ or the past $m$ states. While this is certainly something we can do (and is done in reality, but that’s a different kind of state estimator for a different time 😉️), this makes the problem more difficult and doesn’t align with how our EKF works. This assumption that any particular state $x_k$ is only dependent on the previous state $x_{k-1}$ is so important and widely used that it has a name: the <strong>Markov assumption</strong>.</p>
<p>For our sensor model, recall that we thought of it as “mapping our state space into the observation space”; again, this is exactly what the other arrow is capturing. In probabilistic notation, the sensor model is written as $p(z_k|x_k)$: given the distribution over $x_k$, the sensor model tells us how to get the distribution over $z_k$. This also sets up another assumption: a sensor measurement $z_k$ is only dependent on the corresponding state $x_k$ we made that measurement in.</p>
<p>Hopefully, by now, I’ve convinced you that this HMM structure makes sense for the problem at hand, and, at this stage, we have all of the tools and insight we need to try to solve the HMM for the most recent state.</p>
<p>But after all of that motivation, what’s the quantity we’re actually trying to solve for? State estimation in general is trying to solve for the latest state given the previous states and sensor measurements. From that intuition, we might think that $p(x_k|x_1,x_2,…,x_{k-1}, z_1, z_2, …, z_k)$ is the distribution to solve for, but that’s not quite right because the unknown previous states are on the right-side of the “given” symbol even though they’re unknown. So the distribution we’re actually after is $p(x_k|z_1, z_2, …, z_k)$ and we can recursively “fold in” only the previous state (as per our Markov assumption). As a shorthand, we can write $p(x_k|z_{1:k})$. Finally, this is the quantity in question we’ve been building up to solve!</p>
<p>Since recursive Bayes estimation is recursive, we can assume we’re given $p(x_{k-1}|z_{1:k-1})$. We also need a prior $p(x_0)$ for initialization which gives us the initial state distribution. Our task is to compute the latest state estimate $p(x_k|z_{1:k})$ given the previous one $p(x_{k-1}|z_{1:k-1})$, the motion model $p(x_k|x_{k-1})$, and the sensor model $p(z_k|x_k)$.</p>
<p>The first thing we need to do is to apply the motion model. This results in transforming the previous state $p(x_{k-1}|z_{1:k-1})$ to $p(x_k|z_{1:k-1})$.</p>
\[p(x_k|z_{1:k-1}) \overset{?}{=} p(x_k|x_{k-1})p(x_{k-1}|z_{1:k-1})\]
<p>But this isn’t quite right because we have a free variable $x_{k-1}$ that’s unaccounted for! We need some heavy-duty probability theory for this, but, as it turns out, we’re not too far off from the right answer. We can directly the <a href="https://en.wikipedia.org/wiki/Chapman%E2%80%93Kolmogorov_equation">Chapman-Kolmogorov Equation</a>, which, in our scenario, states $p(A|C) = \int p(A|B)p(B|C)\text{d}B$. Applying this, we get the right answer:</p>
\[p(x_k|z_{1:k-1}) =\int p(x_k|x_{k-1})p(x_{k-1}|z_{1:k-1}) \mathrm{d}x_{k-1}\]
<p>We got rid of that extra variable by integrating it out. Intuitively, this “folds it into” the resulting distribution. Finally, we get to the mathematical definition of “folding in”: <strong>marginalization</strong>! To “fold in” a random variable, we <strong>marginalize</strong> it out by integrating over all possible values of that random variable. (Note that this operation is well-defined for random/unknown variables.) The information in $x_{k-1}$ is “folded into” the resulting distribution.</p>
<p>(For the case of KFs/EKFs, we can derive the predict step by plugging in Gaussians for $p(x_k|x_{k-1})$ and $p(x_{k-1}|z_{1:k-1})$ and solving for $p(x_k|z_{1:k-1})$ in terms of the Gaussians for $p(x_k|x_{k-1})$ and $p(x_{k-1}|z_{1:k-1})$. The result can be massaged into looking like a Gaussian, and we can extract the equations for the mean and covariance from there. Remember that applying a Gaussian to another Gaussian produces a Gaussian.)</p>
<p>Now that we have $p(x_k|z_{1:k-1})$, we need to apply the sensor model $p(z_k|x_k)$ to refine the estimate into the final result of $p(x_k|z_{1:k})$. For this, we can use Bayes Theorem to get the right answer:</p>
\[p(x_k|z_{1:k}) = \displaystyle\frac{p(z_k|x_k)p(x_k|z_{1:k-1})}{p(z_k|z_{1:k-1})}\propto p(z_k|x_k)p(x_k|z_{1:k-1})\]
<p>where</p>
\[p(z_k|z_{1:k-1}) = \int p(z_k|x_k)p(x_k|z_{1:k-1}) \mathrm{d}x_k\]
<p>is the normalization constant. (Notice the use of the Chapman-Kolmogorov Equation again.) This isn’t easy to compute directly, but it rarely matters since, in practice, we usually just compute $p(z_k|x_k)p(x_k|z_{1:k-1})$ (which is proportional to $p(x_k|z_{1:k})$) and manually normalize the result.</p>
<p>And that’s it! We have the predict and update steps written in a generic way with no assumptions on nonlinearity or distributions:</p>
\[\begin{align}
p(x_k|z_{1:k-1}) &=\int p(x_k|x_{k-1})p(x_{k-1}|z_{1:k-1}) \mathrm{d}x_{k-1}\\
p(x_k|z_{1:k}) &= \displaystyle\frac{p(z_k|x_k)p(x_k|z_{1:k-1})}{p(z_k|z_{1:k-1})} \propto p(z_k|x_k)p(x_k|z_{1:k-1})\\
\end{align}\]
<p>That was a long digression, but it served the purpose to show the generalized predict and update steps so we can apply them (specifically the update step) to our particle filter because the update step doesn’t follow so nicely from EKF.</p>
<h2 id="update">Update</h2>
<p>Our sensor model updates the state to “fold in” information taken from sensor measurements. In the digression, we showed that the sensor update corresponds to the following probabilistic representation:</p>
\[p(x_k|z_{1:k}) \propto p(z_k|x_k) p(x_k|z_{1:k-1})\]
<p>where $p(x_k|z_{1:k})$ is our latest state, given all of the sensor measurements, and $p(x_k|z_{1:k-1})$ is the result of applying just the motion model to the previous state $p(x_{k-1}|z_{1:k-1})$. So what’s that extra term $p(z_k|x_k)$? That’s our sensor model! It’s the distribution of our sensor measurement given the forward-predicted-by-the-motion-model state! To reiterate, remember that for EKFs, we thought of the sensor model as mapping our state space into our observation space. This is exactly what $p(z_k|x_k)$ represents: given a state $x_k$, we want to know how likely seeing a particular sensor measurement $z_k$ is.</p>
<p>Notice that the above equation has “proportional to” $\propto$ instead of “equals” $=$. This is because the quantity on the right-hand-side has yet to be normalized. Specifically, we know for a fact that $p(x_k|z_{1:k-1})$ is already normalized so we really just need to normalize our sensor model $p(z_k|x_k)$ for the latest state $p(x_k|z_{1:k})$ to be a valid probability distribution.</p>
<p>Another interpretation is that the normalized $p(z_k|x_k)$ act as a set of <strong>weights</strong> that are large if the state “agrees” with the sensor measurement and small if the state “disagrees” with the sensor measurement. In the context of our particle filter, after moving each particle according to the motion model, we <em>weight</em> each of the particles by how well they agree with the sensor measurement. The particles that are in the most agreement will have a larger weight, and we normalize the weights so that the result is a valid probability distribution.</p>
<p>In fact, after the sensor model update, we can compute the best state estimate given the particles $x^{(i)}_k$ and their associated weights $w^{(i)}_k$ by a simple weighted sum:</p>
\[\bar{x}_k = \frac{1}{N}\sum_i w^{(i)}_k x^{(i)}_k\]
<p>To make the weights more concrete, let’s look at an example where the sensor is a GPS. The GPS will tell us, with some noise covariance matrix $R$, the position of our robot. We want to construct our sensor update such that the largest values are the ones closest to the GPS reading. One way to do this is to compute the distance between our best state position and the GPS measurement and treat that as the mean of a Gaussian with the covariance being the GPS noise $R$. Then we evaluate the Gaussian pdf at the sensor measurement. This works because the Gaussian pdf is always largest at the mean, which is the “error” between our estimate position and what the GPS is telling us.</p>
<p>The combination of the particles and weights finally form the distribution corresponding to the most recent state given all of the sensor measurements $p(x_k|z_{1:k})$!</p>
<p><img src="/images/particle-filters/sensor-model-weights.svg" alt="Computing the weights from a sensor measurement" title="Computing the weights from a sensor measurement" /></p>
<p><small>The robot is stationed at the origin and receives a positional sensor measurement (depicted as a magenta ‘X’) that, with noise included, actually tells us we’re slightly above the origin ($y=0.1$). The left pane shows a uniform distribution of weights for all particles, i.e., each particle gets a weight of $\frac{1}{N}$. The right pane adjusts these weights by factoring in the sensor model; we see that particles closer to the measurement are weighted higher. Finally, we normalize the weights so they form a valid probability distribution.</small></p>
<p>(Note that we’re not assuming any particular distribution. The mean of any distribution is always well-defined)</p>
<p>So now we have our sensor update to apply after our motion model! We’re almost finished with the full particle filter algorithm.</p>
<h2 id="resampling">Resampling</h2>
<p>The last thing we need to discuss is resampling. Repeatedly applying the motion model and assigning weights to each particle runs into a problem as the particle filter evolves. Suppose we direct our robot to travel straight ahead for a long time. Since the particles are initially randomly distributed, it is often the case that many of them move away significantly away from the best estimated state. Naturally, these would have small weights once the sensor measurements are accounted for. What we’re left with is only a few particles accurately representing the true state of the robot. We’re wasting perfectly good particles for no benefit! This problem is often called <strong>degeneracy</strong>.</p>
<p>For the poorly-weighted particles, we’d like to swap them for higher-quality particles, and we can do this through the process of <strong>resampling</strong>. Since we know the combination of the weights and particles form a valid probability distribution, we can simply sample from this weighted distribution. Particles at the fringe with small weights will be less likely to be selected while particles near the true state of the robot will be more likely to be included in the resampled particle set. In other words, we swap out the lower-weighted particles with copies of the higher-weighted particles to get rid of bad particles while maintaining the same number of particles overall. This technique is called <strong>sequential importance resampling (SIR)</strong>. While it’s certainly not the only or most advanced sampling technique, it’s fairly popular and works well in practice.</p>
<p><img src="/images/particle-filters/resampling.svg" alt="Resampling" title="Resampling" /></p>
<p><small>Suppose our weight distribution looked like that in the left pane. The right pane has the resampled particles; notice how the higher-weighted particles were sampled more often and the lower-weighted particles at the fringe aren’t selected as part of the resampling. The new weights are reset to $\frac{1}{N}$ after resampling.</small></p>
<p>Practically, an easy way to implement resampling is to create a bin for each weight whose size is proportional to the size of the weight. Then sample from a uniform distribution of $[0, \sum_i w_k^{(i)}]$ and take the index of the bin the sample falls into. A larger weight means a larger bin and the greater the chance that particle is selected as part of the resampling process.</p>
<p>As for how often to resample, we don’t resample at each state update. Resampling too often means we lose particle diversity, i.e., many of our particles will actually be the same state. In the worse case, all of our particles would have the same estimate, effectively just being represented by one particle! This problem is called <strong>sample impoverishment</strong>. There are plenty of different techniques to remedy this, but a simple one is to compute a quantity called the <strong>effective number of particles</strong>.</p>
\[\hat{N}_{\text{eff}} = \displaystyle\frac{1}{\displaystyle\sum_i (w^{(i)})^2}\]
<p>(Note that the $2$ is actually an exponent, not a particular particle index.)</p>
<p>This is a curious metric, but, intuitively, we can think of it as measuring the “information content” of the particle set given their weights. For example, consider if each particle has the same, uniform weight $\frac{1}{N}$. This is also called a <em>high entropy</em> distribution because it doesn’t really tell us much about which particle is most representative of the state: they’re all equally representative! In that case, we’ll need all of the particles we can get, and, indeed $\hat{N}_\mathrm{eff}=N$.</p>
<p>To the other extreme, suppose all of our weights are $0$ except for one which is $1$. We call this a <em>low entropy</em> because it gives us a very good idea of which particle is representative of the best state. In that case, our information content is really just that one particle, and, indeed, $\hat{N}_\mathrm{eff}=1$.</p>
<p>With this heuristic, we can pick a threshold $N_\mathrm{thresh}$ such that we resample if $\hat{N}_ \mathrm{eff} < N_\mathrm{thresh}$. A good initial value to try is $\frac{N}{2}$ or $\frac{N}{3}$; this should be tuned later on.</p>
<p>One last thing to say about resampling is what happens to the weights after resampling or up to the point of resampling? Initially, we start with the weights taking a uniform distribution $w_k^{(i)}=\frac{1}{N}$. Similarly, right after resampling, we reset the weights to that same uniform distribution. For the time steps that we don’t resample at, we accumulate the weights rather than recomputing new ones from scratch with the following simple update rule:</p>
\[w^{(i)}_{k} = w^{(i)}_{k-1}p(z_k|x^{(i)}_ k)\]
<p>Remember to normalize afterwards! This allows us to preserve some historical information about the weights across time steps.</p>
<h2 id="the-particle-filter-algorithm">The Particle Filter Algorithm</h2>
<p>We’ve seen derivations of the particle filter rules over the past few sections so it’s time to bring together the fruits of our labor!</p>
<p>The particle filter algorithm (specifically with sequential importance resampling) has the following actors:</p>
<ul>
<li>$N$: the number of particles</li>
<li>$N_\text{thresh}$: the resampling criterion threshold</li>
<li>$p(x_k|x_{k-1})$: the motion model</li>
<li>$p(z_k|x_k)$: the sensor model.</li>
</ul>
<p>And the algorithm is</p>
<ol>
<li>Randomly initialize $N$ particles $x_0^{(i)}$ and uniformly initialize $N$ weights $w_0^{(i)} = \frac{1}{N}$ to get the prior distribution $p(x_0)$.</li>
<li>Apply the motion model $f$ to each particle independently $x_k^{(i)} = f(x_{k-1}^{(i)}, u_k) + q_k$, where $u_k$ is an optional control and $q_k$ is added noise, to get the post-motion-model distribution $p(x_k|z_{k-1})$.</li>
<li>From the sensor model, update the weights: $w_k^{(i)}\gets w_{k-1}^{(i)} p(z_k|x_k)$</li>
<li>Renormalize the weights: $w_k^{(i)}\gets\frac{w_k^{(i)}}{\sum_j w_k^{(j)}}$</li>
<li>Compute the resampling criterion: $\hat{N}_{\text{eff}} = \frac{1}{\sum_i (w^{(i)})^2}$</li>
<li>Resample $N$ particles if the resampling criterion falls below the threshold: $\hat{N}_ {\text{eff}} < N_{\text{thresh}}$ and reset the weights to the uniform distribution: $w_k^{(i)} = \frac{1}{N}$.</li>
<li>Compute and report the the best state estimate: $\bar{x}_k = \frac{1}{N}\sum_i w^{(i)}_k x^{(i)}_k$</li>
<li>Go to 2. until forever.</li>
</ol>
<p>Particle filters aren’t without problems however. We already discussed degeneracy and sample impoverishment. Another problem with particle filters is that it can be difficult to find the right number of particles. It’s not immediately obvious; $N$ is a function of both the dimensionality of the space, e.g., a full 6DOF pose would require more particles than a just a 3DOF position and heading, as well as the complexity of the environment. A larger number of particles means more computation; particle filters are generally more computationally expensive since they require the motion model and weight computation for each particle.</p>
<h1 id="conclusion">Conclusion</h1>
<p>EKFs as state estimators can work really well but do have some limitations that disqualify them from certain kinds of systems. Specifically, they’re only good for Gaussian and slighly nonlinear systems. We saw that a large part of these limitations stemmed from EKFs being parametric state estimators that required explicit updating of a mean and covariance. To get around this problem and move into the non-parametric domain, our key insight was that any arbitrary distribution can be represented by samples, or particles, of that distribution. We dubbed this non-parametric approach of state estimation a particle filter. We started by initializing $N$ randomly-generated particles with equally distributed weights, i.e., all are $\frac{1}{N}$. For our motion model, we simply apply it, with some noise, to each particle independently. Since the sensor update didn’t follow so nicely, we sojourned to Hidden Markov Models for just long enough to motivate the particle filter sensor model. We interpreted the model as computing and maintaining a weight for each particle that measured sensor agreement. If we needed to resample to avoid degeneracy, we can do that. At the end of the day, we computed a state estimate by taking the weighted sum of the particles (using the weights as weights, of course). Then we simply “rinse-and-repeated” for the duration of the particle filter and that’s our non-parametric state estimator!</p>
<p>This post ended up being much, much longer than I had originally anticipated so kudos if you’ve read everything 😀️! I wanted to make sure I properly motivated each equation and formulation so it read more like a story and less like an itemized list of facts. Besides particle filters, there are some more bespoke techniques for robotic state estimation and localization that we’ll maybe get to next time.</p>Going beyond EKFs, I'll motivate particle filters as a more advanced state estimator that can compensate for the limitations of EKFs.Extended Kalman Filtering for Robotic State Estimation2020-10-05T00:00:00+00:002020-10-05T00:00:00+00:00/ekf<p>Over the past decade or so, there have been significant advancements in the field of robotics, e.g., <a href="https://www.bostondynamics.com/spot">Boston Dynamics’ Spot</a>, <a href="https://www.davincisurgery.com">Da Vinci surgical robot</a>, <a href="https://www.tesla.com/autopilot">Tesla’s Autopilot</a> and more. I had some experience working around robots but I’ve never really built my own from scratch, components, software, and all. To remedy this, I decided to take up building a robot as a “little” side project. Arguably, one of the most important aspects is <strong>state estimation</strong>, the problem of accurately determining variables about your robot, e.g., its position with respect to some global frame, velocity, acceleration, IMU biases, and other dynamical variables, given the robot’s sensors and physics kinematics of the robot. Previously, I didn’t have to worry about this since there was always some other team responsible for state estimation whose output I would use. But since I was constructing my own robot, I had to do this leg work myself. (Of course, I could have used an off-the-shelf product, but where’s the fun in that?) To that goal, this post aims to describe the underpinnings of a very common approach to state estimation: the extended kalman filter (EKF).</p>
<h1 id="kalman-filters">Kalman Filters</h1>
<p>You probably read the title and thought, “wait, what’s a Kalman Filter in the first place? Shouldn’t we discuss that before extending it?” You’re absolutely right! I was planning on doing a writeup of Kalman Filters, but then I found this fantastic post: <a href="https://www.bzarg.com/p/how-a-kalman-filter-works-in-pictures/">How a Kalman filter works, in pictures</a>. (I have no affiliation with the author; I just thought the post was written really well, and I wouldn’t have many additional things to say about the topic.) Skim through that first before continuing so you get the gist of Kalman Filters.</p>
<p>I have just a few additions to make on top of that post:</p>
<p>I think the motion model is pretty intuitive, but the measurement model confused me the very first time I encountered it so I’ll provide a slightly different interpretation with some concrete examples. Suppose our state is like that in the post:</p>
\[\hat{x}_k = \begin{bmatrix}p_k\\ v_k\end{bmatrix}\]
<p>The <strong>measurement/observation model</strong> is a matrix $H_k$ that maps the state space into the sensor space. Imagine driving our robot along and taking a sensor measurement; where that sensor measurement lies is dependent on where we are and how fast we’re going, i.e., our state, when we multiply it by $\hat{x}_k$, we’re computing the estimated sensor reading. This is why $H_k$ acts on the state vector. Think of this as a way to map our state space to our observation space.</p>
<p>To give a more concrete example, suppose our robot had GPS, which we could use to sense the position component of our state vector but not the velocity component. Furthermore, the mapping is direct: the whole purpose of the GPS sensor is to tell you the position (plus a noise error). Then our $H_k$ matrix would look like the following:</p>
\[\begin{aligned}
\begin{bmatrix} 1 & 0 \end{bmatrix}\begin{bmatrix} p_k\\v_k \end{bmatrix} &= \begin{bmatrix} p_k \end{bmatrix}\\
H_k\hat{x}_k &= \begin{bmatrix} p_k \end{bmatrix}\\
\end{aligned}\]
<p>In a slightly more complicated (and realistic) example in 2D space, we can still compute the $H_k$ matrix:</p>
\[\begin{aligned}
\begin{bmatrix} 1 & 0 & 0 & 0\\0 & 1 & 0 & 0 \end{bmatrix}\begin{bmatrix} p^{(1)}_k\\p^{(2)}_k\\v^{(1)}_k\\v^{(2)}_k \end{bmatrix} &= \begin{bmatrix} p^{(1)}_k\\p^{(2)}_k \end{bmatrix}\\
H_k\hat{x}_k &= \begin{bmatrix} p^{(1)}_k\\p^{(2)}_k \end{bmatrix}
\end{aligned}\]
<p>Note the superscripts represent spatial dimensions, not exponents!</p>
<p>The final thing I’ll say is about the intuitive meaning of the Kalman Gain $P_k H^T_k(H_k P_k H^T_k + R_k)^{-1}$. This is probably the most complicated-looking quantity in the Kalman Filter equations, but it has a very intuitive explanation. To see this, let’s rewrite it a little:</p>
\[K = \displaystyle\frac{P_k H^T_k}{H_k P_k H^T_k + R_k}\]
<p>It has two “inputs”: $P_k$, the covariance around the state, and $R_k$, the covariance around the sensor measurement.</p>
<p>Let’s consider the case where $P_k$ has small values and $R_k$ has large values. This means we’re very certain about our state but more uncertain about our sensors. Making these adjustments, we’ll see that $K$ will have very small values, and that means we won’t use our sensor measurements as much:</p>
\[\hat{x}_k' = \hat{x}_k + K(z_k-H_k\hat{x}_k)\]
<p>If $K$ has small values, then $\hat{x}_k’$ is dominated by $\hat{x}_k$, our forward-projected state.</p>
<p>Now let’s consider the opposite case: $R_k$ has small values and $P_k$ has large values. This means that there’s less uncertainty around our sensor measurements, i.e., our sensors are very accurate, and more uncertainty around our state. In that case, we definitely want to use our sensors to update our state since it will give us a better estimate than if we had very inaccurate sensors. Notice when $R_k\rightarrow 0$, $K\rightarrow H_k^{-1}$ (I’m being a little sloppy with the notation), which is just the matrix we use to map back to the state space from the observation space.</p>
\[\hat{x}_k' = \hat{x}_k + K(z_k-H_k\hat{x}_k)\]
<p>If $K$ has larger values, then $\hat{x}_k’$ will use both $\hat{x}_k$ and the sensor measurement $z_k-H_k\hat{x}_k$.</p>
<h1 id="extended-kalman-filters-ekfs">Extended Kalman Filters (EKFs)</h1>
<p>Now that we have an understanding of the basics of Kalman Filters, we can extend them to work well for a wider range of problems. In the previous section, we saw how the Kalman Filter can be used to estimate our robot’s state using just a few different matrices. One significant caveat about the Kalman Filter is that it’s a <em>linear</em> filter!</p>
<p>Take another look at our predict and update steps: everything in them is linear. This is a problem because we’re relying on Gaussians: applying linear functions to a Gaussian will produce a Gaussian but applying a nonlinear function to a Gaussian might not!</p>
<p>This is the folly of Kalman Filters: they’re really good at modeling linear systems, but the world has many nonlinear systems. A more powerful and accurate, representation would be a nonlinear one.</p>
<p>So how can we construct a nonlinear version of Kalman Filters? Well we know that Kalman Filters work well for linear systems; if we can come up with a way to <em>linearize</em> our nonlinear system at the current estimate, then we can simply use the exact same Kalman Filter mechanics to solve our problem!</p>
<p>Armed with a little calculus knowledge, we can come up with a way to create a linear approximation of a system at a particular value. To see this, let’s use a simple example.</p>
<p>Suppose we have a parabola, e.g., the function $f(x) = x^2$. We want to create a linear approximation of our quadratic function at the point $x=2$. In other words, we want the tangent line at $x=2$. Using a little calculus, we could just compute the derivative $\displaystyle\frac{\mathrm{d}f}{\mathrm{d}x}$ and evaluate it at $x=2$. This would give us the slope of the <em>tangent line</em> at that point, and we can use that line to estimate our function around $x=2$. Notice that the farther away we move from $x=2$, the worse our estimate gets so it’s only good local to $x=2$, which is what makes this a local approximation.</p>
<p><img src="/images/ekf/tangent.svg" alt="Tangent line" title="Tangent line" /></p>
<p><small>Plot of $x^2$ in blue while the tangent line at $x=2$ is shown in orange.</small></p>
<p>Mathematically, we can represent our new approximation using the following function, where $a=2$.</p>
\[\begin{aligned}
\tilde{f}(x) &= f(a) + f'(a)\cdot(x-a)\\
\tilde{f}(x) &= f(2) + f'(2)\cdot(x-2)\\
&= 4 + 2(2)\cdot(x-2)\\
&= 4 + 4(x-2)\\
&= 4x - 4
\end{aligned}\]
<p>What we’re actually doing is creating the <strong>Taylor series</strong> of $f$ at $x=2$, particularly just the first-order Taylor series, i.e., a line.</p>
<p>Notice the slope of this line is correctly $f’(a)$. The y-intercept is a bit more complicated, but can be worked out using some algebra.</p>
<p>So where does this apply to Kalman Filters? For EKFs, we replace the motion and sensor models with nonlinear functions $f(\hat{x}_{k-1}, u_k)$ and $h(\hat{x}_k)$, respectively. One important thing to note is that these functions accept vectors as inputs <em>and produce vector outputs</em>! This means we can’t directly apply the Taylor series equation we’ve seen to linearize our input. In other words, writing $\displaystyle\frac{\mathrm{d}h}{\mathrm{d}\hat{x}_k}$ is ambiguous; which output of $h$ are we referring to? Instead, I want to know how changing a particular input will affect a particular output. For this case, we need to compute all of the partial derivatives of our each output with respect to each input.</p>
<p>Let’s consider with an example we’ve seen before:</p>
\[\begin{aligned}
p_k &= p_{k-1} + \Delta t v_{k-1}\\
v_k &= \hphantom{p_{k-1}} \hphantom{+} \hphantom{\Delta t~~} v_{k-1}
\end{aligned}\]
<p>Recall this is just the Kalman Filter’s motion model. Let me re-write this in a slightly different way.</p>
\[\begin{aligned}
f^{(1)} &= x^{(1)} + \Delta t x^{(2)}\\
f^{(2)} &= \hphantom{x^{(1)} + \Delta t } x^{(2)}
\end{aligned}\]
<p>where the superscripts are actually vector components. So the above has $\mathbf{f}(\mathbf{x})$ where all of the $1$ components are $p$ and the $2$ components are $v$. I’ve also dropping the subscripts for now.</p>
<p>Now we have to be more specific when we compute partial derivatives. Consider the partial derivative $\displaystyle\frac{\partial f^{(1)}}{\partial x^{(1)}}$. This tells us how much the first component of $f$ changes with the first component of the input vector. In other words, this tells us how much the position at $k$ changes when the position at $k-1$ changes, holding the velocity constant.</p>
<p>$\displaystyle\frac{\partial f^{(2)}}{\partial x^{(1)}}$ tells us how much the second component of $f$ changes with the first component of the input vector. $\displaystyle\frac{\partial f^{(1)}}{\partial x^{(2)}}$ tells us how much the first component of $f$ changes with the second component of the input vector.</p>
<p>Let’s take all 4 partial derivatives:</p>
\[\begin{aligned}
\displaystyle\frac{\partial f^{(1)}}{\partial x^{(1)}} &= 1\\
\displaystyle\frac{\partial f^{(1)}}{\partial x^{(2)}} &= \Delta t\\
\displaystyle\frac{\partial f^{(2)}}{\partial x^{(1)}} &= 0\\
\displaystyle\frac{\partial f^{(2)}}{\partial x^{(2)}} &= 1\\
\end{aligned}\]
<p>A better format would be to arrange them into a matrix:</p>
\[\begin{aligned}
J &= \begin{bmatrix}
\displaystyle\frac{\partial f^{(1)}}{\partial x^{(1)}} & \displaystyle\frac{\partial f^{(1)}}{\partial x^{(2)}}\\
\displaystyle\frac{\partial f^{(2)}}{\partial x^{(1)}} & \displaystyle\frac{\partial f^{(2)}}{\partial x^{(2)}}
\end{bmatrix}\\
&= \begin{bmatrix}
1 & \Delta t\\
0 & 1
\end{bmatrix}
\end{aligned}\]
<p>This is called the <strong>Jacobian Matrix</strong> $J$. It’s a generalization of first derivatives for functions that accept vector inputs and produce vector outputs. When using these kinds of functions, anywhere we have first derivatives, we can put the Jacobian $J$ in its place.</p>
<p>For example, let’s take a look at our Taylor Series expansion again:</p>
\[\tilde{f}(x) = f(a) + f'(a)\cdot(x-a)\\\]
<p>But since $f$ is now a vector-valued function, we replace the derivative with the Jacobian and the equation has the identical form:</p>
\[\tilde{\mathbf{f}}(\mathbf{x}) = \mathbf{f}(\mathbf{a}) + J(\mathbf{a})\cdot(\mathbf{x}-\mathbf{a})\\\]
<p>So the Jacobian here is being used to create linear approximations at a vector $\mathbf{a}$, just like derivatives could with the scalar $a$!</p>
<p>Also, notice that we’ve shown $J=F$, the motion model. This shows EKFs are more general that just Kalman Filters: using the EKF formulation on a function that’s already linear just gives us the Kalman Filter! (This notion is actually far broader concept: linear functions of any variables can always be written in matrix form.)</p>
<p>Now that we know about the Jacobian, let’s see what our Extended Kalman Filter equations look like:</p>
\[\begin{align}
\hat{x}_k &= f(\hat{x}_{k-1}, u_k)\\
P_k &= F_k \hat{x}_{k-1} F^T_k + Q_k\\
\hat{x}_k' &= \hat{x}_k + K(z_k-h(\hat{x}_k))\\
P_k' &= P_k - KH_k P_k\\
K &= P_k H^T_k(H_k P_k H^T_k + R_k)^{-1}
\end{align}\]
<p>They look almost identical! First, we have a motion (and control) model function $f$ and a sensor model function $h$, both of which need not be linear! (But as we’ve shown, it’s OK if they are because then the EKF reduces to just the Kalman Filter!)</p>
<p>Also, the definitions of the $F$ and $H$ matrices have to change to be the <em>Jacobians</em> of $f$ and $h$, respectively.</p>
\[\begin{aligned}
F_k &= \displaystyle\displaystyle\frac{\partial f}{\partial \mathbf{x}}\Bigr|_{\mathbf{x}=x_{k-1},u_k}\\
H_k &= \displaystyle\displaystyle\frac{\partial h}{\partial \mathbf{x}}\Bigr|_{\mathbf{x}=x_{k-1}}
\end{aligned}\]
<p>Keep in mind that $f$ and $h$ accept vectors as inputs and return vectors as outputs. Besides those two changes, the EKF equations look identical to the Kalman Filter ones!</p>
<p>Let’s finish our discussion on the EKF with an example of computing the Jacobian of the sensor model function $h$. For this example, suppose our state vector contains four quantities: 2D position and 2D velocity. In other words, we have a robot in a plane.</p>
\[x_k = \begin{bmatrix} p^{(1)}_k \\ p^{(2)}_k \\ v^{(1)}_k \\ v^{(2)}_k \end{bmatrix}\]
<p>Our robot is equipped with a range-bearing sensor that can give us an angle $\theta$ and radius $r$.</p>
<p><img src="/images/ekf/bearing.svg" alt="Bearing sensor" title="Bearing sensor" /></p>
<p><small>Example of a bearing sensor detecting an object with respect to the robot frame.</small></p>
<p>Since our sensor model maps our state space into our sensor space, $h$ is a function that computes $\theta$ and $r$ given our state information, such as the object’s location relative to our robot $(b^{(x)}, b^{(y)})$.</p>
<p>With some geometry, we can figure out what $h$ should be.</p>
\[h(\hat{x}_k)=\begin{bmatrix}r\\\theta\end{bmatrix}
=\begin{bmatrix}\sqrt{b^{(x)}\cdot b^{(x)} + b^{(y)}\cdot b^{(y)}}\\\arctan{\displaystyle\frac{b^{(y)}}{b^{(x)}}}\end{bmatrix}\]
<p>Now we can compute the Jacobian $H$ by looking at all of the partial derivatives of $h$.</p>
\[H=\begin{bmatrix}
\displaystyle\frac{\partial r}{\partial b^{(x)}} & \displaystyle\frac{\partial r}{\partial b^{(y)}} \\
\displaystyle\frac{\partial \theta}{\partial b^{(x)}} & \displaystyle\frac{\partial \theta}{\partial b^{(y)}} \\
\end{bmatrix}
=\begin{bmatrix}
\displaystyle\frac{b^{(x)}}{\sqrt{b^{(x)}\cdot b^{(x)} + b^{(y)}\cdot b^{(y)}}} & \displaystyle\frac{b^{(y)}}{\sqrt{b^{(x)}\cdot b^{(x)} + b^{(y)}\cdot b^{(y)}}}\\
-\displaystyle\frac{b^{(y)}}{b^{(x)}\cdot b^{(x)} + b^{(y)}\cdot b^{(y)}} &
\displaystyle\frac{b^{(x)}}{b^{(x)}\cdot b^{(x)} + b^{(y)}\cdot b^{(y)}}\\
\end{bmatrix}\]
<p>(I’ll leave the derivations as an exercise.)</p>
<p>With this Jacobian computed, we can simply use the equations of Kalman Filters.</p>
<h1 id="conclusion">Conclusion</h1>
<p>A fundamental problem in robotics is state estimation, and one of the most common ways to solve this is with Kalman Filters. But Kalman Filters have the fundamental limitation that they only work for linear systems, and many systems we’re interested in and want to model are nonlinear. We can still use the Kalman Filter machinery if we take a linear approximation of our nonlinear system at its current state. The Jacobian is a general way to find a linear approximation locally; think of it like finding the slope of a tangent line at a point, but for functions with more than one inputs and outputs. Using the Jacobian, we have a linear approximation and can simply reuse most of the equations of Kalman Filters.</p>
<p>Writing this post reminded me how interesting and multidisciplinary of a field robotics is; expect more articles on this topic in the future 🙂</p>I discuss a fundamental building block for state estimation for a robot: the extended kalman filter (EKF).Deep Reinforcement Learning: Policy-based Methods2019-01-20T00:00:00+00:002019-01-20T00:00:00+00:00/deep-rl-policy-methods<p>My previous posts on <a href="/reinforcement-learning">reinforcement learning</a> and <a href="/deep-rl-value-methods">value-based methods in deep reinforcement learning</a> discussed q-learning and deep q-networks. In this post, we’re going to look at deep reinforcement methods that directly manipulate the policy. In particular, we’ll look at a few variants of <strong>policy gradients</strong> and then a state-of-the-art algorithm called <strong>proximal policy optimization (PPO)</strong>, the same algorithm that defeated expert human players at <a href="https://blog.openai.com/openai-five/">DOTA</a>.</p>
<p>For a more concrete understanding, i.e., code, I have a <a href="https://github.com/mohitd/policy-gradient">repo</a> where I’ve implemented the DQN models and algorithms of this post in Pytorch.</p>
<h1 id="policy-gradients-and-reinforce">Policy Gradients and REINFORCE</h1>
<p>The value-based models that we’ve discussed a <a href="/deep-rl-value-methods">previous post</a> use a neural network to estimate q-values which are then used to implicitly compute the policy. Consequently, these value-based methods will produce <strong>deterministic policies</strong>. In other words, the same action will be given for the same state, i.e., $a=\pi_\theta(s)$, assuming the q-values don’t change.</p>
<p>Recall that Markov Decision Processes could be solved with value-based methods, like value iteration, and policy-based methods, such as policy iteration. Similarly, we can directly use policy-based methods in deep reinforcement learning. This approach is called <strong>policy gradients</strong>.</p>
<p>For policy gradients, we use a <strong>stochastic policy</strong> $\pi_\theta(a\vert s)$, parameterized by some $\theta$, that gives us the probability of taking action $a$ in state $s$. To compute this policy, we use a neural network!</p>
<p><img src="/images/deep-rl-policy-methods/policy-network.svg" alt="Policy Network" title="Policy Network" /></p>
<p><small>The policy network receives a game frame as input and produces a probability distribution over the actions using the softmax function. For continuous control, our policy network will produce a mean and variance value, assuming the values of our action are distributed normally.</small></p>
<p>This <strong>policy network</strong> takes in a frame from our game and produces a probability distribution over the actions (for discrete actions). Our policy network computes the stochastic policy $\pi_\theta(a\vert s)$ where $\theta$ are the neural network parameters. Anywhere you see $\pi_\theta(a\vert s)$, just think of it as the policy network.</p>
<p>Now that we’ve defined our policy and the role of the policy network, let’s talk about how we can train it. To improve our policy, we need a quality function to tell us how good our policy is. This <strong>policy score</strong> function is dependent on the particular environment we’re operating in, but, generally, the expected reward is a good measure of policy quality.</p>
\[J(\theta) = \mathbb{E}_{(s_1, a_1,\dots, s_T, a_T)\sim\pi_\theta} \bigg[\sum_{t=1}^T r(s_t, a_t)\bigg]\]
<p>(There’s a discounting factor $\gamma$ that’s supposed to be inside the expectation, but I’m going to ignore it for a while since it’ll just clutter up the derivation. I’ll add it back in later!)</p>
<p>In other words, the score of a policy $\pi_\theta$ is the expectation of the total reward for a sequence of states and actions under that policy, i.e., the states and actions sampled from the policy. We’ll want to maximize this function because we want to achieve the maximum expected total reward under some policy $\pi_\theta$.</p>
<p>For conciseness, we denote $r(\tau)=\sum_{t=1}^T r(s_t, a_t)$ and call $\tau=(s_1, a_1, \dots, s_T, a_T)$ the <strong>trajectory</strong>. In practice, we can either keep taking actions until the game itself terminates (<strong>infinite horizon</strong>) or we can set a maximum number of steps (<strong>finite horizon</strong>).</p>
<p>Substituting $r(\tau)$ into our expectation, we get the following policy quality function.</p>
\[J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta(\tau)} [r(\tau)]\]
<p>Intuitively, maximizing this function increases the likelihoods of good actions and decreases the likelihood of bad actions, based on the trajectories of the episodes.</p>
<p><img src="/images/deep-rl-policy-methods/trajectory.svg" alt="Training the Policy" title="Training the Policy" /></p>
<p><small>Trajectories are lines on the state-action space where the ellipses denote different regions of rewards. (A darker shade means a higher reward.) We want to maximize the trajectories that lead to high expected rewards and decrease the likelihood of trajectories that lead to low expected rewards.</small></p>
<p>The fact that we wrote the policy quality function as an expectation is very important: we can use Monte Carlo methods to estimate the value! In this particular application of Monte Carlo estimation, we take many samples of states and actions from the policy and average the total reward to approximate the expectation. (Whenever you see “Monte Carlo methods”, just think of it as counting-and-dividing.)</p>
\[J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta(\tau)} \bigg[\sum_{t=1}^T r(s_t, a_t)\bigg]\approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T r(s_t^{(i)}, a_t^{(i)})\]
<p>Now that we have a policy quality function we want to maximize, it’s time to take the gradient!</p>
\[\nabla_\theta J(\theta) = \nabla_\theta\mathbb{E}_{\tau\sim\pi_\theta(\tau)} [r(\tau)]\]
<p>We can expand the expectation into an integral using the definition of expectation for a continuous variable $\tau$.</p>
\[\nabla_\theta J(\theta) = \nabla_\theta\int \underbrace{\pi_\theta(\tau)}_\text{probability} \underbrace{r(\tau)}_{\substack{\text{random} \\ \text{variable}}} \mathrm{d}\tau\]
<p>Then, we can move the gradient operator into the integral since we’re taking the gradient with respect to the parameters $\theta$, not the trajectory $\tau$.</p>
\[\nabla_\theta J(\theta) = \int \nabla_\theta\pi_\theta(\tau) r(\tau) \mathrm{d}\tau\]
<p>We need to convert this equation back into an expectation somehow so we can use Monte Carlo estimation to approximate it. Fortunately, we can use the following logarithm identity (read backwards) to accomplish this.</p>
\[f(x) \nabla \log f(x) = f(x)\frac{\nabla f(x)}{f(x)} = \nabla f(x)\]
<p>Now we just replace $f(x)$ with $\pi_\theta(\tau)$ and use the identity.</p>
\[\int \nabla_\theta\pi_\theta(\tau) r(\tau) \mathrm{d}\tau=\int \pi_\theta(\tau)\nabla_\theta\log\pi_\theta(\tau) r(\tau) \mathrm{d}\tau\]
<p>Now we can convert this back into an expectation!</p>
\[\int \underbrace{\pi_\theta(\tau)}_\text{probability} \underbrace{\nabla_\theta\log\pi_\theta(\tau) r(\tau)}_\text{random variable} \mathrm{d}\tau = \mathbb{E}_{\tau\sim\pi_\theta(\tau)} [\nabla_\theta\log\pi_\theta(\tau)r(\tau)]\]
<p>This gives us the gradient of our policy quality function!</p>
\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta(\tau)} [\nabla_\theta\log\pi_\theta(\tau)r(\tau)]\]
<p>The last thing that remains is applying the gradient to $\log\pi_\theta(\tau)$ so that we can write $\nabla_\theta J(\theta)$ in terms of $\pi_\theta(a\vert s)$, i.e., our policy network! Taking the gradient of this just means using backpropagation on our policy network!</p>
<p>According to our notation substitution, we can replace $\tau$ with $(s_1, a_1, \dots, s_T, a_T)$</p>
\[\pi_\theta(\tau) = \pi_\theta(s_1, a_1, \dots, s_T, a_T)\]
<p>Now we need to write $\pi_\theta(s_1, a_1, \dots, s_T, a_T)$ in terms of $\pi_\theta(a\vert s)$. Intuitively, $\pi_\theta(s_1, a_1, \dots, s_T, a_T)$ represents the likelihood of this trajectory so we can expand it using probability theory.</p>
\[\pi_\theta(s_1, a_1, \dots, s_T, a_T) = p(s_1)\prod_{t=1}^T\pi_\theta(a_t\vert s_t)p(s_{t+1}\vert s_t, a_t)\]
<p>In words, we’re expanding the probability of observing a particular trajectory $s_1, a_1, \dots, s_T, a_T$. The first term is the probability of the starting state $s_1$. The product operator computes the overall probability of all of the transitions. To transition to a new state $s_{t+1}$, we need to take an action in the current state, but, since our policy is stochastic instead of deterministic, we use the policy $\pi_\theta(a_t\vert s_t)$ to give us the <em>probability</em> of taking action $a_t$. Now that we have a representation of the action, we can use the transition function $p(s_{t+1}\vert s_t, a_t)$ to compute the probability of the next state. Combining all of these together, we get the overall probability of observing the trajectory.</p>
<p>For numerical stability, i.e., to prevent underflow, we take the logarithm of both sides, and multiplication becomes log addition.</p>
\[\log\pi_\theta(s_1, a_1, \dots, s_T, a_T) = \log p(s_1)+\sum_{t=1}^T\bigg[\log\pi_\theta(a_t\vert s_t)+\log p(s_{t+1}\vert s_t, a_t)\bigg]\]
<p>Finally, we can take the gradient of both sides. Notice that the first and last terms on the right-hand side are not a function of $\theta$ so they can be removed.</p>
\[\require{cancel}\]
\[\nabla_\theta\log\pi_\theta(s_1, a_1, \dots, s_T, a_T) = \nabla_\theta\bigg(\cancel{\log p(s_1)}+\sum_{t=1}^T\bigg[\log\pi_\theta(a_t\vert s_t)+\cancel{\log p(s_{t+1}\vert s_t, a_t)}\bigg]\bigg)\]
<p>Then we can move the gradient operator inside of the sum, and we’re left with the following.</p>
\[\nabla_\theta\log\pi_\theta(s_1, a_1, \dots, s_T, a_T) = \sum_{t=1}^T\nabla_\theta\log\pi_\theta(a_t\vert s_t)\]
<p>Notice that we’ve effectively arrived at a maximum likelihood estimate using log likelihoods! This is because $\pi_\theta(s_1, a_1, \dots, s_T, a_T)$ just computes the likelihood of the trajectory $s_1, a_1, \dots, s_T, a_T$.</p>
<p>Now we’ve represented $\nabla_\theta\log\pi_\theta(\tau)$ in terms of $\pi_\theta(a_t\vert s_t)$ and we can finish writing the full policy gradient! Recall the gradient of $J(\theta)$.</p>
\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta(\tau)} [\nabla_\theta\log\pi_\theta(\tau)r(\tau)]\]
<p>Fully expanded, we arrive at the following.</p>
\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta(\tau)}\bigg[ \bigg( \sum_{t=1}^T\nabla_\theta\log\pi_\theta(a_t\vert s_t) \bigg) \bigg( \sum_{t=1}^T r(s_t, a_t) \bigg) \bigg]\]
<p>Let me take a second and explain this intuitively in its two sums.</p>
\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta(\tau)}\bigg[ \underbrace{\bigg( \sum_{t=1}^T\nabla_\theta\log\pi_\theta(a_t\vert s_t) \bigg)}_{\text{maximum log likelihood}} \underbrace{\bigg( \sum_{t=1}^T r(s_t, a_t) \bigg)}_{\text{reward for this episode}} \bigg]\]
<p>By multiplying the likelihood of a trajectory with its reward, we encourage our agent to increase the probability of a good trajectory if the reward is high, and, discourage it if the reward is low.</p>
<p>One other thing we can fold in is the discount factor $\gamma$. We need to discount rewards backwards in time, similar to computing the value of a state.</p>
\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta(\tau)}\bigg[ \bigg( \sum_{t=1}^T\nabla_\theta\log\pi_\theta(a_t\vert s_t) \bigg) \bigg( \sum_{t=1}^T \gamma^{t-1} r(s_t, a_t) \bigg) \bigg]\]
<p><br /></p>
<p>Now that we have the gradient of the quality function, we can write the algorithm, called the <strong>REINFORCE algorithm</strong>, to maximize this function and train our policy network!</p>
<ol>
<li>
<p>Sample a trajectory ${\tau^{(i)}}$ from $\pi_\theta(a\vert s)$. In other words, run the policy and record the values of the reward function $r(s_t^{(i)}, a_t^{(i)})$ and log probabilities $\log\pi_\theta(a_t^{(i)}\vert s_t^{(i)})$.</p>
</li>
<li>
<p>Compute the policy gradient by averaging across the trajectory. $\nabla_\theta J(\theta) \approx \frac{1}{N}\sum_i\bigg[ \bigg( \sum_t\nabla_\theta\log\pi_\theta(a_t^{(i)}\vert s_t^{(i)}) \bigg) \bigg( \sum_t \gamma^{t-1} r(s_t^{(i)}, a_t^{(i)}) \bigg) \bigg]$</p>
</li>
<li>
<p>Update the parameters (where $\alpha$ is the learning rate). $\theta\gets\theta + \alpha\nabla_\theta J(\theta)$</p>
</li>
</ol>
<p>(Notice that we’re using gradient <em>ascent</em> instead of <em>descent</em> here since the goal is to <em>maximize</em> our score function!)</p>
<p><img src="/images/deep-rl-policy-methods/training-loop.svg" alt="Policy Gradient Training Loop" title="Policy Gradient Training Loop" /></p>
<p><small>To collect training data, we sample trajectories from our policy, i.e., run the policy network for a game, and collect the rewards. We use the rewards to compute our “ground truth” values and fit a model to them, now that we have data and “labels”. Finally, we update our model’s parameters using gradient descent.</small></p>
<p>We can draw some analogies between reinforcement learning and supervised learning. Really reinforcement learning is just like supervised learning where the training data is sampled and computed from the policy. We batch these data and train them using our policy network, just like a supervised learning algorithm. We don’t really have “ground truth” labels, so we compute them using the discounted reward.</p>
<h3 id="practical-implementation">Practical Implementation</h3>
<p>Now that I’ve laid down the foundations of vanilla policy gradients, there are a few implementation details to actually get them working with automatic differentiation (autodiff), used in Pytorch and Tensorflow.</p>
<p>Going step-by-step from the original algorithm, the first thing we need to do is collect trajectories to compute the loss. Autodiff will handle computing the gradient and updating the model parameters. Notice that our loss function really just needs the log probabilities of the actions and the corresponding reward so those are the only two things we need to store when collecting trajectories.</p>
\[J(\theta) \approx \frac{1}{N}\sum_i\bigg[ \bigg( \sum_t\log\underbrace{\pi_\theta(a_t^{(i)}\vert s_t^{(i)})}_{\text{from network}} \bigg) \bigg( \sum_t \gamma^t\underbrace{r(s_t^{(i)}, a_t^{(i)})}_{\text{from env}} \bigg) \bigg]\]
<p>Since we’re using Monte Carlo methods, we can collect these log probabilities and rewards and aggregate them by averaging. The next step is fitting the model, which autodiff can do for us if we provide it the loss value.</p>
<p>Additionally, we normalize, i.e., subtract the mean and divide by the standard deviation, the temporally discounted rewards to help reduce variance and increase stability. There are formal proofs that show why this works, but I’ve omitted them since they don’t add that much to the discussion and normalizing is a fairly standard practice in the community.</p>
<p>The final, minor point is that many optimization algorithms in software libraries are going to perform some kind of <em>descent</em> in our parameter space, however, we want to <em>maximize</em> our policy quality function. The easy fix to turn our policy quality function into a loss function is to multiply the quality by $-1$! We’ll simply sum up the negative log probabilities, scaled by the normalized rewards, and perform gradient <em>descent</em> on the parameters using that loss value.</p>
<p>A more detailed REINFORCE algorithm looks like this.</p>
<ol>
<li>For each episode do
<ol>
<li>For each step do
<ol>
<li>Sample the action $a\sim\pi_\theta(s)$ and store $\log\pi_\theta(a\vert s)$</li>
<li>Execute the action $a$ to receive a reward $r$ and store it.</li>
<li>If the episode is not done, skip to the next iteration.</li>
<li>Compute the discounted rewards using the discount factor: $R_t = \sum_{t’=0}^t \gamma^{t’} r_{T-t’}\gets [r_T, r_T+\gamma r_{T-1}, r_T+\gamma r_{T-1}+\gamma^2 r_{T-2}, \dots]$.</li>
<li>Normalize the rewards by subtracting the mean reward $\mu_R$ and dividing by the standard deviation $\sigma_R$ (and adding in an epsilon factor $\epsilon$ to prevent dividing by zero): $R_t\gets\frac{R_t-\mu_R}{\sigma_R+\epsilon}$.</li>
<li>Multiply the negative log probabilities with their respective discounted rewards and sum them all up to get the loss: $L(\theta) = \sum_t -\log\pi_\theta(a_t\vert s_t)\cdot R_t$.</li>
<li>Backpropagate $L(\theta)$ through the policy network.</li>
<li>Update the policy network’s parameters.</li>
</ol>
</li>
<li>End for</li>
</ol>
</li>
<li>End for</li>
</ol>
<p><br /></p>
<p>One advantage of policy-based methods is that they can learn stochastic policies while value-based methods can only learn deterministic policies. There are many scenarios where learning a stochastic policy is better. For example, consider the game rock-paper-scissors. A deterministic policy, e.g., playing only rock, can be easily exploited, so a stochastic policy tends to perform much better.</p>
<p>A related issue stochastic policies solve is called <strong>perceptual aliasing</strong>.</p>
<p><img src="/images/deep-rl-policy-methods/perceptual-aliasing.svg" alt="Perceptual Aliasing" title="Perceptual Aliasing" /></p>
<p><small>Perceptual aliasing is when our agent isn’t able to differentiate the best action to take in a similar-looking state. The dark grey squares are identical states in the eyes of our agent, and a deterministic agent would take the same action in both states.</small></p>
<p>Suppose our agent’s goal is to get to the treasure while avoiding the fire pits. The two dark grey states are perceptually aliased; in other words, the agent can’t differentiate the two because they both look identical. In the case of a deterministic policy, the agent would perform the same action for both of those states and never get to the treasure. The only hope is the random action selected by the epsilon-greedy exploration technique. However, a stochastic policy could move either left or right, giving it a higher likelihood of reaching the treasure.</p>
<p>As with Markov Decision Processes, one disadvantage of policy-based methods is that they generally take longer to converge and evaluating the policy is time-consuming. Another disadvantage is that they tend to converge to local optima rather than the global optimum.</p>
<p>Regardless of these pitfalls, policy gradients tend to perform better than value-based reinforcement learning agents at complex tasks. In fact, many of the advancements in reinforcement learning beating humans at complex games such as DOTA use techniques based on policy gradients as we’ll see shortly.</p>
<h1 id="advantage-actor-critic-a2c">Advantage Actor-Critic (A2C)</h1>
<p>Advantage Actor-Critic (A2C) is a hybrid architecture that merges policy-based and value-based learning into a single approach.</p>
<p>As the name implies, there are two components: an actor (a policy-based agent) and a critic (a value-based agent). The actor gives us the action distribution, i.e., the stochastic policy $\pi_\theta(a\vert s)$, to select the action in the environment, and the critic tells us how good that action was, i.e., the value $V(s)$. As we train, the actor learns to take better actions from critic feedback, and the critic learns to provide better feedback.</p>
<p><img src="/images/deep-rl-policy-methods/a2c.svg" alt="Advantage Actor-Critic" title="Advantage Actor Critic" /></p>
<p><small>Advantage Actor-Critic (A2C) uses the same base network with two output heads: one that produces a probability distribution over the actions for the current state and the other that computes the value of that state.</small></p>
<p>A2C solves a major issue with policy gradients: we only compute the reward at the end of the horizon so we only consider the average. We could have a sequence of actions where all of our actions produced small positive rewards, except one action produced a very negative reward. Because of this averaging, the action with the awful result is effectively hidden among the actions with the better results.</p>
<p>Instead, A2C incrementally updates the parameters at some fixed interval during the episode. To get this working, however, our policy gradient needs some changes.</p>
<p>Recall our policy gradient from the previous section.</p>
\[\nabla_\theta J(\theta) \approx \frac{1}{N}\sum_i\bigg[ \bigg( \sum_t\nabla_\theta\log\pi_\theta(a_t^{(i)}\vert s_t^{(i)}) \bigg) \bigg( \sum_t \gamma^t r(s_t^{(i)}, a_t^{(i)}) \bigg) \bigg]\]
<p>The reward function computes the reward for the <em>entire</em> episode. If we’re incrementally updating our parameters inside of an episode, we can no longer use the reward function here. Instead, we can replace the reward function with the q-value function.</p>
\[\nabla_\theta J(\theta) \approx \frac{1}{N}\sum_i \sum_t\nabla_\theta\log\pi_\theta(a_t^{(i)}\vert s_t^{(i)}) Q(s_t^{(i)}, a_t^{(i)})\]
<p>Recall that the q-value tells us the expected reward for taking an action $a$ in a state $s$. This gives us a quality metric inside of the episode without having to wait for the end of the episode for the reward function. However, our critic only computes the value $V(s)$, not the q-value. But recall that there’s a relation between the q-value of a state and the value at the state: $Q(s, a) = r + \gamma V(s’)$ where $s’$ is the state we end up in after taking action $a$ in state $s$ and $r$ is the reward for taking action $a$ in state $s$.</p>
<p>Instead of using the q-value directly, we can go a step further using the advantage function. We learned with DDQNs that the advantage function helps improve learning by reducing variance. So instead of the q-value function, let’s use the advantage function in the policy gradient.</p>
\[\nabla_\theta J(\theta) \approx \frac{1}{N}\sum_i \sum_t\nabla_\theta\log\pi_\theta(a_t^{(i)}\vert s_t^{(i)}) A(s_t^{(i)}, a_t^{(i)})\]
<p>where $A(s, a) = Q(s, a) - V(s)$.</p>
<p>Intuitively, $Q(s, a)$ tells us the reward for taking action $a$ in state $s$, and $V(s)$ tells us the total reward from being in state $s$. Hence, $A(s, a)$ tells us how much better taking action $a$ in state $s$ would be. If $A(s, a) > 0$, then our action does better than the average value of the state; if $A(s, a) < 0$, then our action does worse than the average value of the state. By multiplying our policy network gradient by the advantage, we push our parameters so that good actions, i.e., actions such that $A(s, a) > 0$, more probable and bad actions, i.e., actions such that $A(s, a) < 0$, less probable.</p>
<p>The advantage is particularly useful when we have a state where all actions produce negative rewards. Since the advantage function computes the value <em>relative</em> to $V(s)$, we can still select the best possible action in that bad state. The advantage function allows us to make the best of a bad situation, so to speak.</p>
<p>Finally, we can write an expression for the loss of the actor head, i.e., the policy head (and notice it looks very similar to vanilla policy gradients with the exception of the advantage function).</p>
\[L^\text{A}(\theta) = \frac{1}{N}\sum_i \sum_t\log\pi_\theta(a_t^{(i)}\vert s_t^{(i)}) A(s_t^{(i)}, a_t^{(i)})\]
<p>where</p>
\[A(s_t^{(i)}, a_t^{(i)}) = r(s_t^{(i)}, a_t^{(i)}) + \gamma V(s_{t+1}^{(i)}) - V(s_t^{(i)})\]
<p>Now let’s look at the loss of the critic head of the actor-critic architecture. If we trained it using the same loss as the DQN, we’d need two functions: the q-value function and the value function. However, we can make an improvement by only fitting the value function $V(s)$ if we notice that the q-value, in our case, is really just a function of the reward and value functions. Since we’re already using the advantage function, we can estimate it using the temporal difference error.</p>
\[A(s_t, a_t) = r(s_t, a_t) + \gamma V(s_{t+1}) - V(s_t)\]
<p>So our loss for the critic is simply the temporal difference error itself.</p>
\[L^\text{C}(\theta) = \frac{1}{N}\sum_i \sum_t \mathcal{L}_1\Big( A(s_t^{(i)}, a_t^{(i)})\Big)\]
<p>where $\mathcal{L}_1$ is the Huber loss. (This is the same loss we used when discussing DQNs. Read <a href="https://mohitd.github.io/2018/12/23/deep-rl-value-methods.html">this</a>, specifically the section Reward Clipping and Huber Loss, if you need a refresher as to why it’s an improvement over squared error or quadratic loss.)</p>
<p>One final component to the A2C architecture is exploration. For DQNs, we used an epsilon-greedy approach to select our action. But with a stochastic policy, we sample from the action distribution so we can’t follow the same approach. Instead, we embed the notion of exploration into our loss function by adding an entropy penalty.</p>
<p><img src="/images/deep-rl-policy-methods/entropy.svg" alt="Entropy" title="Entropy" /></p>
<p><small>Low entropy action distributions have one action that is much more likely than the others while high entropy action distributions spread the probability nearly evenly across all actions.</small></p>
<p>The idea is that we want to penalize low entropy action distributions because this means we’re very likely to select the same action over and over again. Instead, we’d prefer lower entropy action distributions (think back to perceptual aliasing). We can compute the entropy of the action distribution using the definition of entropy:</p>
\[S = -\sum_i P_i\log P_i\]
<p>When applying it to our action distribution, we get the following.</p>
\[L^\text{S}(\theta) = \sum_a \pi_\theta(a\vert s)\log\pi_\theta(a\vert s)\]
<p>We’re omitting the negative sign at the beginning since we’ll be incorporating it into the combined loss function.</p>
<p>Now we can put all of the pieces of the loss together into one expression.</p>
\[L(\theta) = L^\text{A}(\theta) - L^\text{C}(\theta) - L^\text{S}(\theta)\]
<p>Now let’s put all of the pieces together into the online A2C algorithm.</p>
<ol>
<li>For each episode do
<ol>
<li>For each step do
<ol>
<li>Sample the action $a\sim\pi_\theta(s)$ and store $\log\pi_\theta(a\vert s)$</li>
<li>Execute the action $a$ to receive a reward $r$ and store it.</li>
<li>If the episode is not done, skip to the next iteration.</li>
<li>Compute the list of discounted rewards using the discount factor: $R_t = \sum_{t’=0}^t \gamma^{t’} r_{T-t’}\gets [r_T, r_T+\gamma r_{T-1}, r_T+\gamma r_{T-1}+\gamma^2 r_{T-2}, \dots]$.</li>
<li>Normalize the rewards by subtracting the mean $\mu_R$ and dividing by the standard deviation $\sigma_R$ (and adding in an epsilon factor $\epsilon$ to prevent dividing by zero): $R_t\gets\frac{R_t-\mu_R}{\sigma_R+\epsilon}$.</li>
<li>Compute the advantages using the target values $R_t$ and critic head $V(s_t)$: $A(s, a) = R_t - V(s_t)$.</li>
<li>Compute the actor/policy loss by multiplying the negative log probabilities with their respective advantages and average: $L^\text{A}(\theta) = \frac{1}{N}\sum_i\sum_t -\log\pi_\theta(a_t^{(i)}\vert s_t^{(i)})\cdot A(s_t^{(i)}, a_t^{(i)})$.</li>
<li>Compute the critic/value loss by taking the average of the Huber loss of the advantages: $L^\text{C}(\theta) = \frac{1}{N}\sum_i \sum_t \mathcal{L}_1(A(s_t^{(i)}, a_t^{(i)}))$</li>
<li>Compute the entropy penalty and average: $L^\text{S}(\theta) = \sum_a \pi_\theta(a\vert s)\cdot\log\pi_\theta(a\vert s)$</li>
<li>Combine the losses: $L(\theta) = L^\text{A}(\theta) - L^\text{C}(\theta) - L^\text{S}(\theta)$</li>
<li>Backpropagate $L(\theta)$ through the policy network.</li>
<li>Update the policy network’s parameters.</li>
</ol>
</li>
<li>End for</li>
</ol>
</li>
<li>End for</li>
</ol>
<p>We can take this architecture and asynchronize and parallelize it into an Asynchronous Advantage Actor-Critic (A3C) network by running independent agents on separate worker threads and asynchronously updating the parameters of the actor and critic networks. This can run into issues because some agents might be using outdated parameters of both networks.</p>
<p>A synchronous, i.e., A2C, architecture can also be parallelized, but we would wait for each agent to finish playing before averaging each agent’s gradient and updating the networks’ parameters. This ensures that, on the next round, each agent has the latest parameters.</p>
<h1 id="proximal-policy-optimization-ppo">Proximal Policy Optimization (PPO)</h1>
<p><a href="https://arxiv.org/pdf/1707.06347.pdf">Proximal Policy Optimization (PPO)</a> is a style of policy algorithm that has shown incredible results in very complicated games such as DOTA 2, even beating out expert gamers.</p>
<p>To understand PPO, let’s go back to policy gradients. Recall the loss of the policy gradient.</p>
\[L^{\text{PG}}(\theta) = \mathbb{E}_t [\log\pi_\theta(a_t\vert s_t) A_t]\]
<p>where $A_t$ is the estimator for the advantage function. (The subscript $t$ is functionally the same as the $\tau\sim\pi_\theta(\tau)$ subscript for policy gradients we used earlier.)</p>
<p>Policy gradients increase the likelihood of actions that produced higher rewards and decrease those that produced lower rewards. This loss function had its own set of issues, namely slow rates of convergence and high variability in training causing the policy to potentially change drastically between two updates.</p>
<p>The last problem is a very undesirable property of policy gradients so <a href="https://arxiv.org/pdf/1502.05477.pdf">Trust Region Policy Optimization (TRPO)</a> aimed to mitigate this by using a surrogate loss function and adding an additional constraint to the optimization problem.</p>
<p>Instead of using the log likelihood, the surrogate function uses the ratio of the action probabilities using current and old parameters.</p>
\[r_t(\theta) = \frac{\pi_\theta(a_t\vert s_t)}{\pi_{\theta_\text{old}}(a_t\vert s_t)}\]
<p>(This ratio function is different from the reward function $r(\tau)$ we considered at when deriving policy gradients.)</p>
<p>If $r_t(\theta) > 1$, then $a_t$ is more likely for the current parameters $\theta$ than it was for the old parameters $\theta_\text{old}$. On the other hand, $0 < r_t(\theta) <= 1$ if $a_t$ is less or equally likely using the current parameters as opposed to the old parameters. We call this ratio function a <em>surrogate</em> function because it stands in for the log likelihood function in vanilla policy gradients.</p>
<p>Now we can replace the log likelihood with this new metric in the loss function.</p>
\[L^{\text{TRPO}}(\theta) = \mathbb{E}_t \bigg[\frac{\pi_\theta(a_t\vert s_t)}{\pi_{\theta_\text{old}}(a_t\vert s_t)} A_t\bigg] = \mathbb{E}_t \big[ r_t(\theta) A_t\big]\]
<p>However, we need to add an additional constraint to prevent the policy from drastically changing, as it could do with vanilla policy gradients. The original TRPO paper suggests using Kullback-Leibler divergence (KL divergence) as a hard constraint:</p>
\[\text{maximize}_\theta~~~~\mathbb{E}_t \bigg[\frac{\pi_\theta(a_t\vert s_t)}{\pi_{\theta_\text{old}}(a_t\vert s_t)} A_t\bigg]\\\\
\text{subject to}~~~~\mathbb{E}_t \big[ \text{KL}(\pi_{\theta_\text{old}}(\cdot\vert s_t), \pi_\theta(\cdot\vert s_t))\big] \leq \delta\]
<p>where $\delta$ is some maximum threshold.</p>
<p>If you’re unfamiliar, the KL divergence is a metric that measures the difference between two probability distributions $P$ and $Q$.</p>
\[\text{KL}(P,Q) = \sum_i P_i\log\frac{P_i}{Q_i}\]
<p>Intuitively, this new constraint says that the difference in the action distributions of the new and old policies can’t exceed a maximum value $\delta$.</p>
<p>Instead of a hard constraint, we can actually include it in the objective function as a penalty with a hyperparameter $\beta$ controlling the strength.</p>
\[\mathbb{E}_t \bigg[\frac{\pi_\theta(a_t\vert s_t)}{\pi_{\theta_\text{old}}(a_t\vert s_t)} A_t - \beta~\text{KL}(\pi_{\theta_\text{old}}(\cdot\vert s_t),\pi_\theta(\cdot\vert s_t))\bigg]\]
<p>Incorporating KL divergence either way helps discourage large changes in the policy for each update.</p>
<p><img src="/images/deep-rl-policy-methods/trust-region.svg" alt="Trust Region" title="Trust Region" /></p>
<p><small>The trust region (blue) is centered around a point in the state-action space. We strongly discourage our policy update from moving outside of this region to prevent large changes in the policy.</small></p>
<p>However, computing the KL divergence is expensive, especially for large action distributions! Instead, we can embed the notion of smaller policy updates into the objective function itself, which is exactly what PPO does. PPO introduces this strange, but effective and efficient, objective function.</p>
\[\mathcal{L}^{\text{CLIP}}(\theta) = \mathbb{E}_t \bigg[ \min(\underbrace{r_t(\theta)A_t}_\text{TRPO}, \underbrace{\text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon)A_t)}_\text{clipped surrogate})\bigg]\]
<p>where $\varepsilon$ is a hyperparameter that controls the range of the ratio function. ($\varepsilon=0.2$ in the PPO paper)</p>
<p>In this new objective function, the minimum is taken between the original TRPO objective and this new clipped term. By clipping the surrogate function to $[1-\varepsilon, 1+\varepsilon]$, we discourage it from significantly deviating from 1. In other words, we discourage updates where there are large difference between the new and old policies. This clipping is functionally similar to TRPO’s KL divergence except it’s much easier to compute!</p>
<p>Finally, we take the minimum of the clipped and unclipped objectives, which gives us the lower bound of the objective. It’ll become apparent why take the minimum in just a moment.</p>
<p>With this new objective function, there are two cases to consider here: $A > 0$ and $A < 0$.</p>
<p><img src="/images/deep-rl-policy-methods/clipped-objective.png" alt="Clipped Objective Function" title="Clipped Objective Function" /></p>
<p><small>Clipped objective function.</small></p>
<p>Recall that if $A > 0$, then our action was good. In this case, $r(\theta)$ will be clipped if it is too high, i.e., one action is far more probable with this set of parameters compared to the old ones, even though this action is good. This is because we want to avoid taking large jumps in policy and overshooting, so clipping the objective ensures that we don’t take a step that’s too large. On the other hand, $A < 0$ means our action was bad. In this case, we clip $r(\theta)$ so that it doesn’t make the bad action drastically less probable. In both cases, we don’t want to update our policy drastically.</p>
<p>One last bit about this objective function is that $r(\theta)$ is unbounded to the right when $A < 0$. This is the scenario where the action taken was a bad action and it became <em>more</em> probable compared to the last parameter update! This is not good! Fortunately, by leaving it unbounded, the gradient step will move in the <em>opposite direction</em> because the value of the objective function is negative. Furthermore, it will take a step <em>proportional</em> to how bad this new action is. Intuitively, this allows us to <em>correct</em> our actions when we make a big mistake.</p>
<p>Here is where we get to the min function in the objective function. The min function enables us to take that corrective step backwards. When it is invoked, the unclipped value of $r(\theta)A$ will be returned, thus allowing our agent to take that corrective step.</p>
<p>If you read the paper, it’ll go on to introduce a <strong>Adaptive KL Penalty Coefficient</strong> that adjusts the value of $\beta$ in the full KL-penalized TRPO objective, as well as reference the same entropy penalty with A2C. However, the essence of PPO is really embodied in $L^{\text{CLIP}}(\theta)$.</p>
<p>Since each PPO step makes smaller changes to the policy, we can actually train several epochs on the same minibatch of data! This consideration is factored in to the new PPO algorithm:</p>
<ol>
<li>For each iteration do
<ol>
<li>For each actor $1,\dots,N$ do
<ol>
<li>Run $\pi_{\theta_\text{old}}$ for $T$ time steps</li>
<li>Average the advantage estimate</li>
</ol>
</li>
<li>Optimize surrogate $L(\theta)$ with $K$ epochs and a minibatch size of $M \leq NT$.</li>
<li>$\theta_\text{old}\gets\theta$</li>
</ol>
</li>
</ol>
<p>This is in the style of A3C so we have asynchronous actors, but we could use a synchronous implementation. Notice that we optimize the surrogate function for several epochs on the same minibatch of data. The hyperparameters suggested in the paper are $K=[3,15]$, $M=[64,4096]$, and $T=[128,2048]$.</p>
<p>Proximal Policy Optimization is the culmination of policy-based methods, at the time of this post at least. <a href="https://blog.openai.com/openai-five/">OpenAI Five</a> has won games against human players in DOTA 2 where the horizon of moves is 20,000, the action space is ~170,000, and the observation space uses 20,000 unique floating-point numbers. All of this was accompanied by 256 P100 GPUs and 128,000 CPUs.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Policy-based or hybrid value-policy techniques seem to be consistently outperforming value-based methods on complicated tasks. Policy gradients look at stochastic policies and directly update them rather than computing them from values. Advantage Actor-critic (A2C) and Asynchronous Advantage Actor-critic (A3C) take hybrid approaches by intertwining a value-based network and policy-based network. Proximal Policy Optimization (PPO) is a class of algorithms that helps mitigate large changes in the policy to stabilize learning and ultimately produce more sophisticated reinforcement learning agents.</p>
<p>I hope this post has shed some light on a few of the policy-based algorithms use 🙂</p>I discuss state-of-the-art deep RL techniques that use policy-based methods.Deep Reinforcement Learning: Value-based Methods2018-12-23T00:00:00+00:002018-12-23T00:00:00+00:00/deep-rl-value-methods<p>In my <a href="/reinforcement-learning">previous post on reinforcement learning</a>, I explained the formulation of a game and a way to solve it called a Markov Decision Process (MDPs). However, MDPs are useable only when we know which transitions lead to which rewards; in many real-world scenarios and games, we don’t have this a-priori knowledge. Instead, we can repeatedly play the game over and over again to learn which actions in which states lead to the highest expected reward. This algorithm, called Q-learning, is the basis for reinforcement learning.</p>
<p>However, it’s just the tip of the iceberg: researchers have incorporated neural networks into reinforcement learning to create deep reinforcement learning architectures that are capable of winning against humans at more advanced games such as <a href="https://deepmind.com/research/alphago/">Go</a>, <a href="https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf">Atari games</a>, and <a href="https://blog.openai.com/openai-five/">DOTA</a>. In this post, we’re going to look at a category of deep reinforcement learning algorithms, called value-based learning, that derive the best policy using q-values.</p>
<p>For a more concrete understanding, i.e., code, I have a <a href="https://github.com/mohitd/dqn">repo</a> where I’ve implemented the DQN models and algorithms of this post in Pytorch.</p>
<h1 id="approximate-q-learning">Approximate Q-learning</h1>
<p><img src="/images/deep-rl-value-methods/pacman.png" alt="Pacman" title="PacMan" /></p>
<p><small>Credit: Pacman Doodle. May 21, 2010. google.com</small></p>
<p>To motivate our discussion of deep reinforcement learning, let’s consider a game that is marginally more complicated than our GridWorld game: PacMan. The premise of the game is similar: maximize the score by eating pellets and avoiding ghosts. The entities are PacMan, the ghosts and their various states (normal, flashing, etc.), and the pellets. Hence, a complete state would include the location of PacMan, the locations of all of the ghosts, the states of all of the ghosts, and the locations of the remaining pellets. With some thought, we might be able to merge some elements of the state space into one entity, but, overall, we have a very large discretized state space!</p>
<p>Performing vanilla Q-learning on this state space will create a massive q-value table with many states, and itʼs very unlikely that we’ll reach each state during training. At this point, we may even run into memory issue storing this massive q-value table!</p>
<p>One technique to alleviate this large state space is called <strong>Approximate Q-learning</strong>. Instead of storing state-action information in a large q-value table, we represent the q-value as a weighted sum of feature functions that convert the raw state into values important to our agent. For example, we could write a feature function that returns the number of pellets PacMan has left to eat or the position of the nearest ghost using Manhattan distance. We hand-design a number of these feature functions $f_1(s, a), \dots, f_n(s, a)$ and associate trainable weights $w_1, \dots, w_n$. Our new q-value function looks like the following.</p>
\[Q(s,a) = w_1 f_1(s, a) + \dots + w_n f_n(s, a)\]
<p>In other words, given a state $s$ and action $a$, we compute the q-value by taking the weighted sum of our feature functions.</p>
<p>Intuitively, the weight of a feature function is increased if that feature tends to produce better results and vice-versa for feature functions that lead to poor results. Given a transition $(s, a, r, s’)$, this results in the following update rule for the weights:</p>
\[w_i \gets w_i + \alpha (r + \gamma \max_{a'} Q(s', a') - Q(s, a)) f_i(s, a)\]
<p>Compare this to the update rule for Q-learning:</p>
\[Q(s, a) \gets Q(s, a) + \alpha (r + \gamma \max_{a'} Q(s', a') - Q(s, a))\]
<p>The first difference is we’re learning the weights in order to compute the q-values; we’re not computing the q-values directly. We multiply the feature function $f_i(s, a)$ to learn which features tend to produce higher scores. Incorporating these feature functions into the weight update allows our agent to learn which features in which states help maximize the score.</p>
<p>Using Approximate Q-learning in place of vanilla Q-learning can help improve our agent’s generalization ability by using weighted sum of hand-tailored feature functions to represent a state rather than the state itself. The result is we see better performance on games with larger state spaces such as PacMan.</p>
<h1 id="deep-q-networks-dqns">Deep Q-Networks (DQNs)</h1>
<p>Approximate Q-learning has a major flaw: the quality of the agent is dependent on the user-defined feature functions. Writing good feature functions will produce an intelligent agent; there’s a bit of trial-and-error involved in determining which feature functions produce an intelligent agent. However, as with the vision community and its handmade feature detectors for images, we can replace the feature functions with a neural network that will compute the q-value function $Q(s, a)$.</p>
<p>This neural architecture is the premise of deep reinforcement learning and the Deep Q-Network (DQN). The basic design of DQNs is quite simple: the input is a raw image of our game, and the output is a q-value for each action.</p>
<p><img src="/images/deep-rl-value-methods/dqn.jpg" alt="DQN" title="DQN" /></p>
<p><small><em>Human-level control through deep reinforcement learning</em> by Mnih et al.</small></p>
<p>We can replace the feature functions with this neural network and devise the most fundamental DQN training algorithm. Since we’re using a neural network, we train the weights using gradient descent on the temporal difference error (also called the Bellman error).</p>
<p>Here is the fundamental DQN training algorithm.</p>
<ol>
<li>Initialize DQN $Q(x, a)$</li>
<li>For each episode do
<ol>
<li>For each step do
<ol>
<li>In frame $x_t$, with probability $\epsilon$, take a random action, else take best action $a = \arg\max_{a’} Q(x_t, a’)$</li>
<li>Execute action $a$ to receive reward $r$ and next image $x_{t+1}$</li>
<li>If end of game, compute the q-value $y = r$</li>
<li>If not end of game, compute q-value $y = r + \gamma\max_{a’}Q(x_t, a’)$</li>
<li>Perform gradient descent on step using the quadratic loss of the temporal difference error $(y - Q(x_t, a))^2$</li>
</ol>
</li>
<li>End For</li>
</ol>
</li>
<li>End For</li>
</ol>
<p>(This algorithm takes an online approach, but we’ll soon see how to batch our data when we discuss experience replay.)</p>
<p>As an example, let’s consider the classic Atari game of Breakout.</p>
<p><img src="/images/deep-rl-value-methods/breakout.jpg" alt="Breakout" title="Breakout" /></p>
<p><small>Credit: OpenAI Gym.</small></p>
<p>The goal of the game is to break all of the bricks with the ball without dropping it. Our states are now frames of the game, and our possible actions are move a little to the left, move a little to the right, and do nothing. The reward is simply the score of the game. We feed the frame into the DQN, and it produces q-values for each action. Then we simple select the action with the largest q-value, just like with Q-learning.</p>
<p>While DQNs essentially replace the weighted sum of the feature functions, vanilla DQNs are very difficult to train because they have many convergence issues. To get DQNs to work well in practice, we must make several key modifications to the learning algorithm.</p>
<h2 id="frame-stacking-skipping-and-processing">Frame Stacking, Skipping, and Processing</h2>
<p>The first improvement we can make is to process the input. In particular, we can give our DQN some notion of velocity and motion by stacking several frames together. Think of the Breakout game: if we look at several consecutive frames, there’s the motion of the ball and the paddle. Hence, we can stack several frames into a single tensor and feed that as an input to our network. In the case of Breakout, suppose we wanted to stack 4 frames of size $84\times 84$ together. Then the input to our network would be a 3-tensor of size $84\times 84\times 4$.</p>
<p><img src="/images/deep-rl-value-methods/frame-skipping.svg" alt="Frame Skipping" title="Frame Skipping" /></p>
<p><small>As an example, we take only every fourth frame and stack those together into a single observation input to our DQN. These skipped frames give our DQN a better perception of velocity and motion in our game.</small></p>
<p>However, we don’t quite stack 4 <em>consecutive</em> frames together. Instead, we skip frames. The reason we do this is to prevent our network from making strong correlations about the states immediately before it. Another reason is to give our DQN a more useful sense of motion: the ball and paddle won’t move much between two consecutive frames in a game like Breakout. To fully conceptualize the idea of motion and velocity, we skip frames before we stack them.</p>
<p>(To be thorough, there’s an additional step that Google DeepMind used in their <a href="https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf">Nature article</a> on deep reinforcement learning: they took the max between a frame of the input and the skipped frame immediately before it. This step was added because of how Atari renders sprites.)</p>
<h2 id="experience-replay">Experience Replay</h2>
<p>Another improvement on vanilla DQNs is called <strong>experience replay</strong>. Instead of training our network on sequential frames of video, we store all of the state transitions $(x_t, a_t, r_t, x_{t+1})$ into a <strong>replay memory</strong>. Under the hood, this may be implemented as a ring buffer or any data structure with fast sampling and the ability to replace old transitions with newer ones.</p>
<p>After storing transitions into the replay memory, at each step, we randomly sample a mini-batch and compute the q-value for each transition in the mini-batch under the current parameters of our DQN. Then we perform a gradient descent step on the mini-batch with the target q-values and update the DQN’s parameters for the next step.</p>
<p>In practice, we don’t start training the DQN until our replay memory is full so the first few thousand or hundred thousand or million transitions aren’t actually used to train the network: they fill the replay memory by taking random actions. After we have a full replay memory, we can sample mini-batches from it to train our network. As we collect more transitions, we replace the old transitions in the replay memory with these newer ones which is why a cyclical data structure is used for the replay memory.</p>
<p>Why do we need experience replay? It can greatly improve generalization by minimizing correlations in our training data. Sequential frames of the game are highly correlated, and we do not want our network to learn these tight correlations in the sequence of gameplay frames. By storing these transitions and randomly sampling, we force our DQN to make the best decision based only on the stack of frames it receives, with no other a-priori information to help it.</p>
<h2 id="target-network">Target Network</h2>
<p>The final improvement on vanilla DQNs, at least by the original paper, is incorporating a <strong>target network</strong> $\hat{Q}$. We’ll call the original DQN $Q$ the <strong>online network</strong> since we now have two networks. The target network is a copy of the online network and is used to compute the q-values for each transition in the mini-batch. The online network is used to compute the best action, and only its weights are updated via gradient descent. The target network’s weights are set to the online network’s weights at some number of time steps $C$. In other words, the target network’s weights are copied over from the online network every $C$ steps.</p>
<p>The purpose of the target network is to add stability. Consider what happens to the online network as we train: its weights are frequently being changed, and, consequently, the resulting q-values from that network will also frequently change and so on and so on. By using the target network to compute the q-values and updating its weights less frequently, we have more stable q-values, and our online network trains quicker.</p>
<h2 id="reward-clipping-and-huber-loss">Reward Clipping and Huber Loss</h2>
<p>There are a few minor improvements that we can make to help improve the stability of training for our DQN. One improvement is to clip the rewards to $[-1, 1]$. This prevents extreme outcomes from affecting our DQN, causing the weights to jump drastically.</p>
<p>Similarly, we can clip the gradients to $[-1, 1]$ as well to prevent a weight update from being too large. However, an even better way to do this is to use the Huber Loss instead of the quadratic loss.</p>
\[\mathcal{L}_\delta(t) = \begin{cases}
\frac{1}{2}t^2 & \vert t\vert\leq \delta \\
\delta(\vert t\vert -\frac{1}{2}\delta) & \mathrm{otherwise}
\end{cases}\]
<p>A plot of this function is shown below.</p>
<p><img src="/images/deep-rl-value-methods/huber-loss.png" alt="Huber Loss" title="Huber Loss" /></p>
<p><small>Small inputs act quadratically for the Huber Loss while large inputs act linearly. In either case, the magnitude of the derivative is never greater than 1.</small></p>
<p>(We usually set $\delta=1$.) For small values, Huber Loss is quadratic, and, for large values, Huber Loss acts linearly. Intuitively, for small temporal difference errors, we use the quadratic function, but, for large errors, we use the linear function. Notice that the gradient of the Huber Loss is never greater than one!</p>
<h2 id="dqn-training-algorithm">DQN Training Algorithm</h2>
<p>After incorporating the frame skipping and processing, experience replay, and target network, the final DQN algorithm becomes the following.</p>
<ol>
<li>Initialize replay memory $M$</li>
<li>Initialize online network $Q$</li>
<li>Initialize target network $\hat{Q}$</li>
<li>For each episode do
<ol>
<li>For each step do
<ol>
<li>Create a sequence $s_t$ from the previous processed frames</li>
<li>With probability $\epsilon$, take a random action, else take best action $a_t = \arg\max_{a’} Q(s_t, a’)$</li>
<li>Execute action $a_t$ to receive reward $r_t$ and next image $x_{t+1}$</li>
<li>Process $x_{t+1}$ and incorporate it into a frame sequence $s_{t+1}$</li>
<li>Store the transition $(s_t, a_t, r_t, s_{t+1})$ into the replay memory $M$</li>
<li>Sample a random mini-batch of transitions $(s_k^{(j)}, a^{(j)}, r^{(j)}, s_{k+1}^{(j)})$ from $M$, where each transition in the mini-batch is indexed by $j$.</li>
<li>Clip all rewards $r^{(j)}\in [-1,1]$</li>
<li>If end of game, set q-value to reward: $y^{(j)} = r^{(j)}$</li>
<li>If not end of game, set q-value using target network: $y^{(j)} = r^{(j)} + \gamma\max_{a’}\hat{Q}(s_{k+1}^{(j)}, a’)$</li>
<li>Perform a gradient descent update of the online network using the Huber Loss of the temporal difference error $\mathcal{L}_1(\mathbf{y} - Q(\mathbf{s_k}, \mathbf{a}))$.</li>
<li>Every $C$ steps, update the target network’s weights $\hat{Q}=Q$</li>
</ol>
</li>
<li>End For</li>
</ol>
</li>
<li>End For</li>
</ol>
<p>This is the same algorithm and approach used by DeepMind in their <a href="https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf">seminal paper</a> on deep reinforcement learning. The implementation of this training loop is in my repo <a href="https://github.com/mohitd/dqn/blob/master/learn.py">here</a>.</p>
<h1 id="double-dqn">Double DQN</h1>
<p>Even with those improvements to the vanilla DQN algorithm, DQNs still have a tendency to <em>overestimate</em> q-values because the same network selects which action to take and assigns the q-value, leading to overconfidence and subsequently overestimation.</p>
<p>Mathematically, the issue lies in the function we use to set the q-value:</p>
\[y = r + \gamma\max_{a'}\hat{Q}(s', a')\]
<p>However, in the beginning of training, our DQN doesn’t know which action is the optimal action, and this causes our DQN to assign high q-values to non-optimal actions. Then, our agent will prefer to take these non-optimal actions because of those high q-values, and training becomes more difficult because our agent needs to “unlearn” those non-optimal actions.</p>
<p>The solution is to split up this q-value estimation task between the online and target networks. The online network $Q$, with the latest set of weights, i.e., the greedy policy, will give us the best action to take given our state $s’$, and the target network $\hat{Q}$ can fairly evaluate this action to compute the q-value. Through this decoupling, we can prevent overoptimism.</p>
<p>The only change we have to make is how we compute the q-value:</p>
\[y = r + \gamma\hat{Q}(s', {\arg\max}_a Q(s', a))\]
<p>The online network computes the best action using a greedy policy while the target network fairly assesses that action when it computes the q-value. This minimal change produces more stable learning and a significant drop in training time.</p>
<p>If this explanation of separating the estimation between the online and target networks isn’t convincing enough, the <a href="https://arxiv.org/pdf/1509.06461.pdf">original paper</a> has proofs that compute overoptimism and its bounds regarding the vanilla DQN algorithm and the double DQN algorithm.</p>
<h1 id="dueling-dqn">Dueling DQN</h1>
<p>The <a href="https://arxiv.org/pdf/1511.06581.pdf">Dueling DQN</a> takes a different approach to the DQN model. First, we define a quantity called <strong>advantage</strong> that relates the value and q-value functions.</p>
\[A(s,a) = Q(s, a) - V(s)\]
<p>Intuitively, the value function tells us how good it is to be in state $s$, i.e., the future expected reward of being in state $s$. The q-value tells us how good it is to take action $a$ in state $s$, i.e., the expected reward of taking action $a$ in state $s$. Hence, the advantage function tells us how much better it would be to take action $a$ in state $s$ over all other possible actions $a\in\mathcal{A}$ in state $s$.</p>
<p>Using this definition of the advantage function, we can rewrite the q-value function as the following.</p>
\[Q(s, a) = V(s) + A(s, a)\]
<p>There are a few interesting equations that follow from our advantage function definition and knowledge of value and q-value functions that will help us better understand the advantage function and the Dueling DQN model.</p>
<p>Suppose we have an optimal action $a^\ast=\arg\max_{a’\in\mathcal{A}}Q(s, a’)$. Then $Q(s, a^\ast) = V(s)$ (for a deterministic policy) because taking the best action in any state is equivalent to that state’s value, i.e., expected reward. By substituting that equivalence into the advantage function, we see that $A(s,a^\ast) = Q(s, a^\ast) - V(s) = 0$. Intuitively, the advantage of our action is 0 because we’re already taking the best possible action so any other action will produce an expected score <em>worse</em> than our value. Keep this in mind because we’ll revisit this notion soon.</p>
<p>With our q-value function written as the sum of the value and advantage function, we can devise an architecture that learns these two terms separately.</p>
<p><img src="/images/deep-rl-value-methods/dueling-dqn.png" alt="Dueling DQN" title="Dueling DQN" /></p>
<p><small><em>Dueling Network Architectures for Deep Reinforcement Learning</em> by Wang et al.</small></p>
<p>Dueling DQNs split our network into two branches: the top branch that compute the value of the state and the bottom branch that computes the advantage for each action. Then, the two branches are merged again to produce the q-values.</p>
<p>The immediate question you might ask is “what’s the point of splitting the branches if we’re just going to combine them at the end?” We split the value and advantage so we can learn about good states without necessarily learning about the actions in those states.</p>
<p>The Dueling DQN paper has a good example: imagine a car driving game where the objective is to go as far as we can without crashing. If the road ahead is completely empty, action of moving out of the way is irrelevant. On the other hand, if there are cars in front, the action we take is critical to the score, i.e., we want to avoid driving into other cars!</p>
<p><img src="/images/deep-rl-value-methods/advantage.png" alt="Value and Advantage Functions" title="Value and Advantage Functions" /></p>
<p><small><em>Dueling Network Architectures for Deep Reinforcement Learning</em> by Wang et al.</small></p>
<p>The saliency maps of the two streams are shown as an orange overlay. Notice that the value stream learns to attend to the road itself, and the advantage stream only pays attention when there are cars immediately in front of our driver.</p>
<p>By splitting the q-value into the value and advantage, we can compute the value for states where the action doesn’t directly impact the score. In the driving game, if there are no cars on the road, our choice of action to move left or right doesn’t affect the score.</p>
<p>The next question you might ask is “how do we merge the value and advantage together?” The straightforward, and incorrect, way to do this would be to simply add them: $Q(s, a) = V(s) + A(s, a)$. But we lose identifiability in this case. In other words, given $Q(s, a)$, we cannot retrieve $V(s)$ and $A(s, a)$ uniquely. This leads to poor agent performance because backpropagation won’t uniquely train the two streams, i.e., the gradient copies evenly to both streams.</p>
<p>(<em>Note:</em> Regarding dimensionality, we copy $V(s)$ into a vector of size $\mathcal{A}$, i.e., the number of actions, so we can add it with the vector $A(s, a)$.)</p>
<p>One technique to add identifiability is to force the advantage function to zero for the given action. We can do this by subtracting the highest advantage for our state $s$ over all possible actions.</p>
\[Q(s, a) = V(s) + \Big( A(s, a) - \max_{a'\in\vert\mathcal{A}\vert} A(s, a') \Big)\]
<p>Now, for the optimal action according to our policy $a^\ast=\arg\max_{a’\in\mathcal{A}}Q(s, a’)$, we see that $Q(s, a)=V(s)$ because recall that $A(s, a^\ast) = 0$! Also, $\max_{a’\in\vert\mathcal{A}\vert} A(s, a’) = 0$ for similar reasons: if it didn’t equal zero, that would mean that we didn’t have the optimal action and there was an action that was better than our $a^\ast$, which is impossible because we already said $a^\ast$ was the best action! And since we have our best action, the gradient goes to the value stream to train it!</p>
<p>However, instead of subtracting the max of the advantage over all actions, we subtract the average.</p>
\[Q(s, a) = V(s) + \Big( A(s, a) - \frac{1}{\vert\mathcal{A}\vert}\sum_{a'} A(s, a') \Big)\]
<p>In practice, this turns out to be more stable because the mean changes much more gradually than the max operation which can jump drastically if the action changes.</p>
<p>Dueling DQNs only alter the DQN model and are <em>not</em> mutually exclusive from Double DQNs so we can have Dueling Double DQNs by using the Dueling DQN model architecture and Double DQN training. To recap, Dueling DQNs alter our DQN model by splitting the q-value estimation into two streams: value and advantage to better estimate q-values. These Dueling DQNs, even without Double DQN training, tend to outperform DQNs and DQNs using Double DQN training.</p>
<h1 id="prioritized-experience-replay-per">Prioritized Experience Replay (PER)</h1>
<p>Another area of improvement we can make is to the replay memory. <a href="https://arxiv.org/pdf/1511.05952.pdf">Prioritized Experience Replay (PER)</a> helps improve our training and overall agent performance. With vanilla experience replay, we randomly sample the replay memory using a <em>uniform distribution</em>. However, we may have transitions that occur less frequently but can help our agent learn significantly.</p>
<p>Instead of sampling with a uniform distribution, we assign a priority to each transition in the replay memory and then normalize across the replay memory to convert the priority to a probability. Then, we sample based on that probability distribution instead of the uniform distribution.</p>
<p>But first, we need to assign each transition a priority. We set the priority to be directly proportional to the magnitude of the temporal difference error.</p>
\[p_i = \vert\delta_i\vert + \epsilon\]
<p>where $\delta_i$ is the temporal difference, i.e., $r + \gamma\hat{Q}(s’, {\arg\max}_a Q(s’, a)) - Q(s, a)$ for a Double DQN. $\epsilon$ is a small constant to prevent zero priority from being assigned to any transition.</p>
<p>We assign higher priority to transitions with larger temporal different errors because a larger error means our DQN was “surprised”. In other words, our DQN gave a poor estimate of the q-value, and, it could learn more from these kinds of transitions than when our DQN estimates the q-value well.</p>
<p>The PER paper also proposed an alternative definition of priority.</p>
\[p_i = \frac{1}{\mathrm{rank}(i)}\]
<p>where $\mathrm{rank}(i)$ is the position of transition $i$ in the replay memory after it is sorted according to $\vert\delta_i\vert$. This definition was shown to be a bit more robust and immune to outliers.</p>
<p>However, we can’t use only priorities because the same high-priority transitions will be seen by the network over and over again. Instead, we convert the priorities into probabilities. After we assign each transition in the replay memory a priority, we can perform (a kind of) normalization to compute probabilities.</p>
\[P(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha}\]
<p>where $\alpha$ is a hyperparameter used to vary prioritization. $\alpha=0$ means we revert back to a uniform distribution. Now we can sample from this probability distribution instead of using the uniform distribution!</p>
<p>One issue remains: our replay memory is biased towards high-priority transitions. Bias wasn’t an issue with our vanilla experience replay because all of the transitions were treated as equally likely. This bias might lead to overfitting because we tend to see high-priority transitions more than the low-priority ones. To correct this bias, we need to weight the parameter updates using importance sampling.</p>
\[w_i = \bigg(\frac{1}{N}\cdot\frac{1}{P(i)}\bigg)^\beta\]
<p>where $\beta$ is a decay factor that starts at 1 and is annealed during training. This $w_i$ is multiplied by the temporal difference error and gradient when we’re accumulating our weight update for the end of the episode. Intuitively, we see that our weight is inversely proportional to the probability: we assign a higher weight to transitions with lower likelihoods because we will tend to see them less frequently.</p>
<p>In the paper, for added stability of the weight updates, they normalize the weights by $\frac{1}{\max_i w_i}$ so the weights scale downwards.</p>
<p>We need to make some modifications to our algorithm to use PER. A-priori, we don’t know which transitions will have a high TD error, so, when we observe a transition, we give it the maximum priority. Our replay memory will be filled with maximum priority transitions. During the training phase where we use the network to compute the TD error, we update this transition’s priority using that TD error so that the next time we sample, this transition will have a non-maximal priority. We continually update priorities, even if the given transition is one we’ve seen before.</p>
<p>Here is the Double DQN algorithm that uses prioritized experience replay.</p>
<ol>
<li>Initialize replay memory $M$</li>
<li>Initialize online network $Q$</li>
<li>Initialize target network $\hat{Q}$</li>
<li>For each episode do
<ol>
<li>For each step do
<ol>
<li>Create a sequence $s_t$ from the previous processed frames</li>
<li>With probability $\epsilon$, take a random action, else take best action $a_t = \arg\max_{a’} Q(s_t, a’)$</li>
<li>Execute action $a_t$ to receive reward $r_t$ and next image $x_{t+1}$</li>
<li>Process $x_{t+1}$ and incorporate it into a frame sequence $s_{t+1}$</li>
<li>Store the transition $(s_t, a_t, r_t, s_{t+1})$ into the replay memory $M$ with maximal priority $p_t = \max_{i < t}p_i$</li>
<li>Sample a random mini-batch of transitions $(s_k^{(j)}, a^{(j)}, r^{(j)}, s_{k+1}^{(j)})$ from $M$ according to the distribution $\frac{p_i^\alpha}{\sum_z p_z^\alpha}$, where each transition in the mini-batch is indexed by $j$.</li>
<li>Compute the importance sampling weights: $w^{(j)} = \frac{(N\cdot P(j))^{-\beta}}{\max_z w_z}$</li>
<li>Clip all rewards $r^{(j)}\in[-1,1]$</li>
<li>If end of game, set q-value to reward: $y^{(j)} = r^{(j)}$</li>
<li>If not end of game, set q-value using target network: $y^{(j)} = r^{(j)} + \gamma\max_{a’}\hat{Q}(s_{k+1}^{(j)}, a’)$</li>
<li>Compute temporal difference error: $\mathbf{\delta} = \mathbf{y} - Q(\mathbf{s_k}, \mathbf{a})$</li>
<li>Update transition priorities $p^{(j)} = \vert\delta^{(j)}\vert$</li>
<li>Perform a gradient descent update of the online network using the Huber Loss of the temporal difference error multiplied, element-wise, by the weights $\mathcal{L}_1(\mathbf{\delta})\odot\mathbf{w}$.</li>
<li>Every $C$ steps, update the target network’s weights $\hat{Q}=Q$</li>
</ol>
</li>
<li>End For</li>
</ol>
</li>
<li>End For</li>
</ol>
<p>Using prioritized experience replay has been shown to drastically decrease training time and increase agent performance when compared to a uniformly-sampled replay memory.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Approximate Q-Learning helped mitigate large state spaces by representing the q-value function as a weighted sum of hand-tailored feature functions. DQNs removed the “hand-tailored” aspect of approximate q-learning and use neural networks to approximate the q-values. However, we had to apply other techniques, such as frame skipping, the target network, and experience replay, to get these DQNs working well enough to win at Atari games. Double DQNs separated the q-value estimation between the online and target networks to mitigate overestimation. Dueling DQNs split the q-value function into two independent streams, value and advantage, to better estimate q-values for states where the actions in some states don’t affect the overall reward. Finally, prioritized experience replay moved away from the uniform distribution of vanilla experience replay to bring high-impact transitions to our DQN so it can train better.</p>
<p>Although value-based methods are still used for some agent tasks today, policy-based methods tend to outperform them on a variety of tasks. (I’m already working on a policy-based methods article and code to be published soon 🙂.) However, they have issues of their own (particularly with noisy gradients!) so, as is the case with every other machine learning model, try many different approaches!</p>
<p>Hopefully this post has helped you learn a bit about DQNs and deep reinforcement learning. And don’t forget, I have a <a href="https://github.com/mohitd/dqn">repo</a> that implements these models and techniques in Pytorch!</p>I overview some of the fundamental deep reinforcement learning algorithms used as the basis for many of the more advanced techniques used in practice and research.