CMake is one of the most commonly-used ways to create a set of build files to construct the library or executable. It’s a meta-build system since does not build anything itself: it creates the files that we then use to build, e.g., generating Makefiles to run make
. Learning CMake is challenging since tutorials and the official CMake documentation and public projects either range from constructing the very basic “Hello World” to constructing multi-platform, multi-compiler submodular libraries. In other words, the complexity is often binary from “let’s build this one C++ file!” to “let’s build something like Boost!” The majority of times, I’ve found that a CMake structure somewhere in between tends to be good enough for most projects.
In this post, I’ll describe a good-enough C++ library project structure and CMake file that accomplishes enough to build a fairly flexible library for a client to build from scratch and use (or some automated build system to generate binaries). To concretely demonstrate this, I’ve started on a catch-all miscellaneous C++ library called bagel, named after an “everything bagel” that I had for breakfast that day 😄, that I’m going to be using as a C++ playground going forward.
I don’t intend for this to be a CMake tutorial for complete beginners; I’ll assume you have enough CMake knowledge where I won’t have to explain syntax or basic commands like set
or project
. The purpose of this post is to talk more about how to use that CMake knowledge to create a project structure that makes building easy and flexible.
Before getting into the CMake file, let’s describe a good-enough directory structure for a mid-sized project:
.
├── CMakeLists.txt
├── LICENSE
├── Readme.md
├── cmake
│ └── Config.cmake.in
├── examples
│ ├── CMakeLists.txt
│ └── timer.cpp
├── include
│ └── bagel
│ ├── chrono
│ │ └── timer.hpp
│ └── export.hpp
├── src
│ └── chrono
│ └── timer.cpp
└── tests
├── CMakeLists.txt
└── chrono
└── test_timer.cpp
10 directories, 11 files
In this directory, we have a few “required” files like Readme.md
and LICENSE
that provide an overall description of the library (among many other things) as well as the legal software license it falls under. Often times open-source libraries have more files like CONTRIBUTING.md
and AUTHORS.md
that explain how to contribute to the library and the core authors of the library, respectively.
The crux of building the library is in the CMakeLists.txt
which is the CMake file that’s used by the CMake executable to write the Makefiles used to actually build this library; it contains the actual library definition including things like compile options and where to install the headers and whatnot. When we run the CMake command in a directory like cmake .
, it will search for CMakeLists.txt
in that directory and parse and execute it. A related directory we’ll cover in the later sections is cmake
, which tends to store auxiliary CMake files used by the root CMakeLists.txt
.
The next directory examples
contains example usage of the library with its own CMakeLists.txt
that just builds the examples. This allows the builder to control if they want to build examples or not. In this case, examples
is flat, but it could be more hierarchical if we had a larger library. We’ll get to this definition later as well. The tests
directory contains our tests for the library and it’s own CMakeLists.txt
for the same reason as the examples
directory. We use GoogleTest to validate our library, but any testing framework will do. I’d highly recommend having tests for your libraries so it helps provide credibility and confidence to users that your library actually does what it intends to do.
The next two directories include
and src
contain the actual content of our library. In the case of include
, we have some subdirectories, the main one being the name of the library bagel
. Then we have subdirectories for the subcomponents like chrono
. The reason we use a subdirectory bagel
with the same name as the project is so that, when we install the header files, e.g., to a place like /usr/local/include
in a Linux system, that our headers like timer.hpp
are prefixed by the library folder to avoid overwriting some other file named timer.hpp
from some other library.
We’ll see most of these directories play a part in the project-level CMakeLists.txt
. The focus for this post is on the CMake required to build our library and not on what the library itself actually does so we won’t necessarily talk about what timer.hpp
/timer.cpp
contains. The contents aren’t as important as how we build the contents into a library.
Building a project starts with the CMakeLists.txt
file that defines the project, build artifacts, and other options. I like to divide the CMake into several larger sections:
The preamble defines the minimum CMake binary version as well as defines the project.
cmake_minimum_required(VERSION 3.14)
project(bagel
VERSION 0.1.0
DESCRIPTION "An everything bagel of C++"
LANGUAGES CXX)
In general, using a too-recent version of CMake can make it difficult for developers to use your library since not everyone might be able to use the latest version of CMake, especially in industry where upgrades to newer build tools can be very slow. For the versioning, semantic versioning is usually a popular choice.
After defining the root CMake project, we define some project-level configurations and check some variables. One of the first configurations we’ll provide to builders is the ability to build our code into a shared or a static library. A shared library (also called shared object hence the .so
file extension) is a kind of library that is dynamically loaded into an executable at runtime; these kinds of libraries make the overall executable smaller but, since the library is loaded dynamically at runtime, the executable requires the shared library to be located in the right place in the filesystem otherwise the exectuable fails when you run it. On the other hand, a static library (file extension .a
for archive) is the other kind of library that is actually built into an executable at compile-time; these kinds of libraries make the executable larger but, since they’re built into the executable, it ensures the exectuable is self-sufficient.
CMake allows the builder to specific which kind of library they want to build. There’s a built-in variable called BUILD_SHARED_LIBS
. However, since this is general to all CMake libraries and is coupled to other CMake behavior, oftentimes we provide a project-specific override usually called something like ${PROJECT_NAME}_SHARED_LIBS
. If that is defined, then we can use it, otherwise, we can default to whatever the BUILD_SHARED_LIBS
variable decides. The default option is to build static libraries.
One nuance is that we want the variable to be defined like BAGEL_BUILD_SHARED_LIBS
not bagel_BUILD_SHARED_LIBS
for consistencency so we’ll define a ${UPPER_PROJECT_NAME}
variable that’s just ${PROJECT_NAME}
but uppercase.
set(namespace ${PROJECT_NAME})
string(TOUPPER ${PROJECT_NAME} UPPER_PROJECT_NAME)
message(CHECK_START "Checking ${UPPER_PROJECT_NAME}_SHARED_LIBS")
if(DEFINED ${UPPER_PROJECT_NAME}_SHARED_LIBS)
set(BUILD_SHARED_LIBS ${UPPER_PROJECT_NAME}_SHARED_LIBS)
message(CHECK_PASS "${${UPPER_PROJECT_NAME}_SHARED_LIBS}")
else()
message(CHECK_FAIL "${BUILD_SHARED_LIBS}")
endif()
message(CHECK_START "Building shared libraries")
if(BUILD_SHARED_LIBS)
message(CHECK_PASS "yes")
else()
message(CHECK_FAIL "no")
endif()
We’re also defining a ${namespace}
that we’ll use later. To write things to the screen, we use the message
macro but use the CHECK_START
, CHECK_PASS
, and CHECK_FAIL
settings so that CMake formats our message nicely like the following.
[cmake] -- Checking BAGEL_SHARED_LIBS
[cmake] -- Checking BAGEL_SHARED_LIBS - ON
[cmake] -- Building shared libraries
[cmake] -- Building shared libraries - yes
In CMake, like in bash, there’s a difference between a variable existing and not existing and a variable having a value. We first check if the variable ${UPPER_PROJECT_NAME}_SHARED_LIBS
exists. Note that we don’t use ${}
around the entire expression since we’re not checking if the contents of the variable exist, we want to check if the variable itself exists. If the variable is defined, then we override the value of BUILD_SHARED_LIBS
, otherwise we default to BUILD_SHARED_LIBS
. If that also doesn’t exist, then we’ll use CMake’s default (building a static library).
There are several ways to set these variables. One way is to do it using the cmake
command like cmake -DMY_VAR
to set define MY_VAR
.
Another common CMake configuration is the build type. The build type mostly sets the compiler optimizations and options such as debug symbols. The most commonly-used ones are Debug
, Release
, and RelWithDebInfo
. Debug
has minimal optimizations but retains debug symbols; Release
has the strongest optimizations but strips any debug symbols for debugging through a debugger like gdb. The last one has the optimizations of release mode but still contains debug symbols. Similar to BUILD_SHARED_LIBS
, if CMAKE_BUILD_TYPE
isn’t defined, we’ll default to Release mode since that’s what builders of our library will tend to use.
if(NOT DEFINED CMAKE_BUILD_TYPE)
set(CMAKE_BUILD_TYPE Release CACHE STRING "Build type" FORCE)
endif()
message(STATUS "Setting build type: ${CMAKE_BUILD_TYPE}")
Using the CACHE
and FORCE
options, we override whatever user-defined value is set in the cache with this value; this is fine since the user didn’t specify a CMAKE_BUILD_TYPE
in the first place. The STRING "Build type"
tells CMake that CMAKE_BUILD_TYPE
is a string.
Next we set some variables for later and define some custom other build options like building examples, tests, and documentation.
set(export_header_name "export.hpp")
set(export_file_name "${CMAKE_CURRENT_SOURCE_DIR}/include/${PROJECT_NAME}/${export_header_name}")
include(GNUInstallDirs)
set(cmake_config_dir ${CMAKE_INSTALL_LIBDIR}/cmake/${PROJECT_NAME})
set(build_tests ${UPPER_PROJECT_NAME}_BUILD_TESTS)
set(build_examples ${UPPER_PROJECT_NAME}_BUILD_EXAMPLES)
set(build_docs ${UPPER_PROJECT_NAME}_BUILD_DOCS)
option(${build_tests} "Builds tests" OFF)
option(${build_examples} "Builds examples" OFF)
option(${build_docs} "Builds docs" OFF)
We use a few CMake variables:
${CMAKE_CURRENT_SOURCE_DIR}
: the directory being processed by CMake; in our case, since our library itself is a top-level CMake project itself, this is the root of the project. This usually refers to the root of the project for single-project CMakes.${CMAKE_INSTALL_LIBDIR}
: the install directory for libraries; in Linux systems, this is usually called lib
(or sometimes lib32
and lib64
). Note that the install prefix is prepended to this folder. Since we used include(GNUInstallDirs)
earlier, it will set this folder correctly for us.We’ll discuss the export header and config directory later.
Now we’re getting into actually building the library. First thing we’ll do is define the library itself and an alias.
add_library(${PROJECT_NAME})
add_library(${PROJECT_NAME}::${PROJECT_NAME} ALIAS ${PROJECT_NAME})
The alias is so that, if someone was building our library from source and linking it as part of their library, then the target_link_libraries
would look the same. We’re not adding any sources to it yet, just defining the library’s existence. After we define the library, we also set the minimum C++ version and provide some compile-time options.
target_compile_features(${PROJECT_NAME} PUBLIC cxx_std_17)
target_compile_options(${PROJECT_NAME} PRIVATE -Wall -Wextra)
We use PUBLIC
for the minimum C++ version so that it’s visible to users when they try to link against our library. For the compile options, those are PRIVATE
since they’re only applicable to our library; we don’t want our library decisions on warnings and errors to be propagated to all of our users!
C++ provides access specifiers like public
and private
, but when building a shared library, we also kind of have a notion of “library” visibility. For shared libraries, each class and function defines a symbol in the symbol table of the library. When you link the shared library to an executable (or other library), the linker resolves those symbols to actual memory addresses. Think of them as placeholders and the actual interface that your library itself provides (sometimes called its ABI or Application Binary Interface). By default, all defined symbols (except the defined inline ones) are exported by our shared library. However, sometimes we have some internal classes or functions that we don’t want to export as part of the shared library interface. It would be better to explicitly mark which symbols should be part of our library’s interface and default all other symbols to be hidden. The asymmetry is that, for static libraries, we don’t have this distinction since the static library is built into the executable in its entirety; the linker doesn’t apply such symbol visibility to static libraries. So we have a few criteria we need to satisfy:
CMake handles this by generating an export header that can create a symbol like BAGEL_EXPORT
that’ll export symbols for shared libraries but it becomes a no-op operation for static libraries.
if(NOT BUILD_SHARED_LIBS)
target_compile_definitions(${PROJECT_NAME} PUBLIC ${UPPER_PROJECT_NAME}_STATIC_DEFINE)
endif()
include(GenerateExportHeader)
generate_export_header(${PROJECT_NAME}
EXPORT_FILE_NAME ${export_file_name}
)
The first part will add a macro definition BAGEL_STATIC_DEFINE
that will no-op BAGEL_EXPORT
. The generate_export_header
will auto-generate a header file at ${export_file_name}
that will define macros to change the visibility of a symbol. To export certain classes or functions, we can import that header and use BAGEL_EXPORT
right before the symbol name like the following.
class BAGEL_EXPORT MyClass {
...
};
void BAGEL_EXPORT myFunc() {
...
}
If we inspect the symbol table of shared library, we’ll see only those symbols exported while others won’t be. For a class, exporting the class exports all symbols but the export header also defined a BAGEL_NO_EXPORT
that “un-exports” the symbol again.
The last thing we need to do is to disable exporting all symbols by default.
if(NOT DEFINED CMAKE_CXX_VISIBILITY_PRESET)
set_target_properties(${PROJECT_NAME} PROPERTIES
CXX_VISIBILITY_PRESET hidden
)
endif()
if(NOT DEFINED CMAKE_VISIBILITY_INLINES_HIDDEN)
set_target_properties(${PROJECT_NAME} PROPERTIES
VISIBILITY_INLINES_HIDDEN ON
)
endif()
That finishes our symbol exporting stuff. Moving on, one minor thing we’ll do is also set our library’s version based on what we set in the project()
macro.
set_target_properties(${PROJECT_NAME} PROPERTIES
SOVERSION ${PROJECT_VERSION_MAJOR}
VERSION ${PROJECT_VERSION}
)
After all of that, we’re finally ready to actually add header and source files.
target_include_directories(${PROJECT_NAME}
PRIVATE
"${CMAKE_CURRENT_SOURCE_DIR}/src"
PUBLIC
"$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/include>"
"$<INSTALL_INTERFACE:${CMAKE_INSTALL_INCLUDEDIR}>"
)
target_sources(${PROJECT_NAME} PRIVATE
src/chrono/timer.cpp)
We use target_include_directories
to add headers to our library. The PRIVATE
part means that only our source files in our src
can also access headers in our src
directory but external users can’t (since those are meant to be for library use only). For the PUBLIC
part, we use CMake generators to specify a build and install interface. When building the library, we can also use headers in the include
directory directly; for users, they’ll use headers wherever we’ve install them as part of the install stage. Recall that ${CMAKE_INSTALL_INCLUDEDIR}
is just like ${CMAKE_INSTALL_LIBDIR}
but for includes instead of libraries (set to include
by GNUInstallDirs
).
target_sources
adds sources to our library and PRIVATE
is really the only thing that makes sense here. We could also glob all source files under the src
directory but I like to be more explicit about which source files are added to the library.
At this point, we have our library and header files ready and we just need to install them in a way so that users can find the library and link against it. The ideal user experience is to be as simple as possible.
find_package(bagel REQUIRED)
target_link_libraries(${PROJECT_NAME} bagel)
These two lines should be all that’s required to link against the installed library. So how can we accomplish this? First thing we need to do is install the headers. There’s a PUBLIC_HEADERS
field but that doesn’t work so nicely for nested directory structures. I’ve found it easier to just install the entire include
directory into the right place.
install(DIRECTORY "${CMAKE_CURRENT_SOURCE_DIR}/include/${PROJECT_NAME}"
DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
)
This does go against my previous sentiment about being more explicit about which files are added to the library but we’ve already configured out project to find headers in the src
directory only for our project so we have a mechanism to keep some headers private. The next thing we need to install is our actual library itself and associate the headers with it.
install(TARGETS ${PROJECT_NAME}
EXPORT "${PROJECT_NAME}Targets"
)
Installing the library isn’t enough: we need to create an export target for our library that describes how to find the header files and library file from the target itself. We’ll use the export target we just created and create a corresponding *Targets.cmake
for it. We’ll give it a namespace; this is a more modern way for CMake to know that a particular alias is a build target and not a folder or something else.
install(EXPORT "${PROJECT_NAME}Targets"
FILE "${PROJECT_NAME}Targets.cmake"
NAMESPACE ${namespace}::
DESTINATION ${cmake_config_dir}
)
We’ll get to why we’re installing this into ${cmake_config_dir}
in just a second.
The last thing we need is to write a package config so that find_package
(using pkg-config) in a client CMakeLists.txt
can actually find it and import the build target. First thing we’ll do is write a version config file, but there’s a helper we can use.
include(CMakePackageConfigHelpers)
write_basic_package_version_file(
"${CMAKE_CURRENT_BINARY_DIR}/${PROJECT_NAME}ConfigVersion.cmake"
VERSION "${PROJECT_VERSION}"
COMPATIBILITY SameMajorVersion
)
Recall ${CMAKE_CURRENT_BINARY_DIR}
is the location of the build directory; this is fine since we’ll be installing these generated files immediately anyways. We’re setting the compatibility to be the SameMajorVersion
since, under our semantic versioning scheme, there are no breaking changes across major versions. Next thing we need to create is a config file that imports our previously-created target file. For that, first we create a separate Config.cmake.in
.
@PACKAGE_INIT@
include("${CMAKE_CURRENT_LIST_DIR}/@PROJECT_NAME@Targets.cmake")
check_required_components(@PROJECT_NAME@)
Some of this is a bit esoteric, but the documentation says to ensure @PACKAGE_INIT@
is at the start and check_required_components(@PROJECT_NAME@)
is at the bottom. In the middle, all we have to do is include our targets file. Finally, we install both of these to the right location.
install(FILES
"${CMAKE_CURRENT_BINARY_DIR}/${PROJECT_NAME}Config.cmake"
"${CMAKE_CURRENT_BINARY_DIR}/${PROJECT_NAME}ConfigVersion.cmake"
DESTINATION ${cmake_config_dir}
)
Note that we install the package config files and the targets file to the ${cmake_config_dir}
we defined earlier. This effectively installs to a filepath like lib/bagel/cmake
on a Linux system. This is where pkg-config looks when you write find_package(bagel)
: it’ll go through the folder of each library stored in lib
and look for a cmake
diretory. If we were to put it somewhere else, we’d get some error like the following.
CMake Error at CMakeLists.txt:6 (find_package):
Could not find a package configuration file provided by "bagel" with any of
the following names:
bagelConfig.cmake
bagel-config.cmake
Add the installation prefix of "bagel" to CMAKE_PREFIX_PATH or set
"bagel_DIR" to a directory containing one of the above files. If "bagel"
provides a separate development package or SDK, be sure it has been
installed.
Alternatively, we could install this anywhere and append to the CMAKE_PREFIX_PATH
or define a bagel_DIR
, but it’s convenient to have the right suffix location by default so clients don’t have to do that extra step. Of course a client could add an install prefix to anywhere but then it’s on them to set either of the two variables above.
At this point, we technically have everything we need for our library, but let’s also provide a way to build examples, tests, and documentation. In the project-level CMakeLists.txt
, we just need to recurse into the lower-level CMakeLists.txt
.
message(CHECK_START "Building tests")
if(${build_tests})
message(CHECK_PASS "yes")
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/tests)
else()
message(CHECK_FAIL "no")
endif()
message(CHECK_START "Building examples")
if(${build_examples})
message(CHECK_PASS "yes")
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/examples)
else()
message(CHECK_FAIL "no")
endif()
We’ll get into those in a minute but building documentation relies on Doxygen and there are some CMake variables that can be set and the doxygen_add_docs
command generates docs. One additional thing we can do is to create a dependency in our project to our generate_docs
target so that, whenever we rebuild the library due to a code change, the documentation will automatically be re-generated too!
message(CHECK_START "Building docs")
if(${build_docs})
message(CHECK_PASS "yes")
find_package(Doxygen REQUIRED)
set(README_PATH "${CMAKE_CURRENT_SOURCE_DIR}/Readme.md")
set(DOXYGEN_PROJECT_NAME "${PROJECT_NAME}")
set(DOXYGEN_PROJECT_BRIEF "${PROJECT_DESCRIPTION}")
set(DOXYGEN_USE_MDFILE_AS_MAINPAGE "${README_PATH}")
doxygen_add_docs(generate_docs include "${README_PATH}"
COMMENT "Generating docs")
add_dependencies(${PROJECT_NAME} generate_docs)
else()
message(CHECK_FAIL "no")
endif()
Alternatively, we could use a backup documentation generator and not make Doxygen required but that’s a choice.
The CMakeLists.txt
in the examples folder is fairly straightforward
cmake_minimum_required(VERSION 3.16)
project(bagel-examples)
add_executable(timer timer.cpp)
target_link_libraries(timer PRIVATE bagel::bagel)
Notice how we link our example executable to our library with bagel::bagel
using PRIVATE
since we have an executable.
Tests are slightly more complicated beacuse of downloading and using GoogleTest, but still readable.
cmake_minimum_required(VERSION 3.16)
project(bagel-tests)
set(INSTALL_GTEST OFF)
enable_testing()
include(FetchContent)
FetchContent_Declare(
googletest
URL https://github.com/google/googletest/archive/refs/tags/v1.14.0.zip
)
FetchContent_MakeAvailable(googletest)
include(GoogleTest)
add_executable(test_timer chrono/test_timer.cpp)
target_link_libraries(test_timer
PRIVATE
bagel::bagel
GTest::gtest_main
)
gtest_discover_tests(test_timer)
Again notice how we link our library to a test binary (and also to the GoogleTest binary).
To evaluate if we did everything correct, I created a dummy C++ executable for testing purposes. The main.cpp
simply imports the header and does some work.
#include <chrono>
#include <iostream>
#include <thread>
#include <bagel/chrono/timer.hpp>
using namespace std::chrono_literals;
int main(int argc, char** argv) {
bagel::WallTimer t;
t.start();
std::this_thread::sleep_for(10ms);
auto elapsed = t.stop();
std::cout << elapsed.count() << "s\n";
return 0;
}
We create a timer, intentionally pause the main thread for about 10ms, stop the timer, and record the value in the timer.
The CMakeLists.txt
simply defines an executable and links against our library. Since I’ve installed the library to a custom location for development purposes, I’m manually appending the location to the CMAKE_PREFIX_PATH
.
cmake_minimum_required(VERSION 3.14)
project(hungry)
list(APPEND CMAKE_PREFIX_PATH "/Users/mohit/Developer/bagel/install/")
find_package(bagel CONFIG REQUIRED)
add_executable(hungry main.cpp)
target_link_libraries(hungry PRIVATE bagel::bagel)
Now we can create a build directory, run cmake, build our executable, and run it!
mkdir build && cd build
cmake ..
make
./hungry
The output is what we expect: a value close to 10ms (a little off depending on your scheduler).
0.012527s
CMake is the most popular meta-build system to build C++ libraries and executables, but it’s also one of the most challenging ones to learn well. In this post, we went over a project structure and CMakeLists.txt
for a medium-sized project with multiple subcomponents. We broke the CMakeLists.txt
down into a parts: (i) preamble, (ii) configuration, (iii) building, (iv) installing, and (v) extra stuff. In (i), we simply define the project. In (ii), we define some variables that clients can use to configure how they build our library. (iii) is where we actually build the library and set things like include directories. After building the library, (iv) is where we install it and the headers in a way and place where clients can easily link against it. Finally, (v) is where we build optional things like examples, tests, and documentation.
CMake can be pretty complicated to “get right” and there’s a lot of variability in how developers use CMake to write libraries and executables. Hopefully this little tutorial provides some guidance on how to provide more structure your CMakeLists.txt
abiding to some best practices to avoid. If you’re working on C++ stuff, try to crystallize some of this guidance into your team or project’s standards and let me know how it goes 🙂
In this post, we’ll compose our modern artificial neurons together into an actual artificial neural network (ANN) and discuss how to train an ANN using the most important and fundamental algorithm in all of deep learning: backpropagation of errors. In fact, we’re going to derive this algorithm for ANNs of any width, depth, and activation and cost function! Similar to previous posts, we’ll implement a generic ANN and the backpropagation algorithm in Python code using numpy. However, this time we’ll use a more complex dataset to highlight the expressive power of a full ANN.
Disclaimer: this part is going to have a lot of maths and equations since I want to properly motivate backpropagation and dispel any myths and misconceptions about backpropagation being this magical thing known only to machine learning library implementers and to combat those saying “ah just let X library take care of it; it’ll ‘just work’”. To make this understanding more accessible, I’ll have sections that summarize the high-level ideas as well as intuitive explanations for each of the core equations.
A two-layer network is composed by taking the output of the previous layer’s neurons and feeding them as input to each of the next layer’s neurons to create an all-to-all connection across layers. We don’t currently consider self-connections although there are network architectures, such as recurrent neural networks (RNNs), that do.
From the previous post in the series, we’re already familiar with building a perceptron network that has an intermediate hidden layer to solve the XOR gate problem. We saw that adding this hidden layer gave our model far more expressive power than a single layer. This structure or architecture, however, is general enough we can call it an artificial neural network: we have an input layer, any number of hidden layers, and an output layer. Layer to layer, we connect each neuron of the previous layer to each neuron of the next layer, forming a many-to-many connection.
Zooming into a single neuron, we take the weighted sum of its inputs and add a bias to form the pre-activation. One way to represent a bias is a “weight” whose input is always $+1$. The weights and bias are learning parameters of the network through some learning algorithm such as gradient descent. The activation function is applied to the pre-activation to produce the neuron output. This output is fed into each of the next layer’s neurons.
To compute a value for each neuron, we take the weighted sum of its inputs plus the bias to compute a pre-activation, and then run the pre-activation through an activation function to get the actual activation/value of the neuron. Then that activation becomes an input into the next layers neurons. Finally, we have an output layer that computes some value that’s useful for evaluation, e.g., a number for linear regression or a class label.
The intuition behind having multiple hidden layers is that it gives the network more expressive power. We saw this with the XOR gate problem: the input space wasn’t linear separable but the hidden space was. One interpretation of these hidden layers is that they transform the input space nonlinearly until the output space is linear separable. A complementary interpretation is that the hidden layers iteratively build complexity from the earlier layers to the later layers. It’s easier to see this with neural networks that operate on images, i.e., convolutional neural networks: the weights of the earlier layers activate on simple lines and edges while the weights of the later layers compose these to activate on shapes and more complex geometry.
Let’s introduce/re-introduce some notation to make talking about pre-activations, activations, weights, biases, layers, and other neural network stuff easier. I’m a fan of Michael Nielsen’s notation since I think it makes the subsequent analysis easier to understand and pattern-match against so we’ll use that going forward.
Let’s consider neural networks with $L$ total layers, indexed by $l\in [1, L]$. We define a bias vector vector $b^l$ with components $b_j^l$ for each layer $l$. The superscripts here mean layer not exponents! Between layers, we collect all of the individual weights into a weight matrix $W^l$ with elements $W_{jk}^l$ that represent the value of the individual weight from neuron $k$ in layer $(l-1)$ to neuron $j$ in layer $l$, i.e., $W_{jk}^l$ is the weight from neuron $k\to j$ from layer $(l-1)$ to $l$. Notice the first index represents the neuron in layer $l$ and the second is the neuron in layer $l-1$; Michael Nielsen defines the weight matrix this way since it simplifies some of the equations and makes the intuition easier to understand.
An entry in the weight matrix $W_{jk}^l$ is the value connecting the $k$-th neuron in the $(l-1)$-th layer with the $j$-th neuron in the $l$-th layer. When we compute the input for any particular neuron, we sum over all of the output activations of the previous layer, hence the $\sum_k W_{jk}^l a_k^l$ part of the pre-activation.
For the input layer, we can define the pre-activation as the weighted sum of the inputs and weights plus the bias $z_j^1=\sum_k W_{jk}^1 x_k + b_j^1$; we can write it in a vectorized form like $z^1= W^1 x + b^1$. The activation just runs the pre-activation through an activation function $\sigma(\cdot)$ like $a_j^1=\sigma(z_j^1)$ or $a^1=\sigma(z^1)$ for the vectorized version (assuming the activation function is applied element-wise). For the next layer, we use the activations of the previous layer all the way until we get to the activations for the last layer $a^L$, also called the output layer. For simplicity, we can define the zeroth set of activations as the input $a^0 = x$ so we can write the entire set of equations in a general form for each layer.
\[\begin{align*} z_j^l &= \displaystyle\sum_k W_{jk}^l a_k^{l-1} + b_j^l & z^l &= W^l a^{l-1} + b^l\\ a_j^l &= \sigma(z_j^l) & a^l &= \sigma(z^l)\\ \end{align*}\]Performing a forward pass/inference is just computing $z^l$ and $a^l$ all the way to the final output layer $L$. During training, that final output $a^L$ goes into the cost function to determine how well the current set of weights and biases help produce the desired output. For tasks like classification, we can express the cost function in terms of the output layer activations $a^L$ and the desired class $y$ like $C(a^L, y)$. Note that if we were to “unroll” $a^L$ and all activations back to the input, $a^L$ would expand into a huge equation that would be a function of all of the weights and biases in the network so putting in the cost function is really evaluating all of the weights and biases.
This is a lot of notation but take a second to understand the placement of indices and what they represent. As an example, suppose we wanted to compute $z_1^l$, then substituting $j=1$ into the pre-activation equation, we get $\sum_k W_{1k}^l a_k^{l-1} + b_1^l$. Intuitively, this means we take each $k$ neurons from the $(l-1)$ th layer as a vector, multiply by the 1st column of the weight matrix, and add the 1st component of the bias vector to get the 1st vector component of the pre-activation. Make sure the indices match up and make sense, i.e., there should be the same number of free lower indices on both sides of any equation! Try other kinds of substitutions to make sure you understand how the index placement works.
In the previous post, we demonstrated how to train a single neuron using gradient descent by computing the partial derivatives of the cost function with respect to the weights and bias. That is actually still the exact same principle and idea that we’ll be going forward with; it’s just that in the general case, the maths gets a bit more complicated since we have multiple sets of parameters across multiple layers written as functions of each other. Rather than computing individual partial derivatives for each weight and bias, we can come up with a general set of equations that tell us how to do so for any width and depth of neural network.
Instead of jumping right into the maths, let’s go through a numerical example of backpropagation to get our feet wet first. I actually wrote a post many years ago on this that I’ll steal from and take this opportunity to update the writing and narrative. Since we already somewhat used backpropagation in the previous post, let’s analyze that in a bit more detail.
One useful visual representation for a a computation is a computation graph. Each node in the graph represents an operation and each edge represents a value that is the output of the previous operation. Let’s draw a computation graph for our little artificial neuron from the previous post and substitute some random values for the weights and bias.
This computation graph represents a single neuron with two inputs and corresponding weights and a bias term. Example values have been substituted and a forward pass has been computed. The $y$ value is the target/true value fed into the cost function.
In this very simple example, we have a few operations: multiplication, addition, sigmoid activation, and cost function evaluation. We’ve done a forward pass and recorded the outputs of the operands on the top of the line. At the very last step, we have a sigmoid output of 0.73 but a desired output of 1. So the goal is to adjust our weights and biases such that, the next time we perform a forward pass, the output of the model is closer to 1. What we did last time was to compute the partial derivatives of the cost function with respect to each parameter by expanding out the entire cost function and analytically computing derivatives. One of the things we saw was that all of the learnable parameters had similar terms in their derivatives, namely $\frac{\p C}{\p a}$ and $\frac{\p C}{\p z}$. Was this coincidental or a byproduct of how we compute the output of a neuron?
To answer this question, we’re going to take a slightly different, but equivalent, approach at computing the partial derivatives by using the graph as a visual guide for which derivatives compose. For each node, we’re going to take the derivative of the operation with respect to each of the inputs and accumulate the overall gradient, starting at the end, through the graph until we get all the way back to the parameters of the model at the very left of the graph. We’ll start with the first derivative $\frac{\p C}{\p a}$ and keep tacking on factors as we go backwards through the graph. For example, the next factor we’ll tack on is $\frac{\p a}{\p z}$ to get $\frac{\p C}{\p a}\frac{\p a}{\p z}=\frac{\p C}{\p z}$. By multiplying through the partial derivatives this way, propagating the gradient signal backwards through the graph is equivalent to applying the chain rule. By the time we get to the model parameters, we will have computed something like $\frac{\p C}{\p w_1}$ and we can simply read this off the graph.
Let’s start with the output layer and the cost function. We’re using the quadratic cost function that looks like this for a single output: $C(a, y) = \frac{1}{2}(y - a)^2$. There are technically two possible partial derivatives of this function $\frac{\p C}{\p a}$ and $\frac{\p C}{\p y}$ but the latter doesn’t make sense since $y$ is given and not a function of the parameters of the model so let’s compute the former. We’ve already done so in the previous post so we’ll lift the derivative from there.
\[\begin{align*} \frac{\p C}{\p a} &= -(y-a)\\ &= -(1 - 0.73)\\ &= -0.27 \end{align*}\]Computing the derivative and substituting our values, we get $-0.27$ for the start of the gradient signal.
We’ve computed the gradient of the cost function with respect to its inputs and placed it below the corresponding edge in green. Since $y$ is given, we don’t compute a gradient to it.
We’re going to write the gradient values under the edges and track them as we move backward through the graph. Now the next operation we encounter is the sigmoid activation function $\sigma(z) = \frac{1}{1+e^{-z}}$. Let’s compute the derivative of the sigmoid with respect to input $z$. Similar to the above example, we already know a closed-form of $\sigma’(z)$ from the previous post so we’ll lift the derivative from there.
\[\begin{align*} \frac{\p a}{\p z} &= \sigma(z)\big[1-\sigma(z)\big]\\ &= a(1-a)\\ &= 0.73(1-0.73)\\ &= 0.1971 \end{align*}\]Computing the derivative and substituting values, we get $0.1971$. Now do we add this number underneath the corresponding edge of the graph? Not quite. We could call this value a local gradient since we’re just computing the gradient of a single node with respect to its inputs. But remember what we said above: propagating the gradient is equivalent to applying the chain rule so we actually need to multiply this by $-0.27$ to get the total gradient $\frac{\p C}{\p a}\frac{\p a}{\p z}=\frac{\p C}{\p z}=-0.27(0.1971)=-0.053$ which we can put underneath the corresponding edge.
We’ve computed the gradient of the activation function with respect to its inputs. To get the actual gradient, we multiply it with the previous gradient from the cost function so that we have a full global gradient.
Now we’ve reach our first parameter the bias $b$! Same as before, we’ll compute the local gradient and multiply by the thus-far accumulated gradient. To make things a bit easier, let’s just define $\Omega \equiv w_1 x_1 + w_2 x_2$ so the operation can be defined like $z = \Omega + b$. We have two local gradients to compute $\frac{\p z}{\p \Omega}$ and $\frac{\p z}{\p b}$. Fortunately, this is easy since the derivative of a sum with respect to either terms is 1 so $\frac{\p z}{\p \Omega}=\frac{\p z}{\p b}=1$ so we just “copy” the gradient along both input paths of the addition node. We’ve successfully computed the gradient of the cost function with respect to our bias parameter!
We’ve computed the gradient across the weighted sum and bias. Notice that the gradient is “copied” across addition nodes because the derivative of a sum with respect to the terms is always $+1$.
We have two more parameters to go. The next node we encounter on our way to the weights is another addition node. Similar to what we just did, we can “copy” the gradient along both paths.
Let’s first consider $w_1$ and now we encounter a multiplication node. Similarly, we can define $\omega_1 = w_1 x_1$ and compute just the local gradient $\frac{\p \omega_1}{\p w_1}$ since $\frac{\p \omega_1}{\p x_1}$ is fixed just like with the output.
\[\begin{align*} \frac{\p \omega_1}{\p w_1} &= x_1\\ &= -1\\ \end{align*}\]Multiplying this with the incoming gradient we get the total gradient of $\frac{\p C}{\p a}\frac{\p a}{\p z}\frac{\p z}{\p\Omega}\frac{\p \Omega}{\p \omega_1}\frac{\p \omega_1}{\p w_1} = \frac{\p C}{\p w_1} = 0.053$. Collapsing the identity terms, a more meaningful application of the chain rule would be $\frac{\p C}{\p a}\frac{\p a}{\p z}\frac{\p z}{\p w_1} = \frac{\p C}{\p w_1} = 0.053$. We can easily figure out the other derivative $\frac{\p C}{\p a}\frac{\p a}{\p z}\frac{\p z}{\p w_2} = \frac{\p C}{\p w_2} = -0.053(-2)=0.106$ by noting that for a multiplication node, the local gradient of one of the inputs is the other input so $\frac{\p z}{\p w_2}=x_2$.
We’ve computed all of the gradients in the computation graph, including the weights. For a multiplication gate, the gradient of a particular term is the product of the other terms. For example, $\frac{\p}{\p a}abc=bc$ and the other derivatives follow. For a product like this, we multiply by the incoming gradient.
Now we’ve computed the gradient of the cost function for every parameter so we’re ready for a gradient descent update!
\[\begin{align*} w_1&\gets w_1 - \eta\frac{\p C}{\p w_1}\\ w_2&\gets w_2 - \eta\frac{\p C}{\p w_2}\\ b&\gets b - \eta\frac{\p C}{\p b} \end{align*}\]Let’s set the learning rate to $\eta=1$ for simplicity and perform a single update to get new values for our parameters.
\[\begin{align*} w_1 &\gets 2 - (0.053) &= 1.94\\ w_2 &\gets -3 - (0.106) &= -3.106\\ b &\gets -3 - (-0.053) &= -2.947 \end{align*}\]If we run another forward pass with these new parameters, we get $a=0.79$ which is closer to our target value of $y=1$! We’ve successfully performed gradient descent numerically by hand and saw that it does, in fact, adjust the model parameters to get us closer to the desired output!
To summarize, a computation graph is a useful tool for visualizing a larger computation in terms of its constituent operations, represented as nodes in the graph. To perform backpropagation on this graph, we start with the final output and work our ways backwards to each parameter, accumulating the global gradient as we go by successively multiplying it by the local gradient at each node. The local gradient at each node is just the derivative of the node with respect to its inputs. If we keep doing this, we’ll eventually arrive at the global gradient for each parameter which is equivalent to the derivative of the cost function with respect to the parameter. We can directly use this gradient in a gradient descent update to get our model closer to the target value.
Now that we’ve seen backpropagation work in a few different cases, e.g., single neuron and computation graph, we’re ready to actually derive the general backpropagation equations for any ANN. This is where the maths is going to start getting a little heavy so feel free to skip to the last paragraph of this section. I’ll be loosely following Michael Nielsen’s general approach here since I like the high-level way he’s structured the derivation. We’re going to start with computing the gradient of the cost function with respect to the output of the model, then come up with an equation for propagating a convenient intermediate quantity (he calls this the “error”) from layer to layer, and finally two more equations to compute the partial derivatives of the weights and bias of a particular layer with respect to that intermediate quantity of the layer.
From the previous section, we started with computing the gradient of the cost function with respect to the entire model output first so that sounds like a sensible thing to compute first: $\frac{\p C}{\p a_j^L}$ or $\nabla_{a^L}C$ in vector form. We’re making an implicit assumption that the cost function is a function of the output of the network but that’s most often the case. There are more complex models that account for other things in the cost function, but it’s a reasonable assumption to make. Note that this gradient is entirely dependent on the cost function we use, e.g. mean absolute error, mean squared error, or something more interesting like Huber loss, so we’ll leave it written symbolically.
Going a step further, we want to compute the derivative of the cost function with respect to the weights and biases of the very last layer, i.e., $\frac{\p C}{\p W_{jk}^L}$ and $\frac{\p C}{\p b_j^L}$. To do this, we’ll have to go backwards through the activation function first $\frac{\p C}{\p a_j^L}\frac{\p a_j^L}{\p z_j^L}=\frac{\p C}{\p z_j^L}$. One thing to note is that, for every layer, the pre-activation is always a function of the weights and biases at the same layer. By that logic, if we could compute $\frac{\p C}{\p z_j^l}$ for each layer, the gradients of the weights and biases would just be another factor tacked on to this. For convenience purposes, it seems like a good idea to define a variable and name for this quantity so let’s directly call this the error in neuron $j$ in layer $l$.
\[\begin{equation} \delta_j^l \equiv \frac{\p C}{\p z_j^l} \end{equation}\]Note that we could have defined the error in terms of the activation rather than the pre-activation like $\frac{\p C}{\p a_j^l}$ but then there would be an extra step to go through the activation into the pre-activation anyways (for each weight matrix and bias vector) so it’s a bit simpler to define it in terms of the pre-activation. But everything we do past this point could be done using $\frac{\p C}{\p a_j^l}$ as the definition of the error without loss of generality.
A visual way to think about the error is taking the green gradient path from the cost function to the pre-activation $z_j^l$ (across its activation $a_j^l$) of a particular neuron.
Intuitively, $\delta_j^l$ represents how a change in the pre-activation in a neuron $j$ in a layer $l$ affects the entire cost function. This little wiggle in the pre-activation occurs from a change in the weights or bias but since the pre-activation is a function of both, we use it to represent both kinds of wiggles. It’s really just a helpful intermediate quantity that simplifies some of the work of propagating the gradient backwards.
Now that we have this quantity, the first step is to compute this error at the output layer $L$. Let’s substituting $l=L$ into the definition of $\delta_j^l$
\[\begin{align*} \delta_j^L &= \frac{\p C}{\p z_j^L}\\ &= \sum_k\frac{\p C}{\p a_k^L}\frac{\p a_k^L}{\p z_j^L}\\ &= \frac{\p C}{\p a_j^L}\frac{\p a_j^L}{\p z_j^L}\\ &= \frac{\p C}{\p a_j^L}\sigma'(z_j^L) \end{align*}\]Between the first and second steps, we have to sum over the activations of all of the output layer since the cost function depends on all of them. Between the second and third steps, we used the fact that the pre-activation $z_j^L$ is only used in the corresponding activation $a_j^L$ and any other $a_k^L$ is not a function of $z_j^L$. So the only activation that is a function of $z_j^L$ is $a_j^L$. So all of the other terms in the sum disappear. So now we have an equation telling us the error in the last layer.
\[\begin{equation} \delta_j^L = \frac{\p C}{\p a_j^L}\sigma'(z_j^L) \end{equation}\]and its vectorized counterpart
\[\begin{equation} \delta^L = \nabla_{a^L}C \odot \sigma'(z^L) \end{equation}\]where $\odot$ is the Hadamard product or element-wise multiplication. Intuitively, this equation follows from the derivation: to get to the pre-activation at the last layer, we have to move the gradient backwards through the cost function and then again backwards through the activation of the last layer.
For the first backpropagation equation, we apply the definition of the error, but move back only to the output layer. To get to the pre-activation $z_j^L$, we start at the cost function $\frac{\p C}{\p a_j^L}$ and through the corresponding activation $\frac{\p a_j^L}{\p z_j^L}$ to get the total gradient $\frac{\p C}{\p a_j^L}\frac{\p a_j^L}{\p z_j^L}=\frac{\p C}{\p a_j^L}\sigma’(z_j^L)=\delta_j^L$.
Now we could go right into computing the weights and biases from here, but let’s first figure out a way to propagate this error from layer to layer first and then come up with a way to compute the derivative of the cost function with respect to the weights and biases of any layer, including the last one. So we’re looking to propagate the the error $\delta^{l+1}$ from a particular layer $(l+1)$ to a previous layer $l$. Specifically, we want to write the error in the previous layer $\delta^l$ in terms of the error of the next layer $\delta^{l+1}$. As we did before, we can start with the definition of $\delta^l$ and judiciously apply the chain rule.
\[\begin{align*} \delta_k^l &= \frac{\p C}{\p z_k^l}\\ &= \sum_j \frac{\p C}{\p z_j^{l+1}}\frac{\p z_j^{l+1}}{\p z_k^l}\\ &= \sum_j \delta_j^{l+1}\frac{\p z_j^{l+1}}{\p z_k^l} \end{align*}\]Between the second and third steps, we substituted back the definition of $\delta_j^{l+1}=\frac{\p C}{\p z_j^{l+1}}$ just using $k\to j$ and $l\to (l+1)$ from the original definition (both are free indices). Now we have $\delta^l$ in terms of $\delta^{l+1}$! The last remaining thing to expand is $\frac{\p z_j^{l+1}}{\p z_k^l}$.
\[\begin{align*} \frac{\p z_j^{l+1}}{\p z_k^l} &= \frac{\p}{\p z_k^l}z_j^{l+1}\\ &= \frac{\p}{\p z_k^l}\bigg[\sum_p W_{jp}^{l+1}a_p^l + b_j^{l+1}\bigg]\\ &= \frac{\p}{\p z_k^l}\bigg[\sum_p W_{jp}^{l+1}\sigma(z_p^l) + b_j^{l+1}\bigg]\\ &= \frac{\p}{\p z_k^l}\sum_p W_{jp}^{l+1}\sigma(z_p^l)\\ &= \frac{\p}{\p z_k^l} W_{jk}^{l+1}\sigma(z_k^l)\\ &= W_{jk}^{l+1}\frac{\p}{\p z_k^l} \sigma(z_k^l)\\ &= W_{jk}^{l+1}\sigma'(z_k^l)\\ \end{align*}\]This derivation is more involved. In the second line, we expand out $z_j^{l+1}$ using its definition; note that we use $q$ as the dummy index to avoid any confusion. In the fourth line, we cancel $b_j^{l+1}$ since it’s not a function of $z_k^l$. Going to the fifth line, similar to the reasoning earlier, the only term in the sum that is a function of $z_k^l$ is when $p=k$ so we cancel all of the other terms. Then we differentiate as usual. We can take this result and plug it back into the original equation.
\[\begin{equation} \delta_k^l = \sum_j W_{jk}^{l+1}\delta_j^{l+1}\sigma'(z_k^l) \end{equation}\]To get the vectorized form, note that we have to transpose the weight matrix since we’re summing over the rows instead of the columns; also note that the last term is not a function of $k$ so we can take the Hadamard product.
\[\begin{equation} \delta^l = (W^{l+1})^{T}\delta^{l+1}\odot\sigma'(z^l) \end{equation}\]This is why we intentionally ordered the terms in the multiplication this way: to better show how it translates into matrix product and why we use the transpose of weight matrix.
For the second backpropagation equation, we assume we’ve already computed the error at some layer $(l + 1)$ and try to propagate it back to layer $l$. We can always apply this to the last and second-to-last layer anyways. Starting from $\delta_j^{l+1}$, to get to $\delta_k^l$, we need to move backwards through the weight matrix and through the activation. In the forward pass, since we compute the pre-activation of a neuron using the weighted sum of all previous activation, to compute gradient, we need the sum of all of the previous errors, weighted by the transpose of the weight matrix (consider the dimensions) which explains the $\sum_j W_{jk}^{l+1}\delta_j^{l+1}$ part. Then we move backwards through the cost function which explains the $\sigma’(z_k^l)$ term.
This has an incredibly intuitive explanation: since the weight matrix propagates inputs forward, the transpose of the weight matrix propagates errors backwards, specifically the error in the next layer $\delta^{l+1}$ to the current layer. Another way to think about it is in terms of the dimensions of the matrix: the weight matrix multiples against the number of neurons of the previous layer to produce the number of neurons in the next layer so the transpose of the weight matrix multiples against the number of neurons in the next layer and produces the number of neurons in the previous layer. After the weight matrix multiplication, we have to Hadamard with the derivative of the activation function to move the error backward through the activation to the pre-activation.
We’re almost done! The last two things we need are the actual derivatives of the cost function with respect to the the weights and biases. Fortunately, they can be easily expressed in terms of the error $\delta_j^l$. Let’s start with the bias since its easier. This time, we can start with what we’re aiming for and then decompose in terms of the error.
\[\begin{align*} \frac{\p C}{\p b_j^l} &= \sum_k\frac{\p C}{\p z_k^l}\frac{\p z_k^l}{\p b_j^l}\\ &= \frac{\p C}{\p z_j^l}\frac{\p z_j^l}{\p b_j^l}\\ &= \delta_j^l\frac{\p z_j^l}{\p b_j^l}\\ &= \delta_j^l\frac{\p}{\p b_j^l}\Big(\sum_k W_{jk}^l a_k^{l-1} + b_j^l\Big)\\ &= \delta_j^l \end{align*}\]In the first step, we use the chain rule to expand the left-hand side. Similar to the previous derivations, all except for one term in the sum cancels. Then we plug in the definition of the error and differentiate.
\[\begin{equation} \frac{\p C}{\p b_j^l} = \delta_j^l \end{equation}\]The vectorized version looks almost identical!
\[\begin{equation} \nabla_{b^l}C = \delta^l \end{equation}\]Note that if we had defined the error as the gradient of the cost function with respect to the activation, we’d have to take an extra term moving it across the pre-activation.
Remember that one way to interpret the bias is being a “weight” whose input is always $+1$. Similar to the second backpropagation equation, we’ll assume we’ve computed $\delta_j^l$. To get to the bias $b_j^l$, we don’t have to do anything extra since the input term is simply $+1$.
Turns out the derivative of the cost function with respect to the bias is exactly equal to the error! Convenient that it worked out this way!
Now we just need the corresponding derivative for the weights. It’ll follow almost the same pattern.
\[\begin{align*} \frac{\p C}{\p W_{jk}^l} &= \sum_q\frac{\p C}{\p z_q^l}\frac{\p z_q^l}{\p W_{jk}^l}\\ &= \frac{\p C}{\p z_j^l}\frac{\p z_j^l}{\p W_{jk}^l}\\ &= \delta_j^l\frac{\p z_j^l}{\p W_{jk}^l}\\ &= \delta_j^l\frac{\p}{\p W_{jk}^l}\Big(\sum_p W_{jp}^l a_p^{l-1} + b_j^l\Big)\\ &= \delta_j^l\frac{\p}{\p W_{jk}^l}W_{jk}^l a_k^{l-1}\\ &= \delta_j^l a_k^{l-1} \end{align*}\]Be careful with the indices! The first step we use a dummy index $q$ to not confuse indices. The only term in the sum that is nonzero is $z_j^l$; remember that the second in index in the weight matrix is summed over so only the first one allows us to cancel the other terms. Then we can expand out using a dummy index again and apply the same reasoning to cancel out other terms in the sum. Then we differentiate.
\[\begin{equation} \frac{\p C}{\p W_{jk}^l} = \delta_j^l a_k^{l-1} \end{equation}\]Note that all indices are balanced on both sides of the equation so we haven’t made any obvious mistake in the calculation.
Like the previous two backpropagation equations, we’ll assume we’ve computed $\delta_j^l$. To get to the weight between two arbitrary neurons $W_{jk}^l$, the two terms involved are the error $\delta_j^l$ which is the error at the $j$th neuron and the activation of the $k$th neuron that it connects to.
The intuitive explanation for this is that $a_k^{l-1}$ is the “input” to a neuron through a weight and $\delta_j^l$ is the “output” error; this says the change in cost function as a result of the change in the weight is the product of the activation going “into” the weight times the resulting error “output”. The vectorized version uses the outer product since, for a matrix $M_{ij}=x_i y_j \leftrightarrow M=xy^T$.
\[\begin{equation} \nabla_{W^l}C = \delta^l (a^{l-1})^{T} \end{equation}\]That’s the last equation we need for a full backpropagation solution! Let’s see them all in one place here, both in element and vectorized form!
\[\begin{align*} \delta_j^l &\equiv \frac{\p C}{\p z_j^l} & \delta^l &\equiv \nabla_{z^l} C\\ \delta_j^L &= \frac{\p C}{\p a_j^L}\sigma'(z_j^L) & \delta^L &= \nabla_{a^L}C \odot \sigma'(z^L)\\ \delta_k^l &= \sum_j W_{jk}^{l+1}\delta_j^{l+1}\sigma'(z_k^l) & \delta^l &= (W^{l+1})^{T}\delta^{l+1}\odot\sigma'(z^l)\\ \frac{\p C}{\p b_j^l} &= \delta_j^l & \nabla_{b^l}C &= \delta^l\\ \frac{\p C}{\p W_{jk}^l} &= \delta_j^l a_k^{l-1} & \nabla_{W^l}C &= \delta^l (a^{l-1})^{T}\\ \end{align*}\]With this set of equations, we can train any artificial neural network on any set of data! Take a second to prod at what happens when various values such as what happens when $\sigma’(\cdot)\approx 0$. This should help give some insight on how quickly or efficiently training can happen, for example. There are some other insights we can gain from analyzing these equations further but that’s a bit tangential to this current discussion and best saved for when we encounter problems (“seeing is believing”).
Now we can describe the entire backpropagation algorithm in the context of stochastic gradient descent (SGD).
This algorithm follows suit from the previous SGD training loop we wrote except now we’re computing an intermediate quantity (the error $\delta^l$), and have more complicated update equations.
We’ve derived the equations for backpropagation so we’re ready to implement and train a general artificial neural network in Python! But before we dive into the code, our dataset is going to be different than the Iris dataset. I want to highlight how general ANNs can solve more complex problems than singular neurons so the dataset is going to be more complicated.
We’ll be training on a famous data called the MNIST Handwritten Digits dataset. As the name implies, it’s a dataset of handwritten digits 0-9 represented as grayscale images. Each image is $28\times 28$ pixels and the true label is a digit 0-9. It’s always a good idea to look at raw data of a dataset that we’re not familiar with so that we understand what the inputs correspond to in the real world.
MNIST Handwritten Digits Dataset contains tens of thousands of handwritten digits from 0-9. We can plot some example data from the training set in a grid.
Now that we’ve seen some data, we can start writing the data pre-processing step. In practice, this data pipeline is often more important than the exact model or network architecture. Running poorly-processed data through even the state-of-the-art model will produce poor results. To start, we’re going to use the Pytorch machine learning Python framework to load the training and testing data. For a particular grayscale image pixel, there are a lot of data representations, but the most common are (i) an integer value in $[0, 255]$ or (ii) a floating-point value in $[0, 1]$. We’re going to use the latter since it plays more nicely, numerically, with the floating-point parameters of our model (and the sigmoid activation).
import numpy as np
from torchvision import datasets
from matplotlib import pyplot as plt
# load MNIST dataset
train_dataset = datasets.MNIST('./data', train=True, download=True)
test_dataset = datasets.MNIST('./data', train=False, download=True)
X_train = train_dataset.data.numpy()
X_test = test_dataset.data.numpy()
# normalize training data to [0, 1]
X_train, X_test = X_train / 255., X_test / 255.
We can print the “shape” of this data with X_train.shape
. The first dimension represents the number of examples (either training or test) and the remaining dimensions represent the data. In this case, for the MNIST training set, we have 60,000 examples and the images are all $28\times 28$ pixels so the shape of our training data is a multidimensional array of shape $(60000, 28, 28)$. The test set contains 10,000 examples for evaluation. But our neural network accepts a number of neurons as input, not a 2D image. An easy way to reconcile this is to flatten the image into a single layer. So we’ll take each $28\times 28$ image and flatten it into a single list of $28*28=784$ numbers. This will change the shape of the training data to $(60000, 784)$ but we’ll need to add an extra dimension to make Pytorch and the maths work out so we want the resulting shape to be $(60000, 784, 1)$ where the last dimension just means that one set of 784 numbers correspond to 1 input example.
# flatten image into 1d array
X_train, X_test = X_train.reshape(X_train.shape[0], -1), X_test.reshape(X_test.shape[0], -1)
# add extra trailing dimension for proper matrix/vector sizes
X_train, X_test = X_train[..., np.newaxis], X_test[..., np.newaxis]
print(f"Training set size: {X_train.shape}")
print(f"Testing set size: {X_test.shape}")
So that handles the input data, but what about the output data? Remember the output is a label from 0-9. We could just leave the label alone but there are problems with this numbering. For example, if we were to take an average across a set of output data, we’d end up with a value corresponding to a different output: the average of 0 and 4 is 2. This relation doesn’t really make sense and arises from the fact that our output data is ordinal: an integer between 0-9. We’d rather have each possible output “stretch” out into it’s own dimension so we can operate on a particular output or set of outputs independently without inadvertently considering all outputs. One way to do this is to literally put each output into it’s own dimension. This is called a one-hot encoding where we create an $n$-dimensional vector where $n$ represents the number of possible output classes. In our specific case, it maps a numerical output to a binary vector with a 1 in the index of the vector: so the digit 2 would be mapped to the vector $\begin{bmatrix}0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0\end{bmatrix}^T$. We’ll do the same with the input data and expand the last dimension for the same reasons.
def to_onehot(y):
"""
Convert index to one-hot representation
"""
one_hot = np.zeros((y.shape[0], 10))
one_hot[np.arange(y.shape[0]), y] = 1
return one_hot
y_train, y_test = train_dataset.targets.numpy(), test_dataset.targets.numpy()
y_train, y_test = to_onehot(y_train), to_onehot(y_test)
y_train, y_test = y_train[..., np.newaxis], y_test[..., np.newaxis]
print(f"Training target size: {y_train.shape}")
print(f"Test target size: {y_test.shape}")
Now we’re ready to instantiate our neural network class with a list of neurons per layer and train it!
ann = ArtificialNeuralNetwork(layer_sizes=[784, 32, 10])
training_params = {
'num_epochs': 30,
'minibatch_size': 16,
'cost': QuadraticCost,
'learning_rate': 3.0,
}
print(f'Training params: {training_params}')
ann.train(X_train, y_train, X_test, y_test, **training_params)
There are a few parameters that haven’t been explained yet, but we’ll get to them. Even before the class definition, let’s define the activation and cost functions and their derivatives.
class Sigmoid:
@staticmethod
def forward(z):
return 1. / (1. + np.exp(-z))
@staticmethod
def backward(z):
return Sigmoid.forward(z) * (1 - Sigmoid.forward(z))
class QuadraticCost:
@staticmethod
def forward(a, y):
return 0.5 * np.linalg.norm(a - y) ** 2
@staticmethod
def backward(a, y):
return a - y
The forward pass computes the output based on the input and the backward pass computes the gradient. Note that the forward pass of the quadratic cost computes a vector norm since the inputs are 10-dimensional vectors and the cost function generally outputs a scalar. Now we can define the class and constructor. For the most part, we’ll just copy over the input parameters as well as initialize the weights and biases.
class ArtificialNeuralNetwork:
def __init__(self, layer_sizes: [int], activation_fn=Sigmoid):
self.layer_sizes = layer_sizes
self.num_layers = len(layer_sizes)
self.activation_fn = activation_fn
# use a unit normal distribution to initialize weights and biases
# performs better in practice than initializing to zeros
# note that weights are j in layer [i] to k in layer [i-1]
self.weights = [np.random.randn(j, k)
for j, k in zip(layer_sizes[1:], layer_sizes[:-1])]
# since the first layer is an input layer, we don't have biases for
self.biases = [np.random.randn(j, 1) for j in layer_sizes[1:]]
Notice that we’re initializing the weights and biases with a standard normal distribution rather than with zeros. This is to intentionally create asymmetry in the neurons so that they learn independently! The next function to implement is the training function. This follows from the previous ones we’ve written where we iterate over the number of epochs and then create minibatches and iterate over those.
def train(self, X_train, y_train, X_test, y_test, **kwargs):
num_epochs = kwargs['num_epochs']
self.minibatch_size = kwargs['minibatch_size']
self.cost = kwargs['cost']
self.learning_rate = kwargs['learning_rate']
for epoch in range(num_epochs):
# shuffle data each epoch
permute_idxes = np.random.permutation(X_train.shape[0])
X_train = X_train[permute_idxes]
y_train = y_train[permute_idxes]
epoch_cost = 0
for start in range(0, X_train.shape[0], self.minibatch_size):
minibatch_cost = 0
# partition dataset into minibatches
Xs = X_train[start:start+self.minibatch_size]
ys = y_train[start:start+self.minibatch_size]
self._zero_grad()
for x_i, y_i in zip(Xs, ys):
a = self.forward(x_i)
d_nabla_W, d_nabla_b = self._backward(y_i)
self._accumulate_grad(d_nabla_W, d_nabla_b)
minibatch_cost += self.cost.forward(a, y_i)
self._step()
minibatch_cost = minibatch_cost / self.minibatch_size
epoch_cost += minibatch_cost
test_set_num_correct = self.num_correct(X_test, y_test)
test_set_accuracy = test_set_num_correct / X_test.shape[0]
print(f"Epoch {epoch+1}: \
\tLoss: {epoch_cost:.2f} \
\ttest set acc: {test_set_accuracy*100:.2f}% \
({test_set_num_correct} / {X_test.shape[0]})")
There are a lot of functions that we haven’t defined yet. The first loop defines the outer loop for the epochs, then we create minibatches and iterate over those. At the start of each minibatch, we zero out any accumulated gradient since we’ll be performing a gradient descent update for each minibatch. In the innermost loop for each individual training example, notice that we do a forward pass and a backward pass that computes the weights and biases gradients. We accumulate these gradients over the minibatch. Then we call this self._step()
function to perform one step of gradient descent optimization to update all of the model parameters. At the end of each minibatch, we compute the accuracy on the test set. (There a better way to compute incremental progress using something called a validation set.)
Going from top to bottom, the first function we encounter is self._zero_grad()
that is called at the beginning of the minibatch loop since, for stochastic gradient descent, we accumulate the gradient over the minibatch and perform a single parameter update over the accumulated gradient of the minibatch. So we need this function to zero out the accumulated gradient for the next minibatch.
def _zero_grad(self):
self.nabla_W = [np.zeros(W.shape) for W in self.weights]
self.nabla_b = [np.zeros(b.shape) for b in self.biases]
We’re going to skip over the forward and backward passes to the self._accumulate_grad(d_nabla_W, d_nabla_b)
. This folds in the gradient for a single training example into the total accumulated gradient across the minibatch.
def _accumulate_grad(self, d_nabla_W, d_nabla_b):
self.nabla_W = [nw + dnw for nw, dnw in zip(self.nabla_W, d_nabla_W)]
self.nabla_b = [nb + dnb for nb, dnb in zip(self.nabla_b, d_nabla_b)]
The last function self._step()
applies one step of gradient descent optimization and updates all of the weights and biases from the averaged accumulated gradient.
def _step(self):
self.weights = [w - (self.learning_rate / self.minibatch_size) * nw
for w, nw in zip(self.weights, self.nabla_W)]
self.biases = [b - (self.learning_rate / self.minibatch_size) * nb
for b, nb in zip(self.biases, self.nabla_b)]
Those are all functions that operate on the gradient and weights and biases and perform simpler calculations. The crux of this class lies in the forward and backward pass functions. For the forward pass, we define the first activation as the input and iterate through the layers applying the corresponding weights and biases and activation functions. For the backwards pass, we cache the values of the activations and pre-activations.
def forward(self, a):
self.activations = [a]
self.zs = []
for W, b in zip(self.weights, self.biases):
z = np.dot(W, a) + b
self.zs.append(z)
a = self.activation_fn.forward(z)
self.activations.append(a)
return a
The backward pass simply implements the backpropagation equations we derived earlier. The only consideration is that we need to apply the derivative of the cost and activation functions at the very end and then move backwards. One thing we do is exploit Python’s negative indexing so the first element is the last layer, the second element is the second-to-last layer, and so on.
def _backward(self, y):
nabla_W = [np.zeros(W.shape) for W in self.weights]
nabla_b = [np.zeros(b.shape) for b in self.biases]
z = self.zs[-1]
a_L = self.activations[-1]
delta = self.cost.backward(a_L, y) * self.activation_fn.backward(z)
a = self.activations[-1-1]
nabla_W[-1] = np.dot(delta, a.T)
nabla_b[-1] = delta
for l in range(2, self.num_layers):
z = self.zs[-l]
W = self.weights[-l+1]
delta = np.dot(W.T, delta) * self.activation_fn.backward(z)
a = self.activations[-l-1]
nabla_W[-l] = np.dot(delta, a.T)
nabla_b[-l] = delta
return nabla_W, nabla_b
Finally, we have an evaluation function that computes the number of correct examples. We run the input through the network and take the index of the largest activation of the output layer and compare it against the index of the one in the one-hot encoding of the label vectors.
def num_correct(self, X, Y):
results = [(np.argmax(self.forward(x)), np.argmax(y)) for x, y in zip(X, Y)]
return sum(int(x == y) for (x, y) in results)
And that’s it! We can run the code and train our neural network and see the output! Even with our simple neural network we can get to >95% accuracy on the test set! Try messing around with the other input parameters!
The full code listing can be found here.
We did a lot this article! We started off with our modern neuron model and extended it into layers to support multi-layer neural networks. We defined a bunch of notation to perform a forward pass to propagate the inputs all the way to the last layer. Then we started learning about how to automatically compute the gradient across all weights and biases using the backpropagation algorithm. We demonstrated the concept with a computation graph and then derived the necessary equation to backpropagate the gradient and we coded a neural network in Python and numpy and trained it on the MNIST handwritten dataset.
We have a functioning neural network written in Numpy now! We’re able to get pretty good accuracy on the MNIST data as well. However this dataset has been around for decades and is that really the best we can do? This is a good start but we’re going to learn how to make our neural networks even better with some modern training techniques 🙂
]]>In this post, we’ll learn how to automatically solve for the parameters of a simple neural model. In doing so, we’ll make a number of modifications that will evolve the perceptron into a more modern artificial neuron that we can use as a building block for wider and deeper neural networks. Similar to last time, we’ll implement this new artificial neuron using Python and numpy.
In the past, we solved for the weights and biases by inspection. This was feasible since logic gates are human-interpretable and the number of parameters was small. Now consider trying to come up with a network for trying to detect a mug in an image. This is a much more complicated task that required understand what a “mug” even is. What do the weights correspond to, and how would we set them manually? In practice, we have very wide and large networks with hundreds of thousands or even millions and billions of parameters, and we need a way to find appropriate values for these to solve our objective, e.g., emulating logic gates or detecting mugs in images.
Fortunately for us, there already exists a field of mathematics that’s been around for a long time that specializes in solving for these parameters: numerical optimization (or just optimization for short). In an optimization problem, we have a mathematical model, often represented as a function like $f(x; \theta)$ with some inputs $x$ and parameters $\theta$, and the goal is to find the values of the parameters such that some objective function $C$ satisfies some criteria. In most cases, we’re minimizing or maximizing it; in the former case, this objective function is sometimes called a cost/loss/error function (all are interchangeable), and in the latter case it is sometimes called a utility/reward function (all are interchangeable). There’s already a vast literature of numerical optimization techniques to draw from so we should try to leverage these rather than building something from scratch.
More specifically, in the case of our problem, we have a neural network represented by a function $f(x;W,b)$ that accepts input $x$ and is parameterized by its weights $W$ and biases $b$. But to use the framework of numerical optimization and its techniques, we need an objective function. In other words, how do we quantify what a “good” model is? In our classification task, we want to ensure that the output of the model is the same as the desired training label for all training examples so we can intuitively think of this as trying to minimize the mistakes of our model output from the true training label.
\[C = \displaystyle\sum_i \vert f(x_i;W,b) - y_i\vert\]Notice we used the absolute value since any difference will increase our cost; verify this for yourself (for a single training example) that when the output of the model is different from $y_i$, $C > 0$ and when and output is the same as $y_i$, $C = 0$. This cost function is also sometimes called mean absolute error (MAE). (Sometimes we’ll see a $\frac{1}{N}$, where $N$ is the number of training examples, in front of the sum but this is just a constant factor that makes the maths easier so we can omit it without any issue.) We only get $C=0$ if, for every training example, $f(x_i;W,b) = y_i$, i.e., our model always classifies correctly. Now we have our model and our cost function so we can try to figure out which optimization approach is well-suited for our problem.
One bifurcation of the numerical optimization field is gradient-free and gradient-based methods. Recall from calculus that a gradient measures the rate-of-change of a function with respect to all of its inputs. This extra information in addition to the objective function itself that, if we choose to use it, will have to be computed and maintained. So the former set of methods describes approaches where we don’t need this extra information and rely on just the values of the objective function itself. The latter describes a set of methods where we do use this extra information. In practice, gradient-based methods tend to work better for neural networks since they tend to converge to a better solution, i.e., they more quickly find the set of parameters with lower cost, but it should be noted there are techniques that optimize neural networks using gradient-free approaches as well.
The idea behind gradient-based methods is to compute a partial derivative of the cost function with respect to each parameter $\frac{\p C}{\p \theta_i}$ of the model. From calculus, we can arrange these partial derivatives in a vector called the gradient $\nabla_\theta C$. This quantity tells us how changes in a parameter $\theta_i$ correspond to changes in the cost function $C$; specifically, it tells us how to change $\theta_i$ to increase the value of $C$. Mathematically speaking, if we have a function of a single variable $C(\theta)$ and a little change in its inputs $\Delta\theta$, then $C(\theta + \Delta\theta)\approx C(\theta)+\frac{\p C}{\p\theta}\Delta\theta$; in other words, a little change in the input is mapped to a little change in the output, but proportional to how the cost function changes with respect to that little input: $\frac{\p C}{\p \theta}$. This is very useful because it can tell us in which direction to move $\theta_i$ such that the value of $C$ decreases, i.e., $-\frac{\p C}{\p \theta_i}$. Remember that in our ideal case, we want $C=0$ (we minimize cost functions and maximize reward functions), and the negative of the partial derivatives tell us exactly how to accomplish this. With this information, we can nudge the parameters $\theta$ using the gradient of the cost function.
\[\theta_i\gets\theta_i - \eta\displaystyle\frac{\p C}{\p \theta_i}\]or in vectorized form
\[\theta\gets\theta - \eta\nabla_\theta C\]Just like with perceptrons, we’ll have a learning rate $\eta$ that is a tuning parameter that tells us how much to adjust the current $\theta$ by. If we do this, we can find the values of the parameters such that the cost function is minimized! This optimization technique is called gradient descent.
(Note that I’ll be using a bit sloppy with my nomenclature and interchangeably say “partial derivative” and “gradient” but just remember the definition of the gradient of a function: the vector of all partial derivatives of the function with respect to each parameter.)
One intuitive way to visualize gradient descent is to think about $C$ is as an “elevation”, like on a topographic map and and the objective is to find the single lowest valley. Mathematically, we’re trying to find the global minima of the cost function. If we could analytically and tractably compute $C$ exactly with respect to all parameters and the entire dataset, then we could just use calculus to solve for the global minima and be finished perfectly! However, the complexity of neural networks along with the size of the datasets they’re often trained on makes this approach infeasible.
Suppose we have a very well-behaved cost function $C(x, y) = x^2+y^2$ with a single global minima. The idea behind gradient descent is to start at some random point $(x, y)$, e.g., $(5, 5)$ in this example, on this cost surface and incrementally move in a way such that we ultimately arrive at the lowest-possible point. The left figure shows the 3D mesh of the cost function (z axis is the value of the cost function for its x and y axis inputs) as well as the path that gradient descent will take us from the starting point $(5, 5)$ to the global minima at the origin. The right figure shows the same, but a top-down view where the colors represent the value of the cost function.
Instead, imagine we’re at some starting point on the cost surface. Using the negative of the gradient tells us how to move parameters from where we currently are to get to a slightly lower point on the cost surface from where we were. If the cost function is well-behaved, this should decrease our overall cost. We repeatedly do this until we’re at a point on the cost surface where, no matter which direction we nudge our parameters, the cost always increases. This is a minima! Depending on the cost function, we might have multiple local minima which are locally optimal within some bounds of the cost function, but they’re not optimal across the entire cost function; that would be the global minima, which is the best solution.
Another intuitive way to think about this is suppose someone took us hiking and we got lost. All we know is that there is a town in the valley of the mountain but there’s a thick fog so we’re unable to see far out. Rather than picking a random direction to walk in, we can look around (within the visibility of the fog) to see if the elevation goes downhill from where we currently are, and then move in that direction. While we’re moving, we’re constantly evaluating which direction would bring us downhill. We repeat until, no matter which direction we look, we’re always going uphill.
Let’s apply what we’ve learned so far to the same Iris dataset example we did last time! Let’s try to train our perceptron using gradient descent. We’ll use the cost function above and analytically compute the gradients to update the weights. However, we’ll run into an immediate problem: the Heaviside step function we’re using as an activation function. Recall its definition:
\[f(\theta)=\begin{cases} 1 & \theta \geq 0 \\ 0 & \theta < 0 \\ \end{cases}\]We’ll be computing a gradient, and this step function has a nonlinearity at 0. That alone isn’t a huge issue; the larger issue is that the gradient will be 0 since the output of the step function is constant and the derivative of a constant is always 0. We’ll get no parameter update from gradient descent, and our model won’t learn a thing! So this choice of activation function isn’t going to work; we need an activation function that actually has a gradient.
Rather than picking the step function, we can try to pick a differentiable function that looks just like a step function. Fortunately for us, there exists a whole class of functions call logistic functions that closely resemble this step function. One specific logistic curve is called the sigmoid.
\[\sigma(z) \equiv \frac{1}{1+e^{-z}}\]The function itself actually even looks like a smooth version of the step function!
The sigmoid (right) can be considered a smooth version of the Heaviside step function (left) so it can be differentiated an infinite amount of times. Both map their unbounded input to a bounded output, but the nuance is that the step function bound is inclusive $[0, 1]$ while the sigmoid bound is exclusive $(0, 1)$ because of the asymptotes.
Note that if the input $z$ is very large and positive, then the sigmoid function asymptotes/saturates to $1$ and if the input is very large and negative, the sigmoid function asymptotes to $0$. In other words, it maps the unbounded real number line $(-\infty, \infty)$ to the bounded interval $(0, 1)$. The sigmoid is smooth in that we can take a derivative, and that little changes in the input will map to little changes in the output. In fact, the derivative of the sigmoid can be expressed in terms of the sigmoid itself (thanks to the properties of the $e$ in its definition!)
\[\sigma'(z) = \sigma(z)(1 - \sigma(z))\]It’s a good exercise to verify this for yourself! (Hint: rewrite $\sigma(z) = (1+e^{-z})^{-1}$ and use the power rule.)
Now let’s replace our step function with the sigmoid so we do end up with nonzero derivatives. Remember we’re trying to compute the gradient of the cost function with respect to the two weights $w_1$ and $w_2$ and the bias $b$. Substituting and expanding the cost function for a single training example, we get the following.
\[C = \vert \sigma(w_1 x_1 + w_2 x_2 + b) - y\vert\]Let’s start with computing $\frac{\p C}{\p w_1}$ and the other derivatives will follow. We’ll need to make liberal use of the chain rule; the way I remember it is “derivative of the outside with respect to the inside times the derivative of the inside $\frac{\d}{\d x}f(g(x)) = f’(g(x))g’(x)$. We’ll also need to know that the derivative of the absolute value function $f(x)=\vert x\vert$ is the sign function $\sgn(x)$ that returns 1 if the input is positive and -1 if the input is negative and is mathematically undefined if the input is 0, but practically, in this specific example, we can let $\sgn(0) = 0$. (Similar to the Heaviside step function, we can see this from plotting the absolute value function, looks like a ‘V’, and noting that both sides of the ‘V’ have a constant slope of $\pm 1$ depending on the side of the ‘V’.)
\[\begin{align*} \displaystyle\frac{\p C}{\p w_1} &= \frac{\p}{\p w_1} \vert \sigma(w_1 x_1 + w_2 x_2 + b) - y\vert \\ &= \sgn\big[\sigma(w_1 x_1 + w_2 x_2 + b) - y\big] \frac{\p}{\p w_1}\big[\sigma(w_1 x_1 + w_2 x_2 + b) - y\big]\\ &= \sgn\big[\sigma(w_1 x_1 + w_2 x_2 + b) - y\big] \frac{\p}{\p w_1}\sigma(w_1 x_1 + w_2 x_2 + b)\\ &= \sgn\big[\sigma(w_1 x_1 + w_2 x_2 + b) - y\big] \sigma'(w_1 x_1 + w_2 x_2 + b)\frac{\p}{\p w_1}\big[w_1 x_1 + w_2 x_2 + b\big]\\ &= \sgn\big[\sigma(w_1 x_1 + w_2 x_2 + b) - y\big] \sigma'(w_1 x_1 + w_2 x_2 + b)\frac{\p}{\p w_1}\big[w_1 x_1\big]\\ &= \sgn\big[\sigma(w_1 x_1 + w_2 x_2 + b) - y\big] \sigma'(w_1 x_1 + w_2 x_2 + b)x_1\\ &= \sgn(a - y) \sigma'(z)x_1\\ \end{align*}\]In the last step, we simplified by substituting back $z=w_1 x_1 + w_2 x_2 + b$ and $a=\sigma(z)$. Similarly, the other derivatives follow from this one with only minor changes in the last few steps so we can compute them all.
\[\begin{align*} \displaystyle\frac{\p C}{\p w_1} &= \sgn(a - y) \sigma'(z)x_1 \\ \displaystyle\frac{\p C}{\p w_2} &= \sgn(a - y) \sigma'(z)x_2 \\ \displaystyle\frac{\p C}{\p b} &= \sgn(a - y) \sigma'(z) \\ \end{align*}\]Another way to think about these derivatives that will be useful for implementation in code is expanding out the partials in accordance with the chain rule.
\[\begin{align*} \displaystyle\frac{\p C}{\p w_1} &= \displaystyle\frac{\p C}{\p a} \displaystyle\frac{\p a}{\p z}\displaystyle\frac{\p z}{\p w_1} \\ \displaystyle\frac{\p C}{\p w_2} &= \displaystyle\frac{\p C}{\p a} \displaystyle\frac{\p a}{\p z}\displaystyle\frac{\p z}{\p w_2} \\ \displaystyle\frac{\p C}{\p b} &= \displaystyle\frac{\p C}{\p a} \displaystyle\frac{\p a}{\p z}\displaystyle\frac{\p z}{\p b} \\ \end{align*}\]So the first two terms in each of these are the same, and it’s only the last term that we have to actually change. Now that we have these gradients computed analytically, we can get around to writing code!
A sketch of the general training algorithm is going to look like this.
We refer to passing an input through the network to get an output as a forward pass and computing gradients as a backward pass because of the nature of how we perform both computations (starting from input through the parametres of the model to the output and from the cost function back through the model parameters toward input). We’ll see the name nomenclature in literature and neural network libraries such as Tensorflow and Pytorch.
Let’s first start by defining out cost and activation functions and their derivatives.
import matplotlib.pyplot as plt
from sklearn import datasets
import numpy as np
def cost(pred, true):
return np.abs(pred - true)
def dcost(pred, true):
return np.sign(pred - true)
def sigmoid(z):
return 1. / (1 + np.exp(-z))
def dsigmoid(z):
return sigmoid(z) * (1 - sigmoid(z))
Now we can define an ArtificialNeuron
class that trains its weights and biases using the rough sketch of the algorithm above
class ArtificialNeuron:
def __init__(self, input_size, learning_rate=0.5, num_epochs=100):
self.learning_rate = learning_rate
self.num_epochs = num_epochs
self._W = np.zeros(input_size)
self._b = 0
def train(self, X, y):
self.costs_ = []
num_examples = X.shape[0]
for _ in range(self.num_epochs):
costs = 0
dW = np.zeros(self._W.shape[0])
db = 0
for x_i, y_i in zip(X, y):
# forward pass
a_i = self._forward(x_i)
# backward pass
dW_i, db_i = self._backward(x_i, y_i)
# accumulate cost and gradient
costs += cost(a_i, y_i)
dW += dW_i
db += db_i
# average cost and gradients across number of examples
dW = dW / num_examples
db = db / num_examples
costs = costs / num_examples
# update weights
self._W = self._W - self.learning_rate * dW
self._b = self._b - self.learning_rate * db
self.costs_.append(costs)
return self
def _forward(self, x):
# compute and cache intermediate values for backwards pass
self.z = np.dot(x, self._W) + self._b
self.a = sigmoid(self.z)
return self.a
def _backward(self, x, y):
# compute gradients
dW = dcost(self.a, y) * dsigmoid(self.z) * x
db = dcost(self.a, y) * dsigmoid(self.z)
return dW, db
That’s it! Now we can load the data and use this new neuron model to train a classifier!
# Load the Iris dataset
iris = datasets.load_iris()
data = iris.data
target = iris.target
# Select only the Setosa and Versicolor classes (classes 0 and 1)
setosa_versicolor_mask = (target == 0) | (target == 1)
data = data[setosa_versicolor_mask]
target = target[setosa_versicolor_mask]
# Extract the sepal length and sepal width features into a dataset
sepal_length = data[:, 0]
petal_length = data[:, 2]
X = np.vstack([sepal_length, petal_length]).T
# Train the artificial neuron
an = ArtificialNeuron(input_size=2)
an.train(X, target)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
# Create a scatter plot of values
ax1.scatter(sepal_length[target == 0], petal_length[target == 0], label="Setosa", marker='o')
ax1.scatter(sepal_length[target == 1], petal_length[target == 1], label="Versicolor", marker='x')
# Plot separating line
w1, w2 = an.W_[0], an.W_[1]
b = an.b_
x_values = np.linspace(min(sepal_length), max(sepal_length), 100)
y_values = (-w1 * x_values - b) / w2
ax1.plot(x_values, y_values, label="Separating Line", color="k")
# Set plot labels and legend
ax1.set_xlabel("Sepal Length (cm)")
ax1.set_ylabel("Petal Length (cm)")
ax1.legend(loc='upper right')
ax1.set_title('Artificial Neuron Output')
# Plot neuron cost
ax2.plot(an.costs_, label="Error", color="r")
ax2.set_xlabel("Epoch")
ax2.set_ylabel("Cost")
ax2.legend(loc='upper left')
ax2.set_title('Artificial Neuron Cost')
# Show the plot
plt.show()
Using this new activation function and gradient descent, we’re still able to create a line separating the two classes exactly (left). The cost function on the right shows the overall cost for each iteration of gradient descent and is monotonically decreasing until it approaches 0. Note that there are more than several sets of solutions to this specific separation problem (the space between the two classes is large) so any solution that has the lowest cost will work. Intuitively, we can think of this learning problem to have a non-unique global minima or a “basin” of optimal solutions. This is often not the case with more complex problems.
One criticism is in the cost function itself. Remember that we replaced the step function with the sigmoid because it wasn’t continuous everywhere and the gradient was 0 everywhere. The derivative of the absolute value function is also not continuous everywhere, and, although the gradient does exist, it’s a constant value. Can we come up with a better cost function that provides a more well-behaved, helpful gradient? Similar to what we did with the step function, we can replace the mean absolute error cost function with a smoothed variant where the gradient not only exists everywhere, but provides a better signal on which direction to update the weights. Fortunately for us, such a function exists as the mean squared error (MSE), which just replaces the absolute value with a square.
\[C = \displaystyle\frac{1}{2}\displaystyle\sum_i \big(y_i - a_i\big)^2\]where $a_i$ is the output layer for input $x_i$. (Similar to mean absolute error, we’re adding the $\frac{1}{2}$ in front purely for mathematical convenience when computing the gradient; it’s just a constant that we could omit) This cost function is a smooth and has a derivative everywhere.
\[\begin{align*} \displaystyle\frac{\p C}{\p a_i} &= \displaystyle\sum_i -\big(y_i - a_i\big)\\ \displaystyle\frac{\p C}{\p a} &= -(y - a) \end{align*}\]Practically, smooth cost functions tend to work better since the gradient contains more information to guide gradient descent to the optimal solution. Compare this cost function gradient to the previous one that just returned $\pm 1$. In the code, we can replace the cost function and derivative with MSE instead of MAE.
def cost(pred, true):
return 0.5 * (true - pred) ** 2
def dcost(pred, true):
return -(true - pred)
Using the new MSE cost function, we can achieve the same optimal result but notice the y-axis scale on the cost plot: it has much smaller values than that of the same plot using MAE as the cost function. This is because MSE produces very small values when true and predicted values are close but very large values when they’re farther apart. In other words, they scale with the magnitude of the difference and are not constant like with MAE.
Another point to bring up that we should address early on is efficiency of gradient descent: it requires us to average gradients over all training examples. This might be fine for a few hundred or even a few thousand training examples (depending on your compute) but quickly becomes intractable for any dataset larger than that. Rather than averaging over the entire set of training examples, we can perform gradient descent on a mini-batch that’s intended to be a smaller, sampled set of data representative of the entire training set. This is called stochastic gradient descent (SGD). We take the training data, divide it up into minibatches, and run gradient descent with parameter updates over those minibatches. An epoch still elapses after all minibatches are seen; in other words, the union of all minibatches form the entire training data, and that’s when an epoch passes. While the cost function plot to convergence is a bit noisier than full gradient descent, it’s often far more efficient per iteration since the minibatch size is much smaller. We can update the corresponding code to shuffle and partition our training data into minibatches, iterate over them, and perform a gradient descent update over the current minibatch instead of the entire training set.
class ArtificialNeuron:
def __init__(self, input_size, learning_rate=0.5, num_epochs=50, minibatch_size=32):
self.learning_rate = learning_rate
self.num_epochs = num_epochs
self.minibatch_size = minibatch_size
self.W_ = np.zeros(input_size)
self.b_ = 0
def train(self, X, y):
self.costs_ = []
for _ in range(self.num_epochs):
epoch_cost = 0
# shuffle data each epoch
permute_idxes = np.random.permutation(X.shape[0])
X = X[permute_idxes]
y = y[permute_idxes]
for start in range(0, X.shape[0], self.minibatch_size):
minibatch_cost = 0
dW = np.zeros(self.W_.shape[0])
db = 0
# partition dataset into minibatches
Xs, ys = X[start:start+self.minibatch_size], y[start:start+self.minibatch_size]
for x_i, y_i in zip(Xs, ys):
# forward pass
a_i = self._forward(x_i)
# backward pass
dW_i, db_i = self._backward(x_i, y_i)
# accumulate cost and gradient
minibatch_cost += cost(a_i, y_i)
dW += dW_i
db += db_i
# average cost and gradients across minibatch size
dW = dW / self.minibatch_size
db = db / self.minibatch_size
# accumulate cost over the epoch
minibatch_cost = minibatch_cost / self.minibatch_size
epoch_cost += minibatch_cost
# update weights
self.W_ = self.W_ - self.learning_rate * dW
self.b_ = self.b_ - self.learning_rate * db
# record cost at end of each epoch
self.costs_.append(epoch_cost)
# rest is the same
Note that we shuffle the training data each epoch so we have different minibatches to compute gradients with and update our parameters. In fact, we see any cyclical patterns in the cost function plot, it’s usually indicative of the same minibatches of data being seen over and over again.
Using SGD instead of a full GD also gives an optimal solution to this problem. Note the loss curve in the right plot is noisier than using full GD since we’re randomly sampling minibatches across the training input rather than evaluating the entire training set for each iteration. In fact, in some iterations, the cost actually goes up a little bit! But the overall trend goes to 0 and that long-term trend is more important.
Now when we run the code, our loss curve looks a bit noisier but each iteration by itself is faster since we’re only using a fraction of the entire training input, yet we can still converge to a similar solution. Computing gradients over minibatches rather than the entire dataset is essential for any practical training on real-world data!
The full code listing is here.
In this post, we learned about numerical optimization and how we could automatically solve for the parameters of our perceptron and artificial neuron (as well as any other mathematical model, in fact!) using gradient descent. Along the way, we discovered some issues with our perceptron model, such as our step function activation, and evolved our perception into something more modern using the sigmoid activation. We also covered a few improvements for gradient descent, e.g., better choice of cost function as well as minibatching, that can help it achieve better performance in terms of speed and quality of result.
In the next post, we’ll use this neuron model as a building block to construct deep neural networks and discuss how we actually train them when they do have millions of parameters. I had originally planned to cover true artificial neural networks and backpropagation in this post as well but felt like it was already big enough to stand alone. Also backpropagation takes a lot of time and explanation that I think deserves its own dedicated article. Hopefully I turn out to be correct for next time 🙂
]]>In this post, we’ll start with the first and simplest kind of neural network: a perceptron. It models a single biological neuron and can actually be sufficient to solve certain problems, even with its simplicity! We’ll use perceptrons to learn how to separate a dataset and represent logic gates. Then we’ll extend perceptons by feeding them into each other to create multilayer perceptrons to separate even nonlinear datasets!
In the mid 20th century, there was a lot of work on trying to further artificial intelligence, and the general idea was to create artificial intelligence by trying to model actual intelligence, i.e., the brain. Looking towards nature and how it creates intelligence makes a lot of intuitive sense (e.g. dialysis treatment was the result of studying what kidneys to replicate their function). We know that the brain is comprised of biological neurons like this.
Source. A biological neuron has a cell body that accumulates neurotransmitters from its dendrites. If the neuron has enough of a charge, then it emits an electrical signal called an action potential through the axon. However, this signal is only produced if there is “enough” of a charge; if not, then no signal is produced.
(This wouldn’t be an “Intro to Neural Nets” explanation without a picture of an actual biological neuron!)
A few key conceptual notions of artificial neurons arose from this exact model and simplistic understanding. Specifically, a biological neuron has dendrites that collect neurotransmitters and sends them to the cell body (soma). If there is enough accumulated input, then the neuron fires an electrical signal through its axon, which connects to other dendrites through little gaps called synaptic terminals. There exists an All or Nothing Law in physiology where, when a nerve fiber such as a neuron fires, it produces the maximum-output response rather than any partial one (interestingly this was first shown with the electrical signals across heart muscles that keep the heart beating!); in other words, the output is effectively binary: either the neuron fires or it does not based on a threshold of accumulated neurotransmitters.
Trying to model this mathematically, it seems like we have some inputs $x_i$ (dendrites) that are accumulated and determine if a binary output $y$ fires if the combined input is above a threshold $\theta$. Since we have multiple inputs, we have to combine them somehow; the simplest thing to do would be to add them all together. This is called the pre-activation. Finally, we threshold on the pre-activation to get a binary output, i.e., apply the activation function to the pre-activation.
\[y=\begin{cases} 1, & \displaystyle\sum_i x_i \geq \theta \\ 0, & \displaystyle\sum_i x_i < \theta \\ \end{cases}\]What are the kinds of things we can model with this? For the simplest case, let’s consider binary inputs and start with binary models. For example, consider logic gates like AND and OR. If we chose the right value of $\theta$, we can recreate these gates using this neuron model. For an AND gate, the output is $1$ if and only if both inputs are $1$. $\theta = 2$ seems to be the right value to recreate an AND gate. For an OR gate, the output is $1$ if either of the inputs are $1$. $\theta=1$ seems to be the right value to recreate an OR gate. What about an XOR gate? This gate returns $1$ if exactly one of the inputs are $1$. What value of $\theta$ would allow us to recreate the XOR gate? We can try a bunch of different values but it turns out that there is no value of $\theta$ that can allow us to recreate the XOR gate under this particular mathematical model. One other way to see this is visually.
We can plot the inputs along two axes representing two inputs and color them based on what the result should be, i.e., white is output of 1 and black is output of 0. Note that the neuron model is a linear model which means we can only represent gates whose outputs are separable by a line. This is true for the AND and OR gates, but not for the XOR gate. However, two lines could be used the recreate the XOR gate so it seems like we’ll need a more expressive model.
We’ll see later what model we need to also be able to recreate the XOR gate, but it’s important to know that this simple model has limitations on its representative power so we’re going to need a more complicated model in the future.
This seems like a good start but there’s no “learning” happening here. Even before neural networks, we had learning-based approaches that sought to solve (or optimize for) some parameters given an objective/cost/loss function and set of input data. For example, consider fitting a least-squares line to a set of points. Given some parameters of our model (specifically the slope and y-intercept) and a set of data (set of points to fit a line to), we want to find the optimal values of the parameters such that they “best” fit the data (according to the cost). In our example, we do have a single parameter $\theta$ parameter, but we’ve been guessing the value that works, which clearly won’t work for more complex examples.
One thing we can do to improve the expressive power is to add more parameters to the model and figure out how to solve/optimize for them given a set of input data rather than having to guess their values by inspection. There are an number of different ways to do this but one effective way is to introduce a set of weights $w_i$, one for each input, and a bias $b$ across all inputs. Since we have a single bias that can shift the values of the inputs, we can also simplify the activation function to fix $\theta=0$ and let the learned bias shift the input to the right place.
\[y=\begin{cases} 1 & \displaystyle\sum_i w_i x_i + b \geq 0 \\ 0 & \displaystyle\sum_i w_i x_i + b < 0 \\ \end{cases}\]This thresholding function is also called the Heaviside step function. A simpler notation is to collect the weights and inputs into vectors and use the dot product.
\[y=\begin{cases} 1 & w\cdot x + b \geq 0 \\ 0 & w\cdot x + b < 0 \\ \end{cases}\]Furthermore, we can absorb the bias into the weights and input by adding a dimension to the input and weight dimension and fixing the first value of every input to $1$ always. We can think of the bias as being a weight whose input is always $1$, i.e., $\sum_{i\neq 0} w_i\cdot x_i + b\cdot x_0$ where $x_0=1$.
\[y=\begin{cases} 1 & w\cdot x \geq 0 \\ 0 & w\cdot x < 0 \\ \end{cases}\]We’ll also sometimes use $y=f(w\cdot x)$ as a shorthand where $f$ represents the step function.
This very first neural model is called the perceptron: a linear binary classifier whose weights we can learn by providing it with a dataset of training examples and using the perceptron training algorithm to update the weights. Supposing we have the already-trained values of the weights, we can take any input, dot it with the learned weights, and run it through the step function to see which of the two classes the input belongs in.
One illustrative example to see how this is more general than the binary case is to recreate our logic gates, but using this model instead. Again, let’s try to recreate the AND and OR gates. Both of these take two inputs $x_1$ and $x_2$ so we’ll have $w_1$, $w_2$ and $b$ that we need to find appropriate values for.
Similar to the previous examples, we’ll recreate the AND and OR logic gates but use the weights of the perceptron rather than the threshold. The values of the weights are on the edges while the value of the bias term is inside of the neuron. Note that the perceptron model is a still a linear model so we still can’t represent the XOR gate just yet.
With some inspection and experimentation, we can figure out the values for the weights and bias. For the AND gate, if we set $w_1=1$, $w_2=1$, $b=-2$, then for the positive case, we get the input $x_1+x_2-2$. Only when both $x_1=x_2=1$ would the pre-activation be $0$ and hence produce $y=1$ after running through the step function. For the OR gate, the parameters are $w_1=1$, $w_2=1$, $b=-1$ and the input is $x_1+x_2-1$. There are also other gates we can represent, e.g., NOT and NAND, but still not XOR since perceptrons are still linear models. Note that the values of these parameters aren’t the only values that satisfy the criteria; this will become important much later on when we talk about regularization.
Similar to the previous cases, we’ve manually solved for the values of the parameters since there were only three and our “dataset” was one example but what if we wanted to separate a dataset like this.
These data are taken from a famous dataset called the Iris Flower dataset that measured the petal and sepal length and width of 3 species of iris flowers: iris setosa, iris versicolor, and iris virginica. Here, we plot only the sepal and petal lengths of iris setosa and iris versicolor. Notice that we can draw a line that separates these two species. Interestingly this dataset was collected by the God of statistics: Ronald Fisher.
Now trying to figure out the weights and bias by inspection becomes a bit more difficult! Now imagine doing the same for a 100-dimensional dataset. It’d be nigh impossible! These are the majority of practical cases we’ll encounter in the real world so we need an algorithm for automatically solving for the weights and bias of a perceptron. Let’s set up the problem: we have a bunch of linearly separable pairs of inputs $x_i$ and binary class labels $y_i\in\{0,1\}$ that we group into a dataset $\mathcal{D} = \Big\{ (x_1, y_1), \cdots, (x_N, y_N) \Big\}$ and we want to solve for the set of weights that correctly assigns the predicted class value $\hat{y} = f(w\cdot x)$ to the correct class value $y$ for examples in our dataset.
In other words, we want to update our weights using some rule such that we eventually correctly classify every example. The most general kind of weight update rule is of the form.
\[w_i\gets w_i + \Delta w_i\]For each example in the dataset, we can apply this rule to move the weights a little bit towards the right direction. But what should $\Delta w_i$ be? We can define a few desiderata of this rule and try to put something together. First, if the target and predicted outputs are the same, then we don’t want to update the weight, i.e., $\Delta w_i = 0$ since the model is already correct! However, if the target and predicted outputs are different, we want to move the weights towards the correct class of that misclassified example. One last important thing is to be able to scale the weight update so that we don’t make too large of an update and overshoot. Putting all of these together, we can come up with an update scheme like the following.
\[\Delta w_i = \alpha(y-\hat{y})x_i\]where $\alpha$ is the learning rate that controls the magnitude of the update. Note that when the target and predicted class are the same, $\Delta w_i = 0$ since we’re already correct. However, if they disagree, then we move the weights towards the direction of the correct class of that misclassified example.
Putting everything together, we have the Perceptron Training Algorithm!
Given a learning rate $\alpha$, set of weights $w_i$, and dataset $\mathcal{D} = \Big\{ (x_1, y_1), \cdots, (x_N, y_N) \Big\}$,
An epoch is an full iteration where the network sees all of the training data exactly once; it’s used to control the high-level loop in case the perceptron or network doesn’t converge perfectly. That being said, this update algorithm is actually guaranteed to converge in a finite amount of time by the Perceptron Convergence Theorem. The proof itself isn’t particularly insightful but the existence of the proof is: with a linearly separable dataset, we’re guaranteed to converge after a finite number of mistakes.
Perceptrons are really easy to code up so let’s go ahead and write one really quickly in Python using numpy.
import numpy as np
class Perceptron:
def __init__(self, lr=0.01, num_epochs=10):
self.lr = lr
self.num_epochs = num_epochs
def train(self, X, y):
# initialize x_0 to be bias
self.w_ = np.zeros(1 + X.shape[1])
self.losses_ = []
for _ in range(self.num_epochs):
errors = 0
for x_i, y_i in zip(X, y):
dw = self.lr * (y_i - self.predict(x_i))
self.w_[1:] += dw * x_i
# bias update; recall x_0 = 1
self.w_[0] += dw
errors += int(dw != 0.0)
self.losses_.append(errors)
return self
def _forward(self, X):
return np.dot(X, self.w_[1:]) + self.w_[0]
def predict(self, X):
return np.where(self._forward(X) >= 0., 1, 0)
Let’s train this on the above linearly separable dataset and see the results!
import matplotlib.pyplot as plt
from sklearn import datasets
import numpy as np
# Load the Iris dataset
iris = datasets.load_iris()
data = iris.data
target = iris.target
# Select only the Setosa and Versicolor classes (classes 0 and 1)
setosa_versicolor_mask = (target == 0) | (target == 1)
data = data[setosa_versicolor_mask]
target = target[setosa_versicolor_mask]
# Extract the sepal length and sepal width features into a dataset
sepal_length = data[:, 0]
petal_length = data[:, 2]
X = np.vstack([sepal_length, petal_length]).T
# Train the Perceptron
p = Perceptron()
p.train(X, target)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
# Create a scatter plot of values
ax1.scatter(sepal_length[target == 0], petal_length[target == 0], label="Setosa", marker='o')
ax1.scatter(sepal_length[target == 1], petal_length[target == 1], label="Versicolor", marker='x')
# Plot separating line
w1, w2 = p.w_[1], p.w_[2]
b = p.w_[0]
x_values = np.linspace(min(sepal_length), max(sepal_length), 100)
y_values = (-w1 * x_values - b) / w2
ax1.plot(x_values, y_values, label="Separating Line", color="k")
# Set plot labels and legend
ax1.set_xlabel("Sepal Length (cm)")
ax1.set_ylabel("Petal Length (cm)")
ax1.legend(loc='upper right')
ax1.set_title('Perceptron Output')
# Plot perceptron loss
ax2.plot(p.losses_, label="Error", color="r")
ax2.set_xlabel("Epoch")
ax2.set_ylabel("Error")
ax2.legend(loc='upper left')
ax2.set_title('Perceptron Errors')
# Show the plot
plt.show()
After training the perceptron on the dataset, we get a line in 2D that separates the two classes. In the general case, for a dataset where the inputs are $d$-dimension, we’d get a $(d-1)$-dimensional hyperplane. The right plot shows the number of errors the perceptron model occurs as we train on the dataset; if the dataset is linear, the perceptron is guaranteed to converge to some solution after a finite number of tries.
Since our dataset was linearly separable, we were able to converge to a solution in just a few iterations! Note that the result is complete, but maybe not optimal. Feel free to experiment with different kinds of weight initialization and learning rates!
Even with the improvements on the perceptron from the simpler artificial neuron model, we still can’t solve the XOR problem since perceptrons only work for linearly separable data. But recall back to when we were talking about biological neurons. After consuming input from the dendrites, if we’ve accumulated enough inputs to fire the neuron, it’ll fire along the output axon which in turn is used as the input to other neurons. So it seems, at least biologically, that neurons feed into other neurons.
We can also feed our artificial neurons into other neurons and create connections between them. There are a number of different choices for how we connect them; we could even connect neurons recurrently to themselves! But the simplest thing to try is to connect the outputs of the two inputs to another neuron before producing the output.
A multilayer perceptron (MLP) takes the output of one perceptron and feeds it into another perceptron. The edges represent the weights and the circles represent the biases. Here is a 2-layer perceptron with a hidden layer of 2 neurons and output layer of 1 neuron.
This structure is called a multilayer perceptron (MLP) and the intermediate layer is called a hidden layer since it maps an observable input to an observable output, but the hidden layer itself might not directly have an observable result or interpretation. In this particular example, we have 9 learnable parameters $w_1$, $w_2$, $w_3$, $w_4$, $b_1$, $b_2$, $w_5$, $w_6$, and $b_3$. Solving for these parameters via inspection is still possible by making one key observation: we can redefine an XOR gate as a combination of other gates: $a \tt{~XOR~} b = (a\tt{~OR~}b) \tt{~AND~} (a\tt{~NAND~}b)$. We’ve already seen the AND and OR gates so we need to figure out the right weights and bias for the NAND gate. Test this yourself, but the one set of values that satisfies the NAND gate is $w_1=-1$, $w_2=-1$, $b=1$. Because of this decomposition of the XOR gate, we can try to recreate it using those same weights and values.
One way to interpret this solution to the XOR gate problem is that the top hidden neuron represents $h_1 = x_1\tt{~OR~}x_2$ and the bottom one represents $h_2 = x_1\tt{~NAND~}x_2$. Then the final one represents $h_1\tt{~AND~}h_2 = (x_1\tt{~OR~}x_2) \tt{~AND~} (x_1\tt{~NAND~}x_2)=x_1 \tt{~XOR~} x_2$. Now we have a solution to classify even nonlinear data!
In theory this seems to work, but let’s try to plug in some values and run it though this MLP to see if it produces the right outputs. We’ll call the hidden layer outputs $h_1=f(x_1+x_2-1)$ and $h_2=f(-x_1-x_2+1)$. The final output is then $y=f(h_1+h_2-2)$. Here’s a truth table showing the inputs and outputs.
$x_1$ | $x_2$ | $h_1$ | $h_2$ | $y$ |
---|---|---|---|---|
$0$ | $0$ | $0$ | $1$ | $0$ |
$0$ | $1$ | $1$ | $1$ | $1$ |
$1$ | $0$ | $1$ | $1$ | $1$ |
$1$ | $1$ | $1$ | $0$ | $0$ |
Seems like this MLP works to correctly produce the right outputs for the XOR gate! This is pretty interesting because a single perceptron couldn’t solve the XOR gate problem because the XOR gate wasn’t linearly separable. But it seems by layering perceptrons, we can correctly classify even nonlinear output! To understand why, let’s try plotting the values of the hidden layer using the truth table above.
We can plot the hidden state values in the 2D plane in the same way as plotting the logic gates. Notice that in the latent space, the XOR gate is indeed linearly separable so we only need one additional perceptron on this hidden state to complete our MLP representation of the XOR gate!
This is particularly insightful: in the input space, the XOR gate is not linearly separable but in the hidden/latent space it is! This is a general observation about neural networks: they perform a series of transforms until the final data are linearly separable, then we just need a single perceptron to separate them. Layering perceptrons provides more expressive power to the MLP to separate nonlinear datasets by passing them through multiple transforms. Even this MLP model has limitations as we scale up to many hundreds, thousands, millions, and billions of parameters! We’ll still need to come up with a way to automatically learn the parameters of these kinds of very large neural networks but we’ll save that for next time!
Neural networks have gained immense traction in the past decade for their exceptional performance across a wide variety of different tasks. Historically, these arose from trying to model biological neurons in an effort to create artificial intelligence. From these simple biological models, we derived a few parametrized mathematical models of these. We moved on to perceptrons as a start and learned what their parameters were and how to train them using the Perceptron Learning Algorithm. We showed how they can successfully classify real-world, linearly-separable data. However we found limitations in them, particularly with nonlinear datasets, even the simplest cases such as recreating an XOR gate. But we found that by layering these together into multilayer perceptrons (MLPs), we could even separate some nonlinear datasets!
We solved for the parameters of the MLP by inspection but this isn’t possible for very large neural networks so we’ll need an algorithm to automatically learn these parameters given the dataset. Furthermore, there have been a number of advancements in neural networks to improve their efficiency and robustness, and we’ll discuss the training algorithm and some of these advancements in the next post 🙂
]]>Can we come up with a better word representation that actually models word meanings and relations? In this post, I’ll go over how to compare words and how to quantify that similarity of words. As always, we’ll start with some background in linguistics. Then, before getting into word similarity, we’ll actually talk about document similarity, since it’s a bit easier to understand, and use those concepts to finally talk about embeddings which are vector representations of words that capture meaning. Finally, we’ll see some tangible examples with code on how to load pretrained embeddings, perform analogy tasks, and visualize them.
Similar to the previous discussion on n-grams, since we’re talking about representing meanings of words, we have to understand what that entails linguistically first. n-gram models represent words as strings which doesn’t capture the meaning or connotation of the word in question. For example, some words are synonyms or antonyms of other words; some words have a positive or negative connotation; some words are related, but not synonymous to other words. A good representation should capture all of that.
Let’s start with synonyms as an example: one way to say two words are synonyms is if we can substitute them in a sentence and still have the “truth conditions” of the sentence hold. However, just because words are synonyms doesn’t mean they’re interchangeable in all contexts. For example, $H_2 O$ is used in scientific contexts but strange in other contexts. Furthermore, words can be similar without being synonyms: tea and biscuits are not synonyms but are related because we often serve biscuits with tea.
One methodology linguists came up with to quantify word meaning in the 50s (Osgood et al. The Measurement of Meaning. 1957) is quantifying them along three dimensions: valence (pleasantness of stimulus, e.g., happy/unhappy), arousal (intensity of emotion of the stimulus, e.g., excited/calm), and dominance (degree of control of the stimulus, e.g., controlling/influenced). Linguists asked groups of humans to quantify different words based on those dimensions. From the results, they could numerically measure similarity and relationships of words using those three dimensions. With this representation, we could map a word to a vector in a 3D space and perform arithmetic operations to compare two words along those hand-crafted features. This was a good start but depended on surveying humans to come up with these values when large corpora of human text already exists.
Gathering large groups of people to quantify words along those dimensions isn’t a practical way of doing things, but it provides some insight: we can try to come up with an automated mechanism to map a word to a vector that we can perform mathematical operations on. The key insight lies in what linguists call the distributional hypothesis: words that occur in similar contexts have similar meanings. So the idea is to construct this vector representation for a particular word based on the context that word appears in.
Counter-intuitively, figuring this out for documents of words is a bit easier than individual words themselves so let’s sojourn into the world of information retrieval (IR). Given a query $q$ and a set of documents $\mathcal{D}$ (also called a corpus), the problem of information retrieval is to find a document $d\in\mathcal{D}$ such that it “best matches” the query $q$. Based on the distributional hypothesis, one way to compare two documents would be to look at how many words co-occur across documents. For each word in each document of the corpus, we can create a word-document matrix where the rows are words, the columns are documents, and an entry in the matrix represents the number of times a particular word appeared in a particular document.
As you like it | Julius Caesar | Henry V | |
---|---|---|---|
battle | 1 | 7 | 13 |
good | 114 | 62 | 89 |
fool | 20 | 1 | 4 |
Source: Speech and Language Processing by Dan Jurafsky and James H. Martin
In this example, we can see that Julius Caesar has more in common with Henry V than it does with As you like it because the counts are more similar to each other. Quantitatively, we can represent each document as a vector of size $\vert V\vert$ where $\vert V\vert$ represents the size of the vocabulary (all words across all documents). So we can represent Julius Caesar and Henry V as two $\vert V\vert$-dimensional vectors, but how do we compare them?
One straightforward comparison is using the Euclidean distance between vectors:
\[d(v, w) = \sqrt{v_1 w_1+\cdots+v_{N}w_{N}}\]However, that would disproportionally give a higher weight to vectors of greater magnitude. We can normalize against both of the vector sizes and drop the square root (since similarity is relative anyways) and we get a simpler notion of distance.
\[d(v, w) = \frac{v_1 w_1+\cdots+v_{N}w_{N}}{|v||w|} = \frac{v\cdot w}{|v||w|}\]This measure of similarity is called cosine similarity or also the normalized dot product because of the equation $a\cdot b = \vert a\vert\vert b\vert\cos\theta$. This distance metric is bounded to be in $[-1, 1]$ where a similarity of 1 means the vectors are maximally similar (pointed in the same direction), a similarity of 0 means the vectors are unrelated (orthogonal), and a similarity of -1 means the vectors are maximally dissimilar (pointing in opposite directions). Using this measure, we can compare two documents against each other and quantify their similarity!
One large issue with directly using the term-document matrix is article words. Words like “the”, “a”, “an” are words that occur frequently across all documents. They don’t have any discriminative power when it comes to comparing two documents since they occur too frequently. So we need to balance words that are frequent against words that are too frequent. To quantify this, we define term frequency as the frequency of a word $t$ in a document $d$ using the raw count.
\[\text{tf}_{t,d} = \text{count}(t, d)\]Word frequencies can become very large numbers so we want to squash the raw counts since they don’t linearly equate to relevance. But what if a word doesn’t occur in a document at all? Its count would be 0, which ends up becoming a problem when we take a log. We can simply offset the raw count by 1 to avoid numerical issues and use the log-space instead of the raw counts.
\[\text{tf}_{t,d} = \log\Big(\text{count}(t, d) + 1\Big)\]Similarly, we can define document frequency as the number of documents that a word occurs in: $\text{df}_t$. Inverse document frequency (idf) is simply the inverse using $N$ as the number of documents in the corpus: $\text{idf}_t=\frac{N}{\text{df}_t}$. Similar to the above rationale, we also use the log-space.
\[\text{idf}_t = \log\frac{N}{\text{df}_t}\]The intuition is that frequent words are more important than infrequent ones, but, the fewer documents that a word occurs in should have a higher weight since it has more discriminative power, i.e., it uniquely defines the document. Combining these two constraints, we get the full tf-idf weight
\[w_{t,d} = \text{tf}_{t,d}\text{idf}_t\]Note that this ensures that really common words would have $w\approx 0$ since their idf score would be close to 0. With the earlier table, let’s replace the raw counts with the tf-idf score for each entry in the word-document matrix.
As you like it | Julius Caesar | Henry V | |
---|---|---|---|
battle | 0.074 | 0.22 | 0.28 |
good | 0 | 0 | 0 |
fool | 0.019 | 0.0036 | 0.022 |
Source: Speech and Language Processing by Dan Jurafsky and James H. Martin
Notice that since “good” is a very common word, it’s tf-idf score becomes 0 since it has no discriminative power. Using tf-idf provides a better way to compare documents by more accurately representing their word contents.
We’ve seen how to represent documents as large, sparse vectors of word counts/frequencies and how to compare against each other using various techniques. Let’s see how to compare individual words using embeddings. An embedding is a short, dense vector representation of a word that holds particular semantics about that word. In practice these dense vectors tend to work better than sparse vectors in most language tasks since they are more efficient with capturing the complexity of the semantic space than sparse vectors.
One way to construct an embedding for a vector is to go back to that distribution hypothesis: words that occur in similar contexts have similar meanings. This is the principle behind word2vec: we want to train a model that tells us if a word is likely to be near another word. Through training a word2vec model, the weights of the model become the embedding and we’ll learn them for each word in the vocabulary in a self-supervised fashion with no explicit training labels.
There are two flavors of word2vec: continuous bag of words (CBOW) and skip-gram; we’ll use the skip-gram model. The intuition is that we select a target word and define a context window of a few words before and after the target word. We construct a tuple of the target word and each of the words in the context window, and these become our training examples. We learn a set of weights to maximize the likelihood that a context word appears in the context window of a target word and use the learned weights as the embedding itself.
Let’s start with constructing the training tuples. Suppose we have the following sentence and the target word was “cup” and the context window was $\pm 2$:
\[\text{[The coffee }\color{red}{\text{cup}}\text{ was half] empty.}\]The training examples would be tuples of the target word and the context words: (cup, the), (cup, coffee), (cup, was), (cup, half). We want to train a model such that, given a target word and another word, it returns the likelihood that the other word is a context word of the target word.
word2vec model architecture and training example. We map a word to its one-hot embedding and then use $E$ to map into the embedding itself. Then we remap into the vocabulary, normalize over all words, and try to maximize the likelihood that a particular context word is seen in the context window of the target.
For the input, we represent words as sparse one-hot embeddings where the size of the vector is the size of the vocabulary and we assign a unique dimension/index in the vector to each word.
\[\text{cup}\mapsto\begin{bmatrix}0\\\vdots\\ 0\\ 1\\ 0\\\vdots\\ 0\end{bmatrix} = w\\\]Then we have a weight matrix $E$ that maps this one-hot vector to its embedding vector of some dimensionality $H$, so the dimensions of the matrix must be $H\times\vert V\vert$. We can get the embedding for a word in its one-hot representation by multiplying $Ew$ to get an output embedding vector of size $H\times 1$ that corresponds to the same row in the matrix. Note that this is equivalent to “selecting” the row of the one-hot embedding. For this reason, we also call $E$ the embedding matrix itself.
Recall that to train the model, we want it to produce a high likelihood if a context word is indeed in the context of the target word. To do that, we need another matrix mapping the embedding space back into the vocab space $E’$ of dimension $\vert V\vert\times H$. Since we want a probability, we need to normalize the output so we get a probability distribution across the vocabulary words. To do this, we apply the softmax operator:
\[\text{softmax}(z_i) = \frac{\exp(z_i)}{\sum_j \exp(z_j)}\]Intuitively, this takes a particular element $z_i$ and divides it by the total sum of all elements in the exponential space. This gives us a valid probability distribution as the output. For the context word, we use the one-hot embedding. Another way to interpret the one-hot embedding probabilistically is that it represents a distribution with a single peak at a single index/word.
Now that we have the normalized output distribution and a one-hot embedding (thought of as a peaked distribution), the intuition behind the loss function is that we want to push the output distribution to be peaked in the same index as the desired embedding. One loss function that has this property is called the cross-entropy (CE) loss between a target $y$ and predicted $\hat{y}$.
\[\mathcal{L}(\hat{y}, y) = -\sum_i y_i\log\hat{y}_i\]Note that because the target vector $y$ is a one-hot embedding, almost every term in the sum will $0$ except the one where $i=c$ where $c$ is the index of the context word in the target vector is and the element value is simply $1$. So we can simply this into a single expression.
\[\mathcal{L}(\hat{y}, y) = -\log\hat{y}_c\]Does this loss function do the right thing? What happens if $\hat{y}_c$ is very close to $0$? Intuitively, this means the model is not doing a good job since it estimates the context word with a low probability of being in the actual context. In this case, we’re taking the log of a very small number which is a very large negative number. After we negate it, we get a very large loss. This makes sense since our model is saying that it doesn’t think the context word has a high likelihood of being in the context window even though it actually is (because that’s how we constructed the dataset). Note that since $\hat{y}_c$ is the output of a softmax, it’s bounded to be in $[0, 1]$. Since we can’t take the log of $0$, we often add a little epsilon $\varepsilon$ inside the log like $\log(\hat{y}_c+\varepsilon )$ for numerical stability.
Now what happens if $\hat{y}_c$ is close to $1$? Intuitively, this means our model is doing great because it’s very confidently estimating that the context word is in the context window. In this case, the log of $1$ is $0$ so we have a loss of $0$. This makes sense since our model is accurately predicting the high likelihood of the context word being in the context window.
Overall, this loss function seems to do what we want: move the predicted distribution of the model to be peaked at the context word. Putting all of this together, the training process looks like the following.
Practically, we’d use a framework such as Pytorch or Tensorflow and their automatic differentiation (also called autograd for automatic gradient) to compute the gradients for us. After training, we have an embedding matrix $E$ such that each row is an embedding vector that we can look up for a particular word in our vocabulary.
word2vec is a good start in providing us with a word representation that holds some semantics about the word but it has one major problem: the context is always local. When we create training examples, we always use a context window around the word. While this gives us good local co-occurrences, we could more accurately represent the word if we also looked at global co-occurrences of words. Rather than trying to learn the raw probabilities like what word2vec does, GloVe aims to learn a ratio of probabilities representing how much more likely is it that a particular word appears in a context of one word compared to another word.
To start with some notation, we define a word-word co-occurrence matrix with $X$ and let $X_{ij}$ represent the number of times word $j$ appeared in the context of word $i$. With that definition, let $X_i = \sum_j X_{ij}$ as the number of times any word appears in the context of word $i$; we can also define $p_{ij}=p(j\vert i)=\frac{X_{ij}}{X_i}$ as the probability that word $j$ appears in the context of word $i$. As an example, consider $i=\text{ice}$ and $j=\text{steam}$. With probe words $k$, we can consider the ratio $\frac{p_{ik}}{p_{jk}}$ that tells us how much more likely is word $k$ to appear in the context of word $i$ than word $j$. For words like $k=\text{solid}$ that are more closely related to $i=\text{ice}$, the ratio will be large; for words more closely related to $j=\text{steam}$ like $k=\text{gas}$, the ratio will be small. For words that are closely related to both such as $k=\text{water}$, the ratio will be close to 1. This ratio has more discriminative power in identifying which words are relevant or irrelevant than using the raw probabilities.
Rather than learning raw probabilities, the authors construct a model to learn the co-occurrence ratios and train it using a novel weighted least squares regression model.
\[J = \sum_{i,j} f(X_{ij})\Big(w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij} \Big)^2\]where
There are a few nice properties about this weighting function that carry over from tf-idf: $f(x)$ is non-decreasing so that more frequent words are weighted correctly but it has an upper bound of $1$ so that very frequent words are not overweighted. The additional numerical property required by this function is that $f(0) = 0$ else a co-occurrence entry could be 0 and the entire function would be ill-defined. The hyperparameters are $x_\text{max}$ and $\alpha$ and the authors found that the former doesn’t impact the quality as much as the latter; $\alpha=0.75$ tended to work better than a linear model, empirically. Solving for the weights, we get GloVe embeddings that can be used just like word2vec embeddings but they tend to perform better since we’re also considering global word co-occurrences in additional to local context windows. We’ll see an example later where we load pretrained GloVe embeddings and use them to solve word analogies.
Read the GloVe paper for more details!
Both word2vec and GloVe train embeddings that can be used across a number of different language modeling tasks. However, the cost to pay for the generalization is that they may not perform as well for very specific applications. In other words, since the embeddings are taken off-the-shelf, we’ll have to fine-tune them for a specific language modeling task. We can use the pretrained embeddings to start and then consider them to be “optimizable” variables as a smaller part of our language model. This gives us a good start but also allows us to fine-tune the pretrained embeddings for our particular language modeling task.
In some cases, it may be beneficial to actually train an embedding layer from scratch end-to-end as part of whatever the language modeling task-of-interest is. The training procedure is similar to word2vec in that we start with one-hot embeddings of the words and them map them into an embedding space with an embedding matrix, but then the output directly goes into the next layer or stage in the language model. When we train the language model, the gradients automatically update the embedding matrix based on the overall loss of the language modeling task. While this method does tend to produce more accurate results for the end-to-end task, it does require a large corpus to train since we’re training the embeddings from scratch along with the rest of the language model rather than pulling the embeddings off-the-shelf.
After we’ve trained embeddings, we can see how well they model word semantics. One canonical task that demonstrates semantic analysis is completing a word analogy. For example, “man is to king as woman is to X”. The correct answer is “queen”. If our embeddings are truly learning correct semantic relationships, then they should be able to solve these kinds of analogies. We can represent these in the embedding space with vector arithmetic (since vector spaces are linear) and look at which other embeddings lie close to the result.
\[\overrightarrow{\text{king}} - \overrightarrow{\text{man}} + \overrightarrow{\text{woman}} \approx \overrightarrow{\text{queen}}\]In other words, the embedding for “king” minus the embedding for “man” plus the embedding for “woman” should be close to the embedding for “queen”. This turns out to be true for word2vec and GloVe embeddings! So it seems like they are actually capturing certain kinds of semantic relations. Let’s actually write some code to load some pre-trained GloVe embeddings and show this!
First, we’ll need to go to the official GloVe website and download the pre-trained embedding and unzip them. For this example, we’ll use the glove.6B.zip with 100-dimensional GloVe embeddings. Each line in the text file is the word followed by the values of the embeddings so we can load that into a dictionary for easy lookup. Let’s try computing the similarity of $\overrightarrow{\text{king}} - \overrightarrow{\text{man}} + \overrightarrow{\text{woman}}$ and $\overrightarrow{\text{queen}}$ and also an unrelated word like $\overrightarrow{\text{egg}}$ and see if the embeddings correctly note similarities.
import numpy as np
from numpy import dot
from numpy.linalg import norm
embedding_dim = 100
# Define the local path to save the downloaded embeddings
glove_filename = f"glove.6B/glove.6B.{embedding_dim}d.txt"
# Load the GloVe embeddings into a dictionary
e = {}
with open(glove_filename, 'r', encoding='utf-8') as f:
for line in f:
values = line.split()
word = values[0]
embedding = np.array(values[1:], dtype='float32')
e[word] = embedding
# compute analogy
result = e['king'] - e['man'] + e['woman']
# cosine similarity of the result and the embedding for queen
cos_sim = dot(result, e['queen']) / (norm(result) * norm(e['queen']))
print(cos_sim)
# cosine similarity of the result and the embedding for egg
cos_sim = dot(result, e['egg']) / (norm(result) * norm(e['egg']))
print(cos_sim)
The cosine similarity for the result and queen is $0.7834413$ while the cosine similarity for the result and egg is only $0.19395089$. As expected, “queen” is a far more appropriate solution to the word analogy than “egg”!
It would be nice to visualize the embeddings of different words relative to each other. However embeddings tend to be higher-dimensional vectors so how can we meaningfully visualize them? There are two common dimensionality-reduction techniques: (i) principal component analysis (PCA) and (ii) t-distributed Stochastic Neighbor Estimation (t-SNE). The intuition behind PCA is to repeatedly project along the dimension with the highest variance (since it has higher discriminative power) using a linear algebra technique such as singular value decomposition (SVD) until we hit the target dimension. t-SNE solves an optimization problem that tries to project the data such that the distances in the higher-dimensional space are similar to distances in the lower-dimensional space, thus locally preserving the structure of the higher-dimensional space in the lower-dimensional space. Both are good techniques to lower the dimensionality of the embedding so we can visualize words as points on a plane (while still preserving their semantics).
Fortunately, Scikit provides implementations for both so we can plot them side-by-side and see the differences.
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
words_to_plot = ["king", "man", "queen", "woman", "egg", "chicken", "frog", "snake"]
embeddings_to_plot = np.array([e[word] for word in words_to_plot])
pca = PCA(n_components=2)
reduced_embeddings_pca = pca.fit_transform(embeddings_to_plot)
tsne = TSNE(n_components=2, random_state=42, perplexity=5)
reduced_embeddings_tsne = tsne.fit_transform(embeddings_to_plot)
plt.figure(figsize=(12, 6))
# PCA on the left
plt.subplot(1, 2, 1)
plt.scatter(reduced_embeddings_pca[:, 0], reduced_embeddings_pca[:, 1])
for i, word in enumerate(words_to_plot):
plt.annotate(word, (reduced_embeddings_pca[i, 0], reduced_embeddings_pca[i, 1]))
plt.title('PCA Projection')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
# t-SNE on the right
plt.subplot(1, 2, 2)
plt.scatter(reduced_embeddings_tsne[:, 0], reduced_embeddings_tsne[:, 1])
for i, word in enumerate(words_to_plot):
plt.annotate(word, (reduced_embeddings_tsne[i, 0], reduced_embeddings_tsne[i, 1]))
plt.title('t-SNE Projection')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.tight_layout()
plt.show()
The resulting plot shows that the vector difference between “king” and “man” is roughly the same as that of “queen” and “woman” in both plots! In the t-SNE plot, however, we see that the vectors are a bit closer in terms of magnitude and direction.
Some other interesting observations is with the other words: we see that snake and frog are closer together than say, man and egg because while they’re not synonyms, they’re still related words (both being animals that lay eggs). Try plotting other words to see how they cluster together in the lower-dimensional space!
Embeddings are a word representation that preserves semantic properties of words, such as relations to other words and connotation, in a much better way than representing words as strings of characters. Representing documents as vectors is counter-intuitively more straightforward so we started with learning about term frequency and document frequency; that also helped illustrate some interesting concepts like how words that occur too frequently should be downweighted since they don’t have discriminative power. To transition to representing individual words as embeddings, we learned about the distributional hypothesis that stated the meaning of a word depends on the context around it. Our first word embedding model word2vec trained embeddings with that in mind: train a model to predict if a word lies in the context window of a target word. Our next embedding model did a bit better by also looking at global word-word co-occurrences in addition to the local context window approach that word2vec uses. The final embedding model we discussed was a more recent type of model where we learn the embeddings from scratch as part of the language modeling task in an end-to-end fashion. Finally, we used embeddings to show how they can model semantic relations using word analogies as an example semantic understanding task.
Now that we have a vectorized format for embeddings, we can use them for different kinds of language models, the most popular and accurate ones being neural network language models, which we’ll cover next time 🙂
]]>Rather than jumping straight to how LLMs work, I think it’s helpful to cover some prerequisite knowledge to help us demystify LLMs. In this post, we’ll go back in time before neural networks and talk about language, language modeling, and n-gram language models since they’re simple to understand and we can do an example by hand.
Before we start with n-gram models, we need to understand the kind of data we’re working with. If we were going to delve into convolutional neural networks (CNNs), we’d start our discussion with images and image data. Since we’re talking about language modeling, let’s talk about language so we can better motive why language modeling is very hard. One definition of language that’s particularly relevant to language modeling is a structured system of communication with a grammar and vocabulary (note this applies for spoken, written, and sign language). Given you’re reading this post in the English language, you’re probably already familiar with vocabulary and grammar so let me present to you a sentence.
The quick brown fox jumps over the lazy dog.
You might recognize this sentence as being one that uses each letter of the English/Latin alphabet at least once. Immediately we see the words belonging to the vocabulary and their part-of-speech: nouns like “fox” and “dog”; adjectives like “quick”, “brown”, “lazy”; articles like “the”; verbs like “jumps”; and prepositions like “over”.
Grammar is what dictates the ordering of the words in the vocabulary: the subject “fox” comes before the verb “jumps” and the direct object “dog”. This ordering depends on the language however. For example, if I translated the above sentence into Japanese, it would read: 素早い茶色のキツネが怠惰な犬を飛び越えます。A literal translation would go like “Quick brown fox lazy dog jumped over”. Notice how the verb came at the end rather than between the subject and direct object.
These problems help illustrate why we can’t simply have a model that performs a one-to-one mapping when we try to model languages. We might end up with more words, e.g., if the target language uses particles words, or fewer words, e.g., if the target language doesn’t have article words. Even if we did have the same number of words, the ordering might change. For example, in English, we’d say “red car” but in Spanish we’d say “carro rojo” which literally translates to “car red”: the adjective comes after the noun it describes.
To summarize, language is very difficult! Even for humans! So it’s going to be a challenge for computers too.
With that little aside on languages, before we formally define language modeling, let’s look at a few applications that use some kind of language modeling under-the-hood.
Sentiment Analysis. When reading an Amazon review, as humans, we can tell if they’re positive or negative. We want to have a language model that can do the same kind of thing. Given a sequence of text, we want to see if the sentiment is good or bad. Cases like “It’s hard not to hate this movie” are particularly challenging and need to be handled correctly. This particular application of language modeling is used in “Voice of the Customer” style analysis to gauge perceptions about a company or their products.
Automatic Speech Recognition. Language modeling can be useful for speech recognition by being able to correctly model sentences, especially for words that sound the same but are written differently like “tear” and “tier”.
Neural Machine Translation. Google Translate is a great example of this! If we have language models of different languages, implicitly or explicitly, we can translate between the languages that they model!
Text Generation. This is what ChatGPT has grown famous for: generating text! This application of language modeling can be used for question answering, code generation, summarization, and a lot more applications.
Now that we’ve seen a few applications, what do all of these haven in common? It seems like one point of commonality is that we want to understand and analyze text against the trained corpus to ensure that we’re consistent with it. In other words, if our model was trained on a dataset of English sentences, we don’t want it generating grammatically incorrect sentences. In other words, we want to ensure that the outputs “conform” to the dataset.
One way to measure this is to compute a probability of “belonging”. For a some random given input sequence, if the probability is high, then we expect that sequence to be close to what we’ve see in the dataset. If that probability is low, then that sequence is likely something that doesn’t make sense in the dataset. For example, a good language model would score something like $p(\texttt{The quick brown fox jumps over the lazy dog})$ high and something like $p(\texttt{The fox brown jumped dog laziness over lazy})$ low because the former has proper grammar and uses known words in the vocabulary.
This is what a language model does: given an input sequence $x_1,\cdots,x_N$, it assigns a probability $p(x_1,\cdots,x_N)$ that represents how likely it is to appear in the dataset. That seems a little strange given we’ve just discussed the above applications. What does something like generating text have to do with assigning probabilities to sequences? Well we want the generated text to match well with the dataset, don’t we? In other words, we don’t want text with poor grammar or broken sentences. This also explains why those phenomenal LLMs are trained on billions of examples: they need diversity in order to assign high probabilities to sentences that encode facts and data of the dataset.
So how do we actually compute this probability? Well the most basic definition of probability is “number of events that happened” / “number of all possible events” so we can try to do the same thing with this sequence of words.
\[p(w_1,\dots, w_N)=\displaystyle\frac{C(w_1,\dots, w_N)}{\displaystyle\sum_{w_1,\dots,w_N} C(w_1,\dots, w_N)}\]So for a word sequence $w_1,\dots, w_N$, in our corpus, we count how many times we find that sequence divide by all possible word sequences of length $N$. There are several problems with this. To compute the numerator, we need to count a particular sequence in the dataset but notice that this gets harder to do the longer the sequence is. For example, finding the sequence “the cat” is far easier than finding the sequence “the cat sat on the mat wearing a burgundy hat”. To compute the denominator, we need the combination of all English words up to length $N$. To give a sense of scale, Merriam Webster estimates there are about ~1 million words so this becomes the combinatorial problem.
\[\binom{1\mathrm{e}6}{N} = \displaystyle\frac{1\mathrm{e}6!}{N!(1\mathrm{e}6-N)!}\]In other words, for each word up to $N$, there are about a million possibilities we have to account for until we get up to the desired sequence length. The factorial of a million is an incredibly large number! So these reasons make it difficult to compute language model probabilities in that form so we have to try something else. If we remember some probability theory, we can try to rearrange the terms using the chain rule of probability.
\[\begin{align*} p(w_1,\dots, w_N) &= p(w_N|w_1,\dots,w_{N-1})p(w_1,\dots,w_{N-1})\\ &= p(w_N|w_1,\dots,w_{N-1})p(w_{N-1}|w_1,\dots,w_{N-2})p(w_1,\dots,w_{N-2})\\ &= \displaystyle\prod_{i=1}^N p(w_i|w_1,\dots,w_{i-1})\\ \end{align*}\]So we’ve decomposed the joint distribution of the language model into a product of conditionals $p(w_i\vert w_1,\dots,w_{i-1})$. Intuitively, this measures the probability that word $w_i$ follows the previous sequence $w_1,\dots,w_{i-1}$. Basically for a word, we depend on all previous words. So let’s see if this is any easier to practically count up the sequences.
\[p(w_i|w_1,\dots,w_{i-1})=\displaystyle\frac{C(w_1,\dots,w_i)}{C(w_1,\dots,w_{i-1})}\]This looks a little better! Intuitively, we count a particular sequence up to $i$: $w_1,\dots,w_i$ in the corpus. But the denominator, we only count up to the previous word $w_1,\dots,w_{i-1}$. This is a bit better than going up to the entire sequence length $N$ but still a problem. Particularly, the biggest problem is the history $w_1,\dots,w_{i-1}$. How do we deal with it?
Rather than dealing with the entire history up to a certain word, we can approximate it using only the past few words! This is the premise behind n-gram models: we approximate the entire past history using the past $n$ words.
\[p(w_i|w_1,\dots,w_{i-1})\approx p(w_i|w_{1-(n-1)},\dots,w_{i-1})\]A unigram model looks like $p(w_i)$; a bigram model looks like $p(w_i\vert w_{i-1})$; a trigram model looks like $p(w_i\vert w_{i-1},w_{i-2})$. Intuitively, a unigram model looks at no prior words; a bigram models looks only at the previous word; a trigram model looks at only the past two words. Now let’s see if it’s easier to compute these conditional distributions using the same counting equation.
\[\begin{align*} p(w_i|w_{i-1})&=\displaystyle\frac{C(w_{i-1}, w_i)}{\displaystyle\sum_w C(w_{i-1}, w)}\\ &\to\displaystyle\frac{C(w_{i-1}, w_i)}{C(w_{i-1})} \end{align*}\]We go to the second line by using maximum likelihood estimation. Computing these counts is much easier! To see this, let’s actually compute an n-gram model by hand using a very small corpus.
\[\texttt{<SOS>}\text{I am Sam}\texttt{<EOS>}\] \[\texttt{<SOS>}\text{Sam I am}\texttt{<EOS>}\]Practically, we use special tokens that denote the start of the sequence (<SOS>) and end of sequence (<EOS>). The <EOS> token is required to normalize the conditional distribution into a true probability distribution. The <SOS> token is optional but it becomes useful for sampling the language model later so we’ll add it. Treating these as two special tokens, let’s compute the bigram word counts and probabilities by hand.
$w_i$ | $w_{i-1}$ | $p(w_i\vert w_{i-1})$ |
---|---|---|
I | <SOS> | $\frac{1}{2}$ |
Sam | <SOS> | $\frac{1}{2}$ |
<EOS> | Sam | $\frac{1}{2}$ |
I | Sam | $\frac{1}{2}$ |
Sam | am | $\frac{1}{2}$ |
<EOS> | am | $\frac{1}{2}$ |
am | I | $1$ |
Concretely, let’s see how to compute $p(\text{I}\vert\text{Sam})$. Intuitively, this is asking for the likelihood that “I” follows “Sam”. In our corpus, we have two instances of “Sam” and the words after are “<EOS>” and “I”. So overall, the likelihood is $\frac{1}{2}$. Notice how the conditionals form a valid probability distribution, e.g., $\sum_w p(w\vert\text{Sam}) = 1$.
With this model, we can approximate the full language model with a product of n-grams. Consider bigrams:
\[\begin{align*} p(w_1,\dots, w_N)&\approx p(w_2|w_1)p(w_3|w_2)\cdots p(w_N|w_{N-1})\\ p(\text{the cat sat on the mat}) &\approx p(\text{the}|\texttt{<SOS>})p(\text{cat}|\text{the})\cdots p(\texttt{<EOS>}|\text{mat}) \end{align*}\]This is a lot more tractable! So now we have an approximation of the language model! What other kinds of things can we do? We can sample from language models. We start with the <SOS> and then use the conditionals to sample. We can either keep sampling until we hit a <EOS> or we can keep sampling for a fixed number of words. This is why we have a <SOS>: if we didn’t, we’d need to specific a start token. But since we used <SOS>, we have a uniform start token.
Now that we’ve covered the maths, let’s talk about some practical aspects of language modeling. The first problem we can address is what we just talked about: approximating a full language model with the product of n-grams.
\[p(w_1,\dots, w_N)\approx p(w_2|w_1)p(w_3|w_2)\cdots p(w_N|w_{N-1})\]What’s the problem with this? Numerically, when we multiply a bunch of probabilities together, we’re multiplying together numbers that are in $[0, 1]$ which means the probability gets smaller and smaller. This has a risk of underflowing to 0. To avoid this, we use a trick called the exp-log-sum trick:
\[\exp\Big[\log p(w_2|w_1)+\log p(w_3|w_2)+\cdots+\log p(w_N|w_{N-1})\Big]\]In the log-space, multiplying is adding so the number just gets increasingly negative rather than increasingly small. Then we can take the exponential to “undo” the log-space.
Going beyond the numerical aspects, practically, language models need to be trained on a large corpus because of sparsity. After we train, two major problems we encounter in the field are unknown words not in the training corpus and words that are known but used in an unknown context.
For the former, when we train language models, we often construct a vocabulary during training. This can either be an open vocabulary where we add words as we see them or a closed vocabulary where we agree on the words ahead of time (perhaps the most common $k$ words for example). In either case, during inference, we’ll encounter out-of-vocabulary (OOV) words. One solution to this is to create a special token called <UNK> that represents unknown words. For any OOV word, we map it to the <UNK> token and treat it like any other token in our vocabulary.
What about known words in an unknown context? Let’s consider how we compute bigrams.
\[p(w_i|w_{i-1})=\displaystyle\frac{C(w_{i-1},w_i)}{C(w_{i-1})}\]Mathematically, the problem is that the numerator can be zero. So the simplest solution is to make it not zero by adding $1$. But we can’t simply add $1$ without correcting the denominator since we want a valid probability distribution. So we also need to add something to the denominator. Since we’re adding $1$ to each count for each word, we need to add a count for the total number of words in the vocabulary $V$.
\[p(w_i|w_{i-1})=\displaystyle\frac{C(w_{i-1},w_i)+1}{C(w_{i-1})+V}\]With this, we’re guaranteed not to have zero counts! This is called Laplace Smoothing. The issue with this kind of smoothing is that the probability density moves too sharply since we’re just blindly adding $1$. We can generalize this so that we actually add some $k$ (and normalize by $kV$) to help better ease the probability density less sharply towards the unknown context event.
\[p(w_i|w_{i-1})=\displaystyle\frac{C(w_{i-1},w_i)+k}{C(w_{i-1})+kV}\]This is called Add-$k$ Smoothing. It can perform better than Laplace Smoothing in most cases, with the appropriate choice of $k$ tuned.
One alternative to smoothing is to try to use less information if it’s available. The intuition is that if we can’t find a bigram $p(w_{i-1},w_i)$, we can see if a unigram exists $p(w_i)$ that we can use in its place. This technique is called backoff because we back off to a smaller n-gram.
Going a step further, we don’t have to necessarily choose between using backing off to only the $(n-1)$-gram. We can choose to always consider all previous n-gram, but create a linear combination of them.
\[\begin{align*} p(w_i|w_{i-2},w_{i-1})&=\lambda_1 p(w_i)+\lambda_2 p(w_i|w_{i-1})+\lambda_3 p(w_i|w_{i-2},w_{i-1})\\ \displaystyle\sum_i \lambda_i &= 1 \end{align*}\]Here the $\lambda_i$s are the interpolation coefficients and they have to sum to $1$ to create a valid probability distribution. This allows us to consider all previous n-grams in the absence of data. Backoff with interpolation works pretty well in practice.
We’ve been talking about the theory of language models and n-gram models for a while but let’s actually try training one on a dataset and use it to generate text! Fortunately since they’ve been around for a while, training them is very simple with existing libraries.
from torchtext.datasets import AG_NEWS
import re
from nltk.lm import MLE
from nltk.lm.preprocessing import padded_everygram_pipeline
N = 6
data = AG_NEWS(root='.', split='train')
train, vocab = padded_everygram_pipeline(N,
[re.sub(r'[^A-Za-z0-9 ]+', '', x[1]).split() for x in data])
lm = MLE(N)
lm.fit(train, vocab)
print(' '.join(lm.generate(20, random_seed=4)))
We’re using the AG_NEWS
dataset that contains 120,000 training examples of news articles across World, Sports, Business, and Science/Tech. The padded_everygram_pipeline
adds the <SOS> and <EOS> tokens and creates n-grams and backoff n-grams; we’re using 6-grams which tend to work well in practice. For simplicity, we ignore any non-alphanumeric character besides spaces. Then we use a maximum likelihood estimator (similar to the conditional distribution tables we created above) to train our model. Finally, we can generate some examples of length 20.
I tried a bunch of different seeds and here are a few cherry-picked examples (I’ve truncated them after the <EOS> token):
These look pretty good for just an n-gram model! Notice they retain some information, probabilistically, across the sequence. For example, in the first one, the word “infertile” comes before “birth” since, when generating “birth”, we could see “infertile” in our previous history.
But I also found scenarios where the generated text didn’t really make any. Here are some of those lemon-picked examples:
These are sometimes short phrases or nonsensical with random digits. In one case, the language model just generated a bunch of <EOS> tokens! These examples also help show why neural language models tend to outperform simplistic n-gram models in general. Feel free to change the dataset and generate your own sentences!
Large Language Models (LLMs) are gaining traction online as being able to perform complex and sequential reasoning tasks. They’re often treated as black-box models but understanding a bit about how they work can make it easier to interact with them. Starting from the beginning, we learned a bit about language itself and why this problem is so difficult and why it wasn’t solved decades ago. We introduced language modeling as a task of assigning a probability to a sequence of words based on how likely it is to appear in the dataset. Then we learned about how $n$-gram models approximate this full previous history of a particular word using only the past $n$ words. We can use these models for language modeling and sampling. We finally discussed some practical considerations when training language models including handing unknown words and backoff and interpolation.
There’s still a lot more to cover! This is just the start of our journey to the precipice of language modeling 🙂
]]>In this post, we’ll take our existing notion of Lie Groups and extend them to perform calculus so we can compute derivatives to compute things like the covariance, as it relates to the latter half of Dr. Joan Solà’s work: A micro Lie theory for state estimation in robotics. We’ll start by defining the adjoint to relate the local and global frames since we’ll need it for later. Then we build up calculus by learning how to take derivatives on manifolds as well as covariances. Finally, we’ll take what we learned and arrive at the on-manifold state estimation equations.
In the previous post, we ended with defining the global and local frames and the $\oplus$ and $\ominus$ operators. However, since we have these two global and local frames, how do we relate them? Note that these might be at different places in the manifold so we can’t simply use the $\Exp$ or $\Log$ operators directly. Unfortunately, we can’t continue without defining some reasonable axioms. So let’s go ahead and identify the left and right $\oplus$ operators.
\[X \oplus {}^Xv={}^Ev\oplus X\]Now let’s expand the $\oplus$ on both sides and simplify
\[\begin{align*} X \oplus {}^Xv & ={}^Ev\oplus X\\ \Exp({}^Ev^\wedge)X &= X~\Exp({}^Xv)\\ \exp({}^Ev^\wedge) &= X\exp({}^Xv^\wedge)X^{-1}=\exp(X{}^Xv^\wedge X^{-1})\\ {}^Ev^\wedge &= X{}^Xv^\wedge X^{-1} \end{align*}\]Note that in the third line we use a property of the exponential map that $X\exp({}^Xv^\wedge)X^{-1}=\exp(X{}^Xv^\wedge X^{-1})$. In the last line, notice that we relate the the tangent space $T_X M$ to the tangent space $T_E M$; in other words, we can bring a vector in the local frame to a vector in the global frame. This turns out to be a useful-enough operation that we give it a name: the adjoint:
\[\Ad_X : \mathfrak{m}\to\mathfrak{m}; v^\wedge\mapsto\Ad_Xv^\wedge\equiv X{}^Xv^\wedge X^{-1}\]The adjoint map sends vectors in the local frame to vectors in the global frame. Equivalently, we can say ${}^E v^\wedge=\Ad_X {}^X v^\wedge$. The adjoint at $X$ brings ${}^Xv^\wedge$ to ${}^Ev^\wedge$. Similar to the exponential map, this mapping is exact. From the definition, we can derive several properties:
This map also has properties
As a simple example, we can consider the set of rotations on the plane $SO(2)$. Since rotations on the plane communte everywhere, the mapping the left and right lead to the same result so the adjoint is just the identity: $\Ad_X=I = X\oplus {}^Xv={}^Ev\oplus X$ .
As a more complex example, consider $SO(3)$. We know rotations in space don’t commute, but if we compute the adjoint, we can figure out how exactly they commute (in other words, which term is missing). To do this, let’s pick an arbitrary $[\omega]_\times\in\mathfrak{so}(3)$ and $R\in SO(3)$. We’ll remove these later since they’re arbitrary anyways. Instead of starting immediately with the final definition, it’s a bit more illustrative to start a few steps above in the adjoint derivation.
\[R\exp([w]_\times) = \exp([\Ad_R~\omega]_\times)R\]On the left, we have a rotation matrix times another rotation matrix, but expressed in the Lie algebra $\omega$. In other words, we could have written $R’=\exp([w]_\times)$. But remember the adjoint operates in the Lie algebra (or corresponding vector space) so we need this extra decomposition. On the right side, we have commuted the two but applied the adjoint since it maps across vector spaces.
\[\begin{align*} R\exp([w]_\times) &= \exp([\Ad_R~\omega]_\times)R\\ \exp([\Ad_R~\omega]_\times)R &= R\exp([w]_\times)\\ \exp([\Ad_R~\omega]_\times) &= R\exp([w]_\times)R^{-1}\\ \exp([\Ad_R~\omega]_\times) &= \exp(R[w]_\times R^{-1})\\ [\Ad_R~\omega]_\times &= R [w]_\times R^{-1}\\ [\Ad_R~\omega]_\times &= [Rw]_\times\\ \Ad_R &= R\\ \end{align*}\]In the second-to-last step we use a property of the $[\cdot]_\times$ operator $R[\omega]_\times R^{-1}=[Rw]_\times$. Also, in the last step, we removed the $[\omega]_\times$ since it was arbitrary in the first place. So the adjoint of $SO(3)$ is the same as the rotation matrix $R$! This tells us how to relate commutations for 3D rotations.
Now we have all of the pieces to develop calculus on Lie Groups which we need to compute derivatives for optimization or any other kind of state estimation. The principle for calculus on Lie Groups is same as the original motivation: we want to avoid working directly on the manifold but rather in the tangent space. Tying this to state estimation, if we have a nonlinear motion model using Lie Groups, we need to compute Jacobians which means we need calculus on Lie Groups.
Recall for a scalar function $f:\R\to\R$ the definition of a derivative is
\[f'(x) = \lim_{h\to 0}\frac{f(x+h)-f(x)}{h}\]For a multivariate function $f:\R^n\to\R$, we can compute a gradient vector of partial derivatives:
\[\nabla f=\left[\frac{\p f}{\p x_1},\cdots,\frac{\p f}{\p x_n}\right]^T\]For a multivariate in-out function $f: \R^n\to\R^m$, can compute a Jacobian matrix of partial derivatives:
\[J = \frac{\p \vec{f} }{\p \vec{x} } = \begin{bmatrix} \frac{\p f_1}{\p x_1} & \cdots & \frac{\p f_1}{\p x_n}\\ \vdots & \ddots & \vdots\\ \frac{\p f_m}{\p x_1} & \cdots & \frac{\p f_m}{\p x_n} \end{bmatrix}\]Note that the intermediate notation I used $\frac{\p \vec{f} }{\p \vec{x} }$ is not well-defined but intended to be illustrative. Now suppose we have a function $f:G\to G$ on a Lie Group. We want to compute $\frac{\D f}{\D X}$. In other words, we want to know how a wiggle in $X\in G$ wiggles $f(X)\in G$. But what does it mean to wiggle $X$? This was well-defined for a scalar but not for a group element.
The key idea is that we use some small wiggle $\vec\varepsilon$ in the tangent space of $X$ rather than $X$ itself and map that wiggle to the manifold using the exponential map.
Notationally, we can write something like
\[\begin{align*} \frac{ {}^X\D f}{\D X}&=\lim_{\vec\varepsilon\to 0}\frac{f(X\oplus\vec\varepsilon)\ominus f(X)}{\vec\varepsilon}\\ &=\lim_{\vec\varepsilon\to 0}\frac{\Log(f(X)^{-1}\circ f(X\cdot\Exp(\vec\varepsilon)))}{\vec\varepsilon}\\ &=\frac{\p}{\p\vec\varepsilon}\left[\Log(f(X)^{-1}\circ f(X\cdot\Exp(\vec\varepsilon)))\right]_{\vec\varepsilon=0}\\ \end{align*}\]Note that we had to “upgrade” $+$ to $\oplus$ and $-$ and $\ominus$ since we’re dealing with manifolds and tangent spaces. We’re being a bit sloppy with notation since vector division isn’t well-defined. If we want to be a bit more accurate, we should use $h\vec\varepsilon_i$ such that $h\in\R, \vert\vert h\vert\vert « 1$ where $\vec\varepsilon_i$ is a basis in the $i$ direction and and we take the limit with respect to $h$. Then we need to stack all of the $i$ bases.
\[\frac{ {}^X\D f}{\D X_i} =\lim_{h\to 0}\frac{f(X\oplus h\vec\varepsilon_i)\ominus f(X)}{\vec\varepsilon}\]Using that key idea, we’ve expressed variations in $X$ of $f(X)$ entirely in the tangent space. This Jacobian linearly maps tangent spaces $T_X M\cong\R^m\to T_{f(X)} M\cong\R^n$.
This new kind of derivative behaves similar to a normal derivative in that, for small variations:
\[f(X\oplus\vec\varepsilon)\approx f(X)\oplus\frac{\D f}{\D X}\vec\varepsilon\]To make the derivative more concrete, let’s try to compute the Jacobian of $SO(2)$ under the group action $Rv$, rotating a vector $v\in\R^2$ using a rotation matrix $R\in SO(2)$. Specifically, $f(R)=Rv$.
\[\begin{align*} \frac{ {}^R\D~ ~(Rv)}{\D R}&=\lim_{\theta\to 0}\frac{(R\oplus\theta)v\ominus Rv}{\theta}\\ &=\lim_{\theta\to 0}\frac{R~\Exp(\theta) v - Rv}{\theta}\\ &=\lim_{\theta\to 0}\frac{R(I + [\theta]_\times) v - Rv}{\theta}\\ &=\lim_{\theta\to 0}\frac{R[\theta]_\times v}{\theta}\\ &=\lim_{\theta\to 0}\frac{-R[1]_\times v~\theta}{\theta}\\ &=-R[1]_\times v\\ \end{align*}\]Note that since rotations in the plane commute $R\ominus S=\theta_R - \theta_S$ where $\theta\in\R$ is the corresponding angle to the 2D rotation matrix $R\in SO(2)$. We also expand the exponential map using a Taylor series $\Exp(\theta)\approx I +[\theta]_\times$ since the higher order terms vanish in the limit. We also use a useful property that $[a]_\times b= -a[b]_\times$. The other derivative is much simpler:
\[\frac{ {}^R\D~ ~(Rv)}{\D v}=R\]So far, we’ve been using the right $\oplus$ operator for now; this creates a mapping between local tangent spaces $T_X M\to T_{f(X)} N$. We could also define the left Jacobian $\frac{ {}^E\D f}{\D X}$ using the left $\oplus$ operator that creates a mapping between global tangent spaces $T_E M\to T_{E} N$. The maths is pretty straightforward to define, and we can relate the two using the adjoint.
\[\frac{ {}^E\D f}{\D X}\Ad_X=\Ad_{f(X)}\frac{ {}^X\D f}{\D X}\]So now we’re able to do calculus on Lie Groups by taking the derivative of a function with respect to a point on the manifold. Now for motion models, we can apply derivatives to compute the Jacobian of the motion model! Recall that for an on-manifold motion model, we take an initial pose $X_0$ and twists $v_i$ at some frequency $\Delta t_i$ and apply the exponential map iteratively:
\[\begin{align*} X_k&=X_0\oplus v_1\Delta t_1\oplus\cdots\oplus v_k\Delta t_k\\ &=X_0\Exp(v_1\Delta t_1)\cdots\Exp(v_k\Delta t_k)\\ \end{align*}\]The exponential map performs continuous integration on the manifold. However, with that motion model, we need to compute the derivative of the exponential map.
We’ll need some building blocks before computing things like the Jacobian of the exponential map and its inverse.
The first tool we’ll need is chain rule! This operates on Lie Groups exactly in the same way as ordinary calculus:
\[\frac{\D Z}{\D X} = \frac{\D Z}{\D Y}\frac{\D Y}{\D X}\]Next, we’ll need to prove the Jacobian of the inverse $f(X)=X^{-1}$ :
\[\begin{align*} \frac{\D X^{-1} }{\D X} &=\lim_{v\to 0}\frac{\Log[(X^{-1})^{-1}(X~\Exp(v))^{-1}]}{v}\\ &=\lim_{v\to 0}\frac{\Log(X~\Exp(v)^{-1}X^{-1})}{v}\\ &=\lim_{v\to 0}\frac{\Log(X~\Exp(-v)X^{-1})}{v}\\ &=\lim_{v\to 0}\frac{X~(-v)^{\wedge}X^{-1} }{v}\\ &=\lim_{v\to 0}\frac{\Ad_X(-v)}{v}\\ &=\lim_{v\to 0}\frac{-\Ad_X(v)}{v}\\ &=-\Ad_X\\ \end{align*}\]In the last step we removed $v$ since it was arbitrary. Now let’s prove composition $f(X,Y)=X\circ Y$ with respect to the first argument
\[\begin{align*} \frac{\D}{\D X}(X\circ Y) &=\lim_{v\to 0}\frac{\Log[f(X,Y)^{-1} f(X\Exp(v), Y)]}{v}\\ &=\lim_{v\to 0}\frac{\Log[(XY)^{-1} X~\Exp(v) Y]}{v}\\ &=\lim_{v\to 0}\frac{\Log[Y^{-1} X^{-1} X~\Exp(v) Y]}{v}\\ &=\lim_{v\to 0}\frac{\Log[Y^{-1} \Exp(v) Y]}{v}\\ &=\lim_{v\to 0}\frac{[Y^{-1}~\Exp(v) Y]^\vee}{v}\\ &=\lim_{v\to 0}\frac{\Ad_{Y^{-1} }v}{v}\\ &=\Ad_{Y^{-1} }\\ &=\Ad_Y^{-1}\\ \end{align*}\]and with respect to the second argument
\[\begin{align*} \frac{\D}{\D Y}(X\circ Y) &=\lim_{v\to 0}\frac{\Log[f(X,Y)^{-1}\circ f(X, Y~\Exp(v))]}{v}\\ &=\lim_{v\to 0}\frac{\Log[(X\circ Y)^{-1}\circ XY~\Exp(v)]}{v}\\ &=\lim_{v\to 0}\frac{\Log[(Y^{-1}X^{-1}\circ XY~\Exp(v)]}{v}\\ &=\lim_{v\to 0}\frac{\Log[\Exp(v)]}{v}\\ &=\frac{v}{v}\\ &= I \end{align*}\]Now that we have these blocks, we can define the right Jacobian as the derivative of the exponential map in the local frame.
\[J_r(v)=\frac{ {}^X\D}{\D v}\Exp(v)\]And the left Jacobian as the derivative of the exponential map in the global frame.
\[J_l(v)=\frac{ {}^E\D}{\D v}\Exp(v)\]Like other global and local frame relations, we can relate the two using the adjoint
\[\Ad_{\Exp(v)}=J_l(v)J_r^{-1}(v)\]This is where things get really complicated because, even for known manifolds, computing the closed forms for these Jacobians is super difficult so I’ll have to gloss over the details.
Now that we have some building blocks, we can compute Jacobians for the remaining operations like $\Log$, $\oplus$, and $\ominus$:
\[\begin{align*} \frac{\D}{\D X}\Log(X)&=J_r^{-1}(\Log(X))\\ \frac{\D}{\D X}(X\oplus v)&=\Ad_{\Exp(v)}^{-1} & \frac{\D}{\D X}(Y\ominus X)=-J_l^{-1}(Y\ominus X)\\ \frac{\D}{\D v}(X\oplus v)&=J_r(v) & \frac{\D}{\D Y}(Y\ominus X)=J_r^{-1}(Y\ominus X)\\ \end{align*}\]These can be proven using the chain rule we showed earlier.
The last piece we’re missing is how to compute uncertainties on manifolds. Similar to a state estimate, uncertainty is also localized to the tangent space at some point (state estimate) $X$. We can define a mean $\bar{X}\in M$ and a perturbation $\sigma\in T_{\bar{X} } M$ in the tangent space at $\bar{X}$!
Then we can use $\ominus$ to compute uncertainties.
\[\begin{align*} X&=\bar{X}\oplus\sigma\\ \sigma &=X\ominus \bar{X} \end{align*}\]We can define a covariance in the local frame using the definition of covariance too:
\[{}^{X}\Sigma=\mathbb{E}[\sigma\sigma^T]=\mathbb{E}[(X\ominus \bar{X})(X\ominus \bar{X})^T]\]With this, we can define Gaussians on the manifold $\mathcal{N}(\bar{X},{}^{X}\Sigma)$. Note that the covariance is of the tangent perturbation.
Now we can get back to the question at hand: how do we perform motion integration on Lie Groups for things like EKFs. In the previous post we defined the motion model
\[\begin{align*} X_{i+1}&=X_i\oplus v=X_i\Exp(v)\\ P_{i+1}&=FP_{i}F^T+GW_iG^T \end{align*}\]where
Now that we have the Jacobian blocks we can actually compute $F$ and $G$!
\[\begin{align*} F&=\frac{\D}{\D X}[X\oplus v] = \Ad_{\Exp(v)}^{-1}\\ G&=\frac{\D}{\D v}[X\oplus v] = J_r(v) \end{align*}\]With this, we have the full equations for state estimation on the manifold! Lie Groups don’t only work for EKFs though; we can apply the same logic to pose graph optimization or any other kind of optimization.
In this post, we wrapped up the discussion on Lie Groups by finishing on-manifold motion integration equations for state estimation. We started with defining the adjoint to relate the global and local frames. Then we took our familiar notion of calculus and extended it to work with Lie Groups. We also derived a few fundamental Jacobian blocks to use as a basis for more complicated derivatives. Using those blocks, we also were able to show how uncertainty propagates on a manifold. With all of that background, we were finally able to show the full equations of motion integration.
As I stated before, Lie Groups are pretty theoretical compared to other kinds of applied maths for engineering. Fortunately, there are libraries that abstract away the details of these implementations but it’s still important to know when Lie Groups might be useful. There’s still a lot more to Lie Groups but I’ve covered enough in these two posts for them to prove useful to you should you encounter a scenario where you’re on a manifold working with functions 🙂
]]>There are a lot of really good resources out there to learn about Lie Groups, particularly from physics. However, I think most of them lack an initial motivation: they jump right into a definition without giving any concrete examples. The closest I’ve found is Dr. Joan Solà’s work: A micro Lie theory for state estimation in robotics, which I think does a really good job at explaining the topic practically. It has concrete examples along with proofs and derivations; it starts with just talking about group structure and then adds calculus later on instead of conflating the two at the beginning. But there were many things I had to look up or do by hand when I was going through it to fill knowledge and really understand the proofs. Nevertheless, I still really like that work and used it as one of my references when writing this series on Lie Groups; the structure of this series and some of the examples are inspired from that work (especially when we get to caclulus on Lie Groups).
Lie Groups are a bit more theory-oriented that other kinds of maths, especially for engineers. It could be argued that you could go your entire engineering or (undergrad-level) physics career without ever using Lie Groups. This is partly true, but, for robotic state estimation, we’ll see why we can get a better result (rather than an approximate/error-filled one) if we’re aware about the structure of our problem.
As a meta-point, I’m breaking this up into two parts: this is the introductory part without any (or much) calculus and the next part will intersect calculus and Lie Groups to construct Jacobians and other structures.
Let’s start with the simple example of a vector on a plane. This could be the position and orientation of a robot. Suppose we get some new sensor update that says our robot has rotated by some amount $\phi$ and we want to rotate the vector by that amount.
We have the $x$ basis vector $v$ that we want to rotate by some $\phi$ to get $v’$.
How would we go about doing this? We need an way to transform the initial vector $v=(x,y)^T$ into the rotated vector $v’=(x’,y’)^T$. Since we’re dealing with a vector in a plane, we can do this using ordinary geometry if we draw some angles and remember some trig formulas. Without going through the trig, we end up with the following way to relate $v’$ and $v$.
\[\begin{align*} x' &= x \cos\phi - y\sin\phi\\ y' &= y \cos\phi + x\sin\phi\\ \end{align*}\]For convenience, we can write this in matrix form.
\[\begin{align*} \begin{bmatrix}x'\\y'\end{bmatrix} &= \begin{bmatrix}\cos\phi & - \sin\phi \\ \sin\phi & \cos\phi\end{bmatrix}\begin{bmatrix}x\\y\end{bmatrix}\\ v' &= R(\phi) v \end{align*}\]$R(\phi)\in\R^{2\times 2}$ is the 2D rotation matrix. Of course, we can plug in some known values and see if we get what we expect. Try plugging in $\phi=\frac{\pi}{2}$ and $(1,0)^T$, i.e., the $x$ basis vector, and the result should be $(0,1)^T$, i.e., the $y$ basis vector. We’ve basically rotate the $x$ basis vector into the $y$ basis vector!
Suppose we have $v’$ that is $v$ rotated by $\phi$, and we rotate $v’$ again by $\gamma$ to get $v’’$.
If we have another rotation by angle $\gamma$ that we want to apply after the rotation to $\phi$, we can first apply $R(\phi)$ and then $R(\gamma)$.
\[\begin{align*} \begin{bmatrix}x''\\y''\end{bmatrix} &= \begin{bmatrix}\cos\gamma & - \sin\gamma \\ \sin\gamma & \cos\gamma\end{bmatrix}\begin{bmatrix}\cos\phi & - \sin\phi \\ \sin\phi & \cos\phi\end{bmatrix}\begin{bmatrix}x\\y\end{bmatrix}\\ v'' &= R(\gamma) R(\phi) v \end{align*}\]Notice the ordering we apply the rotations: right to left. We can also combine the two matrices into a single one and, with some trig, we find that the result is also a rotation matrix!
\[\begin{align*} \begin{bmatrix}x''\\y''\end{bmatrix} &= \begin{bmatrix}\cos(\gamma+\phi) & - \sin(\gamma+\phi) \\ \sin(\gamma+\phi) & \cos(\gamma+\phi)\end{bmatrix}\begin{bmatrix}x\\y\end{bmatrix}\\ v'' &= R(\gamma + \phi) v \end{align*}\]Natural to applying a rotation, what if we wanted to reverse/undo a rotation? For example, we wanted to backtrack an orientation, we have to undo the existing rotation. To undo a rotation $R(\phi)$, we need to supply a matrix such that, when composed with $R(\phi)$, we get the identify matrix $I$ because, when we multiply any vector by the identify matrix, we get the same vector out. Naturally, this is the inverse matrix $R(\phi)^{-1}$! However, matrix inverses aren’t free! We need to prove that a rotation matrix $R(\phi)$ has an inverse. In other words, we need to show it has a nonzero determinant, i.e., it is nonsingular. Let’s take the determinant of a general 2D rotation matrix $R(\phi)$:
\[\det\begin{bmatrix}\cos\phi & -\sin\phi \\ \sin\phi & \cos\phi\end{bmatrix}=\cos^2\phi + \sin^2\phi = 1\]Using the trig identity $\cos^2\phi + \sin^2\phi = 1$, we’ve shown that every 2D rotation matrix has an inverse! This makes intuitive sense beacuse there isn’t a value of $\phi$ that we couldn’t “undo” by rotating by the same amount in the opposite direction.
\[\begin{align*} \begin{bmatrix}x\\y\end{bmatrix} &= \begin{bmatrix}\cos\phi & - \sin\phi \\ \sin\phi & \cos\phi\end{bmatrix}^{-1}\begin{bmatrix}\cos\phi & - \sin\phi \\ \sin\phi & \cos\phi\end{bmatrix}\begin{bmatrix}x\\y\end{bmatrix}\\ v &= R(\phi)^{-1}R(\phi) v\\ v &= Iv\\ v &= v \end{align*}\]We’ve also discovered an implicit rule here: multiplying any vector by the identity matrix $I=R(0)$ doesn’t change the vector at all.
Let’s take a second and recap what we’ve learned so far because, while it might not seem like it, we’ve learned a lot about how 2D rotations work.
This set of properties is so useful that we actually give them a name in mathematics: a group! Remember the topic of this series is about Lie Groups so we have to discuss groups! Now that we’ve demonstrated some properties of groups using 2D rotations, let’s generalize that into a formal definition of a group.
A group $(G, \circ)$ is a set $G$ and an operator $\circ$ such that any $X,Y,Z\in G$ obeys the following group axioms:
One other thing we saw was the action of the group on a vector: we multiplied the rotation matrix by a vector to rotate the vector. The action of a group on another set $V$, e.g., 2D vectors, has to be defined for every group and set since each action can be applied differently. More formally, the group action $\cdot$ can be defined as $\cdot: G\times V\rightarrow V; (X,v)\mapsto X\cdot v$ and has the following properties:
Now that we understand the axioms of a group, let’s phrase 2D rotations as a group: $G=\{\text{2D rotation matrices}\}$ and $\circ=\cdot$, i.e., matrix multiplication. To be a bit more specific, we showed earlier than all 2D rotation matrices have a determinant of exactly 1. This is why the set of all 2D rotation matrices is called $SO(2)$ for Special Orthogonal Group of 2 dimensions. What makes it special is the unit determinant. It’s a subgroup of the general Orthogonal Group $O(2)$, which is the set of orthogonal matrices, i.e., $R^T R=I=RR^T$. So we can more formally define $SO(2)=\{R\in\R^{2\times 2}\vert R^T R=I, \det R = 1\}$. Notice that the only time we make mention of the dimension is in how large the matrices are; more generally, we can define $SO(n)=\{R\in\R^{n\times n}\vert R^T R=I, \det R = 1\}$. We can verify that 2D rotation matrices are orthogonal with some more trig identities. We can also verify that all of the group axioms are satisfied for $SO(2)$.
I’ll also take this opportunity to show another represenation of 2D rotations: unit-norm complex numbers: $z=\cos\theta + i\sin\theta$. These are also easier to visualize than rotation matrices since we can plot them on the complex plane. In fact, if we take all possible values of $\theta$ and plot all unit norm $z$ vectors on the complex plane, we get the unit circle $S^1$!
All of the possible rotations on a plane can be represented as the circle group $S^1$. A particular rotation $z=\cos\theta + i\sin\theta$ can be represented as a complex number that lives on that circle.
To develop this even further as a group, $G=S^1$ and $\circ=\cdot$, i.e., complex multiplication. If we have a 2D real vector (or just a complex number) represented as a complex number like $v=x+iy$ and a rotation $z=\cos\theta + i\sin\theta$, then we can rotate $v$ by $\theta$ by multiplying $v’=zv$. Notice that this is closed under multiplication, the identity element is 1, and the inverse is the complex conjugate $z^\star$.
The translation group is an additive group that is simply $\R^n$.
As a more trivial example, consider the set of 2D translation $v=\displaystyle\begin{bmatrix}t_x\ t_y\end{bmatrix}^T\in\R^2=G$ and $\circ=+$. This is an example of an additive group. It’s closed under addition, the identity is 0, and the inverse is the negative $-t$.
Quaternions can be represented as an axis and rotation about that axis. One way to visualize them is their effect on a basis vector or as an axis and rotation on the unit sphere.
As a less trivial example, consider the set of unit quaternions $S^3$ (a 3-sphere/hypersphere). They are a representation of 3D rotations $SO(3)$. Another way to think about quaternions is using the “axis-angle” formulation where we have an axis $\mathbf{u}=u_x i + u_y j + u_z k$ (where $i,j,k$ are the base/unit quaternions such that $i^2=j^2=k^2=ijk=-1$) that represents the vector we’re rotating around and an angle $\theta$ that we’re rotating by. We put both of them together into a single object: $\mathbf{q}=\cos\frac{\theta}{2}+\mathbf{u}\sin\frac{\theta}{2}$. (We’ll see a derivation of this later.) The reason we need an $i,j,k$ is because they obey a special relation that makes rotating vectors actually work. The group action is quaternion/complex multiplication. Acting a quaternion on a vector $\mathbf{v}= v_xi+v_yj+v_zk$ uses the double product $\mathbf{q}\mathbf{v}\mathbf{q}^\star$. It’s closed under that double product, the identity is 1, and the inverse is the complex conjugate $\mathbf{q}^\star$.
Going back to the problem of robotic state estimation, we generally have a state that includes some orientation, for example, in 3D space. We receive sensor updates and accumulate that orientation. For example, Kalman Filters do this by literally adding up increments in the state over some time interval. Other kinds of state estimation use numerical optimization to solve for the state history so it can be corrected later after we learn more information. This generally takes an objective function $C(x)$ that minimizes the sum of squared errors, computes a derivative (Jacobian) $\frac{\d C}{\d x_i}\vert_{x_i=\hat{x_i}}$ for the current values of the parameters $\hat{x}$, and applies a tiny update $\Delta x$ to get new parameters. The cycle repeats until we’ve found the minimum of the function.
In a normal scenario, all three of the gimbals have all three degrees of freedom. However, during Gimbal Lock, we lose a degree of freedom because motions along two degrees of freedom only correspond to one motion.
For representing orientations in 3D, one option is to use Euler angles where we define an angle for roll, pitch, and yaw. This creates a vector in 3D space with exactly the same degrees of freedom as a 3D rotation. There’s nothing wrong with using Euler angles as a way to represent 3D rotation, however, we run into problems when we try to use them for optimization or accumulation. This is because of a problem called Gimbal Lock where we lose a degree of freedom, i.e., changing two variables leads to the same rotation. (More formally, we can think of Euler angles as a mapping of $\R^3$ into the set of 3D rotations $SO(3)$, but the derivative of this mapping isn’t always full-rank.)
However, we can avoid the problem of gimbal lock by using quaternions. But remember we’re not using just any quaternions, we using unit quaternions to represent 3D rotations. A general quaternion is $\mathbf{q}=\cos\frac{\theta}{2}+\mathbf{u}\sin\frac{\theta}{2}$ such that $\mathbf{u}=u_x i + u_y j + u_z k$, so we have 4 degrees of freedom $(\theta, u_x, u_y, u_z)$. But unit quaternions have the additional constraint of unit norm $\vert\vert\mathbf{q}\vert\vert=1$, which removes a degree of freedom (if we knew the values of 3 degrees of freedom, we could use the unit-norm equation to solve for the remaining one). So instead of a full 4D space, we actually have a constrained 3D surface in 4D, which is partly why unit quaternions are called $S^3$: they have 3 degrees of freedom! As an analogy, think about the unit circle $S^1$ for $SO(2)$. The unit circle is a 1D curve embedded in 2D governed by $x^2+y^2=1$: given either $x$ or $y$, we can compute the other using that equation. In other words, it’s a subspace embedded in a higher-dimensional space. Every point on that surface satisfies the constraint and any point off of that surface doesn’t.
If our optimizer sees all degrees of freedom for $S^1$, then we’ll get an update for both $x$ and $y$ that can move us off the circle.
But does our optimizer know that? Unconstrained optimization, by definition, is unconstrained! (In general, unconstrained optimization is easier than constrained optimization and have had more practical success.) If we hand the full quaternion to the optimizer, it’ll see all degrees of freedom so produce an tiny update for each parameter. If we simply fold in that increment, then we’ll almost always end up off of the constrained surface. In other words, we’d end up with something that isn’t a unit quaternion and hence isn’t a 3D rotation. Before the next step of optimization, we’d have to “project” or “renormalize” it back into a unit quaternion, which induces some error!
If we look at a line tangent to the sphere, we can define an increment $\theta$ on that line and find a way to project that onto the circle.
Instead, what if we parameterized the constrained surface in a way that we only handed the optimizer the exact degrees of freedom it could actually optimize over. Consider 2D rotations $SO(2)$ and $S^1$. For rotations on a plane, we really only need a single variable $\theta$ instead of two numbers for the complex representation or four for the rotation matrix. We could hand the optimizer the single $\theta$ and project that angle onto the unit circle.
Examples of manifolds are $\R^n$ and $S^n$: they’re locally flat at a point. Examples of spaces that aren’t manifolds are cones and planes with lines through them because the tip of the cone and the point where the line intersects the plane aren’t locally flat.
As it turns out, there already exists a mathematical structure that encodes exactly what we’re trying to do: a manifold. Manifolds are complicated structures in their own right, and I actually have another series explaining them in detail (Part 1, Part 2, Part 3) so I won’t go over them again. Feel free to read those posts to understand their construction, but I’ll just give the more basic intuition here. A manifold is a space that is required to be flat locally but not globally. Some examples are $\R^n$: it’s both locally flat and globally flat! Another well-known example is the sphere $S^2$. At a point, a sphere is flat (in other words $\R^2$), but globally, it’s not flat; in fact, it has intrinsic curvature. A few examples of spaces that aren’t manifolds are cones or a plane with a line going through it. This is because the point of a cone and the point where the line intersects the plane are not locally flat. Similar to a circle, a sphere is another example of a constrained surface: we only need two coordinates to specify a point on a sphere, but it can be embedded in a 3D space.
In general, the most interesting manifolds are smooth, i.e., continuous and infinitely differentiable. Going back to the example of a circle, if we took a derivative at a point, we’d get a tangent line with one degree of freedom. Specifically, if we consider the circle in the complex plane and took a derivative at $\theta=0$, we’d get the complex line $i\R$ which has one degree of freedom and is a flat space. Another name for this is the Tangent Space at a point $T_p M$. One way to intuitively construct it is by considering some curve on the manifold $\lambda(t) : \R\to M$ (in the case of the cirlce, it’s the circle itself!) and taking a derivative $\frac{\d}{\d t}$. (I discuss a more formal way to construct this in my other post on manifolds.) The tangent space has a few properties as a result of its construction:
In the context of Lie Groups, another name for the tangent space is the Lie Algebra $\mathfrak{m} = T_E M$. We specifically call the Lie Algebra the tangent space at the identity $E$ only because every Lie Group, by definition, is guaranteed to have an identity element. Remember that the structure of tangent space is the same at all points on the manifold so it really doesn’t matter which point we pick, but the identity is the most convenient element that every Lie Group is guaranteed to have.
More formally, we can define a tangent space $T_p M$ at a point $p$ on a manifold $M$ as the set of all directional derivatives of all scalar functions through $p$.
Ideally, we want the optimizer to only operate in the tangent space since it has exactly the same degrees of freedom as the manifold itself. Before talking about how the optimizer would do this, let’s see a few examples of tangent spaces.
For the circle group $S^1$, the tangent space $T_E M$ is a line $i\R$ and elements of that tangent space are scalars $\theta\in i\R=T_E M$.
Let’s explore the tangent space of 2D rotation $SO(2)$. To do this, we need to identify a curve on the circle that we can take the derivative of. We can use the fact that all rotation matrices have the constraint that $R^T R = I$, i.e., orthogonal columns. We can replace the $R$s with parameterized curves $R(t)$ to get $R(t)^T R(t) = I$ and take the derivative $\frac{\d}{\d t}$.
\[\begin{align*} \frac{\d}{\d t}[R(t)^T R(t)] &= \frac{\d}{\d t} I\\ R(t)^T \frac{\d}{\d t} R(t) + \frac{\d}{\d t}[R(t)^T] R(t) &= 0\\ R(t)^T \frac{\d}{\d t} R(t) + \left(\frac{\d}{\d t}R(t)\right)^T R(t) &= 0\\ R(t)^T \frac{\d}{\d t} R(t) &= -\left(\frac{\d}{\d t}R(t)\right)^T R(t)\\ R(t)^T \frac{\d}{\d t} R(t) &= -\left(R(t)^T \frac{\d}{\d t} R(t)\right)^T\\ A &= -A^T\\ \end{align*}\]Between the first and second lines, we use the product rule to expand the product. Then we use the property that derivatives can move in and out of the transpose operation. We moved the second term to the right-hand side. Finally, we transpose the right-hand side so that we end up with an equation of the form $A=-A^T$. If we removed the minus sign, this would be the constraint for a symmetric matrix $A=A^T$! But since we have the minus sign, we call matrices that obey this constraint skew-symmetric matrices. By the way, nothing we’ve done so far has been specific to $SO(2)$: as it turns out, this is the same constraint for $SO(3)$ and even $SO(n)$ as well. But going back to $SO(2)$, we’ve found that the Lie Algebra/structure of the tangent space, called $\mathfrak{so}(2)$, is the set of $2\times 2$ skew-symmetric matrices.
The general form for $2\times 2$ skew-symmetric matrices looks like
\[\begin{bmatrix}0 & -\theta \\ \theta & 0\end{bmatrix}=\theta\begin{bmatrix}0 & -1 \\ 1 & 0\end{bmatrix}=\theta E_\theta\in\mathfrak{so}(2)\]We call $E_\theta$ the generator of the $\mathfrak{so}(2)$ because we can write every element in terms of $E_\theta$. Think of it as a “basis matrix”. From this formulation, we can take any $\theta\in\R$ and map it to $\theta E_\theta\in\mathfrak{so}(2)$ uniquely. This means that there’s a unique mapping between $\R$ and $\mathfrak{so}(2)$ so we can choose to use either space, whichever is convenient for us. For the optimizer, it would be most convenient to use the $\theta\in\R$ space. We can create a notation $[\theta]_\times$ to define this mapping as
\[[\cdot]_\times : \R\to\mathfrak{so}(2);~\theta\mapsto\begin{bmatrix}0 & -\theta \\ \theta & 0\end{bmatrix}\]With $SO(3)$, we can follow the exact same procedure to end up with the set of $3\times 3$ skew-symmetric matrices for its Lie Algebra $\mathfrak{so}(3)$. The general form of those looks like
\[\begin{align*} \begin{bmatrix}0 & -\omega_z & \omega_y \\ \omega_z & 0 & -\omega_x \\ -\omega_y & \omega_x & 0\end{bmatrix}&=\omega_x\begin{bmatrix}0 & 0 & 0 \\ 0 & 0 & -1 \\ 0 & 1 & 0\end{bmatrix}+\omega_y\begin{bmatrix}0 & 0 & 1 \\ 0 & 0 & 0 \\ -1 & 0 & 0\end{bmatrix}+\omega_z\begin{bmatrix}0 & -1 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 0\end{bmatrix}\\ &=\omega_x E_x+\omega_y E_y+\omega_z E_z \end{align*}\]Note that we have 3 degrees of freedom $\omega_x, \omega_y, \omega_z$ and thus 3 generators $E_x, E_y, E_z$. So instead of just $\R$, the degrees of freedom can be grouped into a vector $\omega=[\omega_x, \omega_y, \omega_z]^T\in\R^3$. Just like with $\mathfrak{so}(2)$ and $\R$, the degrees of freedom match the dimension of the flat space. We reuse the same notation to denote converting a vector $\omega\in\R^3$ into a skew-symmetric matrix in $\mathfrak{so}(3)$: $[\omega]_\times$.
Between the tangent space $T_E M=\mathfrak{m}$ and flat space $\R^n$, we can define isomorphisms that exactly map between the two spaces.
In general, not all Lie Algebras are skew-symmetric matrices, but we can define an isomorphism, i.e., a bijection/one-to-one correspondence, that maps between $\mathfrak{m}\leftrightarrow \R^n$.
\[\begin{align*} \mathrm{Hat} : \R^n\to\mathfrak{m} &;~v\mapsto v^\wedge\\ \mathrm{Vee} : \mathfrak{m}\to \R^n &;~v^\wedge\mapsto (v^\wedge)^\vee=v \end{align*}\]In other words, $v$ is some element in a flat space $\R^n$ and $v^\wedge$ is some element of the Lie Algebra. As an example, for $SO(2)$ and $\mathfrak{so}(2)$, we can define these operators in terms of $[\cdot]_\times$.
\[\begin{align*} \mathrm{Hat}: \R\to\mathfrak{so}(2) &;~\theta\mapsto \theta^\wedge = [\theta]_\times\\ \mathrm{Vee}: \mathfrak{so}(2)\to \R &;~[\theta]_\times \mapsto [\theta]^\vee_\times=\theta \end{align*}\]For $SO(3)$ and $\mathfrak{so}(3)$, we can define the same kinds of operators, except using $\R^3$ and $\mathfrak{so}(3)$.
\[\begin{align*} \mathrm{Hat}: \R^3\to\mathfrak{so}(3) &;~\omega\mapsto \omega^\wedge = [\omega]_\times\\ \mathrm{Vee}: \mathfrak{so}(3)\to \R^3 &;~[\omega]_\times \mapsto [\omega]^\vee_\times=\omega \end{align*}\]With these functions, we now have a way to map our degree-of-freedom flat space $\R^n$ into the tangent space/Lie Algebra of the particular Lie Group we’re working with. In the case of 2D rotations, we only have a single degree of freedom $\theta$ that we can project out into the Lie Algebra of $2\times 2$ skew-symmetric matrices. However, we’re still missing a way to project the Lie Algebra onto the Lie Group manifold. Let’s figure out how (and why).
Recall that our problem with state estimation was that our representations for orientation were either overparameterized (quaternions or rotation matrices) or not suitable for optimization/integration (Euler angles). However, learning about manifolds and the tangent space, we can let our optimizer move around in the tangent space where we have the same degrees of freedom as the manifold: no more, no less. After the optimizer computes the derivatives, we get some gradient vector $\Delta x\in\R^n$ that represents the tiny update for all of our parameters. Since we’re at some point $\hat{x}$ on the manifold, this update $\Delta x$ is in the tangent space!
At any stage of optimization, we have the current values of the parameters $\hat{x}$. Giving that to our optimizer along with the Jacobians, we’ll get some $\Delta x$ for all parameters that lives in the tangent space $T_\hat{x} M$. We can’t blindly apply the update so we want to project that onto the manifold $M$.
To get the next value of the parameters, we need to add/accumulate $\Delta x$ into $\hat{x}$. What we’d do is just add $\hat{x}+\Delta x$, which almost certainly puts it off the constrained surface, and “reproject” it back onto the manifold so that the solution obeys the constraints. Rinse and repeat until we converge. The problem is that the “reprojection” induces some error. Ideally, we want to perform this mapping from $\mathfrak{m}\to M$ exactly, without any error. Then, after we get a parameter update $\Delta x$, we can apply that mapping and get the next value of the parameters that are guaranteed to obey the constraints, i.e., they remain on the manifold.
In other words, given some vector $v$ or $v^\wedge\in T_p M$, we want to relate it to some $X\in M$. If we consider rotation groups and go back to the definition of the Lie Algebra: $R(t)^T \frac{\d}{\d t}R(t)=\omega^\wedge=R(t)^{-1} \frac{\d}{\d t}R(t)$ (for orthogonal matrices, $R^T=R^{-1}$), then we have an equation relating an element of the Lie Algebra $\omega^\wedge$ and an element of the Lie Group $R(t)$. Isolating $\frac{\d}{\d t}R(t)$ to one side, we get the differential equation:
\[\frac{\d}{\d t}R(t) = R(t)\omega^\wedge\]This is an ordinary differential equation in $t$ whose solution is well-known (if you took a differential equations class, this was probably the first solution you saw):
\[R(t) = R(0)\exp(\omega^\wedge t)\]Since $R(t)\in M$ and $R(0)\in M$, then $\exp(\omega^\wedge t)\in M$. Since the structure of the tangent space is the same at all points, we can actually set $R(0)=E=I$ to get $R(t)=\exp(\omega^\wedge t)$. So it seems the way to relate a $\omega^\wedge\in T_p M$ and $R(t)$ is via $\exp$. We call this the exponential map: a function that sends elements of $\mathfrak{m}$ to $M$ exactly, with no error or approximation (i.e., the solution to the differential equation is analytical). Naturally, we can reverse the operation by taking a $\log$ and can define the logarithmic map as a function that maps $M$ to $\mathfrak{m}$ exactly.
\[\begin{align*} \exp: \mathfrak{m}\to M &; v^\wedge\mapsto X=\exp(v^\wedge)\\ \log: M\to\mathfrak{m} &; X\mapsto v^\wedge=\log(X) \end{align*}\]Intuitively, we can think of these maps as “wrapping” and “unwrapping” the vector along the manifold. To be more precise, this creates a geodesic at $p$ whose tangent vector is $v$. A geodesic is a generalization of a “straight line” or “shortest distance” path on a manifold. In $\R^n$, geodesics are lines. However, on other kinds of manifold, these are generally not lines. For example, for the sphere $S^2$, geodesics are “great circles”: a circle on the sphere such that the center of the circle is the center of the sphere. This is because “straight lines” don’t generally exist on arbitrary manifolds so we have to compromise and pick the “straight as possible” line. The formal way to derive geodesics is to use calculus of variations and solve for the function that minimzes the distance between two points on the manifold given the manifold metric. We’re not going to do that here, but look at my other series on manifolds for more intuition.
Now that we’ve defined the exponential and logarithmic maps, we have the full picture where we can convert between the flat space $\R^n$, the Lie Algebra/tangent space $T_p M=\mathfrak{m}$, and the manifold $M$.
Let’s look at a few concrete examples of the exponential map starting with $SO(2)$. Recall that all $2\times 2$ skew-symmetric matrices are of the form
\[\begin{bmatrix}0 & -\theta \\ \theta & 0\end{bmatrix} = \theta E_\theta\]Applying the exponential map:
\[\exp\left(\begin{bmatrix}0 & -\theta \\ \theta & 0\end{bmatrix}\right) = \exp(\theta E_\theta)\]But what does it mean to take the exponential of a matrix? Remember that $\exp$ can be written as a Taylor series!
\[\exp(x) = \sum_{k=0}^\infty\frac{x^k}{n!}=1+x+\frac{x^2}{2!}+\frac{x^3}{3!}+\cdots\]We can take powers of square matrices so the matrix exponential is well-defined. Expanding it out we get:
\[\exp(\theta E_\theta) = I+\theta E_\theta+\frac{\theta^2}{2!}E_\theta^2+\frac{\theta^3}{3!}E_\theta^3+\cdots\]To expand this further, we need to compute matrix products $E_\theta^k$. Let’s start by computing the first two:
\[\begin{align*} E_\theta &= \begin{bmatrix}0 & -1 \\ 1 & 0\end{bmatrix}\\ E_\theta^2 &= \begin{bmatrix}0 & -1 \\ 1 & 0\end{bmatrix}\\ \end{align*}\]An interesting property of skew-symmetric matrices is that the powers are cyclic and we actually only need $E_\theta$ and $E_\theta^2$. Here’s the pattern:
\[\begin{align*} E_\theta^0 &= I&\\ E_\theta^1 &= E_\theta & E_\theta^2 &= E_\theta^2\\ E_\theta^3 &= -E_\theta & E_\theta^4 &= -E_\theta^2\\ E_\theta^5 &= E_\theta&\\ \cdots \end{align*}\]Applying this cycling to the Taylor series, we get:
\[\begin{align*} \exp(\theta E_\theta) &= I+\theta E_\theta+\frac{\theta^2}{2!}E_\theta^2-\frac{\theta^3}{3!}E_\theta-\frac{\theta^4}{4!}E_\theta^2+\cdots\\ &= I+E_\theta\left(\theta-\frac{\theta^3}{3!}+\frac{\theta^5}{5!}+\cdots\right) + E_\theta^2\left(\frac{\theta^2}{2!}-\frac{\theta^4}{4!}+\cdots\right)\\ &= I + E_\theta\sin\theta + E_\theta^2(1-\cos\theta)\\ &=\begin{bmatrix}1 & 0\\ 0 & 1\end{bmatrix} + \begin{bmatrix}0 & -\sin\theta\\ \sin\theta & 0\end{bmatrix} + \begin{bmatrix}\cos\theta-1 & 0\\ 0 & \cos\theta-1\end{bmatrix}\\ &=\begin{bmatrix}\cos\theta & -\sin\theta\\ \sin\theta & \cos\theta\end{bmatrix} \end{align*}\]In the first step, we’ve regrouped the terms by $E_\theta$ and $E_\theta^2$. Then we notice that the two series are actually convergent Taylor series for $\sin\theta$ and $1-\cos\theta$. This is the general strategy when dealing with Taylor series: expand it out, regroup the terms, and condense it using other known Taylor series. After that, we can expand $I$, $E_\theta$, and $E_\theta^2$ into matrices and solve for the end result and get a 2D rotation matrix! So the exponential map for $SO(2)$ maps a scalar $\theta\in\R$ into a 2D rotation matrix $R\in SO(2)$!
For the circle group $S^1$, the exponential map exactly sends and element in the tangent space to an element in the group. The logarithmic map does the opposite: it maps an element of the group into a tangent space at a point.
For $SO(3)$, The procedure is almost exactly the same, except we parameterize the input as an axis-angle representation $\theta[\omega]_\times$. Since $[\omega]_\times$ is also a skew-symmetric matrix, the same power cycling happens, and we actually end up with the same result.
\[\exp(\theta[\omega]_\times)=I+[\omega]_\times\sin\theta+[\omega]_\times^2(1-\cos\theta)\]This formula is so important that it’s actually called the Rodrigues Rotation Formula. As it turns out, quaternions have the same kind of result (except with a factor of 2 to account for the double product).
Using the isomorphisms and the $\exp$/$\log$ maps, we can exactly map between $M$, $T_p M$, and $\R$.
Note that all of these exponential maps are exact. There’s no approximation! We’re exactly condensing the infinite series using convergent Taylor series. Now that we’ve seen some concrete examples, we can use the same formula to derive a few properties (that I won’t prove directly).
\[\begin{align*} \exp((a+b)v^\wedge)&=\exp(av^\wedge)\exp(bv^\wedge)\\ \exp(av^\wedge)&=\exp(v^\wedge)^a\\ \exp(-v^\wedge)&=\exp(v^\wedge)^{-1}\\ \exp(X v^\wedge X^{-1}) &= X\exp(v^\wedge)X^{-1} \end{align*}\]As a shortcut, we can define $\Exp$ and $\Log$ operators that use $\exp$ and $\log$ and map directly between $\R$ and $M$.
\[\begin{align*} \Exp: \R^n\to M &; v\mapsto X=\Exp(v)\equiv\exp(v^\wedge)\\ \Log: M\to\R^n &; X\mapsto v=\Log(X)\equiv\log(X)^\vee \end{align*}\]We can define shortcut isomorphisms $\Exp$/$\Log$ that map directly between $M$ and $\R$.
As another convenience, we can define $\oplus$ and $\ominus$ that use $\Exp$ and $\Log$ as well as group composition. But since not all group operations commute, we need to define left and right operations. We can define the right ones as:
\[\begin{align*} \oplus &: Y=X\oplus {}^X v\equiv X\Exp({}^Xv)\in M\\ \ominus &: {}^xv=Y\ominus X\equiv\Log(X^{-1}Y)\in T_X M\\ \end{align*}\]The left ones are defined as:
\[\begin{align*} \oplus &: Y={}^E v\oplus X\equiv \Exp({}^Ev)X\in M\\ \ominus &: {}^Ev=Y\ominus X\equiv\Log(YX^{-1})\in T_E M\\ \end{align*}\]We can define additional shortcut notation to perform on-manifold “addition” and “subtraction”. Since not all group operations are commutative, we need two operations: one for left and one for right operations.
Note that the left and right $\oplus$ are distinguished by the order of the operations but $\ominus$ is ambiguous. Another thing to note is the left superscript: $E$ means the “global frame” while $X$ means the “local frame”. The structure of all $T_p M$ are identical so it really doesn’t matter what we call the global and local frames, but, since every Lie Group has an $E$, we decide on that for the consistent “global frame” and everything else is a “local frame”. The usefulness of this construct is that we can using the right $\oplus$ to define perturbations in the local frame: when our optimizer has a little update $\Delta x$, that happens in the local frame of the current set of parameters $\hat{x}$.
While there’s still (at least) a Part 2 to this series, we’ve covered enough to perform some motion integration or, at least, set up the problem. For robot state estimation in a 2D space, we have both a 2D translation as well as a rotation. The Lie Group corresponding to a combination of translations and rotations is called $SE(2)$, the Special Euclidean Group of 2 dimensions. This combines both translations and rotations so that all operations consider both, jointly; in other words, it’s the set of rigid motions in 2D.
\[X=\begin{bmatrix}R & t \\ 0 & 1\end{bmatrix}\]where $R\in SO(2)$ and $t\in\R^2$. Just like with other Lie Groups, we can define the Lie Algebra and exponential maps for $SE(2)$ as well. In the context of state estimation, we start with some pose $X\in SE(2)$. At some fixed $\Delta t$, we get translational and rotation data from our sensors, e.g., the inertial measurement unit (IMU) and wheel encoders of our robot. If we integrate that, we get a small $\Delta t$ for the translation and $\Delta\theta$ for the angle. This exists in the Lie Algebra of $X$, i.e., the current pose we’re at. If we want to integrate that measurement to get a new pose, we need to use the exponential map to ensure that we have a valid rotation at each step.
Starting at $X_0$, we receive a number of sensor measurements in that local frame and can incorporate that into our pose using the $\oplus$ operator for each sensor measurement.
Starting with $X$, we get some increment $v = \begin{bmatrix}\Delta t & \Delta\theta\end{bmatrix}^T\in T_X M$ across some time increment. To integrate it into the current pose, we use the $\oplus$ operator.
\[X_{i+1}=X_i\oplus v=X_i\Exp(v)\]This is a simple equation but builds on all of the things we’ve learned so far. If we had a sequence of these, we can fold them in through the group operation.
\[X_{i}=X_0\oplus v_1\oplus v_2\oplus\cdots \oplus v_i\]This allows us to take sensor measurements in the local frame and apply them exactly to the pose we’re at to get a new pose that obeys orientation constraints. The only thing we’re missing is the propagation of uncetainties as well. For most state estimation, in addition to the poses, we also have some estimate of uncertainty, either implicit or explicit. Using those uncertainties, however, requires us to perform calculus since we have to compute the Jacobian of the state propagation function, i.e., $\oplus$! We’ll get to that next time!
In this post, I introduced Lie Group using rotations. We first defined 2D rotation using just plain geometry. We used that intuition to define groups and their axioms. Then I gave the intuition about the other part of Lie Groups: manifolds. As a part of manifolds, we also constructed tangent spaces and saw how to map between the tangent space and its corresponding flat space. Beyond the tangent space, we defined the exponential map to map between the tangent space and the manifold itself. Finally, we saw how to apply our new way of thinking to motion integration.
Lie Groups are fairly more theoretical than other kinds of engineering work, and they do represent a different way of thinking about rotation. However, armed with this new knowledge, we can manipulate rotations and other Lie Groups in an error-free way. The other part that we have yet to cover is how to perform calculus on Lie Groups. The optimizer computes derivatives/Jacobians, after all. Just like with the exponential map, we want to stay in the tangent space because it has the same degrees of freedom as the manifold. We want to do the same thing with derivatives: compute variations solely in the tangent space. After we figure that out, we can really perform motion integration and optimization on the manifold. We’ll get to that in the next post! 😀
]]>To discuss curvature, we’ll need some extra constructs. Curvature in a flat space involves taking second derivatives, but we haven’t actually discussed how to do calculus on manifolds. Partial derivatives and gradients only counted as basis vectors, not a calculus operations. But maybe they can do both. Let’s ask an important question about the partial derivative: does it transform like a tensor? If it does, we can simply use it as the primary method of doing calculus on manifolds. If not, then we need to invent some kind of derivative operator that does transform like a tensor. Let’s find out the answer by applying a coordinate transform to the partial derivative $\p_{\mu’}$ to a vector $V$:
\[\begin{align*} \frac{\p}{\p x^{\mu'}}V^{\nu'}&=\Big(\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p}{\p x^\mu}\Big) \Big(\frac{\p x^{\nu'}}{\p x^{\nu}}V^\nu\Big)\\ &=\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p}{\p x^\mu} \Big(\frac{\p x^{\nu'}}{\p x^{\nu}}V^\nu\Big)\\ &=\frac{\p x^\mu}{\p x^{\mu'}}\Big(\frac{\p x^{\nu'}}{\p x^\nu} \frac{\p}{\p x^{\mu}}V^\nu+V^\nu\frac{\p}{\p x^\mu}\frac{\p x^{\nu'}}{\p x^\nu}\Big)\\ &=\underbrace{\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^\nu} \frac{\p}{\p x^{\mu}}V^\nu}_\text{transforms like a tensor}+\underbrace{V^\nu\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p}{\p x^\mu}\frac{\p x^{\nu'}}{\p x^\nu}}_\text{doesn't transform like a tensor}\\ \end{align*}\](Note: going from the second to the third line, we used the product rule since $\frac{\p}{\p x^\mu}$ is a derivative operator.)
It doesn’t seem like partial derivatives transform like tensors! So it’s not a good derivative operator for us to do calculus on manifolds, unfortunately. We’ll have to invent our own derivative operator such that it produces a tensor when acting on vectors, duals, and tensors. What kind of properties do we want in a “good” derivative operator?
Since the partial derivative almost transforms like a tensor except for the non-tensorial part, we can use it as the base, but add a correction to account for the non-tensorial part. Actually, if we closely inspect the non-tensorial part, it seems to be taking the derivative of the basis; in other words, it accounts for the changing basis from point-to-point. We need a correction for each component so that means we need a linear transform for each. Therefore, the general form of the correction is a set of $n$ matrices $(\Gamma_\mu)^\nu_\lambda$. The outer upper and lower indices mean this is a linear transform, and the inner lower indicates we have $n$ of them.
We define the covariant derivative $\nabla$ as a generalization of the partial derivative but for arbitrary coordinates. We can think of it as the partial derivative with a correction for the changing basis. As it turns out (and that we’ll soon prove), the correction matrices $\Gamma^\nu_{\mu\lambda}$ do not transform like tensors so we don’t have to be so careful about the index placement because we can’t raise and lower indices on $\Gamma^\nu_{\mu\lambda}$ anyways. But in which basis does the correction happen? Well we might as well use the same basis used to define the vector we’re operating on; after all, it’s right there! With that, we can mathematically define the covariant derivative.
\[\nabla_\mu V^\nu\equiv\underbrace{\p_\mu V^\nu}_\text{partial}+\underbrace{\Gamma^\nu_{\mu\lambda}V^\lambda}_\text{correction}\]The correction matrices are special enough that we call them the connection coefficients or Christoffel symbols. Another way to think about this is that the covariant derivative tells us the change in $V^\nu$ in the $\mu$ direction. The complete geometric picture won’t make complete sense until we discuss parallel transport and geodesics soon, but I’ll present it here with some hand-waving.
There are a few key actors to understanding the geometry of the covariant derivative. The first is having a vector $V$ at a point $p$. We have another point $q$ and a different value of $V$ at that point. Remember that vector fields are defined at each point on the manifold. The $\mu$ represents the tangent vector to some curve at $p$ that connects to $q$. If we were to take $V$ and move it along the curve in such a way to keep it “as straight as possible”, we’d end up with a different vector $V_{||}$ at $q$. The covariant derivative is just the difference between $V$ at $q$ and the “translated” vector $V_{||}$. Don’t worry if this doesn’t make perfect sense now; we’ll revisit this when we have a more rigourous definition of moving a vector “as straight as possible” along a curve.
The point to remember is that the connection coefficients are the correction matrices, i.e., the non-tensorial part.
\[\begin{align*} \Gamma^\nu_{\mu\lambda}&=\text{change in }\p_\mu\text{ caused by }\lambda\text{ in the }\p_\nu\text{ direction.}\\ &=\frac{\p^2 x^\nu}{\p x^\mu\p x^\lambda} \end{align*}\]I’ve said multiple times now that the connection coefficients represent the non-tensorial part so are they actually tensors? It turns out they are not. Let’s see why. First, let’s start with the above definition of the covariant derivative acting on a vector $V$.
\[\begin{align*} \nabla_\mu V^\nu &= \p_\mu V^\nu + \Gamma_{\mu\lambda}^\nu V^{\lambda}\\ \nabla_{\mu'} V^{\nu'} &= \p_{\mu'} V^{\nu'} + \Gamma_{\mu'\lambda'}^{\nu'} V^{\lambda'} \end{align*}\]Now we’re going to simply demand that the covariant derivative transform like a tensor.
\[\nabla_{\mu'} V^{\nu'} = \frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^\nu}\nabla_\mu V^\nu\]Since we’re inventing the covariant derivative for the sole purpose of being a tensorial operator on a manifold, demaning this constraint is a reasonable thing to do. Now we need to expand this equation to write the primed connection coefficients in terms of the unprimed ones. To start, let’s just consider the left-hand side and transform what we can transform from the primed to the unprimed coordinates.
\[\begin{align*} \nabla_{\mu'} V^{\nu'} &= \p_{\mu'} V^{\nu'} + \Gamma_{\mu'\lambda'}^{\nu'} V^{\lambda'}\\ &=\frac{\p x^\mu}{\p x^{\mu'}}\p_\mu\Big(\frac{\p x^{\nu'}}{\p x^\nu} V^\nu \Big) + \Gamma_{\mu'\lambda'}^{\nu'} \frac{\p x^{\lambda'}}{\p x^{\lambda}}V^{\lambda}\\ &=\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^{\nu}}\p_\mu V^\nu + \frac{\p x^\mu}{\p x^{\mu'}}V^\nu\frac{\p}{\p x^\mu}\frac{\p x^{\nu'}}{\p x^{\nu}} + \Gamma_{\mu'\lambda'}^{\nu'}\frac{\p x^{\lambda'}}{\p x^{\lambda}}V^\lambda \end{align*}\]Just like we figured out the other tensor transformation rules, let’s expand the primed coordinates in terms of the unprimed ones using coordinate transforms. For the time being, let’s leave the connection coefficients untransformed since we don’t yet know how to transform them. Taking the above equation and adding back the right-hand side:
\[\require{cancel} \begin{align*} \nabla_{\mu'} V^{\nu'} &= \frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^\nu}\nabla_\mu V^\nu\\ \cancel{\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^{\nu}}\p_\mu V^\nu} + \frac{\p x^\mu}{\p x^{\mu'}}V^\nu\frac{\p}{\p x^\mu}\frac{\p x^{\nu'}}{\p x^{\nu}} + \Gamma_{\mu'\lambda'}^{\nu'}\frac{\p x^{\lambda'}}{\p x^{\lambda}}V^\lambda &= \frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^\nu}(\cancel{\p_\mu V^\nu} + \Gamma_{\mu\lambda}^\nu V^{\lambda})\\ \frac{\p x^\mu}{\p x^{\mu'}}V^\nu\frac{\p}{\p x^\mu}\frac{\p x^{\nu'}}{\p x^{\nu}} + \Gamma_{\mu'\lambda'}^{\nu'}\frac{\p x^{\lambda'}}{\p x^{\lambda}}V^\lambda &= \frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^\nu}\Gamma_{\mu\lambda}^\nu V^{\lambda} \end{align*}\]We want to remove $V$ since it was arbitrary from the start, but we can’t since the indices don’t match up. We can make them match by relabeling $\nu$ to $\lambda$; this is completely legal since $\nu$ in $V^\nu$ and $\lambda$ in $V^\lambda$ are both dummy indices that we can relabel to anything convenient so let’s relabel everything to be $\lambda$ and get rid of $V$ entirely (and move the primed connection coefficients to one side of the equation and use second-order derivatives).
\[\begin{align*} \frac{\p x^\mu}{\p x^{\mu'}}V^\lambda\frac{\p}{\p x^\mu}\frac{\p x^{\nu'}}{\p x^{\lambda}} + \Gamma_{\mu'\lambda'}^{\nu'}\frac{\p x^{\lambda'}}{\p x^{\lambda}}V^\lambda &= \frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^\nu}\Gamma_{\mu\lambda}^\nu V^{\lambda}\\ \frac{\p x^\mu}{\p x^{\mu'}}\frac{\p}{\p x^\mu}\frac{\p x^{\nu'}}{\p x^{\lambda}} + \Gamma_{\mu'\lambda'}^{\nu'}\frac{\p x^{\lambda'}}{\p x^{\lambda}} &= \frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^\nu}\Gamma_{\mu\lambda}^\nu\\ \frac{\p x^{\lambda'}}{\p x^{\lambda}} \Gamma_{\mu'\lambda'}^{\nu'}&= \frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^\nu}\Gamma_{\mu\lambda}^\nu - \frac{\p x^\mu}{\p x^{\mu'}}\frac{\p}{\p x^\mu}\frac{\p x^{\nu'}}{\p x^{\lambda}}\\ \frac{\p x^{\lambda'}}{\p x^{\lambda}} \Gamma_{\mu'\lambda'}^{\nu'}&= \frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^\nu}\Gamma_{\mu\lambda}^\nu - \frac{\p x^\mu}{\p x^{\mu'}}\frac{\p^2 x^{\nu'}}{\p x^\mu\p x^{\lambda}} \end{align*}\]We’re almost done isolating the primed coordinates in terms of the unprimed coordinates, but we need to get rid of the leading $\frac{\p x^{\lambda’}}{\p x^\lambda}$ on the left-hand side. A convenient strategy for removing terms of this form is to exploit the property of the Kronecker delta: $\frac{\p x^{\lambda}}{\p x^{\rho’}}\frac{\p x^{\lambda’}}{\p x^{\lambda}}=\delta_{\rho’}^{\lambda’}$. So we can multiply both sides by $\frac{\p x^{\lambda}}{\p x^{\rho’}}$ and get a Kronecker delta on the left-hand side that we can replace by swapping indices:
\[\begin{align*} \frac{\p x^{\lambda}}{\p x^{\rho'}}\frac{\p x^{\lambda'}}{\p x^{\lambda}} \Gamma_{\mu'\lambda'}^{\nu'}&= \frac{\p x^{\lambda}}{\p x^{\rho'}}\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^\nu}\Gamma_{\mu\lambda}^\nu - \frac{\p x^{\lambda}}{\p x^{\rho'}}\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p^2 x^{\nu'}}{\p x^\mu\p x^{\lambda}}\\ \delta_{\rho'}^{\lambda'} \Gamma_{\mu'\lambda'}^{\nu'}&= \frac{\p x^{\lambda}}{\p x^{\rho'}}\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^\nu}\Gamma_{\mu\lambda}^\nu - \frac{\p x^{\lambda}}{\p x^{\rho'}}\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p^2 x^{\nu'}}{\p x^\mu\p x^{\lambda}}\\ \Gamma_{\mu'\rho'}^{\nu'}&= \frac{\p x^{\lambda}}{\p x^{\rho'}}\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^\nu}\Gamma_{\mu\lambda}^\nu - \frac{\p x^{\lambda}}{\p x^{\rho'}}\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p^2 x^{\nu'}}{\p x^\mu\p x^{\lambda}}\\ \end{align*}\]Now we can finally relabel $\rho’$ to $\lambda’$ to be more consistent with the original notation. This is also legal to do since $\rho’$ is also a dummy index that we’re free to relabel.
\[\Gamma_{\mu'\lambda'}^{\nu'} = \underbrace{\frac{\p x^{\lambda}}{\p x^{\lambda'}}\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p x^{\nu'}}{\p x^\nu}\Gamma_{\mu\lambda}^\nu}_{\text{tensorial-like}} - \underbrace{\frac{\p x^{\lambda}}{\p x^{\lambda'}}\frac{\p x^\mu}{\p x^{\mu'}}\frac{\p^2 x^{\nu'}}{\p x^\mu\p x^{\lambda}}}_{\text{non-tensorial-like}}\]From this equation, we see that the first term seems to look like a valid transform; however, the second term is some second-order quantity that ruins the ability for the connection coefficients to transform like tensors. If that second term was zero, then we could say the connection coefficients transform like a tensor, but, from its existence, we can say that the connection coefficients do not transform like tensors. In fact, we can even say that the connection coefficients are intentionally non-tensorial to cancel the non-tensorial part of the partial derivative that we saw earlier. The consequence of non-tensorial terms means we can’t raise or lower indices on the connection coefficients with the metric tensor, but it also means we can be more haphazard with the index placement and leave one upper and two lower indices 😉
So far, we’ve shown the action of the covariant derivative on vectors, but what about its action on covectors? If we can figure out how to apply it to both vectors and covectors, we can generalize its action on arbitrary tensors. Similar to what we did with vectors, we can simply demand that the result of the covariant derivative transforms like a tensor.
\[\nabla_\mu\omega_\nu = \p_\mu\omega_\nu + \Theta_{\mu\nu}^\lambda\omega_\lambda\]We’re using $\Theta$ because, at this point in time, we have no reason to believe that $\Theta$ and $\Gamma$ are related. Spoiler alert: they are! In order to operate the covariant derivative on covectors, we need to impose/demand two more constraints:
Like last time, we can apply a covector to a vector to get a scalar.
\[\begin{align*} \nabla_\mu(\omega_\lambda V^\lambda) &= (\nabla_\mu\omega_\lambda)V^\lambda + \omega_\lambda(\nabla_\mu V^\lambda)\\ &= (\p_\mu\omega_\lambda + \Theta_{\mu\lambda}^\sigma\omega_\sigma)V^\lambda + \omega_\lambda(\p_\mu V^\lambda+\Gamma_{\mu\rho}^\lambda V^\rho)\\ &= \p_\mu\omega_\lambda V^\lambda + \Theta_{\mu\lambda}^\sigma\omega_\sigma V^\lambda + \omega_\lambda\p_\mu V^\lambda+ \omega_\lambda\Gamma_{\mu\rho}^\lambda V^\rho\\ \end{align*}\]From the second constraint on the covariant derivative, we know that the left-hand side of the above equation reduces to the partial derivative acting on a scalar.
\[\begin{align*} \nabla_\mu(\omega_\lambda V^\lambda) &= \p_\mu(\omega_\lambda V^\lambda)\\ &= \p_\mu\omega_\lambda V^\lambda + \omega_\lambda\p_\mu V^\lambda \end{align*}\]Now let’s set both sides of the equation equal to each other to cancel out terms (and isolate $\Theta$).
\[\begin{align*} \cancel{\p_\mu\omega_\lambda V^\lambda} + \Theta_{\mu\lambda}^\sigma\omega_\sigma V^\lambda + \bcancel{\omega_\lambda\p_\mu V^\lambda} + \omega_\lambda\Gamma_{\mu\rho}^\lambda V^\rho &= \cancel{\p_\mu\omega_\lambda V^\lambda} + \bcancel{\omega_\lambda\p_\mu V^\lambda}\\ \Theta_{\mu\lambda}^\sigma\omega_\sigma V^\lambda + \omega_\lambda\Gamma_{\mu\rho}^\lambda V^\rho &= 0\\ \Theta_{\mu\lambda}^\sigma\omega_\sigma V^\lambda &= -\omega_\lambda\Gamma_{\mu\rho}^\lambda V^\rho\\ \end{align*}\](I’ve used two different kinds of slashes to note which of the like terms cancel.) To relate $\Theta$ and $\Gamma$, we need to get rid of $\omega$ and $V$. We can relabel them on the right-hand side by mapping $\lambda$ to $\sigma$ and $\rho$ to $\lambda$.
\[\Theta_{\mu\lambda}^\sigma\omega_\sigma V^\lambda = -\omega_\sigma\Gamma_{\mu\lambda}^\sigma V^\lambda\\\]Now we can remove $\omega$ and $V$.
\[\Theta_{\mu\lambda}^\sigma = -\Gamma_{\mu\lambda}^\sigma\\\]So $\Theta$ and $\Gamma$ are related by a negation! So we can make that substitution in the equation that applies the covariant derivative to covectors.
\[\nabla_\mu\omega_\nu \equiv \p_\mu\omega_\nu - \Gamma_{\mu\nu}^\lambda\omega_\lambda\]Take a second to compare the indices on the action of the covariant derivative on vectors versus covectors. For vectors, we have a positive connection coefficient whose second lower index becomes a dummy index across the vector’s index. For covectors, we have a negative connection coefficient whose only upper index becomes a dummy index across the covector’s index. With this observation, we can generalize to arbitrary tensors.
\[\begin{align*} \nabla_\lambda T_{\nu_1\cdots\nu_l}^{\mu_1\cdots\mu_k} &= \p_\lambda T_{\nu_1\cdots\nu_l}^{\mu_1\cdots\mu_k}\\ &+ \Gamma_{\lambda\sigma}^{\mu_1}T_{\nu_1\cdots\nu_l}^{\sigma\mu_2\cdots\mu_k}+\Gamma_{\lambda\sigma}^{\mu_2}T_{\nu_1\cdots\nu_l}^{\mu_1\sigma\cdots\mu_k}+\cdots+\Gamma_{\lambda\sigma}^{\mu_k}T_{\nu_1\cdots\nu_l}^{\mu_1\cdots\mu_{k-1}\sigma}\\ &- \Gamma_{\lambda\nu_1}^{\sigma}T_{\sigma\nu_2\cdots\nu_l}^{\mu_1\cdots\mu_k}-\Gamma_{\lambda\nu_2}^{\sigma}T_{\nu_1\sigma\cdots\nu_l}^{\mu_1\cdots\mu_k}-\cdots-\Gamma_{\lambda\nu_l}^{\sigma}T_{\nu_1\cdots\nu_{l-1}\sigma}^{\mu_1\cdots\mu_{k}}\\ \end{align*}\]There’s a pattern here depending on how many upper and lower indices. Take a second to understand the pattern since it’ll be useful later.
To quickly recap, we’ve successfully defined the covariant derivative on arbitrary tenors. However, in each definition, we write the covariant derivative in terms of the connection coefficients which, as a consequence of their non-tensorial-ness, are coordinate-dependent. We could use many different coordinates, which means we could have many different definitions of the covariant derivative! This is a fundamental characteristic of the covariant derivative and the connection coefficients, but we can define a unique connection if we impose some additional constraints: torsion-free and metric compatibility.
For a connection to be torsion-free, it must be symmetric in its lower indices.
\[\Gamma_{\mu\nu}^\lambda=\Gamma_{\nu\mu}^\lambda\]The consequence of a connection being torsion-free means, given a connection $\Gamma_{\mu\nu}^\lambda$, we can immediately define another connection with permutated lower indices $\Gamma_{\nu\mu}^\lambda$. In fact, we define the torsion tensor as $T_{\mu\nu}^\lambda = \Gamma_{\mu\nu}^\lambda - \Gamma_{\nu\mu}^\lambda = 2\Gamma_{[\mu\nu]}^\lambda$. Interestingly, the torsion tensor is a valid tensor, even though it is composed of the non-tensorial connection. To see this, suppose we had two connections $\nabla$ and $\tilde{\nabla}$. Let’s apply both on an arbitrary vector $V^\lambda$ and take the difference.
\[\begin{align*} \nabla_\mu V^\lambda-\tilde{\nabla}_\mu V^\lambda &= \cancel{\p_\mu V^\lambda} + \Gamma_{\mu\nu}^\lambda V^\nu - \cancel{\p_\mu V^\lambda} - \tilde{\Gamma}_{\mu\nu}^\lambda V^\nu\\ &= (\Gamma_{\mu\nu}^\lambda - \tilde{\Gamma}_{\mu\nu}^\lambda) V^\nu\\ &= S_{\mu\nu}^\lambda V^\nu\\ \end{align*}\]Since the left-hand side is a tensor, the right-hand side must also be a tensor, which means $S_{\mu\nu}^\lambda$, which is the difference of the connections, is also a tensor. Torsion is a special case of $S_{\mu\nu}^\lambda$ where we use the connection.
Geometrically, we can think of torsion as the “twisting” of reference frames or a “corkscrew” of reference frames along a path. We’ll get a slightly better geometric interpretation after we discuss parallel transport soon.
The second constraint we enforce is metric compatibility, which says $\nabla_\rho g_{\mu\nu}=0$. In words, that means the metric is flat/Euclidean at each individual point in the space. We need this property so that the covariant derivative commutes with the metric tensor when raising and lowering indices: $g_{\mu\lambda}\nabla_\rho V^\lambda = \nabla_\rho V_\mu$. Like with the covariant derivative action on covectors, there’s no way to prove these two constraints; we simply demand that they be true.
Metric compatibility means that components of the metric are constant at a point. Geometrically, this means, at a point, we can define a flat tangent space. Or, to be more precise, we can write the metric components in a way that they are constant.
Now that we have those two contraints, we can construct a unique connection from the metric using those two properties. Let’s first apply the covariant derivative to the metric tensor and set it to zero (using metric compatibility). With that one equation, we can can permute the indices to get two more equations.
\[\begin{align*} \nabla_\rho g_{\mu\nu} &= \p_\rho g_{\mu\nu} - \Gamma_{\rho\mu}^\lambda g_{\lambda\nu} - \Gamma_{\rho\nu}^\lambda g_{\mu\lambda} &= 0\\ \nabla_\mu g_{\nu\rho} &= \p_\mu g_{\nu\rho} - \Gamma_{\mu\nu}^\lambda g_{\lambda\rho} - \Gamma_{\mu\rho}^\lambda g_{\nu\lambda} &= 0\\ \nabla_\nu g_{\rho\mu} &= \p_\nu g_{\rho\mu} - \Gamma_{\nu\rho}^\lambda g_{\lambda\mu} - \Gamma_{\nu\mu}^\lambda g_{\rho\lambda} &= 0\\ \end{align*}\]Now we take the first equation and subtract the second and third equations. Then we can use the torsion-free property to cancel multiple terms, i.e., any connection coefficients with permuted lower indices.
\[\require{cancel} \begin{align*} \nabla_\rho g_{\mu\nu} &= \p_\rho g_{\mu\nu} - \cancel{\Gamma_{\rho\mu}^\lambda g_{\lambda\nu}} - \bcancel{\Gamma_{\rho\nu}^\lambda g_{\mu\lambda}} &= 0\\ -\nabla_\mu g_{\nu\rho} &= \p_\mu g_{\nu\rho} - \Gamma_{\mu\nu}^\lambda g_{\lambda\rho} - \cancel{\Gamma_{\mu\rho}^\lambda g_{\nu\lambda}} &= 0\\ -\nabla_\nu g_{\rho\mu} &= \p_\nu g_{\rho\mu} - \bcancel{\Gamma_{\nu\rho}^\lambda g_{\lambda\mu}} - \Gamma_{\nu\mu}^\lambda g_{\rho\lambda} &= 0\\ \end{align*}\]And we’re left with an equation with a single connection coefficient after permuting the indices so they match.
\[\begin{align*} \p_\rho g_{\mu\nu} - \p_\mu g_{\nu\rho} - \p_\nu g_{\rho\mu} + 2\Gamma_{\mu\nu}^\lambda g_{\lambda\rho} &= 0\\ \Gamma_{\mu\nu}^\lambda g_{\lambda\rho} &= \frac{1}{2}(\p_\mu g_{\nu\rho} + \p_\nu g_{\rho\mu} - \p_\rho g_{\mu\nu})\\ \end{align*}\]To get rid of the extra $g_{\lambda\rho}$, we can multiply by $g^{\sigma\rho}$ and use the Kronecker delta.
\[\begin{align*} \Gamma_{\mu\nu}^\lambda g_{\lambda\rho}g^{\sigma\rho} &= \frac{1}{2}g^{\sigma\rho}(\p_\mu g_{\nu\rho} + \p_\nu g_{\rho\mu} - \p_\rho g_{\mu\nu})\\ \Gamma_{\mu\nu}^\lambda \delta_{\lambda}^{\sigma} &= \frac{1}{2}g^{\sigma\rho}(\p_\mu g_{\nu\rho} + \p_\nu g_{\rho\mu} - \p_\rho g_{\mu\nu})\\ \Gamma_{\mu\nu}^\sigma &= \frac{1}{2}g^{\sigma\rho}(\p_\mu g_{\nu\rho} + \p_\nu g_{\rho\mu} - \p_\rho g_{\mu\nu})\\ \end{align*}\]Finally we’ve written the connection coefficients in terms of the metric! This unique connection is called the Christoffel/Levi-Civita/Riemannian connection. This is the canonical connection that’s used often in general relativity and other fields so we have a “preferred” covariant derivative. It’s not necessary to use this particular connection, especially if there is another set of connection coefficients that makes the particular problem we’re studying easier, but this connection is often used because it’s convenient.
Now that we have a clear definition of a “preferred” covariant derivative, we can do calculus on a manifold like we could in a flat space! However, we quickly run into a problem: how do we compare vectors on a manifold? With scalars, we can compare two of them at different points on a manifold, but we can’t compare two different vectors at two different points on the manifold since they would be in different tangent spaces! The vector might actually be the same in one tangent space but look different in the other tangent space (but still related by a transform).
In a Cartesian space, if we have a vector $V$ and we move it along a path, it will forever have the same magnitude and direction. Some people say that vectors (in a Cartesian space) are just displacements that you can slide around the space because the displacement is relative: it doesn’t depend on where the arrow starts/ends. However, this is not true for curved coordinates.
In a flat space, we didn’t have to be this careful since we can arbitrary move a vector from point to point while keeping it parallel with itself. If we took a vector and drew an arbitrary path for the vector to take, at each point along the path, the vector would be pointed in exactly the same direction with the same magnitude! A consequence of this is that it doesn’t matter what the path, e.g., a long path and short path that have the same endpoint will still keep the vector the exact same.
Since it seems to work in flat space, let’s try this idea on a manifold: take a vector in one tangent space and “transport” it to the other tangent space so that the two vectors are in the same tangent space while keeping the “transported” vector “as straight as possible”. This notion is called parallel transport. We have to say “as straight as possible” since, in a curved space, it’s not always possible to keep a vector pointed completely in the same direction with the same magnitude at each point along the path. In fact, it’s even worse than that because the path we take will change the resulting vector!
On a sphere, suppose we start at the equator with a vector pointing along the equator. Then we parallel transport that vector to the North Pole. Then we parallel transport it back to the equator on a different longitude. Finally, we parallel transport it along the equator back to its original position. We’ll find that it has rotated! It’s different from the original vector.
Even keeping the vector as straight as possible, the resulting vectors are pointed in completely different directions. Unfortunately, this is a fundamental fact about manifolds that we can’t get over with a clever trick or coordinate transform! But we can try to precisely define parallel transport and what we mean by “keeping the vector as straight as possible”. Mathematically, this means we want to keep the tensor components from changing as much as possible along the curve. Suppose we have a curve $x^\mu(\lambda)$ and an arbitrary tensor $T_{\nu_1\cdots\nu_l}^{\mu_1\cdots\mu_k}$. Then keeping the components the same just means the derivative of the tensor along the path must vanish.
\[\frac{\d}{\d\lambda}T_{\nu_1\cdots\nu_l}^{\mu_1\cdots\mu_k} = \frac{\d x^\sigma}{\d\lambda}\p_\sigma T_{\nu_1\cdots\nu_l}^{\mu_1\cdots\mu_k} = 0\]However this isn’t quite tensorial because we have a partial derivative. We can make this tensorial by replacing the partial derivative with a covariant derivative (this is sometimes called the “comma goes to semicolon” rule if you denote partials with commas and covariant derivatives with semicolons, but I hate that notation), and we get the equation of parallel transport.
\[\frac{\d x^\sigma}{\d\lambda}\nabla_\sigma T_{\nu_1\cdots\nu_l}^{\mu_1\cdots\mu_k} = 0\]For convenience, we can define a parallel transport operator/directional covariant derivative using the covariant derivative and a tangent vector.
\[\frac{\D}{\d\lambda} = \frac{\d x^\sigma}{\d\lambda}\nabla_\sigma\]Going back to our original inquiry, let’s see what this equation looks like for a vector $V^\mu$.
\[\begin{align*} \frac{\d x^\sigma}{\d\lambda}\nabla_\sigma V^\mu &= 0\\ \frac{\d x^\sigma}{\d\lambda}(\p_\sigma V^\mu + \Gamma_{\sigma\rho}^\mu V^\rho) &= 0\\ \frac{\d x^\sigma}{\d\lambda}\Big(\frac{\p}{\p x^\sigma} V^\mu + \Gamma_{\sigma\rho}^\mu V^\rho\Big) &= 0\\ \frac{\d}{\d\lambda} V^\mu + \Gamma_{\sigma\rho}^\mu \frac{\d x^\sigma}{\d\lambda} V^\rho &= 0\\ \end{align*}\]Note that this is a set of 1st order differential equations, one for each $\mu$ index. Also note that since the parallel transport equation depends on coordinate-dependent things like $\Gamma$ and $\frac{\d x^\sigma}{\d\lambda}$, the equation itself also depends on coordinates.
One immediately practical application of the parallel transport equation is to see what happens when we parallel transport the metric $g_{\mu\nu}$.
\[\require{cancel} \frac{\D}{\d\lambda}g_{\mu\nu} = \frac{\d x^\sigma}{\d\lambda}\cancelto{0}{\nabla_\sigma g_{\mu\nu}} = 0\]We can see that the metric is always parallel transported because of metric compatibility! This means that the value of inner products is preserved as we parallel transport along a curve.
Now suppose we also parallel transport two vectors that the metric acts on $V^\mu$ and $W^\nu$ along with it. Suppose those vectors are also parallel transported along with the metric.
\[\require{cancel} \begin{align*} \frac{\D}{\d\lambda}(g_{\mu\nu}V^\mu W^\nu) &= 0\\ \cancelto{0}{(\frac{\D}{\d\lambda}g_{\mu\nu})}V^\mu W^\nu + g_{\mu\nu}\cancelto{0}{(\frac{\D}{\d\lambda}V^\mu)} W^\nu + g_{\mu\nu}V^\mu\cancelto{0}{(\frac{\D}{\d\lambda}W^\nu)} &= 0 \end{align*}\]The first term is cancelled because of metric compatibility and the second and third terms are also cancelled because we defined $V^\mu$ and $W^\nu$ to be parallel transported. This means that norms, angles, and orthogonality are also preserved!
Now that we’ve discussed parallel transport, let me circle back to a few points geometrically and suppliment the lines and lines of equations with actual geometrical pictures. Let’s start with the geometrical picture of the covariant derivative. Recall that it generalizes the partial derivative by adding a correction for the changing basis that occurs from point-to-point. But, if a vector was parallel transported, by the parallel transport equation, the change in the covariant derivative along the path is zero. So we can think of the covariant derivative as being the vector that is the difference between parallel transporting a vector on a path from one point to another and simply evaluating the vector at that point on the manifold (see the first image in this post).
Additionally, with parallel transport, we can also get a slightly better geometric picture of torsion.
Suppose we have two vector fields $A^\mu$ and $B^\nu$. If we parallel transport $A^\mu$ in the direction of $B^\nu$ and $B^\nu$ in the direction of $A^\mu$, then the torsion tensor $S_{\mu\nu}^\lambda$ measures the ability of that loop to close. With a torsion-free metric, the parallel-transported vectors form a closed parallelogram.
One last crucial topic we’ll need to discuss before getting into curvature is a geodesic. To understand the intuition, remember that parallel transport changes a vector along a particular path from point to point. But there are an infinite number of paths between any two points so there doesn’t immediately seem to be a way to have a “preferred” path between two points that multiple people could compare. One candidate is picking the “shortest possible” path between the points. In a flat space, we knew how to do this: pick a straight line! But on a curved manifold where the coordinates change as well, there isn’t always a “straight” path.
One way to do this is to find a path $\frac{\d x^\mu}{\d\lambda}$ that minimizes the total arc length/path length between any two points. But this way requires us to know and use calculus of variations so that’s complicated! A slightly less formal, but more intuitive, way to understand a path length is in terms of parallel transport. One observation is that, in a flat space, a straight line keeps its tangent vector pointing in the same direction along the line. In other words, the straight line parallel transports its own tangent vector. This intuition carries over to a curved space. Suppose we have a curve $x^\mu(\lambda)$ and its tangent vector $\frac{\d x^\mu}{\d\lambda}$. Let’s parallel transport the tangent vector along the curve.
\[\begin{align*} \frac{\D}{\d\lambda}\Big(\frac{\d x^\mu}{\d\lambda}\Big) &= 0\\ \frac{\d x^\sigma}{\d\lambda}\nabla_\sigma\frac{\d x^\mu}{\d\lambda} &= 0\\ \frac{\d x^\sigma}{\d\lambda}\Big(\frac{\p}{\p x^\sigma}\frac{\d x^\mu}{\d\lambda} + \Gamma_{\sigma\rho}^\mu\frac{\d x^\rho}{\d\lambda}\Big) &= 0\\ \frac{\d}{\d\lambda}\frac{\d x^\mu}{\d\lambda} + \Gamma_{\sigma\rho}^\mu\frac{\d x^\sigma}{\d\lambda}\frac{\d x^\rho}{\d\lambda} &= 0\\ \frac{\d^2 x^\mu}{\d\lambda^2} + \Gamma_{\sigma\rho}^\mu\frac{\d x^\sigma}{\d\lambda}\frac{\d x^\rho}{\d\lambda} &= 0\\ \end{align*}\]The final result is the geodesic equation, a 2nd order differential equation, one for each coordinate/index $\mu$. Notice that in a Cartesian space, all $\Gamma=0$ so we’re left with $\frac{\d^2 x^\mu}{\d\lambda^2} = 0$. The solution to this differential equation is a line! (If you don’t know any differential equations, you can convince yourself of this since the only kinds of functions with no second derivative anywhere are lines!) Even without talking about curvature, geodesics are incredibly important: in general relativity, test particles in a gravitational field move along geodesics so they’re critical for understanding the consequences of different gravities.
Solving the geodesic equation can seem a little complicated so there’s an alternative way to think about geodesics that’s a bit more practical. Imagine we’re at an arbitrary point $p$ on a manifold, and we have a tangent vector $V^\mu$ to some curve/direction we want to travel in. We can construct a unique geodesic in a small neighborhood of $p$. Suppose our geodesic is $\gamma^\mu(\lambda)$. From the above statements, we immediately have two constraints to the geodesic: $\gamma^\mu(\lambda=0)=p$ and $\frac{\d\gamma^\mu}{\d\lambda}(\lambda=0)=V^\mu$. The former says that the geodesic “starts” at $p$ and the second statement says that the tangent vector at $\lambda=0$ on the geodesic is $V^\mu$. The exponential map is the map we use to get the geodesic. It is defined as $\exp_p: T_p\to M, V^\mu\mapsto\gamma^\mu(\lambda=1)$ such that $\gamma^\mu$ solves the geodesic equation.
Given a point $p$ and a direction $V^\mu$ at $p$, it’s always possible to specify a unique geodesic $\gamma^\mu$ “in the neighborhood” of the point. If we stray too far from the point, this geodesic fails to be unique because they could cross over each other.
Since the geodesic is on the manifold, if we follow $\gamma^\mu$, then there’s some other point $q$ also on the manifold such that $\gamma^\mu(\lambda=1)=q$. After this process, we’re now at another point on the manifold by travelling along the geodesic. With this technique, we can travel all across the manifold by travelling from tangent space to tangent space along the shortest path. An important thing to note is that this geodesic is only unique and invertible in a “small enough” neighborhood around $p$. Travel too far away, and the we no longer have a unique geodesic since some of them might overlap so that some other one ends up at $q$ too.
With all of those prerequisites addressed, we can finally discuss curvature. In a flat space, when we talk about curvature, we often mean the curvature of a 2D/3D curve or a parameterized surface. These are forms of extrinsic curvature since they depend on the embedding space. However, remember that a manifold is completely independent of the space it’s embedded in. Alternative to extrinsic curvature, we also have intrinsic curvature. Intuitively, imagine if you were a little bug walking on top of the manifold. Could you tell if the space was curved like the Earth or flat? As it turns out, on a manifold with arbitrary coordinates, it’s much harder to tell if the space is curved or we just chose curved coordinates. As an example, consider a flat plane. We could use Cartesian coordinates and know that the space is flat like $\R^2$. However, we could also use polar coordinates on the plane, and that’s more difficult to tell if the space is flat since polar coordinates are curved and have nonzero connection coefficients!
In Cartesian coordinates, it’s pretty clear to see that the components of the basis don’t change from point-to-point. However, in polar coordinates, this is not true. However, polar coordinates are just curved coordinates on a flat space! We need a way to differentiate an intrinsically curved space from just the choice of curved coordinates on that space.
Interestingly, the inverse can also be true: manifolds that appear to have curvature can actually be intrinsically flat! Consider a torus. At first glance, it appears to be a curved space, but that’s only extrinsically. As it turns out, we can show that the torus is actually intrinsically flat, specifically, it is the same as a square with the sides identified.
We can flatten a torus by cutting the torus into a cylinder and then cutting the cylinder in half and unrolling it. The sides are identified so the space “repeats”. On the other hand, there’s no way to cut a sphere into a flat space (in a way that preserves distances and angles).
So if we were a little bug on a torus, we would think our world was flat! We could construct a map of a torus on a piece of paper that perfectly preserves angles and distances. To complete the list of examples, a sphere, e.g., the surface of the Earth, is both extrinsically and intrinsically curved! We’ll see exactly how to prove this shortly.
So far, I’ve described curvature intuitively, but we need some equations to let us definitively differentiate a flat from a curved space. The key is to recall what we said about parallel transporting a vector from a start point to an end point: the final result depends on the path! Taking that same notion, what would happen if we parallel transported a vector in a little infinitesimal loop? In a flat space, either Cartesian or polar, the vector should be pointing in the same direction! But what if a space is not flat? Remember what happened for the sphere? When we parallel transported a vector in a loop, it wasn’t pointed in the same direction! Let’s take the same concept, but do it at a much smaller/infinitesimal scale so we can define a curvature at each point in space.
“Parallel transport around a little loop” is a bit too informal, so let’s use some equations to make this more concrete. Some texts take this too literally, but I think a better interpretation is to consider two vectors $A^\mu$ and $B^\nu$ and an arbitrary vector $V^\rho$ that we parallel transport along those two vectors. The mathematical way to represent this is with the commutator of the covariant derivative:
\[[\nabla_\mu, \nabla_\nu]V^\rho = \nabla_\mu \nabla_\nu V^\rho - \nabla_\nu \nabla_\mu V^\rho\]Intuitively, this is like transporting the vector to the far side of the loop and then back to the start again. The computation itself is fairly straightforward. Let’s first start by applying the outermost covariant derivative to the first term.
\[\nabla_\mu \nabla_\nu V^\rho - \nabla_\nu \nabla_\mu V^\rho = \p_\mu(\nabla_\nu V^\rho) - \Gamma_{\mu\nu}^\lambda\nabla_\lambda V^\rho + \Gamma_{\mu\sigma}^\rho\nabla_\nu V^\sigma - (\mu\leftrightarrow\nu)\\\]Recall that we’re applying $\nabla_\mu$ on the tensor $\nabla_\nu V^\rho$, which has one upper and one lower index so we need two connection coefficients. (You can think of this tensor as $(\nabla V)_\nu^\rho$ if that helps). As it turns out, the expansion of the second term is identical to the first except with the $\mu$s and $\nu$s swapped, which is denoted as $(\mu\leftrightarrow\nu)$. Don’t worry about those for now; we’ll expand them later. Now let’s expand the inner covariant derivative.
\[\p_\mu(\p_\nu V^\rho + \Gamma_{\mu\sigma}^\rho V^\sigma) - \Gamma_{\mu\nu}^\lambda(\p_\lambda V^\rho + \Gamma_{\lambda\sigma}^\rho V^\sigma) + \Gamma_{\mu\sigma}^\rho(\p_\nu V^\sigma + \Gamma_{\nu\lambda}^\sigma V^\lambda) - (\mu\leftrightarrow\nu)\\\]Now let’s multiple everything out, but be careful about the partial $\p_\mu$.
\[\p_\mu\p_\nu V^\rho + \p_\mu(\Gamma_{\mu\sigma}^\rho V^\sigma) - \Gamma_{\mu\nu}^\lambda\p_\lambda V^\rho - \Gamma_{\mu\nu}^\lambda\Gamma_{\lambda\sigma}^\rho V^\sigma + \Gamma_{\mu\sigma}^\rho\p_\nu V^\sigma + \Gamma_{\mu\sigma}^\rho\Gamma_{\nu\lambda}^\sigma V^\lambda - (\mu\leftrightarrow\nu)\\\]For the $\p_\mu(\Gamma_{\mu\sigma}^\rho V^\sigma)$ term, we have to expand it using the product rule!
\[\p_\mu\p_\nu V^\rho + \p_\mu\Gamma_{\mu\sigma}^\rho V^\sigma + \Gamma_{\mu\sigma}^\rho\p_\mu V^\sigma - \Gamma_{\mu\nu}^\lambda\p_\lambda V^\rho - \Gamma_{\mu\nu}^\lambda\Gamma_{\lambda\sigma}^\rho V^\sigma + \Gamma_{\mu\sigma}^\rho\p_\nu V^\sigma + \Gamma_{\mu\sigma}^\rho\Gamma_{\nu\lambda}^\sigma V^\lambda - (\mu\leftrightarrow\nu)\\\]Even though this equation already has a lot of terms, we’re ready to add in the other terms and see what cancels!
\[\begin{align*} \p_\mu\p_\nu V^\rho + \p_\mu\Gamma_{\mu\sigma}^\rho V^\sigma + \Gamma_{\mu\sigma}^\rho\p_\mu V^\sigma - \Gamma_{\mu\nu}^\lambda\p_\lambda V^\rho - \Gamma_{\mu\nu}^\lambda\Gamma_{\lambda\sigma}^\rho V^\sigma + \Gamma_{\mu\sigma}^\rho\p_\nu V^\sigma + \Gamma_{\mu\sigma}^\rho\Gamma_{\nu\lambda}^\sigma V^\lambda\\ -\p_\nu\p_\mu V^\rho - \p_\nu\Gamma_{\nu\sigma}^\rho V^\sigma - \Gamma_{\nu\sigma}^\rho\p_\nu V^\sigma + \Gamma_{\nu\mu}^\lambda\p_\lambda V^\rho + \Gamma_{\nu\mu}^\lambda\Gamma_{\lambda\sigma}^\rho V^\sigma - \Gamma_{\nu\sigma}^\rho\p_\mu V^\sigma - \Gamma_{\nu\sigma}^\rho\Gamma_{\mu\lambda}^\sigma V^\lambda\\ \end{align*}\]Remembering that partial derivatives commute, we can get rid of quite a few terms!
\[\require{cancel} \begin{align*} \cancel{\p_\mu\p_\nu V^\rho} + \p_\mu\Gamma_{\mu\sigma}^\rho V^\sigma + \bcancel{\Gamma_{\mu\sigma}^\rho\p_\mu V^\sigma} - \Gamma_{\mu\nu}^\lambda\p_\lambda V^\rho - \Gamma_{\mu\nu}^\lambda\Gamma_{\lambda\sigma}^\rho V^\sigma + \xcancel{\Gamma_{\mu\sigma}^\rho\p_\nu} V^\sigma + \Gamma_{\mu\sigma}^\rho\Gamma_{\nu\lambda}^\sigma V^\lambda\\ -\cancel{\p_\nu\p_\mu V^\rho} - \p_\nu\Gamma_{\nu\sigma}^\rho V^\sigma - \xcancel{\Gamma_{\nu\sigma}^\rho\p_\nu V^\sigma} + \Gamma_{\nu\mu}^\lambda\p_\lambda V^\rho + \Gamma_{\nu\mu}^\lambda\Gamma_{\lambda\sigma}^\rho V^\sigma - \bcancel{\Gamma_{\nu\sigma}^\rho\p_\mu V^\sigma} - \Gamma_{\nu\sigma}^\rho\Gamma_{\mu\lambda}^\sigma V^\lambda\\ \end{align*}\]Nearly half of our terms cancel! Let’s examine the surviving terms. I’ve swapped dummy indices $\lambda\leftrightarrow\sigma$ for the last terms of each line so that the notation is more consistent.
\[\begin{align*} \p_\mu\Gamma_{\mu\sigma}^\rho V^\sigma - \Gamma_{\mu\nu}^\lambda\p_\lambda V^\rho - \Gamma_{\mu\nu}^\lambda\Gamma_{\lambda\sigma}^\rho V^\sigma + \Gamma_{\mu\lambda}^\rho\Gamma_{\nu\sigma}^\lambda V^\sigma\\ - \p_\nu\Gamma_{\nu\sigma}^\rho V^\sigma + \Gamma_{\nu\mu}^\lambda\p_\lambda V^\rho + \Gamma_{\nu\mu}^\lambda\Gamma_{\lambda\sigma}^\rho V^\sigma - \Gamma_{\nu\lambda}^\rho\Gamma_{\mu\sigma}^\lambda V^\sigma\\ \end{align*}\]There are a few interesting things to notice, especially with the middle two terms of each line. They can each be condensed back into a covariant derivative, but with a connection coefficient as a coefficient on the front.
\[\begin{align*} \p_\mu\Gamma_{\mu\sigma}^\rho V^\sigma - \Gamma_{\mu\nu}^\lambda(\nabla_\lambda V^\rho) + \Gamma_{\mu\lambda}^\rho\Gamma_{\nu\sigma}^\lambda V^\sigma\\ - \p_\nu\Gamma_{\nu\sigma}^\rho V^\sigma + \Gamma_{\nu\mu}^\lambda(\nabla_\lambda V^\rho) - \Gamma_{\nu\lambda}^\rho\Gamma_{\mu\sigma}^\lambda V^\sigma\\ \end{align*}\]Yet another condensation we can do is to look at each term in the middle of each line. They’re almost identical except the $\mu$ and $\nu$ are swapped! This is exactly twice the commutator of the indices!
\[\p_\mu\Gamma_{\mu\sigma}^\rho V^\sigma + \Gamma_{\mu\lambda}^\rho\Gamma_{\nu\sigma}^\lambda V^\sigma - \p_\nu\Gamma_{\nu\sigma}^\rho V^\sigma - \Gamma_{\nu\lambda}^\rho\Gamma_{\mu\sigma}^\lambda V^\sigma - 2\Gamma_{[\mu\nu]}^\lambda\nabla_\lambda V^\rho \\\]But remember that for a torsion-free metric, this terms cancels so we’re left with only the first four terms, that we can factor out the $V^\sigma$ since it was arbitrary (and we do a bit of rearranging).
\[(\p_\mu\Gamma_{\mu\sigma}^\rho - \p_\nu\Gamma_{\nu\sigma}^\rho + \Gamma_{\mu\lambda}^\rho\Gamma_{\nu\sigma}^\lambda - \Gamma_{\nu\lambda}^\rho\Gamma_{\mu\sigma}^\lambda) V^\sigma\]With some inspection, the tensor in the parentheses seems to have one upper and three lower indices. We define this as the Riemann tensor, which tells us the curvature (at a point) of a space.
\[R_{\sigma\mu\nu}^\rho = \p_\mu\Gamma_{\mu\sigma}^\rho - \p_\nu\Gamma_{\nu\sigma}^\rho + \Gamma_{\mu\lambda}^\rho\Gamma_{\nu\sigma}^\lambda - \Gamma_{\nu\lambda}^\rho\Gamma_{\mu\sigma}^\lambda\]We went through several stages of equations to get here, but remember that we were trying to see what happens if we parallel transported a vector along a little infinitesimal loop. The final result is that the parallel transported vector is linearly transformed by the Riemann tensor! To see this more clearly, let me group the indices a bit differently: $(R_\sigma^\rho)_{\mu\nu}$. The first upper and lower indices together represent a linear transform, just like a matrix linearly transforms a vector. The last two lower indices tell us in which directions are we parallel transporting the vector along a little loop.
Similar to torsion, suppose we have two vectors $A^\mu$ and $B^\nu$ that we parallel transport into each other to make a closed loop (we’re assuming no torsion). Then if we have a vector $V^\rho$ that we move around in that little loop, we’ll end up with $V^{\rho’}$ that’s related to the original $V^\rho$ we started with by a linear transform. That linear transform that relates the two is what we call the Riemann tensor $R_{\sigma\mu\nu}^\rho$.
There are a few more things to note about this tensor. First of all, from the derivation, we can see that it’s antisymmetric in its last two lower indices. Imagine if we went around the loop in the other way and swapped $\mu$ and $\nu$ right from the beginning. Another important property is that it really does tell us if a space is flat or not because it’s written in terms of the derivatives of the connection, which, canonically, is written in terms of the metric. So this is effectively looking at second derivatives of the metric, similar to how curvature in a flat space looks at second derviatives. In Cartesian coordinates, we can immediately see that $R_{\sigma\mu\nu}^\rho=0$ everywhere.
As it turns out, there’s a theorem that says we can find a coordinate system in which the metric components are constant if and only if the Riemann tensor vanishes everywhere. From the above examples, it’s easy to show the forward implication of that theorem, but it’s a bit more work to show the backwards implication. I think the forward implication is more commonly used so I’ll skip the backwards implication and refer you to Sean Carroll’s book on general relativity.
In terms of components, navïely, we might think it has $n^4$ components since there are four indices, but, with the symmetries, we actually have much fewer components. The first symmetry we already saw: antisymmetric in the last two lower indices. There are more symmetries, but they are easier to discover if we lower the single upper index.
\[R_{\rho\sigma\mu\nu} = g_{\rho\lambda}R_{\sigma\mu\nu}^\lambda\]Let’s expand this out, but we’re going to use a special set of coordinates called Riemann normal coordinates. They’re a set of coordinates such that $\partial_{\sigma}g_{\mu\nu}=0$. A consequence of this (that you can verify yourself) is that all of the connection coefficients themselves are zero. However, this doesn’t mean the derivatives of the connection coefficients are zero so we still have to keep those.
\[\require{cancel} \begin{align*} R_{\rho\sigma\mu\nu} &= g_{\rho\lambda}R_{\sigma\mu\nu}^\lambda\\ &= g_{\rho\lambda}(\p_\mu\Gamma_{\mu\sigma}^\lambda- \p_\nu\Gamma_{\nu\sigma}^\lambda + \cancelto{0}{\Gamma_{\mu\lambda}^\lambda\Gamma_{\nu\sigma}^\lambda} - \cancelto{0}{\Gamma_{\nu\lambda}^\lambda\Gamma_{\mu\sigma}^\lambda)}\\ &= g_{\rho\lambda}(\p_\mu\Gamma_{\mu\sigma}^\lambda- \p_\nu\Gamma_{\nu\sigma}^\lambda)\\ \end{align*}\]Now we can expand the connection coefficients in terms of the metric (since we’re assuming a Levi-Civita connection):
\[\begin{align*} &= g_{\rho\lambda}(\p_\mu\Gamma_{\mu\sigma}^\lambda- \p_\nu\Gamma_{\nu\sigma}^\lambda)\\ &= g_{\rho\lambda}\Bigg(\p_\mu\Big[\frac{1}{2}g^{\lambda\tau}(\p_\nu g_{\sigma\tau} + \p_\sigma g_{\tau\nu} - \p_\tau g_{\nu\sigma})\Big] - \p_\nu\Big[\frac{1}{2}g^{\lambda\tau}(\p_\mu g_{\sigma\tau} + \p_\sigma g_{\tau\mu} - \p_\tau g_{\mu\sigma})\Big]\Bigg)\\ &= \frac{1}{2}g_{\rho\lambda}\Bigg(\p_\mu\Big[g^{\lambda\tau}(\p_\nu g_{\sigma\tau} + \p_\sigma g_{\tau\nu} - \p_\tau g_{\nu\sigma})\Big] - \p_\nu\Big[g^{\lambda\tau}(\p_\mu g_{\sigma\tau} + \p_\sigma g_{\tau\mu} - \p_\tau g_{\mu\sigma})\Big]\Bigg)\\ \end{align*}\]We have to expand out the inner partials $\p_\mu$ and $\p_\nu$ using the product rule, but remember that we’re in Riemann normal coordinates so the partials of the metric tensor and inverse metric tensor are zero $\p_\mu g^{\lambda\tau}=0$. So we can just apply the partial on the second term and factor out the $g^{\lambda\tau}$ to the front.
\[= \frac{1}{2}g_{\rho\lambda}g^{\lambda\tau}\Bigg(\p_\mu(\p_\nu g_{\sigma\tau} + \p_\sigma g_{\tau\nu} - \p_\tau g_{\nu\sigma}) - \p_\nu(\p_\mu g_{\sigma\tau} + \p_\sigma g_{\tau\mu} - \p_\tau g_{\mu\sigma})\Bigg)\]The partials can distribute through as well.
\[= \frac{1}{2}g_{\rho\lambda}g^{\lambda\tau}(\p_\mu\p_\nu g_{\sigma\tau} + \p_\mu\p_\sigma g_{\tau\nu} - \p_\mu\p_\tau g_{\nu\sigma} - \p_\nu\p_\mu g_{\sigma\tau} + \p_\nu\p_\sigma g_{\tau\mu} - \p_\nu\p_\tau g_{\mu\sigma})\]The partials commute so we can cancel out the first and fourth terms.
\[= \frac{1}{2}g_{\rho\lambda}g^{\lambda\tau}(\p_\mu\p_\sigma g_{\tau\nu} - \p_\mu\p_\tau g_{\nu\sigma} + \p_\nu\p_\sigma g_{\tau\mu} - \p_\nu\p_\tau g_{\mu\sigma})\]Finally, recall that $g_{\rho\lambda}g^{\lambda\tau}=\delta_\rho^\tau$ so we can substitute any lower $\tau$ with a $\rho$, and we’re left with the final result.
\[R_{\rho\sigma\mu\nu} = \frac{1}{2}(\p_\mu\p_\sigma g_{\rho\nu} - \p_\mu\p_\rho g_{\nu\sigma} + \p_\nu\p_\sigma g_{\rho\mu} - \p_\nu\p_\rho g_{\mu\sigma})\]From these terms, there are two symmetries we can see (by the fact the metric is symmetric and the partials commute). The first is that the tensor is antisymmetric in the first two indices.
\[R_{\rho\sigma\mu\nu} = -R_{\sigma\rho\mu\nu}\]Also, the tensor is invariant if we swap the first pair with the last pair of indices.
\[R_{\rho\sigma\mu\nu} = R_{\mu\nu\rho\sigma}\]You can convince yourself of these by substituting (and carefully changing indices around!) to find that things cancel or match up. There really isn’t much insight or practice gained from showing you that so I’ll just skip it. The last property is that if we cycle the last three indices completely and take the sum, everything cancels!
\[R_{\rho\sigma\mu\nu} + R_{\rho\mu\nu\sigma} + R_{\rho\nu\sigma\mu} = 0\]With some more index acrobatics, we can show that cyclical permutations are equivalent to taking a multi-index antisymmetry:
\[R_{\rho[\sigma\mu\nu]} = 0\](You can verify this yourself, but it’s not a very interesting calculation to do so I’ve also skipped this.) Note that we haven’t done anything non-tensorial here, even though we’ve used the connection coefficients.
Now we can use these symmetries to figure out the number of components. Using the first antisymmetry, the pairs of indices can only take the values $\binom{n}{2}$. To see this, consider $n=4$ (as commonly used in general relativity!). Because of the antisymmetry, the only unique values of the indices are $01$, $02$, $03$, $12$, $13$, $23$. The diagonal values vanish and the other side of the diagonal is repeated. Hence, we have $n$ choose $2$, in combinatorial syntax.
\[m = \binom{n}{2} = \frac{n(n-1)}{2}\]Now we can factor that into the second symmetry that says the first and second pair are swappable. For a symmetric matrix, we have $\frac{m(m+1)}{2}$ independent values, but that’s on top of the antisymmetry, which is why I used $m$ again. Substituting in terms of $n$, we can get the following (I’m skipping the algebra because it’s just algebra).
\[\frac{m(m+1)}{2} = \frac{n^4-2n^3+3n^2-2n}{8}\]Note that this is for the entire tensor so we need to subtract out additional constraints. Now to account for the cyclic permutation, using the same binomial syntax, we get $\binom{n}{4}$ because we’re fixing four indices. The permutation of the last three fixes the three, but the fourth one at the beginning also has to be subtracted else the relation devolves into the first and second symmetries.
\[\binom{n}{4} = \frac{n^4-6n^3+11n^2-6n}{24}\]This constrains the degrees of freedom from the general case of the first one so we subtract them to get the final result.
\[\frac{n^4-2n^3+3n^2-2n}{8} - \frac{n^4-6n^3+11n^2-6n}{24} = \frac{n^2(n^2 - 1)}{12}\](Yet again, I’ve skipped over the algebra because it’s not very interesting.) Finally we’re left with the number of independent components of the Riemann tensor with all of the symmetries accounted for! It’s certainly smaller than $n^4$, but it’s also not that small. For $n=4$, we have 20 independent components.
There’s just one last property regarding the Riemann tensor we need to discuss before we can simplify it into something easier to use. We can consider the derivative of the lowered Riemann (also in Riemann normal coordinates so there’s no connection coefficient term).
\[\begin{align*} \nabla_\lambda R_{\rho\sigma\mu\nu} &= \p_\lambda R_{\rho\sigma\mu\nu}\\ &= \frac{1}{2}\p_\lambda (\p_\mu\p_\sigma g_{\rho\nu} - \p_\mu\p_\rho g_{\nu\sigma} + \p_\nu\p_\sigma g_{\rho\mu} - \p_\nu\p_\rho g_{\mu\sigma})\\ &= \frac{1}{2}(\p_\lambda \p_\mu\p_\sigma g_{\rho\nu} - \p_\lambda \p_\mu\p_\rho g_{\nu\sigma} + \p_\lambda \p_\nu\p_\sigma g_{\rho\mu} - \p_\lambda \p_\nu\p_\rho g_{\mu\sigma})\\ \end{align*}\]If we consider cyclical permutations of the first three indices, everything cancels!
\[\nabla_\lambda R_{\rho\sigma\mu\nu} + \nabla_\rho R_{\sigma\lambda\mu\nu} + \nabla_\sigma R_{\lambda\rho\mu\nu} = 0\]Like with the symmetry with cyclical permutations of the last three indices, we can use an equivalent antisymmetry.
\[\nabla_{[\lambda} R_{\rho\sigma]\mu\nu} = 0\]The above property is called the Bianchi identity and it’s actually used to prove an important property of the Einstein Field Equations used in general relativity.
One geometric interpretation of the Bianchi Identity that I really like is the ability/inability to close a parallelepiped. Suppose we have three vectors $U$, $V$, and $W$. If we parallel transport each in the direction of each other, we’ll get a parallelepiped. The Bianchi Identity measures the ability of the ends of the vectors to close into a closed parallelepiped.
Even for small dimensionalities, the Riemann tensor has a lot of components! Practically speaking, we don’t often have to deal with this tensor directly. Instead, we can deal with a smaller tensor formed from a contraction of the Riemann tensor called the Ricci tensor.
\[R_{\mu\nu} = R_{\mu\lambda\nu}^\lambda\]In fact, we can contract it even further to get a scalar called the Ricci scalar.
\[R = R_\mu^\mu= g^{\mu\nu}R_{\mu\nu}\]As with the Riemann tensor, I also want to provide some illustrative intuition behind both of these quantities. (I won’t go through the exact proofs since that requires setting up some more machinery.) One interpretation I really like is John Baez’s coffee grounds. Imagine a ball of comoving coffee grounds on the manifold; “comoving” just means each individual coffee particle is at rest relative to all of the others so the whole group moves as a single coffee ground blob. In a flat space, the shape and size remain the same no matter how we move around the manifold. But, on a curved manifold, the ball might expand, collapse, rotate, or deform in all kinds of different ways. This is because each individual coffee ground doesn’t follow the same geodesic. The Ricci tensor measures only the change in volume of our coffee grounds. There is an other tensor called the Weyl tensor that measures the deformation.
The Ricci scalar, sometimes called scalar curvature, measures how the volume of the coffee ground blob differs from flat space. A positive scalar curvature is like a sphere. As we’ll see, a sphere has positive curvature everywhere, and geodesics tend to “bend apart” on a sphere. On the other hand, a negative curvature is like a saddle.
With a positive scalar curvature, like a sphere, the edges of a triangle will “bow outward”. This is the reason we need to use the Haversine Formula when we look at angles and distance on the surface of the Earth. In a flat space, a triangle is simply a triangle. With a negative scalar curvature, like with a saddle, the edges of a triangle will “bow inward”.
To see a practical application of the Ricci tensor and scalar to general realtivity, there’s a little computation we have to do first. Taking the Bianchi identity a step further, we can contract it twice on the Bianchi identity to write it in terms of the Ricci tensor and Ricci scalar.
\[\begin{align*} g^{\nu\sigma}g^{\mu\lambda}(\nabla_\lambda R_{\rho\sigma\mu\nu} + \nabla_\rho R_{\sigma\lambda\mu\nu} + \nabla_\sigma R_{\lambda\rho\mu\nu}) &= 0\\ g^{\nu\sigma}g^{\mu\lambda}(\nabla_\lambda R_{\mu\nu\rho\sigma} + \nabla_\rho R_{\mu\nu\sigma\lambda} + \nabla_\sigma R_{\lambda\rho\mu\nu}) &= 0\\ g^{\nu\sigma}g^{\mu\lambda}(\nabla_\lambda R_{\mu\nu\rho\sigma} - \nabla_\rho R_{\nu\mu\sigma\lambda} + \nabla_\sigma R_{\lambda\rho\mu\nu}) &= 0\\ g^{\nu\sigma}(\nabla^\mu R_{\mu\nu\rho\sigma} - \nabla_\rho R_{\nu\mu\sigma}^\mu + \nabla_\sigma R_{\rho\mu\nu}^\mu) &= 0\\ g^{\nu\sigma}(\nabla^\mu R_{\mu\nu\rho\sigma} - \nabla_\rho R_{\nu\sigma} + \nabla_\sigma R_{\rho\nu}) &= 0\\ \nabla^\mu R_{\mu\nu\rho}^\nu - \nabla_\rho R_{\nu}^\nu + \nabla^\nu R_{\rho\nu} &= 0\\ \nabla^\mu R_{\mu\rho} - \nabla_\rho R + \nabla^\nu R_{\rho\nu} &= 0\\ \nabla^\mu R_{\mu\rho} - \nabla_\rho R + \nabla^\mu R_{\mu\rho} &= 0\\ 2\nabla^\mu R_{\mu\rho} - \nabla_\rho R &= 0\\ \nabla^\mu R_{\mu\rho} - \frac{1}{2}\nabla_\rho R &= 0\\ \nabla^\mu R_{\mu\rho} &= \frac{1}{2}\nabla_\rho R\\ \end{align*}\]Between the first two equations, I used the second symmetry on the first and second terms. From the second and third equations, I used the first antisymmetry on the second term. The rest follow from raising the tensors and forming the Ricci tensor and Ricci scalar. Note that we can raise the index on a covariant derivative (rather than a partial) because of metric compatibility.
Now suppose we define the Einstein tensor in terms of the Ricci tensor and scalar as the following.
\[G_{\mu\nu} \equiv R_{\mu\nu} - \frac{1}{2}R g_{\mu\nu}\](Note that this tensor is also symmetric because the Ricci tensor and the metric are also symmetric!) Applying to the above Bianchi identity, we can see the following property is true.
\[\begin{align*} \nabla^\mu G_{\mu\nu} &= 0\\ \nabla^\mu (R_{\mu\nu} - \frac{1}{2}R g_{\mu\nu}) &= 0\\ \nabla^\mu R_{\mu\nu} - \frac{1}{2}\nabla^\mu g_{\mu\nu} R &= 0\\ \nabla^\mu R_{\mu\nu} - \frac{1}{2}\nabla_\nu R &= 0\\ \end{align*}\]Note that the final line corresponds to the second-to-last line of the Bianchi identity above. As it turns out, this property corresponds to the conservation of energy and momentum in general relativity! In fact, the Einstein tensor is actually the left half of the Einstein Field Equations (EFE) that tell us how the geometry of a space is affected by the energy-momentum of that space.
So far, we’ve set up a ton of machinery, so let’s put it into practice on a canonical example: the two-sphere $S^2$!
We’ll define intrinsic spherical coordinates like a physicist such that the polar angle, i.e., the angle with respect to the $z$-axis is $\theta$ and the azimuthal angle, i.e., the angle in the $xy$-plane from the $x$-axis, is $\phi$.
The metric for a two-sphere requires only two intrinsic coordinates. Think about the Earth: we only need a latitude and longitude to specify a coordinate on the surface. To see this, let’s start with the spherical coordinate metric in a flat space.
\[\d s^2 = \d r^2 + r^2 \d\theta^2 + r^2\sin^2\theta\d\phi^2\]However, if we’re on a sphere of a constant radius, note that $\d r^2$ vanishes and we’re left with an intrinsic metric on a sphere.
\[\d s^2 = r^2(\d\theta^2 + \sin^2\theta\d\phi^2)\]Visually, treat $\d s^2$ as a little slice along the sphere, in terms of a $\theta$ and $\phi$. We can write the components of the metric and inverse metric tensor in matrix form.
\[\begin{align*} g_{ij} &= \begin{bmatrix}1 & 0\\ 0 & \sin^2\theta \end{bmatrix}\\ g^{ij} &= \begin{bmatrix}1 & 0\\ 0 & \frac{1}{\sin^2\theta} \end{bmatrix}\\ \end{align*}\](Recall that the inverse of a diagonal metric is just the inverse of the components.) From these, we can compute the connection coefficients. It’s just the algebra of plugging the connection coefficients into the equation and churning them out. Remember that the bottom two indices are symmetric so we don’t have to compute them twice. Also, the off-diagonals of the metric and its inverse are zero so this should make it a bit easier. The only non-zero connection coefficients are the following.
\[\begin{align*} \Gamma^\theta_{\phi\phi} &= -\cos\theta\sin\theta\\ \Gamma^\phi_{\theta\phi} = \Gamma^\phi_{\phi\theta} &= \cot\theta\\ \end{align*}\]While we’re at it, we can compute the Ricci tensor. (This is also just algebra.)
\[\begin{align*} R_{\theta\theta} &= 1\\ R_{\theta\phi} = R_{\phi\theta} &= 0\\ R_{\phi\phi} &= r^2\sin^2\theta\\ \end{align*}\]And finally we can compute the Ricci scalar.
\[R = \frac{2}{r^2}\]From this, we see that the Ricci scalar is constant across the sphere and positive. This makes sense since neighboring geodesics tend to “bow” outwards and “inflate”. On the other hand, if we had added some “noise” to the metric, then this wouldn’t be the case. One interesting thing to note is that the scalar curvature increases as the radius decreases. One interesting application is that we can model some kinds of black hole’s event horizons as sphere. And, as it turns out, the strength of tidal forces is inversely proportional to the scalar curvature. In other words, a black hole with a very large event horizon doesn’t have as strong tidal forces. For the supermassive black hole at the center of our Milky Way galaxy, we could toss anything in without it being ripped apart by tidal forces.
Another, more interesting, thing to consider is geodesics on the sphere. This is particular interesting because, if we wanted to find the shortest path between two points on the Earth, the geodesic tell us exactly that! Let’s start by rewriting the geodesic equation.
\[\frac{\d^2 x^\mu}{\d\lambda^2} + \Gamma_{\sigma\rho}^\mu\frac{\d x^\sigma}{\d\lambda}\frac{\d x^\rho}{\d\lambda} = 0\]Recall that these are actually a set of 2nd order differential equations in $\mu$. Since we have two coordinates $\theta$ and $\phi$, we’ll have two equations. We can also simplify the equations since there are only two unique, non-zero connection coefficients.
\[\begin{align*} \frac{\d^2 x^\theta}{\d\lambda^2} + \Gamma_{\phi\phi}^\theta\frac{\d x^\phi}{\d\lambda}\frac{\d x^\phi}{\d\lambda} &= 0\\ \frac{\d^2 x^\phi}{\d\lambda^2} + \Gamma_{\theta\phi}^\phi\frac{\d x^\theta}{\d\lambda}\frac{\d x^\phi}{\d\lambda} +\Gamma_{\phi\theta}^\phi\frac{\d x^\phi}{\d\lambda}\frac{\d x^\theta}{\d\lambda}&= 0\\ \end{align*}\]But remember that the connection coefficients are symmetric so the last two terms in the second equation are the same.
\[\begin{align*} \frac{\d^2 x^\theta}{\d\lambda^2} + \Gamma_{\phi\phi}^\theta\frac{\d x^\phi}{\d\lambda}\frac{\d x^\phi}{\d\lambda} &= 0\\ \frac{\d^2 x^\phi}{\d\lambda^2} + 2\Gamma_{\theta\phi}^\phi\frac{\d x^\theta}{\d\lambda}\frac{\d x^\phi}{\d\lambda} &= 0\\ \end{align*}\]Now let’s plug in the values for the connection coefficients.
\[\begin{align*} \frac{\d^2 x^\theta}{\d\lambda^2} -\cos\theta\sin\theta\frac{\d x^\phi}{\d\lambda}\frac{\d x^\phi}{\d\lambda} &= 0\\ \frac{\d^2 x^\phi}{\d\lambda^2} + 2\cot\theta\frac{\d x^\theta}{\d\lambda}\frac{\d x^\phi}{\d\lambda} &= 0\\ \end{align*}\]These are a set of paired 2nd order differential equations that are too difficult to solve in general. Fortunately, the sphere has a lot of symmetries so, even if we restrict the solution, we can use those symmetries to produce general solutions. For now, let’s fix a lattitude $\theta=\tilde{\theta}$ so we have the equations $x^\theta(\lambda)=\tilde{\theta}, x^\phi(\lambda)=\alpha\lambda + \beta$ where $\alpha$ and $\beta$ are just constants that represent path around the lattitude. (We could ignore $\beta$, but I left it in for completeness.) Now let’s compute the first and second order derivatives needed for the geodesic equation.
\[\begin{align*} \frac{\d x^\theta}{\d\lambda} = 0 &, \frac{\d x^\phi}{\d\lambda} = \alpha\\ \frac{\d^2 x^\theta}{\d\lambda^2} = 0 &, \frac{\d^2 x^\phi}{\d\lambda^2} = 0\\ \end{align*}\]Now we can plug these into the geodesic equation and substitute $\theta=\tilde{\theta}$.
\[\begin{align*} 0 - \cos\tilde{\theta}\sin\tilde{\theta}\cdot\alpha^2 &= 0\\ 0 + 2\cot\tilde{\theta}\cdot 0\cdot\alpha &= 0\\ \end{align*}\]The second equation is just $0=0$ so we can ignore that so we can just focus on the first one.
\[- \cos\tilde{\theta}\sin\tilde{\theta}\cdot\alpha^2 = 0\]The goal is to set $\tilde{\theta}$ and $\alpha$ such that the equation is also $0=0$. The easiest thing to do seems to be to set $\alpha=0$. But if we do that, the resulting equations become $x^\theta(\lambda)=\tilde{\theta}, x^\phi(\lambda)=\beta$, which is just a fixed point on the sphere. Let’s try to set $\sin\tilde{\theta}=0$. In this case, $\tilde{\theta}=0$ or $\tilde{\theta}=\pi$. The resulting equations become $x^\theta(\lambda)=0, x^\phi(\lambda)=\alpha\lambda+\beta$, but this is also just a point because $x^\theta(\lambda)=0$ and $x^\theta(\lambda)=\pi$ are the North and South Poles.
Instead, let’s try to set $\cos\tilde{\theta}=0$, which means $\tilde{\theta}=\frac{\pi}{2}$. The equations become $x^\theta(\lambda)=\frac{\pi}{2}, x^\phi(\lambda)=\alpha\lambda+\beta$. This represents a path along the equator! This kind of circle is called a great circle: a circle on a sphere where the center of the circle is the center of the sphere. Using the rotational symmetry of the sphere, all geodesics on a sphere are great circles. In other words, the shortest distance between any two points on a sphere is the great circle that contains those two points. (There are actually two directions, but we can simply pick the shortest one.) With this, we have shown that geodesics on spheres are all great circles using the geodesic equation. An alternative to finding geodesics with the geodesic equation is to use calculus of variations and the Euler-Lagrange equations, and that’s sometimes easier (maybe I’ll explain that in another post!), but this is also a valid way of finding geodesics.
We’ve covered a lot of topics in this post, eventually culminating in answering a deceptively simple question: “how do we know if a space is flat?”. We saw that this was not an easy question when manifolds and intrinsic geometry was involved! To answer that question, we had to build up to it piece-by-piece, starting with a good intrinsic derivative operator, then discussing how to compare vectors on a manifold, and ending on curvature with a peek into general relativity. Here are some of the core concepts we learned:
That’s all! In this set of posts, we’ve learned all about manifolds and how to do calculus on them 😀
]]>So far, we’ve only dealt with Euclidean spaces. However, there are plenty of spaces that are only locally Euclidean, but, globally, have a more interesting topology. This is the informal definition of a manifold: a space that is locally flat but globally more interesting. This has some profound connotations for how vectors, duals, and tensors are defined, as well as how we perform any kind of calculus (differentiation and integration) on this manifold. To be more precise, we can work with manifolds that don’t allow for calculus on them, i.e., non-differentiable manifolds, but those are much less interesting, and, practically, we’ll usually be able to perform calculus on our manifolds.
I gave an intuition of manifolds, but let me define a few concrete examples that you’ve likely seen or heard of:
With all of these examples, what isn’t a manifold? Using that same definition, anything that isn’t a manifold is a space where, at some point, it locally doesn’t look like a flat, $\R^n$ space. There are a few contrived examples, but also a few practical examples:
As for the more rigourous definition, we’ll be following Wald’s textbook on general relativity; even though we rarely use the full definition of a manifold, I think it’s a really neat construction that emphasizes several important characteristics of a manifold, e.g., indepedence of coordinates, no global frame, and independence of embedding space. Before we do that, however, I’ll review some definitions of maps and functions since they’re essential constructs in manifolds and differential geometry.
Given two sets $A$ and $B$, a map $\phi : A\to B$ assigns, to each $a\in A$, exactly one $b\in B$. We can think of this as a “generalization” of a function. With this definition, there are several different kinds of maps that are more specific:
one-to-one/injective: $\forall b\in B$, there is at most $a\in A$ mappped to it by $\phi$. A technique you might have heard of for identifying injective function is the “horizontal line test”: if there is a horizontal line that intersects with the function more than once, it is not an injective function. For example, $f(x)=x^2$ fails, since, for $f(x)=4$, $x=\pm 2$. Also, there may be $b\in B$ such that $\nexists a\in A$ such that $\phi(a)=b$. In other words, there is an element in $B$ such that no element in $A$ is mapped to it.
onto/surjective: $\forall b\in B$, there is at least $a\in A$ such that $\phi(a) = b$. In other words, every $b\in B$ originates from some $a\in A$, even if $a$ is the same for more than two $b\in B$. An example of such a map is $f(x)=x^3$: each element in the $x$-axis is sent to some element in the $y$-axis. On the other hand, functions like $f(x)=e^x$ and $f(x)=\log x$ are not surjective since they don’t span the entire $x$-axis.
one-to-one correspondence/bijective: a function that is both one-to-one and onto. In other words, each $a\in A$ is sent to exactly one $b\in B$. For example, $x^3$ is bijective since it is both injective and surjective. As a corollary of the definition, for each bijection, there exists an inverse bijiection $\phi^{-1} : B\to A$ such that $\phi^{-1}(\phi(a)) = a$. From this definition, it’s pretty easy to show that the composition of bijections is also a bjiection.
The top function is one-to-one; the middle function is onto; and the bottom function is bijective.
One last thing we’ll need about maps is composition of maps: if $\phi: A\to B$ and $\psi : B\to C$, then $(\psi\circ\phi): A\to C, a\mapsto\psi(\phi(a))$.
Now that we’ve reviewed the preliminaries, let’s construct a manifold! We’ll start by defining an open ball as the set of all points $x\in\R^n$ such that $\lVert x - y\rVert < r$ for some $y\in\R^n$ and $r\in\R$.
An open ball is a really simple construct: a set of points inside of an open circle.
(If we considered a closed ball, we’d have to worry about the boundary! As it turns out, we can completely construct a manifold with open balls rather than closed balls.) With that definition, we can define an open subset as a union of (a potentially infinite number of) open balls.
An open set is just a (possibly infinite) collection of open balls.
In fact, we can say that a subset $U\subset\R^n$ is open iff $\forall u\in U, \exists$ an open ball at $u$ such that it is inside of $U$. In other words, we can say that $U$ defines the interior of an $(n-1)$-dimensional surface. As a concrete example, an open set $U$ for $\R^2$ defines the interior of an $1$-dimensional surface, i.e., the interior of a closed loop on a plane. For $\R^3$, this would define the interior of a closed surface.
Now that we have this arbitrary set, we can naturally and immediately define a coordinate system/chart on this open set as being a subset $U\subset M$ and a one-to-one function $\phi : U\to\R^n$ that maps the open set $U$ into the flat Euclidean space $\R^n$. For convenience, instead of applying $\phi$ to individual points, we can consider the image of $\phi$ for a set of points. This is defined to be the set of all points $\R^n$ that $U$ gets mapped to. As an example, we can consider the unit circle parameterized by $\theta$. Then we can define a chart such that $U=\{ \theta | \theta\in(0,\pi) \}$ and $\phi(\theta)=\theta$. This maps the half-circle $\theta\in(0,\pi)$ to the real line by “flattening” it. In fact, we could have actually mapped the entire circle to the real line by flattening it, but, as we’ll see, this is usually not possible for more complicated manifolds.
A coordinate chart maps an arbitrary open set to an open set in a flat space.
Even though we can’t usually use a single chart to cover a manifold, we could use multiple charts if we impose some additional constraints. This is called a $C^\infty$ atlas: an indexed family of charts $\{(U_\alpha, \phi_\alpha)\}$ such that
The union of all of the sets cover the manifold: $\bigcup_\alpha U_\alpha = M$. If they didn’t, then we couldn’t create a chart for some part of our manifold!
If two charts overlap, they are smoothly sewn together. More formally, if $U_\alpha\cap U_\beta\neq\emptyset$, then $\phi_\beta\circ\phi_\alpha^{-1} : \phi_\alpha(U_\alpha\cap U_\beta)\to\phi_\beta(U_\alpha\cap U_\beta)$. This is best explained in the figure below. This condition is the crux of manifold construction: we can smoothly sew together a bunch of locally flat spaces into a structure that is only locally flat, and we’ve said absolutely nothing about the global structure. The reason this is called a $C^\infty$ atlas is because all of the maps are $C^\infty$, in other words, continuous and infinitely differentiable.
This “smooth stitching” constraint is the most important part of the manifold definition: if we’re in one open set, we can “hop” to an adjacent one using this property.
Now we can finally get to the definition we’ve been waiting for! A $C^\infty$ $n$-dimensional manifold is a set $M$ with a maximal atlas. A maximal atlas is an atlas that contains every possible chart for that manifold. The reason we need a maximal atlas is so we don’t consider different atlases to be different manifolds. For example, if we had an atlas of a circle and another atlas that starts at 45 degrees relative to the first one, without the condition of a maximal atlas, we would have thought we had two different circles!
Note that in the construction of the manifold, we never mentioned anything about the space that the manifold may be embedded in or the global structure. We simply took a bunch of flat $\R^n$ spaces and smoothly sewed them together on their overlaps. Manifolds exist completely independent of the space they are embedded in. We can take a circle, embed it in either a plane or a space and the maps into the real line would be the same. In fact, there’s a famous theorem called Whitney’s embedding theorem that states any $n$-manifold can be embedded in at most $\R^{2n}$. For example, a sphere $S^2$ can be embedded in at most $\R^4$, but, it turns out we can also embed it in $\R^3$. Another example is a Klein bottle, which is a $2$-manifold, but it can only be embedded in $\R^4$.
Now let’s look at a few concrete examples of constructing a manifold from an atlas. We’ve seen an atlas for a circle, but we only covered it with a single chart. This doesn’t quite fit the manifold construction because a single chart means we have a closed set and we need an open set. Let’s fix that and use two overlapping charts to cover the circle:
\[\begin{align*} U_1 &=\Big\{\theta | \theta\in\Big(\frac{\pi}{4}, \frac{7\pi}{4}\Big)\Big\}, \phi_1(\theta)=\theta\\ U_2 &=\Big\{\theta | \theta\in\Big(\frac{3\pi}{4}, \frac{-3\pi}{4}\Big)\Big\}, \phi_2(\theta)=\theta\\ \end{align*}\]These two charts cover the circle with plenty of overlap, so they are open sets. This atlas isn’t maximal, of course, but showing just one atlas is proof that a structure is a manifold.
The atlas for a circle needs to use two charts to ensure openness, even though it could technically be covered with one chart.
For a slightly more complicated example, let’s consider the sphere $S^2$. This is one manifold where it is impossible to have a single chart that covers the manifold. We can split the sphere into two atlases using the Mercator projection by excluding the North and South Poles. We can use the planes $x^3=\pm 1$ as the two sets of $\R^2$ to project into. (recall that $x^3$ is a coordinate, not an exponent!) We will project a ray starting from one of the poles, intersecting the sphere, and landing on one of the planes. The two charts for our atlas are $U_1=\{\text{all points excluding the North pole}\}$ and $U_2=\{\text{all points excluding the South pole}\}$ with the maps
\[\begin{align*} \phi_1(x^1, x^2, x^3) &= \Big(\frac{2x^1}{1-x^3}, \frac{2x^2}{1-x^3}\Big)\\ \phi_2(x^1, x^2, x^3) &= \Big(\frac{2x^1}{1+x^3}, \frac{2x^2}{1+x^3}\Big)\\ \end{align*}\]This atlas hits all points on the sphere twice except for the North and South poles, which are hit only once; therefore, we still have an open set, and we can see that this hits all points on the sphere.
Take either pole, project a beam from the inside through the surface to the outside, and record where it falls on the “catching” plane. This gives us a smooth map that projects the points on the circle into a flat space.
So we’ve shown a sphere is indeed a manifold. Moreoever, since we’re mapping the atlas into $\R^2$, we’ve shown it is specifically a 2-dimensional manifold.
Now that we’ve constructed the manifold, we need to re-introduce tensors, starting with vectors in the tangent space. In flat space, we already defined vectors to exist only at a point (to get around vectors in a curved coordinate system) and the collection of them all pointing in each direction to be the tangent space $T_p M$ at that point. First off, let’s construct the tangent space. Unlike in flat space, we can’t simply construct it by considering all vectors pointing in every direction because we haven’t defined the tangent space! Instead, we might think of “creating” vectors by looking at all possible curves $\xi : \R\to M, \lambda\mapsto\xi(\lambda)$ that go through a point $p$ and their tangent vectors at $p$. That would seem to give us basically the same result, but the problem lies in the parametrization of $\xi$: it’s dependent on the coordinates of the manifold! In other words, our tangent vectors would be $\frac{\d\xi^\mu}{\d\lambda}$, which depend on the coordinates $\xi^\mu$. Recall that vectors are independent of all coordinates since they are geometric objects so we can’t use this definition. Also, we’re cheating here since we haven’t defined what “tangent to a curve” even means!
We’re still pretty close though. Instead, let’s flip this notion and define the set of all continuous, infinitely-differentiable functions on the manifold $\mathcal{F}=\{\text{all } C^\infty f : M\to\R\}$. Given any function, we can define a directional derivative operator $\frac{\d}{\d\lambda}$ that can act on a function $f$ to produce $\frac{\d f}{\d\lambda}$. Notice that this doesn’t depend on the coordinates since we’re using a scalar function $f$, not a curve under some coordinates! Now we can take a similar approach where we look at all possible directional derivative operators of functions through $p$ and define the tangent space to be that.
At a point p, consider all possible (scalar) functions through that point. We can always take the directional derivative of a parameterized curve with respect to the parameter.
However, in order to make that statement, we need to show the following conditions hold:
To show that the space of directional derivatives is a vector space, we need to show that two of these operators can be added and scaled and the result is also a directional derivative operator. The first part of this is pretty easy:
\[a\frac{\d}{\d\lambda} + b\frac{\d}{\d\tau}\]The second part is a bit trickier. A directional derivative operator must be linear and obey the Leibniz product rule. From the equation above, we can already see that the operator is linear so we just need to show the product rule holds:
\[\begin{align*} \Big(\frac{\d}{\d\lambda}+\frac{\d}{\d\tau}\Big)(fg) &= f\frac{\d g}{\d\lambda} + g\frac{\d f}{\d\lambda} + f\frac{\d g}{\d\tau} + g\frac{\d f}{\d\tau}\\ &= \Big(\frac{\d f}{\d\lambda}+\frac{\d f}{\d\tau}\Big)g + f\Big(\frac{\d g}{\d\lambda}+\frac{\d g}{\d\tau}\Big)\\ &= \Big(\frac{\d}{\d\lambda}+\frac{\d}{\d\tau}\Big)(f)g + f\Big(\frac{\d}{\d\lambda}+\frac{\d}{\d\tau}\Big)(g) \end{align*}\]Therefore directional derivatives form a valid vector space. It sounds rather interesting that an “operator” can form a vector space, but really any kind of object can form a vector space as long as it satisfies the constraints! (Personally, I think “linear space” is maybe a better name since the properties of a vector space are really just linearity and closure.)
The last thing we have to do is show that the dimensionality of this vector space is the same as that of the manifold. In Wald’s textbook on general relativity, he shows this directly, but, in Sean Carroll’s book, he uses a clever identity: the dimensionality of a vector space is the same as the number of basis vectors. Therefore we just need to show that the number of basis vectors for the tangent space is the same as the dimensionality of the manifold. In other words, we need to construct a basis for the tangent space.
Let’s start by assuming some arbitrary coordinates $x^\mu$. Given that, there’s a natural choice for the basis of directional derivatives: the partial derivatives of the coordinates $\partial_\mu$! Let’s define the directional derivatives as a linear combination of the partial derivatives of some arbitrary coordinates. Then we need to show that the set of partial derivatives form a basis and the number of elements in that set is $n$, i.e., the dimensionality of our manifold. Since we’re defining the directional derivatives as partial derivatives, we need to show that any directional derivative $\frac{\d}{\d\lambda}$ can be decomposed into a linear combination of the partial derivatives $\partial_\mu$.
For a set of coordinate functions on the manifold, the partial derivatives can form a basis for the directional derivatives.
Since we’re dealing with operators, it’s much less error-prone if we define some arbitrary function $f:M\to\R$ that the operators act on that we’ll remove at the end. We’ll also need a curve $\xi:\R\to M$ since $\xi$ is the function that is actually parameterized by the $\lambda$ in the directional derivative $\frac{\d}{\d\lambda}$. Since we’re at a point $p$, we’ll also get a chart $\phi:M\to\R^n$ with coordinates $x^\mu$ for free!
To reiterate, our goal is to show that we can write $\frac{\d}{\d\lambda}$ as a linear combination of partial derivatives $\partial_\mu$. With all of the maps and spaces, we can draw this picture.
The complicated set of maps can be used to show how any directional derivative can be teased apart into scalars and partial derivatives.
Conceptually, we’ll be applying $\frac{\d}{\d\lambda}$ to $f$, but realistically, we need to compose with $\xi$ since $\xi$ is the thing that is parameterized by $\lambda$.
\[\begin{align*} \frac{\d}{\d\lambda}f&\to\frac{\d}{\d\lambda}(f\circ\xi)\\ &=\frac{\d}{\d\lambda}[(f\circ\phi^{-1})\circ(\phi\circ\xi)]\\ &=\frac{\partial}{\partial x^\mu}(f\circ\phi^{-1})\frac{\d}{\d\lambda}(\phi\circ\xi)\\ &=\frac{\d}{\d\lambda}(\phi\circ\xi)\partial_\mu(f\circ\phi^{-1})\\ &=\frac{\d x^\mu}{\d\lambda}\partial_\mu(f\circ\phi^{-1})\\ &\to\frac{\d x^\mu}{\d\lambda}\partial_\mu f\\ \end{align*}\]In the last step, we use the fact that $\phi$ has coordinates $x^\mu$. Now we can remove $f$ since it was arbitrary:
\[\frac{\d}{\d\lambda}=\frac{\d x^\mu}{\d\lambda}\partial_\mu\]Now we’ve shown that we can decompose an arbitrary directional derivative $\frac{\d}{\d\lambda}$ into a scalar $\frac{\d x^\mu}{\d\lambda}$ and a vector $\partial_\mu$. Thus, the set of $n$ partial derivatives actually do form a basis for the tangent space and we have $n$ of them! It’s a little strange to think that an operator is a vector! (Maybe this is less surprising if you’ve taken any quantum mechanics and learned that operators can be represented as matrices.) In fact, this basis is so convenient that we give it a name: the coordinate basis $\hat{e}_{(\mu)}\equiv\partial_\mu$. We don’t have to use this basis, but it’s often easy and convenient. One important thing to note is that this basis is not orthonormal everywhere like Cartesian coordinates in a flat space. In fact, if that were the case, then we would actually have a flat space!
Given this basis, we can write out the general vector and basis transformation laws from the index notation (this isn’t exactly rigourous, but it works for now):
\[\begin{align*} \partial_{\mu'}&=\frac{\partial x^\mu}{\partial x^{\mu'}}\partial_{\mu}\\ V^{\mu'}&=\frac{\partial x^{\mu'}}{\partial x^{\mu}}V^{\mu}\\ \end{align*}\]Since we’re using a coordinate basis, the components will change when the basis changes, and a change of coordinates means a change of basis as well.
So far, we’ve constructed the tangent space using partial derivatives as the basis vectors, but what about the cotangent space $T_p^* M$? How do we construct/define the basis for this space? Analogously to how we used the partials for the basis, we can use the gradients $\d x^\mu$ as the basis for $T_p^* M$. They used to be defined $\hat{\zeta}^{(\mu)}(\hat{e}_{(\nu)})=\delta^\mu_\nu$, but we’re going to upgrade them using our calculus notation:
\[\d x^\mu(\partial_\nu)\equiv\delta^\mu_\nu=\frac{\partial x^\mu}{\partial x^\nu}\]In this case, $\d x$ is not an infinitesimal, but actually a kind of object called a differential form (specifically a one-form, also known as a gradient). A differential form is a $(0, p)$ antisymmetric tensor; a $0$-form is a scalar or scalar function, and a $1$-form is a gradient. There’s more work we have to do to discuss differential forms, so, for now, it’s ok to think of these as just dual vectors. From the definition, the set of gradients also form a basis for the cotangent space. (We can go through a similar process to apply the one-forms to vectors and show this, but it looks very similar to vectors so I’m going to skip it.) Similar to vectors, we can derive the transformation laws.
\[\begin{align*} \d x^{\mu'}&=\frac{\partial x^{\mu'}}{\partial x^\mu}\d x^\mu\\ \omega_{\mu'}&=\frac{\partial x^\mu}{\partial x^{\mu'}}\omega_\mu \end{align*}\]Now that we’ve re-invented vectors and duals using the language of manifolds, we’re ready to construct tensors. As you might think, this construction follows straightforwardly from the construction in flat space: we take the tensor product of the basis vectors (partial derivatives) and duals (gradients).
\[\begin{align*} T^{\mu_1\cdots\mu_k}_{\nu_1\cdots\nu_l}&=T(\d x^{\mu_1}, \cdots, \d x^{\mu_k}, \partial_{\nu_1}, \cdots, \partial_{\nu_l})\\ T&=T^{\mu_1\cdots\mu_k}_{\nu_1\cdots\nu_l} \partial_{\mu_1}\otimes\cdots\otimes\partial_{\mu_l}\otimes\d x^{\nu_1}\otimes\cdots\otimes\d x^{\nu_k}\\ T^{\mu_1'\cdots\mu_k'}_{\nu_1'\cdots\nu_l'}&=\frac{\partial x^{\mu_1'}}{\partial x^{\mu_1}}\cdots\frac{\partial x^{\mu_k'}}{\partial x^{\mu_k}}\frac{\partial x^{\nu_1}}{\partial x^{\nu_1'}}\cdots\frac{\partial x^{\nu_l}}{\partial x^{\nu_l'}}T^{\mu_1\cdots\mu_k}_{\nu_1\cdots\nu_l} \end{align*}\]Almost everything is the same as it was in a flat space, except we upgraded our basis vectors and duals to partial derivatives and gradients (this also technically works in a flat space but is a bit overkill in that context). Just as with flat space, we have the metric tensor $g_{\mu\nu}$.
The last few things I’ll point out is a small nuiance with notation. Recall the polar coordinates metric
\[\d s^2=\d r^2+r^2\d\theta^2\]$\d s^2$ is just a symbol, but $\d r^2$ and $\d\theta^2$ are honest basis one-forms. That being said, for this case, our use of basis one-forms is consistent with the infinitesimal philosophy for now.
I’ll end on some nomenclature that is popular in other sources (as well as some foreshadowing). A metric is said to be in canonical form if it is written as $g_{\mu\nu}=\mathrm{diag}(-1,\cdots,-1,+1,\cdots,+1,0,\cdots,0)$ where $\mathrm{diag}$ is a diagonal matrix with the diagonal entries as the arguments to the function. At a point, it’s always possible to put the metric in this form: for a point $p\in M$, there exist coordinates $x^{\hat{\mu}}$ such that $g_{\hat{\mu}\hat{\nu}}$ is canonical and $\partial_\hat{\sigma}g_{\hat{\mu}\hat{\nu}}=0$. In other words, the metric is flat and its components are constant at $p$. Coordinates that satisfy these conditions are called Riemann Normal Coordinates:
\[\begin{align*} g_{\hat{\mu}\hat{\nu}}(p)&=\delta_{\hat{\mu}\hat{\nu}}\\ \partial_\hat{\sigma}g_{\hat{\mu}\hat{\nu}}(p)&=0 \end{align*}\]This gives us a convenient set of coordinates to work in initially, then we can generalize using tensor notation. If we can show our equation is true in this coordinate system, then it must be true in all coordinate systems because a tensor equation is true in all coordinate systems. We’ll need some extra machinery to make this claim, but it stands nonetheless.
One last bit of terminology is the metric signature: the number of positive and negative eigenvalues of the metric. A metric is Euclidean/Riemannian/positive-definite if all eigenvalues are positive. This is the signature for most mathematical manifolds. A metric is Lorentzian/pseudo-Riemannian if it has exactly one negative eigenvalue and the rest are positive. This is the metric used in relativity as the metric of spacetime, with the negative eigenvalue acting as the time coordinate. (Alternative, we could flip the spacetime metric to have three negative eigenvalues for the spatial components and a positive eigenvalue for the temporal component.) A metric is indefinite if it has a mixture of positive and negative eigenvalues. A metric is degenerate if it has any zero eigenvalues; note that this means an inverse metric doesn’t exist. If a metric is continuous and non-degenerate, its signature is the same everywhere. In other words, if we start in a Lorentzian spacetime, the metric is non-degenerate and continuous so spacetime stays Lorentzian everywhere (at least, that’s what we think now). In practice, we don’t usualy deal with indefinite or degenerate metrics; in fact, in special relativity, we often assume a non-degenerate metric because a degenerate one wouldn’t be terribly useful in the first place!
In this post, we learned how to construct a manifold from fundamental objects like sets and how to re-invent vectors, duals, and tensors on the manifold. Let’s take a second to review what we’ve learned in this part:
In the next installment, we’ll discuss the most important property of a manifold: curvature 😀
]]>