Algorithms for Generalized Inference, Learning, and Extraction Package
By Luke de Oliveira.
AGILEPack is a versatile C++ Deep Learning library designed with the needs of particle physicists in mind. It’s purpose is twofold – first, it aims to provide a set of accessable, state-of-the-art Deep Learning methods for physicists to use – and second, it aims to provide a wrapper around many of the hassles of dealing with the ROOT framework when doing numerical or machine learning studies.
AGILEPack provides a generic interface to design, build, estimate, save, load, and use Deep Networks for unsupervised learning, classification, and regression purposes. In particular (and due to the nature of data from particle collisions), AGILEPack provides a comprehensive implementation of an autoencoder framework to build such deep networks.
The AGILEPack repository with all code can be found here.
At it’s most conceptual level, Deep Learning is a set of machine learning algorithms that aim to develop high-level, abstracted representations and understandings about datasets. By forming layers of understanding about relationships in data, we can construct more accurate predictive models for quantities of interest. In the case of High Energy Physics, this means we can build classification models using low level kinematic variables without having to use any higher level, more computationally intensive features that are commonly needed.
Specifically, AGILEPack is an implementation of a paradigm called Stacked Autoencoders. An autoencoder is a nonlinear dimensionality reduction technique that learns how to reconstruct an input distribution. When stacked, autoencoders create deep, hierarchical representations of data which can then be used to form very accurate predicted models. This process is divided into two steps.
In what follows, we’ll embark on a technical explanation of autoencoders.
An autoencoder is a tranformed linear mapping that creates a ``bottleneck’’ of dimensionality and information. Formally, consider a data vector \(x\in\mathbb{R}^{D}\). Suppose we want to find a representation \(r\in\mathbb{R}^{d}\), with \(d < D\). An autoencoder consists of an encoder, which is a pair \((W_{1},b_{1})\), with \(W_{1}\in\mathbb{R}^{d\times D},b_{1}\in\mathbb{R}^{d}\), and a decoder, which is a pair \((W_{2},b_{2})\), with \(W_{2}\in\mathbb{R}^{D\times d},b_{2}\in\mathbb{R}^{D}\). We also need two mappings \(f,g:\mathbb{R}\longrightarrow\mathbb{R}\), that can be applied element-wise to vectors and matrices. The encoder then generates an encoded, bottlenecked representation by the map
\[\varphi(x)=f(W_{1}x+b_{1}),\]
and the decoder reproduces the original vector by the map
\[\rho(x)=g(W_{2}\varphi(x)+b_{2}).\]
To ensure this new space represents the original data vector well, we would like to find \(W_{i},b_{i}\) such that \(x\approx\rho_{\theta}(x)\). Let \(X\in\mathbb{R}^{D\times n}\) be a data batch, where \(x_{i}\) is the $i$th point (column) of the dataset. We define the loss over this dataset to be
\[L(X;\theta)=\frac{1}{2n}\sum_{i=1}^{n}\Vert x_{i}-\rho_{\theta}(x_{i})\Vert_{2}^{2}+\lambda_{1}\Vert W_{1}\Vert_{F}^{2}+\lambda_{2}\Vert W_{2}\Vert_{F}^{2},\]
Where \(\lambda_{1}\) and \(\lambda_{2}\) are regularization parameters (usually around \(10^{-4}\)) that act as \(\ell_{2}\) norm penalties for overtraining to the presented data batch. We can formulate this as an optimization problem. We can solve
\[\min_{W_{1},W_{2},b_{1},b_{2}}\frac{1}{2n}\sum_{i=1}^{n}\Vert x_{i}-\rho(x_{i})\Vert_{2}^{2}+\lambda_{1}\Vert W_{1}\Vert_{F}^{2}+\lambda_{2}\Vert W_{2}\Vert_{F}^{2}\]
using Stochastic Gradient Descent, or an online version of BFGS or L-BFGS.
Gradients for autoencoders are rarely formulated in efficient linear algebraic terms. While working on implementations for AGILEPack, I derived gradients with respect to a data batch rather than a single vector in simple linear algebra, where the final derivations are shown here as reference.
In what follows, every \(\mathbf{1}\) is a vector of 1’s appropriately sized to make the multiplication defined. Suppose we have our data batch \(X\in\mathbb{R}^{D\times n}\). It’s encoding given \(\theta\) is
\[\varphi(X)=f(W_{1}X+b_{1}{{\mathbf{1}}}^{T})\]
with associated decoding
\[\rho(X)=g(W_{2}\varphi(X)+b_{2}\mathbf{1}^{T}).\]
Completely written out, we have that our Loss function takes the form
\[L(X;\theta)=\frac{1}{2n}\Vert X-g(W_{2}f(W_{1}X+b_{1}{{\mathbf{1}}}^{T})+b_{2}\mathbf{1}^{T})\Vert_{F}^{2}+\lambda_{1}\Vert W_{1}\Vert_{F}^{2}+\lambda_{2}\Vert W_{2}\Vert_{F}^{2}.\label{eq:loss_general}\]
Taking derivatives,
\[\frac{\partial L(X;\theta)}{\partial W_{2}}=\frac{1}{n}\underbrace{(X-\rho(X))\odot g'(W_{2}\varphi(X)+b_{2}\mathbf{1}^{T})}_{\delta_{dec}}\varphi(X)^{T}+\lambda_{2}\Vert W_{2}\Vert_{F}\]
Using this definition of \(\delta_{dec}\), we have that
\(\begin{eqnarray*} \frac{\partial L(X;\theta)}{\partial W_{2}} & = & \frac{1}{n}\delta_{dec}\varphi(X)^{T}+\lambda_{2}\Vert W_{2}\Vert_{F}\\ \frac{\partial L(X;\theta)}{\partial b_{2}} & = & \frac{1}{n}\delta_{dec}\mathbf{1}^{T} \end{eqnarray*}\)
Now, define \(\delta_{enc}=W_{2}\delta_{dec}\odot f'(W_{1}X+b_{1}{{\mathbf{1}}}^{T})\), which yields
\(\begin{eqnarray*} \frac{\partial L(X;\theta)}{\partial W_{1}} & = & \frac{1}{n}\delta_{enc}X{}^{T}+\lambda_{1}\Vert W_{1}\Vert_{F}\\ \frac{\partial L(X;\theta)}{\partial b_{1}} & = & \frac{1}{n}\delta_{enc}\mathbf{1}^{T}. \end{eqnarray*}\)
Using this formulation of \(\frac{\partial L(X;\theta)}{\partial\theta}\), we can use any numerical optimization procedure to find our weights. We take note of two features of this formulation. First, no regularization is applied to the \(b_{i}\) bias vectors. This is because there are not enough parameters in the \(b\) vector to seriously overtrain a representation. Second, in any variant that does not use regularization, we can simply set \(\lambda_{i}=0\).
A standard autoencoder, defined by the loss function
\[L(X;\theta)=\frac{1}{2n}\Vert X-g(W_{2}f(W_{1}X+b_{1}{{\mathbf{1}}}^{T})+b_{2}\mathbf{1}^{T})\Vert_{F}^{2},\]
has no regularization terms, and aims to build the best reconstruction possible. That is, in any applications where we care about being able to build as close a reconstruction as possible, an unmodified autoencoder will be optimized to preserve as much of the intrinsic variation as possible. However, we must take note of the fact that no regularization will make such a building block highly susceptible to overtraining. In instances where the user has a large training set and is sure that the samples being shown to the learner are in no way biased compared to the test sample or true population, then such a method could be appropriate, as this will attempt to create exact reconstructions of data vectors.
The regularized autoencoder is defined by the loss function
\[L(X;\theta)=\frac{1}{2n}\Vert X-g(W_{2}f(W_{1}X+b_{1}{{\mathbf{1}}}^{T})+b_{2}\mathbf{1}^{T})\Vert_{F}^{2}+\lambda_{1}\Vert W_{1}\Vert_{F}^{2}+\lambda_{2}\Vert W_{2}\Vert_{F}^{2},\]
where \(\lambda_{i}\) are chosen to penalize large weights in the autoencoder. This regularization – in essence forcing weights to be nearer to zero – works to help the autoencoder learn robust reconstructions. Overfit models tend to have large (magnitudally) parameters which blow up due to statistical fluctuations that can be exploited in minimizing the loss function. The regularization terms make the reconstructions robust with respect to variations in the features present in the training set.
The denoising autoencoder is defined by the loss function
\[L(X;\theta)=\frac{1}{2n}\Vert X-g(W_{2}f(W_{1}\tilde{X}+b_{1}{{\mathbf{1}}}^{T})+b_{2}\mathbf{1}^{T})\Vert_{F}^{2},\]
where \(\tilde{X}\sim P(\tilde{X}|X)\) is a stochastic, corruptingmapping. Common examples of corruption mappings are
\(P(\tilde{X}|X)=X\odot B,B_{ij}\sim{Bernoulli}(p)\) (i.e., a binary mask)
\(P(\tilde{X}|X)=X+N_{0,\varepsilon}\), where \([N_{o,\varepsilon}]_{ij}\sim N(0,\varepsilon^{2})\)
For the sake of simplicity, we simply use the normal perturbations. The corruptions to the input vectors lead to robust representations of input variables. That is, the reconstructed rperesentations arerobust to fluctuations, leading to a more robust understanding of the relationships between input variables.
With theory aside, we can discuss practical usage of AGILEPack. We first need to discuss dependencies.
Eigen
matrix library (header only) for fast matrix operations.ROOT
, the framework that CERN uses for data analysis. This is only used to read data from ROOT files.The package can be downloaded using
git clone https://github.com/lukedeo/AGILEPack.git
is relatively basic to install. As long as Eigen
is in a place such that
#include <Eigen/Dense>
won’t cause your compiler to produce errors, and the $ROOTSYS
environment variable is set appropriately (a ROOT specific need) you should be able to build the entire package with
make [-j]
This will build a static library called lib/libAGILEPack.a
, which you can then link against use all elements of the package. Its recommended that you put the /path/to/AGILEPack
in your $PATH
, but it’s certainly not necessary.
AGILEPack is divided into seven distinct parts.
agile::root::tree_reader
class – a wrapper around reading flat trees from ROOT files.agile::root::tree_reader
class can dump into an agile::dataframe
for easy manipulation. The agile::dataframe
can then be tranformed into an agile::matrix
(a typedef for Eigen::Matrix<numeric, Eigen::Dynamic, Eigen::Dynamic>
).Boost
libraries and my hacked version of the yaml-cpp
library. The YAML
markup language is the specification used for all serialization of network objects within AGILEPack. By including the hacked subset of Boost and yaml-cpp itself, we reduce the number of potential dependency mishaps.agile::neural_net
class – the main class a basic user will interact with.cmd-parser
library (written by me) to implement a simple CLI around using AGILEPack for HEP purposes.Let’s say you have a program called prog.cxx
that uses AGILEPack. Provided that your file takes the form
#include "AGILEPack"
// ^~~~~~ Includes all AGILEPack stuff!
#include <myotherheaders.h>
int main(int argc, char const *argv[])
{
// Do stuff with AGILEPack!
}
You can compile this program with:
g++ -o prog prog.cxx `/path/to/AGILEPack/agile-config build --root`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
fixes include paths and links to -lAGILEPack
which will produce an excecutable called prog
.
The ordering of these arguments is very important on Linux (particularly on SLC{5,6})!
Let’s say you’re using AGILEPack as part of a larger project, and you need to produce a *.o file. Stealing the example from the last section, let’s say we want to produce prog.o
. We can do this with
g++ -c `/path/to/AGILEPack/agile-config compile --root` prog.cxx -o prog.o
then link it to other files with
g++ -o MyExcecutable prog.o [other *.o ...] `/path/to/AGILEPack/agile-config link --root`
The API for AGILEPack was designed to be as versatile as possible, while still allowing a power user to tweak the package to their own desires. This tutorial will show a very simple example using a fake ROOT file (which doesnt actually exist).
Suppose we want to make a b-tagger. Say we have an TFile
called training.root
over which we’d like to train a neural network. Let’s also say that a flat, numeric ntuple
lives inside a TTree
called Physics
. (NOTE: the ntuple(s)
must be flat).
For the sake of this tutorial, suppose that the ntuple
has the following branches:
Branch Name | (C/C++) Numeric Type |
---|---|
bottom |
int |
nTracks |
int |
nVTX |
int |
ip3d_pb |
double |
ip3d_pu |
double |
ip3d_pc |
double |
mass |
float |
significance3d |
float |
pt |
float |
eta |
float |
For the estimation of the posterior probability that a given jet is a bottom jet, we need that that the bottom
branch is a 1
if a jet is a bottom, and a 0
otherwise. AGILEPack has three different numeric types it can read from a TTree.
agile::root::integer
agile::root::single_precision
agile::root::double_precision
.We will now walk through the steps of using the agile::root::tree_reader
class to bypass the TTree
branch address nonsense.
Now, let’s see how the AGILEPack API helps us load this TTree into a usable format.
agile::root::tree_reader reader; // declare a tree_reader instance
reader.add_file("training.root", "Physics") // Load the file and TTree
// Set all the branches. Notice the passing of types.
reader.set_branch("bottom", agile::root::integer);
reader.set_branch("nTracks", agile::root::integer);
reader.set_branch("nVTX", agile::root::integer);
reader.set_branch("ip3d_pb", agile::root::double_precision);
reader.set_branch("ip3d_pu", agile::root::double_precision);
reader.set_branch("ip3d_pc", agile::root::double_precision);
reader.set_branch("mass", agile::root::single_precision);
reader.set_branch("significance3d", agile::root::single_precision);
reader.set_branch("pt", agile::root::single_precision);
reader.set_branch("eta", agile::root::single_precision);
// ^~~~^ ^~~~~~~~~~~~~~~~~~~~~~~~~~~~^
// branch name the numeric type
What happens now? Well, suppose we want to get the value of nTracks
at the 4th entry. We can call:
double nTracks = reader(4, "nTracks");
Suppose we want to get the elements from nTracks
, nVTX
, ip3d_pb
, and ip3d_pu
. We can call:
std::vector<std::string> vars( {"nTracks", "nVTX", "ip3d_pb", "ip3d_pu"} );
std::map<std::string, double> my_vars = reader(4, vars);
This is great, but doesn’t help us train a neural network. Enter, the agile::dataframe
. Unfortunately, it’s necessary for neural network training to have access to data stored in a structure that provides random pattern access. Thus, we can use the the agile::dataframe
as a sort of temporary storage. Let’s dump some entries from our TTree
stored in reader
into a dataframe
.
agile::dataframe D = reader.get_dataframe(1000);
// ~~~~~~~~~~~~~~^
// Dumps 1000 entries
agile::dataframe D = reader.get_dataframe(500, 12);
// ~~~~~~~~~~~~~~^ ^~~~~~~~
// Dumps 500 entries... ...after skipping the first 12 entries
agile::dataframe D = reader.get_dataframe();
// ~~~~~~~~~~~~^
// Dump ALL the things
D.to_csv("data.csv"); // maybe we want to write it to CSV!
Now, suppose, there’s another ROOT file called training_2.root
with a TTree
called more_physics
that we also want in this dataframe
.
agile::root::tree_reader reader_2; // declare a tree_reader instance
reader_2.add_file("training_2.root", "more_physics") // Load the file and TTree
// Set all the same branches as before.
reader_2.set_branch("bottom", agile::root::integer);
reader_2.set_branch("nTracks", agile::root::integer);
reader_2.set_branch("nVTX", agile::root::integer);
reader_2.set_branch("ip3d_pb", agile::root::double_precision);
reader_2.set_branch("ip3d_pu", agile::root::double_precision);
reader_2.set_branch("ip3d_pc", agile::root::double_precision);
reader_2.set_branch("mass", agile::root::single_precision);
reader_2.set_branch("significance3d", agile::root::single_precision);
reader_2.set_branch("pt", agile::root::single_precision);
reader_2.set_branch("eta", agile::root::single_precision);
// we already have our `agile::dataframe` named `D` from before.
D.append(std::move(reader_2.get_dataframe(1000)));
// ^~~~~~~~~^
// Optional, but avoids copy
So, we can call append()
to squash dataframes together (some checking of bounds and such happens in the background).
We don’t know how to use the neural network training portion of the API yet, but it’s important to document the formula syntax. Input and output variables and discriminants are specified using a model formula. This is a fancy way of saying that variable inclusion and exclusion can be
run by a formula parser that constructs what the neural network sees. The following characters are used to specify a formula:
Character | Meaning |
---|---|
~ |
Means “is a function of”. This splits the left and right hand sides of what the neural network is estimating. |
+ |
Means “Include the next variable in the neural network”. |
- |
Means “Don’t include the next variable in the neural network”. |
* |
Standard glob. Means “Include everything not on the left hand side of the formula”. |
In our case (with the branches mentioned in the previous section), the formula
"bottom ~ *"
would mean “predict bottom
using all other variables as inputs”. The formula
"bottom ~ * - eta"
would mean “predict bottom
using all other variables except eta
as inputs”. The formula
"bottom + pt ~ ip3d_pb + ip3d_pu + ip3d_pc + mass"
Is simply “predict bottom
and pt
as functions of everything on the right hand side”. Note that formulas aren’t space sensitive – the formulas bott om ~ pt+eta + ip3d_pb
and bottom ~ pt + eta + ip3d_pb
are the exact same thing.
Suppose we’ve decided that we wish to train a neural network using nTracks
,nVTX
,ip3d_pb
,ip3d_pu
,ip3d_pc
,mass
,significance3d
,pt
to predict bottom
without using eta
. We then must decide the network structure to utilize for this prediction problem. For this example, we use autoencoders with 17, 12, 5, and 3 hidden units, and then a fina layer used for the prediction of bottom
. We wish to use stacked autoencoders for the first layers, then a standard layer for the last (prediction) layer. We continue the example, using the agile::dataframe D
from before.
agile::neural_net my_net;
//move semantic work here
my_net.add_data(std::move(D));
// the neural net parses this and creates "sub" dataframes
my_net.model_formula("bottom ~ * -eta");
If more data needs to be added, say from an agile::dataframe
named F
, we can call
my_net.add_data(std::move(F));
We now create the network structure that we specified…
// "stacks" the layers (stacked autoencoders), uses C++11-style syntax
my_net.emplace_back(new autoencoder(8, 17, sigmoid, linear));
my_net.emplace_back(new autoencoder(17, 12, sigmoid, sigmoid));
my_net.emplace_back(new autoencoder(12, 5, sigmoid, sigmoid));
my_net.emplace_back(new autoencoder(5, 3, sigmoid, sigmoid));
my_net.emplace_back(new layer(3, 1, sigmoid));
Set parameters…
my_net.set_learning(0.05);
my_net.set_regularizer(0.001);
my_net.set_batch_size(1);
my_net.check(); // checks the dimensions on the network
Then, we can train!
// The unsupervised training trains each layer seperately
my_net.train_unsupervised(10); // 10 unsupervised epochs
my_net.set_learning(0.01); // new learning rate
my_net.train_supervised(10); // 10 supervised epochs
Finally, we can save the network, along with the variable types from the branches in the agile::root::tree_reader
.
my_net.to_yaml("btagger.yaml", reader.get_var_types())
How can we use the API to predict? We can use the agile::neural_net::predict_map()
function to read in a map of input variables and produce a map of the predicted values.
auto input_variables = my_net.get_inputs();
for (int jet = 0; jet < 10, ++jet)
{
// This pulls a map out.
auto predictions = my_net.predict_map(reader(i, input_variables));
std::cout << "Prob[jet #" << i << " == bottom] = ";
std::cout << predictions["bottom"] << std::endl;
}
Joining all of our example code together, we have the following program, which estimates a b-tagger.
#include "AGILEPack"
int main(int argc, char const *argv[])
{
// Loading Data
//----------------------------------------------------------------------------
// declare a tree_reader instance
agile::root::tree_reader reader;
// Load the file and TTree
reader.add_file("training.root", "Physics")
// Set all the branches. Notice the passing of types.
reader.set_branch("bottom", agile::root::integer);
reader.set_branch("nTracks", agile::root::integer);
reader.set_branch("nVTX", agile::root::integer);
reader.set_branch("ip3d_pb", agile::root::double_precision);
reader.set_branch("ip3d_pu", agile::root::double_precision);
reader.set_branch("ip3d_pc", agile::root::double_precision);
reader.set_branch("mass", agile::root::single_precision);
reader.set_branch("significance3d", agile::root::single_precision);
reader.set_branch("pt", agile::root::single_precision);
reader.set_branch("eta", agile::root::single_precision);
// ^~~~^ ^~~~~~~~~~~~~~~~~~~~~~~~~~~~^
// branch name the numeric type
agile::dataframe D = reader.get_dataframe();
agile::root::tree_reader reader_2;
reader_2.add_file("training_2.root", "more_physics")
// Set all the same branches as before.
reader_2.set_branch("bottom", agile::root::integer);
reader_2.set_branch("nTracks", agile::root::integer);
reader_2.set_branch("nVTX", agile::root::integer);
reader_2.set_branch("ip3d_pb", agile::root::double_precision);
reader_2.set_branch("ip3d_pu", agile::root::double_precision);
reader_2.set_branch("ip3d_pc", agile::root::double_precision);
reader_2.set_branch("mass", agile::root::single_precision);
reader_2.set_branch("significance3d", agile::root::single_precision);
reader_2.set_branch("pt", agile::root::single_precision);
reader_2.set_branch("eta", agile::root::single_precision);
// we already have our `agile::dataframe` named `D` from before.
D.append(std::move(reader_2.get_dataframe(1000)));
// Estimating the neural network
//----------------------------------------------------------------------------
agile::neural_net my_net;
//move semantics work here
my_net.add_data(std::move(D));
// the neural net parses this and creates "sub" dataframes
my_net.model_formula("bottom ~ * -eta");
// "stacks" the layers (stacked autoencoders), uses C++11-style syntax
my_net.emplace_back(new autoencoder(8, 17, sigmoid, linear));
my_net.emplace_back(new autoencoder(17, 12, sigmoid, sigmoid));
my_net.emplace_back(new autoencoder(12, 5, sigmoid, sigmoid));
my_net.emplace_back(new autoencoder(5, 3, sigmoid, sigmoid));
my_net.emplace_back(new layer(3, 1, sigmoid));
// parameter setting
my_net.set_learning(0.05);
my_net.set_regularizer(0.001);
my_net.set_batch_size(1);
my_net.check(); // checks the dimensions on the network
// The unsupervised training trains each layer seperately
my_net.train_unsupervised(10); // 10 unsupervised epochs
my_net.set_learning(0.01); // new learning rate
my_net.train_supervised(10); // 10 supervised epochs
my_net.to_yaml("btagger.yaml", reader.get_var_types());
// Prediction time
//----------------------------------------------------------------------------
auto input_variables = my_net.get_inputs();
for (int jet = 0; jet < 10, ++jet)
{
// This pulls a map out.
auto predictions = my_net.predict_map(reader(i, input_variables));
std::cout << "Prob[jet #" << i << " == bottom] = ";
std::cout << predictions["bottom"] << std::endl;
}
}
If we save this to a file called btagger.cxx
, we can compile and run everything using
g++ -o btagger btagger.cxx `/path/to/AGILEPack/agile-config build --root` && ./btagger
If you’re interesting in using the Python thin-client, add /path/to/AGILEPack/python
to $PYTHONPATH
, then look at the AGILEPy website.
If you, the user, has any questions, concerns, or suggestions, do not hesitate to contact me at luke.deoliveira@yale.edu
. I have tried to test most things a user will try to do, and when possible, AGILEPack will throw a helpful, somewhat verbose exception trying to refer to the class or action that was taken that isn’t valid. However, an error-proof library isn’t possible, so do not hesitate to contact me.