Algorithms for Generalized Inference, Learning, and Extraction Package

By Luke de Oliveira.

Overview

AGILEPack is a versatile C++ Deep Learning library designed with the needs of particle physicists in mind. It’s purpose is twofold – first, it aims to provide a set of accessable, state-of-the-art Deep Learning methods for physicists to use – and second, it aims to provide a wrapper around many of the hassles of dealing with the ROOT framework when doing numerical or machine learning studies.

AGILEPack provides a generic interface to design, build, estimate, save, load, and use Deep Networks for unsupervised learning, classification, and regression purposes. In particular (and due to the nature of data from particle collisions), AGILEPack provides a comprehensive implementation of an autoencoder framework to build such deep networks.

The AGILEPack repository with all code can be found here.

Deep Learning

At it’s most conceptual level, Deep Learning is a set of machine learning algorithms that aim to develop high-level, abstracted representations and understandings about datasets. By forming layers of understanding about relationships in data, we can construct more accurate predictive models for quantities of interest. In the case of High Energy Physics, this means we can build classification models using low level kinematic variables without having to use any higher level, more computationally intensive features that are commonly needed.

Specifically, AGILEPack is an implementation of a paradigm called Stacked Autoencoders. An autoencoder is a nonlinear dimensionality reduction technique that learns how to reconstruct an input distribution. When stacked, autoencoders create deep, hierarchical representations of data which can then be used to form very accurate predicted models. This process is divided into two steps.

  1. Determine a network structure in terms of autoencoders and perform so-called pre-training. Pre-training is a greedy process of learning the autoencoders sequentially, using each autoencoders encoding as an input space for the next autoencoder.
  2. We can use the weights from the encoders of the stacked structure to form the initial weights of a neural network. At its initial state, the network will already have an understanding of the input distribution for the problem at hand. This can be trained using the standard backpropagation algorithm, and is called the fin-tuning step.

In what follows, we’ll embark on a technical explanation of autoencoders.

Autoencoders

An autoencoder is a tranformed linear mapping that creates a ``bottleneck’’ of dimensionality and information. Formally, consider a data vector \(x\in\mathbb{R}^{D}\). Suppose we want to find a representation \(r\in\mathbb{R}^{d}\), with \(d < D\). An autoencoder consists of an encoder, which is a pair \((W_{1},b_{1})\), with \(W_{1}\in\mathbb{R}^{d\times D},b_{1}\in\mathbb{R}^{d}\), and a decoder, which is a pair \((W_{2},b_{2})\), with \(W_{2}\in\mathbb{R}^{D\times d},b_{2}\in\mathbb{R}^{D}\). We also need two mappings \(f,g:\mathbb{R}\longrightarrow\mathbb{R}\), that can be applied element-wise to vectors and matrices. The encoder then generates an encoded, bottlenecked representation by the map

\[\varphi(x)=f(W_{1}x+b_{1}),\]

and the decoder reproduces the original vector by the map

\[\rho(x)=g(W_{2}\varphi(x)+b_{2}).\]

To ensure this new space represents the original data vector well, we would like to find \(W_{i},b_{i}\) such that \(x\approx\rho_{\theta}(x)\). Let \(X\in\mathbb{R}^{D\times n}\) be a data batch, where \(x_{i}\) is the $i$th point (column) of the dataset. We define the loss over this dataset to be

\[L(X;\theta)=\frac{1}{2n}\sum_{i=1}^{n}\Vert x_{i}-\rho_{\theta}(x_{i})\Vert_{2}^{2}+\lambda_{1}\Vert W_{1}\Vert_{F}^{2}+\lambda_{2}\Vert W_{2}\Vert_{F}^{2},\]

Where \(\lambda_{1}\) and \(\lambda_{2}\) are regularization parameters (usually around \(10^{-4}\)) that act as \(\ell_{2}\) norm penalties for overtraining to the presented data batch. We can formulate this as an optimization problem. We can solve
\[\min_{W_{1},W_{2},b_{1},b_{2}}\frac{1}{2n}\sum_{i=1}^{n}\Vert x_{i}-\rho(x_{i})\Vert_{2}^{2}+\lambda_{1}\Vert W_{1}\Vert_{F}^{2}+\lambda_{2}\Vert W_{2}\Vert_{F}^{2}\]

using Stochastic Gradient Descent, or an online version of BFGS or L-BFGS.

Gradients for autoencoders are rarely formulated in efficient linear algebraic terms. While working on implementations for AGILEPack, I derived gradients with respect to a data batch rather than a single vector in simple linear algebra, where the final derivations are shown here as reference.

In what follows, every \(\mathbf{1}\) is a vector of 1’s appropriately sized to make the multiplication defined. Suppose we have our data batch \(X\in\mathbb{R}^{D\times n}\). It’s encoding given \(\theta\) is

\[\varphi(X)=f(W_{1}X+b_{1}{{\mathbf{1}}}^{T})\]

with associated decoding

\[\rho(X)=g(W_{2}\varphi(X)+b_{2}\mathbf{1}^{T}).\]

Completely written out, we have that our Loss function takes the form

\[L(X;\theta)=\frac{1}{2n}\Vert X-g(W_{2}f(W_{1}X+b_{1}{{\mathbf{1}}}^{T})+b_{2}\mathbf{1}^{T})\Vert_{F}^{2}+\lambda_{1}\Vert W_{1}\Vert_{F}^{2}+\lambda_{2}\Vert W_{2}\Vert_{F}^{2}.\label{eq:loss_general}\]

Taking derivatives,

\[\frac{\partial L(X;\theta)}{\partial W_{2}}=\frac{1}{n}\underbrace{(X-\rho(X))\odot g'(W_{2}\varphi(X)+b_{2}\mathbf{1}^{T})}_{\delta_{dec}}\varphi(X)^{T}+\lambda_{2}\Vert W_{2}\Vert_{F}\]

Using this definition of \(\delta_{dec}\), we have that

\(\begin{eqnarray*} \frac{\partial L(X;\theta)}{\partial W_{2}} & = & \frac{1}{n}\delta_{dec}\varphi(X)^{T}+\lambda_{2}\Vert W_{2}\Vert_{F}\\ \frac{\partial L(X;\theta)}{\partial b_{2}} & = & \frac{1}{n}\delta_{dec}\mathbf{1}^{T} \end{eqnarray*}\)

Now, define \(\delta_{enc}=W_{2}\delta_{dec}\odot f'(W_{1}X+b_{1}{{\mathbf{1}}}^{T})\), which yields

\(\begin{eqnarray*} \frac{\partial L(X;\theta)}{\partial W_{1}} & = & \frac{1}{n}\delta_{enc}X{}^{T}+\lambda_{1}\Vert W_{1}\Vert_{F}\\ \frac{\partial L(X;\theta)}{\partial b_{1}} & = & \frac{1}{n}\delta_{enc}\mathbf{1}^{T}. \end{eqnarray*}\)

Using this formulation of \(\frac{\partial L(X;\theta)}{\partial\theta}\), we can use any numerical optimization procedure to find our weights. We take note of two features of this formulation. First, no regularization is applied to the \(b_{i}\) bias vectors. This is because there are not enough parameters in the \(b\) vector to seriously overtrain a representation. Second, in any variant that does not use regularization, we can simply set \(\lambda_{i}=0\).

Autoencoder Types

Standard Autoencoders

A standard autoencoder, defined by the loss function
\[L(X;\theta)=\frac{1}{2n}\Vert X-g(W_{2}f(W_{1}X+b_{1}{{\mathbf{1}}}^{T})+b_{2}\mathbf{1}^{T})\Vert_{F}^{2},\]

has no regularization terms, and aims to build the best reconstruction possible. That is, in any applications where we care about being able to build as close a reconstruction as possible, an unmodified autoencoder will be optimized to preserve as much of the intrinsic variation as possible. However, we must take note of the fact that no regularization will make such a building block highly susceptible to overtraining. In instances where the user has a large training set and is sure that the samples being shown to the learner are in no way biased compared to the test sample or true population, then such a method could be appropriate, as this will attempt to create exact reconstructions of data vectors.

Regularized Autoencoders

The regularized autoencoder is defined by the loss function
\[L(X;\theta)=\frac{1}{2n}\Vert X-g(W_{2}f(W_{1}X+b_{1}{{\mathbf{1}}}^{T})+b_{2}\mathbf{1}^{T})\Vert_{F}^{2}+\lambda_{1}\Vert W_{1}\Vert_{F}^{2}+\lambda_{2}\Vert W_{2}\Vert_{F}^{2},\]

where \(\lambda_{i}\) are chosen to penalize large weights in the autoencoder. This regularization – in essence forcing weights to be nearer to zero – works to help the autoencoder learn robust reconstructions. Overfit models tend to have large (magnitudally) parameters which blow up due to statistical fluctuations that can be exploited in minimizing the loss function. The regularization terms make the reconstructions robust with respect to variations in the features present in the training set.

Denoising Autoencoders

The denoising autoencoder is defined by the loss function
\[L(X;\theta)=\frac{1}{2n}\Vert X-g(W_{2}f(W_{1}\tilde{X}+b_{1}{{\mathbf{1}}}^{T})+b_{2}\mathbf{1}^{T})\Vert_{F}^{2},\]

where \(\tilde{X}\sim P(\tilde{X}|X)\) is a stochastic, corruptingmapping. Common examples of corruption mappings are

For the sake of simplicity, we simply use the normal perturbations. The corruptions to the input vectors lead to robust representations of input variables. That is, the reconstructed rperesentations arerobust to fluctuations, leading to a more robust understanding of the relationships between input variables.

AGILEPack

With theory aside, we can discuss practical usage of AGILEPack. We first need to discuss dependencies.

Dependencies

Installation

The package can be downloaded using

git clone https://github.com/lukedeo/AGILEPack.git

is relatively basic to install. As long as Eigen is in a place such that

#include <Eigen/Dense>

won’t cause your compiler to produce errors, and the $ROOTSYS environment variable is set appropriately (a ROOT specific need) you should be able to build the entire package with

make [-j]

This will build a static library called lib/libAGILEPack.a, which you can then link against use all elements of the package. Its recommended that you put the /path/to/AGILEPack in your $PATH, but it’s certainly not necessary.

Package Features

AGILEPack is divided into seven distinct parts.

  1. The AGILE portion, which implements all of the algorithm-side details of autoencoders, all (un)-supervised training, and all matrix math underlying the network operations.
  2. The ROOT interface portion, which provides the agile::root::tree_reader class – a wrapper around reading flat trees from ROOT files.
  3. The dataframe portion, which is a nice column-name-accessable dataframe which can be loaded from csv, to csv, appended to, etc. In addition, the agile::root::tree_reader class can dump into an agile::dataframe for easy manipulation. The agile::dataframe can then be tranformed into an agile::matrix (a typedef for Eigen::Matrix<numeric, Eigen::Dynamic, Eigen::Dynamic>).
  4. The YAML portion, which is my hacked version of the Boost libraries and my hacked version of the yaml-cpp library. The YAML markup language is the specification used for all serialization of network objects within AGILEPack. By including the hacked subset of Boost and yaml-cpp itself, we reduce the number of potential dependency mishaps.
  5. The high level API, which consists of the agile::neural_net class – the main class a basic user will interact with.
  6. The Command Line Interface section, which uses the cmd-parser library (written by me) to implement a simple CLI around using AGILEPack for HEP purposes.
  7. The python section, which implements a thin-client to load the YAML file produced by AGILEPack and predict in python using the numpy library. It’s basic website is here.

Basic Usage

Let’s say you have a program called prog.cxx that uses AGILEPack. Provided that your file takes the form

#include "AGILEPack"
//            ^~~~~~ Includes all AGILEPack stuff!
#include <myotherheaders.h>

int main(int argc, char const *argv[])
{
	// Do stuff with AGILEPack!
}

You can compile this program with:

g++ -o prog prog.cxx `/path/to/AGILEPack/agile-config build --root`  
              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
      fixes include paths and links to -lAGILEPack

which will produce an excecutable called prog.

The ordering of these arguments is very important on Linux (particularly on SLC{5,6})!

Let’s say you’re using AGILEPack as part of a larger project, and you need to produce a *.o file. Stealing the example from the last section, let’s say we want to produce prog.o. We can do this with

g++ -c `/path/to/AGILEPack/agile-config compile --root` prog.cxx -o prog.o

then link it to other files with

g++ -o MyExcecutable prog.o [other *.o ...] `/path/to/AGILEPack/agile-config link --root`

Using the AGILEPack API (For High Energy Physics2)

The API for AGILEPack was designed to be as versatile as possible, while still allowing a power user to tweak the package to their own desires. This tutorial will show a very simple example using a fake ROOT file (which doesnt actually exist).

Suppose we want to make a b-tagger. Say we have an TFile called training.root over which we’d like to train a neural network. Let’s also say that a flat, numeric ntuple lives inside a TTree called Physics. (NOTE: the ntuple(s) must be flat).

Dealing with the data

For the sake of this tutorial, suppose that the ntuple has the following branches:

Branch Name (C/C++) Numeric Type
bottom int
nTracks int
nVTX int
ip3d_pb double
ip3d_pu double
ip3d_pc double
mass float
significance3d float
pt float
eta float

For the estimation of the posterior probability that a given jet is a bottom jet, we need that that the bottom branch is a 1 if a jet is a bottom, and a 0 otherwise. AGILEPack has three different numeric types it can read from a TTree.

We will now walk through the steps of using the agile::root::tree_reader class to bypass the TTree branch address nonsense.

Now, let’s see how the AGILEPack API helps us load this TTree into a usable format.

agile::root::tree_reader reader;            // declare a tree_reader instance
reader.add_file("training.root", "Physics") // Load the file and TTree

// Set all the branches. Notice the passing of types.
reader.set_branch("bottom", agile::root::integer);
reader.set_branch("nTracks", agile::root::integer);
reader.set_branch("nVTX", agile::root::integer);
reader.set_branch("ip3d_pb", agile::root::double_precision);
reader.set_branch("ip3d_pu", agile::root::double_precision);
reader.set_branch("ip3d_pc", agile::root::double_precision);
reader.set_branch("mass", agile::root::single_precision);
reader.set_branch("significance3d", agile::root::single_precision);
reader.set_branch("pt", agile::root::single_precision);
reader.set_branch("eta", agile::root::single_precision);
//                ^~~~^  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~^  
//            branch name       the numeric type

What happens now? Well, suppose we want to get the value of nTracks at the 4th entry. We can call:

double nTracks = reader(4, "nTracks");

Suppose we want to get the elements from nTracks, nVTX, ip3d_pb, and ip3d_pu. We can call:

std::vector<std::string> vars( {"nTracks", "nVTX", "ip3d_pb", "ip3d_pu"} );
std::map<std::string, double> my_vars = reader(4, vars);

This is great, but doesn’t help us train a neural network. Enter, the agile::dataframe. Unfortunately, it’s necessary for neural network training to have access to data stored in a structure that provides random pattern access. Thus, we can use the the agile::dataframe as a sort of temporary storage. Let’s dump some entries from our TTree stored in reader into a dataframe.

agile::dataframe D = reader.get_dataframe(1000);
//                            ~~~~~~~~~~~~~~^
//                        Dumps 1000 entries

agile::dataframe D = reader.get_dataframe(500, 12);
//                          ~~~~~~~~~~~~~~^    ^~~~~~~~
//                      Dumps 500 entries...   ...after skipping the first 12 entries

agile::dataframe D = reader.get_dataframe();
//                           ~~~~~~~~~~~~^
//                       Dump ALL the things
D.to_csv("data.csv"); // maybe we want to write it to CSV!

Now, suppose, there’s another ROOT file called training_2.root with a TTree called more_physics that we also want in this dataframe.

agile::root::tree_reader reader_2;                   // declare a tree_reader instance
reader_2.add_file("training_2.root", "more_physics") // Load the file and TTree

// Set all the same branches as before.

reader_2.set_branch("bottom", agile::root::integer);
reader_2.set_branch("nTracks", agile::root::integer);
reader_2.set_branch("nVTX", agile::root::integer);
reader_2.set_branch("ip3d_pb", agile::root::double_precision);
reader_2.set_branch("ip3d_pu", agile::root::double_precision);
reader_2.set_branch("ip3d_pc", agile::root::double_precision);
reader_2.set_branch("mass", agile::root::single_precision);
reader_2.set_branch("significance3d", agile::root::single_precision);
reader_2.set_branch("pt", agile::root::single_precision);
reader_2.set_branch("eta", agile::root::single_precision);

// we already have our `agile::dataframe` named `D` from before.
D.append(std::move(reader_2.get_dataframe(1000)));
//       ^~~~~~~~~^
//     Optional, but avoids copy

So, we can call append() to squash dataframes together (some checking of bounds and such happens in the background).

Syntax for formulae

We don’t know how to use the neural network training portion of the API yet, but it’s important to document the formula syntax. Input and output variables and discriminants are specified using a model formula. This is a fancy way of saying that variable inclusion and exclusion can be
run by a formula parser that constructs what the neural network sees. The following characters are used to specify a formula:

Character Meaning
~ Means “is a function of”. This splits the left and right hand sides of what the neural network is estimating.
+ Means “Include the next variable in the neural network”.
- Means “Don’t include the next variable in the neural network”.
* Standard glob. Means “Include everything not on the left hand side of the formula”.

In our case (with the branches mentioned in the previous section), the formula

"bottom ~ *"

would mean “predict bottom using all other variables as inputs”. The formula

"bottom ~ * - eta"

would mean “predict bottom using all other variables except eta as inputs”. The formula

"bottom + pt ~ ip3d_pb + ip3d_pu + ip3d_pc + mass"

Is simply “predict bottom and pt as functions of everything on the right hand side”. Note that formulas aren’t space sensitive – the formulas bott om ~ pt+eta + ip3d_pb and bottom ~ pt + eta + ip3d_pb are the exact same thing.

Training a neural network.

Suppose we’ve decided that we wish to train a neural network using nTracks,nVTX,ip3d_pb,ip3d_pu,ip3d_pc,mass,significance3d,pt to predict bottom without using eta. We then must decide the network structure to utilize for this prediction problem. For this example, we use autoencoders with 17, 12, 5, and 3 hidden units, and then a fina layer used for the prediction of bottom. We wish to use stacked autoencoders for the first layers, then a standard layer for the last (prediction) layer. We continue the example, using the agile::dataframe D from before.

agile::neural_net my_net;

//move semantic work here
my_net.add_data(std::move(D));

// the neural net parses this and creates "sub" dataframes
my_net.model_formula("bottom ~ * -eta"); 

If more data needs to be added, say from an agile::dataframe named F, we can call

my_net.add_data(std::move(F));

We now create the network structure that we specified…

// "stacks" the layers (stacked autoencoders), uses C++11-style syntax
my_net.emplace_back(new autoencoder(8, 17, sigmoid, linear));
my_net.emplace_back(new autoencoder(17, 12, sigmoid, sigmoid)); 
my_net.emplace_back(new autoencoder(12, 5, sigmoid, sigmoid)); 
my_net.emplace_back(new autoencoder(5, 3, sigmoid, sigmoid)); 
my_net.emplace_back(new layer(3, 1, sigmoid)); 

Set parameters…

my_net.set_learning(0.05);
my_net.set_regularizer(0.001);
my_net.set_batch_size(1);
my_net.check(); // checks the dimensions on the network

Then, we can train!

// The unsupervised training trains each layer seperately
my_net.train_unsupervised(10); // 10 unsupervised epochs

my_net.set_learning(0.01);     // new learning rate

my_net.train_supervised(10);   // 10 supervised epochs

Finally, we can save the network, along with the variable types from the branches in the agile::root::tree_reader.

my_net.to_yaml("btagger.yaml", reader.get_var_types())

How can we use the API to predict? We can use the agile::neural_net::predict_map() function to read in a map of input variables and produce a map of the predicted values.

auto input_variables = my_net.get_inputs();

for (int jet = 0; jet < 10, ++jet)
{
	// This pulls a map out.
	auto predictions = my_net.predict_map(reader(i, input_variables));
	std::cout << "Prob[jet #" << i << " == bottom] = ";
	std::cout << predictions["bottom"] << std::endl;
}

Putting it all together

Joining all of our example code together, we have the following program, which estimates a b-tagger.

#include "AGILEPack"

int main(int argc, char const *argv[])
{
// Loading Data
//----------------------------------------------------------------------------
    // declare a tree_reader instance
    agile::root::tree_reader reader;         

    // Load the file and TTree
    reader.add_file("training.root", "Physics") 

    // Set all the branches. Notice the passing of types.
    reader.set_branch("bottom", agile::root::integer);
    reader.set_branch("nTracks", agile::root::integer);
    reader.set_branch("nVTX", agile::root::integer);
    reader.set_branch("ip3d_pb", agile::root::double_precision);
    reader.set_branch("ip3d_pu", agile::root::double_precision);
    reader.set_branch("ip3d_pc", agile::root::double_precision);
    reader.set_branch("mass", agile::root::single_precision);
    reader.set_branch("significance3d", agile::root::single_precision);
    reader.set_branch("pt", agile::root::single_precision);
    reader.set_branch("eta", agile::root::single_precision);
    //                ^~~~^  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~^  
    //            branch name       the numeric type


    agile::dataframe D = reader.get_dataframe();
    agile::root::tree_reader reader_2;                   
    reader_2.add_file("training_2.root", "more_physics") 

    // Set all the same branches as before.
    reader_2.set_branch("bottom", agile::root::integer);
    reader_2.set_branch("nTracks", agile::root::integer);
    reader_2.set_branch("nVTX", agile::root::integer);
    reader_2.set_branch("ip3d_pb", agile::root::double_precision);
    reader_2.set_branch("ip3d_pu", agile::root::double_precision);
    reader_2.set_branch("ip3d_pc", agile::root::double_precision);
    reader_2.set_branch("mass", agile::root::single_precision);
    reader_2.set_branch("significance3d", agile::root::single_precision);
    reader_2.set_branch("pt", agile::root::single_precision);
    reader_2.set_branch("eta", agile::root::single_precision);

    // we already have our `agile::dataframe` named `D` from before.
    D.append(std::move(reader_2.get_dataframe(1000)));
    
// Estimating the neural network
//----------------------------------------------------------------------------
    agile::neural_net my_net;
    //move semantics work here
    my_net.add_data(std::move(D));

    // the neural net parses this and creates "sub" dataframes
    my_net.model_formula("bottom ~ * -eta"); 
    

    // "stacks" the layers (stacked autoencoders), uses C++11-style syntax
    my_net.emplace_back(new autoencoder(8, 17, sigmoid, linear));
    my_net.emplace_back(new autoencoder(17, 12, sigmoid, sigmoid)); 
    my_net.emplace_back(new autoencoder(12, 5, sigmoid, sigmoid)); 
    my_net.emplace_back(new autoencoder(5, 3, sigmoid, sigmoid)); 
    my_net.emplace_back(new layer(3, 1, sigmoid)); 

    // parameter setting
    my_net.set_learning(0.05);
    my_net.set_regularizer(0.001);
    my_net.set_batch_size(1);
    my_net.check(); // checks the dimensions on the network

    // The unsupervised training trains each layer seperately
    my_net.train_unsupervised(10); // 10 unsupervised epochs

    my_net.set_learning(0.01);     // new learning rate

    my_net.train_supervised(10);   // 10 supervised epochs
    my_net.to_yaml("btagger.yaml", reader.get_var_types());

// Prediction time
//----------------------------------------------------------------------------
    auto input_variables = my_net.get_inputs();

    for (int jet = 0; jet < 10, ++jet)
    {
        // This pulls a map out.
        auto predictions = my_net.predict_map(reader(i, input_variables));
        std::cout << "Prob[jet #" << i << " == bottom] = ";
        std::cout << predictions["bottom"] << std::endl;
    }
}

If we save this to a file called btagger.cxx, we can compile and run everything using

g++ -o btagger btagger.cxx `/path/to/AGILEPack/agile-config build --root` && ./btagger

If you’re interesting in using the Python thin-client, add /path/to/AGILEPack/python to $PYTHONPATH, then look at the AGILEPy website.

Going Forward

If you, the user, has any questions, concerns, or suggestions, do not hesitate to contact me at luke.deoliveira@yale.edu. I have tried to test most things a user will try to do, and when possible, AGILEPack will throw a helpful, somewhat verbose exception trying to refer to the class or action that was taken that isn’t valid. However, an error-proof library isn’t possible, so do not hesitate to contact me.