QuData IceCube report

GNN model

Our model "GNN" is a modification of the graph neural network architecture for Neutrino Telescope Event Reconstruction, which is described in detail in the paper Graph Neural Networks for Low-Energy Event Classification & Reconstruction in IceCube and available on github: GraphNet.

The overall structure of the architecture is as follows:

The tensor of the shape (B,T,F) is fed to the model, where B - is the sample index in a minibatch, and T - the pulse index number in the sequence. Each pulse is characterized by these six (F=6) features:

\(x,y,z\) - coordinates;
\(t\) - time;
\(q\) - charge;
\(aux\) - auxilary

First, for each pulse we find NEIGHBORS=8 nearest neighbors using only the coordinates as closeness metric. Thus we obtain the graph of spatial connections between pulses. Now, having a graph and pulse features, we form aggregated features for each event:

\(homophily(x,y,z,t)\) - similarity of features in graph nodes;
\(mean(x,y,z,t,q,aux)\) - mean values of pulse features;
\(pulses\) - number of pulses in an event;

Thus, for each event, we get a graph in which each node corresponds to the pulse, and the features are the features of the pulse combined with the aggregated features of the event.

The resulting graph sequentially passes through layers, each of which modifies the feature space and also updates its topology. The outputs of all layers are combined and fed to the layer Layers MLP in which the number of features is reduced to 256. Then a pooling operation is performed in which the features of all nodes of the graph are aggregated by functions min, max, mean. Accordingly, we obtain 768 features for each event.

Since the purpose of the architecture is to predict the direction, at the last nodeDirection MLP the resulting embedding is converted into a 3D direction vector.

Training and enchancing the model

Training base model (1.018 → 1.012)

Since we were limited in computational resources, we decided to proceed with retraining the best public GNN model from a notebook GraphNeT Baseline Submission (special thanks to @rasmusrse) which gave LB score 1.018. Using the library polars we accelerated the batch preparation time to 2 seconds. We retrained the model on all batches, except for the last one; 100k examples of batche 659 were used for validation. An epoch corresponded to one batch (200k events); minibatch size: 400 events. So, in each epoch:

the previous batch was unloaded from memory;
new one loaded in 2 seconds;
training was done for 500 (200k/400) steps in 1-2 minutes depending on the architecture and the freezing of the layers;
validation was performed in 20 seconds;
model weights were saved if the loss function or metric reached a new minimum;

During training, the learning rate was reduced in steps, at those moments when the validation did not improve for about 100 epochs.

Here, special thanks are due to Google Colab Pro, without which we would not have been able to train such an architecture within reasonable timeframe.

As a result, after 956 epochs, the value of the metric dropped to 1.0127.

Adding another layer (1.012 → 1.007)

Having a trained network, we tried to add another layer EdgeConv to it. In order not to learn from scratch, all layers in the new architecture were frozen except for the new one.

In the frozen layers, weights were loaded from the model retrained at the previous stage and training continued.

The model in this mode learns quickly, and the metric also quickly reaches the level of the model from which the weights are borrowed. To compare, while our way from 1.018 → 1.012 took more than a week, the training of amended model took just about a day. Then, unfreezing was performed, the learning rate was reduced, and the entire model was retrained in this mode.

After 1077 epochs, we reached the metric 1.007

Increasing the number of neighbors (1.007 → 1.003)

In the original Graphnet library, the number of neighbors for building a graph is chosen as 8.
We have tried increasing this value to 16

As we see in the module EdgeConv the same MLP model is applied to all neighbors, and then the result is summed. Therefore, with an increase in the number of neighbors, the number of network parameters does not change, but at the same time it learns and works twice as slowly.

Thus, we retrained the trained model from the previous stage with a new number of neighbors.

After the 1444 epochs, the metric reached a new low of 1.003

Expanding Layers MLP(1.003 → 0.9964)

Since the number of layers was increased, we thought it reasonable that the number of parameters of Layers MLP which receives concatenated input of the outputs of each level, should also be increased. The first layer of the module Layers MLP was increased from 336 to 2048. Similarly, all levels were frozen, the weights of the model of the previous stage were loaded, and training was continued.

Then the weights were unfreezed and retrained again.

After 1150 epochs, the metric dropped to 0.9964

Replacing regression with classification (0.9964 → 0.9919)

Studying the best solutions, we paid attention to the notebook Tensorflow LSTM Model Training TPU (thanks to @rsmits) From it we borrowed the idea to move from a regression problem to a classification one.

The azimuth angle was uniformly divided into 24 bins.

azimuth_edges = np.linspace(0, 2 * np.pi, bin_num + 1)

The zenith angle was also divided into 24 bins; here we warked with cosine of zenith, since it is the cosine that has a uniform distribution (as is seen from the statistics of all training events):

for bin_idx in range(1, bin_num):
    zenith_edges.append(np.arccos(np.cos(zenith_edges[-1]) - 2 / (bin_num)))

Accordingly we received a total of 24x24=576 classes.

The the last layer of MLP was increased from [128,3] to [512,576], and the loss-function was changed to CrossEntropyLoss.

We froze the entire model except for the last module.

We loaded the weights obtained at the previous stage and continued training.

After 967 epochs, the metric reached the value of 0.9919.

This was the best result that we achieved for a standalone GNN model, and it was then used for ensembling with other models.

What didn't help

During the competition we tried many other things that did not yield a noticeable improvement of the metric, or none at all. Some of the things we did:

separately predicting zenith and azimuth angles;
changing the type of convolution to SAGE, GAT;
inserting transformer after Layers MLP;
replacing pooling with RNN (GRU);
inserting a TopKPooling layer;
using 4 or more features to build the event graph.

What have not been done

There are also several approaches which we thought over but did not pursued to the end, mainly because of lack of time and hardware resources, to mention just a few:

training the GNN from scratch, with scattering, absorption and DOM-embedding features;
training a classifier with a larger number of bins;
parallel training of transformer and GNN models.

IceCube Report