QuData - IceCube Neutrino Tracks Reco Report

Research Goal

IceCube Neutrino Observatory is a one cubic kilometer neutrino detector located in the primordial ice at the South Pole. It's main goal is the detection of astrophysical neutrinos (neutrinos originating from extraterrestrial sources), and identification of their sources. Due to extremely low intensity of neutrino interactions, it is only possible to detect the tracks left by particles (electrons, muons, and τ-leptons) that are born when neutrinos interact within detector volume. The type of events are determined by the incomng neutrino's flavor and energy, and have three characteristical topologies: track (muon neutrinos), cascade (electron neutrinos) and double-cascade (tau-neutrinos). The product particles, due to kinematics of the reaction, travel almost collinearly with the original neutrino track and at a speed close to the light velocity c. As the speed of light in the ice is lower than c, these particles emit Cherenkov radiation and this radiation is registered by the detector.

Candidate neutrino events are selected and reconstructed on-site, and as they are generated at a rate around several kHz, the reconstruction algotithms should be fast enoug to handle the event flow. While several sophisticated and powerful reconstruction methods exist, they are quite computationally intensive and minutes reconstruct a single event, thus being impossible to use in the field and impractical even in the off-site reconstructions, if one wishes to analyze all the available data, rather than a selected subset.

So, we are looking for a fast and accurate reconstruction method which can handle raw data and has a short and predictable inference time.

Dataset Description

The detector is located at depths between 1450 m and 2450 m. The main advantage of Southe Pole location is the extreme transparency of the ice in the detector volume, increasing the free propagation distance of the Cherenkov photons. The detector ice features are not completely uniform. The transparency varies with depth, and there is a layer of volcanic dust at depths of 1950 m to 2100 m, which results in increased scattering and absorption coefficients.

The detector consists of 5160 digital optical modules (DOMs) installed on vertical strings. The origin of the IceCube coordinate system is at the detector center and z-axis points upwards (along the Earth axis). Every string holds 60 DOMs, and there are 86 strings; 78 strings form an approximately hexagonal grid, and extend through the whole height of the detector. The remaining 8 strings form a more densely spaced part of the detector (DeepCore). The DeepCore DOMs are grouped in two parts: for every string 10 of them are above the dust layer (often used for vetoing atmospheric events) and 50 are in the lover half of the cube, forming a high-sensitivity part of the detector, able to detect lower-energy events with better accuracy. The DeepCore has roughly twice shorter typical distance between DOMs, as well as more efficient sensors (by 35%).

The dataset contains simulated neutrino events for a range of energies above 100 GeV. For each event a number of pulses are recorded, each corresponding to a sensor firing up. Each pulse is characterized by several parameters: time (relative to the event start), sensor ID (that is, it's coordinates x, y, z in detector's coordinates), charge (the number of photons registered), and an additional binary parameter (aux) reflecting the pulse's reliability.

The target values to be predicted are zenith and azimuth angles of the direction from which the neutrino came (that is the θ and φ angles of the spherical coordinates). The metric is the mean angular error between the predicted and true event origins, which we'll denote as Δѱ.

Thanks

First of all, we are very happy to be a part of such unordinary and interesting competition, many thanks to the organizers and competitors who selflessly shared their knowledge during this competition, @rasmusrse, @rsmits, @solverworld, @edguy99, @iafoss, @pellerphys …, If I've missed someone, please let me know. Many thanks to my great team QuData @synset, @alexz0 and @semenb, you are awesome!

General architecture

The solution was described in detail on the our website https://qudata.com/ and has the following structure:

"GNN" is a modification of the graph neural network architecture for Neutrino Telescope Event Reconstruction GraphNet (public score: 0.992)

"Transformer" is a combination of architectures including fully connected blocks, an attention mechanism, and a recurrent layer (public score: 0.995)

"Ensemble" is a neural network that agregates outputs from "GNN" and "Transformer" models and predict direction. (public score: 0.976)

GNN

Model

During the competition, we made many changes to the architecture of the graph network (described in detail on the https://qudata.com/), its final state:

Input

The tensor of the shape (B,T,F) is fed to the model, where B - is the sample index in a minibatch, and T - the pulse index number in the sequence. Each pulse is characterized by these six (F=6) features:

\(x,y,z\) - coordinates;
\(t\) - time;
\(q\) - charge;
\(aux\) - auxilary

First, for each pulse we find NEIGHBORS=8 nearest neighbors using only the coordinates as closeness metric. Thus we obtain the graph of spatial connections between pulses. Now, having a graph and pulse features, we form aggregated features for each event:

\(homophily(x,y,z,t)\) - similarity of features in graph nodes;
\(mean(x,y,z,t,q,aux)\) - mean values of pulse features;
\(pulses\) - number of pulses in an event;

Thus, for each event, we get a graph in which each node corresponds to the pulse, and the features are the features of the pulse combined with the aggregated features of the event.

The resulting graph sequentially passes through layers, each of which modifies the feature space and also updates its topology. The outputs of all layers are combined and fed to the layer Layers MLP in which the number of features is reduced to 256. Then a pooling operation is performed in which the features of all nodes of the graph are aggregated by functions min, max, mean. Accordingly, we obtain 768 features for each event.

Since the purpose of the architecture is to predict the direction, at the last nodeDirection MLP the resulting embedding is converted into a 3D direction vector.

Our progress consisted of the following stages:

Training and enchancing the model

Training base model (1.018 → 1.012)

Since we were limited in computational resources, we decided to proceed with retraining the best public GNN model from a notebook GraphNeT Baseline Submission (special thanks to @rasmusrse) which gave LB score 1.018. Using the library polars we accelerated the batch preparation time to 2 seconds. We retrained the model on all batches, except for the last one; 100k examples of batche 659 were used for validation. An epoch corresponded to one batch (200k events); minibatch size: 400 events. So, in each epoch:

the previous batch was unloaded from memory;
new one loaded in 2 seconds;
training was done for 500 (200k/400) steps in 1-2 minutes depending on the architecture and the freezing of the layers;
validation was performed in 20 seconds;
model weights were saved if the loss function or metric reached a new minimum;

During training, the learning rate was reduced in steps, at those moments when the validation did not improve for about 100 epochs.

Here, special thanks are due to Google Colab Pro, without which we would not have been able to train such an architecture within reasonable timeframe.

As a result, after 956 epochs, the value of the metric dropped to 1.0127.

Adding another layer (1.012 → 1.007)

Having a trained network, we tried to add another layer EdgeConv to it. In order not to learn from scratch, all layers in the new architecture were frozen except for the new one.

In the frozen layers, weights were loaded from the model retrained at the previous stage and training continued.

The model in this mode learns quickly, and the metric also quickly reaches the level of the model from which the weights are borrowed. To compare, while our way from 1.018 → 1.012 took more than a week, the training of amended model took just about a day. Then, unfreezing was performed, the learning rate was reduced, and the entire model was retrained in this mode.

After 1077 epochs, we reached the metric 1.007

Increasing the number of neighbors (1.007 → 1.003)

In the original Graphnet library, the number of neighbors for building a graph is chosen as 8.
We have tried increasing this value to 16 In the module EdgeConv the same MLP model is applied to all neighbors, and then the result is summed. Therefore, with an increase in the number of neighbors, the number of network parameters does not change, but at the same time it learns and works twice as slowly.

Thus, we retrained the trained model from the previous stage with a new number of neighbors.

After the 1444 epochs, the metric reached a new low of 1.003

Expanding Layers MLP(1.003 → 0.9964)

Since the number of layers was increased, we thought it reasonable that the number of parameters of Layers MLP which receives concatenated input of the outputs of each level, should also be increased. The first layer of the module Layers MLP was increased from 336 to 2048. Similarly, all levels were frozen, the weights of the model of the previous stage were loaded, and training was continued.

After 1150 epochs, the metric dropped to 0.9964

Replacing regression with classification (0.9964 → 0.9919)

Studying the best solutions, we paid attention to the notebook Tensorflow LSTM Model Training TPU (thanks to @rsmits) From it we borrowed the idea to move from a regression problem to a classification one.

The azimuth angle was uniformly divided into 24 bins.

The zenith angle was also divided into 24 bins; here we warked with cosine of zenith, since it is the cosine that has a uniform distribution (as is seen from the statistics of all training events).

Accordingly we received a total of 24x24=576 classes.

The the last layer of MLP was increased from [128,3] to [512,576], and the loss-function was changed to CrossEntropyLoss.

We froze the entire model except for the last module. We loaded the weights obtained at the previous stage and continued training.

After 967 epochs, the metric reached the value of 0.9919.

This was the best result that we achieved for a standalone GNN model, and it was then used for ensembling with other models.

What didn't help

During the competition we tried many other things that did not yield a noticeable improvement of the metric, or none at all. Some of the things we did:

separately predicting zenith and azimuth angles;
changing the type of convolution to SAGE, GAT;
inserting transformer after Layers MLP;
replacing pooling with RNN (GRU);
inserting a TopKPooling layer;
using 4 or more features to build the event graph.

What have not been done

There are also several approaches which we thought over but did not pursued to the end, mainly because of lack of time and hardware resources, to mention just a few:

training the GNN from scratch, with scattering, absorption and DOM-embedding features;
training a classifier with a larger number of bins;
parallel training of transformer and GNN models.

Transformer

Ensemble

References

The IceCube Neutrino Observatory: Instrumentation and Online Systems — design, production, and calibration of the IceCube digital optical module (DOM), event triggering and data filtering
Measurement of South Pole ice transparency with the IceCube LED calibration system — ice properties in detector as determined via DOM measurements
The design and performance of IceCube DeepCore ice properties in detector as determined via DOM measurements — properties of the deepcore sub-array with denser strings and more efficient DOMs
Event triggering in the IceCube data acquisition system — triggers on patterns of Cherenkov light deposition in the detector based on temporal and spatial coincidences
The AMANDA Neutrino Telescope: Principle of Operation and First Result — topic review and Line Fit approach
A Fast Algorithm for Muon Track Reconstruction and its Application to the ANTARES Neutrino Telescope — original publication of the Santa track reco
Low Energy Event Reconstruction in IceCube DeepCore — Santa and RETRO reconstruction methods
Neutrino tomography of the Earth — high energy neutrino absorbtion by the Earth
A muon-track reconstruction exploiting stochastic losses for large-scale Cherenkov detectors — recent progress on accuracy of the neutrino track direction reconstruction
Application of Deep Neural Networks to Event Type Classification in IceCube — discussion of the network's architecture and performance of InceptionResNet classifier for event type classification
A Convolutional Neural Network based Cascade Reconstruction for the IceCube Neutrino Observatory — a reconstruction method based on convolutional architectures and hexagonally shaped kernels
Graph Neural Networks for Low-Energy Event Classification & Reconstruction in IceCube — representing neutrino events as point cloud graphs to apply a Graph Neural Network (GNN) as the classification and reconstruction method
[vMS-reg] von Mises-Fisher loss for training sequence to sequence models with continuous outputs — application of the vMF-loss to the regression task
[vMS-class] von Mises–Fisher Loss: An Exploration of Embedding Geometries for Supervised Learning — application of the vMF-loss to the classification task
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time — averaging the weights of multiple models fine-tuned with different hyperparameter configurations
kaggle solution —kaggle solution