🧪 Data Preparation & Updated Input Reference¶
This page documents the new preprocessing contract for EveNet. The reference schema in
share/event_info/pretrain.yaml is bundled with the repository as an example, but you are free to adapt the feature
lists to match your campaign. The goal of this guide is to help you build the .npz files that the EveNet converter
ingests and to clarify how each tensor maps onto the model heads.
⚠️ Ownership reminder: Users are responsible for generating the
.npzfiles. EveNet does not reorder or reshape features for you—the arrays must already follow the size and ordering implied by your event-info YAML. The converter simply validates the layout and writes the parquet outputs. Keep the YAML and the.npzdictionary in sync at all times.
- Config + CLI workflow
- Input tensor dictionary
- Supervision targets by head
- NPZ → Parquet conversion
- Runtime checklist
🛠️ Config + CLI Workflow¶
- Start from an event-info YAML. The repository ships an example at
share/event_info/pretrain.yaml; copy or extend it to describe the objects, global variables, and heads you plan to enable. The display names inside theINPUTSblock are just labels used for logging and plotting—what matters is the order, which must match the tensor layout you write to disk. - Produce an event dictionary. For every event, assemble a Python dictionary that satisfies the shapes described
below and append it to the archive you will write to disk. When you call
numpy.savez(orsavez_compressed), each key becomes an array with leading dimensionN, the number of events in the file. Masks indicate which padded entries are valid and should contribute to the loss. - Run the EveNet converter. Point
preprocessing/preprocess.pyat your.npzbundle and pass the matching YAML so the loader can recover feature names, the number of sequential vectors, and the heads you are enabling. The converter assumes both artifacts describe the same structure—mismatches will surface as validation errors. - Train or evaluate. Training configs reference the resulting parquet directory via
platform.data_parquet_dirand reuse the same YAML inoptions.Dataset.event_info.
✨ Normalization note. The
normalize,log_normalize, andnonetags in the YAML are metadata only. EveNet derives channel-wise statistics during conversion. The sole special case isnormalize_uniform, which reserves a transformation for circular variables (φ); the model automatically maps to and from the wrapped representation.
📦 Input Tensor Dictionary¶
Each event is described by the following feature tensors. Shapes are shown with a leading N to indicate the number of
stored events in a given .npz file. Masks share the same leading dimension as the value they gate.
| Key | Shape | Description |
|---|---|---|
num_vectors |
(N,) |
Total count of global + sequential objects per event. |
num_sequential_vectors |
(N,) |
Number of valid sequential entries per event. Mirrors num_vectors behaviour. |
subprocess_id |
(N,) |
Integer label identifying the subprocess drawn from the YAML metadata. |
x |
(N, 18, 7) |
Point-cloud tensor storing exactly 18 slots with 7 features each. These dimensions are fixed so datasets can leverage the released pretraining weights. Order the features exactly as your YAML lists them (energy, pT, η, φ, b-tag score, lepton flag, charge in the example). Use padding for missing particles and mask them out via x_mask. |
x_mask |
(N, 18) |
Boolean (or 0/1) mask indicating which particle slots in x correspond to real objects. Only entries with mask 1 contribute to losses and metrics. |
conditions |
(N, C) |
Event-level scalars. C is the number of global variables you define (10 in the example). You may add or remove variables as long as the order matches your YAML; if you do not supply any conditions, drop the key entirely. |
conditions_mask |
(N, 1) |
Mask for conditions. Set to 1 when the global features are present. If you omit conditions, omit this mask as well. |
🎯 Supervision Targets by Head¶
Only provide the tensors required for the heads you enable in your training YAML. Omit unused targets or set them to empty arrays so the converter skips unnecessary storage.
Classification Head¶
| Key | Shape | Meaning |
|---|---|---|
classification |
(N,) |
Process label per event. Combine with event_weight for weighted cross-entropy when sampling imbalanced campaigns. |
event_weight |
(N,) |
Optional per-event weight; defaults to 1 if omitted. Populate it alongside classification so the converter can broadcast the weights into the parquet shards. |
TruthGeneration Head¶
| Key | Shape | Meaning |
|---|---|---|
x_invisible |
(N, N_nu, F_nu) |
Invisible particle (e.g., neutrino) features. N_nu is the maximum number of invisible objects you intend to pad, 2 in the example, and F_nu is the number of features per invisible. Feature order is defined in your YAML under the TruthGeneration block. |
x_invisible_mask |
(N, N_nu) |
Flags which invisible entries are valid. |
num_invisible_raw |
(N,) |
Count of all invisible objects before quality cuts. |
num_invisible_valid |
(N,) |
Number of invisible objects associated with reconstructed parents. |
ReconGeneration Head¶
ReconGeneration is self-supervised: it perturbs the visible point-cloud channels and learns to denoise them. The target specification (which channels to regenerate) lives directly in the YAML under the ReconGeneration configuration. No additional tensors beyond the standard inputs are required.
Resonance Assignment Head¶
| Key | Shape | Meaning |
|---|---|---|
assignments-indices |
(N, R, D) |
Resonance-to-child mapping. R is the number of resonances you describe in the YAML (56 in the example) and D is the maximum number of children a resonance may have (3 in the example). Each entry stores indices drawn from the sequential inputs. |
assignments-mask |
(N, R) |
Indicates whether all children for a given resonance were reconstructed. |
assignments-indices-mask |
(N, R, D) |
Per-child validity flags. Use 0 to pad absent daughters while keeping other children active. |
📐 Assignment internals: During conversion EveNet scans your assignment map to determine
RandD, initialises arrays filled with-1, and then writes the actual child indices along with boolean masks. The snippet below mirrors the loader logic so you can generate matching tensors in your own pipeline:
full_indices = np.full((num_events, n_targets, max_daughters), -1, dtype=int)
full_mask = np.zeros((num_events, n_targets), dtype=bool)
index_mask = np.zeros((num_events, n_targets, max_daughters), dtype=bool)
# Fill with your resonance→daughter mappings; mark valid entries in the masks.
Segmentation Head¶
| Key | Shape | Meaning |
|---|---|---|
segmentation-class |
(N, S, T) |
One-hot daughter class per slot. S is the maximum number of segments per event (4 in the example) and T is the number of resonance tags (9 in the example). |
segmentation-data |
(N, S, 18) |
Assignment of each daughter slot to one of the 18 input particles used in the point cloud. |
segmentation-momentum |
(N, S, 4) |
Ground-truth four-momenta for the segmented daughters. |
segmentation-full-class |
(N, S, T) |
Boolean indicator: 1 if all daughters of the resonance are reconstructed. |
As with assignments, the converter computes S and T by scanning your segment-tag map. The internal helper creates
zero-initialised arrays sized (num_events, max_event_segments, max_segment_tags) and fills them according to the
resonance definitions. Any slots you leave unused remain associated with the catch-all background class (0).
If you disable the segmentation head, you can skip all four tensors.
Worked Input Example¶
import numpy as np
example = {
"x": np.zeros((N, 18, 7), dtype=np.float32), # fixed to (18, 7) for pretraining compatibility
"x_mask": np.zeros((N, 18), dtype=bool),
# Optional globals — drop both keys if unused
"conditions": np.zeros((N, C), dtype=np.float32),
"conditions_mask": np.ones((N, 1), dtype=bool),
# Classification head (weights default to ones if omitted)
"classification": np.zeros((N,), dtype=np.int32),
"event_weight": np.ones((N,), dtype=np.float32),
# Head-specific entries sized by your resonance/segment definitions
"assignments-indices": np.full((N, R, D), -1, dtype=int),
"assignments-mask": np.zeros((N, R), dtype=bool),
"segmentation-data": np.zeros((N, S, 18), dtype=bool),
# ... add heads you enabled ...
}
Feel free to adjust the head-specific dimensions (R, D, S, T) and the number of condition scalars C to match
your physics process. The only fixed sizes are the point-cloud slots (18, 7) shared across datasets. Keep the YAML and
the .npz dictionary in sync so the converter knows how many channels to expect and how to name them.
🔄 NPZ → Parquet Conversion¶
- Assemble events into Python lists and save them with
numpy.savez(orsavez_compressed). Each key listed above becomes an array inside the archive. - Invoke the converter:
python preprocessing/preprocess.py \
share/event_info/pretrain.yaml \
--in_npz /path/to/events.npz \
--store_dir /path/to/output \
--cpu_max 32
The converter reads the YAML to recover feature names, masks, and head activation flags, then emits:
- `data_*.parquet` containing flattened tensors.
- `shape_metadata.json` with the original shapes (e.g., `(18, 7)` for `x`).
- `normalization.pt` with channel-wise statistics and class weights.
- Inspect the logs. The script reports how many particles, invisible objects, and resonances were valid across the dataset—helpful when debugging mask alignment.
✅ Runtime Checklist¶
platform.data_parquet_dirpoints to the folder with the generated parquet shards andshape_metadata.json.options.Dataset.event_inforeferences the same YAML (share/event_info/pretrain.yamlor your copy).options.Dataset.normalization_filematches thenormalization.ptproduced during conversion.- Only the heads you activated in the training YAML have matching supervision tensors in the parquet files.
With those pieces in place, EveNet will rebuild the full event dictionary on the fly, apply the appropriate circular
normalization for normalize_uniform channels, and route each tensor to the corresponding head.