๐Ÿงช Data Preprocessing for EveNet

This preprocessing pipeline prepares structured event-level datasets for EveNet. It merges input .h5 files, flattens structured arrays into 2D pyarrow tables, calculates normalization statistics, and saves outputs as .parquet and .json files ready for model training.

๐Ÿ“ Folder Structure

The input directory should follow this structure:

/global/cfs/cdirs/m2616/avencast/Event_Level_Analysis/data/
โ”œโ”€โ”€ Run_2.Dec20/
โ”‚   โ”œโ”€โ”€ run_1/
โ”‚   โ”‚   โ”œโ”€โ”€ process_1.h5
โ”‚   โ”‚   โ”œโ”€โ”€ process_2.h5
โ”‚   โ”œโ”€โ”€ run_2/
โ”‚   โ”‚   โ”œโ”€โ”€ ...
โ”œโ”€โ”€ Run_2.Dec21/
โ”‚   โ”œโ”€โ”€ run_1/
โ”‚   โ”‚   โ”œโ”€โ”€ ...
...

Each Run folder (e.g., Run_2.Dec20) contains sub-run folders (run_1, run_2, โ€ฆ), which hold the .h5 files representing different processes.

๐Ÿ›  What the Script Does

For every sub-run folder:
1. Merges all .h5 process files into a single structured data dictionary.
2. Converts structured data into a 2D pyarrow Table.
3. Saves the output as:
- data_.parquet (flattened table)
- shape_metadata.json (used for reconstruction)
4. Computes normalization statistics (mean, std) across all data.
5. Merges statistics from all sub-runs and saves a single .pt normalization file.

๐Ÿš€ Run on NERSC

  • Paths

    • Input root: /global/cfs/cdirs/m2616/avencast/Event_Level_Analysis/data
    • Pretrain Run folders: Run_2.Dec20, Run_2.Dec21, Run_2.Dec22
    • Output directory: better to say on $SCRATCH folder
  • Example Command

    pretrain_dir="/global/cfs/cdirs/m2616/avencast/Event_Level_Analysis/data"
    output_dir="$PSCRATCH/Event_Level_Analysis/Pretrain_Parquet/"
    
    python3 preprocessing/preprocess.py share/preprocess_pretrain.yaml \
        --c 25 \
        --pretrain_dirs \
            ${pretrain_dir}/Run_2.Dec20 \
            ${pretrain_dir}/Run_2.Dec21 \
            ${pretrain_dir}/Run_2.Dec22 \
            ${pretrain_dir}/Run_2.Pretrain.20250505 \
            ${pretrain_dir}/Run_2.Pretrain.20250507 \
            ${pretrain_dir}/Run_3.Pretrain.20250527.Run2Extra \
        --store_dir ${output_dir}/run.20250527.654M | tee ${output_dir}/run.20250527.654M.log
    
    shifter python3 preprocessing/preprocess.py share/preprocess_pretrain.yaml \
        --c 60 \
        --pretrain_dirs \
        ${pretrain_dir}/Combined_Balanced \
        --store_dir ${output_dir}/run.20250625.2700M
    

    This command will: - Process all sub-runs in the three Run folders - Generate flattened .parquet files and normalization .pt file - Save logs to run.20250403.log

  • Global Shuffle

      python3 NERSC/global_shuffle.py --first_shuffle_percent 1.0 --second_shuffle_percent 0.5 25 \
      /pscratch/sd/a/avencast/Event_Level_Analysis/Pretrain_Parquet/run.20250527.654M/
    

๐Ÿ“ค Output Files

Under the output directory (--store_dir), the script saves:

/run.20250403/
โ”œโ”€โ”€ data_<timestamp>.parquet       # Flattened input table
โ”œโ”€โ”€ shape_metadata.json            # Shape metadata for reconstruction
โ”œโ”€โ”€ regression_norm.pt             # Normalization statistics
โ””โ”€โ”€ run.20250403.log               # Optional log if tee is used

โš ๏ธ Notes

  • You must supply a valid preprocess_config.yaml (e.g., EveNet/share/preprocess_pretrain.yaml).
  • The script will skip any sub-run folder where no valid .h5 files are found or matching fails.
  • All normalization is computed globally across all sub-runs and merged at the end.