Step #1
Preprocessing and Feature Engineering
Last update: August 14, 2025
AI Assistance: Claude.AI (Anthropic) is used for documentation, code restructuring, and performance optimization.
Copyright (C) 2025 Sukanta Basu
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.
Overall Strategy
Step 1: Preprocess and engineer new features.
Step 2: Use AutoGluon to generate OOF predictions for each target separately. These predictions will be used as additional input features in steps 3 and 4.
Step 3: Train the RealMLP model with processed input (step 1) + ten AutoGluon-OOFs (step 2). These additional features will capture the correlation among targets effectively.
Step 4: Similar to step 3 except use the TabPFN (v2) model.
Step 5: Combine predictions from RealMLP (step 3) and TabPFN (step 4).
Imports
[ ]:
import numpy as np
import pandas as pd
import random
Set Random Seeds
[ ]:
random.seed(7)
np.random.seed(7)
Input & Output Directories
[ ]:
ROOT_DIR = '/data/Sukanta/Works_AIML/2025_SHELL_FuelProperty/'
DATA_DIR = ROOT_DIR + 'DATA/'
ExtractedDATA_DIR = ROOT_DIR + 'ExtractedDATA/'
Load Training and Testing Data Provided by the Organizers
[ ]:
df_XyTrnVal_org = pd.read_csv(DATA_DIR + 'train.csv')
df_XTst_org = pd.read_csv(DATA_DIR + 'test.csv')
Feature Engineering
[ ]:
# Create empty data frames
df_XyTrnVal_mod = pd.DataFrame()
df_XTst_mod = pd.DataFrame()
# Add component fractions
for comp in range(1, 6):
df_XyTrnVal_mod[f'Component{comp}_fraction'] = (
df_XyTrnVal_org)[f'Component{comp}_fraction']
df_XTst_mod[f'Component{comp}_fraction'] = (
df_XTst_org)[f'Component{comp}_fraction']
# Create volume fraction-weighted input features
for prop in range(1, 11):
for comp in range(1, 6):
fraction_col = f'Component{comp}_fraction'
property_col = f'Component{comp}_Property{prop}'
contribution_col = f'Component{comp}_Contribution_Property{prop}'
df_XyTrnVal_mod[contribution_col] = (df_XyTrnVal_org[fraction_col] *
df_XyTrnVal_org[property_col])
df_XTst_mod[contribution_col] = (df_XTst_org[fraction_col] *
df_XTst_org[property_col])
# Create weighted-averaged input features
for prop in range(1, 11):
df_XyTrnVal_mod[f'WeightedAvg_Property{prop}'] = (
sum(df_XyTrnVal_org[f'Component{comp}_fraction'] *
df_XyTrnVal_org[f'Component{comp}_Property{prop}']
for comp in range(1, 6)))
df_XTst_mod[f'WeightedAvg_Property{prop}'] = (
sum(df_XTst_org[f'Component{comp}_fraction'] *
df_XTst_org[f'Component{comp}_Property{prop}']
for comp in range(1, 6)))
# Add targets
for target in range(1, 11):
df_XyTrnVal_mod[f'BlendProperty{target}'] = df_XyTrnVal_org[f'BlendProperty{target}']
Save Processed Data
[ ]:
df_XyTrnVal_mod.to_csv(ExtractedDATA_DIR + 'train_processed.csv',index=False)
df_XTst_mod.to_csv(ExtractedDATA_DIR + 'test_processed.csv',index=False)