{ "cells": [ { "cell_type": "markdown", "id": "9c3c8b8b86d69aae", "metadata": {}, "source": [ "# Step #3\n", "\n", "## ML Predictions using the RealMLP Model" ] }, { "cell_type": "markdown", "id": "ba67b64fb1271959", "metadata": {}, "source": [ "**Last update: August 14, 2025**\n", "\n", "AI Assistance: Claude.AI (Anthropic) is used for documentation, code restructuring, and performance optimization" ] }, { "cell_type": "markdown", "id": "3b01b757cd3fb543", "metadata": {}, "source": [ "This program is free software: you can redistribute it and/or modify\n", "it under the terms of the GNU General Public License as published by\n", "the Free Software Foundation, either version 3 of the License, or\n", "(at your option) any later version.\n", "\n", "This program is distributed in the hope that it will be useful,\n", "but WITHOUT ANY WARRANTY; without even the implied warranty of\n", "MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the\n", "GNU General Public License for more details.\n", "\n", "You should have received a copy of the GNU General Public License\n", "along with this program. If not, see ." ] }, { "cell_type": "markdown", "id": "ccb79ee7880034b7", "metadata": {}, "source": [ "**Overall Strategy**" ] }, { "cell_type": "markdown", "id": "534d4fc242b5178b", "metadata": {}, "source": [ "Step 1: Preprocess and engineer new features. \n", "\n", "Step 2: Use AutoGluon to generate OOF predictions for each target separately.\n", "These predictions will be used as additional input features in steps 3 and 4.\n", "\n", "**Step 3: Train the RealMLP model with processed input (step 1) + ten\n", "AutoGluon-OOFs (step 2). These additional features will capture the correlation\n", "among targets effectively.**\n", "\n", "Step 4: Similar to step 3 except use the TabPFN model.\n", "\n", "Step 5: Combine predictions from RealMLP (step 3) and TabPFN (step 4)." ] }, { "cell_type": "markdown", "id": "c2b1da43e686256b", "metadata": {}, "source": [ "**Imports**" ] }, { "cell_type": "code", "execution_count": null, "id": "f3fcbc51e111f66a", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import os\n", "import random\n", "import pickle\n", "\n", "from scipy.stats import hmean\n", "from sklearn.metrics import mean_absolute_percentage_error as mape\n", "\n", "from pytabkit import RealMLP_TD_Regressor" ] }, { "cell_type": "markdown", "id": "5d6dcd73f05cb1c7", "metadata": {}, "source": [ "**Set Random Seeds**" ] }, { "cell_type": "code", "execution_count": null, "id": "710064335368dfd2", "metadata": {}, "outputs": [], "source": [ "random.seed(7)\n", "np.random.seed(7)\n", "\n", "# Force numpy to use legacy RandomState instead of Generator\n", "np.random.set_state(np.random.RandomState(7).get_state())" ] }, { "cell_type": "markdown", "id": "d10f18cb64407a8b", "metadata": {}, "source": [ "**User Input**" ] }, { "cell_type": "code", "execution_count": null, "id": "448f9074987a4d88", "metadata": {}, "outputs": [], "source": [ "# n-repetitions\n", "nTrials = 100 \n", "\n", "# Number of folds in k-fold\n", "nFolds = 8\n", "\n", "# Number of input features + 10 OOFs\n", "nFeatures = 65 + 10\n", "\n", "# Number of target variables\n", "nTargets = 10" ] }, { "cell_type": "markdown", "id": "fc6b82f7d4e3b356", "metadata": {}, "source": [ "**Input & Output Directories**" ] }, { "cell_type": "code", "execution_count": null, "id": "f809d5068a7fd067", "metadata": {}, "outputs": [], "source": [ "ROOT_DIR = '/data/Sukanta/Works_AIML/2025_SHELL_FuelProperty/'\n", "DATA_DIR = ROOT_DIR + 'DATA/'\n", "ExtractedDATA_DIR = ROOT_DIR + 'ExtractedDATA/'\n", "Tuning_DIR = ROOT_DIR + 'Models/RealMLP/'\n", "\n", "# Create directory if it doesn't exist\n", "os.makedirs(Tuning_DIR, exist_ok=True)" ] }, { "cell_type": "markdown", "id": "830ff21de9c5fbbb", "metadata": {}, "source": [ "**Load Processed Training and Testing Data**" ] }, { "cell_type": "code", "execution_count": null, "id": "39a0bdd9fc74c89", "metadata": {}, "outputs": [], "source": [ "df_XyTrnVal_org = pd.read_csv(ExtractedDATA_DIR + 'train_processed.csv')\n", "nSamples_TrnVal = df_XyTrnVal_org.shape[0]\n", "\n", "df_XTst = pd.read_csv(ExtractedDATA_DIR + 'test_processed.csv')\n", "nSamples_Tst = df_XTst.shape[0]" ] }, { "cell_type": "markdown", "id": "34cbb25f319e629a", "metadata": {}, "source": [ "**Load AutoGluon-generated OOF Data**" ] }, { "cell_type": "code", "execution_count": null, "id": "63146cc9edc069ba", "metadata": {}, "outputs": [], "source": [ "df_XTrnVal_AG_OOF = pd.read_csv(ExtractedDATA_DIR + 'AutoGluon_21600_OOF.csv')\n", "df_XTst_AG_OOF = pd.read_csv(ExtractedDATA_DIR + 'AutoGluon_21600_Tst.csv')" ] }, { "cell_type": "markdown", "id": "76cbca0890325b6", "metadata": {}, "source": [ "**Combine Dataframes**" ] }, { "cell_type": "code", "execution_count": null, "id": "9eb5962e7a07d06b", "metadata": {}, "outputs": [], "source": [ "df_XyTrnVal = pd.concat([df_XTrnVal_AG_OOF, df_XyTrnVal_org], axis=1)\n", "df_XTst = pd.concat([df_XTst_AG_OOF, df_XTst], axis=1)" ] }, { "cell_type": "markdown", "id": "c607dca9f8c98941", "metadata": {}, "source": [ "**Initialize Storage for Results**" ] }, { "cell_type": "code", "execution_count": null, "id": "5212cbdf0e771e2e", "metadata": {}, "outputs": [], "source": [ "dict_yTrnVal_OOF = {}\n", "dict_yTst_pred_allFold = {}\n", "dict_CV_scores = {}\n", "dict_trained_models = {}\n", "\n", "for trial in range(nTrials):\n", " dict_yTrnVal_OOF[trial] = {}\n", " dict_yTst_pred_allFold[trial] = {}\n", " dict_CV_scores[trial] = {}\n", " dict_trained_models[trial] = {}" ] }, { "cell_type": "markdown", "id": "5827dbccb2367bf5", "metadata": {}, "source": [ "**Iterative Single-target Training using RealMLP**" ] }, { "cell_type": "code", "execution_count": null, "id": "5f26befb5c18cf30", "metadata": {}, "outputs": [], "source": [ "nSamples_per_fold = int(nSamples_TrnVal / nFolds)\n", "\n", "# n-repetitions of TabPFN models (resampling)\n", "for trial in range(nTrials):\n", "\n", " print(f\"\\n=== TRIAL {trial + 1}/{nTrials} ===\")\n", "\n", " # Shuffle training dataset & track original index\n", " shuffle_indx = np.random.permutation(nSamples_TrnVal)\n", " restore_indx = np.argsort(shuffle_indx)\n", " df_XyTrnVal_shuffled = (\n", " df_XyTrnVal.iloc[shuffle_indx].reset_index(drop=True))\n", "\n", " # Extract input features\n", " XTrnVal_shuffled = df_XyTrnVal_shuffled.iloc[:, 0:nFeatures].values\n", "\n", " # Multioutput targets\n", " for target in range(nTargets):\n", "\n", " print(f\"\\n--- Target {target + 1}/{nTargets} ---\")\n", "\n", " # Extract single target from possible nTargets\n", " yTrnVal_shuffled = (\n", " df_XyTrnVal_shuffled.iloc[:, nFeatures + target].values)\n", "\n", " # Initialize zero vectors for OOF & test predictions\n", " yTrnVal_shuffled_pred = np.zeros_like(yTrnVal_shuffled)\n", " yTst_pred = np.zeros((nSamples_Tst, nFolds))\n", "\n", " # Store models for this target and trial\n", " dict_trained_models[trial][target] = []\n", "\n", " # K-folds\n", " for Fold in range(nFolds):\n", " # Create validation indices for this fold\n", " val_start = Fold * nSamples_per_fold\n", " val_end = min((Fold + 1) * nSamples_per_fold, nSamples_TrnVal)\n", " val_indices = list(range(val_start, val_end))\n", "\n", " # Create training indices (all except validation fold)\n", " trn_indices = list(range(0, val_start)) + list(\n", " range(val_end, nSamples_TrnVal))\n", "\n", " # Split features and targets\n", " XTrn_shuffled_fold = XTrnVal_shuffled[trn_indices]\n", " XVal_shuffled_fold = XTrnVal_shuffled[val_indices]\n", "\n", " yTrn_shuffled_fold = yTrnVal_shuffled[trn_indices]\n", " yVal_shuffled_fold = yTrnVal_shuffled[val_indices]\n", "\n", " print(\n", " f\" Fold {Fold + 1}/{nFolds}: \"\n", " f\"Train={len(trn_indices)}, \"\n", " f\"Val={len(val_indices)}\")\n", "\n", " # Initialize RealMLP model\n", " regressor = RealMLP_TD_Regressor()\n", "\n", " # Fit (no tuning) using TabPFN model\n", " regressor.fit(XTrn_shuffled_fold, yTrn_shuffled_fold)\n", "\n", " # Store the trained model\n", " dict_trained_models[trial][target].append(regressor)\n", "\n", " # Make predictions on the holdout set\n", " yVal_shuffled_fold_pred = regressor.predict(XVal_shuffled_fold)\n", " yTrnVal_shuffled_pred[val_indices] = yVal_shuffled_fold_pred\n", "\n", " # Make predictions on the test set\n", " yTst_pred[:, Fold] = regressor.predict(df_XTst.iloc[:, 0:nFeatures].values)\n", " print(f\"Test predictions generated for Fold {Fold + 1}\")\n", "\n", " # Restore the order of the indices\n", " yTrnVal_OOF = yTrnVal_shuffled_pred[restore_indx]\n", "\n", " # Average yTst_pred across various folds (harmonic mean)\n", " yTst_pred_allFold = (hmean(np.abs(yTst_pred), axis=1) *\n", " np.sign(np.mean(yTst_pred, axis=1)))\n", "\n", " # Store predictions\n", " dict_yTrnVal_OOF[trial][target] = yTrnVal_OOF.copy()\n", " dict_yTst_pred_allFold[trial][target] = yTst_pred_allFold.copy()\n", "\n", " # Compute CV score\n", " dict_CV_scores[trial][target] = mape(yTrnVal_shuffled,\n", " yTrnVal_shuffled_pred)" ] }, { "cell_type": "markdown", "id": "a7a759b8d2f61185", "metadata": {}, "source": [ "**Average Results Across Trials**" ] }, { "cell_type": "code", "execution_count": null, "id": "38edbba9797b49e3", "metadata": {}, "outputs": [], "source": [ "print(\"\\n=== AVERAGING ACROSS TRIALS ===\")\n", "\n", "dict_yTrnVal_avg_final = {}\n", "dict_yTst_avg_final = {}\n", "dict_CV_scores_avg = {}\n", "\n", "for target in range(nTargets):\n", " # Average training OOF predictions across trials\n", " trial_TrnVal = [dict_yTrnVal_OOF[trial][target] for trial in range(nTrials)]\n", " dict_yTrnVal_avg_final[target] = (hmean(np.abs(trial_TrnVal), axis=0) *\n", " np.sign(np.mean(trial_TrnVal, axis=0)))\n", "\n", " # Average test OOF predictions across trials (use hmean)\n", " trial_Tst = [dict_yTst_pred_allFold[trial][target] for trial in range(nTrials)]\n", " dict_yTst_avg_final[target] = (hmean(np.abs(trial_Tst), axis=0) *\n", " np.sign(np.mean(trial_Tst, axis=0)))\n", "\n", " # CV scores of averaged predictions\n", " yTrnVal = (df_XyTrnVal.iloc[:, nFeatures + target].values)\n", " dict_CV_scores_avg[target] = mape(yTrnVal, dict_yTrnVal_avg_final[target])\n", "\n", " print(f\"Target {target + 1}: Avg CV MAPE = {dict_CV_scores_avg[target]:.4f}\")" ] }, { "cell_type": "markdown", "id": "5f4569b9155b1237", "metadata": {}, "source": [ "**Save Results**" ] }, { "cell_type": "code", "execution_count": null, "id": "237e4d72ec9ed492", "metadata": {}, "outputs": [], "source": [ "print(\"\\n=== SAVING RESULTS ===\")\n", "\n", "df_submission = pd.DataFrame()\n", "df_submission['ID'] = range(1, nSamples_Tst + 1)\n", "\n", "for target in range(nTargets):\n", " column_name = f'BlendProperty{target+1}'\n", " df_submission[column_name] = dict_yTst_avg_final[target]\n", "\n", "df_submission.to_csv(ExtractedDATA_DIR + 'RealMLP_submission.csv', index=False)" ] }, { "cell_type": "markdown", "id": "cb0a50602cd7c09a", "metadata": {}, "source": [ "**Save Trained Models**" ] }, { "cell_type": "code", "execution_count": null, "id": "initial_id", "metadata": {}, "outputs": [], "source": [ "print(\"\\n=== SAVING TRAINED MODELS ===\")\n", "\n", "# Save all trained models\n", "with open(Tuning_DIR + 'RealMLP_trained_models.pkl', 'wb') as f:\n", " pickle.dump(dict_trained_models, f)\n", "\n", "print(f\"All trained models saved to: {Tuning_DIR}RealMLP_trained_models.pkl\")\n", "\n", "# Also save individual models for easier access\n", "for trial in range(nTrials):\n", " for target in range(nTargets):\n", " for fold in range(nFolds):\n", " model_filename = f'RealMLP_trial{trial+1}_target{target+1}_fold{fold+1}.pkl'\n", " model_path = os.path.join(Tuning_DIR, model_filename)\n", " with open(model_path, 'wb') as f:\n", " pickle.dump(dict_trained_models[trial][target][fold], f)\n", "\n", "print(f\"Individual models saved to: {Tuning_DIR}\")\n", "print(f\"Total models saved: {nTrials * nTargets * nFolds}\")\n", "\n", "print(f\"RealMLP training completed!\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.5" } }, "nbformat": 4, "nbformat_minor": 5 }