{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "9c3c8b8b86d69aae",
   "metadata": {},
   "source": [
    "# Step #3\n",
    "\n",
    "## ML Predictions using the RealMLP Model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba67b64fb1271959",
   "metadata": {},
   "source": [
    "**Last update: August 14, 2025**\n",
    "\n",
    "AI Assistance: Claude.AI (Anthropic) is used for documentation, code restructuring, and performance optimization"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b01b757cd3fb543",
   "metadata": {},
   "source": [
    "This program is free software: you can redistribute it and/or modify\n",
    "it under the terms of the GNU General Public License as published by\n",
    "the Free Software Foundation, either version 3 of the License, or\n",
    "(at your option) any later version.\n",
    "\n",
    "This program is distributed in the hope that it will be useful,\n",
    "but WITHOUT ANY WARRANTY; without even the implied warranty of\n",
    "MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n",
    "GNU General Public License for more details.\n",
    "\n",
    "You should have received a copy of the GNU General Public License\n",
    "along with this program.  If not, see <https://www.gnu.org/licenses/>."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ccb79ee7880034b7",
   "metadata": {},
   "source": [
    "**Overall Strategy**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "534d4fc242b5178b",
   "metadata": {},
   "source": [
    "Step 1: Preprocess and engineer new features. \n",
    "\n",
    "Step 2: Use AutoGluon to generate OOF predictions for each target separately.\n",
    "These predictions will be used as additional input features in steps 3 and 4.\n",
    "\n",
    "**Step 3: Train the RealMLP model with processed input (step 1) + ten\n",
    "AutoGluon-OOFs (step 2). These additional features will capture the correlation\n",
    "among targets effectively.**\n",
    "\n",
    "Step 4: Similar to step 3 except use the TabPFN model.\n",
    "\n",
    "Step 5: Combine predictions from RealMLP (step 3) and TabPFN (step 4)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2b1da43e686256b",
   "metadata": {},
   "source": [
    "**Imports**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f3fcbc51e111f66a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import os\n",
    "import random\n",
    "import pickle\n",
    "\n",
    "from scipy.stats import hmean\n",
    "from sklearn.metrics import mean_absolute_percentage_error as mape\n",
    "\n",
    "from pytabkit import RealMLP_TD_Regressor"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d6dcd73f05cb1c7",
   "metadata": {},
   "source": [
    "**Set Random Seeds**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "710064335368dfd2",
   "metadata": {},
   "outputs": [],
   "source": [
    "random.seed(7)\n",
    "np.random.seed(7)\n",
    "\n",
    "# Force numpy to use legacy RandomState instead of Generator\n",
    "np.random.set_state(np.random.RandomState(7).get_state())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d10f18cb64407a8b",
   "metadata": {},
   "source": [
    "**User Input**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "448f9074987a4d88",
   "metadata": {},
   "outputs": [],
   "source": [
    "# n-repetitions\n",
    "nTrials = 100  \n",
    "\n",
    "# Number of folds in k-fold\n",
    "nFolds = 8\n",
    "\n",
    "# Number of input features + 10 OOFs\n",
    "nFeatures = 65 + 10\n",
    "\n",
    "# Number of target variables\n",
    "nTargets = 10"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc6b82f7d4e3b356",
   "metadata": {},
   "source": [
    "**Input & Output Directories**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f809d5068a7fd067",
   "metadata": {},
   "outputs": [],
   "source": [
    "ROOT_DIR = '/data/Sukanta/Works_AIML/2025_SHELL_FuelProperty/'\n",
    "DATA_DIR = ROOT_DIR + 'DATA/'\n",
    "ExtractedDATA_DIR = ROOT_DIR + 'ExtractedDATA/'\n",
    "Tuning_DIR = ROOT_DIR + 'Models/RealMLP/'\n",
    "\n",
    "# Create directory if it doesn't exist\n",
    "os.makedirs(Tuning_DIR, exist_ok=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "830ff21de9c5fbbb",
   "metadata": {},
   "source": [
    "**Load Processed Training and Testing Data**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "39a0bdd9fc74c89",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_XyTrnVal_org = pd.read_csv(ExtractedDATA_DIR + 'train_processed.csv')\n",
    "nSamples_TrnVal = df_XyTrnVal_org.shape[0]\n",
    "\n",
    "df_XTst = pd.read_csv(ExtractedDATA_DIR + 'test_processed.csv')\n",
    "nSamples_Tst = df_XTst.shape[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34cbb25f319e629a",
   "metadata": {},
   "source": [
    "**Load AutoGluon-generated OOF Data**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "63146cc9edc069ba",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_XTrnVal_AG_OOF = pd.read_csv(ExtractedDATA_DIR + 'AutoGluon_21600_OOF.csv')\n",
    "df_XTst_AG_OOF = pd.read_csv(ExtractedDATA_DIR + 'AutoGluon_21600_Tst.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76cbca0890325b6",
   "metadata": {},
   "source": [
    "**Combine Dataframes**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9eb5962e7a07d06b",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_XyTrnVal = pd.concat([df_XTrnVal_AG_OOF, df_XyTrnVal_org], axis=1)\n",
    "df_XTst = pd.concat([df_XTst_AG_OOF, df_XTst], axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c607dca9f8c98941",
   "metadata": {},
   "source": [
    "**Initialize Storage for Results**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5212cbdf0e771e2e",
   "metadata": {},
   "outputs": [],
   "source": [
    "dict_yTrnVal_OOF = {}\n",
    "dict_yTst_pred_allFold = {}\n",
    "dict_CV_scores = {}\n",
    "dict_trained_models = {}\n",
    "\n",
    "for trial in range(nTrials):\n",
    "    dict_yTrnVal_OOF[trial] = {}\n",
    "    dict_yTst_pred_allFold[trial] = {}\n",
    "    dict_CV_scores[trial] = {}\n",
    "    dict_trained_models[trial] = {}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5827dbccb2367bf5",
   "metadata": {},
   "source": [
    "**Iterative Single-target Training using RealMLP**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5f26befb5c18cf30",
   "metadata": {},
   "outputs": [],
   "source": [
    "nSamples_per_fold = int(nSamples_TrnVal / nFolds)\n",
    "\n",
    "# n-repetitions of TabPFN models (resampling)\n",
    "for trial in range(nTrials):\n",
    "\n",
    "    print(f\"\\n=== TRIAL {trial + 1}/{nTrials} ===\")\n",
    "\n",
    "    # Shuffle training dataset & track original index\n",
    "    shuffle_indx = np.random.permutation(nSamples_TrnVal)\n",
    "    restore_indx = np.argsort(shuffle_indx)\n",
    "    df_XyTrnVal_shuffled = (\n",
    "        df_XyTrnVal.iloc[shuffle_indx].reset_index(drop=True))\n",
    "\n",
    "    # Extract input features\n",
    "    XTrnVal_shuffled = df_XyTrnVal_shuffled.iloc[:, 0:nFeatures].values\n",
    "\n",
    "    # Multioutput targets\n",
    "    for target in range(nTargets):\n",
    "\n",
    "        print(f\"\\n--- Target {target + 1}/{nTargets} ---\")\n",
    "\n",
    "        # Extract single target from possible nTargets\n",
    "        yTrnVal_shuffled = (\n",
    "            df_XyTrnVal_shuffled.iloc[:, nFeatures + target].values)\n",
    "\n",
    "        # Initialize zero vectors for OOF & test predictions\n",
    "        yTrnVal_shuffled_pred = np.zeros_like(yTrnVal_shuffled)\n",
    "        yTst_pred = np.zeros((nSamples_Tst, nFolds))\n",
    "\n",
    "        # Store models for this target and trial\n",
    "        dict_trained_models[trial][target] = []\n",
    "\n",
    "        # K-folds\n",
    "        for Fold in range(nFolds):\n",
    "            # Create validation indices for this fold\n",
    "            val_start = Fold * nSamples_per_fold\n",
    "            val_end = min((Fold + 1) * nSamples_per_fold, nSamples_TrnVal)\n",
    "            val_indices = list(range(val_start, val_end))\n",
    "\n",
    "            # Create training indices (all except validation fold)\n",
    "            trn_indices = list(range(0, val_start)) + list(\n",
    "                range(val_end, nSamples_TrnVal))\n",
    "\n",
    "            # Split features and targets\n",
    "            XTrn_shuffled_fold = XTrnVal_shuffled[trn_indices]\n",
    "            XVal_shuffled_fold = XTrnVal_shuffled[val_indices]\n",
    "\n",
    "            yTrn_shuffled_fold = yTrnVal_shuffled[trn_indices]\n",
    "            yVal_shuffled_fold = yTrnVal_shuffled[val_indices]\n",
    "\n",
    "            print(\n",
    "                f\"  Fold {Fold + 1}/{nFolds}: \"\n",
    "                f\"Train={len(trn_indices)}, \"\n",
    "                f\"Val={len(val_indices)}\")\n",
    "\n",
    "            # Initialize RealMLP model\n",
    "            regressor = RealMLP_TD_Regressor()\n",
    "\n",
    "            # Fit (no tuning) using TabPFN model\n",
    "            regressor.fit(XTrn_shuffled_fold, yTrn_shuffled_fold)\n",
    "\n",
    "            # Store the trained model\n",
    "            dict_trained_models[trial][target].append(regressor)\n",
    "\n",
    "            # Make predictions on the holdout set\n",
    "            yVal_shuffled_fold_pred = regressor.predict(XVal_shuffled_fold)\n",
    "            yTrnVal_shuffled_pred[val_indices] = yVal_shuffled_fold_pred\n",
    "\n",
    "            # Make predictions on the test set\n",
    "            yTst_pred[:, Fold] = regressor.predict(df_XTst.iloc[:, 0:nFeatures].values)\n",
    "            print(f\"Test predictions generated for Fold {Fold + 1}\")\n",
    "\n",
    "        # Restore the order of the indices\n",
    "        yTrnVal_OOF = yTrnVal_shuffled_pred[restore_indx]\n",
    "\n",
    "        # Average yTst_pred across various folds (harmonic mean)\n",
    "        yTst_pred_allFold = (hmean(np.abs(yTst_pred), axis=1) *\n",
    "                    np.sign(np.mean(yTst_pred, axis=1)))\n",
    "\n",
    "        # Store predictions\n",
    "        dict_yTrnVal_OOF[trial][target] = yTrnVal_OOF.copy()\n",
    "        dict_yTst_pred_allFold[trial][target] = yTst_pred_allFold.copy()\n",
    "\n",
    "        # Compute CV score\n",
    "        dict_CV_scores[trial][target] = mape(yTrnVal_shuffled,\n",
    "                                        yTrnVal_shuffled_pred)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a7a759b8d2f61185",
   "metadata": {},
   "source": [
    "**Average Results Across Trials**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "38edbba9797b49e3",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"\\n=== AVERAGING ACROSS TRIALS ===\")\n",
    "\n",
    "dict_yTrnVal_avg_final = {}\n",
    "dict_yTst_avg_final = {}\n",
    "dict_CV_scores_avg = {}\n",
    "\n",
    "for target in range(nTargets):\n",
    "    # Average training OOF predictions across trials\n",
    "    trial_TrnVal = [dict_yTrnVal_OOF[trial][target] for trial in range(nTrials)]\n",
    "    dict_yTrnVal_avg_final[target] = (hmean(np.abs(trial_TrnVal), axis=0) *\n",
    "                              np.sign(np.mean(trial_TrnVal, axis=0)))\n",
    "\n",
    "    # Average test OOF predictions across trials (use hmean)\n",
    "    trial_Tst = [dict_yTst_pred_allFold[trial][target] for trial in range(nTrials)]\n",
    "    dict_yTst_avg_final[target] = (hmean(np.abs(trial_Tst), axis=0) *\n",
    "                           np.sign(np.mean(trial_Tst, axis=0)))\n",
    "\n",
    "    # CV scores of averaged predictions\n",
    "    yTrnVal = (df_XyTrnVal.iloc[:, nFeatures + target].values)\n",
    "    dict_CV_scores_avg[target] = mape(yTrnVal, dict_yTrnVal_avg_final[target])\n",
    "\n",
    "    print(f\"Target {target + 1}: Avg CV MAPE = {dict_CV_scores_avg[target]:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f4569b9155b1237",
   "metadata": {},
   "source": [
    "**Save Results**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "237e4d72ec9ed492",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"\\n=== SAVING RESULTS ===\")\n",
    "\n",
    "df_submission = pd.DataFrame()\n",
    "df_submission['ID'] = range(1, nSamples_Tst + 1)\n",
    "\n",
    "for target in range(nTargets):\n",
    "    column_name = f'BlendProperty{target+1}'\n",
    "    df_submission[column_name] = dict_yTst_avg_final[target]\n",
    "\n",
    "df_submission.to_csv(ExtractedDATA_DIR + 'RealMLP_submission.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cb0a50602cd7c09a",
   "metadata": {},
   "source": [
    "**Save Trained Models**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "initial_id",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"\\n=== SAVING TRAINED MODELS ===\")\n",
    "\n",
    "# Save all trained models\n",
    "with open(Tuning_DIR + 'RealMLP_trained_models.pkl', 'wb') as f:\n",
    "    pickle.dump(dict_trained_models, f)\n",
    "\n",
    "print(f\"All trained models saved to: {Tuning_DIR}RealMLP_trained_models.pkl\")\n",
    "\n",
    "# Also save individual models for easier access\n",
    "for trial in range(nTrials):\n",
    "    for target in range(nTargets):\n",
    "        for fold in range(nFolds):\n",
    "            model_filename = f'RealMLP_trial{trial+1}_target{target+1}_fold{fold+1}.pkl'\n",
    "            model_path = os.path.join(Tuning_DIR, model_filename)\n",
    "            with open(model_path, 'wb') as f:\n",
    "                pickle.dump(dict_trained_models[trial][target][fold], f)\n",
    "\n",
    "print(f\"Individual models saved to: {Tuning_DIR}\")\n",
    "print(f\"Total models saved: {nTrials * nTargets * nFolds}\")\n",
    "\n",
    "print(f\"RealMLP training completed!\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}