{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "246a0753d9ced63",
   "metadata": {},
   "source": [
    "# Step #2\n",
    "\n",
    "## Generating Out-of-Fold (OOF) Target Values using AutoGluon"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db7195c7ba8aaa7",
   "metadata": {},
   "source": [
    "**Last update: August 14, 2025**\n",
    "\n",
    "AI Assistance: Claude.AI (Anthropic) is used for documentation, code restructuring, and performance optimization"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a04cdc53525eca0f",
   "metadata": {},
   "source": [
    "**Copyright (C) 2025 Sukanta Basu**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f88e04044878be57",
   "metadata": {},
   "source": [
    "This program is free software: you can redistribute it and/or modify\n",
    "it under the terms of the GNU General Public License as published by\n",
    "the Free Software Foundation, either version 3 of the License, or\n",
    "(at your option) any later version.\n",
    "\n",
    "This program is distributed in the hope that it will be useful,\n",
    "but WITHOUT ANY WARRANTY; without even the implied warranty of\n",
    "MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n",
    "GNU General Public License for more details.\n",
    "\n",
    "You should have received a copy of the GNU General Public License\n",
    "along with this program.  If not, see <https://www.gnu.org/licenses/>."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf8f97e0bd35df0b",
   "metadata": {},
   "source": [
    "**Overall Strategy**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2b56e99773f9f8f2",
   "metadata": {},
   "source": [
    "Step 1: Preprocess and engineer new features. \n",
    "\n",
    "**Step 2: Use AutoGluon to generate OOF predictions for each target separately.\n",
    "These predictions will be used as additional input features in steps 3 and 4.**\n",
    "\n",
    "Step 3: Train the RealMLP model with processed input (step 1) + ten\n",
    "AutoGluon-OOFs (step 2). These additional features will capture the correlation\n",
    "among targets effectively.\n",
    "\n",
    "Step 4: Similar to step 3 except use the TabPFN model.\n",
    "\n",
    "Step 5: Combine predictions from RealMLP (step 3) and TabPFN (step 4)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1a96ba877e53bc1b",
   "metadata": {},
   "source": [
    "**Imports**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5690a0094408a5bf",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import os\n",
    "import random\n",
    "import warnings\n",
    "\n",
    "from autogluon.tabular import TabularPredictor"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b8c651ce745d65b",
   "metadata": {},
   "source": [
    "**Set Random Seeds**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cd7eebb603f39cc1",
   "metadata": {},
   "outputs": [],
   "source": [
    "random.seed(7)\n",
    "np.random.seed(7)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ef87b39e71027c1",
   "metadata": {},
   "source": [
    "**User Input**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b42562ac873ae2f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# AutoGluon quality preset\n",
    "quality_preset = 'best_quality'\n",
    "\n",
    "# AutoGluon training time (in seconds)\n",
    "maxTime = 21600\n",
    "\n",
    "# Number of input features\n",
    "nFeatures = 65\n",
    "\n",
    "# Number of target variables\n",
    "nTargets = 10"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0e955b8d1340774",
   "metadata": {},
   "source": [
    "**Input & Output Directories**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9ef35bbf7fb66929",
   "metadata": {},
   "outputs": [],
   "source": [
    "ROOT_DIR = '/data/Sukanta/Works_AIML/2025_SHELL_FuelProperty/'\n",
    "DATA_DIR = ROOT_DIR + 'DATA/'\n",
    "ExtractedDATA_DIR = ROOT_DIR + 'ExtractedDATA/'\n",
    "Tuning_DIR = ROOT_DIR + 'Models/AutoGluon-OOF/'\n",
    "\n",
    "# Create directory if it doesn't exist\n",
    "os.makedirs(Tuning_DIR, exist_ok=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4afbfc41d9bc6897",
   "metadata": {},
   "source": [
    "**Load Processed Training and Testing Data**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "62de12294936981f",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_XyTrnVal_org = pd.read_csv(ExtractedDATA_DIR + 'train_processed.csv')\n",
    "nSamples_TrnVal = df_XyTrnVal_org.shape[0]\n",
    "\n",
    "df_XTst = pd.read_csv(ExtractedDATA_DIR + 'test_processed.csv')\n",
    "nSamples_Tst = df_XTst.shape[0]\n",
    "\n",
    "print(f\"Training data shape: {df_XyTrnVal_org.shape}\")\n",
    "print(f\"Test data shape: {df_XTst.shape}\")\n",
    "\n",
    "# Extract input features\n",
    "XTrnVal = df_XyTrnVal_org.iloc[:, 0:nFeatures]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f7687daaaf7d237",
   "metadata": {},
   "source": [
    "**Iterative Single-target Training using AutoGluon**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d51a4003a1b94737",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize predictions array\n",
    "yTrnVal_OOF = np.zeros((nSamples_TrnVal, nTargets))\n",
    "yTst = np.zeros((nSamples_Tst, nTargets))\n",
    "\n",
    "for target in range(nTargets):\n",
    "    print(f\"\\n--- Target {target + 1}/{nTargets} ---\")\n",
    "\n",
    "    # Extract single target from possible nTargets\n",
    "    yTrnVal = df_XyTrnVal_org.iloc[:, nFeatures + target]\n",
    "\n",
    "    # Create training dataframe with features and target\n",
    "    train_data = XTrnVal.copy()\n",
    "    train_data[f'target_{target}'] = yTrnVal\n",
    "\n",
    "    # Create unique file path for each target\n",
    "    target_path = os.path.join(Tuning_DIR, f'target_{target + 1}')\n",
    "    os.makedirs(target_path, exist_ok=True)\n",
    "\n",
    "    # Initialize TabularPredictor from AutoGluon\n",
    "    predictor = TabularPredictor(\n",
    "        label=f'target_{target}',\n",
    "        path=target_path,\n",
    "        eval_metric='mean_absolute_percentage_error',\n",
    "        problem_type='regression'\n",
    "    )\n",
    "\n",
    "    # Train the model\n",
    "    print(\"Starting AutoGluon training...\")\n",
    "    predictor.fit(\n",
    "        train_data,\n",
    "        time_limit=maxTime,\n",
    "        presets=quality_preset,\n",
    "        verbosity=2,\n",
    "        auto_stack=False,\n",
    "        dynamic_stacking=False,\n",
    "        num_bag_folds=8,\n",
    "        num_bag_sets=5,\n",
    "        num_stack_levels=2,\n",
    "        use_bag_holdout=False,\n",
    "        fit_strategy=\"sequential\",\n",
    "        ag_args_ensemble={'fold_fitting_strategy': \"parallel_local\"},\n",
    "        ds_args={'enable_ray_logging': False}\n",
    "    )\n",
    "\n",
    "    print(\"\\n Model Leaderboard:\")\n",
    "    leaderboard = predictor.leaderboard(silent=True)\n",
    "    print(leaderboard.sort_values(\"score_val\", ascending=False).head())\n",
    "\n",
    "    # OOF predictions based on training set\n",
    "    yTrnVal_OOF[:, target] = predictor.predict_oof()\n",
    "\n",
    "    # Make predictions on test set\n",
    "    yTst[:, target] = predictor.predict(df_XTst)\n",
    "    print(f\"Test predictions generated for target {target + 1}\")\n",
    "\n",
    "    # Clean up predictor to free memory\n",
    "    del predictor"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0ccfa2e9369bfc5",
   "metadata": {},
   "source": [
    "**Save Results**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "initial_id",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"\\n=== SAVING RESULTS ===\")\n",
    "\n",
    "# Create dataframes\n",
    "df_AG_yTrnVal_OOF = pd.DataFrame()\n",
    "df_AG_yTst = pd.DataFrame()\n",
    "\n",
    "# Add prediction columns\n",
    "for i in range(nTargets):\n",
    "    df_AG_yTrnVal_OOF[f'AG-BlendProperty{i + 1}'] = yTrnVal_OOF[:, i]\n",
    "    df_AG_yTst[f'AG-BlendProperty{i+1}'] = yTst[:, i]\n",
    "\n",
    "# Save predictions\n",
    "AG_OOF_file = os.path.join(ExtractedDATA_DIR, f'AutoGluon_{maxTime}_OOF.csv')\n",
    "df_AG_yTrnVal_OOF.to_csv(AG_OOF_file, index=False)\n",
    "\n",
    "AG_Tst_file = os.path.join(ExtractedDATA_DIR, f'AutoGluon_{maxTime}_Tst.csv')\n",
    "df_AG_yTst.to_csv(AG_Tst_file, index=False)\n",
    "\n",
    "print(f\"AutoGluon training completed!\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}