{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "852ef70cb8ecc72b",
   "metadata": {},
   "source": [
    "# Step #1\n",
    "\n",
    "## Preprocessing and Feature Engineering"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a702638d91fdbf52",
   "metadata": {},
   "source": [
    "**Last update: August 14, 2025**\n",
    "\n",
    "AI Assistance: Claude.AI (Anthropic) is used for documentation, code restructuring, and performance optimization"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1bd3369630baee5f",
   "metadata": {},
   "source": [
    "**Copyright (C) 2025 Sukanta Basu**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1416a6ae4710a657",
   "metadata": {},
   "source": [
    "This program is free software: you can redistribute it and/or modify\n",
    "it under the terms of the GNU General Public License as published by\n",
    "the Free Software Foundation, either version 3 of the License, or\n",
    "(at your option) any later version.\n",
    "\n",
    "This program is distributed in the hope that it will be useful,\n",
    "but WITHOUT ANY WARRANTY; without even the implied warranty of\n",
    "MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n",
    "GNU General Public License for more details.\n",
    "\n",
    "You should have received a copy of the GNU General Public License\n",
    "along with this program.  If not, see <https://www.gnu.org/licenses/>."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d73a2a6544e84865",
   "metadata": {},
   "source": [
    "**Overall Strategy**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1fa7227c3134490",
   "metadata": {},
   "source": [
    "**Step 1: Preprocess and engineer new features.**\n",
    "\n",
    "Step 2: Use AutoGluon to generate OOF predictions for each target separately.\n",
    "These predictions will be used as additional input features in steps 3 and 4.\n",
    "\n",
    "Step 3: Train the RealMLP model with processed input (step 1) + ten\n",
    "AutoGluon-OOFs (step 2). These additional features will capture the correlation\n",
    "among targets effectively.\n",
    "\n",
    "Step 4: Similar to step 3 except use the TabPFN model.\n",
    "\n",
    "Step 5: Combine predictions from RealMLP (step 3) and TabPFN (step 4)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "657d4e0d4ac92187",
   "metadata": {},
   "source": [
    "**Imports**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3dd86f72e702f7f1",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import random"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4f5ff788d12758",
   "metadata": {},
   "source": [
    "**Set Random Seeds**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cdeb9ba693025ed4",
   "metadata": {},
   "outputs": [],
   "source": [
    "random.seed(7)\n",
    "np.random.seed(7)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6043c7f749d38155",
   "metadata": {},
   "source": [
    "**Input & Output Directories**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "initial_id",
   "metadata": {},
   "outputs": [],
   "source": [
    "ROOT_DIR = '/data/Sukanta/Works_AIML/2025_SHELL_FuelProperty/'\n",
    "DATA_DIR = ROOT_DIR + 'DATA/'\n",
    "ExtractedDATA_DIR = ROOT_DIR + 'ExtractedDATA/'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "104d275ebcec063",
   "metadata": {},
   "source": [
    "**Load Training and Testing Data Provided by the Organizers**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3f89bdec75fdcbb8",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_XyTrnVal_org = pd.read_csv(DATA_DIR + 'train.csv')\n",
    "df_XTst_org = pd.read_csv(DATA_DIR + 'test.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e165a4c82d2edfe",
   "metadata": {},
   "source": [
    "**Feature Engineering**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ef1de08836867b2a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create empty data frames\n",
    "df_XyTrnVal_mod = pd.DataFrame()\n",
    "df_XTst_mod = pd.DataFrame()\n",
    "\n",
    "# Add component fractions\n",
    "for comp in range(1, 6):\n",
    "    df_XyTrnVal_mod[f'Component{comp}_fraction'] = (\n",
    "        df_XyTrnVal_org)[f'Component{comp}_fraction']\n",
    "    df_XTst_mod[f'Component{comp}_fraction'] = (\n",
    "        df_XTst_org)[f'Component{comp}_fraction']\n",
    "\n",
    "# Create volume fraction-weighted input features\n",
    "for prop in range(1, 11):\n",
    "    for comp in range(1, 6):\n",
    "        fraction_col = f'Component{comp}_fraction'\n",
    "        property_col = f'Component{comp}_Property{prop}'\n",
    "        contribution_col = f'Component{comp}_Contribution_Property{prop}'\n",
    "        df_XyTrnVal_mod[contribution_col] = (df_XyTrnVal_org[fraction_col] *\n",
    "                                             df_XyTrnVal_org[property_col])\n",
    "\n",
    "        df_XTst_mod[contribution_col] = (df_XTst_org[fraction_col] *\n",
    "                                             df_XTst_org[property_col])\n",
    "\n",
    "# Create weighted-averaged input features\n",
    "for prop in range(1, 11):\n",
    "    df_XyTrnVal_mod[f'WeightedAvg_Property{prop}'] = (\n",
    "        sum(df_XyTrnVal_org[f'Component{comp}_fraction'] *\n",
    "            df_XyTrnVal_org[f'Component{comp}_Property{prop}']\n",
    "            for comp in range(1, 6)))\n",
    "    df_XTst_mod[f'WeightedAvg_Property{prop}'] = (\n",
    "        sum(df_XTst_org[f'Component{comp}_fraction'] *\n",
    "            df_XTst_org[f'Component{comp}_Property{prop}']\n",
    "            for comp in range(1, 6)))\n",
    "\n",
    "# Add targets\n",
    "for target in range(1, 11):\n",
    "    df_XyTrnVal_mod[f'BlendProperty{target}'] = df_XyTrnVal_org[f'BlendProperty{target}']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "859652ade27b244b",
   "metadata": {},
   "source": [
    "**Save Processed Data**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "df0ba8a8ba8837d",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_XyTrnVal_mod.to_csv(ExtractedDATA_DIR + 'train_processed.csv',index=False)\n",
    "df_XTst_mod.to_csv(ExtractedDATA_DIR + 'test_processed.csv',index=False)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}