{ "cells": [ { "cell_type": "markdown", "id": "852ef70cb8ecc72b", "metadata": {}, "source": [ "# Step #1\n", "\n", "## Preprocessing and Feature Engineering" ] }, { "cell_type": "markdown", "id": "a702638d91fdbf52", "metadata": {}, "source": [ "**Last update: August 14, 2025**\n", "\n", "AI Assistance: Claude.AI (Anthropic) is used for documentation, code restructuring, and performance optimization" ] }, { "cell_type": "markdown", "id": "1bd3369630baee5f", "metadata": {}, "source": [ "**Copyright (C) 2025 Sukanta Basu**" ] }, { "cell_type": "markdown", "id": "1416a6ae4710a657", "metadata": {}, "source": [ "This program is free software: you can redistribute it and/or modify\n", "it under the terms of the GNU General Public License as published by\n", "the Free Software Foundation, either version 3 of the License, or\n", "(at your option) any later version.\n", "\n", "This program is distributed in the hope that it will be useful,\n", "but WITHOUT ANY WARRANTY; without even the implied warranty of\n", "MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the\n", "GNU General Public License for more details.\n", "\n", "You should have received a copy of the GNU General Public License\n", "along with this program. If not, see ." ] }, { "cell_type": "markdown", "id": "d73a2a6544e84865", "metadata": {}, "source": [ "**Overall Strategy**" ] }, { "cell_type": "markdown", "id": "1fa7227c3134490", "metadata": {}, "source": [ "**Step 1: Preprocess and engineer new features.**\n", "\n", "Step 2: Use AutoGluon to generate OOF predictions for each target separately.\n", "These predictions will be used as additional input features in steps 3 and 4.\n", "\n", "Step 3: Train the RealMLP model with processed input (step 1) + ten\n", "AutoGluon-OOFs (step 2). These additional features will capture the correlation\n", "among targets effectively.\n", "\n", "Step 4: Similar to step 3 except use the TabPFN model.\n", "\n", "Step 5: Combine predictions from RealMLP (step 3) and TabPFN (step 4)." ] }, { "cell_type": "markdown", "id": "657d4e0d4ac92187", "metadata": {}, "source": [ "**Imports**" ] }, { "cell_type": "code", "execution_count": null, "id": "3dd86f72e702f7f1", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import random" ] }, { "cell_type": "markdown", "id": "a4f5ff788d12758", "metadata": {}, "source": [ "**Set Random Seeds**" ] }, { "cell_type": "code", "execution_count": null, "id": "cdeb9ba693025ed4", "metadata": {}, "outputs": [], "source": [ "random.seed(7)\n", "np.random.seed(7)" ] }, { "cell_type": "markdown", "id": "6043c7f749d38155", "metadata": {}, "source": [ "**Input & Output Directories**" ] }, { "cell_type": "code", "execution_count": null, "id": "initial_id", "metadata": {}, "outputs": [], "source": [ "ROOT_DIR = '/data/Sukanta/Works_AIML/2025_SHELL_FuelProperty/'\n", "DATA_DIR = ROOT_DIR + 'DATA/'\n", "ExtractedDATA_DIR = ROOT_DIR + 'ExtractedDATA/'" ] }, { "cell_type": "markdown", "id": "104d275ebcec063", "metadata": {}, "source": [ "**Load Training and Testing Data Provided by the Organizers**" ] }, { "cell_type": "code", "execution_count": null, "id": "3f89bdec75fdcbb8", "metadata": {}, "outputs": [], "source": [ "df_XyTrnVal_org = pd.read_csv(DATA_DIR + 'train.csv')\n", "df_XTst_org = pd.read_csv(DATA_DIR + 'test.csv')" ] }, { "cell_type": "markdown", "id": "6e165a4c82d2edfe", "metadata": {}, "source": [ "**Feature Engineering**" ] }, { "cell_type": "code", "execution_count": null, "id": "ef1de08836867b2a", "metadata": {}, "outputs": [], "source": [ "# Create empty data frames\n", "df_XyTrnVal_mod = pd.DataFrame()\n", "df_XTst_mod = pd.DataFrame()\n", "\n", "# Add component fractions\n", "for comp in range(1, 6):\n", " df_XyTrnVal_mod[f'Component{comp}_fraction'] = (\n", " df_XyTrnVal_org)[f'Component{comp}_fraction']\n", " df_XTst_mod[f'Component{comp}_fraction'] = (\n", " df_XTst_org)[f'Component{comp}_fraction']\n", "\n", "# Create volume fraction-weighted input features\n", "for prop in range(1, 11):\n", " for comp in range(1, 6):\n", " fraction_col = f'Component{comp}_fraction'\n", " property_col = f'Component{comp}_Property{prop}'\n", " contribution_col = f'Component{comp}_Contribution_Property{prop}'\n", " df_XyTrnVal_mod[contribution_col] = (df_XyTrnVal_org[fraction_col] *\n", " df_XyTrnVal_org[property_col])\n", "\n", " df_XTst_mod[contribution_col] = (df_XTst_org[fraction_col] *\n", " df_XTst_org[property_col])\n", "\n", "# Create weighted-averaged input features\n", "for prop in range(1, 11):\n", " df_XyTrnVal_mod[f'WeightedAvg_Property{prop}'] = (\n", " sum(df_XyTrnVal_org[f'Component{comp}_fraction'] *\n", " df_XyTrnVal_org[f'Component{comp}_Property{prop}']\n", " for comp in range(1, 6)))\n", " df_XTst_mod[f'WeightedAvg_Property{prop}'] = (\n", " sum(df_XTst_org[f'Component{comp}_fraction'] *\n", " df_XTst_org[f'Component{comp}_Property{prop}']\n", " for comp in range(1, 6)))\n", "\n", "# Add targets\n", "for target in range(1, 11):\n", " df_XyTrnVal_mod[f'BlendProperty{target}'] = df_XyTrnVal_org[f'BlendProperty{target}']" ] }, { "cell_type": "markdown", "id": "859652ade27b244b", "metadata": {}, "source": [ "**Save Processed Data**" ] }, { "cell_type": "code", "execution_count": null, "id": "df0ba8a8ba8837d", "metadata": {}, "outputs": [], "source": [ "df_XyTrnVal_mod.to_csv(ExtractedDATA_DIR + 'train_processed.csv',index=False)\n", "df_XTst_mod.to_csv(ExtractedDATA_DIR + 'test_processed.csv',index=False)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.5" } }, "nbformat": 4, "nbformat_minor": 5 }