{ "cells": [ { "cell_type": "markdown", "id": "246a0753d9ced63", "metadata": {}, "source": [ "# Step #2\n", "\n", "## Generating Out-of-Fold (OOF) Target Values using AutoGluon" ] }, { "cell_type": "markdown", "id": "db7195c7ba8aaa7", "metadata": {}, "source": [ "**Last update: August 14, 2025**\n", "\n", "AI Assistance: Claude.AI (Anthropic) is used for documentation, code restructuring, and performance optimization" ] }, { "cell_type": "markdown", "id": "a04cdc53525eca0f", "metadata": {}, "source": [ "**Copyright (C) 2025 Sukanta Basu**" ] }, { "cell_type": "markdown", "id": "f88e04044878be57", "metadata": {}, "source": [ "This program is free software: you can redistribute it and/or modify\n", "it under the terms of the GNU General Public License as published by\n", "the Free Software Foundation, either version 3 of the License, or\n", "(at your option) any later version.\n", "\n", "This program is distributed in the hope that it will be useful,\n", "but WITHOUT ANY WARRANTY; without even the implied warranty of\n", "MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the\n", "GNU General Public License for more details.\n", "\n", "You should have received a copy of the GNU General Public License\n", "along with this program. If not, see ." ] }, { "cell_type": "markdown", "id": "cf8f97e0bd35df0b", "metadata": {}, "source": [ "**Overall Strategy**" ] }, { "cell_type": "markdown", "id": "2b56e99773f9f8f2", "metadata": {}, "source": [ "Step 1: Preprocess and engineer new features. \n", "\n", "**Step 2: Use AutoGluon to generate OOF predictions for each target separately.\n", "These predictions will be used as additional input features in steps 3 and 4.**\n", "\n", "Step 3: Train the RealMLP model with processed input (step 1) + ten\n", "AutoGluon-OOFs (step 2). These additional features will capture the correlation\n", "among targets effectively.\n", "\n", "Step 4: Similar to step 3 except use the TabPFN model.\n", "\n", "Step 5: Combine predictions from RealMLP (step 3) and TabPFN (step 4)." ] }, { "cell_type": "markdown", "id": "1a96ba877e53bc1b", "metadata": {}, "source": [ "**Imports**" ] }, { "cell_type": "code", "execution_count": null, "id": "5690a0094408a5bf", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import os\n", "import random\n", "import warnings\n", "\n", "from autogluon.tabular import TabularPredictor" ] }, { "cell_type": "markdown", "id": "9b8c651ce745d65b", "metadata": {}, "source": [ "**Set Random Seeds**" ] }, { "cell_type": "code", "execution_count": null, "id": "cd7eebb603f39cc1", "metadata": {}, "outputs": [], "source": [ "random.seed(7)\n", "np.random.seed(7)" ] }, { "cell_type": "markdown", "id": "7ef87b39e71027c1", "metadata": {}, "source": [ "**User Input**" ] }, { "cell_type": "code", "execution_count": null, "id": "b42562ac873ae2f7", "metadata": {}, "outputs": [], "source": [ "# AutoGluon quality preset\n", "quality_preset = 'best_quality'\n", "\n", "# AutoGluon training time (in seconds)\n", "maxTime = 21600\n", "\n", "# Number of input features\n", "nFeatures = 65\n", "\n", "# Number of target variables\n", "nTargets = 10" ] }, { "cell_type": "markdown", "id": "a0e955b8d1340774", "metadata": {}, "source": [ "**Input & Output Directories**" ] }, { "cell_type": "code", "execution_count": null, "id": "9ef35bbf7fb66929", "metadata": {}, "outputs": [], "source": [ "ROOT_DIR = '/data/Sukanta/Works_AIML/2025_SHELL_FuelProperty/'\n", "DATA_DIR = ROOT_DIR + 'DATA/'\n", "ExtractedDATA_DIR = ROOT_DIR + 'ExtractedDATA/'\n", "Tuning_DIR = ROOT_DIR + 'Models/AutoGluon-OOF/'\n", "\n", "# Create directory if it doesn't exist\n", "os.makedirs(Tuning_DIR, exist_ok=True)" ] }, { "cell_type": "markdown", "id": "4afbfc41d9bc6897", "metadata": {}, "source": [ "**Load Processed Training and Testing Data**" ] }, { "cell_type": "code", "execution_count": null, "id": "62de12294936981f", "metadata": {}, "outputs": [], "source": [ "df_XyTrnVal_org = pd.read_csv(ExtractedDATA_DIR + 'train_processed.csv')\n", "nSamples_TrnVal = df_XyTrnVal_org.shape[0]\n", "\n", "df_XTst = pd.read_csv(ExtractedDATA_DIR + 'test_processed.csv')\n", "nSamples_Tst = df_XTst.shape[0]\n", "\n", "print(f\"Training data shape: {df_XyTrnVal_org.shape}\")\n", "print(f\"Test data shape: {df_XTst.shape}\")\n", "\n", "# Extract input features\n", "XTrnVal = df_XyTrnVal_org.iloc[:, 0:nFeatures]" ] }, { "cell_type": "markdown", "id": "8f7687daaaf7d237", "metadata": {}, "source": [ "**Iterative Single-target Training using AutoGluon**" ] }, { "cell_type": "code", "execution_count": null, "id": "d51a4003a1b94737", "metadata": {}, "outputs": [], "source": [ "# Initialize predictions array\n", "yTrnVal_OOF = np.zeros((nSamples_TrnVal, nTargets))\n", "yTst = np.zeros((nSamples_Tst, nTargets))\n", "\n", "for target in range(nTargets):\n", " print(f\"\\n--- Target {target + 1}/{nTargets} ---\")\n", "\n", " # Extract single target from possible nTargets\n", " yTrnVal = df_XyTrnVal_org.iloc[:, nFeatures + target]\n", "\n", " # Create training dataframe with features and target\n", " train_data = XTrnVal.copy()\n", " train_data[f'target_{target}'] = yTrnVal\n", "\n", " # Create unique file path for each target\n", " target_path = os.path.join(Tuning_DIR, f'target_{target + 1}')\n", " os.makedirs(target_path, exist_ok=True)\n", "\n", " # Initialize TabularPredictor from AutoGluon\n", " predictor = TabularPredictor(\n", " label=f'target_{target}',\n", " path=target_path,\n", " eval_metric='mean_absolute_percentage_error',\n", " problem_type='regression'\n", " )\n", "\n", " # Train the model\n", " print(\"Starting AutoGluon training...\")\n", " predictor.fit(\n", " train_data,\n", " time_limit=maxTime,\n", " presets=quality_preset,\n", " verbosity=2,\n", " auto_stack=False,\n", " dynamic_stacking=False,\n", " num_bag_folds=8,\n", " num_bag_sets=5,\n", " num_stack_levels=2,\n", " use_bag_holdout=False,\n", " fit_strategy=\"sequential\",\n", " ag_args_ensemble={'fold_fitting_strategy': \"parallel_local\"},\n", " ds_args={'enable_ray_logging': False}\n", " )\n", "\n", " print(\"\\n Model Leaderboard:\")\n", " leaderboard = predictor.leaderboard(silent=True)\n", " print(leaderboard.sort_values(\"score_val\", ascending=False).head())\n", "\n", " # OOF predictions based on training set\n", " yTrnVal_OOF[:, target] = predictor.predict_oof()\n", "\n", " # Make predictions on test set\n", " yTst[:, target] = predictor.predict(df_XTst)\n", " print(f\"Test predictions generated for target {target + 1}\")\n", "\n", " # Clean up predictor to free memory\n", " del predictor" ] }, { "cell_type": "markdown", "id": "d0ccfa2e9369bfc5", "metadata": {}, "source": [ "**Save Results**" ] }, { "cell_type": "code", "execution_count": null, "id": "initial_id", "metadata": {}, "outputs": [], "source": [ "print(\"\\n=== SAVING RESULTS ===\")\n", "\n", "# Create dataframes\n", "df_AG_yTrnVal_OOF = pd.DataFrame()\n", "df_AG_yTst = pd.DataFrame()\n", "\n", "# Add prediction columns\n", "for i in range(nTargets):\n", " df_AG_yTrnVal_OOF[f'AG-BlendProperty{i + 1}'] = yTrnVal_OOF[:, i]\n", " df_AG_yTst[f'AG-BlendProperty{i+1}'] = yTst[:, i]\n", "\n", "# Save predictions\n", "AG_OOF_file = os.path.join(ExtractedDATA_DIR, f'AutoGluon_{maxTime}_OOF.csv')\n", "df_AG_yTrnVal_OOF.to_csv(AG_OOF_file, index=False)\n", "\n", "AG_Tst_file = os.path.join(ExtractedDATA_DIR, f'AutoGluon_{maxTime}_Tst.csv')\n", "df_AG_yTst.to_csv(AG_Tst_file, index=False)\n", "\n", "print(f\"AutoGluon training completed!\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.5" } }, "nbformat": 4, "nbformat_minor": 5 }