Troubleshooting

This guide provides solutions to common issues and problems that may arise when using the TSOC Data Analysis package.

Common Error Messages

File Not Found Errors

Error: FileNotFoundError: [Errno 2] No such file or directory: ‘raw_data/substation_active_power.xlsx’

Cause: Excel files are missing or in the wrong location.

Solution:

Check file structure:

ls -la raw_data/
# Should show:
# substation_active_power.xlsx
# substation_reactive_power.xlsx
# wind_farm_active_power.xlsx
# shunt_element_reactive_power.xlsx
# generator_voltage_setpoints.xlsx
# generator_reactive_power.xlsx

Verify file names match configuration:

from tsoc_data_analysis.system_configuration import FILES

for data_type, filename in FILES.items():
    print(f"{data_type}: {filename}")

Check data directory path:

# Use absolute path or correct relative path
success, df = execute(month='2024-01', data_dir='/full/path/to/raw_data')

Data Quality Issues

Error: ValueError: Data contains too many missing values

Cause: Excel files have excessive missing data or incorrect structure.

Solution:

Check Excel file structure:

import pandas as pd

# Load Excel file and check structure
df = pd.read_excel('raw_data/substation_active_power.xlsx')
print(f"Shape: {df.shape}")
print(f"Missing values: {df.isnull().sum().sum()}")
print(f"Columns: {list(df.columns)}")

Verify data starts at correct row:

# Check if data starts at row 6 (0-indexed)
df = pd.read_excel('raw_data/substation_active_power.xlsx', header=None)
print(f"Row 5 (should be timestamps): {df.iloc[4, :5]}")
print(f"Row 6 (should be data): {df.iloc[5, :5]}")

Check column naming:

# Verify column names follow expected pattern
expected_prefix = 'ss_mw_'
matching_cols = [col for col in df.columns if col.startswith(expected_prefix)]
print(f"Found {len(matching_cols)} columns with prefix '{expected_prefix}'")

Performance Issues

Slow Clustering for Large Datasets

Problem: Clustering takes too long for large datasets.

Solutions:

Reduce dataset size:

# Use sampling for large datasets
from tsoc_data_analysis import extract_representative_ops

# Sample data for faster clustering
sample_df = df.sample(n=10000, random_state=42)

rep_df, diagnostics = extract_representative_ops(
    sample_df,
    max_power=850,
    MAPGL=200
)

Adjust clustering parameters:

# Use fewer clusters for faster processing
rep_df, diagnostics = extract_representative_ops(
    df,
    max_power=850,
    MAPGL=200,
    k_max=5,  # Reduce from default 10
    random_state=42
)

Use parallel processing:

# Enable parallel processing if available
from joblib import parallel_backend

with parallel_backend('threading', n_jobs=4):
    rep_df, diagnostics = extract_representative_ops(
        df,
        max_power=850,
        MAPGL=200
    )

Memory Issues

Problem: Out of memory errors when processing large datasets.

Solutions:

Process data in chunks:

# Process data month by month
months = ['2024-01', '2024-02', '2024-03']
results = {}

for month in months:
    print(f"Processing {month}...")
    success, df = execute(month=month, data_dir='raw_data')
    if success:
        results[month] = df
        # Clear memory
        del df

Reduce memory usage:

# Use smaller data types
import pandas as pd

# Convert to smaller data types
df = df.astype({
    'ss_mw_SUBSTATION1': 'float32',
    'wind_mw_FARM1': 'float32'
})

Monitor memory usage:

import psutil

def check_memory():
    memory = psutil.virtual_memory()
    print(f"Memory usage: {memory.percent}%")
    return memory.percent < 90  # Warning if > 90%

# Check before processing
if check_memory():
    # Proceed with processing
    pass

Configuration Problems

Invalid Configuration Settings

Problem: Configuration errors or invalid parameter values.

Solutions:

Validate configuration:

from tsoc_data_analysis.system_configuration import (
    FILES, COLUMN_PREFIXES, DATA_VALIDATION, REPRESENTATIVE_OPS
)

# Check file mappings
for data_type, filename in FILES.items():
    if not filename.endswith('.xlsx'):
        print(f"Warning: {data_type} file should end with .xlsx")

# Check column prefixes
for data_type, prefix in COLUMN_PREFIXES.items():
    if not prefix.endswith('_'):
        print(f"Warning: {data_type} prefix should end with '_'")

Reset to defaults:

# Reset clustering parameters to defaults
REPRESENTATIVE_OPS['defaults']['k_max'] = 10
REPRESENTATIVE_OPS['defaults']['random_state'] = 42
REPRESENTATIVE_OPS['quality_thresholds']['min_silhouette'] = 0.25

Check parameter ranges:

# Validate parameter ranges
if REPRESENTATIVE_OPS['defaults']['k_max'] < 2:
    print("Error: k_max must be at least 2")

if DATA_VALIDATION['gap_filling']['max_gap_steps'] < 1:
    print("Error: max_gap_steps must be at least 1")

Missing Dependencies

Problem: Import errors or missing packages.

Solutions:

Install missing dependencies:

pip install pandas numpy matplotlib seaborn openpyxl scikit-learn scipy psutil joblib

Check package versions:

import pandas as pd
import numpy as np
import matplotlib
import seaborn
import openpyxl
import sklearn

print(f"pandas: {pd.__version__}")
print(f"numpy: {np.__version__}")
print(f"matplotlib: {matplotlib.__version__}")
print(f"seaborn: {seaborn.__version__}")
print(f"openpyxl: {openpyxl.__version__}")
print(f"scikit-learn: {sklearn.__version__}")

Install development dependencies:
```
pip install -e ".[dev]"
```

Visualization Issues

Problem: Plotting errors or missing plots.

Solutions:

Check matplotlib backend:

import matplotlib
print(f"Backend: {matplotlib.get_backend()}")

# Set backend if needed
matplotlib.use('Agg')  # For non-interactive environments

Create output directory:

import os

# Ensure output directory exists
output_dir = 'results'
os.makedirs(output_dir, exist_ok=True)

Check file permissions:

# Check if directory is writable
import os

if os.access('results', os.W_OK):
    print("Directory is writable")
else:
    print("Directory is not writable")

Parallel Processing Issues

Problem: Parallel processing errors or performance issues.

Solutions:

Disable parallel processing:

# Use single-threaded processing
from joblib import parallel_backend

with parallel_backend('sequential'):
    rep_df, diagnostics = extract_representative_ops(
        df,
        max_power=850,
        MAPGL=200
    )

Adjust number of jobs:

# Use fewer parallel jobs
from joblib import parallel_backend

with parallel_backend('threading', n_jobs=2):
    rep_df, diagnostics = extract_representative_ops(
        df,
        max_power=850,
        MAPGL=200
    )

Data Format Issues

Excel File Structure Problems

Problem: Excel files have incorrect structure or format.

Solutions:

Check Excel file format:

import pandas as pd

# Check if file can be read
try:
    df = pd.read_excel('raw_data/substation_active_power.xlsx')
    print("File can be read successfully")
except Exception as e:
    print(f"Error reading file: {e}")

Verify data structure:

# Check expected structure
df = pd.read_excel('raw_data/substation_active_power.xlsx', header=None)

# Check timestamp column (column C, row 6+)
timestamps = df.iloc[5:, 2]  # Column C (0-indexed = 2)
print(f"Timestamp range: {timestamps.min()} to {timestamps.max()}")

# Check substation names (row 2)
substation_names = df.iloc[1, 6:]  # Row 2, starting from column G
print(f"Substation names: {list(substation_names)}")

Fix common structure issues:

# If timestamps are in wrong column
if df.iloc[5, 2] is None:  # Column C is empty
    # Check other columns for timestamps
    for col in range(df.shape[1]):
        if df.iloc[5, col] is not None:
            print(f"Timestamps found in column {col}")

Data Type Issues

Problem: Data type conversion errors or incorrect data types.

Solutions:

Check data types:

# Check column data types
for col in df.columns:
    if col.startswith('ss_mw_'):
        print(f"{col}: {df[col].dtype}")
        print(f"  Sample values: {df[col].head()}")

Convert data types:

# Convert to numeric types
for col in df.columns:
    if col.startswith('ss_mw_'):
        df[col] = pd.to_numeric(df[col], errors='coerce')

Handle non-numeric values:

# Find and handle non-numeric values
for col in df.columns:
    if col.startswith('ss_mw_'):
        non_numeric = pd.to_numeric(df[col], errors='coerce').isna()
        if non_numeric.any():
            print(f"Non-numeric values in {col}: {df[col][non_numeric].unique()}")

Debugging Techniques

Enable Verbose Mode

Solution: Use verbose mode for detailed output.

# Enable verbose mode in CLI
tsoc-analyze 2024-01 --verbose

# Enable verbose mode in Python
success, df = execute(
    month='2024-01',
    data_dir='raw_data',
    output_dir='results',
    verbose=True
)

Log Analysis

Solution: Check log files for detailed error information.

import logging

# Set up logging
logging.basicConfig(
    level=logging.DEBUG,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('tsoc_analysis.log'),
        logging.StreamHandler()
    ]
)

# Run analysis with logging
success, df = execute(month='2024-01', data_dir='raw_data')

Step-by-Step Debugging

Solution: Debug each step individually.

# Step 1: Check data loading
try:
    df = loadallpowerdf('2024-01', data_dir='raw_data')
    print(f"Data loaded: {df.shape}")
except Exception as e:
    print(f"Data loading error: {e}")
    return

# Step 2: Check data validation
try:
    validator = DataValidator(df)
    validation_results = validator.validate_data()
    print(f"Validation completed: {validation_results['valid_records']} valid records")
except Exception as e:
    print(f"Validation error: {e}")
    return

# Step 3: Check clustering
try:
    rep_df, diagnostics = extract_representative_ops(
        df,
        max_power=850,
        MAPGL=200
    )
    print(f"Clustering completed: {len(rep_df)} clusters")
except Exception as e:
    print(f"Clustering error: {e}")

Getting Help

Additional Resources:

Check the documentation for detailed API reference and examples
Review error messages carefully for specific issue details
Test with sample data to isolate the problem
Check system requirements and dependencies
Contact support at info@sps-lab.org for persistent issues

Common Debugging Checklist:

[ ] All required Excel files are present in the data directory
[ ] File names match the configuration in system_configuration.py
[ ] Excel files have the correct structure (timestamps in column C, data starting at row 6)
[ ] Column names follow the expected prefix patterns (ss_mw_*, wind_mw_*, etc.)
[ ] Data types are numeric (no text or mixed types)
[ ] Sufficient memory is available for the dataset size
[ ] All required Python packages are installed with compatible versions