Troubleshooting
This guide provides solutions to common issues and problems that may arise when using the TSOC Data Analysis package.
Common Error Messages
File Not Found Errors
Error: FileNotFoundError: [Errno 2] No such file or directory: ‘raw_data/substation_active_power.xlsx’
Cause: Excel files are missing or in the wrong location.
Solution:
Check file structure:
ls -la raw_data/ # Should show: # substation_active_power.xlsx # substation_reactive_power.xlsx # wind_farm_active_power.xlsx # shunt_element_reactive_power.xlsx # generator_voltage_setpoints.xlsx # generator_reactive_power.xlsx
Verify file names match configuration:
from tsoc_data_analysis.system_configuration import FILES for data_type, filename in FILES.items(): print(f"{data_type}: {filename}")
Check data directory path:
# Use absolute path or correct relative path success, df = execute(month='2024-01', data_dir='/full/path/to/raw_data')
Data Quality Issues
Error: ValueError: Data contains too many missing values
Cause: Excel files have excessive missing data or incorrect structure.
Solution:
Check Excel file structure:
import pandas as pd # Load Excel file and check structure df = pd.read_excel('raw_data/substation_active_power.xlsx') print(f"Shape: {df.shape}") print(f"Missing values: {df.isnull().sum().sum()}") print(f"Columns: {list(df.columns)}")
Verify data starts at correct row:
# Check if data starts at row 6 (0-indexed) df = pd.read_excel('raw_data/substation_active_power.xlsx', header=None) print(f"Row 5 (should be timestamps): {df.iloc[4, :5]}") print(f"Row 6 (should be data): {df.iloc[5, :5]}")
Check column naming:
# Verify column names follow expected pattern expected_prefix = 'ss_mw_' matching_cols = [col for col in df.columns if col.startswith(expected_prefix)] print(f"Found {len(matching_cols)} columns with prefix '{expected_prefix}'")
Performance Issues
Slow Clustering for Large Datasets
Problem: Clustering takes too long for large datasets.
Solutions:
Reduce dataset size:
# Use sampling for large datasets from tsoc_data_analysis import extract_representative_ops # Sample data for faster clustering sample_df = df.sample(n=10000, random_state=42) rep_df, diagnostics = extract_representative_ops( sample_df, max_power=850, MAPGL=200 )
Adjust clustering parameters:
# Use fewer clusters for faster processing rep_df, diagnostics = extract_representative_ops( df, max_power=850, MAPGL=200, k_max=5, # Reduce from default 10 random_state=42 )
Use parallel processing:
# Enable parallel processing if available from joblib import parallel_backend with parallel_backend('threading', n_jobs=4): rep_df, diagnostics = extract_representative_ops( df, max_power=850, MAPGL=200 )
Memory Issues
Problem: Out of memory errors when processing large datasets.
Solutions:
Process data in chunks:
# Process data month by month months = ['2024-01', '2024-02', '2024-03'] results = {} for month in months: print(f"Processing {month}...") success, df = execute(month=month, data_dir='raw_data') if success: results[month] = df # Clear memory del df
Reduce memory usage:
# Use smaller data types import pandas as pd # Convert to smaller data types df = df.astype({ 'ss_mw_SUBSTATION1': 'float32', 'wind_mw_FARM1': 'float32' })
Monitor memory usage:
import psutil def check_memory(): memory = psutil.virtual_memory() print(f"Memory usage: {memory.percent}%") return memory.percent < 90 # Warning if > 90% # Check before processing if check_memory(): # Proceed with processing pass
Configuration Problems
Invalid Configuration Settings
Problem: Configuration errors or invalid parameter values.
Solutions:
Validate configuration:
from tsoc_data_analysis.system_configuration import ( FILES, COLUMN_PREFIXES, DATA_VALIDATION, REPRESENTATIVE_OPS ) # Check file mappings for data_type, filename in FILES.items(): if not filename.endswith('.xlsx'): print(f"Warning: {data_type} file should end with .xlsx") # Check column prefixes for data_type, prefix in COLUMN_PREFIXES.items(): if not prefix.endswith('_'): print(f"Warning: {data_type} prefix should end with '_'")
Reset to defaults:
# Reset clustering parameters to defaults REPRESENTATIVE_OPS['defaults']['k_max'] = 10 REPRESENTATIVE_OPS['defaults']['random_state'] = 42 REPRESENTATIVE_OPS['quality_thresholds']['min_silhouette'] = 0.25
Check parameter ranges:
# Validate parameter ranges if REPRESENTATIVE_OPS['defaults']['k_max'] < 2: print("Error: k_max must be at least 2") if DATA_VALIDATION['gap_filling']['max_gap_steps'] < 1: print("Error: max_gap_steps must be at least 1")
Missing Dependencies
Problem: Import errors or missing packages.
Solutions:
Install missing dependencies:
pip install pandas numpy matplotlib seaborn openpyxl scikit-learn scipy psutil joblib
Check package versions:
import pandas as pd import numpy as np import matplotlib import seaborn import openpyxl import sklearn print(f"pandas: {pd.__version__}") print(f"numpy: {np.__version__}") print(f"matplotlib: {matplotlib.__version__}") print(f"seaborn: {seaborn.__version__}") print(f"openpyxl: {openpyxl.__version__}") print(f"scikit-learn: {sklearn.__version__}")
Install development dependencies:
pip install -e ".[dev]"
Visualization Issues
Problem: Plotting errors or missing plots.
Solutions:
Check matplotlib backend:
import matplotlib print(f"Backend: {matplotlib.get_backend()}") # Set backend if needed matplotlib.use('Agg') # For non-interactive environments
Create output directory:
import os # Ensure output directory exists output_dir = 'results' os.makedirs(output_dir, exist_ok=True)
Check file permissions:
# Check if directory is writable import os if os.access('results', os.W_OK): print("Directory is writable") else: print("Directory is not writable")
Parallel Processing Issues
Problem: Parallel processing errors or performance issues.
Solutions:
Disable parallel processing:
# Use single-threaded processing from joblib import parallel_backend with parallel_backend('sequential'): rep_df, diagnostics = extract_representative_ops( df, max_power=850, MAPGL=200 )
Adjust number of jobs:
# Use fewer parallel jobs from joblib import parallel_backend with parallel_backend('threading', n_jobs=2): rep_df, diagnostics = extract_representative_ops( df, max_power=850, MAPGL=200 )
Data Format Issues
Excel File Structure Problems
Problem: Excel files have incorrect structure or format.
Solutions:
Check Excel file format:
import pandas as pd # Check if file can be read try: df = pd.read_excel('raw_data/substation_active_power.xlsx') print("File can be read successfully") except Exception as e: print(f"Error reading file: {e}")
Verify data structure:
# Check expected structure df = pd.read_excel('raw_data/substation_active_power.xlsx', header=None) # Check timestamp column (column C, row 6+) timestamps = df.iloc[5:, 2] # Column C (0-indexed = 2) print(f"Timestamp range: {timestamps.min()} to {timestamps.max()}") # Check substation names (row 2) substation_names = df.iloc[1, 6:] # Row 2, starting from column G print(f"Substation names: {list(substation_names)}")
Fix common structure issues:
# If timestamps are in wrong column if df.iloc[5, 2] is None: # Column C is empty # Check other columns for timestamps for col in range(df.shape[1]): if df.iloc[5, col] is not None: print(f"Timestamps found in column {col}")
Data Type Issues
Problem: Data type conversion errors or incorrect data types.
Solutions:
Check data types:
# Check column data types for col in df.columns: if col.startswith('ss_mw_'): print(f"{col}: {df[col].dtype}") print(f" Sample values: {df[col].head()}")
Convert data types:
# Convert to numeric types for col in df.columns: if col.startswith('ss_mw_'): df[col] = pd.to_numeric(df[col], errors='coerce')
Handle non-numeric values:
# Find and handle non-numeric values for col in df.columns: if col.startswith('ss_mw_'): non_numeric = pd.to_numeric(df[col], errors='coerce').isna() if non_numeric.any(): print(f"Non-numeric values in {col}: {df[col][non_numeric].unique()}")
Debugging Techniques
Enable Verbose Mode
Solution: Use verbose mode for detailed output.
# Enable verbose mode in CLI
tsoc-analyze 2024-01 --verbose
# Enable verbose mode in Python
success, df = execute(
month='2024-01',
data_dir='raw_data',
output_dir='results',
verbose=True
)
Log Analysis
Solution: Check log files for detailed error information.
import logging
# Set up logging
logging.basicConfig(
level=logging.DEBUG,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('tsoc_analysis.log'),
logging.StreamHandler()
]
)
# Run analysis with logging
success, df = execute(month='2024-01', data_dir='raw_data')
Step-by-Step Debugging
Solution: Debug each step individually.
# Step 1: Check data loading
try:
df = loadallpowerdf('2024-01', data_dir='raw_data')
print(f"Data loaded: {df.shape}")
except Exception as e:
print(f"Data loading error: {e}")
return
# Step 2: Check data validation
try:
validator = DataValidator(df)
validation_results = validator.validate_data()
print(f"Validation completed: {validation_results['valid_records']} valid records")
except Exception as e:
print(f"Validation error: {e}")
return
# Step 3: Check clustering
try:
rep_df, diagnostics = extract_representative_ops(
df,
max_power=850,
MAPGL=200
)
print(f"Clustering completed: {len(rep_df)} clusters")
except Exception as e:
print(f"Clustering error: {e}")
Getting Help
Additional Resources:
Check the documentation for detailed API reference and examples
Review error messages carefully for specific issue details
Test with sample data to isolate the problem
Check system requirements and dependencies
Contact support at info@sps-lab.org for persistent issues
Common Debugging Checklist:
[ ] All required Excel files are present in the data directory
[ ] File names match the configuration in system_configuration.py
[ ] Excel files have the correct structure (timestamps in column C, data starting at row 6)
[ ] Column names follow the expected prefix patterns (
ss_mw_*
,wind_mw_*
, etc.)[ ] Data types are numeric (no text or mixed types)
[ ] Sufficient memory is available for the dataset size
[ ] All required Python packages are installed with compatible versions