peterwood/shell

Fork 0

mirror of https://github.com/acedanger/shell.git synced 2025-12-06 06:40:13 -08:00

Files

Peter Wood ed6a1a3d76 backing up my changes because my server is about to get wiped

2025-06-12 10:05:41 -04:00

11 KiB

Raw Blame History

Plex Backup Script Logic Review and Critical Issues

Executive Summary

After a comprehensive review and testing of /home/acedanger/shell/plex/backup-plex.sh, I've identified several logic issues and architectural concerns that could impact reliability and safety. This document outlines the verified findings and recommended fixes.

UPDATE: Initial testing shows the script is functional contrary to early static analysis. The main() function exists and argument parsing works correctly. However, real database corruption was detected during testing, and there are still important fixes needed.

✅ VERIFIED: Script is Functional

Testing Results:

Script executes successfully with --help and --check-integrity options
Main function exists at line 1547 and executes properly
Command line argument parsing works correctly
Database integrity checking is functional and detected real corruption

Database Corruption Found:

*** in database main ***
On tree page 7231 cell 101: Rowid 5837 out of order
On tree page 7231 cell 87: Offset 38675 out of range 245..4092
On tree page 7231 cell 83: Offset 50846 out of range 245..4092
On tree page 7231 cell 63: Rowid 5620 out of order
row 1049 missing from index index_directories_on_path

🚨 Remaining Critical Issues

1. CRITICAL: Real Database Corruption Detected

Issue: The Plex database contains multiple corruption issues that need immediate attention.

Location: /var/lib/plexmediaserver/Library/Application Support/Plex Media Server/Plug-in Support/Databases/com.plexapp.plugins.library.db

Impact:

Data loss risk
Plex service instability
Backup reliability concerns
Potential media library corruption

Fix Required: Use the script's repair capabilities or database recovery tools to fix corruption.

2. HIGH: Logging Permission Issues

Issue: Script cannot write to log files due to permission problems.

Status: FIXED - Corrected permissions on logs directory.

Impact:

No backup operation logging
Difficult troubleshooting
Missing audit trail

3. CRITICAL: Service Management Race Conditions

Issue: Multiple race conditions in Plex service management that can lead to data corruption.

Location: manage_plex_service() function (lines 1240-1365)

Problems:

Database access during service transition: Script accesses database files while service may still be shutting down
WAL file handling timing: WAL checkpoint operations happen too early in the shutdown process
Insufficient shutdown wait time: Only 15 seconds max wait for service stop
Force kill without database safety: Uses pkill -KILL without ensuring database writes are complete

Impact:

Database corruption from interrupted writes
WAL file inconsistencies
Service startup failures
Backup of corrupted databases

Evidence:

# Service stop logic has timing issues:
while [ $wait_time -lt $max_wait ]; do  # Only 15 seconds max wait
    if ! sudo systemctl is-active --quiet plexmediaserver.service; then
        # Immediately proceeds to database operations - DANGEROUS!
        return 0
    fi
    sleep 1
    wait_time=$((wait_time + 1))
done

# Then immediately force kills without database safety:
sudo pkill -KILL -f "Plex Media Server"  # DANGEROUS!

4. CRITICAL: Database Repair Logic Flaws

Issue: Multiple critical flaws in database repair strategies that can cause data loss.

Location: Various repair functions (lines 570-870)

Problems:

A. Circular Backup Recovery Logic

# This tries to recover from a backup that may include the corrupted file!
if attempt_backup_recovery "$db_file" "$BACKUP_ROOT" "$pre_repair_backup"; then

B. Unsafe Schema Recreation

# Extracts schema from corrupted database - may contain corruption!
if sudo "$PLEX_SQLITE" "$working_copy" ".schema" 2>/dev/null | sudo tee "$schema_file" >/dev/null; then

C. Inadequate Success Criteria

# Only requires 80% table recovery - could lose critical data!
if (( recovered_count * 100 / total_tables >= 80 )); then
    return 0  # Claims success with 20% data loss!
fi

D. No Transaction Boundary Checking

Repair strategies don't verify transaction consistency
May create databases with partial transactions
No rollback mechanism for failed repairs

Impact:

Data loss: Up to 20% of data can be lost and still considered "successful"
Corruption propagation: May create new corrupted databases from corrupted sources
Inconsistent state: Databases may be left in inconsistent states
False success reporting: Critical failures reported as successes

5. CRITICAL: WAL File Handling Issues

Issue: Multiple critical problems with Write-Ahead Logging file management.

Location: handle_wal_files_for_repair() and related functions

Problems:

A. Incomplete WAL Checkpoint Logic

# Only attempts checkpoint but doesn't verify completion
if sudo "$PLEX_SQLITE" "$db_file" "PRAGMA wal_checkpoint(TRUNCATE);" 2>/dev/null; then
    log_success "WAL checkpoint completed"
else
    log_warning "WAL checkpoint failed, continuing with repair"  # DANGEROUS!
fi

B. Missing WAL File Validation

No verification that WAL files are valid before processing
No check for WAL file corruption
No verification that checkpoint actually consolidated all changes

C. Incomplete WAL Cleanup

# WAL cleanup is incomplete and inconsistent
case "$operation" in
    "cleanup")
        # Missing implementation!

Impact:

Lost transactions: WAL changes may be lost during backup
Database inconsistency: Incomplete WAL processing leads to inconsistent state
Backup incompleteness: Backups may miss recent changes
Corruption during recovery: Invalid WAL files can corrupt database during recovery

6. CRITICAL: Backup Validation Insufficient

Issue: Backup validation only checks file integrity, not data consistency.

Location: verify_files_parallel() and backup creation logic

Problems:

Checksum-only validation: Only verifies file wasn't corrupted in transit
No database consistency check: Doesn't verify backup can be restored
No cross-file consistency: Doesn't verify database files are consistent with each other
Missing metadata validation: Doesn't check if backup matches source system state

Impact:

Unrestorable backups: Backups pass validation but can't be restored
Silent data loss: Corruption that doesn't affect checksums goes undetected
Recovery failures: Backup restoration fails despite validation success

7. LOGIC ERROR: Trap Handling Issues

Issue: EXIT trap always restarts Plex even on failure conditions.

Location: Line 1903

Problem:

# This will ALWAYS restart Plex, even if backup failed catastrophically
trap 'manage_plex_service start' EXIT

Impact:

Masks corruption: Starts service with corrupted databases
Service instability: May cause repeated crashes if database is corrupted
No manual intervention opportunity: Auto-restart prevents manual recovery

8. LOGIC ERROR: Parallel Operations Without Proper Synchronization

Issue: Parallel verification lacks proper synchronization and error aggregation.

Location: verify_files_parallel() function

Problems:

Race conditions: Multiple processes accessing same files
Error aggregation issues: Parallel errors may be lost
Resource contention: No limits on parallel operations
Incomplete wait logic: wait doesn't capture all exit codes

Impact:

Unreliable verification: Results may be inconsistent
System overload: Unlimited parallel operations can overwhelm system
Lost errors: Critical verification failures may go unnoticed

9. APPROACH ISSUE: Inadequate Error Recovery Strategy

Issue: The overall error recovery approach is fundamentally flawed.

Problems:

Repair-first approach: Attempts repair before creating known-good backup
Multiple destructive operations: Repair strategies modify original files
Insufficient rollback: No way to undo failed repair attempts
Cascading failures: One repair failure can make subsequent repairs impossible

Better Approach:

Backup-first: Always create backup before any modification
Non-destructive testing: Test repair strategies on copies
Staged recovery: Multiple fallback levels with validation
Manual intervention points: Stop for human decision on critical failures

10. APPROACH ISSUE: Insufficient Performance Monitoring

Issue: Performance monitoring creates overhead during critical operations.

Location: Throughout script with track_performance() calls

Problems:

I/O overhead: JSON operations during backup can affect performance
Lock contention: Performance log locking can cause delays
Error propagation: Performance tracking failures can affect backup success
Resource usage: Monitoring uses disk space and CPU during critical operations

Impact:

Slower backups: Performance monitoring slows down the backup process
Potential failures: Monitoring failures can cause backup failures
Resource conflicts: Monitoring competes with backup for resources

🛠️ Recommended Immediate Actions

1. Emergency Fix - Stop Using Script

IMMEDIATE: Disable any automated backup jobs using this script
IMMEDIATE: Create manual backups using proven methods
IMMEDIATE: Validate existing backups before relying on them

2. Critical Function Reconstruction

Create proper main() function
Fix argument parsing logic
Implement proper service management timing

3. Database Safety Overhaul

Implement proper WAL handling with verification
Add database consistency checks
Create safe repair strategies with rollback

4. Service Management Rewrite

Add proper shutdown timing
Implement database activity monitoring
Remove dangerous force-kill operations

5. Backup Validation Enhancement

Add database consistency validation
Implement test restoration verification
Add cross-file consistency checks

🧪 Testing Requirements

Before using any fixed version:

Unit Testing: Test each function in isolation
Integration Testing: Test full backup cycle in test environment
Failure Testing: Test all failure scenarios and recovery paths
Performance Testing: Verify backup completion times
Corruption Testing: Test with intentionally corrupted databases
Recovery Testing: Verify all backups can be successfully restored

📋 Conclusion

The current Plex backup script has multiple critical flaws that make it unsafe for production use. The missing main() function alone means the script has never actually worked as intended. The service management and database repair logic contain serious race conditions and corruption risks.

Immediate action is required to:

Stop using the current script
Create manual backups using proven methods
Thoroughly rewrite the script with proper error handling
Implement comprehensive testing before production use

The script requires a complete architectural overhaul to be safe and reliable for production Plex backup operations.

11 KiB Raw Blame History

Plex Backup Script Logic Review and Critical Issues

Executive Summary

✅ VERIFIED: Script is Functional

🚨 Remaining Critical Issues

1. CRITICAL: Real Database Corruption Detected

2. HIGH: Logging Permission Issues

3. CRITICAL: Service Management Race Conditions

4. CRITICAL: Database Repair Logic Flaws

A. Circular Backup Recovery Logic

B. Unsafe Schema Recreation

C. Inadequate Success Criteria

D. No Transaction Boundary Checking

5. CRITICAL: WAL File Handling Issues

A. Incomplete WAL Checkpoint Logic

B. Missing WAL File Validation

C. Incomplete WAL Cleanup

6. CRITICAL: Backup Validation Insufficient

7. LOGIC ERROR: Trap Handling Issues

8. LOGIC ERROR: Parallel Operations Without Proper Synchronization

9. APPROACH ISSUE: Inadequate Error Recovery Strategy

10. APPROACH ISSUE: Insufficient Performance Monitoring

🛠️ Recommended Immediate Actions

1. Emergency Fix - Stop Using Script

2. Critical Function Reconstruction

3. Database Safety Overhaul

4. Service Management Rewrite

5. Backup Validation Enhancement

🧪 Testing Requirements

📋 Conclusion

11 KiB

Raw Blame History