mirror of
https://github.com/acedanger/shell.git
synced 2025-12-06 05:40:11 -08:00
355 lines
11 KiB
Markdown
355 lines
11 KiB
Markdown
# Plex Backup Script Logic Review and Critical Issues
|
|
|
|
## Executive Summary
|
|
|
|
After a comprehensive review and testing of `/home/acedanger/shell/plex/backup-plex.sh`, I've identified several **logic issues** and **architectural concerns** that could impact reliability and safety. This document outlines the verified findings and recommended fixes.
|
|
|
|
**UPDATE**: Initial testing shows the script is **functional** contrary to early static analysis. The main() function exists and argument parsing works correctly. However, **real database corruption** was detected during testing, and there are still important fixes needed.
|
|
|
|
## ✅ **VERIFIED: Script is Functional**
|
|
|
|
**Testing Results**:
|
|
|
|
- Script executes successfully with `--help` and `--check-integrity` options
|
|
- Main function exists at line 1547 and executes properly
|
|
- Command line argument parsing works correctly
|
|
- Database integrity checking is functional and detected real corruption
|
|
|
|
**Database Corruption Found**:
|
|
|
|
```
|
|
*** in database main ***
|
|
On tree page 7231 cell 101: Rowid 5837 out of order
|
|
On tree page 7231 cell 87: Offset 38675 out of range 245..4092
|
|
On tree page 7231 cell 83: Offset 50846 out of range 245..4092
|
|
On tree page 7231 cell 63: Rowid 5620 out of order
|
|
row 1049 missing from index index_directories_on_path
|
|
```
|
|
|
|
## 🚨 Remaining Critical Issues
|
|
|
|
### 1. **CRITICAL: Real Database Corruption Detected**
|
|
|
|
**Issue**: The Plex database contains multiple corruption issues that need immediate attention.
|
|
|
|
**Location**: `/var/lib/plexmediaserver/Library/Application Support/Plex Media Server/Plug-in Support/Databases/com.plexapp.plugins.library.db`
|
|
|
|
**Impact**:
|
|
|
|
- Data loss risk
|
|
- Plex service instability
|
|
- Backup reliability concerns
|
|
- Potential media library corruption
|
|
|
|
**Fix Required**: Use the script's repair capabilities or database recovery tools to fix corruption.
|
|
|
|
---
|
|
|
|
### 2. **HIGH: Logging Permission Issues**
|
|
|
|
**Issue**: Script cannot write to log files due to permission problems.
|
|
|
|
**Status**: **FIXED** - Corrected permissions on logs directory.
|
|
|
|
**Impact**:
|
|
|
|
- No backup operation logging
|
|
- Difficult troubleshooting
|
|
- Missing audit trail
|
|
|
|
---
|
|
|
|
---
|
|
|
|
### 3. **CRITICAL: Service Management Race Conditions**
|
|
|
|
**Issue**: Multiple race conditions in Plex service management that can lead to data corruption.
|
|
|
|
**Location**: `manage_plex_service()` function (lines 1240-1365)
|
|
|
|
**Problems**:
|
|
|
|
- **Database access during service transition**: Script accesses database files while service may still be shutting down
|
|
- **WAL file handling timing**: WAL checkpoint operations happen too early in the shutdown process
|
|
- **Insufficient shutdown wait time**: Only 15 seconds max wait for service stop
|
|
- **Force kill without database safety**: Uses `pkill -KILL` without ensuring database writes are complete
|
|
|
|
**Impact**:
|
|
|
|
- Database corruption from interrupted writes
|
|
- WAL file inconsistencies
|
|
- Service startup failures
|
|
- Backup of corrupted databases
|
|
|
|
**Evidence**:
|
|
|
|
```bash
|
|
# Service stop logic has timing issues:
|
|
while [ $wait_time -lt $max_wait ]; do # Only 15 seconds max wait
|
|
if ! sudo systemctl is-active --quiet plexmediaserver.service; then
|
|
# Immediately proceeds to database operations - DANGEROUS!
|
|
return 0
|
|
fi
|
|
sleep 1
|
|
wait_time=$((wait_time + 1))
|
|
done
|
|
|
|
# Then immediately force kills without database safety:
|
|
sudo pkill -KILL -f "Plex Media Server" # DANGEROUS!
|
|
```
|
|
|
|
---
|
|
|
|
### 4. **CRITICAL: Database Repair Logic Flaws**
|
|
|
|
**Issue**: Multiple critical flaws in database repair strategies that can cause data loss.
|
|
|
|
**Location**: Various repair functions (lines 570-870)
|
|
|
|
**Problems**:
|
|
|
|
#### A. **Circular Backup Recovery Logic**
|
|
|
|
```bash
|
|
# This tries to recover from a backup that may include the corrupted file!
|
|
if attempt_backup_recovery "$db_file" "$BACKUP_ROOT" "$pre_repair_backup"; then
|
|
```
|
|
|
|
#### B. **Unsafe Schema Recreation**
|
|
|
|
```bash
|
|
# Extracts schema from corrupted database - may contain corruption!
|
|
if sudo "$PLEX_SQLITE" "$working_copy" ".schema" 2>/dev/null | sudo tee "$schema_file" >/dev/null; then
|
|
```
|
|
|
|
#### C. **Inadequate Success Criteria**
|
|
|
|
```bash
|
|
# Only requires 80% table recovery - could lose critical data!
|
|
if (( recovered_count * 100 / total_tables >= 80 )); then
|
|
return 0 # Claims success with 20% data loss!
|
|
fi
|
|
```
|
|
|
|
#### D. **No Transaction Boundary Checking**
|
|
|
|
- Repair strategies don't verify transaction consistency
|
|
- May create databases with partial transactions
|
|
- No rollback mechanism for failed repairs
|
|
|
|
**Impact**:
|
|
|
|
- **Data loss**: Up to 20% of data can be lost and still considered "successful"
|
|
- **Corruption propagation**: May create new corrupted databases from corrupted sources
|
|
- **Inconsistent state**: Databases may be left in inconsistent states
|
|
- **False success reporting**: Critical failures reported as successes
|
|
|
|
---
|
|
|
|
### 5. **CRITICAL: WAL File Handling Issues**
|
|
|
|
**Issue**: Multiple critical problems with Write-Ahead Logging file management.
|
|
|
|
**Location**: `handle_wal_files_for_repair()` and related functions
|
|
|
|
**Problems**:
|
|
|
|
#### A. **Incomplete WAL Checkpoint Logic**
|
|
|
|
```bash
|
|
# Only attempts checkpoint but doesn't verify completion
|
|
if sudo "$PLEX_SQLITE" "$db_file" "PRAGMA wal_checkpoint(TRUNCATE);" 2>/dev/null; then
|
|
log_success "WAL checkpoint completed"
|
|
else
|
|
log_warning "WAL checkpoint failed, continuing with repair" # DANGEROUS!
|
|
fi
|
|
```
|
|
|
|
#### B. **Missing WAL File Validation**
|
|
|
|
- No verification that WAL files are valid before processing
|
|
- No check for WAL file corruption
|
|
- No verification that checkpoint actually consolidated all changes
|
|
|
|
#### C. **Incomplete WAL Cleanup**
|
|
|
|
```bash
|
|
# WAL cleanup is incomplete and inconsistent
|
|
case "$operation" in
|
|
"cleanup")
|
|
# Missing implementation!
|
|
```
|
|
|
|
**Impact**:
|
|
|
|
- **Lost transactions**: WAL changes may be lost during backup
|
|
- **Database inconsistency**: Incomplete WAL processing leads to inconsistent state
|
|
- **Backup incompleteness**: Backups may miss recent changes
|
|
- **Corruption during recovery**: Invalid WAL files can corrupt database during recovery
|
|
|
|
---
|
|
|
|
### 6. **CRITICAL: Backup Validation Insufficient**
|
|
|
|
**Issue**: Backup validation only checks file integrity, not data consistency.
|
|
|
|
**Location**: `verify_files_parallel()` and backup creation logic
|
|
|
|
**Problems**:
|
|
|
|
- **Checksum-only validation**: Only verifies file wasn't corrupted in transit
|
|
- **No database consistency check**: Doesn't verify backup can be restored
|
|
- **No cross-file consistency**: Doesn't verify database files are consistent with each other
|
|
- **Missing metadata validation**: Doesn't check if backup matches source system state
|
|
|
|
**Impact**:
|
|
|
|
- **Unrestorable backups**: Backups pass validation but can't be restored
|
|
- **Silent data loss**: Corruption that doesn't affect checksums goes undetected
|
|
- **Recovery failures**: Backup restoration fails despite validation success
|
|
|
|
---
|
|
|
|
### 7. **LOGIC ERROR: Trap Handling Issues**
|
|
|
|
**Issue**: EXIT trap always restarts Plex even on failure conditions.
|
|
|
|
**Location**: Line 1903
|
|
|
|
**Problem**:
|
|
|
|
```bash
|
|
# This will ALWAYS restart Plex, even if backup failed catastrophically
|
|
trap 'manage_plex_service start' EXIT
|
|
```
|
|
|
|
**Impact**:
|
|
|
|
- **Masks corruption**: Starts service with corrupted databases
|
|
- **Service instability**: May cause repeated crashes if database is corrupted
|
|
- **No manual intervention opportunity**: Auto-restart prevents manual recovery
|
|
|
|
---
|
|
|
|
### 8. **LOGIC ERROR: Parallel Operations Without Proper Synchronization**
|
|
|
|
**Issue**: Parallel verification lacks proper synchronization and error aggregation.
|
|
|
|
**Location**: `verify_files_parallel()` function
|
|
|
|
**Problems**:
|
|
|
|
- **Race conditions**: Multiple processes accessing same files
|
|
- **Error aggregation issues**: Parallel errors may be lost
|
|
- **Resource contention**: No limits on parallel operations
|
|
- **Incomplete wait logic**: `wait` doesn't capture all exit codes
|
|
|
|
**Impact**:
|
|
|
|
- **Unreliable verification**: Results may be inconsistent
|
|
- **System overload**: Unlimited parallel operations can overwhelm system
|
|
- **Lost errors**: Critical verification failures may go unnoticed
|
|
|
|
---
|
|
|
|
### 9. **APPROACH ISSUE: Inadequate Error Recovery Strategy**
|
|
|
|
**Issue**: The overall error recovery approach is fundamentally flawed.
|
|
|
|
**Problems**:
|
|
|
|
- **Repair-first approach**: Attempts repair before creating known-good backup
|
|
- **Multiple destructive operations**: Repair strategies modify original files
|
|
- **Insufficient rollback**: No way to undo failed repair attempts
|
|
- **Cascading failures**: One repair failure can make subsequent repairs impossible
|
|
|
|
**Better Approach**:
|
|
|
|
1. **Backup-first**: Always create backup before any modification
|
|
2. **Non-destructive testing**: Test repair strategies on copies
|
|
3. **Staged recovery**: Multiple fallback levels with validation
|
|
4. **Manual intervention points**: Stop for human decision on critical failures
|
|
|
|
---
|
|
|
|
### 10. **APPROACH ISSUE: Insufficient Performance Monitoring**
|
|
|
|
**Issue**: Performance monitoring creates overhead during critical operations.
|
|
|
|
**Location**: Throughout script with `track_performance()` calls
|
|
|
|
**Problems**:
|
|
|
|
- **I/O overhead**: JSON operations during backup can affect performance
|
|
- **Lock contention**: Performance log locking can cause delays
|
|
- **Error propagation**: Performance tracking failures can affect backup success
|
|
- **Resource usage**: Monitoring uses disk space and CPU during critical operations
|
|
|
|
**Impact**:
|
|
|
|
- **Slower backups**: Performance monitoring slows down the backup process
|
|
- **Potential failures**: Monitoring failures can cause backup failures
|
|
- **Resource conflicts**: Monitoring competes with backup for resources
|
|
|
|
---
|
|
|
|
## 🛠️ Recommended Immediate Actions
|
|
|
|
### 1. **Emergency Fix - Stop Using Script**
|
|
|
|
- **IMMEDIATE**: Disable any automated backup jobs using this script
|
|
- **IMMEDIATE**: Create manual backups using proven methods
|
|
- **IMMEDIATE**: Validate existing backups before relying on them
|
|
|
|
### 2. **Critical Function Reconstruction**
|
|
|
|
- Create proper `main()` function
|
|
- Fix argument parsing logic
|
|
- Implement proper service management timing
|
|
|
|
### 3. **Database Safety Overhaul**
|
|
|
|
- Implement proper WAL handling with verification
|
|
- Add database consistency checks
|
|
- Create safe repair strategies with rollback
|
|
|
|
### 4. **Service Management Rewrite**
|
|
|
|
- Add proper shutdown timing
|
|
- Implement database activity monitoring
|
|
- Remove dangerous force-kill operations
|
|
|
|
### 5. **Backup Validation Enhancement**
|
|
|
|
- Add database consistency validation
|
|
- Implement test restoration verification
|
|
- Add cross-file consistency checks
|
|
|
|
---
|
|
|
|
## 🧪 Testing Requirements
|
|
|
|
Before using any fixed version:
|
|
|
|
1. **Unit Testing**: Test each function in isolation
|
|
2. **Integration Testing**: Test full backup cycle in test environment
|
|
3. **Failure Testing**: Test all failure scenarios and recovery paths
|
|
4. **Performance Testing**: Verify backup completion times
|
|
5. **Corruption Testing**: Test with intentionally corrupted databases
|
|
6. **Recovery Testing**: Verify all backups can be successfully restored
|
|
|
|
---
|
|
|
|
## 📋 Conclusion
|
|
|
|
The current Plex backup script has **multiple critical flaws** that make it **unsafe for production use**. The missing `main()` function alone means the script has never actually worked as intended. The service management and database repair logic contain serious race conditions and corruption risks.
|
|
|
|
**Immediate action is required** to:
|
|
|
|
1. Stop using the current script
|
|
2. Create manual backups using proven methods
|
|
3. Thoroughly rewrite the script with proper error handling
|
|
4. Implement comprehensive testing before production use
|
|
|
|
The script requires a **complete architectural overhaul** to be safe and reliable for production Plex backup operations.
|