surge/docs/architecture.md

457 lines
13 KiB
Markdown

# Architecture High-Level Design: Surge
## Executive Summary
This Architecture High-Level Design establishes the technical foundation for Surge, a mobile application enabling users to discover and complete structured self-improvement challenges. Building upon the Feature Definition's prioritization of the daily check-in experience and streak psychology, this architecture emphasizes responsive local-first interactions, reliable data synchronization, and a foundation that supports future social features without over-engineering the MVP.
The design balances immediate delivery needs with strategic positioning for Phase 2 social capabilities, ensuring the core tracking experience remains fast and satisfying even under poor network conditions.
***
## System Architecture Overview
```mermaid
graph TB
subgraph "Client Layer"
MA[Mobile App<br/>React Native]
LS[(Local Storage<br/>SQLite/Realm)]
end
subgraph "API Layer"
AG[API Gateway<br/>AWS API Gateway]
AUTH[Auth Service<br/>Firebase Auth]
end
subgraph "Application Layer"
US[User Service]
CS[Challenge Service]
PS[Progress Service]
end
subgraph "Data Layer"
PG[(PostgreSQL<br/>Primary DB)]
RC[(Redis<br/>Cache/Sessions)]
end
subgraph "Supporting Services"
PN[Push Notifications<br/>Firebase FCM]
AN[Analytics<br/>Mixpanel/Amplitude]
end
MA <--> LS
MA <--> AG
AG <--> AUTH
AG <--> US
AG <--> CS
AG <--> PS
US <--> PG
CS <--> PG
PS <--> PG
PS <--> RC
US <--> PN
MA --> AN
```
***
## Technology Stack
### Mobile Application
| Layer | Technology | Rationale |
| ----- | ---------- | --------- |
| Framework | React Native | Cross-platform efficiency, strong ecosystem, team familiarity |
| State Management | Zustand | Lightweight, minimal boilerplate, excellent for offline-first patterns |
| Local Database | WatermelonDB | Optimized for React Native, built-in sync capabilities, lazy loading |
| Navigation | React Navigation | Industry standard, deep linking support |
| UI Components | Custom + React Native Reanimated | Bold, high-energy design requires custom animations |
### Backend Services
| Component | Technology | Rationale |
| --------- | ---------- | --------- |
| Runtime | Node.js with TypeScript | Type safety, shared models with frontend, async performance |
| Framework | Fastify | High performance, schema validation, lower overhead than Express |
| Database | PostgreSQL 15 | ACID compliance, JSON support, proven reliability for user data |
| Cache | Redis | Session management, streak calculations, leaderboard preparation |
| Authentication | Firebase Auth | Rapid implementation, social login support, secure token management |
### Infrastructure
| Component | Technology | Rationale |
| --------- | ---------- | --------- |
| Cloud Provider | AWS | Comprehensive services, reliable, cost-effective at scale |
| Container Orchestration | AWS ECS Fargate | Serverless containers, reduced operational overhead |
| API Management | AWS API Gateway | Rate limiting, request validation, easy Lambda integration if needed |
| CDN | CloudFront | Challenge asset delivery, global edge caching |
| CI/CD | GitHub Actions | Integrated with codebase, cost-effective, extensive marketplace |
***
## Core Component Design
### Challenge Service
Manages the challenge library and challenge definitions. As noted in Feature Definition, launching with 5 well-documented challenges is prioritized over quantity.
```mermaid
classDiagram
class Challenge {
+uuid id
+string name
+string description
+int duration_days
+DailyRequirement[] requirements
+DifficultyLevel difficulty
+string[] tags
+boolean is_active
}
class DailyRequirement {
+uuid id
+string title
+string description
+RequirementType type
+json validation_rules
+int sort_order
}
class RequirementType {
<<enumeration>>
BOOLEAN
NUMERIC
DURATION
PHOTO_PROOF
}
Challenge "1" --> "*" DailyRequirement
DailyRequirement --> RequirementType
```
**Design Decisions:**
* Challenge definitions are admin-managed, cached aggressively on device
* Requirement types support future extensibility (photo proof for social features)
* Validation rules stored as JSON for flexible challenge-specific logic
### Progress Service
The heart of the user experience. Following Feature Definition's emphasis on making check-ins "fast, satisfying, and visually rewarding," this service prioritizes write performance and immediate feedback.
```mermaid
classDiagram
class UserChallenge {
+uuid id
+uuid user_id
+uuid challenge_id
+date start_date
+ChallengeStatus status
+int current_streak
+int longest_streak
+int attempt_number
}
class DailyProgress {
+uuid id
+uuid user_challenge_id
+date progress_date
+int day_number
+boolean is_complete
+timestamp completed_at
}
class TaskCompletion {
+uuid id
+uuid daily_progress_id
+uuid requirement_id
+json completion_data
+timestamp completed_at
}
class ChallengeStatus {
<<enumeration>>
ACTIVE
COMPLETED
FAILED
PAUSED
}
UserChallenge "1" --> "*" DailyProgress
DailyProgress "1" --> "*" TaskCompletion
UserChallenge --> ChallengeStatus
```
**Streak Calculation Strategy:**
* Current streak calculated on write (not read) for instant UI updates
* Redis maintains hot streak data for active users
* Nightly batch job reconciles any sync discrepancies
* `attempt_number` tracks restarts, supporting Feature Definition's "encouraging restart experience"
### User Service
Handles authentication, profile management, and notification preferences.
```mermaid
sequenceDiagram
participant App
participant Firebase
participant API
participant DB
App->>Firebase: Social Login (Google/Apple)
Firebase-->>App: ID Token
App->>API: POST /auth/verify
API->>Firebase: Verify Token
Firebase-->>API: User Claims
API->>DB: Upsert User
DB-->>API: User Record
API-->>App: JWT + User Profile
App->>App: Store JWT Securely
```
***
## Offline-First Architecture
Given that daily check-ins are the core interaction, the app must function reliably regardless of network conditions.
```mermaid
graph LR
subgraph "User Action"
A[Complete Task]
end
subgraph "Local First"
B[Write to Local DB]
C[Update UI Immediately]
D[Queue Sync Operation]
end
subgraph "Background Sync"
E{Network Available?}
F[Sync to Server]
G[Retry with Backoff]
H[Conflict Resolution]
end
A --> B
B --> C
B --> D
D --> E
E -->|Yes| F
E -->|No| G
F --> H
G -.->|Retry| E
```
**Sync Strategy:**
* All progress writes happen locally first, providing instant feedback
* Background sync with exponential backoff (5s, 15s, 45s, 2min max)
* Last-write-wins conflict resolution (acceptable for single-user MVP)
* Server timestamp used as source of truth for streak calculations
* Sync queue persisted to survive app termination
***
## Data Architecture
### PostgreSQL Schema (Simplified)
```sql
-- Core tables with indexes optimized for common queries
users (id, firebase_uid, email, display_name, created_at, updated_at)
INDEX: firebase_uid (unique), email
challenges (id, name, slug, duration_days, difficulty, is_active, metadata)
INDEX: slug (unique), is_active
challenge_requirements (id, challenge_id, title, type, validation_rules, sort_order)
INDEX: challenge_id
user_challenges (id, user_id, challenge_id, start_date, status, current_streak, attempt_number)
INDEX: (user_id, status), (user_id, challenge_id)
daily_progress (id, user_challenge_id, progress_date, day_number, is_complete, completed_at)
INDEX: (user_challenge_id, progress_date) UNIQUE
task_completions (id, daily_progress_id, requirement_id, completion_data, completed_at)
INDEX: daily_progress_id
```
### Redis Data Structures
```
# Active user streaks (hot data)
streak:{user_id}:{challenge_id} -> { current: 45, longest: 45, last_date: "2024-01-15" }
TTL: 7 days (refreshed on activity)
# Session management
session:{token} -> { user_id, expires_at, device_id }
TTL: 30 days
# Future: Leaderboard preparation
leaderboard:{challenge_id}:daily -> Sorted Set (user_id -> streak)
```
***
## API Design
RESTful API with consistent patterns. Key endpoints:
| Endpoint | Method | Purpose |
| -------- | ------ | ------- |
| `/challenges` | GET | List active challenges (cached) |
| `/challenges/{id}` | GET | Challenge details with requirements |
| `/me/challenges` | GET | User's active and past challenges |
| `/me/challenges` | POST | Start a new challenge |
| `/me/challenges/{id}/progress` | GET | Full progress for a challenge |
| `/me/challenges/{id}/today` | GET | Today's tasks and completion status |
| `/me/challenges/{id}/today` | PATCH | Update task completions |
| `/sync` | POST | Batch sync for offline changes |
**Response Time Targets:**
* Challenge library: <100ms (CDN cached)
* Today's progress: <150ms (Redis + DB)
* Task completion: <200ms (write path)
***
## Security Architecture
```mermaid
graph TB
subgraph "Client Security"
A[Secure Token Storage<br/>iOS Keychain / Android Keystore]
B[Certificate Pinning]
C[Biometric Lock Option]
end
subgraph "Transport Security"
D[TLS 1.3]
E[API Gateway Rate Limiting]
end
subgraph "Backend Security"
F[JWT Validation]
G[Row-Level Security]
H[Input Validation<br/>Fastify Schemas]
end
A --> D
B --> D
D --> E
E --> F
F --> G
F --> H
```
**Key Security Measures:**
* Firebase Auth handles credential security
* Short-lived JWTs (1 hour) with refresh token rotation
* All user data queries filtered by authenticated user\_id
* Rate limiting: 100 requests/minute per user
* Input validation at API gateway and service layers
***
## Scalability Considerations
**MVP Scale (10K users):**
* Single PostgreSQL instance (db.t3.medium)
* Single Redis instance (cache.t3.micro)
* 2 ECS tasks behind ALB
* Estimated cost: \~$150/month
**Growth Path (100K+ users):**
* PostgreSQL read replicas for challenge library queries
* Redis cluster for streak calculations
* Horizontal scaling of stateless API services
* Consider Aurora Serverless for variable load
**Social Features Preparation:**
* User ID foreign keys in place for future friend relationships
* Redis sorted sets ready for leaderboard implementation
* Event-driven architecture allows adding notification triggers
***
## Deployment Architecture
```mermaid
graph TB
subgraph "Production"
ALB[Application Load Balancer]
ECS1[ECS Task 1]
ECS2[ECS Task 2]
RDS[(RDS PostgreSQL)]
REDIS[(ElastiCache Redis)]
end
subgraph "CI/CD"
GH[GitHub Actions]
ECR[ECR Registry]
end
subgraph "Monitoring"
CW[CloudWatch]
SENTRY[Sentry]
end
GH --> ECR
ECR --> ECS1
ECR --> ECS2
ALB --> ECS1
ALB --> ECS2
ECS1 --> RDS
ECS2 --> RDS
ECS1 --> REDIS
ECS2 --> REDIS
ECS1 --> CW
ECS1 --> SENTRY
```
**Deployment Strategy:**
* Blue/green deployments via ECS
* Database migrations run as pre-deployment task
* Feature flags for gradual rollouts
* Automated rollback on health check failures
***
## Recommendations
1. **Invest in Local-First Infrastructure**: The offline-first pattern is critical for the daily check-in experience. Allocate adequate time for sync logic and conflict handling.
2. **Implement Comprehensive Analytics Early**: As noted in Feature Definition, event tracking from day one informs Phase 2 social features. Instrument all user interactions.
3. **Design APIs for Mobile Efficiency**: Combine related data in single responses (today's tasks + streak + progress) to minimize round trips.
4. **Plan for Streak Edge Cases**: Timezone handling, daylight saving transitions, and missed-day scenarios need careful consideration in both client and server logic.
5. **Prepare Social Foundation Without Building It**: Include user\_id relationships and Redis structures that support leaderboards, but don't implement social features until validated.
***
## Technical Risks & Mitigations
| Risk | Impact | Mitigation |
| ---- | ------ | ---------- |
| Offline sync conflicts | Data loss, user frustration | Comprehensive conflict resolution, sync status UI |
| Streak calculation errors | Core feature broken | Server-side validation, reconciliation jobs, audit logs |
| Firebase Auth dependency | Authentication outage | Graceful degradation, cached sessions |
| React Native performance | Poor animation experience | Native driver animations, performance profiling |
***
## Next Steps
1. Set up infrastructure-as-code (Terraform/CDK) for reproducible environments
2. Implement authentication flow and user service
3. Build challenge service with seed data for 5 launch challenges
4. Develop progress service with offline-first client integration
5. Establish CI/CD pipeline with staging environment