Resilience
How SpendLil ensures your AI never goes down because of us.
SpendLil is built around one principle: your AI must never stop working because of us. Every architectural decision serves this goal.
The Resilience Model
| State | What Happens |
|---|---|
| Normal | Full proxy: requests logged, spend tracked, alerts fire, response returned |
| DynamoDB degraded | Request forwarded to provider, failed writes queued for retry, response returned |
| Lambda error | API Gateway returns 502; your app can retry or fall back to calling the provider directly |
| Complete outage | DNS failover — your requests go direct to your AI provider using the key already in your request |
Why This Works
The key insight is that your provider API key is always in your request. SpendLil never strips it, never replaces it, never stores it. This means at every failure point, you still have everything you need to call your AI provider directly.
Compare this with proxies that store your keys and inject them on your behalf. If that proxy goes down, your requests fail because the key isn't in them. With SpendLil, the worst case is a gap in your spend data — never a gap in your AI service.
Fire-and-Forget Logging
When the proxy processes a request, it returns the provider's response before confirming that the usage record was written to DynamoDB. If the write fails, you still get your response. The write failure is logged and can be retried, but it never blocks the critical path.
The provider's response is returned to your app regardless of whether SpendLil successfully logged the usage. Tracking loss is temporary; service loss is never.
Dashboard Independence
The dashboard (app.spendlil.ai) is completely decoupled from the proxy (gateway.spendlil.ai). They share a DynamoDB table but run on independent infrastructure. If the dashboard goes down, the proxy keeps working. If the proxy has issues, the dashboard still shows your historical data.
Infrastructure
The proxy runs on AWS Lambda + API Gateway, which provides built-in multi-availability-zone redundancy managed by AWS. There are no single points of failure: no single server, no single container, no Redis instance, no RDS database. DynamoDB is serverless with point-in-time recovery enabled.
Building Fallback Into Your App
For maximum resilience, implement a simple fallback in your application code that calls your provider directly if SpendLil returns a 502 or 503. See the Error Handling guide for a complete code example.