01 · The Problem
Coaches were losing clients
to no-shows and silence
The product is a B2B SaaS platform helping businesses manage client relationships, appointments, and operations. When I joined, appointment communication was email-only — and email alone wasn't cutting it.
Coaches were reporting that clients frequently missed appointments, weren't seeing confirmation emails in time, and had no fallback when email notifications failed. The business needed a real-time, independent notification channel. SMS was the obvious answer — but nobody had built it yet.
What we were dealing with before this system existed
- No automated SMS for appointment confirmations, reminders, or cancellations
- No way to notify coaches or clients in real time without relying on email deliverability
- No billing system for SMS — no way to charge coaches per message or manage credit limits
- No consent management — couldn't send SMS at scale without TCPA violation risk
- No retry logic — a failed Twilio call meant the message was simply lost
The scope of the work was bigger than it looked. Building "SMS notifications" meant building a complete platform: Twilio integration, a credit-based billing layer, a consent management system, scheduled reminders, retry infrastructure, and an audit trail. I owned all of it end-to-end.
02 · Architecture
Four layers, one
clean separation of concerns
The first decision I made was architectural. SMS has unique concerns that don't map to email or push — it costs money per message, has strict compliance requirements, and involves an external API with its own failure modes. A generic notification system wouldn't work. I designed four dedicated layers:
ENTRY POINTS AppointmentService.create → new appointment created AppointmentsController → update / cancel / reschedule AppointmentReminder.perform → hourly cron (48hr, 24hr, 1hr) Appointment.cancelled_notify → model observer on cancellation │ ▼ Sms::AppointmentSmsService ← Orchestration Layer Context-aware routing · Consent validation · Template selection Timezone handling · Cohort vs individual logic │ ▼ SmsService ← Integration Layer Validate credentials → Format phone (E.164) → Check credits Create SMS record → Send via Twilio → Deduct credits (atomic) Handle errors → Schedule retries │ ┌────┴────┐ ▼ ▼ Twilio API SmsRetryJob ← exponential backoff │ ▼ SmsCreditsService ← Financial Layer Credit balance · Auto-purchase · Ledger transactions
Each layer has exactly one responsibility. The orchestration layer knows about appointments and business rules. The integration layer knows about Twilio and error handling. The financial layer knows about credits and billing. None of them bleeds into the others.
03 · How It Works
The full lifecycle of
a single appointment SMS
When a coach books an appointment, here is exactly what happens before the client receives a text:
Appointment created → SMS triggered
AppointmentService.create fires after saving the appointment. SMS calls are wrapped in rescue blocks so a Twilio failure never breaks the booking itself.
Three-layer consent check
Before touching Twilio: does the client have a mobile number? Have they given SMS consent? Has the coach enabled SMS for this context (creation, reminder, cancellation)? All three must pass.
Atomic credit transaction with row-level lock
Inside a database transaction: lock the coach's record (prevents race conditions), check credit balance, create the SMS record with status 'queued', send to Twilio, deduct 1 credit. All or nothing.
Timezone-aware message with client's local time
The message is built using the client's timezone — not the coach's. A client in Tokyo and a coach in New York each receive times in their own local context.
Error classification → permanent or retryable
If Twilio fails, I classify the error. Invalid phone numbers are marked permanently failed — no retry, no wasted credits. Temporary service errors trigger exponential backoff: 30s → 60s → 120s with ±20% jitter.
Audit record persisted regardless of outcome
Every SMS — sent, failed, or retrying — creates an immutable record: recipient, message body, Twilio SID, status, retry count, timestamps. This is the ledger, not just a log.
04 · The Hard Parts
Three engineering problems
that actually mattered
Problem 1: Race conditions in credit deduction. Without locking, two concurrent SMS sends for the same coach could both pass the credit check and both deduct — leaving the balance negative or overcharged. The fix was pessimistic locking inside the transaction:
# Atomic: check credits, send SMS, deduct — all in one transaction User.transaction do locked_user = User.lock.find(credit_user.id) # row-level lock unless SmsCreditsService.sufficient_credits?(locked_user, 1) raise InsufficientCreditsError end sms_record = Sms.create!(status: 'queued', ...) message = @client.messages.create(from: @from, to: number, body: body) sms_record.update!(status: 'sent', twilio_sid: message.sid) SmsCreditsService.deduct_credits(locked_user, 1) end
Problem 2: Thundering herd on retry. If Twilio has a 5-minute outage and recovers, all queued retries would fire simultaneously. Adding jitter to the backoff distributes the load:
def calculate_retry_delay(attempt) base_delay = 30 * (2 ** (attempt - 1)) # 30s → 60s → 120s jitter = rand(0.8..1.2) # ±20% randomness (base_delay * jitter).clamp(30, 300).to_i end
Problem 3: Duplicate reminders on recurring appointments. A coach with a 10-session recurring series would generate 10 reminder SMS at the same time without deduplication. The fix: group by recurring_id and take only the first occurrence per series.
recurring = all_appointments.where.not(recurring_id: nil)
non_recurring = all_appointments.where(recurring_id: nil)
# Only remind for first occurrence — not all 10 sessions
first_per_series = recurring
.group_by(&:recurring_id)
.values
.map(&:first)
non_recurring + first_per_series
05 · Engineering Decisions
Why I built it this way
Pessimistic locking over optimistic
Financial credit systems need guarantees. Optimistic locking retries on conflict — acceptable for version tracking, not for money. Pessimistic locking at the database level gives the certainty we needed.
Dedicated SMS service, not a generic notifier
SMS has unique concerns — cost per message, E.164 phone validation, TCPA compliance — that don't apply to email or push. A generic notification abstraction would have hidden these distinctions badly.
Graceful degradation over strong coupling
SMS is auxiliary to the core booking flow. Wrapping every SMS call in rescue blocks means Twilio outages never affect appointment creation. The user's primary action always succeeds.
Synchronous send, async retry
Sending synchronously means immediate delivery. Background jobs are only used for retries where latency doesn't matter. This avoids the queue latency problem while keeping retry infrastructure clean.
Context symbols over if/else chains
Using :scheduled_appointment, :cancellation, :reminder_48hr as context keys made the system extensible. Adding a new notification type needs only a template and a consent check — no controller changes.
Ledger model for SMS records
Every SMS is a financial transaction. Treating the Sms model as an immutable ledger — not just a queue — gave us audit capability, TCPA compliance proof, and the ability to debug any production issue.
"The hardest part wasn't integrating Twilio. It was building the financial layer correctly so that in high-concurrency scenarios, credits were never double-deducted or bypassed. That required thinking about the database as a financial system, not just a data store."
06 · What I Learned
Ten things this system
taught me as an engineer
Financial systems require atomic guarantees
Lock the resource before checking the balance. Always. This pattern applies to any system handling money, points, or limited resources under concurrency.
External API failures should never break core features
Graceful degradation is an architectural decision, not an afterthought. Wrapping auxiliary operations in rescue blocks protects the primary user workflow.
Not all errors are retryable — classify them
Retrying an invalid phone number wastes credits and cycles. Mapping error codes to permanent vs transient outcomes is one of the most impactful things you can do in external API design.
Audit trails are non-negotiable in regulated systems
Every SMS creates an immutable record. This isn't logging — it's a ledger. It enabled debugging production issues, proving compliance in audits, and reconciling billing.
Compliance belongs in the service layer
TCPA violations cost $500–$1,500 per message. Putting consent checks in the service object (not the controller) ensures they can never be bypassed by a future developer.
Jitter in retry logic prevents cascade failures
Without randomness in backoff delays, all failed messages retry simultaneously after an outage. Adding ±20% jitter distributes load and respects rate limits.
Phone number validation is a security concern
Without E.164 validation and type checking, an attacker could input premium-rate international numbers and drain Twilio credits. Validation isn't just UX — it's fraud prevention.
Deduplication logic matters for scheduled jobs
A 10-session recurring appointment would generate 10 simultaneous reminders without explicit deduplication. Always think about what happens when scheduled tasks run against collections.
Context symbols scale better than boolean flags
Parameterizing notification types with :scheduled_appointment, :cancellation, :reminder_48hr made the system extensible without touching existing code paths.
Structured logging is essential for production debugging
Prefixing every log line with [SmsService] and including record IDs means I can grep the full lifecycle of any SMS in under 10 seconds. Observability is built-in, not bolted on.
07 · What I'd Do Next
If I were to continue
improving this system
Every production system has a backlog of improvements. Here are the ones I would prioritize and why:
Circuit breaker for Twilio outages
When Twilio is down, every SMS attempt currently times out after 30+ seconds, blocking threads. A circuit breaker would fast-fail and schedule retries instead of hanging.
Delivery status webhooks
We currently know if Twilio accepted the message — not if it was actually delivered. A webhook endpoint would give us real delivery rates and surface carrier-level failures.
Rate limiting per coach
A bug in recurring appointment logic could send thousands of SMS in a loop. A 100/hour rate limit per coach would prevent accidental credit drainage before anyone notices.
SMS opt-out via STOP reply
TCPA best practice requires an easy opt-out mechanism. A Twilio webhook that listens for "STOP" replies and updates user SMS consent automatically would strengthen compliance.
Bulk sending for cohort sessions
Group appointments currently send SMS sequentially — one Twilio API call per member. Using Twilio Messaging Service for batch sends would reduce API calls and sending latency significantly.
Timezone-aware reminder scheduling
A 1-hour reminder could arrive at 3am if the appointment and client are in different timezones. Adding a quiet hours check (8am–10pm in recipient timezone) would prevent late-night interruptions.
08 · Outcome
What shipped and
what it changed
The system went live serving thousands of coaches and clients across multiple timezones. Appointment confirmation, reminder, cancellation, and reschedule notifications all run fully automated. Coaches can configure exactly which events trigger SMS, for whom, and at what intervals. The credit system generates recurring revenue per message sent.
Most importantly — when I look at the architecture today, I would change very little. The layering holds. The separation of concerns holds. The financial guarantees hold. That's what good production engineering feels like.
09 · Tech Stack