Five architectural decisions I had to make differently after twenty years of making them in an enterprise.
Twenty years building enterprise software taught me a vocabulary — Clean Architecture, CQRS, circuit breakers, retry policies, DDD, the whole catechism — and the rigour that comes from operating systems where something breaking at 2 AM stops a thousand people from doing their jobs.
Then I started building my own product. VoiceMeet Pro: a real-time audio meeting platform with hundred-participant rooms, studio-grade WebRTC audio via LiveKit, per-participant cloud recording, AI transcription via Whisper large-v3, host visibility controls, guest access without registration, waiting room, admin dashboard, and Razorpay and Stripe payments. The patterns did not stop being correct. They stopped being free. What follows are the five decisions where that distinction mattered most.
1. Every background process is a tenant. Know them all.
In an enterprise, the OS is the substrate. Somebody else’s job. I would provision an instance, assume the standard image was reasonable, and move on. The slack in those environments is enormous — so much memory that whatever the OS is doing in the background never surfaces in application metrics.
Right-size a production instance for cost discipline and that assumption becomes the first casualty.
What got me was a dnf automatic update timer. It fires, memory pressure spikes, the Node process gets pushed into swap, WebRTC signalling timeouts cascade through LiveKit, meetings drop. The symptom looks like an application bug. I spent two days chasing it through the wrong layers — LiveKit connection lifecycle, Node handler crashes, PostgreSQL connection pool exhaustion — before I ran systemctl list-timers and found the actual cause sitting there in plain text.
bash# The fix sudo systemctl disable --now dnf-makecache.timer sudo systemctl disable --now dnf-automatic.timer sudo systemctl disable --now dnf-automatic-install.timer
Ninety seconds to fix. Two days to diagnose. One Google Play Store rejection as the cost of the lesson.
The deeper issue was architectural. In a tightly-tuned production environment running a real-time audio application where every megabyte of resident memory is accounted for, every system service is a tenant of the application’s resource budget. A dnf timer is no different from a microservice competing for the same memory pool — except that I had forgotten it existed.
What I implemented after this:
bash#!/bin/bash # deployment-audit.sh — runs at every deploy echo "=== Active Timers ===" systemctl list-timers --no-pager echo "=== Top Memory Consumers ===" ps aux --sort=-%mem | head -20 echo "=== Swap Usage ===" free -h | grep Swap echo "=== Non-Essential Services ===" systemctl list-units --type=service --state=running | grep -v -E "(sshd|nginx|pm2|node)"
Every active timer, every resident process, every swap consumer, visible at deploy time. Thirty seconds to run. Should have written it on day one. The enterprise let me get away with not knowing what was running alongside my application. My own product did not.
2. The serverless infrastructure chosen for one reason will quietly veto another.
I needed an event bus. The product has cascading side effects on every meeting lifecycle event:
event flowMeetingEndedEvent → ArchiveRecordingsHandler (move files, mark DB records) → NotifyParticipantsHandler (email delivery) → UpdateBillingUsageHandler (decrement plan quota) → ScheduleRetentionHandler (set deletion_due_at: 30 days Free/Trial/Basic, 45 days Pro)
Without this separation, the meeting service becomes a god object importing email templates, file system utilities, billing modules, and cron schedulers. With an event bus, each handler is a standalone module subscribed to a domain event. Standard pattern.
The default tool in Node.js for this is BullMQ. Redis-backed, persistent, retries built in, dead letter queues, a dashboard. I already had Redis via Upstash. One-evening implementation. Except it was not.
BullMQ depends on persistent TCP connections to Redis and blocking commands (BRPOPLPUSH, XREADGROUP). Upstash is serverless Redis over HTTPS. No persistent connections. No blocking commands. The failure modes:
failure modes// Silent failure: jobs enqueue successfully, never dequeue // BullMQ's Worker calls BRPOPLPUSH → Upstash returns protocol error // No exception surfaces in application logs // Jobs sit in Redis indefinitely // Loud failure: connection errors on specific code paths // Error: "ERR unknown command 'BRPOPLPUSH'" // Surfaces only when Worker.process() is invoked, not at connection time
Both failure modes in the same hour. Neither documented in a place I would have found before choosing the tool.
The options:
| Option | Problem |
|---|---|
| Self-managed Redis on app host | Memory budget already fully allocated to the application stack |
| Upstash higher tier | Still no guarantee of BRPOPLPUSH support under load |
| HTTP-based queue (Trigger.dev, Inngest) | New external dependency, new failure domain |
| Node.js EventEmitter in-process | No persistence, no retry, events lost on process crash |
I chose EventEmitter. The reasoning is worth preserving because I will be tempted to second-guess it.
The critical insight was separating what needs persistence from what needs decoupling. The most safety-critical side effect is retention-based deletion. If a meeting ends and deletion is not scheduled, I am holding user data past its retention window — a privacy and compliance liability. But retention does not need the event bus to be persistent. It needs the database to be persistent.
javascript// Event handler sets the deletion timestamp async handleMeetingEnded(event) { const retentionDays = getPlanLimits(event.plan).recordingRetentionDays; // Free/Trial/Basic: 30 days, Pro: 45 days await db.query( 'UPDATE recordings SET deletion_due_at = NOW() + $1 * INTERVAL \'1 day\' WHERE meeting_id = $2', [retentionDays, event.meetingId] ); } // Nightly cron handles actual deletion — self-heals if event was lost // 0 2 * * * node scripts/cleanup-expired-recordings.js async function cleanupExpiredRecordings() { const expired = await db.query( 'SELECT id, file_path FROM recordings WHERE deletion_due_at <= NOW()' ); for (const recording of expired.rows) { await fs.unlink(recording.file_path); await db.query('DELETE FROM recordings WHERE id = $1', [recording.id]); logger.info({ recordingId: recording.id }, 'recording-deleted-by-retention'); } }
The event bus sets deletion_due_at. The cron executes deletion. If the event is lost, the cron self-heals on the next run because the database is the source of truth, not the event log. The event bus is now a notification mechanism. The persistence layer handles its own correctness.
The other handlers — notification, archival, usage tracking — are low-frequency, bounded-cost operations. A meeting ends a few times a day during early adoption. The risk of losing one event is small. The cost of losing one event is recoverable. The math works for now.
The non-negotiable part was the abstraction layer:
javascript// src/infrastructure/events/EventBus.js class EventBus { constructor() { this.emitter = new EventEmitter(); } async emit(eventName, payload) { this.emitter.emit(eventName, payload); } subscribe(eventName, handler) { this.emitter.on(eventName, async (payload) => { try { await handler(payload); } catch (err) { logger.error({ event: eventName, err }, 'event-handler-failed'); } }); } } // When Redis becomes available, this becomes: // class EventBus { // constructor(redisConnection) { // this.queue = new Queue('events', { connection: redisConnection }); // } // async emit(eventName, payload) { // await this.queue.add(eventName, payload); // } // subscribe(eventName, handler) { // new Worker('events', async (job) => { // if (job.name === eventName) await handler(job.data); // }, { connection: this.redis }); // } // }
Every handler registers with EventBus. No handler imports EventEmitter directly. When the infrastructure supports a real Redis, the swap is one file. Every handler stays untouched. This is not premature abstraction — it is a seam placed at the exact point where the implementation is going to change, with the trigger condition documented: the day the instance has enough headroom for a resident Redis process, or the day a side effect becomes critical enough that “cron self-heals it” is no longer acceptable. The same principle applies to retry and resilience patterns — adopt the mechanism when the failure mode demands it, not when the pattern catalogue suggests it.
3. CQRS — adopt the cheap half, reject the expensive half.
CQRS has two parts, and people conflate them constantly.
The cheap half: command/query separation at the application layer. Write operations and read operations get different handlers, different validation, different shapes. Instead of a single meetingService.getOrCreateOrUpdateOrDelete() method, there is:
command / queryEndMeetingCommand { meetingId, hostId } → EndMeetingHandler → validates host ownership → calls domain logic → emits MeetingEndedEvent GetMeetingDetailsQuery { meetingId, requesterId } → GetMeetingDetailsHandler → validates access → returns read-optimised DTO
This costs almost nothing to adopt. It pays back immediately in clarity — every entry point into the system has a name, a shape, and a single responsibility. Grepping for Command finds every write operation. Grepping for Query finds every read operation. When something breaks, the name tells me which side of the system to look at.
The expensive half: separate read and write data stores. Writes go to normalised Postgres. Reads come from a denormalised Redis projection or a materialised view. The two are synchronised via event-driven projections. This pays back when reads and writes are genuinely divergent — a social media feed where the read shape (flat timeline with embedded author data) looks nothing like the write shape (normalised tables for users, posts, relationships). It does not pay back when reads and writes hit the same tables in the same shape.
VoiceMeet Pro reads and writes hit the same PostgreSQL tables in roughly the same shape. A meeting is created (INSERT), fetched for display (SELECT with JOIN), updated when it ends (UPDATE). Query latency is under 50ms against a single index. Adding a denormalised Redis projection would mean:
- Maintaining two data stores with synchronisation logic
- Handling cache invalidation on every write
- Reasoning about eventual consistency for every read
- Doubling the debugging surface when something is stale
All in service of shaving milliseconds off a query that is already fast enough. That is not architecture. That is waste.
Adopted: command/query separation, domain events emitted from command handlers, side-effect handlers via event bus.
Rejected: separate read/write stores, event sourcing, projection synchronisation.
Same principle applied to event sourcing. Powerful when “what was the state of this object at 3:47 PM last Tuesday” is a core requirement. It is not a core requirement here. An audit_log table and an updated_at column satisfy the compliance need. The operational simplicity of a current-state model is worth more than the theoretical elegance of an event-sourced one — at this stage, for this product, with this team size.
4. Classes earn their existence. They are not granted it.
Early on I had a meeting object that looked like this:
javascriptconst meeting = { id: 79, title: 'Daily Standup', host_id: 6, status: 'ended' };
The DDD instinct said: wrap it in a class. I opened the file, looked at what the class would contain — this.id = row.id, this.title = row.title — and stopped. There were no invariants. No behaviour. No reason for the class to exist. Creating it would have been cargo-culting.
The class earned its existence when plan enforcement arrived. VoiceMeet Pro has four tiers — Free, Trial, Basic, Pro — each with hard limits:
javascript// src/domain/plans.js const PLAN_LIMITS = { free: { maxParticipants: 10, maxMeetingMinutes: 30, maxRecordings: 3, recordingRetentionDays: 30 }, trial: { maxParticipants: 20, maxMeetingMinutes: 40, maxRecordings: 5, recordingRetentionDays: 30 }, basic: { maxParticipants: 20, maxMeetingMinutes: 90, maxRecordings: 25, recordingRetentionDays: 30 }, pro: { maxParticipants: 100, maxMeetingMinutes: null, maxRecordings: null, recordingRetentionDays: 45 }, };
These are configuration values. Configuration is not enforcement. Without domain-level enforcement, a race condition or a direct API call can bypass the controller-level check. The invariant has to live where the state mutation happens.
javascript// src/domain/entities/Meeting.js const { MeetingCapacityError } = require('../errors/MeetingCapacityError'); const { MeetingDurationExceededError } = require('../errors/MeetingDurationExceededError'); const { getPlanLimits } = require('../plans'); class Meeting { constructor({ id, title, hostId, hostPlan, participants = [], startedAt, status }) { this.id = id; this.title = title; this.hostId = hostId; this.hostPlan = hostPlan; this.participants = participants; this.startedAt = startedAt; this.status = status; } addParticipant(user) { const limits = getPlanLimits(this.hostPlan); if (limits.maxParticipants && this.participants.length >= limits.maxParticipants) { throw new MeetingCapacityError(this.id, limits.maxParticipants, this.hostPlan); } if (this.participants.find(p => p.id === user.id)) { return; // idempotent — already joined } this.participants.push(user); } checkDurationLimit() { if (this.status !== 'active') return; const limits = getPlanLimits(this.hostPlan); if (!limits.maxMeetingMinutes) return; // Pro: unlimited const elapsed = (Date.now() - this.startedAt.getTime()) / 60000; if (elapsed >= limits.maxMeetingMinutes) { throw new MeetingDurationExceededError(this.id, limits.maxMeetingMinutes, this.hostPlan); } } }
The errors are typed and contextual:
javascript// src/domain/errors/MeetingCapacityError.js const { AppError } = require('./AppError'); class MeetingCapacityError extends AppError { constructor(meetingId, maxParticipants, plan) { super( `Meeting ${meetingId} has reached the ${plan} plan limit of ${maxParticipants} participants`, 403 ); this.meetingId = meetingId; this.maxParticipants = maxParticipants; this.plan = plan; } }
MeetingCapacityError. MeetingDurationExceededError. RecordingLimitError. TrialExpiredError. Each one is a domain error in domain/errors/, each one carries typed context — meeting ID, limit value, plan name — instead of a raw string. Grepping for MeetingCapacityError finds every place in the codebase that handles room-full scenarios. Tests assert against the error class, not against a message string that someone will typo on the next refactor.
javascript — testexpect(() => meeting.addParticipant(user21)) .toThrow(MeetingCapacityError); expect(() => meeting.addParticipant(user21)) .toThrow(expect.objectContaining({ maxParticipants: 20, plan: 'trial' }));
No code path bypasses the invariant because there is no other way to mutate participant state. The controller cannot bypass it. A direct service call cannot bypass it. The entity owns its rules.
What I rejected alongside this: JavaScript interface files (IUserRepository.js). In TypeScript, an interface is a compile-time contract the compiler enforces. In JavaScript, it is a comment that decays. The test suite is the actual contract. When the codebase migrates to TypeScript, interfaces become real. Until then, they are ceremony.
5. The pattern that protects itself.
The folder structure after the restructure:
project structure/src /api /routes ← HTTP verb + path → controller method. Nothing else. /controllers ← Receive request → validate shape → call service → send response. /middleware ← Auth, rate limiting, error handling. /services ← Application layer. Orchestrates domain + infrastructure. /domain /entities ← Meeting, Recording — classes with invariants. /errors ← AppError, MeetingCapacityError, RecordingLimitError... /plans.js ← Plan limits configuration. /infrastructure /db ← PostgreSQL pool + query helpers. Only place that imports 'pg'. /redis ← Redis client. Only place that imports 'ioredis'. /livekit ← LiveKit + Egress wrappers. Only place that imports 'livekit-server-sdk'. /email ← Nodemailer wrapper. Only place that imports 'nodemailer'. /events ← EventBus abstraction. /config ← Environment variables. Imports nothing except Node built-ins. /tests /unit /integration /architecture ← Fitness functions. The point of this section.
This structure encodes rules. api never imports from infrastructure. domain never imports from api or services or infrastructure. services never import from api. infrastructure/config never imports from anything except Node built-ins. No circular dependencies anywhere.
These rules are easy to write down and extraordinarily hard to enforce through human discipline. In a team, code review catches violations. Solo, there is no reviewer except me at 2 AM, and I at 2 AM have different standards than I at 10 AM.
The solution is to put the discipline somewhere it cannot be tired.
javascript// .dependency-cruiser.js module.exports = { forbidden: [ { name: 'no-api-to-infrastructure', comment: 'Controllers must not import infrastructure directly. Use a service.', severity: 'error', from: { path: '^src/api' }, to: { path: '^src/infrastructure' } }, { name: 'no-domain-external-deps', comment: 'Domain must have zero external dependencies.', severity: 'error', from: { path: '^src/domain' }, to: { path: '^src/(api|services|infrastructure)' } }, { name: 'no-services-to-api', comment: 'Services must not know HTTP exists. Must be callable from CLI, cron, or test.', severity: 'error', from: { path: '^src/services' }, to: { path: '^src/api' } }, { name: 'no-config-internal-deps', comment: 'Config is loaded first. Circular deps here cause undefined at runtime.', severity: 'error', from: { path: '^src/infrastructure/config' }, to: { path: '^src/' } }, { name: 'no-circular', comment: 'Circular requires in Node.js return partially-initialised modules.', severity: 'error', from: {}, to: { circular: true } } ] };
yaml — CI pipeline# .github/workflows/ci.yml - name: Architecture fitness check run: npx depcruise src --config .dependency-cruiser.js --output-type err
When a violation occurs:
build outputerror src/api/controllers/recording.controller.js → src/infrastructure/livekit/livekit.service.js Controllers must not import infrastructure directly. Use a service. BUILD FAILED
The pull request cannot merge. The rule is enforced by the pipeline, not by memory, not by good intentions, not by “I’ll fix it in the next PR.”
This is the most important decision in the entire restructure. The folder layout, the CQRS separation, the entity classes, the event bus — every one of them can be violated silently. A developer can decide the rule does not apply to this particular case, and the code compiles, the tests pass, and the violation lands. The fitness tests cannot be violated silently. They run on every push. They produce an error that points to the exact file and the exact import. There is no judgment call.
For a solo developer, this is a senior architect reviewing every commit. The architect never takes a day off. Never gets tired. Never lets a violation slide because the deadline is tomorrow. The architect is a CI step that runs in eight seconds and reports, with no diplomacy, that a rule has been broken.
I could have had this in the enterprise. It was always available. I never adopted it because the human review process was good enough. The problem with “good enough” is that it kills the motivation to reach for the thing that would have been better.
What stays with me
The patterns translate. But what twenty years of enterprise work actually trained me for is not the patterns themselves. It is knowing which ones to defer, with what justification, and with a clear understanding of when each deferred decision needs to be revisited.
In the enterprise, deferral was dangerous. Defer a pattern, forget about it, the system grows around the absence, and three years later there is a refactoring project that takes a quarter. Senior architects lean toward over-engineering precisely because the cost of forgetting is high.
Building my own product inverted this. Every pattern I adopted that I did not need was a week I was not shipping. I learned that the question is not “is this the right pattern” but “is this the right pattern now, for this product, at this stage, given these constraints.” The answer was almost always more conservative than my enterprise instincts told me.
Each of the five decisions above is a specific deferral with a specific trigger condition for revisiting. The event bus will become BullMQ when the infrastructure supports a resident Redis. The CQRS model will get a read store when query patterns diverge from write patterns. The domain model will grow more entities as more invariants emerge. The fitness tests will get stricter as the codebase grows.
Architecture decision records are not justifications for what I built. They are documentation of what I chose not to build, and the conditions under which that choice should be reconsidered.
I am the Founder and Software Architect at TechScriptAid, building VoiceMeet Pro and ScriptPost AI. Twenty years across semiconductor, airline, supply chain, energy, e-commerce, logistics, and SaaS — most recently as Principal Engineer at ASM Technologies consulting for Applied Materials. I write about enterprise architecture and the realities of bootstrapped engineering at techscriptaid.com and publish tutorials on YouTube @TECHScriptaid.

Leave a Reply