Building Aeiva in Public: What We're Learning Building Our Longevity App

TL;DR: Aeiva is the longevity app we're building at Baxter Labs — an AI health twin under the hood, designed to help you actually do the things that extend healthy lifespan. We're building it in public and sharing the build as it happens. So far the biggest realization is that AI products are mostly not about the AI. The work that's mattered most is upstream: data quality, latency, memory, and surfacing uncertainty. Five things we're learning, with the caveat that we're still very much in the middle of figuring this out.

Aeiva is the longevity app we're building at Baxter Labs. Under the hood it's an AI health twin — a digital model of you, fed by your wearable, that adapts as you do. The point of the app isn't gadget-y dashboards or another step counter. It's the longer arc: helping you make the daily decisions — sleep, training load, recovery, stress — that compound into healthspan.

It pulls real-time data from your wearable — Apple Watch, Oura Ring, Whoop, Fitbit, Garmin — runs it through a recovery and performance model, and rewrites your workout plan based on how your body actually responded yesterday. The pitch is simple: stop guessing whether to push or rest, and let the small, consistent decisions stack up.

The pitch sounds easy. The build is not.

This post is the inside story of what we're getting wrong as we go, what we're starting to get right, and the five things shaping how we think about every other AI feature we're touching. Treat them as in-progress notes, not signed-off lessons.

1. Wearable data is dirtier than we expected

I assumed, naively, that the hardware vendors had solved data quality. They have not. Apple Watch will sometimes report a 38 BPM resting heart rate that is not a heart rate at all — it's the watch losing skin contact while you sleep. Oura's "deep sleep" estimate can vary by 40 minutes night to night for the same actual sleep. Whoop's HRV is excellent if you wear it correctly; about 30% of users do not.

The first version of Aeiva treated all this data as ground truth. The model produced beautiful, confident, often-wrong recommendations. Early testers told us, "I slept fine, why is the app saying I'm overtrained?" — because a single dropout in the night had skewed the average. For a longevity app — where the whole premise is consistent, trustworthy daily nudges — that's a fatal flaw if we don't fix it.

What's helping isn't a bigger model. It's a data quality layer we're iterating on that runs before the model ever sees a number — outlier rejection, gap interpolation, source-confidence weighting. The model now sees a sanitized, time-aligned stream with explicit "low confidence" flags, and recommendations got noticeably better the day we shipped the first version of that pre-processor. We're still tuning it, and we expect to keep tuning it for a long time.

Working theory: in any AI product, the moat is usually upstream of the model.

2. Surface uncertainty — don't hide it

Early Aeiva recommendations were declarative: "Push hard today." Testers didn't love it. Not because the recommendation was wrong (it was usually right), but because when it was wrong, there was no graceful fallback. They felt gaslit by the app. For a product asking people to trust it with the long arc of their health, that gaslit feeling is poison.

We're rebuilding the UX to expose the model's confidence: "Based on 6 of 7 nights of clean data, your readiness is high — push hard today. (Last night had partial sensor dropout; if today feels off, trust your body.)"

Two things shifted in early data. First, trust went up — counterintuitively, telling users about the limitation seemed to make them believe the rest more. Second, churn went down. People who would have quit after one bad recommendation now had a frame to interpret it. We don't have the n-size yet to call this validated, but the signal is consistent enough that we're committing to the pattern.

Working theory: AI products that try to hide uncertainty feel arrogant. AI products that surface it feel like a smart partner. The former churns. The latter compounds — which is exactly what a longevity app needs to do.

3. Personalization is mostly memory, not fine-tuning

The most-asked engineering question we got at the start of Aeiva was, "Are you fine-tuning a model per user?" The answer was, and is, no.

What we're doing instead is maintaining a structured per-user memory: their typical resting heart rate, their stated longevity goals, their training history, their feedback ("the run plan was too aggressive on Tuesday"). At inference time, we inject the relevant slice of that memory into the prompt. The base model is the same for every user. The context is unique.

This is a 10x simpler architecture than per-user fine-tunes, and so far it's working better than we expected. Testers can tell the app "I'm coming back from injury, take it easy this week" and the next plan reflects it — instantly, without retraining anything. The longer someone uses Aeiva, the more memory it accumulates, and the more the recommendations feel like they're written for that specific person, not a population. We may eventually need actual fine-tuning for specific cohorts, but we're putting off that decision until the memory layer hits a wall it can't get past.

Working theory: if your product feels personalized, it's probably context, not weights. Build the memory layer first.

4. Real-time is a feature, but only if it's actually real-time

We promised real-time updates. The first iteration polled wearable APIs every 15 minutes and called it "real-time". Users called it laggy. They were right.

What we're moving toward is real real-time: subscribing to push events from each wearable platform (where they expose them), maintaining a websocket connection to the app, and pre-computing the next likely recommendation so it can render in under 200ms when the user opens the app. None of that is glamorous, all of it is in flight, and most of the work is plumbing — but it's the difference between "real-time" and real-time.

Working theory: latency is a product surface. The model can be brilliant; if the answer takes 8 seconds to render, users will believe the model is dumb.

5. The model isn't the product. The loop is.

The single biggest mental shift we've had building Aeiva is realizing the product isn't "AI tells you what to do." It's a loop: AI proposes → user does → wearable measures → AI updates. Every part of that loop has to work for the product to feel magical, and any one part breaking is enough to make the whole thing feel broken. For a longevity app, that loop also has to keep working for years, not weeks — which is a different bar than most AI products are designed for.

Most AI projects we've seen optimize the "AI proposes" step and ignore the rest. We're trying to instrument the whole loop:

How long after the workout did the user re-open the app? (Engagement)
Did the wearable capture the workout cleanly? (Data quality)
Did the AI's adjustment for tomorrow reflect today's data? (Model freshness)
Did the user accept the adjustment, or override it? (Trust calibration)

The dashboards we're building measure the loop, not just the model. That mental shift is starting to inform every other AI feature and tool we're touching at Baxter Labs.

Working theory: the model is one node in a graph. Build the graph and watch all of it.

Where Aeiva is heading

The next stretch of build, in order of where we are right now: deeper integration with continuous glucose monitors, conversational coaching that treats your goals like a multi-year arc rather than a single workout, and a partner program for clinicians who'd want Aeiva data inside their care plans. None of those are shipped — they're on the roadmap, in different states of "actively being designed" vs. "we still need to figure out the right partner."

If you've been reading and any of this sounds familiar from your own build — it probably is. The thing we're starting to internalize is that AI products are mostly not about the AI. They're about data, latency, memory, uncertainty, and the loop. The model is, increasingly, the easy part — and for a longevity app, the hard part is making sure those upstream pieces stay trustworthy for long enough that the compounding actually happens.

You can take a look at where Aeiva is right now at baxter-labs.dev/aeiva. We're early — there will be sharp edges. If you're building something adjacent and want to compare notes mid-build, my inbox is open.

— Eshwar PK, Founder, Baxter Labs