
Lessons from Building Small AI Tools
What I actually learned building small AI tools: why scope clarifies thinking, why 'it works' isn't enough, and why most of what I built was only ever for me.
I didn't expect building small AI tools to be humbling. Interesting, yes. Occasionally impressive. But the lessons that stuck weren't about the technology, they were about judgment, utility, and the uncomfortable gap between a thing that works and a thing worth using.
Small Is a Feature, Not a Constraint
Most AI discourse orbits scale. Billion-parameter models, enterprise deployments, infrastructure budgets that would make a small company wince. When I started building, it wasn't a philosophy, it was just the reality of limited time and genuine uncertainty about whether any of this was worth investing heavily in before I understood it better.
What surprised me is how much the constraint clarified things. My first tools each did exactly one job. I had a broader idea, a kind of personal knowledge graph assistant, that I shelved almost immediately because I couldn't scope it tightly enough to test it honestly. The narrow version, the one that only did one thing, was the one I could actually evaluate. I could tell within a week whether it was earning its place.
The SDG Group makes a similar argument: why the best AI projects start small isn't about timidity, it's that limited initial scope produces higher adoption rates and faster iteration. Deliberately restricting features isn't a concession to resource constraints. It's the actual strategy. When you build AI tools at this scale, you can evaluate them honestly. Narrow building is pedagogically irreplaceable in a way I couldn't have understood before doing it.
The Gap Between "It Works" and "It's Useful"
There's a specific seductiveness to the demo moment. The prototype does something impressive, accurately, coherently, with apparent intelligence, and you feel like you've built something real. That feeling is one of the more dangerous parts of this process.
I built a note-summarisation tool that worked, technically. It produced clean, accurate summaries. And I used it maybe three times before quietly reverting to my manual method. The output quality wasn't the problem. The problem was that the summaries arrived in a format I didn't want to read, structured in a way that made sense for the model's training but didn't fit into any actual habit on my end or any real workflow I had. Real, repeated use is the only honest metric, and mine was damning.
The lesson, after the brief excitement about how well it worked, was that the gap between "it works" and "it's useful" lives almost entirely in whether the output slots into something your actual behaviour already supports. Features that look impressive in a demo can be actively unhelpful in practice if they don't connect to a real need. The hard-earned lessons from shipping AI agents from the CopilotKit team make this concrete: domain-specific grounding and human-in-the-loop design are what separate the tools that get used from the prototypes that get abandoned. I would've saved myself weeks if I'd read that earlier and taken it seriously.
Part of what makes this gap so persistent is the nature of the input data itself. I kept feeding my summarisation tool notes that were shapeless, half-formed, written for no audience, and then wondering why the outputs felt clinical. The model was doing its job. I was asking it to perform tasks on material that hadn't been prepared for the task. That's a workflow problem, not a model problem, and it took me embarrassingly long to see it.
Prompts Are Logic, Not Instructions
This took me longer to understand than I'd like to admit.
I kept writing prompts the way you'd write a Slack message to a helpful colleague, conversational, contextual, slightly optimistic about what the other party would infer. The results were inconsistent in ways I found genuinely frustrating. I blamed the model. I adjusted the temperature. I tried different phrasings that amounted to the same structural looseness.
The shift happened when I started thinking about prompts the way developers think about functions. A prompt build has inputs, expected outputs, edge cases, and failure modes. The rules you encode, the conditions under which it should behave differently, the constraints that protect against privacy terms concerns in the inputs, these aren't decorative. They're the actual logic. A vague prompt produces variable outputs the same way a vague function specification produces buggy code. Once I started writing prompts with that discipline, the consistency improved measurably.
What Telerik's team learned designing AI dev tools maps onto this directly: the real shift in AI tool design is about judgment, not intelligence. Reliability over cleverness. That reframe, from "write a good instruction" to "encode reliable logic", is probably the single most transferable thing I've learned. It doesn't matter whether you write code around your prompts or use them in a no-code interface; the discipline of specifying inputs, outputs, and edge cases applies equally. The tools I've seen fail most visibly are the ones where the prompt design was treated as an afterthought rather than the architecture.
Iteration Looks Different When the Model Is the Variable
I thought I understood iteration. Write, test, adjust, repeat. What I didn't account for was building on top of something I couldn't fully control or stabilise.
Model versions change. Subscriptions to model APIs don't freeze the behaviour of what you're subscribing to, new versions arrive, outputs shift, and the same prompt that worked reliably last month can start producing subtly different results without warning. It's a bit like how YouTube periodically rewires its recommendation logic underneath you: the surface stays the same but the underlying behaviour drifts, and you only notice when something you relied on stops working. This is one of the structural realities of cloud development that doesn't get discussed enough in beginner write-ups, the platform you're building on is itself a moving target.
I had a prompt that handled a specific edge case cleanly for several weeks. After what I assume was a model update, I can't be certain, it started handling that same case differently. Diagnosing it felt strange. I was debugging something I couldn't read the source code for. At one point I genuinely considered whether to abandon the dependency and rebuild using a different approach entirely.
Thoughtworks' argument that developers need to think more, not less when working with AI is the right frame here. The stochasticity doesn't reduce the cognitive demand, it changes its shape. You're not debugging deterministic logic anymore. You're developing judgment about probabilistic behaviour. That's a different skill, and it takes longer to build than I expected. The deep learning models underlying these tools weren't designed with your specific edge case in mind; they were trained on a training set broad enough to be generally capable, which means their failure modes are general too, diffuse, hard to isolate, resistant to the kind of systematic debugging that works on deterministic systems.
Most of What I Built Was for Myself
Honest accounting: the overwhelming majority of the small tools I've built had exactly one intended user. Me.
That turned out to be the best possible condition for learning. Feedback is instant and unambiguous, you either open the thing again or you don't. There's no press release to write, no need to pitch it to anyone. The audience collapses to a single person, and that removes a whole category of noise from the evaluation.
There's something worth naming about the rights you have when you build for yourself. You're the author, the tester, and the critic simultaneously. You own the full loop. When a tool built for myself stuck around, it was because it genuinely earned its place in my actual day, not because I'd convinced someone else to try it. The writers who taught me the most about this process were the ones who talked about personal tools as first-class projects rather than practice runs for the real thing. They were right.
About half of what I built lasted more than two weeks. The tools that stuck shared one thing: they solved a problem I encountered repeatedly, not one I imagined I might encounter in future. There's a useful contrast here with the kind of tools built for customer service or enterprise automation, where the use case is prescribed and the user is someone else. When you're building for yourself, you can't paper over a misfit with documentation or training. You just stop using it.
What I'd Tell Myself Before Starting
Not a list. More like a letter.
Start with tools that are almost embarrassingly small. The conditions for learning aren't about complexity, they're about closing the loop between building and feedback as quickly as possible. Work through problems step by step, even when that feels slower than reaching for the most capable model available. The discipline of decomposing a problem before throwing a tool at it pays back more than any prompt trick I've encountered.
The framing of a tool, how you introduce it to yourself, how you describe what it's supposed to do before you build it, turns out to matter as much as the implementation. I wasted real time on tools I'd framed too vaguely at the start. If you can't describe the one job a tool does in a single plain sentence before you build it, you probably don't understand it well enough yet. This is as true whether you're building a personal reading assistant or something more ambitious in the vein of autonomous vehicles or large-scale process automation, scope clarity is not a luxury for when you have resources to spare; it's what makes any project evaluable at all.
Know when to seek another perspective, when to show someone what you've built and ask if it's actually useful. I stayed in solo-building mode longer than was helpful in a few cases. The it-works-vs-useful gap is much easier for someone else to see than for you to see yourself.
I would have started smaller, and I would have started sooner. The prompt-as-logic shift, the honest accounting of actual use, the acceptance that the model is a variable and not a constant, none of these required sophisticated projects to encounter. They were all waiting in the simplest possible version of the problem.
Key takeaways
- Narrow scope is a strategy, not a limitation, tools scoped to a single job are faster to evaluate and produce clearer learning than broader, more ambitious prototypes.
- Utilisation is the only honest metric, a tool that works but doesn't get used has failed, regardless of output quality; fit to actual workflow matters more than demo impressiveness.
- Prompts are logic, not requests, treating prompt design as engineering (inputs, outputs, edge cases, failure modes) produces measurably more consistent results than conversational phrasing.
- Building on a stochastic variable requires a different kind of debugging, model variability demands strategic judgment, not just technical skill; the instability doesn't go away, but your relationship to it can become more productive.
- Personal tools are the real projects, building for yourself collapses the feedback loop, removes noise, and forces an honest reckoning with whether something is genuinely useful.
FAQ
Do I need to be a developer to build small AI tools?
Not in the traditional sense. Some of the most useful small tools I've encountered involve almost no code, prompt chains, simple automations, structured workflows. The thinking required is closer to logic design than software engineering, though the line blurs as you go deeper. Starting without a software background will add friction in certain places, but it won't stop you from building something genuinely useful.
What makes a small AI tool "stick" versus getting abandoned?
The tools I kept using solved a problem I ran into repeatedly, not one I anticipated. They also fit into something I was already doing, an existing habit or workflow, rather than requiring me to change my behaviour to accommodate them. If a tool asks too much of you upfront before delivering value, you'll quietly stop opening it.
How should I think about prompt design when I'm just starting out?
Treat it like writing a specification, not sending a message. Define what the input data will be, what you want the output to look like, and what should happen in edge cases. Then test it against those cases deliberately. The gap between a vague prompt and a structured one shows up almost immediately in output consistency.
How do I handle it when a model update breaks something that was working?
First, confirm it's actually the model and not something else that changed, context length, temperature setting, input format. If it is the model, try to isolate which part of the output changed. Document your prompts and expected outputs before any change lands, even informally. Treat model versioning the same way you'd treat a dependency update in any other project: cautiously, with tests.
Is it worth building tools only you will ever use?
Yes, and not just as practice. The feedback loop when you're both builder and sole user is uniquely honest. You can't paper over a usability problem with a better onboarding flow. You either use the thing or you don't. That directness makes personal tools some of the most instructive projects you can take on. More thoughts on this kind of reflective building are scattered across the blog.
Where can I read more about your approach to building and thinking in public?
The home page has an overview of how I think about the work I do, and the blog collects essays like this one, mostly exploratory, occasionally technical, never quite finished in the way a tutorial would be. I write here when I've learned something uncomfortable enough to be worth recording.
I still have that note-summarisation tool open in a tab somewhere. I don't use it the way I intended. But occasionally, on a slow afternoon, I open it and try something new with it, adjust a prompt, change what it's looking through my notes for, see if it surprises me. Sometimes it does. I'm not sure what that says about building things: that the post-useful phase of a project can be weirdly generative, or just that I find it hard to let go. Probably both.