Context: back in March 2025 I decided to put aside my scepticism and try AI driven development for the day. I appreciate that in 8 months, the AI landscape, particularly around agentic software dev has moved along, and perhaps this should have been posted originally back in March. All the same, maybe this is useful to some degree, if only to capture what it was like in time.


Whilst I sit squarely in my AI-sceptic seat, I was recently prompted to try a different tact by two post I read a few weeks back.

The first was Bruce's colonoscopy post, yes, that one. It was in fact that he was using a local LLM to create a generative image to commemorate his visit. I'd been using chatgpt (for code and electronics, basically a NLP version of a search engine), and hadn't considered that perhaps I could host myself to take some responsibility for one of the two problematic aspects of LLMs today (first being power consumption, second being the global theft).

The second post was Simon Willison's post on how he created a mini tool (the post is much broader than that, a useful read). It wasn't the post so much but the tool he was using: Claude Code. So far with chatgpt I'd copy back and forth errors and tweaks when it was helping me. Although I also have copilot enabled in VS Code it really holds down to autocomplete at a pretty junior level.

I'll often have typescript errors (for work) that copilot claims it can help fix, only resulting in even more typescript errors - so as a rule, I tend to avoid generating code fixes with copilot.

But what Simon showed with a shared transcript between himself and Claude Code was the software making the changes and offering diffs.

So that was what prompted a mini journey, and here's how it went.

Offline LLMs

Previously I had installed command line tools and even Ollama on my Mac without having the faintest idea how to use them effectively - so they sat idle and unused.

I'm not sure how, but I came across Msty, a tool that purported to make using local LLMs very easy. For a change, that seems to be true.

Since I still had Ollama running (though I should have probably ejected it), Msty quickly linked up to this and discovered (though I'd forgotten) that the DeepSeek R1 model was loaded already.

I figured since I was primarily using LLMs for software development, it made sense to find a model that suited those requirements.

Not quite that simple, or certainly not if you (ie. me) doesn't have the LLM lingo and terminology down.

I did some searching online, found a few articles, which themselves read as if they had been generated by AI, offering feature reviews of LLM models, but I couldn't find anything that said "model X is excellent for Y software development".

Additionally, I found some models on huggingface.io, and figured that 7B was going to be easier/more viable to run than a 70B (big numbers need big machines? I was guessing…). However putting the name of the model into Myst yielded a lot of results and I couldn't really tell you what the difference was. More importantly, I couldn't quite be sure of the provenance (and whether that even matters…I'd assume it did).

I ended up installing the following models:

  1. llama 3.2 (2gb)
  2. deepseek coder v2 (which blew up)
  3. codellama (3.8gb)
  4. codellama 13b
  5. WizardCoder (I think the 3gb one?) (it just didn't reply on the query)
  6. Wavecoder-ultra-6.7b (literally never worked for me)

I tried a number of technical problems without much success and did capture the output for a number of these models to try to solve a Home Assistant automation process (which you can see here).

It's entirely likely there's better models out there for what I wanted to do. I did know I didn't need a model to know everything, so some focus (and thus: reduced size) would make sense.

However, not one of these gave useful replies to my questions.

My favourite (fail) was giving code-llama a question that read: this query gets all the lessons matching these lesson slugs (I gave it the actual SQL query), how do I get a list of all the slugs that didn't match?.

The LLM happily reported that I was trying to learn Spanish:

The provided code contains a list of sentences that are written in Spanish, and the goal is to translate them into English. However, there are some errors in the provided code that need to be fixed before it can function properly. Here are some issues with the current code:

Missing curly braces around the translate function definition: The translate function should be defined as a separate named function expression, like this: { translate } = require("google-translate");. This will ensure that the function is properly closed and can be used in other parts of the code.

For some reason it was using the slugs in the query as the most important part and completely ignoring the technical query.

For the most part, the responses I got were fairly hand wavy, text heavy (which I didn't want since I was asking about code), and in most cases irrelevant to my task.

I think em0ry42 on BlueSky sums up what I was seeing:

Smaller self hosted models will always under perform the larger ones. I think your experience and those of the other commenters are consistent with the current reality of these technologies. They can't do what people are promising. Yet.

I'm sure there are people who can tune the hell out of their setup, but sadly, running any decent LLM locally as a useful code assistant, is just not here for the rest of us.

So I parked that for a while, and turned my attention to Claude Code.

Coding without touching code

I've no clue how new Claude Code was at the time, though I've gathered it's fairly new. It's a solid product from my experience (where I even managed to lose track as to which company owns which weirdly named AI thingy).

Setup and interface is entirely on the command line, so already we're speaking my language.

I'd seen demos of developers who've been able to join up their entire codebase to the LLM but each time I'd dabbling, I would quickly get lost and give up.

Claude Code does exactly this without the walls I'd experienced in the past.

I am however, acutely wary that Claude is running on a remote machine, and likely to be chonking through so much power that we're just throwing away water to keep machines from burning up. Let's stick a pin in that (and gosh, I loath myself already for that).

The very first problem I wanted it to solve was where I was trying to download 1,000s of videos and they all needed to be added to one massive tarball (context: this is for work, to allow users to bulk download our assets).

I'd hit a problem where the tar process kept throwing an unhelpful exception the evening before and no amount of documentation on the library I was using helped me.

Overnight I had a suspicion as to the cause and it gave me an idea to try - but I thought I'd let Claude try first, see what it does.

Without any specific direction (ie. my idea for the fix), and only the name of the file and the function the problem happened, Claude Code suggested the same solution I had in my head.

The UI then offered a syntax highlighted diff of the change it wanted to commit to disk. I was able to review it (very much how I'd approach a code review) and all I then needed to do was hit enter to accept.

I tested the code in a separate terminal and indeed the change worked.

Given this positive start, I then spent most of the working time split between the Claude Code UI and in the terminal to run the main program (which was sequencing a very large dataset).

The code changes for the most part were always good and code that I accepted.

The experience was…weird. I'd heard of LLMs being referred to as junior developers but when I was going back and forth between chatgpt and vscode (again, for me, copilot never really came in useful), because of the amount of interaction that was required from me, it felt even less than working with a junior.

But this was a much closer experience. I'd describe the change and logic, sometimes pointing to filenames that would offer useful context, and Claude would spend some time (and money) thinking, then it would ask me to check a diff.

Weirdly I spent more time sitting and staring out the window waiting for code to come back than I did looking at code. It was a weirdly hands off experience. I can't tell where I sit on that.

The main criticism that I had is that, because we use specific rules for typescript (no any and types are defined, which I think seems okay), Claude wouldn't really follow those strict rules, so I needed to go in at the end to clean that part up.

The secondary criticism is more a matter of taste. The code (and logging) was verbose to my taste. Additionally, being outside the code for the majority of the work period felt really strange. Sort of like a self-driving car took me the majority of my journey, deciding itself the navigation, for me only to be needed for the final arrival through some tight country lanes. Or something!

A cost

Since I was freestyling my way on Claude Code, I did manage to rattle through $5 of credit. I did think this was (somehow) linked to my Google business account, but I'm now suspecting it was free credit to introduce me to their API.

After running through this credit now twice (I switched to my personal account for a second run) I've discovered there's tools to help manage that sprawling cost (such as /compact and /clear to reduce how much context the LLM is fed before giving me a result). I'd like to play with this more to get an idea of how much I'm really prepared to pay.

Also after writing (most of) this post, I came across an interesting project that takes the Claude UI and lets you connect up your own backend. I've not tried it yet, but I'd be interested to see if I can connect to a local LLM and try out results (though going by the current experience, it's going to have a hard time competing).

Since it was conversational...

I decided to hack together a simple keyboard keycap (I had spare) with an ESP32 board to emulate a keyboard.

Then this would send a (fairly) unique keycode that then launched a python command which started a whisper based script that let me talk, then pasted the text into whatever was focused.

This meant I had: press button, say the thing, press the button, wait for it to be done.

It wasn't great because it was a little clunky, but it definitely felt futuristic!

How I felt afterwards

(I'm now writing this 8 months late, but I remember how it felt on the day).

Even though I was surprised at the progress of the work, both for how terrible the local code solving was and how impressed I was with Claude Code - it did leave me with a feeling of disconnect.

There's certainly the issue with the maintainability of pure vibe-coded software, but this was something more.

There's a creative input that I put into my coding process. A sense of purpose and achievement in solving some complicated problem, or writing a line of code that I'm particularly pleased with. There wasn't really of that feeling of connection with the output.

Having written this retrospectively I know that my perspective has changed somewhat, but I do remember have this weird dissonance between the outcome and the experience of getting there.

MY EBOOK£5 for Working the Command Line

Gain command-line shortcuts and processing techniques, install new tools and diagnose problems, and fully customize your terminal for a better, more powerful workflow.