Running local models is good now

☆ Yσɠƚԋσʂ ☆@lemmy.ml · 1 day ago

Running local models is good now

arcine@jlai.lu · 4 hours ago

“Making your own cyanide is good now”

The fact I don’t know how to make cyanide at home isn’t what’s keeping me from consuming it. I don’t consume cyanide because it is POISON ; so is AI.

☆ Yσɠƚԋσʂ ☆@lemmy.ml · 4 hours ago

Ah yes tools are poison, you’re very intelligent.

Jayjader@jlai.lu · 1 day ago

I’ve been pleasantly surprised by Qwen3.6-27b on a Radeon 6700xt (12GB of VRAM) with 32GB of system RAM for it to offload onto (especially when pushing the context window up past 50k). Definitely more of a “compose prompt and hit send -> do something else -> check back after a while to view results” experience than an engaged back-and-forth, but at least compared to previous models I’ve tried running over the past year or two the results are palatable and sometimes even meaningfully useful.

Given the speed I get, I’ve mostly found it useful for doing overviews of a codebase southy some sort of improvement plan suggested at the end. Tool calls work, but I’m still not comfortable letting it code outright (plus, I think I can still code faster than it for now).

☆ Yσɠƚԋσʂ ☆@lemmy.ml · 1 day ago

I find I kind of look at the whole agentic harness setup as a genetic algorithm. Your tests and specs are the fitness function for the program you’re evolving, and the LLM is the mutator. At each step it generates some output, it gets tested against the fitness function, the LLM gets feedback and iterates on it. Eventually something working falls out in the end. The better you can define the selection criteria the more you box the agent in the better results you get.

The trick I can recommend for getting the model to code is to ask it to come up with a phased plan composed of focused features, and then to build each feature on its own branch. That way you have a clear unit of work that does a specific thing which makes it much easier to review the code. Can also recommend tools like https://github.com/Fission-AI/OpenSpec for making specs to box the model in when it works.

Jayjader@jlai.lu · 1 day ago

I really dislike the idea of making the whole program a genetic algorithm - that approach is nice when you don’t have a straightforward approach to employ/enact, but otherwise it feels both overkill and horrendously inefficient.

The next step for my own harness (whenever I get back to working on it) is definitely to look at leveraging structured outputs to help these smaller models iterate towards a longer term goal.

☆ Yσɠƚԋσʂ ☆@lemmy.ml · 24 hours ago

I don’t mean you turn the program itself into a genetic algorithm. I’m saying that the agentic loop for producing code acts as one. The code itself is just regular code. And the loop isn’t really any more inefficient than what you do as a developer. It almost never happens that you write perfect code on a first try in practice. You’ll write some code, run your tests, look how it did, and iterate. That’s precisely the same process the agent follows.

The difference from a typical genetic algorithm is that the LLM is not just randomly generating text that eventually fits into the shape you specified. It’s generating code that’s already close to what’s intended most of the time, and it just needs a bit of massaging to get completely right. That’s the feedback loop here.

Jayjader@jlai.lu · 16 hours ago

Sorry, I misspoke (miswrote?). I meant growing the code through a genetic-algorithm-like process. Though, fundamentally, I don’t think there’s that much difference between applying a selection process on randomized bytes and having an LLM churn on a codebase.

I feel like you’re only considering the time it takes to reach a particular solution when considering what is inefficient - in which case I would agree it’s probably a wash. However, I don’t think an LLM is less energy-hungry than my own body, and I learn by doing, effectively reducing the cost of future coding iterations. I guess if I could run the LLM and surrounding hardware entirely off of solar power I wouldn’t mind nearly as much - though there’s still that part of banging my head against a problem that I believe is crucial for my own growth. I think that, over time and problems/projects, this compounds in a way that letting the LLM figure out the gritty details just won’t.

I think I agree with your last paragraph, though I do wish the LLM was capable of needing less massaging the more it runs. I hope we’ll be able to figure out how to achieve effectively infinite context length so that it doesn’t have to “forget” all of the previous tasks I’ve had it work on.

☆ Yσɠƚԋσʂ ☆@lemmy.ml · 5 hours ago

Having done development for over two decades now, I’m really not learning anything useful when I make yet another CRUD end point on a server, or a new widget. The reality is that most coding tasks are highly repetitive and we’re just writing the same boiler plate in slightly different contexts. Being able to offload boring and repetitive tasks to a machine is what automation is for.

I’d rather spend my brainpower on things I find interesting like the overall architecture and the problem being solved while leaving writing implementation details to the LLM. It’s not like you stop solving problems when you use an LLM for coding, you’re just focusing on different things at that point.

It’s also worth noting that this argument isn’t new. I’m old enough to remember how writing assembly by hand was what real coders did or how using GC was cheating because you shouldn’t offload memory management to the computer. In each case it turned out that using better tools let us build more interesting things in the end and freed up human thinking from boring and repetitive work.

Jayjader@jlai.lu · 3 hours ago

I want to agree, but for example GC has enabled webpages that take 3gigs of ram to do the same tasks we could do with 200 megs fifteen years ago. We don’t automatically build more interesting things once the gritty details and boilerplate are automated, and this stochastic automation gives even more room for “bad practices” to creep in and rob us of the gains it is supposed to bring.

☆ Yσɠƚԋσʂ ☆@lemmy.ml · 2 hours ago

GC has little to do with web page bloat though. In fact, that’s precisely where human agency comes in to design things in a sensible way. And I see little evidence to support the claim that stochastic automation leads to worse code myself. I use these tools every day, that’s completely contrary to my experience. I get the impression that you’re starting from a conclusion and coming up with a narrative that fits it rather than actually trying these tools out and seeing how to work with them effectively.

pimat@feddit.org · 1 day ago

Stopped reading at 64GB of RAM in a M2 Mac… That’s not a real world example to start.

☆ Yσɠƚԋσʂ ☆@lemmy.ml · 1 day ago

You can run the Gemma 4 and Qwen3.5 MoE models with as little as 12 GB of VRAM at 30-40 tps (Q4/Q5), and they both blow GPT-4o and DeepSeek R1 out of the water. But 64gb RAM is also not really out of scale with the cost of a shop tool in other trades. If you’re a professional that’s confident in a positive return on the investment, or just a hobbyist with the luxury budget for a “shop” that cost is well within consumer market. That’s not everybody, of course, but it’s not some inconceivable fantasy.

The key point is that local models continue to get more efficient and usable. You need high end consumer grade hardware today, but given how fast improvements are happening, it’s entirely likely that you’ll be able to get the same capability on even smaller hardware in a few months.

pimat@feddit.org · 1 day ago

I really appreciate you taking the time for the reply. From your point of view this makes sense of course and I hope you are right about the upcoming improvements. I did some experiments with a M1 Mac mini and was quickly disappointed but maybe I’ll give it another shot. Thanks again, I’m always open to be corrected and love to learn new stuff.

lichtmetzger@discuss.tchncs.de · edit-2 23 hours ago

Doesn’t have to be a Mac, my GPD Win Max 2 has 64GB as well for a much lower price and it can somehow use 55GB on the integrated NPU (AMD 780M) for running models with ollama. I can even combine that with an external GPU on the Oculink port to increase the total memory.

It takes between 30s to 5min to get a reply, but it does work and it’s mainly useful for going over my project asking how to improve the codebase.

Quality-wise it’s good enough for boilerplate code and small improvements. Wouldn’t trust it to work on big features in larger projects, but I don’t trust LLMs in general for that. I don’t see a big difference to ChatGPT and Gemini (which is a win for local hosting and putting the freedom of computing back into our own hands). But the usual caveats always apply. All models have their problems and people tend to overhype the capabilities of LLMs in general.

setsubyou@lemmy.world · 1 day ago

Why not. I have a 2020 M1 MBP with 64 GB too. But you don’t need that much for the models in the article.