Microsoft Amplifier

3 months ago

More creative? I've just seen my premium subscription "AI" struggling to find a trivial issue of a missing import in a very small / toy project. Maybe these tools are getting all sorts of scores on all sorts of benchmarks, I dont doubt it, but why are there no significant real-world results after more than 3 years of hype? It reminds of that situation when the geniuses at Google offered the job to the guy who created Homebrew and then rejected him after he supposedly did not do well on one of those algorithmic tasks (inverting a binary tree? - not sure if I remember correctly). There are also all sorts of people scoring super high on various IQ tests, but what counts, with humans as with the supposed AI is the real world results. Benchmarks without results do not mean anything.

vachina

3 months ago

It is as creative as it's training material.

You think it is creative because you lack the knowledge of what it has learnt.

jsheard

3 months ago

1 reply

> I'll always be skeptical about using AI to amplify AI.

This project was in part written by Claude, so for better or worse I think we're at least 3 levels deep here (AI-written code which directs an AI to direct other AIs to write code).

brazukadev

3 months ago

The amount of AI slop coming from Microsoft is staggering. Not surprised their CEO said AI could destroy MS.

Balinares

3 months ago

I think I'm more optimistic about this than brute-forcing model training with ever larger datasets, myself. Here's why.

Most models I've benchmarked, even the expensive proprietary models, tend to lose coherence when the context grows beyond a certain size. The thing is, they typically do not need the entire context to perform whatever step of the process is currently going on.

And there appears to be a lot of experimentation going on along the line of having subagents in charge of curating the long term view of the context to feed more focused work items to other subagents, and I find that genuinely intriguing.

My hope is that this approach will eventually become refined enough that we'll get dependable capability out of cheap open weight models. That might come in darn handy, depending on the blast radius of the bubble burst.

tcdent

3 months ago

1 reply

> Never lose context again. Amplifier automatically exports your entire conversation before compaction, preserving all the details that would otherwise be lost. When Claude Code compacts your conversation to stay within token limits, you can instantly restore the full history.

If this is restoring the entire context (and looking at the source code, it seems like it is just reloading the entire context) how does this not result in an infinite compaction loop?

redhale

3 months ago

I think the idea would be that you could re-compact with a different focus. When you compact, you can give Claude instructions on what is important to retain and what can be discarded. If you later discover that actually you wanted something you discarded during a previous compaction, this could allow you to recover it.

Also, it can be useful to compact before it is strictly necessary to compact (before you are at max context length). So there could be a case where you decide you need to "undo" one of these types of early compactions for some reason.

rco8786

3 months ago

4 replies

A lot of snark in these comments. Has anyone actually tried it yet?

SilverElfin

3 months ago

1 reply

I’ve seen people discuss these types of approaches on X. To me it looks like the concepts here are already tried and popular - they’re just packaging it up so that people who aren’t as deep in that world can get the same benefits. But I’m not an expert.

ridruejo

3 months ago

1 reply

Exactly. I don’t understand the cynicism in the comments and they literally are just trying to make the technology more accessible

nozzlegear

3 months ago

3 replies

That's a very altruistic outlook on Microsoft's intent with getting everyone to use and depend on AI.

username223

3 months ago

1 reply

I mean this in the best possible way, but I don't think you're using "altruistic" correctly. Altruism is "showing a selfless concern for the well-being of others." I think you're looking for "naive," and Microsoft is some combination of cynical and manipulative.

nozzlegear

3 months ago

Good point, thanks! I meant to say that they were taking an outlook that cast Microsoft's intentions as altruistic when (in my view) the intentions are more along the lines of cynical and manipulative, as you said.

otterley

3 months ago

Isn’t that what every company that sells technology does—build demos and showcase uses in order to provoke the imagination and motivate sales? No company is perfect, but what Microsoft is doing here is hardly unusual.

vachina

3 months ago

Microsoft is on a roll, on a roll at repackaging open source efforts and branding them, and then saying they made it.

3 months ago

1 reply

I think most of us are irritated by the constant A/B Testing and underwhelming releases. Lets just have the bubble pop so we can solve real problems instead of this.

fishmicrowaver

3 months ago

2 replies

Hehe suddenly many people will have the real problem of paying bills unfortunately

3 months ago

1 reply

They will, especially when it comes to paying back the VCs all the burnt GenAI-dollars

bee_rider

3 months ago

1 reply

I thought VCs were investors.

Generally when your investment fails you don’t get paid back, right?

yunnpp

3 months ago

1 reply

Where were you in 2008?

bee_rider

3 months ago

Not in the workforce yet

Incipient

3 months ago

I'm super confused how anyone can actually afford to pay per token for llms to actually do dev.

I tried it with a feature, took about 10 minutes and a lot of iterations, and would easily have used hundreds of thousands of tokens. Doing this 20, 30 times a day would be crazy expensive.

rs186

3 months ago

1 reply

The repo is full of big AI words without any metrics/benchmark.

People are correct to question it.

If anything, Microsoft needs to show something meaningful to make people believe it's worth trying it out.

rco8786

3 months ago

1 reply

I’m not blaming them. I’m asking if anyone has tried it.

brazukadev

3 months ago

1 reply

Why didn't you try?

rco8786

3 months ago

Time? I’m on a discussion forum. Just wondering if anyone actually tried it out. Not that big a deal

awy311

3 months ago

Claude Code is great, this is just a set of tweaks, not really "research". For anyone into vibe coding, there are dozens of interesting video tutorials on customizing Claude Code and running practical jobs, not limited to coding.

alganet

3 months ago

1 reply

> "I have more ideas than time to try them out" — The problem we're solving

I see a possible paradox here.

For exploration, my goal is _to learn_. Trying out multiple things is not wasting time, it's an intensive learning experience. It's not about finding what works fast, but understanding why the thing that works best works best. I want to go through it. Maybe that's just me though, and most people just want to get it done quickly.

tclancy

3 months ago

Yeah, this seems like the opposite of invention. You can throw paint at a canvas but it won’t make you Pollock. And will you feel a sense of accomplishment?

vincnetas

3 months ago

3 replies

Starting in Claude bypass mode does not give me confidence:

WARNING: Claude Code running in Bypass Permissions mode │ │ │ │ In Bypass Permissions mode, Claude Code will not ask for your approval before running potentially dangerous commands. │ │ This mode should only be used in a sandboxed container/VM that has restricted internet access and can easily be restored if damaged.

nine_k

3 months ago

1 reply

The Readme clearly states:

Caution

This project is a research demonstrator. It is in early development and may change significantly. Using permissive AI tools in your repository requires careful attention to security considerations and careful human supervision, and even then things can still go wrong. Use it with caution, and at your own risk.

vincnetas

3 months ago

2 replies

Claude Code will not ask for your approval before running potentially dangerous commands.

and

requires careful attention to security considerations and careful human supervision

is a bit orthogonal no?

nine_k

3 months ago

1 reply

As a token of careful attention, run this in a clean VM, properly firewalled not to access the host, your internal network, GitHub or wherever your valuable code lives, and ideally anything but the relevant Anthropic and Microsoft API endpoints.

3 months ago

1 reply

And even then if you give it Internet access you're at risk of code exfiltration attacks.

nine_k

3 months ago

Definitely do not give it access to code you are afraid of leaking. Take an open-source code base you're familiar with, and experiment on that.

otterley

3 months ago

It’s not orthogonal at all. On the contrary, it’s directly related:

“Using permissive AI tools [that is, ones that do not ask for your approval] in your repository requires careful attention to security considerations and careful human supervision”. Supervision isn’t necessarily approving every action: it might be as simple as inspecting the work after it’s done. And security considerations might mean to perform the work in a sandbox where it can’t impact anything of value.

nicwolff

3 months ago

I assumed, especially with the VS Code recommendation, that this would automatically use devcontainers...

cyral

3 months ago

If they didn't have this warning you'd see comments on how irresponsible they are being

koakuma-chan

3 months ago

1 reply

Is this is a Claude Code wrapper?

skrebbel

3 months ago

Yes

furyofantares

3 months ago

2 replies

I do a lot of work with claude code and codex cli but frankly as soon as I see all the LLM-tells in the readme, and then all the commit messages written by claude, I immediately don't want to read the readme or try the project until someone else recommends it to me.

This is gaining stars and forks but I don't know if that's just because it's under the github.com/microsoft, and I don't really know how much that means.

typpilol

3 months ago

1 reply

I'd rather have in-depth commit messages then three word ones

furyofantares

3 months ago

1 reply

When I blind-commit claude code commit messages they are sometimes totally wrong. Not even hallucinations necessarily - by the time I'm committing the context may be large and confusing, or some context lost.

I'd rather have the three word message than detailed but wrong messages.

I think I agree with you anyway on average. Most of the time a claude-authored commit message is better than a garbage message.

But it's still a red flag that the project may be filled with holes and not really ready for other people. It's just so easy to vibe your way to a project that works for you but is buggy and missing tons of features for anyone who strays from your use case.

typpilol

3 months ago

You're not wrong.

I'd never encourage anyone to blind commit the messages But if they are correct they seem a lot more useful than 90% of commit messages.

I found the biggest mistakes that I've seen other people do are like - they move a file, and the commit message acts like it's a brand new feature they added because the llm doesn't put it together it's just a moved file

nightshift1

3 months ago

Future LLMs are going to be trained on this. Github really ought to start tagging repos that are vibe-coded.

nightshift1

3 months ago

4 replies

I think that letting an LLM run unsupervised on a task is a good way to waste time and tokens. You need to catch them before they stray too far off-path. I stopped using subagents in Claude because I wasn't able to see what they were doing and intervene. Indirectly asking an LLM to prompt another LLM to work on a long, multi-step task doesn't seem like a good idea to me. I think community efforts should go toward making LLMs more deterministic with the help of good old-fashioned software tooling instead of role-playing and writing prayers to the LLM god.

danmaz74

3 months ago

5 replies

When the task is bigger than I trust the agent to work on it on its own, or for me to review the results, I ask it to create a plan with steps. Then create a md file for each step. I review the steps, and ask the agent to implement the first one. Review that one, fix it, then ask it to update the next steps, and then implement the next one. And so on, until finished.

3 months ago

1 reply

Separately, you have to consider that "wasting tokens spinning" might be acceptable if you're able to run hundreds of thousands of these things in parallel. If even a small subset of them translate to value, then you're far net ahead vs with a strictly manual/human process.

pjc50

3 months ago

1 reply

> hundreds of thousands of these things in parallel

At what cost,. monetary and environmental?

3 months ago

1 reply

If the system provides value that is greater than its cost, then paying the cost to gain the value is always worthwhile - regardless of the magnitude of the cost.

As costs drop exponentially (a reasonable expectation for LLMs, etc.) then increasing agent parallelism becomes more and more economically viable over time.

ahartmetz

3 months ago

>As costs drop exponentially

Not a reasonable expectation anymore. Moore's Law has been dead for more than a decade and we're getting close to physical limits.

anditherobot

3 months ago

1 reply

Have you tried Scoped context packages? Basically for each task, I create a .md file that includes relevant file paths, the purpose of the task, key dependencies, a clear plan of action, and a test strategy. It’s like a mini local design doc. I found that it helps ground implementation and stabilizes the output of the agents.

genghisjahn

3 months ago

3 replies

I read this suggestion a lot. “Make clear steps, a clear plan of action.” Which I get. But then instead of having an LLM flail away at it could we give to an actual developer? It seems like we’ve finally realized that clear specs makes dev work much easier for LLMs. But the same is true for a human. The human will ask more clarifying questions and not hallucinate. The llm will role the dice and pick a path. Maybe we as devs would just rather talk with machines.

FrinkleFrankle

3 months ago

I'm using it to help me build what I want and learn how. It being incorrect and needing questioning isn't that bad, so long as you ARE questioning it. It has brought up so many concepts, parameters, etc that would be difficult to find and learn alone. Documentation can often be very difficult to parse. Llms make it easier.

catlifeonmars

3 months ago

> Maybe we as devs would just rather talk with machines.

This is kind of how I feel. Chat as an interaction is mentally taxing for me.

redhale

3 months ago

Yes, but the difference is that an LLM produces the result instantly, whereas a human might take hours or days.

So if you can get the spec right, and the LLM+agent harness is good enough, you can move much, much faster. It's not always true to the same degree, obviously.

Getting the spec right, and knowing what tasks to use it on -- that's the hard part that people are grappling with, in most contexts.

meander_water

3 months ago

1 reply

This is built into Cursor now with plan mode https://cursor.com/docs/agent/planning

danmaz74

3 months ago

1 reply

How does Cursor plan mode differ from Claude Code plan mode? I've used the latter a lot (it's been there a long time), and the description seems very similar. The big difference with the workflow I described is that with that plan mode you don't get to review and correct what happened between steps.

meander_water

3 months ago

I've not used Claude Code, so my answer might not be that useful. But I would think that because both are chat-based interfaces you would be able to instruct the model to either continue without approval or wait for your approval at each step. I certainly do that with Cursor. Cursor has also recently started automatically generating TODO lists in the background (with a tool call I'm assuming), and displaying them as part of the thinking process without explicit instruction. I find that useful.

spike021

3 months ago

this plus a reset in between steps usually helps focus context in my experience

sanex

3 months ago

I do the same thing with my engineers but I keep the tasks in Jira and I label them "stories".

But in all seriousness +1 can recommend this method.

theshrike79

3 months ago

There are two opposite ways to do this.

Codex is like an external consultant. You give it specs and it quietly putters away and only stops when the feature is done.

Claude is built more like a pair programmer, it displays changes live, "talks" about what it's doing and what's working et.

It's really, REALLY hard to abort codex mid-run to correct it. With Claude it's a lot easier when you see it doing something stupid or getting of the rails. Just hit ESC and tell it where it went wrong (like use task build, don't build it manually or use markdownlint, don't spend 5 minutes editing the markdown line by line).

tummler

3 months ago

I also use AI to do discrete, well-defined tasks so I can keep an eye on things before they go astray.

But I thought there are lots of agentic systems that loop back and ask for approval every few steps, or after every agent does its piece. Is that not the case?

hu3

3 months ago

Yeah in my experience, LLMs are great but they still need babysitting lest they add 20k lines of code that could have been 2k.

estimator7292

3 months ago

1 reply

The very first line in the readme is a quote, attributed to "the problem we're solving".

That's cute

nvader

3 months ago

If you think about it, that's because "the problem we're solving" is running out of time. Once it's solved it won't be able to try out ideas.

theusus

3 months ago

1 reply

Didn’t GitHub create something similar called Spec.

janpio

3 months ago

You are thinking of https://github.com/github/spec-kit

stillsut

3 months ago

1 reply

I've actually written my own a homebrew framework like this which is a.) cli-coder agnostic and b.) leans heavily on git worktrees [0].

The secret weapon to this approach is asking for 2-4 solutions to your prompt running in parallel. This helps avoid the most time consuming aspect of ai-coding: reviewing a large commit, and ultimately finding the approach to the ai took is hopeless or requires major revision.

By generating multiple solutions, you can cutdown investing fully into the first solution and use clever ways to select from all the 2-4 candidate solutions and usually apply a small tweak at the end. Anyone else doing something like this?

[0]: https://github.com/sutt/agro

https://xbow.com/blog/alloy-agents

3 months ago

2 replies

There is a related idea called "alloying" where the 2-4 candidate solutions are pursued in parallel with different models, yielding better results vs any single model. Very interesting ideas.

stillsut

3 months ago

Exactly what I was looking for, thanks.

I've been doing something similiar: aider+gpt-5, claude-code+sonnet, gemini-cli+2.5-pro. I want to coder-cli next.

A main problem with this approach is summarizing the different approaches before drilling down into reviewing the best approach.

Looking at a `git diff --stat` across all the model outputs can give you a good measure of if there was an existing common pattern for your requested implementation. If only one of the models adds code to a module that the others do not, it's usually a good jumping off point to exploring the differing assumptions each of the agents built towards.

michaelbarton

3 months ago

This reminds me of an an approach in mcmc where you run mutiple chains at different temperatures and then share the results between them (replica exchange MCMC sampling) the goal being not to get stuck in one “solution”

willahmad

3 months ago

1 reply

Project looks interesting, but no demos. As much I want to try it because of all cool concepts mentioned, but I am not sure I want to invest my time if I don't see any demos

fishmicrowaver

3 months ago

1 reply

I mean that's fair but doing a make install and providing your API key is pretty easy?

willahmad

3 months ago

multiply it by 20 other similar projects and assume 20% have security issues, your environment will be messed up before you even understand if you need it or not. Not even talking about time you lost

ripped_britches

3 months ago

2 replies

Please comment under this thread if you have actually tried this and can compare it to another tool like Cursor, Codex, raw Claude, etc.

I’m super not interested in hearing what people have to say from a distance without actually using it.

3 months ago

FWIW, finished an eval of claude code against various tasks that amplifier works well on:

The agent demonstrated strong architectural and organizational capabilities but suffered from critical implementation gaps across all three analyzed tasks. The primary pattern observed is a "scaffold without substance" failure mode, where the agent produces well-structured, well-documented code frameworks that either don't work at all or produce placeholder outputs instead of real functionality. Of the three tasks analyzed, two failed due to placeholder/mock implementations (Cross-Repo Improvement Tool, Email Drafting Tool), and one failed due to insufficient verification of factual claims (GDPVAL Extraction). The common thread is a lack of validation and testing before delivery, combined with a tendency to prioritize architecture over functional implementation.

3 months ago

I've tried it. It works better than raw Claude. We're working on benchmarks now. But... it's a moving target as amplifier (an experimental project) is evolving rapidly.

3 months ago

1 reply

Hey all! I'm one of a handful of developers on this project. Great to see it's getting some interest!

For context, we are right in the middle of building this thing... multiple rebuilds daily since we are using it to build itself. The value isn't in the code itself, yet, but in the approaches (UNIX philosophy, meta-cognitive recipes, etc.)

We are really excited about how productive these approaches are even in this early stage. We are able to have amplifier go off make significant progress unattended for sometimes hours at a time. This, of course, raises a lot of questions on how software will be built in the near future... questions which we are leaning into.

Most of our team's projects, unless they have some unresolved IP or are using internal-only systems, are built in the open. This is a research project at this stage. We recognize this approach it too expensive and too hacky for most independent developers (we're spending thousands of dollars daily on tokens). But once the patterns are identified, we expect we'll all find ways to make them more accessible.

The whole point of this is to experiment and learn fast.

3 months ago

Here's a writeup of the project for more context: https://paradox921.medium.com/amplifier-notes-from-an-experi...

neuroelectron

3 months ago

aka Winamp