Rendered at 11:03:22 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
ctoth 11 hours ago [-]
A useful(ish) trick I've found is adding a persona block to my CLAUDE.md. When it stops addressing me as 'meatbag' I know the HK-47 persona instructions are not being followed, which means other instructions are not being followed. Dumb trick? Yup. Does it work? Kinda? Does it make programming a lot more fun and funny? Heck yes.
Don't lecture me on basins of attraction--we all know HK is a great programmer.
Baeocystin 7 hours ago [-]
The brown m&m trick turns out to have more applications than one would think!
JonSchneider 8 hours ago [-]
Mind sharing that block? Is it just: "Persona: You are HK-47"?
evantahler 16 hours ago [-]
I feel like asking the thing that you are measuring, and don’t trust, to measure itself might not produce the best measurements.
john_strinlai 16 hours ago [-]
"we investigated ourselves and found nothing wrong"
deaux 7 hours ago [-]
Funny, but in this case it will be the opposite. If you tell an LLM to find potential regression, it will lean towards "finding" it even where there is none.
jdiff 9 hours ago [-]
My attitude towards this is growing similar to my attitude towards Windows. If I have to fight against my tools and they are actively working against me, I'd rather save the sanity and time and just find a new tool.
thawab 1 hours ago [-]
I am on the same boat, started to build my extensions and preferences in PI. The community is awesome and helpful.
My assumption is the personalization has higher priority than the intelligence gap between opus and gpt or others. At least i wont stop working if claude is down.
trueno 7 hours ago [-]
i think a lot of us are kind of sitting back and seeing what dark horse rises up. it's such a non-deterministic outputting technology that's still in discovery mode and the resources expenses/constraints are outta control, the companies leading the charge are eating themselves and can't guarantee jack squat. down in the trenches people have built these things up to be critical dependencies in either their day to day life or their work, my eyes are mostly glazing over hearing about how people are using claude to do whatever grand array of things with no oversight. the way we benchmark this shit is all over the map, the goalposts are just teleporting randomly at this point.
my claude usage has drastically dried up as i've personally realized the real bottom of this stuff is always gonna be genuinely learning and becoming excellent at a thing. i think claudes not bad for helping me get through early stages of that process, and for actual work i think claudes great for just ripping out something im too lazy to do, but something i know so well that i can catch him slippin. absolute coinflip on if it's worth the pain, many times now ive said "i shouldve just done this myself".
i've got my fingers crossed for like llm's to reach some sort of proverbial opus 4,5 territory here. even if thats gonna cost me a bit in hardware, that's kind of my personal bench mark for "good enough, i'm unplugging from all this craziness".
one thing is for sure, anthropic needs to stop adding _features_. claude code or vscode extension was their bread and butter reliability that garnered them a lot of goodwill from people who were willing to pay good money for a good service. seeing them launch their design thing just has me rolling my eyes. they're kind of microsoft'ing themselves here by trying to do too much, and they'll end up delivering a lot of subpar services that aren't best-in-class at any one thing. we're already seeing that i think.
cindyllm 7 hours ago [-]
[dead]
Retr0id 14 hours ago [-]
What is "drift"? It seems to be one of those words that LLMs love to say but it doesn't really mean anything ("gap" is another one).
jldugger 14 hours ago [-]
IDK how it applies to LLMs but the original meaning was a change in a distribution over time. Like if you had some model based app trained on American English, but slowly more and more American Spanish users adopt your app; training set distribution is drifting away from the actual usage distribution.
In that situation, your model accuracy will look good on holdout sets but underperform in user's hands.
Fandom sites continue to have one of the most unpleasant user experiences on the modern web, I just want to read the article without having my to watch 4 different video ads...
idle_zealot 14 hours ago [-]
I believe it's businessspeak for "change." Gap is suittongue for "difference."
redanddead 13 hours ago [-]
there are many causes, but it’s a drift in performance
you can drift a tool via the harness in many ways
you can modify the system prompt
you can modify the underlying model powering the harness
you can use different “thinking” levels for different processes in the harness
you can change the entire way a system works via the harness, which could be better or worse, depending on many things
you can introduce anti-anti-slop within the harness to foil attempts from users using patch scripts
you can modify how your tool sends requests to your server depending on many variables
you can handle requests differently, depending on any variable of your choosing, at the server level
you can modify the compute allotment per user depending on many things, from the backend, without telling the user, it’s very easy. you can modify it dynamically depending on your own usage or the user’s cycle. Or their organization’s priority level as a customer. The weekly and daily usage management system is intricate, compute is very finite and must be managed
the user has literally no way to know and you have no legal obligation to tell them, you never made them any legally binding promises
the combination of so many factors that all affect each other means that you can, if you’d want to, create a new clusterfuck of an experience anytime any of these or unknown variables change, it may not even be deliberate, it grows exponentially complex, so you may not even be able to promise a specific standard to your users
drift is not imagined, sure, but admitting to it could expose you to unneeded liability
Retr0id 13 hours ago [-]
That's a lot of words without actually defining the term, although idle_zealot's suggestion of "change" seems to make grammatical sense as a replacement here.
redanddead 13 hours ago [-]
yeah, figured i’d put some thought into it, you know?
majormajor 8 hours ago [-]
In addition to the elsewhere-mentioned "you're using a black box to try to analyze the same black box," the fundamental metrics all seem incredibly prone to other factors than any Claude Code changes.
Claude Code changes all the time—it's the whole shitty trend of the day—but you can't tell which of those changes are better or worse from analyzing results on independent novel tasks.
And you're baking in certain conclusions: "HOLDING / SUSPECTED REGRESSION / CONFIRMED REGRESSION / INCONCLUSIVE". Where's an option for "better than previous baseline"? Seems certainly possible that a session could have better-than-average numbers on the measured things.
Overall, though, there's just so much here that's just uncontrolled. The most obvious thing that isn't controlled for is the work itself. What does the typical software project look like? A continued accumulation of more code performing more features? What's gonna make an LLM-based agent have to do more work? Having to deal with a larger, more complicated codebase. Nothing in this seems to attempt to deal with the possibility that a session that got labeled a regression might have actually been scored even lower against a month ago's Claude Code.
"It's harder to read code than to write code" and "codebases take more effort to modify over time as they grow" are ancient observations.
Drift detection would require static targets and frequent re-attempts.
I use it everyday and haven't seen worsening. (It's definitely not static but the general trend has been good.) But I use it on a codebase that was already very complex before we started using these tools, where overall every three months or so has brought significant improvements in usability and accuracy.
aleksiy123 16 hours ago [-]
Interesting approach, I've been particularly interested in tracking and being able to understand if adding skills or tweaking prompts is making things better or worse.
Anyone know of any other similar tools that allow you to track across harnesses, while coding?
Running evals as a solo dev is too cost restrictive I think.
Don't lecture me on basins of attraction--we all know HK is a great programmer.
my claude usage has drastically dried up as i've personally realized the real bottom of this stuff is always gonna be genuinely learning and becoming excellent at a thing. i think claudes not bad for helping me get through early stages of that process, and for actual work i think claudes great for just ripping out something im too lazy to do, but something i know so well that i can catch him slippin. absolute coinflip on if it's worth the pain, many times now ive said "i shouldve just done this myself".
i've got my fingers crossed for like llm's to reach some sort of proverbial opus 4,5 territory here. even if thats gonna cost me a bit in hardware, that's kind of my personal bench mark for "good enough, i'm unplugging from all this craziness".
one thing is for sure, anthropic needs to stop adding _features_. claude code or vscode extension was their bread and butter reliability that garnered them a lot of goodwill from people who were willing to pay good money for a good service. seeing them launch their design thing just has me rolling my eyes. they're kind of microsoft'ing themselves here by trying to do too much, and they'll end up delivering a lot of subpar services that aren't best-in-class at any one thing. we're already seeing that i think.
In that situation, your model accuracy will look good on holdout sets but underperform in user's hands.
you can drift a tool via the harness in many ways
you can modify the system prompt
you can modify the underlying model powering the harness
you can use different “thinking” levels for different processes in the harness
you can change the entire way a system works via the harness, which could be better or worse, depending on many things
you can introduce anti-anti-slop within the harness to foil attempts from users using patch scripts
you can modify how your tool sends requests to your server depending on many variables
you can handle requests differently, depending on any variable of your choosing, at the server level
you can modify the compute allotment per user depending on many things, from the backend, without telling the user, it’s very easy. you can modify it dynamically depending on your own usage or the user’s cycle. Or their organization’s priority level as a customer. The weekly and daily usage management system is intricate, compute is very finite and must be managed
the user has literally no way to know and you have no legal obligation to tell them, you never made them any legally binding promises
the combination of so many factors that all affect each other means that you can, if you’d want to, create a new clusterfuck of an experience anytime any of these or unknown variables change, it may not even be deliberate, it grows exponentially complex, so you may not even be able to promise a specific standard to your users
drift is not imagined, sure, but admitting to it could expose you to unneeded liability
Claude Code changes all the time—it's the whole shitty trend of the day—but you can't tell which of those changes are better or worse from analyzing results on independent novel tasks.
And you're baking in certain conclusions: "HOLDING / SUSPECTED REGRESSION / CONFIRMED REGRESSION / INCONCLUSIVE". Where's an option for "better than previous baseline"? Seems certainly possible that a session could have better-than-average numbers on the measured things.
Overall, though, there's just so much here that's just uncontrolled. The most obvious thing that isn't controlled for is the work itself. What does the typical software project look like? A continued accumulation of more code performing more features? What's gonna make an LLM-based agent have to do more work? Having to deal with a larger, more complicated codebase. Nothing in this seems to attempt to deal with the possibility that a session that got labeled a regression might have actually been scored even lower against a month ago's Claude Code.
"It's harder to read code than to write code" and "codebases take more effort to modify over time as they grow" are ancient observations.
Drift detection would require static targets and frequent re-attempts.
I use it everyday and haven't seen worsening. (It's definitely not static but the general trend has been good.) But I use it on a codebase that was already very complex before we started using these tools, where overall every three months or so has brought significant improvements in usability and accuracy.
Anyone know of any other similar tools that allow you to track across harnesses, while coding?
Running evals as a solo dev is too cost restrictive I think.
Going to feed into my own.
Out of curiosity how have your agents evolved and metric changed.
This project is somewhat unconventional in its approach, but that might reveal issues that are masked in typical benchmark datasets