Skill Capsules | Not Hacker News!

Discussion (3 comments)

Showing 3 comments

18 days ago

1 reply

A bit of backstory:

I got really interested in LLMs in 2020 after GPT-3 release demonstrated in-context learning. But I tried running a LLM a year before: trying out AI Dungeon 2 (based on GPT-2).

Back in 2020 people were discussing how transformer-based language model are limited in all sorts of ways (operating on a tiny context, etc). But as I learned about how transformers work, I got really excited: it's possible to use raw vectors as input, not just text. So I got this idea that all kinds of modules can be implemented on top of pre-trained transformers via adapters which translate any data into representations of a particular model. E.g. you can make a new token representing some command, etc.

A lack of memory was one of hot topics, so I did a little experiment: since KV cache has to encode 'run-time' memory, I tried transplanting parts of KV cache from one model forward pass into another - and apparently only few mid layers were sufficient to make model recall a name from prior pass. But I didn't go further as it was too time consuming for a hobby project. So that's where I left it.

Over the years, academic researchers got through same ideas as I had and gave them names:

* arbitrary vectors injected in place of fixed token embeddings are called a "soft prompt" * custom KV-prefix added before normal context is called "prefix tuning" * "soft prompt" to generate KV prefix which encodes a memory is called "gisting" * KV prefix encoding a specific collection of documents was recently called "cartridge"

Opus 4.5 running in Claude Code can pretty much run an experiment of this kind on its own, starting from a general idea. But it still needs some help - to make sure we use prompts and formats which actually make sense, look for best data set, etc.

visarga

15 days ago

1 reply

The prefix tuning approach was largely abandoned for LoRA, it does not change the process if you tune the prefix or some adapter layers, but it is more flexible to train the LoRAs.

The Skills concept emerged naturally when you see how coding agents use docs, CLI tools and code. Their advantage is they can be edited on the fly to incorporate new information and can learn from any feedback source - human, code execution, web search or LLMs.

killerstorm

12 days ago

KV-based "skill capsules" are very different from LoRAs / classic prefix tuning:

  * A "hypernetwork" (which can be, in fact, same LLM) can build 
    a skill capsules _from a single example_.
    You can't get LoRA or KV-prefix using just one example.

  * It can be inserted at any point, as needed. I.e. if during reasoning you find that you need particular skill, you can insert it.
  * They are composable, and far less likely to over-write some information, as they only affect KV cache and not weights.

Skills as used by Anthropic & OpenAI are just textual instruction. KV-based skill capsule can be a lot more compact (and thus would contribute less to context rot) and might encode information which is difficult to convey through instruction (e.g. style).

Resources