Omitting or summarizing low relevance chunks vs. Top K retrieval
Hi all,
I've been considering a perhaps underexplored method for single-document/small dataset RAG and I’d love some feedback. It doesn’t seem especially novel but I haven’t found anyone doing anything similar.
I have a 50k-token document, a Technical Standard, which has been painstakingly and meticulously cleaned up by hand into 100% perfectly clean Markdown. It's our ONE single source of truth, so this document gets all the tender love & care. Being a Standard it already has an inherent structure (sections, clauses.)
It works wonderfully with long-context LLMs. But while they're fairly cheap these days, they are still SLIGHTLY costlier than I’d like (~$0.01/query on models like 4o-mini).
My experiments with traditional vector RAG haven't produced results quite comparable to long-context LLMs, so I’m considering a different approach: instead of chunking the document and retrieving top-k based on cosine similarity, I’d manually chunk by section or clause and keep the document’s structure intact.
Of course, if you concatenated all the chunks you'd get the original document.
The idea is to omit or summarize low-relevance chunks, possibly flagged by cosine (dis)similarity or perhaps a hybrid of techniques, while maintaining the document’s order. For the very lowest-relevance parts, we'd insert “[Omitted, low relevance]” and/or a brief summary, allowing the LLM to process the document sequentially while saving tokens.
This way, I avoid breaking the flow but reduce token costs. I keep tokens that may be questionably relevant (much of it probably still not), but I prune those tokens that are definitely irrelevant.
I'm thinking each chunk could probably have at least 2 versions, the full chunk, and the highly abridged one. I'm also prepared to implement manual rules too, say "if this chunk is returned, then this other one MUST be returned, regardless of calculated similarity."
When we are "assembling" the document, one chunk at a time, we simply decide if it's worth including the full chunk or not.
Would love to know if anyone’s tried something like this or has suggestions!