In your documentation for Prefilling at https://console.groq.com/docs/prefilling I noticed the following sentence:
> Note: For some models, adding a newline after the prefill assistant
message leads to better results.
You can fix this bug by using “token healing”. The cause for the decreased quality is that some sequences of smaller tokens are less likely to be generated by the LLM than larger tokens that are a combination of those smaller tokens. Now if the prefill stops in the middle of a sequence of smaller tokens, the larger, more likely token can not be generated anymore, confusing the model.
For example, consider the prefill
def quicksort(values)
Without token healing, you might get the completion
def quicksort(values) → list
instead of the more common
def quicksort(values):
because the larger, more likely token `):` can not be generated anymore, since the token `)` already exists and can not be undone.
The solution is to chop off some tokens and let the model continue, which is now allowed to generate the large token. Of course, make sure to zero out the probability of all tokens which do not match the prefix.
Here is the corresponding PR in llama.cpp for reference: https://github.com/ggml-org/llama.cpp/pull/7187
And if you have that in place, you can implement full-blown GBNF grammar support, which allows the generation of JSON with a specific schema, XML, YAML, syntactically correct programs and everything else that can be expressed as a grammar: https://github.com/ggml-org/llama.cpp/blob/master/grammars/README.md