Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering
Abstract: Multimodal LLMs (MLLMs) are the natural extension of large language models to handle multimodal inputs, combining text and image data. They have recently garnered attention due to their ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results