https://github.com/mit-han-lab/streaming-llm

GitHub - mit-han-lab/streaming-llm: [ICLR 2024] Efficient Streaming Language Models with Attention Sinks

🚀 Exciting news! Introducing GitHub's StreamingLLM tool from @mit-han-lab for infinite-length inputs. With attention sinks, it speeds up streaming settings by 22.2x and keeps models fluent without extensive memory. Perfect for multi-round dialogues! 🤖📈 #AI #GitHub #LLM

The project is about deploying Large Language Models (LLMs) for infinite-length inputs efficiently.
StreamingLLM enables LLMs to generalize to infinite sequence length without fine-tuning.
Attention sinks help in recovering the performance of window attention.
It outperforms the sliding window recomputation baseline by up to 22.2x speedup in streaming settings.
It retains only the most recent tokens and attention sinks, discarding intermediate tokens.
The context window of LLMs does not expand, remaining constrained by initial pre-training.
StreamingLLM is optimized for streaming applications like multi-round dialogues.
It allows models to function continuously without requiring extensive memory or past data dependencies.
It enables models to generate fluent text from recent tokens without cache refresh.
StreamingLLM is orthogonal to recent context extension methods and can be integrated with them.