GitHub - mit-han-lab/streaming-llm: [ICLR 2024] Efficient Streaming Language Models with Attention Sinks
🚀 Exciting news! Introducing GitHub's StreamingLLM tool from @mit-han-lab for infinite-length inputs. With attention sinks, it speeds up streaming settings by 22.2x and keeps models fluent without extensive memory. Perfect for multi-round dialogues! 🤖📈 #AI #GitHub #LLM
- The project is about deploying Large Language Models (LLMs) for infinite-length inputs efficiently.
- StreamingLLM enables LLMs to generalize to infinite sequence length without fine-tuning.
- Attention sinks help in recovering the performance of window attention.
- It outperforms the sliding window recomputation baseline by up to 22.2x speedup in streaming settings.
- It retains only the most recent tokens and attention sinks, discarding intermediate tokens.
- The context window of LLMs does not expand, remaining constrained by initial pre-training.
- StreamingLLM is optimized for streaming applications like multi-round dialogues.
- It allows models to function continuously without requiring extensive memory or past data dependencies.
- It enables models to generate fluent text from recent tokens without cache refresh.
- StreamingLLM is orthogonal to recent context extension methods and can be integrated with them.