GitHub - mit-han-lab/streaming-llm: [ICLR 2024] Efficient Streaming Language Models with Attention Sinks

GitHub - mit-han-lab/streaming-llm: [ICLR 2024] Efficient Streaming Language Models with Attention Sinks

🚀 Exciting news! Introducing GitHub's StreamingLLM tool from @mit-han-lab for infinite-length inputs. With attention sinks, it speeds up streaming settings by 22.2x and keeps models fluent without extensive memory. Perfect for multi-round dialogues! 🤖📈 #AI #GitHub #LLM

  • The project is about deploying Large Language Models (LLMs) for infinite-length inputs efficiently.
  • StreamingLLM enables LLMs to generalize to infinite sequence length without fine-tuning.
  • Attention sinks help in recovering the performance of window attention.
  • It outperforms the sliding window recomputation baseline by up to 22.2x speedup in streaming settings.
  • It retains only the most recent tokens and attention sinks, discarding intermediate tokens.
  • The context window of LLMs does not expand, remaining constrained by initial pre-training.
  • StreamingLLM is optimized for streaming applications like multi-round dialogues.
  • It allows models to function continuously without requiring extensive memory or past data dependencies.
  • It enables models to generate fluent text from recent tokens without cache refresh.
  • StreamingLLM is orthogonal to recent context extension methods and can be integrated with them.