-
Chaotic Futurism; @SJTU-IPADS
- https://66ring.github.io/
- https://vx._66RING_
Highlights
- Pro
llm
[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
Fast inference from large lauguage models via speculative decoding
[ICLR 2024] Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding
AI Native Data App Development framework with AWEL(Agentic Workflow Expression Language) and Agents
Papers for database systems powered by artificial intelligence (machine learning for database)
This repository contains tutorials and examples for Triton Inference Server