-
Chaotic Futurism; @SJTU-IPADS
- https://66ring.github.io/
- https://vx._66RING_
Highlights
- Pro
Lists (8)
Sort Name ascending (A-Z)
Stars
Code repo for "CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs".
VPTQ, A Flexible and Extreme low-bit quantization algorithm
The Official Implementation of Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
A throughput-oriented high-performance serving framework for LLMs
LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
mimalloc is a compact general purpose allocator with excellent performance.
xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters
The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use th…
SquirrelFS: A crash-consistent Rust file system for persistent memory (OSDI 24)
Dynamic Memory Management for Serving LLMs without PagedAttention
Fast Matrix Multiplications for Lookup Table-Quantized LLMs
A Fast and Extensible DRAM Simulator, with built-in support for modeling many different DRAM technologies including DDRx, LPDDRx, GDDRx, WIOx, HBMx, and various academic proposals. Described in the…
A lightweight library for portable low-level GPU computation using WebGPU.
Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind
[HotStorage'24] Can Modern LLMs Tune and Configure LSM-based Key-Value Stores?
The repo for NSDI24 paper: SIEVE is Simpler than LRU: an Efficient Turn-Key Eviction Algorithm for Web Caches
The Operating System for JudgeDuck -- Stable and Accurate Judge System
Long short token decoding speed up 4x for long context LLM. A hundred lines of core code. Open source for learning.
An easy-to-use LLM quantization and inference toolkit based on GPTQ algorithm (weight-only quantization).
[ACL 2024] A novel QAT with Self-Distillation framework to enhance ultra low-bit LLMs.
GLM-4 series: Open Multilingual Multimodal Chat LMs | 开源多语言多模态对话模型
A large-scale simulation framework for LLM inference