QLoader reduced the BERT-Base size by approximately 6.7x (using a mix of INT4/INT8) with only a 0.4 point drop in the GLUE score, demonstrating the robustness of the sensitivity metric across different model architectures.
Currently, QLoader focuses primarily on linear layers and convolutions. The quantization of activation functions (e.g., GELU, Swish) is handled via look-up tables, which may introduce minor numerical instability on hardware with low-precision ALUs. Future work will focus on integrating learned activation quantization parameters directly into the loader. qloader
Once the configuration $B$ is determined, the model is compiled. The QLoader Runtime Engine manages memory allocation. It utilizes a technique called . Instead of storing weights in standard 32-bit aligned memory slots, QLoader packs low-bit weights tightly. For example, four INT2 weights are packed into a single INT8 container. During inference, the Loader dispatches specialized kernels that unpack and compute operations simultaneously, minimizing memory access overhead. QLoader reduced the BERT-Base size by approximately 6