ABSTRACT

In the confrontation with the performance bottleneck caused by the limited I/O bandwidth, low memory capacity, scarce computational primitives, and network transmission latency of on-device learning in practice, data quantization is a promising method to address these challenges by representing the data value in a relatively low precision while not hampering the final quality of model training. As to the quantization methods, the Post-training Quantization scheme aims at compressing the model size after training is completed, which is often used in the offline inference scenarios and does not require large amount of data. However, the Quantization-aware Training scheme trains a quantized model from scratch and holds higher accuracy, which often collaborates with the fine-tuning and transfer learning techniques. This chapter verifies that data quantization can be combined with other optimization methods, e.g., network pruning and knowledge distillation to further reduce the computational overhead and accelerate on-device learning applications. This chapter also provides a practice on image classification on commodity device to guide the readers implement typical ML applications.