In the scenario of big data analytics, operating large-scale machine learning applications often resort to distributed processing and parallel computing, where handling the collaboration between edge nodes, especially in the heterogeneous environment has become a promising research direction for both algorithm design and system implementation. This chapter elaborates an efficient and scalable Tiny ML platform, which is well compatible with the heterogeneous environment and fully exploits the capacity of edge devices when conducting machine learning applications. To achieve this goal, one critical question is how to build a high-performance architecture for large-scale edge learning systems.This chapter summarizes the existing parallelism mechanisms for Tiny ML system. As an emerging distributed training framework, the Federated Learning (FL) aims at collaboratively training multiple ML models among different participants without sharing their raw data during the whole training process. This chapter also conduct a practice on FL implementation. Following the steps in this practice, the readers can easily construct a FL training platform, which would be helpful to understand the concept of FL.