Building highly efficient data mining modules based on heterogeneous architectures

As shown in Fig. 1, the first part of our data analysis oriented work is to develop efficient data mining modules, such as EM-GMM (Expectation-Maximization algorithm for Gaussian Mixture Models), SVM, and CNN, on heterogeneous architectures, such as FPGA, GPU, MIC, and Sunway TaihuLight. There is also an algorithmic work that develops a specific targeted mutation strategy for differential evolution.

For EM-GMM, we propose a novel design targeting reconfigurable platforms. Our major innovations include: a pipeline-friendly EM-GMM with diagonal covariance matrices, a function evaluation unit for Gaussian probability density based on fixed-point arithmetic, and extended support for a wide range of dimensions or/and components. Our dataflow design targeting the Maxeler MPC-X2000 with a Stratix-5SGSD8 FPGA can run over 200 times faster than a 6-core Xeon E5645 processor, and over 39 times faster than a Pascal TITAN-X GPU [FPT12, TC17].

For SVM, we designed and implemented MIC-SVM, a highly efficient parallel SVM library for x86 based multi-core and many-core architectures, such as the Intel Ivy Bridge CPUs and Intel Xeon Phi co-processor (MIC). We propose various novel analysis methods and optimization techniques to fully utilize the multilevel parallelism provided by these architectures and serve as general optimization methods for other machine learning tools. MIC-SVM achieves 4.4–84×and 18–47×speedups against the popular LIBSVM, on MIC and Ivy Bridge CPUs respectively, for several real-world data-mining datasets [IPDPS14b, JPDC15]. The work was later extended to support parallelization across multiple nodes, and the solving of multi-class problems [JSTARS17].

For deep learning, we derive a performance model that guides us in the process of identifying the most suitable approach for mapping the convolutional neural networks (CNNs) onto the 260 cores within the Sunway chip. By performing a systematic optimization that explores major factors, such as organization of convolution loops, blocking techniques, register data communication schemes, as well as reordering strategies for the two pipelines of instructions, we manage to achieve a double-precision performance over 1.6 Tflops for the convolution kernel, achieving 54% of the theoretical peak. Compared with Tesla K40m with cuDNNv5, swDNN results in 1.91-9.75x performance speedup in an evaluation with over 100 parameter configurations [IPDPS17]. In the algorithm work, we consider the possibility to achieve a better trade-off and more accurate result by reducing the randomness of the differential vector, and design a tight adaptive DE variant called TADE. In TADE, the population is divided into a major subpopulation adopting the general strategy and a minor subpopulation utilizing our proposed strategy of sharing the same base vector but reducing the randomness in differential vector. Based on success-history parameter adaptation, TADE designs a simple information exchange scheme to avoid the homogeneity of parameters. The extensive experiments on CEC2014 suite show that TADE achieves better or equivalent performance on at least 76.7 % functions comparing with five state-of-the-art DE variants [ICTAI15, PPSN16].

Key Publications for Data Mining Modules

[JSTARS17] Weijia Li, Haohuan Fu*, Yang You, et al., "Parallel Multiclass Support Vector Machine for Remote Sensing Data Classification on Multicore and Many-Core Architectures", IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, June 2017, DOI: 10.1109/JSTARS.2017.2713126.

[TC17] Conghui He, Haohuan Fu*, Ce Guo, Wayne Luk, and Guangwen Yang, "A Fully-Pipelined Hardware Design for Gaussian Mixture Models", IEEE Transactions on Computers, 66.11 (2017): 1837-1850.

[IPDPS17] Jiarui Fang, Haohuan Fu, Wenlai Zhao, Bingwei Chen, Weijie Zheng, and Guangwen Yang, “swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight”, in Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 615-624, May, 2017.

[PPSN16] Weijie Zheng, Haohuan Fu and Guangwen Yang, “TADE: Tight Adaptive Differential Evolution”, in Proceedings of the 14th International Conference on Parallel Problem Solving from Nature(PPSN), pp. 113-122, Edinburgh, Scotland, UK, 2016.

[ICTAI15] Weijie Zheng, Haohuan Fu, and Guangwen Yang, “Target Mutation: A Novel Mutation Strategy for Differential Evolution”, Best Paper Award, in Proceedings of the 27th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 286-293, Vietri Sul Mare, Italy, 2015.

[JPDC15] Yang You, Haohuan Fu*, Shuaiwen Song, and et al., “Scaling Support Vector Machines on modern HPC platforms”, Journal of Parallel and Distributed Computing, vol. 76, pp. 16-31, February 2015.

[IPDPS14] Yang You, Shuaiwen Song, Haohuan Fu, and et al., “MIC-SVM: Designing a Highly Efficient Support Vector Machine for Advanced Modern Multi-core and Many-Core Architectures”, in Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), pp. 809-818, 2014.

[FPT12] Ce Guo, Haohuan Fu, and Wayne Luk, “A Fully-Pipelined Expectation-Maximization Engine for Gaussian Mixture Models”, in Proceedings of International Conference on Field-Programmable Technology (FPT), pp. 182-189, Seoul, 2012.