Support Vector Machines (SVM) have been widely used in data-mining and Big Data applications as modern commercial databases start to attach an increasing importance to the analytic capabilities. In recent years, SVM was adapted to the field of High Performance Computing for power/performance prediction, auto-tuning, and runtime scheduling. However, even at the risk of losing prediction accuracy due to insufficient runtime information, researchers can only afford to apply offline model training to avoid significant runtime training overhead. Advanced multi- and many-core architectures offer massive parallelism with complex memory hierarchies which can make runtime training possible, but form a barrier to efficient parallel SVM design. To address the challenges above, we designed and implemented MIC-SVM, a highly efficient parallel SVM for x86 based multi-core and many-core architectures, such as the Intel Ivy Bridge CPUs and Intel Xeon Phi co-processor (MIC). We propose various novel analysis methods and optimization techniques to fully utilize the multilevel parallelism provided by these architectures and serve as general optimization methods for other machine learning tools. MIC-SVM achieves 4.4–84× and 18–47× speedups against the popular LIBSVM, on MIC and Ivy Bridge CPUs respectively, for several real-world data-mining datasets. Even compared with GPUSVM, running on the NVIDIA k20x GPU, the performance of our MIC-SVM is competitive. We also conduct a cross-platform performance comparison analysis, focusing on Ivy Bridge CPUs, MIC and GPUs, and provide insights on how to select the most suitable advanced architectures for specific algorithms and input data patterns.
There are several important design methods and optimization details for our proposed MIC-SVM on parallelizing and accelerating SVM method on Intel Ivy Bridge CPUs and Intel Xeon Phi coprocessor (MIC). Since Ivy Bridge CPUs and MIC share many similarities in architecture, we employ several similar optimization techniques for them. Figure 1 shows the general flow for parallelizing and optimizing our MIC-SVM, which can also be applied to optimize similar machine learning methods. (1) To analyse the algorithm and architecture, we calculate the Ratio of Computation to Memory Access (RCMA) and the Ratio of peak Computation performance to peak Memory Bandwidth (RCMB). (2) We provide Adaptive heuristic support for input datasets. We apply adaptive support for handling different types of datasets in our MIC-SVM. Additionally, unlike LIBSVM, we use SOA (structure of array) instead of AOS (array of structure) for continuous memory access in the data processing phase. (3) All our experimental architectures employ two-level parallelism: task and data parallelism. The efficient techniques of task parallelism contain the configuration of proper thread number and balanced utilization of cores and threads. To achieve better performance for data parallelism, we apply memory alignment so that vectorization can use aligned load instructions. (4) We remove the caching strategy and reduce shrinking frequency. (5) We effectively reduce the discrepancy between the algorithmic RCMA and architectural RCMB through data reuse. (6) In order to further improve the performance, we supplement other useful techniques such as exploring the proper granularity of parallelism by configuring the data sizes for two-level parallelism, minimizing the threads’ creation and destroy to reduce synchronization overhead, and maximizing the TLB page size to 2 MB to obtain significantly more memory coverage when the datasets require more than 16 MB memory. A skeleton parallel SMO algorithm in MIC-SVM is described in Algorithm 2.
Figure 1. General flow for parallelizing and optimizing MIC-SVM
For extremely large datasets that have more than 30 million non-zero elements, the limited processing power, memory bandwidth, and on-chip buffer of single architecture restrict the further performance improvement. Meanwhile, some of the most advanced supercomputers today have more than one co-processors on each node. To further scale up our algorithm and fully utilize heterogeneous platforms, we redesign our SVM algorithm for HPC systems with multiple co-processors (using Intel MICs as an example). Our design is a general approach and can be applied to similar multi-coprocessor based platforms.
We will provide a cross-platform performance comparison analysis between our proposed MIC-SVM and previous tools including LIBSVM (serial) and optimized GPUSVM. Another important objective is to provide insights for users on how to choose the most suitable architecture towards a specific implementation and input data pattern. Table 1 shows the speedups of different implementations on various architectures over LIBSVM (baseline). The results marked bold represent the best speedup achieved by this implementation on such architecture for this specific input dataset. We can observe that our proposed MIC-SVM achieves 4.4–84× and 18–47× speedups over the serial LIBSVM through Intel Knights Corner MIC and Ivy Bridge CPUs respectively. Fig. 2 shows the speedups of MIC-SVM and GPUSVM over LIBSVM on one single iteration (instead of the total execution time) for various input patterns. This shows the performance gaps among different scenarios more clearly without the interference of the iterations. We argue that speedup is a combination factor of implementation, architecture and input data pattern.
Table 1. Iterations, training time and speedups.
Figure 2. The speedups over LIBSVM based on the time of each iteration