Introduction

Accuracy and speed have long been the central topics for seismic migration methods. Within recent years, we have seen a main trend for migration algorithms moving from conventional Kirchhoff migration to reverse time migration. The driving force behind the trend is the request for better image quality. However, an inevitable effect along with this trend is the increase in processing time and demand for more computing resources. At the same time, efforts on another direction should not be ignored. While keeping an acceptable image quality, people are trying to push the migration to be extremely fast. Though Kirchhoff migration has difficulties in imaging complex geologic structures (Etgen et al., 2009), it is still widely used for 3D prestack depth imaging in production due to its robustness and easy usage. Also, it has the flexibility of target-oriented processing, which helps minimizing processing turn-around time (Sun et al., 2000). Therefore, it is still necessary to both improve the image quality and accelerate the Kirchhoff migration.

Beam-type migration methods, such as the 3-D prestack Kirchhoff beam migration(Sun et al., 2000), the parsimonious Kirchhoff depth migration(Hua and McMechan, 2001, 2003), and the fast beam migration(Gao et al., 2006), are the major candidates towards that front. One common idea among all these algorithms is to use the slant information along with adjoint traces in common shot or common receiver gathers. Applying a first order approximation(Sun et al., 2000) or local plane wave assumption(Hua and McMechan, 2001), adjoint traces, as a local gather, are slant stacked into τ-p domain to form a beam gather. In the τ-p domain, additional information of emitting and emergent directions is presented, which is the essential reason of the one order magnitude speed up consistently achieved in migration.

To be more specific, we can classify the algorithms into two types: Kirchhoff beam migration(Sun et al., 2000) and fast beam migration (Hua and McMechan, 2001, 2003; Gao et al., 2006). The Kirchhoff beam migration, same as the conventional Kirchhoff, builds a static sparse traveling table first using ray tracing. The difference is, corresponding ray emitting angles are also recorded for every image points. After that, every beam gather is mapped into image domain, with the same cost for mapping one trace in conventional Kirchhoff migration. As a result, a bundle of traces are migrated together to significantly reduce the computation time.

As to the fast beam migration, strongest events are first picked out in the τ-p domain. During the migration processing, each event is migrated by dynamically tracing two rays from source and receiver points, using associated emitting/emergent angles, which could significantly reduce the memory requirement of the time table. Travel time of the event is further applied to find reflection region. Compared to the Kirchhoff method, only desired rays related to the picked events are traced, making it truly “fast”.

However, while previous works have all focused on the benefits that the beam forming has brought to the migration part, the cost for performing the τ-p transform and the event picking itself, which as a whole we call the data pre-processing step, has not been fully discussed. While the beam migration methods consume only minutes to process 3D volumes, the pre-processing step can sometimes take hours. Moreover, from the algorithmic point of view, the strategies applied in pre-processing step have significant effects on the final image quality, which means several passes of the pre-processing would be necessary to find a best configuration for specific cases. Thus, the pre-processing step for beam migration has been the biggest obstacle on the road to an interactive migration method.

Accelerating the slant stack on GPUs

The slant stack operation consists of three different steps performed on each trace sample: shift, interpolation and accumulation. As we need to perform these operations on different samples in different traces, the slant stack operation provides a lot of parallelism that we can make use of.

                                            Datapreprocessingflowchart.jpg

In our slant stack design, we adopt three optimization methods to achieve a good effi- ciency on the GPU platform. First, we employ a different threading mechanism to process the traces. In the CPU-based designs, we generally initialize different threads to process different traces concurrently, shown in Figure ??, which in fact will offer a better L1 cache hit. However, on the GPU platform, which supports executions of millions of lightweight threads, we choose to assign different GPU threads in the same block to process different time samples in the same trace. In this way, as shown in Figure ??, the neighboring threads would be accessing neighboring memory locations, which achieves the best memory access efficiency on the GPU architectures.

Second, we carefully use the registers and the shared memory to cache frequently-used variables in the algorithm. For example, when applying accumulation, instead of directly adding the shifted trace onto the target beam trace which needs Nl times global memory access, we use a register in the loop to record the accumulated value and write it to the target field in the global memory only at the end.

Third, we apply the stream mechanism offered by CUDA to make a good overlap of the computation and the data transfer. When the amount of traces in one local gather is small, the data transfer overhead becomes a substantial part of the total computation time. By applying the stream mechanism, we can hide the data transfer time in the computation

On the other hand, the rapid development of GPU architectures provides a high perfor- mance platform for geophysical processing recently, and has brought significant speedup for Kirchhoff and RTM methods (Shi et al., 2011; Liu et al., 2013). Although beam migration has been widely used for seismic imaging due to its high performance on CPU cluster with either distributed or shared memory architectures, the variations in the sizes of beams and the irregular memory access patterns make it difficult to derive an efficient mapping of the beam migration to the GPU platform (e.g., Jang et al. (2011)), and few works about GPU-based beam migration designs are proposed.

Based on the above considerations, in our work, we analyze the computation cost for both the pre-processing part and the migration part in the beam migration algorithms. We target three kernels that consume most of the compute cycles, which are the data pre-processing kernel, the ray tracing kernel and the beam mapping kernel. By carefully porting these three parts onto the GPU accelerators and overlapping the IO part and the computation part, we derive a highly efficient CPU-GPU hybrid parallel design for beam migration. We further explore the performance potential of a CPU-GPU hybrid parallel beam migration design across different generations of GPU architectures from Fermi in 2011 to the latest Kepler K80. Our experiments on the SEG/EAGE salt model and SEAM salt model demonstrate that our proposed data pre-processing part achieves over 10 times speedup over a parallel CPU design that use 16 CPU cores, and our ray tracing and beam mapping parts achieves 2-6 times speedup compared to a parallel 16-core CPU design.time, thus removing the time cost caused by transferring data between CPU and GPU.

By applying the above three optimization strategies, our final design is running 3 times faster than a naive GPU design that we use as our starting point. According to the result of the NVIDIA profiling tool NVVP, we achieved 63% GPU efficiency on the slant stack operation kernel, which is close to the theoretical peak of 66%.

Accelerate the ray tracing on GPUs

The interpolation operation is widely used in ray tracing because the coordinates of the rays at a specific time step are not always falling into the discretized grids. Not all ray trace schemes and interpolation algorithms are suitable for the GPU architecture; we choose to use the wavefront-oriented tracing algorithm (Coman and Gajewski, 2005) with with cubic linear interpolation, which we think provide a good balance between performance and accuracy.

In this design, each ray at a given time step is represented by a light-weighted GPU thread. When the rays move one time step further, the cubic linear interpolation is invoked to compute the velocities for wavefronts and the 4th-order Runge-Kutta is also employed for the time-step forwarding.

In 3D scenarios, the major challenge in designing a high performance GPU-based ray tracing comes from the way that the GPU threads access memory. In cubic linear inter- polation, 8 neighboring points in a cubic grid need to be accessed. However, if each GPU thread directly accesses the 8 points that are located sparsely in memory, the computing performance can be very poor. We avoid such an inefficient memory-accessing pattern by preloading the points into the shared memory, which serves as a fast buffer for the interpolations afterwards.

Further optimization in this design includes storing the travel-time tables in page-lock memory, overlapping the computing and memory transfer between the host and the device, and tuning the number of registers for a GPU thread to gain maximum performance.

Accelerate the beam mapping on GPUs

Beam mapping in equation (5) in beam migration is a procedure to map the beam stacks from the ??-p domain to the ?? domain, generating the final images for the interpreters. We treat the beam stacks and the travel-time tables, which are usually precomputed, as inputs for beam mapping in order to simplify the procedure of beam mapping. In this stage, several interpolations are employed in calculating the travel time and directional information from the travel-time tables, and amplitudes from beam stacks.

We first design GPU-friendly data structures by storing different information (e.g. travel time, angles to sources and receivers) in different vectors, with the adjacent information stored in GPU memory continuously. This kind of data structure not only allows the GPU threads to access the memory in the most efficient way, but also provides an opportunity to set up concurrent kernels to fully utilize GPUs resources. The shared memory technique mentioned above is also used for 3D volume interpolations.

To move one step further, we add a procedure (as shown in Figure 4) to compress the beam stack data before it is transferred to the GPU memory. The most time-consuming procedure in beam mapping on GPU is the part of looking up the amplitude information in the beam stack according to the interpolated travel time. It brings a totally random memory access pattern, which could dramatically decrease the GPU performance, especially when we look up a large memory space. Therefore, we compress the beam stack data in a GPU- friendly way in advance and load it into the read-only cache on GPU, and decompress the data on the fly when it is needed. As the size of data turns relatively small after compression and most of it can even reside in cache, it brings us further performance gain.

A HYBRID CPU-GPU DESIGN

Compared with the slant stack operation, the event picking in the data pre-processing step is an irregular operation that uses dynamic data structures, and has lots of configurable parameters. As a first choice, we leave the event picking part on the CPU host, and to parallelize its computation through multi-threading techniques on CPU cores.

However, as the different operations are sequential in the data pre-processing step, in a straightforward design, the CPU part and the GPU part will have to wait for each other to finish its processing before entering into the next stage. To avoid the idling of both the CPU and the GPU devices, we further apply a pipeline design (Shown in figure ) to achieve efficient utilization of both the CPU and GPU computation resources. While the GPU is performing the slant stack onto the (i+1)th beam gather, we employ most of the CPU cores to work on the event picking of the lately slant stacked ith beam gather. Meanwhile, we add another pipeline stage to load in trace data of the (i+2)th beam gather, to avoid potential I/O bottleneck of distribution file system. By doing this, we can achieve an efficient and also balanced utilization of both CPU and GPU resources in our hybrid computation platform.

Furthermore, in order to achieve the best system resource usage, a good load balance between CPU and GPU sides is needed. According to our test, when the average traces amount Nl of one local gather is less than 200, the event picking part cannot be completely hidden by slant stack calculation on the GPU side. To achieve the best efficiency of such scenarios, we provide an option to move several loop passes in the event picking, which demonstrate relatively regular computation and data access patterns, onto the GPU platform.

Combining the above optimized 3 pipeline stage with the stream technique we apply to overlap GPU computation and CPU-GPU communication, we achieve a fully-pipelined design for the data pre-processing step. Under such a design, the I/O time, and CPU-GPU communication time are well hidden. The slant stack computation on the GPU and the event picking on CPU are well overlapped. A high-efficient and balanced utilization of the hybrid system is achieved.