Hardware-Accelerated Frequency Domain Volume Rendering

This special issue includes a selection of papers from SCCG 2004 conference chaired by Prof. Alexander Pasko (Hosei University, Tokyo). The Spring Conference on Computer Graphics is the oldest annual event in Central Europe and the proceedings are later published by ACM SIGGRAPH. This is possible thanks to the fruitful combination of high quality contributions and generous sponsoring from HP Invent Slovakia. The conference celebrated 20^th anniversary in 2004. More details can be found at www.sccg.sk.
There are two competitions organized during the conference пїЅ SCCG Best Papers and SCCG Best Presentations. They are based on evaluation by reviewers and public voting of SCCG participants. Awarding of winners is a part of closing ceremony and the diplomas with logos of sponsors are available at www.sccg.sk, as well.пїЅ As proposed by Alexander Pasko and accepted by the editor-in-chief, Prof. Victor V. пїЅPilyugin, the winning papers are published in special issue of CGG, a prominent online journal at http://elibrary.ru/cgg. The papers are slightly extended and rewritten, based on SCCG discussions and inspirations. After completing the selection, one can see that the unifying idea of all five papers awarded can be formulated as discovering the tricky solutions between speeding-up (modeling) and rendering quality criteria. William Van Haevre et al. dealt with ray density estimation for plant growth simulation. In particular, they evaluated the varying indoor environment illumination while growing the plants using intensity-modified rules for L-systems. The novel approach results in a flexible and accurate algorithm to achieve more realistic vegetation. The paper won the 3^rd Best Presentation Award. Mario Sormann et al. focused on a solution of a complex task пїЅ creating models from image sequences as fast and as good as possible. VR modeler is a novel interactive monocular 3D modeling system with nicely separated intelligent 2D interaction and 3D reconstruction. Besides that, the coarse and detailed precision of urban models is supported for web presentation and other purposes. The results already contributed toпїЅ Virtual Heart of Central Europe (www.vhce.info) which is a recent European cultural heritage project.пїЅ The paper won the 3^rd Best Paper Award.
Rui Rodrigues and Antonio Ramires Fernandes report on prospective use of graphics cards. A significant part of 3D reconstruction, especially epipolar geometry computations, can be transferred into the GPU. This new idea offers a remarkable gain up to two orders of magnitude in terms of computational times. The paper won the 2^nd Best Presentation Award.
Ivan Viola et al. explored frequency domain volume rendering (FVR) because of computational speed. Moving significant parts of computations to GPU, they report acceleration by factor of 17. This allows for highly interactive framerates with varying rendering quality. The quality depends on interpolation schemes. The authors analyzed four of them to clarify the trade-off between performance and quality. The paper won the 2^nd Best Paper Award.
Last but not least, Diego Gutierrez et al. contributed by a SIGGRAPH quality paper on global illumination for inhomogeneous media. In total, there are 10 different light-object interactions known and we simplify the model to achieve faster solutions. The authors noticed that light rays travel a curved path while going through inhomogeneous media where the index of refraction is not constant. In addition, they took into account the way how human perception deals with luminances. In total, the phenomena like sunset, green flash, and bleaching are mastered to complete an excellent research and a brilliant presentation. This is why only five papers are here пїЅ Diego clearly won in both competitions. пїЅ
For conclusion, I have to recall the following. In 2003, one year ago, this message from Alexander Pasko arrived to Budmerice Castle:
пїЅDear participants and organizers of SCCG, your conference provides unique opportunity for young researchers to make their efforts visible in the world, especially for those who are not hypnotized by the visual quality of modern computer graphics works in modeling, rendering, and animation. We all know that such a work still requires tedious manual labor hampered by errorneous models and algorithms. Let us hope that the next spiral of development will make our work in computer graphics more close to a joyful mind game.пїЅ
I have to thank again to Alexander and to all people who contributed to SCCG 2004 in the spirit of these beautiful and clever words.
*Andrej* *Ferko* Comenius University Bratislava, SK-842 48 Bratislava, Slovakia, , www.sccg.sk/~ferko

Ivan Viola, Armin Kanitsar, and Meister Eduard Gröller,
Institute of Computer Graphics and Algorithms, Vienna University of Technology, Austria

Abstract:

Frequency domain volume rendering (FVR) is a volume rendering technique with lower computational complexity as compared to other volume rendering techniques. In this paper the original FVR algorithm is significantly accelerated by performing the rendering stage computations on the GPU. The overall hardware-accelerated pipeline is discussed and the changes according to previous work are pointed out. The three-dimensional transformation into frequency domain is done in a pre-processing step.

In the rendering step first the projection slice is extracted. The pre-computed frequency response of the three-dimensional data is stored as a 3D texture. Four different interpolation schemes for resampling the slice out of a 3D texture are presented. The resampled slice is then transformed back into the spatial domain using the inverse Fast Fourier or Fast Hartley Transform. The rendering step is implemented as a set of shader programs and is executed on running on programmable graphics hardware achieving highly interactive framerates.

Keywords: Fourier Volume Rendering, Interpolation, Fourier Transform, Hartley Transform, Hardware Acceleration

1 Introduction

Frequency domain volume rendering (also known as Fourier volume rendering (FVR) [22]) is a volume rendering technique based on a fundamentally different idea, i.e., the projection slice theorem [20]. The goal is to compute projections of the volumetric data (complexity $O(N^{3})$ ). Projections in the spatial domain correspond to slicing in the frequency domain (complexity $O(M^{2})$ for a $M \times M$ reconstructed slice). Therefore slicing is used in frequency domain volume rendering to reduce the computational complexity of projections in spatial domain.

Three-dimensional data are first transformed from the spatial domain into the frequency domain. This is done by using a three-dimensional Fourier or Hartley Transform. The discrete one-dimensional forward and inverse Fourier Transforms (DFT) are given by equations 1 and 2.

After the pre-processing step of complexity $O(N^{3}\log N)$ , the slice is resampled along a plane oriented perpendicular to the viewing direction and positioned in the origin of the frequency volume. Afterwards an inverse two-dimensional transform of the resulting slice is performed. The method has a theoretical computational complexity of $O(M^{2}\log M)$ , which is lower as compared to other volume rendering methods. The exact computational complexity is also depending from the chosen interpolation scheme.

The method, however, has certain limitations: only parallel projection is possible and hidden surface removal is not included. The reason is the nature of the function that FVR computes, which is an order independent linear projection. This results in X-ray images. Currently ray-casting is considered as the method that produces the best image quality. Ray-casting gains performance by early-ray termination, displaying only some surfaces from the data. FVR displays the entire volumetric data set. Because of the computational complexity, FVR will gain importance when the resolution of data sets will increase.

As described earlier, the performance of FVR does not explicitly depend on the size of the input volume. The factor that directly influences the rendering performance is the number of samples contributing to the projection slice (in the previous text $M \times M$ ). The resolution of the projection slice should be high enough to prevent aliasing. This shows as overlapping of copies of the rendered data in the result image. Progressive refinement strategies can be realized by adjusting the projection slice resolution to achieve a desired performance. Also the resampling area can be reduced to low frequencies around the origin only. This will result in blurry preview images, but no frequency overlapping artifacts will occur. Our implementation does not include a progressive refinement mode. Highly interactive framerates are achieved even when slicing the entire data set with sufficient slice resolution.

This paper presents mapping of FVR algorithm to GPU in order to significantly accelerate the rendering performance. An overall pipeline of hardware-accelerated frequency domain volume rendering is presented. The data set is transformed into frequency domain in a pre-processing step. Then the projection slice is resampled using the following interpolation schemes: nearest neighbor interpolation, tri-linear interpolation, tri-cubic interpolation, and interpolation using windowed sinc with window of width four. In addition we demonstrate that current graphics hardware provides enough precision to perform FVR at high quality. Furthermore the GPU-based multi-dimensional Fast Hartley Transform [4,14] is presented as an alternative to the wide spread Fourier Transform. The rendering results are compared according to image quality as well as performance. The performance is compared to a software implementation using the highly optimized FFTW library [10].

Section 2 describes previous work related to FVR and its mapping towards GPU. The overall rendering pipeline is discussed in section 3. First, the stage performed on the CPU is presented, followed by the on-the-fly rendering stage on the GPU. The slicing in the frequency domain is discussed in sub-section 3.1. The following part, i.e., inverse transform to the spatial domain, is shown in sub-section 3.2. Afterwards we show the results in section 4 and discuss future work and conclusions in sections 5 and 6.

2 Related Work

Another approach that produces images which are similar to FVR is based on importance sampling and Monte Carlo integration [7] thus the samples are not aligned on a regular grid. This technique overcomes the limitation of parallel projection and the overall computational complexity $O(N^{2})$ is better than in case of FVR.

A straightforward implementation of the Fourier transform is not suitable for high-performance FVR. The inverse two-dimensional transform must be computed at high speed to achieve interactive framerates. Therefore fast variants of the Fourier Transform are used in FVR implementations. The original idea of the Fast Fourier Transform (FFT) was introduced by Cooley and Tukey [6]. Their algorithm decomposes the Discrete Fourier Transform (DFT) into $\log_{2} N$ passes, where

is the size of the input array. Each of these passes consists of

butterfly computations. Each butterfly operation takes two complex numbers

and

and computes two numbers,

and

, where

is a complex number, called principal

th root of unity [6]. The complex number

corresponds to the exponential term from equations 1 and 2. Butterfly operations are based on an efficient reordering of intermediate results, which are used multiple times. After $\log_{2} N$ passes the butterfly operations result into the transformed data. One of the fastest implementations available, is the FFTW library [10].

The Fast Hartley Transform (FHT) was proposed by Bracewell [3] as an alternative to FFT. The transform produces real output for a real input, and is its own inverse. Therefore the FHT is more efficient for FVR in terms of memory consumption. The one-dimensional forward and inverse Hartley transform is described by equation 3:

Many approaches exist to exploit the capabilities of modern graphics accelerators for volume rendering. Fast processing and a large number of flexible features are the main reasons that make current graphics hardware attractive. Texture-based volume rendering using one 3D texture or a stack of 2D textures for volumetric data representation gained considerable interest [5,27]. These techniques perform very fast. They however, also compute a huge number of operations that do not contribute to the final image. A new approach was presented by Roettger et al. [28] and Krüger and Westermann [17]. They propose to use front-to-back ray-casting with early ray termination. The Z-buffer is used for opacity accumulation. An early Z-test rejects fragments when the accumulated opacity reaches a certain threshold. Besides the standard volume rendering approach based on transfer function specification, also other rendering modes like MIP, contour enhancement or tone shading have been ported to GPU-based implementations [12].

Flexibility of the latest graphics hardware is also used for various other general purpose computations [11]. Moreland and Angel [24] have implemented a two-dimensional FFT running on NVidia GeForceFX GPUs [25]. Their implementation is using the Cg high-level shading language [23] and is based on the Decimation in Time algorithm [6]. Unfortunately this implementation performs slower than the software reference [10], which is running on a standard CPU. Another FFT implementation was done by Hart [8]. His implementation performs much faster than the previously mentioned implementation and runs on ATI GPUs [1].

Recently, algorithms for numerical simulation exploit the processing power of current GPUs. Bolz et al. [2] implemented sparse matrix conjugate gradient solver and a regular-grid multi-grid solver. Similar work was presented by Krüger and Westermann [18]. Hillesland et al. [15] have turned the nonlinear optimization for image-based modeling into a streaming process accelerated on GPU.

An important aspect of FVR is interpolation, since the samples in the projection slice, in general, do not coincide with samples of the transformed input data. Current graphics hardware natively supports nearest neighbor interpolation for all texture formats. Linear interpolation is supported only for fixed-point formats. Unfortunately higher-order interpolation schemes are not natively supported at all. A general approach for GPU-based linear filtering was presented by Hadwiger et al. [13]. Their work can be applied to arbitrary filter kernels, gaining speed-ups from various kernel properties like symmetry and separability. The filter kernel is sampled at high-resolution and stored as a texture. A particular sample that contributes to the new resampling point is convolved with a kernel tile. Their framework implements various higher-order reconstruction filters like cubic B-spline filters, Catmull-Rom spline filters or windowed sinc filters. As Malzbender [22] has shown, careful filter design for reconstruction in the frequency domain is crucial for good image quality.

3 Mapping FVR on GPU

**Figure 1:** Frequency domain volume rendering pipeline. Instead of projecting the data in the spatial domain ( $O(N^{3})$ ), slicing in frequency domain is performed on the GPU, followed by inverse transform ( $O(M^{2}\log M)$ ) after an off-line pre-processing step on the CPU.
$\includegraphics[height=6.0cm]{pipeline.eps}$

After the data upload the algorithm proceeds to the rendering stage. This stage can also be divided into two parts, i.e., slicing in the frequency domain and the inverse Fourier transform. Slicing refers to resampling the projection slice from the 3D frequency data. Transforming the resampled frequency slice back to the spatial domain results in the projection of accumulated intensities. These two parts will now be focused on and discussed in more detail. The FVR pipeline is sketched in figure 1.

3.1 Slicing in Frequency Domain

**Figure 2:** Setup for the resampling part shows relationship between 3D texture, proxy slice geometry and rendering target.
$\includegraphics[width=7.7cm]{slicing.eps}$

To avoid aliasing the resolution of the rendering target must meet the Nyquist criterion. Also the reconstruction quality of the projection slice strongly influences the final rendering quality. Low resampling quality introduces artifacts like ghosts or frequencies that are not present in the original data. To avoid this, it is necessary to use higher order interpolation schemes, or at least tri-linear interpolation. The hardware does not natively support higher order interpolation schemes, so floating-point textures can be fetched only using nearest neighbor interpolation. We additionally support three other types of interpolation. The first custom filter type is tri-linear interpolation. The second interpolation scheme is tri-cubic interpolation using cardinal or BC-splines. The last scheme is windowed sinc of width four. Tri-cubic interpolation as well as windowed sinc interpolation are based on the texture-based approach of Hadwiger et al. [13]. The following subsections describe in more detail the mapping of these custom filters onto GPU. After describing the interpolation schemes various hardware implementation issues of slicing are presented.

3.1.1 Tri-linear interpolation

The problem is how to estimate the blending factors for all three directions. Although it is not explicitly given, they can be retrieved from the three-dimensional texture coordinates of the resampling point. These describe the offset from a corner of the 3D texture in the range

for each coordinate. Obviously the coordinates of the opposite corner are

. The multiplication of texture coordinates of a given resampling point with the original texture resolution results in coordinates where the distance between two texels is equal to

. The fractional parts of these new coordinates are the blending factors we are looking for. We illustrate the blending factor estimation in figure 3.

**Figure 3:** Blending factor estimation. The resolution of the original texture is $8 \times 8$ . The texture coordinates are multiplied with the original resolution. The blending factors are equal to the fractional parts of these new coordinates.
$\includegraphics[width=7.7cm]{linear.eps}$

3.1.2 Higher-order interpolation

The filtering is divided into four passes, where each pass is computing the contribution of sixteen neighboring samples. These intermediate results are summed together resulting in the value of the resampling point. The straightforward method to sum-up intermediate results would be to set the blending operation to addition. Unfortunately current hardware does not support blending with floating-point precision. The blending is done after four intermediate passes in a separate pass in which four sub-results are summed together. After the blending pass the filtered slice is stored in a texture, ready to be processed with the inverse FFT.

**Figure 4:** Filter kernel of width four and corresponding 1D floating-point RGBA texture storing the discretized kernel. Each channel stores one kernel tile. A single texture fetch at position `X` returns 4 weight values.
$\includegraphics[width=7.7cm]{kernel.eps}$

3.1.3 Hardware implementation strategies

Currently only ATI GPUs support floating-point 3D textures, so our implementation is running on these type of cards. Supporting available NVidia hardware would require to store the floating-point value of every sample in a four-channel fixed-point 3D texture. The slicing part will then include an additional conversion in order to reconstruct the original floating-point value. This must be done before interpolation for every sample contributing to the resampling point on the projection slice. The rest of the pipeline remains unchanged. Additionally the implementation of texture-based higher-order interpolation is divided into four passes. This is due to limitations of the fragment program length on ATI hardware. NVidia supports much longer fragment programs, i.e., the multi-pass approach can be folded into a single pass.

The multi-pass approach discussed before, as well as FVR in general, renders an intermediate image that is used in the next pass as input. This is done by changing the rendering target to a texture and vice versa. The rendering target is any kind of buffer that can be bound as a 2D texture, which is in current OpenGL specification invisible rendering target called Pbuffer.

3.2 Inverse Two-Dimensional Fast Fourier Transform

Both transforms, i.e., forward and inverse, can exploit the separability property in the multi-dimensional case. In the case of a two-dimensional array first the rows are transformed as stack of independent one-dimensional arrays. These frequency arrays are then transformed column-wise, where each column is also handled separately. In case of the three-dimensional transform, the third dimension is handled analogously.

The two-dimensional Fourier transform is split in two almost identical parts. Each of these passes consists first of reordering the input data, also called scrambling. This pass prepares the data for $\log_{2} N$ butterfly passes where

is the number of columns or rows respectively. Scrambling means swapping two samples, where one sample is at position x and the other sample is at position y. The relationship between x and y is that y is the reverse in the bit order of x. For example a sample at position x=4 (bit pattern 100) is exchanged with the sample at position y=1 (bit pattern 001). This reverse order function can be efficiently done using a pre-computed scramble lookup texture.

The scramble pass is followed by $\log_{2} N$ butterfly passes. The butterfly passes are performed ping-pong-wise. One buffer is bound as the source 2D texture, another one as the rendering target. In the next pass the first buffer changes to be bound as the rendering target and the second buffer (the rendering target from the previous pass) is bound as the source texture. Each butterfly pass first performs a texture fetch into a lookup texture, which determines addresses of two samples

and

that contribute to the currently processed fragment. The value of the principal

th root of unity

is also contained in this fetch. Afterwards, a complex multiplication and two additions are performed as described in section 2.

After performing one scramble pass and $\log_2 N$ butterfly passes in one dimension, the algorithm analogously performs the same operations to the other dimension. In case of the inverse transform the values are finally scaled down by the factor $N^{-1}$ (see equation 2) in the last pass and the output is written to the framebuffer. The process of a two-dimensional FFT is illustrated in figure 5. The interested reader is referred to Hart's implementation [8] for further details.

**Figure 5:** State diagram of GPU-based FFT.
$\includegraphics[width=7.7cm]{fft.eps}$

To reduce the memory requirements by one half, the Hartley transform can be used instead of Fourier transform. The GPU-based implementation of the Fast Hartley Transform (FHT) is quite similar to the FFT. The scrambling operation is done in exactly the same way as in case of FFT. The difference is that instead of butterfly passes so called double butterfly passes have to be performed. In case of the FFT the butterfly texture stores the complex number

, addresses to values contributing on the butterfly and a sign that determines which butterfly result should be written to the current fragment. The FHT needs to store five values, therefore it is necessary to have two lookup textures instead of a single one. One possibility is to store three addresses of samples that contribute to the final butterfly result in one three-channel (RGB) texture. Precomputed weights of

and

terms are stored in another two-channel (LUMINANCE_ALPHA) texture. The

and

terms are then multiplied with the corresponding input values and summed together. More details on the FHT can be found in the referenced literature [3]. In the multi-dimensional case, the last pass corrects the product of one-dimensional transforms for each direction to the multi-dimensional transform. The shader performs four texture fetches, sums them together and performs correction according to equation 4.

4 Results

**Figure 6:** Result of rendering the X-mas tree test data set. The images show the rendering quality according to the used interpolation scheme. Nearest neighbor interpolation (a) exhibits noticeable artifacts, which are eliminated by tri-linear interpolation (b), respectively by higher-order interpolation schemes like tri-cubic interpolation using a cubic B-spline filter (c) or windowed sinc filter using a Blackman window of width four (d).

(a)	(b)

(c)	(d)

Some images exhibit so called vignetting artifact. This means that the central value of the final image are overemphasized. Here the shape of the frequency response of used interpolation filter comes into play. This artifact can be suppressed by spatial premultiplication [22] of the original volume with the reciprocal of the interpolation filter frequency response. Our implementation, however, does not contain this operation, but it can be directly integrated into the pre-processing stage. The reason is that spatial premultiplication increases the effect of ghosting on other hand. The vignetting artifact is noticable in figure 7. It might be slightly disturbing in case of static images, but interactive manipulation like rotation overcomes this problem.

The quality of the image is also strongly influenced by the resolution of the projection slice. This is shown in figure 8. This dataset has resolution $256^{3}$ , which means that the resolution of the projection slice should be at least $512 \times 512$ . If this condition is not fulfilled overlapping artifacts appear.

**Figure 8:** Quality comparison using the bonsai tree dataset ( $256^{3}$ ). Projection slice resolution is $256 \times 256$ (a) and $512 \times 512$ (b). Low projection slice resolution results in an image distorted by overlapping copies. This effect is removed when using a sufficient resampling resolution.

(a)	(b)

In case of higher order filter kernels, which are in our implementation filters of width four, the resolution of the discretized kernel also influences the final image quality. As mentioned earlier, only one-dimensional kernel representation is stored in a texture and the three-dimensional filter kernel is computed on-the-fly on the rendering stage. Additionally each kernel is divided into four tiles. In figure 9 we show the results of cubic interpolation using cubic B-spline, where for each tile of the one-dimensional kernel we store 1, 4, and 16 samples respectively. Increasing number of samples per-tile clearly improves the image quality.

**Figure 9:** The engine block data set is resampled using the cubic B-spline reconstruction kernel. Images show different number of samples per filter *tile* in one-dimensional case: 1 sample per *tile* (a), 4 (b), and 16 (c) respectively.

(a)	(b)	(c)

The software FVR using the highly optimized FFTW library [10] was running on an AMD Athlon XP 2200+ processor with 1.0 GB of RAM ( $2 \times 512$ DDR, 133 MHz, CL2) and VIA Apollo KT333 chipset. The software implementation is using tri-linear interpolation, a projection slice of size of $256 \times 256$ and the same test data set of size $256^{3}$ . The performance of slicing was 17 frames per second (fps). Wrap-around reordering is running at 45 fps and inverse 2D transform at 26 fps. Note that wrap-around reordering does not take additional time in the GPU implementation. Using the algorithm mapped on the GPU, a speed-up factor of approximately 17 is achieved.

5 Future Work

Future work also includes mapping to other platforms, e.g., NVidia GPUs as well as a high level shading implementation using the Cg [25] respectively the OpenGL Shading Language [26].

6 Summary and Conclusions

The performance of frequency domain volume rendering does not explicitly depend on the data set resolution. It depends on the number of resampling points which are given by the resolution of the projection slice. The data set resolution influences the texture cache efficiency, i.e., the higher the resolution is, the higher is the number of cache misses. This can lead to slight differences in rendering performance, which is usually $\pm$ 2 fps in case of a $512 \times 512$ projection slice resolution.

Acknowledgments

The authors would like to thank Martin Artner, Markus Hadwiger from the VRVis Research Center for fruitful discussions on hardware-based filtering, Alois Dornhofer for the software implementation, Jason Mitchell, Evan Hart, and especially Mark Segal from the ATI corporation for providing a Radeon 9800 XT and for general support.

$\displaystyle H[u, v, w]=$	$\textstyle \frac{1}{2LMN}(T[L-u, v, w] +$
	$\textstyle T[u, M - v, w] + T[u, v, N - w]$
	$\textstyle - T[L - u, M - v, N - w])$

$\displaystyle H[u, v]=$	$\textstyle \frac{1}{2}(T[u, v] + T[L - u, v] +$
	$\textstyle T[u, M - v] - T[L - u, M - v])$