Large distributed datasets visualization software, progress and opportunities

M. V. Iakobovski, I. A. Nesterov, P. S. Krinov
Institute for Mathematical Modelling, Russian Academy of Sciences, Russia
www.imamod.ru

This work is supported by Russian Foundation for Basic Research (grant â„– 05-01-00750)

Contents

Abstract:

This article covers particular visualization topic related to interactive scientific visualization of large-scale datasets. It offers brief introduction to modern widely used visualization software packages for large distributed data handling, briefly covers their functionality and usability while working with datasets that cannot be handled and visualized on a single PC. This article also illustrates the utilization of parallel algorithms for isosurface rendering and massless particle tracing in the visualization system RemoteView installed on top of computing cluster of Institute for Mathematical Modelling, RAS.

1. Introduction

Modern scientific research is closely connected with high performance supercomputing. Numerical simulations require accurate mesh generators, appropriate mesh decomposition, complex numerical algorithms and post processing. One topic stands somewhat separately â€“ scientific visualization. In most cases to visualize and to interact with the simulation results isnâ€™t an obvious and easy task. Moreover each peculiar simulation can demand custom visualization. So the visualization package for particular simulation can be thought as some custom visualization software. Thatâ€™s why when we want to review the existent visualization software, their flexibility and general visualization abilities must be at the centre of attention. As the numerical simulation is performed on highly parallel high performance clusters and supercomputers the results and intermediate datasets of the simulation can be distributed through many individual computing nodes. The ability to process large distributed datasets and the scalability of the software are thereafter very important. This paper offers brief introductory to modern popular visualization software packages with large distributed data handling opportunities and covers also current work of the authors â€“ software visualization package RemoteView that is installed on a local supercomputer cluster of Institute for Mathematical Modelling, IMM RAS.

3. Software review

There are several multiplatform software packages (for Unix and Windows) for interactive parallel visualization and graphical analysis:

VisIt (http://www.llnl.gov/visit/) - - free interactive parallel graphical tool for viewing scientific data developed by the Department of Energy (DOE) Advanced Simulation and Computing Initiative (ASCII) to visualize and analyze the results of terascale simulations. VisIt is an open source project and is freely available under the BSD license.
ParaView (http://www.paraview.org) - open source parallel visualization application created by Kitware in conjunction with Jim Ahrens of the Advanced Computing Laboratory at Los Alamos National Laboratory (LANL). Contributors and developers of ParaView currently include: Kitware, LANL, Sandia National Laboratories, and Army Research Laboratory.
Ensight (http://www.ensight.com/product-overview.html)- commercial tools for cluster-based rendering, parallel processing and visualization (Ensight DR, Ensight Gold and others) developed by Computational Engineering International (CEI) founded by former employees of Cray Research and spun-off in early 1994.

Most general visualization functionality of both freely available products: VisIt and ParaView are somewhat identical because both of them rely on also open-source visualization toolkit - VTK. VTK is an object oriented library for visualizing 3D data. VTK supports a wide variety of visualization algorithms including scalar, vector, tensor, texture, and volumetric methods; and advanced modeling techniques such as implicit modeling, polygon reduction, mesh smoothing, cutting, contouring, and Delaney triangulation. In addition, dozens of imaging algorithms have been directly integrated to allow the user to mix 2D imaging / 3D graphics algorithms and data. VisIt uses it for all its visualization functionality. ParaView has extra VTK extensions, which allow building the software from source with integrated MPI support for advanced functionality on parallel clusters. Some examples of different plots obtained with VTK visualization algorithms are shown below.


volume visualization (VisIt)	vector plots (VisIt)

unstructured mesh handling with streamlines (VisIt)	multiplot visualization (VisIt)

Make data processing and visualization remote and parallel

All these software packages support distributed client server functionality. This means that all the data from the numerical simulation can be handled and processed remotely without data transfer of the whole datasets to the client workstation, moreover such an operation can be simply impossible due to enormous bandwidth needed for a reasonable data transfer time. All the vendors offer independent GUIs that can connect with remote servers. Different is the approach to how the required plots and data samples are transmitted to the client. A general remotely generated bitmap or a compressed set of graphical primitives can be transmitted for effective local rendering on the client side.

Distributed client server functionality doesn't cover the opportunities for effective visualization that a high performance parallel system and modern software can afford . Not only can the evident opportunity of large data processing due to geometric parallelism be effectively used on parallel system. Best effectiveness and scalability can be achieved only when most of computer graphics algorithms will have their corresponding parallel analogues and will be constantly used together with distributed data processing. This calls forth parallel servers (such as by ParaView and RemoteView) or Server of Servers (SoS) approaches (such as by EnSight DR).

VisIt has the simplest approach for parallel data processing that is based on host profiles. Host profile covers launching the parallel compute engine, setting the number of processors and hosts and also time limits for parallel tasks and the preferable load balancing. VisIt (after VisIT 1.5.1) supports both hardware-accelerated and software scalableoff-screen rendering for clusters whose compute nodes have graphics hardware. During scalable rendering graphics primitives are drawn using an off-screen renderer on each parallel processor before intermediate images were composited into the final image that VisIt's viewer displays. Unfortunately such built-in functionality meets the most difficulties when someone has a working cluster with task pool, security handling and so on.

More general approach demonstrates ParaView. It runs both parallel on distributed and shared memory as well as on a single processor system. ParaView uses the VTK as the data processing and rendering engine but adds additional build-in parallel processing opportunities. When compiled with an internal MPI support the server part of the ParaView can be launched as a general MPI task. This allows straightforward integration with an existing task management system used in cluster administration. ParaView supports distributed rendering (where the results are rendered on each node and composed later using the depth buffer), local rendering (where the resulting polygons are collected on one node and rendered locally) and a combination of both functionalities. For example, the level-of-detail models can be rendered locally whereas the full model is rendered in a distributed manner. This provides scalable rendering for large data without sacrificing performance when working with smaller data.

ParaView handles structured (uniform rectilinear, non-uniform rectilinear, and curvilinear grids), unstructured, polygonal and image data. It supports a variety of file formats including:

legacy VTK (all types including parallel, ascii and binary, can be read and written)
VTK XML-based file formats (all types including parallel, ascii and binary, can be read and written)
EnSight 6 and EnSight Gold (all types including parallel, ascii and binary; multiple parts are supported -each part is loaded separately and can be processed individually) (read only)
Plot3D (ascii and binary, C or Fortran; support for multiple blocks - each block is loaded separately and can be processed individually-, I blanking is currently partially supported) (read only)
Various polygonal file formats including STL and BYU (by default, read only, other VTK writers can be added by writing XML description)

ParaView also supports parallel data processing. It uses the data parallel model in which the data is broken into pieces to be processed by different processes. Most of the visualization algorithms function without any change when running in parallel. ParaView also supports ghost levels used to produce piece invariant results. Ghost levels are points/cells shared between processes and are used by algorithms which require neighborhood information.

When handling complex plots and large datasets ParaView maintains interactive frame rates even when working with large data through the use of level-of-detail (LOD) models. The user determines the threshold (number of points) beyond which a reduced version of the model is displayed during interaction. (The size of the model can also be adjusted.) Once the interaction is over, the large model is rendered.

3. RemoteView

3.1 Client-Server architecture

Computer network capabilities are greatly strained by large data and transfer of the dataset under consideration to the end-user is simply impossible. The access to the supercomputer system is often done through local area or slow global area network. For example, high accuracy calculations required while carrying out three-dimensional computational solutions of gas dynamics and other problem domains demand the use of detailed grids. High accuracy of calculations results in the fact that the files containing data have very big size. We suppose it is true that the accuracy of calculations (and hence the size of files) are limited with the system power. Computing power of systems now grows very quickly. The number of processors in multiprocessor systems increases and thus one should expect appreciable growth of volumes of data with results of calculations. Our work is focused on processing of over hundreds of gigabytes of data. That requires new methods of information processing. Such data volume cannot be allocated in main memory of a personal computer nor even on a hard drive. Throughput of any network is limited. These restrictions prevent the use of traditional applications, since the transmission through any network of necessary data for analysis and visualization does not give an opportunity for interactive supervision of results.

The large data visualization system RemoteView is developed to meet scientific needs and to provide alternative visualization tool which serves for local and remote users of the supercomputer centers. Visualization program is divided into two parts â€“ parallel server and client.

This division allows carrying out the principle part of visualization process on supercomputer and transferring to userâ€™s workplace only minimum of information required directly for construction of prepared image. Such approach assumes that the image is finally formed on a userâ€™s workplace and makes it possible to use modern multimedia hardware (helmets, stereo glasses, multi-dimensional manipulators etc.) for better clearness of visual information.

A number of algorithms for visualization of largescalar data given on three-dimensional grid are considered here. Method of isosurfaces is chosen from existing variety of methods for such data visualization. These are the surfaces on which function under consideration (temperature, density, concentration etc.) takes fixed values. Thus, several surfaces (with function value specified by user) are chosen from three-dimensional scalar field and displayed on the screen.
Within such an approach, each isosurface is determined by some triangular grid. The sizes of these grids are in some cases quite comparable to the size of initial three-dimensional grids and cannot be transferred through a computer network (and also processed) for the reasonable time providing an interactive mode. Moreover, they cannot be allocated in the main memory of a single processor unit. Thus, the central problem under study is the compression of triangulated surfaces presented as unstructured triangular grids. Compression is necessary both for fast transfer through a global or local network and for direct display at userâ€™s workplace.

Software visualization package based on RemoteView for parallel structured and unstructured triangular and rectangular three-dimensional grids analysis is developed and installed on local supercomputer cluster of Institute for Mathematical Modelling that obtained the 29th place in top50 list of Commonwealth of Independent States(CIS) supercomputers (http://supercomputers.ru/) in september, 2005.

3.2. Mesh decomposition

RemoteView is developed to visualize 3D scalar fields defined on regular and irregular rectangular or tetrahedral meshes. RemoteView supports effective parallel distributed data handling based on a mesh decomposition where the processed mesh is composed of a set of micro domains. Micro domain decomposition holds additional information about global mesh topology and micro domain relative boundary positions. Special I/O library for loading individual micro domain mesh geometry and corresponding fragment of a given scalar field allows parallel distributed data processing. An example of mesh decomposition for a viscous flow field simulation on multiprocessor system is shown below.

micro domain mesh decomposition

Data coarsening and visual perception preservation

Since the amount of geometry primitives that describes the necessary isosurface can be in the worst case as big as the input data later isosurface compression must be specially considered. In case of visualization of large datasets compression is always lossy not only due to the limited bandwidth of communication channels but also due to the limited resolution of modern monitors or projectors. RemoteView uses special algorithms to reduce the amount of primitives that must be transmitted to the client side prior to the final rendering. The key features of the algorithms are the controlled accuracy loss during data compression and the forced data reduction to the preset output compressed data size. Special attention is paid to the topology preservation during compression in order to produce the same visual representation as if uncompressed set of polygons is rendered on the client side. Below are some examples of boundary surface compressions with different compression ratios.

Tecplot-RemoteView

boundary surface compression with different compressions ratios

Tecplot-RemoteView

strong requirements for lossy compression are identical visual perception and topology preservation

Owing to the fact that the real 3D geometry (perhaps somehow coarsed) resides on the client side all the user interactions with the 3D plot, especially affine transformations can also be done directly on the client side. Such architecture reduces demands for extra client-server interactions for scene updating and re-rendering.

Parallel isosurface algorithm

Each general visualization algorithm needs special investigation of its correctness. It sounds a bit surprising but it isnâ€™t such an unusual thing in parallel tools development because of different algorithmic techniques and tricks done to archive general performance and good scalability and effectiveness of the parallel analogues of the algorithm. To illustrate correct distributed data handling and correct visualization output of some complex 3D scalar field two identical datasets of the same size were visualized by well-known non parallel reference software.

Tecplot ( http://www.tecplot.com) was chosen as a reference non-parallel visualization software and a dataset with admissible size was visualized with Tecplot and RemoteView. Both output isosurfaces are shown below. Pictures with blue background correspond to RemoteView.

Tecplot-RemoteView

Isosurface calculated and rendered with RemoteView

Tecplot-RemoteView

Isosurface calculated and rendered with Tecplot

This comparison illustrates identical visual perception of the same input dataset and correctness of parallel processing algorithms used by RemoteView.

4. Conclusion

The urgent need of special visualization software for high performance clusters and large data processing drives to continuous investigations in the fields of distributed data management and scientific visualization. As the result of collaborative work of software developers and scientific researchers several open source visualization packages (e.g. ParaView, VisIT) were developed to meet scientific and industrial demands. There are also all-in-one highly scalable commercial tools for cluster-based rendering (e.g. Ensight). All of them have similar visualization capabilities, they are widely used and are always under improvement. All of them offer remote data processing and easy-to-install binary distributions. Each software package offers some facilities for distributed data handling and parallel visualization too. Nevertheless, to achieve distributed data handling and parallel visualization is a very intricate task. VisIt meets difficulties with executing parallel renderer based on host profiles on top of existing cluster, its file management and task pool. ParaView needs manual compilation with MPI library, which is a difficult task even for experienced users and may require administrative privileges. All of them continually improve GUI and support more and more different data formats but are still not so user-friendly for parallel and distributed processing as for remote execution.

It is also necessary to mention that some crucial visualization algorithms are missing by ParaView and VisIt and stand only as subsequent goals for a far future. The main examŃ€le is unsteady vector field visualization. Current research of the authors and further RemoteView functionality enhancement is parallel vector field visualization. Vector fields given on tetrahedral or rectangular meshes are common results in scientific simulation. Each vector field dataset that corresponds to a specific time step during numerical simulation gives additional amount of data to be processed and analyzed. For correct visualization of time-varying vector field given on 3d mesh one needs hundreds or thousands of individual vector field datasets for each consecutive time interval. An example of stand-alone parallel tracing tool for calculating path lines of mass less particles is shown below. It needs further integration under the same GUI of RemoteView.

References

[1]. P.S Krinov, S.V. Polyakov, M.V. Iakobovski. Visualisation in distributed computational systems of results of three dimensionalÂ computations. // 4 international conference of mathematical modeling: â€śStankinâ€ť, 2001., v.2, p.126-133
[2]. D.E. Karasev, M.V. Iakobovski. Visualisation of gas-dynamics currents on multiprocessing systems in the distributed computer networks. Fundamental physical and mathematical problems and modelling of tehniko-technological systems: the collection of proceedings. M.: â€śStankinâ€ť, 2000, p.160-168.
[3]. M.V.Iakobovski, D.E.Karasev, P.S.Krinov, S.V.Polyakov. Visualisation of grand challenge data on Distributed systems. Kluwer Academic, Plenum Publishers. 2001, p.71-78

[4]. Charles D. Hansen, Chris R.Johnson, The visualization handbook, Elsevier 2005