Mostly Data Formats and Storage

These are software packages for the storage and retrieval of data in standard formats. The software usually takes the form of a library of subroutines which are used either by another package, e.g. Matlab, Ferret, etc., or in a separate (usually Fortran or C) program created by the user to retrieve or store some type of data. I've listed the most common ones used in the geosciences (at least as far as I know) below along with a final link to even more such packages. It may seem a bit strange to have so many choices for a "standard" data format, but as you'll find out when delving into them a bit more deeply, there's the beginnings of some sort of consolidation (at least amongst the netCDF, HDF, and CDF packages listed) so eventually we might have one (or at least very few) standard formats with which to work.

Last updated and checked on Mar. 23, 2004, just a few short years since the last update on Mar. 15, 1996.


The National Space Science Data Center's (NSSDC) Common Data Format (CDF) is a self-describing data abstraction for the storage and manipulation of multidimensional data in a discipline-independent fashion. When one first hears the term "Common Data Format" one intuitively thinks of data formats in the traditional (i.e. messy/convoluted storage of data on disk or tape) sense of the word. Altho ugh CDF has its own internal self describing format, it consists of more than just a data format. C DF is a scientific data management package (known as the "CDF Library") which allows programmers an d application developers to manage and manipulate scalar, vector, and multi-dimensional data arrays . The irony of the term "FORMAT" is that the actual data format which CDF utilizes is completely tr ansparent to the user and accessible through a consistent set of interface (known as the "CDF Inter face") routines. Therefore, programmers are not burdened with performing low level I/O's to physica lly format and unformat the data file. This is all done for them. The development of CDF arose out of the recognition by the NSSDC for a class of data models that is matched to the structure of scie ntific data and the applications (i.e. statistical and numerical methods, visualization, and manage ment) they serve.



The Climate Data Management System is an object-oriented data management system, specialized for organizing multidimensional, gridded data used in climate analysis and simulation. Data can be obtained from files in any of the self-describing formats netCDF, HDF, GrADS/GRIB or PCMDI DRS.

The Climate Data Markup Language (CDML) is the markup language used to represent data in CDMS. It is based on the XML standard. CDML is an XML dialect geared toward the representation of gridded climate datasets.



The Computer Graphics Metafile is a 2D data interchange standard which allows graphical data to be stored and exchanges among graphics devices, applications and computer systems in a device-independent manner. It is a revisable, structured format that can represent vector graphics, raster graphics and text. The given URL is the NIST CGM site which contains the full CGM standard.



The CFD General Notation System (CGNS) consists of a collection of conventions, and software implementing those conventions, for the storage and retrieval of CFD (computational fluid dynamics) data. The system consists of two parts: (1) a standard format for recording the data, and (2) software that reads, writes, and modifies data in that format. The format is a conceptual entity established by the documentation; the software is a physical product supplied to enable developers to access and produce data recorded in that format.

The CGNS system is designed to facilitate the exchange of data between sites and applications, and to help stabilize the archiving of aerodynamic data. The data are stored in a compact, binary format and are accessible through a complete and extensible library of functions. The API (Application Program Interface) is platform independent and can be easily implementd in C, C++, Fortran and Fortran90 applications.



The Data Retrieval and Storage (DRS) library and utilities, part of the PCMDI Project, support the scientific data format used at PCMDI. This was developed to support high-volume multi-dimensional array output from general circulation models. The available source code is currently set to compile on Cray, Sun, SGI and HP platforms, although a quick perusal of it shows that it probably wouldn't be difficult to port to other platforms.



An interchange technology is an enabling technology that utilizes external metadata to allow applications to plug and play seamlessly with datasets in heterogeneous formats. An interchange technology can be utilized to solve the data/application interoperability problem.

The Earth Science Markup Language (ESML) is one such interchange technology. Based on XML it consists of the ESML Schema, ESML Files and the ESML Library. ESML Files contain descriptions of the content, structure, and semantics of a particular set of data files. The ESML Schema defines rules for creating the ESML file. Because ESML Files are external files (i.e. not contained within the data files), both data producers and consumers can create and use these descriptions at any time. A key point is that the ESML Files do not modify the application or the data file itself. The ESML Library is utilized by applications to parse the ESML file and to decode the data format. Application developers can now build data format independent applications utilizing the ESML Library. Furthermore, the applications will not require modification in order to access new formats as they become available.



GeoTIFF represents an effort by over 160 different remote sensing, GIS, cartographic, and surveying related companies and organizations to establish a TIFF based interchange format for georeferenced raster imagery.



GeoVRML is an official Working Group of the Web3D Consortium. It was formed on 27 Feb 1998 with the goal of developing tools and recommended practice for the representation of geographical data using the Virtual Reality Modeling Language (VRML). The desire is to enable geo-referenced data, such as maps and 3-D terrain models, to be viewed over the web by a user with a standard VRML plugin for their web browser.



The World Meteorological Organization (WMO) Commission for Basic Systems (CBS) Extraordinary Meeting Number VIII (1985) approved a general purpose, bit-oriented data exchange format, designated FM 92-VIII Ext. GRIB (GRIdded Binary). It is an efficient vehicle for transmitting large volumes of gridded data to automated centers over high-speed telecommunication lines using modern protocols. By packing information into the GRIB code, messages (or records - the terms are synonymous in this context) can be made more compact than character oriented bulletins, which will produce faster computer-to-computer transmissions. GRIB can equally well serve as a data storage format, generating the same efficiencies relative to information storage and retrieval devices.


  • wgrib

    WGRIB is a program to manipulate, inventory and decode GRIB files.



    World Meteorological Organization (WMO) has developed a format for efficient transmission of gridded data, which is a set of floating-point values sampled over a two-dimensional grid. This format, GRIB, is described in FM 92-IX Ext. GRIB [FM92-IX]. GRIB is a compact data format. However, it is difficult to decode. Without a sophisticated decoder -- and a set of appropriate tables -- an application cannot even determine which quantity is being represented in a GRIB record, let alone access the values of the quantity at particular grid points. JMGRIB has been developed to let an application access at least the meta-information about the grid without resorting to a decoder. JMGRIB also includes a format that describes both meta-data and the data in XML form. When the number of grid points is small, this is the most convenient format as an application needs no GRIB decoder and no knowledge of the GRIB format.

    JMGRIB defines a collection of elements that can be combined by a user into one of four formats. Three formats -- raw, encoded, and expanded -- are intended for gridded data with a large number of values. In this case, annotating each grid point is unfeasible. The forth, a grid-point format, is to represent a small set of data points on possibly irregular multi-dimensional grid. The latter format is the most convenient for an application program as this program can access any data value without the help of a GRIB decoder. The first three formats require the understanding of a GRIB record.



The Hierarchical Data Format is a multi-object file format that facilitates the transfer of various types of data between machines and operating systems. It allows self-definitions of data content and is easily extensible for future enhancements or compatibility with other standard formats. The latest version of HDF supports the complete netCDF interface.


  • HDFView

    The HDFView is a Java-based tool for browsing and editing NCSA HDF4 and HDF5 files. HDFView allows users to browse through any HDF4 and HDF5 file; starting with a tree view of all top-level objects in an HDF file's hierarchy. HDFView allows a user to descend through the hierarchy and navigate among the file's data objects. The content of a data object is loaded only when the object is selected, providing interactive and efficient access to HDF4 and HDF5 files. HDFView editing features allow a user to create, delete, and modify the value of HDF objects and attributes.



NASA developed the HDF-EOS format with additional conventions and data types for HDF files. HDF-EOS supports three geospatial data types (grid, point, and swath), providing uniform access to diverse data types in a geospatial context. The HDF-EOS software library allows a user to query or subset the contents of a file by earth coordinates and time (if there is a spatial dimension in the data). Tools that process standard HDF files will also read HDF-EOS files; however, standard HDF library calls cannot access geolocation data, time data, and product metadata as easily as with HDF-EOS library calls.



MarineXML's structure allows for complete encapsulation of all possible marine environmental parameters including metadata, quality assessments and their results and a complete history of edits made to the data throughout its entire life, thus making it the ideal archiving format.

Most AODC data management systems and procedures now revolve around data in MarineXML. MarineQC semi-automatically validates incoming marine environmental data in MarineXML and writes any results within the appropriate elements. MEDI is then used to automatically extract the metadata from the MarineXML file and create a metadata record that complies with NASA's GCMD format and most fields of the national ANZLIC metadata standards.


  • MarineQC

    MarineQC is a JAVA based software, developed internally for the processing and quality control of ADF collected marine environmental data. It works with data in eXtensible Markup Language (XML) thus encapsulating both metadata and data within the one file. The application will allow the user to undertake QC of any environmental data at various levels of detail attaching flags to features, edits and/or quality of the data. This is all possible through the use of MarineXML our standard internal data format for oceanographic data.


  • MEDI Authoring Tool

    The MEDI authoring tool has been developed to encourage data collectors and scientists to produce metadata descriptions for their datasets.



MathML is a low-level specification for describing mathematics as a basis for machine to machine communication.



The network Common Data Form is an interface for scientific data access and a library that provides an implementation of the interface. It also defines a machine-independent format for representing data. Data stored in the netCDF format is self-describing, network transparent, direct-access, appendable, and sharable. There is a netCDF interface to HDF available. further details.


  • NetCDF 4

    In version 4.0 the netCDF API will be extended and implemented on top of the HDF5 data format. NetCDF users will be able to create HDF5 files with benefits not available with the netCDF format, such as much larger files and multiple unlimited dimensions. Backward compatibility in accessing old netCDF files will be supported. The combined library will preserve the desirable common characteristics of netCDF and HDF5 while taking advantage of their separate strengths: the widespread use and simplicity of netCDF and the generality and performance of HDF5.


  • CF Metadata Convention

    The CF conventions for climate and forecast metadata are designed to promote the processing and sharing of files created with the netCDF API The conventions define metadata that provide a definitive description of what the data in each variable represents, and of the spatial and temporal properties of the data. This enables users of data from different sources to decide which quantities are comparable, and facilitates building applications with powerful extraction, regridding, and display capabilities. The CF conventions generalize and extend the COARDS conventions.


  • NcML

    NcML is an XML representation of netCDF metadata, (roughly) the header information one gets from a netCDF file with the "ncdump -h" command. NcML is similar to the netCDF CDL (network Common data form Description Language), except, of course, it uses XML syntax.



A format for portable graphics, as you might surmise from the name. PNG unofficially stands for "PNG's Not GIF", which originates from the decision of Unisys and CompuServe to require royalties from programs using the GIF format since Unisys has a patent on the LZW compression format used therein. Besides the fact that it's not GIF, its features include an unambiguous pronunciation, multiple CRCs so that file integrity can be checked without viewing, a magic signature that can detect the most common types of file corruption, better compression than GIF, a 2-D interlacing scheme, and a non-patented free and completely referenced implementation with full source code.



PyTables is a hierarchical database package designed to efficiently manage very large amounts of data. It is built on top of the HDF5 library and the numarray package. It features an object-oriented interface that, combined with natural naming and C-code generated from Pyrex sources, makes it a fast, yet extremely easy to use tool for interactively save and retrieve very large amounts of data. Besides, it provides flexible indexed access on disk to anywhere in the data you want to go.

PyTables was born because one of its authors (Francesc Alted) had a need to save lots of data in a both hierarchical and efficient way for later post-processing it. After using several approaches (ZODB, the NetCDF interface of Scientific Python, and HL-HDF5), he found that these software presented distinct inconveniences. For example, working with file sizes larger than, say, 100 MB, was rather painful with ZODB (it took a lot of memory ). The NetCDF interface provided by Scientific Python was great, but it does not allow to endow the data with a hierarchical structure; besides, NetCDF only supports homogeneous datasets, not heterogeneous datasets (i.e. tables). Finally, HL-HDF5, which is a high level interface to HDF5 library, and specially its module PyHL, was closer to what he needed, but working with tables demonstrated to be cumbersome (you need to build a Python C module containing the table definition).



SEDRIS is fundamentally about two key aspects: (1) representation of environmental data, and (2) the interchange of environmental data sets.

To achieve the first one, SEDRIS offers a data representation model, augmented with its environmental data coding specification and spatial reference model, so that one can articulate one's environmental data clearly, while also using the same representation model to understand others' data unambiguously. Therefore, the data representation aspect of SEDRIS is about capturing and communicating meaning and semantics.

For the second part, we know from practice that it is not enough to be able to clearly represent or describe the data, we must also be able to share such data with others in an efficient manner. So the second aspect of SEDRIS is about interchange of data that can be described using the data representation model. For the interchange part, the SEDRIS API, its format, and all the associated tools and utilities play the primary role, while being semantically coupled to the data representation model.



Sil is a library which implements an application programming interface (API) designed for reading and writing scientific data. It is a high-level, portable interface that was developed at Lawrence Livermore National Laboratory to address difficult database issues, such as different, incompatible file formats and libraries.

Silo takes advantage of features in netCDf and PDB, a binary database file format developed at LLNL, to build a powerful data access mechanism and to provide a higher level view of the data. It assigns meaning to different types of objects and supports a hierarchical directory structure. Entities managed by the Silo library include not just arrays, but also meshes, mesh variables, material data, and curves. The Silo interface allows the development of generic tools.



SVG is a language for describing two-dimensional graphics and graphical applications in XML.



The Weather Observation Markup Format is an application of XML to describe a particular kind of documents: weather observation reports.



XDF is a common scientific data format based on XML and general mathematical principles that can be used throughout the scientific disciplines. It includes these key features: hierarchical data structures, any dimensional arrays merged with coordinate information, high dimensional tables merged with field information, variable resolution, easy wrapping of existing data, user specified coordinate systems, searchable ASCII meta-data, and extensibility to new features/data formats.



The eXtensible Data Model and Format (XDMF) is an active, common data hub used to pass values and metadata in a standard fashion between application modules. XDMF views data as consisting of two basic types : Light data and Heavy data. Light data is both metadata and small amounts of values. Heavy data typically consists of large arrays of values.



XMML, the eXploration and Mining Markup Language, is an XML based encoding for geoscience and exploration information. It is intended to support exchange of exploration information in a wide variety of contexts. This includes between software packages on the desktop, between users and organisations, and in particular to be compatible with http.



The Extensible Scientific Interchange Language (XSIL) is a flexible, hierarchical, extensible, transport language for scientific data objects.

The entire object may be represented in the file, or there may be metadata in the XSIL file, with a powerful, fault-tolerant linking mechanism to external data. The language is based on XML, and is designed not only for parsing and processing by machines, but also for presentation to humans through web browsers and web-database technology.

It comes with a Java object model that is designed to be extensible, so that scientific data and metadata represented in XML is available to a Java code.


Scientific Data Management

Below I listed common problem and opportunities in scientific data access. Then I collected what are considered the parts of a Data Management solution. A list of references and examples of data access and scientific data collections follow.

The paper ends with more implementation oriented issues: a survey of some scientific data formats, planning for a possible implementation and a survey of the supporting technologies available.


Scientific Data Format FAQ

A list of Frequently Asked Questions about various standard data formats. It includes links to sites with the software when and where it's available. This document was last updated in Oct. 1995.


S. Baum
Dept. of Oceanography
Texas A&M University