Vector data
Vector data is by far the most common geospatial format because it is the most efficient way to store spatial information, and in general requires less computer resources to store and process than raster data. The Open Geospatial Consortium (OGC) has over 16 formats directly related to vector data. Vector data stores only geometric primitives including points, lines, and polygons. But only the points are stored for each type of shape. For example, in the case of a simple straight vector line shape, only the end points would be necessarily stored and defined as a line. Software displaying that data would read the shape type, and then connect the end points with a line dynamically.
Geospatial vector data is similar to the concept of vector computer graphics with some notable exceptions. Geospatial vector data contains positive and negative Earth-based coordinates, while vector graphics typically store computer screen coordinates. Geospatial vector data is also usually linked to other information about the object represented by the geometry. This information may be as simple as a timestamp in the case of GPS data, or an entire database table for larger geographic information systems. Vector graphics often store styling information describing colors, shadows, and other display related instructions, while geospatial vector data typically do not. Another important difference is shapes. Geospatial vectors typically only include very primitive geometries based on points, straight lines, and straight-line polygons, while many computer graphics vector formats have concepts of curves and circles. Geospatial vectors can model these shapes; however, using more points.
Another important point about geospatial data is that vector data is generally considered to be in binary format. Other human readable formats such as Comma-Separated Values (CSV), simple text strings, and XML based formats, are technically vector data because they store geometry as opposed to rasters, which represent all data within the bounding box of the data set. However, these formats are typically not categorized with binary vector data formats for discussion purposes. Until the explosion of XML beginning in the late 1990s, vector data formats were nearly all binary. XML provided a hybrid approach that was both computer and human readable. The compromise is XML data greatly increases file size over binary formats. These formats are discussed later in this section.
The number of vector formats to choose from is staggering. The open source vector library OGR (http://www.gdal.org/ogr/ogr_formats.html), lists over 70 supported vector formats. Its commercial counterpart, SAFE Software's Feature Manipulation Engine (FME) lists over 170 supported vector formats (http://www.safe.com/fme/format-search/). Granted these lists do include a few vector graphics formats, as well as human-readable geospatial formats. There are still dozens of formats out there to at least be aware of, in case you come across them.
Shapefiles
The most ubiquitous geospatial format is the Esri shapefile. Geospatial software company Esri released the shapefile format specification as an open format in 1998 (http://www.esri.com/library/whitepapers/pdfs/shapefile.pdf). Esri developed it as a format for their ArcView software, designed as a lower-end GIS option to complement their high-end professional package, ArcIinfo, formerly called Arc/INFO. But the open specification, efficiency, and simplicity of the format turned it into an unofficial GIS standard, still extremely popular over 15 years later. Virtually every piece of software labeled as geospatial software supports shapefiles because the shapefile format is so common. For this reason, you can get by as an analyst being intimately familiar with shapefiles and mostly ignoring other formats. You can convert almost any other format to shapefiles through the source format's native software or a third-party converter like the OGR library.
One of the most striking features of a shapefile is that the format consists of multiple files. At a minimum, there are three and there can even be as many as 15 different files! The following table describes the file formats. The .shp
, .shx
, and .dbf
files are required for a valid shapefile.
You will probably never encounter all of these formats at once. But any shapefile you use will have multiple files. You will commonly see .shp
, .shx
, .dbf
, .prj
, .sbn
, .sbx
, and occasionally .shp.xml
files. If you want to rename a shapefile you must rename all of the associated files to the same name.
Another important feature of shapefiles is that the records are not numbered. Records include the geometry, the .shx
index record, and the .dbf
record. These records are stored in a fixed order. When you examine shapefile records using software, they appear to be numbered. But people are often confused when they delete a shapefile record, save the file, and reopen it; the number of the record deleted still appears. The reason is the shapefile records are numbered dynamically upon loading, but not saved. So if you delete record number 23 and save the shapefile, record number 24 will become 23 next time you read the shapefile. Many people expect to open the shapefile and see the records jump from 22 to 24. The only way to track shapefile records that way is to create a new attribute called ID
or similar in the .dbf
file and assign each record a permanent, unique identifier.
Just like renaming shapefiles, care must be taken when editing shapefiles. It's best to use software which treats the shapefiles as a single data set. If you edit any of the the files individually and add or delete a record without editing the accompanying files, the shapefile will be seen as corrupt by most geospatial software.
CAD files
CAD stands for Computer-Aided Design. The primary formats for CAD data were created by AutoDesk for their leading AutoCAD package. The two formats commonly seen are the Drawing Exchange Format (DXF) and the AutoCAD native Drawing format (DWG). DWG was traditionally a closed format but it has become more open.
CAD software is used for everything engineering related, from designing bicycles, to cars, to parks, and city sewer systems. As a geospatial analyst, you don't have to worry about mechanical engineering designs; however, civil engineering designs become quite an issue. Most engineering firms use geospatial analysis to a very limited degree but store nearly all of their data in CAD format. The DWG and DXF formats can represent objects using features not found in geospatial software or weakly supported by geospatial systems. Examples of these features include:
- Curves
- Surfaces (for objects which are different from geospatial elevation surfaces)
- 3D solids
- Text (rendered as an object)
- Text styling
- Viewport configuration
These CAD and engineering-specific features make it difficult to cleanly convert CAD data to geospatial formats. If you encounter CAD data, the easiest option is to ask the data provider if they have shapefiles or some other geospatial-centric format.
Tag and markup-based formats
Tag-based markup formats are typically Extensible Markup Language (XML) formats. But they also include other structured text formats such as the Well-Known Text format used for projection information files as well as different types of data exchange. XML formats include the Keyhole Markup Language (KML), the Open Street Map (OSM) format, and the Garmin GPX format for GPS data, which has become a popular exchange format. The Open Geospatial Consortium's Geographic Markup Language (GML) standard is one of the oldest and most widely used XML-based geographic formats. It is also the basis for the OGC Web Feature Service standard for web applications. But GML has been largely superseded by KML and the GeoJSON format discussed next.
XML formats often contain more than just geometry. They also contain attributes and rendering instructions such as color, styling, and symbology. Google's KML format has become a fully-supported Open GIS Consortium (OGC ) standard. The following is a sample of KML showing a simple place mark:
<?xml version="1.0" encoding="utf-8"?> <kml xmlns="http://www.opengis.net/kml/2.2"> <Placemark> <name>Mockingbird Cafe</name> <description>Coffee Shop</description> <Point> <coordinates>-89.329160,30.310964</coordinates> </Point> </Placemark> </kml>
XML format is attractive to geospatial analysts for several reasons:
- It is a human-readable format
- It can be edited in a text editor
- It is well-supported by programming languages (especially Python!)
- It is, by definition, easily extensible
XML is not perfect though. It is an inefficient storage mechanism for very large data formats and can quickly become cumbersome to edit. Errors in data sets are common and most parsers do not handle errors robustly. Despite the downsides, XML is widely used in geospatial analysis. Scalable Vector Graphics (SVG) is a widely supported XML format for computer graphics. It is supported well by browsers and is often used for geospatial rendering. However, SVG was not designed as a geographic format.
The Well Known Text (WKT) format is also an older OGC standard, the most common use for it is to define projection information usually stored in .prj
projection files alongside a shapefile or raster. The WKT string for the WGS84 coordinate system is as follows:
GEOGCS["WGS 84", DATUM["WGS_1984", SPHEROID["WGS 84",6378137,298.257223563, AUTHORITY["EPSG","7030"]], AUTHORITY["EPSG","6326"]], PRIMEM["Greenwich",0, AUTHORITY["EPSG","8901"]], UNIT["degree",0.01745329251994328, AUTHORITY["EPSG","9122"]], AUTHORITY["EPSG","4326"]]
The parameters defining a projection can be quite long. A standards committee created by the EPSG created a numerical coding system to reference projections. These codes are used as shorthand for strings like the preceding code. There are also short names for commonly used projections like Mercator which can be used in different software packages to reference a projection. More information on these reference systems can be found at the Spatial Reference website at http://spatialreference.org/ref/.
GeoJSON
GeoJSON is a relatively new and brilliant text format based on the JavaScript Object Notation (JSON) format, which has been a commonly used data exchange format for years. Despite its short history, GeoJSON can be found embedded in all major geospatial software systems and most websites that distribute data.
GeoJSON is a completely backwards-compatible extension to the popular JSON format. The structure of JSON is very similar and in some cases identical to existing data structures of common programming languages. JSON is almost identical to Python's dictionary and list data types. Because of this similarity, parsing JSON in a script is simple to do from scratch but there are also many libraries to make it even easier. Python contains a built-in library aptly named json
.
GeoJSON provides a standard way to define geometry, attributes, bounding boxes, and projection information. GeoJSON has all of the advantages of XML including human readable syntax, excellent software support, and wide use in the industry. But it also surpasses XML. GeoJSON is far more compact than XML largely because it uses simple symbols to define objects rather than opening and closing text-laden tags. The compactness also helps with readability and manageability of larger data sets. The following is a sample of GeoJSON syntax, defining a geometry collection with both a point and a line:
{ "type": "GeometryCollection", "geometries": [ { "type": "Point", "coordinates": [-89.33, 30.0] }, { "type": "LineString", "coordinates": [ [-89.33, 30.30], [-89.36, 30.28] ] } ] }
The preceding code is a valid GeoJSON, but it is also a valid Python data structure. You can copy the preceding code sample directly into the Python interpreter as a variable definition and it will evaluate without error as follows:
>>> gc = { "type": "GeometryCollection", ... "geometries": [ ... { "type": "Point", ... "coordinates": [-89.33, 30.0] ... }, ... { "type": "LineString", ... "coordinates": [ [-89.33, 30.30], [-89.36, 30.28] ] ... } ... ] ... } >>> gc {'type': 'GeometryCollection', 'geometries': [{'type': 'Point', 'coordinates': [ -89.33, 30.0]}, {'type': 'LineString', 'coordinates': [[-89.33, 30.3], [-89.36,30.28]]}]}
Because of its compact size, Internet-friendly syntax by virtue of is similarity to JavaScript, and support from major programming languages, GeoJSON is a key component of leading REST geospatial web APIs. It currently offers the best compromise among the computer resource efficiency of binary formats, the human-readability of text formats, and programmatic utility.