Spatial indexing
Geospatial datasets are often very large files easily reaching hundreds of megabytes or even several gigabytes in size. Geospatial software can be quite slow in trying to repeatedly access large files when performing analysis. As discussed briefly in Chapter 1, Learning Geospatial Analysis with Python, spatial indexing creates a guide, which allows software to quickly locate query results without examining every single feature in the dataset. Spatial indexes allow software to eliminate possibilities and perform more detailed searches or comparisons on a much smaller subset of the data.
Indexing algorithms
Many spatial indexing algorithms are derivatives of well-established algorithms used for decades on nonspatial information. The two most common spatial indexing algorithms are Quadtree index and R-tree index.
Quadtree index
The Quadtree algorithm actually represents a series of different algorithms based on a common theme. Each node in a Quadtree index contains four children. These child nodes are typically square or rectangular in shape. When a node contains a specified number of features, the node splits if additional features are added. The concept of dividing a space into nested squares speeds up spatial searches. Software must only handle five points at a time and use simple greater-than/less-than comparisons to check whether a point is inside a node. Quadtree indexes are most commonly found as file-based index formats.
The following image shows a point dataset sorted by a Quadtree algorithm. The black points are the actual dataset, while the boxes are the bounding boxes of the index. Note that none of the bounding boxes overlap. The left image shows the spatial representation of the index. The right image shows the hierarchical relationship of a typical index which is how spatial software sees the index and data. This structure allows a spatial search algorithm to quickly eliminate possibilities when trying to locate one or more points in relation to some other set of features.
R-tree index
R-tree indexes are more sophisticated than Quadtrees. R-trees are designed to handle three-dimensional data and are optimized to store the index in a way that is compatible with the way databases use disk space and memory. Nearby objects are grouped together using any algorithm from a variety of spatial algorithms. All objects in a group are bounded by a minimum rectangle. These rectangles are aggregated into hierarchical nodes that are balanced at each level. Unlike a Quadtree, the bounding boxes of an R-tree may overlap across nodes. Due to the relative complexity and database-oriented structure, R-trees are most commonly found in spatial databases as opposed to file-based formats.
The following diagram from R-tree for a two-dimensional point dataset:
Grids
Spatial indexes also often employ the concept of an integer grid. Geospatial coordinates are usually floating point decimal numbers with anywhere from 2 to 16 decimal places. Performing comparisons on floating point numbers is far more computationally expensive than working with integers. Indexed searching is about first eliminating possibilities that don't require precision. Most spatial indexing algorithms, therefore, map floating point coordinates to a fixed-sized integer grid. On searching for a particular feature, the software can use more efficient integer comparisons rather than working with floating point numbers. Once the results are narrowed down, the software can access the full resolution data.
Grid sizes can be as small as 256 by 256 for simple file formats or can be as large as three million by three million in large geospatial databases designed to incorporate every known coordinate system and possible resolution. The integer mapping technique is very similar to the rendering technique that is used to plot data on a graphics canvas in mapping programs. The SimpleGIS script in Chapter 1, Learning Geospatial Analysis with Python, also uses this technique to render points and polygons using the built-in Python turtle graphics engine.