File structure
The preceding elements can be stored in a variety of ways in a single file, multiple files, or database depending on the format. Additionally, this geospatial information can be stored in a variety of formats, including embedded binary headers, XML, database tables, spreadsheets/CSV, separate text, or binary files.
Human readable formats such as XML files, spreadsheets, and structured text files require only a text editor to investigate. These files are also easily parsed and processed using Python's built-in modules, data types, and string manipulation functions. Binary-based formats are more complicated. It is thus typically easier to use a third-party library to deal with binary formats.
However, you don't have to use a third-party library, especially if you just want to investigate the data at a high level. Python's built-in struct module has everything that you need. The struct module lets you read and write binary data as strings. When using the struct module, you need to be aware of the concept of byte order. Byte order refers to how the bytes of information that make up a file are stored in memory. This order is usually platform-specific, but in some rare cases, including shapefiles, the byte order is mixed in the file. The Python struct module uses the greater than (>) and less than (<) symbols to specify byte order (big endian and little endian, respectively).
The following brief example demonstrates the usage of the Python struct module to parse the bounding box coordinates from an Esri shapefile vector dataset. You can download this shapefile as a zipped file at the following URL:
https://geospatialpython.googlecode.com/files/hancock.zip
When you unzip this, you will see three files. For this example, we'll be using hancock.shp
. The Esri shapefile format has a fixed location and data type in the file header from byte 36 to byte 37 for the minimum x, minimum y, maximum x, and maximum y bounding box values. In this example, we will execute the following steps:
- Import the
struct
module. - Open the
hancock.zip
shapefile in the binary read mode. - Navigate to byte
36
. - Read each 8-byte double specified as
d
, and unpack it using thestruct
module in little-endian order as designated by the<
sign.
The best way to execute this script is in the interactive Python interpreter. We will read the minimum longitude, minimum latitude, maximum longitude, and maximum latitude:
>>> import struct >>> f = open("hancock.shp","rb") >>> f.seek(36) >>> struct.unpack("<d", f.read(8)) (-89.6904544701547,) >>> struct.unpack("<d", f.read(8)) (30.173943486533133,) >>> struct.unpack("<d", f.read(8)) (-89.32227546981174,) >>> struct.unpack("<d", f.read(8)) (30.6483914869749,)
You'll notice that when the struct
module unpacks a value, it returns a Python tuple with one value. You can shorten the preceding unpacking code to one line by specifying all four doubles at once and increasing the byte length to 32 bytes as shown in the following code:
>>> f.seek(36) >>> struct.unpack("<dddd", f.read(32)) (-89.6904544701547, 30.173943486533133, -89.32227546981174, 30.6483914869749)