﻿ Spatial Data

# Spatial Data

Spatial data is data that is connected to a location as part of that data.  Like all other data in Manifold, spatial data in Manifold is stored in tables.    By spatial data we usually mean that each record in a table has values in one or more fields that give the location for that record.

The location may be specified in a very tight, explicit and specific way:  For example, tables may have latitude and longitude fields so that each record includes an explicit latitude/longitude location for that record.    That is usually what we mean by spatial data, tables that have explicit location information for each record already within the record with no need to look up the actual geographic location for the record in some other table.

An example is the Carolingian Hoards table above, a list of coin hoards found in Europe that date to Carolingian times.   Each record in the table has a latitude and longitude location for where the hoard was found along with other information such as the dates that characterized the coins found in the hoard.   By plotting on a map  the locations of various dates of hoards we can get insight into patterns of commerce and travel at different times in Carolingian Europe.

A lesser form of spatial data are records in tables where the connection between each record and a location is implied but not given explicitly.   In such cases we do not know the location for each record until we look up the actual location using information outside of the table.

Consider, for example, the same table without latitude and longitude fields for the location of each record.   In that case the location of each record is not given explicitly but is implied by the name of the town in the field called FINDSPOT and by the name of the country in the field called NAT.   The location is only implied because by itself the name of a town does not tell us the explicit, geographic location of that town.

While it may be interesting for the residents of those towns to know coin hoards were found in Jelsum in the Netherlands or in Zetel in Germany, for those of us who cannot immediately in their mind's eye visualize where Jelsum and Zetel are located, the table is much less informative if we cannot plot the records on a map.  To put a dot on a map at the location of each hoard we will need to know the explicit geographic location of places like Jelsum and Zetel.

If we have a second table, say, called Towns, that lists the names of towns in various countries and also has a latitude and longitude location for the approximate center of each town we could do a join to look up the actual latitude and longitude location for each town from the Towns table and add that location to records in the Carolingian Hoards table.   The process of adding a latitude and longitude to records in a table is called geocoding the table.   Once the table has been geocoded each record it can be plotted on a map.

## Vector Data and Raster Data

There are primarily two types of spatial data with which we work in Manifold: vector data and raster data.

### Vector Data

Vector data -  Data where a geometry field in each record specifies the location and geometry of an object, which may be a point object,, a line object or an area object.   Each record may also have additional fields providing information content, called attribute data, which is associated with that object.

The field specifying the location and geometry of the object will use a geometric data type, typically a geom.    Fields providing attribute data can be any one of the many data types supported by Manifold.  Vector data can be visualized as drawings with each object drawn in the location and shape as specified by the geometry data that defines the object.

Sometimes the value of vector data is entirely in the objects and their locations without any attribute data, for example, a drawing that shows the continents of the world where the information conveyed is simply the location and shape of the continents.

In other cases records may also include attribute data in each record in addition to the geometry of the object, for example, in the case of a table of real estate parcels as area objects where each parcel has many attributes, such as a code number identifying the parcel, the name of the owner, the use of the parcel, a property tax rate for the parcel and other information.

###### Vector Data:  Points

The classic example of spatial data with an explicit location is a geocoded table, that is, a table in which each record has a latitude and longitude that gives the location with which that record is associated.

Consider the table above, which lists all In-N-Out Burger restaurants in the United States by address and which provides the latitude and longitude coordinates for each.   Each record has a location explicitly associated with it.

We could use the latitude and longitude coordinates in each record to plot the location of that record on a map, to see exactly where all In-N-Out restaurants are located.  The table is a classic example of spatial data as point data, that is, tied to a specific spot and not spread out along a line or over an area.    Each latitude and longitude location refers to a specific point.

A better and more efficient way to specify the location of a point within Manifold is to encode the latitude and longitude location of that point within a geom data type for a point.   Geoms are so much faster than fields which list latitude and longitude as numbers that Manifold drawings will always use geoms as the source of the geometry they display.

We may also, of course, include latitude and longitude columns in the table if we want, to provide some easily human-readable version of the point coordinates that are encoded by the geom.   But even if those are in the table Manifold will use the geom to draw dots for the points in any drawing that visualizes the table.

Besides increased performance  geoms have many advantages the straight enumeration of latitude and longitude numbers do not have.    One big advantage is that if a table uses latitude and longitude fields as a way to specify the location of a record that does not allow specifying record locations that are lines or areas, both of which need a more sophisticated, more capable data type.

Geoms can also encode lines or areas so if we use geoms to specify locations the same table can have a mix of records that are records for points, records for lines and records for areas with the geoms that give the location of each record in the Geom column encoding locations for points, for lines or for areas as the case may be.

###### Vector Data:  Lines

The examples above show tables where each record's data applies to a single point location.  Spatial data can also be associated with a more spread out location, like data associated with a line that represents a road,  or like data associated with a polygonal area that represents a real estate parcel.

When a record contains data for a single location, a point, understanding how a location is tied to each record in a table is easy: we can see a record in the table, note the coordinates for its location and think, "Aha! This record provides data connected to that particular spot on a map."

When a record contains data for a line, we need more than just a single coordinate location to specify the line.   If the line is a long line, like the route of Interstate Highway 5 from the Northern to Southern borders of the United States, we might need tens of thousands of coordinate locations to specify the line.   To store those coordinate locations within a single record we need a data type that can encode many coordinate locations within a single field.

In Manifold that data type is a geom.  Lines are specified within a geom by the sequence of coordinates necessary to draw a line.   Manifold can use the coordinates packed into the geom to draw a line in a drawing or a map.

Let's take a closer look at how that works.   In the examples that follow we will show a map that contains several layers.  To provide visual context the background layers will show a world map together with the names of towns in the region of interest.

Consider an example of a line, shown in magenta in the illustration above, that shows the center line of the 21 August 2017 total eclipse of the sun.  What looks like a smooth curve from far away actually is composed of straight line segments between the coordinates which define the line object.   In this particular case there are 96 coordinate pairs which define the line.   From here on we will use the more casual idiom coordinates to refer to what is a pair of coordinates, a latitude value and a longitude value.

Longitude   Latitude

-162.0300   41.6199

-155.4699   42.7299

-150.9300   43.3800

-147.2400   43.8299

-144.0600   44.1599

-141.2299   44.4099

-138.6500   44.6000

A few of the first coordinates beginning at the Western end of the line are listed above as longitude, latitude coordinates.  These have been formatted to only four positions past the decimal point to provide a prettier table.

If we plotted the coordinates that make up the line as small magenta squares we could see how the list of coordinates defines the position and shape of the line.

If we draw a line from coordinate to coordinate in a "connect the dots" manner we draw the line defined by those 96 coordinates.

Zooming into the display to the region of Oregon where the eclipse path makes landfall, we have also added magenta labels giving the longitude, latitude position of each coordinate that defines the line.   Because the eclipse center line is a fairly smooth curve, even when the view has been zoomed in the sequence of straight line segments still looks like a smooth curve, despite the use of only a few coordinates to define it.

If we take away the symbols that show the locations of the coordinates which define the line we can see that the line looks like a very smooth and gentle curve, not at all kinky or uneven, even though the section we see above has been defined with only five coordinates.

If the line were more convoluted, for example, like the outlines of the green park areas seen in the base map, we would need many more coordinates than just five to define that part of the line seen in the view.  Very curvy or convoluted lines may require many thousands of coordinates to define.

All data in Manifold is stored in tables, so how do tables store the geometric spatial data, that is, the list of coordinates, that defines a line?  Manifold stores the coordinates that define a line within a geom data type which contains a line object.    Geoms provide fast performance and the ability to mix records for point, line and area objects within the same table.

The illustration above shows part of a table that contains spatial data for the center lines of various total eclipses of the sun through 2020, including our subject total eclipse of 21 August 2017.    Each record represents one center line with fields giving the year, month and day of the eclipse as well as a geom that contains the geometry data which defines the line.  The geom for the 2017 eclipse contains the 96 coordinates which define the line for that eclipse.

###### Vector Data:  Areas

As we might have guessed, Manifold stores areas similar to how Manifold stores lines, as sequences of coordinates which define the area packed within a geom.  The sequences of coordinates which define an area specify the boundary of the area which, like a line object, is defined in a connect-the-dots fashion by the coordinates.

Consider an area object in a drawing seen in the illustration above as a layer overlaying an image server layer in a map, the image server layer having no role whatsoever in the example besides providing background context.   The area object is illustrated partially transparent, and represents a real estate parcel in that town, in France.

As with the case of a line object the area object is defined by a series of coordinates.

We can draw a square, magenta icon at the location of each coordinate which defines the area.  The area is defined by 23 coordinates.   Following is a list of the first few coordinates that define the area.

Longitude  Latitude

2.0598   47.2294

2.0599   47.2295

2.0598   47.2295

2.0599   47.2295

2.0599   47.2295

2.0602   47.2294

2.0602   47.2294

We can plot each coordinate labelled with its longitude and latitude coordinates together with the number of the coordinate.

If we draw a line from the first coordinate to the next and so on in "connect-the-dots" fashion we can see that we end up drawing a boundary line around the area in clockwise direction.   Note that if the coordinates are not in order we cannot know how to connect the coordinates.

Connecting the coordinates in order results in a closed boundary line which defines the external border of the area object.

Manifold knows to "fill in" the region inside of the closed boundary to create the area object.

As with lines, the list of coordinates which make up the area is contained within a geom that in this case stores an area object.  For this example we have a table with several records, seen above, where the first  record contains the data for the our area object. The geom contains the geometric data while other fields provide other data of interest, such as comments on the parcel and the price in Euros.   That is not a bad asking price for a fine parcel with a nice house and a pool within walking distance of the high-speed TGV train to Paris!

As with lines the reason Manifold stores coordinates within a geom is for performance.  In this example there are only 23 coordinates but there could be thousands of coordinates in an object.

Consider an area object that represents France in a digital map of the world.

The area object is defined by over 8400 coordinates, which provides a reasonably smooth appearance if we do not zoom too far in.  There are so many coordinates that if we show their locations with magenta dots they merge together when zoomed out.

However, if we zoom into the drawing we can see how the individual coordinates define the boundary of the area object.

Zooming even further in we can see that even with over 8400 coordinates that is not enough to define the boundary of the area object to provide a smooth appearance when zoomed far into the drawing.   For closer views we would prefer to have the area more finely defined using more coordinates.

By storing coordinates for objects within binary geom storage we can efficiently store very large numbers of coordinates per object for much faster performance than if coordinates were stored in a separate table.

So far we have discussed spatial data that is vector data.   A different class of data we will encounter in spatial data sets are rasters, used to store information about images and similar smoothly-distributed information.

### Raster Data

Raster data - Data that consists of pixels that are arranged in a regular, rectangular array that is a given number of pixels wide and a given number of pixels high.  Each pixel contains one or more channels, that is, numeric values, per pixel.  All pixels in the same raster data set will have the same number of channels.  For example, all pixels will have just one channel, one number per pixel, for grayscale images or three channels, three numbers per pixel, for RGB images.

Raster data is stored as a table where each record contains a tile full of many pixels, a more efficient approach than storing only one pixel per record.   Raster data can be visualized as images, which depending on the content of the pixels can be virtually any data that is stored as pixels.

For example, when pixels contain R, G, B numeric values to specify a color all the pixels visualized together might be a visual image.   If the values in the pixels represent heights as measured by RADAR or LiDAR the information in the raster data might be the heights of terrain in the region.  That could be visualized as a pseudo-image showing the terrain heights as a range of colors, or using a style to provide shading or other effects.

Raster data, such as images, can also be spatial data when the image has been connected to a specific location.  There are various technical means to do that, with Manifold understanding all the typical ways of binding raster data to a location when Manifold imports an image from some other format into Manifold.

For example, the image above shows a satellite photograph showing an overhead view of the In-N-Out restaurant in Allen, Texas, just below and to the right of the road intersection in the center of the image.

The image was acquired from an Internet image server that automatically tagged the image with location data to specify the extent, orientation and coordinate system of the image, thus providing location data for the image overall and the pixels of which it is composed.

Importing the image into Manifold preserved the location data associated with the image so we can use the image as a layer in a map with other spatial data and the image will appear in the correct location on Earth.   For example, the illustration above shows the image in a map as a layer positioned above a layer that shows a map of the region in Allen, Texas.  Note how the streets in the map line up with the streets seen in the image.

Note that if the image did not have location information embedded into it, it would not be spatial data but would just be an image.

For example, the image above shows the In-N-Out restaurant in Allen, Texas, in a street view photograph.  It is a fine photo of the restaurant but if we import it from a format that does not tag it with location information it is just an image that is no more "spatial" than a photograph of our cat.

Both the In-N-Out restaurant image above...

...and the overhead satellite photo of the restaurant location in Allen, Texas...

..., are stored in Manifold as tiles within tables.

The table above stores the tiles for the aerial view image.    See the Example: How Images use Tiles from Tables topic.

The connection to location for the overhead image of the restaurant in Allen, Texas, is in the table's properties...

... where the coordinate system for the tiles that make up the image is specified.

Manifold understands many different ways of specifying coordinate system information for data.  The image was imported from a format that used XML to encode the coordinate system so that is what Manifold used as seen in the dialog above.

In contrast, the street view image of the restaurant is just a plain JPEG image with no location information attached of any kind.  On import there is no location information to bring in so the properties of the table show no coordinate system info.

Note that if we obliterated the location information for the overhead satellite photo before importing the image into Manifold, for example, by editing the image with a graphics editor that discarded any extra TIFF tags or whatever was used to store the coordinate system and other location info, when we imported the image into Manifold it wouldn't be spatial data.  It would just be a photo of some buildings and streets as seen from above, that for all we know could be anywhere in the world, with no idea of how it should be scaled, rotated or placed on a map.  As a result the image no longer would be any more "spatial" than a photograph of our cat.

For a discussion of geocoding tables and to see the actual location of the above In-n-Out restaurant, see the Example: Street Address Geocoding topic.

## Notes

"Spatial" does not mean it is 3D -   Most "spatial" data is two dimensional data, such as points or figures that could be plotted on a flat map.  Less pretentiously it could be called "map data" or "planar data."   But that would go against human nature, where in technical matters people often like to puff up what they are doing with more technical-sounding terms.   Because the term "spatial data" sounds cooler and more technical GIS people started using it even though very little such data is genuinely "spatial," that is, really has X,Y and Z coordinates to place it within a 3D volume instead of just on a 2D surface as X,Y coordinates do.   The term is becoming more accurate in modern times as more and more spatial data now does have genuine 3D location to it, that is, instead of being just X,Y has X,Y,Z coordinates.

Geographic or Non-Geographic -   Spatial data can be about locations that are geographic locations or which may be non-geographic locations.

• Geographic spatial data - A location on Earth is a geographic location such as the exact position where a geologic sample was harvested, where a restaurant is located or where an airplane or truck is currently located.

• Non-Geographic spatial data - A location might be some non-geographic location like a position within a CAD diagram that shows the complex plumbing of a refinery or a blueprint showing the arrangement of conveyers in a shipping warehouse or the current location of a robotized parts cart as it moves through the warehouse.

Most often the spatial data we work with will be geographic spatial data, but the same tools, such as Manifold, which can handle coordinate systems that are used to specify locations on Earth can also normally handle coordinate systems that specify locations within a blueprint or other CAD diagram.

Showing rates by state -  The example showing thematic coloring of states by homicide rate is indeed spatial data but it is very broad and general spatial data.  Coloring in an entire state such as California or Texas is not much more than visual shorthand that provides the pro-rata homicide rate for the entire state.  Such maps can easily deceive the weak-minded because they give the impression that the homicide rate might be even throughout the entire state.    The slightest bit of worldliness would indicate that the homicide rate in Texas, for example, is likely very different pro-rata in a big city than it is far out in the country where no homicides might have been recorded for many years.   What would be much more insightful data would be to display a table of homicide rates by census block group or other smaller region than simply an entire state.

No attributes in raster data  - Raster data is all about the numeric values in the pixels: it does not contain separate fields for attribute data the way vector data does.   The numeric values for each pixel together with the location of each pixel provide the information content for the data.

Carolingian Hoards - Throughout history people have buried small bags of coins, for example, soldiers hiding their coins before going into battle or residents hiding their valuables in times of trouble.  When such hoards were not reclaimed they became a time capsule awaiting rediscovery. Plotting the locations of discovered hoards can easily lead to historical insights that may be much more difficult to see by pouring over row and column tables.

The illustration above shows hoard locations from the Carolingian Hoards table with the view zoomed into France.    Each dot shows a hoard.  The colors of the dots range from purple to blue to green to yellow to orange based on the most recent date to which the hoard can be dated.   The purple and blue dots show the earliest hoards dated to 854 AD or earlier.

If we connect those dots using a spanning tree algorithm the lines hint at relationships between those sites that may be worth investigating.    For example, do the sites tend to be aligned along transportation routes such as river valleys, do they follow the course of military campaigns or might their be trading relationships along those routes?

The data set used as an example is genuine data, downloaded from Harvard's Center for Geographic Analysis, an example of the seemingly endless variety of geospatial data that can be downloaded for free.  The full citation to the data set is:

S. Coupland with B. Maione-Downing, 2013. "Geodatabase of Carolingian Coin Hoards 751-987," DARMC Scholarly Data Series, Data Contribution Series # 2013-4. DARMC, Center for Geographic Analysis, Harvard University, Cambridge MA 02138.

Why Geoms and not More Normal Form?  -  Packing all of the coordinates which define a line into a geom is not the only way we might invent to represent the spatial data which represents a line.   Classic DBMS design might encourage a more normal form with the coordinates that make up a line stored in their own table with each coordinate being a single record.

Consider the above table, which is not how Manifold does it, but which illustrates a concept.   Suppose instead of the geom that contained all of the line's coordinates instead there was a field that gave a unique LineID number for each line.

Another table could then provide all of the coordinates for all of the lines in the database, with each line  being identified by its LineID and some other value giving the order of coordinates in the sequence which defines a particular line.   In this thought experiment, to draw the line with a LineID value of 13 we fetch all coordinates for that LineID and draw a line by connecting the dots between the coordinates given from the first CoordNum to the last for LineID.   Remember, this is a thought experiment to teach a concept and is not how Manifold does it.

So if that is the logical way a DBMS designer might choose to ensure more normal form of spatial data used to construct lines, why does Manifold pack all the coordinates for a line into a single, binary geom value?  That is done primarily for performance.

Consider a drawing that shows the road system of North America using millions of lines.   Each one of those lines could have thousands of coordinates to define it.   When manipulating millions of objects, either for analysis or for rendering in a drawing, it is simply more efficient to fetch one record for each line from a table where the geometry for that line is stored within the line's record as an ordered blob of binary data in the form of a geom than it is to fetch millions of records, one for each line, and to then in addition for each of those millions of records fetch yet hundreds or thousands more records from a different table, one for each coordinate that defines the line.  It is usually much quicker to fetch a smaller number of "fat"  records than it is to fetch a much larger number of small records.

That is the same reason why Manifold, like so many other systems that can work with big images, stores images in tables using tiles.

Consider the table above that shows the data for an image as tiles, where each record contains a tile for a particular intermediate level with each tile placed in an X, Y location relative to other tiles.  As we can see from the table, the tiles are 128 pixels wide by 128 pixels high for a total of 16384 pixels per tile.

If we wanted to be fanatic about normal form instead of storing our image in the form of records where each record has a blob of 16384 pixels we could store our image as a table where each record was a single pixel.   That would make the table over sixteen thousands times larger and require over sixteen thousand more fetches, which would end up being far slower and less efficient than fetching many pixels at once for each record.

Tech Note: With four eight bit integers per pixel (an integer for each of three R, G and B channels plus one more integer for an alpha transparency channel)  the example above uses 65536 bytes of memory per tile.   That is eight  times more memory required for each tile than the total system memory (8KB) with which most original Apple II computers shipped.