Spatial Data (GIS)¶
Introduction¶
This section describes the process of submitting spatial (e.g., GIS) data to a data portal.
Terminology¶
GIS data represent positionl information on the Earth’s surface, and can come in the form of vector polygons or raster surfaces. GIS data typically hosted in Axiom data portals includes maps or distribution models of habitat, species distribution, hydrography, land parcels, and more. Species distribution data are derived from a set of directly- or remotely-sensed location observations that have been post-processed into some sort of summary or aggregate product.
Note
Example: the Benthic Biomass Relative Biomass Index are vector contours representing the relative biomass of benthic invertebrates across the Bering, Beaufort, and Chuckchi Seas. This dataset was derived from survey locations that were postprocessed into rasters by kernel density estimation techniques and subsequently summarized as density contours.
If your data have not undergone any of these or similar post-processing analyses, they are likely more accurately described as survey locations or trajectories, for which we have separate data submission guidelines we suggest you follow.
General Guidelines¶
Use community standards. Open data formats such as CSVs and Shapefiles are always preferable to closed or proprietary formats like Excel spreadsheets or ArcGIS geodatabases because they are more likely to be accessible to the broader community both now and in the future.
Be consistent. Strive to use the same file formats and variable names across files within a dataset (e.g. different distributions in a year) and, as often as is possible, across datasets. When we plan to visualize your data, we use scripts to ingest it, so any inconsistencies will require manual intervention and lead to delays.
Data Submission Guidelines¶
Describing¶
When submitting sGIS data for visualization and/or archive, there are a number of descriptors that can aid in the ingestion and reuse of those data. Depending on the nature of your dataset, these descriptors might be incorporated into filenames, column headers, field values, metadata records, or left out entirely if not applicable. These possible descriptors include:
Spatial reference: Spatial reference information should identify both the datum and projection used, if any. Depending on the data format, this might be included in the PRJ file along with a Shapefile, or specified in the filename, fieldnames, or metadata record associated with tabular data.
Time period: If ths dataset represents observations taken over a period of time, then the best practices are to include an indication of the season or time range included in the dataset.
Method of summarization: Some indication of the methods used to generate the data should be embedded in the filename and/or fieldnames to help make your datasets easier to interpret.
Species identifier: If applicable, identification of species in filenames, fieldnames, or values is important especially in the case of multiple species comprising a single dataset. In all cases use the species identification conventions specific to your field, where they exist.
Common Mistakes We Encounter¶
There are a variety of things that can go wrong in the process of creating and sharing a dataset. To help with any setbacks during the ingestion and visualization process, here are some common mistakes we encounter that you can be aware of.
Shapefiles where not all the necessary files are included
Data with no location information or invalid geometries
Data incorrectly formatted around the Antimeridian
Polygons that aren’t closed
Polygons that are self intersecting
Missing world files
- Projection issues:
Wrong or no projection defined
Using EPSG:4326 where longitude is [0, 360]
Dataset using spherical mercator but should use traditional mercator
Defining points along or near singularities
- Defining curves
A “line” element connecting two longitude and latitude points is ambiguous without a projection or some additional metadata describing how lines
Formatting¶
There are a wide variety of data formats capable of representing vector or raster species distributions. Below is a list of the formats we prefer to work with, including links to more detailed documentation for each format. If your data are not in one of these formats we will still likely be able to work with it, but we ask that you attempt to convert to one of the following formats:
- Vector
GeoPackage: GeoPackages are an open, standards-based format for transferring geospatial information in one file. They are openly described and can be read by any geospatial software package. They are prefered to shapefiles since there are less opportunities for shared data to be corupted. They also allow for multiple geometry types to be included.
Shapefile: ESRI Shapefiles are standard across many scientific fields. They are also openly described and can be read by any geospatial software package. The drawback to shapefiles is that they are not a single file, which can lead to corruption of a dataset if not handled carefully. To be valid, shapefiles must include a SHP, SHX, and DBF file, and to be useful they also need to include a PRJ.
CSV/TSV: CSVs and TSVs are not specifically spatial data formats, but can include spatial fields, the simplest of which might be latitude and longitude for a dataset of points. Better is to use the Well Known Text format for spatial objects, which can accommodate points, lines, and polygons and incorporates spatial reference information.
GeoJSON: GeoJSON is an extension of the text-based JSON data format. GeoJSON can handle all vector types so long as they are represented by geographic coordinates in the WGS84 datum (EPSG code 4326).
- Raster
NetCDF: NetCDF files are the preferred format and the backbone of archival data storage at both Axiom and NCEI. They are machine-independent, flexible, can store data and dimensions in any combination, store metadata as global and/or variable attributes within the file itself (i.e., self-describing).
GeoTiff: GeoTiff is a common format for regularly-gridded (i.e., raster) data. This format is simple to read and interpret and commonly includes metadata such as projection information, spatial resolution, pixel-size, etc., but other metadata attributes such as source, owner, publisher, history, parameter names, etc. are often not included.
Examples¶
Sea lion MCP home range¶
As a vector example, take a dataset derived from a ARGOS locations taken from multiple distinct populations of tagged sea lions across multiple seasons. This dataset was subsequently summarized using a minimum convex polygon analysis into a vector home range estimate. Ideally this dataset would be described and formatted in the following manner:
Formatted as a Shapefile with PRJ file containing spatial reference information.
Descriptive filename, for example sealion_mcp_95_summer_2009_2011, indicating the species, the method used (MCP) and the percentage of locations included based on outlier status (95%), the season and time range.
A population field with values indicating the subpopulation of each home range polygon.
Horned puffin KDE relative abundance¶
As a raster example, take a dataset derived from direct survey observations collected over a ten year period of time from ocean-going vessels. This dataset was subsequently summarized as a relative abundance surface using kernel density estimation methods. Ideally this dataset would be described and formatted in the following manner:
Formatted as a GeoTiff with internal spatial reference definition describing datum, projection, extent, and cell size
Descriptive filename, for example hopu_kde_01_2001_2011, indicating the standard species code, the method of analysis (KDE) and a smoothing bandwidth of 0.1, and the date range.
Submitting Data¶
Once your data are well-described and in an appropriate format, you may submit up to 1 TB of files using the following form, or upload your files to the Research Workspace if you already have an account: