Shapefiles: Floating point numbers can contain rounding errors because they are stored as text.

Nico9 · ‎09-26-2023

Hello,

I found an old article about problems with Shapefiles/ArcView and would like to know if the problem also exists after ArcView, in arcMap and ArcGIs Pro? Or was it a specific problem with ArcView and shapefiles?

Automatic translation from http://www.wlm.at/Arc4You/A4clean/Poly%20Error.htm

Due to the inaccuracies of floating point numbers, there must be a tolerance level below which points, line or polygon sections are recognized as identical and also removed during cleanup. This is the so-called fuzzy tolerance. At first glance, ArcMap lacks such a metric. In reality, it is applied internally whenever a line or polygon is processed. Only this accuracy cannot be set, but is calculated variably by ArcView as about 10-6 to 10-12 of the extent of the shape in X and Y direction. Thus, countless different tolerance limits are applied in a shapefile. Fine structures between neighboring surfaces are removed in the larger polygon, but remain in the smaller one. This results in tiny overlaps or gaps. This occurs very often when polygons are split into halves of different sizes. Since ArcView removes all vertices whose distance to each other is below the tolerance limit, single vertices on "straight" sections are often missing. The outlines of two adjacent polygons are then only apparently identical.
The resulting tiny gaps and overlaps are usually not visible and cannot be detected by "normal" means in ArcView, since any comparison or intersection between polygons is subject to the internal blur tolerance. However, as the polygons are further cut or reduced in size, the blur tolerance also becomes finer, and overlaps or gaps that were not previously "present" may appear in places that were not processed at all. We refer to these errors as fuzzy vertices, which cause hidden overlaps and gaps.

AyanPalit · ‎09-26-2023

@Nico9 It will help to know the data types for your use case and what is the data maintenance process. I concur with comments from @DuncanHornby and @curtvprice on shapefile; also unsure of the source cited: http://www.wlm.at

A best practice is to model schema objects (feature class, tables, other classes), in a geodatabase format that allows standardization for attribute fields, data types and stipulates precision, scale among other specifications. All robust GIS databases that act as authoritative repository/source of truth, should be modeled such. Legacy formats like shapefiles may be used to export/import for workflows or third-party solutions that are limited to this consumption format. Increasingly, modern solutions, architectures are using services or SQL ETL for data exchange. Most vendors, data/service providers have also moved to file geodatabase formats that preserve data types, field names etc.

In summary, if the requirement is high-precision, high-accurate data and preserving the fidelity, develop a process that will not downgrade the data. Shapefile format inherently has some known limitations.

Ayan Palit | Principal Consultant Esri

View solution in original post

DuncanHornby · ‎09-26-2023

Hard to say, you reference software that was made obsolete more than 20 years ago! So shapefiles are more than 20 years old. How ArcPro reconciles precision may be quite different to software that was built 20 years ago, especially as data formats have matured, such as the file geodatabase. I would always recommend using shapefiles as a last resort and use file geodatabases as they are superior in their functionality and storage capacity.

I would imagine the article is still relevant to shapefiles, a format that is now over 20 years old.

Nico9 · ‎09-26-2023

Hello,

unfortunately shapefiles are still widely used, even if there are better alternatives. In ArcGIS Pro shapesfiles can still be used without any restrictions. So my question is still relevant. Does the described problem exist in ArcGIS Pro as well or was it a specific problem of ArcView in handling shapefiles?

DuncanHornby · ‎09-26-2023

This page discusses how ArcPro deals with precision and tolerance:

https://pro.arcgis.com/en/pro-app/latest/help/data/geodatabases/overview/the-properties-of-a-spatial...

Without talking to the developers that's as much as you are going to find out which is published.

Nico9 · ‎09-26-2023

Thank you, I will take a look at the site.

curtvprice · ‎09-26-2023

The issue is not with the software but with the shapefile format - which has not changed.

Shapefiles do not contain an x,y tolerance as do geodatabase feature classes. x,y tolerance is the minimum distance between coordinates before they are considered equal. This x,y tolerance is used when evaluating relationships between features within the same feature class or between several feature classes. It is also used extensively when editing features. When using any operation that involves the comparison of features, such as tools in the Overlay toolset, the Clip tool, the Select Layer By Location tool, or any tool that takes two or more feature classes as input, use geodatabase feature classes (which have an x,y tolerance) rather than shapefiles.

I would add that if you do overlay shapefiles, it is a good idea to set the XY Tolerance environment to avoid unnecessary issues. I believe shapefile coordinate precision is only 32 bits, and since they are floating point (not integer like geodatabases), these are approximations so the "fuzzy creep" your translated quote alludes to can be a real issue over multiple overlay operations.

Your post title is talking about something else, which is also an important thing to know, about shapefile table (.dbf format) fields:

Unlike other formats, shapefiles store numeric attributes in character format rather than binary format. For real numbers (that is, numbers containing decimal places), this may lead to rounding errors. This limitation does not apply to shape coordinates, only attributes.

ArcGIS Pro Help: Geoprocessing considerations for shapefile output

VinceAngelo · ‎09-26-2023

Shapefiles use 8-byte (64-bit) IEEE storage for all coordinate values. The "32-bit" issue was in how what is now called BASIC spatial references handled coordinate processing. While geodatabases store floating-point coordinates in a compressed bytestream which uses integers, the coordinates can be expressed as 8-byte (double precision) floats at any time. "Fuzzy creep" wasn't really a thing back in ArcInfo days (you'd have to misuse the software, changing the tolerance with each processing step to make it happen), and is less of a thing now (because tolerances are fixed by invariant spatial references).

- V

AyanPalit · ‎09-26-2023

@Nico9 It will help to know the data types for your use case and what is the data maintenance process. I concur with comments from @DuncanHornby and @curtvprice on shapefile; also unsure of the source cited: http://www.wlm.at

A best practice is to model schema objects (feature class, tables, other classes), in a geodatabase format that allows standardization for attribute fields, data types and stipulates precision, scale among other specifications. All robust GIS databases that act as authoritative repository/source of truth, should be modeled such. Legacy formats like shapefiles may be used to export/import for workflows or third-party solutions that are limited to this consumption format. Increasingly, modern solutions, architectures are using services or SQL ETL for data exchange. Most vendors, data/service providers have also moved to file geodatabase formats that preserve data types, field names etc.

In summary, if the requirement is high-precision, high-accurate data and preserving the fidelity, develop a process that will not downgrade the data. Shapefile format inherently has some known limitations.

Ayan Palit | Principal Consultant Esri

VinceAngelo · ‎09-26-2023

There's a bunch of issues in the referenced post, wrongly conflated, or just plain wrong.

IEEE floating-point storage is not "inaccurate" -- Floating-point operations are way more accurate than geodata
It's not the accuracy (or precision) of floating-point representation which makes the fuzzy tolerance necessary, but the inherent accuracy of freely collected vertices
ArcMap (and all of ArcGIS Desktop and Server products) has always had access to tolerances in coincident coordinate calculation
The shapefile format was independent of ArcView, and the shapefile format does not have any of the quirks of the how ArcView (which is ancient, and retired) calculated coordinate values.
While there may be gaps between geometries in shapefiles (or file or enterprise geodatabase), there are tools to detect them, and repair them (if they are significant enough to bother)
The title of this thread is misleading, and refers not to coordinates, but dBase values; while dBase does have both binary and textual storage options for floating-point values, the scientific notation used in the text values is either at, or exceeds the stated precision of the field (and often of the IEEE format itself)

So, basically, the entire article was wrong then, and hasn't gotten any better in the decades since.

Since understanding how coordinate values are maintained is important, I refer you to the Understanding Coordinate Management in the Geodatabase whitepaper.

- V

DuncanHornby · ‎09-26-2023

That's a really useful white paper, some quality bed time reading for me! Thanks.