Skip navigation
All People > Dan_Patterson > Py... blog > 2016 > October
2016

Running a script once, reserves input parameters and outputs in Python's namespace, allowing you to check the state of these in the Interactive Window, IW, at any time.  Often I save the results of the Interactive Window to document a particular case or change in state of one of the inputs.  I was getting bored writing multiple print statements to try to remember what the inputs and outputs were.  Moreover, I had documented all this in the script in the header.

 

I decided to combine the best of both worlds:  1)  reduce the number of print statements;   2)  retrieve the header information so I could check namespace and outputs in the IW which I could then save and/or print.

 

The following are the results of a demo scripts output which includes the input namespace and their meaning and the results for a run.  The actual script is not relevant but I have included it anyways as an attachment.  The example here is the result from PythonWin's IW.  I did take a huge chunk out of the outputs to facilitate reading.

----------------------------------------------------------------------

:Script:   matrix_covariance_demo.py
:Author:   Dan.Patterson@carleton.ca
:Modified: 2016-10-25
:
:Notes:
:  Namespace....
:  x, y       x,y values
:  xy_s       X,Y values zipped together forming a column list
:  x_m, y_m   means of X and Y
:  x_t, x_t   X and Y converted to arrays and translated to form row arrays
:  s_x, s_y   sample std. deviations
:  v_x, v_y   sample variances
:  cov_m      numpy covariance matrix, sample treatement, see docs re: ddof
:  Exy        sum of the X_t,Y_t products
:  cv_alt     alternate method of calculating "cov_m" in terms of var. etc
:
:  Useage....
:  Create a list of key values in execuation order, print using locals()
:  Syntax follows...
:  names = ["x","x","xy_s","x_m","y_m","xy_t","x_t","y_t",
:           "s_x","s_y","v_x","v_y","cov_m","Exy","n","cv_alt"]
:  for name in names:
:      print("{!s:<8}:\n {!s: <60}".format(name, locals()[name]))
:
:References
:  http://en.wikipedia.org/wiki/Pearson_product-moment_
:       correlation_coefficient
:  http://stackoverflow.com/questions/932818/retrieving-a-variables-
:       name-in-python-at-runtime
:
)

Results listed in order of execution:
x     .... [1.0, 2.0, 3.0, 5.0, 8.0]
y     .... [0.11, 0.12, 0.13, 0.15, 0.18]
xy_s  .... [(1.0, 0.11), (2.0, 0.12), (3.0, 0.13), (5.0, 0.15), (8.0, 0.18)]
x_m   .... 3.8
y_m   .... 0.138
x_t   .... [-2.800 -1.800 -0.800  1.200  4.200]
y_t   .... [-0.028 -0.018 -0.008  0.012  0.042]
s_x   .... 2.7748873851
s_y   .... 0.027748873851
v_x   .... 7.7
v_y   .... 0.00077
cov_m .... [[ 7.700  0.077], [ 0.077  0.001]]
Exy   .... 0.308
n     .... 4
cv_alt.... [[ 7.700  0.077], [ 0.077  0.001]]

 

Now the code.... I have left out the scripts doc since I always like to keep a copy in the output so I don't forget what I used to produce it.  The only real important parts are the list of names in the main part of the script and the lines in the __main__ section to process the locals() variable yes.

 

.... SNIP ....

import sys
import numpy as np
from numpy.linalg import linalg as nla
from textwrap import dedent

ft = {"bool": lambda x: repr(x.astype("int32")),
      "float": "{: 0.3f}".format}
np.set_printoptions(edgeitems=10, linewidth=80, precision=2,
                    suppress=True, threshold=100,
                    formatter=ft)
script = sys.argv[0]
#
# .... Variables and calculations ....
x = [1.0, 2.0, 3.0, 5.0, 8.0]            # x values
y = [0.11, 0.12, 0.13, 0.15, 0.18]       # y values
xy_s = list(zip(x, y))                    # stack together
x_m, y_m = np.mean(xy_s, axis=0)            # get the means
xy_t = np.array(xy_s) - [x_m, y_m]          # convert to an array and translate
x_t, y_t = xy_t.T                        # x,y coordinates, transposed array
s_x, s_y = np.std(xy_t, axis=0, ddof=1)  # sample std. deviations
v_x, v_y = np.var(xy_t, axis=0, ddof=1)  # sample variances
cov_m = np.cov(x_t, y_t, ddof=1)             # covariance matrix
#
# .... alternate expressions of the covariance matrix ....
Exy = np.sum(np.product(xy_t, axis=1))  # sum of the X_t,Y_t products
n = len(x_t) - 1
cv_alt = np.array([[v_x, Exy/n], [Exy/n, v_y]])

# create a list of key values in execution order format from locals()[name]
names = ["x", "y", "xy_s",
         "x_m", "y_m", "x_t", "y_t",
         "s_x", "s_y", "v_x", "v_y",
         "cov_m", "Exy", "n", "cv_alt"]

#-------------------------
if __name__ == "__main__":
    print("\n{}\n{})".format("-"*70, __doc__))
    print("\nResults listed in order of execution:")
    for name in names:
        args = [name, str(locals()[name]).replace("\n", ",")]
        print("{!s:<6}.... {!s:}".format(*args))

 

 

Hope you find something useful in this.  Send comments via email.

By any other name ... the questions are all the same. They only differ by whether you want the result or its opposite.

The generic questions can be looked from the traditional perspectives of

  • what is the question,
  • what is the object in the question and
  • what are the object properties.

 

What is the same?
  • geometry
    • points
      • X, Y, Z, M and or ID values
    • lines
      • the above plus
        • length
        • angle/direction total or by part
        • number of points (density per unit)
        • parts
        • open/closed circuit
    • polygons
      • the above plus
        • perimeter (length)
        • number of points
        • parts
        • holes?
  • attributes
    • numbers
      • floating point (single or double precision)
      • integer (long or short
      • boolean (True or False and other representations)
    • text/string
      • matching
      • contains
      • pattern (order, repetition etcetera)
      • case (upper, lower, proper, and other forms)
    • date-time
What to to with them?
  • find them
    • everything...exact duplicates in every regard
    • just a few attributes
    • just the geometry
    • the geometry and the attributes
  • copy them
    • to a new file of the same type
    • to append to an existing file
    • to export to a different file format
  • delete them
    • of course... after backup
    • just the ones that weren't found (aka... the switch)
  • change them
    • alter properties
      • geometric changes enhance                                
      • positional changes
      • representation change

 

Lets start with a small point data file brought in from Arcmap using the FeatureClassToNumPyArray tool.

Four fields were brought in, the Shape field ( as X and Y values), an integer Group field and a Text field.  The data types for each field are indicated in the dtype line.   The details of data types have been documented in other documents in the series.

 

>>> arr
array([(6.0, 0.0, 4, 'a'), (7.0, 9.0, 2, 'c'),
       (8.0, 6.0, 1, 'b'), (3.0, 2.0, 4, 'a'),
       (6.0, 0.0, 4, 'a'), (2.0, 5.0, 2, 'b'),
       (3.0, 2.0, 4, 'a'), (8.0, 6.0, 1, 'b'),
       (7.0, 9.0, 2, 'c'), (6.0, 0.0, 4, 'a')],
      dtype=[('X', '<f8'), ('Y', '<f8'), ('Group', '<i4'), ('Text', 'S5')])
>>> arr.shape
(10,)

 

In summary:

  • the X and Y fields 64 bit floating point numbers (denoted by: <f8 or  float64)
  • the Group field is a 32 bit integer field (denoted by: <i4 or int32)
  • the text field is just that...a field of string data 5 characters wide.

 

Is any of this important?  Well yes...look at the array above. The shape indicates it has 10 rows but no columns??  Not quite what you were expecting and it appears all jumbled and not nicely organized like a table in ArcMap or in a spreadsheet.  The array is a structured array, a subclass of the multidimensional array class, the  ndarray.  The data types in structured arrays are mixed and NumPy works if the data are of one data type like those in the parent class

 

Data in an array can be cast to find a common type, if it contains one element belongs to a higher data type.  Consider the following examples, which exemplify this phenomenon.

 

The arrays have been cast into a data type which is possible for all elements.  For example, the 2nd array contained a single floating point number and 4 integers and upcasting to floating point is possible.  The 3rd example downcast the integers to string and in the 4th example, True was upcast to integer since it has a base class of integer, which is why True-False is often represented by 1-0.

 

>>> type(True).__base__
<type 'int'>

 

The following code will be used for further discussion.

 

def get_unique(arr,by_flds=[]):
    """ Produce unique records in an array controlled by a list of fields.
    Input:   An array, and a list of fields to assess unique.
        All fields:  Use [] for all fields.
        Remove one:  all_flds = list(arr_0.dtype.names)
                     all_flds.remove('ID')
        Some fields: by name(s): arr[['X','Y']]  # notice the list inside the slice
                     or slices:  all_flds[slice]... [:2], [2:], [:-2] [([start:stop:step])
    Returns: Unique array of sorted conditions.
             The indices where a unique condition is first encountered.
             The original array sliced with the sorted indices.
    Duh's:   Do not forget to exclude an index field or fields where all values are
             unique thereby ensuring each record will be unique and you will fail miserably.
    """
    a = arr.view(np.recarray)
    if by_flds:
        a = a[by_flds].copy()
    N = arr.shape[0]
    if arr.ndim == 1: # ergo... arr.size == arr.shape[0]
        uniq,idx = np.unique(a,return_index=True)
        uniq = uniq.view(arr.dtype).reshape(-1, 1) # purely for print purposes
    else:
        uniq,idx = np.unique(arr.view(a.dtype.descr * arr.shape[1]),return_index=True)
        uniq = uniq.view(arr.dtype).reshape(-1, arr.shape[1])
    arr_u = arr[np.sort(idx)]
    return uniq,idx,arr_u

if __name__=="__main__":
    """Sample data section and runs...see headers"""
    X = [6,7,8,3,6,8,3,2,7,9];  Y = [0,9,6,2,0,6,2,5,9,4]
    G = [4,2,1,4,3,2,2,3,4,1];  T = ['a','c','b','a','a','b','a','c','d','b']
    dt = [('X','f8'),('Y','f8'),('Group','i4'),('Text','|S5')]
    arr_0 = np.array(zip(X,Y,G,T),dtype=dt)
    uniq_0,idx_0,arr_u0 = get_unique(arr_0[['X','Y']])
    frmt = "\narr_0[['X','Y']]...\nInput:\n{}\nOutput:\n{}\nIndices\n{}\nSliced:\n{}"
    print(frmt.format(arr_0,uniq_0,idx_0,arr_u0))

 

Which yields the following results

 

arr_0[['X','Y']]...
Input:
[(6.0, 0.0, 4, 'a') (7.0, 9.0, 2, 'c') (8.0, 6.0, 1, 'b')
(3.0, 2.0, 4, 'a') (6.0, 0.0, 3, 'a') (8.0, 6.0, 2, 'b')
(3.0, 2.0, 2, 'a') (2.0, 5.0, 3, 'c') (7.0, 9.0, 4, 'd')
(9.0, 4.0, 1, 'b')]
Output:
[[(2.0, 5.0)]
[(3.0, 2.0)]
[(6.0, 0.0)]
[(7.0, 9.0)]
[(8.0, 6.0)]
[(9.0, 4.0)]]
Indices
[7 3 0 1 2 9]
Sliced:
[(6.0, 0.0) (7.0, 9.0) (8.0, 6.0) (3.0, 2.0) (2.0, 5.0)
(9.0, 4.0)]

 

The arr_0 output is your conventional recarray output with everything wrapped around making it hard to read.  The Output section showsn the unique X,Y values in the array in sorted order, which is the default.  The Indices output is the location in the original array where the entries in the sorted Output can be found.  To produce the Sliced incarnation, I sorted the Indices, then used the sorted indices to slice the rows out of the original array.

 

Voila...take a table, make it an array...find all the unique entries based upon the whole array, or a column or columns, then slice and dice to get your desired output.  In any event, it is possible to terminate the process at any point and just find the unique values in a column for instance.

 

The next case will show how to deal with ndarrays which consist of a uniform data type and the above example will not work.

Of course there is a workaround.  To that end, consider the small def from a series I maintain, that shows how to recast an ndarray with a single dtype to a named structured array and a recarray.  Once you have fiddled with the parts, you can 

  • determine the unique records (aka rows)
  • get them in sorted order or
  • maintain the original order of the data

 

# ----------------------------------------------------------------------
# num_42 comment line above def
def num_42():
    """(num_42)...unique while maintaining order from the original array
    :Requires: import numpy as np
    :--------
    :Notes:  see my blog for format posts, there are several
    :-----
    : format tips
    : simple  ["f{}".format(i) for i in range(2)]
    :         ['f0', 'f1']
    : padded  ["a{:0>{}}".format(i,3) for i in range(5)]
    :         ['a000', 'a001', 'a002', 'a003', 'a004']
    """

    a = np.array([[2, 0], [1, 0], [0, 1], [1, 0], [1, 2], [1, 2]])
    shp = a.shape
    dt_name = a.dtype.name
    flds = ["f{:0>{}}".format(i,2) for i in range(shp[1])]
    dt = [(fld, dt_name) for fld in flds]
    b = a.view(dtype=dt).squeeze()  # type=np.recarray,
    c, idx = np.unique(b, return_index=True)
    d = b[idx]
    return a, b, c, idx, d

 

The results are pretty well as expected.  

  1. Array 'a' has a uniform dtype.
  2. The shape and dtype name were used to produce a set of field names (see flds and dt construction).
  3. Once the dtype was constructed, a structured or recarray can be created ( 'b' as structured  array).
  4. The unique values in array 'b' are returned in sorted order ( array 'c', see line 21)
  5. The indices of the first occurrence of the unique values are also returned (indices, idx, see line 21)
  6. The input structured array, 'b', was then sliced using the indices obtained.
>>> a
array([[2, 0],
       [1, 0],
       [0, 1],
       [1, 0],
       [1, 2],
       [1, 2]])
>>> b
array([(2, 0), (1, 0), (0, 1), (1, 0), (1, 2), (1, 2)],
      dtype=[('f00', '<i8'), ('f01', '<i8')])
>>> c
array([(0, 1), (1, 0), (1, 2), (2, 0)],
      dtype=[('f00', '<i8'), ('f01', '<i8')])
>>> idx
array([2, 1, 4, 0])
>>> d
array([(0, 1), (1, 0), (1, 2), (2, 0)],
      dtype=[('f00', '<i8'), ('f01', '<i8')])
>>> # or via original order.... just sort the indices
>>> idx_2 = np.sort(idx)
>>> idx_2
array([0, 1, 2, 4])
>>> b[idx_2]
array([(2, 0), (1, 0), (0, 1), (1, 2)],
      dtype=[('f00', '<i8'), ('f01', '<i8')])

 

I am sure a few of you (ok, maybe one), is saying 'but the original array was a Nx2 array with a uniform dtype?  Well I will leave the solution for you to ponder.  Once you understand it, you will see that it isn't that difficult and you only need  a few bits of information... the original array 'a' dtype, and shape and the unique array's shape...

 

>>> e = d.view(dtype=a.dtype).reshape(d.shape[0],a.shape[1])
>>> e
array([[0, 1],
       [1, 0],
       [1, 2],
       [2, 0]])

 

These simple examples can be upcaled quite a bit in terms of the number of row and columns and which ones you need to participate in the uniqueness quest.

That's all for now.