Case Study: Renovating a Data Inventory

JimSpath · ‎2025 May 14

Many times you may be instructed to create a new application to replace an existing one. Your agency might have taken over a legacy group, you may be a new hire after the last group quit, or you might want to leverage prior art instead of starting from scratch.

In this case, I needed to upgrade a forest inventory in order to better work with landscapers and architects/engineers, where previous data were on paper or disorganized digital media. While this scope covers a few acres/hectares, the design concept should fit a bigger coverage area such as a metropolis, a public park, or private holdings. Basically, an operations and maintenance (O&M) guidance document to further natural resource protection.

In an earlier iteration, the solution was to use a GIS mapping layer, add points or polygons per tree, and sprinkle advice to the sawyers ("dead limb to be removed"). I quickly realized that naming each tree was a poor choice for a unique identifier. The "north cherry tree" and the "south cherry tree" seemed self-evident until the assigned worker had no compass-sense. Rather than starting from scratch, I searched for examples. And if you've used the big search engines lately, you'll understand this may not be easy or rewarding.

The basic requirements were: (1) tree (or shrubbery) species list, (2) measured or observed conditions, and (3) a maintenance or care plan per tree. In a large-scale version, there would be multiple crews collecting the first 2 parts, and maybe a different set of crews caring for the trees. To avoid data getting stale, the maintenance crew sends the feedback to the inventory so the next cycle has current state.

One interesting site I found with templates + instructions was a state university, connected to federal government services (USDA - Department of Agriculture). There was an Excel sheet (two, in fact, one blank and one populated exemplar), an instruction document in MS Word format, and links. I grabbed the files and put on the detective hat to discern the relationships and design created. Somewhat surprisingly, the files are 25 years old yet quite usable. For my renovation, the content would be stored in a database not a spreadsheet, meaning I needed to closely inspect the referential integrity and grow the design to include better data validation, and add any missing or desired elements.

Primary key

At this point in the redesign, a few keys needed to be documented. Each tree would get a unique serial number, which seemed to be the only required differentiator. While every tree belongs to a specific species, those might not be known initially so that column needed to allow nulls (or "undefined"/"unidentified").

Trees/Codes/Sizes

The examples used a 2-letter code that represented a genus + species, with a lookup table included in a separate spreadsheet tab. In the design steps, this is where I needed to decide to keep that legacy style, or try to enhance this by using the full name. For expediency (which might backfire later when scaling up), I stuck with the 2 letter code. We'd need to train the data collectors what letters to use to translate the "English" name to the Latin name to the coded index value. Presumably a later application layer would make the lookups easier with pull-down selections. But for now, stick with the inherited design.

As it turned out, under 100 species were listed. In a diverse forest, there might be more than this sometimes, but for urban or monoculture (planted forests), you might find fewer than 10 predominant. Seemed manageable, so I looked more closely at the lookup table. The first hurdle, the primary key, was unique. While you might have expected this result, you should check anyway, given the looseness of spreadsheet constraints. OK, 70 codes, including 1 for "unknown". The next check was whether these codes translated 1:1 into a Latin species name. Unfortunately, but not too unsurprisingly, I found duplicates.

Secondary keys

Using typical UNIX command line tools, find the errors:

$ sort   /tmp/species-latin.txt | uniq -c |  grep -v "^ *1 "
      2 Populus grandidentata
      2 Quercus alba

To make the lookup table functional, we'd need to drop (or deprecate) one of these codes.

$ egrep "Populus grandidentata|Quercus alba"  /tmp/code+species.txt
PD      Populus grandidentata
PI      Populus grandidentata
QA      Quercus alba
QC      Quercus alba

In a "green-field" scenario, we'd pick one and just delete the other code. But in a "brown-field" situation where the new logic needs to accommodate history data inventories, we would need to review the past and decide how to react. The legacy sample does have both "PD" and "PI", alas.

Tree inventory, auto-filtered

In the new inventory, I would eliminate the above 2 duplicates and add a unique index on the species name, to avoid repeating the mistake.

Next key

Few among us other than trained arborists have good working Latin skills so to make the inventory more friendly, I would add the translated English name, at least for those species known to be within the focus area. I will leave out the gnarly details of verifying each name other than to mention the 70-some species list was missing rather common values, and included ones unlikely to be found here.

Here is where another, unexpected, set of issues came up. Taxonomy of trees and shrubs is a well-established branch (no pun intended), and Latin spelling should probably not vary. Yes, there are changes to the lists as new forms are found (or created), or species thought to differ are found to coincide. But, like searching for an app design in the first place, name searches did not routinely results in a simple answer.

One presumed wrong:

PE      Prunus serontina

The correct spelling appears to be "Prunus serotina" [see, e.g. https://www.wildflower.org/plants/result.php?id_plant=prse2 ]. For future revisions, the USDA code for species has 5 characters, not 2. Might be handy to add those to the lookup table. For now, edit the table.

Another misspelling in the source data is "Tsuga" instead of "Thuja". Unclear how these errors crept in (I blame the squirrels).

$ egrep "Tsuga|Thuja" /tmp/code+species.txt
TD      Tsuga canadensis
TO      Tsuga occidentalis

In this case, the search algorithm showed helpful results rather than wrong or no answers ("Did you mean: Thuja occidentalis [?]").

psql:create_....sql:40: 
ERROR:  duplicate key value violates unique constraint "species_pkey"

Further

With the main keys/constraints designed, or rehabbed, the remaining O&M content fell out fairly easily.

Tree status / condition data

The "planting location" was designed to show whether a specific tree was impacted by paved areas, such as being planted in a hole in a sidewalk. The spreadsheet model used 4 columns, but for simplicity, I'd condense this to a single column with lookup values checked against a list. While an SAP focused solution would have the database in HANA, or Sybase, I used PostgreSQL. Each platform might have different rules or syntax to create constraints. In my case, the "<@" array operator didn't do what I thought the docs said, and I got errors. I chased these down for a bit, then backed off and used an expression that worked. Note that I didn't set syntax errors on the first method, but in testing found the constraints were inoperative.

[...]
  condition bpchar(1) constraint condition_code CHECK (
    condition = ' ' or
    condition = 'G' or
    condition = 'F' or
    condition = 'P' or
    condition = 'D' or
    condition = 'C' or
    condition is NULL ),

One downside to putting the constraint in the table create statement is altering the values later is more work. The other approach is to put the foreign key values into another table. You decide which approach to use based on data knowledge. For this "report card" scenario, I used the old mantra "Every Good Boy Does Fine" (E-G-B-D-F). Or was that a tune?

Other characteristics the inventory will capture are specific conditions, as opposed to general health, so a tree with overhanging wires can be marked as such. These are distinct, and multiple can be true at once, so leave those as individual True/False columns.

Fixes

The inventory crew may or may not know what fixes are needed for every situation, so the original design has a "C" for consult code, meaning ask an expert how to treat or correct a problem. Then, the initial record can be updated to show the maintenance plan. I reasoned the supplied options were adequate, though I am considering adding a "Poison ivy" flag given the propensity of this pest.

Maintenance needs + comments

Similar to overlapping hazards, the maintenance plan may show multiple fixes needed per item, so these need to be separate columns for ease of use. The legacy spreadsheet condensed the titles, making the intent hard to discern; the flags are: Clean, Raise, Reduce, Remove. On first glance, these are unrelated meaning no other constraints are needed. But looking a second time, the "Remove" flag would obviate the need for "Reduce" to be marked, so maybe more intricate triggers are useful.

Legacy markers

The website where I found these resources says it was last updated in 2004; the downloaded files are around 5 years older than that. Not unusual but at least the data can be accessed by current software rather than needing to find a conversion process.

Inventory template properties

Sheet from 1999

Tree Inventory Docs

By Category

Related Content

Activity Groups

Industry Groups

Influence and Feedback Groups

Interest Groups

Location Groups

Customer Only Groups

Forums

Related Resources

Products

Learning and Support

About

My SAP Profile

My SAP Profile