Skip to content

ForestGEO App Specification

Emma Semenuk Scott edited this page Jun 16, 2022 · 10 revisions

Emma Semenuk Scott, ForestGEO Data Intern

Last updated: 06/16/22

Summary

ForestGEO, an initiative led by the Smithsonian Institution, studies forest diversity via 70+ research sites. Each research site conducts censuses of every tree within the plot at that location. Once forest data is collected from a census, it can take over a year to prepare that data for analysis. This delay is largely due to a patchwork of highly manual processes and legacy software that ForestGEO relies on to enter data into a digital format and validate the data. Furthermore, a handful of database managers are responsible for maintaining this legacy process because errors requiring expert support frequently arise. As a result, ForestGEO would like to make their data collection system more reliable, automated, and easy to use.

A data intern is working with Microsoft volunteers to increase resiliency and timeliness in ForestGEO's database processes.

Customer Overview

We are working with Forest Global Earth Observatory (ForestGEO), an initiative led by the Smithsonian Institution in partnership with dozens of research institutions around the world to study forest function and diversity globally. The intensity and scale of ForestGEO is unprecedented in forest science. There are 73 forest research sites in 28 countries participating in ForestGEO, monitoring approximately 7 million trees and nearly 12,000 species. Data from ForestGEO has been leveraged to publish over 1,100 scientific articles, train hundreds of early-career scientists and students, and contribute to forest policy and management.

ForestGEO research sites take measurements from the ground, as opposed to some initiatives which monitor forest health via satellite. It is true that satellite data can cover a greater surface area, but this does not render ForestGEO's methods obsolete because satellite data cannot match ForestGEO's fidelity and accuracy that are so critical to forest science research.

Personas

Note: Some sites do not have all personas present (ex: Lambir and Palanan)

Director

  • Runs the program across all sites and oversees fundraising
  • Wishes for all research sites to migrate to tablet data entry in the next few years

Database Manager

  • Curate datasets and put them up on the server

  • Train PIs and their field crew in data entry, validation, and how to extract information from database

    • How to use SQL to access relational databases
    • How to use the existing data entry software, CTFSWeb, to enter, screen, and upload data
  • Perform administrative roles for CTFSWeb

  • Conflict resolution and error checking for submitted data

    • For many of the research sites, manually runs SQL scripts to generate error reports (sorted by record rather than by error type), corrects data, and repeats process until all errors resolved
    • Writes new SQL scripts that aren't already hosted on CTFSWeb
  • Example walkthrough: Laura is a database manager. When a research site has completed all data entry for a given census, the lead technician contacts her via email, so that she can complete the data screening process and generate meaningful, reports of errors organized by tag (unique identification number given to each tree) and quadrat (see appendix Fig 7 for plot diagram). Because there are so many errors, Laura feeds these error reports to the lead technician in batches, as not to overwhelm the research site, so that they are more likely to make corrections. The lead technician checks for data entry errors and makes corrections before emailing these corrections back to Laura, so that she can update the database. She is finished entering this set of corrections, but new errors arise as a result of these corrections. Laura again contacts the lead technician over email with the updated error report, and the process repeats until there are no more errors to correct. This process can sometimes take months, and Laura is likely doing it for multiple research sites at a time.

Principal Investigator (PI)

  • Head scientist(s) at a research site, sometimes assisted by lead technician
  • Usually "approves" report of post-census summary statistics

Lead Technician

  • Usually does majority of data wrangling and handling (prior to uploading into CTFSWeb for sites using tablets)
  • In the field, measuring during a census
  • Depending on the site, there are one or two lead technicians who train the field crew
  • Point person; buffer between PI and field crew
  • High turnover role, so knowledge gets lost; many times, technicians aren't permanent and may only be there for a census before moving on
  • Example walkthrough: Rodolfo is a lead technician at a research site. When his site has completed a census, he reaches out to the database manager Laura for error reports. He sends members of the field crew out to make corrections, and then sends lists of corrections to Laura to input into the database. He would like to be able to make these changes himself, but he would need to know SQL to write the scripts to make changes on the temporary tables.

Field Crew

  • Performs data collection via paper forms or tablets, depending on site
  • Enters data using CTFSWeb single or double data entry OR enters data into Excel or Access and send flat files to database managers
  • Example walkthrough: Mica is a member of the field crew. During a census, they are responsible for collecting data in the field during the day and then returning to the field station at night to enter this new data into CTFSWeb. Sometimes Mica forgets to input the right date or mistypes a number, but they're unable to make edits once they submit a data entry form. So, they make note of these mistakes and contact database manager Laura to make changes.
  • Note: At BCI and some African and Asian sites, there are one or two people dedicated to data entry only who start about 3 months after field work has started.

Goals

  • As a database manager, application users can now run their own validation and correct their own errors, so I can better use my technical skills to fulfill my original role responsibilities. I'm also able to easily maintain the application by adding new screening scripts, so that it doesn't become outdated.
  • As a principal investigator, the software is easy to install and intuitive to use, so I can easily onboard new lead technicians, field crew members, and data entry personnel. And, I have the data ready for analysis by my team and me.
  • As a lead technician, I'm able to generate printable error reports immediately after data entry, before too much time has passed, so that I can correct my mistakes in the field the next day and repeat the process.
  • As a director, previously manual data validation is automatic, reducing the time spent data entry and data use from months to seconds, so that census data is available for analysis sooner. And, for those datasets that I manage, I have them made publicly accessible in close to real-time.

Capabilities of Current Solution: CTFSWeb

CTFSWeb was built by a single developer several years ago who has since left ForestGEO. This is the application used by database managers to screen data and by many sites to perform data entry.

  1. Log in with user and site credentials
  2. Viewing and downloading data reports
    1. Creating census-specific, filtered data reports
      1. Filtering for only certain taxa (e.g., "Faramea" species)
      2. Filtering only certain areas of the plot (e.g., quadrat 01-01)
      3. Filtering only certain trees (e.g., all trees with dbh ≥ 10 cm)
    2. Downloading complete data in one file
  3. Configuring data entry forms
    1. Allows for the addition/removal/reordering of columns to accommodate specific site needs
      1. Ex: Some sites opt to hide past census' DBH measurements while others opt to include them
      2. The ability to flag certain trees that need special attention (e.g., collect sample for identification)
  4. Data entry
    1. System relies on data entry forms being configured first
    2. Either create new census or work on existing census
    3. Data entered by quadrat into the following forms (see appendix Fig 1, 2 & 3):
      1. New Plants Form
      2. Old Tree Form
      3. Multiple Stems Form
      4. Big Trees Form
      5. Problems Form
    4. Cannot make edits after submission
    5. Keyboard shortcuts for fast work
  5. Data screening
    1. Single Data Entry or Double Data Entry
    2. Data put into temporary tables
    3. Pre-uploading screening (See appendix Fig 4)
    4. "Finishing Up" (browse or download)
      1. Includes creating the ViewFullTable and ViewTaxonomy tables where all the relevant data is put back into Excel-like tables for easy query
    5. Post-uploading screening (See appendix Fig 5)
    6. PI checks results and either sends field people out to check errors in plot or waits until next census/marks as a first draft of database (more likely)
      1. CTFSWebgives separate lists of each type of error i.e. one for wrong species, wrong DBHs, or duplicate tags. But, having three or four separate lists isn't helpful to a field person who needs the errors organized by tag, so database managers use SQL to reformat/manipulate error reports to be ready for printing.
      2. CTFSWeb doesn't allow users to go back and edit temporary tables or database, so they send corrections to database managers to enter.
    7. See appendix for documentation of current data validations
    8. Uploading data into MySQL database
    9. Generating customizable field form PDFs for print use

Pain Points

# Pain Point Details
1 Data entry process has additional steps where errors can be created and is time consuming 1. Paper forms: Collecting census data on paper then digitizing introduces two points of potential error rather than one if forms were electronic input. There’s an opportunity to mis-record a record on paper in the field and/or to mistype when transferring data to digital format; double data entry minimizes these transcription errors.
2. Tablets: There is no way for people entering data to convert from spreadsheet to format used in legacy application, so they email files to database managers who manually run SQL scripts.
3. The current “save” process is convoluted. Data that is submitted but hasn’t undergone screening remains in temporary tables. However, the people entering data cannot edit data after submitting. Therefore, they are incentivized to enter all census data in a batch run rather than screening individual quadrats worth of data while it’s still reasonable to correct errors.
2 Database manager role is overloaded with manual data validation It’s ideal/expected that PIs and field crews enter data, screen for errors, and generate a post-validation report. However, much of this responsibility has fallen to the database managers because the entire validation process isn’t accessible through the existing software’s user interface and instead requires technical skills to manually run SQL scripts. This impacts the existing flow in a few ways:
1. Single point of failure. If database managers were to retire or be replaced, the entire process breaks down because they hold all of the understanding of how the current system functions.
2. Scalability. The database managers are already handling many (3-6 annually) of research sites’ error reports. Additional sites are added from time to time, and each new site is additional work for the database managers.
3. Overcapacity. Time spent acting as a helpdesk for research sites is time not spent fulfilling database managers’ original roles.
3 Technical concerns
1. Not maintainable. Database managers don’t know how to edit the existing PHP code where screening scripts are hosted. Every time they want to make a change to the validation process, they must write new SQL scripts which can only be run manually.
2. Installation. The installation process is cumbersome and confusing. For a site to get on-boarded, it requires reading a 40+ page instruction manual, attending training workshops, and/or getting individual help from database managers.
3. Existing validation is not always robust. Current screening scripts match by tag number or by order of data entry to locate errors. However, if the field crew inputs tag numbers incorrectly, scripts struggle to detect errors. Additionally, the database mangers have a wish list of validations they’d like the software to perform that they don’t yet have scripts written for.
4 Externalities outside ForestGEO’s control 1. Lack of internet connectivity. Some research sites are remote or located in countries facing civil/political conflicts that result in reduced access to internet. In fact, some sites can only reach internet by traveling for hours or may only have internet once per week in an internet café.
2. Lack of funding. ForestGEO has limited resources for building and maintaining a new solution themselves and has volunteers and a temporary intern managing the project today
3. High turnover rate at research sites. A PI or field worker might know enough SQL from database workshops to run validations on their own, but people come and go and there are always new staff in need of training.

Requirements

*P0 = Essential product feature, P1 = Important, but the product can work without these, P2 = Nice to have

MVR (Minimum Viable Replacement)

Client agnostic drag and drop data validation platform

# Requirement Notes Priority*
1a User can upload census data via CSV file File should meet certain formatting requirements.
P0
1b User can generate an error report following CSV upload but prior to Smithsonian Digital Repository upload Specifies error type and location.
User can download and/or print error report for field use.
P0
1c User can view previously uploaded file list P0
1d After user corrects errors and re-uploads CSV, older version of data is replaced with revised version Requires user confirmation.
Version control: Soft delete (mark as deleted but retain data until a certain point).
P0
1e User can generate report of post-census summary statistics, which relies on historical data P0
1f User credentials required before form upload P0
1g Application supports direct integration with the Smithsonian Digital Repository (on-premises SQL server) • User can click a button to get historical data or click a button to send data.
• Need backend API to store “temporary” data (cloud rather than local); can’t connect directly to Smithsonian Digital Repository.
• When user uploads a CSV, what’s the scope of it? Is it one quadrat or the whole census?
• Data redundancy: Is this process serving as a backup?
• Can run validations on backend data storage for speed.
• Should some/all validation be done client side -- esp. for offline users? Challenging b/c there are multiple devices per site.
P0
1h Administrators can add and edit screening/validation scripts Application is maintainable long term.
Configure driven design.
P0
1i Application supports streamlined process for conversion of post-census datasets to standard R tables used for analysis P1
1j User can create census-specific, filterable data reports for viewing/downloading P1
1k User can reconfigure error report structure User can organize reports either by type of error or by tag. P2

MVR+

Excel-like data entry interface with real time, in-line error validations

# Requirement Notes Priority*
2a User can enter data offline Internet access is limited at some sites. P0
2b Application automatically saves a copy of form in its current state Version control: Database manager wants research sites to be able to revert back to previous version in the event that something goes wrong. P0
2c Data entered has local persistence (application stores the state as data in user’s device storage) The data entry person would like to be able to make changes to forms after saving/submitting. P0

2d
User can configure columns on each form (addition, removal, reordering of columns) Not all sites collect data in the same order or on the same variables.
P0
2e User(s) can practice double data entry (entering the same data twice, comparing versions, and selecting the most accurate record) to minimize transcription errors Read more about double data entry as a means for quality control here. P0
2f Users can complete manual data entry quickly using keyboard shortcuts P0
2g User can generate customizable field form PDFs for print use This step is done once at the beginning of a census for each form.
Would like to be able to fit two tables of data side by side on a single sheet to conserve paper.
P0
2h User can enter location data and specify conversion parameters in order to standardize location data Location data collected on different scales (utm, global, local, 20x20m, 10x10m, or 5x5m coordinates) depending on site.
Makes sure that there’s a location for every new tag and informs of errors.
P1

Architecture Proposal

What's happening in the front end and the back end, and why.

The ForestGEO App will be a web application developed using React, an open-source, front-end JavaScript library. This web application will communicate with Azure cloud services, which will in turn communicate with the Smithsonian Digital Repository, a SQL database that is hosted on-premises.

Cloud services other than Azure may fit the needs of the Smithsonian. We recommend Azure in this proposal for two reasons. For one, the Smithsonian is already running on the Microsoft tech stack, affording the ForestGEO App a synergy with the software the Smithsonian already uses. For example, the Smithsonian logs into their online resources using the Microsoft identity provider, for which authentication is already built-in by Azure Function Apps. Secondly, Azure specializes in the hybrid cloud model, in which on-premises resources connect to the cloud, which is precisely the scenario the ForestGEO App must work within since data must be written to and read from the Smithsonian Digital Repository.

There are two Azure cloud services that the ForestGEO App will leverage: Functions and Serverless Cosmos DB. The Function App orchestrates all backend processing. When a user enters the URL for the ForestGEO App in their browser, the function app will return all content necessary to render the page. The function app will handle authentication/authorization, computationally expensive queries such as data validations, and syncing with the digital repository and Cosmos DB.

For example: suppose that a user drops in a CSV file with fresh data from their second census. At an earlier time, the function app would have queried the on-premises repository for published data from the first census and cached said data in the Serverless Cosmos DB. Data from the dropped CSV file is sent to the Function App, which has authenticated the user and has authorized the user to read and write data to their site's database on the Serverless Cosmos DB. The Function App runs a suite of validations against the second census data, either through code running directly within the Function App or by invoking stored procedures on the Serverless Cosmos DB. For instance, one validation may check to see if a tree in the second census has shrunk compared to the previous census. Validated data are stored alongside pertinent error codes within the Serverless Cosmos DB. Finally, the validated data are returned to the client app, which interprets error codes and displays inline error messages to the user.

Picture1

Front End

  • Web Application
  • Typescript, CSS, HTML
  • React – Open-source JavaScript library

Back End

  • Microsoft Azure Cloud Services
  • Azure Function App
  • Azure Serverless Cosmos DB
  • The Smithsonian Digital Repository: an on-premises SQL server

Lo-Fi UI Mockups

Note: Not all features of the new ForestGEO App user experience are included in these mockups.

MVR

Figure 1.1 Drop box for CSV files containing census data that is either new or revised. Desktop - 1

Figure 1.2 CSV-uploaded census data displayed with inline error messages, for print or download.

Desktop - 3

Figure 1.3 Previously uploaded file list, with indication of validation status and ability to edit or delete files.

Desktop - 4

MVR+

Note: This iteration includes a stepper bar which contextualizes the user's location in the entire census data management process.

Figure 2.1 Site-specific form configuration with column reordering, addition, and/or removal.

Desktop - 1

Figure 2.2 Previously entered data list, separated by form and quadrat, and options for user to upload CSV file or manually enter data.

Desktop - 2

Figure 2.3 Excel-like manual data entry with real-time, in-line error validations and double data entry option accessed via "Compare Versions" button

Desktop - 3

Figure 2.4 Comparing two versions of the same data, entered using double data entry method, and selecting the most correct record before merging into one file

Desktop - 4

User Flow

  1. Collect data in the field via tablet (ArcGIS Pro or Fast Field Forms, etc.) OR via paper field form and double data entry
  2. Export CSV(s) of data
  3. Wrangle data to fit designated format (established by database manager and already in use)
  4. Upload file(s) to ForestGEO App
  5. Get immediate error report which both specifies the location of the error and its type
  6. If necessary, check data in the field, and make corrections to your spreadsheet(s) using your chosen data entry method
  7. Re-upload corrected spreadsheet(s)
  8. If there are no more errors, submit data for upload to ForestGEO database
  9. Rinse and repeat as new data are collected

Solutions Previously Explored

  • Updating legacy CTFSWeb application rather than rebuilding. CTFSweb was designed to run on a local machine for one research site. In order to leverage the cloud, support multiple sites, and make other much-needed enhancements to the user-experience, we would need to change nearly every piece of the CTFSWeb source code. Therefore, our assessment is that creating a new application will overall be less friction than updating CTFSWeb to meet the needs of users. We've also consulted with the original developers of CTFSWeb, and they agree with our assessment.
  • Automated data entry using Azure Forms Recognizer. In Summer 2020, Microsoft Global Hackathon volunteers spent a couple days exploring using machine learning to convert paper field forms to digital data. This investigation did not yield a viable prototype, since ForestGEO paper forms were found to have inconsistent formatting and sometimes multiple languages.
  • Progressive Web Application. In Fall 2021, Microsoft Global Hackathon volunteers including software engineers, data scientists, and a program manager worked to improve upon the progress made in the previous hackathon on the single data entry webpage. The team successfully built a progressive web application (PWA) that functions both online and offline. They also documented existing pre- and post-uploading screenings and determined whether they require internet to run (Appendix Fig 4 & 5).

Out of Scope

  1. Modify schema of the final data in the Smithsonian Digital Repository.
  2. Managing supplementary data processes (wood density, phenology, rainfall, etc.) not directly related to five-year, full census.

Appendix

Figure A.1 Function of data entry forms

Explains the functions and use cases for each data entry form used during a census.

Data Entry Form Description
New Plants Form • Enter data for recruits ≥ 1cm
• Only form needed for first time census
Old Tree Form • Enter new data for existing trees and stems
Multiple Stems Form • Enter data for new stems ≥ 1cm growing from existing tree
• Tree tags should already exist in Old Tree Form
Big Trees Form • Enter data if big trees are measured at a different time than the rest of the trees
• Rarely used, according to Suzanne
Problems Form • Rarely, if ever, used except at BCI b/c field forms for old trees have 2 columns and no place to enter comments
• Paper form used by field supervisors to correct problems in the field
• Data entry technicians have to refer to this form and when necessary, input observations into the field comments
• Not many notes relative to number of trees

Figure A.2 Data entry form

Example configuration of an Old Tree Form in CTFSWeb. Someone entering data can use specific keycodes to move from one field to the next and/or add new fields.

Picture2

Figure A.3 Column names for data entry forms

List of column names available for use during configuration of data entry forms. Artifact from CTFSWeb that should be streamlined into new ForestGEO App.

COLUMNS COLUMN NAME (USE THIS NAME WHILE ADDING COLUMNS
TreeID treeID
Tag, Tree Tag tag
StemID stemID
Stem Tag, Multi Stem stemTag
QuadratID quadratID
Quadrat Name quadratName
Subquadrat subquadrat
SpeciesID speciesID
Mnemonic, Species species
SubSpeciesID subSpeciesID
PlotID plotID
CensusID censusID
PlotCensusNumber plotCensusNumber
Primary Stem primaryStem
X x
Y y
DBH, Diameter dbh
OldDBH, Old Diameter olddbh
HOM, Height hom
OldHOM, Old Height oldhom
Codes codes
Old Codes oldcodes
Comments comments
Exact Date exactDate
Errors errors
Old Trees oldTrees
Big Trees bigTrees

Figure A.4 Pre-upload screening

Explanation of each validation function performed during pre-upload screening and whether it can be run locally or requires internet connection to the cloud. Written by data science volunteers from Microsoft during internal hackathon.

Validation function Notes Location required run validation
 makeErrorBlank Sets Errors as empty cell when errors are null or none. Local
 checkDuplTagout Check for duplicate Tag+StemTag (unique combination) in another database.
Uses ViewFullTable
Cloud
 checkDuplTagin Check for duplicate Tag +StemTag (unique combination) within temporary database. Local
 screenXY Checks if the x and y-coordinates are within plot dimensions
"SELECT TempID FROM ".$file." as a,Site as b WHERE a.PlotID=b.PlotID and a.X NOT BETWEEN 0 AND b.QDimX AND Errors<>'NONE'
Local
checkDuplStemin Check for duplicate Tree and Stem tags within temporary database. Does not include dead stems that couldn't be tagged (StemTag=NULL)
"SELECT Tag, StemTag, COUNT(Tag) as cnt FROM ".$fileName." GROUP BY Tag, StemTag HAVING cnt > 1"
Local
screenDiam Check for diameter range (d <0 and d> maxpossible raises an error).
"SELECT TempID FROM ".$fileName." WHERE DBH<>0 AND (DBH>".$maxDBH." OR DBH<".$minDBH.") AND Errors<>'NONE'
Minimum DBH for a stem is often 1.0, but this depends on whether site uses mm or cm.
Local
screenCodes Check for invalid codes and for Trees with both a D code and diameter measurement
Get codes from TSMAttributes table and put in 1-D array
Check if there are duplicate codes
Flag if code does not appear in TSMAttributes table
Flag if tree is dead and there is a dbh
Local
 screenMainStems Assumes there are main stem codes. Check if there is more than one main stem, if there is none, if main stem died, etc. for multiple-stemmed trees
Get all the codes
Get all the multiple-stemmed trees
Screen for more than one main stem
Screen if no main stem
Screen if main stem died and there are other live stems
Screen if main stem does not have largest dbh
Local
checkSpecies Check that species codes in filein are found in fileout
"SELECT TempID,Mnemonic FROM ".$filein." WHERE Mnemonic IS NULL or Mnemonic NOT IN (SELECT DISTINCT Mnemonic FROM ".$fileout.") AND Errors<>'NONE'
Cloud
checkQuadrat Check for quadrats not matching another file, i.e. tempNewPlants. Is there an easier way??? Remember that there may be more than one record with the same tag in tempNewPlants
"SELECT DISTINCT Tag, QuadratName FROM ".$fileout." WHERE PlotID=".$plotID
Cloud
checkTagnotout Check for Tree tag NOT in another database
"SELECT TempID,Tag FROM ".$filein." WHERE Tag NOT IN (SELECT DISTINCT TAG FROM ".$fileout." WHERE PlotID=".$plotID.") AND Errors<>'NONE'
Cloud
checkQuadName Check that QuadratName in filein are found in fileout Cloud
screenDuplQuad This public function checks to see if all the stems of a tree have the same quadrat Cloud
screenDuplSpec This public function checks to see if all the stems of a tree have the same species code Local
createTempSpecies Creates a table if species doesn't exist Local
screenSpecies Check whether genus name not found in Genus table
Needs Genus cache
Local
cntSpeciesrecords Counts the number of species record Local
startCensusDate Screens for dates entered, should not be earlier than the user defined start date.
Sets the class variables PlotID, PlotCensusNumber and returns StartDate
Local
screenDuplQuadratInfo Checks for duplicated quadrats Local
screenQuadratInfo Check for date ranges
SELECT TempID FROM TempPersonnelInfo WHERE DateWorked NOT BETWEEN '".$start."' AND '".$end."' AND Errors<>'NONE'
Local
checkGrowth Check for extreme growth - for Old Trees form
Flags stems where absolute annual growth > $annualGrowth mm
Check for shrinkage in recensuses - only for Old Trees form
Flags records with annual relative growth <= -0.05 in stems with previous dbh>=50mm and no current stem lost code
Local
checkTagInOldTrees   Check that all Tree tags in Multiple Stem file appears in tempOldTrees
SELECT TempID FROM ".$tempMultiName." WHERE Tag NOT IN (SELECT DISTINCT TAG FROM ".$tempOldName." UNION SELECT DISTINCT TAG FROM ".$tempOldName."Error) AND Errors<>'NONE'
Local
checkDuplStemout   Check for duplicate Tree and Stem tags in another database
SELECT TempID,Tag,StemTag FROM ".$filein." WHERE (Tag,StemTag) IN (SELECT DISTINCT TAG,StemTag FROM ".$fileout." WHERE PlotID = ".$plotID." UNION SELECT DISTINCT TAG,StemTag FROM ".$fileout."Error where PlotID = ".$plotID.") AND Errors<>'NONE'
Cloud
screenMainStems2 Check main stems for multiple-stemmed trees in 2 files: Old Trees and Multiple Stems
Get all the codes
Get all the multiple-stemmed trees in first file
Get all the tree tags from multiple-stemmed file
Screen for more than one main stem
Screen if no main stem
Screen if main stem died and there are other live stems
Screen if main stem does not have largest dbh
Cloud
  ifTableExist Check if a table exists
  censusStartDate Check for valid census using plotID and CensusNumber
Helper Functions
startTransaction Starts db transaction
endTransaction
_getCodes Gets codes

Figure A.5 Report of post-census summary statistics

Explanation of each summary statistic function performed after a census and whether it requires current data or previous censuses to run. Written by data science volunteers from Microsoft during internal hackathon.

Validation Notes Location required to run validation
-- A. COUNT THE NUMBER OF RECORDS BY QUADRAT --
-- This is useful for finding quadrats with no stems (not listed) or quadrats that have too many stems

#1 -- Counting all stem records, dead or alive
SELECT PlotID,CensusID,QuadratName,COUNT() FROM ViewFullTable GROUP BY PlotID,CensusID,QuadratName;

#2 -- Counting all live stem records only
SELECT PlotID,CensusID,QuadratName,count(
) FROM ViewFullTable WHERE status<>'dead' and status<>'stem dead' and status<>'missing' GROUP BY PlotID,CensusID,QuadratName;

#3 -- Counting number of trees
SELECT PlotID,CensusID,QuadratName,count(DISTINCT TreeID) FROM ViewFullTable GROUP BY PlotID,CensusID,QuadratName;
Number of alive trees per quadrat.
These queries could be simplified.
Would #1 and #3 be the same?
This seems to be gathering stats not performing a validation.



Batch validation on current data.
No validation criteria specified here
-- B. COUNT THE NUMBER OF DEAD STEMS AND MISSING STEMS BY CENSUS --
SELECT CensusID,status,COUNT(*) FROM ViewFullTable WHERE status IN ('dead','stem dead','missing') GROUP BY CensusID,status;
Counts number of dead tree.
This should add up to total count above - # alive from above. No need to recompute this.
Batch validation on current data.
No validation criteria specified here
-- C. CHECK WHETHER THERE ARE ANY TREES LOCATED OUTSIDE THE LIMITS OF THE PLOT OR ANY TREES WITH MISSING COORDINATES --
SELECT * FROM ViewFullTable WHERE X IS NULL OR Y IS NULL;
-- For a 1000 by 500m plot
SELECT * FROM ViewFullTable WHERE PX>=@site_xdim OR PY>=@site_ydim;
Check coordinates are valid. Simple type validation, could be in javascript.
-- D. CHECK WHETHER THERE ARE ANY DUPLICATE TAGS --
SELECT Tag,COUNT() FROM Tree GROUP BY Tag HAVING COUNT()>1;
-- If StemTags are unique --
SELECT StemTag,COUNT() FROM Stem GROUP BY StemTag HAVING COUNT()>1;
-- If StemTags are not unique, i.e. starts with 1 for every tree --
SELECT TreeID,StemTag,COUNT() FROM Stem GROUP BY TreeID,StemTag HAVING COUNT()>1;
Checks for duplicates within the current census. Batch validation on current data.
-- E. DISPLAY THE RANGE OF THE DATES OF THE CENSUSES --
SELECT CensusID,MIN(ExactDate) AS DateMin, MAX(ExactDate) AS DateMax
FROM ViewFullTable
GROUP BY CensusID;
Checks the date range is valid. Simple type validation, could be in javascript.
F. DISPLAY THE LARGEST DBH AND HEIGHT OF MEASURE FOR EACH SPECIES --
-- Can check whether any shrub species have too large of a DBH, for example
SELECT Mnemonic, MAX(DBH), MAX(HOM)
FROM ViewFullTable
GROUP BY Mnemonic;
Check the max height and diam are valid. Simple type validation, could be in javascript.
G. CHECK WHETHER ANY TREES HAVE BOTH A DEAD AND ALIVE STATUS IN ONE CENSUS --
SELECT * FROM ViewFullTable
WHERE status='dead' AND (CensusID,TreeID) IN (SELECT CensusID,TreeID FROM ViewFullTable WHERE status<>'dead' AND status<>'stem dead');
Checking for both dead and alive in one census --> are we expecting there to be duplicate tree ids – or would finding duplicate tree ids be an error already? Maybe not needed.
H. CHECK WHETHER ALL TREES FROM ONE CENSUS ARE ACCOUNTED FOR IN THE NEXT CENSUS --
SELECT * FROM ViewFullTable
WHERE CensusID=@census1 AND StemID NOT IN (SELECT StemID FROM ViewFullTable WHERE CensusID=@census2);
Check for missing trees from the previous census. Validation requiring previous census.
I. CHECK WHETHER ANY TREES THAT WERE DEAD IN ONE CENSUS TURNS OUT ALIVE IN NEXT CENSUS --
SELECT * FROM ViewFullTable
WHERE CensusID=@census1 AND
Status IN ('dead','stem dead') AND
StemID IN (SELECT StemID FROM ViewFullTable WHERE CensusID=@census2 AND status<>'dead' AND status<>'stem dead');
Check for invalid alive trees (that were dead in the previous census. Validation requiring previous census.
-- J. CHECK WHETHER THERE ARE ANY TREES WITH EXTREME GROWTH RATES --
-- There are several ways of calculating growth rates, the following are some examples --
-- The following uses the temporary table TempOldTrees and the view table ViewFullTable and compares the dbh from the 2nd and 3rd census

-- To find stems where the growth rate is too high, listing stems where absolute annual growth > 65 mm.
-- NOTE: Growth rates should be calculated only when the HOM is the same in both censuses and both dbhs >= 10 (if in mm).

SELECT A.Tag,A.StemTag,StemID,A.mnemonic,A.QuadratName,B.DBH AS DBH2, A.DBH AS DBH3
FROM TempOldTrees A JOIN ViewFullTable B USING (StemID)
WHERE B.CensusID=@census1 AND A.DBH>=10 AND B.DBH>=10 AND A.HOM=B.HOM
AND (A.DBH-B.DBH)/((to_days(A.ExactDate)-to_days(B.ExactDate))/365.25) > 65;

-- To find stems where the growth rate is too negative:
-- List stems where annual relative growth <= -0.05 in stems where previous dbh>=50mm AND there is no broken stem codes in 3rd census

SELECT A.Tag,A.StemTag,StemID,A.mnemonic,A.QuadratName,B.DBH AS DBH2, A.DBH AS DBH3,A.codes
FROM TempOldTrees A JOIN ViewFullTable B USING (StemID)
WHERE B.CensusID=@census1 AND A.DBH>=10 AND B.DBH>=50
AND ((A.DBH-B.DBH)/B.DBH)/((to_days(A.ExactDate)-to_days(B.ExactDate))/365.25) <=-0.05;
Check for extreme growth rates. Validation requiring previous census.
-- COUNT NUMBER OF RECRUITS PER QUADRAT PER CENSUS --
-- Check what quadrats do not have recruits or too few recruits --

SELECT QuadratName,COUNT() FROM ViewFullTable
WHERE DBH>=10 AND CensusID=@census2 AND StemID NOT IN (SELECT StemID FROM ViewFullTable WHERE CensusID=@census1)
GROUP BY QuadratName
ORDER BY QuadratName;

-- Check number of recruits where dbh is between 10 and 20. A site was found to have too few small recruits in one census --

SELECT COUNT(
) FROM ViewFullTable
WHERE DBH>=10 AND DBH<20 AND CensusID=@census2 AND StemID NOT IN (SELECT StemID FROM ViewFullTable WHERE CensusID=@census1);
Checking # of recruits (trees that are in the new census but not in the previous census?).
Is the dbh check required?
Could this check be done by just comparing the total counts in the new vs old+alive?
Validation requiring previous census.
Possible simplification, to run as batch validation only required current census.
-- COUNT NUMBER OF DEAD STEMS PER QUADRAT OR PER SPECIES --

-- Count number of dead stems per quadrat --
SELECT QuadratName,COUNT() FROM ViewFullTable
WHERE Status IN ('dead','stem dead') AND CensusID=@census2 AND StemID IN (SELECT StemID FROM ViewFullTable WHERE CensusID=@census1 AND Status NOT IN ('dead','stem dead','missing'))
GROUP BY QuadratName
ORDER BY QuadratName;
-- Count number of dead stems per species in descending order of number of stems --
SELECT Mnemonic,COUNT(
) FROM ViewFullTable
WHERE Status IN ('dead','stem dead') AND CensusID=@census2 AND StemID IN (SELECT StemID FROM ViewFullTable WHERE CensusID=@census1 AND Status NOT IN ('dead','stem dead','missing'))
GROUP BY Mnemonic
ORDER BY COUNT(*) DESC;
Seems like summary statistics – not a validation criteria.
Checks number of newly dead trees in this census vs the previous one.
Analysis, requires the previous census.

Figure A.6 Asks

High level summary of requests made by ForestGEO about usage capabilities of the new application.

Usage scenario Details Concern
Data Entry can be performed offline. Ideally, screening can also run offline. Given the number of remote sites without internet connectivity, a method to pre-validate using offline scripts hosted locally would be helpful. Post-validation scripts should ideally also run offline, but that would require constant backups to the previous censuses. Concern that post-validation scripts might not be able to run offline and may take too much compute power or technical know-how to run
More insightful validation scripts Shameema and Suzanne have to write additional scripts to do error checking because their existing validation scripts don’t give accessible results. We know where the scripts are located on the site, but have not investigated whether we can rework them to run more quickly.
The validation process allows for editing and addition/removal of new validations. Shameema and Suzanne know SQL well, but they don’t know PHP well enough to edit currently-hosted scripts. They must run manual scripts.
Faster Data entry for paper sites For paper sites especially, having to re-type paper entries into a database leaves the site more prone to mistakes and slows down the process Their current system functions today, and there’s limited room for improvement given the offline requirements.
PIs and field people take on the validation role Multiple challenges here—not sure whether field people would use a new solution even if it were more accessible because they don’t see screening as part of the job description.
Faster and easier installation This is a one-time operation, and might be improved upon incidentally if we improve the validation process.
Faster data entry for online sites Customer has an existing solution that’s highly manual, but does function.

Figure A.7 Plot diagram

Simple illustration of plot set up.

Picture3

Clone this wiki locally