From raw form files to a cleaned, decision-support field inventory.

This project was built to analyze FSU form fields at scale, identify inconsistent naming patterns, reduce noise from non-form documents, and produce a more defensible view of where field standardization opportunities exist.

Document Scope

619

Source documents reviewed across student and faculty/staff collections

Forms Kept

355

Documents retained after separating likely forms from guides and manuals

Trusted Core

2,493

High-confidence rows currently strong enough for decision-support use

Low Confidence

Residual rows that still need cleanup or careful manual review

What The Project Does

The dashboard is not just a file inventory. It is a field-level analysis workflow that surfaces repeated concepts across forms, shows where labels are fragmented, and helps distinguish reliable patterns from cleanup debt.

Inventories form collections across student and faculty/staff sources.
Extracts field labels from PDFs, DOCX files, and discovered web forms.
Groups naming variants under standardized field categories.
Separates likely forms from non-forms such as manuals, policies, and guides.
Adds confidence signals so reviewers can distinguish stronger evidence from weaker mappings.

How It Works

Document Collection

Departmental form pages were crawled and source files were collected into student and faculty/staff inventories.

Field Extraction

Fillable PDFs, flat PDFs, DOCX files, and discovered web forms were parsed to extract potential field labels and associated metadata.

Document-Type Cleanup

Rule-based and AI-assisted classification was used to split true forms from guides, manuals, policies, and other non-form documents.

Variant Normalization

Noisy labels, formatting artifacts, duplicated suffixes, and over-broad buckets were cleaned or split into more functional categories.

Dual-AI Category Review

Some field categories were still too broad or ambiguous after rule-based cleanup. To handle those safely, Gemini and Claude were used as two independent reviewers. Each model was given the same candidates and asked to suggest a more specific category. The two answers were then compared — only cases where both models agreed were accepted into the dataset. If they disagreed, the row stayed flagged for review. This agreement-only approach kept confident-but-wrong AI decisions out of the final results.

Decision-Support Layer

Rows were labeled with confidence tiers so the strongest core could be separated from medium-risk and low-confidence items.

What Could Come Next

This project establishes a foundation. The natural next steps would be to act on what the analysis found.

Prioritize standardization starting with Universal fields. Fields that appear across 10 or more departments with high inconsistency are the highest-value targets — fixing those labels would have the broadest impact across the university.
Use the Trusted Core / High Confidence layer as a starting point for a shared field library. In this project, Trusted Core and High Confidence refer to the same strongest subset. The 2,493 high-confidence rows represent field patterns that are already reliable enough to propose as official standardized definitions.
Expand department coverage. Not every FSU department was included in this version. Adding more sources would increase both the inventory and the confidence in cross-department patterns.
Connect findings to a modernization or migration project. If FSU moves toward a unified forms platform or digital workflow system, this field inventory could directly inform what data fields that system needs to support.
Address the remaining low-confidence rows. The 62 low-confidence field concepts still need targeted review before they can be trusted for planning decisions.

What Was Improved

Forms-only filtering reduced contamination from non-form documents.
Signature cleanup split true signatures from approvals, notary fields, initials, and signature dates.
Administrative catch-all cleanup replaced a junk-drawer category with smaller functional buckets.
Comment and description cleanup split broad narrative buckets into more specific types.
Department confidence replaced misleading perfect readiness scores with a real high-confidence share.

How To Read The Dashboard

Coverage shows how many departments use a field pattern.
Variants show how many different labels map to the same standardized category.
Confidence shows how defensible a category currently is for planning use.
Trusted Core is the same thing as the High Confidence subset. The dashboard uses both names for the strongest rows currently suitable for decision-support discussion.

High Confidence Medium Confidence Trusted Core = High Confidence

What Coverage and Inconsistency Mean

These two columns are calculated automatically from the data — not assigned by hand.

Universal — the field appears across 10 or more of the 27 tracked departments. It is essentially a cross-university pattern within the current project scope.
Common — the field appears in 5 to 9 of the 27 tracked departments. Widely shared but not universal.
Specialized — the field appears in 2 to 4 departments. Present in multiple places but not broadly adopted.
Limited — the field appears in only 1 department. May be department-specific or a one-off.

Inconsistency is based on how many different raw label variants map to the same standardized field.

High inconsistency — 15 or more different labels were found for the same concept across forms. The field is widely used but departments are not naming it the same way.
Medium inconsistency — 5 to 14 variants. Some fragmentation but not extreme.
Low inconsistency — fewer than 5 variants. Departments are already using relatively consistent naming for this field.

How Confidence Was Assigned

Confidence was assigned by the analysis workflow after cleanup and category remapping. It is not a manual score on every row, and it is not a claim that a row is permanently final. It is a practical decision-support signal based on how reliable a category currently is.

High confidence was assigned to stronger identity, contact, department, course, signature, and core administrative fields that were repeatedly observed and cleaned into stable categories.
Medium confidence was assigned to categories that are useful but still broader, more context-dependent, or more likely to contain mixed variants.
Low confidence was assigned to residual noise, unmatched labels, artifact-like fields, and office-use buckets that still need more review.

Document classification used a mix of rule-based filtering and targeted AI review for ambiguous files. The confidence labels themselves were produced by the analysis workflow after those cleanup steps, not by a person hand-tagging each row.

Why This Is Useful

Even without building or owning implementation, this analysis provides a structured view of field inconsistency and data quality that can help future modernization work start from evidence instead of guesswork.

The strongest value of this work is not claiming that every field is already perfect. It is making the current forms landscape measurable, explainable, and easier to prioritize.

Current Limits

Residual low-confidence rows remain and still need targeted review.
Some extracted labels depend on source quality, especially older PDFs and OCR-heavy documents.
Not every standardized category is final; some are still practical analysis buckets rather than final enterprise definitions.