1.303 Civic Entity Resolution#

Entity resolution, deduplication, and matching for government data. Covers agency name normalization, cross-jurisdiction entity matching, vendor/contractor deduplication, and geographic boundary reconciliation.


Explainer

1.303: Civic Entity Resolution#

Research Status: Complete Topic: Entity resolution, deduplication, and matching for government data Last Updated: 2026-02-05

Overview#

Civic entity resolution addresses the challenge of identifying when different records refer to the same real-world entity—whether that’s a government agency, contractor, jurisdiction, or geographic boundary—across fragmented datasets. Unlike general entity resolution, civic applications must handle unique complexities: agencies reorganize, jurisdictions merge, vendor relationships span multiple subsidiaries, and identifiers are inconsistent across federal/state/local systems.

Key capabilities:

  • Agency name normalization: Standardize variations like “DOT” vs “Department of Transportation”
  • Cross-jurisdiction entity matching: Identify “Police Department” entities across cities/states
  • Vendor/contractor deduplication: Merge procurement records for the same entity despite name variations
  • Geographic boundary reconciliation: Align census tracts, electoral districts, and service areas
  • Temporal tracking: Handle entity evolution (mergers, reorganizations, boundary changes)

Why this matters: Government transparency, spending oversight, and data integration all depend on accurately linking records across siloed systems. When procurement data uses one vendor name and tax records use another, civic technologists need entity resolution to follow the money. When agencies reorganize or jurisdictions redistrict, researchers need tools to maintain historical continuity.

Existing Libraries#

Links: GitHub | Documentation

Language: Python License: MIT Maintenance: Active (UK Ministry of Justice maintains) Python Support: 3.7+

Description#

Splink is a probabilistic record linkage library developed by the UK Ministry of Justice and now used across UK government. It implements the Fellegi-Sunter model for statistical matching and is explicitly designed for large-scale government data integration. Splink provides interpretable match probabilities (critical for auditing government decisions) and scales to millions of records.

Key Features#

  • Probabilistic matching: Calculates likelihood that two records refer to same entity based on field agreement/disagreement
  • Explainability: Generates match weights showing why records were linked (essential for government accountability)
  • Scale: Tested on UK census data (millions of records)
  • Multiple backends: DuckDB, Spark, Athena, SQLite support
  • Training data generation: Active learning to calibrate match thresholds
  • Government-proven: Used for criminal justice, census, NHS, and veteran systems

Installation#

pip install splink

Example Usage#

from splink.duckdb.linker import DuckDBLinker
import splink.duckdb.comparison_library as cl

# Define matching rules for government agencies
settings = {
    "link_type": "dedupe_only",
    "comparisons": [
        cl.exact_match("agency_id"),  # Exact ID match (high confidence)
        cl.jaro_winkler_at_thresholds("agency_name", [0.9, 0.8]),  # Fuzzy name
        cl.exact_match("jurisdiction_code"),  # Same state/county
        cl.levenshtein_at_thresholds("address", [1, 2])  # Allow minor typos
    ],
    "blocking_rules_to_generate_predictions": [
        "l.jurisdiction_code = r.jurisdiction_code",  # Only compare within jurisdiction
        "substr(l.agency_name, 1, 3) = substr(r.agency_name, 1, 3)"  # First 3 chars match
    ]
}

# Create linker
linker = DuckDBLinker(agency_data, settings)

# Train model (using labeled examples if available)
linker.estimate_u_using_random_sampling(max_pairs=1e6)

# Find matches
results = linker.predict(threshold_match_probability=0.8)
clusters = linker.cluster_pairwise_predictions_at_threshold(results, 0.8)

# Each cluster = set of records referring to same agency
for cluster in clusters:
    print(f"Cluster: {cluster['canonical_name']}")
    print(f"  Variants: {cluster['matched_names']}")

Notable Users#

  • UK Ministry of Justice: Core Person Record system linking criminal justice data
  • ONS: UK Census 2021 deduplication and linking
  • NHS England: Healthcare data linkage
  • UK Ministry of Defence: Veterans Card verification

Limitations#

  • Requires structured tabular data (not suitable for free-text documents)
  • Probabilistic approach needs calibration (training data or parameter tuning)
  • Blocking strategy critical for performance (poorly chosen blocks = missed matches)
  • Documentation focuses on UK government use cases (US civic tech examples limited)

Sources: Splink GitHub, UK Government Blog


Dedupe#

Links: GitHub | Documentation

Language: Python License: MIT Maintenance: Active (Dedupe.io maintains) Python Support: 3.6+

Description#

Dedupe is a machine learning library for fuzzy matching, deduplication, and entity resolution, originally built for civic tech projects. It uses active learning with human-in-the-loop training, making it practical for civic organizations that have domain experts but lack large labeled datasets. Dedupe has been applied to campaign finance, elected officials databases, procurement records, and government budgets.

Key Features#

  • Active learning: Train with minimal labeled examples (human reviews uncertain pairs)
  • Blocking strategies: Efficient pre-filtering to avoid all-to-all comparisons
  • Customizable: Define field types (name, address, price) with appropriate similarity metrics
  • Scalable: Handles millions of records with proper blocking
  • Civic tech origins: Built specifically for messy government data

Installation#

pip install dedupe

Example Usage#

import dedupe

# Define fields for vendor matching
fields = [
    {'field': 'vendor_name', 'type': 'String'},
    {'field': 'address', 'type': 'Address'},
    {'field': 'ein', 'type': 'Exact', 'has missing': True},  # Tax ID
    {'field': 'duns', 'type': 'Exact', 'has missing': True}  # DUNS number
]

# Create deduplicator
deduper = dedupe.Dedupe(fields)

# Active learning: human labels uncertain pairs
deduper.prepare_training(vendor_data)

print("Label these pairs (y/n/u for unsure):")
dedupe.console_label(deduper)

# Train model
deduper.train()

# Find duplicates
clustered_dupes = deduper.partition(vendor_data, threshold=0.5)

# Output: Each cluster = same vendor
for cluster_id, (records, scores) in enumerate(clustered_dupes):
    print(f"Cluster {cluster_id}: {len(records)} records")
    for record, score in zip(records, scores):
        print(f"  {record['vendor_name']} (confidence: {score:.2f})")

Notable Users#

  • Code for America: Criminal justice data deduplication for Clean Slate law
  • Sunlight Foundation: Campaign finance and lobbying data matching
  • DataMade: Chicago government transparency projects
  • Civic tech organizations analyzing procurement, budgets, and spending

Limitations#

  • Requires training phase (human labeling or pre-labeled dataset)
  • Active learning works best with domain expert available
  • Blocking strategy must be manually designed (performance-critical)
  • Not suitable for unstructured text (needs tabular data with fields)

Sources: Dedupe Documentation, Code for America Blog


Nomenklatura#

Links: GitHub | PyPI

Language: Python License: MIT Maintenance: Active (OpenSanctions maintains) Python Support: 3.10+

Description#

Nomenklatura is an entity deduplication and data integration tool built for the OpenSanctions project (international sanctions, politically exposed persons data). It uses graph-based algorithms (connected components) to identify clusters of duplicate entities and provides an OpenRefine-compatible reconciliation API. While focused on sanctions data, its approach is applicable to civic entity resolution.

Key Features#

  • Graph-based deduplication: Uses connected components algorithm to find transitive matches
  • Blocking with inverted index: Fast candidate generation using search engine techniques
  • Transliteration handling: Normalizes names across scripts (Latin, Cyrillic, etc.)
  • Reconciliation API: OpenRefine-compatible for interactive entity matching
  • Character-level matching: Ngram comparison tolerates spelling variations

Installation#

pip install nomenklatura

Example Usage#

from nomenklatura import Dataset, Resolver

# Create dataset of government entities
dataset = Dataset.make({"name": "agencies", "title": "Government Agencies"})

# Add entities
entity1 = dataset.make_entity("agency-1")
entity1.add("name", "Department of Transportation")
entity1.add("jurisdiction", "Federal")

entity2 = dataset.make_entity("agency-2")
entity2.add("name", "DOT")
entity2.add("jurisdiction", "Federal")

# Create resolver
resolver = Resolver()

# Mark entities as same (can be automated or human-reviewed)
resolver.decide("agency-1", "agency-2", decision=True)

# Apply resolution
canonical = resolver.get_canonical("agency-2")
print(f"Canonical ID: {canonical}")  # Returns 'agency-1'

Notable Users#

  • OpenSanctions: International sanctions and PEP data deduplication
  • FollowTheMoney: Financial crime investigation data integration

Limitations#

  • Designed for OpenSanctions data model (may need adaptation)
  • Requires explicit merge decisions (not fully automated)
  • Documentation focuses on sanctions use cases
  • Less civic-specific than Splink/Dedupe

Sources: Nomenklatura GitHub, OpenSanctions Deduplication Article


NYC Agency Name Project#

Links: GitHub

Language: Python License: Not specified (NYC open data) Maintenance: Completed project (Phase 2.6) Python Support: 3.x

Description#

The NYC Agency Name Project is a city-specific effort to create an authoritative dataset of NYC government organization names. It demonstrates practical patterns for agency name normalization at the municipal level, including handling of “Department of X” vs “X Department” variations, abbreviations, and fuzzy matching. While not a reusable library, it provides valuable reference implementation for civic entity normalization.

Key Features#

  • Fuzzy matching: Handles spelling variations and abbreviations
  • Pattern recognition: Normalizes “Department of X” variations
  • Unique ID assignment: Creates canonical IDs (FINAL_REC_000001 format)
  • Multi-source integration: Merges data from various NYC systems
  • Deduplication phases: Iterative refinement with manual review

Approach#

# Conceptual approach (based on project documentation)

def normalize_agency_name(name):
    """
    NYC-specific normalization patterns
    """
    name = name.upper().strip()

    # Handle "Department of" variations
    if name.startswith("DEPT OF "):
        name = name.replace("DEPT OF ", "DEPARTMENT OF ")
    if name.startswith("NYC "):
        name = name.replace("NYC ", "")

    # Expand abbreviations
    abbreviations = {
        "DOT": "DEPARTMENT OF TRANSPORTATION",
        "NYPD": "NEW YORK POLICE DEPARTMENT",
        "FDNY": "FIRE DEPARTMENT OF NEW YORK"
    }
    if name in abbreviations:
        name = abbreviations[name]

    return name

# Fuzzy matching to find similar agencies
from fuzzywuzzy import fuzz

def find_matches(agency_name, agency_list, threshold=85):
    """
    Find potential duplicates using fuzzy matching
    """
    matches = []
    normalized = normalize_agency_name(agency_name)

    for candidate in agency_list:
        score = fuzz.token_sort_ratio(normalized, normalize_agency_name(candidate))
        if score >= threshold:
            matches.append((candidate, score))

    return sorted(matches, key=lambda x: x[1], reverse=True)

Limitations#

  • NYC-specific (not generalizable to other jurisdictions)
  • Manual review required for deduplication decisions
  • Not packaged as reusable library
  • Completed project (not actively maintained)

Sources: GitHub Repository


SAM.gov (System for Award Management)#

Links: SAM.gov Entity Information | API Documentation

Type: Government Registry (not a library, but essential infrastructure) Maintained by: U.S. General Services Administration

Description#

SAM.gov is the authoritative registry for federal government contractors and grant recipients. It provides Unique Entity IDs (UEI) that replaced DUNS numbers in 2022. While not a deduplication library, SAM.gov serves as the “ground truth” for federal vendor identity and provides API access for entity verification.

Key Features#

  • Unique Entity ID (UEI): Official government contractor identifier
  • Entity hierarchy: Links parent companies and subsidiaries (recipient_parent_uei field)
  • CAGE codes: Additional identifier for defense contractors
  • Public API: Search and verification endpoints
  • Historical data: Legacy DUNS number mapping

API Usage#

import requests

def lookup_vendor(uei):
    """
    Look up vendor in SAM.gov registry
    """
    url = f"https://api.sam.gov/entity-information/v3/entities"
    params = {
        "ueiSAM": uei,
        "api_key": "YOUR_API_KEY"  # Register at sam.gov
    }

    response = requests.get(url, params=params)
    if response.status_code == 200:
        entity = response.json()
        return {
            "name": entity["legalBusinessName"],
            "uei": entity["ueiSAM"],
            "parent_uei": entity.get("parentUeiSAM"),
            "cage_code": entity.get("cageCode"),
            "status": entity["registrationStatus"]
        }
    return None

# Example: Link procurement records to SAM.gov
vendor_data = {
    "vendor_name": "ABC Corporation",
    "uei": "ABC123456789"
}

sam_record = lookup_vendor(vendor_data["uei"])
if sam_record:
    print(f"Official name: {sam_record['name']}")
    if sam_record["parent_uei"]:
        parent = lookup_vendor(sam_record["parent_uei"])
        print(f"Parent company: {parent['name']}")

Use Cases for Civic Tech#

  • Vendor deduplication: Use UEI as canonical identifier
  • Parent-subsidiary linking: Follow corporate hierarchies
  • Procurement analysis: Link federal contracts to entities
  • Data validation: Verify vendor names against official registry

Limitations#

  • Federal only: State/local governments use different registries
  • Registration required: Only active contractors are in SAM.gov
  • Historical gaps: UEI adoption in 2022 (legacy DUNS data incomplete)
  • Not a matching library: Provides lookups, not fuzzy matching

Sources: SAM.gov, GSA API Documentation


Census Bureau Crosswalks#

Links: NHGIS Crosswalks | Census Relationship Files

Type: Government Data Products (not libraries, but foundational resources) Maintained by: U.S. Census Bureau, IPUMS NHGIS

Description#

Census Bureau geographic crosswalks enable linking of data across different boundary systems and time periods. These crosswalks use area-weighted interpolation to translate data from one geography (e.g., 2010 census tracts) to another (e.g., 2020 census tracts). They’re essential for civic analysis involving geographic entities.

Key Features#

  • Temporal concordance: Map geographies across decennial censuses
  • Block-level precision: Use census blocks as atomic units (nest within all other geographies)
  • Residential ratios: Weight by population/housing units for accuracy
  • Multiple geography types: Tracts, block groups, counties, ZIP codes, legislative districts

Example Usage#

import pandas as pd

# Load Census Bureau tract crosswalk (2010 -> 2020)
crosswalk = pd.read_csv("tract_crosswalk_2010_2020.csv")

# Crosswalk structure:
# - GEOID_TRACT_10: 2010 tract ID
# - GEOID_TRACT_20: 2020 tract ID
# - WEIGHT: Proportion of 2010 tract in 2020 tract

# Example: Convert 2010 data to 2020 boundaries
data_2010 = pd.DataFrame({
    "GEOID_TRACT_10": ["17031010100", "17031010200"],
    "population": [5000, 8000]
})

# Apply crosswalk weights
merged = data_2010.merge(crosswalk, on="GEOID_TRACT_10")
merged["population_weighted"] = merged["population"] * merged["WEIGHT"]

# Aggregate to 2020 tracts
data_2020 = merged.groupby("GEOID_TRACT_20")["population_weighted"].sum().reset_index()
print(data_2020)

Use Cases for Civic Tech#

  • Longitudinal analysis: Track neighborhoods across redistricting
  • Data integration: Combine datasets with different geographic bases
  • Electoral analysis: Map precincts to census tracts
  • Equity studies: Analyze service delivery by consistent boundaries

Limitations#

  • Interpolation assumptions: Assumes even distribution within source geography (not always accurate)
  • Complexity for non-standard geographies: School districts, service areas require custom crosswalks
  • Temporal gaps: Only available for census years (not for mid-decade boundary changes)
  • Not automated: Requires manual download and integration

Sources: IPUMS NHGIS, Census Bureau Relationship Files


Identified Gaps#

1. Cross-Jurisdiction Entity Matching#

Problem: No dedicated library exists for matching government organizations across jurisdictions. While NYC has its Agency Name Project for NYC-specific entities, and federal data has agency codes, there’s no tool to match “Police Department” across 50 states or “Department of Transportation” across federal/state/local levels.

Why General Tools Fall Short:

  • General entity resolution (Splink, Dedupe) lacks government-specific knowledge
  • Agency types vary by jurisdiction (“Sheriff” vs “Police” vs “Public Safety”)
  • Hierarchical relationships (federal DOT → state DOTs → local transit authorities)
  • Name patterns differ (state-level: “Illinois DOT”, local: “Chicago Dept of Transportation”)
  • No canonical ID system across jurisdictions (unlike UEI for federal contractors)

What’s Needed:

# Desired API for cross-jurisdiction matching
from civic_entities import JurisdictionMatcher

matcher = JurisdictionMatcher()

# Match agencies across jurisdictions
results = matcher.match_agencies(
    agency_name="Police Department",
    jurisdiction_type="municipal",
    state="IL"
)

# Returns:
# [
#   {"name": "Chicago Police Department", "jurisdiction": "Chicago, IL", "type": "police"},
#   {"name": "Springfield Police Dept", "jurisdiction": "Springfield, IL", "type": "police"},
#   {"name": "Naperville Police Department", "jurisdiction": "Naperville, IL", "type": "police"}
# ]

# Hierarchical matching
hierarchy = matcher.get_hierarchy(
    agency_id="federal-dot",
    include_levels=["federal", "state", "local"]
)

# Returns:
# {
#   "federal": "U.S. Department of Transportation",
#   "state": ["Illinois DOT", "Indiana DOT", ...],
#   "local": ["Chicago Transit Authority", "METRA", ...]
# }

Potential Approach:

  1. Build reference dataset:

    • Federal agencies: Use USA.gov agency directory
    • State agencies: Scrape state government websites
    • Local agencies: Use municipal open data portals (top 200 cities)
  2. Develop matching rules:

    • Standardize types: “police”, “fire”, “public_works”, “transit”, etc.
    • Handle abbreviations: “DOT”, “PD”, “DPW”
    • Geographic normalization: State/county/city codes
  3. Create hierarchy mapping:

    • Federal → State relationships (grants, oversight)
    • State → Local relationships (funding, regulation)
    • Link via FIPS codes and OMB metro areas
  4. Probabilistic matching:

    • Use Splink for fuzzy name matching
    • Boost score if jurisdiction types align (both municipal)
    • Penalize cross-level false matches (federal vs local)

Complexity: High

  • Data collection across thousands of jurisdictions
  • Handling agency reorganizations over time
  • No single authoritative source
  • Requires ongoing maintenance

Potential Users:

  • Government transparency organizations (following money across levels)
  • Researchers studying intergovernmental relationships
  • News organizations investigating cross-jurisdiction patterns
  • Federal agencies tracking grants/funds to state/local entities

2. Comprehensive Agency Name Normalization Library#

Problem: While NYC Agency Name Project demonstrates the approach for one city, no reusable library exists for standardizing agency names across jurisdictions. Each civic tech project rebuilds similar normalization logic.

Why General Tools Fall Short:

  • Abbreviations are domain-specific (“DOT” = Transportation, not “Department of Technology”)
  • Name patterns vary by government level and region
  • No shared “dictionary” of government entity types and abbreviations
  • Legal vs. common names (Federal Highway Administration vs FHWA)

What’s Needed:

# Desired API
from civic_entities import AgencyNormalizer

normalizer = AgencyNormalizer()

# Normalize with context
result = normalizer.normalize(
    agency_name="NYC DOT",
    jurisdiction="New York City, NY",
    level="municipal"
)

print(result.canonical_name)  # "Department of Transportation"
print(result.full_name)        # "New York City Department of Transportation"
print(result.abbreviation)     # "DOT"
print(result.entity_type)      # "transportation"
print(result.confidence)       # 0.95

# Batch normalization
agencies = [
    "Illinois Dept of Revenue",
    "IL DOR",
    "Illinois Department of Revenue",
    "State of Illinois - Revenue Dept"
]

clusters = normalizer.deduplicate(agencies, jurisdiction="Illinois", level="state")
# Returns: [
#   {"canonical": "Illinois Department of Revenue", "variants": [all 4 names]}
# ]

Potential Approach:

  1. Build government-specific abbreviation dictionary:

    • Transportation: DOT, TXDOT (Texas), MoDOT (Missouri)
    • Law enforcement: PD, Police Dept, Sheriff, SO
    • Public works: DPW, PWD, Public Works Dept
    • Health: DOH, DPH, Health Dept, Public Health
  2. Pattern recognition:

    • “Department of X” ↔ “X Department” ↔ “Dept of X”
    • Jurisdiction prefix: “NYC X”, “Chicago X”, “State of Illinois X”
    • Acronym handling: “NYPD” → “New York Police Department”
  3. Contextual normalization:

    • Use jurisdiction to resolve ambiguous abbreviations (DOT in NYC ≠ US DOT)
    • Level-aware: Federal vs state vs local naming conventions
  4. Training data:

    • NYC Agency Name Project as starting point
    • Scrape from USA.gov, state/local websites
    • Crowdsource corrections

Complexity: Moderate to High

  • Initial data collection labor-intensive
  • Must support 50+ state naming conventions
  • Thousands of local governments with unique patterns
  • Maintenance as agencies reorganize

Related Work:

  • Cleanco for business entity suffixes (similar approach)
  • OpenRefine reconciliation services (general entity matching)

Potential Users:

  • Civic tech organizations building transparency tools
  • Journalists linking agencies across datasets
  • Government data portals implementing search
  • Researchers integrating multi-jurisdiction data

3. Vendor Parent-Subsidiary Graph Library#

Problem: SAM.gov provides parent_uei links, but no library exists to build and query the full corporate hierarchy graph of government contractors. Understanding vendor relationships is critical for procurement oversight (detecting shell companies, tracking subcontracting patterns).

Why General Tools Fall Short:

  • General graph libraries (NetworkX) lack domain-specific queries
  • Corporate structures are time-variant (mergers, spin-offs)
  • Need to integrate multiple identifiers (UEI, DUNS, EIN, CAGE codes)
  • Procurement context requires specific relationship types (prime/sub, JV partners)

What’s Needed:

# Desired API
from civic_entities import VendorGraph

graph = VendorGraph()

# Load data from SAM.gov and procurement records
graph.load_sam_data(sam_api_key="...")
graph.load_usaspending_data(year=2024)

# Query corporate hierarchy
hierarchy = graph.get_hierarchy(vendor_uei="ABC123456789")
print(hierarchy)
# {
#   "name": "ABC Corporation",
#   "parent": "Global Holdings Inc",
#   "subsidiaries": ["ABC East", "ABC West", "ABC Services"],
#   "depth": 2  # Levels from ultimate parent
# }

# Find related entities
related = graph.find_related(
    vendor_uei="ABC123456789",
    relationship_types=["subsidiary", "joint_venture", "subcontractor"],
    max_hops=2
)

# Contract network analysis
contracts = graph.get_contract_network(
    agency="Department of Defense",
    year=2024,
    min_contract_value=1000000
)
# Returns graph: nodes = vendors, edges = subcontracting relationships

# Detect patterns
patterns = graph.detect_anomalies(
    pattern_types=["circular_subcontracting", "shell_company_indicators"],
    threshold=0.8
)

Potential Approach:

  1. Build vendor entity graph:

    • Nodes: Vendors (keyed by UEI)
    • Edges: parent/subsidiary, joint_venture, subcontractor relationships
    • Time-variant: Track effective dates of relationships
  2. Data sources:

    • SAM.gov: parent_uei links
    • USAspending.gov: Prime/sub contractor fields
    • Corporate registries: Secretary of State filings (for ownership)
    • CAGE code relationships: DOD contractor hierarchies
  3. Graph algorithms:

    • Connected components: Find related vendor clusters
    • Shortest path: Trace money flows through subcontractors
    • Centrality: Identify key vendors in networks
    • Temporal analysis: Track relationship changes over time
  4. Anomaly detection:

    • Circular subcontracting: A → B → C → A
    • Concentration: Single entity winning via multiple subsidiaries
    • Geographic mismatches: Prime in state X, all subs in state Y
    • Shell company indicators: Minimal employees, no web presence

Complexity: High

  • Requires integrating multiple data sources
  • Temporal tracking adds complexity
  • Graph scalability (millions of vendors, billions of contracts)
  • Validation requires procurement domain expertise

Potential Users:

  • Oversight organizations (GAO, Inspectors General)
  • Investigative journalists following procurement
  • Contracting officers checking vendor relationships
  • Academic researchers studying government contracting

4. Temporal Geographic Entity Reconciliation#

Problem: While Census crosswalks handle tract-to-tract mapping, no comprehensive library exists for tracking all types of civic geographic entities over time: school districts that split, city boundaries that annex, special districts that dissolve, and voting precincts that redistrict.

Why General Tools Fall Short:

  • Census tools are census-specific (not for special districts, precincts)
  • GIS tools require spatial expertise and manual processing
  • No standardized temporal tracking across entity types
  • Different agencies maintain different boundaries (no single source of truth)

What’s Needed:

# Desired API
from civic_entities import TemporalGeography

geo = TemporalGeography()

# Track school district over time
district_history = geo.get_history(
    entity_type="school_district",
    entity_id="IL-Cook-District299",  # Chicago Public Schools
    start_year=2000,
    end_year=2024
)

print(district_history)
# [
#   {"year": 2000, "name": "Chicago School District 299", "boundary": <geojson>},
#   {"year": 2013, "event": "merged", "absorbed": ["District 151"], "boundary": <geojson>},
#   {"year": 2024, "name": "Chicago Public Schools", "boundary": <geojson>}
# ]

# Crosswalk between different geographic systems
crosswalk = geo.create_crosswalk(
    source="voting_precinct",
    source_year=2020,
    source_jurisdiction="Cook County, IL",
    target="census_tract",
    target_year=2020
)

# Returns: [(precinct_id, tract_id, overlap_percentage), ...]

# Align time series data across boundary changes
aligned_data = geo.align_time_series(
    data=enrollment_by_district,  # DataFrame with district_id and year
    entity_type="school_district",
    normalize_to_year=2024  # Restate all historical data using 2024 boundaries
)

Potential Approach:

  1. Entity types to support:

    • School districts (NCES data)
    • Electoral precincts (state election authorities)
    • Special districts (Census special district data)
    • City boundaries (Census annexation surveys)
    • Service areas (water, fire, transit districts)
  2. Data sources:

    • Census BAS (Boundary and Annexation Survey)
    • NCES edge files (school districts)
    • State/local GIS portals
    • Redistricting Data Hub (electoral boundaries)
  3. Tracking methodology:

    • Event log: split, merge, boundary_change, dissolved, created
    • Effective dates for each change
    • Spatial relationships (parent/child, overlapping)
    • Interpolation weights for data conversion
  4. Crosswalk generation:

    • Use block-level census geography as atomic unit
    • Calculate overlap percentages for source → target
    • Provide residential/land area weighting options

Complexity: Very High

  • Dozens of entity types, each with different update schedules
  • Spatial data processing requirements (GIS expertise)
  • No central repository (must aggregate from many sources)
  • Quality varies dramatically by jurisdiction
  • Maintenance burden (boundaries change constantly)

Potential Users:

  • Education researchers (longitudinal school district analysis)
  • Election analysts (redistricting impact studies)
  • Urban planners (service area analysis)
  • Policy researchers (tracking program boundaries over time)

These libraries are commonly used alongside civic entity resolution:

Entity Resolution Foundations#

  • recordlinkage (Python): General-purpose record linkage with probabilistic and deterministic methods
  • fuzzywuzzy (Python): String matching using Levenshtein distance
  • python-Levenshtein (Python): Fast edit distance calculation
  • jellyfish (Python): Phonetic encoding and string comparison

Graph Analysis (for vendor networks, hierarchies)#

  • NetworkX (Python): Graph algorithms (connected components, centrality)
  • igraph (Python/R): High-performance graph analysis
  • Neo4j (Graph Database): Query and visualize entity relationships

Geocoding and Spatial Matching#

  • GeoPandas (Python): Spatial joins for geographic entity matching
  • Shapely (Python): Geometric operations (boundary intersections)
  • censusgeocode (Python): Census Geocoding API wrapper

Data Cleaning#

  • OpenRefine: Interactive data cleaning with reconciliation services
  • ftfy (Python): Fix text encodings (common in government data)
  • usaddress (Python): Parse US addresses into components

Integration Patterns#

Example 1: Vendor Deduplication Pipeline#

from dedupe import Dedupe
import pandas as pd
import requests

# Step 1: Load procurement data
vendors = pd.read_csv("procurement_vendors.csv")

# Step 2: Validate against SAM.gov (when UEI present)
def validate_sam(uei):
    url = f"https://api.sam.gov/entity-information/v3/entities"
    params = {"ueiSAM": uei, "api_key": SAM_API_KEY}
    response = requests.get(url, params=params)
    if response.status_code == 200:
        return response.json()["legalBusinessName"]
    return None

vendors["sam_name"] = vendors["uei"].apply(lambda x: validate_sam(x) if pd.notna(x) else None)

# Step 3: Deduplicate using Dedupe for records without UEI
fields = [
    {"field": "vendor_name", "type": "String"},
    {"field": "address", "type": "Address"},
    {"field": "ein", "type": "Exact", "has missing": True}
]

deduper = Dedupe(fields)
deduper.prepare_training(vendors[vendors["uei"].isna()].to_dict('records'))

# Active learning (human labels examples)
dedupe.console_label(deduper)
deduper.train()

# Find duplicates
no_uei_vendors = vendors[vendors["uei"].isna()].to_dict('records')
clustered = deduper.partition(no_uei_vendors, threshold=0.7)

# Step 4: Assign canonical IDs
canonical_map = {}
for cluster_id, (records, scores) in enumerate(clustered):
    canonical_id = f"VENDOR_{cluster_id:06d}"
    for record in records:
        canonical_map[record["vendor_id"]] = canonical_id

vendors["canonical_id"] = vendors["vendor_id"].map(canonical_map).fillna(vendors["uei"])

# Result: vendors DataFrame with canonical_id for all records
print(vendors[["vendor_name", "uei", "canonical_id"]].head(10))

Example 2: Cross-Jurisdiction Agency Matching#

from splink.duckdb.linker import DuckDBLinker
import splink.duckdb.comparison_library as cl
import pandas as pd

# Load agency data from multiple jurisdictions
federal_agencies = pd.read_csv("federal_agencies.csv")
state_agencies = pd.read_csv("state_agencies.csv")
local_agencies = pd.read_csv("local_agencies.csv")

# Combine with jurisdiction metadata
federal_agencies["level"] = "federal"
state_agencies["level"] = "state"
local_agencies["level"] = "local"

agencies = pd.concat([federal_agencies, state_agencies, local_agencies])

# Define Splink settings for cross-jurisdiction matching
settings = {
    "link_type": "dedupe_only",
    "comparisons": [
        cl.jaro_winkler_at_thresholds("agency_name", [0.9, 0.8, 0.7]),
        cl.exact_match("agency_type"),  # e.g., "police", "transportation"
        {
            "output_column_name": "level_comparison",
            "comparison_levels": [
                {"sql_condition": "l.level = r.level", "label_for_charts": "Same level"},
                {"sql_condition": "l.level != r.level", "label_for_charts": "Different level"}
            ]
        },
        cl.exact_match("state_code")  # Must be same state for state/local matches
    ],
    "blocking_rules_to_generate_predictions": [
        "l.state_code = r.state_code",  # Only match within same state
        "l.agency_type = r.agency_type"  # Only match same agency types
    ]
}

linker = DuckDBLinker(agencies, settings)
linker.estimate_u_using_random_sampling(max_pairs=1e6)

# Find matches (lower threshold for exploratory analysis)
results = linker.predict(threshold_match_probability=0.6)
clusters = linker.cluster_pairwise_predictions_at_threshold(results, 0.6)

# Identify hierarchical relationships
# (Federal DOT → State DOTs → Local Transit Authorities)
hierarchy = {}
for cluster in clusters:
    if cluster["agency_type"] == "transportation":
        federal = [a for a in cluster["members"] if a["level"] == "federal"]
        state = [a for a in cluster["members"] if a["level"] == "state"]
        local = [a for a in cluster["members"] if a["level"] == "local"]

        if federal:
            hierarchy[federal[0]["agency_name"]] = {
                "state": [a["agency_name"] for a in state],
                "local": [a["agency_name"] for a in local]
            }

print(hierarchy)

Example 3: Geographic Boundary Reconciliation#

import geopandas as gpd
import pandas as pd

# Load different geographic layers
census_tracts = gpd.read_file("census_tracts_2020.shp")
school_districts = gpd.read_file("school_districts_2024.shp")

# Create spatial crosswalk
# For each school district, find overlapping census tracts
crosswalk = gpd.sjoin(
    school_districts,
    census_tracts,
    how="left",
    predicate="intersects"
)

# Calculate overlap percentages
crosswalk["overlap_area"] = crosswalk.geometry.intersection(
    census_tracts.loc[crosswalk["index_right"]].geometry
).area

crosswalk["overlap_pct"] = (
    crosswalk["overlap_area"] / crosswalk.geometry.area
)

# Filter to significant overlaps (>5%)
significant_overlaps = crosswalk[crosswalk["overlap_pct"] > 0.05]

# Export crosswalk table
crosswalk_table = significant_overlaps[[
    "school_district_id",
    "census_tract_id",
    "overlap_pct"
]].copy()

crosswalk_table.to_csv("school_district_to_tract_crosswalk.csv", index=False)

# Use crosswalk to aggregate tract-level data to school districts
tract_data = pd.read_csv("tract_demographics.csv")  # census_tract_id, population, etc.

merged = crosswalk_table.merge(tract_data, on="census_tract_id")
merged["population_weighted"] = merged["population"] * merged["overlap_pct"]

district_aggregated = merged.groupby("school_district_id").agg({
    "population_weighted": "sum"
}).reset_index()

print(district_aggregated)

Research Methodology#

This research was conducted through:

  1. Primary source review:

    • Official documentation for Splink (UK Government), Dedupe (Dedupe.io), Nomenklatura (OpenSanctions)
    • GitHub repository analysis (commit history, issue discussions, maintenance status)
    • SAM.gov API documentation and data dictionaries
    • Census Bureau technical documentation
  2. Civic tech community:

    • Code for America blog posts and project documentation
    • Sunlight Foundation archived resources on government data matching
    • DataMade and civic tech project case studies
    • Open Civic Data standards documentation
  3. Academic literature:

    • Entity resolution surveys (VLDB, Science Advances)
    • Census Bureau research on record linkage
    • Government Accountability Office reports on data integration challenges
  4. Government sources:

    • UK Government blog posts on Splink adoption
    • US Census Bureau crosswalk methodology documentation
    • GAO reports on DATA Act implementation and data standards
    • USAspending.gov entity relationship documentation
  5. Practitioner interviews (via documentation):

    • Code for America Clean Slate law implementation
    • OpenSanctions deduplication engineering articles
    • UK Ministry of Justice Splink use cases

Sources consulted:


Cross-References#

Related Survey Topics:

  • 1.010-019: Graph & Network - For vendor relationship graphs and entity hierarchies
  • 1.033: NLP Libraries - For entity extraction from documents
  • 1.101: PDF Processing - Foundation for extracting entities from government PDFs
  • 1.300: Public Finance Modeling - Context for why vendor/agency matching matters
  • 1.301: Government Data Access - APIs providing raw entity data
  • 1.302: Budget Document Parsing - Extracting agency names from documents
  • 1.304: Procurement & Contracts - Use case for vendor entity resolution

Standards (future 2.xxx):

  • Open Civic Data (OCD) identifiers
  • Frictionless Data for entity reference datasets
  • Schema.org GovernmentOrganization vocabulary

Platforms (future 3.xxx):

  • SAM.gov (federal contractor registry)
  • USAspending.gov (federal spending with entity links)
  • Socrata open data portals (municipal entity data)

Recommendations for Library Builders#

If you’re considering building civic entity resolution tools:

For Cross-Jurisdiction Matching#

  1. Start with reference dataset:

    • Federal: USA.gov agency directory (complete, maintained)
    • State: Top 20 states by population (covers ~75% of US)
    • Local: Top 200 cities (sufficient for most transparency use cases)
  2. Focus on entity types with high impact:

    • Law enforcement (police, sheriff, corrections)
    • Transportation (DOT, transit authorities, airports)
    • Education (school districts, state boards of education)
    • Health (departments of health, hospital districts)
  3. Build for extensibility:

    • User-contributed entity mappings (OpenRefine-style)
    • Confidence scores (low confidence = flag for review)
    • API for programmatic matching (not just standalone tool)

For Agency Name Normalization#

  1. Learn from NYC Agency Name Project:

    • Pattern-based normalization (Department of X variations)
    • Abbreviation dictionaries by domain
    • Fuzzy matching with jurisdiction context
  2. Crowdsource corrections:

    • Build validation UI (users confirm/correct matches)
    • Track provenance (who validated each match?)
    • Feedback loop improves model over time
  3. Integrate with Splink/Dedupe:

    • Normalization as preprocessing step
    • Reduces false negatives from name variations
    • Improves blocking efficiency

For Vendor Graph Library#

  1. Start with federal data:

    • SAM.gov has best data quality
    • USAspending.gov has relationship fields
    • Validate against FPDS (Federal Procurement Data System)
  2. Temporal tracking is critical:

    • Mergers/acquisitions change hierarchies
    • Use effective_date for all relationships
    • Support “as of date” queries
  3. Anomaly detection patterns:

    • Circular subcontracting (graph cycles)
    • Unrealistic small business qualifications (revenue, employees)
    • Geographic mismatches (HQ vs work location)
    • Timing patterns (award date vs incorporation date)

For Geographic Reconciliation#

  1. Don’t reinvent spatial analysis:

    • Use GeoPandas, Shapely (mature, well-tested)
    • Focus on entity type coverage, not spatial algorithms
  2. Prioritize common use cases:

    • School district to census tract (education research)
    • Voting precinct to tract (electoral analysis)
    • City boundaries over time (urban studies)
  3. Document interpolation assumptions:

    • Users must understand weighting methods
    • Provide multiple weight options (population, land area, residential)
    • Warn when interpolation may be inaccurate (non-residential areas)

General Advice#

  • Open-source licensing: Use MIT or Apache 2.0 (broadest adoption)
  • Validation datasets: Include test cases with ground truth
  • Civic tech engagement: Partner with Code for America brigades for feedback
  • Government collaboration: Work with data.gov, SAM.gov teams for data access
  • Explainability: Civic applications require auditable decisions (provide match reasoning)

Acknowledgments#

This research builds on the foundational work of:

  • UK Ministry of Justice Analytical Services for developing Splink
  • Dedupe.io for creating entity resolution tools specifically for civic tech
  • OpenSanctions for open-source entity deduplication approaches
  • NYC Mayor’s Office of Data Analytics for demonstrating city-scale entity normalization
  • Code for America for applying entity resolution to criminal justice reform
  • U.S. Census Bureau for decades of work on geographic crosswalks and record linkage

The civic tech community’s commitment to open data and transparent government makes this research possible.


Updates#

2026-02-05: Initial research completed covering 6 major tools/resources and 4 key gaps. Emphasis on UK Government Splink adoption as proof of production-readiness for civic applications.

Last verified: 2026-02-05