1.3xx Series Vision: Domain-Specific Data & Analysis#

Status: Planning document for future taxonomy expansion Created: 2025-02-05 Context: Emerging from civic budget analysis work (see task 271842)

Overview#

The 1.3xx series will cover domain-specific data infrastructure and analysis libraries - specialized tools for working with real-world data in specific industries and sectors.

Thematic Structure#

1.0xx-1.1xx: Core computer science (algorithms, data structures, ML, NLP, UI)
1.2xx: Modern infrastructure (LLMs, calendaring, social protocols, messaging)
1.3xx: Domain-specific data & analysis ← NEW SERIES
2.xxx: Standards and protocols
3.xxx: Commercial platforms and applications

Rationale#

Domain-specific data work has unique characteristics:

Specialized data sources - Often messy, unstructured, or semi-structured
Entity resolution challenges - Domain-specific entities need normalization
Regulatory requirements - Compliance, standardization, reporting mandates
Fragmented tooling - Libraries exist but aren’t cataloged
High-stakes domains - Finance, healthcare, legal, government

These domains need the same infrastructure (parsing, entity resolution, data access, analysis) but general-purpose tools fall short.

Proposed Structure#

1.300-329: Financial Data & Analysis#

1.300-309: Public Finance & Civic Data

Public finance modeling (OpenFisca, PolicyEngine, Tax-Calculator)
Government data access (Census, USAspending APIs)
Budget document parsing (CAFR/ACFR extraction)
Civic entity resolution (agency names, jurisdictions)
Procurement & contract analysis
Fiscal health metrics

1.310-319: Corporate Finance & Securities

SEC filing parsers (EDGAR access, 10-K/10-Q extraction)
Financial statement analysis
Company entity resolution
Credit analysis libraries
Market data access

1.320-329: Financial Infrastructure

Financial entity resolution (LEI, CUSIP, ISIN)
Financial document parsing (earnings calls, prospectuses)
Time-series financial data
Financial data standards (XBRL, FIX)

1.330-339: Legal & Regulatory Data#

Contract analysis (NDA parsing, clause extraction, risk assessment)
Case law research (legal search engines, citation analysis)
Legal entity identification (parties, jurisdictions, citations)
Regulatory compliance checking
Court data access (PACER, state court systems)
Legal document generation

1.340-349: Healthcare & Medical Data#

HL7/FHIR libraries (healthcare interoperability standards)
Medical NLP (clinical text analysis, entity extraction)
Medical coding (ICD-10, CPT, SNOMED CT)
Drug/pharmaceutical databases
EHR integration tools
Public health datasets (CDC, WHO)

1.350-359: Real Estate & Property Data#

Property records access (assessor data, deeds, titles)
Parcel/GIS data integration
Zoning analysis
Real estate market data
Title search & deed parsing
Property valuation models

1.360-369: Supply Chain & Logistics#

Shipping carrier APIs (FedEx, UPS, USPS, DHL)
Inventory management libraries
Trade/customs data (harmonized tariff codes)
Supply network analysis
Warehouse optimization
Freight tracking

1.370-379: Energy & Climate Data#

Energy market APIs (CAISO, PJM, ERCOT)
Smart grid/IoT data
Carbon accounting & emissions tracking
Climate datasets (NOAA, NASA, IPCC)
Renewable energy modeling
Building energy analysis

1.380-389: Scientific Research Data#

Lab equipment APIs/drivers
Scientific file formats (NetCDF, HDF5, FITS)
Research data repositories (Zenodo, Dryad)
Unit conversion & measurement libraries
Data provenance tracking
Scientific workflow tools

1.390-399: Government Operations (non-financial)#

Legislative data (bills, votes, sponsors, amendments)
Regulatory actions & Federal Register
FOIA/public records access
Open data portal APIs
Election data & results
Government entity resolution (agencies, departments, programs)

Scope & Boundaries#

What Belongs in 1.3xx#

✅ Domain-specific data infrastructure

Parsers for domain-specific documents
Entity resolution for domain entities
Data access libraries for domain data sources
Analysis frameworks tailored to domain needs

✅ Open source libraries and free APIs

Python/JavaScript/R libraries
Government/public data APIs
Open data standards implementations

✅ Reusable components

Can be used across multiple projects
Solve general problems within the domain
Have documentation and examples

What Does NOT Belong#

❌ General-purpose tools (already covered in 1.0xx-1.1xx)

Generic parsers → 1.100-109 Text & Document Processing
Generic graph analysis → 1.010-019 Graph & Network
Generic NLP → 1.030-039 String & Text Algorithms

❌ End-user applications (go in 3.xxx)

Complete SaaS platforms
Turnkey solutions
Commercial applications

❌ Pure standards (go in 2.xxx)

Protocol specifications
Data format standards
API specifications (without implementations)

The Blurred Lines Problem#

Challenge: Domain-specific work increasingly involves:

Commercial APIs with free tiers
Government data portals (free but not “open source”)
Freemium services
Proprietary data with open source access libraries

Initial approach:

Focus on libraries - Wrapper libraries that access commercial APIs are in scope
Document the ecosystem - Note when data/APIs are commercial vs. free
Prioritize open - When multiple options exist, lead with open source
Be transparent - Mark commercial dependencies clearly

Examples:

✅ python-edgar (open library accessing free SEC EDGAR)
✅ census (open library accessing free Census Bureau API)
✅ openaq (open library accessing free air quality data)
⚠️ alpaca-trade-api (open library accessing freemium brokerage API)
❌ Pure Bloomberg Terminal access (commercial, no free option)

We’ll document this boundary as we encounter cases.

Implementation Strategy#

Phase 1: Anchor with Public Finance (1.300-309)#

Start with the best-defined domain (civic budget analysis driver)
Document existing libraries (OpenFisca, PolicyEngine)
Identify gaps clearly
Establish pattern for gap notation

Corporate finance (1.310-319) - natural extension
Legal data (1.330-339) - similar parsing/entity resolution challenges
Healthcare (1.340-349) - high-value, well-defined standards

Phase 3: Fill Out Vision#

Add remaining domains as research capacity allows
Prioritize by: (a) existing tooling, (b) community need, (c) research interest

Documentation Standards#

Each 1.3xx entry should include:

For existing libraries:

Name & link (GitHub, PyPI, npm)
Primary use case (1-2 sentences)
Language/ecosystem
Maintenance status (active, stable, archived)
Notable users (if applicable)
Key dependencies
Commercial vs. free

For identified gaps:

Problem description
Why general-purpose tools fall short
Potential approach (if known)
Related existing tools
Estimated complexity

Cross-references:

Link to related 1.0xx-1.1xx general tools
Link to 2.xxx standards (if applicable)
Link to 3.xxx commercial alternatives (if applicable)

Open Questions#

Granularity: When does a subdomain deserve its own 10-slot range vs. being a single entry?
Overlap: How to handle libraries that span multiple domains (e.g., graph analysis for both finance and supply chain)?
Currency: How frequently to update as new libraries emerge?
Community input: Should we solicit domain expert review before publishing each range?
Commercial boundaries: How to refine the “open library accessing commercial API” boundary?

Success Criteria#

The 1.3xx series succeeds if:

Discovery: Domain practitioners can find tools faster than Google/GitHub search
Gap identification: White space becomes visible, inspiring new library development
Legitimacy: Domain-specific infrastructure elevated as serious software engineering
Composability: Clear documentation enables combining libraries (parser + analyzer + visualizer)
Community: Becomes a resource domain communities reference and contribute to

Next Steps#

Draft 1.300-309 structure (Public Finance & Civic Data)
Research existing libraries in each subsection
Document gaps with problem descriptions
Add to survey catalog with proper formatting
Update homepage/sidebar
Get community feedback (Code for America, civic tech community)
Iterate on documentation standards based on feedback

References:

Original proposal: Vikunja task 271842
Related: Survey Methodology
Related: Vision