Skip to content

Add Form 8-K parsing and event storage infrastructure#68

Open
sroussey wants to merge 3 commits intomainfrom
claude/add-8k-support-0jEUc
Open

Add Form 8-K parsing and event storage infrastructure#68
sroussey wants to merge 3 commits intomainfrom
claude/add-8k-support-0jEUc

Conversation

@sroussey
Copy link
Contributor

@sroussey sroussey commented Mar 7, 2026

Summary

This PR adds comprehensive support for parsing SEC Form 8-K filings and storing extracted events in a dedicated repository. It includes the schema definitions, parsing logic, storage layer, and extensive test coverage with real SEC EDGAR filing samples.

Key Changes

  • Form 8-K Schema (Form_8_K.schema.ts): Defined TypeBox schemas for Form 8-K submissions, signatures, and related metadata
  • Form 8-K Parser (Form_8_K.ts): Implemented parsing logic to extract structured data from 8-K HTML/XML documents
  • Form 8-K Event Storage (Form8KEventSchema.ts, Form8KEventRepo.ts): Created dedicated storage layer for Form 8-K events with repository pattern
  • Storage Integration (Form_8_K.storage.ts): Added processForm8K function to extract and persist 8-K events, handling item codes, signatures, and company relationships
  • Dependency Injection: Registered Form 8-K event repository in both DefaultDI.ts and TestingDI.ts configurations
  • Task Integration (ProcessAccessionDocFormTask.ts): Integrated Form 8-K processing into the document form processing pipeline
  • Test Coverage (Form_8_K.test.ts, Form8KEventRepo.test.ts): Added comprehensive unit tests with 14 real SEC EDGAR filing samples covering various 8-K item types (2.02, 5.02, 5.03, 5.07, 7.01, 8.01, 9.01)
  • Mock Data: Added 14 real Form 8-K filing documents from companies including Apple, Microsoft, Amazon, Tesla, Meta, and Alphabet

Implementation Details

  • The parser extracts key information including CIK, accession number, filing date, report date, and item codes from 8-K documents
  • The storage layer normalizes company names and creates relationships between events and signatories
  • Item codes are parsed from filing metadata and stored separately for efficient querying
  • The implementation handles both standard 8-K and 8-K/A (amended) filings
  • Test suite validates parsing of diverse 8-K structures from different filers and time periods

https://claude.ai/code/session_01SKG4qTyjPAtmuSipiEiAio

claude added 2 commits March 6, 2026 23:13
Implement full 8-K current report support:
- Form_8_K.schema.ts: TypeBox schema for structured XML 8-K submissions
- Form_8_K.ts: parse() handles both XML and HTML primary documents
- Form_8_K.storage.ts: processForm8K extracts and stores event items,
  merging data from filing metadata and XML form data, plus signature processing
- Form8KEventSchema/Repo: new storage layer for normalized 8-K event items
  (one row per item per filing) with queries by CIK, accession, and item code
- ProcessAccessionDocFormTask: routes 8-K/8-K/A to processForm8K
- DI registration in DefaultDI and TestingDI
- 17 tests covering parsing, storage, amendments, signatures, and edge cases

https://claude.ai/code/session_01SKG4qTyjPAtmuSipiEiAio
- Download 15 real 8-K filings from Apple, Microsoft, Amazon, Tesla,
  Meta, and Alphabet covering diverse item types (1.01, 2.02, 5.02,
  5.03, 5.07, 7.01, 8.01, 9.01)
- Replace synthetic XML mock data with real SEC EDGAR HTML/XHTML filings
- Fix parser detection: use regex for edgarSubmission root element
  instead of <?xml prefix (XHTML inline XBRL files also start with <?xml)
- Add 31 comprehensive tests: parsing all files, storage with filing
  metadata, item type coverage, cross-entity querying, amendment
  handling, edge cases (null/empty items, semicolons, deduplication,
  unknown items), XML signature processing, Form_8_K_ITEMS validation
- Total: 470 tests pass across 45 files

https://claude.ai/code/session_01SKG4qTyjPAtmuSipiEiAio
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces initial infrastructure to support SEC Form 8‑K processing by adding a minimal 8‑K parser, an event-item storage schema/repo, DI registrations, and extensive test fixtures/coverage using real 8‑K primary documents.

Changes:

  • Added Form_8_K parsing entrypoint (structured XML via edgarSubmission; HTML returns minimal {}) and integrated 8‑K processing into the accession document processing task.
  • Introduced a form_8k_events storage table (schema + repository) and wired it into DefaultDI/TestingDI.
  • Added storage logic (processForm8K) plus tests and mock filing samples to validate item-code extraction and persistence.

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/task/forms/ProcessAccessionDocFormTask.ts Routes 8‑K/8‑K/A filings into processForm8K and plumbs filing metadata fields (items, report_date).
src/sec/forms/miscellaneous-filings/Form_8_K.ts Adds Form_8_K.parse() supporting edgarSubmission XML; HTML/XHTML returns {}.
src/sec/forms/miscellaneous-filings/Form_8_K.schema.ts Defines TypeBox schemas for structured 8‑K XML submissions/signatures.
src/sec/forms/miscellaneous-filings/Form_8_K.storage.ts Extracts item codes from filing metadata and/or XML, stores per-item events, and stores signature relationships (XML only).
src/storage/form-8k-event/Form8KEventSchema.ts Defines the Form8KEvent table schema and DI token.
src/storage/form-8k-event/Form8KEventRepo.ts Provides repository methods for saving/querying 8‑K events.
src/config/DefaultDI.ts Registers form_8k_events storage in production DI.
src/config/TestingDI.ts Registers in-memory form_8k_events storage for tests.
src/storage/form-8k-event/Form8KEventRepo.test.ts Unit tests for event repository save/query behavior.
src/sec/forms/miscellaneous-filings/Form_8_K.test.ts End-to-end-ish tests for parsing and storing events using mock filings + metadata.
src/sec/forms/miscellaneous-filings/mock_data/form-8k/*.htm Adds real-world 8‑K primary document samples used by tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 88 to 100
@@ -91,6 +94,8 @@ export class ProcessAccessionDocFormTask extends Task<
form = filing.form ?? undefined;
filing_date = filing.filing_date;
file_number = filing.file_number;
items = filing.items;
report_date = filing.report_date;
fileName = fileName ?? filing.primary_doc;
}
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filing_date, items, and report_date are only populated when cik, form, or fileName are missing. In the main pipeline (FetchAndStoreFormsTask / UpdateAllFormsTask) those three fields are provided, so items/report_date stay undefined and processForm8K will store zero events for HTML 8-Ks (and filing_date becomes an empty string). Consider always loading the filing record (or at least when any of filing_date/items/report_date/file_number are missing) so 8-K event storage has the necessary metadata.

Copilot uses AI. Check for mistakes.
FORM_8K_EVENT_REPOSITORY_TOKEN,
createStorage("form_8k_events", Form8KEventSchema, Form8KEventPrimaryKeyNames, [
["cik", "filing_date"],
["item_code"],
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The form_8k_events storage is indexed on ["cik","filing_date"] and ["item_code"], but Form8KEventRepo.getEventsByAccession() queries by { cik, accession_number }. Without an index that includes accession_number this query will likely degrade to a full scan. Add an index such as ["cik","accession_number"] (and optionally ["accession_number"] / ["cik"] depending on expected query patterns).

Suggested change
["item_code"],
["item_code"],
["cik", "accession_number"],

Copilot uses AI. Check for mistakes.
FORM_8K_EVENT_REPOSITORY_TOKEN,
new InMemoryTabularStorage(Form8KEventSchema, Form8KEventPrimaryKeyNames, [
["cik", "filing_date"],
["item_code"],
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The in-memory Form 8-K event storage is indexed on ["cik","filing_date"] and ["item_code"], but tests/repo APIs query by { cik, accession_number }. Add an index including accession_number (e.g. ["cik","accession_number"]) so getEventsByAccession() doesn't require scanning all rows.

Suggested change
["item_code"],
["item_code"],
["cik", "accession_number"],

Copilot uses AI. Check for mistakes.
@sroussey
Copy link
Contributor Author

sroussey commented Mar 7, 2026

@copilot open a new pull request to apply changes based on the comments in this thread

Copy link
Contributor

Copilot AI commented Mar 7, 2026

@sroussey I've opened a new pull request, #69, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: sroussey <127349+sroussey@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants