Combined CIC Financial Dataset Report

Building a Combined CIC Financial Dataset

February 2026

Merging XBRL and PDF-extracted accounts for 34,244 Community Interest Companies

Background

Community Interest Companies (CICs) file annual accounts at Companies House, just like other limited companies. These accounts include standard financial statements — balance sheets and profit-and-loss accounts — plus a CIC34 community interest report detailing the company's activities, stakeholder consultations, director remuneration, and any asset transfers.

These filings exist in two machine-readable forms: XBRL (structured data embedded directly in the filing) and PDF (the original document image). Each source has different coverage and strengths. XBRL data is precise and standardised but not all CICs file in XBRL format. PDF extractions can capture filings that lack XBRL data but require mapping unstructured labels to a common schema. Combining both sources produces the most complete picture of CIC finances available from public data.

34,244 Unique CICs
101,971 Combined Filing-Years
29 Financial Fields
4 CIC34 Narrative Fields

Data Sources

XBRL Data

Collected from Companies House monthly data releases (2006–2025), parsed with stream-read-xbrl, and filtered to CICs using the CSO Spine. This provides 49,471 current-year filing-years from 15,956 unique CICs. XBRL data offers strong financial fill rates for balance sheet and profit-and-loss items. CIC34 narrative coverage is around 79%, as not all XBRL taxonomies include these fields.

PDF Extraction

84,084 CIC accounts PDFs processed via the OpenAI gpt-4.1-mini Batch API with a structured JSON schema. This covers 33,157 unique CICs (2014–2025). PDF extraction achieves a higher CIC34 narrative fill rate (~89%), since the narrative sections are visible in the document even when not tagged in XBRL. However, unstructured source labels must be mapped to a common schema before the data can be combined.

Coverage Comparison

Source overlap chart
Filing-year coverage across both data sources

The Conversion Pipeline

To combine both sources, PDF extractions must be mapped into the same 58-column schema as XBRL data. The 23,223 filing-years present in both sources serve as ground truth. A 4-step matching process was used: exact value matching, sign-adjusted matching for expenses and liabilities, label-to-column text similarity, and artifact filtering. This produced 1,217 approved mapping rules.

1. Find Overlaps Ground truth: 23,223 filing-years in both sources 2. Build Label Mapping Result: 1,217 label-to-column mapping rules 3. Convert All PDF Extractions Scale: 84,084 files → 145,491 rows 4. Combined Dataset Identical 58-column schema, ready to merge with sign conventions normalised across sources XBRL Source 49,471 CIC filing-years Value + Label Matching 4-step cascade PDF Source 84,084 CIC filing-years

Data Quality

Fill rate comparison chart
Fill rates for 29 financial columns across both data sources (current-year CIC filings only)

For the 23,223 filing-years present in both sources, values typically agree for well-populated fields such as cash at bank and creditors. Some disagreement (~20%) exists due to label ambiguity and differences in how multi-year filings are handled.

Expenses and liabilities (such as creditors, administrative expenses, and cost of sales) are stored as positive magnitudes in both datasets, following XBRL convention. Nine sign-aware columns are normalised automatically during conversion.

The Combined Dataset

The combined dataset uses a 58-column schema covering company identity, balance sheet items, profit and loss items, CIC34 narrative fields, and filing metadata. Each row represents one company-year, with a from_prior_year flag distinguishing current and comparative figures.

Category Field Description
Identity
IdentityuidUnique identifier from the CSO Spine (e.g. GB-COH-12345678)
IdentitycoynoCompanies House company number (8-digit, zero-padded)
Identityentity_current_legal_nameCurrent registered name of the company
Identitybalance_sheet_dateBalance sheet date (YYYY-MM-DD) defining the filing year
IdentityfyFinancial year derived from the balance sheet date
Balance Sheet
Balance Sheettangible_fixed_assetsNet book value of tangible fixed assets
Balance SheetdebtorsAmounts owed to the company (trade debtors, prepayments)
Balance Sheetcash_bank_in_handCash and bank balances at the balance sheet date
Balance Sheetcurrent_assetsTotal current assets
Balance Sheetcreditors_due_within_one_yearAmounts owed falling due within one year (positive magnitude)
Balance Sheetcreditors_due_after_one_yearAmounts owed falling due after more than one year (positive magnitude)
Balance Sheetnet_current_assets_liabilitiesCurrent assets minus current liabilities
Balance Sheettotal_assets_less_current_liabilitiesTotal assets minus current liabilities
Balance Sheetnet_assets_liabilities_including_pension_asset_liabilityNet assets or liabilities including pension adjustments
Balance Sheetcalled_up_share_capitalCalled-up share capital
Balance Sheetshareholder_fundsTotal shareholder funds (equity)
Profit & Loss
Profit & Lossturnover_gross_operating_revenueTotal turnover or gross operating revenue
Profit & Lossother_operating_incomeOther operating income not included in turnover
Profit & Losscost_salesCost of sales (positive magnitude)
Profit & Lossgross_profit_lossGross profit or loss
Profit & Lossadministrative_expensesAdministrative expenses (positive magnitude)
Profit & Lossstaff_costsTotal staff costs including wages, NI, and pensions
Profit & Losswages_salariesWages and salaries component of staff costs
Profit & Lossoperating_profit_lossOperating profit or loss
Profit & Lossprofit_loss_on_ordinary_activities_before_taxProfit or loss on ordinary activities before taxation
Profit & Losstax_on_profit_or_loss_on_ordinary_activitiesTax charge on ordinary activities (positive magnitude)
Profit & Lossprofit_loss_for_periodNet profit or loss for the financial period
Profit & Lossgovernment_grant_incomeIncome from government grants
CIC34 Narratives
CIC34cic34_activities_impactDescription of the company's activities and their community impact
CIC34cic34_stakeholder_consultationHow the company consulted stakeholders
CIC34cic34_directors_remunerationDetails of directors' remuneration and benefits
CIC34cic34_asset_transferRecord of any asset transfers during the period
Metadata
Metadatasource_fileOriginal filing filename (XBRL or PDF)
Metadatafile_typeFiling format: html (XBRL) or pdf
Metadatafrom_prior_yearTrue if this row contains comparative (prior year) figures
Metadataoverlap_with_xbrlTrue if this filing-year also exists in the XBRL dataset

CIC34 Narrative Coverage

Activities & Impact
91.7%
Stakeholder Consultation
90.4%
Director Remuneration
89.2%
Asset Transfer
86.9%

These narrative fields enable research into CIC activities, governance practices, and community impact. Combined with the financial data, the dataset supports analysis of workforce patterns, director remuneration, and how CICs describe what they do.

Next Steps

  • Expand PDF extraction coverage — 72,881 CICs exist in the CSO Spine; 33,157 currently have PDF extractions
  • Improve extraction accuracy — ~20% baseline disagreement between PDF and XBRL values suggests room to refine extraction prompts
  • Recover unmapped labels — ~99,000 unique labels remain unmapped, many representing niche financial items beyond the 29 core columns
  • Merge datasets for downstream analysis — combine XBRL and PDF sources into a single longitudinal panel for CIC research