Building a Combined CIC Financial Dataset
Merging XBRL and PDF-extracted accounts for 34,244 Community Interest Companies
Background
Community Interest Companies (CICs) file annual accounts at Companies House, just like other limited companies. These accounts include standard financial statements — balance sheets and profit-and-loss accounts — plus a CIC34 community interest report detailing the company's activities, stakeholder consultations, director remuneration, and any asset transfers.
These filings exist in two machine-readable forms: XBRL (structured data embedded directly in the filing) and PDF (the original document image). Each source has different coverage and strengths. XBRL data is precise and standardised but not all CICs file in XBRL format. PDF extractions can capture filings that lack XBRL data but require mapping unstructured labels to a common schema. Combining both sources produces the most complete picture of CIC finances available from public data.
Data Sources
XBRL Data
Collected from Companies House monthly data releases (2006–2025), parsed with stream-read-xbrl, and filtered to CICs using the CSO Spine. This provides 49,471 current-year filing-years from 15,956 unique CICs. XBRL data offers strong financial fill rates for balance sheet and profit-and-loss items. CIC34 narrative coverage is around 79%, as not all XBRL taxonomies include these fields.
PDF Extraction
84,084 CIC accounts PDFs processed via the OpenAI gpt-4.1-mini Batch API with a structured JSON schema. This covers 33,157 unique CICs (2014–2025). PDF extraction achieves a higher CIC34 narrative fill rate (~89%), since the narrative sections are visible in the document even when not tagged in XBRL. However, unstructured source labels must be mapped to a common schema before the data can be combined.
Coverage Comparison
The Conversion Pipeline
To combine both sources, PDF extractions must be mapped into the same 58-column schema as XBRL data. The 23,223 filing-years present in both sources serve as ground truth. A 4-step matching process was used: exact value matching, sign-adjusted matching for expenses and liabilities, label-to-column text similarity, and artifact filtering. This produced 1,217 approved mapping rules.
Data Quality
For the 23,223 filing-years present in both sources, values typically agree for well-populated fields such as cash at bank and creditors. Some disagreement (~20%) exists due to label ambiguity and differences in how multi-year filings are handled.
Expenses and liabilities (such as creditors, administrative expenses, and cost of sales) are stored as positive magnitudes in both datasets, following XBRL convention. Nine sign-aware columns are normalised automatically during conversion.
The Combined Dataset
The combined dataset uses a 58-column schema covering company identity, balance
sheet items, profit and loss items, CIC34 narrative fields, and filing metadata. Each
row represents one company-year, with a from_prior_year flag
distinguishing current and comparative figures.
| Category | Field | Description |
|---|---|---|
| Identity | ||
| Identity | uid | Unique identifier from the CSO Spine (e.g. GB-COH-12345678) |
| Identity | coyno | Companies House company number (8-digit, zero-padded) |
| Identity | entity_current_legal_name | Current registered name of the company |
| Identity | balance_sheet_date | Balance sheet date (YYYY-MM-DD) defining the filing year |
| Identity | fy | Financial year derived from the balance sheet date |
| Balance Sheet | ||
| Balance Sheet | tangible_fixed_assets | Net book value of tangible fixed assets |
| Balance Sheet | debtors | Amounts owed to the company (trade debtors, prepayments) |
| Balance Sheet | cash_bank_in_hand | Cash and bank balances at the balance sheet date |
| Balance Sheet | current_assets | Total current assets |
| Balance Sheet | creditors_due_within_one_year | Amounts owed falling due within one year (positive magnitude) |
| Balance Sheet | creditors_due_after_one_year | Amounts owed falling due after more than one year (positive magnitude) |
| Balance Sheet | net_current_assets_liabilities | Current assets minus current liabilities |
| Balance Sheet | total_assets_less_current_liabilities | Total assets minus current liabilities |
| Balance Sheet | net_assets_liabilities_including_pension_asset_liability | Net assets or liabilities including pension adjustments |
| Balance Sheet | called_up_share_capital | Called-up share capital |
| Balance Sheet | shareholder_funds | Total shareholder funds (equity) |
| Profit & Loss | ||
| Profit & Loss | turnover_gross_operating_revenue | Total turnover or gross operating revenue |
| Profit & Loss | other_operating_income | Other operating income not included in turnover |
| Profit & Loss | cost_sales | Cost of sales (positive magnitude) |
| Profit & Loss | gross_profit_loss | Gross profit or loss |
| Profit & Loss | administrative_expenses | Administrative expenses (positive magnitude) |
| Profit & Loss | staff_costs | Total staff costs including wages, NI, and pensions |
| Profit & Loss | wages_salaries | Wages and salaries component of staff costs |
| Profit & Loss | operating_profit_loss | Operating profit or loss |
| Profit & Loss | profit_loss_on_ordinary_activities_before_tax | Profit or loss on ordinary activities before taxation |
| Profit & Loss | tax_on_profit_or_loss_on_ordinary_activities | Tax charge on ordinary activities (positive magnitude) |
| Profit & Loss | profit_loss_for_period | Net profit or loss for the financial period |
| Profit & Loss | government_grant_income | Income from government grants |
| CIC34 Narratives | ||
| CIC34 | cic34_activities_impact | Description of the company's activities and their community impact |
| CIC34 | cic34_stakeholder_consultation | How the company consulted stakeholders |
| CIC34 | cic34_directors_remuneration | Details of directors' remuneration and benefits |
| CIC34 | cic34_asset_transfer | Record of any asset transfers during the period |
| Metadata | ||
| Metadata | source_file | Original filing filename (XBRL or PDF) |
| Metadata | file_type | Filing format: html (XBRL) or pdf |
| Metadata | from_prior_year | True if this row contains comparative (prior year) figures |
| Metadata | overlap_with_xbrl | True if this filing-year also exists in the XBRL dataset |
CIC34 Narrative Coverage
These narrative fields enable research into CIC activities, governance practices, and community impact. Combined with the financial data, the dataset supports analysis of workforce patterns, director remuneration, and how CICs describe what they do.