Problem/Motivation

PDFa11y currently implements 7 accessibility checks covering the structural foundation of PDF accessibility: tagging, structure tree, document title, language, PDF version, heading hierarchy, and filename-style titles. These catch the most common blockers but leave significant gaps that the Sa11y and Editoria11y JavaScript libraries identify for HTML content.

Both libraries share the same underlying ruleset. A review of their full check inventory for HTML tests against the PDF structure tree (tags, attributes, metadata) that smalot/pdfparser exposes identified 10 checks that translate directly and 9 checks that partially translate with caveats, for a total of 19 new checks.

Proposed resolution

Implement all 19 checks as plugins in src/Plugin/AccessibilityCheck/, each extending AccessibilityCheckBase with an #[AccessibilityCheck] attribute and implementing check(Document $document, string $fileUri): AccessibilityCheckResult. Directly translating checks are straightforward; partially translating checks are noted with caveats that inform their implementation.

Directly translating checks (10)

These have exact structural equivalents in the PDF tag tree and are implementable with the current smalot/pdfparser API without approximation.

Heading completeness (4 checks)

All four walk the same H1–H6 tag structure that HeadingStructureCheck already traverses. They are natural companions to the existing check and can share a heading-collection helper.

  • heading_missing_h1 — Fails if no H1 element exists anywhere in the structure tree. PDF/UA requires a document to have at least one top-level heading. Implementation: walk the structure tree collecting heading tags; if none are H1, fail.
  • heading_first — Fails if the first heading encountered in reading order is not H1 or H2. A document that opens with an H3 disorients screen reader users who navigate by heading. Implementation: find the first heading tag in the structure tree traversal and check its level.
  • heading_empty — Fails if any H1H6 tag has no text content (no ActualText attribute and no text in its child content items). Empty headings are announced as "heading level N" with no description. Implementation: for each heading tag, collect text content recursively; fail if blank.
  • heading_long — Warns if any heading exceeds 160 characters. Long headings are usually paragraphs incorrectly tagged as headings. Implementation: same text extraction as heading_empty, with a string length check.

Figure alt text (4 checks)

PDF Figure structure tags carry an /Alt entry in their attribute dictionary — the direct equivalent of HTML alt="". This is a PDF/UA requirement. All four checks read the same /Alt string from Figure tags.

  • figure_missing_alt — Fails if any Figure tag has no /Alt attribute and is not marked as decorative (no empty /ActualText). This is one of the most common PDF accessibility failures and a direct PDF/UA requirement. Implementation: iterate structure tags of type Figure; fail for any missing /Alt.
  • figure_alt_filename — Fails if a Figure tag's /Alt text looks like a filename or file path (matches a pattern like /\.(png|jpg|gif|bmp|svg|tiff?|webp)$/i or contains path separators). Adapts the same heuristic already used by DocumentTitleFilenameCheck.
  • figure_alt_placeholder — Fails if /Alt consists solely of generic stop words: "image", "photo", "picture", "graphic", "icon", "logo", "screenshot", or similar. These provide no useful information to screen reader users. Implementation: case-insensitive comparison against a stop-word list.
  • figure_alt_too_long — Warns if /Alt exceeds 250 characters. Overly long alt text belongs in a caption or body text, not the alt attribute. Implementation: string length check on the /Alt value.

Table structure (2 checks)

PDF tagged tables use Table > TR > TH / TD structure elements, mirroring HTML table semantics exactly.

  • table_missing_headers — Fails if a Table structure element contains no TH descendants. A table with only TD cells cannot be navigated meaningfully by screen readers. Implementation: for each Table tag, check for at least one TH child at any depth.
  • table_empty_header — Fails if any TH tag has no text content. An empty header cell is announced as a column or row label with no information. Implementation: for each TH tag, recursively collect text content and fail if blank.

Partially translating checks (9)

These checks address concepts that exist in PDFs but require approximation or produce higher false-positive rates than their HTML equivalents. Caveats are noted to inform implementation decisions.

Link quality (3 checks)

PDF Link structure tags contain text content (the visible link text) and optionally an /Alt attribute. Text extraction from Link tags varies in reliability across authoring tools.

  • link_empty — Fails if a Link tag has no text content and no /Alt. Caveat: some authoring tools produce Link tags wrapping non-text annotations; filter by checking for annotation children before flagging.
  • link_stopword — Fails if link text consists entirely of generic phrases: "click here", "read more", "here", "more", "link", "this", etc. The Sa11y stop-word list applies directly once link text is extracted. Caveat: apply a minimum-length threshold to avoid flagging legitimately short links.
  • link_url — Warns if link text is a bare URL (starts with http://, https://, or www.). URLs read character-by-character by screen readers. Caveat: some PDFs legitimately show URLs as visible text; a warning rather than a failure is appropriate.

Content quality (4 checks)

String-analysis checks on the text content of structure tags. PDF text extraction can be unreliable for complex multi-column layouts.

  • text_uppercase — Warns if a P tag contains a run of 4 or more consecutive ALL-CAPS words. Sustained all-caps text is harder to read and is often used as a substitute for proper heading structure. Caveat: exclude known abbreviations and acronyms; a minimum run-length threshold reduces false positives.
  • text_fake_list — Warns if 3 or more consecutive P tags begin with manual list-like prefixes ("1.", "a.", "•", "–", "-") rather than using L / LI structure tags. Untagged lists are not navigable by screen readers. Caveat: requires examining runs of sibling P elements, not individual tags.
  • blockquote_as_heading — Warns if a BlockQuote structure tag contains fewer than 25 characters of text. Short blockquotes are almost always headings or callouts that have been mis-tagged. Caveat: some legitimate pull quotes are short; warning rather than failing is appropriate.
  • readability — Reports a Flesch-Kincaid readability score computed from the extracted text of body P tags. Does not fail the document; surfaces the score as informational metadata. Caveat: PDF text extraction order can be unreliable for multi-column layouts. Should be disabled by default and opt-in via the enabled checks settings.

Figure caption duplication (1 check)

  • figure_duplicate_alt — Warns if a Figure tag's /Alt text duplicates the text content of an adjacent Caption tag. Screen readers announce both, so duplication is redundant. Caveat: Caption association with Figure is not always explicit; check both sibling and parent–child relationships.

Table header association (1 check)

  • table_headers_ref — Fails if TD tags carry a /Headers attribute referencing TH IDs that do not exist in the table. This is the PDF/UA equivalent of HTML headers="" pointing to a nonexistent id. Caveat: the /Headers attribute is rarely populated by most authoring tools; this check mainly benefits highly structured documents.

Remaining tasks

  • Implement all 10 directly translating checks (heading completeness, figure alt text, table structure)
  • Implement all 9 partially translating checks (link quality, content quality, figure caption, table header association)
  • Add kernel tests for each new check using appropriate PDF fixtures (new fixtures required for figures-with-alt, tables, links, etc.)
  • Add each new check ID to config/install/pdfa11y.settings.yml and config/schema/pdfa11y.schema.yml; readability disabled by default, all others enabled
  • Review
  • Merge

User interface changes

19 new checks appear in the Enabled checks list on the settings form (18 enabled by default; readability opt-in). Each check produces a new failure or warning message in the inline upload widget and the per-media report.

One thing that is currently giving me pause is that PDFa11y checks are currently pass or fail with a universal option to "block" upload of PDFs with any failing tests. Many of these new tests require review and judgement to decide if they truly represent an accessibility risk. To add all these tests, I think we need an option similar to Editoria11y that allows trained editors to dismiss false positives or, at the very least, the option to configure each PDFa11y check as warn, block, or disabled instead of just enabled or not. This could be a select list per test instead of a checkbox.

API changes

None. New checks are self-contained plugins.

Data model changes

None. New check IDs are stored in the existing pdfa11y_results table using the existing schema.

Comments

joshuami created an issue. See original summary.

joshuami’s picture

Title: Add accessibility checks adapted similar to Editoria11y test library » Add accessibility checks similar to Editoria11y test library