Problem/Motivation
PDFa11y currently implements 7 accessibility checks covering the structural foundation of PDF accessibility: tagging, structure tree, document title, language, PDF version, heading hierarchy, and filename-style titles. These catch the most common blockers but leave significant gaps that the Sa11y and Editoria11y JavaScript libraries identify for HTML content.
Both libraries share the same underlying ruleset. A review of their full check inventory for HTML tests against the PDF structure tree (tags, attributes, metadata) that smalot/pdfparser exposes identified 10 checks that translate directly and 9 checks that partially translate with caveats, for a total of 19 new checks.
Proposed resolution
Implement all 19 checks as plugins in src/Plugin/AccessibilityCheck/, each extending AccessibilityCheckBase with an #[AccessibilityCheck] attribute and implementing check(Document $document, string $fileUri): AccessibilityCheckResult. Directly translating checks are straightforward; partially translating checks are noted with caveats that inform their implementation.
Directly translating checks (10)
These have exact structural equivalents in the PDF tag tree and are implementable with the current smalot/pdfparser API without approximation.
Heading completeness (4 checks)
All four walk the same H1–H6 tag structure that HeadingStructureCheck already traverses. They are natural companions to the existing check and can share a heading-collection helper.
heading_missing_h1— Fails if noH1element exists anywhere in the structure tree. PDF/UA requires a document to have at least one top-level heading. Implementation: walk the structure tree collecting heading tags; if none areH1, fail.heading_first— Fails if the first heading encountered in reading order is notH1orH2. A document that opens with anH3disorients screen reader users who navigate by heading. Implementation: find the first heading tag in the structure tree traversal and check its level.heading_empty— Fails if anyH1–H6tag has no text content (noActualTextattribute and no text in its child content items). Empty headings are announced as "heading level N" with no description. Implementation: for each heading tag, collect text content recursively; fail if blank.heading_long— Warns if any heading exceeds 160 characters. Long headings are usually paragraphs incorrectly tagged as headings. Implementation: same text extraction asheading_empty, with a string length check.
Figure alt text (4 checks)
PDF Figure structure tags carry an /Alt entry in their attribute dictionary — the direct equivalent of HTML alt="". This is a PDF/UA requirement. All four checks read the same /Alt string from Figure tags.
figure_missing_alt— Fails if anyFiguretag has no/Altattribute and is not marked as decorative (no empty/ActualText). This is one of the most common PDF accessibility failures and a direct PDF/UA requirement. Implementation: iterate structure tags of typeFigure; fail for any missing/Alt.figure_alt_filename— Fails if aFiguretag's/Alttext looks like a filename or file path (matches a pattern like/\.(png|jpg|gif|bmp|svg|tiff?|webp)$/ior contains path separators). Adapts the same heuristic already used byDocumentTitleFilenameCheck.figure_alt_placeholder— Fails if/Altconsists solely of generic stop words: "image", "photo", "picture", "graphic", "icon", "logo", "screenshot", or similar. These provide no useful information to screen reader users. Implementation: case-insensitive comparison against a stop-word list.figure_alt_too_long— Warns if/Altexceeds 250 characters. Overly long alt text belongs in a caption or body text, not the alt attribute. Implementation: string length check on the/Altvalue.
Table structure (2 checks)
PDF tagged tables use Table > TR > TH / TD structure elements, mirroring HTML table semantics exactly.
table_missing_headers— Fails if aTablestructure element contains noTHdescendants. A table with onlyTDcells cannot be navigated meaningfully by screen readers. Implementation: for eachTabletag, check for at least oneTHchild at any depth.table_empty_header— Fails if anyTHtag has no text content. An empty header cell is announced as a column or row label with no information. Implementation: for eachTHtag, recursively collect text content and fail if blank.
Partially translating checks (9)
These checks address concepts that exist in PDFs but require approximation or produce higher false-positive rates than their HTML equivalents. Caveats are noted to inform implementation decisions.
Link quality (3 checks)
PDF Link structure tags contain text content (the visible link text) and optionally an /Alt attribute. Text extraction from Link tags varies in reliability across authoring tools.
link_empty— Fails if aLinktag has no text content and no/Alt. Caveat: some authoring tools produceLinktags wrapping non-text annotations; filter by checking for annotation children before flagging.link_stopword— Fails if link text consists entirely of generic phrases: "click here", "read more", "here", "more", "link", "this", etc. The Sa11y stop-word list applies directly once link text is extracted. Caveat: apply a minimum-length threshold to avoid flagging legitimately short links.link_url— Warns if link text is a bare URL (starts withhttp://,https://, orwww.). URLs read character-by-character by screen readers. Caveat: some PDFs legitimately show URLs as visible text; a warning rather than a failure is appropriate.
Content quality (4 checks)
String-analysis checks on the text content of structure tags. PDF text extraction can be unreliable for complex multi-column layouts.
text_uppercase— Warns if aPtag contains a run of 4 or more consecutive ALL-CAPS words. Sustained all-caps text is harder to read and is often used as a substitute for proper heading structure. Caveat: exclude known abbreviations and acronyms; a minimum run-length threshold reduces false positives.text_fake_list— Warns if 3 or more consecutivePtags begin with manual list-like prefixes ("1.", "a.", "•", "–", "-") rather than usingL/LIstructure tags. Untagged lists are not navigable by screen readers. Caveat: requires examining runs of siblingPelements, not individual tags.blockquote_as_heading— Warns if aBlockQuotestructure tag contains fewer than 25 characters of text. Short blockquotes are almost always headings or callouts that have been mis-tagged. Caveat: some legitimate pull quotes are short; warning rather than failing is appropriate.readability— Reports a Flesch-Kincaid readability score computed from the extracted text of bodyPtags. Does not fail the document; surfaces the score as informational metadata. Caveat: PDF text extraction order can be unreliable for multi-column layouts. Should be disabled by default and opt-in via the enabled checks settings.
Figure caption duplication (1 check)
figure_duplicate_alt— Warns if aFiguretag's/Alttext duplicates the text content of an adjacentCaptiontag. Screen readers announce both, so duplication is redundant. Caveat:Captionassociation withFigureis not always explicit; check both sibling and parent–child relationships.
Table header association (1 check)
table_headers_ref— Fails ifTDtags carry a/Headersattribute referencingTHIDs that do not exist in the table. This is the PDF/UA equivalent of HTMLheaders=""pointing to a nonexistentid. Caveat: the/Headersattribute is rarely populated by most authoring tools; this check mainly benefits highly structured documents.
Remaining tasks
- Implement all 10 directly translating checks (heading completeness, figure alt text, table structure)
- Implement all 9 partially translating checks (link quality, content quality, figure caption, table header association)
- Add kernel tests for each new check using appropriate PDF fixtures (new fixtures required for figures-with-alt, tables, links, etc.)
- Add each new check ID to
config/install/pdfa11y.settings.ymlandconfig/schema/pdfa11y.schema.yml;readabilitydisabled by default, all others enabled - Review
- Merge
User interface changes
19 new checks appear in the Enabled checks list on the settings form (18 enabled by default; readability opt-in). Each check produces a new failure or warning message in the inline upload widget and the per-media report.
One thing that is currently giving me pause is that PDFa11y checks are currently pass or fail with a universal option to "block" upload of PDFs with any failing tests. Many of these new tests require review and judgement to decide if they truly represent an accessibility risk. To add all these tests, I think we need an option similar to Editoria11y that allows trained editors to dismiss false positives or, at the very least, the option to configure each PDFa11y check as warn, block, or disabled instead of just enabled or not. This could be a select list per test instead of a checkbox.
API changes
None. New checks are self-contained plugins.
Data model changes
None. New check IDs are stored in the existing pdfa11y_results table using the existing schema.
Comments
Comment #2
joshuami