Eliminating PDFs in Local Government

We all know that PDFs aren't very accessible but when councils have thousands of them that need converting to HTML it's a big, daunting and almost impossible task to convert them all to HTML manually.

We've got the tools to make engaging online content

Using LocalGov Drupal, content designers and editors can build accessible content in different ways depending on the aims of the content. There's timelines, step-by-step guides, subsites, services pages, etc..

But what's the best way for councils to move the content from PDFs onto their website? We built the LocalGov Publications module to do just that. The content types this module provides let editors build online publications with previous/next page buttons, a menu of the whole publication, a menu of the content in the current page and a cover page to link to other versions of the document. Our client the London Borough of Hammersmith & Fulham used the module to publish their corporate plan. As the module's open source, and part of the LocalGov Drupal ecosystem, it's also been used by other councils, for example: Bracknell Forest and West Lindsey. You can see how it works on the LocalGov Drupal YouTube channel.

We have a plan and a prototype!

This video was shortened for the internet - it took 33 seconds to convert this document

So that's the problem of creating new publications in LocalGov Drupal solved, but what about all those files that are already out there?

Last year, we built a prototype module which reads a PDF and pops all the text into an HTML publications node. It's a decent starting point, but it needs more work before it could be a really good solution for people to use.

We've got some ideas on features we can make and we've done some research on what other options are out there but we'd like to:

Understand how users would expect a publications importer to work.
Use all the features that HTML publications provides: Create and link up a cover page, have multiple pages in the output publication, etc.
Reliably extract all of the text, and as much of the document's original structure (EG headings, lists, tables) as possible.
Investigate the use of AI to re-introduce structure to the document when it can't be preserved (we've tested this already!).

First steps

Make this work manually. After a PDF is imported, it needs to be reviewed with some kind of manual process to tidy up the content, add page breaks etc.

After that

After that, we want to look at making it automated and having better editor tools.

What we know already

Every council has thousands of PDFs with a wide range of styles: from large constitution documents to colourfully graphic heavy marketing pamphlets.
Some people don't like AI - so chatGPT should be optional.
PDF content doesn't always need to go into a HTML publication, sometimes a simple page will do. We'd like to create some kind of preview tool to test out different content types before saving.