Using Open Source PDF Technology to Solve the Unstructured Data Problem in Healthcare

Bruno LowagieIf there’s one major challenge to single out in healthcare IT today, it would be leveraging the growth and usage of big data. While consumer IT made big advances in the past decade to get a handle of data by marking up content, indexing it, and annotating it for use, enterprise, and healthcare IT in particular, still need to catch up on making data actionable.

A typical healthcare office handles tens of thousands of documents for patient records, legal, finance, billing processes. In pharma and biotech, a typical FDA drug review process, involves multiple stages of trials, testing, applications, marketing and manufacturing for the new drug – all requiring a mind-blowing amount of paperwork. In all these cases, either the collected data is not timely or relevant, or it doesn’t present enough opportunity to easily access, archive for the future or comply with legal standards.

This article provides insights into how using the Portable Document Format (PDF) and accompanying tools within healthcare organizations can be a powerful way to help solve the unstructured data challenge, speed up processes, and reduce the costs for document handling.

We will explain why PDF, with its ability to contain data structure and interactivity, is the perfect document format for meeting the archiving, accessibility and compliance requirements of the healthcare industry. We will also examine the building blocks of a solution that helps create such compliant PDF documents, and deep dive into the ways to organize and structure PDFs.

But first: some culture!

The Treachery of PDFs

Developed by Adobe Systems, PDF, is a way of electronic document representation, that may contain text, images, graphics, and other multimedia content. PDF works well for us humans, to help us perceive media in an unambiguous, one-dimensional way with the convenience of a paper document. This is where the treachery of PDF lies.

I was born and raised in Belgium. When people ask me about my country, they expect me to talk about beer and chocolates. But the first thought that comes to mind when I think about Belgium is that I live in the country of the absurd. One of the most famous Belgian artists is René Magritte. You probably know him from his painting "The Treachery of Images".

The Treachery of Images, by René MagrittePeople are often confused when they read "Ceci n'est pas une pipe" ("This is not a pipe"). To quote René Magritte: "The famous pipe. How people reproached me for it! And yet, could you stuff my pipe? No, it's just a representation, is it not? So if I had written on my picture "This is a pipe", I'd have been lying!"

What does this have to do with PDF?

Most of the world's data is in the form of media (images, sounds, videos) that work directly with the human perceptual senses.  To organize and index these media, we give our machines corresponding perception capabilities: we let our computers look at the images and video, and build descriptions of their perceptions. For a machine to understand a PDF document therefore, it needs additional layers of information to make sense of its contents. Luckily, the PDF format is perfectly suited for machine consumption – but only when it’s done right.

PDF: More than just an image

To a human being, the PDF below looks like any typical paper document. It has a title (This is not the title), and text (Lorem ipsum dolor sit amet,...). This is what we see on the surface, and we are happy with it. What would a machine think though, if it examines this page?

Using a tool called iText RUPS®, we can take a peek at the internal structure of a PDF document, to see what this document means to the machine, beyond the obvious to humans.

According to the tool, there's no real indication that one part of the text in this document is a title, and the other part is the start of a paragraph. From the point of view of a machine, "This is not the title" is ordinary text, just like all the rest of the text in the document.

Looking inside an unstructured PDFs

Let’s look at another document that to us, looks just like the previous one. This second document has a title that says “This is the title” and has a paragraph of text underneath.

This time, inspecting the document using iText RUPS, tells us something different: This PDF contains structure called the Structure Tree. The Adobe Acrobat Pro tags panel on the left, is already giving us an indication of such structure. Whereas the panel in our first example showed no available tags, we now see that there is a <Document> tag with two children: an <H1> tag for a first level header, and a <P> paragraph tag.

Looking inside a structured PDF


Tagged PDF: Are my PDFs structured?

Having logical structure using a Structure Tree like the above, is one of the requirements for a PDF to be accepted as a Tagged PDF. All content inside a tagged PDF needs to be marked using standard structure types and attributes. There are a multitude of technical requirements for these attributes, but at a very minimum, you should be able to check whether a PDF is a Tagged PDF or not. You can do that very easily in Adobe Reader’s “Properties” dialog. At the bottom of the Advanced section of the Document Properties window shown below, you see "Tagged PDF: Yes".

How to find out if a PDF is a Tagged PDF

Accessible PDF: How does that benefit Healthcare?

The use of tagged PDF may be required to fully comply with mandates like Section 508 of the Rehabilitation Act, the  (Americans with Disabilities Act (ADA), or the W3C Accessibility Guidelines (WCAG).

  • The UA in PDF/UA stands for Universal Accessibility. A PDF that is compliant with the PDF/UA ISO standard (ISO 14289) is accessible because a machine can interpret the content of the document and present it to the blind and the visually impaired in ways that don't require visual characteristics, such as font sizes and colors.
  • The A in PDF/A stands for Archiving. This format is described in ISO-19005-1, ISO-19005-2 and ISO-19005-3. Level A makes sure that the structure of the document is preserved in an unambiguous way in the decades and centuries to come.

PDF has features that facilitate access for visually and hearing impaired users, e.g., captioning of multimedia. This capability, one of the many aspects of Tagged PDF, provides access to the content and logical structure of a document for use by assistive technologies such as screen readers and Braille printers. Some PDF viewers also include a built-in capability to read documents aloud.

In addition to support for assistive technology, the use of Tagged PDF also provides some benefits for all users. Tagged PDF can be re-flowed to provide easier reading for small screen devices such as PDAs or cell phones, and can be used to temporarily re-format multi-column documents into a single continuous format.

The implications of using inaccessible content should be considered very carefully in the case of the healthcare industry. Whenever possible, consider use of PDF software that supports creating tagged PDF from electronic originals for all public-facing PDFs.

A Healthcare Use Case

The need of finding structure in a document, and grouping content using styles, links and bookmarks is crucial for creating an orderly and functional document management system in Healthcare. In this sense, improving the quality of any individual document is the first step to making it “actionable”, “FDA-ready” or legally compliant. For that purpose, a useful tool for PDF creation and manipulation in a healthcare setting, should be able to parse the contents of an unstructured PDF document, detect a structure, and just as one example, allow you to create a list of bookmarks based on titles or a table of contents.

The applications can be numerous: Bundle a large number of PDFs, each corresponding with one patient test result into a single PDF package; Add a cover note that links to each separate document, highlighting those documents that are of special interest because they contain a specific word sequence; Stamp information on documents, such as dates, watermarks (e.g. "Confidential"); Add or update metadata to help find and search documents.

One such tool, created specifically for the healthcare industry is GlobalSubmit LINK™. At the core of LINK, we find iText, the PDF code library that is responsible for crunching the original PDFs and creating and updating links, adding bookmarks and so on.

GlobalSubmit LINK is built to quickly and accurately generate regulatory compliant PDF documents layered with external and internal links, and bookmarks without the need of Adobe Acrobat. With LINK, you can sync disconnected source documents with speed and accuracy meeting the challenge of building submission-quality documents from a variety of disparate source materials and file formats, such as patient data and case reports.

iText is the world’s most comprehensive PDF code library, that empowers millions of developers to deliver advanced PDF functionality to web, cloud, mobile and desktop applications. It’s available for the Java, .Net, Android and GAE platforms via open source (AGPL) and commercial licenses.

Where to start: Extracting the data

Extracting useful information that can be read by a machine is a challenging task. iText, when built into other applications, allows users to parse PDF - extract images and text info from a PDF document.

A page from a PDF document as seen by a human being and by a machineIn the next example, we will take a look at a PDF page taken from my book "iText in Action - Second Edition."

We, humans, see different structure elements on this page: a header, a title, some paragraphs, and an image. To a machine, this same page consists of many different items that can appear in a random order. Their place on the page is defined by coordinates, not by the logical reading order. In the example below, we've asked iText to highlight text images using a red rectangle and text snippets using a blue rectangle. This is how a PDF is composed from the point of view of a machine reading the PDF.

In order for iText do detect structure, it will need certain elements to look for, in order to interpret the different snippets. A good start will be defining each format with fonts and sizes. For example, a Header consists of content in the top margin that measures 48 points; Section Titles are 10-point FranklinGothic font; Captions of Images use an 8-point FranklinGothic font., etc.

A PDF page as seen by a machine with a little help from a human beingNow that iText knows which fonts to look for, it can highlight the different text snippets using different colors: red for the header, yellow for titles, green for captions and blue for regular text.

Taking Action: Improving Your Document

Improving PDF Navigation

We've already discovered that iText can examine the content of a PDF and highlight specific areas. This is extremely useful for readers who are confronted with a pile of documents, for instance thousands of test results from different patients where one specific word sequence is very important. One could for instance highlight every occurrence of "HIV-resistant" as that could be an important quality to look for in a pile of test results.

Changing the color of linksIn the same way, you can use Global Submit LINK to add a variety of Interactive features and content to be stored in separate places inside the PDF, in order to “improve” the quality of the PDF document.

The following screenshot shows the ability of LINK to change the colors of links present in the document, to allow a person reading it identify the clickable areas.

Numerous other interactive elements can be added using LINK, such as chapter and section titles based on Styles, or chapter and section titles based on a Table of Contents.

Archiving PDF to pass the test of time

Introducing structure and defining all elements in the document, hDetecting chapter and section titles based on styles elps you preserve the exact look and feel of it today, or 30 years from today. Preserving the content, context and structure of records, also allows enhancing the authenticity and integrity of the documents with regards to internal and external regulations or requirements.

In summary, the PDF format, and the GlobalSubmit LINK solution based on the iText technology, is ideally suited to bringing efficiency to the healthcare industry, by automated generation of bookmarks and hyperlinks, ensuring interoperability among systems, and allowing compliance by future-proof archiving.


Detecting chapter and section titles based on TOCAlthough the world is constantly buzzing with words such as "Content Management" and "Big Data", the truth is that plenty of knowledge and data are still locked inside huge piles of documents. The advent of new standards such as PDF/UA, PDF/A, and support for these standards by libraries such as iText, will significantly improve the quality of PDF documents. PDF will no longer be merely the digital equivalent of the paper document. We'll see more and more intelligent documents that present the data for consumption by humans as well as machines.

iText can help you unlock the information secreted away in unstructured documents. In the context of the Healthcare industry, GlobalSubmit's LINK product allows companies to enhance documents related to new drugs or patient records with the goal of accelerating business processes.

Author Bio

Bruno Lowagie is the original developer of iText, an open source PDF library first released in 2000. He's also the author of the “iText in Action” books, published by Manning Publications. Together with his wife, he founded iText Group (2008), a company with subsidiaries in the US (2009), Belgium (2011), and Singapore (2015). The couple grew the business from start-up to exit. He wrote a book about his journey as an open source developer and entrepreneur: “Entreprenerd: Building a Multi-Million-Dollar Business with Open Source Software.” Today, Bruno and his wife invest in Belgian technology start-up companies. [More...]