Friday, April 13, 2018

Long-term Survival of PDF/A Files

PDF/A is widely marketed and regarded as a preservation file format. However, a recently published article, “PDF/A Considered Harmful for Digital Preservation” by Marco Klindt serves as a prudent reminder that the PDF/A file format is not a comprehensive solution for preservation in itself.  

For digital information to exist in the long term, the data that comprise the information content needs to remain discoverable, machine readable, and renderable for human consumption. If preserving digital content means that we are planning for the potential reuse of data, a computer needs to be able to read and extract this information in the future. However, there are significant challenges to preserving PDF files in the distinction between what the human eye can read and what a machine can interpret and extract.  

PDF/A is intended to serve as the long-term archival version of PDF files. However, as the author notes, while PDF/A is marketed and widely adopted as a preservation format, “comprehensive policies regarding the use of PDF in archives seem to be rare” and “using PDF/A as a container for files complicates preservation workflows and might be considered an additional risk.” PDF documents preserve the visual appearance, structure, and format of the original document, but this comes at a potential cost for the reusability of data. A PDF/A document created at Level A (accessible) conformance is designed to improve a document’s accessibility through the use of tagging to markup the structure and content of a document, which in turn should help support both visibility and reuse. However, in its current version, PDF/A-3 still presents multiple challenges.  

Klindt discusses the risks and shortcomings through observations of existing inadequacies and challenges with the creation and reuse of PDF/A documents. The risks identified here undermines confidence in the suitability of PDF/A for long-term preservation. A few of the challenges discussed include impediments to text and content extraction in addition to information loss during the creation and conversion process. It is worth noting that while the author acknowledges PDF/A validation issues has been largely addressed by the creation of veraPDF, an open-source PDF/A validator, he argues that validation is a “necessary condition” but does not mitigate risks to future reuse of content. 

An understanding of these risks and shortcomings of PDF/A for preservation purposes underscores the need for comprehensive strategies and policies at the institutional level to safeguard digital content within a flawed archival solution. There are a number of useful, previously published resources on PDF/A cited in this article, including the National Digital Stewardship Alliance (NDSA) report on The Benefits and Risks of the PDF/A-3 File Format for Archival Institutionsand a National Information Standards Organization (NISO) Information Standards Quarterly article, “Preserving the Grey Literature Explosion: PDF/A and the Digital Archive”.  The PDF/A-4 standard is expected to be published sometime in 2018. 

No comments: