pgPDF: `pdf` type for Postgres

This extension for PostgreSQL provides a pdf data type and assorted functions.

You can create a pdf type, by casting either a text filepath or bytea column.

SELECT '/tmp/pgintro.pdf'::pdf;

                                       pdf                                        
----------------------------------------------------------------------------------
 PostgreSQL Introduction                                                         +
 Digoal.Zhou                                                                     +
 7/20/2011Catalog                                                                +
  PostgreSQL Origin

If you don’t have the PDF file in your filesystem, but have already stored its content in a bytea column, you can just cast it to pdf.

SELECT pg_read_binary_file('/tmp/pgintro.pdf')::bytea::pdf;

Why?: This allows you to work with PDFs in an ACID-compliant way. The usual alternative relies on external scripts or services which can easily make your data ingestion pipeline brittle and leave your raw data out-of-sync.

The actual PDF parsing is done by poppler.

Also check blog:

Full Text Search on PDFs With Postgres
pgpdf: pdf type for Postgres

Usage

Download some PDFs.

wget https://wiki.postgresql.org/images/e/ea/PostgreSQL_Introduction.pdf -O /tmp/pgintro.pdf
wget https://pdfobject.com/pdf/sample.pdf -O /tmp/sample.pdf

Create a table with a pdf column:

CREATE TABLE pdfs(name text primary key, doc pdf);

INSERT INTO pdfs VALUES ('pgintro', '/tmp/pgintro.pdf');
INSERT INTO pdfs VALUES ('pgintro', '/tmp/sample.pdf');

Parsing and validation should happen automatically. The files will be read from the disk only once!

Note

The filepath should be accessible by the postgres process / user! That's different than the user running psql. If you don't understand what this means, as your DBA!

String Functions and Operators

Standard Postgres String Functions and Operators should work as usual:

SELECT 'Below is the PDF we received ' || '/tmp/pgintro.pdf'::pdf;

SELECT upper('/tmp/pgintro.pdf'::pdf::text);

SELECT name
FROM pdfs
WHERE doc::text LIKE '%Postgres%';

Full-Text Search (FTS)

You can also perform full-text search (FTS), since you can work on a pdf file like normal text.

SELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('postgres');

 ?column? 
----------
 t
(1 row)

SELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('oracle');

 ?column? 
----------
 f
(1 row)

Document similarity with `pg_trgm`

You can use pg_trgm to get the similarity between two documents:

CREATE EXTENSION pg_trgm;

SELECT similarity('/tmp/pgintro.pdf'::pdf::text, '/tmp/sample.pdf'::pdf::text);

Metadata

The following functions are available:

pdf_title(pdf) → text
pdf_author(pdf) → text
pdf_num_pages(pdf) → integer

Total number of pages in the document
pdf_page(pdf, integer) → text

Get the i-th page as text
pdf_creator(pdf) → text
pdf_keywords(pdf) → text
pdf_metadata(pdf) → text
pdf_version(pdf) → text
pdf_subject(pdf) → text
pdf_creation(pdf) → timestamp
pdf_modification(pdf) → timestamp

SELECT pdf_title('/tmp/pgintro.pdf');

        pdf_title        
-------------------------
 PostgreSQL Introduction
(1 row)

SELECT pdf_author('/tmp/pgintro.pdf');

 pdf_author 
------------
 周正中
(1 row)

Getting a subset of pages

SELECT pdf_num_pages('/tmp/pgintro.pdf');

 pdf_num_pages 
---------------
            24
(1 row)

SELECT pdf_page('/tmp/pgintro.pdf', 1);

           pdf_page           
------------------------------
 Catalog                     +
  PostgreSQL Origin         +
  Layout                    +
  Features                  +
  Enterprise Class Attribute+
  Case
(1 row)

SELECT pdf_subject('/tmp/pgintro.pdf');

 pdf_subject 
-------------
 
(1 row)

SELECT pdf_creation('/tmp/pgintro.pdf');

       pdf_creation       
--------------------------
 Wed Jul 20 11:13:37 2011
(1 row)

SELECT pdf_modification('/tmp/pgintro.pdf');

     pdf_modification     
--------------------------
 Wed Jul 20 11:13:37 2011
(1 row)

SELECT pdf_creator('/tmp/pgintro.pdf');

            pdf_creator             
------------------------------------
 Microsoft® Office PowerPoint® 2007
(1 row)

SELECT pdf_metadata('/tmp/pgintro.pdf');

 pdf_metadata 
--------------
 
(1 row)

SELECT pdf_version('/tmp/pgintro.pdf');

 pdf_version 
-------------
 PDF-1.5
(1 row)

Installation

Install poppler dependencies

Linux

sudo apt install -y libpoppler-glib-dev pkg-config

Homebrew/MacOS

brew install poppler pkgconf

cd /tmp
git clone https://github.com/Florents-Tselai/pgpdf.git
cd pgpdf
make
make install # may need sudo

After the installation, in a session:

CREATE EXTENSION pgpdf;

Docker

Get the Docker image with:

docker pull florents/pgpdf:pg17

This adds pgpdf to the Postgres image (replace 17 with your Postgres server version, and run it the same way).

Run the image in a container.

docker run --name pgpdf -p 5432:5432 -e POSTGRES_PASSWORD=pass florents/pgpdf:pg17

Through another terminal, connect to the running server (container).

PGPASSWORD=pass psql -h localhost -p 5432 -U postgres

Warning

Reading arbitrary binary data (PDF) into your database can pose security risks. Only use this for files you trust.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

pgPDF: `pdf` type for Postgres

Usage

String Functions and Operators

Full-Text Search (FTS)

Document similarity with `pg_trgm`

Metadata

Installation

Docker

Files

README.md

Latest commit

History

README.md

File metadata and controls

pgPDF: pdf type for Postgres

Usage

String Functions and Operators

Full-Text Search (FTS)

Document similarity with pg_trgm

Metadata

Installation

Docker

pgPDF: `pdf` type for Postgres

Document similarity with `pg_trgm`