pdf-link-checker – find broken hyperlinks in PDF documents

Python powered logo Tired of seeing your documents out of date? Don’t want to manually review them?

pdf-link-checker is a simple tool that parses a PDF document and checks for broken hyperlinks. It does this by sending simple HTTP requests to each link found in a given document.

External references can be a very valuable part of your documents. Broken links reduce their usefulness as well as the impression they make. They also give the feeling that your documents are outdated and older than they are.
Web sites evolve frequently. Having an automated way of detecting obsolete links is essential to keeping your documents up to date.

Of course, pdf-link-checker is free software (GNU GPLv2 license).

Why?

We are using pdf-link-checker to make sure that our Android, embedded Linux, and kernel training materials are always up to date. They contain references to useful resources on the Internet, but such resources can disappear or be moved to other places. Our training materials are created from LaTeX source code, but instead of implementing a broken link checker for LaTeX, we preferred to develop a checker for the exported PDF file. This is a much more generic solution, which could interest billions of users!

pdf-link-checker can be used to check hyperlinks in most document formats. All you need is a utility to convert your document format to PDF, with the ability to preserve hyperlinks. We recommend to open your documents with the excellent and free software LibreOffice office software (supporting GNU/Linux, MacOS X and Windows), offering a very easy to export to the PDF format. This way, you can use pdf-link-checker to find broken links in any text and presentation document, such as LibreOffice presentations and slides, Microsoft Word (doc / docx) and PowerPoint (ppt / pptx), RTF documents and HTML pages.

Installing

On GNU/Linux

Installing pdf-link-checker is very easy. First, we recommend to install the pip python package installer if you don’t have it yet.

sudo apt-get install python-pip on Debian based systems (such as Ubuntu)
sudo yum install python-pip on RPM based systems (Red Hat, Fedora, Suse, Mandriva…)

Then, installing pdf-link-checker along with its dependencies is easy:

$ pip install pdf-link-checker

On Windows

First, see instructions for installing Python and pip on Windows.

Then, you can install pdf-link-checker as follows:

$ pip install pdf-link-checker

Running pdf-link-checker

Using pdf-link-checker is even easier:

$ pdf-link-checker my-awesome-doc.pdf

Usage

./pdf-link-checker --help
Usage: pdf-link-checker [options] [PDF document files]

Reports broken hyperlinks in PDF documents

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -v, --verbose         display progress information
  -s, --status          store check status information in a .checked file
  -d, --debug           display debug information
  -t MAX_THREADS, --max-threads=MAX_THREADS
                        set the maximum number of parallel threads to create
  -r MAX_REQUESTS_PER_HOST, --max-requests-per-host=MAX_REQUESTS_PER_HOST
                        set the maximum number of parallel requests per host
  -x EXCLUDE_HOSTS, --exclude-hosts=EXCLUDE_HOSTS
                        ignore urls which host name belongs to the given list
  -m TIMEOUT, --timeout=TIMEOUT
                        set the timeout for the requests
  --check-url=CHECK_URL
                        checks given url instead of checking PDF (debug)

Option details

--max-threads
Specifies the maximum number of allowed threads (default: 100). To speed up the run, pdf-link-checker will launch several threads in order to check several links in parallel. This option allows to set a limit to the number of threads.
--max-requests-per-host
Specifies the maximum number of allowed requests per host. Some URLs may belong to the same host, and since pdf-link-checker can check many URLs at the same time, we may want to set a limit to the number of requests per host. Otherwise, some hosts may confuse the check with a DoS attack.
--status
Allows to create a .input-file.checked in case no broken hyperlink was found. This can allow scripts to skip the execution of pdf-link-checker for documents which have already been validated.

Limitations

pdf-link-checker won’t detect and check URLs which are not properly declared as hyperlinks.
It doesn’t support checking internal links yet. This feature is on our todo list though.
It doesn’t support checking links that require authentication yet. The plan is to ignore such URLs.

Getting help and helping out

Please use GitHub’s ressources for reporting issues, asking questions, etc.

Patches and pull requests are welcome of course! Browse our Git repository and feel free to contribute!

Author: Ezequiel Garcia

Ezequiel Garcia is a kernel developer from Rosario, Argentina, who has worked for Bootlin in the past. He has touched many different parts of the Linux kernel. Ezequiel is great at learning and mastering new topics. He sometimes even ventures in an unknown territory called userspace... View all posts by Ezequiel Garcia

20 thoughts on “pdf-link-checker – find broken hyperlinks in PDF documents”

Reinier Post says:

August 20, 2013 at 11:17 am

The installation procedure above doesn’t quite work on current Python (2.7.3) on current Cygwin (1.7.18-1): no suitable version of pdfminer can be found. After installing it anyway by typing “easy_install pdfminer”, I can install and run the link checker. Thanks!

Now the next thing to find out is how to make it work on internal links … yes, PDF definitely has those.

Reply
Reinier Post says:

August 22, 2013 at 10:05 am

I have extended this tool to support file:// links.
Where do I put the patch?

Reply
1. Michael Opdenacker says:
  
  August 23, 2013 at 4:47 am
  
  Hi Renier
  
  Thank you very much for your contribution! The best is to send the patch to our mailing list: https://lists.bootlin.com/mailman/listinfo/pdf-link-checker
  
  Michael
  
  Reply
  1. Reinier Post says:
    
    August 23, 2013 at 1:43 pm
    
    I’ve subscribed and sent the same question there, but it seems to have gone to /dev/null.
    
    Reply
    1. Michael Opdenacker says:
      
      August 28, 2013 at 6:08 am
      
      That’s fixed now. Thanks for the patch!
      
      Michael.
      
      Reply
Antonio says:

July 9, 2014 at 8:41 am

Hi Ezquiel

I’ve tried to install the pdf link checker on Windows 7, Python 2.7.6. When I type

pdf-link-checker my-awesome-slides.pdf

nothing happens. after copying pdf-link-checker to pdf-link-checker.py I get the following error message:

Traceback (most recent call last):
File “C:Python27Scriptspdf-link-checker.py”, line 54, in
from pdfminer.pdfparser import PDFDocument, PDFParser
ImportError: cannot import name PDFDocument

Any help would be really appreciated.

Thanks & best Regards
Antonio

Reply
1. Tim says:
  
  July 9, 2014 at 7:12 pm
  
  I just ran into this same problem today when trying to install pdf-link-checker.
  
  I fixed it by:
  
  1) uninstalling pdfminer (had to run pip uninstall pdfminer twice to remove it all)
  2) Downloading https://pypi.python.org/packages/source/p/pdfminer/pdfminer-20110515.tar.gz
  3) Installing the pdfminer-20110515 from source (python setup.py install)
  
  After doing the above pdf-link-checker works as expected
  
  Good luck!
  
  Reply
  1. swapan says:
    
    November 11, 2014 at 12:24 pm
    
    It also work for me..
    but in your link pdfminer-20110515 its necessary to change folder name to pdfminer instead of pdfminer-20110515.
    And also copy all files from C:Python27pdfminerpdfminer and paste to C:Python27pdfminer, so that setup.py will be run without any further errors.
    And also for running pdf-link-checker, the file pdf-link-checker from C:Python27Scriptspdf-link-checker must be move to C:Python27pdf-link-checker.
    Now run that command :
    python pdf-link-checker my-awesome-doc.pdf, it will surely work.
    
    Reply
dabel says:

January 28, 2015 at 5:43 pm

Hi there,

I’ve tried all the hints above, but it does not work on my PC. Even I use “pdf-link-checker.py –help” I obtain:
—
Traceback (most recent call last):
File “C:Python27Scriptspdf-link-checker.py”, line 54, in
from pdfminer.pdfparser import PDFDocument, PDFParser
ImportError: cannot import name PDFDocument
—

I don’t know what is wrong.
d.

PS: I am not a python user at all, so I have no experience with it.

Reply
1. Michael Opdenacker says:
  
  January 28, 2015 at 9:28 pm
  
  Pfooh, I need people with a Windows box to test this, because I don’t have the issue on my Linux box.
  
  Hoping someone can help….
  
  Reply
  1. B says:
    
    April 16, 2015 at 1:43 am
    
    I am using VMware on a Win 8.1 host with a Debian guest.
    I got the same error message (“PDFParser ImportError: cannot import name PDFDocument”) under Debian.
    So it doesn’t seem to be a Windows problem?
    Regards, B
    
    Reply
    1. Michael Opdenacker says:
      
      April 16, 2015 at 4:49 am
      
      Hi Bernard,
      
      What about asking your questions by filing an issue on GitHub (https://github.com/bootlin/pdf-link-checker/issues)?
      It could help to find solutions together.
      
      Michael.
      
      Reply
      1. Sherwood says:
        
        November 14, 2019 at 4:33 pm
        
        Ironically, this link is broken 🙂
        
        Reply
        
        Michael Opdenacker says:
        
        November 15, 2019 at 1:13 pm
        
        Oops 🙂
        Fixed now…
        Thanks
        
        Reply
Bernhard Esslinger says:

April 16, 2015 at 12:04 am

Hi Ezquiel,

I tried to unpack pdf-link-checker under Win 8.1 with Python 3.4. Is Python 3 not supported or what does the following output mean?
————————————————————
C:>pip install pdf-link-checker
Downloading/unpacking pdf-link-checker
Downloading pdf-link-checker-1.1.1.tar.gz
Running setup.py (path:C:Temppip_build_benpdf-link-checkersetup.py) egg_info for package pdf-link-checker

Downloading/unpacking pdfminer (from pdf-link-checker)
Could not find a version that satisfies the requirement pdfminer (from pdf-link-checker) (from versions: 20091024, 20091129, 20091219, 20100104, 20100131, 201
00213, 20100322, 20100327, 20100424, 20100619p1, 20100829, 20101017, 20101226, 20110227, 20110515, 20131113, 20140324, 20140327, 20140328)
Some externally hosted files were ignored (use –allow-external to allow).
Cleaning up…
No distributions matching the version for pdfminer (from pdf-link-checker)
Storing debug log for failure in C:pippip.log

C:>python
Python 3.4.0 (v3.4.0:04f714765c13, Mar 16 2014, 19:25:23) [MSC v.1600 64 bit (AMD64)] on win32
————————————————————

It would be great, if you could help.

Thanks a lot, Bernhard

Reply
Gerd says:

January 23, 2016 at 5:51 pm

Dear all,

after installing the pdfminer-20110515 (see Tim’s post above), I got the tool runningunder Win7 64 bit – great thing! Just a small flaw: Checking documents that contain links to destinations within SharePoint Sites will produce Errors by Reason HTTP Error 401: Unauthorized -despite the executing user has all necessary credentials. Any idea how to commit user name and password before start the checking? Would be great! thanks in advance. Gerd

Reply
Michael Opdenacker says:

May 18, 2016 at 1:54 pm

There’s a new version that repairs breakage against recent versions of pdfminer:
See https://github.com/bootlin/pdf-link-checker

Reply
Lucky Singh says:

June 23, 2016 at 5:31 pm

Can we check links within same pdf document or external documents linked with in same folder/machine?

Or do you know any other utility.

Thanks for any help.

Reply
1. Michael Opdenacker says:
  
  June 24, 2016 at 9:40 am
  
  pdf-link-checker essentially is there for checking external URLs. I agree it would be useful too to check internal links.
  Michael.
  
  Reply
Hans says:

June 7, 2018 at 10:08 am

Hi
I like to run pdf-link-checker on Windows. I have istalled it via pip, this works and the pdf-link-checker is installed.
But I I trry to run it (in Windows cmd) the message “pdf-link-checker is not found” is shown. I can’t run it.

What is wrong? What do I wrong?

Thanks
Kind regards
Hans

Reply