Checks OpenOffice.org documents for bad Links
cOOol is a simple Python script that looks for broken hyperlinks in OpenOffice.org documents.
Here is why an automatic link checker for your documents is useful:
- External references can be a very valuable part of your documents. Broken links reduce their usefulness as well as the impression they make. They also give the feeling that your documents are outdated and older than they are.
- Web sites evolve frequently. Having an automated way of detecting obsolete links is essential to keeping your documents up to date.
- You may be much more familiar with your target websites than your readers. They may not be able to find a new location by themselves. You’d better be aware of the change and do this for them!
- When you rename a page (for example), OpenOffice.org doesn’t update all the references to it.
Usage: coool [options] [OpenOffice.org document files] Checks OpenOffice.org documents for broken Links Options: --version show program's version number and exit -h, --help show this help message and exit -v, --verbose display progress information -d, --debug display debug information -t MAX_THREADS, --max-threads=MAX_THREADS set the maximum number of parallel threads to create -r MAX_REQUESTS_PER_HOST, --max-requests-per-host=MAX_REQUESTS_PER_HOST set the maximum number of parallel requests per host -x EXCLUDE_HOSTS, --exclude-hosts=EXCLUDE_HOSTS ignore urls which host name belongs to the given list
When a broken link is found, open the document in OpenOffice.org and use the search facility to look for the link text.
Rather than configuring cOOol from the command line, it is possible to
define the same settings in a
# Configuration file for cOOol verbose = True exclude_hosts = "lxr.free-electrons.com www.example.com" max_threads = 200
You can see that configuration file settings have the same name as long
options, except that dash (
-) characters are replaced by
Usage through a proxy
cOOol can be used through a proxy. The Python classes it uses rely on standard Unix environment variables for proxy definition, as in the below bash example:
export http_proxy="proxy.server.com:8080" export ftp_proxy="proxy.server.com:8080"
You first need to install the configparse module.
cOOol can be found in our training scripts git tree.
cOOol parses the xml components of each document file, looking for hyperlinks.
It would have been cleaner and safer to use the OpenOffice.org API to explore the documents. However, there are also benefits in a standalone Python implementation:
- No need to start OpenOffice.org and load documents in memory. This saves a lot of time and RAM!
- No need to have an OpenOffice.org install. Nice if you need to implement a validation server using cOOol.
- Last but not least, no need to understand OpenOffice.org’s API and the internal structure of documents! By the way, that’s what makes exchange formats like XML attractive. However, we would be delighted if somebody could come up with a simpler and safer implementation based on the API, that could be run within OpenOffice.org user interface!
We are using 2 documents to make sure that cOOol finds all the kinds of broken links it is supposed to support:
Limitations and possible improvements
- cOOol doesn’t check for e-mail links. It could at least check that the corresponding domain is valid.
- cOOol doesn’t give you page numbers for broken links. You have to open the document and use the search facilities to locate each link.
- cOOol still crashes on some documents with Unicode strings (for example with Chinese text).
- cOOol has trouble with link text containing quotes, as in
what’s new. The text it outputs is truncated.