First off, I’d like to thank DocRaptor for sponsoring this article. I’m pretty well off, but this money will be able to help my siblings out, who aren’t so lucky. It also got me to discover their service, which I may find myself using in the future.
What is DocRaptor?
DocRaptor is an online service that can be used to transform HTML documents into PDFs or even Excel documents. This is a paid service, but there’s a 7-day free trial so you can have a chance to try it out first. They have 8 different plans, ranging from 125 documents per month for $15 per month to 100,000 documents for $2250, plus a level for an unlimited number of documents, for which you need to contact them to set up a price.
All of the plans allow you to go over the number of documents each month, paying an overage cost equal to the what the plan was already covering ($15 for 125 documents = 12¢ per document; $2250 for 100,000 documents = 2.25¢ per document). You can do test documents for free, but those are watermarked, and it seems they still need an API key.
In order to use the service, you submit a POST request, supplying certain options and the document you want to convert, giving the content directly or providing a URL. You also need the API Key supplied with your paid account.
These requests can be simply done with a command line call to curl (using their API reference documentation to know exactly how), but the organization also supplies libraries that you can use in Ruby, Python, Node, PHP, Java, and C#. Later on, we’ll be going over their Python APIs.
HTML as a PDF? Or as XLS? What?
Before we get into the APIs, though, we need to address the elephant in the room of how the heck they turn HTML into the other document types. It’s pretty cool, but getting into it is a little out of the scope of this post, so I’ll point you to DocRaptor’s documentation about such things. I’ve read over it, and it’s quite good.
To ease your hunger a little, though, I’ll tell you a few tidbits. When making an Excel document, the HTML contains only tables. When it comes to making PDFs, there are a several metadata attributes that can be used to set the “paper” size as well as dealing with headings, such as auto-generating a table of contents and telling it whether a heading starts on a new page or if it should avoid being at the bottom of a page. It’s really interesting, so seriously check it out.
First, I’d like to point out that, even at this stage in the game, they support both Python 2 and Python 3. The library is easily installable with
pip with the name
docraptor (who would have guessed?).
The API is actually quite small, especially if you don’t need to use any of the special features. In the basic example on their site, you
import docraptor, configure your API key with
docraptor.configuration.username = "your api key", call
on a newly created instance of
docraptor.DocApi (no arguments needed for the constructor), then write the returned data into a file. This is the synchronized way of doing it, and they do provide an asynchronous way (not using Python’s async system), which we’ll get to.
So that’s pretty much just 3 points of contact with their API. This doesn’t cover error handling, which adds another 1 for their
The hardest part of working with the API seems to be passing in arguments to
create_doc(). First, the documentation for it is pretty bad. The best thing you have is the example code. I dug into the code, and it’s a pretty difficult read because, as the documentation for the class says, “This class is auto generated…” Whatever process they used for it wasn’t the best. Anyway, the in-Python documentation also seems to conflict with the examples and with itself.
Example 1, the methods claim that the main argument they need is an object of type
Doc, but all 3 examples simply pass in a dictionary. Digging more deeply into the code, it actually looks like the code prefers the dictionary, but you have to dig deep in order to find any semblance of it being used instead of just passed along. This is fine, but it makes using the code’s documentation harder. To make things worse, the
Doc type’s documentation lies again. The constructor docs claim to need two dictionaries as arguments, but the actual parameter list is empty, other than
self. Instead, you’re supposed to set all the properties individually after creation.
Luckily, people will most likely follow the example code more than trying to read the code’s documentation (unlike me), so these don’t really represent a real problem.
The next discussion is on asynchronicity. There seems to be two different kinds of asynchronous actions here; an asynchronous request and asynchronous creation of the document.
The first kind, asynchronous requests, they give no example of, nor do they make any mention of it other than in the in-code docs, but it kind of makes sense. It’s the type of asynchronicity you expect, and it’s triggered by passing a callback function using the
callback named argument which is called when the response returns. And instead of the call returning the document, it returns the thread that the request was actually made on.
The other kind is something special, but at least they give you an example. To do it, you call
create_asyn_doc() instead of
create_doc(), and the response comes back quickly, returning some sort of id for your document instead of the document itself (unless you also do an asynchronous request, in which case it will return the request thread, and the callback will receive the id). With this id, you poll the service using
get_async_doc_status() method, which returns the status. This status is either
"failed", or… nothing? In the code example, it simply uses
else for any other possible statuses, and the code doesn’t set the statuses, so I have no way to look it up. When it’s not completed or failed, you wait a little bit, then try again. When it’s completed, you make one more call to the service with
get_async_doc() method. This requires an id from the response that gave you a
"completed" status. Finally, the return from that is the same as from the synchronous call,
create_doc(), which can be written to a file. It should also be noted that these last two methods can also be give a callback to do the request asynchronously.
But why do they provide that second type of asynchronous call? The problem stems from the possibility of the document taking a long time to convert, which makes it take a long time to receive a response, potentially causing a timeout. When done synchronously, DocRaptor limits that time to 60 seconds, so if you have a document that takes longer than that, you need to do it asynchronously. But even that isn’t limitless; DocRaptor limits that time to 10 minutes. After that, it can probably be assumed that you’re being a jerk by sending a gigantic file or there’s a problem and it needs to abort.
Finally coming back to it, the most complicated thing is passing in the right arguments to
create_<async_>doc(). At a bare minimum, you need to specify the type of file you’re generating (pdf, xls, or xlsx) and either a string of HTML or a URL to an HTML document.
For the rest, you can look over the API documentation linked to earlier, and you can look at the examples to get an idea of what’s typical.
So, I got a little harsh on them in the middle there, but overall it looks like a great service that’s pretty easy to use. The hardest part is getting good documentation from the python code itself, but if you just stick the examples for the most part, you’ll be fine. The examples won’t help you do asynchronous requests, but hopefully this article, mixed with double-checking the code docs, will get you far enough with that if you need it.
Most of what you need to know is on their site, though (link to all the documentation pages is at the bottom of the site’s pages, down by their “contact” and “privacy” links), so I highly recommend using that.
Thanks for reading! Soon, I’ll be putting out a similar post on using the Java API, including using it in Kotlin.