- Published
- 10 December 2021
- Tagged
You ever get that feeling where you think, "I wonder if I can do x?" And then you spend a good day working out how to do it, and document it all in a blog post?
No? Just me? OK.
I have a little bit of writing that I keep on this site. I vacillate between being proud of it and being ashamed of it. Perhaps over time it'll slowly get bigger. In the meantime, wouldn't it be nice if you could download the writing as an epub? It would, wouldn't it?
What's in an .epub?
So according to wikipedia, an epub consists of:
- XHTML or DTBook files to represent the text and structure of the document,
- a subset of CSS to provide layout and formatting,
- XML to create the document manifest, table of contents, and metadata.
And these files are all zipped up into one package. Sounds pretty logical. Let's take one apart!
First, we grab an epub:
Let's rename it to a .zip
and see if we can open it:
So it seems like OS X's default archive utility
doesn't like this zipped epub. However, when I unzip it with The Unarchiver, everything runs smoothly. No idea what that's about.
OK, what do we see when we unzip everything?
It looks like we have files in the following structure:
epub.zip
+ META-INF
| + container.xml
+ mimetype
+ OEBPS
+ @export@sunsite@users@gutenbackend@cache@epub@1257@1257-cover.png
+ @public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-0.htm.html
+ @public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-1.htm.html
+ @public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-2.htm.html
+ ...<69 more documents like this skipped>
+ 0.css
+ 1.css
+ content.opf
+ pgepub.css
+ toc.ncx
+ wrap0000.html
OK. Let's go through each of these files and work out what's in each of them. Combined with the wikipedia article above, that should give us an idea of what goes into the epub.
container.xml
This is a really basic file that really just tells the epub reader where to find the juicy stuff. Check it out:
<?xml version='1.0' encoding='utf-8'?>
<container xmlns="urn:oasis:names:tc:opendocument:xmlns:container" version="1.0">
<rootfiles>
<rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml"/>
</rootfiles>
</container>
That's hardly anything! In fact, this is a stock file and varies very little from file to file.
mimetype
If you thought container.xml
was bare, you're in for a surprise here. This is literally just our mimetype:
application/epub+zip
OEBPS
This is where the action happens. Those files starting with @export
or @public
definitely look like they've been automatically exported from some kind of automatic publishing tool. The cover is just the PNG cover of the book, while the html
files are the individual chapters of the book. For example, here's the start of the preface, taken from h2.htm.html
:
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.1//EN' 'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta name="generator" content="HTML Tidy for HTML5 for Linux version 5.6.0"/>
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"/>
<meta http-equiv="Content-Style-Type" content="text/css"/>
<title>The Project Gutenberg eBook of The Three Musketeers, by Alexandre Dumas, Père</title>
<link href="0.css" rel="stylesheet" type="text/css"/>
<link href="1.css" rel="stylesheet" type="text/css"/>
<link href="pgepub.css" rel="stylesheet" type="text/css"/>
<meta name="generator" content="Ebookmaker 0.11.9 by Project Gutenberg"/>
</head>
<body class="x-ebookmaker"><div class="chapter" id="pgepubid00002">
<h2><a id="pref01"/>AUTHOR’S PREFACE</h2>
<p class="pfirst"><span class="dropcap c6">I</span><span class="dropspan">n</span> which it is proved that, notwithstanding their names’ ending in <i>os</i> and <i>is</i>, the heroes of the story which we are about to have the honor to relate to our readers have nothing mythological about them.</p>
<p class="p2">A short time ago, while making researches in the Royal Library for my History of Louis XIV...
What about those files at the end? We have three CSS files, plus an opf
, an ncx
, and one weird html
file. The weird html
file is nothing, and the css
files are just styling info for the book. The opf
and ncx
are more interesting though.
content.opf
This is an Open Packaging Format file, and it's here that we define all the interesting stuff about the book. Once again it's an xml subset:
<?xml version='1.0' encoding='UTF-8'?>
<package xmlns:opf="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.idpf.org/2007/opf" version="2.0" unique-identifier="id">
<metadata>
<dc:rights>Public domain in the USA.</dc:rights>
<dc:identifier opf:scheme="URI" id="id">http://www.gutenberg.org/1257</dc:identifier>
<dc:creator opf:file-as="Dumas, Alexandre">Alexandre Dumas</dc:creator>
<dc:title>The Three Musketeers</dc:title>
<dc:language xsi:type="dcterms:RFC4646">en</dc:language>
<dc:subject>Historical fiction</dc:subject>
<dc:subject>France -- History -- Louis XIII, 1610-1643 -- Fiction</dc:subject>
<dc:subject>Adventure and adventurers -- Fiction</dc:subject>
<dc:subject>Swordsmen -- Fiction</dc:subject>
<dc:date opf:event="publication">1998-03-01</dc:date>
<dc:date opf:event="conversion">2021-09-07T17:00:12.890029+00:00</dc:date>
<dc:source>https://www.gutenberg.org/files/1257/1257-h/1257-h.htm</dc:source>
<meta name="cover" content="item1"/>
</metadata>
<manifest>
<!--Image: 1200 x 1800 size=41509 -->
<item href="@export@sunsite@users@gutenbackend@cache@epub@1257@1257-cover.png" id="item1" media-type="image/png"/>
<item href="pgepub.css" id="item2" media-type="text/css"/>
<item href="0.css" id="item3" media-type="text/css"/>
<item href="1.css" id="item4" media-type="text/css"/>
<!--Chunk: size=2934 Split on div.chapter-->
<item href="@public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-0.htm.html" id="item5" media-type="application/xhtml+xml"/>
<!--Chunk: size=11517 Split on div.chapter-->
<item href="@public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-1.htm.html" id="item6" media-type="application/xhtml+xml"/>
<!--...plus a bunch more...-->
<item href="toc.ncx" id="ncx" media-type="application/x-dtbncx+xml"/>
<item href="wrap0000.html" id="coverpage-wrapper" media-type="application/xhtml+xml"/>
</manifest>
<spine toc="ncx">
<itemref idref="coverpage-wrapper" linear="yes"/>
<itemref idref="item5" linear="yes"/>
<itemref idref="item6" linear="yes"/>
<!--...plus a bunch more...-->
</spine>
<guide>
<reference type="toc" title="CONTENTS" href="@public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-1.htm.html#pgepubid00001"/>
<reference type="cover" title="Cover" href="wrap0000.html"/>
</guide>
</package>
So it looks like we have:
- A bunch of metadata (spoiler alert: we're not going to define most of this for any automatic epub bundling we do).
- A manifest of all of the files, which are assigned IDs and media types.
- A "spine", which lists "all the XHTML content documents in their linear reading order". This also has a reference to the table of contents, which we'll get to in a bit.
toc.ncx
Wikipedia informs me that the Table of Contents file (which is a navigational control file for XML, or ncx
file) is "traditionally named toc.ncx
", and it looks like the following:
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE ncx PUBLIC '-//NISO//DTD ncx 2005-1//EN' 'http://www.daisy.org/z3986/2005/ncx-2005-1.dtd'>
<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1" xml:lang="en">
<head>
<meta name="dtb:uid" content="http://www.gutenberg.org/1257"/>
<meta name="dtb:depth" content="1"/>
<meta name="dtb:generator" content="Ebookmaker 0.11.9 by Project Gutenberg"/>
<meta name="dtb:totalPageCount" content="0"/>
<meta name="dtb:maxPageNumber" content="0"/>
</head>
<docTitle>
<text>The Three Musketeers</text>
</docTitle>
<navMap>
<navPoint id="np-1" playOrder="1">
<navLabel>
<text>The Three Musketeers</text>
</navLabel>
<content src="@public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-0.htm.html#pgepubid00000"/>
</navPoint>
<navPoint id="np-2" playOrder="2">
<navLabel>
<text>CONTENTS</text>
</navLabel>
<content src="@public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-1.htm.html#pgepubid00001"/>
</navPoint>
<navPoint id="np-3" playOrder="3">
<navLabel>
<text>AUTHOR’S PREFACE</text>
</navLabel>
<content src="@public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-2.htm.html#pgepubid00002"/>
</navPoint>
<navPoint id="np-4" playOrder="4">
<navLabel>
<text>The Three Musketeers</text>
</navLabel>
<content src="@public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-3.htm.html#pgepubid00003"/>
</navPoint>
<navPoint id="np-5" playOrder="5">
<navLabel>
<text>1 THE THREE PRESENTS OF D’ARTAGNAN THE ELDER</text>
</navLabel>
<content src="@public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-4.htm.html#pgepubid00004"/>
</navPoint>
<!-- ... -->
<navPoint id="np-72" playOrder="72">
<navLabel>
<text>EPILOGUE</text>
</navLabel>
<content src="@public@vhost@g@gutenberg@html@files@1257@1257-h@1257-h-71.htm.html#pgepubid00071"/>
</navPoint>
</navMap>
</ncx>
Building our own epub
OK! So, we've gone through what makes an epub. We should therefore be able to make our own.
I'm going to integrate this with my current static site build, which uses nanoc - at this stage all I have on the site are short stories, so we can get away with a single file/chapter per epub. This should make it pretty easy to build these.
Our process will need to look like the following:
- Identify all the pieces of writing we have on the site.
- Make an epub for each a. Parse the story (which could be either markdown or haml) into html. b. Create a basic TOC c. Create basic content.opf file d. Package + zip it all up
Let's see how easy this will be to do!
Step 1: Add epub items
Nanoc builds pages and files in the output site by linking each input file to an output file through a set of rules. So this post I'm writing right now is a markdown file which will get converted to an html file (with the appropriate wrapper) when I render it. I still want each piece of writing to turn up on the site, so I need to create a new, duplicate item to produce the relevant epub.
Thankfully, we can do this with @items.create
. I'm going to do this in the preprocess step:
preprocess do
# ...a bunch of other stuff that my site needs...
#---------------------------------------
# Create epubs for each piece of writing
@items
.find_all("/writing/**/*.{md,haml}")
.each do |i|
# Don't create epubs of index files - just of others
next if i.identifier =~ /index.(haml|md)$/
new_identifier = i.identifier.to_s.gsub(/(haml|md)$/, "epub.\\1")
@items.create(i.raw_content, {}, new_identifier)
end
So this will take every piece of writing (at least every markdown and haml file in the writing folder) and create an epub equivalent. I'm also going to build a renderer, which looks like this:
compile '/writing/**/*.epub.{haml,md}' do
write ext: "epub"
end
Right now all this will do is take the base text and output to a file that has the same name as the original, but with the extension ".epub". So for example, this will output /writing/a-story.md.epub
to the file /writing/a-story.epub
.
Actually, all of the files on my site tend to be output as index.html
files within folders - so this story would actually be located at /writing/a-story/index.html
. So I'm going to change the output code so that the epub sits alongside the file itself:
compile '/writing/**/*.epub.{haml,md}' do
# Write to a path which means the file goes alongside the index.html of the
# story itself
# Remove the haml or md extension
output_filename = File.basename(item.identifier.without_ext)
# Remove all the extensions - haml/md and epub
output_path = item.identifier.without_exts
write "#{output_path}/#{output_filename}"
end
OK! We've now got an epub step in our processing, and it's outputting to the right place. It's a bit more complex to build the epubs themselves.
Step 2: Build an epub
The usual way we convert something from one format to another in nanoc is through filters. It's really easy to build a filter. Here's an example that appears on Nanoc's site:
class CensorFilter < Nanoc::Filter
identifier :censor
def run(content, params = {})
content.gsub('Nanoc sucks', 'Nanoc rocks')
end
end
So this would then allow us to use the filter in our Rules:
compile '/some/glob' do
filter :censor
# ...
end
This is a text-to-text filter - we take in text, and we return text. But you can also create a text-to-binary filter, which looks more like the following:
class EpubFilter < Nanoc::Filter
identifier :epub
type :text => :binary
def run(content, params = {})
# This function will receive text content as its first argument, and should
# produce the result to the value `output_filename`.
end
end
So what do we need to do here? We need to create to:
- Create a temporary folder (with the relevant
META-INF
andOEBPS
folders) - Create a bunch of relevant files:
container.xml
mimetype
content.opf
toc.ncx
- Add the rendered story content
- And then zip the whole thing up!
Step 2d: Zipping it all up
If you've been following along at home, right now you'll be going "Now hold on Jan, this is the last step! We should start at 2a!" And you're right. But it turns out that when you search for ruby zip gems, you quickly find yourself checking out rubyzip, and rubyzip lets us assemble our zipped folder in situ. For example, here's some code that produces our metadata file in a zipped folder:
require "zip"
class EpubFilter < Nanoc::Filter
identifier :epub
type :text => :binary
# This takes them item as the param
def run(content, params = {})
Zip::File.open(output_filename, create: true) do |zipfile|
zipfile.get_output_stream("mimetype"){ |f| f.write "application/epub+zip" }
end
end
end
So that means we can fold steps 2a through 2c into 2d.
It's nice and easy to add container.xml
to this file as well:
class EpubFilter < Nanoc::Filter
identifier :epub
type :text => :binary
# This takes them item as the param
def run(content, params = {})
Zip::File.open(output_filename, create: true) do |zipfile|
# mimetype
zipfile.get_output_stream("mimetype"){ |f| f.write "application/epub_zip" }
# META-INF/container.xml
zipfile.get_output_stream("META-INF/container.xml") do |f|
f.write <<-end
<?xml version='1.0' encoding='utf-8'?>
<container xmlns="urn:oasis:names:tc:opendocument:xmlns:container" version="1.0">
<rootfiles>
<rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml"/>
</rootfiles>
</container>
end
end
end
end
We still need to add the table of contents, and the file itself - let's split them out into their own functions. Here's the final look:
require "zip"
class EpubFilter < Nanoc::Filter
identifier :epub
type :text => :binary
def run(content, params = {})
Zip::File.open(output_filename, create: true) do |zipfile|
# mimetype
zipfile.get_output_stream("mimetype"){ |f| f.write generate_mimetype }
# META-INF/container.xml
zipfile.get_output_stream("META-INF/container.xml"){ |f| f.write generate_container }
# OEBPS/toc.ncx
zipfile.get_output_stream("OEBPS/toc.ncx"){ |f| f.write generate_toc }
# OEBPS/content.html
zipfile.get_output_stream("OEBPS/content.html"){ |f| f.write generate_content_html(content) }
# OEBPS/content.opf
zipfile.get_output_stream("OEBPS/content.opf"){ |f| f.write generate_content_opf }
end
end
# Generator functions --------------------------------------------------------
def generate_mimetype
"application/epub_zip"
end
def generate_container
return <<-end
<?xml version='1.0' encoding='utf-8'?>
<container xmlns="urn:oasis:names:tc:opendocument:xmlns:container" version="1.0">
<rootfiles>
<rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml"/>
</rootfiles>
</container>
end
end
def generate_toc
return <<-end
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE ncx PUBLIC '-//NISO//DTD ncx 2005-1//EN' 'http://www.daisy.org/z3986/2005/ncx-2005-1.dtd'>
<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1" xml:lang="en">
<head>
</head>
<docTitle>
<text>#{item[:title]}</text>
</docTitle>
<navMap>
<navPoint id="np-1" playOrder="1">
<navLabel>
<text>#{item[:title]}</text>
</navLabel>
<content src="content.html"/>
</navPoint>
</navMap>
</ncx>
end
end
def generate_content_html(content)
rendered_content = case item.identifier.ext
when "md"
markdown = Redcarpet::Markdown.new(Redcarpet::Render::HTML)
markdown.render(content)
when "haml"
Haml::Engine.new(content).render
else
raise "Don't know how to convert #{item.identifier.ext}"
end
return <<-end
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.1//EN' 'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"/>
<meta http-equiv="Content-Style-Type" content="text/css"/>
<title>#{item[:title]}</title>
</head>
<body>#{rendered_content}</body>
</html>
end
end
def generate_content_opf
return <<-end
<?xml version='1.0' encoding='UTF-8'?>
<package xmlns:opf="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.idpf.org/2007/opf" version="2.0" unique-identifier="id">
<metadata>
<dc:creator opf:file-as="Ruzicka, Jan-Yves">Jan-Yves Ruzicka</dc:creator>
<dc:title>#{item[:title]}</dc:title>
<dc:language xsi:type="dcterms:RFC4646">en</dc:language>
<dc:date opf:event="publication">#{item[:date]}</dc:date>
</metadata>
<manifest>
<item href="content.html" id="content" media-type="application/xhtml+xml"/>
<item href="toc.ncx" id="ncx" media-type="application/x-dtbncx+xml"/>
</manifest>
<spine toc="ncx">
<itemref idref="content" linear="yes"/>
</spine>
</package>
end
end
end
And now we're creating epubs as we go! Super-easy! OK, final step: adding a download link into our site.
Linking to the epub
If you visit a story page right now, it looks like the following:
How about we put a little button just under the title, which allows the viewer to download the epub version of the story?
Nanoc uses layouts to wrap content in boilerplate html. Right now, our writing uses a very basic "default" layout, that provides the header, sidebar, footer, all the rest. Let's quickly make a custom layout for our writing:
# layouts/writing.haml
=render "/default.*" do
%p
=link_to "Download epub", "#", class: "button"
=yield
This layout will render the default layout, as normal, but rather than just spitting out the content of the story, it'll put a little "download" button at the top. The button won't do anything yet, but it'll look pretty (thanks to some default rendering).
Inside our rules, we need to set things up to ensure the stories use this layout[1]:
compile '/writing/**/*.{haml,md}' do
# Filter
case item.identifier.ext
when "haml"
filter :haml
when "md"
filter :redcarpet, renderer: MarkdownOptions::renderer, options: MarkdownOptions::options
else
raise RuntimeError, "Don't know how to render #{item.identifier}"
end
layout '/writing.*'
end
Now, let's hook things up! Actually, you know what? I keep on referring to the epub's location - why don't we define it in the preprocessing step and make it an attribute of both the story, and its epub equivalent:
preprocess do
# All the other stuff...
#---------------------------------------
# Create epubs for each piece of writing
@items
.find_all("/writing/**/*.{md,haml}")
.each do |i|
# Don't create epubs of index files - just of others
next if i.identifier =~ /index.(haml|md)$/
# Identifier for the epub element
new_identifier = i.identifier.to_s.gsub(/(haml|md)$/, "epub.\\1")
# Final URL for the epub
i[:epub_location] =
i.identifier.without_exts +
"/" +
File.basename(i.identifier.without_ext) +
".epub"
@items.create(i.raw_content, i.attributes, new_identifier)
end
end
Now we're defining the final URL where the epub should end up, and setting it to the :epub_location
attribute of both items. And that means we can simplify the epub rendering step in our Rules:
compile '/writing/**/*.epub.{haml,md}' do
filter :epub
write item[:epub_location]
end
And finally, we can link up that button in our layout:
# layouts/writing.haml
=render "/default.*" do
%p
=link_to "Download epub", item[:epub_location], class: "button"
=yield
And now, look what we have!
Success all around.
I'm using some custom RedCarpet options, which is why I have that interesting set of parameters going to the redcarpet renderer. ↩︎