-
Notifications
You must be signed in to change notification settings - Fork 8
/
README_epub2txt.html
executable file
·209 lines (176 loc) · 6.92 KB
/
README_epub2txt.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
<a href="software.html">• Software</a>
<a href="utility_corner.html">• Utility corner</a>
<p>
<h1>epub2txt -- Extract text from EPUB documents</h1>
Version 0.1.5, September 2017
<h2>What is this?</h2>
<code>epub2html</code> is a simple command-line utility for
extracting text from
EPUB documents and, optionally, re-flowing it to fit a text display
of a particular number of columns. It is written entirely in ANSI-standard
C, and should run on any Unix-like system with a C compiler. It is
intended for reading EPUB e-books on embedded systems that can't host a
graphical EPUB viewer, or converting such e-books to read on those systems.
However, it should be robust enough for other purposes, such as batch
indexing of EPUB document collections.
</p>
<code>epub2html</code> favours speed and low memory usage over
accuracy of rendering. Most of the formatting of the source document
will be lost but, with a text-only display, this is likely to be of
little consequence.
</p>
This utility is specifically written to have no dependencies on external
libraries, except the standard C library, and even on this is makes
few demands. It does expect to be able to run an "unzip" command,
however. The purpose of minimizing dependencies is to allow the
utility to build on embedded systems without needing to build a bunch
of dependencies.
</p>
<code>epub2txt</code> will output UTF8-encoded text by default, but can
be told to output ASCII, in which case it will try to convert non-ASCII
characters into something displayable if possible.
<h2>Prerequisites</h2>
<code>epub2html</code> is intended to run on Linux and other Unix-like
systems. It makes use of the common Unix <code>unzip</code> utility
but has no other dependencies.
It builds and runs on Windows under Cygwin,
but not as a native Windows console application.
The system must be set up such that there is a temporary
directory at <code>/tmp</code> that users can write to, unless the
environment variable <code>TMP</code> is set, in which case the utility
will use that instead.
<h2>Building and installing</h2>
<code>epub2txt</code> builds and installs from a simple Makefile.
On most systems, all you should need to do is
<pre class="codeblock">
$ make
# make install
</pre>
<h2>Bugs and limitations</h2>
There is no support for any form of DRM or encryption, and such support
is unlikely to be added in the future.
<p/>
<code>epub2txt</code> only handles documents that use
UTF8 (or ASCII) encoding (but I believe that UTF8 is more-or-less
universal in EPUB),
and writes output only in UTF8 encoding,
regardless of the platform's locale. This limitation exists because
<code>epub2txt</code> does all its own multibyte to fixed-size
character encoding conversions
to avoid creating a dependency on an external library. Doing this for UTF8
is enough work on its own; doing it for arbitrary encodings would be
overwhelming.
<p/>
The program can't correct errors in encoding, and there are a large number
of EPUB documents in public repositories that contain encoding errors.
A common problem is spurious use of non-UTF8 8-bit characters, often
in documents that have been converted from Microsoft Office applications.
<p/>
<code>epub2txt</code> does not right-justify text, as there are already many
good utilities to do this. A simple approach is to pipe the output
into <code>nroff</code>, without specifying a width (<code>-w</code>).
Not specifying a width turns off line-breaking in <code>epub2txt</code>,
allowing <code>nroff</code> to justify the paragraphs.
It will probably
also be necessary to use the <code>--ascii</code> option,
as <code>nroff</code> does not
handle UTF8 text very well. For example:
<pre class="codeblock">
epub2txt -a mydoc.epub | nroff
</pre>
<p/>
<code>epub2txt</code> extracts text aggressively, and will include things that
cannot possibly be rendered properly in plain text. This includes constructs
like indices and tables of contents, which will be of little use. The captions
of pictures will also likely be included, even though the pictures themselves
can not. It seemed
better to err on the side of extracting too much text than too little;
unfortunately there is little in the EPUB format to distinguish content that
is meaningful in a text-only representation from that which is not.
<p/>
It is unlikely that any kind of fixed-layout structure of the
source document will be rendered accurately in plain text, so
<code>epub2txt</code> does not try. Tabs and other layout elements are
collapsed
into spaces, and text re-flowed according to the set line length, if any.
<p/>
Conversion of Unicode to ASCII is, in the general case, impossible. The
<code>--ascii</code> switch tells <code>epub2txt</code> to perform some
common conversions, such as straight quotes for angled quotes.
It will also attempt to replace accented latin characters with non-accented
equivalents, at least for commonly-used characters. However, there are
a huge number of characters in the Unicode set that cannot be rendered,
even approximately, in ASCII.
<h2>Revision history</h2>
<table cellpadding="5" cellspacing="5">
<tr>
<td valign="top">
0.1.5, September 2017
</td>
<td valign="top">
Some fixes related to line-wrapping with multi-byte characters; support
(after a fashion) for manifest files with namespaces.
</td>
</tr>
<tr>
<td valign="top">
0.1.4, May 2017
</td>
<td valign="top">
Remove unnecessary KBOX support kludges
</td>
</tr>
<tr>
<td valign="top">
0.1.3, March 2016
</td>
<td valign="top">
Fixed a bug that caused epub2txt to fail when XML files contained a
UTF-8 BOM
</td>
</tr>
<tr>
<td valign="top">
0.1.2, September 2015
</td>
<td valign="top">
Fixed a bug that caused strings like
"%222022020," which might legitimately appear in URLs, to be treated as
text length specifiers.
</td>
</tr>
<tr>
<td valign="top">
0.1.1, April 2015
</td>
<td valign="top">
Fixed some bugs with integer sizes that caused problems on 64-bit systems
</td>
</tr>
<tr>
<td valign="top">
0.0.1
</td>
<td valign="top">
First functional release
</td>
</tr>
</table>
<h2>Downloads</h2>
Please read the installation instructions before downloading. Note also
that only the source bundle is sure to be up-to-date; the binaries depend
on the availability of specific build platforms, and always lag the
source by a minor version or two.
<p/>
<a href="epub2txt-0.1.4.tar.gz">Source code bundle for all platforms</a><br/>
The latest source can also be checked out from <a href="https://github.com/kevinboone/epub2txt">github</a>.
<h2>Further information</h2>
<a href="epub2txt.man.html">epub2txt man page</a><br/>
<h2>Author and legal</h2>
<i>epub2txt</i> is maintained by Kevin Boone, and distributed under the terms
of the GNU Public Licence, v2.0. Essentially, this means that you may
use this software as you wish, at your own risk, provided that the
original author continues to be acknowledged.
<p/>
Please report bugs, etc., using the details on the
<a href="contact.html">contact page</a>.