Unicode in RTF documents
03 Mar 2016额,为了找把cfv
输出的形如\uhhhh
的编码还原为utf8串,搜到这个。
特转载。
途中发现是Jekyll风,就搜了下源码,遂得之。
话说,他那个{ {page.description} }
用得挺别致的。
咱自己当然还是关键词get后用Perl解决的。
#!/usr/bin/env perl
use strict;
use warnings;
use File::Basename;
use File::Path 2.08;
use Unicode::Escape;
#my $Prefix='t/';
while (<>) {
next unless s/ : file size does not match \((\d+)\!=(\d+)\)$//;
chomp;
my ($torrSize,$fileSize) = ($1,$2);
my $tName = Unicode::Escape::unescape($_);
my $dirname = dirname($tName);
#File::Path::make_path("$dirname",{verbose => 0,mode => 0755});
#system('mv',$Prefix.$tName,$tName)
my $diskSize = -s $tName;
print STDERR "$tName\n\t$torrSize <= $fileSize,$diskSize\t";
if ($diskSize != $fileSize) {
warn "Skip for Reported != Real\n";
} elsif ($torrSize == $fileSize) {
warn "Skip for Already done.\n";
} else {
warn "truncate -s $torrSize\n";
system('truncate','-s',$torrSize,$tName);
}
}
__END__
[philosophy-raws][Sword Art Online]/CDs/\u6625\u5948\u308b\u306a OVERSKY BD\u4ed8\u521d\u56de\u751f\u7522\u9650\u5b9a\u76e4/BK/01.png : file size does not match (19327223!=19337438)
./torrtruncate.pl SwordArtOnLineA.err
How to encode unicode codepoints in RTF documents using PyRTF.
Some time ago I had to output some nicely formatted reports from a web application, to be usable offline by Windows users. Naturally, I used the aging but still reliable PyRTF
module to generate RTF documents with headers, tables, and a consistent style.
As my application users are mostly Norwegians, however, I quickly discovered that the PyRTF
module does not handle international characters (i.e. anything outside the ASCII codepoints), at all. There is no unicode support at all (it has been on the TODO list since forever), let alone converting unicode codepoints to whatever RTF uses to represent international characters.
Recently, a Stack Overflow question reminded me of how I solved this problem at the time, and clearly this question has come up before. Because some approaches I’ve seen can actually produce incorrect or overly verbose output (including pyrtf-ng), I wanted to explain and expand on my solution to provide a definitive answer to the problem, and also see how my original method faired in terms of speed.
So does RTF handle unicode?
Since PyRTF doesn’t filter the text you add to a document at all we can just encode unicode strings ourselves. Lucky for me, the Wikipidia entry on RTF has a fairly detailed section on how RTF handles characters outside the ASCII range. Together with the published RTF 1.9.1 specification (PDF) there is plenty of information on how to encode unicode codepoints to RTF control sequences.
There basically are two choices:
-
The
\'hh
control sequence; a backslash and single quote, followed by an 8-bit hexadecimal value. The value is interpreted as a code-point in a Windows codepage, limiting it’s use. You can assign different codepages to different fonts, but you still cannot use the full range of unicode in a paragraph. -
The
\uN?
control sequence; backslash ‘u’ followed by a signed 16-bit integer value in decimal and a placeholder character (represented here by a question mark). The signed 16-bit integer number here is consistent with the RTF standard for control characters, a value between -32768 and 32767.This control sequence can properly represent unicode, at least for the U+0000 through to U+FFFF codepoints. This sequence was introduced in the 1.5 revision of the RTF spec, in 1997, so it should be widely supported. The placeholder character is meant to be used by readers that do not yet support this escape sequence and should be an ASCII character closest to the unicode codepoint.
The \uN?
format is the easiest to produce, especially if you ignore the replacement character (just set it to ‘?’ at all times, surely most RTF readers support the 1.5 RTF standard by now, it’s been out there for 15 years).
Encoding the slow (and incorrect) way
A quick search with Google showed me how pyrtf-ng encodes unicode points:
Unfortunately, the above snippet does a few things wrong: it uses a control code for every character in the unicode string, producing output that is at least 5 times as long as the input, and it doesn’t produce negative numbers for codepoints over \u7fff
:
A recent Stack Overflow answer improved on this by only encoding characters over \u007f
(decimal 127) but it still iterates over every character in the string to do so:
This outputs unicode because codepoints < 128 are left untouched; numbers are not properly converted to signed shorts either. Here is my variation that remedies these things, and dispenses with the str()
call:
This feels like rather a waste to me, and must be slow as well. I wanted to see how my own solution stacks up against the character-by-character, naive implementation.
Encoding the lazy way
While casting around for my own solution, I also looked into the Python codecs
module to come up with ideas on how to do this more efficiently. Of course, the codecs provided by that module are all implemented in C, but the unicode_escape
codec did produce output quite close to what I needed for RTF; codepoints between \u0020
and \u007f
are left alone, the rest are encoded to one of the \xhh
, \uhhhh
or \Uhhhhhhhh
8, 16 or 32-bit escapes (with the exception of \t
, \n
and \r
). Would there be any way to reuse this output?
Well, if you combine this with a bit of re.sub
magic, you can in fact produce convincing RTF command sequences:
Using the struct
module gave me a quick means to re-interpret the hexadecimal notation as produced by the unicode_escape
format as a signed short, but I did have to make sure there were 2 bytes at all times.
Of course, the above trick does not handle newlines, returns or tabs (\n
, \r
and \t
respectively) correctly, nor does it escape existing backslashes yet, but I hoped back when that this proof of concept should operate several orders of a magnitude faster than the naive character-by-character method when dealing with mostly-ASCII input; most of the work is done in C by the codecs
and re
modules, after all.
So this time around I decided to time these:
Cool, so my hybrid encode plus regular-expression based solution looks to be around 4 times as fast, at least when it comes to simple Norwegian text with a handful of latin-1 characters, my most common case. Note however that I am not handling the RTF escape characters properly, nor are the \n
, \r
and \t
characters handled correctly.
Can I do better?
But I am actually being too clever by half (read: pretty dumb really); why did I encode to unicode_escape
in the first place? I was still in the process of fully understanding the issues and saw a shortcut. My regular expression isn’t particularly clever, I dabbled with the struct module to get my signed short values, and with all this hocus-pocus I lost sight of the goal: to escape certain classes of characters to RTF command codes.
But aren’t regular expressions quite good at finding those classes all by themselves? I may as well use a decent expression that selects what needs to be encoded directly:
Suddenly we have an 10 times speed increase! Not only that, I am now also properly escaping the three whitespace characers \n
, \r
and \t
, and as an added bonus, the RTF special characters \
, {
and }
are now also being escaped! I call this a result, and a lesson to learn.
Perhaps we can translate instead
We could also use a translation table to do my escaping for me. This is simply a dict that maps unicode codepoints to a replacement value. To create a static dict for all unicode values could be somewhat tricky, requiring either a custom __missing__
method or loading a generated structure on import.
Before digging into clever solutions to that, I should perhaps first test the speed of a simple translation table, one that only covers codepoints up to ‘\u00ff’, or latin-1:
Unfortunately, using .translate
turns out to be slowing us down considerably. Reducing the table to just a few codepoints doesn’t help either:
So it looks like I might want to avoid using .translate
if at all possible.
Worst-case scenario
So far, I’ve compared methods by testing them against some Norwegian text, typical of many European languages with a generous helping of ASCII characters.
To get a more complete picture, I need to test these methods against a worst-case scenario, a UTF-8 encoded test set from a great set of UTF-8 test documents:
Interesting! So in the worse-case scenario, where the vast majority (99.8%) of the text requires encoding, the character-by-character method is actually a little faster again! But this also means that for most cases, where you insert shorter text snippets into an RTF document, and where a far larger percentage of characters do not need escaping, the regular expression method will beat the character-by-character method hands down.
So what about non-BMP Unicode?
So far I’ve focused only on characters within the BMP. You can apparently use a UTF-16 surrogate pair, at least according to Wikipedia for codepoints byond the BMP. However, the RTF specification itself is silent on this, and no endian-nes is documented anywhere that I can find. The Microsoft platform uses UTF-16-LE throughout, so perhaps RTF readers support little-endian surrogate pairs too.
However, I cannot at this time be bothered to extend my encoder to support such codepoints. On a UCS-2-compiled python there is a happy coincidence that codepoints beyond the BMP are treated mostly like UTF-16 surrogate pairs anyway, so they are sort-of supported by this method:
Note, however, that the first byte is -10240, or 0xd800
in unsigned hexadecimal, making this a big-endian encoded surrogate pair. Presumably on Windows that’ll encode the other way around.
On a UCS-4 platform the codepoint will be ignored by the regular expression and the .encode('ascii')
call will raise a UnicodeEncodeError instead.
I am calling this ‘unsupported’ and a day. Suggestions for implementing this in a neat and performant way are welcome!
Off to PyPI we go
I am quite happy with the simple regular expression method, and prefer it over the character-by-character loop.
So I packaged up my regular expression method as a handy module on PyPI, complete with Python 2 and 3 support and a miniscule test suite; the source code is available on GitHub.
The module in fact registers a new codec, called rtfunicode
, so after you import the package all you need do is use the new codec in the .encode()
method:
Hopefully it comes in handy for others. Feedback is most welcome, as are patches!
No comments
Comments are closed