July 25, 2005

Converting LaTeX document to Word (RTF) format

LaTeX2rtf Home. One of the slowest processes when collaborating on a manuscript is the tedious task of converting into different filetypes if either party is using a different word processor of even a different operating system. Working with my mentor (who only lives in the MS Windows 95 world of yesteryears) was quite a task in itself, to say the least; especially when I'm using LaTeX and BibTeX (pybliographer) in Linux for my manuscript. In the ideal world, you and your mentor are both using LaTeX and some CVS program to deal with the "on-the-fly" changes and revisions. But in the real world, you have a mentor who likes it "just the way it is." Fortunately for me, I was able to quickly hack up some scripts that made the conversion from LaTeX and BibTeX to MS Word with only minor manual editing.

The keys to managing and creating these scripts are: 1) the Makefile and 2) sed scripts. The Makefile allows you to assemble a smattering of text commands into one methodical unit and the sed script does the actual editing and changing of the resultant text file (the RTF file) to change a few annoyances or editing flaws that latex2rtf either misses or just isn't built to handle.

Let's get down to the nitty-gritty. The following is my Makefile:

# My makefile to automate compiling all necessary docs

TARGET = paper

all: ps pdf html php

ps: $(TARGET).tex
latex $(TARGET).tex
bibtex $(TARGET)
latex $(TARGET).tex
bibtex $(TARGET)
latex $(TARGET).tex
dvips $(TARGET).dvi -o $(TARGET).ps

pdf: $(TARGET).tex
pdflatex $(TARGET).tex
bibtex $(TARGET)
pdflatex $(TARGET).tex
bibtex $(TARGET)
pdflatex $(TARGET).tex

html: $(TARGET).tex
tth -e2 -n1 -w2 -V $(TARGET).tex
bibtex $(TARGET)
tth -e2 -n1 -w2 -V $(TARGET).tex
bibtex $(TARGET)
tth -e2 -n1 -w2 -V $(TARGET).tex
./tthpostproc.sed $(TARGET).html > index.html

quickhtml: $(TARGET).tex
tth -e2 -n1 -w2 -V $(TARGET).tex
./tthpostproc.sed $(TARGET).html > index.html

php: header.php footer.php $(TARGET).tex
tth -r -e2 -n1 -w2 -V $(TARGET).tex
bibtex $(TARGET)
tth -r -e2 -n1 -w2 -V $(TARGET).tex
bibtex $(TARGET)
tth -r -e2 -n1 -w2 -V $(TARGET).tex
cat header.php > index.php
./tthpostprocraw.sed $(TARGET).html >> index.php
cat footer.php >> index.php

rtf: $(TARGET).tex
latex2rtf -o temp.rtf $(TARGET).tex
./rtfpostproc.sed temp.rtf > temp2.rtf
./rtfpostproc2.sed temp2.rtf > $(TARGET).rtf
rm temp.rtf temp2.rtf

tidy:
rm -f *~

clean:
rm -f *~
rm -f *.aux *.log *.toc *.lof *.lot *.bbl *.blg *.out *.brf temp.rtf

Just follow the above template and you can command your Makefile to do whatever you wish. As you can see, I have different make options to either produce a postscript file (using the traditional latex and dvips sequences), a pdf file (using pdflatex), html (using tth, my favorite latex to html converter), or rtf (using latex2rtf). There are a few other commands to help me clean up unneeded files or to create files for php to serve up on a dynamic webpage. If you haven't figured it out already, put the Makefile in the directory where you keep all the LaTeX files you wish to typeset. Creating the ps, pdf or rtf file is a simple "make ps" or "make pdf" or "make rtf" command at the shell script prompt, respectively.

If you are trying to compile a LaTeX document to send to someone using MS Word, you have two options. You can create an html file using tth or you can create a RTF file using latex2rtf. Tth used to work well for me in the initial stages of preparing the manuscript, but certain problems came up that were insurmountable and I had to use latex2rtf: the biggest obstacle was the inability to change the citation formatting from regular citations in parentheses or brackets to small type superscript due to incomplete support of natbib. Latex2rtf also does not support natbib fully, but at least it typesetted the program close enough for me to make a few minor changes via the sed script.

The biggest hurdles to overcome was typesetting the citations correctly to fit the requirements of the journal to which we are submitting the manuscript. Since RTF is human readable via a text editor, it is also amenable to changes via sed. The trick to doing this is via regular expressions in sed. Do your best to learn regex well in sed and it will take you far in automating this process. Here are the two sed scripts that I used in the "make rtf" process:

First script
#!/bin/sed -f
s/{\\field{\\\*\\fldinst{\\lang1024 REF BM\([a-zA-Z]*\) \\\\\* MERGEFORMAT }}{\\
fldrslt{\([0-9]\)}}}/\2/g
s/ {\\up7\\fs18\ \([0-9-\,]*\)}\([\.\,]\)/\2{\\super \1}/g
s/ {\\up7\\fs18/{\\super/g
s/{\\up7\\fs18/{\\super/g
s/{\\dn7\\fs18/{\\sub/g
s/\\-{/{\\sub /g
s/fi-450 \[/fi-450/g
s/]\\tab/.\\tab/g
s/Times;/Times New Roman;/g

Second script
#!/bin/sed -f
s/{\\super 125/\ {\\super 125/g

Study these scripts closely and you will see that all they really do is remove a space here, add a space there, or transpose a period and citations where latex2rtf didn't get things right the first time (see below). The only major change I did was in the first line of the first script. That line essentially cleared the rtf file of hypertext references that MS Word did not understand well which resulted in "Error! Reference source not found." for each reference I made to a Figure or Table in the LaTeX document.

Once you've changed your RTF file to the way you like it, you can then import it into MS Word (or even better OpenOffice Writer) and do the manual edits as you please---like changing the margins and linespacing.

Just a sidenote about latex2rtf. Apparently you can define your own substitutions that latex2rtf will use while compiling the rtf document. They're defined in files that end with the ".cfg" tag. I have two files in the same directory as my LaTeX files, they are direct.cfg and natbib.cfg. They define substitutions that I wish to have for my particular needs in the manuscript format. For some reason not all of them worked and I still cannot for the life of me figure out why simple substitutions like superscript or subscript did not work. Although some definitions/substitutions failed to work, sed saved the day in the end, allowing me to make changes automatically that I would have cursed the air I breathed if I had to do them manually in MS Word. The files direct.cfg and natbib.cfg looked like the following:

direct.cfg

\superscript{++},{\super ++}.
\raisebox{-1ex}{\scriptsize{50}},{\sub 50}.
\subscript{50},{\sub 50}.
\subscript{BaL},{\sub BaL}.
\subscript{2044},{\sub 2044}.

natbib.cfg
\newcommand{\bibstyle@science}{\bibpunct[, ]{(}{)}{,}{n}{}{,}%
\gdef\NAT@biblabelnum##1{##1.}}
\newcommand{\bibstyle@jvirol}{\bibpunct[, ]{(}{)}{,}{n}{}{,}%
\gdef\NAT@biblabelnum##1{##1.}}
\newcommand{\bibstyle@arhr}{\bibpunct[, ]{}{}{,}{s}{}{,}%
\gdef\NAT@biblabelnum##1{##1.}}
\newcommand{\bibstyle@unsrtabnat}{\bibpunct[, ]{(}{)}{,}{n}{}{,}%
\gdef\NAT@biblabelnum##1{##1.}}
\newcommand{\bibstyle@ecology}{\bibpunct[, ]{(}{)}{,}{a}{}{,}}

\renewcommand\@biblabel[1]{#1.}
\renewcommand{\bibnumfmt}[1]{{#1}.}

Just because your mentor uses an ancient version of MS Word doesn't mean that you have to. As long as you're comfortable enough to hack up a few scripts to help you along the way to save some time, you'll find out quickly that you can have your cake and eat it too. That means you can have your docs in LaTeX format ready to import quickly into your dissertation or thesis and still be able to convert quickly into MS Word format so that others can read and edit at their hearts content.

If you've found some way to make my life even easier (i.e. found an even better way to convert LaTeX docs to MS Word without the need for much manual editing), then by all means, please contact me and let me know about it. I will then be able to share it with the rest of the world in this space. Nonetheless, I hope the information I've provided here will help some of you out there who are in the process or even just starting out your PhD dissertation and manuscript writing by using tools that are open-sourced and freely available the way it should be.

Posted by johnvu at July 25, 2005 11:04 PM
Comments
Post a comment