October 14, 2004

More Google Desktop goodies

Xpdf: Download. If google didn't implement it, there's always a kludge. As I mentioned in my previous post, I had wished Google Desktop indexed the content of my pdf files, instead of indexing just the file name. Since Google Desktop doesn't do it automatically, you could always convert the pdf to text and have the text file in the same directory as the pdf file -- of course, it'll be named with the same basename as the pdf so that you'll know which pdf was a query hit.

The tool that I use to convert pdf to text is a commandline tool, pdftotext.exe found in the package bundle xpdf for win32 (link located above). I've written a very simple batch file to convert all the pdf's in a directory to text (and skip over the ones that already are converted):

@echo off
for %%a in (*.pdf) do if not exist %%a.txt pdftotext %%a %%a.txt

I haven't figured out how to strip the file extension to have the file named *.txt instead of *.pdf.txt, but that's just a minor annoyance and doesn't detract from the usefulness of this conversion. Normally I'd ask that you send me a comment on how to get it done, but I've turned off comments on my blog indefinitely -- I just didn't have the time to delete all the comment spam. Eventually, I may develop mysql query scripts to delete comment spam, but I digress.

If you deal with a lot of pdf's and you want to have their content indexed by Google Desktop, this is definitely a quick workaround until Google implements it into their own tool (if it's on their to-do list, I haven't looked). And once the tool is implemented, you could always delete the text files later.

Posted by johnvu at October 14, 2004 11:20 PM
Post a comment