How to convert fixed position PDF documents to free flowing text like TXT RTF DOC etc. with pdftotext on Linux
Have you ever gotten a document that was saved in PDF format, and you wanted to edit it? So you converted it to text using a utility, but each line has a newline at the end so you would have to manually remove each newline. Really annoying right?
I thought I would record how I did it even though it's unlikely that anyone will find this file through a search engine. Imagine if everyone who found solutions like this published their work!
First take the PDF document and convert it to formatted
text with a single byte encoding. The text document must be in a format
where each paragraph ending has two newline characters in a row, else this
won't work. Then convert all "\n" characters to = characters or
some other
character that isn't used in the document. Use sed to convert all single
instances of = in to spaces but ignore multiple instances like ==. Then
convert = back in to "\n". Now the document is in free flowing
text. Open it in a word processor and save it as any file type that you
like. You'll have to manually recreate bold headings and any other text
styles that were lost in the conversion to text. Here are the commands for
a GUN/Linux system:
$ pdftotext -enc Latin1 -eol unix -layout file.pdf
file.converted.pdf.txt
$ cat file.converted.pdf.txt | tr "\n" "="
| sed 's/=\{1,\}/\x0&\x1/g; s/\x0=\x1/ /g; s/[\x0\x1]//g;' | tr "="
"\n" > file_flowing_text.txt
$ Ted file_flowing_text.txt # Open in a word processor
and save as anything you like.
You can now edit the file and read it on any screen width!. I hope you enjoyed! I hope you found this useful.