Thursday, December 13, 2007

Stripping line breaks with Python

I've been snatching lots of files from Project Gutenberg lately. Gutenberg is a great resource!

The problem is that the files have all their line breaks hard coded, instead of just paragraph breaks. This messes up the output from a PDA e-book reader (my Nintendo DS, actually) or from the printer. I did a quick search for small scripts to do this, but couldn't find any so I wrote my own.

The program takes an input filename and an output filename. This adds a bit more to the code, but this way, I can use it in a batch program that will loop across several files for conversion.

Code follows.



#!/usr/bin/python

import sys

if len(sys.argv) < 2:
print "Oops, need a filename to open."
sys.exit()
elif len(sys.argv) < 3:
print "Oops, need a filename to write to."
sys.exit()

filename1=sys.argv[1]
filename2=sys.argv[2]

fp1=open(filename1,"r")
fp2=open(filename2,"w")

while 1:
line=fp1.readline()
if line == "":
break
if len(line)==2:
fp2.writelines("\n\n")
else:
fp2.writelines(line[:-2])
fp2.writelines(" ")



Of course, if anyone has something shorter out there, I wouldn't mind using that instead.

3 comments:

  1. You can do this with sed, I have a hacky two line implementation:

    cat input.txt | sed 'N;s/\n$/PARDELIMITER/;P;D' | sed -e :a -e '$!N;s/\n/ /;ta' -e 'P;D' | sed 's/PARDELIMITER/\
    /g' > output.txt

    Basically, takes the file, replaces two newlines with PARDELIMITER, pipes that into a new sed script that removes all newline characters, then pipes *that* to another sed script that replaces PARDELIMITER to a newline character.

    Ok there's probably a cleaner way to do it.

    ReplyDelete
  2. Thanks, Roy. Made my head spin, though.

    ReplyDelete
  3. Well, it's an illustration of how "shorter" does not necessarily mean "better" :-)

    ReplyDelete