java: Why is BufferedReader readLine reading past EOF

mardi 14 juin 2016

Why is BufferedReader readLine reading past EOF

I have a very large file (~6GB) that has fixed-width text separated by rn, and so I'm using buffered reader to read line by line. This process can be interrupted or stopped and if it is, it uses a checkpoint "lastProcessedLineNbr" to fast forward to the correct place to resume reading. This is how the reader is initialized.

private void initializeBufferedReader(Integer lastProcessedLineNbr) throws IOException {
    reader = new BufferedReader(new InputStreamReader(getInputStream(), "UTF-8"));
    if(lastProcessedLineNbr==null){lastProcessedLineNbr=0;}

    for(int i=0; i<lastProcessedLineNbr;i++){
        reader.readLine();
    }
    currentLineNumber = lastProcessedLineNbr;
}

This seems to work fine, and I read and process the data in this method:

public Object readItem() throws Exception {
    if((currentLine = reader.readLine())==null){
        return null;
    }
    currentLineNumber++;
    return parse(currentLine);
}

And again, everything works fine until I reach the last line in the document. readLine() in the latter method throws an error:

17:06:49,980 ERROR [org.jberet] (Batch Thread - 1) JBERET000007: Failed to run job ProdFileRead, parse, org.jberet.job.model.Chunk@3965dcc8: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
    at java.util.Arrays.copyOf(Arrays.java:3332)
    at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
    at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:569)
    at java.lang.StringBuffer.append(StringBuffer.java:369)
    at java.io.BufferedReader.readLine(BufferedReader.java:370)
    at java.io.BufferedReader.readLine(BufferedReader.java:389)
    at com.rational.batch.reader.TextLineReader.readItem(TextLineReader.java:55)

Curiously, it seems to be reading past the end of the file and allocating so much space that it runs out of memory. I tried looking at the contents of the file using Cygwin and "tail file.txt" and in the console it gave me the expected 10 lines. But when I did "tail file.txt > output.txt" output.txt ended up being like 1.8GB, much larger than the 10 lines I expected. So it seems Cygwin is doing the same thing. As far as I can tell there is no special EOF character. It's just the last byte of data and it ends abruptly.

Anyone have any idea on how I can get this working? I'm thinking I could resort to counting the number of bytes read until I get the full size of the file, but I was hoping there was a better way.

java

mardi 14 juin 2016

Why is BufferedReader readLine reading past EOF

Aucun commentaire:

Enregistrer un commentaire