RangeReader should not require knowing file length #3239

echeipesh · 2020-05-04T13:48:24Z

Currently RangeReader implementations will clip read requests to known maximum file size:

geotrellis/util/src/main/scala/geotrellis/util/RangeReader.scala

Lines 41 to 42 in 514e6fa

    
           def readRange(start: Long, length: Int): Array[Byte] = 
        
             readClippedRange(start, clipToSize(start, length))

In order to do this typically a HEAD request needs to be made. This is sometime problamatic on S3 because public requester-pays buckets often only allow GET and not HEAD requests.

In this PR we tried to work around this restriction: #3025

geotrellis/s3/src/main/scala/geotrellis/store/s3/util/S3RangeReader.scala

Lines 55 to 60 in 514e6fa

    
           val responseStream = client.getObject(request) 
        
           val length = responseStream.response.contentLength 
        
           responseStream.abort() 
        
           responseStream.close() 
        
           length

by opening a get object stream and reading the length metadata. Followed by abort/close calls to avoid actually pulling down any data.

However, some combination of AWS SDK and Apache http lib can produce the following exception

org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 2,196,257,163; received: 0)
        at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
        at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:198)
        at org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:101)
        at org.apache.http.impl.execchain.ResponseEntityProxy.streamClosed(ResponseEntityProxy.java:142)
        at org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228)
        at org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:172)
        at java.base/java.io.FilterInputStream.close(FilterInputStream.java:180)
        at software.amazon.awssdk.services.s3.checksums.ChecksumValidatingInputStream.close(ChecksumValidatingInputStream.java:162)
        at java.base/java.io.FilterInputStream.close(FilterInputStream.java:180)
        at software.amazon.awssdk.core.io.SdkFilterInputStream.close(SdkFilterInputStream.java:83)
        at geotrellis.store.s3.util.S3RangeReader.totalLength$lzycompute(S3RangeReader.scala:59)
        at geotrellis.store.s3.util.S3RangeReader.totalLength(S3RangeReader.scala:44)
        at geotrellis.util.StreamingByteReader.ensureChunk(StreamingByteReader.scala:109)
        at geotrellis.util.StreamingByteReader.get(StreamingByteReader.scala:130)
        at geotrellis.raster.io.geotiff.reader.GeoTiffInfo$.read(GeoTiffInfo.scala:127)

Issue

This issue is to remove all usage of RangeRader.totalLength method and deprecate it for removal in next major version.

Ultimately we do not need to know the total length of the file. RangeReader interface is used exclusively to read GeoTiffs (and possibly similar files). First few KB of such files are a header that will contain or point to table of offsets. Consequently when we're reading GeoTiff segments we're always basing the read ranges from the offset table and never have a chance to walk off the end of the file. One exception to this is performing the initial header read. In theory the file should never be smaller than that requests, but we should double check the behavior of GeoTiff reader if it is.

As a side benefit we can remove one extra HEAD/GET request per file and avoid paying the AWS costs on that.

The text was updated successfully, but these errors were encountered:

pomadchin · 2020-05-04T13:52:50Z

I'm wondering, would it throw on the line 75?
In other words is it a combination of close and abort or it will still throw even without abort?

geotrellis/s3/src/main/scala/geotrellis/store/s3/util/S3RangeReader.scala

Lines 64 to 75 in 514e6fa

    
           def readClippedRange(start: Long, length: Int): Array[Byte] = { 
        
             val getRequest = 
        
               GetObjectRequest 
        
                 .builder() 
        
                 .bucket(request.bucket) 
        
                 .key(request.key) 
        
                 .range(s"bytes=${start}-${start + length}") 
        
                 .build() 
        
             val is = client.getObject(getRequest) 
        
             val bytes = IOUtils.toByteArray(is) 
        
             is.close()

Earlier we had no problems with it, since everyone had a dependnecy on the same SDK and the same org.apache.http library.

pomadchin added the enhancement label May 14, 2020

pomadchin mentioned this issue Dec 18, 2020

Remove an excessive close after the abort call in S3RangeReader #3324

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RangeReader should not require knowing file length #3239

RangeReader should not require knowing file length #3239

echeipesh commented May 4, 2020

pomadchin commented May 4, 2020 •

edited

Loading

RangeReader should not require knowing file length #3239

RangeReader should not require knowing file length #3239

Comments

echeipesh commented May 4, 2020

Issue

pomadchin commented May 4, 2020 • edited Loading

pomadchin commented May 4, 2020 •

edited

Loading