Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RangeReader should not require knowing file length #3239

Open
echeipesh opened this issue May 4, 2020 · 1 comment
Open

RangeReader should not require knowing file length #3239

echeipesh opened this issue May 4, 2020 · 1 comment

Comments

@echeipesh
Copy link
Contributor

Currently RangeReader implementations will clip read requests to known maximum file size:

def readRange(start: Long, length: Int): Array[Byte] =
readClippedRange(start, clipToSize(start, length))

In order to do this typically a HEAD request needs to be made. This is sometime problamatic on S3 because public requester-pays buckets often only allow GET and not HEAD requests.

In this PR we tried to work around this restriction: #3025

val responseStream = client.getObject(request)
val length = responseStream.response.contentLength
responseStream.abort()
responseStream.close()
length

by opening a get object stream and reading the length metadata. Followed by abort/close calls to avoid actually pulling down any data.

However, some combination of AWS SDK and Apache http lib can produce the following exception

org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 2,196,257,163; received: 0)
        at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
        at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:198)
        at org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:101)
        at org.apache.http.impl.execchain.ResponseEntityProxy.streamClosed(ResponseEntityProxy.java:142)
        at org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228)
        at org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:172)
        at java.base/java.io.FilterInputStream.close(FilterInputStream.java:180)
        at software.amazon.awssdk.services.s3.checksums.ChecksumValidatingInputStream.close(ChecksumValidatingInputStream.java:162)
        at java.base/java.io.FilterInputStream.close(FilterInputStream.java:180)
        at software.amazon.awssdk.core.io.SdkFilterInputStream.close(SdkFilterInputStream.java:83)
        at geotrellis.store.s3.util.S3RangeReader.totalLength$lzycompute(S3RangeReader.scala:59)
        at geotrellis.store.s3.util.S3RangeReader.totalLength(S3RangeReader.scala:44)
        at geotrellis.util.StreamingByteReader.ensureChunk(StreamingByteReader.scala:109)
        at geotrellis.util.StreamingByteReader.get(StreamingByteReader.scala:130)
        at geotrellis.raster.io.geotiff.reader.GeoTiffInfo$.read(GeoTiffInfo.scala:127)

Issue

This issue is to remove all usage of RangeRader.totalLength method and deprecate it for removal in next major version.

Ultimately we do not need to know the total length of the file. RangeReader interface is used exclusively to read GeoTiffs (and possibly similar files). First few KB of such files are a header that will contain or point to table of offsets. Consequently when we're reading GeoTiff segments we're always basing the read ranges from the offset table and never have a chance to walk off the end of the file. One exception to this is performing the initial header read. In theory the file should never be smaller than that requests, but we should double check the behavior of GeoTiff reader if it is.

As a side benefit we can remove one extra HEAD/GET request per file and avoid paying the AWS costs on that.

@pomadchin
Copy link
Member

pomadchin commented May 4, 2020

I'm wondering, would it throw on the line 75?
In other words is it a combination of close and abort or it will still throw even without abort?

def readClippedRange(start: Long, length: Int): Array[Byte] = {
val getRequest =
GetObjectRequest
.builder()
.bucket(request.bucket)
.key(request.key)
.range(s"bytes=${start}-${start + length}")
.build()
val is = client.getObject(getRequest)
val bytes = IOUtils.toByteArray(is)
is.close()

Earlier we had no problems with it, since everyone had a dependnecy on the same SDK and the same org.apache.http library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants