AmazonS3 client : putObject execution error when using SequenceInputStream (and a warning message is retreived)Β #1265
Description
I'm getting a bug using the Java SDK AmazonS3 client, and maybe you can provide a fix. This is what happens:
Every minute I generate several CSV files on an S3 bucket (in order to send data to a reporting data warehouse). After several minutes, I need to concatenate these files into a bigger one and place it on shared folder in the bucket where then it gets loaded by the data warehouse. In order to perform the concatenation I don't want to download the files, concatenate them externally, and upload them again, because I'll be wasting a lot of resources and bandwith. So I use S3ObjectInputStreams instead.
On my first approach, I tried the multipart upload (using each single file as a part) but that doesn't work when the file parts are too small. The AWS SDK documentation clearly states that files have to be over 5MB in size to be able to use multipart upload. So this approach is discarded in this case (pity, because I could be as easy as that). So instead I use SequenceInputStream to combine the files, and try to provide the contentLength using the following code (with Java SDK 1.11.164 to 1.11.172):
#1 public boolean concatenateFiles(String bucketName, String preffix, String destKey, boolean deleteOrig) {
#2
#3 ObjectListing objectListing = s3Client.listObjects(bucketName, preffix);
#4
#5 Vector inputs = new Vector();
#6 ArrayList objects = new ArrayList();
#7
#8 long contentLength = 0;
#9
#10 while (true) {
#11 for (Iterator<?> iterator =
#12 objectListing.getObjectSummaries().iterator();
#13 iterator.hasNext();) {
#14 S3ObjectSummary summary = (S3ObjectSummary)iterator.next();
#15 inputs.add(getObjectInputStream(bucketName, summary.getKey()));
#16 objects.add(summary.getKey());
#17 contentLength += summary.getSize();
#18 System.out.println("adding "+ summary.getKey());
#19 }
#20 // more object_listing to retrieve?
#21 if (objectListing.isTruncated()) {
#22 objectListing = s3Client.listNextBatchOfObjects(objectListing);
#23 } else {
#24 break;
#25 }
#26 };
#27
#28 Enumeration enu = inputs.elements();
#29 SequenceInputStream sis = new SequenceInputStream(enu);
#30
#31 ObjectMetadata metadata = new ObjectMetadata();
#32 metadata.setContentLength(contentLength);
#33 s3Client.putObject(new PutObjectRequest(bucketName, destKey, sis, metadata));
#34
#35 if (deleteOrig) {
#36 for (String object : objects) {
#37 s3Client.deleteObject(bucketName, object);
#38 }
#39 }
#40
#41 return true;
#42 }
The fact is this code doesn't work properly... becasuse of the contentLength! When I execute the code above, it only copies the first file (the first InputStream) in the SequenceInputStream and then raises an issue saying that the action may have not been successfully executed and that some data may have not been readen.
If I remove lines #8, #17 and #32 and don't provide the contentLength, the code works fine (a new S3 object is correctly created with the content concatenation of all others), but it raises a warning similar to "No content length specified for stream > data. Stream contents will be buffered in memory and could result in out of memory errors.". Of course I want to avoid memory errors in case files are too big. But the fact is that, by removing the contentLength, the S3 client seems to be prepared to do the job efficiently and reliably. This approach is currently working fine in our production environment, with the only issue that I cannot provide a contentLength, thus forcing the client to work with all file contents buffered in memory... and it could result in out of memory errors when files grow in size.
To sum it up, even when I'm able to provide the contentLength (which is the desired behaviour on the AmazonS3 client), if then I try to use a SequenceInputStream the client won't work correctly, reading only the first InputStream but ignoring all the rest. I believe this issue can be arising because everytime a single InputStream in the SequenceInputStream is readen, the AmazonS3 client asuemes that the operation is completed and it checks the metadata contentLength against the readen data. In this case, obviously when only the first InputStream is readen they won't match, as there are other InputStreams in the sequence.
Probably you could fix that but just checking wheter the InputStream in the PutObjectRequest (line #33) is an instance of SequenceInputStream, and in this case check the contentLength only when all InputStreams in the sequence have been readen.
In any case, if you want to suggest a better way to concatenate files on S3 that would be fine too.
Thank you very much for your time.