-
-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question "sed"-like editing support #37
Comments
Hi. Well, read mode is for reading only and write mode is for writing only. For read/write same time - default non stream mode. Under the hood, in any type of stream mode library operates only with one row per time. Regarding to read mode - any updates will be discarded and original sheet data will be added to xlsx. The purpose of stream modes - to load/write large data, but not updating. These modes are very limited. Long story short - it's not possible to update sheets in stream mode - only read or write. About your suggested approach - how to merge? it's not just copy - you need somehow to merge 2 sheets. P.S.: Did you try standard mode (read/write)? |
@plandem -- Thanks for your reply. I appreciate that streaming mode operates one row at a time. My hope would be to stream in a row, modify it, and write it out. If it can't be done in place, then stream to new destination is ok (and perhaps safer). Could you comment more about "how to merge"? I could work on a PR for this but need to wrap my head around the underlying data model. I guess the problem is that there are "sheet global resources" that are referred to in the cells? Or are they "workbook global"? If the latter case, then raw copy of cells should work, perhaps? If "sheet global" then perhaps I would need to make multiple passes: 1) collect all sheet global stuff before opening 2nd sheet for writing, 2) write header of output stream with global stuff from first sheet, 3) stream input sheet, copying over to output sheet, while changing references appropriately. EDIT I guess some things are obviously "sheet global" -- column definitions, for instance. Perhaps there could be a few things you could do to a output streamed sheet before writing any rows, and one of these things would be to copy columns (and other stuff?) from another sheet. Sort of like http libraries allow fiddling with headers, but only before you start streaming the body. Would you be willing to accept a PR on this, potentially? A sketch in a branch would be welcome. :)
No -- there seem to be no facilities for copying between sheets. (e.g. Apropos EDIT answering "PS" I was still thinking about copy between sheets (stream original, build new in non-stream mode). I just tried simple update in place in non-streaming mode, but also didn't seem to work. However, the code around the update is complex. I'll submit another issue The spreadsheets are several hundred thousand rows by ~100 columns. Not impossible to update without streams, but I want to run in AWS lambda, where memory is a precious resource. |
I will try to think on weekends about possible ways and will back to your suggestions and read again. Can't dive into it right now, sorry. P.S.: regarding to AWS and memory, I don't know how AWS and GC(Garbage collector) will work together. There is a chance that using C/C++ based library will be better, because of GC in Go/Python/Php/Java/Javascript and etc. Update: What I mean, I don't have so much experience with AWS lambda, so don't know how some things will work, e.g. GC. |
Thanks! I have been using go in lambda for a while -- it seems to work well for me; I haven't noticed too much memory bloat due to GC. Of course, this particular task might stress it, but (even if I help with a PR) will probably be quicker for me in go + tuning than C/C++ which I haven't programmed for a decade. (I've been meaning to learn rust... :)). Appropos update problem (when not in stream mode) -- I had forgotten to save! In fact, I didn't notice the save function and had only been using the package for reading before. I had assumed that I would suggest putting ... |
|
I looked through the AWS lambda prices and looks like they don't charge for memory and only for CPU. That's a good and looks like GC is not an issue. |
Actually they do charge for memory (see table below) ... but if I can stream it will be fine. :). EDIT In the long term I plan to move to kubernetes anyway (maybe using fission), which will free up the per-call restrictions, but will still have to pay for memory one way or another. And anyway streaming support is what makes your library so great! :)
I remember there being a guide, but the link in the readme is now broken. I was under the impression that you may have taken the guide down because it was obsolete. (So maybe fix the link if it is still useful. :)) Thanks! |
Ouch...I didn't know that guide is broken. Sorry. Will try to fix it asap. |
guide is fixed. Thanks for pointing on it! |
I thought a bit about updates for stream mode. There is a thing - sheet file has few sections and some header information should be tuned sometimes. Of course, it's possible to copy everything except cells from original sheet and allow updates only inside of bounds - in that case it can work. It will mean - update only cells in that row...literally. Nothing more. Update: Theoretically it's possible. Need somehow to combine read/write stream functionalities. But not sure I will have time in November. In general it can be something like this - read row as in read-stream and write row as in write-stream. But before that you need manually to add everything from original file up to sheet data and after it, and also copy rows that were not iterated. |
Thanks! I'll get the whole thing up and running without stream mode, then hopefully will have some time to work on a PR for this.
A snippet for this or pointer would be helpful. I guess
Presumably can do in General notes:
|
afterCreate will be useless, because that information is already in sheet, so afterOpen only. But...not only that - stream read - reads only supported blocks, but you need to copy everything from original file, like literally. Because in original file there can be much more information in
Actually not. Even in stream mode you can fast rewind to any row, e.g. CellByRef("A100") - you skipped 99 rows, so these rows should be copy to write stream during that rewind process. In Close() you should copy only remained rows.
It's easier document that in stream mode you can go forward only and only once, than mess code with useless one-case only errors. Because
Well, it's quite idiomatic go. If it's not fatal - error, if fatal or not allowed operation - panic. My point is - your code should not get any fatals in production and you should use library properly and prevent all fatals before moving to production. E.g. if it's not allowed to write in read stream, then you should not do it. Why do you even try to call it?! If you will look at other parts of library, all methods that can produce error in other case - never panic. Only when it's not allowed/restricted behaviour for library, because that issue was created by developer, not by |
That is a valid point -- and I'm a little abashed to have raised the point here (perhaps discussed over a beer would be better. :)). However (see interesting article) IMHO on the balance it would be better to panic less and have more errors. Imagine not that your package is used not by a single developer but is nested three or four packages deep, and usually works. But sometimes due to sloppiness in the intervening layers, it brings down an application. The error convention focuses developers on the fact that "something wrong could have happened here" and they should actually check -- indeed if they want good code coverage they should write a failing test case as well. The point is that all-in-all the resulting application will likely be more robust. I'm sure this point doesn't apply to all your panics. I did see some places where I would think it would be good to return errors, at least on first glance. I'm not sure if you are interested, though, and it won't hurt my feelings if you aren't. :) |
Well, maybe in v2.0 it can be changed, but right now there are other more challenging tasks - e.g. finish DrawingML(charts, shapes and etc), Filters and Data Validation. P.S.: I thought a bit about update streaming and think it can be achieved. Refactor streams and create one, but with two streams under the hood - input and output. If no input - then write only, if no output - read only. But now sure when I will have time. Maybe after New Year, in January. |
I would like to update very large xlsx files in stream mode, setting information in given rows of a sheet as I iterate over it.
I tried updating a stream using the current version.
Cell.SetText()
panics with:In fact,
Cell.SetInlineText()
succeeds; however, the resulting text is not present.Would this be hard to add?
Update One possibility would be to add an additional sheet in streaming write mode, copy over everything while making modifications, then close and delete the original sheet, and rename the new sheet. (sigh). But looking around I don't see any facilities for copying data from one sheet to another. Would that be a promising approach?
The text was updated successfully, but these errors were encountered: