Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question "sed"-like editing support #37

Open
shaunc opened this issue Nov 1, 2019 · 14 comments
Open

Question "sed"-like editing support #37

shaunc opened this issue Nov 1, 2019 · 14 comments

Comments

@shaunc
Copy link

shaunc commented Nov 1, 2019

I would like to update very large xlsx files in stream mode, setting information in given rows of a sheet as I iterate over it.

I tried updating a stream using the current version. Cell.SetText() panics with:

    not supported in read-only mode
    /Users/shauncutts/src/other/pkg/mod/github.com/plandem/xlsx@v1.0.3/cell.go:130

In fact, Cell.SetInlineText() succeeds; however, the resulting text is not present.

Would this be hard to add?

Update One possibility would be to add an additional sheet in streaming write mode, copy over everything while making modifications, then close and delete the original sheet, and rename the new sheet. (sigh). But looking around I don't see any facilities for copying data from one sheet to another. Would that be a promising approach?

@plandem
Copy link
Owner

plandem commented Nov 1, 2019

Hi. Well, read mode is for reading only and write mode is for writing only. For read/write same time - default non stream mode. Under the hood, in any type of stream mode library operates only with one row per time. Regarding to read mode - any updates will be discarded and original sheet data will be added to xlsx.

The purpose of stream modes - to load/write large data, but not updating. These modes are very limited.

Long story short - it's not possible to update sheets in stream mode - only read or write.

About your suggested approach - how to merge? it's not just copy - you need somehow to merge 2 sheets.

P.S.: Did you try standard mode (read/write)?

@shaunc
Copy link
Author

shaunc commented Nov 1, 2019

@plandem -- Thanks for your reply. I appreciate that streaming mode operates one row at a time. My hope would be to stream in a row, modify it, and write it out. If it can't be done in place, then stream to new destination is ok (and perhaps safer).

Could you comment more about "how to merge"? I could work on a PR for this but need to wrap my head around the underlying data model. I guess the problem is that there are "sheet global resources" that are referred to in the cells? Or are they "workbook global"?

If the latter case, then raw copy of cells should work, perhaps?

If "sheet global" then perhaps I would need to make multiple passes: 1) collect all sheet global stuff before opening 2nd sheet for writing, 2) write header of output stream with global stuff from first sheet, 3) stream input sheet, copying over to output sheet, while changing references appropriately.

EDIT I guess some things are obviously "sheet global" -- column definitions, for instance. Perhaps there could be a few things you could do to a output streamed sheet before writing any rows, and one of these things would be to copy columns (and other stuff?) from another sheet. Sort of like http libraries allow fiddling with headers, but only before you start streaming the body.

Would you be willing to accept a PR on this, potentially? A sketch in a branch would be welcome. :)

P.S.: Did you try standard mode (read/write)?

No -- there seem to be no facilities for copying between sheets. (e.g. range.CopyToSheet(sheet Sheet, cIdx, rIdx int) -- possibly another enhancement.

Apropos CopyTo current code skips over empty cells -- it would seem to me that this introduces a bug: destination cells are not emptied but simply skipped, so old data might persist in the middle of destination range.

EDIT answering "PS" I was still thinking about copy between sheets (stream original, build new in non-stream mode). I just tried simple update in place in non-streaming mode, but also didn't seem to work. However, the code around the update is complex. I'll submit another issue
with a code sample and spreadsheet if this problem persists in isolation.

The spreadsheets are several hundred thousand rows by ~100 columns. Not impossible to update without streams, but I want to run in AWS lambda, where memory is a precious resource.

@plandem
Copy link
Owner

plandem commented Nov 1, 2019

I will try to think on weekends about possible ways and will back to your suggestions and read again. Can't dive into it right now, sorry.

P.S.: regarding to AWS and memory, I don't know how AWS and GC(Garbage collector) will work together. There is a chance that using C/C++ based library will be better, because of GC in Go/Python/Php/Java/Javascript and etc.

Update: What I mean, I don't have so much experience with AWS lambda, so don't know how some things will work, e.g. GC.

@shaunc
Copy link
Author

shaunc commented Nov 1, 2019

Thanks! I have been using go in lambda for a while -- it seems to work well for me; I haven't noticed too much memory bloat due to GC. Of course, this particular task might stress it, but (even if I help with a PR) will probably be quicker for me in go + tuning than C/C++ which I haven't programmed for a decade. (I've been meaning to learn rust... :)).


Appropos update problem (when not in stream mode) -- I had forgotten to save! In fact, I didn't notice the save function and had only been using the package for reading before. I had assumed that Close() would save if the sheet was dirty.

I would suggest putting ....Save() in the update examples. It is there in the Readme, but not commented; also since Spreadsheet inherits Save from ooxml.Package it doesn't show up in the package godoc.

@plandem
Copy link
Owner

plandem commented Nov 1, 2019

I would suggest putting ....Save() in the update examples. It is there in the Readme, but not commented; also since Spreadsheet inherits Save from ooxml.Package it doesn't show up in the package godoc.
Had you chance to look through the guide? I plan to keep example as way to test features only and focus more on guide in the future.

@plandem
Copy link
Owner

plandem commented Nov 1, 2019

I looked through the AWS lambda prices and looks like they don't charge for memory and only for CPU. That's a good and looks like GC is not an issue.

@shaunc
Copy link
Author

shaunc commented Nov 1, 2019

Actually they do charge for memory (see table below) ... but if I can stream it will be fine. :).

EDIT In the long term I plan to move to kubernetes anyway (maybe using fission), which will free up the per-call restrictions, but will still have to pay for memory one way or another. And anyway streaming support is what makes your library so great! :)

Had you chance to look through the guide?

I remember there being a guide, but the link in the readme is now broken. I was under the impression that you may have taken the guide down because it was obsolete. (So maybe fix the link if it is still useful. :))

Thanks!

@plandem
Copy link
Owner

plandem commented Nov 1, 2019

Ouch...I didn't know that guide is broken. Sorry. Will try to fix it asap.

@plandem
Copy link
Owner

plandem commented Nov 1, 2019

guide is fixed. Thanks for pointing on it!

@plandem
Copy link
Owner

plandem commented Nov 2, 2019

I thought a bit about updates for stream mode. There is a thing - sheet file has few sections and some header information should be tuned sometimes. Of course, it's possible to copy everything except cells from original sheet and allow updates only inside of bounds - in that case it can work. It will mean - update only cells in that row...literally. Nothing more.

Update: Theoretically it's possible. Need somehow to combine read/write stream functionalities. But not sure I will have time in November. In general it can be something like this - read row as in read-stream and write row as in write-stream. But before that you need manually to add everything from original file up to sheet data and after it, and also copy rows that were not iterated.

@shaunc
Copy link
Author

shaunc commented Nov 2, 2019

Thanks! I'll get the whole thing up and running without stream mode, then hopefully will have some time to work on a PR for this.

But before that you need manually to add everything from original file up to sheet data

A snippet for this or pointer would be helpful. I guess sheetWriteStream.afterCreate merged with sheetReadStream.afterOpen? :)

also copy rows that were not iterated.

Presumably can do in Close().


General notes:

  • a streaming sheet can be iterated no more than once -- so sheetReadStream.Rows probably should error rather than allow duplicate call.
  • I do wish this package was more idiomatic go -- returning errors rather than panicking. I guess that even if you wanted to change that would be a "2.0" feature as it would require API change.

@plandem
Copy link
Owner

plandem commented Nov 2, 2019

A snippet for this or pointer would be helpful. I guess sheetWriteStream.afterCreate merged with sheetReadStream.afterOpen? :)

afterCreate will be useless, because that information is already in sheet, so afterOpen only. But...not only that - stream read - reads only supported blocks, but you need to copy everything from original file, like literally. Because in original file there can be much more information in header section.

also copy rows that were not iterated.

Presumably can do in Close().

Actually not. Even in stream mode you can fast rewind to any row, e.g. CellByRef("A100") - you skipped 99 rows, so these rows should be copy to write stream during that rewind process.

In Close() you should copy only remained rows.

a streaming sheet can be iterated no more than once -- so sheetReadStream.Rows probably should error rather than allow duplicate call.

It's easier document that in stream mode you can go forward only and only once, than mess code with useless one-case only errors. Because read/write once is a nature of streams.

I do wish this package was more idiomatic go -- returning errors rather than panicking. I guess that even if you wanted to change that would be a "2.0" feature as it would require API change.

Well, it's quite idiomatic go. If it's not fatal - error, if fatal or not allowed operation - panic.

My point is - your code should not get any fatals in production and you should use library properly and prevent all fatals before moving to production. E.g. if it's not allowed to write in read stream, then you should not do it. Why do you even try to call it?! If you will look at other parts of library, all methods that can produce error in other case - never panic. Only when it's not allowed/restricted behaviour for library, because that issue was created by developer, not by user, input file or settings. P.S.: by you, I mean abstract developer.

@shaunc
Copy link
Author

shaunc commented Nov 2, 2019

My point is - your code should not get any fatals in production and you should use library properly and prevent all fatals before moving to production. E.g. if it's not allowed to write in read stream, then you should not do it.

That is a valid point -- and I'm a little abashed to have raised the point here (perhaps discussed over a beer would be better. :)). However (see interesting article) IMHO on the balance it would be better to panic less and have more errors. Imagine not that your package is used not by a single developer but is nested three or four packages deep, and usually works. But sometimes due to sloppiness in the intervening layers, it brings down an application. The error convention focuses developers on the fact that "something wrong could have happened here" and they should actually check -- indeed if they want good code coverage they should write a failing test case as well. The point is that all-in-all the resulting application will likely be more robust.

I'm sure this point doesn't apply to all your panics. I did see some places where I would think it would be good to return errors, at least on first glance. I'm not sure if you are interested, though, and it won't hurt my feelings if you aren't. :)

@plandem
Copy link
Owner

plandem commented Nov 4, 2019

Well, maybe in v2.0 it can be changed, but right now there are other more challenging tasks - e.g. finish DrawingML(charts, shapes and etc), Filters and Data Validation.

P.S.: I thought a bit about update streaming and think it can be achieved. Refactor streams and create one, but with two streams under the hood - input and output. If no input - then write only, if no output - read only. But now sure when I will have time. Maybe after New Year, in January.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants