Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse comma-delimited ~ASCII sections #265

Open
kinverarity1 opened this issue Feb 9, 2019 · 8 comments
Open

Parse comma-delimited ~ASCII sections #265

kinverarity1 opened this issue Feb 9, 2019 · 8 comments
Labels
bug data-section-parser A bug or enhancement relating to the data section parser las3 stuff relating to LAS 3.0
Milestone

Comments

@kinverarity1
Copy link
Owner

This may be causing an issue for some: https://stackoverflow.com/questions/53874152/how-to-perform-the-same-edit-on-a-folder-of-csv-files

@kinverarity1 kinverarity1 added this to the v1 milestone Jul 4, 2019
@kinverarity1 kinverarity1 added las3 stuff relating to LAS 3.0 and removed enhancement infected-LAS labels May 3, 2020
@kinverarity1 kinverarity1 added bug data-section-parser A bug or enhancement relating to the data section parser labels Apr 9, 2021
@dcslagel
Copy link
Collaborator

Is there a real-world example LAS file with comma-delimited ~ASCII section?

@kinverarity1
Copy link
Owner Author

Try las_sample_3.0.las in the repo, I think all data sections in there are comma delimited.

@dcslagel
Copy link
Collaborator

@kinverarity1, How should Lasio identify to parse the ~ASCII section with comma-delimiter? I think these are the options. Is one of these (or something else) preferred?:

  • Check for the delimiter in the first few lines and have some logic to make the decision. Then parse the data section with this 'found' delimiter.
  • A command line parameter is passed in. Does this already exist?
  • Check if the file is LAS-3 and if it is then parse the ~ASCII section assuming the delimiter is a comma?

Thanks,
DC

@kinverarity1
Copy link
Owner Author

I'd prefer the second approach mixed with the "provisional" approach we use for WRAP and NULL (see the DLM item in the version section):

DLM . COMMA : DELIMITING CHARACTER BETWEEN DATA COLUMNS

Then we could just provide a keyword argument to force a choice? Not sure what to default to - probably whitespace.

My preference is mainly driven by laziness 🙃 First option would be great and we can do it if you want!

@dcslagel dcslagel self-assigned this Jul 18, 2021
@dcslagel
Copy link
Collaborator

dcslagel commented Jul 18, 2021

Reading through 1.2/2.0 and 3.0 specs only 3.0 has DLM. The 3.0 spec says that if DLM is not stated then the default delimiter is space. So we could set the default to space and if DLM is stated then adopt its value. That would roughly be the First option and I think is pretty do-able. I'll attempt a draft branch of it.

@dcslagel
Copy link
Collaborator

dcslagel commented Jul 18, 2021

Dev Notes (may change over time...) Edited 2021-08-10:

  • Reread the 3.0 spec

  • Review Kent's manual method to read LAS3 data https://gist.github.com/kinverarity1/92f00b781472512349a9312d75fd4c33

  • Create a central a las.delimiter property and make it available to both parsing engines: normal and Numpy.

  • In the ~Version parsing section, check for the DLM keyword, if it exists override the default delimiter with the DLM setting. The default delimiter will be a .

  • Focus on parsing just the LAS3 '~ASCII' data section. Skip the other data sections for now because the code and internal data-structure isn't there yet to handle multiple data sections. In LAS3 the '~ASCII' section can be also called ~Log_Data. So add the ability to find either one. Optionally implement comma parsing with this file standards/examples/3.0/sample_las3.0_spec.las. It uses the LAS2 standard section names. Using it as the test file would enable separating the 'adding of alternate 'Log' names to a separate pull-request or separate/subsequent commit.

  • The curve names may be in ~Log_Definition instead of ~Curves. Add the ability to check both, and to add the Data to las.curves

  • Add a default.py::READ_POLICIES key-value for when the delimiter is a comma (this will exclude the regex_sub for converting commas to decimal). Add provisional_delimiter. Move the call in las.py to reader.py::get_substitutions() till after the provisional_delimiter has been processed so we can decide which default.py::READ_POLICIES to use.

  • reader.py::read_data_section_iterative_normal_engine()'s sub-routine items() needs to be changed to enable parsing a line with different delimiters not just SPACE but also COMMA and TAB. Same for reader.py::inspect_data_section().

dcslagel added a commit that referenced this issue Sep 20, 2021
…-data

Parse comma-delimited ~ASCII sections #265
@dcslagel dcslagel removed their assignment Feb 18, 2022
@Charles-HL
Copy link

What is the current status of this issue? I see a pull request have been merged. Is it done?

@kinverarity1
Copy link
Owner Author

I believe the PR only implemented it for the normal (slow) parser, not the numpy (fast) parser. And also didn't allow for user specification of the delimiter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug data-section-parser A bug or enhancement relating to the data section parser las3 stuff relating to LAS 3.0
Projects
None yet
Development

No branches or pull requests

3 participants