Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading shimadzu .lcd files #29

Closed
kco-hereon opened this issue Mar 13, 2024 · 30 comments
Closed

Reading shimadzu .lcd files #29

kco-hereon opened this issue Mar 13, 2024 · 30 comments
Labels
enhancement New feature or request

Comments

@kco-hereon
Copy link

Dear Ethan,
I tried to use your code for reading the raw data of our Shimadzu HPLC, thanks for that code!

I am not a programmer and I am mainly in Python and not in R. Here are some results from our (mine and my colleaque using R) last days working on this, I wonder whether you would like to include the issues we found for your R code.

  1. We are both using a new Mac with a M1 chip: the standard installation, some dependencies in Miniconda are not provided for this processor architecture! So, we could not use your code directly but did also not spend time to figure out the details of which dependency is failing. My colleaque used an older PC instead.
  2. Instead I am trying to translate the codes for reading shimadzu-.lcd files in Python. This code is now (most likely) working for my data and I can also extract the data of our fluorometer in addition to that of the PDA.

I needed to change mainly two things:

  1. your line 147 in read_shimadzu_lcd.R, mat <- matrix(NA, nrow = fsize/(n_lambdas*1.5), ncol = n_lambdas)
    This is about the size of the data stream which depends on the number of wavelength from the PDA and the total time of the HPLC run. A simple factor 1.5 does not work for my data. Instead, I first scan the PDA raw data stream for the start bits of each header of the data set and sum them up. Second, I now found the entry in a stream that contains the number of datasets and can simply be read out.

  2. your line 249 in function decode_shimadzu_block: buffer[[2]] <- twos_complement(substr(bin, 5, nchar(bin))),
    This line cuts off the first 4 bits of the bit string that finally contains the number of the difference to the former value. It worked this way for my PDA data, but could not reproduce the results of the fluoremeter at some positions and distorted the signal. I needed some time to understand this but at the end the funstion simply failed when the value for the difference is a large number and mpre bytes are needed to decode it. At the end I simple reduced the cut and are using the bits from position 3. This seemed to work!
    My question here is: did you find the number '5' simply by trial and error, or was there a reason?

If there is interest from your side, I can spend some time to described more details, e.g. where to find the fluorescence data and how to read it or the file size in the .lcd file.
Best
Rüdiger

@ethanbass
Copy link
Owner

ethanbass commented Mar 13, 2024

Hi Rüdiger,

Thanks for your feedback. I would definitely be interested in hearing more about your innovations and potentially incorporating them into the package. It would be nice to add support for the fluorescence data!

Re: Issue 1, This was just a workaround because I couldn't figure out where the length of the stream was encoded. It worked with all the data files I had access to, but I'm not surprised that it didn't generalize well to every file. I would definitely be interested in fixing this if you can tell me how the length of the stream is encoded. It sounds like maybe this discrepancy is due to a difference in the way the fluorometer stream is encoded compared to the PDA stream.

Re: Issue 2.
To be honest, I can't remember off the top of my head where the 5 came from here. I am a little short on time right now, but I will try to figure it out and get back to you. In general though I can tell you that pretty much everything in the Shimadzu parser was worked out by trial and error because there is no publicly available information (as far as I'm aware) on how these files are encoded. It seems likely to me that this discrepancy may be due to a difference in how the fluorometry stream and the PDA stream are encoded.

Regarding your comment about the M1 processor, I'm not sure what issue you're running into, but I can tell you that the package is definitely functional on M1 macs, because I am actually doing most of the development of the package on an M1 mac. I'm guessing there is some other issue with your miniconda installation that is causing the installation to fail. To be honest, the python dependencies have been quite a headache and it really makes me wish that reticulate worked more smoothly in the context of a package. Unfortunately, for the Shimadzu LCD parser, the python bindings are pretty necessary since as far as I know there is no equivalent to olefile in R for handling the OLE files.

Best,
Ethan

@kco-hereon
Copy link
Author

kco-hereon commented Mar 14, 2024

Hi Ethan,
here are the two Python function I can use to get the number of time sections in the PDA raw data!

the file is read in with olefile.OleFileIO !

def get_nodataset_pda_old(file):
    stream = file.openstream("PDA 3D Raw Data/3D Raw Data").read()
    s=stream[0:3]
    count=0
    for i in range(len(stream)):
        if s==stream[i:i+3]:
            count +=1
    return count

def get_nodataset_pda(file):
    stream = str(file.openstream("PDA 3D Raw Data/3D Data Item").read().decode('utf-8'))[0:1000]
    num=stream[stream.find('<CN')+4:stream.find('</CN')]

In the first case, I just use the first 4 bytes of the PDA Raw Data stream which is repeated before each data block, and simply count how often it appears. In my case it was 3564 times.

Then I screen my data file (mainly by eye) and found the number '3564' in the stream "PDA 3D Raw Data/3D Data Item" which makes perfectly sense. This is an XML-type of stream similar to the one from which your code extracts the start and endtime. Unfortunately, I can not read it without an error with my XML-parser, which I can easily for many other xml streams in the same data file. I simply made a workaround and use a string operation to get the number, but this will fail in case the number would have a different length. But from this you know where to find it.

For the data of the fluorometer: these instruments are connected in an analog way to the main Shimadszu instrument. The first thing to know is at which channel it is connected to. Most likely its an early one, like in may case its Channel 1. When screening the streams of the data file there are several streams for a high number of channels, but looking on the size of each stream (many are empty) I found the data in "LSS Raw Data/Chromatogram Ch1"!

I could not find any additional information in other Channel 1 streams, e.g. for the length of the data set, which is different from the PDA data set as the instruments works with a different frequency. Luckily, the data format of the stream and its decoding is the same than for the PDA data. I can use your block decoding scheme. The differences to the PDA are logical: its only a single data set (the time series of the fluorescence of one excitation/emission channel). While each data set of the PDA data for each PDA spectrum consists of two data block, the fluorescence data have much more data blocks (in my case 18). But here we do not need a fixed order when reading the data in. I simply use a loop over all the blocks til the end of the data stream.

For the time axis I am assuming that the start and end times are the same than for the PDA.

I have not find out how the scale of the values need to be adjusted. I am getting very large values of up to 10^6 in the peaks, so I divided by 10^6.

I can directly compare the results with the data in the Shimadzu software and I am in the same range but about a factor 4 too low, while the setting of the instrument is at Gain 4, but its not exactly factor 4.

However here are my python functions for this:

def read_shimadzu_fluor_raw(file, n_lambdas=None):
    pos=0
    stream = file.openstream("LSS Raw Data/Chromatogram Ch1").read()
    [mat, no_data] = decode_shimadzu_fluor_block(stream,pos)
    return mat, no_data

def decode_shimadzu_fluor_block(fid, pos):
    pos=pos+8
    n_lambda = struct.unpack('<h', fid[pos:pos+2])[0]
    pos=pos+4
    block_length = struct.unpack('<h', fid[pos:pos+2])[0]
    #print(n_lambda, block_length)
    pos=pos+12
    signal = [0] * (n_lambda)
    count = 0
    bufer = [0, 0, 0, 0]
    while pos<len(fid):
        n_bytes = struct.unpack('<h', fid[pos:pos+2])[0]
        #print(n_bytes)
        pos=pos+2
        start = pos
        #print('nbytes',n_bytes)
        while pos < start + n_bytes:
            bufer[2] = format(struct.unpack('B', fid[pos:pos+1])[0], '02x')
            hex1 = int(str(bufer[2])[0],8)
            pos=pos+1
            if hex1 == 0:
                bufer[1] = int(bufer[2],16)
            elif hex1 == 1:
                bin1 = format(int(bufer[2], 16),'08b')
                bufer[1] = twos(bin1[4:8])
            elif hex1 > 1:
                no=hex1 // 2
                if hex1>3:
                    q1=[]
                    for  i in range(no):
                        q1.append(format(struct.unpack('B', fid[pos+i:pos+1+i])[0],'02x'))
                    q1=''.join(q1)
                    bufer[3]=q1
                else:
                    bufer[3] = format(struct.unpack('B', fid[pos:pos+no])[0], '02x')
                    #print('test',count,hex1,bufer)
                pos=pos+no
                bin1 = bufer[2]+bufer[3]
                #print(count, hex1,bin1)
                bin1=format(int(bin1,16),'08b')
                #print(bin1)
                
                if hex1 % 2 == 0:
                    bufer[1] = int(bin1[2:len(bin1)], 2)   
                else:
                    bufer[1] = twos(bin1[2:len(bin1)])    
            
            bufer[0] += bufer[1]
            signal[count] = bufer[0]/1000000   
            count += 1
        end = struct.unpack('<h', fid[pos:pos+2])[0]
        #print(end,pos+2)
        pos=pos+2
        bufer[0] = 0
    return signal, len(signal)

For the cutting of the binary string at position 5 when reading the data:
When looking on the maximum byte length of each data value in the delta-encode string: in case of the PDA data, this is only 3, in case of the fluorometer this is 7! The resulting bit-strings for the PDA have always zeros in position 3->5, i.e. cutting at position 5 does not change the integer value of this bit string. This situation changes when the bit length gets longer. So, in case of your code most values of the fluorometer are decoded correctly, just not when the bit length is >4 to 5.

Hope this is helpful.
Let me know when there are further questions.
I am now stopping for some vacation. My next step is to get information about the instruments calibration factors and spectral libraries. The spectral library is a SPC file, no clue yet how to deal with that!

Rüdiger

@ethanbass
Copy link
Owner

Thanks Rüdiger -- this looks great! I wonder if you'd be willing to share one or two test files from your instrument? I'm not sure I have an analog stream in any of the files I currently have access to. I'd definitely be interested to hear about what you find if you make any headway with the spectral libraries.
Thanks again and I hope you have a nice vacation!
Ethan

@kco-hereon
Copy link
Author

Archiv.zip

Hi Ethan,
here are one original .lcd file from our pigment measurements and a .mat file with the data my code produces from it (very simplistic!)

Rüdiger

@ethanbass
Copy link
Owner

Thanks!

@ethanbass ethanbass assigned ethanbass and unassigned ethanbass Mar 18, 2024
@ethanbass ethanbass added the enhancement New feature or request label Mar 18, 2024
@charumeenah
Copy link

We are also working on parsing the .lcd file from Shimadzu LC-40. Using the above python code, we have extracted LSS Raw Data - Chromatogram ch1 data. Thanks for it!
We are now trying to read and parse the peak table and display parameter streams ['Chromatogram Parameters', 'Display Parameter-1-1'], ['LSS Data Processing', 'PT-LC.1.1.AD.2.CH#1'].
We have no clue how to proceed with decoding these streams. Any lead would be appreciated!

-Charu

@ethanbass
Copy link
Owner

Personally I haven't really looked into these streams too much -- I was mostly interested in being able to extract the data from the DAD detector -- but I would curious to hear what you figure out.

@charumeenah
Copy link

charumeenah commented May 20, 2024

Here’s the file with all streams that I have got from OleFile module in Python. In case if anyone has any idea on how to decode it (the peak table stream - ['LSS Data Processing', 'PT-LC.1.1.AD.2.CH#1']), I’d appreciate the help.
sample (1).txt

@actolonen
Copy link

Dear Ethan,
Thank you for your pioneering work enabling analysis of chromatography data. I would like to use chromConverter to parse .lcd files from our Shimadzu HPLC such as this file as follows:

data = read_shimadzu_lcd(path, format_out = "data.frame", data_format = "long", read_metadata = TRUE)

However, I get this error:
Error in seq_len(n_lambda) :
argument must be coercible to non-negative integer

Could you please advise? Thanks, Andy

@ethanbass
Copy link
Owner

Hi Andy,
Thanks for reporting this. I was able to reproduce the error. I should have time to look into this more later in the week and hopefully track down where the problem is. Will keep you posted.
Best,
Ethan

@actolonen
Copy link

actolonen commented Jul 30, 2024 via email

@ethanbass
Copy link
Owner

ethanbass commented Aug 3, 2024

Hi Andy,
I had a look at your file and the PDA stream seems to be empty? What kind of detector does your instrument have? Also do you have a screenshot (or better yet, a text file) you could share showing what the chromatogram is supposed to look like?
Ethan

@actolonen
Copy link

actolonen commented Aug 4, 2024 via email

@ethanbass
Copy link
Owner

ethanbass commented Aug 4, 2024

Ahh ok. that makes sense. It's not unexpected if you don't have a PDA detector, it's just that the only parser I've written so far is for the PDA stream. Luckily I think those streams use the same encoding. Does the shape of this chromatogram look right to you? I think there is a scaling factor encoded somewhere in the file -- I'm not yet sure where.

image

Do you perhaps have a screenshot of how the two streams (the refractive index and UV) look for the file you shared with me? Or are you expected two streams? So far I've only been able to find one stream in your file?

@actolonen
Copy link

actolonen commented Aug 4, 2024 via email

@ethanbass
Copy link
Owner

Yes, that would be great if it's not too much trouble. Also are you expecting there to be more than one data stream in this file?

@ethanbass
Copy link
Owner

ethanbass commented Aug 6, 2024

Hi Andy ,
I pushed an update to the master branch that should be able to read the 2D chromatograms from your files. Please let me know if you find any issues. I believe there is a scaling factor which I have not yet been able to locate in the files, so the scale of the chromatograms may not be correct.
Ethan

@actolonen
Copy link

actolonen commented Aug 7, 2024 via email

@ethanbass
Copy link
Owner

Wonderful! The scaling factor is encoded somewhere in the file, but I haven't yet been able to figure out where it is. I hope with some more digging I can find where this value is encoded and scale the chromatograms accordingly. In another file I have from another instrument it is 0.1% so the 0.3% scaling factor is not consistent between instruments.

Regarding the two detectors, I suspect that the function should be able to provide the data from both streams, but it would be great if you can update me on that. Also If you could provide me another example file with both data streams that would be great!

Ethan

@ethanbass
Copy link
Owner

@actolonen
There isn't any chance that the signal could actually be scaled by 1000 is there? (.001). I found a field that I think would make sense as the scaling factor, but it would imply that the chromatogram should be scaled by .001 rather than .003.
Ethan

@ethanbass
Copy link
Owner

Hi Andy,
I just pushed a version with support for reading more of the metadata from LCD files and it also scales chromatograms by what I think is the scaling factor (.001 in your case). You should be able to toggle the scaling off by specifying scale = FALSE.
Ethan

@actolonen
Copy link

actolonen commented Aug 14, 2024 via email

@actolonen
Copy link

Hi Ethan,
Following our success reading .lcd files with data from a single detector, I got a set of multi-channel HPLC files that contain chromatogram data from the two detectors on our HPLC . Detector A is UV/VIS SPD-20A and Detector B is refractive index RID-10A. Detector A has two channels: channel 1 is at 260 nm and channel 2 is at 210 nm.

Here are the .lcd files:
https://github.com/actolonen/Analysis_Lab/tree/main/HPLC/ChromConverter/Files_LabSolutions/Files_aug24

I ran read_shimadzu_lcd() as follows:

data = read_shimadzu_lcd(
path = inputfile,
format_out = "data.frame",
what = "chromatogram",
data_format = "long",
read_metadata = TRUE);

This gives the following error.

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 3359, 3360

This looks like a simple error that nrows doesn't equal ncols, but I don't know how to troubleshoot this with .lcd files. Could you please advise?

best,
andy

@ethanbass
Copy link
Owner

ethanbass commented Aug 26, 2024

Hi Andy,

I'm having trouble reproducing this error with the files you provided. Can you double check what version of chromConverter you're currently running? Also maybe you can run traceback() after the error to see what is precipitating it. The first two chromatograms in all of your files are 3359 rows while the third is 3360 rows, but I am not receiving the error.

Thanks!
Ethan

By the way, the new version I'm working on in the dev branch should be faster for reading the shimadzu LCD files and also has better behavior for handling multiple chromatograms. For example, it will return them as a single data.frame instead of as a list of data.frames when data_format == long.

@ethanbass
Copy link
Owner

Also would it be alright with you if I include one of your multi-channel shimadzu files as a test file in my chromConverterExtraTests repository?

@actolonen
Copy link

Hi Ethan, I confirm that chromConverter works great on our multi-detector .lcd files:
https://github.com/actolonen/Analysis_Lab/blob/main/HPLC/ChromConverter/2024.08_test_chromConverter.html

My error just was due to the chromatograms from the different detectors having different numbers of lines.

I would be delighted if you include one of our multi-channel .lcd files in your chromConverterExtraTests repo.

thanks!
andy

@ethanbass
Copy link
Owner

Excellent. Thanks Andy!

@ethanbass
Copy link
Owner

ethanbass commented Aug 27, 2024

I still don't understand why the intensities are off. I think the values exported in Shimadzu are being rounded or smoothed somehow but I can't figure out how. It's strange, because the other Shimadzu files I have access to are exact.

@actolonen
Copy link

Hi Ethan, Just as a quick update: we are routinely using chromConverter to extract chromatograms from .lcd files using our three detectors (RID, UV-210 nm, UV-260 nm). Thanks so much for your great work! The issue of the Lab Solutions peak scaling factor is still obscure. However, we include a set of standard solutions at different concentrations in each plate that we use to quantify compound concentrations. So, my impression is that the scaling factor doesn't matter. Do you agree? best, andy

@ethanbass
Copy link
Owner

ethanbass commented Oct 16, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants