ihm fails on PDBDEV_00000088 #53

aozalevsky · 2022-05-20T11:03:47Z

There are multiple points in the entry 88 where ihm fails during parsing:

Traceback (most recent call last):                                                                    
  File "/IHMValidation/example/../master/pyext/src/validation/__init__.py", line 74, in __init__
    self.system, = ihm.reader.read(fh, model_class=self.model)                                        
  File "/root/miniforge/lib/python3.9/site-packages/ihm/reader.py", line 3298, in read       
    more_data = r.read_file()                                                                         
  File "/root/miniforge/lib/python3.9/site-packages/ihm/format.py", line 594, in read_file
    return self._read_file_c()                                                                        
  File "/root/miniforge/lib/python3.9/site-packages/ihm/format.py", line 645, in _read_file_c  
    eof, more_data = _format.ihm_read_file(self._c_format)                                            
  File "/root/miniforge/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)                                
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 470153: invalid start byte

is the result of the sentence:

Typically, 14<B7>106 to 20<B7>106 photons were recorded at TAC channel-width of 14.1\xa0ps (IBH-5000U) or 8\xa0ps (EasyTau300).

The other error:

  File "/root/miniforge/lib/python3.9/codecs.py", line 322, in decode                                                                                                                                       
    (result, consumed) = self._buffer_decode(data, self.errors, final)                                
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 471889: invalid start byte

ogirinates from:

Sample conditions for the EPR experiments were 100 <B5>M protein in 100 mM NaCl, 50 mM Tris-HCl, 5 mM MgCl2, pH 7.4 dissolved in D2O with 12.5 % (v/v) glycerol-d8.

And finally, after deleting symbols causing previous errors:

Traceback (most recent call last):
  File "/root/miniforge/lib/python3.9/site-packages/ihm/format.py", line 645, in _read_file_c
    eof, more_data = _format.ihm_read_file(self._c_format)
_format.FileFormatError: Wrong number of data values in loop (should be an exact multiple of the number of keys) at line 1940098

@benmwebb @brindakv I need your help on that.

The text was updated successfully, but these errors were encountered:

benmwebb · 2022-05-20T15:21:38Z

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 470153: invalid start byte

You are assuming the file is UTF-8 encoded. Not all mmCIF files are. Usually I try reading as UTF-8 and fall back to latin-1 if that fails. See https://python-ihm.readthedocs.io/en/latest/reader.html. IIRC Sai was doing that for at least part of the validation pipeline.

Wrong number of data values in loop

I don't see that if I read the file as latin-1. My guess is you broke something during your edits.

That being said, there is an issue with this file:

loop_
_flr_fret_calibration_parameters.id
_flr_fret_calibration_parameters.phi_acceptor
_flr_fret_calibration_parameters.alpha
_flr_fret_calibration_parameters.alpha_sd
_flr_fret_calibration_parameters.gG_gR_ratio
_flr_fret_calibration_parameters.beta
_flr_fret_calibration_parameters.gamma
_flr_fret_calibration_parameters.delta
_flr_fret_calibration_parameters.a_b
1 '.' '.' '.' '.' '.' '.' '.' '.'

Those '.' entries should all be just plain . of course. @brindakv can fix that upstream.

benmwebb · 2022-05-20T16:13:22Z

You are assuming the file is UTF-8 encoded. Not all mmCIF files are. Usually I try reading as UTF-8 and fall back to latin-1 if that fails. See https://python-ihm.readthedocs.io/en/latest/reader.html. IIRC Sai was doing that for at least part of the validation pipeline.

See e.g. c992574

aozalevsky · 2022-05-20T18:11:13Z

Thank you for pointing out the docs. First time I saw this boilerplate in code I was curious where did this come from. The boilerplate is intact.

IHMValidation/master/pyext/src/validation/__init__.py

Lines 72 to 77 in c3b01ca

    
           try: 
        
               with open(self.mmcif_file, encoding='utf8') as fh: 
        
                   self.system, = ihm.reader.read(fh, model_class=self.model) 
        
           except UnicodeDecodeError: 
        
               with open(self.mmcif_file, encoding='ascii', errors='ignore') as fh: 
        
                   self.system, = ihm.reader.read(fh, model_class=self.model)

I'll add some description of this part to the code later.

It looks like I missed a part of the traceback, though:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 470153: invalid start byte

During handling of the above exception, another exception occurred:

<...>
ValueError: could not convert string to float: '.'

So the UnicodeDecodeError exception was actually caught by the except and switched to ASCII but then failed again on the data part.

brindakv · 2022-05-20T21:48:38Z

The data in the mmCIF file is now updated: changed'.' to .

aozalevsky · 2022-05-20T22:48:24Z

Thanks, @brindakv now parsing works ok.

Another issue popped up, though. It is related to the software section:

#
loop_
_software.pdbx_ordinal
_software.name
_software.classification
_software.description
_software.version
_software.type
_software.location
1 FPS 'Model building' . . Program .
2 NMSim 'Model building' . . Program http://www.nmsim.de
3 'Amber 14' 'Model building' . . Program .
4 'DeerAnalysis2006' 'Data analysis' . . Program .

The code fails on missing links for the software. I can easily fix it, but wonder if it's ok to have missing versions and links?

benmwebb · 2022-05-20T23:10:36Z

The code fails on missing links for the software. I can easily fix it, but wonder if it's ok to have missing versions and links?

We need to provide links for each piece of software for the validation pages; see https://github.com/salilab/IHMValidation/blob/main/templates/references.csv for the file we maintain locally to fill in any missing links.

You can look up whether particular data items are required in the dictionary itself. e.g. https://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v50.dic/Items/_software.version.html (location and version are not required there).

aozalevsky · 2022-05-20T23:33:15Z

Thanks for the clarification! Yeah, I saw the references.csv. However, at the moment the code explicitly uses links from a cif file:

IHMValidation/master/pyext/src/validation/__init__.py

Line 355 in c3b01ca

ref_loc = '<a href="'+software.location+'">'+software.location+"</a>"

software here is an instance of the ihm.Software class. So if you say that the idea is to complement cif file with the data from references.csv, I'll modify this block.

benmwebb · 2022-05-20T23:43:12Z

See read_all_references in the same file for the function that reads the CSV.

aozalevsky · 2024-10-24T15:53:13Z

The problem was addressed in #53 (comment)

benmwebb assigned brindakv May 20, 2022

aozalevsky mentioned this issue May 24, 2022

SAS processing is unaware of measuring units #55

Open

aozalevsky mentioned this issue Sep 27, 2022

New refs for software. Allow missing refs for software. Debug mode. #61

Open

aozalevsky closed this as completed Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ihm fails on PDBDEV_00000088 #53

ihm fails on PDBDEV_00000088 #53

aozalevsky commented May 20, 2022 •

edited

Loading

benmwebb commented May 20, 2022

benmwebb commented May 20, 2022

aozalevsky commented May 20, 2022

brindakv commented May 20, 2022

aozalevsky commented May 20, 2022

benmwebb commented May 20, 2022

aozalevsky commented May 20, 2022

benmwebb commented May 20, 2022

aozalevsky commented Oct 24, 2024

ihm fails on PDBDEV_00000088 #53

ihm fails on PDBDEV_00000088 #53

Comments

aozalevsky commented May 20, 2022 • edited Loading

benmwebb commented May 20, 2022

benmwebb commented May 20, 2022

aozalevsky commented May 20, 2022

brindakv commented May 20, 2022

aozalevsky commented May 20, 2022

benmwebb commented May 20, 2022

aozalevsky commented May 20, 2022

benmwebb commented May 20, 2022

aozalevsky commented Oct 24, 2024

aozalevsky commented May 20, 2022 •

edited

Loading