Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ihm fails on PDBDEV_00000088 #53

Closed
aozalevsky opened this issue May 20, 2022 · 9 comments · May be fixed by #61
Closed

ihm fails on PDBDEV_00000088 #53

aozalevsky opened this issue May 20, 2022 · 9 comments · May be fixed by #61
Assignees

Comments

@aozalevsky
Copy link
Contributor

aozalevsky commented May 20, 2022

There are multiple points in the entry 88 where ihm fails during parsing:

Traceback (most recent call last):                                                                    
  File "/IHMValidation/example/../master/pyext/src/validation/__init__.py", line 74, in __init__
    self.system, = ihm.reader.read(fh, model_class=self.model)                                        
  File "/root/miniforge/lib/python3.9/site-packages/ihm/reader.py", line 3298, in read       
    more_data = r.read_file()                                                                         
  File "/root/miniforge/lib/python3.9/site-packages/ihm/format.py", line 594, in read_file
    return self._read_file_c()                                                                        
  File "/root/miniforge/lib/python3.9/site-packages/ihm/format.py", line 645, in _read_file_c  
    eof, more_data = _format.ihm_read_file(self._c_format)                                            
  File "/root/miniforge/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)                                
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 470153: invalid start byte

is the result of the sentence:

Typically, 14<B7>106 to 20<B7>106 photons were recorded at TAC channel-width of 14.1\xa0ps (IBH-5000U) or 8\xa0ps (EasyTau300).

The other error:

  File "/root/miniforge/lib/python3.9/codecs.py", line 322, in decode                                                                                                                                       
    (result, consumed) = self._buffer_decode(data, self.errors, final)                                
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 471889: invalid start byte  

ogirinates from:

Sample conditions for the EPR experiments were 100 <B5>M protein in 100 mM NaCl, 50 mM Tris-HCl, 5 mM MgCl2, pH 7.4 dissolved in D2O with 12.5 % (v/v) glycerol-d8.

And finally, after deleting symbols causing previous errors:

Traceback (most recent call last):
  File "/root/miniforge/lib/python3.9/site-packages/ihm/format.py", line 645, in _read_file_c
    eof, more_data = _format.ihm_read_file(self._c_format)
_format.FileFormatError: Wrong number of data values in loop (should be an exact multiple of the number of keys) at line 1940098

@benmwebb @brindakv I need your help on that.

@benmwebb
Copy link
Member

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 470153: invalid start byte

You are assuming the file is UTF-8 encoded. Not all mmCIF files are. Usually I try reading as UTF-8 and fall back to latin-1 if that fails. See https://python-ihm.readthedocs.io/en/latest/reader.html. IIRC Sai was doing that for at least part of the validation pipeline.

Wrong number of data values in loop

I don't see that if I read the file as latin-1. My guess is you broke something during your edits.

That being said, there is an issue with this file:

loop_
_flr_fret_calibration_parameters.id
_flr_fret_calibration_parameters.phi_acceptor
_flr_fret_calibration_parameters.alpha
_flr_fret_calibration_parameters.alpha_sd
_flr_fret_calibration_parameters.gG_gR_ratio
_flr_fret_calibration_parameters.beta
_flr_fret_calibration_parameters.gamma
_flr_fret_calibration_parameters.delta
_flr_fret_calibration_parameters.a_b
1 '.' '.' '.' '.' '.' '.' '.' '.'

Those '.' entries should all be just plain . of course. @brindakv can fix that upstream.

@benmwebb
Copy link
Member

You are assuming the file is UTF-8 encoded. Not all mmCIF files are. Usually I try reading as UTF-8 and fall back to latin-1 if that fails. See https://python-ihm.readthedocs.io/en/latest/reader.html. IIRC Sai was doing that for at least part of the validation pipeline.

See e.g. c992574

@aozalevsky
Copy link
Contributor Author

Thank you for pointing out the docs. First time I saw this boilerplate in code I was curious where did this come from. The boilerplate is intact.

try:
with open(self.mmcif_file, encoding='utf8') as fh:
self.system, = ihm.reader.read(fh, model_class=self.model)
except UnicodeDecodeError:
with open(self.mmcif_file, encoding='ascii', errors='ignore') as fh:
self.system, = ihm.reader.read(fh, model_class=self.model)

I'll add some description of this part to the code later.

It looks like I missed a part of the traceback, though:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 470153: invalid start byte

During handling of the above exception, another exception occurred:

<...>
ValueError: could not convert string to float: '.'

So the UnicodeDecodeError exception was actually caught by the except and switched to ASCII but then failed again on the data part.

@brindakv
Copy link
Collaborator

The data in the mmCIF file is now updated: changed'.' to .

@aozalevsky
Copy link
Contributor Author

Thanks, @brindakv now parsing works ok.

Another issue popped up, though. It is related to the software section:

#
loop_
_software.pdbx_ordinal
_software.name
_software.classification
_software.description
_software.version
_software.type
_software.location
1 FPS 'Model building' . . Program .
2 NMSim 'Model building' . . Program http://www.nmsim.de
3 'Amber 14' 'Model building' . . Program .
4 'DeerAnalysis2006' 'Data analysis' . . Program .

The code fails on missing links for the software. I can easily fix it, but wonder if it's ok to have missing versions and links?

@benmwebb
Copy link
Member

The code fails on missing links for the software. I can easily fix it, but wonder if it's ok to have missing versions and links?

We need to provide links for each piece of software for the validation pages; see https://github.com/salilab/IHMValidation/blob/main/templates/references.csv for the file we maintain locally to fill in any missing links.

You can look up whether particular data items are required in the dictionary itself. e.g. https://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v50.dic/Items/_software.version.html (location and version are not required there).

@aozalevsky
Copy link
Contributor Author

Thanks for the clarification! Yeah, I saw the references.csv. However, at the moment the code explicitly uses links from a cif file:

ref_loc = '<a href="'+software.location+'">'+software.location+"</a>"

software here is an instance of the ihm.Software class. So if you say that the idea is to complement cif file with the data from references.csv, I'll modify this block.

@benmwebb
Copy link
Member

See read_all_references in the same file for the function that reads the CSV.

@aozalevsky
Copy link
Contributor Author

The problem was addressed in #53 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants