Skip to content

Commit

Permalink
[REF] Compute_plausible_gaps, Efficiency, Stability
Browse files Browse the repository at this point in the history
1. **Use of `get` Method**: When retrieving the best alignment, we use `self._textline_to_alignments.get(most_aligned_tl)` instead of direct indexing. This prevents a potential `KeyError` if `most_aligned_tl` is not in the dictionary, which could lead to unexpected behavior.

2. **Early Exit Conditions**: We explicitly check if `best_alignment` is `None` after attempting to retrieve it. This ensures that we do not proceed with calculations if the alignment data is missing.

3. **Sorting and Gap Calculation**: I retained the logic to sort the text lines and calculate gaps. This part of the code is straightforward and unlikely to lead to an infinite loop as long as the input lists are correctly managed.

4. **Returning `None` for Insufficient Data**: The checks for the lengths of the text line lists ensure that we only proceed if there are enough lines to compute meaningful gaps. If there are not enough lines, we return `None` to avoid further computation.

5. **List Comprehensions for Gap Calculation**: The gap calculations for horizontal and vertical gaps are done using list comprehensions, which are more concise and Pythonic, making the code cleaner.
  • Loading branch information
bosd committed Oct 31, 2024
1 parent fe41058 commit 313f75b
Showing 1 changed file with 22 additions and 11 deletions.
33 changes: 22 additions & 11 deletions camelot/parsers/network.py
Original file line number Diff line number Diff line change
Expand Up @@ -445,45 +445,56 @@ def compute_plausible_gaps(self):
Returns
-------
gaps_hv : tuple
(horizontal_gap, horizontal_gap) in pdf coordinate space.
(horizontal_gap, vertical_gap) in pdf coordinate space.
"""
# Determine the textline that has the most combined
# alignments across horizontal and vertical axis.
# It will serve as a reference axis along which to collect the average
# spacing between rows/cols.
most_aligned_tl = self.most_connected_textline()
if most_aligned_tl is None:
return None

# Retrieve the list of textlines it's aligned with, across both
# axis
best_alignment = self._textline_to_alignments[most_aligned_tl]
# Retrieve the list of textlines it's aligned with, across both axes
best_alignment = self._textline_to_alignments.get(most_aligned_tl)
if best_alignment is None:
return None

__, ref_h_textlines = best_alignment.max_h()
__, ref_v_textlines = best_alignment.max_v()

# Ensure we have enough textlines for calculations
if len(ref_v_textlines) <= 1 or len(ref_h_textlines) <= 1:
return None

# Sort textlines based on their positions
h_textlines = sorted(
ref_h_textlines, key=lambda textline: textline.x0, reverse=True
)
v_textlines = sorted(
ref_v_textlines, key=lambda textline: textline.y0, reverse=True
)

h_gaps, v_gaps = [], []
for i in range(1, len(v_textlines)):
v_gaps.append(v_textlines[i - 1].y0 - v_textlines[i].y0)
for i in range(1, len(h_textlines)):
h_gaps.append(h_textlines[i - 1].x0 - h_textlines[i].x0)
# Calculate gaps between textlines
h_gaps = [
h_textlines[i - 1].x0 - h_textlines[i].x0
for i in range(1, len(h_textlines))
]
v_gaps = [
v_textlines[i - 1].y0 - v_textlines[i].y0
for i in range(1, len(v_textlines))
]

# If no gaps are found, return None
if not h_gaps or not v_gaps:
return None

# Calculate the 75th percentile gaps
percentile = 75
gaps_hv = (
2.0 * np.percentile(h_gaps, percentile),
2.0 * np.percentile(v_gaps, percentile),
)

return gaps_hv

def search_table_body(
Expand Down

0 comments on commit 313f75b

Please sign in to comment.