Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small improvements to improve debugging and flexibility #97

Merged
merged 5 commits into from
Jul 28, 2024
Merged

Small improvements to improve debugging and flexibility #97

merged 5 commits into from
Jul 28, 2024

Conversation

tuzz
Copy link
Contributor

@tuzz tuzz commented Jul 25, 2024

Hello, this PR includes a few small improvements that should help users with debugging and allow them improved flexibility in which attributes are preserved and which nodes are removed by clean_conditionally.

  1. Allow whitelisting all attributes by setting attributes: ["*"]
  2. Allow setting options[:debug] to a function, e.g. so that you can add message to Rails logging
  3. Fix sibling content not being stripped when checking its length
  4. Allow setting options[:clean_conditionally] to a function so that you can override the default decision

The above changes won't change Readability's behaviour except for the small bug fix in 3).

Thanks for your consideration.

@cantino cantino merged commit d3f562c into cantino:master Jul 28, 2024
1 check passed
@cantino
Copy link
Owner

cantino commented Jul 28, 2024

Thanks @tuzz, these changes look reasonable!

@tuzz tuzz deleted the small-improvements branch July 29, 2024 11:17
@cantino
Copy link
Owner

cantino commented Aug 29, 2024

Released in 0.7.2

@avk
Copy link

avk commented Aug 29, 2024

I appreciate the test case for options[:clean_conditionally], but I still don't quite understand how to use it. Do you have any other examples @tuzz?

Would it be worth elaborating in the README?

@tuzz
Copy link
Contributor Author

tuzz commented Aug 29, 2024

@avk I can try to explain. I can add something to the README if it would be helpful.

Basically, readability tries to extract the "useful" content from the page. It does this by scoring elements and then extracting the one with the highest score. Within that element, there might be sub-elements that aren't particularly useful. For example, a news article might have an advert banner in the middle of it. Because of this, readability does a second pass called "clean conditionally" where it tries to remove those sorts of elements based on some hardcoded rules. If you switch on debug mode it makes it easier to understand which elements have been "cleaned conditionally".

In some cases, however, it might remove elements that you don't want it to (or include elements that it shouldn't). The change I introduced allows you intervene and override readability's decision using your own lambda. The lambda is provided with some context that includes the HTML element, the element's score, the decision that readability made about whether to remove it, etc. The return value of the lambda should be whether to remove the element or not. For example, if you set clean_conditionally to the following lambda you'd invert all the decisions readability made about whether to remove the element:

clean_conditionally: lambda do |context|
  !context[:remove]
end

Perhaps a more useful lambda would be one where you force readability to always remove a specific piece of content:

clean_conditionally: lambda do |context|
  if context[:el].text.include?("Visit our blog")
    true # Always remove elements that contain 'Visit our blog'
  else
    context[:remove] # Otherwise, remove the element according to readability's default rules.
  end
end

Hopefully that helps!

@avk
Copy link

avk commented Aug 29, 2024

@tuzz thank you; fantastic and thorough! Excited to experiment with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants