Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue where "CDATA" was incorrectly used as content type in atom entries #123

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rpsh
Copy link

@rpsh rpsh commented Aug 10, 2023

According to the Atom specification, the acceptable values for the type of <content> are text, html, and xhtml.
This PR fixes the previous code that incorrectly used CDATA as the type value.

@lkiesow
Copy link
Owner

lkiesow commented Dec 22, 2023

If you carefully read the specification, that actually not quite true:

In the most common case, the type attribute is either text, html, xhtml,

But just that these are the most common cases does not mean that these are the only acceptable values. In fact, the specification goes on describing other values right after that.

That being said, you are right that if you want the content to be enclosed by <![CDATA[...]]>, that doesn't say anything about the media type and that should probably be separated? Just stating that it is HTML is also not correct. I'm open for suggestions.

@taesungh
Copy link

taesungh commented Feb 3, 2024

While the specification does describe other acceptable values for the type attribute, the <content type="CDATA"> elements created by feedgen fail the W3C Feed Validation Service:

Not a valid MIME type: CDATA (Invalid MIME Type)

I found <content:encoded> is often used for RSS feeds, but I'm not sure what the best approach is for Atom.

@michaelnordmeyer
Copy link

Content type html quoted by <!CDATA[...]> is the proper way to do it, IMHO, if the content should be parsed as HTML.

From my understanding the type is a hint for clients to know the content's format, which only applies after parsing the XML.

<!CDATA[...]> is for parsers to ignore any content within.

The different content types are text, html, and xhtml. If the content has characters which break XML parsing, this has to be mitigated.

True xhtml content won't break XML parsing, because it's a SGML subset. This type cannot be used anymore, because most people use HTML5, which is not a subset of SGML.

html content can break XML parsing, because it's not a subset of SGML, and has to be XML-escaped or put within <!CDATA[...]>.

text is the same like html, because any text can have XML-breaking characters.

After the XML parser has parsed the feed, apps can parse the content according to the content type, which is only necessary for html and xhtml, in order to display it to the user.

@Alkarex
Copy link

Alkarex commented Mar 8, 2024

The spec is clear: <content type="CDATA"> needs to be decoded as Base64
https://www.rfc-editor.org/rfc/rfc4287#section-4.1.3.3
For the record, we (at FreshRSS) are receiving reports of bad/wrong feeds, which I am guessing are likely produced by this library
FreshRSS/FreshRSS#6180
See also another example of wrong decoding with SimplePie, the library used e.g. by WordPress
https://simplepie.org/demo/?feed=https%3A%2F%2Fpeterwunder.de%2Fprojects%2Fachievements%2Fatom.xml

@Alkarex
Copy link

Alkarex commented Mar 19, 2024

@rpsh or anyone with write access to fix the conflicts here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants