Fix RSS 1.0 data modeling #47

matthew-carroll · 2024-01-14T07:42:23Z

While working on RSS 1.0 serialization, I discovered that the data model seems to be wrong and incomplete. We should fix the data model so that it captures all information from an RSS 1.0 document.

Here's a copy of what I found during working on serialization:

I think the existing RSS 1.0 data model is incorrect. Here's an RSS 1.0 basic example from the test directory:

<?xml version="1.0"?>

<rdf:RDF
        xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
        xmlns="http://purl.org/rss/1.0/"
>

    <channel rdf:about="http://www.xml.com/xml/news.rss">
        <title>XML.com</title>
        <link>http://xml.com/pub</link>
        <description>XML.com features a rich mix of information and services for the XML community.</description>

        <image rdf:resource="http://xml.com/universal/images/xml_tiny.gif"/>

        <items>
            <rdf:Seq>
                <rdf:li resource="http://xml.com/pub/2000/08/09/xslt/xslt.html"/>
                <rdf:li resource="http://xml.com/pub/2000/08/09/rdfdb/index.html"/>
            </rdf:Seq>
        </items>

        <textinput rdf:resource="http://search.xml.com"/>

    </channel>

    <image rdf:about="http://xml.com/universal/images/xml_tiny.gif">
        <title>XML.com</title>
        <link>http://www.xml.com</link>
        <url>http://xml.com/universal/images/xml_tiny.gif</url>
    </image>

    <item rdf:about="http://xml.com/pub/2000/08/09/xslt/xslt.html">
        <title>Processing Inclusions with XSLT</title>
        <link>http://xml.com/pub/2000/08/09/xslt/xslt.html</link>
        <description>Processing document inclusions with general XML tools can be problematic. This article proposes a way of preserving inclusion information through SAX-based processing.</description>
    </item>

    <item rdf:about="http://xml.com/pub/2000/08/09/rdfdb/index.html">
        <title>Putting RDF to Work</title>
        <link>http://xml.com/pub/2000/08/09/rdfdb/index.html</link>
        <description>
            Tool and API support for the Resource Description Framework
            is slowly coming of age. Edd Dumbill takes a look at RDFDB,
            one of the most exciting new RDF toolkits.
        </description>
    </item>

    <textinput rdf:about="http://search.xml.com">
        <title>Search XML.com</title>
        <description>Search XML.com's XML collection</description>
        <name>s</name>
        <link>http://search.xml.com</link>
    </textinput>

</rdf:RDF>

Here's the spec for RSS 1.0: https://validator.w3.org/feed/docs/rss1.html#s5.5

Yet, here's the property list from rss1_feed.dart:

  final String? title;
  final String? description;
  final String? link;
  final String? image;
  final List<Rss1Item> items;
  final UpdatePeriod? updatePeriod;
  final int? updateFrequency;
  final DateTime? updateBase;
  final DublinCore? dc;

The parsing behavior is as follows:

final document = XmlDocument.parse(xmlString);
    XmlElement rdfElement;
    try {
      rdfElement = document.findAllElements('rdf:RDF').first;
    } on StateError {
      throw ArgumentError('channel not found');
    }

    final channel = rdfElement.findElements('channel');
    return Rss1Feed(
      title: findElementOrNull(rdfElement, 'title')?.innerText,
      link: findElementOrNull(rdfElement, 'link')?.innerText,
      description: findElementOrNull(rdfElement, 'description')?.innerText,
      items: rdfElement.findElements('item').map((element) => Rss1Item.parse(element)).toList(),
      image: findElementOrNull(rdfElement, 'image')?.getAttribute('rdf:resource'),
      updatePeriod: _parseUpdatePeriod(
        findElementOrNull(rdfElement, 'sy:updatePeriod')?.innerText,
      ),
      updateFrequency: parseInt(
        findElementOrNull(rdfElement, 'sy:updateFrequency')?.innerText,
      ),
      updateBase: parseDateTime(
        findElementOrNull(rdfElement, 'sy:updateBase')?.innerText,
      ),
      dc: channel.isEmpty ? null : DublinCore.parse(rdfElement.findElements('channel').first),
    );

We can see that this object parses the whole document, so it should capture enough information to recover the document, but it doesn't.

We can see that the parser pulls the title, description and link from the top-level RDF element, as it should.

We can see that the parse collects and parses all the top-level items within the RDF element, as it should.

However, the top-level image is reduced to a single attribute, despite the fact that the image can contain a title, link, and url. So we seem to be losing information. Based on a quick check of the spec, it looks like this parser might be confusing two different images. There's an image element under the RDF element, which is the one we want. Then there's an image element under the channel element. This parser is treating the image like a channel version, but it should be treating it like an RDF element.

Also, the textinput top-level element isn't parsed at all, despite being a part of the specification.

The text was updated successfully, but these errors were encountered:

toseefkhan403 · 2024-04-01T11:56:18Z

Hi @matthew-carroll, can I take up this issue?

matthew-carroll · 2024-04-02T02:50:32Z

@toseefkhan403 I'm already working on it. If you also need this work to be done, please be sure to describe the situation you're facing and why this change would be useful for you.

matthew-carroll added type_bug Something isn't working area_rss1 bounty_donation Non-compensated work p1 Critical to solve but not immediate labels Jan 14, 2024

matthew-carroll self-assigned this Jan 14, 2024

matthew-carroll mentioned this issue Jan 14, 2024

Serialize RSS 1.0 documents #46

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix RSS 1.0 data modeling #47

Fix RSS 1.0 data modeling #47

matthew-carroll commented Jan 14, 2024

toseefkhan403 commented Apr 1, 2024

matthew-carroll commented Apr 2, 2024

Fix RSS 1.0 data modeling #47

Fix RSS 1.0 data modeling #47

Comments

matthew-carroll commented Jan 14, 2024

toseefkhan403 commented Apr 1, 2024

matthew-carroll commented Apr 2, 2024