Backend model changes on COSMOS to hold new incoming URLs #1051

bishwaspraveen · 2024-10-02T19:56:00Z

Description

When collections are reindexed, the content that is being brought in may change. To support this, we'll need to make the necessary backend model changes on COSMOS to identify and store newly scraped content.

Implementation Considerations

The initial design doc has been saved here for reference.

New Models and Migration

New, cleaner base model: Url
This should lack the legacy fields that we never use like
- Inference
- Present on test
- Is pdf
- etc
Add three new models: DumpUrl, DeltaUrl, CuratedUrl
- The delta url should have a boolean field, delete
Migrate all existing data for curated collections into CuratedUrl
Migrate all existing data for uncurated collections into DeltaUrl

Deliverable

Necessary backend model changes on COSMOS to identify and save newly scraped URLs during re-indexation.

CarsonDavis · 2024-10-08T20:50:14Z

Use this as the starting place for the Url model:

class Url(models.Model):
    """A candidate URL scraped for a given collection."""

    collection = models.ForeignKey(Collection, on_delete=models.CASCADE, related_name="candidate_urls")
    url = models.CharField("URL")
    # hash = models.CharField("Hash", max_length=32, blank=True, default="1")
    scraped_title = models.CharField(
        "Scraped Title",
        default="",
        blank=True,
        help_text="This is the original title scraped by Sinequa",
    )
    generated_title = models.CharField(
        "Generated Title",
        default="",
        blank=True,
        help_text="This is the title generated based on a Title Pattern",
    )
    visited = models.BooleanField(default=False)
    objects = CandidateURLManager()
    document_type = models.IntegerField(choices=DocumentTypes.choices, null=True)
    division = models.IntegerField(choices=Divisions.choices, null=True)
    delete = models.BooleanField(default=False)

    class Meta:
        """Meta definition for Candidate URL."""

        verbose_name = "Candidate URL"
        verbose_name_plural = "Candidate URLs"
        ordering = ["url"]

    @property
    def fileext(self) -> str:
        # Parse the URL to get the path
        parsed_url = urlparse(self.url)
        path = parsed_url.path

        # Check for cases where the path ends with a slash or is empty, implying a directory or default file
        if path.endswith("/") or not path:
            return "html"

        # Extract the extension from the path
        extension = os.path.splitext(path)[1]

        # Default to .html if no extension is found
        if not extension:
            return "html"

        if extension.startswith("."):
            return extension[1:]
        return extension

    def splits(self) -> list[tuple[str, str]]:
        """Split the path into multiple collections."""
        parts = []
        part_string = ""
        for part in self.path.split("/"):
            if part:
                part_string += f"/{part}"
                parts.append((part_string, part))
        return parts

    @property
    def path(self) -> str:
        parsed = urlparse(self.url)
        path = f"{parsed.path}"
        if parsed.query:
            path += f"?{parsed.query}"
        return path

    def __str__(self) -> str:
        return self.url

    def save(self, *args, **kwargs):
        super().save(*args, **kwargs)

# Fields to remove from Url model
    # test_title = models.CharField(
    #     "Title on Test Server",
    #     default="",
    #     blank=True,
    #     help_text="This is the title present on Test Server",
    # )
    # production_title = models.CharField(
    #     "Title on Production Server",
    #     default="",
    #     blank=True,
    #     help_text="This is the title present on Production Server",
    # )
    # level = models.IntegerField("Level", default=0, blank=True, help_text="Level in the tree. Based on /.")
    # inferenced_by = models.CharField(
    #     "Inferenced By",
    #     default="",
    #     blank=True,
    #     help_text="This keeps track of who inferenced document type",
    # )
    # is_pdf = models.BooleanField(
    #     "Is PDF",
    #     default=False,
    #     help_text="This keeps track of whether the given url is pdf or not",
    # )
    # present_on_test = models.BooleanField(
    #     "URL Present In Test Environment?",
    #     default=False,
    #     help_text="Helps keep track if the Current URL is present in test environment or not",
    # )
    # present_on_prod = models.BooleanField(
    #     "URL Present In Production?",
    #     default=False,
    #     help_text="Helps keep track if the Current URL is present in production or not",
    # )

bishwaspraveen changed the title ~~Create the delta model on COSMOS model to hold new incoming URLs~~ Backend model changes on COSMOS to hold new incoming URLs Oct 2, 2024

bishwaspraveen assigned bishwaspraveen, CarsonDavis and dhanur-sharma Oct 2, 2024

bishwaspraveen linked a pull request Oct 10, 2024 that will close this issue

1051 backend model changes on cosmos to hold new incoming urls #1069

Merged

bishwaspraveen unassigned CarsonDavis and dhanur-sharma Oct 10, 2024

CarsonDavis closed this as completed in #1069 Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backend model changes on COSMOS to hold new incoming URLs #1051

Backend model changes on COSMOS to hold new incoming URLs #1051

bishwaspraveen commented Oct 2, 2024 •

edited by CarsonDavis

Loading

CarsonDavis commented Oct 8, 2024

Backend model changes on COSMOS to hold new incoming URLs #1051

Backend model changes on COSMOS to hold new incoming URLs #1051

Comments

bishwaspraveen commented Oct 2, 2024 • edited by CarsonDavis Loading

Description

Implementation Considerations

Deliverable

CarsonDavis commented Oct 8, 2024

bishwaspraveen commented Oct 2, 2024 •

edited by CarsonDavis

Loading