Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backend model changes on COSMOS to hold new incoming URLs #1051

Closed
bishwaspraveen opened this issue Oct 2, 2024 · 1 comment · Fixed by #1069
Closed

Backend model changes on COSMOS to hold new incoming URLs #1051

bishwaspraveen opened this issue Oct 2, 2024 · 1 comment · Fixed by #1069
Assignees

Comments

@bishwaspraveen
Copy link
Contributor

bishwaspraveen commented Oct 2, 2024

Description

When collections are reindexed, the content that is being brought in may change. To support this, we'll need to make the necessary backend model changes on COSMOS to identify and store newly scraped content.

Implementation Considerations

The initial design doc has been saved here for reference.

New Models and Migration

  • New, cleaner base model: Url
  • This should lack the legacy fields that we never use like
    • Inference
    • Present on test
    • Is pdf
    • etc
  • Add three new models: DumpUrl, DeltaUrl, CuratedUrl
    • The delta url should have a boolean field, delete
  • Migrate all existing data for curated collections into CuratedUrl
  • Migrate all existing data for uncurated collections into DeltaUrl

Deliverable

Necessary backend model changes on COSMOS to identify and save newly scraped URLs during re-indexation.

@bishwaspraveen bishwaspraveen changed the title Create the delta model on COSMOS model to hold new incoming URLs Backend model changes on COSMOS to hold new incoming URLs Oct 2, 2024
@CarsonDavis
Copy link
Collaborator

Use this as the starting place for the Url model:

class Url(models.Model):
    """A candidate URL scraped for a given collection."""

    collection = models.ForeignKey(Collection, on_delete=models.CASCADE, related_name="candidate_urls")
    url = models.CharField("URL")
    # hash = models.CharField("Hash", max_length=32, blank=True, default="1")
    scraped_title = models.CharField(
        "Scraped Title",
        default="",
        blank=True,
        help_text="This is the original title scraped by Sinequa",
    )
    generated_title = models.CharField(
        "Generated Title",
        default="",
        blank=True,
        help_text="This is the title generated based on a Title Pattern",
    )
    visited = models.BooleanField(default=False)
    objects = CandidateURLManager()
    document_type = models.IntegerField(choices=DocumentTypes.choices, null=True)
    division = models.IntegerField(choices=Divisions.choices, null=True)
    delete = models.BooleanField(default=False)

    class Meta:
        """Meta definition for Candidate URL."""

        verbose_name = "Candidate URL"
        verbose_name_plural = "Candidate URLs"
        ordering = ["url"]

    @property
    def fileext(self) -> str:
        # Parse the URL to get the path
        parsed_url = urlparse(self.url)
        path = parsed_url.path

        # Check for cases where the path ends with a slash or is empty, implying a directory or default file
        if path.endswith("/") or not path:
            return "html"

        # Extract the extension from the path
        extension = os.path.splitext(path)[1]

        # Default to .html if no extension is found
        if not extension:
            return "html"

        if extension.startswith("."):
            return extension[1:]
        return extension

    def splits(self) -> list[tuple[str, str]]:
        """Split the path into multiple collections."""
        parts = []
        part_string = ""
        for part in self.path.split("/"):
            if part:
                part_string += f"/{part}"
                parts.append((part_string, part))
        return parts

    @property
    def path(self) -> str:
        parsed = urlparse(self.url)
        path = f"{parsed.path}"
        if parsed.query:
            path += f"?{parsed.query}"
        return path

    def __str__(self) -> str:
        return self.url

    def save(self, *args, **kwargs):
        super().save(*args, **kwargs)

# Fields to remove from Url model
    # test_title = models.CharField(
    #     "Title on Test Server",
    #     default="",
    #     blank=True,
    #     help_text="This is the title present on Test Server",
    # )
    # production_title = models.CharField(
    #     "Title on Production Server",
    #     default="",
    #     blank=True,
    #     help_text="This is the title present on Production Server",
    # )
    # level = models.IntegerField("Level", default=0, blank=True, help_text="Level in the tree. Based on /.")
    # inferenced_by = models.CharField(
    #     "Inferenced By",
    #     default="",
    #     blank=True,
    #     help_text="This keeps track of who inferenced document type",
    # )
    # is_pdf = models.BooleanField(
    #     "Is PDF",
    #     default=False,
    #     help_text="This keeps track of whether the given url is pdf or not",
    # )
    # present_on_test = models.BooleanField(
    #     "URL Present In Test Environment?",
    #     default=False,
    #     help_text="Helps keep track if the Current URL is present in test environment or not",
    # )
    # present_on_prod = models.BooleanField(
    #     "URL Present In Production?",
    #     default=False,
    #     help_text="Helps keep track if the Current URL is present in production or not",
    # )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants