Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1051 backend model changes on cosmos to hold new incoming urls #1069

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
9c7b25b
adding the new base URL model
bishwaspraveen Oct 10, 2024
115481d
adding the new dump url model
bishwaspraveen Oct 10, 2024
8af6102
adding the new delta url model
bishwaspraveen Oct 10, 2024
3f9c885
adding the new curated url model
bishwaspraveen Oct 10, 2024
3c9627f
adding the necessary migration file
bishwaspraveen Oct 10, 2024
2fcd346
adding a command file to migrate urls into delta and curated URL models
bishwaspraveen Oct 10, 2024
d691af3
added the new models into admin console
bishwaspraveen Oct 10, 2024
a17029f
removed url and dumpurl models from admin
bishwaspraveen Oct 14, 2024
8606581
edited the curated url api serialzier used for indexing
bishwaspraveen Oct 14, 2024
0f8578c
changed the api endpoit to have an appropriate name
bishwaspraveen Oct 14, 2024
717eb53
changed the api vew to point to the right curated url model
bishwaspraveen Oct 14, 2024
83cb35a
migration file with changes
bishwaspraveen Oct 14, 2024
bdce7bb
Merge branch 'dev' into 1051-backend-model-changes-on-cosmos-to-hold-…
bishwaspraveen Oct 23, 2024
6bf48ff
adding admin views for DumpURL and URL models
bishwaspraveen Nov 4, 2024
4836851
migration for the dump URL file
bishwaspraveen Nov 4, 2024
19feff8
adding tasks to compare and add URLs to the new models
bishwaspraveen Nov 4, 2024
7e24495
adding a save method for dump URL
bishwaspraveen Nov 4, 2024
e5e64f4
move all url models into the same file
CarsonDavis Nov 4, 2024
7a906b7
update admin url imports
CarsonDavis Nov 4, 2024
728a5b4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 4, 2024
df88c6b
squashed migrations
bishwaspraveen Nov 4, 2024
48592cb
updated import references
bishwaspraveen Nov 4, 2024
266082c
updated import references
bishwaspraveen Nov 4, 2024
c3e2aee
update import references
bishwaspraveen Nov 4, 2024
08e145c
add basic models file
CarsonDavis Nov 7, 2024
82ce8e0
Merge branch '1051-backend-model-changes-on-cosmos-to-hold-new-incomi…
CarsonDavis Nov 7, 2024
2ac12c9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 7, 2024
aee857c
update imports and add DumpUrl model
CarsonDavis Nov 7, 2024
44628cb
update admin to pull from the delta_urls file
CarsonDavis Nov 7, 2024
25fd9c1
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 7, 2024
dd2076f
remove deprecated url model
CarsonDavis Nov 7, 2024
9a0ed0c
Merge branch '1051-backend-model-changes-on-cosmos-to-hold-new-incomi…
CarsonDavis Nov 7, 2024
7c1c255
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 7, 2024
8bb7fc4
add helper functions to url models
CarsonDavis Nov 7, 2024
b775a98
update migrations to mode to new deltaurls models
CarsonDavis Nov 7, 2024
5088094
Merge branch '1051-backend-model-changes-on-cosmos-to-hold-new-incomi…
CarsonDavis Nov 7, 2024
ae8c6cd
add promotion code and promotion tests
CarsonDavis Nov 7, 2024
f252957
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 7, 2024
7091246
add instruction for running the promotion test
CarsonDavis Nov 11, 2024
9839859
add testing libraries to base requirements
CarsonDavis Nov 11, 2024
91acc1d
begin writing delta_patterns
CarsonDavis Nov 11, 2024
84c8ed6
update the DeltaExcludePattern to use both new urls
CarsonDavis Nov 11, 2024
86bdf47
finalize the delta patterns
CarsonDavis Nov 11, 2024
c94f1d1
fix pointers to document pattern
CarsonDavis Nov 11, 2024
4c07dc5
add querysets and managers to the deltaurl model
CarsonDavis Nov 11, 2024
b563f1e
rewrite apply logic
CarsonDavis Nov 11, 2024
43adec0
remove generated title from dumpurl admin
CarsonDavis Nov 11, 2024
49960ac
remove circular migration with lazy loads
CarsonDavis Nov 11, 2024
da12327
add migrations for the new DeltaPattern models
CarsonDavis Nov 11, 2024
6d3fc58
add deltas to the admin and improve verbose names
CarsonDavis Nov 11, 2024
94800c7
Merge branch 'dev' into 1051-backend-model-changes-on-cosmos-to-hold-…
CarsonDavis Nov 12, 2024
2235811
add scraped_text to delta urls and merge migrations
CarsonDavis Nov 12, 2024
77a2496
change full_text task to point to DumpUrl
CarsonDavis Nov 12, 2024
8530f35
refactor process response to use raw payload
CarsonDavis Nov 12, 2024
f6455f1
add code to migrate dump to delta
CarsonDavis Nov 13, 2024
8193d8d
add tests for dump migration
CarsonDavis Nov 13, 2024
6f14664
add tests and refactor patterns
CarsonDavis Nov 14, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,11 @@ django-cors-headers==4.4.0
django-filter==24.3
djangorestframework-datatables==0.7.2
djangorestframework==3.15.2
factory-boy==3.3.0
lxml==4.9.2
PyGithub==2.2.0
pytest-django==4.8.0
pytest==8.0.0
tqdm==4.66.3
unidecode==1.3.8
xmltodict==0.13.0
60 changes: 60 additions & 0 deletions sde_collections/admin.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,14 @@
from django.contrib import admin, messages
from django.http import HttpResponse

from sde_collections.models.delta_patterns import (
DeltaDivisionPattern,
DeltaTitlePattern,
)

from .models.candidate_url import CandidateURL, ResolvedTitle
from .models.collection import Collection, WorkflowHistory
from .models.delta_url import CuratedUrl, DeltaResolvedTitle, DeltaUrl, DumpUrl
from .models.pattern import DivisionPattern, IncludePattern, TitlePattern
from .tasks import fetch_and_update_full_text, import_candidate_urls_from_api

Expand Down Expand Up @@ -317,9 +323,63 @@ class DivisionPatternAdmin(admin.ModelAdmin):
search_fields = ("match_pattern", "division")


# deltas below
class DeltaTitlePatternAdmin(admin.ModelAdmin):
"""Admin View for DeltaTitlePattern"""

list_display = (
"match_pattern",
"title_pattern",
"collection",
"match_pattern_type",
)
list_filter = (
"match_pattern_type",
"collection",
)


class DeltaResolvedTitleAdmin(admin.ModelAdmin):
list_display = ["title_pattern", "delta_url", "resolved_title", "created_at"]


class DeltaDivisionPatternAdmin(admin.ModelAdmin):
list_display = ("collection", "match_pattern", "division")
search_fields = ("match_pattern", "division")


class DumpUrlAdmin(admin.ModelAdmin):
"""Admin View for DumpUrl"""

list_display = ("url", "scraped_title", "collection")
list_filter = ("collection",)


class DeltaUrlAdmin(admin.ModelAdmin):
"""Admin View for DeltaUrl"""

list_display = ("url", "scraped_title", "generated_title", "collection")
list_filter = ("collection",)


class CuratedUrlAdmin(admin.ModelAdmin):
"""Admin View for CuratedUrl"""

list_display = ("url", "scraped_title", "generated_title", "collection")
list_filter = ("collection",)


admin.site.register(WorkflowHistory, WorkflowHistoryAdmin)
admin.site.register(CandidateURL, CandidateURLAdmin)
admin.site.register(TitlePattern, TitlePatternAdmin)
admin.site.register(IncludePattern)
admin.site.register(ResolvedTitle, ResolvedTitleAdmin)
admin.site.register(DivisionPattern, DivisionPatternAdmin)


admin.site.register(DeltaTitlePattern, DeltaTitlePatternAdmin)
admin.site.register(DeltaResolvedTitle, DeltaResolvedTitleAdmin)
admin.site.register(DeltaDivisionPattern, DeltaDivisionPatternAdmin)
admin.site.register(DumpUrl, DumpUrlAdmin)
admin.site.register(DeltaUrl, DeltaUrlAdmin)
admin.site.register(CuratedUrl, CuratedUrlAdmin)
59 changes: 59 additions & 0 deletions sde_collections/management/commands/migrate_urls.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
from django.core.management.base import BaseCommand

from sde_collections.models.candidate_url import CandidateURL
from sde_collections.models.collection import Collection
from sde_collections.models.collection_choice_fields import WorkflowStatusChoices
from sde_collections.models.curated_url import CuratedUrl
from sde_collections.models.delta_url import DeltaUrl


class Command(BaseCommand):
help = "Migrate CandidateURLs to CuratedUrl or DeltaUrl based on collection workflow status"

def handle(self, *args, **kwargs):
# Migrate CandidateURLs for collections with CURATED or higher workflow status to CuratedUrl
collections_for_curated = Collection.objects.filter(workflow_status__gte=WorkflowStatusChoices.CURATED)
self.stdout.write(
f"Migrating URLs for {collections_for_curated.count()} collections with CURATED or higher status..."
)

for collection in collections_for_curated:
candidate_urls = CandidateURL.objects.filter(collection=collection)
for candidate_url in candidate_urls:
CuratedUrl.objects.create(
collection=candidate_url.collection,
url=candidate_url.url,
scraped_title=candidate_url.scraped_title,
generated_title=candidate_url.generated_title,
visited=candidate_url.visited,
document_type=candidate_url.document_type,
division=candidate_url.division,
)
self.stdout.write(
f"Migrated {candidate_urls.count()} URLs from collection '{collection.name}' to CuratedUrl."
)

# Migrate CandidateURLs for collections with a status lower than CURATED to DeltaUrl
collections_for_delta = Collection.objects.filter(workflow_status__lt=WorkflowStatusChoices.CURATED)
self.stdout.write(
f"Migrating URLs for {collections_for_delta.count()} collections with status lower than CURATED..."
)

for collection in collections_for_delta:
candidate_urls = CandidateURL.objects.filter(collection=collection)
for candidate_url in candidate_urls:
DeltaUrl.objects.create(
collection=candidate_url.collection,
url=candidate_url.url,
scraped_title=candidate_url.scraped_title,
generated_title=candidate_url.generated_title,
visited=candidate_url.visited,
document_type=candidate_url.document_type,
division=candidate_url.division,
delete=False,
)
self.stdout.write(
f"Migrated {candidate_urls.count()} URLs from collection '{collection.name}' to DeltaUrl."
)

self.stdout.write(self.style.SUCCESS("Migration complete."))
146 changes: 146 additions & 0 deletions sde_collections/migrations/0059_url_curatedurl_deltaurl_dumpurl.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# Generated by Django 4.2.9 on 2024-11-04 22:22

from django.db import migrations, models
import django.db.models.deletion


class Migration(migrations.Migration):

dependencies = [
("sde_collections", "0058_candidateurl_division_collection_is_multi_division_and_more"),
]

operations = [
migrations.CreateModel(
name="Url",
fields=[
("id", models.BigAutoField(auto_created=True, primary_key=True, serialize=False, verbose_name="ID")),
("url", models.CharField(max_length=4096, verbose_name="URL")),
(
"scraped_title",
models.CharField(
blank=True,
default="",
help_text="This is the original title scraped by Sinequa",
max_length=1024,
verbose_name="Scraped Title",
),
),
(
"generated_title",
models.CharField(
blank=True,
default="",
help_text="This is the title generated based on a Title Pattern",
max_length=1024,
verbose_name="Generated Title",
),
),
("visited", models.BooleanField(default=False)),
(
"document_type",
models.IntegerField(
choices=[
(1, "Images"),
(2, "Data"),
(3, "Documentation"),
(4, "Software and Tools"),
(5, "Missions and Instruments"),
],
null=True,
),
),
(
"division",
models.IntegerField(
choices=[
(1, "Astrophysics"),
(2, "Biological and Physical Sciences"),
(3, "Earth Science"),
(4, "Heliophysics"),
(5, "Planetary Science"),
(6, "General"),
],
null=True,
),
),
(
"collection",
models.ForeignKey(
on_delete=django.db.models.deletion.CASCADE,
related_name="urls",
to="sde_collections.collection",
),
),
],
options={
"verbose_name": "URL",
"verbose_name_plural": "URLs",
"ordering": ["url"],
},
),
migrations.CreateModel(
name="CuratedUrl",
fields=[
(
"url_ptr",
models.OneToOneField(
auto_created=True,
on_delete=django.db.models.deletion.CASCADE,
parent_link=True,
primary_key=True,
serialize=False,
to="sde_collections.url",
),
),
],
options={
"verbose_name": "Curated URL",
"verbose_name_plural": "Curated URLs",
},
bases=("sde_collections.url",),
),
migrations.CreateModel(
name="DeltaUrl",
fields=[
(
"url_ptr",
models.OneToOneField(
auto_created=True,
on_delete=django.db.models.deletion.CASCADE,
parent_link=True,
primary_key=True,
serialize=False,
to="sde_collections.url",
),
),
("delete", models.BooleanField(default=False)),
],
options={
"verbose_name": "Delta URL",
"verbose_name_plural": "Delta URLs",
},
bases=("sde_collections.url",),
),
migrations.CreateModel(
name="DumpUrl",
fields=[
(
"url_ptr",
models.OneToOneField(
auto_created=True,
on_delete=django.db.models.deletion.CASCADE,
parent_link=True,
primary_key=True,
serialize=False,
to="sde_collections.url",
),
),
],
options={
"verbose_name": "Dump URL",
"verbose_name_plural": "Dump URLs",
},
bases=("sde_collections.url",),
),
]
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Generated by Django 4.2.9 on 2024-11-07 17:40

from django.db import migrations


class Migration(migrations.Migration):

dependencies = [
("sde_collections", "0059_url_curatedurl_deltaurl_dumpurl"),
]

operations = [
migrations.RemoveField(
model_name="deltaurl",
name="url_ptr",
),
migrations.RemoveField(
model_name="dumpurl",
name="url_ptr",
),
migrations.RemoveField(
model_name="url",
name="collection",
),
migrations.DeleteModel(
name="CuratedUrl",
),
migrations.DeleteModel(
name="DeltaUrl",
),
migrations.DeleteModel(
name="DumpUrl",
),
migrations.DeleteModel(
name="Url",
),
]
Loading