From 8ec449cdc7fb5a2b0996e7c4c0d5dc6501c641d7 Mon Sep 17 00:00:00 2001 From: Bishwas Praveen Date: Wed, 20 Sep 2023 12:53:18 -0500 Subject: [PATCH 01/28] Added instructions on setting up the project with a db backup and also instructions on pre-commit hooks --- README.md | 109 ++++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 98 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index 2b3efe6e..a262026b 100644 --- a/README.md +++ b/README.md @@ -11,21 +11,33 @@ Moved to [settings](http://cookiecutter-django.readthedocs.io/en/latest/settings ## Basic Commands +### Building The Project + ```bash + $ docker-compose -f local.yml build + ``` + +### Running The Necessary Containers + ```bash + $ docker-compose -f local.yml up + ``` + ### Setting Up Your Users - To create a **normal user account**, just go to Sign Up and fill out the form. Once you submit it, you'll see a "Verify Your E-mail Address" page. Go to your console to see a simulated email verification message. Copy the link into your browser. Now the user's email should be verified and ready to go. - To create a **superuser account**, use this command: - - $ python manage.py createsuperuser + ```bash + $ docker-compose -f local.yml run -rm django python manage.py createsuperuser + ``` For convenience, you can keep your normal user logged in on Chrome and your superuser logged in on Firefox (or similar), so that you can see how the site behaves for both kinds of users. ### Loading fixtures - To load collections - - docker-compose -f local.yml run --rm django python manage.py loaddata sde_collections/fixtures/collections.json + ```bash + $ docker-compose -f local.yml run --rm django python manage.py loaddata sde_collections/fixtures/collections.json + ``` ### Loading scraped URLs into CandidateURLs @@ -36,26 +48,74 @@ For convenience, you can keep your normal user logged in on Chrome and your supe - Run the crawler with `scrapy crawl -o scraped_urls//urls.jsonl - Then run this: + ```bash + $ docker-compose -f local.yml run --rm django python manage.py load_scraped_urls + ``` + +### Loading The DB From A Backup + +- If a database backup is made available, you wouldn't have to load the fixtures or the scrapped URLs anymore. This changes a few steps necessary to get the project running. + +- Step 1 : Build the project (Documented Above) + +- Step 2 : Run the necessary containers (Documented Above) + +- Step 3 : Clear Out Contenet Types Using Django Shell + + -- Enter the Django shell in your Docker container. + ```bash + $ docker-compose -f local.yml run --rm django python manage.py shell + ``` + + -- In the Django shell, you can now delete the content types. + ```bash + from django.contrib.contenttypes.models import ContentType + ContentType.objects.all().delete() + ``` + + -- Exit the shell. + +- Step 4 : Load Your Backup Database + + Assuming your backup is a `.json` file from `dumpdata`, you'd use `loaddata` command to populate your database. + + -- If the backup file is on the local machine, make sure it's accessible to the Docker container. If the backup is outside the container, you will need to copy it inside first. + ```bash + $ docker cp /path/to/your/backup.json container_name:/path/inside/container/backup.json + ``` + + -- Load the data from your backup. + ```bash + $ docker-compose -f local.yml run --rm django python manage.py loaddata /path/inside/the/container/backup.json + ``` + + -- Once loaded, you may want to run migrations to ensure everything is aligned. + ```bash + $ docker-compose -f local.yml run -rm django python manage.py migrate + ``` - $ docker-compose -f local.yml run --rm django python manage.py load_scraped_urls ### Type checks Running type checks with mypy: - + ```bash $ mypy sde_indexing_helper + ``` ### Test coverage To run the tests, check your test coverage, and generate an HTML coverage report: - + ```bash $ coverage run -m pytest $ coverage html $ open htmlcov/index.html + ``` #### Running tests with pytest + ```bash $ pytest + ``` ### Live reloading and Sass CSS compilation @@ -63,15 +123,17 @@ Moved to [Live reloading and SASS compilation](https://cookiecutter-django.readt ### Install Celery -Make sure Celery is installed in your environment. -To install, -pip install celery +Make sure Celery is installed in your environment. To install : + ```bash + $ pip install celery + ``` ### Install all requirements Install all packages listed in a 'requirements' file - + ```bash pip install -r requirements/*.txt + ``` ### Celery @@ -100,6 +162,31 @@ cd sde_indexing_helper celery -A config.celery_app worker -B -l info ``` +### Pre-Commit Hook Instructions + +Hooks have to be run on every commit to automatically take care of linting and structuring. + +To install pre-commit package manager : + + ```bash + $ pip install pre-commit + ``` + +Install the git hook scripts : + + ```bash + $ pre-commit install + ``` + +Run against the files : + + ```bash + $ pre-commit run --all-files + ``` + + It's usually a good idea to run the hooks against all of the files when adding new hooks (usually `pre-commit` will only run on the chnages files during git hooks). + + ### Sentry Sentry is an error logging aggregator service. You can sign up for a free account at or download and host it yourself. From d217656c062fc118c0cebd47d2bd067c9b961a90 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Fri, 22 Sep 2023 22:36:47 +0000 Subject: [PATCH 02/28] Bump gitpython from 3.1.31 to 3.1.37 Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.31 to 3.1.37. - [Release notes](https://github.com/gitpython-developers/GitPython/releases) - [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES) - [Commits](https://github.com/gitpython-developers/GitPython/compare/3.1.31...3.1.37) --- updated-dependencies: - dependency-name: gitpython dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] --- requirements/base.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements/base.txt b/requirements/base.txt index 8a651796..e43ffc6b 100644 --- a/requirements/base.txt +++ b/requirements/base.txt @@ -54,7 +54,7 @@ fonttools==4.39.4 fsspec==2023.9.1 futures==3.0.5 gitdb==4.0.10 -GitPython==3.1.31 +GitPython==3.1.37 greenlet==2.0.2 h11==0.14.0 huggingface-hub==0.15.1 From e35a322a1853d78328edf30d39ea562a87c31470 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Fri, 22 Sep 2023 22:36:57 +0000 Subject: [PATCH 03/28] Bump matplotlib from 3.7.1 to 3.8.0 Bumps [matplotlib](https://github.com/matplotlib/matplotlib) from 3.7.1 to 3.8.0. - [Release notes](https://github.com/matplotlib/matplotlib/releases) - [Commits](https://github.com/matplotlib/matplotlib/compare/v3.7.1...v3.8.0) --- updated-dependencies: - dependency-name: matplotlib dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] --- requirements/base.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements/base.txt b/requirements/base.txt index 8a651796..5689aeb3 100644 --- a/requirements/base.txt +++ b/requirements/base.txt @@ -64,7 +64,7 @@ Jinja2==3.1.2 joblib==1.2.0 kiwisolver==1.4.4 MarkupSafe==2.1.3 -matplotlib==3.7.1 +matplotlib==3.8.0 mpmath==1.3.0 mypy-extensions==1.0.0 networkx==3.1 From c4fb2dcc75cc1cef222a2f668e115b30569c81ce Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Fri, 22 Sep 2023 22:37:01 +0000 Subject: [PATCH 04/28] Bump sphinx from 7.2.5 to 7.2.6 Bumps [sphinx](https://github.com/sphinx-doc/sphinx) from 7.2.5 to 7.2.6. - [Release notes](https://github.com/sphinx-doc/sphinx/releases) - [Changelog](https://github.com/sphinx-doc/sphinx/blob/master/CHANGES.rst) - [Commits](https://github.com/sphinx-doc/sphinx/compare/v7.2.5...v7.2.6) --- updated-dependencies: - dependency-name: sphinx dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] --- requirements/local.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements/local.txt b/requirements/local.txt index 3cac04b9..38093dba 100644 --- a/requirements/local.txt +++ b/requirements/local.txt @@ -16,7 +16,7 @@ types-xmltodict # Documentation # ------------------------------------------------------------------------------ -sphinx==7.2.5 # https://github.com/sphinx-doc/sphinx +sphinx==7.2.6 # https://github.com/sphinx-doc/sphinx sphinx-autobuild==2021.3.14 # https://github.com/GaretJax/sphinx-autobuild # Code quality From 96da296406c25f353f750598c44042b2ddf4eccf Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Fri, 22 Sep 2023 22:37:04 +0000 Subject: [PATCH 05/28] Bump typing-extensions from 4.6.3 to 4.8.0 Bumps [typing-extensions](https://github.com/python/typing_extensions) from 4.6.3 to 4.8.0. - [Release notes](https://github.com/python/typing_extensions/releases) - [Changelog](https://github.com/python/typing_extensions/blob/main/CHANGELOG.md) - [Commits](https://github.com/python/typing_extensions/compare/4.6.3...4.8.0) --- updated-dependencies: - dependency-name: typing-extensions dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] --- requirements/base.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements/base.txt b/requirements/base.txt index 8a651796..bbd933f2 100644 --- a/requirements/base.txt +++ b/requirements/base.txt @@ -106,7 +106,7 @@ tomli==2.0.1 torch==2.0.1 tqdm==4.65.0 transformers==4.30.0 -typing_extensions==4.6.3 +typing_extensions==4.8.0 tzdata==2023.3 urllib3==1.26.16 uvicorn==0.23.2 From 00a765d0ded194203f5c4fb687637608d4cfa740 Mon Sep 17 00:00:00 2001 From: Bishwas Praveen Date: Tue, 26 Sep 2023 10:38:07 -0500 Subject: [PATCH 06/28] Fixed the issue with the push collections to github button --- sde_collections/utils/github_helper.py | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/sde_collections/utils/github_helper.py b/sde_collections/utils/github_helper.py index b6234803..60688447 100644 --- a/sde_collections/utils/github_helper.py +++ b/sde_collections/utils/github_helper.py @@ -52,6 +52,19 @@ def _update_file_contents(self, collection): branch=self.github_branch, ) + def branch_exists(self, branch_name: str) -> bool: + try: + self.repo.get_branch(branch=branch_name) + return True + except GithubException: + return False + + def create_branch(self, branch_name: str): + # Get the SHA of the commit you want to branch from (basically the Dev branch) + base_sha = self.repo.get_branch(self.dev_branch).commit.sha + # Create the new branch + self.repo.create_git_ref(ref=f"refs/heads/{branch_name}", sha=base_sha) + def create_pull_request(self) -> None: title = "Webapp: Update config files" body = "\n".join(self.collections.values_list("name", flat=True)) @@ -66,6 +79,8 @@ def create_pull_request(self) -> None: print("PR exists") def push_to_github(self) -> None: + if not self.branch_exists(self.github_branch): + self.create_branch(self.github_branch) for collection in self.collections: print(f"Pushing {collection.name} to GitHub.") self._update_file_contents(collection) From 78ea04f498c3bc96416a27fd758e81bdee3fa10c Mon Sep 17 00:00:00 2001 From: Carson Davis Date: Thu, 28 Sep 2023 10:16:06 -0500 Subject: [PATCH 07/28] add note about fixtures --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index a262026b..01efe654 100644 --- a/README.md +++ b/README.md @@ -33,7 +33,7 @@ Moved to [settings](http://cookiecutter-django.readthedocs.io/en/latest/settings For convenience, you can keep your normal user logged in on Chrome and your superuser logged in on Firefox (or similar), so that you can see how the site behaves for both kinds of users. ### Loading fixtures - +Please note that currently loading fixtures will not create a fully working database. If you are starting the project from scratch, it is probably preferable to skip to the Loading the DB from a Backup section. - To load collections ```bash $ docker-compose -f local.yml run --rm django python manage.py loaddata sde_collections/fixtures/collections.json From 8adca9396e27be1b008e950d159be1ee8c3df22d Mon Sep 17 00:00:00 2001 From: Carson Davis Date: Thu, 28 Sep 2023 10:31:03 -0500 Subject: [PATCH 08/28] add simple documentation to collection.update_config_xml --- sde_collections/models/collection.py | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/sde_collections/models/collection.py b/sde_collections/models/collection.py index bfe4f8cd..084dbee9 100644 --- a/sde_collections/models/collection.py +++ b/sde_collections/models/collection.py @@ -126,6 +126,13 @@ def _process_document_type_list(self): return document_type_rules def update_config_xml(self, original_config_string): + """ + reads from the model data and creates a config that mirrors the + - excludes + - title rules + - doc types + - tree root + """ editor = XmlEditor(original_config_string) URL_EXCLUDES = self._process_exclude_list() From 8bb2e01e8bdeefeac9a1d045f9b66c79bd59a108 Mon Sep 17 00:00:00 2001 From: Carson Davis Date: Thu, 28 Sep 2023 10:33:49 -0500 Subject: [PATCH 09/28] add code to bulk push collections to github --- sde_collections/utils/bulk_github_push.py | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) create mode 100644 sde_collections/utils/bulk_github_push.py diff --git a/sde_collections/utils/bulk_github_push.py b/sde_collections/utils/bulk_github_push.py new file mode 100644 index 00000000..868e900e --- /dev/null +++ b/sde_collections/utils/bulk_github_push.py @@ -0,0 +1,19 @@ +""" +Sometimes it is necessary to programatically push many collections at once to github. +This code will search for collections matching a certain criteria (curated, pr created), +and push their changes to Github +""" + +from sde_collections.models.collection import Collection +from sde_collections.models.collection_choice_fields import CurationStatusChoices +from sde_collections.utils.github_helper import GitHubHandler + +finished_statuses = [ + CurationStatusChoices.CURATED, + CurationStatusChoices.GITHUB_PR_CREATED, +] + +collections = Collection.objects.filter(curation_status__in=finished_statuses) + +gh = GitHubHandler(collections) +gh.push_to_github() From 35f8bb1c15b0482c97c03bd83837d325e8c1bb0a Mon Sep 17 00:00:00 2001 From: Carson Davis Date: Thu, 28 Sep 2023 10:39:12 -0500 Subject: [PATCH 10/28] turn bulk_push into a runnable function --- sde_collections/utils/bulk_github_push.py | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/sde_collections/utils/bulk_github_push.py b/sde_collections/utils/bulk_github_push.py index 868e900e..ed351c27 100644 --- a/sde_collections/utils/bulk_github_push.py +++ b/sde_collections/utils/bulk_github_push.py @@ -8,12 +8,14 @@ from sde_collections.models.collection_choice_fields import CurationStatusChoices from sde_collections.utils.github_helper import GitHubHandler -finished_statuses = [ +FINISHED_STATUSES = [ CurationStatusChoices.CURATED, CurationStatusChoices.GITHUB_PR_CREATED, ] -collections = Collection.objects.filter(curation_status__in=finished_statuses) -gh = GitHubHandler(collections) -gh.push_to_github() +def bulk_push(statuses_to_push=FINISHED_STATUSES): + collections = Collection.objects.filter(curation_status__in=statuses_to_push) + + gh = GitHubHandler(collections) + gh.push_to_github() From ff672939c9fef7363d3ec6f8534e4d6a25b19918 Mon Sep 17 00:00:00 2001 From: Carson Davis Date: Thu, 28 Sep 2023 11:08:31 -0500 Subject: [PATCH 11/28] update github helper to search dev and update branches for content --- sde_collections/utils/github_helper.py | 21 +++++++++++++-------- 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/sde_collections/utils/github_helper.py b/sde_collections/utils/github_helper.py index 60688447..a6a3f685 100644 --- a/sde_collections/utils/github_helper.py +++ b/sde_collections/utils/github_helper.py @@ -11,7 +11,7 @@ class GitHubHandler: def __init__(self, collections, *args, **kwargs): self.github_token = settings.GITHUB_ACCESS_TOKEN self.github_repo = settings.SINEQUA_CONFIGS_GITHUB_REPO - self.github_branch = settings.GITHUB_BRANCH_FOR_WEBAPP + self.github_update_branch = settings.GITHUB_BRANCH_FOR_WEBAPP self.g = Github(self.github_token) self.repo = self.g.get_repo(f"{self.github_repo}") self.dev_branch = self.repo.default_branch @@ -23,14 +23,19 @@ def _get_config_file_path(self, collection) -> str: def _get_file_contents(self, collection): """ - Get file contents from GitHub + Get file contents from GitHub dev or update branch """ FILE_PATH = self._get_config_file_path(collection) try: - contents = self.repo.get_contents(FILE_PATH, ref=self.github_branch) + contents = self.repo.get_contents(FILE_PATH, ref=self.dev_branch) except UnknownObjectException: - return None + try: + contents = self.repo.get_contents( + FILE_PATH, ref=self.github_update_branch + ) + except UnknownObjectException: + return None return contents @@ -49,7 +54,7 @@ def _update_file_contents(self, collection): COMMIT_MESSAGE, updated_xml, contents.sha, - branch=self.github_branch, + branch=self.github_update_branch, ) def branch_exists(self, branch_name: str) -> bool: @@ -73,14 +78,14 @@ def create_pull_request(self) -> None: title=title, body=body, base=self.dev_branch, - head=self.github_branch, + head=self.github_update_branch, ) except GithubException: # PR exists print("PR exists") def push_to_github(self) -> None: - if not self.branch_exists(self.github_branch): - self.create_branch(self.github_branch) + if not self.branch_exists(self.github_update_branch): + self.create_branch(self.github_update_branch) for collection in self.collections: print(f"Pushing {collection.name} to GitHub.") self._update_file_contents(collection) From 54e058eee25f5718604de1c669a20e747585d307 Mon Sep 17 00:00:00 2001 From: Carson Davis Date: Thu, 28 Sep 2023 11:30:44 -0500 Subject: [PATCH 12/28] add exception for fake collections to bulk github push --- sde_collections/utils/bulk_github_push.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/sde_collections/utils/bulk_github_push.py b/sde_collections/utils/bulk_github_push.py index ed351c27..3a42965d 100644 --- a/sde_collections/utils/bulk_github_push.py +++ b/sde_collections/utils/bulk_github_push.py @@ -15,7 +15,9 @@ def bulk_push(statuses_to_push=FINISHED_STATUSES): - collections = Collection.objects.filter(curation_status__in=statuses_to_push) + collections = Collection.objects.filter( + curation_status__in=statuses_to_push + ).exclude(name__icontains="fake") gh = GitHubHandler(collections) gh.push_to_github() From 10d26d9550c8b506e7e88fe75d676423a1b915ba Mon Sep 17 00:00:00 2001 From: Carson Davis Date: Thu, 28 Sep 2023 11:31:36 -0500 Subject: [PATCH 13/28] add note about automated branch deletetion --- sde_collections/utils/bulk_github_push.py | 1 + 1 file changed, 1 insertion(+) diff --git a/sde_collections/utils/bulk_github_push.py b/sde_collections/utils/bulk_github_push.py index 3a42965d..fdd9cb62 100644 --- a/sde_collections/utils/bulk_github_push.py +++ b/sde_collections/utils/bulk_github_push.py @@ -14,6 +14,7 @@ ] +# currently, the existing automated github branch needs to be deleted def bulk_push(statuses_to_push=FINISHED_STATUSES): collections = Collection.objects.filter( curation_status__in=statuses_to_push From bf6ad50b3e07aff4f5ca71f3c67b3eaede437b2f Mon Sep 17 00:00:00 2001 From: Carson Davis Date: Thu, 28 Sep 2023 16:47:31 -0500 Subject: [PATCH 14/28] temporarily disable treeroot writing --- sde_collections/models/collection.py | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/sde_collections/models/collection.py b/sde_collections/models/collection.py index 084dbee9..32582fde 100644 --- a/sde_collections/models/collection.py +++ b/sde_collections/models/collection.py @@ -139,8 +139,9 @@ def update_config_xml(self, original_config_string): TITLE_RULES = self._process_title_list() DOCUMENT_TYPE_RULES = self._process_document_type_list() - if self.tree_root: - editor.update_or_add_element_value("TreeRoot", self.tree_root) + # TODO: this was creating duplicates so it was temporarily disabled + # if self.tree_root: + # editor.update_or_add_element_value("TreeRoot", self.tree_root) for url in URL_EXCLUDES: editor.add_url_exclude(url) From 3b161221adb3e33ec1c9b08fc87bc2b417ae7002 Mon Sep 17 00:00:00 2001 From: Carson Davis Date: Fri, 29 Sep 2023 15:25:20 -0500 Subject: [PATCH 15/28] remove references to original XmlEditor --- config_generation/db_to_xml_file_based.py | 47 +++++++++++++++++++++-- 1 file changed, 44 insertions(+), 3 deletions(-) diff --git a/config_generation/db_to_xml_file_based.py b/config_generation/db_to_xml_file_based.py index ec066ce4..28ff4eaf 100644 --- a/config_generation/db_to_xml_file_based.py +++ b/config_generation/db_to_xml_file_based.py @@ -4,10 +4,8 @@ import os import xml.etree.ElementTree as ET -from db_to_xml import XmlEditor as XmlEditorStringBased - -class XmlEditor(XmlEditorStringBased): +class XmlEditor: """ Class is instantiated with a path to an xml. An internal etree is generated, and changes are made in place. @@ -79,3 +77,46 @@ def create_config_folder_and_default(self, source_name, collection_name): # self._write_xml(xml_path) self._update_config_xml(xml_path) + + def update_or_add_element_value( + self, + element_name: str, + element_value: str, + parent_element_name: str = "", + ) -> None: + """can update the value of either a top level or secondary level value in the sinequa config + + Args: + element_name (str): name of the sinequa element, such as "Simulate" + element_value (str): value to be stored to element, such as "false" + parent_element_name (str, optional): parent of the element, such as "IndexerClient" + Defaults to None. + """ + + xml_root = self.xml_tree.getroot() + parent_element = ( + xml_root if not parent_element_name else xml_root.find(parent_element_name) + ) + + if parent_element is None: + raise ValueError( + f"Parent element '{parent_element_name}' not found in XML." + ) + + existing_element = parent_element.find(element_name) + if existing_element: + existing_element.text = element_value + else: + ET.SubElement(parent_element, element_name).text = element_value + + def add_job_list_item(self, job_name): + """ + this is specifically for editing joblist templates by adding a new collection to a joblist + config_generation/xmls/joblist_template.xml + """ + xml_root = self.xml_tree.getroot() + + mapping = ET.Element("JobListItem") + ET.SubElement(mapping, "Name").text = job_name + ET.SubElement(mapping, "StopOnError").text = "false" + xml_root.append(mapping) From b2c81d5daf9333a9d7f525b2261b7cf04bf50d31 Mon Sep 17 00:00:00 2001 From: Carson Davis Date: Fri, 29 Sep 2023 15:25:47 -0500 Subject: [PATCH 16/28] refactor job generation into single class --- config_generation/generate_jobs.py | 135 +++++++++++++++++++---------- 1 file changed, 90 insertions(+), 45 deletions(-) diff --git a/config_generation/generate_jobs.py b/config_generation/generate_jobs.py index 98c621d2..13be15ab 100644 --- a/config_generation/generate_jobs.py +++ b/config_generation/generate_jobs.py @@ -9,49 +9,94 @@ from config import collection_list, date_of_batch, n -template_root_path = "xmls/" -joblist_template_path = f"{template_root_path}joblist_template.xml" -job_path_root = "../sinequa_configs/jobs/" - - -def create_job_name(collection_name): - return f"collection.indexer.{collection_name}.xml" - - -def create_joblist_name(index): - return f"parallel_indexing_list-{date_of_batch}-{index}.xml" - - -# create single jobs to run each collection -for collection in collection_list: - job = XmlEditor(f"{template_root_path}job_template.xml") - job.update_or_add_element_value("Collection", f"/SMD/{collection}/") - job._update_config_xml(f"{job_path_root}{create_job_name(collection)}") - - -# Create an empty list of lists -sublists = [[] for _ in range(n)] - -# Distribute elements of the big list into sublists -for i in range(len(collection_list)): - # Use modulus to decide which sublist to put the item in - sublist_index = i % n - sublists[sublist_index].append(collection_list[i]) - -# create the n joblists (which will execute their contents serially in parellel) -job_names = [] -for index, sublist in enumerate(sublists): - joblist = XmlEditor(joblist_template_path) - for collection in sublist: - joblist.add_job_list_item(create_job_name(collection).replace(".xml", "")) - - joblist._update_config_xml(f"{job_path_root}{create_joblist_name(index)}") - job_names.append(create_joblist_name(index).replace(".xml", "")) - -master = XmlEditor(joblist_template_path) -master.update_or_add_element_value("RunJobsInParallel", "true") -[master.add_job_list_item(job_name) for job_name in job_names] -master._update_config_xml( - f"{job_path_root}parallel_indexing_list-{date_of_batch}-master.xml" -) +class ParallelJobCreator: + def __init__( + self, + collection_list, + template_root_path="xmls/", + job_path_root="../sinequa_configs/jobs/", + ): + """ + these default values rely on the old file structure, where the sinequa_configs were a + sub-repo of sde-indexing-helper. so when running this, you will need the sde-backend + code to be inside a folder called sinequa_configs + """ + + self.collection_list = collection_list + self.template_root_path = template_root_path + self.joblist_template_path = f"{template_root_path}joblist_template.xml" + self.job_path_root = job_path_root + + def _create_job_name(self, collection_name): + """ + each job that runs an individual collection needs a name based on the collection name + this code generates that file name as a string, and it will be passed to the function that + creates the actual job file + """ + return f"collection.indexer.{collection_name}.xml" + + def _create_joblist_name(self, index): + """ + each job that runs an list of collections a name based on: + - the date the batch was created + - the index out of n total batches + this code generates that file name as a string, and it will be passed to the function that + creates the actual job file + """ + return f"parallel_indexing_list-{date_of_batch}-{index}.xml" + + def _create_collection_jobs(self): + """ + in order to run a collection, a job must exist that runs it + this code: + - creates a job based on the job template + - adds the exact collection name + - saves it with a name that will reference the collection name + """ + # create single jobs to run each collection + for collection in self.collection_list: + job = XmlEditor(f"{self.template_root_path}job_template.xml") + job.update_or_add_element_value("Collection", f"/SMD/{collection}/") + job._update_config_xml( + f"{self.job_path_root}{self._create_job_name(collection)}" + ) + + def make_all_parallel_jobs(self): + # create initial single jobs that will be referenced by the parallel job lists + self._create_collection_jobs() + + # Create an empty list of lists + sublists = [[] for _ in range(n)] + + # Distribute elements of the big list into sublists + for i in range(len(self.collection_list)): + # Use modulus to decide which sublist to put the item in + sublist_index = i % n + sublists[sublist_index].append(self.collection_list[i]) + + # create the n joblists (which will execute their contents serially in parallel + job_names = [] + for index, sublist in enumerate(sublists): + joblist = XmlEditor(self.joblist_template_path) + for collection in sublist: + joblist.add_job_list_item( + self._create_job_name(collection).replace(".xml", "") + ) + + joblist._update_config_xml( + f"{self.job_path_root}{self._create_joblist_name(index)}" + ) + job_names.append(self._create_joblist_name(index).replace(".xml", "")) + + master = XmlEditor(self.joblist_template_path) + master.update_or_add_element_value("RunJobsInParallel", "true") + [master.add_job_list_item(job_name) for job_name in job_names] + master._update_config_xml( + f"{self.job_path_root}parallel_indexing_list-{date_of_batch}-master.xml" + ) + + +if __name__ == "__main__": + job_creator = ParallelJobCreator(collection_list=collection_list) + job_creator.make_all_parallel_jobs() From 5e27f9fffffa93020943abe8fd1d15272526cf7d Mon Sep 17 00:00:00 2001 From: Carson Davis Date: Fri, 29 Sep 2023 15:26:06 -0500 Subject: [PATCH 17/28] add sources_to_index_20230929 --- config_generation/sources_to_scrape.py | 286 ++++++++++++++++++++++--- 1 file changed, 262 insertions(+), 24 deletions(-) diff --git a/config_generation/sources_to_scrape.py b/config_generation/sources_to_scrape.py index c02b4d4f..2b9b95b6 100644 --- a/config_generation/sources_to_scrape.py +++ b/config_generation/sources_to_scrape.py @@ -665,10 +665,6 @@ "venus_data_archive", "virtual_observatory_information", "z_mast_search", -] - -modified = [ - "ASTRO_Astrophysics_Documents_Website", "ASTRO_HEASARC_Tools_Website", "ASTRO_HIRES_PRV_Website", "ASTRO_Image_Cutouts_Website", @@ -730,100 +726,342 @@ "z_mast_search", ] -sources_to_scrape_20230614 = list(set(created_from_scratch + modified)) -sources_to_scrape_20230616 = [ +# these are all the ones changed on 2023.08.17 +# update this note after running them +sources_with_new_source56_mappings = [ + "CASEI_Deployment", + "CASEI_Instrument", + "CASEI_Platform", + "CEOS_API_I", + "earth_observing_dashboard", + "f_prime", + "nasa_carbon_monitoring_system", + "nasa_global_climate_change", + "nasa_power", + "nasa_science_missions_earth", + "nasa_sea_level_change", + "nasa_wavelength", +] + +# these are all the ones changed on 2023.08.17 +# update this note after running them +sources_with_new_selection_rule = [ + "PDS_Mars_Exploration_Program_Website", + "goddard_institute_for_space_studies", + "my_nasa_data", +] + +# these are all the ones changed on 2023.08.17 +# update this note after running them +sources_with_new_title_rule = [ + "ASTRO_Calibration_Documentation_Website", + "ASTRO_Missions_and_Data_Website", + "PDS_Astromaterials_Acquisition_and_Curation_Office_Website", + "PDS_Cassini_Mission_Saturn_Small_Satellites_Website", + "gcn_missions_instruments_and_facilities", + "general_coordinates_network_gcn", + "nasa_power", +] + +# this is every single source on the dev branch +# we are burning the test server and rescraping all of these... +sources_to_scrape_20230929 = [ + "ARSETAppliedSciences", "ASTRO_API_Search_Website", + "ASTRO_Astrophysics_Documents_Website", "ASTRO_Calibration_Documentation_Website", "ASTRO_Contributed_Datasets_Website", + "ASTRO_Data_Hosted_on_LAMBDA_Website", "ASTRO_Data_Reduction_Tools_Website", - "ASTRO_exoMAST_API_Website", + "ASTRO_Exoplanet_Program_Documents_Website", + "ASTRO_Finder_Chart_Website", "ASTRO_HEASARC_Software_Website", + "ASTRO_HEASARC_Tools_Website", + "ASTRO_HIRES_PRV_Website", + "ASTRO_High-Energy_Missions_Website", "ASTRO_Hubble_Source_Catalog_Search_API_Website", + "ASTRO_Image_Cutouts_Website", "ASTRO_James_Webb_Space_Telescope_Website", "ASTRO_MAST_Documentation_Website", "ASTRO_Missions_and_Data_Website", + "ASTRO_Multi_Website", + "ASTRO_NASA_Exoplanet_Archive_Documents_Website", + "ASTRO_NAVO_HEASARC", + "ASTRO_NED_User_Guides_Website", "ASTRO_Planck_Cutout_Visualization_Website", + "ASTRO_Spitzer_Tools_Website", "ASTRO_TAP_Search_Website", "ASTRO_WISE_Image_Service_Website", "ASTRO_ZMAST_API_Website", + "ASTRO_exoMAST_API_Website", + "Algorithm_Publication_Tool", "Autoplot_Website", - "earth_science_decadal_surveys", - "exoplanet_opacities_database", + "CASEI", + "CASEI_Campaign", + "CASEI_Deployment", + "CASEI_Instrument", + "CASEI_Platform", + "CCMC_Website", + "CEOS_API_I", + "CEOS_API_M", + "CMR_API", + "CODE_NASA_API", + "Commercial_Smallsat_Data_Acquisition_CSDA_Program", + "DataPathFinder", + "ESSCOR_API", + "GCIS_ARTICLE_API", + "GCIS_BOOKS_API", + "GCIS_JOURNAL_API", + "GCIS_REPORTS_API", + "GENELAB_Github_DataProcessing", "GENELAB_Github_SampleProcessing", "GENELAB_Github_Training", + "GENELAB_METADATA_Website", "GENELAB_Publications_Website", + "HAPI_API", "HAPI_Website", + "Helio_Events_Knowledgebase_Website", + "Heliophysics_Project_Data_Management_Plans", "Helioviewer_Documentation_Website", - "igwn_public_alerts_user_guide", - "nasa_carbon_monitoring_system", - "nasa_global_climate_change", - "nasa_power", - "nasa_sea_level_change", - "nasa_wavelength", + "Helioviewer_Website", + "LSDA_Website", + "LSDA_Website_Trial", + "LSDA_Website_Trial2", + "NASA_Black_Marble", + "NASA_Climate_Change", + "NASA_Earth_Observations", + "NASA_Earth_Observatory", + "NASA_Earthdata", + "NASA_Heliophysics_Digital_Resource_Library_HDRL", + "NASA_SPoRT", + "NASA_Techport", + "NASA_Worldview", + "NAVO_HEASARC", + "NTRS", + "NasaEarthObservationWebsite", + "Optical_Constants_Database", + "PDS_API_Legacy_All", + "PDS_All_Data_Holdings_Website", "PDS_Analyst_Notebook_Website", + "PDS_Annex_Data_Holdings_Website", + "PDS_Archive_Navigator_Website", "PDS_Astrogeology_Website", + "PDS_Astromat_Astromaterials_Data_System_Website", "PDS_Astromaterials_Acquisition_and_Curation_Office_Website", "PDS_Astropedia_Lunar_and_Planetary_Cartographic_Catalog_Website", "PDS_Atmospheric_Escape_Chemistry_Page_Website", + "PDS_CRISM_Analysis_Toolkit_(CAT)_Website", + "PDS_CRISM_Spectral_Library_Website", + "PDS_Cassini_Mission_Dione_(Saturn_IV)_Website", + "PDS_Cassini_Mission_Enceladus_(Saturn_II)_Website", + "PDS_Cassini_Mission_Iapetus_(Saturn_VIII)_Website", + "PDS_Cassini_Mission_Mimas_(Saturn_I)_Website", + "PDS_Cassini_Mission_Rhea_(Saturn_V)_Website", + "PDS_Cassini_Mission_Saturn_Small_Satellites_Website", + "PDS_Cassini_Mission_Tethys_(Saturn_III)_Website", "PDS_Cassini_Resource_Page_Website", "PDS_Collision_Induced_Absorption_Model_Website", - "PDS_CRISM_Spectral_Library_Website", + "PDS_Current_Missions_Website", + "PDS_DIVINER_RDR_Query_V20_Website", + "PDS_DIVINER_RDR_Query_Website", + "PDS_Data_Archive_Website", + "PDS_Data_Dictionary_Search", "PDS_Data_Pilot_Website", "PDS_Data_Portal_Website", + "PDS_Data_Volumes_Index_Website", + "PDS_Data_Volumes_Website", "PDS_Dawn_Mission_to_Ceres_Website", "PDS_Dawn_Mission_to_Vesta_Website", - "PDS_DIVINER_RDR_Query_V20_Website", - "PDS_DIVINER_RDR_Query_Website", "PDS_EPIC_Model_Website", + "PDS_Errata_Website", + "PDS_Generic_Kernels_Website", "PDS_Geosciences_Data_Holdings_Website", "PDS_Geosciences_Node_Spectral_Library_Website", "PDS_Gravity_Models_Website", + "PDS_High-Resolution_Transmission_Molecular_Absorption_Database_(HITRAN)_Website", + "PDS_ISIS_Astro_Website", + "PDS_ISIS_Website", "PDS_Image_Atlas_Website", "PDS_Imaging_Software_Website", - "PDS_ISIS_Astro_Website", "PDS_Java_Mission-planning_and_Analysis_for_Remote_Sensing_(JMARS)_Website", "PDS_Juno_Archive_Page_Website", "PDS_Jupiter_Data_Archive_Website", + "PDS_LADEE_NMS_Calibrated_Data_Search", + "PDS_LADEE_NMS_Derived_Data_Search", + "PDS_LADEE_UVS_Calibrated_Data_Search", "PDS_LOLA_RDR_Query_V20_Website", "PDS_LOLA_RDR_Query_Website", "PDS_Lunar_Atmospheres_Data_Archive_Website", "PDS_Lunar_Orbiter_Data_Explorer_Website", - "PDS_Mars_Lander_Data_Website", - "PDS_Mars_Orbiter_Data_Website", "PDS_MAVEN_ACC_Data_Search", "PDS_MAVEN_NGIMS_Data_Search", + "PDS_MRO_Coordinated_Observation_Website", + "PDS_Map-a-Planet_(MAP)_Website", + "PDS_Mars_Exploration_Program_Website", + "PDS_Mars_GCM_Website", + "PDS_Mars_Lander_Data_Website", + "PDS_Mars_Orbital_Data_Explorer_Website", + "PDS_Mars_Orbiter_Data_Website", "PDS_Mercury_Data_Archive_Website", "PDS_Mercury_Orbital_Data_Explorer_Website", "PDS_Messenger_MASCS_UVVS_Archive_Page_Website", "PDS_Metadata_Injector_for_PDS_Labels_Website", + "PDS_Mission_Data_Archive_Website", + "PDS_Missions_Archive_Page_Website", + "PDS_Missions_Website", + "PDS_Models_and_Simulations_Website", "PDS_NASA_Science_Earths_Moon_Website", "PDS_NASA_Science_Solar_System_Exploration_Website", + "PDS_NASA_Solar_System_Treks_Website", + "PDS_NASA_Space_Science_Data_Coordinated_Archive_(NSSDCA)_Website", "PDS_NASAs_Eyes_Website", + "PDS_NEAR_Shoemaker_Mission_to_433_Eros_Website", + "PDS_Near_Earth_Asteroid_Rendezvous_(NEAR)_Data_Archive_Website", "PDS_Neptune_Archive_Page_Website", "PDS_New_Horizons_Encounter_with_Pluto_Website", + "PDS_Niels_Bohr_Institute_Website", "PDS_Notebook_Website", + "PDS_ODE_REST_Service_Website", + "PDS_OMEGA_Analysis_Toolkit_(OAT)_Website", "PDS_OPUS_Website", + "PDS_OSIRIS-REx_Mission_to_Bennu_Website", + "PDS_Object_Access_Library_Website", + "PDS_Odyssey_GRS_Data_Node_Website", + "PDS_Operational_Flight_Other_Project_Kernels_Website", "PDS_Outer_Planets_Icy_Satellites_Archive_Page_Website", - "PDS_PDS_Small_Bodies_Node_Asteroid_Dust_Subnode_Website", "PDS_PDS3_Standards_Reference_Website", + "PDS_PDS4_Documents_Website", "PDS_PDS4_JParser_Website", "PDS_PDS4_Local_Data_Dictionary_Tool_Website", "PDS_PDS4_Training_Documents_Website", + "PDS_PDS_Annex_Products_Website", + "PDS_PDS_Atmospheres_Data_Set_Catalog_Website", + "PDS_PDS_Documentation_Website", + "PDS_PDS_Small_Bodies_Node_Asteroid_Dust_Subnode_Website", + "PDS_PDS_Software_Tools_Tutorial_and_Viewers_Website", + "PDS_PDS_Tool_Registry_Website", + "PDS_PPI_Documents_Website", + "PDS_PPI_Software_Website", "PDS_Phoebe_Saturn_IX_Website", "PDS_Photojournal_Website", + "PDS_Planetary_Data_System_(PDS)_Website", "PDS_Planetary_Science_Tools_Website", "PDS_Pluto_and_Arrokoth_Data_Archive_Website", - "PDS_PPI_Software_Website", + "PDS_Projection_on_the_Web_(POW)_Service_Website", + "PDS_Recently_Archived_Volumes_Website", "PDS_Ring-Moon_Systems_Node_On-line_Tools_Website", - "PDS_Saturn_Data_Archive_Website", + "PDS_Rings_Website", + "PDS_SBIB_3D_Website", "PDS_SBN_Tools_Utilities_and_Interfaces_Website", + "PDS_SPICE-enhanced_Cosmographia_Website", + "PDS_SPICE_Archives_Website", + "PDS_SPICE_Programming_Lessons_Website", + "PDS_SPICE_Self-Training_Website", + "PDS_SPICE_Toolkit_Documentation_Website", + "PDS_SPICE_Toolkit_Website", + "PDS_SPICE_Tutorials_Website", + "PDS_SPICE_Utility_and_Application_Programs_Website", + "PDS_Saturn_Data_Archive_Website", + "PDS_Small_Bodies_Data_Ferret_Website", + "PDS_Small_Bodies_Image_Browser_Website", + "PDS_Solar_System_Exploration_Research_Virtual_Institute_(SSERVI)_Website", + "PDS_Subscription_Service_Website", + "PDS_Subset_Tool_Website", "PDS_TES_Data_Node_Website", "PDS_Titan_Data_Archive_Website", "PDS_Toolkits_Website", + "PDS_USGS_Pilot_Website", "PDS_Uranus_Data_Archive_Website", "PDS_Users_Guides_Website", "PDS_Venus_Archive_Page_Website", "PDS_Venus_Orbital_Data_Explorer_Website", "PDS_Virtual_Astronaut_Website", + "PDS_Web_Chronos_Website", + "PDS_Wind_Tunnel_Particle_Threshold_Speed_Data_Website", + "PyHC_Website", + "SERVIR_Global", + "SPASE_JSON", + "SPASE_Website", + "SPEDAS_Website", + "Sea_Level_Change", + "Small_Bodies_Node", + "Socioeconomic_Data_and_Applications_Center_SEDAC", + "Solar_Data_Analysis_Center_SDAC", + "Solar_Physics_Group", + "Space_Biology_Science_Digest", + "Space_Physics_Data_Facility", + "Space_Place", + "TASKBOOK_Website", + "VEDA_Dashboard", + "VEDA_STAC_Catalog", + "algorithm_theoretical_basis_documents", + "archived_synthetic_data", + "astrogeology_analysis_ready_data", + "astroquery_api_search_mast_queries", + "co_plotter", + "contributed_datasets_in_the_exoplanet_archive", + "coordinate_calculator", + "corot_exoplanet_archive_etss_data_sets", + "earth_observer_publications", + "earth_observing_dashboard", + "earth_science_decadal_surveys", + "emac_exoplanet_modeling_and_analysis_center", + "eos_mission_page", + "exo_mast", + "exoplanet_atmosphere_observability_table", + "exoplanet_opacities_database", + "extinction_calculator", + "f_prime", + "fire_information_for_resource_management_system_firms", + "gcn_circulars", + "gcn_missions_instruments_and_facilities", + "general_coordinates_network_gcn", + "giss_datasets_and_derived_materials", + "giss_publication_list", + "giss_software_tools", + "goddard_institute_for_space_studies", + "heasarc_browse_batch_interface", + "heasarc_download_scripts", + "high_level_science_products", + "hubble_source_catalog_search", + "igwn_public_alerts_user_guide", + "interactive_multiinstrument_database_of_solar_flares", + "ipac_table_validator", + "koa_program_friendly_image_access_service", + "lbti_archive", + "mars_target_encyclopedia_mte", + "mast", + "mast_api_search", + "mast_hubble_search", + "mast_portal", + "mast_query_casjobs", + "mast_web_services", + "montage_mosaic_engine", + "my_nasa_data", + "naif", + "nasa_2023_climate_strategy", + "nasa_carbon_monitoring_system", + "nasa_global_climate_change", + "nasa_power", + "nasa_science_missions_earth", + "nasa_wavelength", + "neid_archive", + "neid_solar_radial_velocity_archive", + "nexsci", + "ntrs_api", + "our_changing_planet_the_view_from_space_images", + "pan_starrs_catalog", + "pan_starrs_catalog_api", + "pds_rings", + "planetary_image_galleries", + "pykoa", + "skiff_spectral_catalog_search", + "space_telescope_bibliographic_search", + "spectral_classes_of_like_stars", + "vao_datascope", + "velocity_calculator", + "venus_data_archive", + "virtual_observatory_information", + "z_mast_search", ] From aa3ebad76c0f20239ffc250a5fce05407d2477ff Mon Sep 17 00:00:00 2001 From: Carson Davis Date: Fri, 29 Sep 2023 16:10:55 -0500 Subject: [PATCH 18/28] update readme with job creation details --- config_generation/README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/config_generation/README.md b/config_generation/README.md index 8ec33626..1f7685fa 100644 --- a/config_generation/README.md +++ b/config_generation/README.md @@ -12,3 +12,6 @@ Don't be fooled by the page on indexing.... https://doc.sinequa.com/en.sinequa-es.v11/Content/en.sinequa-es.devDoc.webservice.rest-indexing.html#indexing-collection You want the page on jobs https://doc.sinequa.com/en.sinequa-es.v11/Content/en.sinequa-es.devDoc.webservice.rest-operation.html#operationcollectionStart. + +## Creating Job Lists +Update config.py to contain the latest collections you want to index. Then run generate_jobs.py and it will create the parallel batches. From 6e33a8dc97c40250271569261b6fdf99644b750f Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon, 2 Oct 2023 22:14:41 +0000 Subject: [PATCH 19/28] Bump werkzeug from 2.3.6 to 3.0.0 Bumps [werkzeug](https://github.com/pallets/werkzeug) from 2.3.6 to 3.0.0. - [Release notes](https://github.com/pallets/werkzeug/releases) - [Changelog](https://github.com/pallets/werkzeug/blob/main/CHANGES.rst) - [Commits](https://github.com/pallets/werkzeug/compare/2.3.6...3.0.0) --- updated-dependencies: - dependency-name: werkzeug dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] --- requirements/local.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements/local.txt b/requirements/local.txt index 3cac04b9..7993043b 100644 --- a/requirements/local.txt +++ b/requirements/local.txt @@ -1,6 +1,6 @@ -r base.txt -Werkzeug==2.3.6 # https://github.com/pallets/werkzeug +Werkzeug==3.0.0 # https://github.com/pallets/werkzeug ipdb==0.13.13 # https://github.com/gotcha/ipdb psycopg2==2.9.6 # https://github.com/psycopg/psycopg2 watchfiles==0.19.0 # https://github.com/samuelcolvin/watchfiles From 3b512191678d7969664781f4bea977dae6a9f8f7 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed, 4 Oct 2023 22:07:32 +0000 Subject: [PATCH 20/28] Bump websockets from 10.4 to 11.0.3 Bumps [websockets](https://github.com/aaugustin/websockets) from 10.4 to 11.0.3. - [Release notes](https://github.com/aaugustin/websockets/releases) - [Commits](https://github.com/aaugustin/websockets/compare/10.4...11.0.3) --- updated-dependencies: - dependency-name: websockets dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] --- requirements/base.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements/base.txt b/requirements/base.txt index 762253fc..801be323 100644 --- a/requirements/base.txt +++ b/requirements/base.txt @@ -111,6 +111,6 @@ tzdata==2023.3 urllib3==1.26.16 uvicorn==0.23.2 wandb==0.15.4 -websockets==10.4 +websockets==11.0.3 zipp==3.15.0 SQLAlchemy==1.4.25 From 40dbbc4e882e2cbd478927f7f143e549b2acdda3 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed, 4 Oct 2023 22:07:36 +0000 Subject: [PATCH 21/28] Bump soupsieve from 2.4.1 to 2.5 Bumps [soupsieve](https://github.com/facelessuser/soupsieve) from 2.4.1 to 2.5. - [Release notes](https://github.com/facelessuser/soupsieve/releases) - [Commits](https://github.com/facelessuser/soupsieve/compare/2.4.1...2.5) --- updated-dependencies: - dependency-name: soupsieve dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] --- requirements/base.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements/base.txt b/requirements/base.txt index 762253fc..2a994078 100644 --- a/requirements/base.txt +++ b/requirements/base.txt @@ -97,7 +97,7 @@ setproctitle==1.3.2 six==1.16.0 scikit-learn sniffio==1.3.0 -soupsieve==2.4.1 +soupsieve==2.5 starlette==0.27.0 sympy==1.12 threadpoolctl==3.1.0 From fd80833ef6eb7e74db2717cf3e9ae04fcf09d397 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed, 4 Oct 2023 22:07:44 +0000 Subject: [PATCH 22/28] Bump transformers from 4.30.0 to 4.34.0 Bumps [transformers](https://github.com/huggingface/transformers) from 4.30.0 to 4.34.0. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](https://github.com/huggingface/transformers/compare/v4.30.0...v4.34.0) --- updated-dependencies: - dependency-name: transformers dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] --- requirements/base.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements/base.txt b/requirements/base.txt index 762253fc..1081dbea 100644 --- a/requirements/base.txt +++ b/requirements/base.txt @@ -105,7 +105,7 @@ tokenizers==0.13.3 tomli==2.0.1 torch==2.0.1 tqdm==4.65.0 -transformers==4.30.0 +transformers==4.34.0 typing_extensions==4.8.0 tzdata==2023.3 urllib3==1.26.16 From a35a39d46c3295692648eb5c33b315b3def4b45e Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed, 4 Oct 2023 22:07:50 +0000 Subject: [PATCH 23/28] Bump celery from 5.3.1 to 5.3.4 Bumps [celery](https://github.com/celery/celery) from 5.3.1 to 5.3.4. - [Release notes](https://github.com/celery/celery/releases) - [Changelog](https://github.com/celery/celery/blob/main/Changelog.rst) - [Commits](https://github.com/celery/celery/compare/v5.3.1...v5.3.4) --- updated-dependencies: - dependency-name: celery dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] --- requirements/base.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements/base.txt b/requirements/base.txt index 762253fc..001bcaa1 100644 --- a/requirements/base.txt +++ b/requirements/base.txt @@ -4,7 +4,7 @@ Pillow==10.0.0 # https://github.com/python-pillow/Pillow argon2-cffi==23.1.0 # https://github.com/hynek/argon2_cffi redis==4.6.0 # https://github.com/redis/redis-py hiredis==2.2.3 # https://github.com/redis/hiredis-py -celery==5.3.1 # pyup: < 6.0 # https://github.com/celery/celery +celery==5.3.4 # pyup: < 6.0 # https://github.com/celery/celery django-celery-beat==2.5.0 # https://github.com/celery/django-celery-beat flower==2.0.0 # https://github.com/mher/flower ipython==8.14.0 From 9149ede190fda8e4f799dec51674b1efab0a0343 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed, 4 Oct 2023 22:07:56 +0000 Subject: [PATCH 24/28] Bump django from 4.2.3 to 4.2.6 Bumps [django](https://github.com/django/django) from 4.2.3 to 4.2.6. - [Commits](https://github.com/django/django/compare/4.2.3...4.2.6) --- updated-dependencies: - dependency-name: django dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] --- requirements/base.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements/base.txt b/requirements/base.txt index 762253fc..bc93e9b8 100644 --- a/requirements/base.txt +++ b/requirements/base.txt @@ -11,7 +11,7 @@ ipython==8.14.0 # Django # ------------------------------------------------------------------------------ -django==4.2.3 # pyup: < 4.1 # https://www.djangoproject.com/ +django==4.2.6 # pyup: < 4.1 # https://www.djangoproject.com/ django-environ==0.10.0 # https://github.com/joke2k/django-environ django-extensions==3.2.3 # https://github.com/django-extensions/django-extensions django-model-utils==4.3.1 # https://github.com/jazzband/django-model-utils From 9946d90d9310dc4e6b0cbec261914e5992b03c87 Mon Sep 17 00:00:00 2001 From: Carson Davis Date: Thu, 5 Oct 2023 11:00:28 -0500 Subject: [PATCH 25/28] add sources scraped on 20231003 --- config_generation/sources_to_scrape.py | 154 +++++++++++++++++++++++++ 1 file changed, 154 insertions(+) diff --git a/config_generation/sources_to_scrape.py b/config_generation/sources_to_scrape.py index 2b9b95b6..e71e2c1e 100644 --- a/config_generation/sources_to_scrape.py +++ b/config_generation/sources_to_scrape.py @@ -1065,3 +1065,157 @@ "virtual_observatory_information", "z_mast_search", ] + +# after indexing for a few days, we discovered some possible optimizations and needed +# to restart the jobs +updated_list_20231003 = [ + "heasarc_download_scripts", + "ipac_table_validator", + "mast_api_search", + "montage_mosaic_engine", + "nasa_global_climate_change", + "neid_solar_radial_velocity_archive", + "pan_starrs_catalog_api", + "space_telescope_bibliographic_search", + "virtual_observatory_information", + "high_level_science_products", + "koa_program_friendly_image_access_service", + "mast_hubble_search", + "my_nasa_data", + "nasa_power", + "nexsci", + "pds_rings", + "spectral_classes_of_like_stars", + "z_mast_search", + "ASTRO_Planck_Cutout_Visualization_Website", + "ASTRO_exoMAST_API_Website", + "CASEI_Deployment", + "CEOS_API_M", + "ESSCOR_API", + "GENELAB_Github_DataProcessing", + "HAPI_API", + "Helioviewer_Website", + "NASA_Climate_Change", + "NASA_SPoRT", + "NasaEarthObservationWebsite", + "PDS_Annex_Data_Holdings_Website", + "PDS_Astropedia_Lunar_and_Planetary_Cartographic_Catalog_Website", + "PDS_Cassini_Mission_Enceladus_(Saturn_II)_Website", + "PDS_Cassini_Mission_Tethys_(Saturn_III)_Website", + "PDS_DIVINER_RDR_Query_Website", + "PDS_Data_Volumes_Index_Website", + "PDS_Errata_Website", + "PDS_High-Resolution_Transmission_Molecular_Absorption_Database_(HITRAN)_Website", + "PDS_Java_Mission-planning_and_Analysis_for_Remote_Sensing_(JMARS)_Website", + "PDS_LADEE_UVS_Calibrated_Data_Search", + "PDS_MAVEN_ACC_Data_Search", + "PDS_Mars_GCM_Website", + "PDS_Mercury_Orbital_Data_Explorer_Website", + "PDS_Missions_Website", + "PDS_NASA_Space_Science_Data_Coordinated_Archive_(NSSDCA)_Website", + "PDS_New_Horizons_Encounter_with_Pluto_Website", + "PDS_OPUS_Website", + "PDS_Outer_Planets_Icy_Satellites_Archive_Page_Website", + "PDS_PDS4_Training_Documents_Website", + "PDS_PDS_Software_Tools_Tutorial_and_Viewers_Website", + "PDS_Photojournal_Website", + "PDS_Recently_Archived_Volumes_Website", + "PDS_SPICE-enhanced_Cosmographia_Website", + "PDS_SPICE_Toolkit_Website", + "PDS_Small_Bodies_Image_Browser_Website", + "PDS_Titan_Data_Archive_Website", + "PDS_Venus_Archive_Page_Website", + "PyHC_Website", + "Sea_Level_Change", + "Space_Biology_Science_Digest", + "VEDA_STAC_Catalog", + "co_plotter", + "earth_observing_dashboard", + "exoplanet_atmosphere_observability_table", + "gcn_circulars", + "giss_software_tools", + "hubble_source_catalog_search", + "lbti_archive", + "mast_portal", + "naif", + "nasa_science_missions_earth", + "ntrs_api", + "planetary_image_galleries", + "vao_datascope", + "Space_Physics_Data_Facility", + "algorithm_theoretical_basis_documents", + "contributed_datasets_in_the_exoplanet_archive", + "earth_science_decadal_surveys", + "exoplanet_opacities_database", + "gcn_missions_instruments_and_facilities", + "goddard_institute_for_space_studies", + "igwn_public_alerts_user_guide", + "mars_target_encyclopedia_mte", + "mast_query_casjobs", + "nasa_2023_climate_strategy", + "nasa_wavelength", + "our_changing_planet_the_view_from_space_images", + "pykoa", + "velocity_calculator", + "ASTRO_NASA_Exoplanet_Archive_Documents_Website", + "ASTRO_TAP_Search_Website", + "Autoplot_Website", + "CASEI_Platform", + "CODE_NASA_API", + "GCIS_BOOKS_API", + "GENELAB_Github_Training", + "Helio_Events_Knowledgebase_Website", + "LSDA_Website_Trial", + "NASA_Earth_Observatory", + "NASA_Worldview", + "PDS_API_Legacy_All", + "PDS_Astrogeology_Website", + "PDS_CRISM_Analysis_Toolkit_(CAT)_Website", + "PDS_Cassini_Mission_Mimas_(Saturn_I)_Website", + "PDS_Collision_Induced_Absorption_Model_Website", + "PDS_Data_Dictionary_Search", + "PDS_Dawn_Mission_to_Ceres_Website", + "PDS_Geosciences_Data_Holdings_Website", + "PDS_ISIS_Website", + "PDS_Jupiter_Data_Archive_Website", + "PDS_LOLA_RDR_Query_Website", + "PDS_MRO_Coordinated_Observation_Website", + "PDS_Mars_Orbital_Data_Explorer_Website", + "PDS_Metadata_Injector_for_PDS_Labels_Website", + "PDS_NASA_Science_Earths_Moon_Website", + "PDS_NEAR_Shoemaker_Mission_to_433_Eros_Website", + "PDS_Notebook_Website", + "PDS_Object_Access_Library_Website", + "PDS_PDS4_Documents_Website", + "PDS_PDS_Atmospheres_Data_Set_Catalog_Website", + "PDS_PPI_Documents_Website", + "PDS_Planetary_Science_Tools_Website", + "PDS_Rings_Website", + "PDS_SPICE_Programming_Lessons_Website", + "PDS_SPICE_Utility_and_Application_Programs_Website", + "PDS_Subscription_Service_Website", + "PDS_USGS_Pilot_Website", + "PDS_Virtual_Astronaut_Website", + "SPASE_JSON", + "Socioeconomic_Data_and_Applications_Center_SEDAC", + "Space_Place", + "archived_synthetic_data", + "coordinate_calculator", + "emac_exoplanet_modeling_and_analysis_center", + "extinction_calculator", + "general_coordinates_network_gcn", + "heasarc_browse_batch_interface", + "interactive_multiinstrument_database_of_solar_flares", + "mast", + "mast_web_services", + "nasa_carbon_monitoring_system", + "neid_archive", + "pan_starrs_catalog", + "skiff_spectral_catalog_search", + "venus_data_archive", + # long jobs... + "giss_datasets_and_derived_materials", + "ASTRO_Missions_and_Data_Website", + "Small_Bodies_Node", + "ASTRO_Image_Cutouts_Website", +] From 4febd9cf6b4bd9da3c7045e32bf87c5a4c70e9ff Mon Sep 17 00:00:00 2001 From: Carson Davis Date: Thu, 5 Oct 2023 11:01:01 -0500 Subject: [PATCH 26/28] remove default collection from job template --- config_generation/xmls/job_template.xml | 1 - 1 file changed, 1 deletion(-) diff --git a/config_generation/xmls/job_template.xml b/config_generation/xmls/job_template.xml index 1630fe50..c5406ea9 100644 --- a/config_generation/xmls/job_template.xml +++ b/config_generation/xmls/job_template.xml @@ -1,7 +1,6 @@ collection - /SMD/ASTRO_High-Energy_Missions_Website/ _ForceReindexation From 6d651e6ab40553f9184217548684baf97a287f33 Mon Sep 17 00:00:00 2001 From: Carson Davis Date: Thu, 5 Oct 2023 11:01:48 -0500 Subject: [PATCH 27/28] change defaultidentity to true in joblist template --- config_generation/xmls/joblist_template.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/config_generation/xmls/joblist_template.xml b/config_generation/xmls/joblist_template.xml index 6bf1b2ff..c02eb396 100644 --- a/config_generation/xmls/joblist_template.xml +++ b/config_generation/xmls/joblist_template.xml @@ -32,7 +32,7 @@ - false + true false