Merge branch 'dev' into 410-adding-additional-status-on-the-curation-app

NASA-IMPACT · Oct 9, 2023 · cce127f · cce127f
2 parents 5328ec1 + 9fdd4b6
commit cce127f
Show file tree

Hide file tree

Showing 12 changed files with 709 additions and 104 deletions.
diff --git a/README.md b/README.md
@@ -11,21 +11,33 @@ Moved to [settings](http://cookiecutter-django.readthedocs.io/en/latest/settings
 
 ## Basic Commands
 
+### Building The Project
+    ```bash
+    $ docker-compose -f local.yml build
+    ```
+
+### Running The Necessary Containers
+    ```bash
+    $ docker-compose -f local.yml up
+    ```
+
 ### Setting Up Your Users
 
 - To create a **normal user account**, just go to Sign Up and fill out the form. Once you submit it, you'll see a "Verify Your E-mail Address" page. Go to your console to see a simulated email verification message. Copy the link into your browser. Now the user's email should be verified and ready to go.
 
 - To create a **superuser account**, use this command:
-
-      $ python manage.py createsuperuser
+    ```bash
+    $ docker-compose -f local.yml run -rm django python manage.py createsuperuser
+    ```
 
 For convenience, you can keep your normal user logged in on Chrome and your superuser logged in on Firefox (or similar), so that you can see how the site behaves for both kinds of users.
 
 ### Loading fixtures
-
+Please note that currently loading fixtures will not create a fully working database. If you are starting the project from scratch, it is probably preferable to skip to the Loading the DB from a Backup section.
 - To load collections
-
-      docker-compose -f local.yml run --rm django python manage.py loaddata sde_collections/fixtures/collections.json
+    ```bash
+    $ docker-compose -f local.yml run --rm django python manage.py loaddata sde_collections/fixtures/collections.json
+    ```
 
 ### Loading scraped URLs into CandidateURLs
 
@@ -36,42 +48,92 @@ For convenience, you can keep your normal user logged in on Chrome and your supe
 - Run the crawler with `scrapy crawl <name of your spider> -o scraped_urls/<config_folder>/urls.jsonl
 
 - Then run this:
+    ```bash
+    $ docker-compose -f local.yml run --rm django python manage.py load_scraped_urls <config_folder_name>
+    ```
+
+### Loading The DB From A Backup
+
+- If a database backup is made available, you wouldn't have to load the fixtures or the scrapped URLs anymore. This changes a few steps necessary to get the project running.
+
+- Step 1 : Build the project (Documented Above)
+
+- Step 2 : Run the necessary containers (Documented Above)
+
+- Step 3 : Clear Out Contenet Types Using Django Shell
+
+    -- Enter the Django shell in your Docker container.
+        ```bash
+        $ docker-compose -f local.yml run --rm django python manage.py shell
+        ```
+
+    -- In the Django shell, you can now delete the content types.
+        ```bash
+        from django.contrib.contenttypes.models import ContentType
+        ContentType.objects.all().delete()
+        ```
+
+    -- Exit the shell.
+
+- Step 4 : Load Your Backup Database
+
+    Assuming your backup is a `.json` file from `dumpdata`, you'd use `loaddata` command to populate your database.
+
+    -- If the backup file is on the local machine, make sure it's accessible to the Docker container. If the backup is outside the container, you will need to copy it inside first.
+        ```bash
+        $ docker cp /path/to/your/backup.json container_name:/path/inside/container/backup.json
+        ```
+
+    -- Load the data from your backup.
+        ```bash
+        $ docker-compose -f local.yml run --rm django python manage.py loaddata /path/inside/the/container/backup.json
+        ```
+
+    -- Once loaded, you may want to run migrations to ensure everything is aligned.
+        ```bash
+        $ docker-compose -f local.yml run -rm django python manage.py migrate
+        ```
 
-      $ docker-compose -f local.yml run --rm django python manage.py load_scraped_urls <config_folder_name>
 
 ### Type checks
 
 Running type checks with mypy:
-
+    ```bash
     $ mypy sde_indexing_helper
+    ```
 
 ### Test coverage
 
 To run the tests, check your test coverage, and generate an HTML coverage report:
-
+    ```bash
     $ coverage run -m pytest
     $ coverage html
     $ open htmlcov/index.html
+    ```
 
 #### Running tests with pytest
 
+    ```bash
     $ pytest
+    ```
 
 ### Live reloading and Sass CSS compilation
 
 Moved to [Live reloading and SASS compilation](https://cookiecutter-django.readthedocs.io/en/latest/developing-locally.html#sass-compilation-live-reloading).
 
 ### Install Celery
 
-Make sure Celery is installed in your environment.
-To install,
-pip install celery
+Make sure Celery is installed in your environment. To install :
+    ```bash
+    $ pip install celery
+    ```
 
 ### Install all requirements
 
 Install all packages listed in a 'requirements' file
-
+    ```bash
     pip install -r requirements/*.txt
+    ```
 
 ### Celery
 
@@ -100,6 +162,31 @@ cd sde_indexing_helper
 celery -A config.celery_app worker -B -l info
 ```
 
+### Pre-Commit Hook Instructions
+
+Hooks have to be run on every commit to automatically take care of linting and structuring.
+
+To install pre-commit package manager :
+
+    ```bash
+    $ pip install pre-commit
+    ```
+
+Install the git hook scripts :
+
+    ```bash
+    $ pre-commit install
+    ```
+
+Run against the files :
+
+    ```bash
+    $ pre-commit run --all-files
+    ```
+
+    It's usually a good idea to run the hooks against all of the files when adding new hooks (usually `pre-commit` will only run on the chnages files during git hooks).
+
+
 ### Sentry
 
 Sentry is an error logging aggregator service. You can sign up for a free account at <https://sentry.io/signup/?code=cookiecutter> or download and host it yourself.

diff --git a/config_generation/README.md b/config_generation/README.md
@@ -12,3 +12,7 @@ Don't be fooled by the page on indexing....
 https://doc.sinequa.com/en.sinequa-es.v11/Content/en.sinequa-es.devDoc.webservice.rest-indexing.html#indexing-collection
 
 You want the page on jobs https://doc.sinequa.com/en.sinequa-es.v11/Content/en.sinequa-es.devDoc.webservice.rest-operation.html#operationcollectionStart.
+
+## Creating Job Lists
+Update config.py to contain the latest collections you want to index. Then run generate_jobs.py and it will create the parallel batches.
+If you want it to run on multiple nodes, you will need to add that in two places in the file, and then you won't be able to run the lists from the masterlist, because of a sinequa bug.
diff --git a/config_generation/db_to_xml_file_based.py b/config_generation/db_to_xml_file_based.py
@@ -4,10 +4,8 @@
 import os
 import xml.etree.ElementTree as ET
 
-from db_to_xml import XmlEditor as XmlEditorStringBased
 
-
-class XmlEditor(XmlEditorStringBased):
+class XmlEditor:
     """
     Class is instantiated with a path to an xml.
     An internal etree is generated, and changes are made in place.
@@ -79,3 +77,46 @@ def create_config_folder_and_default(self, source_name, collection_name):
 
         # self._write_xml(xml_path)
         self._update_config_xml(xml_path)
+
+    def update_or_add_element_value(
+        self,
+        element_name: str,
+        element_value: str,
+        parent_element_name: str = "",
+    ) -> None:
+        """can update the value of either a top level or secondary level value in the sinequa config
+
+        Args:
+            element_name (str): name of the sinequa element, such as "Simulate"
+            element_value (str): value to be stored to element, such as "false"
+            parent_element_name (str, optional): parent of the element, such as "IndexerClient"
+               Defaults to None.
+        """
+
+        xml_root = self.xml_tree.getroot()
+        parent_element = (
+            xml_root if not parent_element_name else xml_root.find(parent_element_name)
+        )
+
+        if parent_element is None:
+            raise ValueError(
+                f"Parent element '{parent_element_name}' not found in XML."
+            )
+
+        existing_element = parent_element.find(element_name)
+        if existing_element:
+            existing_element.text = element_value
+        else:
+            ET.SubElement(parent_element, element_name).text = element_value
+
+    def add_job_list_item(self, job_name):
+        """
+        this is specifically for editing joblist templates by adding a new collection to a joblist
+        config_generation/xmls/joblist_template.xml
+        """
+        xml_root = self.xml_tree.getroot()
+
+        mapping = ET.Element("JobListItem")
+        ET.SubElement(mapping, "Name").text = job_name
+        ET.SubElement(mapping, "StopOnError").text = "false"
+        xml_root.append(mapping)
diff --git a/config_generation/generate_jobs.py b/config_generation/generate_jobs.py
@@ -9,49 +9,94 @@
 
 from config import collection_list, date_of_batch, n
 
-template_root_path = "xmls/"
-joblist_template_path = f"{template_root_path}joblist_template.xml"
 
-job_path_root = "../sinequa_configs/jobs/"
-
-
-def create_job_name(collection_name):
-    return f"collection.indexer.{collection_name}.xml"
-
-
-def create_joblist_name(index):
-    return f"parallel_indexing_list-{date_of_batch}-{index}.xml"
-
-
-# create single jobs to run each collection
-for collection in collection_list:
-    job = XmlEditor(f"{template_root_path}job_template.xml")
-    job.update_or_add_element_value("Collection", f"/SMD/{collection}/")
-    job._update_config_xml(f"{job_path_root}{create_job_name(collection)}")
-
-
-# Create an empty list of lists
-sublists = [[] for _ in range(n)]
-
-# Distribute elements of the big list into sublists
-for i in range(len(collection_list)):
-    # Use modulus to decide which sublist to put the item in
-    sublist_index = i % n
-    sublists[sublist_index].append(collection_list[i])
-
-# create the n joblists (which will execute their contents serially in parellel)
-job_names = []
-for index, sublist in enumerate(sublists):
-    joblist = XmlEditor(joblist_template_path)
-    for collection in sublist:
-        joblist.add_job_list_item(create_job_name(collection).replace(".xml", ""))
-
-    joblist._update_config_xml(f"{job_path_root}{create_joblist_name(index)}")
-    job_names.append(create_joblist_name(index).replace(".xml", ""))
-
-master = XmlEditor(joblist_template_path)
-master.update_or_add_element_value("RunJobsInParallel", "true")
-[master.add_job_list_item(job_name) for job_name in job_names]
-master._update_config_xml(
-    f"{job_path_root}parallel_indexing_list-{date_of_batch}-master.xml"
-)
+class ParallelJobCreator:
+    def __init__(
+        self,
+        collection_list,
+        template_root_path="xmls/",
+        job_path_root="../sinequa_configs/jobs/",
+    ):
+        """
+        these default values rely on the old file structure, where the sinequa_configs were a
+        sub-repo of sde-indexing-helper. so when running this, you will need the sde-backend
+        code to be inside a folder called sinequa_configs
+        """
+
+        self.collection_list = collection_list
+        self.template_root_path = template_root_path
+        self.joblist_template_path = f"{template_root_path}joblist_template.xml"
+        self.job_path_root = job_path_root
+
+    def _create_job_name(self, collection_name):
+        """
+        each job that runs an individual collection needs a name based on the collection name
+        this code generates that file name as a string, and it will be passed to the function that
+            creates the actual job file
+        """
+        return f"collection.indexer.{collection_name}.xml"
+
+    def _create_joblist_name(self, index):
+        """
+        each job that runs an list of collections a name based on:
+            - the date the batch was created
+            - the index out of n total batches
+        this code generates that file name as a string, and it will be passed to the function that
+            creates the actual job file
+        """
+        return f"parallel_indexing_list-{date_of_batch}-{index}.xml"
+
+    def _create_collection_jobs(self):
+        """
+        in order to run a collection, a job must exist that runs it
+        this code:
+            - creates a job based on the job template
+            - adds the exact collection name
+            - saves it with a name that will reference the collection name
+        """
+        # create single jobs to run each collection
+        for collection in self.collection_list:
+            job = XmlEditor(f"{self.template_root_path}job_template.xml")
+            job.update_or_add_element_value("Collection", f"/SMD/{collection}/")
+            job._update_config_xml(
+                f"{self.job_path_root}{self._create_job_name(collection)}"
+            )
+
+    def make_all_parallel_jobs(self):
+        # create initial single jobs that will be referenced by the parallel job lists
+        self._create_collection_jobs()
+
+        # Create an empty list of lists
+        sublists = [[] for _ in range(n)]
+
+        # Distribute elements of the big list into sublists
+        for i in range(len(self.collection_list)):
+            # Use modulus to decide which sublist to put the item in
+            sublist_index = i % n
+            sublists[sublist_index].append(self.collection_list[i])
+
+        # create the n joblists (which will execute their contents serially in parallel
+        job_names = []
+        for index, sublist in enumerate(sublists):
+            joblist = XmlEditor(self.joblist_template_path)
+            for collection in sublist:
+                joblist.add_job_list_item(
+                    self._create_job_name(collection).replace(".xml", "")
+                )
+
+            joblist._update_config_xml(
+                f"{self.job_path_root}{self._create_joblist_name(index)}"
+            )
+            job_names.append(self._create_joblist_name(index).replace(".xml", ""))
+
+        master = XmlEditor(self.joblist_template_path)
+        master.update_or_add_element_value("RunJobsInParallel", "true")
+        [master.add_job_list_item(job_name) for job_name in job_names]
+        master._update_config_xml(
+            f"{self.job_path_root}parallel_indexing_list-{date_of_batch}-master.xml"
+        )
+
+
+if __name__ == "__main__":
+    job_creator = ParallelJobCreator(collection_list=collection_list)
+    job_creator.make_all_parallel_jobs()