Skip to content

Commit

Permalink
Merge branch 'dev' into 410-adding-additional-status-on-the-curation-app
Browse files Browse the repository at this point in the history
  • Loading branch information
CarsonDavis authored Oct 9, 2023
2 parents 5328ec1 + 9fdd4b6 commit cce127f
Show file tree
Hide file tree
Showing 12 changed files with 709 additions and 104 deletions.
111 changes: 99 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,21 +11,33 @@ Moved to [settings](http://cookiecutter-django.readthedocs.io/en/latest/settings

## Basic Commands

### Building The Project
```bash
$ docker-compose -f local.yml build
```

### Running The Necessary Containers
```bash
$ docker-compose -f local.yml up
```

### Setting Up Your Users

- To create a **normal user account**, just go to Sign Up and fill out the form. Once you submit it, you'll see a "Verify Your E-mail Address" page. Go to your console to see a simulated email verification message. Copy the link into your browser. Now the user's email should be verified and ready to go.

- To create a **superuser account**, use this command:

$ python manage.py createsuperuser
```bash
$ docker-compose -f local.yml run -rm django python manage.py createsuperuser
```

For convenience, you can keep your normal user logged in on Chrome and your superuser logged in on Firefox (or similar), so that you can see how the site behaves for both kinds of users.

### Loading fixtures

Please note that currently loading fixtures will not create a fully working database. If you are starting the project from scratch, it is probably preferable to skip to the Loading the DB from a Backup section.
- To load collections

docker-compose -f local.yml run --rm django python manage.py loaddata sde_collections/fixtures/collections.json
```bash
$ docker-compose -f local.yml run --rm django python manage.py loaddata sde_collections/fixtures/collections.json
```

### Loading scraped URLs into CandidateURLs

Expand All @@ -36,42 +48,92 @@ For convenience, you can keep your normal user logged in on Chrome and your supe
- Run the crawler with `scrapy crawl <name of your spider> -o scraped_urls/<config_folder>/urls.jsonl
- Then run this:
```bash
$ docker-compose -f local.yml run --rm django python manage.py load_scraped_urls <config_folder_name>
```
### Loading The DB From A Backup
- If a database backup is made available, you wouldn't have to load the fixtures or the scrapped URLs anymore. This changes a few steps necessary to get the project running.
- Step 1 : Build the project (Documented Above)
- Step 2 : Run the necessary containers (Documented Above)
- Step 3 : Clear Out Contenet Types Using Django Shell
-- Enter the Django shell in your Docker container.
```bash
$ docker-compose -f local.yml run --rm django python manage.py shell
```
-- In the Django shell, you can now delete the content types.
```bash
from django.contrib.contenttypes.models import ContentType
ContentType.objects.all().delete()
```
-- Exit the shell.
- Step 4 : Load Your Backup Database
Assuming your backup is a `.json` file from `dumpdata`, you'd use `loaddata` command to populate your database.
-- If the backup file is on the local machine, make sure it's accessible to the Docker container. If the backup is outside the container, you will need to copy it inside first.
```bash
$ docker cp /path/to/your/backup.json container_name:/path/inside/container/backup.json
```
-- Load the data from your backup.
```bash
$ docker-compose -f local.yml run --rm django python manage.py loaddata /path/inside/the/container/backup.json
```
-- Once loaded, you may want to run migrations to ensure everything is aligned.
```bash
$ docker-compose -f local.yml run -rm django python manage.py migrate
```
$ docker-compose -f local.yml run --rm django python manage.py load_scraped_urls <config_folder_name>
### Type checks
Running type checks with mypy:

```bash
$ mypy sde_indexing_helper
```
### Test coverage
To run the tests, check your test coverage, and generate an HTML coverage report:

```bash
$ coverage run -m pytest
$ coverage html
$ open htmlcov/index.html
```
#### Running tests with pytest
```bash
$ pytest
```
### Live reloading and Sass CSS compilation
Moved to [Live reloading and SASS compilation](https://cookiecutter-django.readthedocs.io/en/latest/developing-locally.html#sass-compilation-live-reloading).
### Install Celery
Make sure Celery is installed in your environment.
To install,
pip install celery
Make sure Celery is installed in your environment. To install :
```bash
$ pip install celery
```
### Install all requirements
Install all packages listed in a 'requirements' file

```bash
pip install -r requirements/*.txt
```
### Celery
Expand Down Expand Up @@ -100,6 +162,31 @@ cd sde_indexing_helper
celery -A config.celery_app worker -B -l info
```
### Pre-Commit Hook Instructions
Hooks have to be run on every commit to automatically take care of linting and structuring.
To install pre-commit package manager :
```bash
$ pip install pre-commit
```
Install the git hook scripts :
```bash
$ pre-commit install
```
Run against the files :
```bash
$ pre-commit run --all-files
```
It's usually a good idea to run the hooks against all of the files when adding new hooks (usually `pre-commit` will only run on the chnages files during git hooks).
### Sentry
Sentry is an error logging aggregator service. You can sign up for a free account at <https://sentry.io/signup/?code=cookiecutter> or download and host it yourself.
Expand Down
4 changes: 4 additions & 0 deletions config_generation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,7 @@ Don't be fooled by the page on indexing....
https://doc.sinequa.com/en.sinequa-es.v11/Content/en.sinequa-es.devDoc.webservice.rest-indexing.html#indexing-collection

You want the page on jobs https://doc.sinequa.com/en.sinequa-es.v11/Content/en.sinequa-es.devDoc.webservice.rest-operation.html#operationcollectionStart.

## Creating Job Lists
Update config.py to contain the latest collections you want to index. Then run generate_jobs.py and it will create the parallel batches.
If you want it to run on multiple nodes, you will need to add that in two places in the file, and then you won't be able to run the lists from the masterlist, because of a sinequa bug.
47 changes: 44 additions & 3 deletions config_generation/db_to_xml_file_based.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,8 @@
import os
import xml.etree.ElementTree as ET

from db_to_xml import XmlEditor as XmlEditorStringBased


class XmlEditor(XmlEditorStringBased):
class XmlEditor:
"""
Class is instantiated with a path to an xml.
An internal etree is generated, and changes are made in place.
Expand Down Expand Up @@ -79,3 +77,46 @@ def create_config_folder_and_default(self, source_name, collection_name):

# self._write_xml(xml_path)
self._update_config_xml(xml_path)

def update_or_add_element_value(
self,
element_name: str,
element_value: str,
parent_element_name: str = "",
) -> None:
"""can update the value of either a top level or secondary level value in the sinequa config
Args:
element_name (str): name of the sinequa element, such as "Simulate"
element_value (str): value to be stored to element, such as "false"
parent_element_name (str, optional): parent of the element, such as "IndexerClient"
Defaults to None.
"""

xml_root = self.xml_tree.getroot()
parent_element = (
xml_root if not parent_element_name else xml_root.find(parent_element_name)
)

if parent_element is None:
raise ValueError(
f"Parent element '{parent_element_name}' not found in XML."
)

existing_element = parent_element.find(element_name)
if existing_element:
existing_element.text = element_value
else:
ET.SubElement(parent_element, element_name).text = element_value

def add_job_list_item(self, job_name):
"""
this is specifically for editing joblist templates by adding a new collection to a joblist
config_generation/xmls/joblist_template.xml
"""
xml_root = self.xml_tree.getroot()

mapping = ET.Element("JobListItem")
ET.SubElement(mapping, "Name").text = job_name
ET.SubElement(mapping, "StopOnError").text = "false"
xml_root.append(mapping)
135 changes: 90 additions & 45 deletions config_generation/generate_jobs.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,49 +9,94 @@

from config import collection_list, date_of_batch, n

template_root_path = "xmls/"
joblist_template_path = f"{template_root_path}joblist_template.xml"

job_path_root = "../sinequa_configs/jobs/"


def create_job_name(collection_name):
return f"collection.indexer.{collection_name}.xml"


def create_joblist_name(index):
return f"parallel_indexing_list-{date_of_batch}-{index}.xml"


# create single jobs to run each collection
for collection in collection_list:
job = XmlEditor(f"{template_root_path}job_template.xml")
job.update_or_add_element_value("Collection", f"/SMD/{collection}/")
job._update_config_xml(f"{job_path_root}{create_job_name(collection)}")


# Create an empty list of lists
sublists = [[] for _ in range(n)]

# Distribute elements of the big list into sublists
for i in range(len(collection_list)):
# Use modulus to decide which sublist to put the item in
sublist_index = i % n
sublists[sublist_index].append(collection_list[i])

# create the n joblists (which will execute their contents serially in parellel)
job_names = []
for index, sublist in enumerate(sublists):
joblist = XmlEditor(joblist_template_path)
for collection in sublist:
joblist.add_job_list_item(create_job_name(collection).replace(".xml", ""))

joblist._update_config_xml(f"{job_path_root}{create_joblist_name(index)}")
job_names.append(create_joblist_name(index).replace(".xml", ""))

master = XmlEditor(joblist_template_path)
master.update_or_add_element_value("RunJobsInParallel", "true")
[master.add_job_list_item(job_name) for job_name in job_names]
master._update_config_xml(
f"{job_path_root}parallel_indexing_list-{date_of_batch}-master.xml"
)
class ParallelJobCreator:
def __init__(
self,
collection_list,
template_root_path="xmls/",
job_path_root="../sinequa_configs/jobs/",
):
"""
these default values rely on the old file structure, where the sinequa_configs were a
sub-repo of sde-indexing-helper. so when running this, you will need the sde-backend
code to be inside a folder called sinequa_configs
"""

self.collection_list = collection_list
self.template_root_path = template_root_path
self.joblist_template_path = f"{template_root_path}joblist_template.xml"
self.job_path_root = job_path_root

def _create_job_name(self, collection_name):
"""
each job that runs an individual collection needs a name based on the collection name
this code generates that file name as a string, and it will be passed to the function that
creates the actual job file
"""
return f"collection.indexer.{collection_name}.xml"

def _create_joblist_name(self, index):
"""
each job that runs an list of collections a name based on:
- the date the batch was created
- the index out of n total batches
this code generates that file name as a string, and it will be passed to the function that
creates the actual job file
"""
return f"parallel_indexing_list-{date_of_batch}-{index}.xml"

def _create_collection_jobs(self):
"""
in order to run a collection, a job must exist that runs it
this code:
- creates a job based on the job template
- adds the exact collection name
- saves it with a name that will reference the collection name
"""
# create single jobs to run each collection
for collection in self.collection_list:
job = XmlEditor(f"{self.template_root_path}job_template.xml")
job.update_or_add_element_value("Collection", f"/SMD/{collection}/")
job._update_config_xml(
f"{self.job_path_root}{self._create_job_name(collection)}"
)

def make_all_parallel_jobs(self):
# create initial single jobs that will be referenced by the parallel job lists
self._create_collection_jobs()

# Create an empty list of lists
sublists = [[] for _ in range(n)]

# Distribute elements of the big list into sublists
for i in range(len(self.collection_list)):
# Use modulus to decide which sublist to put the item in
sublist_index = i % n
sublists[sublist_index].append(self.collection_list[i])

# create the n joblists (which will execute their contents serially in parallel
job_names = []
for index, sublist in enumerate(sublists):
joblist = XmlEditor(self.joblist_template_path)
for collection in sublist:
joblist.add_job_list_item(
self._create_job_name(collection).replace(".xml", "")
)

joblist._update_config_xml(
f"{self.job_path_root}{self._create_joblist_name(index)}"
)
job_names.append(self._create_joblist_name(index).replace(".xml", ""))

master = XmlEditor(self.joblist_template_path)
master.update_or_add_element_value("RunJobsInParallel", "true")
[master.add_job_list_item(job_name) for job_name in job_names]
master._update_config_xml(
f"{self.job_path_root}parallel_indexing_list-{date_of_batch}-master.xml"
)


if __name__ == "__main__":
job_creator = ParallelJobCreator(collection_list=collection_list)
job_creator.make_all_parallel_jobs()
Loading

0 comments on commit cce127f

Please sign in to comment.