Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexer service use Solr authentication to index documents & unites c… #24

Merged
merged 1 commit into from
Sep 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,4 @@ jobs:
docker compose build

- name: Run tests
run: docker compose run test
run: docker compose run all_tests
30 changes: 25 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,12 +71,26 @@ systems involved in the flow to index documents in Full-text search index. The q
In your workdir:

1. Clone the repository:

```git clone [email protected]:hathitrust/ht_indexer.git```
2. Go to the folder ``cd ht_indexer``
3. Create the image

`docker build -t document_generator .`
4. Run the container
`docker compose up document_retriever -d`

4. Run the services

1. Retriever service

`docker compose up document_retriever -d`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we define dependencies or profiles in docker compose to avoid needing to manually start all 3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created profiles for each service, and I also created different services in the docker-compose to group and independently run the tests of each service, but this solution started to fail running the tests in the github actions. For that reason, I abandoned it. I can add a GitHub issue to try it again in the future with K'Ron support.


2. Generator service

`docker compose up document_generator -d`

3. Indexer service

`docker compose up document_indexer -d`

If you want to run the application in your local environment and outside the docker container, you should
follow the steps mentioned in the section [How to set up your python environment](#project-set-up-local-environment)
Expand Down Expand Up @@ -172,7 +186,7 @@ A message can fail for different reasons:

We use a **dead-letter-exchange** to handle messages that are not processed successfully. The dead-letter-exchange is
an exchange to which messages will be re-routed if they are rejected by the queue. In the current logic, all the service
using the queue system has a dead-letter-exchange associated with it. One of our future steps is to figure out what we
using the queue system has a dead-letter-exschange associated with itve One of our future steps is to figure out what we
will do
with the messages in the dead-letter-exchange.

Expand Down Expand Up @@ -308,9 +322,15 @@ In the working directory,

* Run document_indexer_service container and test it

Solr server required authentication, so you should set up the environment variables SOLR_USER and SOLR_PASSWORD before
starting the container. All the users (solr, admin and fulltext) use the same solr password (solrRocks)

export SOLR_USER=admin
export SOLR_PASSWORD=solrRocks

```docker compose up document_indexer -d```

```docker compose exec document_indexer pytest ht_indexer_api ht_queue_service```
```docker compose exec document_indexer pytest document_indexer_service ht_indexer_api ht_queue_service```

## Hosting

Expand Down Expand Up @@ -363,7 +383,7 @@ In the image below, you can see the main kubernetes parts running in this workfl
* Document indexer
```
python document_indexer_service/document_indexer_service.py
--solr_indexing_api http://solr8-embedded-zookeeper:8983/solr/#/core-x/
--solr_indexing_api http://fulltext-workshop-solrcloud-headless:8983/solr/core-x/
```

* In Kubernetes, you can also use the script `run_retriever_processor_kubernetes.sh` to run the services to retrieve
Expand Down
121 changes: 75 additions & 46 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
version: "3"

services:
document_retriever:
container_name: document_retriever
Expand All @@ -11,12 +9,12 @@ services:
ports:
- "8081:8081"
environment:
- SOLR_URL=http://solr-sdr-catalog:9033/solr/#/catalog/
- SDR_DIR=/sdr1/obj
- QUEUE_HOST=rabbitmq
- QUEUE_NAME=retriever_queue
- QUEUE_PASS=guest
- QUEUE_USER=guest
SOLR_URL: http://solr-sdr-catalog:9033/solr/#/catalog/
SDR_DIR: /sdr1/obj
QUEUE_HOST: rabbitmq
QUEUE_NAME: retriever_queue
QUEUE_PASS: guest
QUEUE_USER: guest
depends_on:
solr-sdr-catalog:
condition: service_healthy
Expand All @@ -32,11 +30,11 @@ services:
ports:
- "3306:3306"
environment:
- MYSQL_HOST=mysql-sdr
- MYSQL_USER=mdp-lib
- MYSQL_PASS=mdp-lib
- MYSQL_DATABASE=ht
- MYSQL_RANDOM_ROOT_PASSWORD=1
MYSQL_HOST: mysql-sdr
MYSQL_USER: mdp-lib
MYSQL_PASS: mdp-lib
MYSQL_DATABASE: ht
MYSQL_RANDOM_ROOT_PASSWORD: 1
healthcheck:
interval: 30s
retries: 3
Expand All @@ -49,18 +47,46 @@ services:
]
timeout: 30s
solr-lss-dev:
image: ghcr.io/hathitrust/full-text-search-embedded_zoo:example-8.11
image: ghcr.io/hathitrust/full-text-search-cloud:shards-docker
container_name: solr-lss-dev
ports:
- "8983:8983"
environment:
ZK_HOST: zoo1:2181
SOLR_OPTS: -XX:-UseLargePages
SOLR_USER: solr
SOLR_PASSWORD: 'solrRocks'
depends_on:
zoo1:
condition: service_healthy
volumes:
- solr_data:/var/solr/data
command: solr-foreground -c
- solr1_data:/var/solr/data
# start solr in the background, wait for it to start, then create the collection
command: [ "sh", "-c", 'solr-foreground -c & sleep 150 && export SOLR_AUTHENTICATION_OPTS=-Dbasicauth="$SOLR_USER":"$SOLR_PASSWORD" && solr create_collection -d /opt/solr/core-x -c core-x -shards 1 -replicationFactor 1 -p 8983 && wait' ]
healthcheck:
test: [ "CMD-SHELL", "solr healthcheck -c core-x" ]
interval: 5s
test: [ "CMD-SHELL", "solr healthcheck -c core-x || echo 'Healthcheck failed'" ]
interval: 30s
timeout: 10s
retries: 5
zoo1:
image: zookeeper:3.8.0
container_name: zoo1
restart: always
hostname: zoo1
ports:
- 2181:2181
- 7001:7000
environment:
ZOO_MY_ID: 1
ZOO_SERVERS: server.1=zoo1:2888:3888;2181
ZOO_4LW_COMMANDS_WHITELIST: mntr, conf, ruok
ZOO_CFG_EXTRA: "metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider metricsProvider.httpPort=7000 metricsProvider.exportJvmInfo=true"
volumes:
- zoo1_data:/data
healthcheck:
test: [ "CMD", "echo", "ruok", "|", "nc", "localhost", "2181", "|", "grep", "imok" ]
interval: 30s
timeout: 10s
start_period: 30s
retries: 5
solr-sdr-catalog:
container_name: solr-sdr-catalog
Expand All @@ -75,20 +101,6 @@ services:
- "9033:9033"
expose:
- 9033
test:
container_name: indexing_test
build: .
volumes:
- .:/app
- ../tmp:/tmp
command: [ "pytest" ]
depends_on:
solr-sdr-catalog:
condition: service_healthy
solr-lss-dev:
condition: service_healthy
rabbitmq:
condition: service_healthy
document_generator:
container_name: document_generator
image: document_generator
Expand All @@ -103,14 +115,14 @@ services:
tty: true
stdin_open: true
environment:
- SRC_QUEUE_HOST=rabbitmq
- SRC_QUEUE_NAME=retriever_queue
- SRC_QUEUE_PASS=guest
- SRC_QUEUE_USER=guest
- TGT_QUEUE_HOST=rabbitmq
- TGT_QUEUE_NAME=indexer_queue
- TGT_QUEUE_PASS=guest
- TGT_QUEUE_USER=guest
SRC_QUEUE_HOST: rabbitmq
SRC_QUEUE_NAME: retriever_queue
SRC_QUEUE_PASS: guest
SRC_QUEUE_USER: guest
TGT_QUEUE_HOST: rabbitmq
TGT_QUEUE_NAME: indexer_queue
TGT_QUEUE_PASS: guest
TGT_QUEUE_USER: guest
command: [ "python", "document_generator/document_generator_service.py" ]
document_indexer:
container_name: document_indexer
Expand All @@ -127,10 +139,12 @@ services:
tty: true
stdin_open: true
environment:
- QUEUE_HOST=rabbitmq
- QUEUE_NAME=indexer_queue
- QUEUE_PASS=guest
- QUEUE_USER=guest
QUEUE_HOST: rabbitmq
QUEUE_NAME: indexer_queue
QUEUE_PASS: guest
QUEUE_USER: guest
SOLR_USER: solr
SOLR_PASSWORD: 'solrRocks'
command: [ "python", "document_indexer_service/document_indexer_service.py", "--solr_indexing_api", "http://solr-lss-dev:8983/solr/#/core-x/" ]
rabbitmq:
container_name: rabbitmq
Expand All @@ -147,7 +161,22 @@ services:
timeout: 10s
start_period: 30s
retries: 5
all_tests:
container_name: all_tests
build: .
volumes:
- .:/app
- ../tmp:/tmp
command: [ "pytest" ]
depends_on:
solr-sdr-catalog:
condition: service_healthy
solr-lss-dev:
condition: service_healthy
rabbitmq:
condition: service_healthy
volumes:
solr_data:
mysql_sdr_data:
solr1_data: null
zoo1_data: null

7 changes: 6 additions & 1 deletion document_indexer_service/indexer_arguments.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,12 @@ def __init__(self, parser):

self.args = parser.parse_args()

self.solr_api_full_text = HTSolrAPI(url=self.args.solr_indexing_api)
solr_user = os.getenv("SOLR_USER")
solr_password = os.getenv("SOLR_PASSWORD")

self.solr_api_full_text = HTSolrAPI(url=self.args.solr_indexing_api,
user=solr_user,
password=solr_password)

self.document_local_path = self.args.document_local_path

Expand Down
2 changes: 2 additions & 0 deletions env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
SOLR_USER=solr
SOLR_PASSWORD=solrRocks
8 changes: 6 additions & 2 deletions ht_indexer_api/ht_indexer_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,29 +2,32 @@
from typing import Text

import requests
from requests.auth import HTTPBasicAuth

from ht_utils.ht_logger import get_ht_logger

logger = get_ht_logger(name=__name__)


class HTSolrAPI:
def __init__(self, url):
def __init__(self, url, user=None, password=None):
self.url = url
self.auth = HTTPBasicAuth(user, password) if user and password else None

def get_solr_status(self):
response = requests.get(self.url)
return response

def index_document(self, xml_data: dict, content_type: Text = "application/json"):
"""Feed a JSON object, create an XML string to index the document into SOLR
"""Feed a JSON object, create an XML string to index the document into SOLR
"Content-Type": "application/json"
"""
try:
response = requests.post(
f"{self.url.replace('#/', '')}update/json/docs",
headers={"Content-Type": content_type},
json=xml_data,
auth=self.auth,
params={
"commit": "true",
}, )
Expand All @@ -47,6 +50,7 @@ def index_documents(self, path: Path, list_documents: list = None, solr_url_json
response = requests.post(
f"{self.url.replace('#/', '')}{solr_url_json}?commit=true",
headers=headers,
auth=self.auth,
data=data_dict,
params={
"commit": "true",
Expand Down
Loading