Skip to content

Commit

Permalink
Indexer service use Solr authentication to index documents & unites c…
Browse files Browse the repository at this point in the history
…hanged to avoid Solr authentication

Profiles have been added to the docker-compose to start and test each of the service (retriever, generator, indexer) indendently, but I removed because it did not work well with the github actions & The Solr service has been updated to accept authentication and created the collection.
Update README.md
  • Loading branch information
liseli committed Sep 25, 2024
1 parent a5b9086 commit f0b72b2
Show file tree
Hide file tree
Showing 9 changed files with 181 additions and 65 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,4 @@ jobs:
docker compose build
- name: Run tests
run: docker compose run test
run: docker compose run all_tests
30 changes: 25 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,12 +71,26 @@ systems involved in the flow to index documents in Full-text search index. The q
In your workdir:

1. Clone the repository:

```git clone [email protected]:hathitrust/ht_indexer.git```
2. Go to the folder ``cd ht_indexer``
3. Create the image

`docker build -t document_generator .`
4. Run the container
`docker compose up document_retriever -d`

4. Run the services

1. Retriever service

`docker compose up document_retriever -d`

2. Generator service

`docker compose up document_generator -d`

3. Indexer service

`docker compose up document_indexer -d`

If you want to run the application in your local environment and outside the docker container, you should
follow the steps mentioned in the section [How to set up your python environment](#project-set-up-local-environment)
Expand Down Expand Up @@ -172,7 +186,7 @@ A message can fail for different reasons:

We use a **dead-letter-exchange** to handle messages that are not processed successfully. The dead-letter-exchange is
an exchange to which messages will be re-routed if they are rejected by the queue. In the current logic, all the service
using the queue system has a dead-letter-exchange associated with it. One of our future steps is to figure out what we
using the queue system has a dead-letter-exschange associated with itve One of our future steps is to figure out what we
will do
with the messages in the dead-letter-exchange.

Expand Down Expand Up @@ -308,9 +322,15 @@ In the working directory,

* Run document_indexer_service container and test it

Solr server required authentication, so you should set up the environment variables SOLR_USER and SOLR_PASSWORD before
starting the container. All the users (solr, admin and fulltext) use the same solr password (solrRocks)

export SOLR_USER=admin
export SOLR_PASSWORD=solrRocks

```docker compose up document_indexer -d```

```docker compose exec document_indexer pytest ht_indexer_api ht_queue_service```
```docker compose exec document_indexer pytest document_indexer_service ht_indexer_api ht_queue_service```

## Hosting

Expand Down Expand Up @@ -363,7 +383,7 @@ In the image below, you can see the main kubernetes parts running in this workfl
* Document indexer
```
python document_indexer_service/document_indexer_service.py
--solr_indexing_api http://solr8-embedded-zookeeper:8983/solr/#/core-x/
--solr_indexing_api http://fulltext-workshop-solrcloud-headless:8983/solr/core-x/
```
* In Kubernetes, you can also use the script `run_retriever_processor_kubernetes.sh` to run the services to retrieve
Expand Down
121 changes: 75 additions & 46 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
version: "3"

services:
document_retriever:
container_name: document_retriever
Expand All @@ -11,12 +9,12 @@ services:
ports:
- "8081:8081"
environment:
- SOLR_URL=http://solr-sdr-catalog:9033/solr/#/catalog/
- SDR_DIR=/sdr1/obj
- QUEUE_HOST=rabbitmq
- QUEUE_NAME=retriever_queue
- QUEUE_PASS=guest
- QUEUE_USER=guest
SOLR_URL: http://solr-sdr-catalog:9033/solr/#/catalog/
SDR_DIR: /sdr1/obj
QUEUE_HOST: rabbitmq
QUEUE_NAME: retriever_queue
QUEUE_PASS: guest
QUEUE_USER: guest
depends_on:
solr-sdr-catalog:
condition: service_healthy
Expand All @@ -32,11 +30,11 @@ services:
ports:
- "3306:3306"
environment:
- MYSQL_HOST=mysql-sdr
- MYSQL_USER=mdp-lib
- MYSQL_PASS=mdp-lib
- MYSQL_DATABASE=ht
- MYSQL_RANDOM_ROOT_PASSWORD=1
MYSQL_HOST: mysql-sdr
MYSQL_USER: mdp-lib
MYSQL_PASS: mdp-lib
MYSQL_DATABASE: ht
MYSQL_RANDOM_ROOT_PASSWORD: 1
healthcheck:
interval: 30s
retries: 3
Expand All @@ -49,18 +47,46 @@ services:
]
timeout: 30s
solr-lss-dev:
image: ghcr.io/hathitrust/full-text-search-embedded_zoo:example-8.11
image: ghcr.io/hathitrust/full-text-search-cloud:shards-docker
container_name: solr-lss-dev
ports:
- "8983:8983"
environment:
ZK_HOST: zoo1:2181
SOLR_OPTS: -XX:-UseLargePages
SOLR_USER: solr
SOLR_PASSWORD: 'solrRocks'
depends_on:
zoo1:
condition: service_healthy
volumes:
- solr_data:/var/solr/data
command: solr-foreground -c
- solr1_data:/var/solr/data
# start solr in the background, wait for it to start, then create the collection
command: [ "sh", "-c", 'solr-foreground -c & sleep 150 && export SOLR_AUTHENTICATION_OPTS=-Dbasicauth="$SOLR_USER":"$SOLR_PASSWORD" && solr create_collection -d /opt/solr/core-x -c core-x -shards 1 -replicationFactor 1 -p 8983 && wait' ]
healthcheck:
test: [ "CMD-SHELL", "solr healthcheck -c core-x" ]
interval: 5s
test: [ "CMD-SHELL", "solr healthcheck -c core-x || echo 'Healthcheck failed'" ]
interval: 30s
timeout: 10s
retries: 5
zoo1:
image: zookeeper:3.8.0
container_name: zoo1
restart: always
hostname: zoo1
ports:
- 2181:2181
- 7001:7000
environment:
ZOO_MY_ID: 1
ZOO_SERVERS: server.1=zoo1:2888:3888;2181
ZOO_4LW_COMMANDS_WHITELIST: mntr, conf, ruok
ZOO_CFG_EXTRA: "metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider metricsProvider.httpPort=7000 metricsProvider.exportJvmInfo=true"
volumes:
- zoo1_data:/data
healthcheck:
test: [ "CMD", "echo", "ruok", "|", "nc", "localhost", "2181", "|", "grep", "imok" ]
interval: 30s
timeout: 10s
start_period: 30s
retries: 5
solr-sdr-catalog:
container_name: solr-sdr-catalog
Expand All @@ -75,20 +101,6 @@ services:
- "9033:9033"
expose:
- 9033
test:
container_name: indexing_test
build: .
volumes:
- .:/app
- ../tmp:/tmp
command: [ "pytest" ]
depends_on:
solr-sdr-catalog:
condition: service_healthy
solr-lss-dev:
condition: service_healthy
rabbitmq:
condition: service_healthy
document_generator:
container_name: document_generator
image: document_generator
Expand All @@ -103,14 +115,14 @@ services:
tty: true
stdin_open: true
environment:
- SRC_QUEUE_HOST=rabbitmq
- SRC_QUEUE_NAME=retriever_queue
- SRC_QUEUE_PASS=guest
- SRC_QUEUE_USER=guest
- TGT_QUEUE_HOST=rabbitmq
- TGT_QUEUE_NAME=indexer_queue
- TGT_QUEUE_PASS=guest
- TGT_QUEUE_USER=guest
SRC_QUEUE_HOST: rabbitmq
SRC_QUEUE_NAME: retriever_queue
SRC_QUEUE_PASS: guest
SRC_QUEUE_USER: guest
TGT_QUEUE_HOST: rabbitmq
TGT_QUEUE_NAME: indexer_queue
TGT_QUEUE_PASS: guest
TGT_QUEUE_USER: guest
command: [ "python", "document_generator/document_generator_service.py" ]
document_indexer:
container_name: document_indexer
Expand All @@ -127,10 +139,12 @@ services:
tty: true
stdin_open: true
environment:
- QUEUE_HOST=rabbitmq
- QUEUE_NAME=indexer_queue
- QUEUE_PASS=guest
- QUEUE_USER=guest
QUEUE_HOST: rabbitmq
QUEUE_NAME: indexer_queue
QUEUE_PASS: guest
QUEUE_USER: guest
SOLR_USER: solr
SOLR_PASSWORD: 'solrRocks'
command: [ "python", "document_indexer_service/document_indexer_service.py", "--solr_indexing_api", "http://solr-lss-dev:8983/solr/#/core-x/" ]
rabbitmq:
container_name: rabbitmq
Expand All @@ -147,7 +161,22 @@ services:
timeout: 10s
start_period: 30s
retries: 5
all_tests:
container_name: all_tests
build: .
volumes:
- .:/app
- ../tmp:/tmp
command: [ "pytest" ]
depends_on:
solr-sdr-catalog:
condition: service_healthy
solr-lss-dev:
condition: service_healthy
rabbitmq:
condition: service_healthy
volumes:
solr_data:
mysql_sdr_data:
solr1_data: null
zoo1_data: null

7 changes: 6 additions & 1 deletion document_indexer_service/indexer_arguments.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,12 @@ def __init__(self, parser):

self.args = parser.parse_args()

self.solr_api_full_text = HTSolrAPI(url=self.args.solr_indexing_api)
solr_user = os.getenv("SOLR_USER")
solr_password = os.getenv("SOLR_PASSWORD")

self.solr_api_full_text = HTSolrAPI(url=self.args.solr_indexing_api,
user=solr_user,
password=solr_password)

self.document_local_path = self.args.document_local_path

Expand Down
2 changes: 2 additions & 0 deletions env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
SOLR_USER=solr
SOLR_PASSWORD=solrRocks
8 changes: 6 additions & 2 deletions ht_indexer_api/ht_indexer_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,29 +2,32 @@
from typing import Text

import requests
from requests.auth import HTTPBasicAuth

from ht_utils.ht_logger import get_ht_logger

logger = get_ht_logger(name=__name__)


class HTSolrAPI:
def __init__(self, url):
def __init__(self, url, user=None, password=None):
self.url = url
self.auth = HTTPBasicAuth(user, password) if user and password else None

def get_solr_status(self):
response = requests.get(self.url)
return response

def index_document(self, xml_data: dict, content_type: Text = "application/json"):
"""Feed a JSON object, create an XML string to index the document into SOLR
"""Feed a JSON object, create an XML string to index the document into SOLR
"Content-Type": "application/json"
"""
try:
response = requests.post(
f"{self.url.replace('#/', '')}update/json/docs",
headers={"Content-Type": content_type},
json=xml_data,
auth=self.auth,
params={
"commit": "true",
}, )
Expand All @@ -47,6 +50,7 @@ def index_documents(self, path: Path, list_documents: list = None, solr_url_json
response = requests.post(
f"{self.url.replace('#/', '')}{solr_url_json}?commit=true",
headers=headers,
auth=self.auth,
data=data_dict,
params={
"commit": "true",
Expand Down
Loading

0 comments on commit f0b72b2

Please sign in to comment.