PseudoLabelGenerator not using GPU #2949

sridhar · 2022-08-02T11:20:48Z

sridhar
Aug 2, 2022

Hi all,

I'm trying to train an EmbeddingRetriever using GPL on Google Cloud VM with a Nvidia Tesla A100:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P0    49W / 400W |   3548MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      5451      C   python                           3545MiB |
+-----------------------------------------------------------------------------+

This is on a 12 core Xeon box.
I can see that rest of the code is being scheduled on the GPU (tokenizer, qa generation etc).. however, PseudoLabelGenerator is not. Here's the code that uses it. Am I missing something here?

        st = time.perf_counter()
        retriever = EmbeddingRetriever(document_store=self.document_store,
                                       embedding_model="sentence-transformers/msmarco-distilbert-base-tas-b",
                                       model_format="sentence_transformers",
                                       max_seq_len=200)
        self.document_store.update_embeddings(retriever)
        print(f"Updated retriever took {time.perf_counter() - st} seconds")

        #query_doc_pairs = self._gen_qa_pair()
        query_doc_pairs = joblib.load('qa.jblb')

        psg = PseudoLabelGenerator(query_doc_pairs,
                                    retriever,
                                    max_questions_per_document=10,
                                    top_k=10,
                                    #batch_size=self.batch_size,
                                    batch_size=1024)

        print("Starting GPL run")
        st = time.perf_counter()
        output, pipe_id = psg.run(documents=self.document_store.get_all_documents())
        print(f"GPL run took {time.perf_counter() - st} seconds")

sjrl · 2022-08-02T11:31:38Z

sjrl
Aug 2, 2022
Collaborator

Hi @sridhar what happens if you try running this code?

It's possible that the code exits with an error which is why you would see no process running on the GPU. Looking at your memory GPU usage it might have to do with your large batch size (the default size is 16). I would recommend lowering it and seeing if your script runs as expected.

0 replies

sridhar · 2022-08-02T13:35:51Z

sridhar
Aug 2, 2022
Author

@sjrl,

The code is running (for a day now). I can see all the tqdm status and all the batches being processed. It just runs on a single CPU core :(
Reducing the burst size to 16 didn't help. I further reduced the batch to 4 & then to 1, but still didn't help.

7 replies

sjrl Aug 2, 2022
Collaborator

Thanks! Could you also show me the step which takes a long time?

sridhar Aug 2, 2022
Author

Mine negatives seems to take a long time:

(the output is a bit jumbled up because tqdm message kind of gets interleaved with my prints sometime).
Note that all the steps (including mine negatives) happen on only one core of the box.

Mine negatives: 1%|█▌ | 8/752 [8:31:37<793:21:43, 3838.8Batches: 100
%|██████████████████████████████████████████████████████| 32/32 [00:00<00:00, 131.33it/s]

sjrl Aug 3, 2022
Collaborator

Hmm I see. How many documents do you have in your document store (i.e. what is len(document_store.get_all_documents())? And how big is query_doc_pairs (i.e. what is len(query_doc_pairs)? There are parts of the algorithm that do only run on the CPU but it is suspicious that it is taking so long.

vblagoje Aug 3, 2022
Maintainer

Hey @sridhar, there are a couple of things that could go wrong here. Let's start with the basics - check your PyTorch installation, is your GPU visible (i.e. torch.cuda.is_available())? How do you set up your document store? GPL relies on properly setting up your document store and embedding retriever to mine negatives. In the mine negatives stage, GPL doesn't use GPU but relies on retrievers to do the work. If the retriever and document store are not GPU bound, then GPL will slow down. Compare your setup with https://haystack.deepset.ai/tutorials/gpl Negative mining stage for the GPL tutorial on 10k documents takes about 6 minutes (Tesla T4), and margin scoring is about that long as well. Retriever adaptation (training) takes about 15 min. Let us know how did it go.

sridhar Aug 3, 2022
Author

@sjrl
Total documents: len(document_store.get_all_documents()) = 76968
Total query_doc_pairs: len(query_doc_pairs) = 769680

@vblagoje

env) sridhar@gpu-model-trainer2:~/versa-sml-googl/ongcp$ python
Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True

Setting up document store:

            self.document_store = FAISSDocumentStore(sql_url=f'sqlite:///{self.faiss_file}', validate_index_sync=False)
            self.document_store.delete_all_documents()
            self.document_store.write_documents(docs_prep)
            self.document_store.save(index_path=self.faiss_idx, config_path=self.faiss_config)
      
            # If rerunning:
             self.document_store = FAISSDocumentStore.load(self.faiss_idx, self.faiss_config)

I ran the GPL example you pointed to. I see that margin scoring & retriever adaptation are using 87%+ GPU, but Negative mining is taking at max 1-2% and is the slowest component of all. Here's the output of the whole run:

$ time ./test.py
Original Model
Query: How is COVID-19 transmitted
94.84   Ebola is transmitted via direct contact with blood
92.87   HIV is transmitted via sex or sharing needles
92.31   Corona is transmitted via the air
91.54   Polio is transmitted via contaminated water or food
WARNING - datasets.builder -  Using custom data configuration nreimers--trec-covid-12f4a870565f28c9
WARNING - datasets.builder -  Reusing dataset json (/home/sridhar/.cache/huggingface/datasets/nreimers___json/nreimers--trec-covid-12f4a870565f28c9/0.0.0/a3e658c4731e59120d44081ac10bf85dc7e1388126b92338344ce9
661907f253)
Len Corpus: 10000                                                                                                                                                                                       [0/2533]
/home/sridhar/env/lib/python3.7/site-packages/transformers/image_utils.py:222: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
  def resize(self, image, size, resample=PIL.Image.BILINEAR, default_to_square=True, max_size=None):
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Writing Documents: 100%|██████████████████████████████████████| 10000/10000 [00:16<00:00, 600.96it/s]
INFO - haystack.modeling.utils -  Using devices: CUDA:0
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.nodes.retriever.dense -  Init retriever using embeddings of model sentence-transformers/msmarco-distilbert-base-tas-b
INFO - haystack.document_stores.faiss -  Updating embeddings for 10000 docs...
Documents Processed: 100%|█████████████████████████████████| 10000/10000 [00:19<00:00, 509.87 docs/s]
WARNING - datasets.builder -  Using custom data configuration nreimers--trec-covid-generated-queries-8f2ed7a575e675e7
WARNING - datasets.builder -  Reusing dataset json (/home/sridhar/.cache/huggingface/datasets/nreimers___json/nreimers--trec-covid-generated-queries-8f2ed7a575e675e7/0.0.0/a3e658c4731e59120d44081ac10bf85dc7e1388126b92338344ce9661907f253)
Generated queries: 30000
Mine negatives: 100%|██████████████████████████████████████████████| 938/938 [06:22<00:00,  2.45it/s]
Score margin: 100%|████████████████████████████████████████████████| 938/938 [01:42<00:00,  9.13it/s]
INFO - haystack.nodes.retriever._embedding_encoder -  GPL training/adapting SentenceTransformer(
  (0): Transformer({'max_seq_length': 200, 'do_lower_case': False}) with Transformer model: DistilBertModel
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
) with 30000 examples
Iteration: 100%|█████████████████████████████████████████████████| 1875/1875 [05:15<00:00,  5.94it/s]
Epoch: 100%|██████████████████████████████████████████████████████████| 1/1 [05:15<00:00, 315.52s/it]
Original Model
Query: How is COVID-19 transmitted
94.84   Ebola is transmitted via direct contact with blood
92.87   HIV is transmitted via sex or sharing needles
92.31   Corona is transmitted via the air
91.54   Polio is transmitted via contaminated water or food


Adapted Model
Query: How is COVID-19 transmitted
101.33  Corona is transmitted via the air
100.75  Ebola is transmitted via direct contact with blood
98.25   Polio is transmitted via contaminated water or food
98.04   HIV is transmitted via sex or sharing needles

real    14m31.984s
user    16m55.986s
sys     1m13.093s

vblagoje · 2022-08-03T20:33:05Z

vblagoje
Aug 3, 2022
Maintainer

Thanks for such a detailed report @sridhar - the results you get are expected. If you look at the code of negative mining it really shouldn't use much GPU. You could increase the batch size (say 16/32) but I doubt you can load GPU much more than you currently are. I am not sure why your adaptation runs more than a day when the example adaptation above on 10k documents takes roughly 15 min, end-to-end. What happens if you shrink your corpus to 10k documents?

6 replies

vblagoje Aug 4, 2022
Maintainer

Why not reduce the number of queries to the same ratio? I suspect a good representation of your domain could be sampled from a smaller set of documents, say a 1/4 of your documents. That should work. Which code needs to be cleaned up?

sjrl Aug 4, 2022
Collaborator

@sridhar If you are using the FAISSDocumentStore you can try initializing it with faiss_index_factory_str="HNSW"

document_store = FAISSDocumentStore(
    faiss_index_factory_str="HNSW",
    similarity="cosine",
    )

which speeds up the mine_negatives step in our tutorial from around 6 minutes to 2 minutes.

sjrl Aug 4, 2022
Collaborator

You can find more information about the HNSW index here: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index

Just be aware that it will most likely take up more RAM than the Flat index.

sridhar Aug 4, 2022
Author

@vblagoje
Unfortunately reducing the number of documents is not an option since I'm trying to build retriever on a knowledgebase that is fairly condensed. The base set of documents are broken into passages/words for the retriever to try to get the right pages.

I think the following code needs to be at least parallelized on cpu:

        for i in tqdm(
            range(0, len(question_doc_pairs), batch_size), disable=not self.progress_bar, desc="Mine negatives"
        ):
            # question in batches to minimize network latency
            i_end = min(i + batch_size, len(question_doc_pairs))
            queries: List[str] = [e["question"] for e in question_doc_pairs[i:i_end]]
            pos_docs: List[str] = [e["document"] for e in question_doc_pairs[i:i_end]]

            docs: List[List[Document]] = self.retriever.retrieve_batch(
                queries=queries, top_k=self.top_k, batch_size=batch_size
            )

            # iterate through queries and find negatives
            for question, pos_doc, top_docs in zip(queries, pos_docs, docs):
                random.shuffle(top_docs)
                for doc_item in top_docs:
                    neg_doc = doc_item.content
                    if neg_doc != pos_doc:
                        question_pos_doc_neg_doc.append({"question": question, "pos_doc": pos_doc, "neg_doc": neg_doc})
                        break

@sjrl Let me try HNSW and report back

sjrl Aug 5, 2022
Collaborator

@sridhar I also wanted to let you know that I looked at timings for different parts of the above code and the majority (>99%) of the time is spent on the following step (in our tutorial example):

docs: List[List[Document]] = self.retriever.retrieve_batch(
                queries=queries, top_k=self.top_k, batch_size=batch_size
            )

which is why I found using a faster index (HNSW) in the document store helped a lot. And since this is the time-limiting step, I don't think parallelizing the rest of the code will help much at all.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PseudoLabelGenerator not using GPU #2949

{{title}}

Replies: 3 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

PseudoLabelGenerator not using GPU #2949

sridhar Aug 2, 2022

Replies: 3 comments · 13 replies

sjrl Aug 2, 2022 Collaborator

sridhar Aug 2, 2022 Author

sjrl Aug 2, 2022 Collaborator

sridhar Aug 2, 2022 Author

sjrl Aug 3, 2022 Collaborator

vblagoje Aug 3, 2022 Maintainer

sridhar Aug 3, 2022 Author

vblagoje Aug 3, 2022 Maintainer

vblagoje Aug 4, 2022 Maintainer

sjrl Aug 4, 2022 Collaborator

sjrl Aug 4, 2022 Collaborator

sridhar Aug 4, 2022 Author

sjrl Aug 5, 2022 Collaborator

sridhar
Aug 2, 2022

Replies: 3 comments 13 replies

sjrl
Aug 2, 2022
Collaborator

sridhar
Aug 2, 2022
Author

sjrl Aug 2, 2022
Collaborator

sridhar Aug 2, 2022
Author

sjrl Aug 3, 2022
Collaborator

vblagoje Aug 3, 2022
Maintainer

sridhar Aug 3, 2022
Author

vblagoje
Aug 3, 2022
Maintainer

vblagoje Aug 4, 2022
Maintainer

sjrl Aug 4, 2022
Collaborator

sjrl Aug 4, 2022
Collaborator

sridhar Aug 4, 2022
Author

sjrl Aug 5, 2022
Collaborator