- Introduction
- Setup Instructions
- Getting Started
- Abstract Idea and Problem
- Tradeoff between RAM size and DB access time with the solution
- Implementation
- Future Development
- Results
- License
- Acknowledgements
If you want to setup & execute this repo the check this README file.
First & foremost, the main process works in the form of batches having JSON-like schema (an example is given in the implementation section) includes the Wikipedia page links. These pages are scraped, pre-processed & knowledge is extracted for them. After completing, the extracted knowledge stored in the current batch is transferred to the MongoDB Atlas server for temporary persistence (the permanent storage is the Neo4j DB). Here MongoDB is only used as the helper DB because the complete process is running on the Google Colab & I didn't want to host the Neo4j DB so, I made use of the free tier of MongoDB Atlas server 😅.
You can try the larger model (as python -m spacy download en_core_web_lg
) but, the code doesn't work even with 12GB of RAM on Google Colab due to pages having massive content.
(For example, when I ran the code on google colab it killed the process when it reached this Wikipedia page)
Why the code requires 12+ GB of RAM? (The possible reason could be)
When a large language model is used, it allocates too much memory on the RAM & at the same time the code is running the batch process i.e it takes 5 Wikipedia links & finds the knowledge representation & all this data is stored in the RAM, after the batch completion the data is transferred to the MongoDB Atlas server.
As you saw in the previous section that the batch is collected first & is stored in MongoDB, but its also possible to extract the knowledge from a link & store the data directly to MongoDB but, this is a time-consuming process as we, accessing the DB for each & every link, so I used the batch insert approach. But the problem with the batched approach is the limited amount of RAM as I am storing the massive extracted data on RAM, it is possible the knowledge extraction sub-process is not getting sufficient space on RAM (for some pages, the extracted knowledge is massive & for some it is tiny)
.
As I got to know that the nature of the knowledge found on the pages can have massive or tiny. So to solve this problem, I can keep a track of how many pairs of entity (knowledge) are stores in the batch & if it exceeds a certain threshold, then I can insert them to the DB & delete from RAM.
You will find that in the code its says mini-batch, but batch & mini-batch both are the same and, I will update the code soon to avoid confusion.
batch = [
{
"doc_name": "Albert Einstein",
"wiki_url": "https://en.wikipedia.org/wiki/Albert_Einstein",
"done": true,
"entity_list": [
{
"subject": "Albert einstein",
"relation": "produced",
"object": "E = mc2",
"subj_type": "PERSON",
"obj_type": "NOUN_CHUNK"
},
]
},
]
---- For Insertion ----
---- For fetching ----
---- For Insertion ----
Processing: Links from the batch are sent to the Text Preprocessing sub-process and gets the cleaned version of html. Now this textual data is send to the Knowledge Extraction sub-process and gets list of entities. After all the links from the batch are done the data stored in the batch is passed to MongoDB Connector.
- Improvement in the script to incorporate the mentioned solution for the trade-off.
- Optimizing the knowledge extraction script to accept useful information only.
- Currently, the statements having multiple subjects or objects are neglected. So implementing an efficient way to extract the knowledge from those statements which contain more than one subject or object.
The above image shows the extracted knowledge for "Albert einstein"