-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathprogress_log.txt
113 lines (72 loc) · 9.92 KB
/
progress_log.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
So, I am going to try and make a web app that will be deployed to Heroku and will do this:
- A user can upload a list of pdfs that have weird non-self-explanatory names (usually happens in the case of research paper)
- As an output, they will get a zip file with all those pdf files renamed with the title of the pdf file
- In the case of research papers, the new filename will be the year and the title and this will be straightforward to extract (or so I hope)
- As for other pdf documents, will need to figure out a different strategy such as finding the 'topic keyword' through ML maybe (that might be a good exercise for later)
For now, let's limit the focus to just uploading research papers.
So the front-end will have:
- A page where you can drag and drop or upload a range of files
- These files are then sent to the backend for processing
- Once we receive the zipped file containing all renamed pdf files, display a download option that lets users download each file
At the backend:
- Take a pdf file, read it, find the title and the year through Grobid (or some other way), and rename the file
- Take all the renamed files, package them inside a zip file, and send it back to be downloaded
Let's break this down into smaller challenges:
1- (DONE)Take a downloaded pdf and figure out how to use Grobid to extract its title and year of publication
2- (DONE)Figure out how to rename a file in python once you have the new filename decided (should be fairly easy)
3- (DONE)Figure out how to handle a file at the backend (will I need to use a database?)
4- (DONE)Figure out how to zip a file and send it back to the front end
5- (DONE)Make an HTML template with CSS styling for the page where a user can upload pdfs
6- (NOT NEEDED)Make an HTML template with CSS styling for the page where they will download
7- (DONE)Figure out how to allow users to download something from the front-end
8- (DONE)Use the grobid client inside the rename script (i.e. have an end-to-end script that takes pdfs and renames them)
9- (DONE)Figure out how to receive and handle multiple files at the backend
10-(DONE) Merge the grobid processing functionality with the uploads functionality
11- Add a little puzzle game that the user can play while the processing takes place at the backend
Once this app is running properly on my local computer, I will then have to figure out how to deploy this to heroku.
That sounds good. Let's start tackling each of these challenges.
(NOTE: Here are some problems I need to find solutions for)
- How large a file do I want to allow users to upload?
- How do I deal with the slow start-up? (Let's try keeping n=1 for the first try and switching it to a higher value for the next ones) (or just work with n=1) (Ask on the grobid issues section)
- How can I upload this to heroku or aws
- Consider the difference in time between processing the full text and processing just the header. Also, here's a trick, instead of giving in the full pdf, just take the first page. How about that?
Ideas:
1) Process just the header and rename with the title
2) Check out the other services and see what info you get via them
3) Truncate the pdf to just one page and process that only
<-Update on Jun 7 at 23:07->
So after a lot of troubleshooting and endless searching around, I have managed to containerize my application completely. I managed to make an image out of just the front end of my app, and then also managed to use docker-compose to run the 2 containers simultaneously, i.e. the grobid server and my own image. Should definitely document all my learnings somewhere and brainstorm about that. But now we face a new frontier: heroku does not support docker-compose style deployment. With heroku, you can either deploy a single container or deploy a single container with other "add-ons" from Heroku. So I guess no more heroku for me.
Now I've been able to identify and think of some solutions. Here are these:
- Can I package the entire thing into just one docker image? I highly doubt this because the servicess need different servers to run on.
- I just saw an example that uses AWS Elastic Beanstalk to deploy a docker-compose style app with multiple containers. I could go in that direction but will need a lot of research for that.
- I could use Kubernetes. From what I've read that maybe a solution for running multi-container apps so I should now also check that out.
In short, I need to start digging into "deploying multi-container apps to Elastic beanstalk" and "How to use kubernetes to deploy multi-container apps".
So while I read up on this, I should also work on fixing some of the problems listed above about the slow processing, and a new problem I saw: The pdf folder is not being emptied. Something ain't going right there and I need to figure that out.
Also, I need to compromise the date information in favor of speed cause ain't nobody gonna wait for that long.
<- Update on Jun 13 at 14:24 ->
So here is the progress I've made:
- Succeeded in mounting a modified config file into docker-compose to enable model preLoading. Some tests have shown that model preloading is infact a good strategy. Now currently I have this file in my local repo but it doesn't matter because if anyone else uses this they will have to clone the repo anyway.
- Removed the year part and now the default method is to rename with just the title and not the year of publication. There is a checkbox in the frontend which someone can use if they also want the years, with a warning that it might take longer.
- n=1 works best on my computer because of the severely limited processing capabilities.
- There is a processDate API for grobid as well however the client only gives access to 3 services. I'm thinking of editing the code and trialling the processDate api. If it works out and works quickly, I can replace the current methodology where I use the processFullText to get the date and just combine the processHeader with process date. This is my next step right now.
After testing this thing about the processDate API, I will move on to deployment using Kubernetes or Elastic Beanstalk (If I get the time today to do that).
I also just realized another issue. The names of the files are getting mixed up. So two papers had the names of each other instead of their own. I need to figure out how this is happening and fix it. Probably by using a better way to locate the files.
<- Update on Jun 13 at 16:29 ->
So I checked the processDate API and it's just a simple service that takes in a 'raw string' in any format and gives it back in a specific and consistent format. So no I can't use this as the alternative. It will have to be "processHeaderDocument" for the name only and "processFulltextDocument" for the date as well.
Now I'm going to go ahead and fix the problem about name mismatches. I've found a simple strategy for that. Just take each file from the pdfs folder, take away the .pdf at the end and replace it with .tei.xml (or is it the other way round?) and change the folder path from "pdfs" to "processed_pdfs". This will get exactly the right file. So yeah I'm gonna go play some cricket now and work on this at night. After that will probably think about kubernetes or elastic beanstalk.
<- Update on Jun 22 at 09:25 ->
I never really got the time to work on the mismatch problem after the cricket because I had my TOEFL on this last weekend and a full-day picnic on Sunday. Because of all the fatigue, I was a complete wasteman on Monday and so here I am on a Tuesday morning, just putting in this update.
So one thing I did actually do is make a docker image out of my app and push it to docker hub. I believe this is the container that will run the flask app with the grobid python client dependency. Now, along the way, I had this excellent idea:
Why not just upload the lfoppiano/grobid image to Heroku, and then upload my Flask app to a separate Heroku URL and then have it communicate with the grobid-heroku app? That will at the very least make my app work and be accessible to most people. I could then figure out later how to do this with beanstalk.
Another important point: I am going to write the entire backend again using Node.js, just for some practice. Initially I was going to write the Node.js client as well but turns out it's already there. So I'm just gonna rewrite the backend that includes getting files, processing them with grobid, extracting the title from the .tei.xml files, renaming the files, and sending them back in a zipped folder.
<- Update on Jun 29 at 22:32 ->
Soo, I keep on changing my plans every now and then. So I've decided to scrap two things: no longer deploying to heroku or elastic beanstalk, and no longer rewriting the backend using Node.js. As of now, my plan is to:
- Fix the problems with the file renaming (solution is already brainstormed above)
- Modify the script to make it a "script", i.e. that can be run on the command line without the frontend
- Make a video of this happening for LinkedIn
- Write a well thought-out LinkedIn post and POST THIS AND GET ON TO NEW THINGS.
Okay let's start working and come back with the update at the end.
<- Update on Jun 30 at 00:10 ->
Alright so I've done the first 3 items from the list I made above. I've fixed the file renaming problem, I've made a script that runs blazingly fast on the command line, and I've taken a few before and after screenshots to use.
As for the speed, I'm amazed. I guess going with a "web app" structure was misguided from the start. I took 10 pdfs with a total of >5MB and it completed the processing (without the year) in 2 minutes. What is more surprising is that it completed the processing for the same 10 pdfs WITH THE YEAR in 1 minute. So my note about it "taking more time" is actually false.
Anyhow, I need to get some sleep now. I'm going to run a few more tests tomorrow night on this, update the README with the usage instructions and finally publish this. The script method is going to be the default way of running this. I might just also request some ML influencer to share this. Let's see what kind of reception (if any, lol) I get.