Skip to content

Commit

Permalink
Merge branch 'main' into cadej/uploadfile
Browse files Browse the repository at this point in the history
  • Loading branch information
Ubuntu committed Nov 18, 2023
2 parents 91d2c31 + befe48d commit d36f95f
Show file tree
Hide file tree
Showing 185 changed files with 26,347 additions and 6 deletions.
30 changes: 29 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,31 @@

.env
.DS_Store
__pycache__/
__pycache__/

ml-model/formula_images/
ml-model/output/
ml-model/crop_formula_images/


output.zip
*.zip

.ipynb_checkpoints/
data/
.DS_Store
*.gz
*.venv
simese_data/
venv/
ml-model/model.pt
training_data/
*.png
im2latex/
ml-model/model.pt
simese_data/
venv/
*.pem
ml-model/paths_output.csv

ml-model/web/__pycache__/
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2022 Cornell Data Science

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# MathSearch

A next generation search engine for researchers that supports searching with LaTeX math script.

Refer to our [wiki](https://github.com/CornellDataScience/MathSearch/wiki) to get started :-)
12 changes: 7 additions & 5 deletions front-end/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ located ./mathsearch README.md
## Commands 101

1. update react: `npm run build` and `sudo systemctl restart nginx`
2. update flask: `pkill gunicorn`
2. update flask: `pkill gunicorn`, or `tmux` and restart gunicorn on the correct port (dangerous, read details below before you do it)


# Frontend Public IP
Expand Down Expand Up @@ -74,11 +74,13 @@ then

```
cd ~/MathSearch/front-end/web
tmux
gunicorn -b 127.0.0.1:8000 api:app
```

You must be in `/web` to let gunicorn start app
-b stands for bind which specify the IP address
please run it within tmux so we can always open the terminal again to see gunicorn debug log.
above command start gunicorn which run flask, do NOT end this process

# Other Things - You probably do not need to know
Expand Down Expand Up @@ -108,10 +110,10 @@ Option 2:
curl localhost:8000/test
```

Option 3:

Option 3:
use any unused port
```
gunicorn -b 127.0.0.1:8000 api:app
gunicorn -b 127.0.0.1:8001 api:app
```

if get [ERROR] Connection in use: ('127.0.0.1', 8000), you need to stop public deployment to deploy to local, do a pid kill
Expand Down Expand Up @@ -141,7 +143,7 @@ There's no references anymore. Too many sites had been used during debug. Just s
### 8. location of important files:

1. `/etc/nginx/nginx.conf`
2. `/lib/systemd/system/nginx.service`
2. (unused) `/lib/systemd/system/nginx.service`
3. (unused) `/etc/nginx/sites-available/MathSearch`

## Access S3
Expand Down
82 changes: 82 additions & 0 deletions ml-model/README.md
Original file line number Diff line number Diff line change
@@ -1 +1,83 @@
# ML Model Branch

## Design choice

We link the design choices in the follow google docs;
https://docs.google.com/document/d/1vf85MGTYNEiqmlLApEDHS9gEdpwMC6Jg1svBZvlxrbo/edit?usp=sharing

## Folders

- preprocessing/ - code for dataset augmentation,
- files/ - code for VGG
- model/ - code for VGG and Siamese neural networks
- yolov5/ - all the code for yolov5
- archive/ - deprecated folders

# Setup

### Requirements
Install requirements via `pip install -r requirements.txt` in the files/ directory.
### Finetuning

- #### Siamese w/ VGG
The dataset used to finetune VGG can be found here, in `training.zip`: https://huggingface.co/dyk34/Training-Data-MathSearch/tree/main.

Change the `data_dir, training_dir, training_csv, testing_csv, testing_dir` variables in the `siamese.py` file.

Run `python siamese.py`.

- #### Yolov5
The dataset to finetune Yolov5 can also be found here: https://huggingface.co/dyk34/Training-Data-MathSearch/tree/main.


### Inference
- #### Siamese w/ VGG
To run VGG in Pytorch, load a Siamese Network with
```
vgg19_model = models.vgg19()
net = SiameseNetwork(vgg19_model)
model.load_state_dict(torch.load('model.pt'))
```

and run `model(image1, image2)` to get the latent space distance between image1 and image2.


- ### Yolov5
To segment an image, run

`python segment/predict.py --weights {weights} --data {img}`

# SWE
<img width="700" alt="full pipeline" src="https://github.com/CornellDataScience/MathSearch/assets/44758321/bcb8dff2-0e21-474f-9f19-915cda76262c">


## Frontend Public IP
### As time: 5/29 3:12PM
Everytime EC2 instance gets restarted, new IP and new SSH ip is be generated and need to be updated for config and domain redirection.

Public IP: `http://18.206.12.64`
SSH: `ec2-18-206-12-64.compute-1.amazonaws.com`

## Nginx
location of nginx conf: `/etc/nginx/nginx.conf`
be in `/home/ubuntu/MathSearch/ml-model/web`
```
gunicorn -b 127.0.0.1:8080 api:app
```

## Backend Environment:
- Option 1: `/opt/conda/bin`
- Option 2: `/home/ubuntu/MathSearch/ml-model/venv/bin`
- Option 3 (apply to SWE): packages all installed to default python, no need to activate any venv

## Access S3
To test connection, run below (notice the "-" on the second option). It should display directory in s3 buckets or cat the file.

option 1
```
aws s3 ls s3://mathsearch-intermediary
```
option 2
```
aws s3 cp s3://mathsearch-intermediary/test.txt -
```
206 changes: 206 additions & 0 deletions ml-model/archive/app_sample/app_checkpoint.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,206 @@
from werkzeug.utils import secure_filename
from flask import Flask, flash, request, redirect, render_template
import urllib.request
import requests
import os
from flask import Flask
import wget

UPLOAD_FOLDER = '/home/ubuntu/yolov5/input_data'

app = Flask(__name__)
app.secret_key = "secret key"
app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER
app.config['MAX_CONTENT_LENGTH'] = 16 * 1024 * 1024


@app.route('/')
def hello_world():
return 'Warning: go to http://18.207.249.45/upload instead'

# https://www.cs.cornell.edu/~kozen/Papers/daa.pdf

@app.route('/pdf', methods=['GET', 'POST'])
def download():

# url = request.args.get('c')
# wget.download(url,app.config['UPLOAD_FOLDER']+"/file.pdf")


# url = request.args.get('c')
# print(url)
# r = requests.get(url,allow_redirects=True)
# print(r)
# open(UPLOAD_FOLDER+"/file1.pdf","wb").write(r.content)


# with open(app.config['UPLOAD_FOLDER']+"/file1.pdf", "wb") as file:
# response = requests.get(url)
# file.write(response.content)

return "success"
# return send_to_directory(app.config['UPLOAD_FOLDER'], link)
# return send_file(link, as_attachment=True)


ALLOWED_EXTENSIONS = set(['pdf'])

def allowed_file(filename):
return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS

@app.route('/upload')
def upload_form():
return render_template('upload.html')

@app.route('/upload', methods=['POST'])
def upload_file():
if request.method == 'POST':
# check if the post request has the file part
if 'file' not in request.files:
flash('No file part')
return redirect(request.url)
file = request.files['file']
if file.filename == '':
flash('No file selected for uploading')
return redirect(request.url)
if file and allowed_file(file.filename):
filename = secure_filename(file.filename)
file.save(os.path.join(app.config['UPLOAD_FOLDER'], filename))
flash('File successfully uploaded')
return redirect('/')
else:
flash('Allowed file types are txt, pdf, png, jpg, jpeg, gif')
return redirect(request.url)

if __name__ == "__main__":
app.run()



# # from flask import Flask, render_template, request
# # # from werkzeug import secure_filename
# # from werkzeug.utils import secure_filename
# # from werkzeug.datastructures import FileStorage
# # app = Flask(__name__)

# # @app.route('/')
# # def hello_world():
# # return 'Hello World! - emerald@mathsearch port:3000 temp:4'

# # @app.route('/upload')
# # def upload_file():
# # return render_template('upload.html')

# # @app.route('/uploader', methods = ['GET', 'POST'])
# # def uploadfile():
# # if request.method == 'POST':
# # f = request.files['file']
# # f.save(secure_filename(f.filename))
# # return 'file uploaded successfully'

# # if __name__ == '__main__':
# # app.debug = True
# # app.run(host='0.0.0.0', port=8100)


# from flask import Flask, flash, redirect, url_for, request, render_template
# from werkzeug.utils import secure_filename
# import os

# """
# @Author: Emerald Liu
# Does not support concurrency currently
# """

# # constant variables
# UPLOAD_FOLDER = '/home/ubuntu/yolov5/input_data'
# ALLOWED_EXTENSIONS = {'pdf'}

# # helper functions
# def allowed_file(filename):
# return '.' in filename and \
# filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS


# # initalize flask app config
# app = Flask(__name__)
# app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER


# # @app.route('/upload')
# # def upload_file():
# # return render_template('upload.html')


# # @app.route('/upload', methods=['GET', 'POST'])
# # def upload_file():
# # if request.method == 'POST':
# # # check if the post request has the file part
# # if 'file' not in request.files:
# # flash('No file part')
# # return redirect(request.url)
# # file = request.files['file']
# # # If the user does not select a file, the browser submits an
# # # empty file without a filename.
# # if file.filename == '':
# # flash('No selected file')
# # return redirect(request.url)
# # if file and allowed_file(file.filename):
# # filename = secure_filename(file.filename)
# # # file.save(os.path.join(app.config['UPLOAD_FOLDER'], filename))
# # file.save(app.config['UPLOAD_FOLDER'], filename)
# # return redirect(url_for('download_file', name=filename))
# # return '''
# # <!doctype html>
# # <title>Upload new File</title>
# # <h1>Upload new File</h1>
# # <form method=post enctype=multipart/form-data>
# # <input type=file name=file>
# # <input type=submit value=Upload>
# # </form>
# # '''

# if __name__ == '__main__':
# app.debug = True
# app.run(host='0.0.0.0', port=8100)


# import os
# import urllib.request
# # from app import app
# from flask import Flask, flash, request, redirect, render_template
# from werkzeug.utils import secure_filename

# ALLOWED_EXTENSIONS = set(['txt', 'pdf', 'png', 'jpg', 'jpeg', 'gif'])
# app = Flask(__name__)

# def allowed_file(filename):
# return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS

# @app.route('/')
# def upload_form():
# return render_template('upload.html')

# @app.route('/', methods=['POST'])
# def upload_file():
# if request.method == 'POST':
# # check if the post request has the file part
# if 'file' not in request.files:
# flash('No file part')
# return redirect(request.url)
# file = request.files['file']
# if file.filename == '':
# flash('No file selected for uploading')
# return redirect(request.url)
# if file and allowed_file(file.filename):
# filename = secure_filename(file.filename)
# file.save(os.path.join(app.config['UPLOAD_FOLDER'], filename))
# flash('File successfully uploaded')
# return redirect('/')
# else:
# flash('Allowed file types are txt, pdf, png, jpg, jpeg, gif')
# return redirect(request.url)

# if __name__ == '__main__':
# app.debug = True
# app.run(host='0.0.0.0', port=8100)
Loading

0 comments on commit d36f95f

Please sign in to comment.