Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to GSON #12

Merged
merged 4 commits into from
Dec 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ By replacing `HOST` and `$PORT` with their values (could be the main API instanc
To profile a route, simply use:

```bash
export PYTHONPATH=$PYTHONPATH:$PWD
./profile_route.sh --route apps/routes/<route>
```

Expand All @@ -96,6 +97,40 @@ Line # Hits Time Per Hit % Time Line Contents
0.00 seconds - /home/peloton/codes/fink-object-api/apps/routes/template/utils.py:19 - my_function
```

### Main route performance

The main route performance for a medium size object (14 alerts, about 130 columns):

| request| time (second)|
|--------|--------------|
| Lightcurve data (3 cols) | 0.1 |
| Lightcurve data (130 cols) | 0.3 |
| Lightcurve & 1 cutout data | 3.4 |
| Lightcurve & 3 cutout data | 5.4 |

Requesting cutouts is costly! We have 14 alerts, which is about 0.25 second per cutout. Note that requesting 3 cutouts is faster then 3 times 1 cutout, as what drives the cost is to load the full block in HDFS in memory (see this [discussion](https://github.com/astrolabsoftware/fink-broker/issues/921) about the strategy behind).

Note that for lightcurve data, the time is fortunately not linear with the number of alerts per object:

| request| time (second)|
|--------|--------------|
| Lightcurve data (33 alerts, 130 cols) | 0.3 |
| Lightcurve data (1575 alerts, 130 cols) | 1.8|


### The power of the Gateway

Initially, we loaded the client JAR using jpype at the application's start, sharing the client among all users. This approach caused several issues due to the client's lack of thread safety. To resolve this, we switched to an isolation mode, where a new client is created for each query instead of reusing a global client (see astrolabsoftware/fink-science-portal#516).

While this strategy effectively prevents conflicts between sessions, it significantly slows down individual queries. For instance, when using the route `api/v1/objects`, the overall query time is primarily determined by the time taken to load the client.

Instead of loading the client from scratch in the Python application for each query, we now spawn a JVM once (from outside the Python application), and access Java objects dynamically from the Python application using py4j. This has led to huge speed-up for most queries without the need for cutouts, e.g. for the `/api/v1/objects` route:

| | time (second)|
|--------|--------------|
| Isolation mode | 3.4 |
| Gateway | 0.3 |

## Adding a new route

You find a [template](apps/routes/template) route to start a new route. Just copy this folder, and modify it with your new route. Alternatively, you can see how other routes are structured to get inspiration. Do not forget to add tests in the [test folder](tests/)!
Expand Down
2 changes: 1 addition & 1 deletion apps/routes/cutouts/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ def format_and_send_cutout(payload: dict):
group_alerts=False,
truncated=True,
extract_color=False,
escape_slash=True,
escape_slash=False,
)

json_payload = {}
Expand Down
4 changes: 2 additions & 2 deletions apps/utils/decoding.py
Original file line number Diff line number Diff line change
Expand Up @@ -150,13 +150,13 @@ def format_hbase_output(
def hbase_to_dict(hbase_output, escape_slash=False):
"""Optimize hbase output TreeMap for faster conversion to DataFrame"""
gateway = JavaGateway(auto_convert=True)
JSONObject = gateway.jvm.org.json.JSONObject
GSONObject = gateway.jvm.com.google.gson.Gson

# We do bulk export to JSON on Java side to avoid overheads of iterative access
# and then parse it back to Dict in Python
if escape_slash:
hbase_output = str(hbase_output)
optimized = json.loads(JSONObject(str(hbase_output)).toString())
optimized = json.loads(GSONObject().toJson(hbase_output))

return optimized

Expand Down
Binary file added bin/gson-2.11.0.jar
Binary file not shown.
2 changes: 1 addition & 1 deletion install/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ User=almalinux
Group=almalinux
WorkingDirectory=/home/almalinux/fink-object-api/bin

ExecStart=/bin/sh -c 'source /home/almalinux/.bashrc; exec java -cp "Lomikel-03.04.00x-HBase.exe.jar:py4j0.10.9.7.jar" com.Lomikel.Py4J.LomikelGatewayServer 2>&1 >> /tmp/fink_gateway.out'
ExecStart=/bin/sh -c 'source /home/almalinux/.bashrc; exec java -cp "Lomikel-03.04.00x-HBase.exe.jar:py4j0.10.9.7.jar:gson-2.11.0.jar" com.Lomikel.Py4J.LomikelGatewayServer 2>&1 >> /tmp/fink_gateway.out'

[Install]
WantedBy=multi-user.target
Expand Down
Loading