Skip to content

Commit

Permalink
Switch to GSON (#12)
Browse files Browse the repository at this point in the history
* Do not escape by default. Closes #11.

* Update README

* Add gson JAR

* Swith to GSON to decode TreeMap
  • Loading branch information
JulienPeloton authored Dec 6, 2024
1 parent 26503d8 commit 9a20e21
Show file tree
Hide file tree
Showing 5 changed files with 39 additions and 4 deletions.
35 changes: 35 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ By replacing `HOST` and `$PORT` with their values (could be the main API instanc
To profile a route, simply use:

```bash
export PYTHONPATH=$PYTHONPATH:$PWD
./profile_route.sh --route apps/routes/<route>
```

Expand All @@ -96,6 +97,40 @@ Line # Hits Time Per Hit % Time Line Contents
0.00 seconds - /home/peloton/codes/fink-object-api/apps/routes/template/utils.py:19 - my_function
```

### Main route performance

The main route performance for a medium size object (14 alerts, about 130 columns):

| request| time (second)|
|--------|--------------|
| Lightcurve data (3 cols) | 0.1 |
| Lightcurve data (130 cols) | 0.3 |
| Lightcurve & 1 cutout data | 3.4 |
| Lightcurve & 3 cutout data | 5.4 |

Requesting cutouts is costly! We have 14 alerts, which is about 0.25 second per cutout. Note that requesting 3 cutouts is faster then 3 times 1 cutout, as what drives the cost is to load the full block in HDFS in memory (see this [discussion](https://github.com/astrolabsoftware/fink-broker/issues/921) about the strategy behind).

Note that for lightcurve data, the time is fortunately not linear with the number of alerts per object:

| request| time (second)|
|--------|--------------|
| Lightcurve data (33 alerts, 130 cols) | 0.3 |
| Lightcurve data (1575 alerts, 130 cols) | 1.8|


### The power of the Gateway

Initially, we loaded the client JAR using jpype at the application's start, sharing the client among all users. This approach caused several issues due to the client's lack of thread safety. To resolve this, we switched to an isolation mode, where a new client is created for each query instead of reusing a global client (see astrolabsoftware/fink-science-portal#516).

While this strategy effectively prevents conflicts between sessions, it significantly slows down individual queries. For instance, when using the route `api/v1/objects`, the overall query time is primarily determined by the time taken to load the client.

Instead of loading the client from scratch in the Python application for each query, we now spawn a JVM once (from outside the Python application), and access Java objects dynamically from the Python application using py4j. This has led to huge speed-up for most queries without the need for cutouts, e.g. for the `/api/v1/objects` route:

| | time (second)|
|--------|--------------|
| Isolation mode | 3.4 |
| Gateway | 0.3 |

## Adding a new route

You find a [template](apps/routes/template) route to start a new route. Just copy this folder, and modify it with your new route. Alternatively, you can see how other routes are structured to get inspiration. Do not forget to add tests in the [test folder](tests/)!
Expand Down
2 changes: 1 addition & 1 deletion apps/routes/cutouts/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ def format_and_send_cutout(payload: dict):
group_alerts=False,
truncated=True,
extract_color=False,
escape_slash=True,
escape_slash=False,
)

json_payload = {}
Expand Down
4 changes: 2 additions & 2 deletions apps/utils/decoding.py
Original file line number Diff line number Diff line change
Expand Up @@ -150,13 +150,13 @@ def format_hbase_output(
def hbase_to_dict(hbase_output, escape_slash=False):
"""Optimize hbase output TreeMap for faster conversion to DataFrame"""
gateway = JavaGateway(auto_convert=True)
JSONObject = gateway.jvm.org.json.JSONObject
GSONObject = gateway.jvm.com.google.gson.Gson

# We do bulk export to JSON on Java side to avoid overheads of iterative access
# and then parse it back to Dict in Python
if escape_slash:
hbase_output = str(hbase_output)
optimized = json.loads(JSONObject(str(hbase_output)).toString())
optimized = json.loads(GSONObject().toJson(hbase_output))

return optimized

Expand Down
Binary file added bin/gson-2.11.0.jar
Binary file not shown.
2 changes: 1 addition & 1 deletion install/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ User=almalinux
Group=almalinux
WorkingDirectory=/home/almalinux/fink-object-api/bin

ExecStart=/bin/sh -c 'source /home/almalinux/.bashrc; exec java -cp "Lomikel-03.04.00x-HBase.exe.jar:py4j0.10.9.7.jar" com.Lomikel.Py4J.LomikelGatewayServer 2>&1 >> /tmp/fink_gateway.out'
ExecStart=/bin/sh -c 'source /home/almalinux/.bashrc; exec java -cp "Lomikel-03.04.00x-HBase.exe.jar:py4j0.10.9.7.jar:gson-2.11.0.jar" com.Lomikel.Py4J.LomikelGatewayServer 2>&1 >> /tmp/fink_gateway.out'

[Install]
WantedBy=multi-user.target
Expand Down

0 comments on commit 9a20e21

Please sign in to comment.