The object API is a Flask application used to access object data stored in tables in Apache HBase. The application relies internally on two components, the Java Gateway and the Fink cutout API.
The Java Gateway enables the Flask application to communicate with a JVM using py4j, where the Fink HBase client based on Lomikel is available. This client simplifies the interaction with HBase tables, where Fink aggregated alert data is stored.
The Fink cutout API is a Flask application to access cutouts from the Fink datalake. We only store cutout metadata in HBase, and this API retrieves the data from the raw parquet files stored on HDFS.
From 2019 to 2024, the development of this API was done in fink-science-portal. Check this repository for older issues and PR.
There are several forms of documentation, depending on what you are looking for:
- Tutorials/How-to guides: Fink user manual
- API Reference guide: https://api.fink-portal.org
- Notes for developpers and maintainers (auth required): GitLab
You will need Python installed (>=3.9) with requirements listed in requirements.txt. You will also need fink-cutout-api fully installed (which implies Hadoop installed on the machine, and Java 11/17). For the full installation and deployment, refer as to the procedure.
The input parameters can be found in config.yml. Make sure that the SCHEMAVER
is the same you use for your tables in HBase.
After starting the Fink Java Gateway and fink-cutout-api services, you can simply launch the API in debug mode using:
python app.py
The application is simply managed by gunicorn
and systemd
(see install), and you can manage it using:
# start the application
systemctl start fink_object_api
# reload the application if code changed
systemctl restart fink_object_api
# stop the application
systemctl stop fink_object_api
All the routes are extensively tested. To trigger a test on a route, simply run:
python apps/routes/objects/test.py $HOST:$PORT
By replacing HOST
and $PORT
with their values (could be the main API instance). If the program exits with no error or message, the test has been successful. Alternatively, you can launch all tests using:
./run_tests.sh --url $HOST:$PORT
To profile a route, simply use:
export PYTHONPATH=$PYTHONPATH:$PWD
./profile_route.sh --route apps/routes/<route>
Depending on the route, you will see the details of the timings and a summary similar to:
Wrote profile results to profiling.py.lprof
Inspect results with:
python -m line_profiler -rmt "profiling.py.lprof"
Timer unit: 1e-06 s
Total time: 0.000241599 s
File: /home/peloton/codes/fink-object-api/apps/routes/template/utils.py
Function: my_function at line 19
Line # Hits Time Per Hit % Time Line Contents
==============================================================
19 @profile
20 def my_function(payload):
21 1 241.6 241.6 100.0 return pd.DataFrame({payload["arg1"]: [1, 2, 3]})
0.00 seconds - /home/peloton/codes/fink-object-api/apps/routes/template/utils.py:19 - my_function
The main route performance for a medium size object (14 alerts, about 130 columns):
request | time (second) |
---|---|
Lightcurve data (3 cols) | 0.1 |
Lightcurve data (130 cols) | 0.3 |
Lightcurve & 1 cutout data | 3.4 |
Lightcurve & 3 cutout data | 5.4 |
Requesting cutouts is costly! We have 14 alerts, which is about 0.25 second per cutout. Note that requesting 3 cutouts is faster then 3 times 1 cutout, as what drives the cost is to load the full block in HDFS in memory (see this discussion about the strategy behind).
Note that for lightcurve data, the time is fortunately not linear with the number of alerts per object:
request | time (second) |
---|---|
Lightcurve data (33 alerts, 130 cols) | 0.3 |
Lightcurve data (1575 alerts, 130 cols) | 1.8 |
Initially, we loaded the client JAR using jpype at the application's start, sharing the client among all users. This approach caused several issues due to the client's lack of thread safety. To resolve this, we switched to an isolation mode, where a new client is created for each query instead of reusing a global client (see astrolabsoftware/fink-science-portal#516).
While this strategy effectively prevents conflicts between sessions, it significantly slows down individual queries. For instance, when using the route api/v1/objects
, the overall query time is primarily determined by the time taken to load the client.
Instead of loading the client from scratch in the Python application for each query, we now spawn a JVM once (from outside the Python application), and access Java objects dynamically from the Python application using py4j. This has led to huge speed-up for most queries without the need for cutouts, e.g. for the /api/v1/objects
route:
time (second) | |
---|---|
Isolation mode | 3.4 |
Gateway | 0.3 |
You find a template route to start a new route. Just copy this folder, and modify it with your new route. Alternatively, you can see how other routes are structured to get inspiration. Do not forget to add tests in the test folder!
- configuration: Find a way to automatically sync schema with tables.
- Add nginx management
- Add bash scripts under
bin/
to manage both nginx and gunicorn - Make tests more verbose, even is successful.