Extract, in a structured manner, the general guidelines from the ML community about dataset documentation practices from its scientific documentation. Study and analyze scientific data published in peer-review journals such as: Nature's Scientific Data and Data-in-Brief.
📼 Take a look to our short video presenting the tool! 📼 and here you have an example of an study using DataDocAnalyzer to extract the data from data papers.
Here you have a complete list of data journals suitable to be analyzed with this tool. Test the web UI of the tool in the following HuggingFace Space, and the API using our Docker image
The tools come with two UIs. A web app built with Gradio intended to test the tool's capabilities and analyze a single document (you can try it in the HuggingFace Space). And a API built with FastAPI, suited to be integrated into any ML pipeline:
To use this tool, you need to have python3.10, git, and pip installed in your system. Then just:
git clone https://github.com/SOM-Research/DataDoc-Analyzer.git datadoc
## Enter to the created folder
cd datadoc
## Install dependencies (Better to do this in a virtual enviroment)
pip install -r requirements.txt
python3 app.py
uvicorn api:app
First you need to install docker in your sistem. Then:
docker pull joangi/datadoc_analyzer
docker run --name apidataset -p 80:80 joangi/datadoc_analyzer
docker exec apidataset apt -y install default-jre
The API will be running in your localhost at port 80. (You can change the port in the command above)
To use this tool, you need to provide your own API key from OpenAI.
Once set, you can upload your PDF from one of the scientific journals suited for this tool1. Keep in mind that we analyze “data papers.” Other journal publications, such as “meta-analysis” or full papers, may not work adequately.
At last, click on “get insights” of any tab, and you will get the results together with the completeness report.
The API imitates the behavior of the tabs of the web UI, but, in addition, you also have an endpoint to retrieve all the dimensions at the same time. The API's swagger documentation, which can be tested in situ, is published together along the API. The server will start at port 8000 by default (if not occupied by another app of your system). And the documentation will be found at http://127.0.0.1:8000/docs
The tool has been presented at the 32nd ACM International Conference on Information and Knowledge Management in October '23 (tool's publication). Also, you can check this short video presenting the tool
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License
The CC BY-SA license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. If you remix, adapt, or build upon the material, you must license the modified material under identical terms.
Footnotes
-
Some journals that publish data papers: Nature's Scientific Data, Data-in-Brief, Geoscience Data Journal etc... Here you have a complete list of data journals suitable to be analyzed with this tool. ↩