Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elastic search for data aggregations and queries #12

Open
evasciacca opened this issue Nov 30, 2016 · 26 comments
Open

Elastic search for data aggregations and queries #12

evasciacca opened this issue Nov 30, 2016 · 26 comments

Comments

@evasciacca
Copy link

Dear developers,

We would like to understand how to perform elastic search for data aggregations and queries. We have done some tests on the previous OneData release 3.0.0 beta 7 using the direct connection to CouchBase on the bucket 'xattr'. In release 3.0.0 rc9 we did not find where and how metadata are stored on the CouchBase. It would be very useful for us to include n1ql in the containers so that we could access the n1ql GUI on port 8094 and test the queries. Any news on this topic?

Thank you,

Eva.

@groundnuty
Copy link
Contributor

groundnuty commented Dec 2, 2016 via email

@groundnuty
Copy link
Contributor

groundnuty commented Dec 4, 2016 via email

@evasciacca
Copy link
Author

Dear Michal,

thank you very much for your replies. Regarding n1ql it would be needed in oneprovider docker image (where metadata are stored) but if you think it could be useful for elastic search on metadata also on onezone image.

I will now investigate the new bucket custom_metadata.

Furthermore, I was also experimenting the advanced metadata queries (https://onedata.org/docs/doc/using_onedata/metadata.html#advanced-metadata-queries) using REST API. From which onedata release does it work?

Regards,

Eva.

@groundnuty
Copy link
Contributor

groundnuty commented Dec 5, 2016 via email

@evasciacca
Copy link
Author

Dear Michal,

are there any news regarding the inclusion of the n1ql for next releases?

We have updated OneData to release rc11 and we are testing the queries using the indexing functions over the metadata and we have successfully done some queries using them.

Now, we would like to create more complex queries including more than one index combination, e.g. to perform this kind of sql query:

SELECT FILEPATH/FILEID from FILE-BUCKET-ONEDATA where ORIGIN_ID='41' and
PROGRAM_ID='001' and OBSERV_ID='00001' and TELESCOPE_ID='000' and DATE
BETWEEN '%2014-08-20%' and '%2014-08-26%';

Do you know if it is possible and how to do this?

Thank you in advance,

Eva.

@xorver
Copy link
Contributor

xorver commented Feb 13, 2017

We've had a discussion on out hipchat support channel and we've come up with examples of such queries, which were successfully used by Eva. To summarize the topic, here are the examples provided:

environment

SPACE_ID=4dfVO2NTmZJJzrzIYOzUJkXFA8ndZRveThPgSscPbtA
TOKEN=MDAxNWxvY2F00aW9uIG9uZXpvbmUKMDAzYmlkZW500aWZpZXIgYXJSbF9HRWZ6WWpzNlFsR001PTG1QVl9Fb009yN1kzSVlaZXVONUtWRjhrNAowMDFhY2lkIHRpbWUgPCAxNTE4MDE3MjI4CjAwMmZzaWduYXR1cmUgNLd02VRyADxrt12Sk8d01oSYMzFIcsG2IuxjdzNaLky5AK

create file

curl -k -v --tlsv1.2 -H "X-Auth-Token: $TOKEN" -H "X-CDMI-Specification-Version: 1.1.1" -H "Accept: application/cdmi-object" -H "Content-Type: application/cdmi-object" \
-d '{"metadata": {"ORIGIN_ID": "41", "PROGRAM_ID": "001", "OBSERV_ID": "00001", "RUNS_ID": "000004", "MODES_ID": "R" , "SEQUENCE_NUM": "000", "PACKET_TYPE": "1002", "PROP_ID": "0000000000000001", "DATATYPE": "0000", "TSTART": "430580855", "TSTOP": "430580965" }}' \
-X PUT "https://localhost:8443/cdmi/test/file?metadata"

complexindex.js

function(meta) {
        if(meta['ORIGIN_ID'] && meta['PROGRAM_ID']) {
                return [meta['ORIGIN_ID'], meta['PROGRAM_ID']];
        }
        return null;
}

create index

curl -v -k --tlsv1.2 -X POST -H "X-Auth-Token: $TOKEN" -H 'Content-type: application/javascript' -d @complexindex.js "https://localhost:8443/api/v3/oneprovider/index?space_id=$SPACE_ID&name=index1"

query index

INDEX_ID=t2zLyYaR0HPHO8d6wICNYiRDqzc8lGyVVcGUlbepBWo
curl -v -k --tlsv1.2 -Ss -H "X-Auth-Token: $TOKEN" -H 'Content-type: application/javascript'  -X GET "https://localhost:8443/api/v3/oneprovider/query-index/$INDEX_ID?key=\[\"41\",\"001\"\]&stale=false"

Regards,
Tomasz

@groundnuty
Copy link
Contributor

@xorver thank you for providing this short guide.
@evasciacca is information Tomasz supplied sufficient?

@evasciacca
Copy link
Author

Dear Tomasz and Michal,

thank you very much, your support was very helpful. I was able to perform some test queries. Now we are waiting for further improvements such as the possibility to delete unused indexes that is foreseen for next OneData releases (maybe rc13).

Cheers,

Eva.

@evasciacca
Copy link
Author

Dear Tomasz,

we would need to include in the query some information on the creation time of the dataset. Therefore I added to the complexindex.js the meta['ctime'] attribute but it is not working. Is this operation maybe working only on extended attributes and not on internal ones?

Thank you,

Eva.

@xorver
Copy link
Contributor

xorver commented Mar 1, 2017

Dear Eva,

Those indices work only on user created attributes, not the internal ones (as you suspected).

Regards,
Tomasz Lichoń

@evasciacca
Copy link
Author

Dear Tomasz,

thank you for your prompt replay. Would it be possible to add this possibility as enhancement for next OneData releases. Maybe the fastest way would be to add the internal attributes related to time creation/modification as extended ones (not user modifiable) on data ingestion / modification. Or maybe modify rest API to work with internal attributes...

Please let us know if you will make progresses in this direction!

Kind regards,

Eva.

@xorver
Copy link
Contributor

xorver commented Mar 1, 2017

Dear Eva,

Times are stored in a separate document, as they're changed frequently. We may possibly consider merging them with metadata document, but we've decided not to because the change of access time would provoke a lot of index updates. Moreover, there are other internal attributes that we cannot possibly add (such as size) because they are generated dynamically. Adding only a part of them would be confusing. There are also some compatibility issues that may be hard to overcome during an upgrade.

If you're only interested in creation time I would recommend adding it manually after creation. By the way ctime corresponds to attribute change time, not creation, so it may be newer if the file got modified after creation. I'll discuss with my colleagues whether to add automatically to each file xattr called creation_time, maybe it will not be a big deal.

Regards,
Tomasz Lichoń

@evasciacca
Copy link
Author

Thank you very much.

@evasciacca
Copy link
Author

Dear Tomasz,

also we wold like to understand if it would be possible to use operators such as "not equal", "like" or "order by" and if these are planned to be implemented in next releases.

Thank you in advance,

Eva.

@xorver
Copy link
Contributor

xorver commented Mar 3, 2017

Dear Eva,

We're limited to the API provided by couchbase (http://docs.couchbase.com/admin/admin/Views/views-querying.html). We create an index directly in couchbase, wrapping your function with ours. Our 'map' function uses your function to get the key (based on metadata) and emits (your_key, file_id). Then every query is simply passed directly into couchbase (you define keys, flags etc, and obtain file_ids in return).

Couchbase queries do not support operations: "not equal", "like", "order by". But you may create an index with a function that emits only files that satisfy your requirements, e. g.

For emitting only files authored by 'Eva*' that are of version different than '3.0.0':

function(meta) {
        if(meta['VERSION'] && meta['AUTHOR'])
                if(meta['VERSION'] != '3.0.0' && meta['AUTHOR'].match(/Eva.*/)
                        return 1;
        return null;
}

Then query it with key=1 or without key at all (it will return all entries). However, for each query pattern you'd need to create a new index, and it does not cover "order by".

Regards,
Tomasz Lichoń

@groundnuty
Copy link
Contributor

@evasciacca can we consider this issue as closed or there are still things to be resolved/answered here?

@evasciacca
Copy link
Author

Dear @groundnuty,

it is still not yet very clear how to use the new custom_metadata bucket from n1ql. I am not even able to see it from the couchbase console (see attached screenshot). Could you please provide more information on how to query metadata from the couchbase db to retrive the datasets using more complex queries directly on the metadata bucket?

Thank you in advance,

Eva.
cb_console

@xorver
Copy link
Contributor

xorver commented May 19, 2017

Dear Eva,

As discussed in Catania, for now, we will try to add FileId to each custom_metadata document, so you may query Couchbase directly. It should be available in rc15, I'll write info here when it's done.

Tomasz

@evasciacca
Copy link
Author

Dear @xorver ,
thank you very much, also could you please provide us a sample CouchBase n1ql query to understand how to reach the correct custom_metadata document?
Cheers,
Eva.

@xorver
Copy link
Contributor

xorver commented May 19, 2017

I have an example for you, I've entered the machine with running couchbase, then executed:

root@couchbase0-p1:~# /opt/couchbase/bin/cbq
Couchbase query shell connected to http://localhost:8093/ . Type Ctrl-D to exit.
cbq> CREATE PRIMARY INDEX sync ON sync;
{
    "requestID": "126f23f8-c507-425f-8932-d672d7ad8209",
    "signature": null,
    "results": [
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "3.763397768s",
        "executionTime": "3.763321304s",
        "resultCount": 0,
        "resultSize": 0
    }
}
cbq> SELECT sync.* FROM sync where sync.`<record_type>` = "custom_metadata";
{
    "requestID": "97a0abc4-ecad-43b4-8e69-40173e75f926",
    "signature": {
        "*": "*"
    },
    "results": [
        {
            "\u003crecord_type\u003e": "custom_metadata",
            "\u003crecord_version\u003e": 1,
            "_deleted": "false",
            "_sync": {
                "history": {
                    "channels": [
                        null
                    ],
                    "parents": [
                        -1
                    ],
                    "revs": [
                        "1-64a3bd48eccc930028116168fb118ae0"
                    ]
                },
                "recent_sequences": [
                    142
                ],
                "rev": "1-64a3bd48eccc930028116168fb118ae0",
                "sequence": 142,
                "time_saved": "2017-05-19T10:15:36.656732653Z"
            },
            "space_id": "s1",
            "value": {
                "onedata_json": {
                    "key": []
                },
                "onedata_rdf": "",
                "xattr_name": "xattr_value"
            }
        }
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "28.257026ms",
        "executionTime": "28.126285ms",
        "resultCount": 2,
        "resultSize": 1933
    }
}

Metadata is in "value" field of fetched json. In rc15 there will be a file_id in that document also.

I guess you may do it using libs provided by couchbase or REST API as well.

@evasciacca
Copy link
Author

Dear @xorver ,
thank you very much. I was able to access the custom_metadata. Now I would like to understand how to query a metadata included in the "value" field to get all documents having a specific metadata value. Here a sample output of my results:

cbq> SELECT sync.* FROM sync where sync.`<record_type>` = "custom_metadata";
{
    "requestID": "dfda8e15-b136-43f6-b6cf-6881a429f3f7",
    "signature": {
        "*": "*"
    },
    "results": [
        {
            "\u003crecord_type\u003e": "custom_metadata",
            "\u003crecord_version\u003e": 1,
            "_deleted": true,
            "_sync": {
                "flags": 1,
                "history": {
                    "channels": [
                        null,
                        null
                    ],
                    "deleted": [
                        1
                    ],
                    "parents": [
                        -1,
                        0
                    ],
                    "revs": [
                        "1-be3de124785f5d522b751fc21c95c53e",
                        "2-7a9e86db4a997718f9fc75e22db130a8"
                    ]
                },
                "recent_sequences": [
                    100,
                    141
                ],
                "rev": "2-7a9e86db4a997718f9fc75e22db130a8",
                "sequence": 141,
                "time_saved": "2017-04-13T10:03:07.635264976Z"
            },
            "space_id": "Y37xOXwev7zfOP7Gie0zs-ZNQLYHngluF5Xx_Ar3kWI",
            "value": {
                "DATATYPE": "0000",
                "MODES_ID": "R",
                "OBSERV_ID": "00001",
                "ORIGIN_ID": "41",
                "PACKET_TYPE": "1002",
                "PROGRAM_ID": "001",
                "PROP_ID": "0000000000000001",
                "RUNS_ID": "000005",
                "SEQUENCE_NUM": "000",
                "TSTART": "430580855",
                "TSTOP": "430580965"
            }
        },
        {
            "\u003crecord_type\u003e": "custom_metadata",
            "\u003crecord_version\u003e": 1,
            "_deleted": "false",
            "_sync": {
                "history": {
                    "channels": [
                        null,
                        null
                    ],
                    "parents": [
                        -1,
                        0
                    ],
                    "revs": [
                        "1-c20a06ee14f149db59e6c7b7973f3f5a",
                        "2-116eb453742dc83ee8731356b0163934"
                    ]
                },
                "recent_sequences": [
                    221,
                    227
                ],
                "rev": "2-116eb453742dc83ee8731356b0163934",
                "sequence": 227,
                "time_saved": "2017-04-14T09:19:30.18638156Z"
            },
            "space_id": "cZLEIUmL5Wi6kw53qpUVXWGA5a_2P8nY4tlvE8U3IgI",
            "value": {
                "DATATYPE": "0000",
                "MODES_ID": "R",
                "OBSERV_ID": "00001",
                "ORIGIN_ID": "41",
                "PACKET_TYPE": "1002",
                "PROGRAM_ID": "001",
                "PROP_ID": "0000000000000001",
                "RUNS_ID": "000004",
                "SEQUENCE_NUM": "000",
                "TSTART": "430580855",
                "TSTOP": "430580965"
            }
        },
        {
            "\u003crecord_type\u003e": "custom_metadata",
            "\u003crecord_version\u003e": 1,
            "_deleted": "false",
            "_sync": {
                "history": {
                    "channels": [
                        null
                    ],
                    "parents": [
                        -1
                    ],
                    "revs": [
                        "1-c20a06ee14f149db59e6c7b7973f3f5a"
                    ]
                },
                "recent_sequences": [
                    226
                ],
                "rev": "1-c20a06ee14f149db59e6c7b7973f3f5a",
                "sequence": 226,
                "time_saved": "2017-04-14T09:19:02.874373927Z"
            },
            "space_id": "cZLEIUmL5Wi6kw53qpUVXWGA5a_2P8nY4tlvE8U3IgI",
            "value": {
                "DATATYPE": "0000",
                "MODES_ID": "R",
                "OBSERV_ID": "00001",
                "ORIGIN_ID": "41",
                "PACKET_TYPE": "1002",
                "PROGRAM_ID": "001",
                "PROP_ID": "0000000000000001",
                "RUNS_ID": "000005",
                "SEQUENCE_NUM": "000",
                "TSTART": "430580855",
                "TSTOP": "430580965"
            }
        },
        {
            "\u003crecord_type\u003e": "custom_metadata",
            "\u003crecord_version\u003e": 1,
            "_deleted": true,
            "_sync": {
                "flags": 1,
                "history": {
                    "channels": [
                        null,
                        null
                    ],
                    "deleted": [
                        0
                    ],
                    "parents": [
                        1,
                        -1
                    ],
                    "revs": [
                        "2-42f8698fad6d55be0a100aa39155ce0e",
                        "1-db14a5de948411595295684b189a8931"
                    ]
                },
                "recent_sequences": [
                    66,
                    134
                ],
                "rev": "2-42f8698fad6d55be0a100aa39155ce0e",
                "sequence": 134,
                "time_saved": "2017-04-13T10:02:58.147177312Z"
            },
            "space_id": "Y37xOXwev7zfOP7Gie0zs-ZNQLYHngluF5Xx_Ar3kWI",
            "value": {
                "DATATYPE": "0000",
                "MODES_ID": "R",
                "OBSERV_ID": "00001",
                "ORIGIN_ID": "41",
                "PACKET_TYPE": "1002",
                "PROGRAM_ID": "001",
                "PROP_ID": "0000000000000001",
                "RUNS_ID": "000004",
                "SEQUENCE_NUM": "000",
                "TSTART": "430580855",
                "TSTOP": "430580965"
            }
        }
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "55.515451ms",
        "executionTime": "55.481407ms",
        "resultCount": 4,
        "resultSize": 5756
    }
}

@xorver
Copy link
Contributor

xorver commented May 24, 2017

You need to extend the "where" clause just as in typical SQL, e. g.

SELECT sync.* FROM sync where sync.`<record_type>` = "custom_metadata" AND sync.value.DATATYPE = "0000";

The syntax: http://www.dofactory.com/sql/where-and-or-not

@evasciacca
Copy link
Author

Yes I have tried this before but with no success:

cbq> SELECT sync.* FROM sync where sync.`<record_type>` = "custom_metadata" AND sync.value.ORIGIN_ID = "41";
{
    "requestID": "2e6be4ab-a994-4678-9e04-27b946a5f2c5",
    "errors": [
        {
            "code": 3000,
            "msg": "syntax error - at value"
        }
    ],
    "status": "fatal",
    "metrics": {
        "elapsedTime": "2.41933ms",
        "executionTime": "2.327075ms",
        "resultCount": 0,
        "resultSize": 0,
        "errorCount": 1
    }
}

@xorver
Copy link
Contributor

xorver commented May 25, 2017

Try it like this:

SELECT sync.* FROM sync WHERE sync.`<record_type>` = "custom_metadata" AND sync.`value`.`name` = "something";

It worked with my setup.

@evasciacca
Copy link
Author

Great, thank you very much! It works now... so from next release r15 we will automatically find a file_id from which we can retrieve the data file, is it correct?

Thank you,

Eva.

@xorver
Copy link
Contributor

xorver commented May 25, 2017

Yes, it's correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants