Introduce a codeList property to the field descriptor #1027
Replies: 10 comments
-
I like this proposal. I have, however, two suggestions. Both are inspired by Swagger's $ref notation External codeListsA reference to an external codeList may be made using JSON Reference syntax (RFC3986) and JSON Pointer syntax (RFC6901) Example:
Non-basic codeListsJan's example assumes a "code" and a "label" in the first and second column in the file, respectively. This will work in many cases and it is easy and clean to specify this. However, we become more flexible by allowing an optional Tabular dataThe Example:
Non-tabular data (json, xml)The Example:
... Complex dataWhat about other data sources, like the more complicated ICD-10? In cases like this, a (custom) mimetype/mediatype (e.g. |
Beta Was this translation helpful? Give feedback.
-
Thanks for jumping into the conversation here! Coded categorical data is something I'm very interested in for frictionless as well. I think the functionality you're after here might be more generally modeled / solved by indicating For example, the example you include could be modeled like this: {
"name": "highest_education",
"resources": [
{
"name": "edulevel",
"format": "csv",
"mediatype": "text/csv",
"path": "edulevel.csv",
"encoding": "utf-8",
"schema": {
"fields": [
{
"name": "id",
"type": "integer"
}, {
"name": "place_of_residence",
"type": "string",
}, {
"name": "edu_level",
"type": "integer",
}
],
"primaryKey": ["id"],
"foreignKeys": [
{
"fields": "place_of_residence",
"reference": {
"resource": "codelist-regions",
"fields": "code"
}
}, {
"fields": "edu_level",
"reference": {
"resource": "codelist-edu_level",
"fields": "code"
}
}
]
}
}, {
"name": "codelist-regions",
"schema": {
"fields": [
{
"name": "code",
"type": "string"
}, {
"name": "name",
"type": "string"
}, {
"name": "parent",
"type": "string"
}
],
"primaryKey": ["code"],
"foreignKeys": [
"fields": "parent",
"reference": {
"resource": "",
"fields": "code"
}
]
},
"path": "codelist-regions.csv",
}, {
"name": "codelist-edu_level",
"schema": {
"fields": [
{
"name": "code",
"type": "integer"
}, {
"name": "name",
"type": "string"
}
],
"primaryKey": ["code"]
},
"data": [
{"code": 1, "name": "Low education"},
{"code": 2, "name": "Medium education"},
{"code": 3, "name": "High education"}
]
}
]
} I think this solves almost all the requirements you listed above: ✔️ Possibility to indicate that a given field should use values/codes from a given list. It additionally has the benefits of allowing for more flexible naming of fields, being SQL compatible, as mentioned above, and leveraging all of the existing validation built around primary & foreign keys. For the item that it doesn't address (The simple case of using values / codes inline when they are not long & do not have other complex relationships), I think the current categorical proposal #875 is still a good solution: {
"name": "highest_education",
"resources": [
{
"name": "edulevel",
"format": "csv",
"mediatype": "text/csv",
"path": "edulevel.csv",
"encoding": "utf-8",
"schema": {
"fields": [
{
"name": "id",
"type": "integer"
}, {
"name": "place_of_residence",
"type": "string",
}, {
"name": "edu_level",
"type": "categorical",
"categories": [
{"value": 1, "label": "Low education"},
{"value": 2, "label": "Medium education"},
{"value": 3, "label": "High education"}
]
}
],
}
}
]
} The advantage here of using a specific categorical type, is for its usability in a wide range of statistical software, e.g. R, SAS, SPSS, Stata, etc. (Whereas more complex & interrelated code lists are better modeled in the context of a database, with the Are there any situations I'm missing here that a combination of categorical field types and resource references would not solve? |
Beta Was this translation helpful? Give feedback.
-
(Tagging @pschumm and @peterdesmet for their thoughts as well) |
Beta Was this translation helpful? Give feedback.
-
Thanks for the comments. We have also thought about using the Also in the current implementation/description it not really described what the foreign key should be used for. As far as I understand it just indicates another dataset that shares a key with the current data set. It could be a code list, but also something else completely. When working with code lists from R or python you will probably want to convert to factor, but then the code needs to now that it is a code list. This could be solved by adding something like a Using both |
Beta Was this translation helpful? Give feedback.
-
Thanks for your clarifications. I think my biggest concern regarding the
Right – and I think that's a good thing in this case, because when you're referencing codes in this way, there's a lot of different ways an implementation might want to follow these relationships. For example, for a dropdown selection widget for codes, you might want to list code abbreviations, but when you're populating a larger table, you might want to grab their full descriptions, for example. Which field an implementation will want to use to represent the level depends on the application, the current usage context, and the available properties of the levels…
I disagree on this point – I think there's a strong precedent for having both flat categorical value types and richer categorical level entities via table relationships. For example, DuckDB has categorical types as well as the ability to define Perhaps we could get the desired functionality here by extending the {
"name": "highest_education",
"resources": [
{
"name": "edulevel",
"format": "csv",
"mediatype": "text/csv",
"path": "edulevel.csv",
"encoding": "utf-8",
"schema": {
"fields": [
{
"name": "id",
"type": "integer"
},
{
"name": "place_of_residence",
"type": "string"
},
{
"name": "edu_level",
"type": "categorical",
"categories": [
{ "value": 1, "label": "LOW_EDU" },
{ "value": 2, "label": "MED_EDU" },
{ "value": 3, "label": "HIGH_EDU" }
]
}
],
"foreignKeys": {
"fields": "edu_level",
"reference": {
"resource": "codelist-edu_level",
"fields": "code"
}
}
}
},
{
"name": "codelist-edu_level",
"schema": {
"fields": [
{
"name": "code",
"type": "integer"
},
{
"name": "name",
"type": "string"
},
{
"name": "description",
"type": "string"
},
{
"name": "field_color",
"type": "categorical",
"categories": ["red", "green", "blue"]
}
],
"primaryKey": ["code"]
},
"data": [
{
"code": 1,
"name": "Low education",
"description": "Primary education",
"field_color": "red"
},
{
"code": 2,
"name": "Medium education",
"description": "Secondary education",
"field_color": "green"
},
{
"code": 3,
"name": "High education",
"description": "Tertiary education",
"field_color": "blue"
}
]
}
]
} By using the categorical field type AND It's still rough for fields with 100s of levels, or when many variables that share the same categorical scales -- for those cases what if we just let the |
Beta Was this translation helpful? Give feedback.
-
Having spent a fair amount of time myself on #875 (and its precursor pattern) and believing that the extended discussion that led to it improved it considerably, I was initially not very keen to see this proposed as an alternative. However, I read this proposal and thought about it carefully, and I must admit that it has grown on me. I too use large code lists, as well as those with hierarchical structure (e.g., ICD9/10 codes, Multum drug codes, etc.), so I can appreciate those use cases. And I agree with @djvanderlaan that these are conceptually field-level properties as opposed to properties of the data resource, so using the That said, IIUC, this proposal addresses just two features that #875 does not provide:
Item (1) can be addressed using a JSON reference (as noted above by @fomcl and previously by @peterdesmet). Item (2) is a different matter, and I rather like the syntax proposed by @djvanderlaan above. I can see how that could be used to accomplish everything that #875 does. But the rationale for #875 is a very specific one; namely, to facilitate (if not encourage) the use of categorical variables in the analytic software packages that support them. Item (2) strikes me as quite distinct from this. Thus, while I can appreciate the elegance in a more general solution that can accomplish everything that #875 does as well as Item (2) above, I do not believe that it is justified in this case. Specifically, IMO the syntax proposed in #875 (and especially the syntactic sugar described by @khusmann at the bottom of that discussion) would be considerably simpler and more intuitive for the most common use cases of categorical variables. In sum, I agree with @khusmann above that addressing Item (2) with a separate strategy is warranted in this case, and I would support @djvanderlaan's proposal above for that purpose. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the reply. Being able to store code lists externally (in a separate file or using a url) is not just beneficial for large code lists. It also makes it easier for different datasets to use the same codelists. Many organisations have coordinated code lists for various subjects. Using a data resource allows one to link to such lists. I can see the amount of effort that went into #875. It is perhaps also interesting to tell a bit about the history of the current proposal: We initially had an implementation that is similar to the one proposed in issue #875 (see #156 (comment)) . In order to allow for large code lists and reuse of code lists we wanted to allow that field to also refer to files. After discussion we concluded that it then made more sense to refer to a data resource instead of a file directly. And since a data resource allows for inline data anyway, we could then always refer to a data resource. Edit: I removed a bit here as I didn't completely managed to formulate that the way I wanted. The gist is that I do not see a fundamental difference between a variable with a codelist and a categorical variable. From a user perspective the codelist proposal doesn't seem to be more complex (e.g. see demo: https://gist.github.com/djvanderlaan/f898bd8b4416dfe6157a7c45c616eecb ) I do like the 'syntactic sugar' proposed in the other proposal. This is difficult to do with the codelist proposal
Would it be possible to merge both proposals? The categories field can be
|
Beta Was this translation helpful? Give feedback.
-
Agreed. #875 allows frictionless to be a drop in replacement for the proprietary formats currently dominating a bunch of scientific fields. I think it should stand as it is, and we should try to figure out how to work these features in as a somewhat separate concern.
And similarly allows the same dataset to use the same codelist many times -- that's something I could really make use of in my data, along with the ability to store more metadata about categorical levels.
I like this direction. I would prefer though if we use an object instead of just the data resource name, so we can explicitly assign which fields the labels and values should come from. Something like this:
I still wish we could connect this to foreign keys somehow, because it's an existing practice in data warehousing for automatically recognizing and traversing properties of hierarchical categorical structures – see zillion for a good example. But I agree, it's nice to have it specified in the field itself rather than be a table-level prop. I suppose we can leave it up to implementations to recognize this as a foreign key situation. |
Beta Was this translation helpful? Give feedback.
-
Very late to this party. While I understand that it can be useful to represent a code list as a Data Resource, I consider code lists more similar to Table Schemas. A
Similarly,
Personally, I'm not in fan of " |
Beta Was this translation helpful? Give feedback.
-
An advantage of using a data resource instead of a separate json file, is that a data resource also has functionality for storing additional meta data for the code list. For example, in my file I could could have one variable that uses NACE to code companies. In the data resource for the categories of that field, I can then indicate using the title, description, author, license and source fields which specific version of NACE I am using, the license (which might be different from that of the data set itself) and also refer to the original author of that NACE classification. Without that, I would just have a list of codes and labels and would not even know I was looking at NACE codes (unless you put all that information in the description field of the field which then becomes quite overloaded). Also, #48 (comment) mentions that it would be useful to be able to have one json file with a collection of definitions for categories. We already have an object to store a list of 'resources', namely a data package. If we can refer to a specific data resource in a datapackage, we can could store all categories definitions (inline) in one data package. |
Beta Was this translation helpful? Give feedback.
-
We work a lot with surveydata and administrative data. In both cases files often contain fields where the values in the field should come from a limited list of possible values. These values also have a specific meaning. Some examples:
Properties of these codes:
We are aware of the suggestion in issue #875 for supporting categories which is the same issue/problem. However, there are a few 'wishes' that are not covered by the suggestion in that issue and we believe the suggestion below is also easier to implement.
What we would like/need:
datapackage.json
) itself or have the codes in a file as large lists of codes make the meta data too bloated and this makes maintenance also more difficult. This file could be part of the datapackage or could be hosted externally.What we suggest:
codeList
to the FieldDescriptor. This MUST be string with the name of a DataResource in the DataPackage (if there is a syntax for referencing to a DataResource in an external DataPackage, this would also be valid).This has a number of advantages:
Code List Resource
We don't yet have a concrete suggestion as to what should be in the dataset containing the code list and what format this dataset should have. We currently have an implementation that assumes that the first column in the dataset contains the codes and the second the labels of the codes. This is, however, minimal functionality. Some thoughts:
code
(orid
) andname
; with optional columnsdescription
,locale
andparent
(indicating missing values seems to be missing; I can check with SDMX experts how this is handled; probably using custom annotations).Example
Possible example with both a codelist in a file and inline data:
@fomcl
Beta Was this translation helpful? Give feedback.
All reactions