-
Notifications
You must be signed in to change notification settings - Fork 57
figure out how to make this stuff machine readble #13
Comments
Happy to help out on this, but is YAML the best format? D3, for example, provides data loading functions for CSV/TSV, XML, and JSON blobs natively, but not YAML. |
@jebeck I have no strong feelings about this - whatever you think would be best! Could you have a glance at the existing CSV/XML stuff I linked to and see if we could just work with that directly? |
I glanced at the CSV spec, and it looks pretty terrible (unlike information distributed across rows instead of columns, which is generally not friendly data design). The XML spec may be better, but I also wonder how often companies choose to submit in this form? In any case, happy to take on the task of writing some tools to translate between the official specs and a simplified format (I'd argue for JSON). We should chat about tools - might be able to keep it all client-side and do JavaScript, or could do Python. |
Do we have access to the eeo-1 csv or xml files? I think setting up automated tooling to transform these files may be quite a bit of effort in order to parse a small amount of data that is getting updated and added at a relatively slow pace. I don't mean to suggest that we shouldn't do this, but I would like propose a few alternatives that I think may get use human & machine readable data faster. I think the following strategies may get us a win (that admittedly isn't particularly flexible) in a short amount of time:
Also, +1 on aiming for JSON as the data format we keep in this repo. It is machine readable, and human readable & editable. |
I saw on the double union mailing list that another goal is it advocate for a standard data format to release diversity data in. This seems related, but not necessarily dependent on making the currently released data machine readable. Perhaps we could make another ticket for it? |
AFAIK, @jhlch, we don't have access to the CSV and/or XML data. I think each corporation gets to decide how they want to submit the data (see the links @hypatia pasted opening this issue), and I doubt we're going to have very many, if any, of them releasing the data in these formats. Given the very small size of these datasets (at least compared to some of the data I'm used to working with...), I think transcription won't be a completely heinous task, unless we start getting 1000s of companies to release data(!!!) (Take note that many of the "PDFs" submitted so far are actually images of a PDF, so something like Tabula isn't going to help much.) Another possibility I'd like to try is setting up a client-side GUI app for transcription; we should be able to leverage the download attribute in browsers that support it to let a transcriber download the results of the form and send it in. Does that sound like a good idea to anyone else or just me? ;) All in all, my proposal is the following path (and yeah, these should be split out into separate tickets if there's consensus):
I've got some vacation coming up this week, and I've got some other projects to work on as well, but I could definitely do the JSON Schema proposal, maybe get a start on a simple transcription form. |
I remembered that we have a gmail account for open diversity data. I bet we could make a google form, and get a spreadsheet auto populated in google docs. This may be an option to consider for a client side gui for crowdsourcing parsing the pdf data. Just a thought. |
here's the CSV/XML format: http://www.eeoc.gov/employers/eeo1survey/eeo1_cvs_specifications.cfm
ASCII/text format: http://www.eeoc.gov/employers/eeo1survey/ee1_datafile_2013.cfm
The text was updated successfully, but these errors were encountered: