Skip to content

Latest commit

 

History

History
50 lines (25 loc) · 4.75 KB

README.md

File metadata and controls

50 lines (25 loc) · 4.75 KB

NGEC

New Generation Event Coder Support Files

This repo holds a few files related to the classifiers discussed in the paper

Halterman, Andrew, Philip A. Schrodt, Andreas Beger, Benjamin E. Bagozzi and Grace I. Scarborough. 2023. “Creating Custom Event Data Without Dictionaries: A Bag-of-Tricks.” Working paper presented at the International Studies Association, Montreal, March-2023.

Additional code related to this paper, specifically on the entity-resolution side, can be found at https://github.com/ahalterman/NGEC/

At present, and likely well into the future, this site will mostly contain some utility files useful for reducing typing and typos, and a few code fragments, largely from the Huggingface and sklearn documentation with slight modifications, these clarifying exactly what we mean by "default" paramters and techniques. The complete data-generation pipeline is at Leidos and probably the U.S. government—which remains disinclined to share1—holds the intellectual property rights to this, that issue yet to be decided by legions of Gucci-shod lawyers, so it isn't here.

Which is say, there's not a turn-key system here. But the remaining code in the operational pipeline is the routine stuff of, well, operational pipelines, so if you know enough to be creating an event data pipeline, you know enough to write that sort of code, and will almost certainly be better off writing it using whatever idiomatic style you are comfortable with rather than adapting to ours. Or rather, mine 2

A couple more useful links:

A final and alas, I-wish-were-unnecessary note: In the last few years there has developed in some parts of the event data community a most unfortunate, and certainly thoroughly unscientific, pathology of rigorously suppressing any criticism of data sets, variously using contractual non-disparangement clauses, threats of legal action, or ruthless pursuit of critics through various all-too-available institutional mechanisms originally designed to protect scientific integrity but now used against it. I believe I speak for our entire team when I say that we welcome criticism, and if you sincerely believe this data and/or the entire exercise is the dumbest thing you've ever seen, you are welcome, even encouraged, to express that opinion. Really.3

CAMEO_codefile.txt

The golden-oldie file that lists all of the CAMEO categories and translates the textual descriptions used in ICEWS to the numerical codes used everywhere else. Also see https://github.com/openeventdata/text_to_CAMEO, which also uses this file.

CAMEO2PLOVER-2021-09-17.txt

CAMEO to PLOVER conversion files: per the embedded date, may need a bit of updating

code_fragment_for_CAMEO_conversion.txt

Code fragments for reading and using these files to get the PLOVER equivalent of a CAMEO code

code_fragments_for_NGEC_classification.txt

Code fragments for estimating the models for determining event categories, modes, and contexts, mostly just showing the appropriate libraries to import and then call.

PLOVER_lists.txt

Assorted Python-formated lists including PLOVER categories, 4-character category abbreviations, modes, contexts, and intensity scores.

Footnotes

  1. Why utterly mundane code funded entirely by U.S. taxpayers remains proprietary while billions of dollars of pathbreaking and exceedingly high quality state-of-the-art software generated by corporations such as Alphabet/Google, Meta/Facebook, Amazon, and Microsoft has been made open source is, well, a great mystery. Though as the periodic discourses in War of the Rocks on the utter dysfunctional character of US defense procurement notes repeatedly, the simple combination of Soviet-style central planning and US-style corporate incentives gets you most of the way: nothing personal, just business.

  2. Which includes—I'm not making this up—occasional integer loop indices labelled ka, kb... because variables beginning with "k" were integers in FORTRAN IV and using the suffrixes 'a', 'b'... rather than '1', '2'... saved hitting the numerical shift key on a card punch, said shift requiring moving approximately as much finely machined, but not necessarily balanced, brass and steel as a Honda Civic. But nowadays my code mostly just uses iterators. Honest.

  3. Or in the words of the late Michael Nicholson, one of the early pioneers in quantitative conflict analysis, "I'd rather have people saying 'that bastard Michael Nicholson' than `Who is Michael Nicholson?'" Our sentiment exactly for PLOVER.