Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cwapi] adding named entities and noun chunks counts django stuff #26

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jeiranj
Copy link

@jeiranj jeiranj commented Nov 3, 2017

class SpeakerWordCounts(models.Model):
def __str__(self):
return ",".join([self.crec_id, self.bioguide_id])
bioguide_id = models.CharField(max_length=7, primary_key=True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand it correctly that these are named entities and noun chunks within a given document (crec_id) attributed to a particular speaker (bioguide_id)? In that case, should the primary key be a compound key of ('bioguid_id', 'crec_id'). Also, does the data come from the segments right now?

Copy link

@will-horning will-horning Nov 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a single row in this table (or a single instance of this class) contains the noun chunk and named entity counts attributed to a given speaker within a single snippet, or a single document if it is a single-speaker document.

The primary key question is a little tricky. It would need to be a compound of bioguide_id, crec_id and some sequence number in for the attributed segments (it is possible that a person speaks in separate segments within a single document, so crec + bioguide alone may not be unique). We'll also want to make bioguide_id a foreign key (see the legislators models for an example of how to do that in Django's ORM) so we can easily retrieve all the segments/documents for a legislator object via the ORM.

@@ -253,4 +254,19 @@ def parse_mods_file(self, mods_file):
noun_chunks = text_utils.named_entity_dedupe(noun_chunks, named_entity_freqs.keys())
record['noun_chunks'] = str(Counter(noun_chunks).most_common())

if bool(record['speaker_ids']):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the bool cast?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants