Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply a "collocation correction" to synt. coll. results #9

Open
tomachalek opened this issue Oct 16, 2023 · 2 comments
Open

Apply a "collocation correction" to synt. coll. results #9

tomachalek opened this issue Oct 16, 2023 · 2 comments
Assignees

Comments

@tomachalek
Copy link
Member

tomachalek commented Oct 16, 2023

For each result item, scollex should test whether the "child-parent" pair is not also a collocation. As in such case, the syntactical relationship may be incorrect. Because it may or may not be incorrect, Scollex should not remove such items or mark them with flags like "incorrect". We should just add a flag providing information, that the value is also a "traditional" collocate.

It would be probably best to have this functionality built directly into Scollex rather than moving the responsibility e.g. to WaG (imagine e.g. a tile loading data from Scollex and KonText (or MQuery) and combining them.

How it should work:

  1. the import function will have an option -colloc-flags-with-span (int value)
  2. if enabled, the vertical file processing will have two passes:
  3. find all "traditional" collocations and store them in memory
  4. run the current import to find syntactic collocations and for each word pair add a new attribute coOccurrence bool coOccurrenceScore float64 (we choose a co-occurrence instead of collocation to distinguish further between the collocations we are interested here - syntactic ones and the "traditional ones").

Implementation notes:

  1. to store freq info (Fxy, Fy, Fx) - use map (see FyTable, CounterTable for inspiration, maybe it will be even possible to reuse them)
  2. there will be no need to keep parentSumTable and childSumTable as the relationship in traditional colls is simpler (a word either is not is not in a defined span/window of the other word).
  3. the co-occurence will be defined for two words iff the "other" word is in a span ( -colloc-flags-with-span) of the "main" word (e.g. for span of 3 we will look 3 words backwards and 3 forwards)
@tomachalek
Copy link
Member Author

The resulting flag information should be part of the *_fcolls table. Or better - we should not just store a binary info (is vs. is not coll) but a collocation score (log dice).

@tomachalek
Copy link
Member Author

The final decision on whether a syntactic word pair should be marked as "faulty" or "suspicious" will be left to the API consumer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants