Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kf/eschatology processing #88

Draft
wants to merge 29 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
3561092
Caching embeddings during clustering + fewer clusters by default
KasperFyhn Nov 21, 2024
4763c00
visualizer: prettier and switching double-click and hold ops
KasperFyhn Nov 21, 2024
0f862e7
More caching during clustering
KasperFyhn Nov 21, 2024
6dd1d95
Removing the long tail of entities upfront and reducing cluster sizes
KasperFyhn Nov 21, 2024
f0578d8
Without embedding-based clustering, only label overlap
KasperFyhn Nov 25, 2024
0a916da
Better frequency slider for nodes and edges on visualizer
KasperFyhn Nov 26, 2024
7a85b49
Also filtering edges on dates in visualizer
KasperFyhn Nov 26, 2024
7914226
Trying multiprocess offloading to GPU
KasperFyhn Nov 26, 2024
9588527
First steps towards storing binary files of SpaCy docs
KasperFyhn Nov 26, 2024
1a0f4e7
Trying fastcoref's LingMessCoref instead of AllenNLP which is sluggis…
KasperFyhn Nov 26, 2024
43c4786
Adding fastcoref to dependencies
KasperFyhn Nov 26, 2024
3d9e4b4
Adding a error recovering wrapper of the fastcoref component
KasperFyhn Nov 26, 2024
0f2a0d6
Configurable DocBin size and specified cuda for fastcoref component
KasperFyhn Nov 26, 2024
65b1d9a
Fixing stream collected to list from debugging
KasperFyhn Nov 26, 2024
babfc4b
Output path for pipeline instead of project name
KasperFyhn Nov 27, 2024
47b4f47
Resolving text in safe fastcoref component
KasperFyhn Nov 27, 2024
28782f7
More proper continuation of already processed docs
KasperFyhn Nov 27, 2024
244e18f
fixing bug in SafeFastCoref which led to empty texts
KasperFyhn Nov 27, 2024
1f12382
Union notation instead of | for compatibility with Python 3.9
KasperFyhn Nov 27, 2024
c3804e3
Fixing reading/writing DocBins, fixing safe fastcoref pipe and introd…
KasperFyhn Dec 4, 2024
004a8ed
Deduplicating docs in doc bins
KasperFyhn Dec 4, 2024
c655792
Big hackathon for the visualizer commenced
KasperFyhn Dec 4, 2024
7bb4b7c
Hacking away for visualizer: now showing docs in a very ugly way
KasperFyhn Dec 5, 2024
0ea45b5
Only targeting visualizer files with prettier
KasperFyhn Dec 6, 2024
4a86104
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 6, 2024
a07c2cc
Showing documents with triplets marked up
KasperFyhn Dec 6, 2024
5854373
Introducing database for entities, relations, triplets and documents
KasperFyhn Dec 9, 2024
5c44abf
Fixing failing test after last commit
KasperFyhn Dec 9, 2024
3156970
Changing DB models slightly and optimizing DB population
KasperFyhn Dec 9, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,9 @@ repos:
rev: v0.5.7
hooks:
- id: ruff

- repo: https://github.com/pre-commit/mirrors-prettier
rev: v3.1.0
hooks:
- id: prettier
files: "visualizer/.*"
15 changes: 10 additions & 5 deletions config/eschatology.toml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
language = "en"

[preprocessing]
enabled = true
enabled = false
doc_type = "csv"

[preprocessing.extra]
Expand All @@ -11,9 +11,14 @@ text_column = "body"
timestamp_column = "timestamp"

[docprocessing]
enabled = true
batch_size = 5
prefer_gpu_for_coref = false
enabled = false
batch_size = 50
prefer_gpu_for_coref = true
n_process = 1

[corpusprocessing]
enabled = true
enabled = false

[databasepopulation]
enabled = true
clear_and_write = true
9 changes: 5 additions & 4 deletions config/template.toml
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,14 @@ metadata_fields = ["*"]

[preprocessing.extra]
# specific extra arguments for your preprocessor, e.g. context length for tweets or
# or field specification for CSVs
# field specification for CSVs

[docprocessing]
enabled = true
batch_size = 25
continue_from_last = true
triplet_extraction_method = "multi2oie/prompting"
n_process = 1 # can be set to 2 or more for multiprocess ofloading to GPU; otherwise might not make sense

[corpusprocessing]
enabled = true
Expand All @@ -27,6 +28,6 @@ dimensions = 100 # leave out to skip dimensionality reduction
n_neighbors = 15 # used for dimensionality reduction

[corpusprocessing.thresholds] # leave out for automatic estimation
min_cluster_size = 3 # unused if auto_thresholds is true
min_samples = 3 # unused if auto_thresholds is true
min_topic_size = 5 # unused if auto_thresholds is true
min_label_occurrence = 3
min_cluster_size = 3
min_samples = 3
2 changes: 1 addition & 1 deletion docs/tutorials/overview.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@
" assert isinstance(sent._.coref_clusters[0], tuple)\n",
" assert isinstance(sent._.coref_clusters[0][0], int)\n",
" assert isinstance(sent._.coref_clusters[0][1], Span)\n",
" sent._.resolve_coref # get resolved coref"
" sent._.resolved_text # get resolved coref"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion paper/extract_triplets_newspapers.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ def process_file(

# Resolve coreference
coref_docs = nlp_coref.pipe(normalized_article)
resolved_docs = (d._.resolve_coref for d in coref_docs)
resolved_docs = (d._.resolved_text for d in coref_docs)

# Extract relations
docs = nlp.pipe(resolved_docs)
Expand Down
4 changes: 2 additions & 2 deletions paper/extract_triplets_tweets.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ def concat_resolve_unconcat_contexts(file_path: str):

coref_nlp = build_coref_pipeline()
coref_docs = coref_nlp.pipe(context_tweets)
resolved_docs = (d._.resolve_coref for d in coref_docs)
resolved_docs = (d._.resolved_text for d in coref_docs)

resolved_tweets = (tweet_from_context_text(tweet) for tweet in resolved_docs)
return resolved_tweets
Expand Down Expand Up @@ -240,7 +240,7 @@ def prompt_gpt3(
for i, batch in enumerate(batch_generator(concatenated_tweets, batch_size)):
start = time.time()
coref_docs = coref_nlp.pipe(batch)
resolved_docs = (d._.resolve_coref for d in coref_docs)
resolved_docs = (d._.resolved_text for d in coref_docs)
resolved_target_tweets = (
tweet_from_context_text(tweet) for tweet in resolved_docs
)
Expand Down
5 changes: 4 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,9 @@ dependencies = [
"sentence-transformers",
"stop-words",
"bs4",
"toml"
"toml",
"fastcoref",
"sqlalchemy"
]

[project.license]
Expand Down Expand Up @@ -94,6 +96,7 @@ content-type = "text/markdown"
"prompt_relation_extraction" = "conspiracies.docprocessing.relationextraction.gptprompting:create_prompt_relation_extraction_component"
"relation_extractor" = "conspiracies.docprocessing.relationextraction.multi2oie:make_relation_extractor"
"allennlp_coref" = "conspiracies.docprocessing.coref:create_coref_component"
"safe_fastcoref" = "conspiracies.docprocessing.coref.safefastcoref:create_safe_fastcoref"
"heads_extraction" = "conspiracies.docprocessing.headwordextraction:create_headwords_component"


Expand Down
181 changes: 152 additions & 29 deletions src/conspiracies/corpusprocessing/clustering.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
from collections import defaultdict
from typing import List, Callable, Any, Hashable, Dict
import math
import os
from collections import defaultdict, Counter
from pathlib import Path
from typing import List, Callable, Any, Hashable, Dict, Union

import networkx
import numpy as np
from hdbscan import HDBSCAN
from pydantic import BaseModel
from sentence_transformers import SentenceTransformer
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm
from umap import UMAP

from conspiracies.common.modelchoice import ModelChoice
Expand Down Expand Up @@ -45,13 +48,17 @@ def __init__(
min_cluster_size: int = 5,
min_samples: int = 3,
embedding_model: str = None,
cache_location: Path = None,
):
self.language = language
self.n_dimensions = n_dimensions
self.n_neighbors = n_neighbors
self.min_cluster_size = min_cluster_size
self.min_samples = min_samples
self._embedding_model = embedding_model
self.cache_location = cache_location
if self.cache_location is not None:
os.makedirs(self.cache_location, exist_ok=True)

def _get_embedding_model(self):
# figure out embedding model if not given explicitly
Expand Down Expand Up @@ -97,53 +104,88 @@ def _combine_clusters(

return merged_clusters

def _cluster(
def _cluster_via_embeddings(
self,
fields: List[TripletField],
labels: List[str],
cache_name: str = None,
show_progress: bool = True,
):
model = self._get_embedding_model()
print("Creating embeddings:")
embeddings = model.encode(
[field.text for field in fields],
show_progress_bar=True,
emb_cache = (
Path(self.cache_location, f"embeddings-{cache_name}.npy")
if self.cache_location and cache_name
else None
)
embeddings = StandardScaler().fit_transform(embeddings)
if emb_cache and emb_cache.exists():
print(
"Reusing cached embeddings! Delete cache if this is not supposed to happen.",
)
embeddings = np.load(emb_cache)
else:
model = self._get_embedding_model()

counter = Counter((field for field in labels))
condensed = [
field
for field, count in counter.items()
for _ in range(math.ceil(count / 1000))
]
embeddings = model.encode(
condensed,
normalize_embeddings=True,
show_progress_bar=show_progress,
)
if emb_cache:
np.save(emb_cache, embeddings)

if self.n_dimensions is not None:
print("Reducing embedding space")
reducer = UMAP(n_components=self.n_dimensions, n_neighbors=self.n_neighbors)
embeddings = reducer.fit_transform(embeddings)
reduced_emb_cache = (
Path(
self.cache_location,
f"embeddings-{cache_name}-red{self.n_dimensions}.npy",
)
if self.cache_location and cache_name
else None
)
if reduced_emb_cache and reduced_emb_cache.exists():
print(
"Reusing cached reduced embeddings! Delete cache if this is not supposed to happen.",
)
embeddings = np.load(reduced_emb_cache)
else:
print("Reducing embedding space ...")
reducer = UMAP(
n_components=self.n_dimensions,
n_neighbors=self.n_neighbors,
)
embeddings = reducer.fit_transform(embeddings)
if self.cache_location:
np.save(reduced_emb_cache, embeddings)

print("Clustering ...")
hdbscan_model = HDBSCAN(
min_cluster_size=self.min_cluster_size,
max_cluster_size=self.min_cluster_size
* 10, # somewhat arbitrary, mostly to avoid mega clusters that suck up everything
min_samples=self.min_samples,
)
hdbscan_model.fit(embeddings)

clusters = defaultdict(list)
for field, embedding, label, probability in zip(
fields,
labels,
embeddings,
hdbscan_model.labels_,
hdbscan_model.probabilities_,
):
# skip noise and low confidence
if label == -1 or probability < 0.1:
if label == -1 or probability < 0.5:
continue
clusters[label].append((field, embedding))

merged = self._combine_clusters(
list(clusters.values()),
get_combine_key=lambda t: t[0].text,
get_combine_key=lambda t: t[0],
)

# too risky with false positives from this
# merged = self._combine_clusters(
# merged,
# get_combine_key=lambda t: t[0].head,
# )

# sort by how "prototypical" a member is in the cluster
for cluster in merged:
mean = np.mean(np.stack([t[1] for t in cluster]), axis=0)
Expand All @@ -153,11 +195,69 @@ def _cluster(
return [[t[0] for t in cluster] for cluster in merged]

@staticmethod
def _mapping_to_first_member(clusters: List[List[TripletField]]) -> Dict[str, str]:
def _cluster_via_normalization(
labels: List[str],
top: Union[int, float] = 1.0,
restrictive_labels=True,
) -> List[List[str]]:
counter = Counter((label for label in labels))
if isinstance(top, float):
top = int(top * len(counter))

norm_map = {
label: " "
+ label.lower()
+ " " # surrounding spaces avoids matches like evil <-> devil
for label in counter.keys()
}
cluster_map = {
label: []
for label, count in counter.most_common(top)
# FIXME: hack due to lack of NER and lemmas at the time of writing
if not restrictive_labels
or len(label) >= 4
and label[0].isupper()
or len(label.split()) > 1
}

for label in counter.keys():
norm_label = norm_map[label]
matches = [
substring
for substring in cluster_map.keys()
if norm_map[substring] in norm_label
]
if not matches:
continue

best_match = min(
matches,
key=lambda substring: len(norm_map[substring]),
)
if best_match != label:
cluster_map[best_match].append(label)

clusters = [
[main_label] + alt_labels
for main_label, alt_labels in cluster_map.items()
if alt_labels
]
return clusters

@staticmethod
def _mapping_to_first_member(
clusters: List[List[Union[TripletField, str]]],
) -> Dict[str, str]:
def get_text(member: Union[TripletField, str]):
if isinstance(member, TripletField):
return member.text
else:
return member

return {
member: cluster[0].text
member: get_text(cluster[0])
for cluster in clusters
for member in set(member.text for member in cluster)
for member in set(get_text(member) for member in cluster)
}

def create_mappings(self, triplets: List[Triplet]) -> Mappings:
Expand All @@ -166,10 +266,33 @@ def create_mappings(self, triplets: List[Triplet]) -> Mappings:
entities = subjects + objects
predicates = [triplet.predicate for triplet in triplets]

# FIXME: clustering gets way to aggressive for many triplets
# print("Creating mappings for entities")
# entity_clusters = self._cluster(entities, "entities")
# print("Creating mappings for predicates")
# predicate_clusters = self._cluster(predicates, "predicates")

print("Creating mappings for entities")
entity_clusters = self._cluster(entities)
entity_clusters = self._cluster_via_normalization(
[e.text for e in entities],
0.2,
)
entity_clusters = [
sub_cluster
for cluster in tqdm(entity_clusters, desc="Creating sub-clusters")
for sub_cluster in (
self._cluster_via_embeddings(cluster, show_progress=False)
if len(cluster) > 10
else [cluster]
)
]

print("Creating mappings for predicates")
predicate_clusters = self._cluster(predicates)
predicate_clusters = self._cluster_via_normalization(
[p.text for p in predicates],
top=0.2,
restrictive_labels=False,
)

mappings = Mappings(
entities=self._mapping_to_first_member(entity_clusters),
Expand Down
Loading
Loading