Skip to content

Commit

Permalink
add an overview readme to introduce the system
Browse files Browse the repository at this point in the history
  • Loading branch information
CarsonDavis committed Nov 26, 2024
1 parent 0fcf2ea commit 4c7834f
Show file tree
Hide file tree
Showing 2 changed files with 105 additions and 57 deletions.
78 changes: 78 additions & 0 deletions sde_collections/models/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# URL Pattern Management System

## Overview
This system provides a framework for managing and curating collections of URLs through pattern-based rules. It enables systematic modification, categorization, and filtering of URLs while maintaining a clear separation between work-in-progress changes and production content.

## Core Concepts

### URL States
Content progresses through three states:
- **Dump URLs**: Raw content from initial scraping/indexing
- **Delta URLs**: Work-in-progress changes and modifications
- **Curated URLs**: Production-ready, approved content

### Pattern Types
- **Include/Exclude Patterns**: Control which URLs are included in collections
- Include patterns always override exclude patterns
- Use wildcards for matching multiple URLs

- **Modification Patterns**: Change URL properties
- Title patterns modify final titles shown in search results
- Document type patterns affect which tab the URL appears under
- Division patterns assign URLs within the Science Knowledge Sources

### Pattern Resolution
The system uses a "smallest set priority" strategy which resolves conflicts by always using the most specific pattern that matches a URL:
- Multiple patterns can match the same URL
- Pattern matching the smallest number of URLs takes precedence
- Applies to title, division, and document type patterns
- More specific patterns naturally override general ones

## Getting Started

To effectively understand this system, we recommend reading through the documentation in the following order:

1. Begin with the Pattern System Overview to learn the fundamental concepts of how patterns work and interact with URLs
2. Next, explore the URL Lifecycle documentation to understand how content moves through different states
3. The Pattern Resolution documentation will show you how the system handles overlapping patterns
4. Learn how to control which URLs appear in your collection with the Include/Exclude patterns guide
5. Finally, review the Pattern Unapplication Logic to understand how pattern removal affects your URLs

Each section builds upon knowledge from previous sections, providing a comprehensive understanding of the system.

## Documentation

[Pattern System Overview](./README_PATTERN_SYSTEM.md)
- Core concepts and pattern types
- Pattern lifecycle and effects
- Delta URL generation rules
- Working principles (idempotency, separation of concerns)
- Pattern interaction examples

[URL Lifecycle Management](./README_LIFECYCLE.md)
- Migration process (Dump → Delta)
- Promotion process (Delta → Curated)
- Field handling during transitions
- Pattern application timing
- Data integrity considerations

[Pattern Resolution](./README_PATTERN_RESOLUTION.md)
- Smallest set priority mechanism
- URL counting and precedence
- Performance considerations
- Edge case handling
- Implementation details

[URL Inclusion/Exclusion](./README_INCLUSION.md)
- Wildcard pattern matching
- Include/exclude precedence
- Example pattern configurations
- Best practices
- Common pitfalls and solutions

[Pattern Unapplication Logic](./README_UNAPPLY_LOGIC.md)
- Pattern removal handling
- Delta management during unapplication
- Manual change preservation
- Cleanup procedures
- Edge case handling
84 changes: 27 additions & 57 deletions sde_collections/models/README_PATTERN_RESOLUTION.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,16 @@
# URL Pattern Application Strategies
# Pattern Resolution System

## Strategy 1: Exclusive Patterns
## Overview
The pattern system uses a "smallest set priority" strategy for resolving conflicts between overlapping patterns. This applies to title patterns, division patterns, and document type patterns. The pattern that matches the smallest set of URLs takes precedence.

Patterns have exclusive ownership of URLs they match. System prevents creation of overlapping patterns.
## How It Works

Example:
```
Pattern A: */docs/* # Matches 100 URLs
Pattern B: */docs/api/* # Rejected - overlaps with Pattern A
Pattern C: */blog/* # Accepted - no overlap
```

Benefits:
- Clear ownership
- Predictable effects
- Simple conflict resolution
- Easy to debug

Drawbacks:
- Less flexible
- May require many specific patterns
- May need pattern deletion/recreation to modify rules

## Strategy 2: Smallest Set Priority
When multiple patterns match a URL, the system:
1. Counts how many total URLs each pattern matches
2. Compares the counts
3. Applies the pattern that matches the fewest URLs

Multiple patterns can match same URLs. Pattern affecting smallest URL set takes precedence.

Example:
### Example
```
Pattern A: */docs/* # Matches 100 URLs
Pattern B: */docs/api/* # Matches 20 URLs
Expand All @@ -37,42 +21,28 @@ For URL "/docs/api/v2/users":
- Pattern C wins (5 URLs < 20 URLs < 100 URLs)
```

Benefits:
- More flexible rule creation
- Natural handling of specificity

Drawbacks:
- Complex precedence rules
- Pattern effects can change as URL sets grow
- Harder to predict/debug
- Performance impact from URL set size calculations

## Implementation Notes
## Pattern Types and Resolution

Strategy 1:
### Title Patterns
```python
def save(self, *args, **kwargs):
# Check for overlapping patterns
overlapping = self.get_matching_delta_urls().filter(
deltapatterns__isnull=False
).exists()
if overlapping:
raise ValidationError("Pattern would overlap existing pattern")
super().save(*args, **kwargs)
# More specific title pattern takes precedence
Pattern A: */docs/* → title="Documentation" # 100 URLs
Pattern B: */docs/api/* → title="API Reference" # 20 URLs
Result: URL gets title "API Reference"
```

Strategy 2:
### Division Patterns
```python
def apply(self):
matching_urls = self.get_matching_delta_urls()
my_url_count = matching_urls.count()

# Only apply if this pattern matches fewer URLs than other matching patterns
for url in matching_urls:
other_patterns_min_count = url.deltapatterns.annotate(
url_count=Count('delta_urls')
).aggregate(Min('url_count'))['url_count__min'] or float('inf')
# More specific division assignment wins
Pattern A: *.pdf → division="GENERAL" # 500 URLs
Pattern B: */specs/*.pdf → division="ENGINEERING" # 50 URLs
Result: URL gets division "ENGINEERING"
```

if my_url_count <= other_patterns_min_count:
self.apply_to_url(url)
### Document Type Patterns
```python
# Most specific document type classification applies
Pattern A: */docs/*type="DOCUMENTATION" # 200 URLs
Pattern B: */docs/data/*type="DATA" # 30 URLs
Result: URL gets type "DATA"
```

0 comments on commit 4c7834f

Please sign in to comment.