add an overview readme to introduce the system

NASA-IMPACT · Nov 26, 2024 · 4c7834f · 4c7834f
1 parent 0fcf2ea
commit 4c7834f
Show file tree

Hide file tree

Showing 2 changed files with 105 additions and 57 deletions.
diff --git a/sde_collections/models/README.md b/sde_collections/models/README.md
@@ -0,0 +1,78 @@
+# URL Pattern Management System
+
+## Overview
+This system provides a framework for managing and curating collections of URLs through pattern-based rules. It enables systematic modification, categorization, and filtering of URLs while maintaining a clear separation between work-in-progress changes and production content.
+
+## Core Concepts
+
+### URL States
+Content progresses through three states:
+- **Dump URLs**: Raw content from initial scraping/indexing
+- **Delta URLs**: Work-in-progress changes and modifications
+- **Curated URLs**: Production-ready, approved content
+
+### Pattern Types
+- **Include/Exclude Patterns**: Control which URLs are included in collections
+  - Include patterns always override exclude patterns
+  - Use wildcards for matching multiple URLs
+
+- **Modification Patterns**: Change URL properties
+  - Title patterns modify final titles shown in search results
+  - Document type patterns affect which tab the URL appears under
+  - Division patterns assign URLs within the Science Knowledge Sources
+
+### Pattern Resolution
+The system uses a "smallest set priority" strategy which resolves conflicts by always using the most specific pattern that matches a URL:
+- Multiple patterns can match the same URL
+- Pattern matching the smallest number of URLs takes precedence
+- Applies to title, division, and document type patterns
+- More specific patterns naturally override general ones
+
+## Getting Started
+
+To effectively understand this system, we recommend reading through the documentation in the following order:
+
+1. Begin with the Pattern System Overview to learn the fundamental concepts of how patterns work and interact with URLs
+2. Next, explore the URL Lifecycle documentation to understand how content moves through different states
+3. The Pattern Resolution documentation will show you how the system handles overlapping patterns
+4. Learn how to control which URLs appear in your collection with the Include/Exclude patterns guide
+5. Finally, review the Pattern Unapplication Logic to understand how pattern removal affects your URLs
+
+Each section builds upon knowledge from previous sections, providing a comprehensive understanding of the system.
+
+## Documentation
+
+[Pattern System Overview](./README_PATTERN_SYSTEM.md)
+- Core concepts and pattern types
+- Pattern lifecycle and effects
+- Delta URL generation rules
+- Working principles (idempotency, separation of concerns)
+- Pattern interaction examples
+
+[URL Lifecycle Management](./README_LIFECYCLE.md)
+- Migration process (Dump → Delta)
+- Promotion process (Delta → Curated)
+- Field handling during transitions
+- Pattern application timing
+- Data integrity considerations
+
+[Pattern Resolution](./README_PATTERN_RESOLUTION.md)
+- Smallest set priority mechanism
+- URL counting and precedence
+- Performance considerations
+- Edge case handling
+- Implementation details
+
+[URL Inclusion/Exclusion](./README_INCLUSION.md)
+- Wildcard pattern matching
+- Include/exclude precedence
+- Example pattern configurations
+- Best practices
+- Common pitfalls and solutions
+
+[Pattern Unapplication Logic](./README_UNAPPLY_LOGIC.md)
+- Pattern removal handling
+- Delta management during unapplication
+- Manual change preservation
+- Cleanup procedures
+- Edge case handling
diff --git a/sde_collections/models/README_PATTERN_RESOLUTION.md b/sde_collections/models/README_PATTERN_RESOLUTION.md
@@ -1,32 +1,16 @@
-# URL Pattern Application Strategies
+# Pattern Resolution System
 
-## Strategy 1: Exclusive Patterns
+## Overview
+The pattern system uses a "smallest set priority" strategy for resolving conflicts between overlapping patterns. This applies to title patterns, division patterns, and document type patterns. The pattern that matches the smallest set of URLs takes precedence.
 
-Patterns have exclusive ownership of URLs they match. System prevents creation of overlapping patterns.
+## How It Works
 
-Example:
-```
-Pattern A: */docs/*          # Matches 100 URLs
-Pattern B: */docs/api/*      # Rejected - overlaps with Pattern A
-Pattern C: */blog/*          # Accepted - no overlap
-```
-
-Benefits:
-- Clear ownership
-- Predictable effects
-- Simple conflict resolution
-- Easy to debug
-
-Drawbacks:
-- Less flexible
-- May require many specific patterns
-- May need pattern deletion/recreation to modify rules
-
-## Strategy 2: Smallest Set Priority
+When multiple patterns match a URL, the system:
+1. Counts how many total URLs each pattern matches
+2. Compares the counts
+3. Applies the pattern that matches the fewest URLs
 
-Multiple patterns can match same URLs. Pattern affecting smallest URL set takes precedence.
-
-Example:
+### Example
 ```
 Pattern A: */docs/*          # Matches 100 URLs
 Pattern B: */docs/api/*      # Matches 20 URLs
@@ -37,42 +21,28 @@ For URL "/docs/api/v2/users":
 - Pattern C wins (5 URLs < 20 URLs < 100 URLs)
 ```
 
-Benefits:
-- More flexible rule creation
-- Natural handling of specificity
-
-Drawbacks:
-- Complex precedence rules
-- Pattern effects can change as URL sets grow
-- Harder to predict/debug
-- Performance impact from URL set size calculations
-
-## Implementation Notes
+## Pattern Types and Resolution
 
-Strategy 1:
+### Title Patterns
 ```python
-def save(self, *args, **kwargs):
-    # Check for overlapping patterns
-    overlapping = self.get_matching_delta_urls().filter(
-        deltapatterns__isnull=False
-    ).exists()
-    if overlapping:
-        raise ValidationError("Pattern would overlap existing pattern")
-    super().save(*args, **kwargs)
+# More specific title pattern takes precedence
+Pattern A: */docs/* → title="Documentation"           # 100 URLs
+Pattern B: */docs/api/* → title="API Reference"       # 20 URLs
+Result: URL gets title "API Reference"
 ```
 
-Strategy 2:
+### Division Patterns
 ```python
-def apply(self):
-    matching_urls = self.get_matching_delta_urls()
-    my_url_count = matching_urls.count()
-
-    # Only apply if this pattern matches fewer URLs than other matching patterns
-    for url in matching_urls:
-        other_patterns_min_count = url.deltapatterns.annotate(
-            url_count=Count('delta_urls')
-        ).aggregate(Min('url_count'))['url_count__min'] or float('inf')
+# More specific division assignment wins
+Pattern A: *.pdf → division="GENERAL"                 # 500 URLs
+Pattern B: */specs/*.pdf → division="ENGINEERING"     # 50 URLs
+Result: URL gets division "ENGINEERING"
+```
 
-        if my_url_count <= other_patterns_min_count:
-            self.apply_to_url(url)
+### Document Type Patterns
+```python
+# Most specific document type classification applies
+Pattern A: */docs/* → type="DOCUMENTATION"            # 200 URLs
+Pattern B: */docs/data/* → type="DATA"                # 30 URLs
+Result: URL gets type "DATA"
 ```