CLDR-17566 conversion process scripts (#3826)

unicode-org · Jul 11, 2024 · 79a2c4f · 79a2c4f
1 parent 9af283d
commit 79a2c4f
Show file tree

Hide file tree

Showing 3 changed files with 372 additions and 0 deletions.
diff --git a/tools/scripts/web/conversion_scripts/README.md b/tools/scripts/web/conversion_scripts/README.md
@@ -0,0 +1,84 @@
+# Scripts to help with CLDR → Markdown Conversion
+
+Part of the [CLDR to Markdown Conversion Process](https://docs.google.com/document/d/1NoQX0zqSYqU4CUuNijTWKQaphE4SCuHl6Bej2C4mb58/edit?usp=sharing), aiming to automate steps 1-3.
+
+NOTE: does not get rid of all manual work, images, tables, and general review are still required. 
+
+## File 1: cleanup.py
+
+Objective: this file aims to correct some of the common mistakes that show up when using a html to markdown converter on the google sites CLDR site. It is not a comprehensive list, and there can still be mistakes, but it helps to correct some of the consistently seen errors that show up, particularly with the specific markdown converter used in pullFromCLDR.py. Most of the adjustments utilize regular expressions to find and replace specific text. The functions are as follows:
+
+### Link Correction
+
+- Removing redundant links, e.g. \(https://www.example.com)[https://www.example.com] → https://www.example.com
+- Correcting relative links, e.g. \(index)[/index] → \(index)[https://cldr.unicode.org/index]
+- Correcting google redirect links, e.g. \(people)[http://www.google.com/url?q=http%3A%2F%2Fcldr-smoke.unicode.org%2Fsmoketest%2Fv%23%2FUSER%2FPeople%2F20a49c6ad428d880&sa=D&sntz=1&usg=AOvVaw38fQLnn3h6kmmWDHk9xNEm] → \(people)[https://cldr-smoke.unicode.org/cldr-apps/v#/fr/People/20a49c6ad428d880]
+- Correcting regular redirect links
+
+### Common Formatting Issues
+
+- Bullet points and numbered lists have extra spaces after them
+- Bullet points and numbered lists have extra lines between them
+- Link headings get put in with headings and need to be removed
+
+### Project specific additions
+
+- Every page has --- title: PAGE TITLE --- at the top of the markdown file
+- Every page has the unicode copyright "\!\[Unicode copyright](https://www.unicode.org/img/hb_notice.gif)" at the bottom of the markdown file
+
+## File 2: pullFromCLDR.py
+
+Objective: this file is used along side cleanup.py to automate the process of pulling html and text from a given CLDR page. It uses libraries to retrieve the htmal as well as plain text from a given page, convert the html into markdown, parse the markdown using the cleanup.py file, and create the .md file and the temporary .txt file in the cldr site location. There are a couple of things to note with this:
+
+- The nav bar header are not relevant to each page for this conversion process, so only the html within \<div role="main" ... > is pulled from the page
+- To convert the html into raw text, the script parses the text, and then seperates relevant tags with newlines to appear as text does when copy/pasted from the page.
+- This will only work with "https://cldr.unicode.org" pages, without modifying line 12 of the file
+
+## Usage
+
+### Installation
+
+To run this code, you must have python3 installed. You need to install the following Python libraries:
+
+- BeautifulSoup (from `bs4`)
+- markdownify
+- requests
+
+You can install them using pip:
+
+```bash
+pip install beautifulsoup4 markdownify requests
+```
+
+### Constants
+
+Line 8 of cleanup.py should contain the url that will be appended to the start of all relative links (always https://cldr.unicode.org):
+```
+#head to place at start of all relative links
+RELATIVE_LINK_HEAD = "https://cldr.unicode.org"
+```
+
+Line 7 of pullFromCLDR.py should contain your local location of the cloned CLDR site, this is where the files will be stored:
+```
+#LOCAL LOCATION OF CLDR
+CLDR_SITE_LOCATION = "DIRECTORY TO CLDR LOCATION/docs/site"
+```
+
+### Running
+
+Before running, ensure that the folders associated to the directory of the page you are trying to convert are within your cldr site directory, and there is a folder named TEMP-TEXT-FILES.
+
+Run with:
+```
+python3 pullFromCLDR.py
+```
+
+You will then be prompted to enter the url of the site you are trying to convert, after which the script will run.
+
+If you would like to run unit tests on cleanup, or use any of the functions indiviually, run
+```
+python3 cleanup.py
+```
+
+
+
diff --git a/tools/scripts/web/conversion_scripts/cleanup.py b/tools/scripts/web/conversion_scripts/cleanup.py
@@ -0,0 +1,237 @@
+import re
+import requests
+import urllib.parse
+import unittest
+from unittest.mock import patch
+
+#head to place at start of all relative links
+RELATIVE_LINK_HEAD = "https://cldr.unicode.org"
+
+#sometimes the html --> md conversion puts extra spaces between bullets
+def fixBullets(content):
+    #remove extra spaces after dash in bullet points
+    content = re.sub(r'-\s{3}', '- ', content)
+    #remove extra space after numbered bullet points
+    content = re.sub(r'(\d+\.)\s{2}', r'\1 ', content)
+    #process lines for list handling
+    processed_lines = []
+    in_list = False
+    for line in content.splitlines():
+        if re.match(r'^\s*[-\d]', line):
+            #check if the current line is part of a list
+            in_list = True
+        elif in_list and not line.strip():
+            #skip empty lines within lists
+            continue
+        else:
+            in_list = False
+        processed_lines.append(line)
+    processed_content = '\n'.join(processed_lines)
+
+    return processed_content
+
+#html-->md conversion puts link headings into md and messes up titles
+def fixTitles(content):
+    #link headings regex
+    pattern = re.compile(r'(#+)\s*\n*\[\n*\]\(#.*\)\n(.*)\n*')
+
+    #replace matched groups
+    def replaceUnwanted(match):
+        heading_level = match.group(1)  #heading level (ex. ##)
+        title_text = match.group(2).strip()  #capture and strip the title text
+        return f"{heading_level} {title_text}"  #return the formatted heading and title on the same line
+
+    # Replace the unwanted text using the defined pattern and function
+    processed_content = re.sub(pattern, replaceUnwanted, content)
+    return processed_content
+
+# add title at top and unicode copyright at bottom
+def addHeaderAndFooter(content):
+    #get title from top of md file
+    title_match = re.search(r'(?<=#\s).*', content)
+    if title_match:
+        title = title_match.group(0).strip()
+    else:
+        title = "Default Title"  #default if couldnt find
+
+    #header
+    header = f"---\ntitle: {title}\n---\n"
+    #footer
+    footer = "\n![Unicode copyright](https://www.unicode.org/img/hb_notice.gif)\n"
+
+    #look for existing title and copywrite in the YAML front matter
+    title_exists = re.search(r'^---\n.*title:.*\n---', content, re.MULTILINE)
+    footer_exists = footer.strip() in content
+
+    #add header
+    if not title_exists:
+        content = header + content
+
+    #add footer
+    if not footer_exists:
+        content = content + footer
+
+    return content
+
+#html-->md sometimes produces double bullets on indented lists
+def fixIndentedBullets(content):
+    #regex pattern to match the double hyphen bullets
+    pattern = re.compile(r'^-\s-\s(.*)', re.MULTILINE)
+
+    #split into lines
+    lines = content.split('\n')
+
+    #normalize bullets
+    normalized_lines = []
+    in_list = False
+
+    for line in lines:
+        #lines with double hyphens
+        match = pattern.match(line)
+        if match:
+            #normalize the double hyphen bullet
+            bullet_point = match.group(1)
+            normalized_lines.append(f'- {bullet_point.strip()}')
+            in_list = True
+        elif in_list and re.match(r'^\s*-\s', line):
+            #remove indentation from following bullets in the same list
+            normalized_lines.append(line.strip())
+        else:
+            normalized_lines.append(line)
+            in_list = False
+
+    #join back into a single string
+    processed_content = '\n'.join(normalized_lines)
+    return processed_content
+
+#links on text that is already a link
+def removeRedundantLinks(content):
+    #(link)[link] regex pattern
+    link_pattern = re.compile(r'\((https?:\/\/[^\s\)]+)\)\[\1\]')
+
+    #function to process unwanted links
+    def replace_link(match):
+        return match.group(1)  #return only the first URL
+
+    #replace the links
+    processed_content = re.sub(link_pattern, replace_link, content)
+    return processed_content
+
+#process links, google redirects, normal redirects, and relative links (takes in a url)
+def convertLink(url):
+    #relative links
+    if url.startswith("/"):
+        return RELATIVE_LINK_HEAD + url
+    #google redirect links
+    elif "www.google.com/url" in url:
+        parsed_url = urllib.parse.urlparse(url)
+        query_params = urllib.parse.parse_qs(parsed_url.query)
+        if 'q' in query_params:
+            return query_params['q'][0]
+        return url
+    #redirects
+    else:
+        try:
+            response = requests.get(url)
+            return response.url
+        except requests.RequestException as e:
+            print(f"Error following redirects for {url}: {e}")
+            return url
+
+#finds all links and runs them through converLink
+def process_links(content):
+    #regex pattern for md links
+    pattern = re.compile(r'\[(.*?)\]\((.*?)\)')
+
+    #replace each link
+    def replace_link(match):
+        text = match.group(1)
+        url = match.group(2)
+        new_url = convertLink(url)
+        return f'[{text}]({new_url})'
+
+    return pattern.sub(replace_link, content)
+
+#given a file path to an md file, run it through every cleanup function and write inot samle.md
+def fullCleanup(file_path):
+    with open(file_path, 'r') as file:
+        content = file.read()  # Read entire file as a string
+    content = addHeaderAndFooter(content)
+    content = fixTitles(content)
+    content = fixBullets(content)
+    content = removeRedundantLinks(content)
+    content = fixIndentedBullets(content)
+    content = process_links(content)
+    with open("sample.md", 'w') as file:
+        file.write(content)
+
+#given a md string, run through every cleanup function and return result
+def fullCleanupString(str):
+    content = addHeaderAndFooter(str)
+    content = fixTitles(content)
+    content = fixBullets(content)
+    content = removeRedundantLinks(content)
+    content = fixIndentedBullets(content)
+    content = process_links(content)
+    return content
+
+
+#TESTS
+class TestMarkdownLinkProcessing(unittest.TestCase):
+    def test_remove_redundant_links(self):
+        #standard use cases
+        markdown_content1 = '''
+        redundant link (https://mail.google.com/mail/u/1/#inbox)[https://mail.google.com/mail/u/1/#inbox].
+        not redundant link [example](https://www.example.com).
+        '''
+        expected_output1 = '''
+        redundant link https://mail.google.com/mail/u/1/#inbox.
+        not redundant link [example](https://www.example.com).
+        '''
+        self.assertEqual(removeRedundantLinks(markdown_content1), expected_output1)
+
+        #edge cases:
+        #If the link does not start with http:// or https:// it will not be picked up as a link
+        #if the two links are different, it does not get corrected
+        markdown_content2 = '''
+        not link [www.example.com](www.example.com).
+        Different links (https://mail.google.com/mail/u/1/#inbox)[https://emojipedia.org/japanese-symbol-for-beginner].
+        '''
+        expected_output2 = '''
+        not link [www.example.com](www.example.com).
+        Different links (https://mail.google.com/mail/u/1/#inbox)[https://emojipedia.org/japanese-symbol-for-beginner].
+        '''
+        self.assertEqual(removeRedundantLinks(markdown_content2), expected_output2)
+
+    @patch('requests.get')
+    def test_replace_links(self, mock_get):
+        #mock responses for follow_redirects function
+        def mock_get_response(url):
+            class MockResponse:
+                def __init__(self, url):
+                    self.url = url
+            if url == 'http://www.google.com/url?q=http%3A%2F%2Fwww.typolexikon.de%2F&sa=D&sntz=1&usg=AOvVaw3SSbqyjrSIq8enzBt6Gltw':
+                return MockResponse('http://www.typolexikon.de/')
+            elif url == 'http://www.example.com/':
+                return MockResponse('http://www.example.com/')
+            return MockResponse(url)
+
+        mock_get.side_effect = mock_get_response
+
+        #standard use cases
+        markdown_content1 = '''
+        relative link [page](/relative-page).
+        Google redirect link [typolexikon.de](http://www.google.com/url?q=http%3A%2F%2Fwww.typolexikon.de%2F&sa=D&sntz=1&usg=AOvVaw3SSbqyjrSIq8enzBt6Gltw).
+        normal link [example.com](http://www.example.com/).
+        '''
+        expected_output1 = '''
+        relative link [page](https://cldr.unicode.org/relative-page).
+        Google redirect link [typolexikon.de](http://www.typolexikon.de/).
+        normal link [example.com](http://www.example.com/).
+        '''
+        cleaned_content = removeRedundantLinks(markdown_content1)
+        self.assertEqual(process_links(cleaned_content), expected_output1)
+
+if __name__ == '__main__':
+    fullCleanup("testing.md")
+    unittest.main()
diff --git a/tools/scripts/web/conversion_scripts/pullFromCLDR.py b/tools/scripts/web/conversion_scripts/pullFromCLDR.py
@@ -0,0 +1,51 @@
+import requests
+from bs4 import BeautifulSoup
+import markdownify 
+from cleanup import fullCleanupString
+
+
+#LOCAL LOCATION OF CLDR
+CLDR_SITE_LOCATION = "/Users/chrispyle/Documents/GitHub/cldr/docs/site"
+
+#fetch HTML from the website
+url = input("Enter link to convert: ")
+#compute path in cldr using url
+directoryPath = url.replace("https://cldr.unicode.org", "")
+outputMDFile = CLDR_SITE_LOCATION + directoryPath + ".md"
+#compute path for text file using name of page
+outputTextFile = CLDR_SITE_LOCATION + "/TEMP-TEXT-FILES/" + url.rsplit('/', 1)[-1] + ".txt"
+
+#get html content of page
+response = requests.get(url)
+html_content = response.text
+
+#extract html inside <div role="main" ... >
+soup = BeautifulSoup(html_content, 'html.parser')
+main_div = soup.find('div', {'role': 'main'})
+html_inside_main = main_div.encode_contents().decode('utf-8')
+
+#convert html to md with markdownify and settings from conversion doc
+markdown_content = markdownify.markdownify(html_inside_main, heading_style="ATX", bullets="-") 
+#clean md file using cleanup.py
+cleaned_markdown = fullCleanupString(markdown_content)
+
+#parse raw text from site
+textParser = BeautifulSoup(html_inside_main, 'html.parser')
+
+#add newlines to text content for all newline tags
+for block in textParser.find_all(['p', 'div', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'li', 'br']):
+    block.append('\n')
+
+#fet text content from the parsed HTML
+rawText = textParser.get_text()
+
+#remove unnecessary newlines
+rawText = '\n'.join(line.strip() for line in rawText.splitlines() if line.strip())
+
+#write files to cldr in proper locations
+with open(outputMDFile, 'w', encoding='utf-8') as f:
+    f.write(cleaned_markdown)
+
+with open(outputTextFile, 'w', encoding='utf-8') as f:
+    f.write(rawText)
+