Skip to content

Latest commit

 

History

History
45 lines (35 loc) · 1.85 KB

README.md

File metadata and controls

45 lines (35 loc) · 1.85 KB

μnging

Yet another little list of μnging tools shortcuts in python, this is a small work in progress and is composed by some parts:

Text μnging

A list of very simple steps in the Text Muging:

  • ExtraWhite( string )
    • When we make web scraping the text can contain extra-spaces at start and end, new line character and so on. This function delete that and remove extra first 127 ASCII with a nice spanishdict work.
  • WebCleanText( string )
    • When we make web scraping the text can contain extra-spaces in the middle, commas and so on. This function delete that.
  • KeyAndValue( string, symbol string )
    • When Make web scraping the data commonly has a title and the data are after ':' or other simbol, this split in key and data

Pandas Shortcuts

A list of very simple shortcuts for pandas work:

  • Vectors2Df( list of vectors, list of column names )

    • In this case convert a vector list to Pandas Data Frame, ingest a list of lists and a list of column names <- Vector2Df([y,u], c), return a pandas dataframe. Example of use:
      y = ['a','b','c']	
      u = ['d','e','f']	
      c = ['g',2]	
      Vectors2Df([y,u], c)	
      
      Vectors2Df([u], str(c[0]) )
    

Web Scraping

  • GetMozService()
    • When we make web scraping sometimes need start a webdriver, this function start a Mozilla driver and then install it if is not present. Example of use:
       from bs4 import BeautifulSoup
       from munging import GetMozService
       #
       url_Base = "https://pagename.com/busqueda/en1/use/"
       
       driver = GetMozService()
       
       driver.get(url_Base)
       pageDTA = driver.page_source
       soup = BeautifulSoup(pageDTA, 'lxml')
       print(soup)