ARTS Week 35

什么也无法舍弃的人，什么也改变不了

Algoithm

零钱兑换 II

概述

给定不同面额的硬币和一个总金额。写出函数来计算可以凑成总金额的硬币组合数。假设每一种面额的硬币有无限个。

示例 1:

输入: amount = 5, coins = [1, 2, 5] 输出: 4 解释: 有四种方式可以凑成总金额: 5=5 5=2+2+1 5=2+1+1+1 5=1+1+1+1+1 示例 2:

输入: amount = 3, coins = [2] 输出: 0 解释: 只用面额2的硬币不能凑成总金额3。示例 3:

输入: amount = 10, coins = [10] 输出: 1

注意:

你可以假设：

0 <= amount (总金额) <= 5000 1 <= coin (硬币面额)<= 5000 硬币种类不超过 500 种结果符合 32 位符号整数

coding

class Solution:
   def change(self, amount: int, coins: list[int]) -> int:
      dp = [0] * (amount + 1)
      dp[0] = 1

      for coin in coins:
         for x in range(coin, amount + 1):
            dp[x] += dp[x - coin]
      return dp[amount]

Review

NLP tutorial for Text Classification in Python

概述

Text classification is one of the important task in supervised machine learning (ML). It is a process of assigning tags/categories to documents helping us to automatically & quickly structure and analyze text in a cost-effective manner. It is one of the fundamental tasks in Natural Language Processing with broad applications such as sentiment-analysis, spam-detection, topic-labeling, intent-detection etc.

代码相关的链接 NLP-text-classification-model

Step 1: Importing Libraries

import pandas as pd
import numpy as np
# for text pre-processing
import re, string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
# for model-building
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix
from sklearn.metrics import roc_curve, auc, roc_auc_score
# bag of words
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
# for word embedding
import gensim
from gensim.models import Word2Vec

Step 2: Loading the data set & EDA

Natural Language Processing with Disaster Tweets

Loading the data set in Kaggle Notebook:

import pandas as pd

df_train = pd.read_csv('../input/nlp-getting-started/train.csv')
df_test = pd.read_csv('../input/nlp-getting-started/test.csv')

Step 3: Text Pre-Processing

Simple text cleaning processes
- Removing punctuations, special characters, URLs & hashtags
- Removing leading, trailing & extra white spaces/tabs
- Typos, slangs are corrected, abbreviations are written in their long forms
Stop-word removal
Stemming: Refers to the process of slicing the end or the beginning of words with the intention of removing affixes( prefix/suffix)
Lemmatization: It is the process of reducing the word to its base form

import re
import nltk
import string
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer


# convert to lowercase, strip and remove punctuations
def preprocess(text):
   text = text.lower()
   text = text.strip()
   text = re.compile('<.*?>').sub('', text)
   text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)
   text = re.sub('\s+', ' ', text)
   text = re.sub(r'\[[0-9]*\]', ' ', text)
   text = re.sub(r'[^\w\s]', '', str(text).lower().strip())
   text = re.sub(r'\d', ' ', text)
   text = re.sub(r'\s+', ' ', text)
   return text


# STOPWORD REMOVAL
def stopword(string):
   a = [i for i in string.split() if i not in stopwords.words('english')]
   return ' '.join(a)


# LEMMATIZATION
# Initialize the lemmatizer
wl = WordNetLemmatizer()


# This is a helper function to map NTLK position tags
def get_wordnet_pos(tag):
   if tag.startswith('J'):
      return wordnet.ADJ
   elif tag.startswith('V'):
      return wordnet.VERB
   elif tag.startswith('N'):
      return wordnet.NOUN
   elif tag.startswith('R'):
      return wordnet.ADV
   else:
      return wordnet.NOUN


# Tokenize the sentence
def lemmatizer(string):
   word_pos_tags = nltk.pos_tag(word_tokenize(string))  # Get position tags
   a = [wl.lemmatize(tag[0], get_wordnet_pos(tag[1])) for idx, tag in
        enumerate(word_pos_tags)]  # Map the position tag and lemmatize the word/token
   return " ".join(a)

Step 4: Extracting vectors from text (Vectorization)

Bag-of-Words(BoW) and Word Embedding (with Word2Vec) are two well-known methods for converting text data to numerical data.

Count vectors: It builds a vocabulary from a corpus of documents and counts how many times the words appear in each document

Term Frequency-Inverse Document Frequencies (tf-Idf)

import nltk

# SPLITTING THE TRAINING DATASET INTO TRAIN AND TEST
X_train, X_test, y_train, y_test = train_test_split(df_train["clean_text"], df_train["target"], test_size=0.2,
                                                    shuffle=True)
# Word2Vec
# Word2Vec runs on tokenized sentencesdf_train
X_train_tok = [nltk.word_tokenize(i) for i in X_train]
X_test_tok = [nltk.word_tokenize(i) for i in X_test]

Bag-of-Words & Word2Vec

# Tf-Idf
tfidf_vectorizer = TfidfVectorizer(use_idf=True)
X_train_vectors_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_vectors_tfidf = tfidf_vectorizer.transform(X_test)


# building Word2Vec model
class MeanEmbeddingVectorizer(object):
   def __init__(self, word2vec):
      self.word2vec = word2vec
      # if a text is empty we should return a vector of zeros
      # with the same dimensionality as all the other vectors
      self.dim = len(next(iter(word2vec.values())))


def fit(self, X, y):
   return self


def transform(self, X):
   return np.array([
      np.mean([self.word2vec[w] for w in words if w in self.word2vec]
              or [np.zeros(self.dim)], axis=0)
      for words in X
   ])


w2v = dict(zip(model.wv.index2word, model.wv.syn0))
df['clean_text_tok'] = [nltk.word_tokenize(i) for i in df['clean_text']]
model = Word2Vec(df['clean_text_tok'], min_count=1)
modelw = MeanEmbeddingVectorizer(w2v)
# converting text to numerical data using Word2Vec
X_train_vectors_w2v = modelw.transform(X_train_tok)
X_val_vectors_w2v = modelw.transform(X_test_tok)

Step 5. Running ML algorithms

Logistic Regression: We will start with the most simplest one Logistic Regression. You can easily build a LogisticRegression in scikit using below lines of code

#FITTING THE CLASSIFICATION MODEL using Logistic Regression(tf-idf)
lr_tfidf = LogisticRegression(solver='liblinear', C=10, penalty='l2')
lr_tfidf.fit(X_train_vectors_tfidf, y_train)
#Predict y value for test dataset
y_predict = lr_tfidf.predict(X_test_vectors_tfidf)
y_prob = lr_tfidf.predict_proba(X_test_vectors_tfidf)[:, 1]
print(classification_report(y_test, y_predict))
print('Confusion Matrix:', confusion_matrix(y_test, y_predict))

fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
print('AUC:', roc_auc)

Naive Bayes: It’s a probabilistic classifier that makes use of Bayes’ Theorem, a rule that uses probability to make predictions based on prior knowledge of conditions that might be related

#FITTING THE CLASSIFICATION MODEL using Naive Bayes(tf-idf)
nb_tfidf = MultinomialNB()
nb_tfidf.fit(X_train_vectors_tfidf, y_train)
#Predict y value for test dataset
y_predict = nb_tfidf.predict(X_test_vectors_tfidf)
y_prob = nb_tfidf.predict_proba(X_test_vectors_tfidf)[:, 1]
print(classification_report(y_test, y_predict))
print('Confusion Matrix:', confusion_matrix(y_test, y_predict))

fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
print('AUC:', roc_auc)

Tip

Redis-缓冲区

概述

S/C 结构中，缓冲区是为了能够暂存输入/输出的命令，这些不能过大。

客户端输入和输出缓冲区

输入缓冲区:会先把客户端发送过来的命令暂存起来，Redis 主线程再从输入缓冲区中读取命令，进行处理。当 Redis 主线程处理完数据后，会把结果写入到
输出缓冲区:当 Redis 主线程处理完数据后，会把结果写入到输出缓冲区，再通过输出缓冲区返回给客户端

输入缓冲区溢出

出现的原因：

写入了 bigkey，比如一下子写入了多个百万级别的集合类型数据；
服务器端处理请求的速度过慢，例如，Redis 主线程出现了间歇性阻塞，无法及时处理正常发送的请求，导致客户端发送的请求在缓冲区越积越多。

查看方式

CLIENT LIST
id=5 addr=127.0.0.1:50487 fd=9 name= age=4 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=26 qbuf-free=32742 obl=0 oll=0 omem=0 events=r cmd=client

另一类是与输入缓冲区相关的三个参数：

cmd，表示客户端最新执行的命令。这个例子中执行的是 CLIENT 命令
qbuf，表示输入缓冲区已经使用的大小。这个例子中的 CLIENT 命令已使用了 26 字节大小的缓冲区。
qbuf-free，表示输入缓冲区尚未使用的大小。这个例子中的 CLIENT 命令还可以使用 32742 字节的缓冲区。
qbuf 和 qbuf-free 的总和就是，Redis 服务器端当前为已连接的这个客户端分配的缓冲区总大小。

这个例子中总共分配了 26 + 32742 = 32768 字节，也就是 32KB 的缓冲区。

redis server 默认设置maxmemory 4。如果超过，那么就会出发内存淘汰机制。

解决问题

缓冲区调大
从数据命令的发送和处理速度入手。

方法1，暂时不可取，因为redis未开放出相关的参数，默认设置未1G.

输出缓冲区溢出

输出缓冲包含两种:

16KB 固定缓冲大小,OK,err
动态增加缓冲空间，暂存大小可变的响应结果

出现这种情况的原因：

bigkey
MONITOR
缓冲区大小设置不合理

MONITOR

命令会持续监测redis命令：不推荐生产环境

缓存区大小设置

我们可以通过 client-output-buffer-limit 配置项，来设置缓冲区的大小

设置缓冲区大小的上限阈值；
设置输出缓冲区持续写入数据的数量上限阈值，和持续写入数据的时间的上限阈值。

client-output-buffer-limit normal 0 0 0

normal 表示当前设置的是普通客户端，

第 1 个 0 设置的是缓冲区大小限制
第 2 个 0 和第 3 个 0 分别表示缓冲区持续写入量限制和持续写入时间限制。

使用不同方式进行同步分析

普通客户端请求，可以设置成0，0，0
如果是其他的模式，可以通过手动设定相关的缓冲区大小

Share

链表题目总结

概述

链表题目整体框架：


# 将生成的新的 head 指针引用
# python 中的一切都是引用，所以这里会引用起固定的内存地址
result = head

while head:
	# 相关的函数处理，比如中间某一个节点的删除/多个树枝删除
	# 可以是多个指针，同时中间增加参数
	
	if head.val==num:
	    head.val = head.next.val
	    head.next = head.next.next
	    
  head = head.next

快慢指针

slow = head
fast = head.next

while fast:
    fast = fast.next
    slow = slow.next

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

week35.md

week35.md

ARTS Week 35

Algoithm

概述

coding

Review

概述

Step 1: Importing Libraries

Step 2: Loading the data set & EDA

Step 3: Text Pre-Processing

Step 4: Extracting vectors from text (Vectorization)

Step 5. Running ML algorithms

Tip

概述

客户端输入和输出缓冲区

输入缓冲区溢出

解决问题

输出缓冲区溢出

MONITOR

缓存区大小设置

Share

概述

快慢指针

Files

week35.md

Latest commit

History

week35.md

File metadata and controls

ARTS Week 35

Algoithm

概述

coding

Review

概述

Step 1: Importing Libraries

Step 2: Loading the data set & EDA

Step 3: Text Pre-Processing

Step 4: Extracting vectors from text (Vectorization)

Step 5. Running ML algorithms

Tip

概述

客户端输入和输出缓冲区

输入缓冲区溢出

解决问题

输出缓冲区溢出

MONITOR

缓存区大小设置

Share

概述

快慢指针