From 54096cdc0def24256653a4f4a82bf997809c2ec8 Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Fri, 5 Apr 2019 22:03:39 +0700
Subject: [PATCH 01/34] Create folder id for Indonesia translation

---
 id | 1 +
 1 file changed, 1 insertion(+)
 create mode 100644 id
diff --git a/id b/id
new file mode 100644
index 000000000..8b1378917
--- /dev/null
+++ b/id
@@ -0,0 +1 @@
+

From f96d857fc22e8b6d0b6e8cf81c74759356570d35 Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Fri, 5 Apr 2019 22:04:05 +0700
Subject: [PATCH 02/34] Delete id

---
 id | 1 -
 1 file changed, 1 deletion(-)
 delete mode 100644 id

diff --git a/id b/id
deleted file mode 100644
index 8b1378917..000000000
--- a/id
+++ /dev/null
@@ -1 +0,0 @@
-

From 1c1c7ca6ea293d8102058223027161de5c2accd0 Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Fri, 5 Apr 2019 22:12:58 +0700
Subject: [PATCH 03/34] Create README.md

---
 id/README.md | 5 +++++
 1 file changed, 5 insertions(+)
 create mode 100644 id/README.md

diff --git a/id/README.md b/id/README.md
new file mode 100644
index 000000000..f719f96ac
--- /dev/null
+++ b/id/README.md
@@ -0,0 +1,5 @@
+# Terjemahan Bahasa Indonesia 
+Ini adalah transliterasi dari cheatsheet materi pembelajaran [Machine learning](https://stanford.edu/~shervine/teaching/cs-229/)
+dan [Deep Learning](https://github.com/afshinea/stanford-cs-230-deep-learning) yang dikerjakan oleh [Shervine Amidi].
+
+## Semoga bermanfaat.

From 75220158f035991db8bf6643f478081e15cc5a4f Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Fri, 5 Apr 2019 22:13:24 +0700
Subject: [PATCH 04/34] Update README.md

---
 id/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/id/README.md b/id/README.md
index f719f96ac..f0bc8ae86 100644
--- a/id/README.md
+++ b/id/README.md
@@ -1,5 +1,5 @@
 # Terjemahan Bahasa Indonesia 
 Ini adalah transliterasi dari cheatsheet materi pembelajaran [Machine learning](https://stanford.edu/~shervine/teaching/cs-229/)
-dan [Deep Learning](https://github.com/afshinea/stanford-cs-230-deep-learning) yang dikerjakan oleh [Shervine Amidi].
+dan [Deep Learning](https://github.com/afshinea/stanford-cs-230-deep-learning) yang dikerjakan oleh [Shervine Amidi](https://stanford.edu/~shervine/).
 
 ## Semoga bermanfaat.

From 0bdc9d76483aa325f0d2b7d9d05cd946840f1bbb Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Fri, 5 Apr 2019 22:14:27 +0700
Subject: [PATCH 05/34] Update README.md

---
 id/README.md | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/id/README.md b/id/README.md
index f0bc8ae86..03a8f670f 100644
--- a/id/README.md
+++ b/id/README.md
@@ -1,5 +1,4 @@
 # Terjemahan Bahasa Indonesia 
-Ini adalah transliterasi dari cheatsheet materi pembelajaran [Machine learning](https://stanford.edu/~shervine/teaching/cs-229/)
-dan [Deep Learning](https://github.com/afshinea/stanford-cs-230-deep-learning) yang dikerjakan oleh [Shervine Amidi](https://stanford.edu/~shervine/).
+Ini adalah transliterasi catatan ringkas materi pembelajaran [Machine learning](https://stanford.edu/~shervine/teaching/cs-229/) dan [Deep Learning](https://github.com/afshinea/stanford-cs-230-deep-learning) dari [Shervine Amidi](https://stanford.edu/~shervine/).
 
 ## Semoga bermanfaat.

From 1f08af9938c0ad842f19031f7ac19e7d54082a0e Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Fri, 5 Apr 2019 22:18:10 +0700
Subject: [PATCH 06/34] Add cheatsheet-deep-learning

---
 id/cheatsheet-deep-learning.md | 321 +++++++++++++++++++++++++++++++++
 1 file changed, 321 insertions(+)
 create mode 100644 id/cheatsheet-deep-learning.md

diff --git a/id/cheatsheet-deep-learning.md b/id/cheatsheet-deep-learning.md
new file mode 100644
index 000000000..a5aa3756c
--- /dev/null
+++ b/id/cheatsheet-deep-learning.md
@@ -0,0 +1,321 @@
+**1. Deep Learning cheatsheet**
+
+&#10230;
+
+<br>
+
+**2. Neural Networks**
+
+&#10230;
+
+<br>
+
+**3. Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.**
+
+&#10230;
+
+<br>
+
+**4. Architecture ― The vocabulary around neural networks architectures is described in the figure below:**
+
+&#10230;
+
+<br>
+
+**5. [Input layer, hidden layer, output layer]**
+
+&#10230;
+
+<br>
+
+**6. By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:**
+
+&#10230;
+
+<br>
+
+**7. where we note w, b, z the weight, bias and output respectively.**
+
+&#10230;
+
+<br>
+
+**8. Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:**
+
+&#10230;
+
+<br>
+
+**9. [Sigmoid, Tanh, ReLU, Leaky ReLU]**
+
+&#10230;
+
+<br>
+
+**10. Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:**
+
+&#10230;
+
+<br>
+
+**11. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.**
+
+&#10230;
+
+<br>
+
+**12. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:**
+
+&#10230;
+
+<br>
+
+**13. As a result, the weight is updated as follows:**
+
+&#10230;
+
+<br>
+
+**14. Updating weights ― In a neural network, weights are updated as follows:**
+
+&#10230;
+
+<br>
+
+**15. Step 1: Take a batch of training data.**
+
+&#10230;
+
+<br>
+
+**16. Step 2: Perform forward propagation to obtain the corresponding loss.**
+
+&#10230;
+
+<br>
+
+**17. Step 3: Backpropagate the loss to get the gradients.**
+
+&#10230;
+
+<br>
+
+**18. Step 4: Use the gradients to update the weights of the network.**
+
+&#10230;
+
+<br>
+
+**19. Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p**
+
+&#10230;
+
+<br>
+
+**20. Convolutional Neural Networks**
+
+&#10230;
+
+<br>
+
+**21. Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:**
+
+&#10230;
+
+<br>
+
+**22. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:**
+
+&#10230;
+
+<br>
+
+**23. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.**
+
+&#10230;
+
+<br>
+
+**24. Recurrent Neural Networks**
+
+&#10230;
+
+<br>
+
+**25. Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:**
+
+&#10230;
+
+<br>
+
+**26. [Input gate, forget gate, gate, output gate]**
+
+&#10230;
+
+<br>
+
+**27. [Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]**
+
+&#10230;
+
+<br>
+
+**28. LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.**
+
+&#10230;
+
+<br>
+
+**29. Reinforcement Learning and Control**
+
+&#10230;
+
+<br>
+
+**30. The goal of reinforcement learning is for an agent to learn how to evolve in an environment.**
+
+&#10230;
+
+<br>
+
+**31. Definitions**
+
+&#10230;
+
+<br>
+
+**32. Markov decision processes ― A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ,R) where:**
+
+&#10230;
+
+<br>
+
+**33. S is the set of states**
+
+&#10230;
+
+<br>
+
+**34. A is the set of actions**
+
+&#10230;
+
+<br>
+
+**35. {Psa} are the state transition probabilities for s∈S and a∈A**
+
+&#10230;
+
+<br>
+
+**36. γ∈[0,1[ is the discount factor**
+
+&#10230;
+
+<br>
+
+**37. R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize**
+
+&#10230;
+
+<br>
+
+**38. Policy ― A policy π is a function π:S⟶A that maps states to actions.**
+
+&#10230;
+
+<br>
+
+**39. Remark: we say that we execute a given policy π if given a state s we take the action a=π(s).**
+
+&#10230;
+
+<br>
+
+**40. Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:**
+
+&#10230;
+
+<br>
+
+**41. Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:**
+
+&#10230;
+
+<br>
+
+**42. Remark: we note that the optimal policy π∗ for a given state s is such that:**
+
+&#10230;
+
+<br>
+
+**43. Value iteration algorithm ― The value iteration algorithm is in two steps:**
+
+&#10230;
+
+<br>
+
+**44. 1) We initialize the value:**
+
+&#10230;
+
+<br>
+
+**45. 2) We iterate the value based on the values before:**
+
+&#10230;
+
+<br>
+
+**46. Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:**
+
+&#10230;
+
+<br>
+
+**47. times took action a in state s and got to s′**
+
+&#10230;
+
+<br>
+
+**48. times took action a in state s**
+
+&#10230;
+
+<br>
+
+**49. Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:**
+
+&#10230;
+
+<br>
+
+**50. View PDF version on GitHub**
+
+&#10230;
+
+<br>
+
+**51. [Neural Networks, Architecture, Activation function, Backpropagation, Dropout]**
+
+&#10230;
+
+<br>
+
+**52. [Convolutional Neural Networks, Convolutional layer, Batch normalization]**
+
+&#10230;
+
+<br>
+
+**53. [Recurrent Neural Networks, Gates, LSTM]**
+
+&#10230;
+
+<br>
+
+**54. [Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]**
+
+&#10230;

From e1e4953d816b8121c67645623d2a775ef26fc32e Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Fri, 5 Apr 2019 22:23:04 +0700
Subject: [PATCH 07/34] Rename
 template/cheatsheet-machine-learning-tips-and-tricks.md to
 id/cheatsheet-machine-learning-tips-and-tricks.md

---
 {template => id}/cheatsheet-machine-learning-tips-and-tricks.md | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename {template => id}/cheatsheet-machine-learning-tips-and-tricks.md (100%)

diff --git a/template/cheatsheet-machine-learning-tips-and-tricks.md b/id/cheatsheet-machine-learning-tips-and-tricks.md
similarity index 100%
rename from template/cheatsheet-machine-learning-tips-and-tricks.md
rename to id/cheatsheet-machine-learning-tips-and-tricks.md

From b828f9622221286d78e9bd5146e2bba1d7557eda Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Fri, 5 Apr 2019 22:23:54 +0700
Subject: [PATCH 08/34] Rename
 id/cheatsheet-machine-learning-tips-and-tricks.md to
 template/cheatsheet-machine-learning-tips-and-tricks.md

---
 {id => template}/cheatsheet-machine-learning-tips-and-tricks.md | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename {id => template}/cheatsheet-machine-learning-tips-and-tricks.md (100%)

diff --git a/id/cheatsheet-machine-learning-tips-and-tricks.md b/template/cheatsheet-machine-learning-tips-and-tricks.md
similarity index 100%
rename from id/cheatsheet-machine-learning-tips-and-tricks.md
rename to template/cheatsheet-machine-learning-tips-and-tricks.md

From e8141a66155abffd03671ec51256c3d966b977ba Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Fri, 5 Apr 2019 22:50:43 +0700
Subject: [PATCH 09/34] Translating into Bahasa Indonesia

---
 id/cheatsheet-deep-learning.md | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/id/cheatsheet-deep-learning.md b/id/cheatsheet-deep-learning.md
index a5aa3756c..1a8852b6b 100644
--- a/id/cheatsheet-deep-learning.md
+++ b/id/cheatsheet-deep-learning.md
@@ -1,78 +1,78 @@
 **1. Deep Learning cheatsheet**
 
-&#10230;
+&#10230; **1. Catatan ringkas Deep Learning**
 
 <br>
 
 **2. Neural Networks**
 
-&#10230;
+&#10230; **2. Neural Networks**
 
 <br>
 
 **3. Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.**
 
-&#10230;
+&#10230; **3. Neural networks merupakan sebuah kelas model yang disusun atas beberapa layer. Jenis umum dari neural networks yang umum digunakan adalah convolutional (CNN) dan recurrent neural networks (RNN).**
 
 <br>
 
 **4. Architecture ― The vocabulary around neural networks architectures is described in the figure below:**
 
-&#10230;
+&#10230; **4. Arsitektur - Beberapa istilah yang umum digunakan dalam arsitektur neural network dijelaskan pada gambar di bawah ini**
 
 <br>
 
 **5. [Input layer, hidden layer, output layer]**
 
-&#10230;
+&#10230; **5. [Input layer, hidden layer, output layer]**
 
 <br>
 
 **6. By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:**
 
-&#10230;
+&#10230; **6. Dengan i adalah layer ke-i dari network dan j adalah unit hidden layer ke-j, maka:**
 
 <br>
 
 **7. where we note w, b, z the weight, bias and output respectively.**
 
-&#10230;
+&#10230; **7. Catatan: w, b, z adalah weight, bias, dan output.**
 
 <br>
 
 **8. Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:**
 
-&#10230;
+&#10230; **8. Fungsi aktivasi - Fungsi aktivasi di unit hidden terakhir berfungsi untuk menunjukkan kompleksitas non-linear terhadap model. Beberapa yang umum digunakan:**
 
 <br>
 
 **9. [Sigmoid, Tanh, ReLU, Leaky ReLU]**
 
-&#10230;
+&#10230; **9. [Sigmoid, Tanh, ReLU, Leaky ReLU]**
 
 <br>
 
 **10. Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:**
 
-&#10230;
+&#10230;**10. Cross-entroy loss - Dalam konteks neural networks, cross-entroy loss L(z,y) sangat umum digunakan untuk mendefinisikan:**
 
 <br>
 
 **11. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.**
 
-&#10230;
+&#10230;**11. Learning rate - Learning rate (Tingkat pembelajaran), sering dinotasikan sebagai α atau η, merupakan fase pembaruan pembobotan. Tingkat pembelajaran dapat diperbaiki atau diubah secara adaptif. Metode yang paling populer saat ini disebut Adam, yang merupakan metode yang dapat menyesuaikan tingkat pembelajaran.
 
 <br>
 
 **12. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:**
 
-&#10230;
+&#10230;**12. Backpropagation - Backpropagation adalah metode untuk memperbarui bobot dalam neural networks dengan memperhitungkan output aktual dan output yang diinginkan. Bobot w dihitung dengan menggunakan aturan rantai turunan dalam bentuk berikut:**
 
 <br>
 
 **13. As a result, the weight is updated as follows:**
 
-&#10230;
+&#10230; **13. Sebagai hasilnya, nilai bobot diperbaharui sebagai berikut: 
 
 <br>
 

From 2d6b7eaf47ce73677f347a5acd77cf9c71ef40a6 Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Sat, 6 Apr 2019 13:48:31 +0700
Subject: [PATCH 10/34] Update cheatsheet-deep-learning.md

---
 id/cheatsheet-deep-learning.md | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/id/cheatsheet-deep-learning.md b/id/cheatsheet-deep-learning.md
index 1a8852b6b..1844d7ccd 100644
--- a/id/cheatsheet-deep-learning.md
+++ b/id/cheatsheet-deep-learning.md
@@ -60,7 +60,7 @@
 
 **11. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.**
 
-&#10230;**11. Learning rate - Learning rate (Tingkat pembelajaran), sering dinotasikan sebagai α atau η, merupakan fase pembaruan pembobotan. Tingkat pembelajaran dapat diperbaiki atau diubah secara adaptif. Metode yang paling populer saat ini disebut Adam, yang merupakan metode yang dapat menyesuaikan tingkat pembelajaran.
+&#10230;**11. Learning rate - Learning rate (Tingkat pembelajaran), sering dinotasikan sebagai α atau η, merupakan fase pembaruan pembobotan. Tingkat pembelajaran dapat diperbaiki atau diubah secara adaptif. Metode yang paling populer saat ini disebut Adam, yang merupakan metode yang dapat menyesuaikan tingkat pembelajaran.**
 
 <br>
 
@@ -78,61 +78,62 @@
 
 **14. Updating weights ― In a neural network, weights are updated as follows:**
 
-&#10230;
+&#10230;**14. Memperbaharui bobot w - Dalam neural network, bobot w diperbarui nilainya dengan cara berikut:**
 
 <br>
 
 **15. Step 1: Take a batch of training data.**
 
-&#10230;
+&#10230;**15. Langkah 1: Mengambil jumlah batch dari data latih.**
 
 <br>
 
 **16. Step 2: Perform forward propagation to obtain the corresponding loss.**
 
-&#10230;
+&#10230;**16. Langkah 2: Melakukan forward propagation untuk mendapatkan nilai loss yang sesuai. **
 
 <br>
 
 **17. Step 3: Backpropagate the loss to get the gradients.**
 
-&#10230;
+&#10230; **17. Langkah 3: Melakukan backpropagate terhadap loss untuk mendapatkan gradient.**
 
 <br>
 
 **18. Step 4: Use the gradients to update the weights of the network.**
 
-&#10230;
+&#10230;**18. Langkah 4: Menggunakan gradient untuk untuk memperbarui nilai dari network.**
 
 <br>
 
 **19. Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p**
 
-&#10230;
+&#10230;**19. Dropout - Dropout adalah teknik untuk mencegah overfitting data latih dengan menghilangkan satu atau lebih unit layer dalam neural network. Pada praktiknya, neurons melakukan drop dengan probabilitas p atau tidak melakukannya dengan probabilitas 1-p** 
 
 <br>
 
 **20. Convolutional Neural Networks**
 
-&#10230;
+&#10230; **20. Convolutional Neural Networks**
 
 <br>
 
 **21. Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:**
 
-&#10230;
+&#10230; **21. Kebutuhan layer convolutional - W adalah ukuran volume input, F adalah ukuran dari layer neuron convolutional, P adalah jumlah zero padding, maka jumlah neurons N yang dapat dibentuk dari volume yang diberikan adalah: **
 
 <br>
 
 **22. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:**
 
-&#10230;
+&#10230; **22. Batch normalization - Adalah salah satu step hyperparameter γ,β yang menormalisasikan batch {xi}. Dengan notasi μB,σ2B adalah rata-rata dan variansi nilai yang digunakan untuk perbaikan dalam batch, dapat diselesaikan sebagai berikut:** 
 
 <br>
 
 **23. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.**
 
-&#10230;
+&#10230; **23. Biasanya dilakukan setelah layer sepenuhnya terhubung / konvolusional dan sebelum layer non-linearitas, yang bertujuan untuk peningkatan tingkat pembelajaran yang lebih tinggi dan mengurangi ketergantungan yang kuat pada inisialisasi.**
+ 
 
 <br>
 

From dac448f2ad81ba1871a6f10810f84e5a0f7bcc64 Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Sat, 6 Apr 2019 13:49:03 +0700
Subject: [PATCH 11/34] Update cheatsheet-deep-learning.md

---
 id/cheatsheet-deep-learning.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/id/cheatsheet-deep-learning.md b/id/cheatsheet-deep-learning.md
index 1844d7ccd..fc3c25cc5 100644
--- a/id/cheatsheet-deep-learning.md
+++ b/id/cheatsheet-deep-learning.md
@@ -72,7 +72,7 @@
 
 **13. As a result, the weight is updated as follows:**
 
-&#10230; **13. Sebagai hasilnya, nilai bobot diperbaharui sebagai berikut: 
+&#10230; **13. Sebagai hasilnya, nilai bobot diperbaharui sebagai berikut: **
 
 <br>
 

From 1e5a359775bcf383177bb389416f63484b64fbea Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Mon, 8 Apr 2019 14:09:34 +0800
Subject: [PATCH 12/34] Update cheatsheet-deep-learning.md

---
 id/cheatsheet-deep-learning.md | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/id/cheatsheet-deep-learning.md b/id/cheatsheet-deep-learning.md
index fc3c25cc5..d41ade019 100644
--- a/id/cheatsheet-deep-learning.md
+++ b/id/cheatsheet-deep-learning.md
@@ -139,43 +139,43 @@
 
 **24. Recurrent Neural Networks**
 
-&#10230;
+&#10230; **24. Recurrent Neural Networks (RNN)**
 
 <br>
 
 **25. Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:**
 
-&#10230;
+&#10230; **25. Jenis-jenis gates - Terdapat beberapa jenis gates dalam Recurrent Neural Network: **
 
 <br>
 
 **26. [Input gate, forget gate, gate, output gate]**
 
-&#10230;
+&#10230; **26. [Input gate (gerbang masuk), forget gate (gerbang lupa), gate, output gate (gerbang keluar)]
 
 <br>
 
 **27. [Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]**
 
-&#10230;
+&#10230; **27, [] **
 
 <br>
 
 **28. LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.**
 
-&#10230;
+&#10230; **28. LSTM (Long short-term memory) - LSTM layer adalah salahsatu model RNN yang dibuat untuk menyelesaikan masalah hilangnya gradien dengan menambahkan gerbang 'lupa'.**
 
 <br>
 
 **29. Reinforcement Learning and Control**
 
-&#10230;
+&#10230; **29, Reinforcement Learning dan Kontrol**
 
 <br>
 
 **30. The goal of reinforcement learning is for an agent to learn how to evolve in an environment.**
 
-&#10230;
+&#10230; **30. Tujuan dari reinforcement learning adalah agar agen bisa membaur dan beradaptasi dengan lingkungannya.**
 
 <br>
 
@@ -183,41 +183,41 @@
 
 &#10230;
 
-<br>
+<br> **31. Definisi**
 
 **32. Markov decision processes ― A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ,R) where:**
 
-&#10230;
+&#10230; **32. Markov decision processes (MDP) - Proses pengambilan keputusan Markov (MDP) adalah sebuah 5-tuple (S,A,{Psa},γ,R) dimana: ** 
 
 <br>
 
-**33. S is the set of states**
+**33. S is the set of states** 
 
-&#10230;
+&#10230; **33. S adalah himpunan dari kejadian (states) **
 
 <br>
 
 **34. A is the set of actions**
 
-&#10230;
+&#10230; **34. A adalah himpunan dari aksi/tindakan**
 
 <br>
 
 **35. {Psa} are the state transition probabilities for s∈S and a∈A**
 
-&#10230;
+&#10230; **35. {Psa} merupakan probabilitas perubahan kejadian untuk s∈S dan a∈A** 
 
 <br>
 
 **36. γ∈[0,1[ is the discount factor**
 
-&#10230;
+&#10230; **36. γ∈[0,1[ merupakan faktor potongan]]**
 
 <br>
 
 **37. R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize**
 
-&#10230;
+&#10230; **37. R:S×A⟶R atau R:S⟶R adalah fungsi penghargaan (reward) yang akan ditingkatkan nilainya oleh si algoritma**
 
 <br>
 

From 68b90cf8dcbb55e45ff4b69a6436b0a726372299 Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Mon, 8 Apr 2019 17:35:44 +0800
Subject: [PATCH 13/34] Update cheatsheet-deep-learning.md

---
 id/cheatsheet-deep-learning.md | 40 +++++++++++++++++-----------------
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/id/cheatsheet-deep-learning.md b/id/cheatsheet-deep-learning.md
index d41ade019..26f2c5a4c 100644
--- a/id/cheatsheet-deep-learning.md
+++ b/id/cheatsheet-deep-learning.md
@@ -157,7 +157,7 @@
 
 **27. [Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]**
 
-&#10230; **27, [] **
+&#10230; **27, [Dituliskan ke dalm sel atau tidak?, Hapus sel atau tidak?, Berapa banyak yang harus ditulis ke dalam sel?, Berapa banyak yang dibutuhkan untuk mengungkap sel?] **
 
 <br>
 
@@ -193,7 +193,7 @@
 
 **33. S is the set of states** 
 
-&#10230; **33. S adalah himpunan dari kejadian (states) **
+&#10230; **33. S adalah himpunan dari keadaan (states) **
 
 <br>
 
@@ -217,106 +217,106 @@
 
 **37. R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize**
 
-&#10230; **37. R:S×A⟶R atau R:S⟶R adalah fungsi penghargaan (reward) yang akan ditingkatkan nilainya oleh si algoritma**
+&#10230; **37. R:S×A⟶R atau R:S⟶R adalah fungsi penghargaan (reward) yang akan ditingkatkan nilainya oleh algoritma**
 
 <br>
 
 **38. Policy ― A policy π is a function π:S⟶A that maps states to actions.**
 
-&#10230;
+&#10230; **38. Policy - Policy π adalah sebuah fungsi π:S⟶A yang memetakan keadaan (S) ke tindakan (A).**
 
 <br>
 
 **39. Remark: we say that we execute a given policy π if given a state s we take the action a=π(s).**
 
-&#10230;
+&#10230; **39. Catatan: Kita menjalankan sebuah policy π jika diberikan keadaan S, maka tindakan a = π (s) **
 
 <br>
 
 **40. Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:**
 
-&#10230;
+&#10230; **40. Fungsi nilai - Diberikan sebuah policy π dan sebuah keadaan S, maka kita mendefinisikan nilai fungsi Vπ dengan sebagai berikut: **
 
 <br>
 
 **41. Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:**
 
-&#10230;
+&#10230; **41. Persamaan Bellman - Persamaan Optimal Bellman mengkarakterisasi fungsi nilai Vπ∗ dari sebuah optimal policy π∗:** 
 
 <br>
 
 **42. Remark: we note that the optimal policy π∗ for a given state s is such that:**
 
-&#10230;
+&#10230; **42. Catatan: Nilai optimal policy dari π∗ untuk sebuah keadaan S adalah sebagai berikut:**
 
 <br>
 
 **43. Value iteration algorithm ― The value iteration algorithm is in two steps:**
 
-&#10230;
+&#10230; **43. Nilai perulangan algoritma - Nilai perulangan algoritma dibagai atas dua tahap: **
 
 <br>
 
 **44. 1) We initialize the value:**
 
-&#10230;
+&#10230; **44. 1) Menginisialisasi nilai: **
 
 <br>
 
 **45. 2) We iterate the value based on the values before:**
 
-&#10230;
+&#10230; **45. 2) Melakukan iterasi berdasarkan nilai sebelumnya: **
 
 <br>
 
 **46. Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:**
 
-&#10230;
+&#10230; **46. Estimasi kemungkinan maksimum ― Estimasi kemungkinan maksimum untuk probabilitas transisi keadaan S adalah sebagai berikut: **
 
 <br>
 
 **47. times took action a in state s and got to s′**
 
-&#10230;
+&#10230; **47. Waktu yang dibutuhkan aksi A dalam keadaan S untuk menuju keadaan S'**
 
 <br>
 
 **48. times took action a in state s**
 
-&#10230;
+&#10230; **48. Waktu yang dibutuhkan aksi A di keadaa S**
 
 <br>
 
 **49. Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:**
 
-&#10230;
+&#10230; **49. Q-learning ― Q-learning adalah sebuah estimasi model bebas dari Q, yang didapat dari beikut:**
 
 <br>
 
 **50. View PDF version on GitHub**
 
-&#10230;
+&#10230; **50. Lihat versi PDF di GitHub**
 
 <br>
 
 **51. [Neural Networks, Architecture, Activation function, Backpropagation, Dropout]**
 
-&#10230;
+&#10230; **51 [Neural Networks, Architecture, Activation function, Backpropagation, Dropout]**
 
 <br>
 
 **52. [Convolutional Neural Networks, Convolutional layer, Batch normalization]**
 
-&#10230;
+&#10230; **52, [Convolutional Neural Networks, Convolutional layer, Batch normalization]
 
 <br>
 
 **53. [Recurrent Neural Networks, Gates, LSTM]**
 
-&#10230;
+&#10230; **53. [Recurrent Neural Networks, Gates, LSTM]**
 
 <br>
 
 **54. [Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]**
 
-&#10230;
+&#10230; **54. [Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]**

From 1bf5e3f4ed95b9cf0ecd88403cb1e17a3e2168f4 Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Tue, 9 Apr 2019 10:00:52 +0800
Subject: [PATCH 14/34] Fixing typo

---
 ...tsheet-machine-learning-tips-and-tricks.md | 285 +++++++
 id/cheatsheet-supervised-learning.md          | 567 ++++++++++++++
 id/cheatsheet-unsupervised-learning.md        | 340 +++++++++
 id/convolutional-neural-networks.md           | 716 ++++++++++++++++++
 id/deep-learning-tips-and-tricks.md           | 457 +++++++++++
 id/recurrent-neural-networks.md               | 677 +++++++++++++++++
 id/refresher-linear-algebra.md                | 339 +++++++++
 id/refresher-probability.md                   | 381 ++++++++++
 8 files changed, 3762 insertions(+)
 create mode 100644 id/cheatsheet-machine-learning-tips-and-tricks.md
 create mode 100644 id/cheatsheet-supervised-learning.md
 create mode 100644 id/cheatsheet-unsupervised-learning.md
 create mode 100644 id/convolutional-neural-networks.md
 create mode 100644 id/deep-learning-tips-and-tricks.md
 create mode 100644 id/recurrent-neural-networks.md
 create mode 100644 id/refresher-linear-algebra.md
 create mode 100644 id/refresher-probability.md

diff --git a/id/cheatsheet-machine-learning-tips-and-tricks.md b/id/cheatsheet-machine-learning-tips-and-tricks.md
new file mode 100644
index 000000000..9712297b8
--- /dev/null
+++ b/id/cheatsheet-machine-learning-tips-and-tricks.md
@@ -0,0 +1,285 @@
+**1. Machine Learning tips and tricks cheatsheet**
+
+&#10230;
+
+<br>
+
+**2. Classification metrics**
+
+&#10230;
+
+<br>
+
+**3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.**
+
+&#10230;
+
+<br>
+
+**4. Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:**
+
+&#10230;
+
+<br>
+
+**5. [Predicted class, Actual class]**
+
+&#10230;
+
+<br>
+
+**6. Main metrics ― The following metrics are commonly used to assess the performance of classification models:**
+
+&#10230;
+
+<br>
+
+**7. [Metric, Formula, Interpretation]**
+
+&#10230;
+
+<br>
+
+**8. Overall performance of model**
+
+&#10230;
+
+<br>
+
+**9. How accurate the positive predictions are**
+
+&#10230;
+
+<br>
+
+**10. Coverage of actual positive sample**
+
+&#10230;
+
+<br>
+
+**11. Coverage of actual negative sample**
+
+&#10230;
+
+<br>
+
+**12. Hybrid metric useful for unbalanced classes**
+
+&#10230;
+
+<br>
+
+**13. ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+**14. [Metric, Formula, Equivalent]**
+
+&#10230;
+
+<br>
+
+**15. AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:**
+
+&#10230;
+
+<br>
+
+**16. [Actual, Predicted]**
+
+&#10230;
+
+<br>
+
+**17. Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:**
+
+&#10230;
+
+<br>
+
+**18. [Total sum of squares, Explained sum of squares, Residual sum of squares]**
+
+&#10230;
+
+<br>
+
+**19. Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:**
+
+&#10230;
+
+<br>
+
+**20. Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:**
+
+&#10230;
+
+<br>
+
+**21. where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.**
+
+&#10230;
+
+<br>
+
+**22. Model selection**
+
+&#10230;
+
+<br>
+
+**23. Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:**
+
+&#10230;
+
+<br>
+
+**24. [Training set, Validation set, Testing set]**
+
+&#10230;
+
+<br>
+
+**25. [Model is trained, Model is assessed, Model gives predictions]**
+
+&#10230;
+
+<br>
+
+**26. [Usually 80% of the dataset, Usually 20% of the dataset]**
+
+&#10230;
+
+<br>
+
+**27. [Also called hold-out or development set, Unseen data]**
+
+&#10230;
+
+<br>
+
+**28. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:**
+
+&#10230;
+
+<br>
+
+**29. Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+**30. [Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]**
+
+&#10230;
+
+<br>
+
+**31. [Generally k=5 or 10, Case p=1 is called leave-one-out]**
+
+&#10230;
+
+<br>
+
+**32. The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.**
+
+&#10230;
+
+<br>
+
+**33. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:**
+
+&#10230;
+
+<br>
+
+**34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
+
+&#10230;
+
+<br>
+
+**35. Diagnostics**
+
+&#10230;
+
+<br>
+
+**36. Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.**
+
+&#10230;
+
+<br>
+
+**37. Variance ― The variance of a model is the variability of the model prediction for given data points.**
+
+&#10230;
+
+<br>
+
+**38. Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.**
+
+&#10230;
+
+<br>
+
+**39. [Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]**
+
+&#10230;
+
+<br>
+
+**40. [High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]**
+
+&#10230;
+
+<br>
+
+**41. [Complexify model, Add more features, Train longer, Perform regularization, Get more data]**
+
+&#10230;
+
+<br>
+
+**42. Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.**
+
+&#10230;
+
+<br>
+
+**43. Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.**
+
+&#10230;
+
+<br>
+
+**44. Regression metrics**
+
+&#10230;
+
+<br>
+
+**45. [Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]**
+
+&#10230;
+
+<br>
+
+**46. [Regression metrics, R squared, Mallow's CP, AIC, BIC]**
+
+&#10230;
+
+<br>
+
+**47. [Model selection, cross-validation, regularization]**
+
+&#10230;
+
+<br>
+
+**48. [Diagnostics, Bias/variance tradeoff, error/ablative analysis]**
+
+&#10230;
diff --git a/id/cheatsheet-supervised-learning.md b/id/cheatsheet-supervised-learning.md
new file mode 100644
index 000000000..a6b19ea1c
--- /dev/null
+++ b/id/cheatsheet-supervised-learning.md
@@ -0,0 +1,567 @@
+**1. Supervised Learning cheatsheet**
+
+&#10230;
+
+<br>
+
+**2. Introduction to Supervised Learning**
+
+&#10230;
+
+<br>
+
+**3. Given a set of data points {x(1),...,x(m)} associated to a set of outcomes {y(1),...,y(m)}, we want to build a classifier that learns how to predict y from x.**
+
+&#10230;
+
+<br>
+
+**4. Type of prediction ― The different types of predictive models are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+**5. [Regression, Classifier, Outcome, Examples]**
+
+&#10230;
+
+<br>
+
+**6. [Continuous, Class, Linear regression, Logistic regression, SVM, Naive Bayes]**
+
+&#10230;
+
+<br>
+
+**7. Type of model ― The different models are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+**8. [Discriminative model, Generative model, Goal, What's learned, Illustration, Examples]**
+
+&#10230;
+
+<br>
+
+**9. [Directly estimate P(y|x), Estimate P(x|y) to then deduce P(y|x), Decision boundary,  	Probability distributions of the data, Regressions, SVMs, GDA, Naive Bayes]**
+
+&#10230;
+
+<br>
+
+**10. Notations and general concepts**
+
+&#10230;
+
+<br>
+
+**11. Hypothesis ― The hypothesis is noted hθ and is the model that we choose. For a given input data x(i) the model prediction output is hθ(x(i)).**
+
+&#10230;
+
+<br>
+
+**12. Loss function ― A loss function is a function L:(z,y)∈R×Y⟼L(z,y)∈R that takes as inputs the predicted value z corresponding to the real data value y and outputs how different they are. The common loss functions are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+**13. [Least squared error, Logistic loss, Hinge loss, Cross-entropy]**
+
+&#10230;
+
+<br>
+
+**14. [Linear regression, Logistic regression, SVM, Neural Network]**
+
+&#10230;
+
+<br>
+
+**15. Cost function ― The cost function J is commonly used to assess the performance of a model, and is defined with the loss function L as follows:**
+
+&#10230;
+
+<br>
+
+**16. Gradient descent ― By noting α∈R the learning rate, the update rule for gradient descent is expressed with the learning rate and the cost function J as follows:**
+
+&#10230;
+
+<br>
+
+**17. Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient descent is on a batch of training examples.**
+
+&#10230;
+
+<br>
+
+**18. Likelihood ― The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ through maximizing the likelihood. In practice, we use the log-likelihood ℓ(θ)=log(L(θ)) which is easier to optimize. We have:**
+
+&#10230;
+
+<br>
+
+**19. Newton's algorithm ― The Newton's algorithm is a numerical method that finds θ such that ℓ′(θ)=0. Its update rule is as follows:**
+
+&#10230;
+
+<br>
+
+**20. Remark: the multidimensional generalization, also known as the Newton-Raphson method, has the following update rule:**
+
+&#10230;
+
+<br>
+
+**21. Linear models**
+
+&#10230;
+
+<br>
+
+**22. Linear regression**
+
+&#10230;
+
+<br>
+
+**23. We assume here that y|x;θ∼N(μ,σ2)**
+
+&#10230;
+
+<br>
+
+**24. Normal equations ― By noting X the matrix design, the value of θ that minimizes the cost function is a closed-form solution such that:**
+
+&#10230;
+
+<br>
+
+**25. LMS algorithm ― By noting α the learning rate, the update rule of the Least Mean Squares (LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff learning rule, is as follows:**
+
+&#10230;
+
+<br>
+
+**26. Remark: the update rule is a particular case of the gradient ascent.**
+
+&#10230;
+
+<br>
+
+**27. LWR ― Locally Weighted Regression, also known as LWR, is a variant of linear regression that weights each training example in its cost function by w(i)(x), which is defined with parameter τ∈R as:**
+
+&#10230;
+
+<br>
+
+**28. Classification and logistic regression**
+
+&#10230;
+
+<br>
+
+**29. Sigmoid function ― The sigmoid function g, also known as the logistic function, is defined as follows:**
+
+&#10230;
+
+<br>
+
+**30. Logistic regression ― We assume here that y|x;θ∼Bernoulli(ϕ). We have the following form:**
+
+&#10230;
+
+<br>
+
+**31. Remark: there is no closed form solution for the case of logistic regressions.**
+
+&#10230;
+
+<br>
+
+**32. Softmax regression ― A softmax regression, also called a multiclass logistic regression, is used to generalize logistic regression when there are more than 2 outcome classes. By convention, we set θK=0, which makes the Bernoulli parameter ϕi of each class i equal to:**
+
+&#10230;
+
+<br>
+
+**33. Generalized Linear Models**
+
+&#10230;
+
+<br>
+
+**34. Exponential family ― A class of distributions is said to be in the exponential family if it can be written in terms of a natural parameter, also called the canonical parameter or link function, η, a sufficient statistic T(y) and a log-partition function a(η) as follows:**
+
+&#10230;
+
+<br>
+
+**35. Remark: we will often have T(y)=y. Also, exp(−a(η)) can be seen as a normalization parameter that will make sure that the probabilities sum to one.**
+
+&#10230;
+
+<br>
+
+**36. Here are the most common exponential distributions summed up in the following table:**
+
+&#10230;
+
+<br>
+
+**37. [Distribution, Bernoulli, Gaussian, Poisson, Geometric]**
+
+&#10230;
+
+<br>
+
+**38. Assumptions of GLMs ― Generalized Linear Models (GLM) aim at predicting a random variable y as a function fo x∈Rn+1 and rely on the following 3 assumptions:**
+
+&#10230;
+
+<br>
+
+**39. Remark: ordinary least squares and logistic regression are special cases of generalized linear models.**
+
+&#10230;
+
+<br>
+
+**40. Support Vector Machines**
+
+&#10230;
+
+<br>
+
+**41: The goal of support vector machines is to find the line that maximizes the minimum distance to the line.**
+
+&#10230;
+
+<br>
+
+**42: Optimal margin classifier ― The optimal margin classifier h is such that:**
+
+&#10230;
+
+<br>
+
+**43: where (w,b)∈Rn×R is the solution of the following optimization problem:**
+
+&#10230;
+
+<br>
+
+**44. such that**
+
+&#10230;
+
+<br>
+
+**45. support vectors**
+
+&#10230;
+
+<br>
+
+**46. Remark: the line is defined as wTx−b=0.**
+
+&#10230;
+
+<br>
+
+**47. Hinge loss ― The hinge loss is used in the setting of SVMs and is defined as follows:**
+
+&#10230;
+
+<br>
+
+**48. Kernel ― Given a feature mapping ϕ, we define the kernel K to be defined as:**
+
+&#10230;
+
+<br>
+
+**49. In practice, the kernel K defined by K(x,z)=exp(−||x−z||22σ2) is called the Gaussian kernel and is commonly used.**
+
+&#10230;
+
+<br>
+
+**50. [Non-linear separability, Use of a kernel mapping, Decision boundary in the original space]**
+
+&#10230;
+
+<br>
+
+**51. Remark: we say that we use the "kernel trick" to compute the cost function using the kernel because we actually don't need to know the explicit mapping ϕ, which is often very complicated. Instead, only the values K(x,z) are needed.**
+
+&#10230;
+
+<br>
+
+**52. Lagrangian ― We define the Lagrangian L(w,b) as follows:**
+
+&#10230;
+
+<br>
+
+**53. Remark: the coefficients βi are called the Lagrange multipliers.**
+
+&#10230;
+
+<br>
+
+**54. Generative Learning**
+
+&#10230;
+
+<br>
+
+**55. A generative model first tries to learn how the data is generated by estimating P(x|y), which we can then use to estimate P(y|x) by using Bayes' rule.**
+
+&#10230;
+
+<br>
+
+**56. Gaussian Discriminant Analysis**
+
+&#10230;
+
+<br>
+
+**57. Setting ― The Gaussian Discriminant Analysis assumes that y and x|y=0 and x|y=1 are such that:**
+
+&#10230;
+
+<br>
+
+**58. Estimation ― The following table sums up the estimates that we find when maximizing the likelihood:**
+
+&#10230;
+
+<br>
+
+**59. Naive Bayes**
+
+&#10230;
+
+<br>
+
+**60. Assumption ― The Naive Bayes model supposes that the features of each data point are all independent:**
+
+&#10230;
+
+<br>
+
+**61. Solutions ― Maximizing the log-likelihood gives the following solutions, with k∈{0,1},l∈[[1,L]]**
+
+&#10230;
+
+<br>
+
+**62. Remark: Naive Bayes is widely used for text classification and spam detection.**
+
+&#10230;
+
+<br>
+
+**63. Tree-based and ensemble methods**
+
+&#10230;
+
+<br>
+
+**64. These methods can be used for both regression and classification problems.**
+
+&#10230;
+
+<br>
+
+**65. CART ― Classification and Regression Trees (CART), commonly known as decision trees, can be represented as binary trees. They have the advantage to be very interpretable.**
+
+&#10230;
+
+<br>
+
+**66. Random forest ― It is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance makes it a popular algorithm.**
+
+&#10230;
+
+<br>
+
+**67. Remark: random forests are a type of ensemble methods.**
+
+&#10230;
+
+<br>
+
+**68. Boosting ― The idea of boosting methods is to combine several weak learners to form a stronger one. The main ones are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+**69. [Adaptive boosting, Gradient boosting]**
+
+&#10230;
+
+<br>
+
+**70. High weights are put on errors to improve at the next boosting step**
+
+&#10230;
+
+<br>
+
+**71. Weak learners trained on remaining errors**
+
+&#10230;
+
+<br>
+
+**72. Other non-parametric approaches**
+
+&#10230;
+
+<br>
+
+**73. k-nearest neighbors ― The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.**
+
+&#10230;
+
+<br>
+
+**74. Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.**
+
+&#10230;
+
+<br>
+
+**75. Learning Theory**
+
+&#10230;
+
+<br>
+
+**76. Union bound ― Let A1,...,Ak be k events. We have:**
+
+&#10230;
+
+<br>
+
+**77. Hoeffding inequality ― Let Z1,..,Zm be m iid variables drawn from a Bernoulli distribution of parameter ϕ. Let ˆϕ be their sample mean and γ>0 fixed. We have:**
+
+&#10230;
+
+<br>
+
+**78. Remark: this inequality is also known as the Chernoff bound.**
+
+&#10230;
+
+<br>
+
+**79. Training error ― For a given classifier h, we define the training error ˆϵ(h), also known as the empirical risk or empirical error, to be as follows:**
+
+&#10230;
+
+<br>
+
+**80. Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions: **
+
+&#10230;
+
+<br>
+
+**81: the training and testing sets follow the same distribution **
+
+&#10230;
+
+<br>
+
+**82. the training examples are drawn independently**
+
+&#10230;
+
+<br>
+
+**83. Shattering ― Given a set S={x(1),...,x(d)}, and a set of classifiers H, we say that H shatters S if for any set of labels {y(1),...,y(d)}, we have:**
+
+&#10230;
+
+<br>
+
+**84. Upper bound theorem ― Let H be a finite hypothesis class such that |H|=k and let δ and the sample size m be fixed. Then, with probability of at least 1−δ, we have:**
+
+&#10230;
+
+<br>
+
+**85. VC dimension ― The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis class H, noted VC(H) is the size of the largest set that is shattered by H.**
+
+&#10230;
+
+<br>
+
+**86. Remark: the VC dimension of H={set of linear classifiers in 2 dimensions} is 3.**
+
+&#10230;
+
+<br>
+
+**87. Theorem (Vapnik) ― Let H be given, with VC(H)=d and m the number of training examples. With probability at least 1−δ, we have:**
+
+&#10230;
+
+<br>
+
+**88. [Introduction, Type of prediction, Type of model]**
+
+&#10230;
+
+<br>
+
+**89. [Notations and general concepts, loss function, gradient descent, likelihood]**
+
+&#10230;
+
+<br>
+
+**90. [Linear models, linear regression, logistic regression, generalized linear models]**
+
+&#10230;
+
+<br>
+
+**91. [Support vector machines, Optimal margin classifier, Hinge loss, Kernel]**
+
+&#10230;
+
+<br>
+
+**92. [Generative learning, Gaussian Discriminant Analysis, Naive Bayes]**
+
+&#10230;
+
+<br>
+
+**93. [Trees and ensemble methods, CART, Random forest, Boosting]**
+
+&#10230;
+
+<br>
+
+**94. [Other methods, k-NN]**
+
+&#10230;
+
+<br>
+
+**95. [Learning theory, Hoeffding inequality, PAC, VC dimension]**
+
+&#10230;
diff --git a/id/cheatsheet-unsupervised-learning.md b/id/cheatsheet-unsupervised-learning.md
new file mode 100644
index 000000000..6daab3b21
--- /dev/null
+++ b/id/cheatsheet-unsupervised-learning.md
@@ -0,0 +1,340 @@
+**1. Unsupervised Learning cheatsheet**
+
+&#10230;
+
+<br>
+
+**2. Introduction to Unsupervised Learning**
+
+&#10230;
+
+<br>
+
+**3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.**
+
+&#10230;
+
+<br>
+
+**4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:**
+
+&#10230;
+
+<br>
+
+**5. Clustering**
+
+&#10230;
+
+<br>
+
+**6. Expectation-Maximization**
+
+&#10230;
+
+<br>
+
+**7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:**
+
+&#10230;
+
+<br>
+
+**8. [Setting, Latent variable z, Comments]**
+
+&#10230;
+
+<br>
+
+**9. [Mixture of k Gaussians, Factor analysis]**
+
+&#10230;
+
+<br>
+
+**10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:**
+
+&#10230;
+
+<br>
+
+**11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:**
+
+&#10230;
+
+<br>
+
+**12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:**
+
+&#10230;
+
+<br>
+
+**13. [Gaussians initialization, Expectation step, Maximization step, Convergence]**
+
+&#10230;
+
+<br>
+
+**14. k-means clustering**
+
+&#10230;
+
+<br>
+
+**15. We note c(i) the cluster of data point i and μj the center of cluster j.**
+
+&#10230;
+
+<br>
+
+**16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:**
+
+&#10230;
+
+<br>
+
+**17. [Means initialization, Cluster assignment, Means update, Convergence]**
+
+&#10230;
+
+<br>
+
+**18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:**
+
+&#10230;
+
+<br>
+
+**19. Hierarchical clustering**
+
+&#10230;
+
+<br>
+
+**20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.**
+
+&#10230;
+
+<br>
+
+**21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:**
+
+&#10230;
+
+<br>
+
+**22. [Ward linkage, Average linkage, Complete linkage]**
+
+&#10230;
+
+<br>
+
+**23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]**
+
+&#10230;
+
+<br>
+
+**24. Clustering assessment metrics**
+
+&#10230;
+
+<br>
+
+**25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.**
+
+&#10230;
+
+<br>
+
+**26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:**
+
+&#10230;
+
+<br>
+
+**27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as**
+
+&#10230;
+
+<br>
+
+**28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:**
+
+&#10230;
+
+<br>
+
+**29. Dimension reduction**
+
+&#10230;
+
+<br>
+
+**30. Principal component analysis**
+
+&#10230;
+
+<br>
+
+**31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.**
+
+&#10230;
+
+<br>
+
+**32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
+
+&#10230;
+
+<br>
+
+**33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
+
+&#10230;
+
+<br>
+
+**34. diagonal**
+
+&#10230;
+
+<br>
+
+**35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.**
+
+&#10230;
+
+<br>
+
+**36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k
+dimensions by maximizing the variance of the data as follows:**
+
+&#10230;
+
+<br>
+
+**37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.**
+
+&#10230;
+
+<br>
+
+**38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.**
+
+&#10230;
+
+<br>
+
+**39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.**
+
+&#10230;
+
+<br>
+
+**40. Step 4: Project the data on spanR(u1,...,uk).**
+
+&#10230;
+
+<br>
+
+**41. This procedure maximizes the variance among all k-dimensional spaces.**
+
+&#10230;
+
+<br>
+
+**42. [Data in feature space, Find principal components, Data in principal components space]**
+
+&#10230;
+
+<br>
+
+**43. Independent component analysis**
+
+&#10230;
+
+<br>
+
+**44. It is a technique meant to find the underlying generating sources.**
+
+&#10230;
+
+<br>
+
+**45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:**
+
+&#10230;
+
+<br>
+
+**46. The goal is to find the unmixing matrix W=A−1.**
+
+&#10230;
+
+<br>
+
+**47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:**
+
+&#10230;
+
+<br>
+
+**48. Write the probability of x=As=W−1s as:**
+
+&#10230;
+
+<br>
+
+**49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:**
+
+&#10230;
+
+<br>
+
+**50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:**
+
+&#10230;
+
+<br>
+
+**51. The Machine Learning cheatsheets are now available in [target language].**
+
+&#10230;
+
+<br>
+
+**52. Original authors**
+
+&#10230;
+
+<br>
+
+**53. Translated by X, Y and Z**
+
+&#10230;
+
+<br>
+
+**54. Reviewed by X, Y and Z**
+
+&#10230;
+
+<br>
+
+**55. [Introduction, Motivation, Jensen's inequality]**
+
+&#10230;
+
+<br>
+
+**56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]**
+
+&#10230;
+
+<br>
+
+**57. [Dimension reduction, PCA, ICA]**
+
+&#10230;
diff --git a/id/convolutional-neural-networks.md b/id/convolutional-neural-networks.md
new file mode 100644
index 000000000..1b1283628
--- /dev/null
+++ b/id/convolutional-neural-networks.md
@@ -0,0 +1,716 @@
+**Convolutional Neural Networks translation**
+
+<br>
+
+**1. Convolutional Neural Networks cheatsheet**
+
+&#10230;
+
+<br>
+
+
+**2. CS 230 - Deep Learning**
+
+&#10230;
+
+<br>
+
+
+**3. [Overview, Architecture structure]**
+
+&#10230;
+
+<br>
+
+
+**4. [Types of layer, Convolution, Pooling, Fully connected]**
+
+&#10230;
+
+<br>
+
+
+**5. [Filter hyperparameters, Dimensions, Stride, Padding]**
+
+&#10230;
+
+<br>
+
+
+**6. [Tuning hyperparameters, Parameter compatibility, Model complexity, Receptive field]**
+
+&#10230;
+
+<br>
+
+
+**7. [Activation functions, Rectified Linear Unit, Softmax]**
+
+&#10230;
+
+<br>
+
+
+**8. [Object detection, Types of models, Detection, Intersection over Union, Non-max suppression, YOLO, R-CNN]**
+
+&#10230;
+
+<br>
+
+
+**9. [Face verification/recognition, One shot learning, Siamese network, Triplet loss]**
+
+&#10230;
+
+<br>
+
+
+**10. [Neural style transfer, Activation, Style matrix, Style/content cost function]**
+
+&#10230;
+
+<br>
+
+
+**11. [Computational trick architectures, Generative Adversarial Net, ResNet, Inception Network]**
+
+&#10230;
+
+<br>
+
+
+**12. Overview**
+
+&#10230;
+
+<br>
+
+
+**13. Architecture of a traditional CNN ― Convolutional neural networks, also known as CNNs, are a specific type of neural networks that are generally composed of the following layers:**
+
+&#10230;
+
+<br>
+
+
+**14. The convolution layer and the pooling layer can be fine-tuned with respect to hyperparameters that are described in the next sections.**
+
+&#10230;
+
+<br>
+
+
+**15. Types of layer**
+
+&#10230;
+
+<br>
+
+
+**16. Convolution layer (CONV) ― The convolution layer (CONV) uses filters that perform convolution operations as it is scanning the input I with respect to its dimensions. Its hyperparameters include the filter size F and stride S. The resulting output O is called feature map or activation map.**
+
+&#10230;
+
+<br>
+
+
+**17. Remark: the convolution step can be generalized to the 1D and 3D cases as well.**
+
+&#10230;
+
+<br>
+
+
+**18. Pooling (POOL) ― The pooling layer (POOL) is a downsampling operation, typically applied after a convolution layer, which does some spatial invariance. In particular, max and average pooling are special kinds of pooling where the maximum and average value is taken, respectively.**
+
+&#10230;
+
+<br>
+
+
+**19. [Type, Purpose, Illustration, Comments]**
+
+&#10230;
+
+<br>
+
+
+**20. [Max pooling, Average pooling, Each pooling operation selects the maximum value of the current view, Each pooling operation averages the values of the current view]**
+
+&#10230;
+
+<br>
+
+
+**21. [Preserves detected features, Most commonly used, Downsamples feature map, Used in LeNet]**
+
+&#10230;
+
+<br>
+
+
+**22. Fully Connected (FC) ― The fully connected layer (FC) operates on a flattened input where each input is connected to all neurons. If present, FC layers are usually found towards the end of CNN architectures and can be used to optimize objectives such as class scores.**
+
+&#10230;
+
+<br>
+
+
+**23. Filter hyperparameters**
+
+&#10230;
+
+<br>
+
+
+**24. The convolution layer contains filters for which it is important to know the meaning behind its hyperparameters.**
+
+&#10230;
+
+<br>
+
+
+**25. Dimensions of a filter ― A filter of size F×F applied to an input containing C channels is a F×F×C volume that performs convolutions on an input of size I×I×C and produces an output feature map (also called activation map) of size O×O×1.**
+
+&#10230;
+
+<br>
+
+
+**26. Filter**
+
+&#10230;
+
+<br>
+
+
+**27. Remark: the application of K filters of size F×F results in an output feature map of size O×O×K.**
+
+&#10230;
+
+<br>
+
+
+**28. Stride ― For a convolutional or a pooling operation, the stride S denotes the number of pixels by which the window moves after each operation.**
+
+&#10230;
+
+<br>
+
+
+**29. Zero-padding ― Zero-padding denotes the process of adding P zeroes to each side of the boundaries of the input. This value can either be manually specified or automatically set through one of the three modes detailed below:**
+
+&#10230;
+
+<br>
+
+
+**30. [Mode, Value, Illustration, Purpose, Valid, Same, Full]**
+
+&#10230;
+
+<br>
+
+
+**31. [No padding, Drops last convolution if dimensions do not match, Padding such that feature map size has size ⌈IS⌉, Output size is mathematically convenient, Also called 'half' padding, Maximum padding such that end convolutions are applied on the limits of the input, Filter 'sees' the input end-to-end]**
+
+&#10230;
+
+<br>
+
+
+**32. Tuning hyperparameters**
+
+&#10230;
+
+<br>
+
+
+**33. Parameter compatibility in convolution layer ― By noting I the length of the input volume size, F the length of the filter, P the amount of zero padding, S the stride, then the output size O of the feature map along that dimension is given by:**
+
+&#10230;
+
+<br>
+
+
+**34. [Input, Filter, Output]**
+
+&#10230;
+
+<br>
+
+
+**35. Remark: often times, Pstart=Pend≜P, in which case we can replace Pstart+Pend by 2P in the formula above.**
+
+&#10230;
+
+<br>
+
+
+**36. Understanding the complexity of the model ― In order to assess the complexity of a model, it is often useful to determine the number of parameters that its architecture will have. In a given layer of a convolutional neural network, it is done as follows:**
+
+&#10230;
+
+<br>
+
+
+**37. [Illustration, Input size, Output size, Number of parameters, Remarks]**
+
+&#10230;
+
+<br>
+
+
+**38. [One bias parameter per filter, In most cases, S<F, A common choice for K is 2C]**
+
+&#10230;
+
+<br>
+
+
+**39. [Pooling operation done channel-wise, In most cases, S=F]**
+
+&#10230;
+
+<br>
+
+
+**40. [Input is flattened, One bias parameter per neuron, The number of FC neurons is free of structural constraints]**
+
+&#10230;
+
+<br>
+
+
+**41. Receptive field ― The receptive field at layer k is the area denoted Rk×Rk of the input that each pixel of the k-th activation map can 'see'. By calling Fj the filter size of layer j and Si the stride value of layer i and with the convention S0=1, the receptive field at layer k can be computed with the formula:**
+
+&#10230;
+
+<br>
+
+
+**42. In the example below, we have F1=F2=3 and S1=S2=1, which gives R2=1+2⋅1+2⋅1=5.**
+
+&#10230;
+
+<br>
+
+
+**43. Commonly used activation functions**
+
+&#10230;
+
+<br>
+
+
+**44. Rectified Linear Unit ― The rectified linear unit layer (ReLU) is an activation function g that is used on all elements of the volume. It aims at introducing non-linearities to the network. Its variants are summarized in the table below:**
+
+&#10230;
+
+<br>
+
+
+**45. [ReLU, Leaky ReLU, ELU, with]**
+
+&#10230;
+
+<br>
+
+
+**46. [Non-linearity complexities biologically interpretable, Addresses dying ReLU issue for negative values, Differentiable everywhere]**
+
+&#10230;
+
+<br>
+
+
+**47. Softmax ― The softmax step can be seen as a generalized logistic function that takes as input a vector of scores x∈Rn and outputs a vector of output probability p∈Rn through a softmax function at the end of the architecture. It is defined as follows:**
+
+&#10230;
+
+<br>
+
+
+**48. where**
+
+&#10230;
+
+<br>
+
+
+**49. Object detection**
+
+&#10230;
+
+<br>
+
+
+**50. Types of models ― There are 3 main types of object recognition algorithms, for which the nature of what is predicted is different. They are described in the table below:**
+
+&#10230;
+
+<br>
+
+
+**51. [Image classification, Classification w. localization, Detection]**
+
+&#10230;
+
+<br>
+
+
+**52. [Teddy bear, Book]**
+
+&#10230;
+
+<br>
+
+
+**53. [Classifies a picture, Predicts probability of object, Detects an object in a picture, Predicts probability of object and where it is located, Detects up to several objects in a picture, Predicts probabilities of objects and where they are located]**
+
+&#10230;
+
+<br>
+
+
+**54. [Traditional CNN, Simplified YOLO, R-CNN, YOLO, R-CNN]**
+
+&#10230;
+
+<br>
+
+
+**55. Detection ― In the context of object detection, different methods are used depending on whether we just want to locate the object or detect a more complex shape in the image. The two main ones are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+
+**56. [Bounding box detection, Landmark detection]**
+
+&#10230;
+
+<br>
+
+
+**57. [Detects the part of the image where the object is located, Detects a shape or characteristics of an object (e.g. eyes), More granular]**
+
+&#10230;
+
+<br>
+
+
+**58. [Box of center (bx,by), height bh and width bw, Reference points (l1x,l1y), ..., (lnx,lny)]**
+
+&#10230;
+
+<br>
+
+
+**59. Intersection over Union ― Intersection over Union, also known as IoU, is a function that quantifies how correctly positioned a predicted bounding box Bp is over the actual bounding box Ba. It is defined as:**
+
+&#10230;
+
+<br>
+
+
+**60. Remark: we always have IoU∈[0,1]. By convention, a predicted bounding box Bp is considered as being reasonably good if IoU(Bp,Ba)⩾0.5.**
+
+&#10230;
+
+<br>
+
+
+**61. Anchor boxes ― Anchor boxing is a technique used to predict overlapping bounding boxes. In practice, the network is allowed to predict more than one box simultaneously, where each box prediction is constrained to have a given set of geometrical properties. For instance, the first prediction can potentially be a rectangular box of a given form, while the second will be another rectangular box of a different geometrical form.**
+
+&#10230;
+
+<br>
+
+
+**62. Non-max suppression ― The non-max suppression technique aims at removing duplicate overlapping bounding boxes of a same object by selecting the most representative ones. After having removed all boxes having a probability prediction lower than 0.6, the following steps are repeated while there are boxes remaining:**
+
+&#10230;
+
+<br>
+
+
+**63. [For a given class, Step 1: Pick the box with the largest prediction probability., Step 2: Discard any box having an IoU⩾0.5 with the previous box.]**
+
+&#10230;
+
+<br>
+
+
+**64. [Box predictions, Box selection of maximum probability, Overlap removal of same class, Final bounding boxes]**
+
+&#10230;
+
+<br>
+
+
+**65. YOLO ― You Only Look Once (YOLO) is an object detection algorithm that performs the following steps:**
+
+&#10230;
+
+<br>
+
+
+**66. [Step 1: Divide the input image into a G×G grid., Step 2: For each grid cell, run a CNN that predicts y of the following form:, repeated k times]**
+
+&#10230;
+
+<br>
+
+
+**67. where pc is the probability of detecting an object, bx,by,bh,bw are the properties of the detected bouding box, c1,...,cp is a one-hot representation of which of the p classes were detected, and k is the number of anchor boxes.**
+
+&#10230;
+
+<br>
+
+
+**68. Step 3: Run the non-max suppression algorithm to remove any potential duplicate overlapping bounding boxes.**
+
+&#10230;
+
+<br>
+
+
+**69. [Original image, Division in GxG grid, Bounding box prediction, Non-max suppression]**
+
+&#10230;
+
+<br>
+
+
+**70. Remark: when pc=0, then the network does not detect any object. In that case, the corresponding predictions bx,...,cp have to be ignored.**
+
+&#10230;
+
+<br>
+
+
+**71. R-CNN ― Region with Convolutional Neural Networks (R-CNN) is an object detection algorithm that first segments the image to find potential relevant bounding boxes and then run the detection algorithm to find most probable objects in those bounding boxes.**
+
+&#10230;
+
+<br>
+
+
+**72. [Original image, Segmentation, Bounding box prediction, Non-max suppression]**
+
+&#10230;
+
+<br>
+
+
+**73. Remark: although the original algorithm is computationally expensive and slow, newer architectures enabled the algorithm to run faster, such as Fast R-CNN and Faster R-CNN.**
+
+&#10230;
+
+<br>
+
+
+**74. Face verification and recognition**
+
+&#10230;
+
+<br>
+
+
+**75. Types of models ― Two main types of model are summed up in table below:**
+
+&#10230;
+
+<br>
+
+
+**76. [Face verification, Face recognition, Query, Reference, Database]**
+
+&#10230;
+
+<br>
+
+
+**77. [Is this the correct person?, One-to-one lookup, Is this one of the K persons in the database?, One-to-many lookup]**
+
+&#10230;
+
+<br>
+
+
+**78. One Shot Learning ― One Shot Learning is a face verification algorithm that uses a limited training set to learn a similarity function that quantifies how different two given images are. The similarity function applied to two images is often noted d(image 1,image 2).**
+
+&#10230;
+
+<br>
+
+
+**79. Siamese Network ― Siamese Networks aim at learning how to encode images to then quantify how different two images are. For a given input image x(i), the encoded output is often noted as f(x(i)).**
+
+&#10230;
+
+<br>
+
+
+**80. Triplet loss ― The triplet loss ℓ is a loss function computed on the embedding representation of a triplet of images A (anchor), P (positive) and N (negative). The anchor and the positive example belong to a same class, while the negative example to another one. By calling α∈R+ the margin parameter, this loss is defined as follows:**
+
+&#10230;
+
+<br>
+
+
+**81. Neural style transfer**
+
+&#10230;
+
+<br>
+
+
+**82. Motivation ― The goal of neural style transfer is to generate an image G based on a given content C and a given style S.**
+
+&#10230;
+
+<br>
+
+
+**83. [Content C, Style S, Generated image G]**
+
+&#10230;
+
+<br>
+
+
+**84. Activation ― In a given layer l, the activation is noted a[l] and is of dimensions nH×nw×nc**
+
+&#10230;
+
+<br>
+
+
+**85. Content cost function ― The content cost function Jcontent(C,G) is used to determine how the generated image G differs from the original content image C. It is defined as follows:**
+
+&#10230;
+
+<br>
+
+
+**86. Style matrix ― The style matrix G[l] of a given layer l is a Gram matrix where each of its elements G[l]kk′ quantifies how correlated the channels k and k′ are. It is defined with respect to activations a[l] as follows:**
+
+&#10230;
+
+<br>
+
+
+**87. Remark: the style matrix for the style image and the generated image are noted G[l] (S) and G[l] (G) respectively.**
+
+&#10230;
+
+<br>
+
+
+**88. Style cost function ― The style cost function Jstyle(S,G) is used to determine how the generated image G differs from the style S. It is defined as follows:**
+
+&#10230;
+
+<br>
+
+
+**89. Overall cost function ― The overall cost function is defined as being a combination of the content and style cost functions, weighted by parameters α,β, as follows:**
+
+&#10230;
+
+<br>
+
+
+**90. Remark: a higher value of α will make the model care more about the content while a higher value of β will make it care more about the style.**
+
+&#10230;
+
+<br>
+
+
+**91. Architectures using computational tricks**
+
+&#10230;
+
+<br>
+
+
+**92. Generative Adversarial Network ― Generative adversarial networks, also known as GANs, are composed of a generative and a discriminative model, where the generative model aims at generating the most truthful output that will be fed into the discriminative which aims at differentiating the generated and true image.**
+
+&#10230;
+
+<br>
+
+
+**93. [Training, Noise, Real-world image, Generator, Discriminator, Real Fake]**
+
+&#10230;
+
+<br>
+
+
+**94. Remark: use cases using variants of GANs include text to image, music generation and synthesis.**
+
+&#10230;
+
+<br>
+
+
+**95. ResNet ― The Residual Network architecture (also called ResNet) uses residual blocks with a high number of layers meant to decrease the training error. The residual block has the following characterizing equation:**
+
+&#10230;
+
+<br>
+
+
+**96. Inception Network ― This architecture uses inception modules and aims at giving a try at different convolutions in order to increase its performance through features diversification. In particular, it uses the 1×1 convolution trick to limit the computational burden.**
+
+&#10230;
+
+<br>
+
+
+**97. The Deep Learning cheatsheets are now available in [target language].**
+
+&#10230;
+
+<br>
+
+
+**98. Original authors**
+
+&#10230;
+
+<br>
+
+
+**99. Translated by X, Y and Z**
+
+&#10230;
+
+<br>
+
+
+**100. Reviewed by X, Y and Z**
+
+&#10230;
+
+<br>
+
+
+**101. View PDF version on GitHub**
+
+&#10230;
+
+<br>
+
+
+**102. By X and Y**
+
+&#10230;
+
+<br>
diff --git a/id/deep-learning-tips-and-tricks.md b/id/deep-learning-tips-and-tricks.md
new file mode 100644
index 000000000..347234ec2
--- /dev/null
+++ b/id/deep-learning-tips-and-tricks.md
@@ -0,0 +1,457 @@
+**Deep Learning Tips and Tricks translation**
+
+<br>
+
+**1. Deep Learning Tips and Tricks cheatsheet**
+
+&#10230;
+
+<br>
+
+
+**2. CS 230 - Deep Learning**
+
+&#10230;
+
+<br>
+
+
+**3. Tips and tricks**
+
+&#10230;
+
+<br>
+
+
+**4. [Data processing, Data augmentation, Batch normalization]**
+
+&#10230;
+
+<br>
+
+
+**5. [Training a neural network, Epoch, Mini-batch, Cross-entropy loss, Backpropagation, Gradient descent, Updating weights, Gradient checking]**
+
+&#10230;
+
+<br>
+
+
+**6. [Parameter tuning, Xavier initialization, Transfer learning, Learning rate, Adaptive learning rates]**
+
+&#10230;
+
+<br>
+
+
+**7. [Regularization, Dropout, Weight regularization, Early stopping]**
+
+&#10230;
+
+<br>
+
+
+**8. [Good practices, Overfitting small batch, Gradient checking]**
+
+&#10230;
+
+<br>
+
+
+**9. View PDF version on GitHub**
+
+&#10230;
+
+<br>
+
+
+**10. Data processing**
+
+&#10230;
+
+<br>
+
+
+**11. Data augmentation ― Deep learning models usually need a lot of data to be properly trained. It is often useful to get more data from the existing ones using data augmentation techniques. The main ones are summed up in the table below. More precisely, given the following input image, here are the techniques that we can apply:**
+
+&#10230;
+
+<br>
+
+
+**12. [Original, Flip, Rotation, Random crop]**
+
+&#10230;
+
+<br>
+
+
+**13. [Image without any modification, Flipped with respect to an axis for which the meaning of the image is preserved, Rotation with a slight angle, Simulates incorrect horizon calibration, Random focus on one part of the image, Several random crops can be done in a row]**
+
+&#10230;
+
+<br>
+
+
+**14. [Color shift, Noise addition, Information loss, Contrast change]**
+
+&#10230;
+
+<br>
+
+
+**15. [Nuances of RGB is slightly changed, Captures noise that can occur with light exposure, Addition of noise, More tolerance to quality variation of inputs, Parts of image ignored, Mimics potential loss of parts of image, Luminosity changes, Controls difference in exposition due to time of day]**
+
+&#10230;
+
+<br>
+
+
+**16. Remark: data is usually augmented on the fly during training.**
+
+&#10230;
+
+<br>
+
+
+**17. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:**
+
+&#10230;
+
+<br>
+
+
+**18. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.**
+
+&#10230;
+
+<br>
+
+
+**19. Training a neural network**
+
+&#10230;
+
+<br>
+
+
+**20. Definitions**
+
+&#10230;
+
+<br>
+
+
+**21. Epoch ― In the context of training a model, epoch is a term used to refer to one iteration where the model sees the whole training set to update its weights.**
+
+&#10230;
+
+<br>
+
+
+**22. Mini-batch gradient descent ― During the training phase, updating weights is usually not based on the whole training set at once due to computation complexities or one data point due to noise issues. Instead, the update step is done on mini-batches, where the number of data points in a batch is a hyperparameter that we can tune.**
+
+&#10230;
+
+<br>
+
+
+**23. Loss function ― In order to quantify how a given model performs, the loss function L is usually used to evaluate to what extent the actual outputs y are correctly predicted by the model outputs z.**
+
+&#10230;
+
+<br>
+
+
+**24. Cross-entropy loss ― In the context of binary classification in neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:**
+
+&#10230;
+
+<br>
+
+
+**25. Finding optimal weights**
+
+&#10230;
+
+<br>
+
+
+**26. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to each weight w is computed using the chain rule.**
+
+&#10230;
+
+<br>
+
+
+**27. Using this method, each weight is updated with the rule:**
+
+&#10230;
+
+<br>
+
+
+**28. Updating weights ― In a neural network, weights are updated as follows:**
+
+&#10230;
+
+<br>
+
+
+**29. [Step 1: Take a batch of training data and perform forward propagation to compute the loss, Step 2: Backpropagate the loss to get the gradient of the loss with respect to each weight, Step 3: Use the gradients to update the weights of the network.]**
+
+&#10230;
+
+<br>
+
+
+**30. [Forward propagation, Backpropagation, Weights update]**
+
+&#10230;
+
+<br>
+
+
+**31. Parameter tuning**
+
+&#10230;
+
+<br>
+
+
+**32. Weights initialization**
+
+&#10230;
+
+<br>
+
+
+**33. Xavier initialization ― Instead of initializing the weights in a purely random manner, Xavier initialization enables to have initial weights that take into account characteristics that are unique to the architecture.**
+
+&#10230;
+
+<br>
+
+
+**34. Transfer learning ― Training a deep learning model requires a lot of data and more importantly a lot of time. It is often useful to take advantage of pre-trained weights on huge datasets that took days/weeks to train, and leverage it towards our use case. Depending on how much data we have at hand, here are the different ways to leverage this:**
+
+&#10230;
+
+<br>
+
+
+**35. [Training size, Illustration, Explanation]**
+
+&#10230;
+
+<br>
+
+
+**36. [Small, Medium, Large]**
+
+&#10230;
+
+<br>
+
+
+**37. [Freezes all layers, trains weights on softmax, Freezes most layers, trains weights on last layers and softmax, Trains weights on layers and softmax by initializing weights on pre-trained ones]**
+
+&#10230;
+
+<br>
+
+
+**38. Optimizing convergence**
+
+&#10230;
+
+<br>
+
+
+**39. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. It can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.
+**
+
+&#10230;
+
+<br>
+
+
+**40. Adaptive learning rates ― Letting the learning rate vary when training a model can reduce the training time and improve the numerical optimal solution. While Adam optimizer is the most commonly used technique, others can also be useful. They are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+
+**41. [Method, Explanation, Update of w, Update of b]**
+
+&#10230;
+
+<br>
+
+
+**42. [Momentum, Dampens oscillations, Improvement to SGD, 2 parameters to tune]**
+
+&#10230;
+
+<br>
+
+
+**43. [RMSprop, Root Mean Square propagation, Speeds up learning algorithm by controlling oscillations]**
+
+&#10230;
+
+<br>
+
+
+**44. [Adam, Adaptive Moment estimation, Most popular method, 4 parameters to tune]**
+
+&#10230;
+
+<br>
+
+
+**45. Remark: other methods include Adadelta, Adagrad and SGD.**
+
+&#10230;
+
+<br>
+
+
+**46. Regularization**
+
+&#10230;
+
+<br>
+
+
+**47. Dropout ― Dropout is a technique used in neural networks to prevent overfitting the training data by dropping out neurons with probability p>0. It forces the model to avoid relying too much on particular sets of features.**
+
+&#10230;
+
+<br>
+
+
+**48. Remark: most deep learning frameworks parametrize dropout through the 'keep' parameter 1−p.**
+
+&#10230;
+
+<br>
+
+
+**49. Weight regularization ― In order to make sure that the weights are not too large and that the model is not overfitting the training set, regularization techniques are usually performed on the model weights. The main ones are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+
+**50. [LASSO, Ridge, Elastic Net]**
+
+&#10230;
+
+<br>
+
+**50 bis. Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
+
+&#10230;
+
+<br>
+
+**51. Early stopping ― This regularization technique stops the training process as soon as the validation loss reaches a plateau or starts to increase.**
+
+&#10230;
+
+<br>
+
+
+**52. [Error, Validation, Training, early stopping, Epochs]**
+
+&#10230;
+
+<br>
+
+
+**53. Good practices**
+
+&#10230;
+
+<br>
+
+
+**54. Overfitting small batch ― When debugging a model, it is often useful to make quick tests to see if there is any major issue with the architecture of the model itself. In particular, in order to make sure that the model can be properly trained, a mini-batch is passed inside the network to see if it can overfit on it. If it cannot, it means that the model is either too complex or not complex enough to even overfit on a small batch, let alone a normal-sized training set.**
+
+&#10230;
+
+<br>
+
+
+**55. Gradient checking ― Gradient checking is a method used during the implementation of the backward pass of a neural network. It compares the value of the analytical gradient to the numerical gradient at given points and plays the role of a sanity-check for correctness.**
+
+&#10230;
+
+<br>
+
+
+**56. [Type, Numerical gradient, Analytical gradient]**
+
+&#10230;
+
+<br>
+
+
+**57. [Formula, Comments]**
+
+&#10230;
+
+<br>
+
+
+**58. [Expensive; loss has to be computed two times per dimension, Used to verify correctness of analytical implementation, Trade-off in choosing h not too small (numerical instability) nor too large (poor gradient approximation)]**
+
+&#10230;
+
+<br>
+
+
+**59. ['Exact' result, Direct computation, Used in the final implementation]**
+
+&#10230;
+
+<br>
+
+
+**60. The Deep Learning cheatsheets are now available in [target language].
+
+&#10230;
+
+
+**61. Original authors**
+
+&#10230;
+
+<br>
+
+**62.Translated by X, Y and Z**
+
+&#10230;
+
+<br>
+
+**63.Reviewed by X, Y and Z**
+
+&#10230;
+
+<br>
+
+**64.View PDF version on GitHub**
+
+&#10230;
+
+<br>
+
+**65.By X and Y**
+
+&#10230;
+
+<br>
diff --git a/id/recurrent-neural-networks.md b/id/recurrent-neural-networks.md
new file mode 100644
index 000000000..191e400a1
--- /dev/null
+++ b/id/recurrent-neural-networks.md
@@ -0,0 +1,677 @@
+**Recurrent Neural Networks translation**
+
+<br>
+
+**1. Recurrent Neural Networks cheatsheet**
+
+&#10230;
+
+<br>
+
+
+**2. CS 230 - Deep Learning**
+
+&#10230;
+
+<br>
+
+
+**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]**
+
+&#10230;
+
+<br>
+
+
+**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]**
+
+&#10230;
+
+<br>
+
+
+**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]**
+
+&#10230;
+
+<br>
+
+
+**6. [Comparing words, Cosine similarity, t-SNE]**
+
+&#10230;
+
+<br>
+
+
+**7. [Language model, n-gram, Perplexity]**
+
+&#10230;
+
+<br>
+
+
+**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]**
+
+&#10230;
+
+<br>
+
+
+**9. [Attention, Attention model, Attention weights]**
+
+&#10230;
+
+<br>
+
+
+**10. Overview**
+
+&#10230;
+
+<br>
+
+
+**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:**
+
+&#10230;
+
+<br>
+
+
+**12. For each timestep t, the activation a<t> and the output y<t> are expressed as follows:**
+
+&#10230;
+
+<br>
+
+
+**13. and**
+
+&#10230;
+
+<br>
+
+
+**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.**
+
+&#10230;
+
+<br>
+
+
+**15. The pros and cons of a typical RNN architecture are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+
+**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]**
+
+&#10230;
+
+<br>
+
+
+**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]**
+
+&#10230;
+
+<br>
+
+
+**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+
+**19. [Type of RNN, Illustration, Example]**
+
+&#10230;
+
+<br>
+
+
+**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]**
+
+&#10230;
+
+<br>
+
+
+**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]**
+
+&#10230;
+
+<br>
+
+
+**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:**
+
+&#10230;
+
+<br>
+
+
+**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:**
+
+&#10230;
+
+<br>
+
+
+**24. Handling long term dependencies**
+
+&#10230;
+
+<br>
+
+
+**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:**
+
+&#10230;
+
+<br>
+
+
+**26. [Sigmoid, Tanh, RELU]**
+
+&#10230;
+
+<br>
+
+
+**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.**
+
+&#10230;
+
+<br>
+
+
+**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.**
+
+&#10230;
+
+<br>
+
+
+**29. clipped**
+
+&#10230;
+
+<br>
+
+
+**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:**
+
+&#10230;
+
+<br>
+
+
+**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+
+**32. [Type of gate, Role, Used in]**
+
+&#10230;
+
+<br>
+
+
+**33. [Update gate, Relevance gate, Forget gate, Output gate]**
+
+&#10230;
+
+<br>
+
+
+**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]**
+
+&#10230;
+
+<br>
+
+
+**35. [LSTM, GRU]**
+
+&#10230;
+
+<br>
+
+
+**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:**
+
+&#10230;
+
+<br>
+
+
+**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]**
+
+&#10230;
+
+<br>
+
+
+**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.**
+
+&#10230;
+
+<br>
+
+
+**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:**
+
+&#10230;
+
+<br>
+
+
+**40. [Bidirectional (BRNN), Deep (DRNN)]**
+
+&#10230;
+
+<br>
+
+
+**41. Learning word representation**
+
+&#10230;
+
+<br>
+
+
+**42. In this section, we note V the vocabulary and |V| its size.**
+
+&#10230;
+
+<br>
+
+
+**43. Motivation and notations**
+
+&#10230;
+
+<br>
+
+
+**44. Representation techniques ― The two main ways of representing words are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+
+**45. [1-hot representation, Word embedding]**
+
+&#10230;
+
+<br>
+
+
+**46. [teddy bear, book, soft]**
+
+&#10230;
+
+<br>
+
+
+**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]**
+
+&#10230;
+
+<br>
+
+
+**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:**
+
+&#10230;
+
+<br>
+
+
+**49. Remark: learning the embedding matrix can be done using target/context likelihood models.**
+
+&#10230;
+
+<br>
+
+
+**50. Word embeddings**
+
+&#10230;
+
+<br>
+
+
+**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.**
+
+&#10230;
+
+<br>
+
+
+**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]**
+
+&#10230;
+
+<br>
+
+
+**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]**
+
+&#10230;
+
+<br>
+
+
+**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:**
+
+&#10230;
+
+<br>
+
+
+**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.**
+
+&#10230;
+
+<br>
+
+
+**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:**
+
+&#10230;
+
+<br>
+
+
+**57. Remark: this method is less computationally expensive than the skip-gram model.**
+
+&#10230;
+
+<br>
+
+
+**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:**
+
+&#10230;
+
+<br>
+
+
+**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0.
+Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:**
+
+&#10230;
+
+<br>
+
+
+**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.**
+
+&#10230;
+
+<br>
+
+
+**60. Comparing words**
+
+&#10230;
+
+<br>
+
+
+**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:**
+
+&#10230;
+
+<br>
+
+
+**62. Remark: θ is the angle between words w1 and w2.**
+
+&#10230;
+
+<br>
+
+
+**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.**
+
+&#10230;
+
+<br>
+
+
+**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]**
+
+&#10230;
+
+<br>
+
+
+**65. Language model**
+
+&#10230;
+
+<br>
+
+
+**66. Overview ― A language model aims at estimating the probability of a sentence P(y).**
+
+&#10230;
+
+<br>
+
+
+**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.**
+
+&#10230;
+
+<br>
+
+
+**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:**
+
+&#10230;
+
+<br>
+
+
+**69. Remark: PP is commonly used in t-SNE.**
+
+&#10230;
+
+<br>
+
+
+**70. Machine translation**
+
+&#10230;
+
+<br>
+
+
+**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:**
+
+&#10230;
+
+<br>
+
+
+**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.**
+
+&#10230;
+
+<br>
+
+
+**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y<k>|x,y<1>,...,y<k−1>, Step 3: Keep top B combinations x,y<1>,...,y<k>, End process at a stop word]**
+
+&#10230;
+
+<br>
+
+
+**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.**
+
+&#10230;
+
+<br>
+
+
+**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.**
+
+&#10230;
+
+<br>
+
+
+**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:**
+
+&#10230;
+
+<br>
+
+
+**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.**
+
+&#10230;
+
+<br>
+
+
+**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:**
+
+&#10230;
+
+<br>
+
+
+**79. [Case, Root cause, Remedies]**
+
+&#10230;
+
+<br>
+
+
+**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]**
+
+&#10230;
+
+<br>
+
+
+**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:**
+
+&#10230;
+
+<br>
+
+
+**82. where pn is the bleu score on n-gram only defined as follows:**
+
+&#10230;
+
+<br>
+
+
+**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.**
+
+&#10230;
+
+<br>
+
+
+**84. Attention**
+
+&#10230;
+
+<br>
+
+
+**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α<t,t′> the amount of attention that the output y<t> should pay to the activation a<t′> and c<t> the context at time t, we have:**
+
+&#10230;
+
+<br>
+
+
+**86. with**
+
+&#10230;
+
+<br>
+
+
+**87. Remark: the attention scores are commonly used in image captioning and machine translation.**
+
+&#10230;
+
+<br>
+
+
+**88. A cute teddy bear is reading Persian literature.**
+
+&#10230;
+
+<br>
+
+
+**89. Attention weight ― The amount of attention that the output y<t> should pay to the activation a<t′> is given by α<t,t′> computed as follows:**
+
+&#10230;
+
+<br>
+
+
+**90. Remark: computation complexity is quadratic with respect to Tx.**
+
+&#10230;
+
+<br>
+
+
+**91. The Deep Learning cheatsheets are now available in [target language].**
+
+&#10230;
+
+<br>
+
+**92. Original authors**
+
+&#10230;
+
+<br>
+
+**93. Translated by X, Y and Z**
+
+&#10230;
+
+<br>
+
+**94. Reviewed by X, Y and Z**
+
+&#10230;
+
+<br>
+
+**95. View PDF version on GitHub**
+
+&#10230;
+
+<br>
+
+**96. By X and Y**
+
+&#10230;
+
+<br>
diff --git a/id/refresher-linear-algebra.md b/id/refresher-linear-algebra.md
new file mode 100644
index 000000000..a6b440d1e
--- /dev/null
+++ b/id/refresher-linear-algebra.md
@@ -0,0 +1,339 @@
+**1. Linear Algebra and Calculus refresher**
+
+&#10230;
+
+<br>
+
+**2. General notations**
+
+&#10230;
+
+<br>
+
+**3. Definitions**
+
+&#10230;
+
+<br>
+
+**4. Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:**
+
+&#10230;
+
+<br>
+
+**5. Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:**
+
+&#10230;
+
+<br>
+
+**6. Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.**
+
+&#10230;
+
+<br>
+
+**7. Main matrices**
+
+&#10230;
+
+<br>
+
+**8. Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:**
+
+&#10230;
+
+<br>
+
+**9. Remark: for all matrices A∈Rn×n, we have A×I=I×A=A.**
+
+&#10230;
+
+<br>
+
+**10. Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:**
+
+&#10230;
+
+<br>
+
+**11. Remark: we also note D as diag(d1,...,dn).**
+
+&#10230;
+
+<br>
+
+**12. Matrix operations**
+
+&#10230;
+
+<br>
+
+**13. Multiplication**
+
+&#10230;
+
+<br>
+
+**14. Vector-vector ― There are two types of vector-vector products:**
+
+&#10230;
+
+<br>
+
+**15. inner product: for x,y∈Rn, we have:**
+
+&#10230;
+
+<br>
+
+**16. outer product: for x∈Rm,y∈Rn, we have:**
+
+&#10230;
+
+<br>
+
+**17. Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that:**
+
+&#10230;
+
+<br>
+
+**18. where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.**
+
+&#10230;
+
+<br>
+
+**19. Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:**
+
+&#10230;
+
+<br>
+
+**20. where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively**
+
+&#10230;
+
+<br>
+
+**21. Other operations**
+
+&#10230;
+
+<br>
+
+**22. Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:**
+
+&#10230;
+
+<br>
+
+**23. Remark: for matrices A,B, we have (AB)T=BTAT**
+
+&#10230;
+
+<br>
+
+**24. Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:**
+
+&#10230;
+
+<br>
+
+**25. Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1**
+
+&#10230;
+
+<br>
+
+**26. Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:**
+
+&#10230;
+
+<br>
+
+**27. Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)**
+
+&#10230;
+
+<br>
+
+**28. Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:**
+
+&#10230;
+
+<br>
+
+**29. Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.**
+
+&#10230;
+
+<br>
+
+**30. Matrix properties**
+
+&#10230;
+
+<br>
+
+**31. Definitions**
+
+&#10230;
+
+<br>
+
+**32. Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:**
+
+&#10230;
+
+<br>
+
+**33. [Symmetric, Antisymmetric]**
+
+&#10230;
+
+<br>
+
+**34. Norm ― A norm is a function N:V⟶[0,+∞[ where V is a vector space, and such that for all x,y∈V, we have:**
+
+&#10230;
+
+<br>
+
+**35. N(ax)=|a|N(x) for a scalar**
+
+&#10230;
+
+<br>
+
+**36. if N(x)=0, then x=0**
+
+&#10230;
+
+<br>
+
+**37. For x∈V, the most commonly used norms are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+**38. [Norm, Notation, Definition, Use case]**
+
+&#10230;
+
+<br>
+
+**39. Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.**
+
+&#10230;
+
+<br>
+
+**40. Remark: if no vector can be written this way, then the vectors are said to be linearly independent**
+
+&#10230;
+
+<br>
+
+**41. Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.**
+
+&#10230;
+
+<br>
+
+**42. Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:**
+
+&#10230;
+
+<br>
+
+**43. Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.**
+
+&#10230;
+
+<br>
+
+**44. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
+
+&#10230;
+
+<br>
+
+**45. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
+
+&#10230;
+
+<br>
+
+**46. diagonal**
+
+&#10230;
+
+<br>
+
+**47. Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:**
+
+&#10230;
+
+<br>
+
+**48. Matrix calculus**
+
+&#10230;
+
+<br>
+
+**49. Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:**
+
+&#10230;
+
+<br>
+
+**50. Remark: the gradient of f is only defined when f is a function that returns a scalar.**
+
+&#10230;
+
+<br>
+
+**51. Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:**
+
+&#10230;
+
+<br>
+
+**52. Remark: the hessian of f is only defined when f is a function that returns a scalar**
+
+&#10230;
+
+<br>
+
+**53. Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:**
+
+&#10230;
+
+<br>
+
+**54. [General notations, Definitions, Main matrices]**
+
+&#10230;
+
+<br>
+
+**55. [Matrix operations, Multiplication, Other operations]**
+
+&#10230;
+
+<br>
+
+**56. [Matrix properties, Norm, Eigenvalue/Eigenvector, Singular-value decomposition]**
+
+&#10230;
+
+<br>
+
+**57. [Matrix calculus, Gradient, Hessian, Operations]**
+
+&#10230;
diff --git a/id/refresher-probability.md b/id/refresher-probability.md
new file mode 100644
index 000000000..5c9b34656
--- /dev/null
+++ b/id/refresher-probability.md
@@ -0,0 +1,381 @@
+**1. Probabilities and Statistics refresher**
+
+&#10230;
+
+<br>
+
+**2. Introduction to Probability and Combinatorics**
+
+&#10230;
+
+<br>
+
+**3. Sample space ― The set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by S.**
+
+&#10230;
+
+<br>
+
+**4. Event ― Any subset E of the sample space is known as an event. That is, an event is a set consisting of possible outcomes of the experiment. If the outcome of the experiment is contained in E, then we say that E has occurred.**
+
+&#10230;
+
+<br>
+
+**5. Axioms of probability For each event E, we denote P(E) as the probability of event E occuring.**
+
+&#10230;
+
+<br>
+
+**6. Axiom 1 ― Every probability is between 0 and 1 included, i.e:**
+
+&#10230;
+
+<br>
+
+**7. Axiom 2 ― The probability that at least one of the elementary events in the entire sample space will occur is 1, i.e:**
+
+&#10230;
+
+<br>
+
+**8. Axiom 3 ― For any sequence of mutually exclusive events E1,...,En, we have:**
+
+&#10230;
+
+<br>
+
+**9. Permutation ― A permutation is an arrangement of r objects from a pool of n objects, in a given order. The number of such arrangements is given by P(n,r), defined as:**
+
+&#10230;
+
+<br>
+
+**10. Combination ― A combination is an arrangement of r objects from a pool of n objects, where the order does not matter. The number of such arrangements is given by C(n,r), defined as:**
+
+&#10230;
+
+<br>
+
+**11. Remark: we note that for 0⩽r⩽n, we have P(n,r)⩾C(n,r)**
+
+&#10230;
+
+<br>
+
+**12. Conditional Probability**
+
+&#10230;
+
+<br>
+
+**13. Bayes' rule ― For events A and B such that P(B)>0, we have:**
+
+&#10230;
+
+<br>
+
+**14. Remark: we have P(A∩B)=P(A)P(B|A)=P(A|B)P(B)**
+
+&#10230;
+
+<br>
+
+**15. Partition ― Let {Ai,i∈[[1,n]]} be such that for all i, Ai≠∅. We say that {Ai} is a partition if we have:**
+
+&#10230;
+
+<br>
+
+**16. Remark: for any event B in the sample space, we have P(B)=n∑i=1P(B|Ai)P(Ai).**
+
+&#10230;
+
+<br>
+
+**17. Extended form of Bayes' rule ― Let {Ai,i∈[[1,n]]} be a partition of the sample space. We have:**
+
+&#10230;
+
+<br>
+
+**18. Independence ― Two events A and B are independent if and only if we have:**
+
+&#10230;
+
+<br>
+
+**19. Random Variables**
+
+&#10230;
+
+<br>
+
+**20. Definitions**
+
+&#10230;
+
+<br>
+
+**21. Random variable ― A random variable, often noted X, is a function that maps every element in a sample space to a real line.**
+
+&#10230;
+
+<br>
+
+**22. Cumulative distribution function (CDF) ― The cumulative distribution function F, which is monotonically non-decreasing and is such that limx→−∞F(x)=0 and limx→+∞F(x)=1, is defined as:**
+
+&#10230;
+
+<br>
+
+**23. Remark: we have P(a<X⩽B)=F(b)−F(a).**
+
+&#10230;
+
+<br>
+
+**24. Probability density function (PDF) ― The probability density function f is the probability that X takes on values between two adjacent realizations of the random variable.**
+
+&#10230;
+
+<br>
+
+**25. Relationships involving the PDF and CDF ― Here are the important properties to know in the discrete (D) and the continuous (C) cases.**
+
+&#10230;
+
+<br>
+
+**26. [Case, CDF F, PDF f, Properties of PDF]**
+
+&#10230;
+
+<br>
+
+**27. Expectation and Moments of the Distribution ― Here are the expressions of the expected value E[X], generalized expected value E[g(X)], kth moment E[Xk] and characteristic function ψ(ω) for the discrete and continuous cases:**
+
+&#10230;
+
+<br>
+
+**28. Variance ― The variance of a random variable, often noted Var(X) or σ2, is a measure of the spread of its distribution function. It is determined as follows:**
+
+&#10230;
+
+<br>
+
+**29. Standard deviation ― The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:**
+
+&#10230;
+
+<br>
+
+**30. Transformation of random variables ― Let the variables X and Y be linked by some function. By noting fX and fY the distribution function of X and Y respectively, we have:**
+
+&#10230;
+
+<br>
+
+**31. Leibniz integral rule ― Let g be a function of x and potentially c, and a,b boundaries that may depend on c. We have:**
+
+&#10230;
+
+<br>
+
+**32. Probability Distributions**
+
+&#10230;
+
+<br>
+
+**33. Chebyshev's inequality ― Let X be a random variable with expected value μ. For k,σ>0, we have the following inequality:**
+
+&#10230;
+
+<br>
+
+**34. Main distributions ― Here are the main distributions to have in mind:**
+
+&#10230;
+
+<br>
+
+**35. [Type, Distribution]**
+
+&#10230;
+
+<br>
+
+**36. Jointly Distributed Random Variables**
+
+&#10230;
+
+<br>
+
+**37. Marginal density and cumulative distribution ― From the joint density probability function fXY , we have**
+
+&#10230;
+
+<br>
+
+**38. [Case, Marginal density, Cumulative function]**
+
+&#10230;
+
+<br>
+
+**39. Conditional density ― The conditional density of X with respect to Y, often noted fX|Y, is defined as follows:**
+
+&#10230;
+
+<br>
+
+**40. Independence ― Two random variables X and Y are said to be independent if we have:**
+
+&#10230;
+
+<br>
+
+**41. Covariance ― We define the covariance of two random variables X and Y, that we note σ2XY or more commonly Cov(X,Y), as follows:**
+
+&#10230;
+
+<br>
+
+**42. Correlation ― By noting σX,σY the standard deviations of X and Y, we define the correlation between the random variables X and Y, noted ρXY, as follows:**
+
+&#10230;
+
+<br>
+
+**43. Remark 1: we note that for any random variables X,Y, we have ρXY∈[−1,1].**
+
+&#10230;
+
+<br>
+
+**44. Remark 2: If X and Y are independent, then ρXY=0.**
+
+&#10230;
+
+<br>
+
+**45. Parameter estimation**
+
+&#10230;
+
+<br>
+
+**46. Definitions**
+
+&#10230;
+
+<br>
+
+**47. Random sample ― A random sample is a collection of n random variables X1,...,Xn that are independent and identically distributed with X.**
+
+&#10230;
+
+<br>
+
+**48. Estimator ― An estimator is a function of the data that is used to infer the value of an unknown parameter in a statistical model.**
+
+&#10230;
+
+<br>
+
+**49. Bias ― The bias of an estimator ^θ is defined as being the difference between the expected value of the distribution of ^θ and the true value, i.e.:**
+
+&#10230;
+
+<br>
+
+**50. Remark: an estimator is said to be unbiased when we have E[^θ]=θ.**
+
+&#10230;
+
+<br>
+
+**51. Estimating the mean**
+
+&#10230;
+
+<br>
+
+**52. Sample mean ― The sample mean of a random sample is used to estimate the true mean μ of a distribution, is often noted ¯¯¯¯¯X and is defined as follows:**
+
+&#10230;
+
+<br>
+
+**53. Remark: the sample mean is unbiased, i.e E[¯¯¯¯¯X]=μ.**
+
+&#10230;
+
+<br>
+
+**54. Central Limit Theorem ― Let us have a random sample X1,...,Xn following a given distribution with mean μ and variance σ2, then we have:**
+
+&#10230;
+
+<br>
+
+**55. Estimating the variance**
+
+&#10230;
+
+<br>
+
+**56. Sample variance ― The sample variance of a random sample is used to estimate the true variance σ2 of a distribution, is often noted s2 or ^σ2 and is defined as follows:**
+
+&#10230;
+
+<br>
+
+**57. Remark: the sample variance is unbiased, i.e E[s2]=σ2.**
+
+&#10230;
+
+<br>
+
+**58. Chi-Squared relation with sample variance ― Let s2 be the sample variance of a random sample. We have:**
+
+&#10230;
+
+<br>
+
+**59. [Introduction, Sample space, Event, Permutation]**
+
+&#10230;
+
+<br>
+
+**60. [Conditional probability, Bayes' rule, Independence]**
+
+&#10230;
+
+<br>
+
+**61. [Random variables, Definitions, Expectation, Variance]**
+
+&#10230;
+
+<br>
+
+**62. [Probability distributions, Chebyshev's inequality, Main distributions]**
+
+&#10230;
+
+<br>
+
+**63. [Jointly distributed random variables, Density, Covariance, Correlation]**
+
+&#10230;
+
+<br>
+
+**64. [Parameter estimation, Mean, Variance]**
+
+&#10230;

From 597295d70cbf9ed12b03e589c6fbd214c55928b2 Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Tue, 9 Apr 2019 10:01:12 +0800
Subject: [PATCH 15/34] Adding some files and fixing typo

---
 id/cheatsheet-deep-learning.md | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/id/cheatsheet-deep-learning.md b/id/cheatsheet-deep-learning.md
index 26f2c5a4c..6634a14de 100644
--- a/id/cheatsheet-deep-learning.md
+++ b/id/cheatsheet-deep-learning.md
@@ -72,7 +72,7 @@
 
 **13. As a result, the weight is updated as follows:**
 
-&#10230; **13. Sebagai hasilnya, nilai bobot diperbaharui sebagai berikut: **
+&#10230; **13. Sebagai hasilnya, nilai bobot diperbaharui sebagai berikut:**
 
 <br>
 
@@ -90,7 +90,7 @@
 
 **16. Step 2: Perform forward propagation to obtain the corresponding loss.**
 
-&#10230;**16. Langkah 2: Melakukan forward propagation untuk mendapatkan nilai loss yang sesuai. **
+&#10230;**16. Langkah 2: Melakukan forward propagation untuk mendapatkan nilai loss yang sesuai.**
 
 <br>
 
@@ -120,7 +120,7 @@
 
 **21. Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:**
 
-&#10230; **21. Kebutuhan layer convolutional - W adalah ukuran volume input, F adalah ukuran dari layer neuron convolutional, P adalah jumlah zero padding, maka jumlah neurons N yang dapat dibentuk dari volume yang diberikan adalah: **
+&#10230; **21. Kebutuhan layer convolutional - W adalah ukuran volume input, F adalah ukuran dari layer neuron convolutional, P adalah jumlah zero padding, maka jumlah neurons N yang dapat dibentuk dari volume yang diberikan adalah:**
 
 <br>
 
@@ -145,19 +145,19 @@
 
 **25. Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:**
 
-&#10230; **25. Jenis-jenis gates - Terdapat beberapa jenis gates dalam Recurrent Neural Network: **
+&#10230; **25. Jenis-jenis gates - Terdapat beberapa jenis gates dalam Recurrent Neural Network:**
 
 <br>
 
 **26. [Input gate, forget gate, gate, output gate]**
 
-&#10230; **26. [Input gate (gerbang masuk), forget gate (gerbang lupa), gate, output gate (gerbang keluar)]
+&#10230; **26. [Input gate (gerbang masuk), forget gate (gerbang lupa), gate, output gate (gerbang keluar)]**
 
 <br>
 
 **27. [Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]**
 
-&#10230; **27, [Dituliskan ke dalm sel atau tidak?, Hapus sel atau tidak?, Berapa banyak yang harus ditulis ke dalam sel?, Berapa banyak yang dibutuhkan untuk mengungkap sel?] **
+&#10230; **27, [Dituliskan ke dalm sel atau tidak?, Hapus sel atau tidak?, Berapa banyak yang harus ditulis ke dalam sel?, Berapa banyak yang dibutuhkan untuk mengungkap sel?]**
 
 <br>
 
@@ -187,13 +187,13 @@
 
 **32. Markov decision processes ― A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ,R) where:**
 
-&#10230; **32. Markov decision processes (MDP) - Proses pengambilan keputusan Markov (MDP) adalah sebuah 5-tuple (S,A,{Psa},γ,R) dimana: ** 
+&#10230; **32. Markov decision processes (MDP) - Proses pengambilan keputusan Markov (MDP) adalah sebuah 5-tuple (S,A,{Psa},γ,R) dimana:** 
 
 <br>
 
 **33. S is the set of states** 
 
-&#10230; **33. S adalah himpunan dari keadaan (states) **
+&#10230; **33. S adalah himpunan dari keadaan (states)**
 
 <br>
 
@@ -229,13 +229,13 @@
 
 **39. Remark: we say that we execute a given policy π if given a state s we take the action a=π(s).**
 
-&#10230; **39. Catatan: Kita menjalankan sebuah policy π jika diberikan keadaan S, maka tindakan a = π (s) **
+&#10230; **39. Catatan: Kita menjalankan sebuah policy π jika diberikan keadaan S, maka tindakan a = π (s)**
 
 <br>
 
 **40. Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:**
 
-&#10230; **40. Fungsi nilai - Diberikan sebuah policy π dan sebuah keadaan S, maka kita mendefinisikan nilai fungsi Vπ dengan sebagai berikut: **
+&#10230; **40. Fungsi nilai - Diberikan sebuah policy π dan sebuah keadaan S, maka kita mendefinisikan nilai fungsi Vπ dengan sebagai berikut:**
 
 <br>
 
@@ -253,25 +253,25 @@
 
 **43. Value iteration algorithm ― The value iteration algorithm is in two steps:**
 
-&#10230; **43. Nilai perulangan algoritma - Nilai perulangan algoritma dibagai atas dua tahap: **
+&#10230; **43. Nilai perulangan algoritma - Nilai perulangan algoritma dibagai atas dua tahap:**
 
 <br>
 
 **44. 1) We initialize the value:**
 
-&#10230; **44. 1) Menginisialisasi nilai: **
+&#10230; **44. 1) Menginisialisasi nilai:**
 
 <br>
 
 **45. 2) We iterate the value based on the values before:**
 
-&#10230; **45. 2) Melakukan iterasi berdasarkan nilai sebelumnya: **
+&#10230; **45. 2) Melakukan iterasi berdasarkan nilai sebelumnya:**
 
 <br>
 
 **46. Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:**
 
-&#10230; **46. Estimasi kemungkinan maksimum ― Estimasi kemungkinan maksimum untuk probabilitas transisi keadaan S adalah sebagai berikut: **
+&#10230; **46. Estimasi kemungkinan maksimum ― Estimasi kemungkinan maksimum untuk probabilitas transisi keadaan S adalah sebagai berikut:**
 
 <br>
 
@@ -307,7 +307,7 @@
 
 **52. [Convolutional Neural Networks, Convolutional layer, Batch normalization]**
 
-&#10230; **52, [Convolutional Neural Networks, Convolutional layer, Batch normalization]
+&#10230; **52, [Convolutional Neural Networks, Convolutional layer, Batch normalization]**
 
 <br>
 

From 344a4241bd3685c3c8aa3818e20af5efda397293 Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Tue, 9 Apr 2019 10:12:53 +0800
Subject: [PATCH 16/34] Translating machine learning tips and tricks into
 Bahasa Indonesia

---
 ...tsheet-machine-learning-tips-and-tricks.md | 20 +++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/id/cheatsheet-machine-learning-tips-and-tricks.md b/id/cheatsheet-machine-learning-tips-and-tricks.md
index 9712297b8..a9190b16e 100644
--- a/id/cheatsheet-machine-learning-tips-and-tricks.md
+++ b/id/cheatsheet-machine-learning-tips-and-tricks.md
@@ -1,60 +1,60 @@
 **1. Machine Learning tips and tricks cheatsheet**
 
-&#10230;
+&#10230; **1. Catatan ringkas tips dan trik machine learning** 
 
 <br>
 
 **2. Classification metrics**
 
-&#10230;
+&#10230; **2. Matriks klasifikasi**
 
 <br>
 
 **3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.**
 
-&#10230;
+&#10230; **3. Dalam konteks klasifikasi biner, terdapat beberapa matriks utama yang penting untuk menilai performa dari sebuah model.**
 
 <br>
 
 **4. Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:**
 
-&#10230;
+&#10230; **4. Confusion Matrix - Confusion matrix digunakan untuk membantu mengetahui bagaimana gambaran secara umum performa dari model. Didefinisikan sebagai berikut:**
 
 <br>
 
 **5. [Predicted class, Actual class]**
 
-&#10230;
+&#10230; **5. [Kelas prediksi, Kelas aktual]**
 
 <br>
 
 **6. Main metrics ― The following metrics are commonly used to assess the performance of classification models:**
 
-&#10230;
+&#10230; **6. Matriks utama - Berikut merupakan matriks yang umum digunakan untuk mengukur performa model klasifikasi:**
 
 <br>
 
 **7. [Metric, Formula, Interpretation]**
 
-&#10230;
+&#10230; **7. [Formula matriks, Interprestasi]**
 
 <br>
 
 **8. Overall performance of model**
 
-&#10230;
+&#10230; **8. Performa model secara umum**
 
 <br>
 
 **9. How accurate the positive predictions are**
 
-&#10230;
+&#10230; **9. Bagaimana positif akurasinya**
 
 <br>
 
 **10. Coverage of actual positive sample**
 
-&#10230;
+&#10230; **10. Cakupan sample positif aktual**
 
 <br>
 

From 9b4f85dd456983a0631923f98d6bdf51ec3e7d6e Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Tue, 9 Apr 2019 14:39:16 +0800
Subject: [PATCH 17/34] Add translation to Bahasa Indonesia

---
 ...tsheet-machine-learning-tips-and-tricks.md | 20 +++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/id/cheatsheet-machine-learning-tips-and-tricks.md b/id/cheatsheet-machine-learning-tips-and-tricks.md
index a9190b16e..8c6235c7b 100644
--- a/id/cheatsheet-machine-learning-tips-and-tricks.md
+++ b/id/cheatsheet-machine-learning-tips-and-tricks.md
@@ -54,61 +54,61 @@
 
 **10. Coverage of actual positive sample**
 
-&#10230; **10. Cakupan sample positif aktual**
+&#10230; **10. Cakupan aktual sampel positif**
 
 <br>
 
 **11. Coverage of actual negative sample**
 
-&#10230;
+&#10230;**11. Cakupan aktual sampel negatif**
 
 <br>
 
 **12. Hybrid metric useful for unbalanced classes**
 
-&#10230;
+&#10230;**12. Matriks gabungan yang berguna untuk kelas data yang tidak seimbang**
 
 <br>
 
 **13. ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:**
 
-&#10230;
+&#10230; **13. ROC ― Receiver operating curve (ROC) adalah plot TPR yang disandingkan dengan FPR dengan memvariasikan nilai ambang batasnya. Matriks ini dijelaskan pada tabel dibawah berikut:** 
 
 <br>
 
 **14. [Metric, Formula, Equivalent]**
 
-&#10230;
+&#10230; **14. [Matriks, Formula, Persamaan]**
 
 <br>
 
 **15. AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:**
 
-&#10230;
+&#10230; **15. AUC (Area Under the Curve) ― Area bawah kurva, adalah area dibawah ROC yang digambarkan sebagai berikut:**
 
 <br>
 
 **16. [Actual, Predicted]**
 
-&#10230;
+&#10230; **16. [Aktual, Prediksi]**
 
 <br>
 
 **17. Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:**
 
-&#10230;
+&#10230; **17. Matriks sederhana - Diberikan sebuah model regresi f, matriks berikut secara umum digunakan untuk mengukur performa dari model:**
 
 <br>
 
 **18. [Total sum of squares, Explained sum of squares, Residual sum of squares]**
 
-&#10230;
+&#10230; **18. [Jumlah total persegi, Penjelasan jumlah persegi, Sisa jumlah persegi]**
 
 <br>
 
 **19. Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:**
 
-&#10230;
+&#10230; **19. Koefisisen determinasi ― Koefesien determinasi dinotasikan dengan R2 atau r2, digunakan untuk mengukur bagaimana baik tidaknya hasil observasi  yang ditiru oleh model, sebagai berikut:**
 
 <br>
 

From f1b014de29f67a64b5e9497be39106ec3652a8b3 Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Tue, 9 Apr 2019 18:36:50 +0800
Subject: [PATCH 18/34] Translating some numbers into Bahasa Indonesia

---
 ...tsheet-machine-learning-tips-and-tricks.md | 42 +++++++++----------
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/id/cheatsheet-machine-learning-tips-and-tricks.md b/id/cheatsheet-machine-learning-tips-and-tricks.md
index 8c6235c7b..4f60574a4 100644
--- a/id/cheatsheet-machine-learning-tips-and-tricks.md
+++ b/id/cheatsheet-machine-learning-tips-and-tricks.md
@@ -12,7 +12,7 @@
 
 **3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.**
 
-&#10230; **3. Dalam konteks klasifikasi biner, terdapat beberapa matriks utama yang penting untuk menilai performa dari sebuah model.**
+&#10230; **3. Dalam konteks klasifikasi biner, terdapat beberapa matriks utama yang penting untuk digunakan dalam mengukur performa dari sebuah model.**
 
 <br>
 
@@ -48,19 +48,19 @@
 
 **9. How accurate the positive predictions are**
 
-&#10230; **9. Bagaimana positif akurasinya**
+&#10230; **9. Bagaimana akurasi prediksi positifnya**
 
 <br>
 
 **10. Coverage of actual positive sample**
 
-&#10230; **10. Cakupan aktual sampel positif**
+&#10230; **10. Cakupan sampel aktual positif**
 
 <br>
 
 **11. Coverage of actual negative sample**
 
-&#10230;**11. Cakupan aktual sampel negatif**
+&#10230;**11. Cakupan sampel aktual negatif**
 
 <br>
 
@@ -114,97 +114,97 @@
 
 **20. Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:**
 
-&#10230;
+&#10230;  **20. Matriks utama ― Matriks berikut adalah yang umum digunakan untuk mengukur performa dari model regresi, dengan mengambil sejumlah variabel n sebagai pertimbangan:**
 
 <br>
 
 **21. where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.**
 
-&#10230;
+&#10230; **21. Dimana L adalah kecenderungan dan ˆσ2 adalah estimasi dari asosiasi varian dan responnya.**
 
 <br>
 
 **22. Model selection**
 
-&#10230;
+&#10230; **22. Seleksi model**
 
 <br>
 
 **23. Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:**
 
-&#10230;
+&#10230; **23. Kamus - Dalam memilih sebuah model, kita membagai tiga bagian data sebagai berikut:**
 
-<br>
+<br> 
 
 **24. [Training set, Validation set, Testing set]**
 
-&#10230;
+&#10230; **24. [Data latih, data validasi, data uji]**
 
 <br>
 
 **25. [Model is trained, Model is assessed, Model gives predictions]**
 
-&#10230;
+&#10230; **25. [Model terlatih, model tervalidasi, prediksi model]**
 
 <br>
 
 **26. [Usually 80% of the dataset, Usually 20% of the dataset]**
 
-&#10230;
+&#10230; **26. [Biasanya data latih terdiri atas 80% dataset, sedangkan 20% merupakan data uji/tes]**
 
 <br>
 
 **27. [Also called hold-out or development set, Unseen data]**
 
-&#10230;
+&#10230; **27. [Disebut juga hold-out atau development set, Data yang belum pernah dilihat oleh model]**
 
 <br>
 
 **28. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:**
 
-&#10230;
+&#10230; **28. Setelah model dipilih, model dilatih dengan seluruh dataset dan diuji coba pada data yang belum pernah dijumpai. Hal ini direpresentasikan sebagai berikut:**
 
 <br>
 
 **29. Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:**
 
-&#10230;
+&#10230; **29. Cross-validation ― Cross-validation, disebut juga sebagai CV, adalah metode yang digunakan untuk memilih model yang tidak bergantung pada dataset latih pertama. Perbedaannya diringkas dalam tabel berikut:** 
 
 <br>
 
 **30. [Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]**
 
-&#10230;
+&#10230; **30. []**
 
 <br>
 
 **31. [Generally k=5 or 10, Case p=1 is called leave-one-out]**
 
-&#10230;
+&#10230; **31. [Umumnya k=5 atau 10, kasus p=1 disebut sebagai leave-one-out]**
 
 <br>
 
 **32. The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.**
 
-&#10230;
+&#10230; **32. Metode yang umum digunakan adalah k-fold crosss-validation yang membagi data latih menjadi k folds untuk memvalidasi model pada salahsatu fold, sementara melatih model pada k-1 folds, selama k times. Total error yang dihasilkan kemudian dirata-rata dan disebut sebagai cross-validation error.**
 
 <br>
 
 **33. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:**
 
-&#10230;
+&#10230; **33. Regularisasi ― Prosedur regularisasi bertujuan untuk mencegah model overfit data dan isu variansi yang tinggi. Tabel berikut meringkas beberapa jenis teknik regularisasi:**
 
 <br>
 
 **34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
 
-&#10230;
+&#10230; **34. [Memangkas koefisien menjadi 0, bagus untuk pemilihan variabel, membuat koefisien menjadi lebih kecil. Adanya tradeoff antara pemilihan variabel dan koefisien yang kecil]**
 
 <br>
 
 **35. Diagnostics**
 
-&#10230;
+&#10230; **35. Diagnostik**
 
 <br>
 

From c819837b2bef40ca631eb63e3fb66f96363e06c4 Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Wed, 10 Apr 2019 10:13:44 +0800
Subject: [PATCH 19/34] Translating into Bahasa Indonesia

---
 ...tsheet-machine-learning-tips-and-tricks.md | 26 +++++++++----------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/id/cheatsheet-machine-learning-tips-and-tricks.md b/id/cheatsheet-machine-learning-tips-and-tricks.md
index 4f60574a4..2abebccd0 100644
--- a/id/cheatsheet-machine-learning-tips-and-tricks.md
+++ b/id/cheatsheet-machine-learning-tips-and-tricks.md
@@ -210,76 +210,76 @@
 
 **36. Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.**
 
-&#10230;
+&#10230; **36. Bias ― Bias pada model adalah perbedaan antara hasil prediksi dengan hasil dari model yang seharusnya diprediksi pada data point yang diberikan (Underfitting).**
 
 <br>
 
 **37. Variance ― The variance of a model is the variability of the model prediction for given data points.**
 
-&#10230;
+&#10230; **Varian ― Varian sebuah model adalah tidak konsistennya hasil prediksi sebuah model atas data poin yang diberikan (Overfitting).**
 
 <br>
 
 **38. Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.**
 
-&#10230;
+&#10230; **38. Bias/variance tradeoff ― Semakin sederhana model, semakin tinggi bias, dan semakin kompleks model, semakin tinggi variannya.**
 
 <br>
 
 **39. [Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]**
 
-&#10230;
+&#10230; **39. [Symptom, Ilustrasi regresi, ilustrasi klasifikasi, ilustrasi deep learning, possible remedies]**
 
 <br>
 
 **40. [High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]**
 
-&#10230;
+&#10230; **40. [Training error tinggi, Training error hampir sama dengan test error, Bias tinggi, Training error lebih kecil daripada test error, Training error sangat kecil, Training error lebih kecil daripada test error, Varian yang tinggi]**
 
 <br>
 
 **41. [Complexify model, Add more features, Train longer, Perform regularization, Get more data]**
 
-&#10230;
+&#10230; **41. [Kompleksitas model, Menambah fitur, Penambahan waktu training, Melakukan regularisasi, Menambah data]**
 
 <br>
 
 **42. Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.**
 
-&#10230;
+&#10230; **42. Analisis error ― Analisis eror adalah analisa sebab utama terjadinya perbedaan performa antara model sekarang dengan model yang sempurna** 
 
 <br>
 
 **43. Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.**
 
-&#10230;
+&#10230; **43. Analisis ablatif ― Analisis ablatif adalah analisa sebab utama terjadinya perbedaan performa antara model yang sekarang dengan model baseline.**
 
 <br>
 
 **44. Regression metrics**
 
-&#10230;
+&#10230; **44. Matriks regresi**
 
 <br>
 
 **45. [Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]**
 
-&#10230;
+&#10230; **45. [Metriks klasifikasi, Matriks confusion, akurasi, presisi, recall, Nilai F1, ROC]**
 
 <br>
 
 **46. [Regression metrics, R squared, Mallow's CP, AIC, BIC]**
 
-&#10230;
+&#10230; **46. [Metriks regresi, R squared, Mallow's CP, AIC, BIC]**
 
 <br>
 
 **47. [Model selection, cross-validation, regularization]**
 
-&#10230;
+&#10230; **47. [Model selection, cross-validation, regularization]**
 
 <br>
 
 **48. [Diagnostics, Bias/variance tradeoff, error/ablative analysis]**
 
-&#10230;
+&#10230; **48. [Diagnostics, Bias/variance tradeoof, error/ablative analysis]**

From f3f225fdae5d09d90da9f340996422b3c81b754e Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Thu, 11 Apr 2019 10:04:22 +0800
Subject: [PATCH 20/34] Translating cheatsheet-supervised-learning to Indonesia
 language.

---
 id/cheatsheet-supervised-learning.md | 38 ++++++++++++++--------------
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/id/cheatsheet-supervised-learning.md b/id/cheatsheet-supervised-learning.md
index a6b19ea1c..8a74dd41b 100644
--- a/id/cheatsheet-supervised-learning.md
+++ b/id/cheatsheet-supervised-learning.md
@@ -1,109 +1,109 @@
 **1. Supervised Learning cheatsheet**
 
-&#10230;
+&#10230; **Supervised Learning cheatsheet**
 
 <br>
 
 **2. Introduction to Supervised Learning**
 
-&#10230;
+&#10230; **2. Pengenalan Supervised Learning**
 
 <br>
 
 **3. Given a set of data points {x(1),...,x(m)} associated to a set of outcomes {y(1),...,y(m)}, we want to build a classifier that learns how to predict y from x.**
 
-&#10230;
+&#10230; **3. Diberikan sebuah kumpulan data poin {x(1),....,x(m)} yang berasosiasi dengan hasil {y(1),....,y(m)}, kita ingin membuat klasifikasi yang mempelajari bagaimana memprediksi nilai y dari x.**
 
 <br>
 
 **4. Type of prediction ― The different types of predictive models are summed up in the table below:**
 
-&#10230;
+&#10230; **4. Jenis prediksi ― Perbedaan jenis model prediksi diringkas dalam tabel berikut:**
 
 <br>
 
 **5. [Regression, Classifier, Outcome, Examples]**
 
-&#10230;
+&#10230; **5. [Regresi, klasifikasi, hasil, contoh]**
 
 <br>
 
 **6. [Continuous, Class, Linear regression, Logistic regression, SVM, Naive Bayes]**
 
-&#10230;
+&#10230; **6. [Continues, Class, Linear regression, Logistic regression, SVM, Naive Bayes]**
 
 <br>
 
 **7. Type of model ― The different models are summed up in the table below:**
 
-&#10230;
+&#10230; **7. Jenis model ― Perbedaan antar model diringkas dalam tabel berikut:**
 
 <br>
 
 **8. [Discriminative model, Generative model, Goal, What's learned, Illustration, Examples]**
 
-&#10230;
+&#10230; **8. [Discriminative model, Generative model, Tujuan, Apa yang telah dipelajari, Ilustrasi, Contoh]**
 
 <br>
 
 **9. [Directly estimate P(y|x), Estimate P(x|y) to then deduce P(y|x), Decision boundary,  	Probability distributions of the data, Regressions, SVMs, GDA, Naive Bayes]**
 
-&#10230;
+&#10230; **9. [Estimasi langsung P(y|x), Estimasi P(x|y) untuk mendeduksi P(y|x), Decision boundary, Probabilitas distribusi data, Regresi, SVM, GDA, Naive Bayes]**
 
 <br>
 
 **10. Notations and general concepts**
 
-&#10230;
+&#10230; **10. Notasi dan konsep umum**
 
 <br>
 
 **11. Hypothesis ― The hypothesis is noted hθ and is the model that we choose. For a given input data x(i) the model prediction output is hθ(x(i)).**
 
-&#10230;
+&#10230; **11. Hipotesis ― Hipotesis dinotasikan dengan hθ dan model yang kita pilih. Untuk input data x(i), hasil prediksi model adalah hθ(x(i)).**
 
 <br>
 
 **12. Loss function ― A loss function is a function L:(z,y)∈R×Y⟼L(z,y)∈R that takes as inputs the predicted value z corresponding to the real data value y and outputs how different they are. The common loss functions are summed up in the table below:**
 
-&#10230;
+&#10230; **12. Loss function ― Fungsi loss adalah sebuah fungsi L:(z,y)∈R×Y⟼L(z,y)∈R yang mengambil input sebagai prediksi nilai z yang berkorespondensi dengan nilai real y dan memberikan output perbedaan antara keduanya. Fungsi loss yang umum adalah sebagai berikut:**
 
 <br>
 
 **13. [Least squared error, Logistic loss, Hinge loss, Cross-entropy]**
 
-&#10230;
+&#10230; **13. [Leas squared error, Logistic loss, Hinge loss, Cross-entropy]**
 
 <br>
 
 **14. [Linear regression, Logistic regression, SVM, Neural Network]**
 
-&#10230;
+&#10230; **14. [Linear regression, Logistic regression, SVM, Neural network]**
 
 <br>
 
 **15. Cost function ― The cost function J is commonly used to assess the performance of a model, and is defined with the loss function L as follows:**
 
-&#10230;
+&#10230; **15. Const function ― Funsgi cost j adalah uum digunakan untuk mengukur performa sebuah model, dan mendefinisikannya dengan fungsi loss L sebagai berikut:**
 
 <br>
 
 **16. Gradient descent ― By noting α∈R the learning rate, the update rule for gradient descent is expressed with the learning rate and the cost function J as follows:**
 
-&#10230;
+&#10230; **16. Gradient descent ― α∈R adalah tingkat pembelajaran (learning rate), aturan untuk memperbarui gradient descent diekspresikan dengan hubungan antara learning rate dan fungsi cost J sebagai berikut:**
 
 <br>
 
 **17. Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient descent is on a batch of training examples.**
 
-&#10230;
+&#10230; **17. Catatan: Stochastic gradient descent (SGD) memperbarui parameter berdasarkan setiap contoh data latih, dan batch gradient descent adalah batch pada setiap contoh training**.
 
 <br>
 
 **18. Likelihood ― The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ through maximizing the likelihood. In practice, we use the log-likelihood ℓ(θ)=log(L(θ)) which is easier to optimize. We have:**
 
-&#10230;
-
+&#10230; **q8. Likelihoot ― Likelihood dari model L(θ) diberikan parameter θ digunakan untuk mencari parameter optimal θ dengan memaksimalkan nilai likelihood. Dalam praktiknya, kita menggunakan log-likehood
+ℓ(θ)=log(L(θ)) yang memudahkan untuk optimalisasi.**
 <br>
 
 **19. Newton's algorithm ― The Newton's algorithm is a numerical method that finds θ such that ℓ′(θ)=0. Its update rule is as follows:**

From 658ca3fb3ab7fc6b01770798f666a6a5d0031746 Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Fri, 12 Apr 2019 10:13:04 +0800
Subject: [PATCH 21/34] Translating cheatsheet-supervised-learning

---
 id/cheatsheet-supervised-learning.md | 32 ++++++++++++++--------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/id/cheatsheet-supervised-learning.md b/id/cheatsheet-supervised-learning.md
index 8a74dd41b..ec253b517 100644
--- a/id/cheatsheet-supervised-learning.md
+++ b/id/cheatsheet-supervised-learning.md
@@ -102,97 +102,97 @@
 
 **18. Likelihood ― The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ through maximizing the likelihood. In practice, we use the log-likelihood ℓ(θ)=log(L(θ)) which is easier to optimize. We have:**
 
-&#10230; **q8. Likelihoot ― Likelihood dari model L(θ) diberikan parameter θ digunakan untuk mencari parameter optimal θ dengan memaksimalkan nilai likelihood. Dalam praktiknya, kita menggunakan log-likehood
+&#10230; **q8. Likelihood ― Likelihood dari model L(θ) diberikan parameter θ digunakan untuk mencari parameter optimal θ dengan memaksimalkan nilai likelihood. Dalam praktiknya, kita menggunakan log-likehood
 ℓ(θ)=log(L(θ)) yang memudahkan untuk optimalisasi.**
 <br>
 
 **19. Newton's algorithm ― The Newton's algorithm is a numerical method that finds θ such that ℓ′(θ)=0. Its update rule is as follows:**
 
-&#10230;
+&#10230; **19. Algoritma Newton ― Algoritma newton adalah metode numerik yang mencari 0 sehingga ℓ′(θ)=0. Algoritma ini memperbarui dengan cara berikut:**
 
 <br>
 
 **20. Remark: the multidimensional generalization, also known as the Newton-Raphson method, has the following update rule:**
 
-&#10230;
+&#10230; **20. Catatan: generalisasi multidimensional, juga disebut sebagai Metode Newton-Raphson, cara kerjanya sebagai berikut:**
 
 <br>
 
 **21. Linear models**
 
-&#10230;
+&#10230; **21. Model linear**
 
 <br>
 
 **22. Linear regression**
 
-&#10230;
+&#10230; **22. Regresi linear**
 
 <br>
 
 **23. We assume here that y|x;θ∼N(μ,σ2)**
 
-&#10230;
+&#10230; **23. Asumsinya sebagai berikut: y|x;θ∼N(μ,σ2)** 
 
 <br>
 
 **24. Normal equations ― By noting X the matrix design, the value of θ that minimizes the cost function is a closed-form solution such that:**
 
-&#10230;
+&#10230; **24. Persamaan normal ― Dengan X sebagai desain matriks, nilai 0 digunakan untuk meminimalisir nilai fungsi cost sehingga mendekati bentuk solusi:** 
 
 <br>
 
 **25. LMS algorithm ― By noting α the learning rate, the update rule of the Least Mean Squares (LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff learning rule, is as follows:**
 
-&#10230;
+&#10230; **25. Algoritma LMS ― α adalah learning rate, perbaruan algoritma LMS untuk data training m, yang disebut juga Widrow-Hoff learning:**
 
 <br>
 
 **26. Remark: the update rule is a particular case of the gradient ascent.**
 
-&#10230;
+&#10230; **26. Catatan: perbaruan rule adalah contoh dari gradient ascent.**
 
 <br>
 
 **27. LWR ― Locally Weighted Regression, also known as LWR, is a variant of linear regression that weights each training example in its cost function by w(i)(x), which is defined with parameter τ∈R as:**
 
-&#10230;
+&#10230; **27. LWR ― Locally Weighted Regression, disebut juga LWR, adalah varian dari regersi linear yang bobot pada setiap data training dalam cost functionnya dinotasikan w(i)(x), yang didefinisikan dengan parameter τ∈R sebagai:** 
 
 <br>
 
 **28. Classification and logistic regression**
 
-&#10230;
+&#10230; **28. Klasifikasi dan logistic regression**
 
 <br>
 
 **29. Sigmoid function ― The sigmoid function g, also known as the logistic function, is defined as follows:**
 
-&#10230;
+&#10230; **29. Fungsi sigmoid ― fungsi sigmoid g, disebut juga fungsi logistic, didefinisikan sebagai berikut:**
 
 <br>
 
 **30. Logistic regression ― We assume here that y|x;θ∼Bernoulli(ϕ). We have the following form:**
 
-&#10230;
+&#10230; **30. Logistic regression ― kita asumsikan bahwa y|x;θ∼Bernoulli(ϕ). Dengan bentuk sebagai berikut:**
 
 <br>
 
 **31. Remark: there is no closed form solution for the case of logistic regressions.**
 
-&#10230;
+&#10230; **31. Catatan: tidak ada bentuk solusi tertutup untuk kasus logistic regression.**
 
 <br>
 
 **32. Softmax regression ― A softmax regression, also called a multiclass logistic regression, is used to generalize logistic regression when there are more than 2 outcome classes. By convention, we set θK=0, which makes the Bernoulli parameter ϕi of each class i equal to:**
 
-&#10230;
+&#10230; **32. Softmax regression ― Softmax regression disebut juga sebagai multiclass logistic regression, digunakan untuk membuat logistic regression ketika terdapat lebih dari dua kelas output. Secara umum, kita men-set θK=0, yang membuat Bernoulli parameter ϕi pada setiap kelas i sama dengan:**
 
 <br>
 
 **33. Generalized Linear Models**
 
-&#10230;
+&#10230; **33. Generalized Linear Models**
 
 <br>
 

From b0fa838a61603c8342852996608bc807ca93d55b Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Mon, 15 Apr 2019 10:12:15 +0800
Subject: [PATCH 22/34] Translating  cheatsheet supervised learning

---
 id/cheatsheet-supervised-learning.md | 44 ++++++++++++++--------------
 1 file changed, 22 insertions(+), 22 deletions(-)

diff --git a/id/cheatsheet-supervised-learning.md b/id/cheatsheet-supervised-learning.md
index ec253b517..b3c2cb1e9 100644
--- a/id/cheatsheet-supervised-learning.md
+++ b/id/cheatsheet-supervised-learning.md
@@ -198,133 +198,133 @@
 
 **34. Exponential family ― A class of distributions is said to be in the exponential family if it can be written in terms of a natural parameter, also called the canonical parameter or link function, η, a sufficient statistic T(y) and a log-partition function a(η) as follows:**
 
-&#10230;
+&#10230; **34. Keluarga eksponensial ― Sebuah kelas distribusi disebut keluarga eksponensial jika ditulis dalam sebuah parameter natural, disebut juga sebagai parameter canonical atau link function, η, statistik yang memadai T(y) dan fungsi log-partition a(η) adalah sebagai berikut:**
 
 <br>
 
 **35. Remark: we will often have T(y)=y. Also, exp(−a(η)) can be seen as a normalization parameter that will make sure that the probabilities sum to one.**
 
-&#10230;
+&#10230; **35. Catatan: kita akan sering memiliki T(y)=y. Juga, exp(−a(η)) dapat dilihat sebagai parameter normalisasi yang memastikan bahwa jumlah dari nilai probabilitasnya adalah satu.**
 
 <br>
 
 **36. Here are the most common exponential distributions summed up in the following table:**
 
-&#10230;
+&#10230; **36. Distribusi eksponensial yang paling umum digunakan terdapat dalam tabel berikut:** 
 
 <br>
 
 **37. [Distribution, Bernoulli, Gaussian, Poisson, Geometric]**
 
-&#10230;
+&#10230; **37. [Distribusi, Bernoulli, Gaussian, Poisson, Geometric]**
 
 <br>
 
 **38. Assumptions of GLMs ― Generalized Linear Models (GLM) aim at predicting a random variable y as a function fo x∈Rn+1 and rely on the following 3 assumptions:**
 
-&#10230;
+&#10230; **38. Asumsi GLM ― Generalized Linear Models (GLM) bertujuan untuk memprediksi sebuah random variabel y sebagai fungsi fo x∈Rn+1 dan bergantung pada 3 asumsi berikut ini:**
 
 <br>
 
 **39. Remark: ordinary least squares and logistic regression are special cases of generalized linear models.**
 
-&#10230;
+&#10230; **39. Catatan: ordinary least squares dan logistrik regression adalah contoh spesial dari generalized linear mmodels.**
 
 <br>
 
 **40. Support Vector Machines**
 
-&#10230;
+&#10230; **40. Support Vector Machines**
 
 <br>
 
 **41: The goal of support vector machines is to find the line that maximizes the minimum distance to the line.**
 
-&#10230;
+&#10230; **41. Tujuan dari SVM adalah untuk menentukan garis yang memiliki jarak seminimum mungkin ke garis**
 
 <br>
 
 **42: Optimal margin classifier ― The optimal margin classifier h is such that:**
 
-&#10230;
+&#10230; **42. Optimal margin classifier ― Optimal margin classifier h adalah sebagai berikut:**
 
 <br>
 
 **43: where (w,b)∈Rn×R is the solution of the following optimization problem:**
 
-&#10230;
+&#10230; **43: dimana (w, b)∈Rn×R adalah solusi dari masalah optimalisasi sebagai berikut:**
 
 <br>
 
 **44. such that**
 
-&#10230;
+&#10230; **44. Misalnya**
 
 <br>
 
 **45. support vectors**
 
-&#10230;
+&#10230; **45. Support vectors**
 
 <br>
 
 **46. Remark: the line is defined as wTx−b=0.**
 
-&#10230;
+&#10230; **46. Catatan: garis didefinisikan sebagai wTx−b=0.**
 
 <br>
 
 **47. Hinge loss ― The hinge loss is used in the setting of SVMs and is defined as follows:**
 
-&#10230;
+&#10230; **47. Hinge loss ― Hinge loss digunaka untuk pengaturan dari SVM dan definisikan sebagai berikut:**
 
 <br>
 
 **48. Kernel ― Given a feature mapping ϕ, we define the kernel K to be defined as:**
 
-&#10230;
+&#10230; **48. Kernel ― Diberikan sebuah fitur mapping ϕ, kita mendefinisikan kernel K sebagai berikut:**
 
 <br>
 
 **49. In practice, the kernel K defined by K(x,z)=exp(−||x−z||22σ2) is called the Gaussian kernel and is commonly used.**
 
-&#10230;
+&#10230; **48. Pada praktiknya, kernel K didefinisikan dengan K(x,z)=exp(−||x−z||22σ2) yang disebut sebagai Gaussian kernel dan yang paling umum digunakan.**
 
 <br>
 
 **50. [Non-linear separability, Use of a kernel mapping, Decision boundary in the original space]**
 
-&#10230;
+&#10230; **50. [Non-linear separability, Penggunan kernel mapping, Decision boundary di original space]**
 
 <br>
 
 **51. Remark: we say that we use the "kernel trick" to compute the cost function using the kernel because we actually don't need to know the explicit mapping ϕ, which is often very complicated. Instead, only the values K(x,z) are needed.**
 
-&#10230;
+&#10230; **51. Catatan: kita katakan bahwa kita menggunakan "trik kernel" yaitu menghitung fungsi coss menggunakan kernel karena kita tidak perlu mengetahui mapping eksplisit ϕ, dimana itu sangat kompleks. Sehingga hanya dibutuhkan nilai K(x,z).**
 
 <br>
 
 **52. Lagrangian ― We define the Lagrangian L(w,b) as follows:**
 
-&#10230;
+&#10230; **52. Lagrangian ― Kita mendefiniskan Lagrangian L(w,b) sebagai berikut:**
 
 <br>
 
 **53. Remark: the coefficients βi are called the Lagrange multipliers.**
 
-&#10230;
+&#10230; **Catatan: Koefisien βi disebut sebagai Lagrange multipliers.**
 
 <br>
 
 **54. Generative Learning**
 
-&#10230;
+&#10230; **54. Generative Learning**
 
 <br>
 
 **55. A generative model first tries to learn how the data is generated by estimating P(x|y), which we can then use to estimate P(y|x) by using Bayes' rule.**
 
-&#10230;
+&#10230; **55. Sebuah generative model pertama kali digunakan untuk mempelajari bagaimana data dihasilkan dengan mengestimasi P(x|y), yang kemudian digunakan untuk mengestimasi P(y|x) dengan aturan Bayes.**
 
 <br>
 

From 29166931ad59d09b280032560b20ef85df76b95f Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Mon, 15 Apr 2019 18:30:23 +0800
Subject: [PATCH 23/34] Translating into Bahasa Indonesia

---
 id/cheatsheet-supervised-learning.md | 30 ++++++++++++++--------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/id/cheatsheet-supervised-learning.md b/id/cheatsheet-supervised-learning.md
index b3c2cb1e9..405396556 100644
--- a/id/cheatsheet-supervised-learning.md
+++ b/id/cheatsheet-supervised-learning.md
@@ -330,91 +330,91 @@
 
 **56. Gaussian Discriminant Analysis**
 
-&#10230;
+&#10230; **55. Gaussian Discriminant Analysis**
 
 <br>
 
 **57. Setting ― The Gaussian Discriminant Analysis assumes that y and x|y=0 and x|y=1 are such that:**
 
-&#10230;
+&#10230; **57. Pengaturan ― Asumsi dari gaussian discriminant Analysis adalah y dan x|y=0 dan x|y=1 adalah sebagai berikut:** 
 
 <br>
 
 **58. Estimation ― The following table sums up the estimates that we find when maximizing the likelihood:**
 
-&#10230;
+&#10230; **58. Estimasi ― Tabel berikut meringkas estimasi ketika melakukan maksimalisasi kemungkinan:**
 
 <br>
 
 **59. Naive Bayes**
 
-&#10230;
+&#10230; **59. Naive Bayes**
 
 <br>
 
 **60. Assumption ― The Naive Bayes model supposes that the features of each data point are all independent:**
 
-&#10230;
+&#10230; **60. Asumsi ― Model Naive Bayes menduga bahwa fitur dari setiap data point adalah independent:**
 
 <br>
 
 **61. Solutions ― Maximizing the log-likelihood gives the following solutions, with k∈{0,1},l∈[[1,L]]**
 
-&#10230;
+&#10230; **61. Solusi ― Memaksimalkan log-likelihood memberikan solusi berikut, dengan k∈{0,1},l∈[[1,L]]**
 
 <br>
 
 **62. Remark: Naive Bayes is widely used for text classification and spam detection.**
 
-&#10230;
+&#10230; **62. Catatan: Naive Bayes umum digunakan untuk klasifikasi teks dan deteksi spam.**
 
 <br>
 
 **63. Tree-based and ensemble methods**
 
-&#10230;
+&#10230; **63. Tree-based dan ensemble methods**
 
 <br>
 
 **64. These methods can be used for both regression and classification problems.**
 
-&#10230;
+&#10230; **64. Metode ini digunakan untuk permasalahan regresi dan klasifikasi.**
 
 <br>
 
 **65. CART ― Classification and Regression Trees (CART), commonly known as decision trees, can be represented as binary trees. They have the advantage to be very interpretable.**
 
-&#10230;
+&#10230; **65. CART ― Klasifikasi dan pohon regresi (CART), umumnya disebut sebagai decision trees, direpresentasikan sebagai pohon binary. Memiliki keuntungan yang sangat dapat diinterprestasi.**
 
 <br>
 
 **66. Random forest ― It is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance makes it a popular algorithm.**
 
-&#10230;
+&#10230; **66. Random forest ― Merupakan teknik tree-based yang menggunakan angka tertinggi dari decision-tress yang secara random dipilih dari sekumpulan fitur. Berbeda dengan simple decision tree, ini sangat tidak mudah diinterpretasi namum secara umum memiliki performa yang sangat bagus, dan ini adalah salah satu algoritma yang populer.**
 
 <br>
 
 **67. Remark: random forests are a type of ensemble methods.**
 
-&#10230;
+&#10230; **67. Catatan: random forest adalah salahsatu jenis metode ensemble.**
 
 <br>
 
 **68. Boosting ― The idea of boosting methods is to combine several weak learners to form a stronger one. The main ones are summed up in the table below:**
 
-&#10230;
+&#10230; **68. Boosting ― ide dari metode boosting adalah untuk mengkombinasi beberapa kelemahan learner untuk membentuk leaner yang kuat. Beberapa tujuan utamanya adalah diringkas dalam tabel berikut:**
 
 <br>
 
 **69. [Adaptive boosting, Gradient boosting]**
 
-&#10230;
+&#10230; **69. [Adaptive boosting, Gradient boosting]
 
 <br>
 
 **70. High weights are put on errors to improve at the next boosting step**
 
-&#10230;
+&#10230; **70. Bobot yang tinggi diletakkan di tempat yang memiliki error untuk meningkatkan tahapan boosting berikutnya.**
 
 <br>
 

From 941c9313a4a700570c6a7a06f175c74ee98a034b Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Tue, 16 Apr 2019 10:07:54 +0800
Subject: [PATCH 24/34] Translating into Indonesia Language

---
 id/cheatsheet-supervised-learning.md | 47 ++++++++++++++--------------
 1 file changed, 23 insertions(+), 24 deletions(-)

diff --git a/id/cheatsheet-supervised-learning.md b/id/cheatsheet-supervised-learning.md
index 405396556..f97174dc0 100644
--- a/id/cheatsheet-supervised-learning.md
+++ b/id/cheatsheet-supervised-learning.md
@@ -408,7 +408,7 @@
 
 **69. [Adaptive boosting, Gradient boosting]**
 
-&#10230; **69. [Adaptive boosting, Gradient boosting]
+&#10230; **69. [Adaptive boosting, Gradient boosting]**
 
 <br>
 
@@ -420,121 +420,120 @@
 
 **71. Weak learners trained on remaining errors**
 
-&#10230;
+&#10230; **71. Learner yang lemah berada di tempat yang error**
 
 <br>
 
 **72. Other non-parametric approaches**
 
-&#10230;
+&#10230; **72. Pendekatan non-parametrik lain**
 
 <br>
 
 **73. k-nearest neighbors ― The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.**
 
-&#10230;
+&#10230; **73. K-nearest neighbors ― Algoritma k-nearest neighbors, biasa disebut k-NN, adalah pendekatan non-parametrik dimana respon terhadap data point ditentukan oleh apa yang terjadi di sekitar k dalam data latih. Algoritma ini digunakan pada klasifikasi dan regresi.**
 
 <br>
 
 **74. Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.**
 
-&#10230;
+&#10230; **74. Catatan: Semakin tinggi parameter k, semakin tinggi bias, dan semakin rendah parameter k, semakin tinggi variansinya.**
 
 <br>
 
 **75. Learning Theory**
 
-&#10230;
+&#10230; **75. Teori pembelajaran**
 
 <br>
 
 **76. Union bound ― Let A1,...,Ak be k events. We have:**
 
-&#10230;
+&#10230; **76. Union bound ― Dimana A1,...,Ak dan event k. Kita memiliki:**
 
 <br>
 
 **77. Hoeffding inequality ― Let Z1,..,Zm be m iid variables drawn from a Bernoulli distribution of parameter ϕ. Let ˆϕ be their sample mean and γ>0 fixed. We have:**
 
-&#10230;
-
+&#10230; **77. Ketidaksamaan Hoeffding ― Dimana Z1,...,Zm dan variabel iid didapat dari distribusi Bernoulli dengan parameter ϕ. ^ϕ adalah rerata sampel dan γ>0 adalah tetap. Sehingga:**
 <br>
 
 **78. Remark: this inequality is also known as the Chernoff bound.**
 
-&#10230;
+&#10230; **78. Catatan: ketidaksamaan ini juga dikenal sebagai Chernoff bound.**
 
 <br>
 
 **79. Training error ― For a given classifier h, we define the training error ˆϵ(h), also known as the empirical risk or empirical error, to be as follows:**
 
-&#10230;
+&#10230; **79. Error latih ― diberikan klasifier h, kita mendefinisikan training error sebagai ˆϵ(h), disebut juga empirical risk atau empirical error, dikenal sebagai berikut:**
 
 <br>
 
-**80. Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions: **
+**80. Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions:**
 
-&#10230;
+&#10230; **80. Probably Approximately Correct (PAC) ― PAC adalah sebuah framework yang dihasilkan dari learning theory, dan memiliki beberapa asumsi:** 
 
 <br>
 
-**81: the training and testing sets follow the same distribution **
+**81. the training and testing sets follow the same distribution.**
 
-&#10230;
+&#10230; **81. Data training dan testing mengikuti distribusi yang sama.**
 
 <br>
 
 **82. the training examples are drawn independently**
 
-&#10230;
+&#10230; **82. Contoh data training dihasilkan secara independen**
 
 <br>
 
 **83. Shattering ― Given a set S={x(1),...,x(d)}, and a set of classifiers H, we say that H shatters S if for any set of labels {y(1),...,y(d)}, we have:**
 
-&#10230;
+&#10230; **83. Shattering ― Diberikan sebuah set S={x(1),...,x(d)}, dan sebuah set klasifier H, kita  dapat katakan bahwa H shatter S apabila setiap set dari label {y(1),...,y(d)}, sehingga:**
 
 <br>
 
 **84. Upper bound theorem ― Let H be a finite hypothesis class such that |H|=k and let δ and the sample size m be fixed. Then, with probability of at least 1−δ, we have:**
 
-&#10230;
+&#10230; **84. Teorem Upper bound ― Diberikan H merupakan kelas hipotesis dimana |H|=k, δ, dan sampel m adalah tetap. Sehingga probabilitas 1-δ adalah:**
 
 <br>
 
 **85. VC dimension ― The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis class H, noted VC(H) is the size of the largest set that is shattered by H.**
 
-&#10230;
+&#10230; **85. Dimensi VC ― Dimensi Vapnik-Chervonenkis (VC) dari suah hipotesis kelas H tak terhingga. VC (H) adalah ukuran set terbesar yang di-shatter oleh H.**
 
 <br>
 
 **86. Remark: the VC dimension of H={set of linear classifiers in 2 dimensions} is 3.**
 
-&#10230;
+&#10230; **86. Catatan: dimensi VC dari H={set dari klasifier linear dalam 2 dimensi} adalah 3.**
 
 <br>
 
 **87. Theorem (Vapnik) ― Let H be given, with VC(H)=d and m the number of training examples. With probability at least 1−δ, we have:**
 
-&#10230;
+&#10230; **87. Teorem (Vapnik) ― Diberikan H, dimana VC(H)=d dan m adalah jumlah data training. Dengan probabilitas paling minimal 1−δ, sehingga:**
 
 <br>
 
 **88. [Introduction, Type of prediction, Type of model]**
 
-&#10230;
+&#10230; **88. [Introduction, Jenis prediksi, jenis model]**
 
 <br>
 
 **89. [Notations and general concepts, loss function, gradient descent, likelihood]**
 
-&#10230;
+&#10230; **89. [Notasi dan konsep umum, fungsi loss, gradient descent, likelihood]**
 
 <br>
 
 **90. [Linear models, linear regression, logistic regression, generalized linear models]**
 
-&#10230;
+&#10230; **90. [Model linear, regeresi linear, regresi logistik, generalized linear models]**
 
 <br>
 

From dbffb1a5d5b0e83c648ba56fa18b543db5ef588b Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Thu, 18 Apr 2019 10:08:32 +0800
Subject: [PATCH 25/34] Translating supervised learning into Indonesia language

---
 id/cheatsheet-supervised-learning.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/id/cheatsheet-supervised-learning.md b/id/cheatsheet-supervised-learning.md
index f97174dc0..164168cc4 100644
--- a/id/cheatsheet-supervised-learning.md
+++ b/id/cheatsheet-supervised-learning.md
@@ -539,28 +539,28 @@
 
 **91. [Support vector machines, Optimal margin classifier, Hinge loss, Kernel]**
 
-&#10230;
+&#10230; **91. [Support vector machines. Optimal margin classifier, Hinge loss, Kernel]**
 
 <br>
 
 **92. [Generative learning, Gaussian Discriminant Analysis, Naive Bayes]**
 
-&#10230;
+&#10230; **92. [Generative learning, Gaussian Discriminant Analysis, Naive Bayes]**
 
 <br>
 
 **93. [Trees and ensemble methods, CART, Random forest, Boosting]**
 
-&#10230;
+&#10230; **93. [Tress and ensemble methods, CART, Random forest, Boosting]**
 
 <br>
 
 **94. [Other methods, k-NN]**
 
-&#10230;
+&#10230; **94. [Other methods, k-NN]**
 
 <br>
 
 **95. [Learning theory, Hoeffding inequality, PAC, VC dimension]**
 
-&#10230;
+&#10230; **95. [Learning theory, Hoeffding inequality, PAC, VC dimension]**

From 12a47ab59868d59478ff954ad357f7039ee74e7e Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Mon, 6 May 2019 22:00:56 +0700
Subject: [PATCH 26/34] Translate cheatsheet unspervised learning into
 Indonesia language

---
 id/cheatsheet-unsupervised-learning.md | 119 ++++++++++++-------------
 1 file changed, 58 insertions(+), 61 deletions(-)
 mode change 100644 => 100755 id/cheatsheet-unsupervised-learning.md

diff --git a/id/cheatsheet-unsupervised-learning.md b/id/cheatsheet-unsupervised-learning.md
old mode 100644
new mode 100755
index 6daab3b21..1ae3a67c9
--- a/id/cheatsheet-unsupervised-learning.md
+++ b/id/cheatsheet-unsupervised-learning.md
@@ -1,340 +1,337 @@
 **1. Unsupervised Learning cheatsheet**
 
-&#10230;
+&#10230; **1. Ringkasan Unsupervised Learning**
 
 <br>
 
 **2. Introduction to Unsupervised Learning**
 
-&#10230;
+&#10230; **2. Pengenalan Unsupervised Learning**
 
 <br>
 
 **3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.**
 
-&#10230;
+&#10230; **3. Motivasi ― Tujuan dari unsupervised learning adalah untuk menemukan pola dari data yang tidak memiliki label {x(1),..., x(m)}**
 
 <br>
 
 **4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:**
 
-&#10230;
+&#10230; **4. Ketidaksamaan Jensen ― Diberikan f yang merupakan fungsi yang konveks dan X adalah random variabel. Kita memiliki ketidaksamaan:**
 
 <br>
 
 **5. Clustering**
 
-&#10230;
+&#10230;  **5. Clustering**
 
 <br>
 
 **6. Expectation-Maximization**
 
-&#10230;
+&#10230; **6. Expection-Maximization**
 
 <br>
 
 **7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:**
 
-&#10230;
+&#10230; **7. Latent variables ― Variabel laten adalah variabel yang tersembunyi atau belum dilakukan observasi sehingga membuat estimasi masalah menjadi lebih sulit, dan seringkali didenotasikan dengan z. Berikut merupakan setting dimana latent variables umum digunakan:**
 
-<br>
+<br>AlgoritmaAlgoritma
 
 **8. [Setting, Latent variable z, Comments]**
 
-&#10230;
-
+&#10230; **8. [Setting, Variabel laten z, Komentar ]**
 <br>
 
 **9. [Mixture of k Gaussians, Factor analysis]**
 
-&#10230;
-
+&#10230; **9. [Gabungan dari gaussin K, analisis faktor]**
 <br>
 
 **10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:**
 
-&#10230;
+&#10230;**10. Algoritma ― Expectation-maximizaiton (EM) memberikan metode yang efisien dalam mengestimasi parameter θ melalui estimasi maximum likelihood dengan mengknonstruksi secara berulang lower-bound dari likelihood (E-step) dan optimasasi lowen-bound (M-Step) sebagai berikut:**
 
 <br>
 
 **11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:**
 
-&#10230;
+&#10230; **11. E-step: Mengevaluasi probabilitas posterior Qi(z(i)) yang setiap data point x(i) dari kluster khusus z(i) sebagai berikut:**
 
 <br>
 
 **12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:**
 
-&#10230;
+&#10230; **12. M-step: Menggunakan probabilitas posterior Qi(z(i)) sebagai kluster khusus pada bobot data poin x(i) untuk memisahkan perhitungan estimasi model setiap kluster sebagai berikut:** 
 
 <br>
 
 **13. [Gaussians initialization, Expectation step, Maximization step, Convergence]**
 
-&#10230;
+&#10230; **13. [Gaussian initialization, Expectation step, Maximization step, Convergence]**
 
 <br>
 
 **14. k-means clustering**
 
-&#10230;
+&#10230; **14. k-means clustering**
 
 <br>
 
 **15. We note c(i) the cluster of data point i and μj the center of cluster j.**
 
-&#10230;
+&#10230; **15. Catatan: c(i) merupakan cluster data poin i dan μj merupakan pusat kluster j.**
 
 <br>
 
 **16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:**
 
-&#10230;
+&#10230; **16. Algoritma ― Setelah secara random menginisialisasi kluster centroids μ1,μ2,...,μk∈Rn, algoritma k-means mengulangi langkah-langkah berikut sampai koonvergen:**
 
 <br>
 
 **17. [Means initialization, Cluster assignment, Means update, Convergence]**
 
-&#10230;
+&#10230; **17. [Means initialization, Cluster assignment, Means update, Convergence]**
 
 <br>
 
 **18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:**
 
-&#10230;
+&#10230; **18. Fungsi distorsi ― Untuk melihat jika sebuah algoritma telah konvergen, kita lihat pada fungsi distorsi yang didefiniskan sebagai berikut:**
 
 <br>
 
 **19. Hierarchical clustering**
 
-&#10230;
+&#10230; **19. Hierarchical clustering**
 
 <br>
 
 **20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.**
 
-&#10230;
+&#10230; **20. Algoritma ― merupakan sebuah algoritma clustering dengan pendekatan hirarki agglomeratif yang membangung sarang kluster dalam bentuk yang sempurna** 
 
 <br>
 
 **21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:**
 
-&#10230;
+&#10230; **21. Jenis ― Terdapat beberapa perbedaan dari algoritma klustering yang bertujuan untuk mengoptimalisasi perbedaan fungsi-fungsi tertentu, yang diringkas dalam tabel berikut:**
 
 <br>
 
 **22. [Ward linkage, Average linkage, Complete linkage]**
 
-&#10230;
+&#10230; **22. [Ward linkage, Average linkage, Complete lingkage]**
 
 <br>
 
 **23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]**
 
-&#10230;
+&#10230; **23. [Minimalisasi jarak antar kluster,Minimilasisasi rata-rata jarak antar pasangan kluster, Minimalisasi jarak maksimum antar pasangan kluster]** 
 
 <br>
 
 **24. Clustering assessment metrics**
 
-&#10230;
+&#10230; **24. Matriks penilaian kluster**
 
 <br>
 
 **25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.**
 
-&#10230;
+&#10230; **25. Dalam setting unspervised learning, seringnya lebih sulit untuk melakukan penilaian performa sebuah model saat kita tidak memiliki label dasar yang benar sebagaimana yang dilakukan dalam setting supervised learning**
 
 <br>
 
 **26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:**
 
-&#10230;
+&#10230; **26. Koefisien Silhoutte ― Dengan notasi a dan b yang merupakan rerata jarak antara sampel a dan semua point dalam kelas yang sama, dan antara sample dan semua point dalam kluster terdekat, koefisien siloute s untuk sebuah sample tunggal didefinisikan sebagai berikut:**
 
 <br>
 
 **27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as**
 
-&#10230;
+&#10230; **27. Index Calinski-Harabaz ― Dengan notasi k merupkan jumlah kluster, Bk dan Wk merupakan matriks dispersi di luar dan di dalam kluster, didefinisikan sebagai berikut:**
 
 <br>
 
 **28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:**
 
-&#10230;
+&#10230; **28. Index Calinski-Harabaz s(k) mengindikasikan bagaimana bagus/tidaknya sebuah model kluster mendefinisikan klusternya, seperti nilai score yang tinggi, semakin padat dan terpisah klusternya. Didefiniskan sebagai berikut:**
 
 <br>
 
 **29. Dimension reduction**
 
-&#10230;
+&#10230; **29. Pengurangan Dimensi (Dimension reduction)**
 
 <br>
 
 **30. Principal component analysis**
 
-&#10230;
+&#10230; **30. Principal component analysis**
 
 <br>
 
 **31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.**
 
-&#10230;
+&#10230; **31. Merupakan teknik pengurangan dimensi dengan menemukan varian maksimal arah yang memproyeksikan data.**
 
 <br>
 
 **32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
 
-&#10230;
-
+&#10230; **32. Eigenvalue, eigenvector ― Diberikan sebuah matriks A∈Rn×n, λ merupakan sebuah eigenvalue dari A dimana terdapat sebuah vektor z∈Rn∖{0}, disebut eigenvector, sehingga kita memiliki:** 
 <br>
 
 **33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
 
-&#10230;
+&#10230; **33. Theorema Spectral ― Diberikan A∈Rn×n. Jika A adalah simetris, maka A bersifat diagonal dengan sebuah orthogonal matriks U∈Rn×n. Dengan notasi Λ=diag(λ1,...,λn), kita memiliki:**
 
 <br>
 
 **34. diagonal**
 
-&#10230;
+&#10230; **34. Diagonal**
 
 <br>
 
 **35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.**
 
-&#10230;
+&#10230; **35. Catatan: Eigenvector diasosiasikan dengan eigenvalue terbesar disebut prinsipal eigenvector dari matriks A.**
 
 <br>
 
 **36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k
 dimensions by maximizing the variance of the data as follows:**
 
-&#10230;
+&#10230; **36. Algoritma ― Prosedur PCA adalah teknik pengurangan dimensi yang memproyeksikan data pada dimensi K dengan mamaksimalkan variansi dari data sebagai berikut:** 
 
 <br>
 
 **37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.**
 
-&#10230;
+&#10230; **37. Langkah 1: Normalisasi data yang memiliki rerata 0 dan standar deviasi 1.**
 
 <br>
 
 **38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.**
 
-&#10230;
+&#10230; **38. Langkah 2: Hitung Σ=1mm∑i=1x(i)x(i)T∈Rn×n, yang merupakan simetris dengan nilai real eigenvalues.**
 
 <br>
 
 **39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.**
 
-&#10230;
+&#10230; **39. Langkah 3: Hitung u1,...,uk∈Rn dimana k merupakan prinsipal ortoghonal eigenvector dari Σ, sebagai contohnya nilai eigenvector dari eigenvalues k terbesar.**
 
 <br>
 
 **40. Step 4: Project the data on spanR(u1,...,uk).**
 
-&#10230;
+&#10230; **40. Langkah 4: Proyeksikan data pada spanR(u1,...,uk).**
 
 <br>
 
 **41. This procedure maximizes the variance among all k-dimensional spaces.**
 
-&#10230;
+&#10230; **41. Prosedur tersebut memaksimalkan variansi nilai diantara semua ruang k-dimensional.**
 
 <br>
 
 **42. [Data in feature space, Find principal components, Data in principal components space]**
 
-&#10230;
+&#10230; **42. [Data dalam ruang fitur, Temukan prinsipal komponen, Data di ruang prinsipal komponen]**
 
 <br>
 
 **43. Independent component analysis**
 
-&#10230;
+&#10230; **43. Analisis komponen independen**
 
 <br>
 
 **44. It is a technique meant to find the underlying generating sources.**
 
-&#10230;
+&#10230; **44. Merupakan teknik yang bermaksud untuk menemukan sumber paling dasar.**
 
 <br>
 
 **45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:**
 
-&#10230;
+&#10230; **45. Asumsi ― Kita mengasumsikan bahwa data kita x telah dibuat dengan sumber n-dimensional vector s=(s1,...,sn), dimana si adalah variabel random independen, melalui mixing dan matriks non-singular sebagai berikut:**
 
 <br>
 
 **46. The goal is to find the unmixing matrix W=A−1.**
 
-&#10230;
+&#10230; **46. Tujuannya adalah untuk menemukan unmixing matrix dari W=A-1**
 
 <br>
 
 **47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:**
 
-&#10230;
+&#10230; **47. Algoritma Bell dan Sejnowski ― Alogoritma ini bertujuan untuk menemukan unimixing matriks W dengan langkah-langkah sebagai berikut:**
 
 <br>
 
 **48. Write the probability of x=As=W−1s as:**
 
-&#10230;
+&#10230; **48. Tulis probabilitas dari x=As=W−1s sebagai berikut:**
 
 <br>
 
 **49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:**
 
-&#10230;
+&#10230; **49. Tulis kecenderungan log dari data latih {x(i),i∈[[1,m]]} dan dengan notasi g yang merupakan fungsi sigmoid sebagai berikut:**
 
 <br>
 
 **50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:**
 
-&#10230;
+&#10230; **50. Sehingga, aturan learning dari stochastic gradient ascent adalah bahwa setiap contoh data latih x(i), kita memperbarui W sebagai berikut:**
 
 <br>
 
 **51. The Machine Learning cheatsheets are now available in [target language].**
 
-&#10230;
+&#10230; **51. Catatan ringkas machine learning ini terdapat dalam versi bahasa Indonesia.**
 
 <br>
 
 **52. Original authors**
 
-&#10230;
+&#10230; **52. Penulis Asli: Shervine Amidi**
 
 <br>
 
 **53. Translated by X, Y and Z**
 
-&#10230;
+&#10230; **53. Diterjemahkan oleh Sony Wicaksono**
 
 <br>
 
 **54. Reviewed by X, Y and Z**
 
-&#10230;
+&#10230; **54. Disunting oleh X, Y, dan Z**
 
 <br>
 
 **55. [Introduction, Motivation, Jensen's inequality]**
 
-&#10230;
+&#10230; **55. [Pengenalan, motivasi, Pertidaksamaan Jensen]**
 
 <br>
 
 **56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]**
 
-&#10230;
+&#10230; **56. [Clustering, Expectataion-Maximization, k-means, Hierarchical clustering, Metrics]**
 
 <br>
 
 **57. [Dimension reduction, PCA, ICA]**
 
-&#10230;
+&#10230; **57. [Pengurangan Dimensi, PCA, ICA]**

From c07c746fae6cb8c315a282ecba25af6d722fbdca Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Mon, 6 May 2019 22:09:30 +0700
Subject: [PATCH 27/34] Translating cheatsheet unspervised learning into
 Indonesia

---
 id/cheatsheet-unsupervised-learning.md | 119 ++++++++++++-------------
 1 file changed, 58 insertions(+), 61 deletions(-)

diff --git a/id/cheatsheet-unsupervised-learning.md b/id/cheatsheet-unsupervised-learning.md
index 6daab3b21..1ae3a67c9 100644
--- a/id/cheatsheet-unsupervised-learning.md
+++ b/id/cheatsheet-unsupervised-learning.md
@@ -1,340 +1,337 @@
 **1. Unsupervised Learning cheatsheet**
 
-&#10230;
+&#10230; **1. Ringkasan Unsupervised Learning**
 
 <br>
 
 **2. Introduction to Unsupervised Learning**
 
-&#10230;
+&#10230; **2. Pengenalan Unsupervised Learning**
 
 <br>
 
 **3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.**
 
-&#10230;
+&#10230; **3. Motivasi ― Tujuan dari unsupervised learning adalah untuk menemukan pola dari data yang tidak memiliki label {x(1),..., x(m)}**
 
 <br>
 
 **4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:**
 
-&#10230;
+&#10230; **4. Ketidaksamaan Jensen ― Diberikan f yang merupakan fungsi yang konveks dan X adalah random variabel. Kita memiliki ketidaksamaan:**
 
 <br>
 
 **5. Clustering**
 
-&#10230;
+&#10230;  **5. Clustering**
 
 <br>
 
 **6. Expectation-Maximization**
 
-&#10230;
+&#10230; **6. Expection-Maximization**
 
 <br>
 
 **7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:**
 
-&#10230;
+&#10230; **7. Latent variables ― Variabel laten adalah variabel yang tersembunyi atau belum dilakukan observasi sehingga membuat estimasi masalah menjadi lebih sulit, dan seringkali didenotasikan dengan z. Berikut merupakan setting dimana latent variables umum digunakan:**
 
-<br>
+<br>AlgoritmaAlgoritma
 
 **8. [Setting, Latent variable z, Comments]**
 
-&#10230;
-
+&#10230; **8. [Setting, Variabel laten z, Komentar ]**
 <br>
 
 **9. [Mixture of k Gaussians, Factor analysis]**
 
-&#10230;
-
+&#10230; **9. [Gabungan dari gaussin K, analisis faktor]**
 <br>
 
 **10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:**
 
-&#10230;
+&#10230;**10. Algoritma ― Expectation-maximizaiton (EM) memberikan metode yang efisien dalam mengestimasi parameter θ melalui estimasi maximum likelihood dengan mengknonstruksi secara berulang lower-bound dari likelihood (E-step) dan optimasasi lowen-bound (M-Step) sebagai berikut:**
 
 <br>
 
 **11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:**
 
-&#10230;
+&#10230; **11. E-step: Mengevaluasi probabilitas posterior Qi(z(i)) yang setiap data point x(i) dari kluster khusus z(i) sebagai berikut:**
 
 <br>
 
 **12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:**
 
-&#10230;
+&#10230; **12. M-step: Menggunakan probabilitas posterior Qi(z(i)) sebagai kluster khusus pada bobot data poin x(i) untuk memisahkan perhitungan estimasi model setiap kluster sebagai berikut:** 
 
 <br>
 
 **13. [Gaussians initialization, Expectation step, Maximization step, Convergence]**
 
-&#10230;
+&#10230; **13. [Gaussian initialization, Expectation step, Maximization step, Convergence]**
 
 <br>
 
 **14. k-means clustering**
 
-&#10230;
+&#10230; **14. k-means clustering**
 
 <br>
 
 **15. We note c(i) the cluster of data point i and μj the center of cluster j.**
 
-&#10230;
+&#10230; **15. Catatan: c(i) merupakan cluster data poin i dan μj merupakan pusat kluster j.**
 
 <br>
 
 **16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:**
 
-&#10230;
+&#10230; **16. Algoritma ― Setelah secara random menginisialisasi kluster centroids μ1,μ2,...,μk∈Rn, algoritma k-means mengulangi langkah-langkah berikut sampai koonvergen:**
 
 <br>
 
 **17. [Means initialization, Cluster assignment, Means update, Convergence]**
 
-&#10230;
+&#10230; **17. [Means initialization, Cluster assignment, Means update, Convergence]**
 
 <br>
 
 **18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:**
 
-&#10230;
+&#10230; **18. Fungsi distorsi ― Untuk melihat jika sebuah algoritma telah konvergen, kita lihat pada fungsi distorsi yang didefiniskan sebagai berikut:**
 
 <br>
 
 **19. Hierarchical clustering**
 
-&#10230;
+&#10230; **19. Hierarchical clustering**
 
 <br>
 
 **20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.**
 
-&#10230;
+&#10230; **20. Algoritma ― merupakan sebuah algoritma clustering dengan pendekatan hirarki agglomeratif yang membangung sarang kluster dalam bentuk yang sempurna** 
 
 <br>
 
 **21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:**
 
-&#10230;
+&#10230; **21. Jenis ― Terdapat beberapa perbedaan dari algoritma klustering yang bertujuan untuk mengoptimalisasi perbedaan fungsi-fungsi tertentu, yang diringkas dalam tabel berikut:**
 
 <br>
 
 **22. [Ward linkage, Average linkage, Complete linkage]**
 
-&#10230;
+&#10230; **22. [Ward linkage, Average linkage, Complete lingkage]**
 
 <br>
 
 **23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]**
 
-&#10230;
+&#10230; **23. [Minimalisasi jarak antar kluster,Minimilasisasi rata-rata jarak antar pasangan kluster, Minimalisasi jarak maksimum antar pasangan kluster]** 
 
 <br>
 
 **24. Clustering assessment metrics**
 
-&#10230;
+&#10230; **24. Matriks penilaian kluster**
 
 <br>
 
 **25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.**
 
-&#10230;
+&#10230; **25. Dalam setting unspervised learning, seringnya lebih sulit untuk melakukan penilaian performa sebuah model saat kita tidak memiliki label dasar yang benar sebagaimana yang dilakukan dalam setting supervised learning**
 
 <br>
 
 **26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:**
 
-&#10230;
+&#10230; **26. Koefisien Silhoutte ― Dengan notasi a dan b yang merupakan rerata jarak antara sampel a dan semua point dalam kelas yang sama, dan antara sample dan semua point dalam kluster terdekat, koefisien siloute s untuk sebuah sample tunggal didefinisikan sebagai berikut:**
 
 <br>
 
 **27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as**
 
-&#10230;
+&#10230; **27. Index Calinski-Harabaz ― Dengan notasi k merupkan jumlah kluster, Bk dan Wk merupakan matriks dispersi di luar dan di dalam kluster, didefinisikan sebagai berikut:**
 
 <br>
 
 **28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:**
 
-&#10230;
+&#10230; **28. Index Calinski-Harabaz s(k) mengindikasikan bagaimana bagus/tidaknya sebuah model kluster mendefinisikan klusternya, seperti nilai score yang tinggi, semakin padat dan terpisah klusternya. Didefiniskan sebagai berikut:**
 
 <br>
 
 **29. Dimension reduction**
 
-&#10230;
+&#10230; **29. Pengurangan Dimensi (Dimension reduction)**
 
 <br>
 
 **30. Principal component analysis**
 
-&#10230;
+&#10230; **30. Principal component analysis**
 
 <br>
 
 **31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.**
 
-&#10230;
+&#10230; **31. Merupakan teknik pengurangan dimensi dengan menemukan varian maksimal arah yang memproyeksikan data.**
 
 <br>
 
 **32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
 
-&#10230;
-
+&#10230; **32. Eigenvalue, eigenvector ― Diberikan sebuah matriks A∈Rn×n, λ merupakan sebuah eigenvalue dari A dimana terdapat sebuah vektor z∈Rn∖{0}, disebut eigenvector, sehingga kita memiliki:** 
 <br>
 
 **33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
 
-&#10230;
+&#10230; **33. Theorema Spectral ― Diberikan A∈Rn×n. Jika A adalah simetris, maka A bersifat diagonal dengan sebuah orthogonal matriks U∈Rn×n. Dengan notasi Λ=diag(λ1,...,λn), kita memiliki:**
 
 <br>
 
 **34. diagonal**
 
-&#10230;
+&#10230; **34. Diagonal**
 
 <br>
 
 **35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.**
 
-&#10230;
+&#10230; **35. Catatan: Eigenvector diasosiasikan dengan eigenvalue terbesar disebut prinsipal eigenvector dari matriks A.**
 
 <br>
 
 **36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k
 dimensions by maximizing the variance of the data as follows:**
 
-&#10230;
+&#10230; **36. Algoritma ― Prosedur PCA adalah teknik pengurangan dimensi yang memproyeksikan data pada dimensi K dengan mamaksimalkan variansi dari data sebagai berikut:** 
 
 <br>
 
 **37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.**
 
-&#10230;
+&#10230; **37. Langkah 1: Normalisasi data yang memiliki rerata 0 dan standar deviasi 1.**
 
 <br>
 
 **38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.**
 
-&#10230;
+&#10230; **38. Langkah 2: Hitung Σ=1mm∑i=1x(i)x(i)T∈Rn×n, yang merupakan simetris dengan nilai real eigenvalues.**
 
 <br>
 
 **39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.**
 
-&#10230;
+&#10230; **39. Langkah 3: Hitung u1,...,uk∈Rn dimana k merupakan prinsipal ortoghonal eigenvector dari Σ, sebagai contohnya nilai eigenvector dari eigenvalues k terbesar.**
 
 <br>
 
 **40. Step 4: Project the data on spanR(u1,...,uk).**
 
-&#10230;
+&#10230; **40. Langkah 4: Proyeksikan data pada spanR(u1,...,uk).**
 
 <br>
 
 **41. This procedure maximizes the variance among all k-dimensional spaces.**
 
-&#10230;
+&#10230; **41. Prosedur tersebut memaksimalkan variansi nilai diantara semua ruang k-dimensional.**
 
 <br>
 
 **42. [Data in feature space, Find principal components, Data in principal components space]**
 
-&#10230;
+&#10230; **42. [Data dalam ruang fitur, Temukan prinsipal komponen, Data di ruang prinsipal komponen]**
 
 <br>
 
 **43. Independent component analysis**
 
-&#10230;
+&#10230; **43. Analisis komponen independen**
 
 <br>
 
 **44. It is a technique meant to find the underlying generating sources.**
 
-&#10230;
+&#10230; **44. Merupakan teknik yang bermaksud untuk menemukan sumber paling dasar.**
 
 <br>
 
 **45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:**
 
-&#10230;
+&#10230; **45. Asumsi ― Kita mengasumsikan bahwa data kita x telah dibuat dengan sumber n-dimensional vector s=(s1,...,sn), dimana si adalah variabel random independen, melalui mixing dan matriks non-singular sebagai berikut:**
 
 <br>
 
 **46. The goal is to find the unmixing matrix W=A−1.**
 
-&#10230;
+&#10230; **46. Tujuannya adalah untuk menemukan unmixing matrix dari W=A-1**
 
 <br>
 
 **47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:**
 
-&#10230;
+&#10230; **47. Algoritma Bell dan Sejnowski ― Alogoritma ini bertujuan untuk menemukan unimixing matriks W dengan langkah-langkah sebagai berikut:**
 
 <br>
 
 **48. Write the probability of x=As=W−1s as:**
 
-&#10230;
+&#10230; **48. Tulis probabilitas dari x=As=W−1s sebagai berikut:**
 
 <br>
 
 **49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:**
 
-&#10230;
+&#10230; **49. Tulis kecenderungan log dari data latih {x(i),i∈[[1,m]]} dan dengan notasi g yang merupakan fungsi sigmoid sebagai berikut:**
 
 <br>
 
 **50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:**
 
-&#10230;
+&#10230; **50. Sehingga, aturan learning dari stochastic gradient ascent adalah bahwa setiap contoh data latih x(i), kita memperbarui W sebagai berikut:**
 
 <br>
 
 **51. The Machine Learning cheatsheets are now available in [target language].**
 
-&#10230;
+&#10230; **51. Catatan ringkas machine learning ini terdapat dalam versi bahasa Indonesia.**
 
 <br>
 
 **52. Original authors**
 
-&#10230;
+&#10230; **52. Penulis Asli: Shervine Amidi**
 
 <br>
 
 **53. Translated by X, Y and Z**
 
-&#10230;
+&#10230; **53. Diterjemahkan oleh Sony Wicaksono**
 
 <br>
 
 **54. Reviewed by X, Y and Z**
 
-&#10230;
+&#10230; **54. Disunting oleh X, Y, dan Z**
 
 <br>
 
 **55. [Introduction, Motivation, Jensen's inequality]**
 
-&#10230;
+&#10230; **55. [Pengenalan, motivasi, Pertidaksamaan Jensen]**
 
 <br>
 
 **56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]**
 
-&#10230;
+&#10230; **56. [Clustering, Expectataion-Maximization, k-means, Hierarchical clustering, Metrics]**
 
 <br>
 
 **57. [Dimension reduction, PCA, ICA]**
 
-&#10230;
+&#10230; **57. [Pengurangan Dimensi, PCA, ICA]**

From 625b6b6046c874fb9429323bfc24c4c389188c92 Mon Sep 17 00:00:00 2001
From: Shervine Amidi <shervinea@users.noreply.github.com>
Date: Wed, 29 May 2019 00:31:05 -0700
Subject: [PATCH 28/34] Delete deep-learning-tips-and-tricks.md

---
 id/deep-learning-tips-and-tricks.md | 457 ----------------------------
 1 file changed, 457 deletions(-)
 delete mode 100644 id/deep-learning-tips-and-tricks.md

diff --git a/id/deep-learning-tips-and-tricks.md b/id/deep-learning-tips-and-tricks.md
deleted file mode 100644
index 347234ec2..000000000
--- a/id/deep-learning-tips-and-tricks.md
+++ /dev/null
@@ -1,457 +0,0 @@
-**Deep Learning Tips and Tricks translation**
-
-<br>
-
-**1. Deep Learning Tips and Tricks cheatsheet**
-
-&#10230;
-
-<br>
-
-
-**2. CS 230 - Deep Learning**
-
-&#10230;
-
-<br>
-
-
-**3. Tips and tricks**
-
-&#10230;
-
-<br>
-
-
-**4. [Data processing, Data augmentation, Batch normalization]**
-
-&#10230;
-
-<br>
-
-
-**5. [Training a neural network, Epoch, Mini-batch, Cross-entropy loss, Backpropagation, Gradient descent, Updating weights, Gradient checking]**
-
-&#10230;
-
-<br>
-
-
-**6. [Parameter tuning, Xavier initialization, Transfer learning, Learning rate, Adaptive learning rates]**
-
-&#10230;
-
-<br>
-
-
-**7. [Regularization, Dropout, Weight regularization, Early stopping]**
-
-&#10230;
-
-<br>
-
-
-**8. [Good practices, Overfitting small batch, Gradient checking]**
-
-&#10230;
-
-<br>
-
-
-**9. View PDF version on GitHub**
-
-&#10230;
-
-<br>
-
-
-**10. Data processing**
-
-&#10230;
-
-<br>
-
-
-**11. Data augmentation ― Deep learning models usually need a lot of data to be properly trained. It is often useful to get more data from the existing ones using data augmentation techniques. The main ones are summed up in the table below. More precisely, given the following input image, here are the techniques that we can apply:**
-
-&#10230;
-
-<br>
-
-
-**12. [Original, Flip, Rotation, Random crop]**
-
-&#10230;
-
-<br>
-
-
-**13. [Image without any modification, Flipped with respect to an axis for which the meaning of the image is preserved, Rotation with a slight angle, Simulates incorrect horizon calibration, Random focus on one part of the image, Several random crops can be done in a row]**
-
-&#10230;
-
-<br>
-
-
-**14. [Color shift, Noise addition, Information loss, Contrast change]**
-
-&#10230;
-
-<br>
-
-
-**15. [Nuances of RGB is slightly changed, Captures noise that can occur with light exposure, Addition of noise, More tolerance to quality variation of inputs, Parts of image ignored, Mimics potential loss of parts of image, Luminosity changes, Controls difference in exposition due to time of day]**
-
-&#10230;
-
-<br>
-
-
-**16. Remark: data is usually augmented on the fly during training.**
-
-&#10230;
-
-<br>
-
-
-**17. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:**
-
-&#10230;
-
-<br>
-
-
-**18. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.**
-
-&#10230;
-
-<br>
-
-
-**19. Training a neural network**
-
-&#10230;
-
-<br>
-
-
-**20. Definitions**
-
-&#10230;
-
-<br>
-
-
-**21. Epoch ― In the context of training a model, epoch is a term used to refer to one iteration where the model sees the whole training set to update its weights.**
-
-&#10230;
-
-<br>
-
-
-**22. Mini-batch gradient descent ― During the training phase, updating weights is usually not based on the whole training set at once due to computation complexities or one data point due to noise issues. Instead, the update step is done on mini-batches, where the number of data points in a batch is a hyperparameter that we can tune.**
-
-&#10230;
-
-<br>
-
-
-**23. Loss function ― In order to quantify how a given model performs, the loss function L is usually used to evaluate to what extent the actual outputs y are correctly predicted by the model outputs z.**
-
-&#10230;
-
-<br>
-
-
-**24. Cross-entropy loss ― In the context of binary classification in neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:**
-
-&#10230;
-
-<br>
-
-
-**25. Finding optimal weights**
-
-&#10230;
-
-<br>
-
-
-**26. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to each weight w is computed using the chain rule.**
-
-&#10230;
-
-<br>
-
-
-**27. Using this method, each weight is updated with the rule:**
-
-&#10230;
-
-<br>
-
-
-**28. Updating weights ― In a neural network, weights are updated as follows:**
-
-&#10230;
-
-<br>
-
-
-**29. [Step 1: Take a batch of training data and perform forward propagation to compute the loss, Step 2: Backpropagate the loss to get the gradient of the loss with respect to each weight, Step 3: Use the gradients to update the weights of the network.]**
-
-&#10230;
-
-<br>
-
-
-**30. [Forward propagation, Backpropagation, Weights update]**
-
-&#10230;
-
-<br>
-
-
-**31. Parameter tuning**
-
-&#10230;
-
-<br>
-
-
-**32. Weights initialization**
-
-&#10230;
-
-<br>
-
-
-**33. Xavier initialization ― Instead of initializing the weights in a purely random manner, Xavier initialization enables to have initial weights that take into account characteristics that are unique to the architecture.**
-
-&#10230;
-
-<br>
-
-
-**34. Transfer learning ― Training a deep learning model requires a lot of data and more importantly a lot of time. It is often useful to take advantage of pre-trained weights on huge datasets that took days/weeks to train, and leverage it towards our use case. Depending on how much data we have at hand, here are the different ways to leverage this:**
-
-&#10230;
-
-<br>
-
-
-**35. [Training size, Illustration, Explanation]**
-
-&#10230;
-
-<br>
-
-
-**36. [Small, Medium, Large]**
-
-&#10230;
-
-<br>
-
-
-**37. [Freezes all layers, trains weights on softmax, Freezes most layers, trains weights on last layers and softmax, Trains weights on layers and softmax by initializing weights on pre-trained ones]**
-
-&#10230;
-
-<br>
-
-
-**38. Optimizing convergence**
-
-&#10230;
-
-<br>
-
-
-**39. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. It can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.
-**
-
-&#10230;
-
-<br>
-
-
-**40. Adaptive learning rates ― Letting the learning rate vary when training a model can reduce the training time and improve the numerical optimal solution. While Adam optimizer is the most commonly used technique, others can also be useful. They are summed up in the table below:**
-
-&#10230;
-
-<br>
-
-
-**41. [Method, Explanation, Update of w, Update of b]**
-
-&#10230;
-
-<br>
-
-
-**42. [Momentum, Dampens oscillations, Improvement to SGD, 2 parameters to tune]**
-
-&#10230;
-
-<br>
-
-
-**43. [RMSprop, Root Mean Square propagation, Speeds up learning algorithm by controlling oscillations]**
-
-&#10230;
-
-<br>
-
-
-**44. [Adam, Adaptive Moment estimation, Most popular method, 4 parameters to tune]**
-
-&#10230;
-
-<br>
-
-
-**45. Remark: other methods include Adadelta, Adagrad and SGD.**
-
-&#10230;
-
-<br>
-
-
-**46. Regularization**
-
-&#10230;
-
-<br>
-
-
-**47. Dropout ― Dropout is a technique used in neural networks to prevent overfitting the training data by dropping out neurons with probability p>0. It forces the model to avoid relying too much on particular sets of features.**
-
-&#10230;
-
-<br>
-
-
-**48. Remark: most deep learning frameworks parametrize dropout through the 'keep' parameter 1−p.**
-
-&#10230;
-
-<br>
-
-
-**49. Weight regularization ― In order to make sure that the weights are not too large and that the model is not overfitting the training set, regularization techniques are usually performed on the model weights. The main ones are summed up in the table below:**
-
-&#10230;
-
-<br>
-
-
-**50. [LASSO, Ridge, Elastic Net]**
-
-&#10230;
-
-<br>
-
-**50 bis. Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
-
-&#10230;
-
-<br>
-
-**51. Early stopping ― This regularization technique stops the training process as soon as the validation loss reaches a plateau or starts to increase.**
-
-&#10230;
-
-<br>
-
-
-**52. [Error, Validation, Training, early stopping, Epochs]**
-
-&#10230;
-
-<br>
-
-
-**53. Good practices**
-
-&#10230;
-
-<br>
-
-
-**54. Overfitting small batch ― When debugging a model, it is often useful to make quick tests to see if there is any major issue with the architecture of the model itself. In particular, in order to make sure that the model can be properly trained, a mini-batch is passed inside the network to see if it can overfit on it. If it cannot, it means that the model is either too complex or not complex enough to even overfit on a small batch, let alone a normal-sized training set.**
-
-&#10230;
-
-<br>
-
-
-**55. Gradient checking ― Gradient checking is a method used during the implementation of the backward pass of a neural network. It compares the value of the analytical gradient to the numerical gradient at given points and plays the role of a sanity-check for correctness.**
-
-&#10230;
-
-<br>
-
-
-**56. [Type, Numerical gradient, Analytical gradient]**
-
-&#10230;
-
-<br>
-
-
-**57. [Formula, Comments]**
-
-&#10230;
-
-<br>
-
-
-**58. [Expensive; loss has to be computed two times per dimension, Used to verify correctness of analytical implementation, Trade-off in choosing h not too small (numerical instability) nor too large (poor gradient approximation)]**
-
-&#10230;
-
-<br>
-
-
-**59. ['Exact' result, Direct computation, Used in the final implementation]**
-
-&#10230;
-
-<br>
-
-
-**60. The Deep Learning cheatsheets are now available in [target language].
-
-&#10230;
-
-
-**61. Original authors**
-
-&#10230;
-
-<br>
-
-**62.Translated by X, Y and Z**
-
-&#10230;
-
-<br>
-
-**63.Reviewed by X, Y and Z**
-
-&#10230;
-
-<br>
-
-**64.View PDF version on GitHub**
-
-&#10230;
-
-<br>
-
-**65.By X and Y**
-
-&#10230;
-
-<br>

From 9b578a9b12227278d639284db6347d9faf40cd54 Mon Sep 17 00:00:00 2001
From: Shervine Amidi <shervinea@users.noreply.github.com>
Date: Wed, 29 May 2019 00:31:20 -0700
Subject: [PATCH 29/34] Delete convolutional-neural-networks.md

---
 id/convolutional-neural-networks.md | 716 ----------------------------
 1 file changed, 716 deletions(-)
 delete mode 100644 id/convolutional-neural-networks.md

diff --git a/id/convolutional-neural-networks.md b/id/convolutional-neural-networks.md
deleted file mode 100644
index 1b1283628..000000000
--- a/id/convolutional-neural-networks.md
+++ /dev/null
@@ -1,716 +0,0 @@
-**Convolutional Neural Networks translation**
-
-<br>
-
-**1. Convolutional Neural Networks cheatsheet**
-
-&#10230;
-
-<br>
-
-
-**2. CS 230 - Deep Learning**
-
-&#10230;
-
-<br>
-
-
-**3. [Overview, Architecture structure]**
-
-&#10230;
-
-<br>
-
-
-**4. [Types of layer, Convolution, Pooling, Fully connected]**
-
-&#10230;
-
-<br>
-
-
-**5. [Filter hyperparameters, Dimensions, Stride, Padding]**
-
-&#10230;
-
-<br>
-
-
-**6. [Tuning hyperparameters, Parameter compatibility, Model complexity, Receptive field]**
-
-&#10230;
-
-<br>
-
-
-**7. [Activation functions, Rectified Linear Unit, Softmax]**
-
-&#10230;
-
-<br>
-
-
-**8. [Object detection, Types of models, Detection, Intersection over Union, Non-max suppression, YOLO, R-CNN]**
-
-&#10230;
-
-<br>
-
-
-**9. [Face verification/recognition, One shot learning, Siamese network, Triplet loss]**
-
-&#10230;
-
-<br>
-
-
-**10. [Neural style transfer, Activation, Style matrix, Style/content cost function]**
-
-&#10230;
-
-<br>
-
-
-**11. [Computational trick architectures, Generative Adversarial Net, ResNet, Inception Network]**
-
-&#10230;
-
-<br>
-
-
-**12. Overview**
-
-&#10230;
-
-<br>
-
-
-**13. Architecture of a traditional CNN ― Convolutional neural networks, also known as CNNs, are a specific type of neural networks that are generally composed of the following layers:**
-
-&#10230;
-
-<br>
-
-
-**14. The convolution layer and the pooling layer can be fine-tuned with respect to hyperparameters that are described in the next sections.**
-
-&#10230;
-
-<br>
-
-
-**15. Types of layer**
-
-&#10230;
-
-<br>
-
-
-**16. Convolution layer (CONV) ― The convolution layer (CONV) uses filters that perform convolution operations as it is scanning the input I with respect to its dimensions. Its hyperparameters include the filter size F and stride S. The resulting output O is called feature map or activation map.**
-
-&#10230;
-
-<br>
-
-
-**17. Remark: the convolution step can be generalized to the 1D and 3D cases as well.**
-
-&#10230;
-
-<br>
-
-
-**18. Pooling (POOL) ― The pooling layer (POOL) is a downsampling operation, typically applied after a convolution layer, which does some spatial invariance. In particular, max and average pooling are special kinds of pooling where the maximum and average value is taken, respectively.**
-
-&#10230;
-
-<br>
-
-
-**19. [Type, Purpose, Illustration, Comments]**
-
-&#10230;
-
-<br>
-
-
-**20. [Max pooling, Average pooling, Each pooling operation selects the maximum value of the current view, Each pooling operation averages the values of the current view]**
-
-&#10230;
-
-<br>
-
-
-**21. [Preserves detected features, Most commonly used, Downsamples feature map, Used in LeNet]**
-
-&#10230;
-
-<br>
-
-
-**22. Fully Connected (FC) ― The fully connected layer (FC) operates on a flattened input where each input is connected to all neurons. If present, FC layers are usually found towards the end of CNN architectures and can be used to optimize objectives such as class scores.**
-
-&#10230;
-
-<br>
-
-
-**23. Filter hyperparameters**
-
-&#10230;
-
-<br>
-
-
-**24. The convolution layer contains filters for which it is important to know the meaning behind its hyperparameters.**
-
-&#10230;
-
-<br>
-
-
-**25. Dimensions of a filter ― A filter of size F×F applied to an input containing C channels is a F×F×C volume that performs convolutions on an input of size I×I×C and produces an output feature map (also called activation map) of size O×O×1.**
-
-&#10230;
-
-<br>
-
-
-**26. Filter**
-
-&#10230;
-
-<br>
-
-
-**27. Remark: the application of K filters of size F×F results in an output feature map of size O×O×K.**
-
-&#10230;
-
-<br>
-
-
-**28. Stride ― For a convolutional or a pooling operation, the stride S denotes the number of pixels by which the window moves after each operation.**
-
-&#10230;
-
-<br>
-
-
-**29. Zero-padding ― Zero-padding denotes the process of adding P zeroes to each side of the boundaries of the input. This value can either be manually specified or automatically set through one of the three modes detailed below:**
-
-&#10230;
-
-<br>
-
-
-**30. [Mode, Value, Illustration, Purpose, Valid, Same, Full]**
-
-&#10230;
-
-<br>
-
-
-**31. [No padding, Drops last convolution if dimensions do not match, Padding such that feature map size has size ⌈IS⌉, Output size is mathematically convenient, Also called 'half' padding, Maximum padding such that end convolutions are applied on the limits of the input, Filter 'sees' the input end-to-end]**
-
-&#10230;
-
-<br>
-
-
-**32. Tuning hyperparameters**
-
-&#10230;
-
-<br>
-
-
-**33. Parameter compatibility in convolution layer ― By noting I the length of the input volume size, F the length of the filter, P the amount of zero padding, S the stride, then the output size O of the feature map along that dimension is given by:**
-
-&#10230;
-
-<br>
-
-
-**34. [Input, Filter, Output]**
-
-&#10230;
-
-<br>
-
-
-**35. Remark: often times, Pstart=Pend≜P, in which case we can replace Pstart+Pend by 2P in the formula above.**
-
-&#10230;
-
-<br>
-
-
-**36. Understanding the complexity of the model ― In order to assess the complexity of a model, it is often useful to determine the number of parameters that its architecture will have. In a given layer of a convolutional neural network, it is done as follows:**
-
-&#10230;
-
-<br>
-
-
-**37. [Illustration, Input size, Output size, Number of parameters, Remarks]**
-
-&#10230;
-
-<br>
-
-
-**38. [One bias parameter per filter, In most cases, S<F, A common choice for K is 2C]**
-
-&#10230;
-
-<br>
-
-
-**39. [Pooling operation done channel-wise, In most cases, S=F]**
-
-&#10230;
-
-<br>
-
-
-**40. [Input is flattened, One bias parameter per neuron, The number of FC neurons is free of structural constraints]**
-
-&#10230;
-
-<br>
-
-
-**41. Receptive field ― The receptive field at layer k is the area denoted Rk×Rk of the input that each pixel of the k-th activation map can 'see'. By calling Fj the filter size of layer j and Si the stride value of layer i and with the convention S0=1, the receptive field at layer k can be computed with the formula:**
-
-&#10230;
-
-<br>
-
-
-**42. In the example below, we have F1=F2=3 and S1=S2=1, which gives R2=1+2⋅1+2⋅1=5.**
-
-&#10230;
-
-<br>
-
-
-**43. Commonly used activation functions**
-
-&#10230;
-
-<br>
-
-
-**44. Rectified Linear Unit ― The rectified linear unit layer (ReLU) is an activation function g that is used on all elements of the volume. It aims at introducing non-linearities to the network. Its variants are summarized in the table below:**
-
-&#10230;
-
-<br>
-
-
-**45. [ReLU, Leaky ReLU, ELU, with]**
-
-&#10230;
-
-<br>
-
-
-**46. [Non-linearity complexities biologically interpretable, Addresses dying ReLU issue for negative values, Differentiable everywhere]**
-
-&#10230;
-
-<br>
-
-
-**47. Softmax ― The softmax step can be seen as a generalized logistic function that takes as input a vector of scores x∈Rn and outputs a vector of output probability p∈Rn through a softmax function at the end of the architecture. It is defined as follows:**
-
-&#10230;
-
-<br>
-
-
-**48. where**
-
-&#10230;
-
-<br>
-
-
-**49. Object detection**
-
-&#10230;
-
-<br>
-
-
-**50. Types of models ― There are 3 main types of object recognition algorithms, for which the nature of what is predicted is different. They are described in the table below:**
-
-&#10230;
-
-<br>
-
-
-**51. [Image classification, Classification w. localization, Detection]**
-
-&#10230;
-
-<br>
-
-
-**52. [Teddy bear, Book]**
-
-&#10230;
-
-<br>
-
-
-**53. [Classifies a picture, Predicts probability of object, Detects an object in a picture, Predicts probability of object and where it is located, Detects up to several objects in a picture, Predicts probabilities of objects and where they are located]**
-
-&#10230;
-
-<br>
-
-
-**54. [Traditional CNN, Simplified YOLO, R-CNN, YOLO, R-CNN]**
-
-&#10230;
-
-<br>
-
-
-**55. Detection ― In the context of object detection, different methods are used depending on whether we just want to locate the object or detect a more complex shape in the image. The two main ones are summed up in the table below:**
-
-&#10230;
-
-<br>
-
-
-**56. [Bounding box detection, Landmark detection]**
-
-&#10230;
-
-<br>
-
-
-**57. [Detects the part of the image where the object is located, Detects a shape or characteristics of an object (e.g. eyes), More granular]**
-
-&#10230;
-
-<br>
-
-
-**58. [Box of center (bx,by), height bh and width bw, Reference points (l1x,l1y), ..., (lnx,lny)]**
-
-&#10230;
-
-<br>
-
-
-**59. Intersection over Union ― Intersection over Union, also known as IoU, is a function that quantifies how correctly positioned a predicted bounding box Bp is over the actual bounding box Ba. It is defined as:**
-
-&#10230;
-
-<br>
-
-
-**60. Remark: we always have IoU∈[0,1]. By convention, a predicted bounding box Bp is considered as being reasonably good if IoU(Bp,Ba)⩾0.5.**
-
-&#10230;
-
-<br>
-
-
-**61. Anchor boxes ― Anchor boxing is a technique used to predict overlapping bounding boxes. In practice, the network is allowed to predict more than one box simultaneously, where each box prediction is constrained to have a given set of geometrical properties. For instance, the first prediction can potentially be a rectangular box of a given form, while the second will be another rectangular box of a different geometrical form.**
-
-&#10230;
-
-<br>
-
-
-**62. Non-max suppression ― The non-max suppression technique aims at removing duplicate overlapping bounding boxes of a same object by selecting the most representative ones. After having removed all boxes having a probability prediction lower than 0.6, the following steps are repeated while there are boxes remaining:**
-
-&#10230;
-
-<br>
-
-
-**63. [For a given class, Step 1: Pick the box with the largest prediction probability., Step 2: Discard any box having an IoU⩾0.5 with the previous box.]**
-
-&#10230;
-
-<br>
-
-
-**64. [Box predictions, Box selection of maximum probability, Overlap removal of same class, Final bounding boxes]**
-
-&#10230;
-
-<br>
-
-
-**65. YOLO ― You Only Look Once (YOLO) is an object detection algorithm that performs the following steps:**
-
-&#10230;
-
-<br>
-
-
-**66. [Step 1: Divide the input image into a G×G grid., Step 2: For each grid cell, run a CNN that predicts y of the following form:, repeated k times]**
-
-&#10230;
-
-<br>
-
-
-**67. where pc is the probability of detecting an object, bx,by,bh,bw are the properties of the detected bouding box, c1,...,cp is a one-hot representation of which of the p classes were detected, and k is the number of anchor boxes.**
-
-&#10230;
-
-<br>
-
-
-**68. Step 3: Run the non-max suppression algorithm to remove any potential duplicate overlapping bounding boxes.**
-
-&#10230;
-
-<br>
-
-
-**69. [Original image, Division in GxG grid, Bounding box prediction, Non-max suppression]**
-
-&#10230;
-
-<br>
-
-
-**70. Remark: when pc=0, then the network does not detect any object. In that case, the corresponding predictions bx,...,cp have to be ignored.**
-
-&#10230;
-
-<br>
-
-
-**71. R-CNN ― Region with Convolutional Neural Networks (R-CNN) is an object detection algorithm that first segments the image to find potential relevant bounding boxes and then run the detection algorithm to find most probable objects in those bounding boxes.**
-
-&#10230;
-
-<br>
-
-
-**72. [Original image, Segmentation, Bounding box prediction, Non-max suppression]**
-
-&#10230;
-
-<br>
-
-
-**73. Remark: although the original algorithm is computationally expensive and slow, newer architectures enabled the algorithm to run faster, such as Fast R-CNN and Faster R-CNN.**
-
-&#10230;
-
-<br>
-
-
-**74. Face verification and recognition**
-
-&#10230;
-
-<br>
-
-
-**75. Types of models ― Two main types of model are summed up in table below:**
-
-&#10230;
-
-<br>
-
-
-**76. [Face verification, Face recognition, Query, Reference, Database]**
-
-&#10230;
-
-<br>
-
-
-**77. [Is this the correct person?, One-to-one lookup, Is this one of the K persons in the database?, One-to-many lookup]**
-
-&#10230;
-
-<br>
-
-
-**78. One Shot Learning ― One Shot Learning is a face verification algorithm that uses a limited training set to learn a similarity function that quantifies how different two given images are. The similarity function applied to two images is often noted d(image 1,image 2).**
-
-&#10230;
-
-<br>
-
-
-**79. Siamese Network ― Siamese Networks aim at learning how to encode images to then quantify how different two images are. For a given input image x(i), the encoded output is often noted as f(x(i)).**
-
-&#10230;
-
-<br>
-
-
-**80. Triplet loss ― The triplet loss ℓ is a loss function computed on the embedding representation of a triplet of images A (anchor), P (positive) and N (negative). The anchor and the positive example belong to a same class, while the negative example to another one. By calling α∈R+ the margin parameter, this loss is defined as follows:**
-
-&#10230;
-
-<br>
-
-
-**81. Neural style transfer**
-
-&#10230;
-
-<br>
-
-
-**82. Motivation ― The goal of neural style transfer is to generate an image G based on a given content C and a given style S.**
-
-&#10230;
-
-<br>
-
-
-**83. [Content C, Style S, Generated image G]**
-
-&#10230;
-
-<br>
-
-
-**84. Activation ― In a given layer l, the activation is noted a[l] and is of dimensions nH×nw×nc**
-
-&#10230;
-
-<br>
-
-
-**85. Content cost function ― The content cost function Jcontent(C,G) is used to determine how the generated image G differs from the original content image C. It is defined as follows:**
-
-&#10230;
-
-<br>
-
-
-**86. Style matrix ― The style matrix G[l] of a given layer l is a Gram matrix where each of its elements G[l]kk′ quantifies how correlated the channels k and k′ are. It is defined with respect to activations a[l] as follows:**
-
-&#10230;
-
-<br>
-
-
-**87. Remark: the style matrix for the style image and the generated image are noted G[l] (S) and G[l] (G) respectively.**
-
-&#10230;
-
-<br>
-
-
-**88. Style cost function ― The style cost function Jstyle(S,G) is used to determine how the generated image G differs from the style S. It is defined as follows:**
-
-&#10230;
-
-<br>
-
-
-**89. Overall cost function ― The overall cost function is defined as being a combination of the content and style cost functions, weighted by parameters α,β, as follows:**
-
-&#10230;
-
-<br>
-
-
-**90. Remark: a higher value of α will make the model care more about the content while a higher value of β will make it care more about the style.**
-
-&#10230;
-
-<br>
-
-
-**91. Architectures using computational tricks**
-
-&#10230;
-
-<br>
-
-
-**92. Generative Adversarial Network ― Generative adversarial networks, also known as GANs, are composed of a generative and a discriminative model, where the generative model aims at generating the most truthful output that will be fed into the discriminative which aims at differentiating the generated and true image.**
-
-&#10230;
-
-<br>
-
-
-**93. [Training, Noise, Real-world image, Generator, Discriminator, Real Fake]**
-
-&#10230;
-
-<br>
-
-
-**94. Remark: use cases using variants of GANs include text to image, music generation and synthesis.**
-
-&#10230;
-
-<br>
-
-
-**95. ResNet ― The Residual Network architecture (also called ResNet) uses residual blocks with a high number of layers meant to decrease the training error. The residual block has the following characterizing equation:**
-
-&#10230;
-
-<br>
-
-
-**96. Inception Network ― This architecture uses inception modules and aims at giving a try at different convolutions in order to increase its performance through features diversification. In particular, it uses the 1×1 convolution trick to limit the computational burden.**
-
-&#10230;
-
-<br>
-
-
-**97. The Deep Learning cheatsheets are now available in [target language].**
-
-&#10230;
-
-<br>
-
-
-**98. Original authors**
-
-&#10230;
-
-<br>
-
-
-**99. Translated by X, Y and Z**
-
-&#10230;
-
-<br>
-
-
-**100. Reviewed by X, Y and Z**
-
-&#10230;
-
-<br>
-
-
-**101. View PDF version on GitHub**
-
-&#10230;
-
-<br>
-
-
-**102. By X and Y**
-
-&#10230;
-
-<br>

From 981c088a152d54a8fee9300a9f7327fc23dbb229 Mon Sep 17 00:00:00 2001
From: Shervine Amidi <shervinea@users.noreply.github.com>
Date: Wed, 29 May 2019 00:31:35 -0700
Subject: [PATCH 30/34] Delete recurrent-neural-networks.md

---
 id/recurrent-neural-networks.md | 677 --------------------------------
 1 file changed, 677 deletions(-)
 delete mode 100644 id/recurrent-neural-networks.md

diff --git a/id/recurrent-neural-networks.md b/id/recurrent-neural-networks.md
deleted file mode 100644
index 191e400a1..000000000
--- a/id/recurrent-neural-networks.md
+++ /dev/null
@@ -1,677 +0,0 @@
-**Recurrent Neural Networks translation**
-
-<br>
-
-**1. Recurrent Neural Networks cheatsheet**
-
-&#10230;
-
-<br>
-
-
-**2. CS 230 - Deep Learning**
-
-&#10230;
-
-<br>
-
-
-**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]**
-
-&#10230;
-
-<br>
-
-
-**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]**
-
-&#10230;
-
-<br>
-
-
-**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]**
-
-&#10230;
-
-<br>
-
-
-**6. [Comparing words, Cosine similarity, t-SNE]**
-
-&#10230;
-
-<br>
-
-
-**7. [Language model, n-gram, Perplexity]**
-
-&#10230;
-
-<br>
-
-
-**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]**
-
-&#10230;
-
-<br>
-
-
-**9. [Attention, Attention model, Attention weights]**
-
-&#10230;
-
-<br>
-
-
-**10. Overview**
-
-&#10230;
-
-<br>
-
-
-**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:**
-
-&#10230;
-
-<br>
-
-
-**12. For each timestep t, the activation a<t> and the output y<t> are expressed as follows:**
-
-&#10230;
-
-<br>
-
-
-**13. and**
-
-&#10230;
-
-<br>
-
-
-**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.**
-
-&#10230;
-
-<br>
-
-
-**15. The pros and cons of a typical RNN architecture are summed up in the table below:**
-
-&#10230;
-
-<br>
-
-
-**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]**
-
-&#10230;
-
-<br>
-
-
-**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]**
-
-&#10230;
-
-<br>
-
-
-**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:**
-
-&#10230;
-
-<br>
-
-
-**19. [Type of RNN, Illustration, Example]**
-
-&#10230;
-
-<br>
-
-
-**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]**
-
-&#10230;
-
-<br>
-
-
-**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]**
-
-&#10230;
-
-<br>
-
-
-**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:**
-
-&#10230;
-
-<br>
-
-
-**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:**
-
-&#10230;
-
-<br>
-
-
-**24. Handling long term dependencies**
-
-&#10230;
-
-<br>
-
-
-**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:**
-
-&#10230;
-
-<br>
-
-
-**26. [Sigmoid, Tanh, RELU]**
-
-&#10230;
-
-<br>
-
-
-**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.**
-
-&#10230;
-
-<br>
-
-
-**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.**
-
-&#10230;
-
-<br>
-
-
-**29. clipped**
-
-&#10230;
-
-<br>
-
-
-**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:**
-
-&#10230;
-
-<br>
-
-
-**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:**
-
-&#10230;
-
-<br>
-
-
-**32. [Type of gate, Role, Used in]**
-
-&#10230;
-
-<br>
-
-
-**33. [Update gate, Relevance gate, Forget gate, Output gate]**
-
-&#10230;
-
-<br>
-
-
-**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]**
-
-&#10230;
-
-<br>
-
-
-**35. [LSTM, GRU]**
-
-&#10230;
-
-<br>
-
-
-**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:**
-
-&#10230;
-
-<br>
-
-
-**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]**
-
-&#10230;
-
-<br>
-
-
-**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.**
-
-&#10230;
-
-<br>
-
-
-**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:**
-
-&#10230;
-
-<br>
-
-
-**40. [Bidirectional (BRNN), Deep (DRNN)]**
-
-&#10230;
-
-<br>
-
-
-**41. Learning word representation**
-
-&#10230;
-
-<br>
-
-
-**42. In this section, we note V the vocabulary and |V| its size.**
-
-&#10230;
-
-<br>
-
-
-**43. Motivation and notations**
-
-&#10230;
-
-<br>
-
-
-**44. Representation techniques ― The two main ways of representing words are summed up in the table below:**
-
-&#10230;
-
-<br>
-
-
-**45. [1-hot representation, Word embedding]**
-
-&#10230;
-
-<br>
-
-
-**46. [teddy bear, book, soft]**
-
-&#10230;
-
-<br>
-
-
-**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]**
-
-&#10230;
-
-<br>
-
-
-**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:**
-
-&#10230;
-
-<br>
-
-
-**49. Remark: learning the embedding matrix can be done using target/context likelihood models.**
-
-&#10230;
-
-<br>
-
-
-**50. Word embeddings**
-
-&#10230;
-
-<br>
-
-
-**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.**
-
-&#10230;
-
-<br>
-
-
-**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]**
-
-&#10230;
-
-<br>
-
-
-**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]**
-
-&#10230;
-
-<br>
-
-
-**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:**
-
-&#10230;
-
-<br>
-
-
-**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.**
-
-&#10230;
-
-<br>
-
-
-**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:**
-
-&#10230;
-
-<br>
-
-
-**57. Remark: this method is less computationally expensive than the skip-gram model.**
-
-&#10230;
-
-<br>
-
-
-**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:**
-
-&#10230;
-
-<br>
-
-
-**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0.
-Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:**
-
-&#10230;
-
-<br>
-
-
-**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.**
-
-&#10230;
-
-<br>
-
-
-**60. Comparing words**
-
-&#10230;
-
-<br>
-
-
-**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:**
-
-&#10230;
-
-<br>
-
-
-**62. Remark: θ is the angle between words w1 and w2.**
-
-&#10230;
-
-<br>
-
-
-**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.**
-
-&#10230;
-
-<br>
-
-
-**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]**
-
-&#10230;
-
-<br>
-
-
-**65. Language model**
-
-&#10230;
-
-<br>
-
-
-**66. Overview ― A language model aims at estimating the probability of a sentence P(y).**
-
-&#10230;
-
-<br>
-
-
-**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.**
-
-&#10230;
-
-<br>
-
-
-**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:**
-
-&#10230;
-
-<br>
-
-
-**69. Remark: PP is commonly used in t-SNE.**
-
-&#10230;
-
-<br>
-
-
-**70. Machine translation**
-
-&#10230;
-
-<br>
-
-
-**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:**
-
-&#10230;
-
-<br>
-
-
-**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.**
-
-&#10230;
-
-<br>
-
-
-**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y<k>|x,y<1>,...,y<k−1>, Step 3: Keep top B combinations x,y<1>,...,y<k>, End process at a stop word]**
-
-&#10230;
-
-<br>
-
-
-**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.**
-
-&#10230;
-
-<br>
-
-
-**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.**
-
-&#10230;
-
-<br>
-
-
-**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:**
-
-&#10230;
-
-<br>
-
-
-**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.**
-
-&#10230;
-
-<br>
-
-
-**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:**
-
-&#10230;
-
-<br>
-
-
-**79. [Case, Root cause, Remedies]**
-
-&#10230;
-
-<br>
-
-
-**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]**
-
-&#10230;
-
-<br>
-
-
-**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:**
-
-&#10230;
-
-<br>
-
-
-**82. where pn is the bleu score on n-gram only defined as follows:**
-
-&#10230;
-
-<br>
-
-
-**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.**
-
-&#10230;
-
-<br>
-
-
-**84. Attention**
-
-&#10230;
-
-<br>
-
-
-**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α<t,t′> the amount of attention that the output y<t> should pay to the activation a<t′> and c<t> the context at time t, we have:**
-
-&#10230;
-
-<br>
-
-
-**86. with**
-
-&#10230;
-
-<br>
-
-
-**87. Remark: the attention scores are commonly used in image captioning and machine translation.**
-
-&#10230;
-
-<br>
-
-
-**88. A cute teddy bear is reading Persian literature.**
-
-&#10230;
-
-<br>
-
-
-**89. Attention weight ― The amount of attention that the output y<t> should pay to the activation a<t′> is given by α<t,t′> computed as follows:**
-
-&#10230;
-
-<br>
-
-
-**90. Remark: computation complexity is quadratic with respect to Tx.**
-
-&#10230;
-
-<br>
-
-
-**91. The Deep Learning cheatsheets are now available in [target language].**
-
-&#10230;
-
-<br>
-
-**92. Original authors**
-
-&#10230;
-
-<br>
-
-**93. Translated by X, Y and Z**
-
-&#10230;
-
-<br>
-
-**94. Reviewed by X, Y and Z**
-
-&#10230;
-
-<br>
-
-**95. View PDF version on GitHub**
-
-&#10230;
-
-<br>
-
-**96. By X and Y**
-
-&#10230;
-
-<br>

From ebe595254173ac2b16120c5d7c717e9af78f16ac Mon Sep 17 00:00:00 2001
From: Shervine Amidi <shervinea@users.noreply.github.com>
Date: Wed, 29 May 2019 00:31:41 -0700
Subject: [PATCH 31/34] Delete refresher-linear-algebra.md

---
 id/refresher-linear-algebra.md | 339 ---------------------------------
 1 file changed, 339 deletions(-)
 delete mode 100644 id/refresher-linear-algebra.md

diff --git a/id/refresher-linear-algebra.md b/id/refresher-linear-algebra.md
deleted file mode 100644
index a6b440d1e..000000000
--- a/id/refresher-linear-algebra.md
+++ /dev/null
@@ -1,339 +0,0 @@
-**1. Linear Algebra and Calculus refresher**
-
-&#10230;
-
-<br>
-
-**2. General notations**
-
-&#10230;
-
-<br>
-
-**3. Definitions**
-
-&#10230;
-
-<br>
-
-**4. Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:**
-
-&#10230;
-
-<br>
-
-**5. Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:**
-
-&#10230;
-
-<br>
-
-**6. Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.**
-
-&#10230;
-
-<br>
-
-**7. Main matrices**
-
-&#10230;
-
-<br>
-
-**8. Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:**
-
-&#10230;
-
-<br>
-
-**9. Remark: for all matrices A∈Rn×n, we have A×I=I×A=A.**
-
-&#10230;
-
-<br>
-
-**10. Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:**
-
-&#10230;
-
-<br>
-
-**11. Remark: we also note D as diag(d1,...,dn).**
-
-&#10230;
-
-<br>
-
-**12. Matrix operations**
-
-&#10230;
-
-<br>
-
-**13. Multiplication**
-
-&#10230;
-
-<br>
-
-**14. Vector-vector ― There are two types of vector-vector products:**
-
-&#10230;
-
-<br>
-
-**15. inner product: for x,y∈Rn, we have:**
-
-&#10230;
-
-<br>
-
-**16. outer product: for x∈Rm,y∈Rn, we have:**
-
-&#10230;
-
-<br>
-
-**17. Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that:**
-
-&#10230;
-
-<br>
-
-**18. where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.**
-
-&#10230;
-
-<br>
-
-**19. Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:**
-
-&#10230;
-
-<br>
-
-**20. where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively**
-
-&#10230;
-
-<br>
-
-**21. Other operations**
-
-&#10230;
-
-<br>
-
-**22. Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:**
-
-&#10230;
-
-<br>
-
-**23. Remark: for matrices A,B, we have (AB)T=BTAT**
-
-&#10230;
-
-<br>
-
-**24. Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:**
-
-&#10230;
-
-<br>
-
-**25. Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1**
-
-&#10230;
-
-<br>
-
-**26. Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:**
-
-&#10230;
-
-<br>
-
-**27. Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)**
-
-&#10230;
-
-<br>
-
-**28. Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:**
-
-&#10230;
-
-<br>
-
-**29. Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.**
-
-&#10230;
-
-<br>
-
-**30. Matrix properties**
-
-&#10230;
-
-<br>
-
-**31. Definitions**
-
-&#10230;
-
-<br>
-
-**32. Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:**
-
-&#10230;
-
-<br>
-
-**33. [Symmetric, Antisymmetric]**
-
-&#10230;
-
-<br>
-
-**34. Norm ― A norm is a function N:V⟶[0,+∞[ where V is a vector space, and such that for all x,y∈V, we have:**
-
-&#10230;
-
-<br>
-
-**35. N(ax)=|a|N(x) for a scalar**
-
-&#10230;
-
-<br>
-
-**36. if N(x)=0, then x=0**
-
-&#10230;
-
-<br>
-
-**37. For x∈V, the most commonly used norms are summed up in the table below:**
-
-&#10230;
-
-<br>
-
-**38. [Norm, Notation, Definition, Use case]**
-
-&#10230;
-
-<br>
-
-**39. Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.**
-
-&#10230;
-
-<br>
-
-**40. Remark: if no vector can be written this way, then the vectors are said to be linearly independent**
-
-&#10230;
-
-<br>
-
-**41. Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.**
-
-&#10230;
-
-<br>
-
-**42. Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:**
-
-&#10230;
-
-<br>
-
-**43. Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.**
-
-&#10230;
-
-<br>
-
-**44. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
-
-&#10230;
-
-<br>
-
-**45. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
-
-&#10230;
-
-<br>
-
-**46. diagonal**
-
-&#10230;
-
-<br>
-
-**47. Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:**
-
-&#10230;
-
-<br>
-
-**48. Matrix calculus**
-
-&#10230;
-
-<br>
-
-**49. Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:**
-
-&#10230;
-
-<br>
-
-**50. Remark: the gradient of f is only defined when f is a function that returns a scalar.**
-
-&#10230;
-
-<br>
-
-**51. Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:**
-
-&#10230;
-
-<br>
-
-**52. Remark: the hessian of f is only defined when f is a function that returns a scalar**
-
-&#10230;
-
-<br>
-
-**53. Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:**
-
-&#10230;
-
-<br>
-
-**54. [General notations, Definitions, Main matrices]**
-
-&#10230;
-
-<br>
-
-**55. [Matrix operations, Multiplication, Other operations]**
-
-&#10230;
-
-<br>
-
-**56. [Matrix properties, Norm, Eigenvalue/Eigenvector, Singular-value decomposition]**
-
-&#10230;
-
-<br>
-
-**57. [Matrix calculus, Gradient, Hessian, Operations]**
-
-&#10230;

From 1666ef92eba25a74077be2e13d106f2f3ab5cca9 Mon Sep 17 00:00:00 2001
From: Shervine Amidi <shervinea@users.noreply.github.com>
Date: Wed, 29 May 2019 00:31:48 -0700
Subject: [PATCH 32/34] Delete refresher-probability.md

---
 id/refresher-probability.md | 381 ------------------------------------
 1 file changed, 381 deletions(-)
 delete mode 100644 id/refresher-probability.md

diff --git a/id/refresher-probability.md b/id/refresher-probability.md
deleted file mode 100644
index 5c9b34656..000000000
--- a/id/refresher-probability.md
+++ /dev/null
@@ -1,381 +0,0 @@
-**1. Probabilities and Statistics refresher**
-
-&#10230;
-
-<br>
-
-**2. Introduction to Probability and Combinatorics**
-
-&#10230;
-
-<br>
-
-**3. Sample space ― The set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by S.**
-
-&#10230;
-
-<br>
-
-**4. Event ― Any subset E of the sample space is known as an event. That is, an event is a set consisting of possible outcomes of the experiment. If the outcome of the experiment is contained in E, then we say that E has occurred.**
-
-&#10230;
-
-<br>
-
-**5. Axioms of probability For each event E, we denote P(E) as the probability of event E occuring.**
-
-&#10230;
-
-<br>
-
-**6. Axiom 1 ― Every probability is between 0 and 1 included, i.e:**
-
-&#10230;
-
-<br>
-
-**7. Axiom 2 ― The probability that at least one of the elementary events in the entire sample space will occur is 1, i.e:**
-
-&#10230;
-
-<br>
-
-**8. Axiom 3 ― For any sequence of mutually exclusive events E1,...,En, we have:**
-
-&#10230;
-
-<br>
-
-**9. Permutation ― A permutation is an arrangement of r objects from a pool of n objects, in a given order. The number of such arrangements is given by P(n,r), defined as:**
-
-&#10230;
-
-<br>
-
-**10. Combination ― A combination is an arrangement of r objects from a pool of n objects, where the order does not matter. The number of such arrangements is given by C(n,r), defined as:**
-
-&#10230;
-
-<br>
-
-**11. Remark: we note that for 0⩽r⩽n, we have P(n,r)⩾C(n,r)**
-
-&#10230;
-
-<br>
-
-**12. Conditional Probability**
-
-&#10230;
-
-<br>
-
-**13. Bayes' rule ― For events A and B such that P(B)>0, we have:**
-
-&#10230;
-
-<br>
-
-**14. Remark: we have P(A∩B)=P(A)P(B|A)=P(A|B)P(B)**
-
-&#10230;
-
-<br>
-
-**15. Partition ― Let {Ai,i∈[[1,n]]} be such that for all i, Ai≠∅. We say that {Ai} is a partition if we have:**
-
-&#10230;
-
-<br>
-
-**16. Remark: for any event B in the sample space, we have P(B)=n∑i=1P(B|Ai)P(Ai).**
-
-&#10230;
-
-<br>
-
-**17. Extended form of Bayes' rule ― Let {Ai,i∈[[1,n]]} be a partition of the sample space. We have:**
-
-&#10230;
-
-<br>
-
-**18. Independence ― Two events A and B are independent if and only if we have:**
-
-&#10230;
-
-<br>
-
-**19. Random Variables**
-
-&#10230;
-
-<br>
-
-**20. Definitions**
-
-&#10230;
-
-<br>
-
-**21. Random variable ― A random variable, often noted X, is a function that maps every element in a sample space to a real line.**
-
-&#10230;
-
-<br>
-
-**22. Cumulative distribution function (CDF) ― The cumulative distribution function F, which is monotonically non-decreasing and is such that limx→−∞F(x)=0 and limx→+∞F(x)=1, is defined as:**
-
-&#10230;
-
-<br>
-
-**23. Remark: we have P(a<X⩽B)=F(b)−F(a).**
-
-&#10230;
-
-<br>
-
-**24. Probability density function (PDF) ― The probability density function f is the probability that X takes on values between two adjacent realizations of the random variable.**
-
-&#10230;
-
-<br>
-
-**25. Relationships involving the PDF and CDF ― Here are the important properties to know in the discrete (D) and the continuous (C) cases.**
-
-&#10230;
-
-<br>
-
-**26. [Case, CDF F, PDF f, Properties of PDF]**
-
-&#10230;
-
-<br>
-
-**27. Expectation and Moments of the Distribution ― Here are the expressions of the expected value E[X], generalized expected value E[g(X)], kth moment E[Xk] and characteristic function ψ(ω) for the discrete and continuous cases:**
-
-&#10230;
-
-<br>
-
-**28. Variance ― The variance of a random variable, often noted Var(X) or σ2, is a measure of the spread of its distribution function. It is determined as follows:**
-
-&#10230;
-
-<br>
-
-**29. Standard deviation ― The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:**
-
-&#10230;
-
-<br>
-
-**30. Transformation of random variables ― Let the variables X and Y be linked by some function. By noting fX and fY the distribution function of X and Y respectively, we have:**
-
-&#10230;
-
-<br>
-
-**31. Leibniz integral rule ― Let g be a function of x and potentially c, and a,b boundaries that may depend on c. We have:**
-
-&#10230;
-
-<br>
-
-**32. Probability Distributions**
-
-&#10230;
-
-<br>
-
-**33. Chebyshev's inequality ― Let X be a random variable with expected value μ. For k,σ>0, we have the following inequality:**
-
-&#10230;
-
-<br>
-
-**34. Main distributions ― Here are the main distributions to have in mind:**
-
-&#10230;
-
-<br>
-
-**35. [Type, Distribution]**
-
-&#10230;
-
-<br>
-
-**36. Jointly Distributed Random Variables**
-
-&#10230;
-
-<br>
-
-**37. Marginal density and cumulative distribution ― From the joint density probability function fXY , we have**
-
-&#10230;
-
-<br>
-
-**38. [Case, Marginal density, Cumulative function]**
-
-&#10230;
-
-<br>
-
-**39. Conditional density ― The conditional density of X with respect to Y, often noted fX|Y, is defined as follows:**
-
-&#10230;
-
-<br>
-
-**40. Independence ― Two random variables X and Y are said to be independent if we have:**
-
-&#10230;
-
-<br>
-
-**41. Covariance ― We define the covariance of two random variables X and Y, that we note σ2XY or more commonly Cov(X,Y), as follows:**
-
-&#10230;
-
-<br>
-
-**42. Correlation ― By noting σX,σY the standard deviations of X and Y, we define the correlation between the random variables X and Y, noted ρXY, as follows:**
-
-&#10230;
-
-<br>
-
-**43. Remark 1: we note that for any random variables X,Y, we have ρXY∈[−1,1].**
-
-&#10230;
-
-<br>
-
-**44. Remark 2: If X and Y are independent, then ρXY=0.**
-
-&#10230;
-
-<br>
-
-**45. Parameter estimation**
-
-&#10230;
-
-<br>
-
-**46. Definitions**
-
-&#10230;
-
-<br>
-
-**47. Random sample ― A random sample is a collection of n random variables X1,...,Xn that are independent and identically distributed with X.**
-
-&#10230;
-
-<br>
-
-**48. Estimator ― An estimator is a function of the data that is used to infer the value of an unknown parameter in a statistical model.**
-
-&#10230;
-
-<br>
-
-**49. Bias ― The bias of an estimator ^θ is defined as being the difference between the expected value of the distribution of ^θ and the true value, i.e.:**
-
-&#10230;
-
-<br>
-
-**50. Remark: an estimator is said to be unbiased when we have E[^θ]=θ.**
-
-&#10230;
-
-<br>
-
-**51. Estimating the mean**
-
-&#10230;
-
-<br>
-
-**52. Sample mean ― The sample mean of a random sample is used to estimate the true mean μ of a distribution, is often noted ¯¯¯¯¯X and is defined as follows:**
-
-&#10230;
-
-<br>
-
-**53. Remark: the sample mean is unbiased, i.e E[¯¯¯¯¯X]=μ.**
-
-&#10230;
-
-<br>
-
-**54. Central Limit Theorem ― Let us have a random sample X1,...,Xn following a given distribution with mean μ and variance σ2, then we have:**
-
-&#10230;
-
-<br>
-
-**55. Estimating the variance**
-
-&#10230;
-
-<br>
-
-**56. Sample variance ― The sample variance of a random sample is used to estimate the true variance σ2 of a distribution, is often noted s2 or ^σ2 and is defined as follows:**
-
-&#10230;
-
-<br>
-
-**57. Remark: the sample variance is unbiased, i.e E[s2]=σ2.**
-
-&#10230;
-
-<br>
-
-**58. Chi-Squared relation with sample variance ― Let s2 be the sample variance of a random sample. We have:**
-
-&#10230;
-
-<br>
-
-**59. [Introduction, Sample space, Event, Permutation]**
-
-&#10230;
-
-<br>
-
-**60. [Conditional probability, Bayes' rule, Independence]**
-
-&#10230;
-
-<br>
-
-**61. [Random variables, Definitions, Expectation, Variance]**
-
-&#10230;
-
-<br>
-
-**62. [Probability distributions, Chebyshev's inequality, Main distributions]**
-
-&#10230;
-
-<br>
-
-**63. [Jointly distributed random variables, Density, Covariance, Correlation]**
-
-&#10230;
-
-<br>
-
-**64. [Parameter estimation, Mean, Variance]**
-
-&#10230;

From 80e9009b0e31762367724e8b6d82d2e4817cb8e6 Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Sat, 19 Oct 2019 21:25:57 +0700
Subject: [PATCH 33/34] review unspervised learning

---
 id/cheatsheet-deep-learning.md | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/id/cheatsheet-deep-learning.md b/id/cheatsheet-deep-learning.md
index 6634a14de..c5db5877c 100644
--- a/id/cheatsheet-deep-learning.md
+++ b/id/cheatsheet-deep-learning.md
@@ -60,7 +60,7 @@
 
 **11. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.**
 
-&#10230;**11. Learning rate - Learning rate (Tingkat pembelajaran), sering dinotasikan sebagai α atau η, merupakan fase pembaruan pembobotan. Tingkat pembelajaran dapat diperbaiki atau diubah secara adaptif. Metode yang paling populer saat ini disebut Adam, yang merupakan metode yang dapat menyesuaikan tingkat pembelajaran.**
+&#10230;**11. Learning rate - Learning rate, sering dinotasikan sebagai α atau η, mendefinisikan seberapa cepat nilai weight diperbaharui. Learning rate bisa diset dengan nilai fix atau dirubah secara adaptif. Metode yang paling terkenal saat ini adalah Adam. Sebuah method yang merubah learning rate secara adaptif.**
 
 <br>
 
@@ -78,19 +78,19 @@
 
 **14. Updating weights ― In a neural network, weights are updated as follows:**
 
-&#10230;**14. Memperbaharui bobot w - Dalam neural network, bobot w diperbarui nilainya dengan cara berikut:**
+&#10230;**14. Memperbaharui nilai weights - Dalam neural network, nilai weights diperbarui nilainya dengan cara berikut:**
 
 <br>
 
 **15. Step 1: Take a batch of training data.**
 
-&#10230;**15. Langkah 1: Mengambil jumlah batch dari data latih.**
+&#10230;**15. Langkah 1: Mengambil batch (sampel data) dari keseluruhan training data.**
 
 <br>
 
 **16. Step 2: Perform forward propagation to obtain the corresponding loss.**
 
-&#10230;**16. Langkah 2: Melakukan forward propagation untuk mendapatkan nilai loss yang sesuai.**
+&#10230;**16. Langkah 2: Melakukan forward propagation untuk mendapatkan nilai loss berdasarkan nilai masukan (input).**
 
 <br>
 
@@ -102,13 +102,13 @@
 
 **18. Step 4: Use the gradients to update the weights of the network.**
 
-&#10230;**18. Langkah 4: Menggunakan gradient untuk untuk memperbarui nilai dari network.**
+&#10230;**18. Langkah 4: Menggunakan gradient untuk untuk memperbarui nilai weights dari network.**
 
 <br>
 
 **19. Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p**
 
-&#10230;**19. Dropout - Dropout adalah teknik untuk mencegah overfitting data latih dengan menghilangkan satu atau lebih unit layer dalam neural network. Pada praktiknya, neurons melakukan drop dengan probabilitas p atau tidak melakukannya dengan probabilitas 1-p** 
+&#10230;**19. Dropout - Dropout adalah teknik untuk mencegah overfit terhadap data training dengan menghilangkan satu atau lebih unit layer dalam neural network. Pada praktiknya, neurons di-drop dengan probabilitas p atau dipertahankan dengan probabilitas 1-p** 
 
 <br>
 

From 303057c0834322986aad01b2029a9b59b398b5d1 Mon Sep 17 00:00:00 2001
From: Sony Wicaksono <sonyayahab@gmail.com>
Date: Sat, 19 Oct 2019 21:28:08 +0700
Subject: [PATCH 34/34] review cheatsheet-unsupervised-learning

---
 id/cheatsheet-deep-learning.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/id/cheatsheet-deep-learning.md b/id/cheatsheet-deep-learning.md
index c5db5877c..6eadeb893 100644
--- a/id/cheatsheet-deep-learning.md
+++ b/id/cheatsheet-deep-learning.md
@@ -120,19 +120,19 @@
 
 **21. Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:**
 
-&#10230; **21. Kebutuhan layer convolutional - W adalah ukuran volume input, F adalah ukuran dari layer neuron convolutional, P adalah jumlah zero padding, maka jumlah neurons N yang dapat dibentuk dari volume yang diberikan adalah:**
+&#10230; **21. Kebutuhan layer convolutional - W adalah ukuran volume input, F adalah ukuran dari layer neuron convolutional, P adalah jumlah zero padding, maka jumlah neurons N yang sesuai dengan ukuran dimensi masukan adalah:**
 
 <br>
 
 **22. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:**
 
-&#10230; **22. Batch normalization - Adalah salah satu step hyperparameter γ,β yang menormalisasikan batch {xi}. Dengan notasi μB,σ2B adalah rata-rata dan variansi nilai yang digunakan untuk perbaikan dalam batch, dapat diselesaikan sebagai berikut:** 
+&#10230; **22. Batch normalization - Adalah salah satu step hyperparameter γ,β yang menormalisasikan batch {xi}. Dengan mendefiniskan μB,σ2B sebagai nilai rata-rata dan variansi dari batch yang ingin kita normalisasi, hal tersebut dapat dilakukan dengan cara:** 
 
 <br>
 
 **23. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.**
 
-&#10230; **23. Biasanya dilakukan setelah layer sepenuhnya terhubung / konvolusional dan sebelum layer non-linearitas, yang bertujuan untuk peningkatan tingkat pembelajaran yang lebih tinggi dan mengurangi ketergantungan yang kuat pada inisialisasi.**
+&#10230; **23. Biasanya diaplikasikan setelah layer fully connected dan sebelum layer non-linear, yang bertujuan agar memungkinkannya penggunaan nilai learning rates yang lebih tinggi dan mengurangi ketergantungan pada nilai inisialisasi parameter neural network.**
  
 
 <br>