Notes on Artificial Intelligence course of Udacity
A.I Terminologies: Fully Observable, Partial Observable
- A fully observable thing is when AI can determine the optimal decision when all the context needed to make that optimal decision is ready and readable
- I think, the real world, is mostly not fully observable. But we humans can use our memory in our brain to construct the unobservable, plus the observable. This is how we get the decision
- During the class teacher briefly mentioned about Hidden Markov Model. This makes me wonder, does AI use HMM to construct/predict unobservable data to make the optimal decision?
Working On Deep Learning Assignment 5 (Udacity)
Python
- list class can accept an iterator to construct a list object
- Meet
class Counter
andclass deque
, Why theCounter
class is in Uppercase? class Counter
is a subclass ofdict
class deque
is a optimized version oflist
class zipfile
is a standard library for python- package
six
is a feature that combines py2 in py3 - package
six.moves
is a consistent method to load either py2 or py3 modules - Basic Python data types accept iterator as constructor parameter. (Like C++, so this is a pattern!)
- Sometimes Python really confusing when does it return iterators and when does it return actual lists!
Text Processing
- Using to
Counter(...)
to easily have statistics of all words - Using the
Counter.most_common
to easily get the most popular words - Use a
dict()
to map word to its ranking (order in popularity) - Add a list of rankings in the order of original word list
- Add a
dict()
so we can map from ranking to actual word - Generate batch for Tensorflow
WordVec Using SkipGram with Tensorflow (Course Provided Example)
- The generate_batch function
- Word vector is built for each
prominent
word - Each word vector is fixed length denoted by
num_skips
- For example: Word w, is represented by vector
[d1, d2, d3, d4]
, d1, d2, d3 and d4 are randomly chosen from the nearest 4 words surround w. - Udacity example is set the vocab size to 50,000 words. This will expand to (50,000 x 4) int vectors if the
num_skips
is set to 4
- The training Model
- Goal: giving a word, will know the highest probability of context surrounded words
WordVec Using CBOW
Since all the training procedures are similar, I just reversed the label and batch variable references in the original generate_batch function and call it ‘generate_cbow_batch’
About tf.truncated_normal and tf.random_normal
- In truncated_normal, we provide a standard deviation
- All the values generated from truncated_normal are within the standard deviation
- tf.random_normal does not have this feature
- Using the truncated_normal is to prevent the model parameters overflow from training
- Andrej Karpathy’s blog about backpropagation
Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM)
- In CNN the parameters are shared because near by space, this is somewhat a tying. In recurrent NN, we are going to use this tying again.
- To optimize sequence parameters we need to back propagate the derivative through time. Or in practice as many steps as we can afford.
- A lot of correlated updates. This is bad for stochastic gradient descent. Math is very unstable.
- Exploding gradient and vanishing gradient problems
- The gradients grows exponentially or the gradients demimishes to zero
- The gradient clipping (Only works for grow cases). This is kind of a simple hack.
- For gradient vanishing problems, it is kind of memory loss. (This is where LSTM comes in)!
- There is a
memory cell
in the center, andwrite
,read
andforget
- There are gates to control these above actions, and gate control is a continuous parameter
- Means we can take derivative and back propagate
- The gating values for each gate get controlled by a tiny logistic regression of each parameters
- A typical LSTM cell is:

A typical LSTM cell
- The corresponding implementation of a LSTM cell in Tensorflow
def lstm_cell(i, o, state):
"""Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
Note that in this formulation, we omit the various connections between the
previous state and the gates."""
input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
state = forget_gate * state + input_gate * tf.tanh(update)
output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
return (output_gate * tf.tanh(state), state)