Learning Notes on LSTM Using Tensorflow

Preparation

Data source
- Data is text8.zip downloaded from Mattmahoney.net
- Data is a 100M characters, the first 1000 is prepared for validation, the remaining is for training
- Since char is not able to do numerical calculation, each char will be finding a mapping id to represent itself.
- Char’s id is its numerical value substracting the first ascii char’s numerical value
- Batch generation
  - A batch is a group of char feed into Tensorflow graph at the same time
  - The batch size is defined as a constant number (64) no matter it is training text or a validation text
  - The way the tutorial extracts a batch is a technique of uniformly extracting: e.g:
  - In the following text, text length is 130, batch size is 5, segment is 130/5 => 26
  - Cursor content in the beginning is [0, 26, 52, 78, 104]
  - Cursor content in the next batch is [1, 27, 53, 78, 105]
    abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz a<-cursor a<-cursor a<-cursor a<-cursor a<-cursor
  - The uniformity is achieved by segregating the whole long text into segments
  - Each time when there is batch generation request, we use the cursor list to find the char and then generate the batch
  - In my example, the first batch is [aaaaa], the next batch is [bbbbb] …etc
  - The whole data batch generation can be forever even though the cursor moves beyond the segment or text boundary
  - Because we will modulo the index so when boundary is passed, the index will go back from the beginning
  - But the real batch is a 2D tensor just like this:
    batch index 0 [0, 0, 0, 1, 0, 0, 0, 0, 0] batch index 1 [0, 0, 0, 0, 1, 0, 0, 0, 0] batch index 2 [0, 0, 1, 0, 0, 0, 0, 0, 0] batch index 3 [0, 0, 0, 0, 1, 0, 0, 0, 0] batch index 4 [0, 0, 0, 1, 0, 0, 0, 0, 0]
  - The batch is a tensor with shape to be (batch_size, vocabulary_size) i.e (64, 26+1)

Training Procedure

Aribtrarily defined training steps to be 7001 by course teacher
For each step, the training code fetch a number of unrolling batches to the graph. This is because LSTM’s RNN nature. It needs to keep previous chars for calculation
For example, in one of the training steps, we fed the graph with a list of length num_unrolling+1
- [batches[0], batches[1], batches[2], batches[3] .......... batches[num__unrolling]]
  - Each batches[i] contains 64 1-hot vectors converted from characters extracted from source uniformly distributed
  - *a* b c d __first step__
  - ` d e f g second step`
Feeding the graph
- feed_dict[train_data[i]] = batches[i] This is in the session training loop
- train_data.append(tf.placeholder(...blah...)) This is in the graph
- All I think is that code is a bit of magic. All we can imagine is using this high level interpretation: train_data[i] = batches[i]