What happens to the perplexity? Why do you think that is?
For this part, the perplexity steadily went down by approximately 0.2 each time we doubled the number of hidden units. This can be attributed to the fact that we are changing number of neurons in each layer and thus increasing the of operations being performed on each sentence in each batch. This results in our embedding dimensionality being much larger and giving us more data points to compare to one another when generating results. With more information to base the result of this results in less uncertainty by the program and thus