Sequence classification, predict the category label of the entire input sequence. Sentiment analysis to predict the topic attitude of users when writing text. Predicting election results or product or movie ratings.
International Movie Database film review data set. The target value is binary, positive or negative. The language contains a lot of negation, irony, and ambiguity. You cannot just look at whether a word appears. Construct a word vector recurrent network, view each comment word by word, and finally train a classifier to predict the sentiment of the entire comment based on the word's utterance value.
IMDB movie review data set from Stanford University Artificial Intelligence Laboratory: http://ai.stanford.edu/~amaas/data/sentiment/. Compressed tar file, positive and negative comments are obtained from two folder text files. Use regular expressions to extract plain text and convert all letters to lowercase.
Word vector embedding representation is richer in semantics than one-hot encoding words. The vocabulary determines the word index and finds the correct word vector. The sequences are padded to the same length, and multiple movie review data are sent to the network in batches.
Sequence labeling model, pass in two placeholders, one is the input data data or sequence, and the other is the target value target or emotion. Pass in the configuration parameter params object, optimizer.
Dynamic calculation of the current batch data sequence length. The data is in the form of a single tensor, and each sequence is padded with 0s based on the length of the longest movie review. Absolute maximum reduction of word vectors. Zero vector, scalar 0. Real word vector, scalar greater than 0 real number. tf.sign() is discretely 0 or 1. The results are added along the time steps to get the sequence length. The tensor length is the same as the batch data capacity, and the scalar represents the sequence length.
Use the params object to define the unit type and number of units. The length attribute specifies the maximum number of rows of batch data provided to the RNN. Obtain the last activity value of each sequence and send it to the softmax layer. Because the length of each movie review is different, the final correlation output activity value of each sequence of RNN in the batch data has a different index. Create an index in the time step dimension (batch data shape sequences*time_steps*word_vectors). tf.gather() indexes along the 1st dimension. The first two dimensions of the output activity value shape sequences*time_steps*word_vectors are flattened and the sequence length is added. Add length-1 to select the last valid time step.
Gradient clipping, the gradient value is limited to a reasonable range. Any meaningful cost function in the class can be used, and the model output can be used in all class probability distributions. Add gradient clipping to improve learning results and limit maximum weight updates. RNN training is difficult, different hyper-parameters are not properly matched, and weights can easily diverge.
TensorFlow supports optimizer instance compute_gradients function deduction, modification of gradients, and apply_gradients function to apply weight changes. If the gradient component is less than -limit, set -limit; if the gradient component is within limit, set limit. The TensorFlow derivative can be None, which means that a certain variable has no relationship with the cost function. Mathematically, it should be a zero vector, but None is conducive to internal performance optimization, and only the None value needs to be returned.
The movie reviews are fed into the recurrent neural network word by word, and each time step consists of word vectors forming batch data. The batched function finds word vectors and completes all sequence lengths. Train the model, define hyperparameters, load data sets and word vectors, and run the model on preprocessed training batch data. The successful training of the model depends on the network structure, hyperparameters, and word vector quality. Pre-trained word vectors can be loaded from the skip-gram model word2vec project ( ) and the Stanford NLP research group Glove model (https://nlp.stanford.edu/projects/glove).
Kaggle Open Learning Competition ( ), IMDB movie review data, compare prediction results with others.
import tarfileimport refrom helpers import downloadclass ImdbMovieReviews: DEFAULT_URL = \'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'TOKEN_REGEX = re.compile(r'[A-Za-z]+|[!?.:,()]')def __init__(self, cache_dir, url=None): self._cache_dir = cache_dir self._url = url or type(self).DEFAULT_URLdef __iter__(self): filepath = download(self._url, self._cache_dir) with tarfile.open(filepath) as archive:for filename in archive.getnames():if filename.startswith('aclImdb/train/pos/'):yield self._read(archive, filename), Trueelif filename.startswith('aclImdb/train/neg/'):yield self._read(archive, filename), Falsedef _read(self, archive, filename): with archive.extractfile(filename) as file_: data = file_.read().decode('utf-8') data = type(self).TOKEN_REGEX.findall(data) data = [x.lower() for x in data]return dataimport bz2import numpy as npclass Embedding:def __init__(self, vocabulary_path, embedding_path, length): self._embedding = np.load(embedding_path) with bz2.open(vocabulary_path, 'rt') as file_: self._vocabulary = {k.strip(): i for i, k in enumerate(file_)} self._length = lengthdef __call__(self, sequence): data = np.zeros((self._length, self._embedding.shape[1])) indices = [self._vocabulary.get(x, 0) for x in sequence] embedded = self._embedding[indices] data[:len(sequence)] = embeddedreturn data @propertydef dimensions(self):return self._embedding.shape[1]import tensorflow as tffrom helpers import lazy_propertyclass SequenceClassificationModel:def __init__(self, data, target, params): self.data = data self.target = target self.params = params self.prediction self.cost self.error self.optimize @lazy_propertydef length(self): used = tf.sign(tf.reduce_max(tf.abs(self.data), reduction_indices=2)) length = tf.reduce_sum(used, reduction_indices=1) length = tf.cast(length, tf.int32)return length @lazy_propertydef prediction(self):# Recurrent network.output, _ = tf.nn.dynamic_rnn( self.params.rnn_cell(self.params.rnn_hidden), self.data, dtype=tf.float32, sequence_length=self.length, ) last = self._last_relevant(output, self.length)# Softmax layer.num_classes = int(self.target.get_shape()[1]) weight = tf.Variable(tf.truncated_normal( [self.params.rnn_hidden, num_classes], stddev=0.01)) bias = tf.Variable(tf.constant(0.1, shape=[num_classes])) prediction = tf.nn.softmax(tf.matmul(last, weight) + bias)return prediction @lazy_propertydef cost(self): cross_entropy = -tf.reduce_sum(self.target * tf.log(self.prediction))return cross_entropy @lazy_propertydef error(self): mistakes = tf.not_equal( tf.argmax(self.target, 1), tf.argmax(self.prediction, 1))return tf.reduce_mean(tf.cast(mistakes, tf.float32)) @lazy_propertydef optimize(self): gradient = self.params.optimizer.compute_gradients(self.cost)try: limit = self.params.gradient_clipping gradient = [ (tf.clip_by_value(g, -limit, limit), v)if g is not None else (None, v)for g, v in gradient]except AttributeError:print('No gradient clipping parameter specified.') optimize = self.params.optimizer.apply_gradients(gradient)return optimize @staticmethoddef _last_relevant(output, length): batch_size = tf.shape(output)[0] max_length = int(output.get_shape()[1]) output_size = int(output.get_shape()[2]) index = tf.range(0, batch_size) * max_length + (length - 1) flat = tf.reshape(output, [-1, output_size]) relevant = tf.gather(flat, index)return relevantimport tensorflow as tffrom helpers import AttrDictfrom Embedding import Embeddingfrom ImdbMovieReviews import ImdbMovieReviewsfrom preprocess_batched import preprocess_batchedfrom SequenceClassificationModel import SequenceClassificationModel IMDB_DOWNLOAD_DIR = './imdb'WIKI_VOCAB_DIR = '../01_wikipedia/wikipedia'WIKI_EMBED_DIR = '../01_wikipedia/wikipedia'params = AttrDict( rnn_cell=tf.contrib.rnn.GRUCell, rnn_hidden=300, optimizer=tf.train.RMSPropOptimizer(0.002), batch_size=20, ) reviews = ImdbMovieReviews(IMDB_DOWNLOAD_DIR) length = max(len(x[0]) for x in reviews) embedding = Embedding( WIKI_VOCAB_DIR + '/vocabulary.bz2', WIKI_EMBED_DIR + '/embeddings.npy', length) batches = preprocess_batched(reviews, length, embedding, params.batch_size) data = tf.placeholder(tf.float32, [None, length, embedding.dimensions]) target = tf.placeholder(tf.float32, [None, 2]) model = SequenceClassificationModel(data, target, params) sess = tf.Session() sess.run(tf.initialize_all_variables())for index, batch in enumerate(batches): feed = {data: batch[0], target: batch[1]} error, _ = sess.run([model.error, model.optimize], feed)print('{}: {:3.1f}%'.format(index + 1, 100 * error))
Reference material:
"TensorFlow Practice for Machine Intelligence"
Welcome to add me on WeChat to communicate: qingxingfengzi
me My wife Zhang Xingqing’s WeChat public account: qingxingfengzigz
The above is the detailed content of Detailed explanation of functions such as sequence classification and IMDB movie rating. For more information, please follow other related articles on the PHP Chinese website!