dtw algorithm refers to the dynamic time warping algorithm, which is based on the idea of dynamic programming DP. It is a dynamic programming algorithm that calculates the similarity of two time series, especially sequences of different lengths; it solves the problem of pronunciation length The different template matching problem is an earlier and more classic algorithm in speech recognition. The dtw algorithm is mainly used in time series data, such as isolated word speech recognition, gesture recognition, data mining and information retrieval.
The operating environment of this tutorial: Windows 7 system, Dell G3 computer.
There are many similarity or distance functions for time series data, the most prominent of which is DTW. In isolated word speech recognition, the simplest and most effective method is to use the DTW (Dynamic Time Warping) algorithm. This algorithm is based on the idea of dynamic programming (DP) and solves the problem of template matching of different pronunciations. The problem is an earlier and more classic algorithm in speech recognition, used for isolated word recognition.
DTW (Dynamic Time Warping) dynamic time warping algorithm is a dynamic programming algorithm that calculates the similarity of two time series, especially sequences of different lengths. It is mainly used in time series data, such as isolated word speech recognition, gesture recognition, data mining and information retrieval.
The HMM algorithm needs to provide a large amount of speech data during the training phase, and the model parameters can be obtained through repeated calculations, while the training of the DTW algorithm requires almost no additional calculations. Therefore, the DTW algorithm is still widely used in isolated word speech recognition.
Whether in the training and template building stage or in the recognition stage, the endpoint algorithm is first used to determine the starting point and end point of the speech. Each entry stored in the template library is called a reference template. A reference template can be expressed as R={R(1), R(2),..., R(m),..., R(M)}, m is the timing number of the training speech frame, m=1 is the starting speech frame, m=M is the ending speech frame, so M is the total number of speech frames included in the template, and R(m) is the speech feature vector of the m-th frame. An input entry speech to be recognized is called a test template, which can be expressed as T = {T (1), T (2), ..., T (n), ..., T (N)}, n is the test speech The timing label of the frame, n=1 is the starting speech frame, n=N is the ending speech frame, so N is the total number of speech frames included in the template, and T(n) is the speech feature vector of the nth frame. The reference template and the test template generally use the same type of feature vector (such as MFCC, LPC coefficients), the same frame length, the same window function and the same frame shift.
Assume that the test and reference templates are represented by T and R respectively. In order to compare the similarity between them, the distance between them can be calculated D[T, R]. The smaller the distance, the higher the similarity. To calculate this distortion distance, start from the distance between corresponding frames in T and R. Let n and m be arbitrarily selected frame numbers in T and R respectively, and d[T(n),R(m)] represents the distance between the feature vectors of these two frames. The distance function depends on the actual distance metric used, and Euclidean distance is usually used in the DTW algorithm.
If N=M, it can be calculated directly, otherwise consider aligning T(n) and R(m). Alignment can use the linear expansion method. If N If the frame numbers n=1~N of the test template are marked on the horizontal axis in a two-dimensional rectangular coordinate system, and the frame numbers m=1~M of the reference template are marked on the vertical axis Mark out, and draw some vertical and horizontal lines through these integer coordinates representing the frame number to form a network. Each intersection point (n, m) in the network represents the intersection point of a certain frame in the test mode. The DP algorithm can be summarized as finding a path through several grid points in this network. The grid points passed by the path are the frame numbers calculated in the test and reference templates. The path is not chosen arbitrarily. First of all, the pronunciation speed of any speech sound may change, but the order of its parts cannot be changed. Therefore, the selected path must start from the lower left corner and end at the upper right corner In order to describe this path, assume that all the grid points passed by the path are (n1, m1),..., (ni, mj),..., (nN, mm), where (n1, m1) = (1 ,1), (nN,mM)=(N,M). The path can be described by the function m = Oslash;(n), where n =i, i=1, 2,...,N, Ø(1)=1, Ø(N)=M. In order to prevent the path from being too inclined, the slope can be constrained to be in the range of 0.5~2. If the path has already passed the grid point (n, m), then the next grid point (n, m) passed can only be the following three Case one: Use r to represent the above three constraints. The problem of finding the best path can be reduced to finding the best path function m =Ø(n) when the constraint condition r is satisfied, so that the accumulated distance along the path reaches the minimum value, that is: 搜索该路径的方法如下:搜索从(n, m)点出发,可以展开若干条满足ŋ的路径,假设可计算每条路径达到(n, m)点时的总的积累距离,具有最小累积距离者即为最佳路径。易于证明,限定范围的任一格点(n, m)只可能有一条搜索路径通过。对于(n, m),其可达到该格点的前一个格点只可能是(n-1, m)、(n-1, m -1)和(n, m-1),那么(n, m)一定选择这3个距离之路径延伸而通过(n, m),这时此路径的积累距离为: 这样可以从(n ,m )=(1,1)出发搜索(n ,m ),对每一个(n ,m )都存储相应的距离,这个距离是当前格点的匹配距离与前一个累计距离最小的格点(按照设定的斜率在三个格点中进行比较)。搜索到(n ,m )时,只保留一条最佳路径。如果有必要的话,通过逐点向前寻找就可以求得整条路径。这套DP算法便是DTW算法。 DTW算法可以直接按上面描述来实现,即分配两个N×M的矩阵,分别为积累距离矩阵D和帧匹配距离矩阵d,其中帧匹配距离矩阵d(i,j)的值为测试模板的第i帧与参考模板的第j帧间的距离。D(N,M)即为最佳匹配路径所对应的匹配距离 更多相关知识,请访问常见问题栏目! The above is the detailed content of What is the DTW algorithm?. For more information, please follow other related articles on the PHP Chinese website!(n ,m )=(n +1,m)
(n ,m )=(n +1,m +1)
(n ,m )=(n ,m+1 )
D[(n, m)]=d[T(n),R(m)]+min{D(n-1,m),D(n-1,m-1),D(n,m-1)}