I'm working on a new project in language identification. Specifically, we are looking into using speech attribute detectors to enhance phonetic transcriptions. From there, supervectors are created by creating phone document vectors for each language. Moreover, we are using TempoRAI Patterns (TRAPs) as features. These have been shown to be superior to using MFCC + velocity + acceleration vectors. I would be interested to see how these perform on music, especially since incorporating dynamic features have had only limited effect. I think part of the problem is that music is (generally) slower than speech, so incorporating longer windows might be better. TRAPs are also different from texture windows because texture windows are simply first and second order statistics from the frames within the window, whereas the original frames are concatenated in TRAPs. However, since I'm limited to using USPop's feature set (MFCC), I'm not sure I'll get to see the effect any time soon.
No comments:
Post a Comment