Emotion recognition in the wild is a very challenging task. In this paper, we propose a multiple models fusion method to automatically recognize the expression in the video clip as part of the third Emotion Recognition in the Wild Challenge (EmotiW 2015). In our method, we first extract dense SIFT, LBP-TOP and audio features from each video clip. For dense SIFT features, we use the bag of features (BoF) model with two different encoding methods (locality-constrained linear coding and group saliency based coding) to further represent it. During the classification process, we use partial least square regression to calculate the regression value of each model. By learning the optimal weight of each model based on the regression value, we fuse these models together. We conduct experiments on the given validation and test datasets, and achieve superior performance. The best recognition accuracy of our fusion method is 52.50% on the test dataset, which is 13.17% higher than the challenge baseline accuracy of 39.33%.