Ex Parte Senior et alDownload PDFPatent Trials and Appeals BoardFeb 27, 201913955483 - (D) (P.T.A.B. Feb. 27, 2019) Copy Citation UNITED STA TES p A TENT AND TRADEMARK OFFICE APPLICATION NO. FILING DATE FIRST NAMED INVENTOR 13/955,483 07/31/2013 Andrew W. Senior 26192 7590 03/01/2019 FISH & RICHARDSON P.C. PO BOX 1022 MINNEAPOLIS, MN 55440-1022 UNITED STATES DEPARTMENT OF COMMERCE United States Patent and Trademark Office Address: COMMISSIONER FOR PATENTS P.O. Box 1450 Alexandria, Virginia 22313-1450 www .uspto.gov ATTORNEY DOCKET NO. CONFIRMATION NO. 16113-4867001 3634 EXAMINER OGUNBIYI, OLUWADAMILOL M ART UNIT PAPER NUMBER 2657 NOTIFICATION DATE DELIVERY MODE 03/01/2019 ELECTRONIC Please find below and/or attached an Office communication concerning this application or proceeding. The time period for reply, if any, is set in the attached communication. Notice of the Office communication was sent electronically on above-indicated "Notification Date" to the following e-mail address(es): P ATDOCTC@fr.com PTOL-90A (Rev. 04/07) UNITED STATES PATENT AND TRADEMARK OFFICE BEFORE THE PATENT TRIAL AND APPEAL BOARD Ex parte ANDREW W. SENIOR and IGNACIO L. MORENO Appeal2017-004870 Application 13/955,483 1 Technology Center 2600 Before JUSTIN BUSCH, JAMES W. DEJMEK, and KARA L. SZPONDOWSKI, Administrative Patent Judges. DEJMEK, Administrative Patent Judge. DECISION ON APPEAL Appellants appeal under 35 U.S.C. § 134(a) from a Final Rejection of claims 1-20. Oral arguments were heard on February 13, 2019. A transcript of the hearing will be placed in the record in due course. We have jurisdiction over the pending claims under 35 U.S.C. § 6(b ). We reverse. 1 Appellants identify Google Inc. as the real party in interest. App. Br. 1. Appeal2017-004870 Application 13/955,483 STATEMENT OF THE CASE Introduction Appellants' disclosed and claimed invention generally relates to speech recognition techniques using neural networks. Spec. ,r,r 1, 3. Figure 2 is illustrative of an exemplary embodiment for performing speech recognition using a neural network and is reproduced below: V.-. 271 FIG. 2 2 P(sn_J l X) P(so_2 l X) IBO.Fl-.,......! P{s.L3 i X.1 P(s1_i ; X) P(si ___ 2 i X) P(s, .... 31 X) I ·~z73 Appeal2017-004870 Application 13/955,483 Figure 2 illustrates "an example of processing for speech recognition using neural networks." Spec. ,r 12. As shown, an input audio signal (210) is analyzed in a plurality of windows (220, identified as w1, w2, W3, W4, ... wn). Spec. ,r 28. Each window may represent a particular time period of the input audio signal (210). Spec. ,r 28. A Fast Fourier Transform (FFT) is performed on each window (220) resulting in a time-frequency representation of the audio in each window (230). Spec. ,r 29. Acoustic features are extracted from the FFT results (230) to produce acoustic feature vectors (240, identified as v1, v2, V3, V4, ... vn). Spec. ,r,r 29-30. The acoustic feature vectors (240) are presented to the input of neural network (270). See Spec. ,r 25. In addition to the acoustic feature vectors, Appellants describe providing i-vector (250) as an input to the neural network (270). See Spec. ,r,r 22-23, 31. According to the Specification: "Many typical neural networks used for speech recognition include input connections for receiving only acoustic feature information. By contrast, the neural network 270 receives acoustic feature information augmented with additional information such as an i-vector." Spec. ,r 32. Appellants describe the disclosed speech recognition system may receive acoustic feature vectors and "additional information." Spec. ,r 22. The additional information may be indicative of audio characteristics independent of the words uttered by a speaker. Spec. ,r 22. Such characteristics may correspond to background noise, recording channel properties, the speaking style of the speaker, the speaker's accent and/or the speaker's gender. Spec. ,r 22. This additional information of the input audio signal, as well as other data (such as training data) may include latent 3 Appeal2017-004870 Application 13/955,483 variables of multivariate factor analysis (MF A) of the audio signal or other audio signals. Spec. ,r 23. As shown in Figure 2, i-vector (250) indicates latent variables of multivariate factor analysis and is provided as an input to the input layer of neural network (270). Spec. ,r,r 31, 34. The output layer (273) of neural network (270) indicates the likelihood (i.e., probability) that the audio signal under analysis corresponds to particular phonetic units (so, ... Sm). Spec. i135. Claim 1 is illustrative of the subject matter on appeal and is reproduced below with the disputed limitations emphasized in italics: 1. A method performed by data processing apparatus, the method comprising: receiving a feature vector that models audio characteristics of a portion of an utterance; receiving data indicative oflatent variables of multivariate factor analysis; providing the feature vector and the data indicative of the latent variables as input to an input layer of a neural network comprising the input layer, multiple hidden layers, and an output layer; and determining a candidate transcription for the utterance based on at least an output of the neural network that the neural network provides at the output layer in response to the feature vector and the data indicative of the latent variables being provided at the input layer. The Examiner's Rejections 1. Claims 1, 2, 6, 9, 10, 14, 17, and 18 stand rejected under 35 U.S.C. § 103 as being unpatentable over Vanhoucke (US 8,484,022 Bl; July 9, 2013); Geoffrey Hinton et al., Deep Neural Networks for Acoustic Modeling in Speech Recognition, IEEE Signal Processing Magazine, 82-97 4 Appeal2017-004870 Application 13/955,483 (2012) ("Hinton"); and Daniel Garcia-Romero & Carol Y. Espy-Wilson, Analysis of I-vector Length Normalization in Speaker Recognition Systems, Twelfth Annual Conference of the International Speech Communication Association (2011) ("Garcia-Romero"). Final Act. 3-9. 2. Claims 3, 11, and 19 stand rejected under 35 U.S.C. § 103 as being unpatentable over Vanhoucke; Hinton; Garcia-Romero; Strom et al. (US 8,386,251 B2; Feb. 26, 2013) ("Strom"); and Garudadri et al. (US 2003/0204394 Al; Oct. 30, 2003) ("Garudadri"). Final Act. 9-11. 3. Claims 4, 12, and 20 stand rejected under 35 U.S.C. § 103 as being unpatentable over Vanhoucke, Hinton, Garcia-Romero, and Huo et al. (WO 2014/029099 Al; Feb. 27, 2014) ("Huo"). Final Act. 11-13. 4. Claims 5 and 13 stand rejected under 35 U.S.C. § 103 as being unpatentable over Vanhoucke; Hinton; Garcia-Romero; Willett (US 2012/0259632 Al; Oct. 11, 2012); and Jiang et al. (US 6,542,866 Bl; Apr. 1, 2003) ("Jiang"). Final Act. 13-16. 5. Claims 7 and 15 stand rejected under 35 U.S.C. § 103 as being unpatentable over Vanhoucke, Hinton, Garcia-Romero, and Abrash et al. (US 7,610,199 B2; Oct. 27, 2009) ("Abrash"). Final Act. 16-19. 6. Claims 8 and 16 stand rejected under 35 U.S.C. § 103 as being unpatentable over Vanhoucke, Hinton, Garcia-Romero, and Jiang. Final Act. 19-20. 5 Appeal2017-004870 Application 13/955,483 ANALYSIS 2 Appellants assert the Examiner erred in finding the cited references teach or suggest providing data indicative of latent variables of a multivariate factor analysis as input to an input layer of a neural network. App. Br. 4--6; Reply Br. 1. In particular, Appellants argue Hinton discloses "latent variables" but does not teach inputting the latent variables to an input of a neural network. App. Br. 4. Rather, Appellants assert Hinton's latent variables are merely "internal portions of 'generative models' that are not provided as input to any of Hinton's DNN's [sic] [(Deep Neural Networks)]." App. Br. 4. Appellants argue the Examiner's reliance on Garcia-Romero is also misplaced at least because Garcia-Romero also fails to teach providing data indicative of the latent variables (i.e., Garcia- Romero's i-vectors) as input to an input layer of a neural network. App. Br. 4--5. Moreover, Appellants assert Garcia-Romero does not even mention a neural network in its disclosed approach for speaker recognition. App. Br. 5. In response, the Examiner finds Vanhoucke in combination with Garcia-Romero teaches the claimed invention and explains that Hinton is relied on for a more explicit use of latent variables. Ans. 22. We begin our analysis with a brief overview of Vanhoucke and Garcia-Romero. 2 Throughout this Decision, we have considered the Appeal Brief, filed July 6, 2016 ("App. Br."); the Reply Brief, filed January 31, 2017 ("Reply Br."); the Examiner's Answer, mailed January 30, 2017 ("Ans."); and the Final Office Action, mailed November 20, 2015 ("Final Act."), from which this Appeal is taken. 6 Appeal2017-004870 Application 13/955,483 Vanhoucke teaches automatic speech recognition (ASR) systems may be configured to recognize a spoken utterance and translate the recognized utterance into text. Vanhoucke, col. 3, 11. 22-26, 36-39; see also Vanhoucke, col. 16, 11. 41--43 ("the output 511 could be one or more text string transcriptions of utterance 501"). Vanhoucke further describes the architecture of an ASR system may comprise components such as a signal processing component, a pattern classification component, an acoustic model component, and a language component. Vanhoucke, col. 4, 11. 15-19, Fig. 5. The signal processing component receives an audio input, samples the signal within a sequence of time frames, and processes the frame samples to generate a set of feature vectors. Vanhoucke, col. 4, 11. 19-25. The feature vectors "include[] a set of measured and/or derived elements that characterize the acoustic content of the corresponding time frame." Vanhoucke, col. 4, 11. 23-25. As part of the pattern classification component, Vanhoucke teaches the feature vectors may be applied to an acoustic model. Vanhoucke, col. 4, 11. 31-34. Vanhoucke teaches the acoustic model may be implemented using a hybrid neural network/hidden Markov model approach. Vanhoucke, col. 4, 11. 51-53. 7 Appeal2017-004870 Application 13/955,483 Figure 8, as identified by the Examiner (see Ans. 21 ), is illustrative of a neural network for use in V anhoucke' s disclosed approach and is reproduced below: (' .... . ~1nJ Input Feature Vectors • • • Output Emission Probabltities FIG .. S f.'-, ~~,1 ; _____ j NEURAL NE1'WORK Figure 8 of Vanhoucke illustrates using a neural network (800) for determining emission probabilities (803) for hidden Markov models from input feature vectors (801). Vanhoucke, col. 26, 11. 17-20. As shown, the 8 Appeal2017-004870 Application 13/955,483 neural network has an input layer (L1), an output layer (L4), and two hidden layers (L2 and L3). Vanhoucke, col. 26, 11. 29--39. Garcia-Romero is generally directed to speaker recognition. Garcia- Romero at 1 (Introduction). In particular, Garcia-Romero seeks to "boost the performance of probabilistic generative models that work with i-vector representations." Garcia-Romero at 1 (Abstract). Garcia-Romero teaches i- vector extraction is a known technique "to obtain a low dimensional fixed- length representation of a speech utterance that preserves the speaker- specific information." Garcia-Romero at 1 (Introduction). More specifically, Garcia-Romero teaches an i-vector extractor system that maps a sequence of vectors to a fixed-length vector. Garcia-Romero at 1 (I-vector extraction). Moreover, Garcia-Romero describes that each extracted i-vector is obtained as a MAP (maximum a posteriori) point estimate of a standard normally distributed latent variable. Garcia-Romero at 1 (I-vector extraction). Further describing the i-vector, Garcia-Romero explains each i- vector may be decomposed into two parts: (i) a speaker-specific part, "which describes the between-speaker variability and does not depend on the particular utterance" and (ii) a channel component, which is utterance dependent and describes the within-speaker variability. Garcia-Romero at 2 (Generative i-vector models). Based on these teachings, the Examiner finds V anhoucke contemplates additional feature vectors may be input to the neural network illustrated in Figure 8 of Vanhoucke, as indicated by the ellipsis between feature vectors presented to input layer (L1) of neural network (800). Ans. 21. The Examiner further interprets Figure 8 to suggest "that other vectors can be input into the input layer of the neural network" and that one 9 Appeal2017-004870 Application 13/955,483 of ordinary skill in the art would include the i-vector, as taught by Garcia- Romero (i.e., data indicative of latent variables of multivariate factor analysis) as an input to the neural network ofVanhoucke. Ans. 22. The Examiner reasons "[t]he incorporation of the i-vectors in this scenario provides for having distinguishable factors for speech recognition." Ans. 22. Contrary to the Examiner's proposed combination, we agree with Appellants that the feature vectors of V anhoucke and the i-vectors of Garcia- Romero are different. See Reply Br. 1; see also App. Br. 6. The ellipsis at the input layer of the neural network illustrated in Figure 8 of Vanhoucke suggests additional feature vectors may be input, not all other vectors, such as the i-vectors of Garcia-Romero. Additionally, we note, as do Appellants (see, e.g., App. Br. 5), Garcia-Romero does not describe providing its i- vectors as an input to a neural network. We find it dispositive that Vanhoucke and Garcia-Romero, as relied on by the Examiner, do not teach "providing the feature vector and the data indicative of the latent variables as input to an input layer of a neural network," as required by independent claims 1, 9, and 17. Further, neither Hinton nor the other prior art references of record are relied on by the Examiner to provide this teaching. Accordingly, we need not address other issues raised by Appellants' arguments. For the reasons discussed supra, we are persuaded of Examiner error. Thus, we do not sustain the Examiner's rejection of independent claims 1, 9, and 1 7. Additionally, we do not sustain the Examiner's rejections of claims 2-8, 10-16, and 18-20, which depend therefrom. 10 Appeal2017-004870 Application 13/955,483 DECISION We reverse the Examiner's decision rejecting claims 1-20. REVERSED 11 Copy with citationCopy as parenthetical citation