Ex Parte Kallio et alDownload PDFPatent Trials and Appeals BoardApr 11, 201310341332 - (D) (P.T.A.B. Apr. 11, 2013) Copy Citation UNITED STATES PATENT AND TRADEMARKOFFICE UNITED STATES DEPARTMENT OF COMMERCE United States Patent and Trademark Office Address: COMMISSIONER FOR PATENTS P.O. Box 1450 Alexandria, Virginia 22313-1450 www.uspto.gov APPLICATION NO. FILING DATE FIRST NAMED INVENTOR ATTORNEY DOCKET NO. CONFIRMATION NO. 10/341,332 01/10/2003 Loura Kallio 944-001.096 7106 4955 7590 04/12/2013 WARE, FRESSOLA, MAGUIRE & BARBER LLP BRADFORD GREEN, BUILDING 5 755 MAIN STREET, P O BOX 224 MONROE, CT 06468 EXAMINER HAN, QI ART UNIT PAPER NUMBER 2659 MAIL DATE DELIVERY MODE 04/12/2013 PAPER Please find below and/or attached an Office communication concerning this application or proceeding. The time period for reply, if any, is set in the attached communication. PTOL-90A (Rev. 04/07) UNITED STATES PATENT AND TRADEMARK OFFICE ________________ BEFORE THE PATENT TRIAL AND APPEAL BOARD ________________ Ex parte LOURA KALLIO, PAAVO ALKU, KIMMO KAYHKO, MATTI KAJALA, and PAIVI VALVE ________________ Appeal 2010-006945 Application 10/341,332 Technology Center 2600 ________________ Before JEAN R. HOMERE, MICHAEL J. STRAUSS, and JOHN G. NEW, Administrative Patent Judges. NEW, Administrative Patent Judge. DECISION ON APPEAL Appeal 2010-006945 Application 10/341,332 2 SUMMARY Appellants file this appeal under 35 U.S.C. § 134(a) from the Examiner’s Final Rejection of claims 1-24 and 33-36.1 Specifically, claims 1, 11, 12, 14, 15, 19, and 22-24 stand rejected as unpatentable under 35 U.S.C. § 103(a) as being obvious over the combination of Gustafsson et al. (US 2001/0044722 A1, November 22, 2001) (“Gustafsson”) and Michael E. Deisher and Andreas S. Spanias, Speech enhancement using state-based estimation and sinusoidal modeling, 102(2) J. ACOUST. SCI. AM. 1141 (1997) (“Deischer”). Claims 2 and 3 stand rejected as unpatentable under 35 U.S.C. § 103(a) as being obvious over the combination of Gustafsson and Malah et al. (US 2003/0093279 A1, May 15, 2003) (“Malah”). Claims 4-6, 16, 34, and 36 stand rejected as unpatentable under 35 U.S.C. § 103(a) as being obvious over the combination of Gustafsson and Jax et al. (US 2003/0050786 A1, March 13, 2003) (“Jax”). Claims 7 and 17 stand rejected as unpatentable under 35 U.S.C. § 103(a) as being obvious over the combination of Gustafsson, Asghar et al. (US 6,418,412 B1, July 9, 2002) (“Asghar”), and Wilson et al. (US 5,323,337, June 21, 1994) (“Wilson”). Claims 8, 9, 13, and 20-21 stand rejected as unpatentable under 35 U.S.C. § 103(a) as being obvious over the combination of Gustafsson, Jax, and Gibson et al. (US 6,366,029 B1, January 1, 2003) (“Gibson”). Claim 18 stands rejected as unpatentable under 35 U.S.C. § 103(a) as being obvious over the combination of Gustafsson, Asghar, Jax, and Wilson. 1 Claims 25-32 are canceled. Appeal 2010-006945 Application 10/341,332 3 We have jurisdiction under 35 U.S.C. § 6(b). We affirm. NATURE OF THE CLAIMED INVENTION Appellants’ invention is directed to a method and device for improving the quality of speech signals transmitted using an audio bandwidth between 300 Hz and 3.4 kHz. After the received speech signal is divided into frames, zeros are inserted between samples to double the sampling frequency. The level of these aliased frequency components is adjusted using an adaptive algorithm based on the classification of the speech frame. Sound can be classified into sibilants and non-sibilants, and a non-sibilant sound can be further classified into a voiced sound and a stop consonant. The adjustment is based on parameters, such as the number of zero-crossings and energy distribution, computed from the spectrum of the up-sampled speech signal between 300 Hz and 3.4 kHz. A new sound with a bandwidth between 300 Hz and 7.7 kHz is obtained by inverse Fourier transforming the spectrum of the adjusted, up-sampled sound. Abstract. GROUPING OF CLAIMS Because Appellants argue that the Examiner erred for substantially the same reasons with respect to claims 1-24 and 33-36, we select claim 1 as representative of this group. App. Br. 37-38. Claim 1 recites: 1. A method comprising: Appeal 2010-006945 Application 10/341,332 4 classifying speech signals in signal segments of a speech into a plurality of classes based on at least one signal characteristic of the speech signals; converting the speech signals into a plurality of transformed segments having a speech spectrum in a lower frequency range and a speech spectrum in a higher frequency range in a frequency domain, the speech spectrum in the higher frequency range having a spectral image of the speech spectrum of the lower frequency range; modifying the speech spectrum in the high frequency range in the frequency domain based on the classes for providing modified transformed segments in said high frequency range; and converting the modified transformed segments into speech data in the time domain. App. Br. 40. ISSUES AND ANALYSES We address each of Appellants’ arguments seriatim, as presented in Appellants’ Brief. Issue 1 Appellants argue that the Examiner erred in finding that Gustafsson teaches or suggests the limitation of claim 1 reciting “classifying speech signals in signal segments of a speech into a plurality of classes based on at least one signal characteristic of the speech signals.” App. Br. 21-22. We therefore address the issue of whether the Examiner so erred. Appeal 2010-006945 Application 10/341,332 5 Analysis Appellants admit that Gustafsson teaches or suggests how a speech signal is sampled into frames and what voiced sound and unvoiced sound are, but argue that Gustafsson neither teaches nor suggests classifying speech signals in signal segments of a speech into a plurality of classes based on at least one signal characteristic of the speech signals. App. Br. 22 (citing Gustafsson, ¶¶ [0062]; [0036]). According to Appellants, Gustafsson teaches or suggests that whether the speech signal is voiced or unvoiced is determined by a pitch decision unit 430 based on the error signal 424. App. Br. 22. Appellants contend that Gustafsson teaches that the pitch decision unit 430 determines the pitch based on a distance between transients in the error signal and that, therefore, Gustafsson fails to disclose classifying speech signals in signal segments of a speech into a plurality of classes based on at least one signal characteristic of the speech signals. Id. 22-23. The Examiner responds that Gustafsson teaches or suggests that speech-sound may be classified into two main categories (i.e., classes): voiced sound and unvoiced sound (i.e., a signal characteristic). Ans. 5 (citing Gustafsson, ¶ [0062]). The Examiner finds that Gustafsson teaches or suggests that the processes ensuing between blocks 420-460 of Fig. 4 (see also Gustafsson, Fig. 6, 620-650; Fig. 8, 830-870) are based upon Gustafsson’s disclosed “block [frame] of [speech] samples” (i.e., “signal segments of a speech”). Ans. 20-21 (citing Gustafsson, ¶ [0062]). The Examiner finds that this teaching includes the process for classifying and determining voiced/unvoiced sound in blocks 430,630 or 910, which is the inherent and necessary feature in Gustafsson’s invention because his speech model-based processes requires frame-by-frame (i.e., signal segment by Appeal 2010-006945 Application 10/341,332 6 signal segment) operations. Ans. 21. Otherwise, finds the Examiner, the related parameters (e.g., formant, pitch, etc.) cannot be obtained at all. Id. The Examiner concludes that a person of ordinary skill in the contemporaneous art would have readily recognized this inherent feature. Id. The Examiner additionally finds that Gustafsson’s teachings with respect to a vocoder (implying frame-by-frame processing) (citing Gustafsson, ¶¶ [0053]; [0058]), “segments … represent[ing] an unvoiced sound” (citing Gustafsson, ¶ [0033]), “segmentation module 820” (citing Gustafsson, Fig. 8; ¶ [0054]), and “short time stationary” for “the number of blocks” (citing Gustafsson, ¶¶ [0064]-[0065]), also read on the disputed limitation. Ans. 21. We agree with the Examiner. Gustafsson teaches that “the parametric spectral analysis unit 420 outputs parameters, (i.e., values associated with the particular model employed therein) descriptive of the received voice signal” and that the: [p]itch decision module 430 also determines whether the speech content of the received signal represents a voiced sound or an unvoiced sound, and generates a signal indicative thereof. The decision made by the pitch decision unit 430 regarding the characteristic of the received signal as being a voiced sound or an unvoiced sound may be a binary decision or a soft decision indicating a relative probability of a voiced signal or an un- voiced signal.2 2 With respect to voiced and unvoiced sounds, Gustafsson teaches that: Voiced sounds are produced when quasi-periodic bursts of air are released by the glottis, which is the opening between the vocal cords. These bursts of air excite the vocal tract, creating a Appeal 2010-006945 Application 10/341,332 7 The pitch information and a signal indicative of whether the received signal is a voiced sound or an unvoiced sound are output from the pitch decision unit 430 to a residual extender and copy unit 440. Gustafsson, [0046]-[0048] (emphasis added); see Ans. 21. We agree with the Examiner that “voiced sound or unvoiced sound” correspond to claim 1’s “at least one signal characteristic of the speech signals.” Ans. 20-21. We therefore conclude that the Examiner did not err in finding that Gustafsson teaches or suggests the limitation of claim 1 reciting “classifying speech signals in signal segments of a speech into a plurality of classes based on at least one signal characteristic of the speech signals.” Issue 2 Appellants argue that the Examiner erred in finding that the combination of Gustafsson and Deisher teaches or suggests the limitation of claim 1 reciting “converting the speech signals into a plurality of transformed segments having a speech spectrum in a lower frequency range and a speech spectrum in a higher frequency range in a frequency domain, the speech spectrum in the higher frequency range having a spectral image voiced sound (i.e., a short “a” (ä) in “car”). By contrast, unvoiced sounds are created when a steady flow of air is forced through a constraint in the vocal tract. This constraint is often near the mouth, causing the air to become turbulent and generating a noise-like sound (i.e., as “sh” in “she”). Gustafsson, ¶ [0036]. Appeal 2010-006945 Application 10/341,332 8 of the speech spectrum of the lower frequency range.” App. Br. 23. We therefore address whether the Examiner so erred. Analysis Appellants admit that Gustafsson discloses that the input signal is upsampled by a factor of 2, and the upsampled signal is analyzed by a parametric analysis module to determine the formant structure of the received speech signal either by an autoregressive (“AR”) model in z- domain or a by sinusoidal mode as described in Deisher. App. Br. 25. Appellants contend that the results of this analysis are the parameters and an error signal. Id. Nevertheless, Appellants contend that these teachings and suggestions of the prior art references do not disclose the disputed limitation of claim 1. App. Br. 29-32. With respect to the AR model in z-domain taught or suggested by Gustafsson, Appellants argue that the Examiner erroneously found that the AR model in z-domain (transform) corresponds to the equivalent frequency domain, because the Fourier transform can be viewed as the z-transform of the sequence evaluated on the unit circle, owing to the two transforms inherently having the relationship X(z)|z=e jω =X(ω). App. Br. 29. Appellants admit that there is a relationship between the Fourier transform and z-transform, but argue that the essential difference between Gustafsson and the present invention is that Gustafsson applies the z- transform only to the envelope of the spectrum, and so the result is not a “speech spectrum.” Id. Appellants assert that the Fourier transformation provides the frequency domain representation of the entire signal being transformed, whereas the AR model taught by Gustafsson, when applied to Appeal 2010-006945 Application 10/341,332 9 the same signal, yields two separate items: the residual signal and the envelope. Id. Of the two items, only the parameters can be represented in the z-domain. Id. (citing Gustafsson, Eq. 2). The residual signal is a time- domain entity which is used to provide information on the peaks and/or peak locations in the upsampled input signal in the narrowband. App Br. 29. Appellants therefore contend that an AR-model in z-domain cannot be used to convert an upsampled input speech signal into transformed segments in the frequency domain having a spectrum in the lower frequency range and a spectrum in the higher frequency range. Id. With respect to the sinusoidal model taught by Deisher and explicitly incorporated by reference in Gustafsson,3 Appellants argue that a sinusoidal model is one-half of a Fourier transform, because it lacks the cosine function for the same harmonic frequency. App. Br. 30. Appellants contend that since a sinusoidal model is only one half of a Fourier transform; it cannot be used to convert an upsampled input speech signal into transformed segments in the frequency domain having a spectrum in the lower frequency range and a spectrum in a higher frequency range. Id. In summary, Appellants argue that the AR-model in z-domain is different from a fast Fourier transform (“FFT”) in that an FFT provides the frequency domain representation of the entire signal being transformed, whereas an autoregressive (AR) model applied to the same signal yields two separate items: the residual signal and the envelope, of which the only the parameters can be represented in the z-domain. App. Br. 34. Furthermore, because the sinusoidal mode lacks the corresponding cosine terms, the 3 See Gustafsson, ¶ [0046]. Appeal 2010-006945 Application 10/341,332 10 sinusoidal model does not convert the upsampled speech signal into transformed segments in the frequency domain as an FFT does. Id. Consequently, Appellants argue, combination of the two models still cannot be used to convert an upsampled speech signal into transformed segments in the frequency domain. App. Br. 34-35. Finally, Appellants argue that, even if the upsampled speech signal is converted into transformed segments in the frequency domain, this output of the conversion would completely change the principles of operation taught or suggested by Gustafsson. App. Br. 35. According to Appellants, if the output of Gustafsson’s parametric spectral analysis unit 420 is transformed segments in the frequency domain, most, if not all, of the remaining part of the speech synthesis systems as disclosed in Gustafsson will be inoperable because (a) the pitch decision unit 430, the residual extender/copy unit 440, and the residual modifier can only be used to process a time-domain entity such as an error signal 424; (b) the synthesis filter 450 and the artificial formant determining unit 470 relies on the parameters which are values associated with a certain AR or sinusoidal model; and (c) the new formant determining unit 850 relies on the parameters as calculated by the linear prediction module 840. Id. Appellants therefore assert that Gustafsson does not teach or suggest converting the input signal into transformed segments in the frequency domain having a spectrum in a lower frequency range and a spectrum in the higher frequency range. App. Br. 36. The Examiner responds that Gustafsson teaches the spectral analysis module using the alternative sinusoidal model, which inherently processes the signal in DFT (discrete Fourier transform) (or FFT) domain (i.e., frequency domain). The Examiner finds that an inherent feature of the Appeal 2010-006945 Application 10/341,332 11 sinusoidal model is that “the short speech segments” are analyzed/synthesized (including enhanced or modified) with DFT/FFT (i.e., transformed segments) in a frequency-domain. Ans. 21 (citing Deisher, §§ I, III; Fig. 1). The Examiner finds, contra Appellants’ argument, that Equation (1) of Deisher, which describes the ith sample of speech within a 20 to 30 ms block, is not just half of the Fourier transform, because the existence of the variable phases implies presence of sines, which would explicitly appear by using trig identity for the cosine of the sum of two angles, which reads on the disputed limitation. Ans. 22 (citing Deisher, § 1). The Examiner further finds that, in another embodiment, Gustafsson teaches a spectral analysis module using the AR model, wherein the speech model parameters includes envelope spectrum parameters (e.g., ak representing the formant) processed in z-domain and excitation spectrum parameters (such as pitch, peak/harmonic numbers or locations) processed via Fourier transform (i.e. frequency domain). Ans. 22. The Examiner finds that, since Gustafsson’s speech model is processed based on frame-by-frame sections, (i.e., segments), the processed frames/blocks combining the speech model signal and residual/error signal in or after blocks 420, 620, or 830 (Figs. 4, 6, or 8) read on the claimed “transformed segments.” The Examiner finds that, since the determined parameters, such as envelope spectrum parameters and excitation spectrum parameters, represent each of the processed frames of speech signal based on the speech model, the combination of them as a whole corresponds to the claimed transformed segments (i.e., transformed to speech model based parametric segments). Ans. 22. Appeal 2010-006945 Application 10/341,332 12 The Examiner also finds that the claimed “speech spectrum” can be reasonably interpreted as the entire speech spectrum, amplitude or power spectrum, phase spectrum, envelope spectrum, excitation spectrum, or possible combinations thereof, because the specification and claim do not clearly or specifically define the term “speech spectrum.” Ans. 22-23. Moreover, finds the Examiner, since Fourier transform (X(ω)) and z- transform (X(z)) related by ( X(z)|z=e jω = X(ω)), the z-domain processes taught or suggested by Gustafsson are reasonably interpreted as being equivalently in frequency domain and one of ordinary skill in the art would have readily recognized that the processes in the two transforms are interchangeable and would produce a predictable result. Id. We are not persuaded by Appellants’ arguments. As an initial matter, we observe that the language of claim 1 does not explicitly require the use of a fast Fourier transform to produce transformed segments in the frequency domain, as Appellants appear to imply. App. Br. 34. Although the use of an FFT is described in Appellants’ Specification (see, e.g., Spec., Fig. 1), limitations of an embodiment in the Specification may not be read into the claims. See Phillips v. AWH Corp., 415 F.3d 1303, 1323 (Fed. Cir. 2005). We agree with the Examiner that Gustafsson teaches or suggests “converting the speech signals into a plurality of transformed segments having a speech spectrum in a lower frequency range and a speech spectrum in a higher frequency range in a frequency domain, the speech spectrum in the higher frequency range having a spectral image of the speech spectrum of the lower frequency range.” Specifically, we agree with the Examiner that the AR model and the sinusoidal model taught or suggested by Gustafsson (the latter being incorporated by reference from Deisher) Appeal 2010-006945 Application 10/341,332 13 correspond to the “transformed segments having a speech spectrum … in a frequency domain.” Ans. 21. Deisher teaches or suggests that the sinusoidal model represents speech over short time intervals by a finite set of sinusoids that capture most of the signal energy; the model’s amplitudes Ak and ϕk are typically measured from a high resolution DFT which are in the frequency domain. Deisher, § I; see Ans. 22. Moreover, although we agree with Appellants that the AR model provides two results, an envelope of the spectrum in the frequency domain and a residual, we find that this result corresponds to the limitation reciting “transformed segments having a speech spectrum … in a frequency domain,” because although an FFT is taught by Appellants’ Specification, the language of the claims requires only that the segment be “transformed …having a speech spectrum … in the frequency domain.” Gustafsson teaches that the AR model meets this criterion. See Gustafsson, Figs. 13-15; Ans. 7. Nor are we persuaded by Appellants’ argument that when the upsampled speech signal is converted into transformed segments in the frequency domain, the output of the conversion would completely change the principles of operation taught or suggested by Gustafsson. App. Br. 35. “The test for obviousness is not whether the features of a secondary reference may be bodily incorporated into the structure of the primary reference.” In re Keller, 642 F.2d 413, 425 (CCPA 1981). Rather, the obviousness inquiry requires a finding that the combination of known elements was obvious to a person with ordinary skill in the art. See In re Chevalier, No. 2012-1254, 2013 WL 57873, at *3 (Fed. Cir. Jan. 7, 2013); see also In re Sneed, 710 F.2d 1544, 1550 (Fed. Cir. 1983) (“[I]t is not Appeal 2010-006945 Application 10/341,332 14 necessary that the inventions of the references be physically combinable to render obvious the invention under review”); and In re Nievelt, 482 F.2d 965 (CCPA 1973) (“Combining the teachings of references does not involve an ability to combine their specific structures”). We conclude that these elements of the disputed limitation would, through the teachings and suggestions of Gustafsson and Deisher, have been known to a person of ordinary skill in the contemporary art. We consequently conclude that the Examiner did not err in finding that the combined prior art references teach or suggest the limitation of claim 1 reciting “converting the speech signals into a plurality of transformed segments having a speech spectrum in a lower frequency range and a speech spectrum in a higher frequency range in a frequency domain, the speech spectrum in the higher frequency range having a spectral image of the speech spectrum of the lower frequency range.” Issue 3 Appellants argue that the Examiner erred in finding that the prior art references teach or suggest the limitation of claim 1 reciting “modifying the speech spectrum in the high frequency range in the frequency domain.” App. Br. 36. We therefore address the issue of whether the Examiner so erred. Analysis Appellants argue that Gustafsson teaches or suggests how a synthetic formant can be added to the synthesized speech signal in the higher frequency range. App. Br. 31. However, argue Appellants, when an AR Appeal 2010-006945 Application 10/341,332 15 model in z-domain is used, the error signal and the parameters are conveyed to an inverse filter for speech synthesis (synthesis filter 450), the synthesized speech will have the same formants if the inverse filter uses the same filter coefficients. Id. Appellants argue that the inverse or synthesis filter can be an AR model in z-domain or a sinusoidal model taught or suggested by Gustafsson. Id. Appellants contend that parameters and error signal are not equivalent to transformed segments in the frequency domain having a speech spectrum in the lower frequency range and a speech spectrum in the higher frequency range, the inverse or synthesis filter does not modify the transformed segments in the frequency domain. Id. The Examiner responds that Gustafsson teaches or suggests that, when employing the sinusoidal model the short speech segments after DFT/FFT (becoming transformed segments) are analyzed and/or synthesized, including enhanced (i.e., becoming modified transformed segments), such as “selection of harmonic components” and “harmonic reproduction” in “frequency-domain.” Ans. 24 (citing Deisher, Abstract; §§ I, III; see also Fig. 1(“enhancement” processes between blocks “FFT” and “inversed FFT”)). The Examiner also finds that the parameters taught or suggested by Gustafsson include modified/added envelope spectrum parameters (e.g., formant in Fig. 8 and ¶¶ [0078]-[0084]) and/or excitation spectrum parameters (e.g., peaks or harmonics for residual or error signal in Figs. 5; 7; 9; ¶¶ [0050]-[0051]) based on voiced/unvoiced information (Figs. 4; 6; 9) Ans. 24-25. The Examiner finds that these teaching and suggestions of Gustafsson reasonably read on the claim language of “modifying the speech spectrum.” Ans. 25. Appeal 2010-006945 Application 10/341,332 16 Furthermore, finds the Examiner, in the embodiment of Gustafsson teaching use of the AR model in z-domain, the modification of the envelope spectrum (e.g., formant in Fig. 8; ¶¶ [0078]-[0084]) and/or excitation spectrum (e.g., peaks or harmonics for residual/error signal in Figs. 5, 7, and 9 and ¶¶ [0050]-[0051]) is based on voiced/unvoiced information (Figs. 4, 6, 9) for synthesizing wideband speech. Ans. 25 (citing Gustafsson, ¶¶ [0046]-[0048]). The Examiner finds that one of ordinary skill in the contemporaneous art would have recognized that the AR process in z-domain (transform) would be interpreted as being equivalently in the frequency domain in a broader sense under the inherent relationship, as related supra. Id. The Examiner therefore finds that the parameters taught or suggested by Gustafsson, including modified/added envelope spectrum parameters (such as formant) and excitation spectrum parameters (such as peaks or harmonics for residual or error signal) represent an extended speech frame, and can be reasonably interpreted as reading on modified transformed segments. Ans. 25. We are not persuaded by Appellants’ arguments. We have related supra why we agree with the Examiner’s findings that Gustafsson teaches or suggests transformed segments in the frequency domain. Moreover, we find that Gustafsson teaches supplementing the envelope of the speech power spectrum (transfer function) with synthetic formants and/or high-pass filtered noise. Gustafsson, Figs. 13-15; ¶¶ [0084]-[0086]; see Ans. 7. Further, we agree with the Examiner that Gustafsson teaches or suggests that “modifier 930 creates harmonic peaks in the upper frequency band (in frequency domain) by copying parts of the lower band residual signal to the higher band” and “create the peaks corresponding to the high order Appeal 2010-006945 Application 10/341,332 17 harmonics in the upper frequency band.” Ans. 7 (citing Gustafsson, ¶¶ [0059]-[0061], Figs. 9, 13). We find that these teachings and suggestions of the prior art citations correspond to the language of claim 1 reciting “modifying the speech spectrum in the high frequency range in the frequency domain” and we therefore conclude that the Examiner did not err in concluding that that limitation would be obvious over the prior art references a person of ordinary skill in the contemporaneous art. DECISION The Examiner’s rejection of claims 1-24 and 33-36 as unpatentable under 35 U.S.C. 103(a) is affirmed. TIME PERIOD FOR RESPONSE No time period for taking any subsequent action in connection with this appeal may be extended under 37 C.F.R. § 1.136(a) (1)(iv) . AFFIRMED msc Copy with citationCopy as parenthetical citation