Julian TibbleDownload PDFPatent Trials and Appeals BoardOct 15, 201913411234 - (D) (P.T.A.B. Oct. 15, 2019) Copy Citation UNITED STATES PATENT AND TRADEMARK OFFICE UNITED STATES DEPARTMENT OF COMMERCE United States Patent and Trademark Office Address: COMMISSIONER FOR PATENTS P.O. Box 1450 Alexandria, Virginia 22313-1450 www.uspto.gov APPLICATION NO. FILING DATE FIRST NAMED INVENTOR ATTORNEY DOCKET NO. CONFIRMATION NO. 13/411,234 03/02/2012 Julian TIBBLE 39348-0018001 5381 26181 7590 10/15/2019 FISH & RICHARDSON P.C. (SV) PO BOX 1022 MINNEAPOLIS, MN 55440-1022 EXAMINER ALMANI, MOHSEN ART UNIT PAPER NUMBER 2159 NOTIFICATION DATE DELIVERY MODE 10/15/2019 ELECTRONIC Please find below and/or attached an Office communication concerning this application or proceeding. The time period for reply, if any, is set in the attached communication. Notice of the Office communication was sent electronically on above-indicated "Notification Date" to the following e-mail address(es): PATDOCTC@fr.com PTOL-90A (Rev. 04/07) UNITED STATES PATENT AND TRADEMARK OFFICE ____________ BEFORE THE PATENT TRIAL AND APPEAL BOARD ____________ Ex parte JULIAN TIBBLE Appeal 2018-002395 Application 13/411,234 Technology Center 2100 ____________ Before JEREMY J. CURCURI, KARA L. SZPONDOWSKI, and MATTHEW J. McNEILL, Administrative Patent Judges. SZPONDOWSKI, Administrative Patent Judge. DECISION ON APPEAL Pursuant to 35 U.S.C. § 134(a), Appellant1 appeals from the Examiner’s decision to reject claims 1, 2, 5–8, 11–14, and 17–25, constituting all claims pending in the application. Oral hearing was held on August 1, 2019. A transcript will be placed in the record in due course. We have jurisdiction under 35 U.S.C. § 6(b). We REVERSE. 1 We use the word “Appellant” to refer to “applicant” as defined in 37 C.F.R. § 1.42. Appellant identifies the real party in interest as Semmle Limited. Appeal Br. 1. Appeal 2018-002395 Application 13/411,234 2 STATEMENT OF THE CASE Appellant’s invention is directed to finding duplicate passages of text in a collection of text. Spec. ¶ 2. Claim 1, reproduced below with the disputed limitation in italics, is illustrative of the claimed subject matter: 1. A computer-implemented method for duplicate text detection, the method comprising: producing a respective tokenized version of at least a portion of each file in a plurality of files of a corpus of text, and generating from each tokenized version of each file a respective sequence of overlapping segments for the file, wherein each segment for each file is a specified number of adjacent tokens in the file, and wherein the order of the segments in the sequence of overlapping segments matches the order in the file of the tokens that make up the segments; calculating one hash value for each overlapping segment of each file of the plurality of files of the corpus of text to produce respective sequences of hash values, wherein each sequence of hash values corresponds to a file of the plurality of files of the corpus of text, and wherein each sequence of hash values for a file corresponds in order to the segments in the sequence of overlapping segments for the file; determining duplicate hash values that each occur in more than one of the sequences of hash values for the plurality of files in the corpus of text; creating partitions that each include one of the duplicate hash values, wherein each duplicate hash value is in exactly one of the partitions; determining duplicate hash values in the partitions that are adjacent hash values, wherein any two hash values HI and H2 are adjacent hash values if and only if every occurrence of HI in the sequences of hash values for the plurality of files in the corpus of text is followed by H2, and every occurrence of H2 in the sequences of Appeal 2018-002395 Application 13/411,234 3 hash values for the plurality of files in the corpus of text is preceded by HI; until all pairs of adjacent hash values in the partitions have been found: finding pairs of adjacent hash values that were determined as adjacent in the determining duplicate hash values step and that are each in different respective partitions, and for each such pair of adjacent hash values, merging respective partitions having the adjacent hash values to form a respective merged partition; and after the merging of partitions having adjacent hash values is completed and using the partitions that remain after the merging, producing a report of pieces of duplicate text found in the corpus of text based on each remaining partition representing a single instance of text duplication. REJECTIONS Claims 1, 2, 5, 7, 8, 11, 13, 14, 17, and 19–25 stand rejected under 35 U.S.C. § 103(a) as being unpatentable over Raphel et al., (US 9,342,621 B1; issued May 17, 2016) (“Raphel”) and Saul Schleimer et al., “Winnowing: Local Algorithms for Document Fingerprinting,” SIGMOD International Conference on Management of Data, June 9–12, 2003 at 76–85 (“Schleimer”). Claims 6, 12, and 18 stand rejected under 35 U.S.C. § 103(a) as being unpatentable over the combination of Raphel, Schleimer, and Maxime Crochemore & Christophe Hancart, “Pattern Matching in Strings,” Algorithms and Theory of Computation Handbook, 13-1–13-29 (Chapman & Hall 2010) (“Crochemore”). Appeal 2018-002395 Application 13/411,234 4 ANALYSIS The Examiner finds the combination of Raphel and Schleimer teaches or suggests the limitations in independent claims 1, 7, and 13. Final Act. 2– 11. Appellant argues, inter alia, that Raphel does not teach or suggest the limitation “determining duplicate hash values in the partitions that are adjacent hash values, wherein any two hash values H1 and H2 are adjacent hash values if and only if every occurrence of H1 in the sequence of hash values for the plurality of files in the corpus of text is followed by H2, and every occurrence of H2 in the sequence of hash values for the plurality of files in the corpus of text is preceded by H1,” as recited in claim 1 and commensurately recited in claims 7 and 13. App. Br. 8–14. Raphel is generally directed to matching phrases in portions of received content, specifically to prevent data leakage of sensitive or confidential information. Raphel 1:5–21, 2:54–3:2. A user inputs phrases indicative of sensitive or confidential content, such as “privileged and confidential,” (hereafter “P1”), “private and billing confidential,” (hereafter “P2”), and “private and confidential” (hereafter “P3”). Raphel 4:6–8, 25– 28. These phrase terms are hashed by a phrase detector and the resulting hashes are used to create concatenated hashes. Raphel 4:14–16. Such hashes are exemplified in Table 1, reproduced below: Appeal 2018-002395 Application 13/411,234 5 Table 1 depicts three columns: (1) a unigram column, listing hashes h1–h5 of each phrase term included in phrases P1, P2, and P3; (2) a bigram column, listing hashes h6–h10 of concatenations of proper subsets of the set phrase terms, each having a cardinality of two, for example, “privileged and” or “and confidential,” where the concatenation is in the order of the ordinal position of the phrase of which it is a subset; and (3) a K-gram column, listing hashes h11–h13 for each concatenation of the set of hashes according to the ordinal positions of its corresponding phrase, for example “privileged and confidential” or “privileged and billing confidential.” Raphel, 4:42–63. The Examiner primarily relies on Table 1 and its corresponding description to teach the disputed limitation. Final Act. 4, 14–19; Ans. 3–14. Specifically, the Examiner finds unigrams h1–h5 are equivalent to the claimed tokens, bigrams h6–h10 are hashes of the sequence of tokens/unigrams and are equivalent to the claimed segments, and K-grams h11–h13 are longer sequences of bigrams and are equivalent to the adjacent hashes recited in the claim. Final Act. 15–17; see also Ans. 7–9, 13–14. The Examiner finds “any appearance of two consecutive [hash values] such as h6 and h7 is found as h11, the partition h11 contains the adjacent hash values h6 and h7; every time that h6 appears after h7 it is referenced by h11 and only the appearance of h6 and h7 with the same order is identified as h11.” Final Act. 4; see also Final Act. 16; Ans. 4, 10–11, 13 (“a value such as a bigram or k-gram can represent a partition”). Appellant argues the unigrams, bigrams, and K-grams in Table 1 “are simply different order-preserving permutations of the five different terms appearing in the three input phrases.” Reply Br. 4. According to Appellant, Appeal 2018-002395 Application 13/411,234 6 “h11 is the hash of h1.h2.h3 and not a hash of h6.h7.” App. Br. 9. Appellant further argues “h6 is a hash of h1.h2” and “h7 is a hash of h2.h3” so “[c]oncatenating these hashes would also not result in h11 . . . but would result in a concatenation of h(h1.h2) and h(h2.h3),” which “is not the same as h11, which is a hash of h1.h2.h3.” App. Br. 9. According to Appellant, “h11 is a hash rather than a ‘partition.’” App. Br. 13. Appellant contends “Table 1 of Raphel merely shows how certain hashes can be generated from other hashes and imposes no restriction at all on when they can be generated.” App. Br. 10. Appellant also argues Table 1 contradicts that h11 is generated every time h7 is preceded by h6 because h7 is preceded by h1 in h11 and preceded by h4 in h13. App. Br. 10–12; see Reply Br. 8. We are persuaded by Appellant’s arguments. As Appellant notes, Raphel is concerned with searching a single document for a matching phrase. See Reply Br. 3, 6–7; e.g., Raphel Abstr., 10:48–51, Figs. 4, 6, Abstr. In contrast, Appellant’s invention is directed to identifying duplicate text across a plurality of files. See Spec. ¶ 2; App. Br. 17 (claim 1). Table 1 of Raphel identifies hashes that can be stored as phrase detection data, which are used to detect the subject phrases in the received content. Raphel 4:28– 31, Table 1, Abstr. The Examiner has not sufficiently explained how Table 1 and its associated description teaches or suggests the disputed limitation. Specifically, the Examiner has not sufficiently explained how Raphel’s bigrams (e.g., h6) or K-grams (e.g., h11) teach or suggest a partition, as claimed. The claim recites that the partitions “each include one of the duplicate hash values, wherein each duplicate hash value is in exactly one of the partitions.” The bigrams are described as “hashes of concatenations of proper subsets of the set phrase terms” and have a Appeal 2018-002395 Application 13/411,234 7 cardinality of two; the K-grams are described as “a hash for each concatenation of the set of hashes according to the ordinal positions of its corresponding phrase.” Raphel 4:46–48, 57–59. We agree with Appellant that K-gram h11 is the hash of unigrams h1.h2.h3, and not a hash of bi- grams h6.h7. See App. Br. 9, 13; see Raphel 4:59–63. Likewise, the bigrams are also a hash of concatenated unigrams. Raphel 4:50–56. In other words, the unigrams, bigrams, and K-grams listed in Table 1 are merely hashes. The Examiner has not sufficiently explained how these hashes include a duplicate hash value “in exactly one of the partitions.” It follows that the Examiner has not sufficiently shown that Raphel does not teach or suggest “determining duplicate hash values in the partitions that are adjacent hash values.” Moreover, we agree with Appellant that Table 1 of Raphel does not teach or suggest “any two hash values H1 and H2 are adjacent hash values if and only if every occurrence of H1 in the sequences of hash values for the plurality of files in the corpus of text is followed by H2, and every occurrence of H2 in the sequences of hash values for the plurality of files in the corpus of text is preceded by H1.” See App. Br. 10–12 (underlining omitted); see Reply Br. 8. As discussed above, Table 1 merely identifies a list of hashes that can be used as phrase detection data in order to match phrases in received content. As also discussed above, Raphel is not trying to determine if two hash values are adjacent across a plurality of files; it is looking for a match within a single document. The Examiner finds bigrams in Table 1 of Raphel are adjacent because they overlap for each respective phrase (e.g., h6 (“privileged and”), h7 (“and confidential”)), and that further overlaps with a trigram (e.g., h11 (“privileged and confidential”)), which the Appeal 2018-002395 Application 13/411,234 8 Examiner finds indicates that h11 is the merged form of h6 and h7, and means that h6 and h7 must be adjacent. See Ans. 10–11. However, we agree with Appellant that the Examiner’s findings do not sufficiently show that every occurrence of H1 (e.g., h6) is followed by H2 (e.g., h7), and every occurrence of H2 (e.g., h7) is preceded by H1 (e.g., h6). See App. Br. 10– 12; Reply Br. 8. For example, h6 is the concatenation of h1 and h2 and represents “privileged and”; h7 is the concatenation of h2 and h3 and represents “and confidential”; and h8 is the concatenation of h4 and h2 and represents “private and.” Raphel 4:50–55. K-gram h11 is the concatenation of h1, h2, and h3 and represents “privileged and confidential” and K-gram h13 is the concatenation of h4, h2, and h3 and represents “private and confidential.” Raphel 4:59–63. Under the Examiner’s interpretation, h7 (e.g., h2 and h3) should always be preceded by h6 (e.g., h1 and h2). However, this is not the case in K-gram h13, where h7 (e.g., h2 and h3) is preceded by h8 (e.g., h4 and h2). Because we agree with at least one of Appellant’s arguments, we need not reach Appellant’s remaining arguments. We are persuaded the Examiner erred in rejecting independent claims 1, 7, and 13. For the same reasons, we are persuaded the Examiner erred in rejecting dependent claims 2, 4, 6, 8, 11, 12, 14, and 17–25. For the foregoing reasons, we do not sustain the Examiner’s rejection of claims 1, 2, 5–8, 11–14, and 17–25. CONCLUSION In summary: We reverse the Examiner’s 35 U.S.C. § 103(a) rejection of claims 1, 2, 5–8, 11–14, and 17–25. Appeal 2018-002395 Application 13/411,234 9 Claims Rejected 35 U.S.C. § Basis Affirmed Reversed 1, 2, 5, 7, 8, 11, 13, 14, 17, and 19–25 103 Raphel and Schleimer 1, 2, 5, 7, 8, 11, 13, 14, 17, and 19–25 6, 12 and 18 103 Raphel, Schleimer, and Crochemore 6, 12 and 18 Overall Outcome 1, 2, 5–8, 11– 14, and 17–25 REVERSED Copy with citationCopy as parenthetical citation