14319231 (P.T.A.B. Mar. 8, 2018)

Ex Parte Saladi et al

Patent Trial and Appeal BoardMar 8, 2018

14319231 (P.T.A.B. Mar. 8, 2018)

United States Patent and Trademark Office UNITED STATES DEPARTMENT OF COMMERCE United States Patent and Trademark Office Address: COMMISSIONER FOR PATENTS P.O.Box 1450 Alexandria, Virginia 22313-1450 www.uspto.gov APPLICATION NO. FILING DATE FIRST NAMED INVENTOR ATTORNEY DOCKET NO. CONFIRMATION NO. 14/319,231 06/30/2014 Kalyan Saladi C102 3962 152606 7590 03/08/2018 VlUWare. - OPW EXAMINER P.O. Box 4277 UNG, LANNY N Seattle, WA 98194 ART UNIT PAPER NUMBER 2191 MAIL DATE DELIVERY MODE 03/08/2018 PAPER Please find below and/or attached an Office communication concerning this application or proceeding. The time period for reply, if any, is set in the attached communication. PTOL-90A (Rev. 04/07) UNITED STATES PATENT AND TRADEMARK OFFICE BEFORE THE PATENT TRIAL AND APPEAL BOARD Ex parte KALYAN SALADI, REZA TAHERI, DANIEL MICHAEL HECHT, JIN HEO, and JEFFREY BUELL Appeal 2017-007019 Application 14/319,231 Technology Center 2100 Before JUSTIN BUSCH, STACEY G. WHITE, and JASON M. REPKO, Administrative Patent Judges. BUSCH, Administrative Patent Judge. DECISION ON APPEAL Pursuant to 35 U.S.C. § 134(a), Appellants appeal from the Examiner’s decision to reject claims 21—42, which constitute all the claims pending in this application. We have jurisdiction under 35 U.S.C. § 6(b). We affirm. CLAIMED SUBJECT MATTER Claims 21, 32, and 38 are independent claims. Appellants’ invention “is directed to virtualization of computer hardware and hardware-based performance monitoring and, in particular, to methods and systems for monitoring the performance of memory management in virtual machines.” Spec. 11. Claim 21 is representative and reproduced on the following page. Appeal 2017-007019 Application 14/319,231 21. A virtualization layer comprising computer instructions, stored in a physical data-storage device within a virtualized computer system that includes the one or more processors that each provides a set of hardware-level performance monitoring registers, one or more memories, and one or more physical data storage devices, that, when executed by one or more of the one or more processors, control the virtualized computer system to: provide, by the virtualization layer, a virtual hardware interface to one or more virtual machines that each includes a guest operating system and one or more application programs that execute within an execution environment provided by the guest operating system; provide, by the virtualization layer, as a component of the virtual hardware interface, a set of virtual performance monitoring registers that are accessed by the one or more of the guest operating systems and that differ in number and/or function from the hardware-level performance monitoring registers provided by any one of the one or more physical processors; monitor, by the virtualization layer, execution characteristics of memory management within of one or more virtual machines using one or more of the virtual performance monitoring registers; and when the monitored execution characteristics of one of the one or more virtual machines exceeds a first threshold value, reconfiguring, by the virtualization layer, the virtual machine to use a different type of memory management. REJECTIONS Claims 21—42 stand provisionally rejected for non-statutory double patenting over claims 1—23 ofU.S. Patent Application No. 14/263,640 in view of Chang S. Bae et al., Enhancing Virtualized Application Performance Through Dynamic Adaptive Paging Mode Selection, in Proceedings of the 8th ACM International Conference on Autonomic Computing, ICAC 2 Appeal 2017-007019 Application 14/319,231 2011 and Co-located Workshops, 255-64 (July 15, 2011) (“Bae”). Final Act. 3. Claims 21—31 and 38-42 stand rejected under 35U.S.C. § 101 as being directed to non-statutory subject matter. Final Act. 7. Claims 21—25, 27—30, 32—36, and 38-42 stand rejected under 35 U.S.C. § 103(a) as obvious in view of Benjamin Serebrin & Daniel Hecht, Virtualizing Performance Counters, in Euro-Par 2011 Workshops, PartI, LNCS 7155, 223—33 (M. Alexander et al. eds., 2012) (“Serebrin”) and Bae. Final Act. 8—15. Claims 26 and 37 stand rejected under 35 U.S.C. § 103(a) as obvious in view of Serebrin, Bae, and Levit-Gurevich (US 2008/0155536 Al; June 26, 2008). Final Act. 15—18. Claim 31 stands rejected under 35 U.S.C. § 103(a) as obvious in view of Serebrin, Bae, and Santos (US 8,230,059 Bl; July 24, 2012). Final Act. 18-19. ANALYSIS Double Patenting Rejection The Examiner provisionally rejects claims 21—42 on the ground of non-statutory double patenting because the pending claims are not patentably distinct from claims 1—23 of co-pending U.S. Patent Application Number 14/263,640 in view of Bae. Final Act. 3. Appellants state that they “defer commenting on the double-patenting rejection until it matures into a non-provisional rejection.” App. Br. 5.1 Accordingly, Appellants have 1 We note that the Examiner’s Answer appears to cite to pages from the Appeal Brief filed on October 24, 2016. However, in response to a Notice of Non-Compliant Appeal Brief, mailed on November 23, 2016, Appellants 3 Appeal 2017-007019 Application 14/319,231 waived appeal as to this provisional rejection, and we summarily sustain the provisional non-statutory double patenting rejection of claims 21 42. The Rejection Under 35 U.S.C. § 101 The Examiner rejects claims 21—31 and 38-42 under 35 U.S.C. § 101 as being directed to non-statutory subject matter. Final Act. 7. In particular, the Examiner finds claims 21—31 and 3 8-42 encompass signals, which do not fall in one of the categories of patentable subject matter. Id. The Examiner discusses the fact that simply because a thing is physical and tangible does not mean that it excludes carrier waves and, furthermore, the examples in Appellants’ Specification are not limiting so as to preclude an interpretation of claims 21—31 and 38-42 as encompassing a signal or carrier waves. Ans. 23—24 (citing Spec. 147); Final Act. 7 (citing Spec. 147). Appellants point to portions of their Specification explaining that the recited virtual machines are physical and tangible and argue “the Examiner has cited nothing in case law, rule, or statute that would suggest that a carrier wave is non-statutory.” App. Br. 6—7 (citing Spec. 35, 47). Appellants also contend that “signals [or carrier waves] do not store data and are not tangible.” Id. at 8. Appellants further assert claims 21—31 and 38-42 are directed to statutory subject matter because they recite computer systems that include processors, memories and physical data-storage devices and “[cjarrier waves are not included within virtualized computer systems.” Id. at 9. These arguments are not persuasive. Contrary to Appellants’ assertions, both the Federal Circuit and this Board have held that simply submitted a replacement Appeal Brief on December 15, 2016. We cite to the page numbers from Appellants’ replacement Appeal Brief. 4 Appeal 2017-007019 Application 14/319,231 limiting the subject matter to being “physical” and “tangible” does not necessarily exclude carrier waves or signals. In re Nuijten, 500 F.3d 1346 1356-57; Ex parte Mewherter, 107 USPQ2d 1857, 1862 (PTAB 2013); See also U.S. Patent & Trademark Office, Evaluating Subject Matter Eligibility Under 35 USC § 101 (August 2012 Update) (pp. 11-14), available at http://www. uspto.gov/patents/law/exam/101 _tr a i n i ng_a ug2 012.pdf (noting that while the recitation “non-transitory” is a viable option for overcoming the presumption that those media encompass signals or carrier waves, merely indicating that such media are “physical” or tangible” will not overcome such presumption). Similarly, simply reciting that a medium is a storage medium does not necessarily exclude transitory signals from the claimed subject matter. Mewherter, 107 USPQ2d at 1859—62 (finding the recited storage medium encompassed signals and, thus, were directed to non-statutory subject matter). Appellants’ claims, however, do not recite a physical data storage medium. Rather, Appellants’ claims recite a physical data storage device. Although Appellants’ Specification does not explicitly state that data storage devices cannot encompass signals and characterizes the data storage devices as “physical” and “tangible,” Appellants’ Specification describes non- transitory exemplary devices variously as “optical and electromagnetic disks, electronic memories, and other physical data-storage devices,” “electronic memories, mass-storage devices, optical disks, magnetic disks, and other such devices,” and “larger, slower electronic memories, disk drives, and other such data-storage devices.” Spec. Tflf 35, 47, 99; see also Spec. 134 (describing computer instructions as simply part of devices). Any ambiguity regarding whether Appellants are attempting to claim either the 5 Appeal 2017-007019 Application 14/319,231 virtualization layer recited in claim 21 or the computer instructions recited in claim 38 apart from the physical data storage device on which they are stored has been clarified through Appellants’ argument in the prosecution history. For the reasons given above, we agree with Appellants that the physical data storage devices recited in claims 21—31 and 38—42 do not encompass signals and, therefore, are directed to statutory subject matter. Accordingly, we do not sustain the Examiner’s rejection of claims 21—31 and 38-42 as unpatentable under 35U.S.C. § 101. The Rejections Under 35 U.S.C. § 103 Based on Appellants’ arguments, we decide the appeal of the rejection of claims 21—42 under 35 U.S.C. § 103 on the basis of representative claim 21. In reaching this decision, we consider all evidence presented and all arguments actually made by Appellants. We do not consider arguments Appellants could have made but chose not to make in the Briefs, and we deem any such arguments waived. 37 C.F.R. § 41.37(c)(l)(iv). The Examiner rejects all pending claims as obvious in view of Serebrin and Bae or Serebrin and Bae in combination with either Levit- Gurevich (claims 26 and 37) or Santos (claim 31). Final Act. 8—19. Appellants argue for the patentability of claims 21—25, 27—30, 32—36, and 38-42 as a group based on the limitations recited in claim 21. See App. Br. 9-15. Appellants do not substantively argue the rejection of claims 26, 31, and 37 separately, relying on the arguments presented with respect to claim 21. See App. Br. 15. Of particular note, the Examiner finds Serebrin teaches or suggests “monitor, by the virtualization layer, execution characteristics of memory 6 Appeal 2017-007019 Application 14/319,231 management within of one or more virtual machines using one or more of the virtual performance monitoring registers,” as recited in claim 21. Final Act. 9 (citing Serebrin 224). The Examiner explains that Serebrin’s hypervisor, which the Examiner finds teaches or suggests the recited virtualization layer, provides virtualized performance counters that allow profiling of the virtual machines. Ans. 27. The Examiner further explains that Serebrin discloses using its hypervisor to provide virtualized performance counters to measure various metrics of the virtual machines. Id. The Examiner finds Bae discloses comparing metrics, similar to those Serebrin measures, to a threshold to determine whether to switch memory management modes. Ans. 27—28. Thus, the Examiner concludes it would have been obvious to an ordinarily skilled artisan to use Bae’s memory management switching technique or approach, which is done using hardware performance counters in Bae, with Serebrin’s virtualized memory management because Bae discloses that dynamically switching memory management techniques is beneficial depending on the particular workload. Ans. 28 (citing Bae Abstract). Monitoring By the Virtualization Layer Appellants’ primary contention is that the claims require the virtualization layer to monitor the “execution characteristics of memory management within of one or more virtual machines using one or more of the virtual performance monitoring registers,” whereas Appellants argue Serebrin teaches the guest operating systems perform the monitoring. App. 7 Appeal 2017-007019 Application 14/319,231 Br. 12—13; Reply Br. 7—11.2 Appellants argue “Serebrin discloses a system in which a hypervisor provides virtual hardware performance counters to guest operating systems.” Reply Br. 9. Claim 21 recites, among other things, that the virtualization layer provides “a set of virtual performance monitoring registers that are accessed by the one or more of the guest operating systems’ '' and monitors “execution characteristics of memory management within [one] of [the] one or more virtual machines using one or more of the virtual performance monitoring registers.” Thus, Appellants’ claims are directed to systems in which guest operating systems access the virtual performance monitoring registers. The Examiner cites to various portions of Serebrin as teaching or suggesting these two recited steps. Final Act. 9 (citing Serebrin 224, 226—27). Serebrin describes the two types of performance events generally monitored (i.e., speculative and non-speculative events) using profilers, time-sharing of physical CPUs by virtual CPUs, and trapping and emulation techniques. Serebrin 225—28. In describing trapping and emulation (a process whereby the hypervisor traps privileged guest instructions to emulate those instructions), Serebrin explains that simply pausing virtual counters does not properly account for emulated instructions whereas allowing all counters to run would result in an instructions per cycle count that appears too high. Serebrin 227. 2 Appellants also argue “[t]he cited portion [block quoted on pages 12 and 13 of the Appeal Brief] of Serebrin is discussing hardware performance monitoring registers.” App. Br. 13. Appellants, however, are referring to a portion of Serebrin the Examiner did not cite. The paragraph Appellants subsequently quote and discuss (appearing entirely on page 13 of the Appeal Brief), however, is the portion of Serebrin the Examiner cites. 8 Appeal 2017-007019 Application 14/319,231 After noting the potential problems when monitoring performance in virtual environments using these two methods (pausing when emulating or running all counters), Serebrin identifies “a third approach [that] emulates the hardware counters to attempt to represent the microarchitecture’s counts for a small subset of events.” Serebrin 228 (citing Jiaqing Du et al., Performance Profiling of Virtual Machines, in 46 ACM SIGPLAN notices iss. 7, pp. 3—14 (Andy Gill ed., July 2011) (“Du”)). Serebrin “proposes a hybrid approach: for non-speculative events, the emulation code will ensure correctness, while speculative counters will present a view of the hypervisor’s effect on hardware even for emulated guest instructions and events.” Serebrin 228. Serebrin explains the hypervisor pauses the virtual counters when switching context to a different virtual CPU and resumes the virtual counters when the hypervisor switches context back to the current virtual CPU. Serebrin 228. Thus, similar to Appellants’ claims, Serebrin discloses that the hypervisor provides access to virtual performance monitoring registers. Serebrin 224; see also id. at 223 (“VMware is investigating the virtualization of hardware performance counters”). Serebrin also discloses that “the hypervisor must context switch the active performance counter state, in addition to the context switching of general purpose registers and control state.” Serebrin 226—27. Serebrin’s disclosures discussed herein teach that the hypervisor actively manages and monitors the virtual counters, which the Examiner finds teach the virtual performance monitoring registers, in order to “appropriately and efficiently emulate performance events.” Serebrin 227. 9 Appeal 2017-007019 Application 14/319,231 Moreover, as explained above Serebrin discloses using a hybrid approach to monitoring event performance, referencing Du. Du explains that the virtual machine monitor (VMM) virtualizes the performance monitoring unit hardware, which includes the performance counters and event selectors, in a manner transparent to the guest. Du. 3^4. Du further explains that “[s]ystem-wide profiling reveals the runtime behavior of both the VMM and any number of guests [and] requires a profiler running in the VMM and in the profiled guests, and provides a full-scale view of the system.” Du 4 (second emphasis added). Thus, a person of ordinary skill in the art would have understood that certain system-wide profiling techniques, including Du’s techniques referenced by Serebrin, require running a profiler in the virtualization layer. For the above reasons, we agree with the Examiner that Serebrin at least suggests to one of ordinary skill in the art that the hypervisor manages and monitors events “using one or more of the virtual performance monitoring registers,” as recited in the claims. Rationale for Combining Serebrin and Bae Appellants also assert the Examiner’s motivation to combine Serebrin and Bae “is essentially unreasoned boilerplate.” App. Br. 14. In particular, Appellants argue that simply because Serebrin and Bae both are directed towards virtual machines does not provide sufficient basis for concluding the combination of their respective teachings would have been obvious. Id. Appellants assert there is no obvious way or reason to combine Bae and Serebrin “because Bae successfully uses hardware performance-monitoring 10 Appeal 2017-007019 Application 14/319,231 registers and Serebrin is not directed to, or interested in, monitoring memory-management performance by a virtualization layer.” Id.3 Appellants seem to be arguing that, simply because “Serebrin is not directed towards optimizing VM performance” and “Bae successfully uses hardware performance-monitoring registers [whereas] Serebrin is not directed to, or interested in, monitoring memory-management performance by a virtualization layer,” the cited teachings of Serebrin and Bae cannot be combined. App. Br. 14. Appellants offer no further explanation or evidence supporting the contention. App. Br. 14; see In re Geisler, 116 F.3d 1465, 1470 (Fed. Cir. 1997) (mere attorney arguments and conclusory statements, which are unsupported by factual evidence, are entitled to little probative value). However, an obviousness analysis “need not seek out precise teachings directed to the specific subject matter of the challenged claim, for a court can take account of the inferences and creative steps that a person of ordinary skill in the art would employ.” KSR Int’l Co. v. Teleflex Inc., 550 U.S. 398,418 (2007). As the Examiner explained, both Serebrin and Bae relate to monitoring performance counters in virtual machine environments. That was not, however, the extent of the rationale provided by the Examiner for 3 Appellants also argue that, even if Serebrin and Bae were “combined, they would, at best, teach use of virtual performance-monitoring registers by guest operating systems in order to dynamically change their memory- management policies” because neither reference “even remotely suggests use of virtual performance-monitoring registers by a virtualization layer.” App. Br. 14—15; Reply Br. 10-11. This argument, however, depends on Appellants’ assertion that Serebrin does not teach or suggest the virtualization layer monitoring the virtual performance counters. We addressed this alleged deficiency above. 11 Appeal 2017-007019 Application 14/319,231 concluding that combining the cited teachings of Serebrin and Bae would have been obvious. Rather, the Examiner explained that a person of ordinary skill in the art would have understood that Bae’s disclosure of switching memory management when using hardware counters would have been similarly beneficial to Serebrin’s system that uses virtual counters. See Final Act. 9; Ans. 28. Simply applying a known technique (switching memory management techniques when monitoring memory management performance using hardware counters) to a similar method (monitoring using virtualized counters) to generate predictable results (improved memory management depending on the particular dynamic workload). KSR, 550 U.S. at 417. The Examiner’s rationale articulates a reason with a rational underpinning for combining the identified disclosures in Serebrin and Bae, and relies on an ordinarily skilled artisan’s knowledge and skill to incorporate Bae’s memory management techniques using hardware performance counters into Serebrin’s system using virtual performance counters. Appellants’ conclusory statements that the Examiner’s statement is unreasoned boilerplate and that Serebrin and Bae have different purposes, without more, is insufficient to rebut the Examiner’s proffered rationale. For the reasons discussed above, we sustain the Examiner’s rejection of claims 21, 32, and 38 under 35 U.S.C. § 103(a) as obvious in view of Serebrin and Bae. Appellants do not separately argue the patentability of dependent claims 22—31, 33—37, or 39-42 with particularity. Thus, we also sustain the Examiner’s rejection of those claims as obvious. 12 Appeal 2017-007019 Application 14/319,231 CONCLUSION In reaching this decision, we consider all evidence presented and all arguments Appellants actually made. For the reasons discussed above, we sustain the Examiner’s rejection of claims 21—42 under 35 U.S.C. § 103(a), but we do not sustain the Examiner’s rejection of claims 21—31 and 38-42 under 35 U.S.C. § 101. We also summarily sustain the Examiner’s provisional rejection of claims 21—42 for non-statutory double patenting. DECISION We affirm the Examiner’s decision to reject claims 21 42. No time period for taking any subsequent action in connection with this appeal may be extended under 37 C.F.R. § 1.136(a)(l)(iv). Because we have affirmed at least one ground of rejection with respect to each claim on appeal, the Examiner’s decision is affirmed. See 37 C.F.R. § 41.50(a)(1). AFFIRMED 13 Application/Control No. Applicant(s)/Patent Under Patent Notice of References Cited 14/319,231 Appeal No. 2017-007019 Examiner Art Unit 2191 Page 1 of 1 U.S. PATENT DOCUMENTS * Document Number Country Code-Number-Kind Code Date MM-YYYY Name Classification A us- B us- C US- D US- E US- F US- G US- H US- 1 US- J US- K US- L US- M US- FOREIGN PATENT DOCUMENTS * Document Number Country Code-Number-Kind Code Date MM-YYYY Country Name Classification N O P Q R S T NON-PATENT DOCUMENTS * Include as applicable: Author, Title Date, Publisher, Edition or Volume, Pertinent Pages) U J. Du, N. Sehrawat, and W. Zwaenepoel. Performance Profiling of Virtual Machines. V w X *A copy of this reference is not being furnished with this Office action. (See MPEP § 707.05(a).) Dates in MM-YYYY format are publication dates. Classifications may be US or foreign. U.S. Patent and Trademark Office PTO-892 (Rev. 01-2001) Notice of References Cited Part of Paper No. Performance Profiling of Virtual Machines Jiaqing Du Ecole Polytechnique Federate de Lausanne (EPFL), Switzerland jiaqing.du@epfl.ch Nipun Sehrawat ETniversity of Illinois at Urbana Champaign, USA sehrawa2@illinois.edu Willy Zwaenepoel Ecole Polytechnique Federate de Lausanne (EPFL), Switzerland willy. zwaenepoel@epfl.ch Abstract Profilers based on hardware performance counters are indispensa ble for performance debugging of complex software systems. All modem processors feature hardware performance counters, but current virtual machine monitors (VMMs) do not properly expose them to the guest operating systems. Existing profiling tools require privileged access to the VMM to profile the guest and are only available for VMMs based on paravirtualization. Diagnosing performance problems of software running in a virtualized envi ronment is therefore quite difficult. This paper describes how to extend VMMs to support per formance profiling. We present two types of profiling in a virtual ized environment: guest-wide profiling and system-wide profiling. Guest-wide profiling shows the runtime behavior of a guest. The profiler runs in the guest and does not require privileged access to the VMM. System-wide profiling exposes the runtime behavior of both the VMM and any number of guests. It requires profilers both in the VMM and in those guests. Not every VMM has the right architecture to support both types of profiling. We determine the requirements for each of them, and explore the possibilities for their implementation in virtual machines using hardware assistance, paravirtualization, and binary translation. We implement both guest-wide and system-wide profiling for a VMM based on the x86 hardware virtualization extensions and system-wide profiling for a VMM based on binary translation. We demonstrate that these profilers provide good accuracy with only limited overhead. Categories and Subject Descriptors D.4 [Operating Systems]: Performance; C.4 [Performance of Systems]: Performance Attrib utes General Terms Performance, Design, Experimentation Keywords Performance Profiling, Virtual Machine, Hardware- assisted Virtualization, Binary Translation, Paravirtualization 1. Introduction Profilers based on the hardware performance counters of modem processors are indispensable for performance debugging of com plex software systems [21,4, 23], Developers rely on profilers to Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. VEE’ll, March 9-11, 2011, Newport Beach, California, USA. Copyright © 2011 ACM 978-1A503-0501-3/11/03...$10.00. understand the runtime behavior, identify potential bottlenecks, and tune the performance of a program. Performance counters are part of the processor’s performance monitoring unit l The isisisisisisisisisisisisisisisisisisisisisisisisisi:etc. when a performance counter reaches a pre-defined threshold, a counter overflow inter- rapt is generated. The profiler selects the event(s) to be monitored, and registers itself as the PMU counter overflow interrupt handler. When an interrupt occurs, it records the saved program counter (PC) and other relevant information. After the program is finished, it con verts the sampled PC values to function names in the profited program, and it generates a histogram that shows the frequency with which each function triggers the monitored hardware event. For instance. Table 1 shows a typical output of the widely used OProfile profiler [17] for Linux. The table presents the eight functions that consume the most cycles in a run of the profited program. PMU-based performance profiling in a native computing envi ronment has been well studied. Mature profiling tools built upon PMUs exist in almost every popular operating system [17, 13], They are extensively used by developers to tune software per formance. This is, however, not the case in a virtualized environ ment, for the following two reasons. On the one hand, running an existing PMU-based profiler in a guest does not result in useful output, because, as far as we know, none of the current VMMs properly expose the PMU program ming interfaces to guests. Most VMMs simply filter out guest accesses to the PMU. It is possible to run a guest profiler in re stricted timer interrupt mode, but doing so results in limited pro filing results. As more and more applications run in a virtualized environment, it is necessary to provide full-featured profiling for virtual machines. In particular, as applications are moved to virtu alization-based public clouds, the ability to profile applications in a virtual machine without the need for privileged access to the VMM allows users of public clouds to identify performance bot tlenecks and to fully exploit the hardware resources they pay for. On the other hand, white running a profiler in the VMM is possible, without the cooperation of the guest its sampled PC values cannot be converted to function names meaningful to the guest application developer. The data in Table 1 result from run ning a profiler in the VMM during the execution of a computa tion-intensive application in a guest. The first row shows that the CPU spends more than 98% of its cycles in the function vmx_vcpu_run(), which switches the CPU to run the guest. As the design of the profiler does not consider virtualization, all the CPU cycles consumed by the guest are accounted to this function in the VMM . Therefore, we cannot obtain detailed profiling data 3 on the guest. Currently, only XenOprof [18] supports detailed profiling of virtual machines running in Xen [6], a VMM based on paravirtualization. For VMMs based on hardware assistance and binary translation, no such tools exist. Enabling profiling in the VMM provides users and developers of virtualization solutions with a full-scale view of the whole software stack and its interac tions with the hardware, helping them to tune the performance of the VMM, the guest, and the applications running in the guest. % CYCLE Function Module 98.5529 vmx vcpu run kvm-intel.ko 0.2226 (no symbols) libc.so 0.1034 hpet cpuhp notify vmlinux 0.1034 native_patch vmlinux 0.0557 (no symbols) bash 0.0318 x86 decode insn kvm.ko 0.0318 vga update display qemu 0.0318 get call destination vmlinux Table 1. A typical profiler output: the eight functions that con sume the most cycles in a run of the profiled program. In this paper we address the problem of performance profiling for three different virtualization techniques: hardware assistance, paravirtualization, and binary translation. We categorize profiling techniques in a virtualized environment into two types. Guest wide profiling exposes the runtime characteristics of the guest kernel and all its active applications. It only requires a profiler running in the guest, similar to native profiling, i.e., profiling in a nonvirtualized environment. j, and the changes introduced to the VMM are transparent to the guest. reveals the runtime behavior of both the VMM and any number of guests. It and in the profiled guests, and provides a full-scale view of the system. The main contributions of this paper are: 1. We generalize the problem of performance profiling in a vir tualized environment and propose two types of profiling: guest-wide profiling and system-wide profiling. 2. We analyze the challenges of achieving guest-wide and sys tem-wide profiling for each of the three virtualization tech niques. Synchronous virtual interrupt delivery to the guest is necessary for guest-wide profiling. The ability to convert sam ples belonging to a guest context into meaningful function names is required for system-wide profiling. 3. We present profiling solutions for virtualization based on hardware assistance and binary translation. 4. We demonstrate the feasibility and usefulness of virtual ma chine profiling by implementing both guest-wide and system- wide profiling for a VMM based on the x86 virtualization ex tensions and system-wide profiling for a VMM based on bi nary translation. The rest of the paper is organized as follows. In Section 2 we review the structure and working principles of a profiler in a native environment. In Section 3 we analyze the challenges of supporting guest-wide and system-wide profiling for each of the three aforementioned virtualization techniques. In Section 4 we present the implementation of guest-wide and system-wide profil ing in two VMMs, KVM and QEMU. We evaluate the accuracy, usefulness and performance of the resulting profilers in Section 5. In Section 6 we discuss some practical issues related to supporting virtual machine profiling in production environments. We de scribe related work in Section 7 and conclude in Section 8. 2. Native Profiling Profiling is a widely used technique for dynamic program analy sis. A profiler investigates the runtime behavior of a program as it executes. It determines how much of a hardware resource each function in a program consumes. A PMU-based profiler relies on performance counters to sample system states and figure out approximately how the profiled program behaves. Compared with other profiling techniques, such as code instrumentation [22, 12], PMU-based profiling provides a more accurate picture of the target program’s execution as it is less intrusive and introduces fewer side effects. The programming interface of a PMU is a set of performance counters and event selectors. When a performance counter reaches the pre-defined threshold, a counter overflow interrupt is gener ated by the interrupt controller and received by the CPU. Exploit ing this hardware component for performance profiling, a PMU- based profiler generally consists of the following major compo nents: • Sampling configuration. The profiler registers itself as the counter overflow interrupt handler of the operating system, se lects the monitored hardware events and sets the number of events after which an interrupt should occur. It programs the PMU hardware directly by writing to its registers. • Sample collection. The profiler records the saved PC, the event type causing the interrupt, and the identifier of the inter rupted process under the counter overflow interrupt context. The interrupt is handled by the profiler synchronously. • Sample interpretation. The profiler converts the sampled PC values into function names of the profiled process by consult ing its virtual memory layout and its binary file compiled with debugging information. Figure 1. Block diagram of a native PMU-based profiler. In a native environment, all three profiling components and their data structures reside in the operating system. They interact with each other through facilities provided by the operating sys tem. Figure 1 shows a block diagram of a PMU-based profiler in a native environment. In a virtualized environment, the VMM sits between the PMU hardware and the guests. The profiler’s three components may be spread among the VMM and the guests. Their interactions may require communications between the VMM and the guests. In addition, the conditions for implementing these three components may not be satisfied in a virtualized environment. For instance, the sampling configuration component of a guest profiler may not be able to program the PMU hardware because of the interposition of the VMM. In the next section, we present a detailed discussion 4 of the requirements for guest-wide and system-wide profiling for virtualization based on hardware extensions, paravirtualization and virtualization based on binary translation. 3. Virtual Machine Profiling 3.1 Guest-wide Profiling Challenges By definition, guest-wide profiling runs a profiler in the guest and only monitors the guest. Although more information about the whole software stack can be obtained by employing system-wide profiling, sometimes guest-wide profiling is the only way to do performance profiling in a virtualized environment. As we explained before, users of a public cloud service are normally not granted the privilege to run a profiler in the VMM, which is necessary for conducting system-wide profiling. To achieve guest wide profiling, the VMM needs to provide PMU multiplexing, i.e., saving and restoring PMU registers, and enable the imple mentation of the three profiling components in the guest. Since sample interpretation in guest-wide profiling is the same as in native profiling, we only present here the required facilities for sampling configuration and sample collection. We return to the topic of PMU multiplexing in Section 3.3. To implement sampling configuration, the guest must be able to program the PMU registers, either directly or with the assis tance of the VMM. To achieve sample collection, the guest must be able to collect correct samples under interrupt contexts, which requires that the VMM supports synchronous interrupt delivery to the guest. This means that, if the VMM injects an interrupt into a guest, that injected interrupt is handled first when the guest resumes its exe cution. For performance profiling, when a performance counter overflows, an interrupt is generated and first handled by the VMM. If the interrupt is generated when the guest code is being executed, the counter overflow is considered to be contributed by the guest. The VMM injects a virtual interrupt into the guest, which drives the profiler to take a sample. If the guest handles the injected interrupt synchronously when it resumes execution, it collects correct samples as in native profiling. If not, at the time when the injected virtual interrupt is handled, the real interrupt context has already been destroyed and the profiler obtains wrong sampling information. Hardware assistance The x86 virtualization extensions pro vide facilities that help implement guest-wide profiling. First, the guest can be configured with direct access to the PMU registers, which are model-specific registers (MSRs) in the x86. The save and restore of the relevant MSRs can also be done automatically by the CPU. Second, the guest can be configured to exit when interrupts occur. The hardware guarantees that event delivery to a guest is synchronous, so the VMM can forward to the guest all counter overflow interrupts contributed by it, and the guest pro filer samples correct system states. We present our implementa tion of guest-wide profiling for virtualization based on the x86 hardware extensions in Section 4.1. Paravirtualization The major obstacle of implementing guest wide profiling for VMMs based on paravirtualization is synchro nous interrupt delivery to the guest. At least for Xen, this facility is currently not available. External events are delivered to the guest asynchronously. Mechanisms similar to synchronous signal delivery in a conventional OS should be employed to add this capability to paravirtualization-based VMMs. For sampling con figuration, as a paravirtualized guest runs in a deprivileged mode and cannot access the PMU hardware, the VMM must implement the necessary programming interfaces to allow the guest to pro gram the PMU indirectly. Binary translation For VMMs based on binary translation, sampling configuration can be achieved with the assistance of the VMM, which is able to identify instructions that access the PMU and rewrite them appropriately. Similar to paravirtualization, synchronous interrupt delivery to the guest is an engineering challenge. As far as we know, no VMMs based on binary transla tion support this feature. 3.2 System-wide Profiling Challenges System-wide profiling reveals the runtime characteris tics of both the VMM and the guests. It first requires that all three components of the profiler run in the VMM. Since the profiler resides in the VMM, it can program the PMU hardware directly and handle the counter overflow interrupts synchronously. The challenges for system-wide profiling come from the other two profiling components. The first challenge for system-wide profiling comes from sam ple collection. If the counter overflow is triggered by a guest user process, the VMM profiler cannot obtain the identity of this proc ess without the assistance of the guest operating system. This information is described by a global variable in the guest kernel, and the VMM does not know the internal memory layout of the guest. To solve this problem, the guest must share this piece of information with the VMM profiler. The second challenge is interpreting samples belonging to the guests. Even if the VMM profiler is able to sample all the required information, sample interpretation is not possible because the VMM does not know the virtual memory layout of the guest processes or the guest kernel. This requires the guest to interpret its samples, which means that at least the sample interpretation component of a profiler should run in the guest. One approach that addresses both the sample collection and the sample interpretation problem is to not let the VMM record samples corresponding to a guest, but to delegate this task to the guest. We call this approach full-delegation. It requires guest-wide profiling to be supported by the VMM. With this approach, during the profiling process, one profiler runs in the VMM and one runs in each guest. The VMM profiler is responsible for handling all counter overflow interrupts, but it only collects and interprets samples contributed by the VMM. For a sample not belonging to the VMM, a counter overflow interrupt is injected into the corre sponding guest. The guest profiler is driven by the injected inter rupts to collect and interpret samples contributed by the guest. Overall system-wide profiling results are obtained by merging the outputs of the VMM profiler and the guests. An alternative solution is to let the VMM profiler collect all samples and to delegate the interpretation of guest samples to the corresponding guest [18], We call this approach interpretation- delegation. With this solution, the guest makes the identity of the process to be run available to the VMM profiler. When a counter overflows, the VMM records the saved PC, the event type, and the identifier of the interrupted guest process, and sends it to the guest. After the guest receives the sample, it notifies the sample interpretation component of its profiler to convert the sample to a function name, in the same manner as a native profiler. After the profiling finishes, the results recorded in the guests are merged with those produced by the VMM profiler to obtain a system-wide view. The interpretation-delegation approach for system-wide profil ing requires explicit communication between the VMM and the guest. Their interaction rate approaches the rate of counter over flow interrupts, which can go up to thousands of times per second with a normal profiling configuration. Efficient communication methods should be used to avoid distortions in the profiling re- 5 suits. We choose to use a buffer shared among all the profiling participants for exchanging information. In addition, a guest should process samples collected for it in time. Otherwise, if a process terminates before the samples contributed by it are inter preted, there will be no way to interpret these samples because the sample interpretation component needs to consult the virtual memory layout of this process. Similar to full-delegation, interpretation-delegation can also be implemented by running one profiler in the VMM and one in each guest. However, the guest profiler does not need to access the PMU hardware. It only processes samples in the shared buffer, which are collected for it by the VMM profiler, by running its sample interpretation component when handling a virtual interrupt injected by the VMM profiler. The choice between full-delegation and interpretation- delegation to implement system-wide profiling depends on whether synchronous interrupt delivery is supported by the VMM. If so, full-delegation is the preferred approach. If not, one should either choose the interpretation-approach or add the support of synchronous interrupt delivery to the VMM. Full delegation re quires less engineering effort and is transparent to the profiled guest. Implementing synchronous interrupt delivery to the guest in software is, however, not trivial, and current VMMs based on paravirtualization and binary translation do not support this fea ture. Therefore, we choose the full-delegation approach to imple ment system-wide profiling for a VMM based on hardware assis tance and the interpretation-delegation approach for a VMM based on binary translation (see Section 4). Hardware assistance For VMMs based on hardware exten sions, since they have all the facilities to implement guest-wide profiling, the full-delegation approach can be employed to achieve system-wide profiling. This approach only requires minor changes to the VMM as our implementation of system-wide profiling for an open-source VMM in Section 4.1 shows. System-wide profil ing can also be achieved by the interpretation-delegation ap proach. An efficient communication path between the guest and the VMM and extensions to the profilers running both in the VMM and in the guest require more engineering work than the full-delegation approach. Paravirtualization For VMMs based on paravirtualization, system-wide profiling can be implemented by interpretation- delegation. XenOprof uses this approach to perform system-wide profiling in Xen. Its implementation requires less engineering effort than in VMMs based on hardware assistance or binary translation, because Xen provides the hypercall mechanism that facilitates interactions between the VMM and the guest. The full- delegation approach may also work if the VMM supports guest wide profiling. Binary Translation For VMMs based on binary translation, system-wide profiling can be achieved through the interpretation- delegation approach. If the VMM supports synchronous interrupt delivery to the guest, the full-delegation approach also works. When using interpretation-delegation, VMMs based on binary translation need to solve the following additional problem. If the execution of a guest triggers a counter overflow, the PC value sampled by the VMM profiler points to a translated instruction in the translation cache, not to the original instruction. Additional work is required to map the sampled PC back to the guest address where the original instruction is located. This address translation problem can be solved by the following idea. During the transla tion of guest instructions, we save the mapping from the ad dresses) of one or more translated instructions to the address of the original guest instruction in the address mapping cache, a counterpart of the translation cache. For each memory address of the translation cache, there is an entry in the address mapping cache, which points to the address holding the original guest instruction. For samples from a guest context, rather than storing the PC value itself, the VMM looks up the original instruction address in the address mapping cache and stores that address as part of the sample. This rewriting of the sampled PC value is transparent to the sample interpretation component in the guest. 3.3 PMU Multiplexing Besides the requirements stated previously, another important question for both guest-wide and system-wide profiling is: what is the right time to save and restore PMU registers? The first option is to save and restore these registers when the CPU switches between running guest code and running VMM code. We call this type of profiling CPU switch. Profiling results with CPU switch reflect the execution of the guest on the virtual ized CPU, but not the guest’s use of devices emulated by soft ware. When the CPU switches to execute the VMM code that emulates the effects of a guest I/O operation, although the moni tored hardware events are effectively being contributed by the guest, they are not accounted to it. In the case of guest-wide pro filing, the PMU is turned off, and in the case of system-wide profiling the events are accounted to the VMM. The second option is to save and restore relevant registers when the VMM switches execution from one guest to another. We call this domain switch. This method accounts to a guest all the hardware events triggered by its execution and by the execution of the VMM while emulating devices on its behalf. In other words, domain switch PMU multiplexing reflects the characteristics of the entire virtual environment, including both the virtualized hardware and the virtualization software. Guest-wide and system-wide profiling can choose either of the two approaches for PMU multiplexing. Generally, domain switch provides a more realistic picture of the underlying virtual envi ronment. We compare the profiling results of these two methods in Section 5. 4. Implementation We describe the implementation of both guest-wide and system- wide profiling for the kernel-based virtual machine (KVM) [16], a VMM based on hardware assistance. We also present how sys tem-wide profiling is implemented in QEMU [7], a VMM based on binary translation. As both KVM and QEMU are considered as hosted VMMs, we use the terms “VMM” and “host” inter changeably in this section. Our implementation follows two principles. First, performance profiling should introduce as little as possible overhead to the execution of the guest. Otherwise, the monitoring results would be perturbed by monitoring overhead. Second, performance profiling should generate as little as possible performance overhead for the VMM. It should not slow down the whole system too much. To achieve these goals, we only introduce additional switches be tween the VMM and the guest when absolutely necessary. For all our implementations, except during the profiling initialization phase, only virtual interrupt injection into the guest causes addi tional context switches, but these are inevitable for PMU-based performance profiling. 4.1 Hardware Assistance KVM is a Linux kernel infrastructure which leverages hardware virtualization extensions to add a virtual machine monitor capabil ity to Linux. With KVM, the VMM is a kernel module in the host Linux, while each virtual machine resides in a normal user space 6 process. Although KVM supports multiple hardware architectures, we choose the x86 with virtualization extensions to illustrate our implementation, because it has the most mature code. The virtualization extensions augment x86 with two new op eration modes: host mode and guest mode. KVM runs in host mode, and its guests run in guest mode. Host mode is compatible with conventional x86, while guest mode is very similar to it but deprivileged in certain ways. Guest mode supports all four privi lege levels and allows direct execution of the guest code. A virtual machine control structure (VMCS) is introduced to control vari ous behaviors of a virtual machine. Two transitions are also de fined: a transition from host mode to guest mode called a VM- entry, and a transition from guest mode to host mode called a VM-exit. Regarding performance profiling, if a performance counter overflows when the CPU is in guest mode, the currently running guest is forced to exit, i.e., the CPU switches from guest mode to host mode. The VM-exit information filed in the VMCS indicates that the current VM-exit is caused by a non-maskable interrupt (NMI). By checking this field, KVM is able to decide whether a counter overflow is contributed by a guest. This ap proach assumes all NMIs are caused by counter overflows in a profiling session. To be more precise, KVM could also check the content of all performance counters to make sure that NMIs are really caused by counter overflows. Our guest-wide profiling implementation requires no modifica tions to the guest and its profiler. The guest profiler reads and writes the physical PMU registers directly as it does in native profiling. KVM is responsible for virtualizing the PMU hardware and forwarding NMIs due to performance counter overflows to the guest. A user can launch the profiler from the guest and do performance profiling exactly as in a native environment. We implement system-wide profiling by the full-delegation approach, since KVM is built upon hardware virtualization exten sions and supports synchronous virtual interrupt delivery in the guest. In a profiling session, we run one unmodified profiler instance in the host and one in each guest. These profiling in stances work and cooperate as we discussed in Section 3.2. The only changes to KVM are clearing the bit in an APIC register after each VM-exit (see below) and injecting NMIs into a guest that causes a performance counter overflow. When CPU switch is enabled, KVM saves all the relevant MSRs when a VM-exit happens and restores them when the cor responding VM-resume occurs. By configuring certain fields in the VMCS, this is done automatically in hardware. When domain switch is enabled, we tag all (Linux kernel) threads belonging to a guest and group them into one domain. When the Linux kernel switches to a thread not belonging to the current domain, it saves and restores the relevant registers (in software). In the process of implementing these two profiling techniques in KVM, we also observe the following two noteworthy facts. First, in the x86 architecture, there is one bit of a register in the Advanced Programmable Interrupt Controller (APIC) that speci fies the delivery of NMIs due to performance counter overflows. Clearing this mask bit enables interrupt delivery and setting it inhibits delivery. After the APIC sends a counter overflow NMI to the CPU, it automatically sets this bit. To allow subsequent NMIs to be delivered, a profiler should clear this bit after it handles each NMI. In theory, exposing the register containing this bit to the guest would require the virtualization of the APIC. However, the current implementation of KVM does not virtualize the APIC, but emulates it in software. To bypass this problem, we simply clear the bit after each VM-exit, no matter whether the exit is caused by a performance counter overflow or not. Second, for guest-wide profiling with CPU switch, we find that the CPU receives NMIs due to counter overflows in host mode, typically right after a VM-exit. For guest-wide profiling, however, performance monitoring is only enabled in guest mode, and NMIs due to performance counter overflows are not supposed to happen in host mode. We could not with 100% certainty deter mine the reason for this problem because of the lack of a hardware debugger. One plausible explanation is that the VM-exit operation is not “atomic”. It consists of a number of sub-operations, includ ing saving and restoring MSRs. A counter may overflow during the execution of VM-exit, but before performance monitoring is disabled. The corresponding NMI is not generated immediately, because the instruction executing when an NMI is received is completed before the NMI is generated [15], The NMI due to a performance counter overflow in the middle of a VM-exit is thus generated after the VM-exit operation finishes, when the proces sor is in host mode. We solve this problem by registering an NMI handler in the host to catch those host counter overflows and inject the corresponding virtual NMIs into the guest. 4.2 Binary Translation We present the implementation of system-wide profiling by the interpretation-delegation approach in QEMU, a VMM based on binary translation. In this environment, the guest profiler runs in the guest kernel space; the guest runs in QEMU; QEMU runs in the user space of the host; and the host profiler runs in the host kernel space. The implementation takes more engineering effort than that of the full-delegation approach in KVM. The implemen tation touches a number of major components in the whole soft ware stack, including the host, the host profiler, the guest, and the guest profiler. The conventional x86 is not virtualizable because some in structions do not trap when executed in the unprivileged mode. Dynamic binary translation solves this problem by rewriting those problematic instructions on the fly. A binary translator processes a basic block of instructions at a time. The translated code is stored in a buffer called the translation cache, which holds a recently used subset of all the translated code. Instead of the original guest code, the CPU actually executes the code in the translation cache. Virtual interrupts injected to a guest are delivered asynchro nously in QEMU. Once a virtual interrupt injection request is received, QEMU first unchains the translated basic block being executed and forces the control back to itself after this basic block finishes execution. It then sets a flag of the guest virtual CPU to indicate the reception of an interrupt. The injected interrupt is handled when the guest resumes execution. To reduce the VM exit rate due to information exchange among the guest, QEMU, and the host, we design an efficient communication mechanism for interpretation-delegation. This is important because in a typical profiling session interactions among all these participants can occur at the rate of thousands of times per second. If each interaction involves one VM exit, the profiling results would be polluted and far from being accurate. The key data structure underlying this communication mechanism is a buffer shared among the three participants. All the critical information in a profiling session, such as the PCs and pointers to process descriptors, is exchanged through this buffer. Each profil ing participant reads the buffer directly whenever it needs any information and no VM exits are triggered. The shared buffer is allocated in the guest and shared through the following control channel. The guest exchanges information with QEMU through a customized virtual device added to QEMU and the corresponding device driver in the guest kernel. QEMU and the host kernel talk with each other through common 7 user/kemel communication facilities provided by the host. After the address of the buffer is passed from the guest profiler to the host profiler, the guest profiler accesses the shared buffer by an address in the guest kernel space; QEMU uses this buffer through an address in its own address space; the host profiler accesses it with an address in the host kernel space. In our implementation, QEMU is responsible for rewriting the PC value sampled by the host profiler into an address of the guest pointing to the original guest code. To reduce the overall overhead of PC value rewriting, the address mapping cache proposed in Section 3.2 does not map the host address of each instruction in the translation cache to its corresponding guest address. Instead, the cache only maintains one entry for one basic block. All the addresses of the instructions in a translated basic block are mapped to the starting address of the corresponding original basic block in the guest. This does not hurt the accuracy of performance profiling with functions as the interpretation granularity because of two reasons. First, a profiler always interprets any address pointing to the body of a function to the name of that function. Second, a basic block does not span across more than one func tion, because it terminates right after the first branch instruction. Putting all the pieces together, the process of system-wide pro filing for binary translation can be described as follows. 1. In the initialization phase, the host profiler is loaded to the host kernel, and the guest profiler is loaded to the guest kernel. A communication channel across all the profiling participants is established and a buffer is shared among them. 2. The user starts a profiling session by launching the host pro filer. The host profiler sends a message to the guest to start the guest profiler. 3. When profiling is being conducted, the guest records the ad dress pointing to the descriptor of each process right before it schedules the process to run. There is only one entry in the shared buffer for this address. The guest keeps overwriting this entry because it is only useful when the execution of the corre sponding process triggers a performance counter overflow. When a counter overflows and if it is contributed by the guest, the host profiler copies the sampled PC value, the event type, and the address to the descriptor of the corresponding guest process to a sampling slot in the shared buffer. It then sends a signal to QEMU running in user space. After the counter over flow NMI is handled and QEMU is scheduled to run again, the signal from the host profiler is delivered first. The signal han dler rewrites the sampled PC value, records the current privi lege mode of the virtual CPU in the same sampling slot, and injects an NMI into the guest. Upon handling the injected NMI, the guest profiler processes all the available sampling slots one by one. 4. The user finishes the profiling session by stopping the host profiler. The host profiler sends a message to the guest to stop the guest profiler. The output of the host profiler and the guest profiler is merged together as the final profiling results. Because the host knows little about the internals of a guest and the guest code is dynamically translated, the host profiler can only obtain limited runtime information about the guest under an NMI context. Both the guest and QEMU are required to help record or process sampling information on behalf of the host profiler. This leads to changes to all the participants involved in system-wide profiling based on interpretation-delegation. 5. Evaluation We first verify the accuracy of our profilers by comparing the results of native profiling with profiling in various virtualized environments. We then show how guest-wide profiling can be used to profile two guests simultaneously. Next, by comparing the results of CPU switch and domain switch for guest-wide profiling, we show that domain switch sometimes provides considerably more information about the guest’s execution. We also demon strate the power of guest-wide profiling by using it on a couple of examples to explain why one virtualization technique performs better than the other. Finally, we quantify the overhead of our profilers by comparing the execution time of a computation intensive program with and without profiling. Our experiments involve two machines. The first one is a Dell OptiPlex 745 desktop with one dual-core Intel Core2 E6400 proc essor, 2GB DDR2 memory, and one Gigabit Ethernet NIC. The second machine is a Sun Fire x4600 M2 server with four quad- core AMD Opteron 8356 processors, 32GB DDR2 memory, and a dozen of Gigabit Ethernet NICs. Unless explicitly stated, all the experiments are conducted on the Intel machine. For hardware-assisted virtualization, the VMM consists of the 2.6.32 Linux kernel with the KVM kernel module loaded and QEMU 0.11 in user space [7]. For virtualization based on binary translation, the VMM is QEMU 0.10.5 in user space, which runs on top of the 2.6.30 Linux kernel with virtualization extensions disabled. All guests are configured with one virtual CPU and run Linux with the 2.6.32 kernel. The profiler is OProfile 0.9.5. For both guest-wide and system-wide profiling, CPU switch for KVM adds 170 lines of C code to the Linux kernel, while domain switch consists of 272 lines of C code. QEMU system- wide profiling with CPU switch introduces 1115 lines of C code to QEMU, the host kernel, and the guest kernel. 5.1 Computation-intensive Workload To verify the accuracy of our profilers for computation-intensive workloads, we use the code given in Figure 2 as the profiled application, and we compare the output of native profiling with virtualized profiling. The program consists of an infinite loop executing two computation-intensive functions compute_a() and compute_b(), which perform floating point arithmetic and consume different number of CPU cycles. We run this program in two different processes, processl and process2. We launch both processes at the same time and pin them to one CPU core. int main(int argc, char *argv[]) { while (1) { compute_a(); compute_b(); } } Figure 2. Code used to verify the accuracy of VM profilers for computation-intensive programs. Table 2 to Table 5 present the output of profiling runs of this program in which we measure the number of CPU cycles con sumed, for native profiling (Table 2), guest-wide profiling in KVM (Table 3), system-wide profiling in KVM (Table 4), and system-wide profiling in QEMU (Table 5)1. The results for the For system-wide profiling, only CPU cycles consumed in the guest are counted in the percentages. VM profilers are roughly the same as those for the native profiler. As expected, the two processes, processl and process2, con sume roughly the same number of cycles, and the ratio between cycles consumed in compute_a() and in compute_b() is also roughly similar. % CYCLE Function Module 40.3463 compute_a process2 38.2010 compute_a process 1 10.6135 compute_b process2 10.2371 compute b process 1 0.1505 vsnprintf vmlinux 0.1129 (no symbols) bash 0.0376 (no symbols) libc.so 0.0376 mem cgroup read vmlinux Table 2. % of cycles consumed in two processes running the program given in Figure 2, native profiling. % CYCLE Function Module 38.8114 compute_a process 1 38.5913 compute_a process2 10.3815 compute_b process2 10.0880 compute_b process 1 0.5503 native apic mem write vmlinux 0.2201 (no symbols) libc.so 0.2201 schedule vmlinux 0.1101 (no symbols) bash Table 3. % of cycles consumed in two processes running the program given in Figure 2, KVM guest-wide profiling. % CYCLE Function Module 39.9220 compute_a process 1 39.4209 compute_a process2 10.3563 compute_b process2 10.0223 compute_b process 1 0.0557 switch to vmlinux 0.0557 ata sff check status vmlinux 0.0557 run time soffirq vmlinux 0.0557 update wall time vmlinux Table 4. % of cycles consumed in two processes running the program given in Figure 2, KVM system-wide profiling. % CYCLE Function Module 40.0000 compute_a process2 36.2963 compute_a processl 9.2593 compute_b processl 8.5185 compute_b process2 0.7407 update wall time vmlinux 0.3704 schedule vmlinux 0.3704 tasklet hi schedule vmlinux 0.3704 cleanup workqueue thread vmlinux Table 5. % of cycles consumed in two processes running the program given in Figure 2, QEMU system-wide profiling. These results are further confirmed by Figure 3, which shows the average and the standard deviation of the percentage of CPU cycles consumed by compute_a() and compute_b() over 10 runs with all four profilers. Our profilers provide stable results, with standard deviations ranging from 0.44% to 2.87%. Native profiling has the smallest variance and system-wide profiling for QEMU has the largest. Figure 4 shows the results of simultaneous guest-wide profil- ing of two KVM guests running the program described in Figure 2. The percentage of CPU cycles consumed by each function is the same in both guests, and similar to the percentage for each function for native profiling, indicating the accuracy of our pro filer when used with multiple virtual machines. Although the data in this experiment do not constitute “proof of correctness”, they give us reasonable confidence that our de sign and implementation work reasonably well in terms of CPU cycles. We obtain similar results for instruction retirements. Routine Figure 3. Average and standard deviation of the percentage of cycles consumed by compute_a() and compute_b(). Figure 4. Average and standard deviation of the percentage of cycles consumed by compute_a() and compute_b() in two KVM guests, KVM guest-wide profiling. 5.2 Memory-intensive Workload We use the program described in Figure 5 to demonstrate the operation of the KVM guest-wide profiling with memory intensive programs. This program makes uniformly distributed random accesses to a fixed-size region of memory. We run this program with a working set of 512KB for which the entire execu tion fits in the L2 cache, and with a working set of 2048KB that causes many misses in the L2 cache. Table 6 presents the profiling results for L2 cache misses for the 512KB working set, and Table 7 presents the results for the 2048KB working set. The results clearly reflect the higher number of L2 misses with the larger working set. struct item { struct item "next; long pad[NUM_PAD]; } void chase_pointer() { struct item *p = NULL; p = &random1y_connected_iterns; while (p != null) p = p->next; } Figure 5. Code used to verify the accuracy of VM profilers for memory-intensive programs. L2 Miss Function Module 1250 chase_pointer cache test 100 (no symbols) bash 100 (no symbols) ld.so 100 (no symbols) libc.so 50 sync_buffer oprofile.ko 50 do notify resume vmlinux 50 do wp page vmlinux 50 find first bit vmlinux Table 6. L2 cache misses (in thousands) for the program given in Figure 5 with a working set of 512KB, KVM guest-wide profil ing. o> 1.8 r 1.4 to<0 £ O 1.2 aj c 'o 1 CL <5 Q. 0.8 a; <0 £ 0.6COCOn Cl) 0.4£a<0 O 0.2 native KVM guest-wide KVM system-wide QEMU guest-wide 500 1000 1500 2000 2500 3000 3500 Working Set Size (KB) Figure 6. The number of L2 cache misses per pointer access for different working set sizes in four computing environments. this when the VMM switches execution from one guest to another one. To show the difference between CPU switch and domain switch, we use guest-wide profiling for KVM on a guest that receives TCP packets. In the experiment, as much TCP traffic as possible is pushed to the guest from a Gigabit NIC on a different machine. The virtual NIC used by the guest is RTL8139. L2 Miss Function Module 150750 chase_pointer cache test 2050 native apic mem write vmlinux 250 idle cpu vmlinux 250 run posix cpu timers vmlinux 200 account user time vmlinux 200 unmap vmas vmlinux 200 update curr vmlinux 150 do timer vmlinux Table 7. L2 cache misses (in thousands) for the program given in Figure 5 with a working set of 2048KB, KVM guest-wide profil ing. Figure 6 shows the average number of L2 cache misses trig gered by one pointer access of our memory-intensive benchmark. We run the benchmark with different working set sizes in four different computing environments. For system-wide profiling of both KVM and QEMU, we only count the cache misses reported in the guest profiler. The number of cache misses per pointer access for native Linux, the KVM guests, and the QEMU guest follow a similar pattern. After the size of the working set exceeds a certain value, the amount of L2 cache available for the bench mark, the miss rate increases dramatically. For native Linux and KVM, the available L2 cache is about 1024 KB. For QEMU, it is 512 KB, because QEMU involves the execution of more software components, such as the binary translator and the MMU emula tion code. The cache miss rate after these points grows linearly with the working set size. 5.3 CPU Switch vs. Domain Switch For guest-wide profiling, there are two possible places to save and restore the registers related to profiling. CPU switch saves and restores the relevant registers when the CPU switches from run ning guest code to VMM code, or vice versa. Domain switch does % INSTR Function Module 14.1047 csum_partial vmlinux 8.9527 csum partial copy generic vmlinux 6.2500 copy to user vmlinux 3.9696 ipt do table ip tables.ko 3.6318 tcp v4 rc vmlinux 3.2095 (no symbols) libc.so 2.8716 ip route input vmlinux 2.7027 tcp rev established vmlinux Table 8. Instruction retirements for TCP receive in a KVM guest, guest-wide profiling with CPU switch. % INSTR Function Module 31.0321 cp interrupt 8139cp.ko 18.3365 cp rx poll 8139cp.ko 14.1916 cp start xmit 8139cp.ko 5.7782 native apic mem write vmlinux 5.1331 native apic mem read vmlinux 2.6215 csum_partial vmlinux 1.4411 csum partial copy generic vmlinux 1.2901 copy to user vmlinux Table 9. Instruction retirements for TCP receive in a KVM guest, guest-wide profiling with domain switch. Table 8 presents the eight functions with the largest number of instruction retirements with CPU switch, and Table 9 with domain switch. The total number of samples with CPU switch is 1184 vs. 7286 with domain switch. In other words, more than 80% of the retired instructions involved in receiving packets in the guest are spent outside the guest, inside the device emulation code of the VMM. The VMM spends a large number of instructions emulat ing the effects of the I/O operations in the virtual RTL8139 NIC and the virtual APIC. The top three functions in Table 9 are from the RTL8139 NIC driver, and the next two program the APIC. 10 Only below those five appear the three guest functions appearing at the top in Table 8. This example clearly shows that domain switch can provide more complete information than CPU switch for I/O intensive programs. 5.4 The Power of Guest-wide Profiling One of the advantages of guest-wide profiling is that it does not require access to the VMM. Nevertheless, it allows advanced performance debugging, as we demonstrate next with the follow ing two examples. We first profile the benchmark described in Figure 7 to show why hardware-supported nested paging [8] provides more effi cient memory virtualization than shadow page tables [19]. This program is a familiar UNIX kernel micro-benchmark that stresses process creation and destruction.2 int main(int argc, char *argv[]) { for (int i =0; i < 32000; i++) { int pid = fork() ; if (pid < 0) return -1; if (pid == 0) return 0; wai tpi d (pi d) ; } return 0; } Figure 7. A micro- destruction [3], -benchmark that stresses process creation and CYCLE Function Module 1300 do wp page vmlinux 1100 do wait vmlinux 750 page fault vmlinux 400 get page from freelist vmlinux 400 wait consider task vmlinux 350 unmap vmas vmlinux 200 flush tlb page vmlinux 200 native flush tlb single vmlinux Table 10. Cycles (in millions) consumed in the program given in Figure 7 in a KVM guest with nested paging, KVM guest-wide profiling. CYCLE Function Module 5450 native set pmd vmlinux 5350 do wp pge vmlinux 3500 native flush tlb single vmlinux 3050 get page from freelist vmlinux 2650 schedule vmlinux 1100 native flush tlb vmlinux 1050 do wait vmlinux 950 page fault vmlinux Table 11. Cycles (in millions) consumed in the program given in Figure 7 in a KVM guest with shadow page tables, KVM guest- wide profiling. Running natively, we measure 4.97 seconds to create and de stroy 32000 processes. With nested paging, the guest takes 5.52 2 This experiment is conducted on our AMD machine, because KVM’s modular implementation on AMD CPUs can easily be switched between shadow page tables and hardware-supported nested paging. seconds, slightly slower than at native speed. When using shadow page tables, the execution time grows to 20.06 seconds. By profil ing the benchmark in the guest, a developer can easily figure out which operations involved in process creation and destroying are expensive. Table 10 presents the eight functions that consume the most CPU cycles with nested paging, and Table 11 presents the same results for shadow page tables. By comparing the profiling results presented in these two tables, we observe that operations related to page table manipulation, such as nati ve_set_pmd () and do_wp_page(), become more expensive with shadow page tables. The shadow page table mechanism causes a large number of VM exits, including when loading and updating a page table in the guest, when accessing protected pages, and when performing privileged operations like TLB flushing. With nested paging, most of these operations do not cause a VM exit. Our second example is again TCP receive, similar to the ex periment described in Section 5.3. The difference in this experi ment is that, instead of RTL8139, we use the El000 virtual NIC and a virtual NIC based on VirtIO [20], VirtIO is a paravirtualized TO framework that provides good I/O performance for virtual machines. % INSTR Function Module 25.2399 el000 intr elOOO.ko 16.8906 el000 irq enable elOOO.ko 12.1881 el000 xmit frame elOOO.ko 4.6065 native apic mem write vmlinux 4.4146 csum_partial vmlinux 3.3589 el000 alloc rx buffers elOOO.ko 3.2630 native apic mem read vmlinux 3.0710 copy user intel vmlinux Table 12. Instruction retirements for TCP receive in KVM guest with El000 virtual NIC, guest-wide profiling with domain switch. % INSTR Function Module 52.3312 native safe halt vmlinux 7.7244 native apic mem write vmlinux 6.6806 csum partial copy generic vmlinux 1.8903 native write crO vmlinux 1.4614 ipt do table ip tables.ko 0.9047 (no symbols) libc.so 0.9047 get page from freelist vmlinux 0.9047 schedule vmlinux Table 13. Instruction retirements for TCP receive in KVM guest with VirtIO virtual NIC, guest-wide profiling with domain switch. Table 12 presents the profiling results of packet receive through the El000 virtual NIC in a KVM guest. Similar to the results of RTL8139, interrupt handling functions retire more than 40% of all instructions, because of the high network I/O interrupt rate. Table 13 presents the results of the VirtlO-based NIC. The function native_safe_ha1t() retires more than half of all instructions, but this function actually executes the HLT instruc tion, which halts the CPU until the next external interrupt occurs. The frequent execution of this instruction in the guest shows that the guest is not saturated while handling lGbps TCP traffic. Compared with the data in Table 12, we do not find a single func tion related to interrupt handling, which indicates that the interrupt rate due to network I/O is low. Our profiling results validate the design of VirtIO, which improves virtualized TO performance by batching I/O operations to reduce the number of VM exits. Therefore, as these two experiments demonstrate, guest-wide profiling with domain switch helps developers understand the 11 underlying virtualized environment without the need for accessing the VMM. 5.5 Profiling QEMU With our system-wide profiling extensions for QEMU, we profile TCP receive of both the host and the guest. The experiment con figuration is similar to the one described in Section 5.4. Instead of KVM, we use QEMU running in user space as the VMM. The virtual NIC is based on VirtlO. The observed TCP receive throughput is around 50MB/s. This amount of traffic saturates the physical CPU but does not keep the virtual CPU of the guest busy. Table 14 presents the profiling results of the host part. Func tion cpu_x86_exec() retires a large portion of all the instruc tions. Its functionality is similar to vmx_vcpu_run () in KVM, which switches the CPU from the host context to the guest con text. Table 15 shows the results of the guest part, which is obtained by running a customized OProfile in the guest. The appearance of function schedule() indicates that the virtual CPU is not saturated. The reason why function strcatO retires most in structions in the guest may be that the corresponding translated native code of this operation is expensive. % INSTR Function Module 68.9548 cpu_x86_exec qemu 6.0842 ldl mmu qemu 4.2902 helper cc compute c qemu 1.7161 cpu x86 handle mmu fault qemu 1.7161 phys page find alloc qemu 1.4041 ld_phys qemu 1.2480 tlb set page exec qemu 0.6240 helper cc compute all qemu Table 14. Instruction retirements for TCP receive in QEMU host with VirtlO virtual NIC, system-wide profiling with CPU switch. % INSTR Function Module 10.5178 strcat vmlinux 3.8835 ipt do table vmlinux 2.7508 olpc ec cmd vmlinux 2.4272 schedule vmlinux 2.4272 slab alloc vmlinux 2.2654 ip route input vmlinux 2.2654 skb gro receive vmlinux 1.9417 vring add buf virtio ring.ko Table 15. Instruction retirements for TCP receive in QEMU guest with VirtlO virtual NIC, system-wide profiling with CPU switch. 5.6 Profiling Overhead Profiling based on CPU performance counters inevitably slows down the profiled program, even in a native environment, because of the overhead of handling the counter overflow interrupts. In a virtualized environment, these interrupts need to be forwarded to the guest, adding more context switches between the host and the guest and therefore more overhead. In addition, the VMM needs to save and restore the performance counters on a VM switch. We evaluate the overhead of our profiling extensions by com paring the execution time, with and without profiling, of the pro gram in Figure 2, which is modified to execute a fixed number of iterations. The program runs in the guest, and we take a sample every 5 million CPU cycles (or about 400 times per second). Table 16 presents the results for profiling overhead. In the na tive environment, the overhead of profiling is about 0.048%. For KVM guest-wide profiling, the overhead is about 0.386%. We further breakdown the overhead into two parts: additional context switches due to interrupt injection account for about 80% of over all overhead and interrupt handling in the guest takes the remain ing 20%. For KVM system-wide profiling, the overhead is 0.441%. This is roughly the sum of the overhead of native and KVM guest-wide profiling, because KVM system-wide profiling also runs a profiler in the host Linux. System-wide profiling for QEMU incurs more overhead, around 0.942%. The overhead comes from multiple sources. First, QEMU runs in user space, and forwarding an interrupt to the guest requires a change in CPU privilege level and a signal to the user space process. Second, QEMU needs to query the address mapping cache and rewrite the sampled address. Third, frequent context switches also hurt the performance of QEMU’s binary translation engine. Profiling environment Execution time overhead Native 0.048% ± 0.0042% KVM guest-wide 0.386% ± 0.0450% KVM system-wide 0.441% ±0.0435% QEMU system-wide 0.942% ± 0.0441% Table 16. Profiling overhead. The sample rate of the profiler is about 400 times per second. 6. Discussion Although both guest-wide and system-wide profiling are feasible and useful for diagnosing performance problems in a virtualized computing environment, there are still a number of issues that need to be considered before these techniques can be deployed in production use, as discussed next. Virtual PMU interface Since the PMU is not a standardized hardware component of the x86 architecture, the programming interfaces for PMUs differ between hardware vendors and even between different models from the same hardware vendor. In addition, different processors may also support different profiling events. Therefore, for guest-wide profiling to be portable to dif ferent processors, a proper interface between the guest profiler and the virtualized PMU must be defined. There are two ways to expose PMU interfaces to the guests. The first method is to rely on the CPUID instruction to return the physical CPU family and model identifier to the guest. This information tells the guest profiler how to program the PMU hardware directly. The burden on the VMM is minimal, but the solution breaks one fundamental principle of virtualization: de coupling software from hardware. The second approach is to expose to the guest a standardized virtual PMU with a limited number of broadly used profiling events, such as CPU cycles, TLB and cache misses, etc. The guest profiler is extended to support this virtual PMU, and the VMM provides the mapping of operations between the virtual PMU and the underlying physical PMU. This approach decouples software from hardware, but imposes additional work on the VMM. Profiling accuracy In addition to the statistical nature of sam pling-based profiling, there are other factors that potentially affect profiling accuracy in virtualized environments. The first problem is that the multiplexing of some hardware re sources inevitably introduces noise into profiling results. For instance, TLBs are flushed when switching between the VMM and the guests. If TLB misses are being monitored, the profiling results in a guest are perturbed by the execution of the VMM and/or other guests. This problem also exists in native profiling. Although it can be mitigated by cache/TLB entry tagging, profil- 12 mg results for these events are still not guaranteed to be entirely accurate. The second problem is specific to profilers based on domain switch. If the VMM is interrupted to perform some action on behalf of another guest, the handling of this interrupt is incorrectly charged to the currently executing guest. The issue is similar to the resource accounting problem in a conventional operating system [5], and can possibly be solved by techniques such as early demultiplexing [10], PMU emulation In addition to virtualizing the PMU, for VMMs based on binary translation and pure emulators, it is also possible to emulate the PMU hardware in software. In this case, the PMU of the physical CPU is not involved during the profiling process, and the entire functionality of the PMU is emulated in software. By emulating the PMU the VMM can support some events that are not implemented by the physical PMU. For in stance, if the energy consumption of each CPU instruction is known, one could build an energy profiler in this way. We use PMU emulation to count instruction retirements. When a basic block is translated, we count the number of guest instruc tions in the block and insert a few instructions at the beginning of the translated basic block. When the translated block is executed, these instructions increase the emulated performance counter. If the emulated counter reaches the predefined threshold, an NMI is injected into the virtual CPU. The difficulties of PMU emulation lie in supporting a large number of hardware events. Emulating these events may incur high overhead and emulating some of them may not even be possible for a binary translator or an in struction-level CPU emulator. 7. Related Work The XenOprof profiler [18] is the first profiler for virtual ma chines. According to our definition, it does system-wide profiling. It is specifically designed for Xen, a VMM based on paravirtuali- zation. A newer version of Xen, Xen HVM, supports hardware- assisted full virtualization. Xen HVM saves and restores MSRs when it performs domain switches, but it does not attribute sam ples from domain 0, in which all I/O device emulation is done, to any guest. VM exits in a domain that do not require the interven tion of domain 0 are handled by Xen HVM under the context of that domain. As a result, guest-wide profiling in Xen HVM re flects neither the characteristic of the physical CPU nor that of the CPU plus the VMM. Linux perf [2] is a new implementation of performance counter support for Linux. It runs in the Linux host and can pro file a Linux guest running in KVM [1], It obtains samples of the guest by observing its virtual CPU state. Because this is done outside the guest, only PC and CPU privilege mode can be re corded. The address of the descriptor of the current process is not known. As a result, Linux perf can only interpret samples that belong to the kernel of the Linux guest, and cannot handle sam ples contributed by user space applications. The binary image and the virtual memory layout of the guest kernel, necessary for sam ple interpretation, are obtained through an explicit communication channel. VMware vmkperf [14] is a performance monitoring utility for VMware ESX. It runs in the VMM and only records how many hardware events happen in a given time interval. It does not han dle counter overflow interrupts, and it does not attribute them to functions. It does not support the profiling mechanisms presented in this paper. VTSS++ [9] demonstrates a profiling technique similar to guest-wide profiling. It requires the cooperation of a profiler running in the guest and a PMU sampling tool running in the VMM. It relies on sampling timestamps to attribute hardware events sampled in the host to the corresponding threads in the guest. Although it does not require modifications to the VMM, VTSS++ requires access to the VMM to run a sampling tool, and the accuracy of the profiling results is affected by the estimation algorithm it uses. The work in this paper builds on our earlier work [11], which proposes some basic ideas of virtual machine profiling and only concentrates on guest-wide profiling for VMMs based on hard- ware-assisted virtualization. This paper extends the earlier work in several ways. We implement system-wide profiling for a VMM based on binary translation. We also evaluate our implementations through extensive experiments to demonstrate the feasibility and usefulness of virtual machine profiling. 8. Conclusions Profilers based on CPU performance counters help developers debug performance problems in complex software systems, but they are not well supported in virtual machine environments, making performance debugging in such environments hard. We define guest-wide profiling, which allows profiling of a guest without VMM access, and system-wide profiling, which allows profiling of the VMM and any number of guests. We study the requirements for each type of profiling. Guest-wide profiling requires synchronous interrupt delivery to the guest. System-wide profiling requires cooperation between the VMM and the guest to interpret samples belonging to the guest. We describe two ap proaches to implement this cooperation, full-delegation and inter pretation-delegation. We develop a guest-wide and a system-wide profiler for a VMM based on hardware-assisted virtualization (KVM), and a system-wide profiler for a VMM based on binary translation (QEMU). We demonstrate the accuracy and the power of these profilers, and show that their performance overhead is very small. As more and more computing is migrated to virtualization- based cloud infrastructures, better profiling tools for virtual ma chines will facilitate performance debugging and improve re source utilization in the cloud. Acknowledgements We would like to thank Mihai Dobrescu, Simon Schubert and the anonymous reviewers for their valuable comments and help in improving this paper. References [1] Enhance perf to collect KVM guest os statistics from host side. 2010. http://lwn.net/Articles/378778. [2] Performance Counters for Linux. 2010. http://lwn.net/Articles/- 310176. [3] K. Adams and O. Agesen. A comparison of software and hardware techniques for x86 virtualization. In Proceedings of the 12th Interna tional Conference on Architectural Support for Programming Lan guages and Operating Systems, 2006. [4] J.M. Anderson, L.M. Berc, J. Dean, S. Ghemawat, M.R. Henzinger, S.T.A. Leung, R.L. Sites, M.T. Vandevoorde, C.A. Waldspurger, and W.E. Weihl. Continuous profiling: where have all the cycles gone? Operating Systems Review, 1997. [5] G. Banga, P. Druschel, and J.C. Mogul. Resource containers: A new facility for resource management in server systems. Operating Sys tems Review, 1998. [6] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of virtual- 13 ization. In Proceedings of the 9th ACM Symposium on Operating Systems Principles, 2003. [7] F. Bellard. QEMU, a fast and portable dynamic translator. In Pro ceedings of the USENIX 2005 Annual Technical Conference, FREENIX Track, 2005. [8] R. Bhargava, B. Serebrin, F. Spadini, and S. Manne. Accelerating two-dimensional page walks for virtualized systems. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems, 2008. [9] Stanislav Bratanov, Roman Belenov, and Nikita Manovich. Virtual machines: a whole new world for performance analysis. Operating Systems Review, 2009. [10] P. Druschel and G. Banga. Lazy receiver processing (LRP): A net work subsystem architecture for server systems. Operating Systems Review, 1996. [11] J. Du, N. Sehrawat, and W. Zwaenepoel. Performance profiling in a virtualized environment. In Proceedings of the 2nd USENIX Work shop on Hot Topics in Cloud Computing, 2010. [12] S.L. Graham, P.B. Kessler, and M.K. Mckusick. Gprof: A call graph execution profiler. ACM SIGPLAN Notices, 1982. [13] Intel Inc. Intel VTune Performance Analyser, 2010. http://- software.intel.com/en-us/intel-vtune/. [14] VMware Inc. Vmkperf for VMware ESX 4.0, 2010. [15] Intel. Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3: System Programming Guide. [16] A. Kivity, Y. Kamay, D. Laor, U. Lublin, and A. Liguori. kvm: the Linux virtual machine monitor. In Linux Symposium, 2007. [17] J. Levon and P. Elie. Oprofile: A system profiler for linux. 2010. http://oprofile.sourceforge.net. [18] A. Menon, J.R. Santos, Y. Turner, G.J. Janakiraman, and W. Zwaenepoel. Diagnosing performance overheads in the Xen vir tual machine environment. In Proceedings of the 1st ACM/USENIX International Conference on Virtual Execution Environments, 2005. [19] M. Rosenblum and T. Garfinkel. Virtual machine monitors: Current technology and future trends. Computer, 2005. [20] R. Russell, virtio: towards a de-facto standard for virtual I/O devices. Operating Systems Review, 2008. [21] B. Sprunt. The basics of performance-monitoring hardware. IEEE MICRO, 2002. [22] A. Srivastava and A. Eustace. ATOM: A system for building custom ized program analysis tools. In Proceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implemen tation, 1994. [23] M. Zagha, B. Larson, S. Turner, and M. Itzkowitz. Performance analysis using the MIPS R10000 performance counters. In Proceed ings of the 1996 ACM/IEEE Conference on Supercomputing, 1996. 14