Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit f4031338 authored by Kyle Yan's avatar Kyle Yan Committed by Prasad Sodagudi
Browse files

Merge remote-tracking branch 'origin/tmp-5ed02dbb' into msm-next



* origin/tmp-5ed02dbb:
  Linux 4.12-rc3
  x86/ftrace: Make sure that ftrace trampolines are not RWX
  x86/mm/ftrace: Do not bug in early boot on irqs_disabled in cpu_flush_range()
  selftests/ftrace: Add a testcase for many kprobe events
  kprobes/x86: Fix to set RWX bits correctly before releasing trampoline
  ftrace: Fix memory leak in ftrace_graph_release()
  ipv4: add reference counting to metrics
  net: ethernet: ax88796: don't call free_irq without request_irq first
  ip6_tunnel, ip6_gre: fix setting of DSCP on encapsulated packets
  sctp: fix ICMP processing if skb is non-linear
  net: llc: add lock_sock in llc_ui_bind to avoid a race condition
  PCI/msi: fix the pci_alloc_irq_vectors_affinity stub
  blk-mq: Only register debugfs attributes for blk-mq queues
  x86/timers: Move simple_udelay_calibration past init_hypervisor_platform
  nvme: Quirk APST on Intel 600P/P3100 devices
  nvme: only setup block integrity if supported by the driver
  nvme: replace is_flags field in nvme_ctrl_ops with a flags field
  nvme-pci: consistencly use ctrl->device for logging
  bonding: Don't update slave->link until ready to commit
  test_bpf: Add a couple of tests for BPF_JSGE.
  bpf: add various verifier test cases
  bpf: fix wrong exposure of map_flags into fdinfo for lpm
  bpf: add bpf_clone_redirect to bpf_helper_changes_pkt_data
  bpf: properly reset caller saved regs after helper call and ld_abs/ind
  bpf: fix incorrect pruning decision when alignment must be tracked
  arp: fixed -Wuninitialized compiler warning
  tcp: avoid fastopen API to be used on AF_UNSPEC
  net: move somaxconn init from sysctl code
  Input: elan_i2c - ignore signals when finishing updating firmware
  Input: elan_i2c - clear INT before resetting controller
  net: fix potential null pointer dereference
  drm/amdgpu: fix null point error when rmmod amdgpu.
  geneve: fix fill_info when using collect_metadata
  xfs: Move handling of missing page into one place in xfs_find_get_desired_pgoff()
  xfs: Fix off-by-in in loop termination in xfs_find_get_desired_pgoff()
  xfs: Fix missed holes in SEEK_HOLE implementation
  xfs: fix off-by-one on max nr_pages in xfs_find_get_desired_pgoff()
  xfs: fix unaligned access in xfs_btree_visit_blocks
  powerpc: Add PPC_FEATURE userspace bits for SCV and DARN instructions
  powerpc/spufs: Fix hash faults for kernel regions
  powerpc: Fix booting P9 hash with CONFIG_PPC_RADIX_MMU=N
  powerpc/powernv/npu-dma.c: Fix opal_npu_destroy_context() call
  serial: altera_uart: call iounmap() at driver remove
  serial: imx: ensure UCR3 and UFCR are setup correctly
  drm/amd/powerplay: fix a signedness bugs
  drm/amdgpu: fix NULL pointer panic of emit_gds_switch
  drm/radeon: Unbreak HPD handling for r600+
  drm/amd/powerplay/smu7: disable mclk switching for high refresh rates
  drm/amd/powerplay/smu7: add vblank check for mclk switching (v2)
  drm/radeon/ci: disable mclk switching for high refresh rates (v2)
  drm/amdgpu/ci: disable mclk switching for high refresh rates (v2)
  virtio-net: enable TSO/checksum offloads for Q-in-Q vlans
  be2net: Fix offload features for Q-in-Q packets
  vlan: Fix tcp checksum offloads in Q-in-Q vlans
  drm/amdgpu: fix fundamental suspend/resume issue
  net: phy: marvell: Limit errata to 88m1101
  net/phy: fix mdio-octeon dependency and build
  net: rtnetlink: bail out from rtnl_fdb_dump() on parse error
  net: fec: add post PHY reset delay DT property
  sctp: set new_asoc temp when processing dupcookie
  sctp: fix stream update when processing dupcookie
  MAINTAINERS/serial: Change maintainer of jsm driver
  ceph: check that the new inode size is within limits in ceph_fallocate()
  libceph: cleanup old messages according to reconnect seq
  x86/alternatives: Prevent uninitialized stack byte read in apply_alternatives()
  x86/PAT: Fix Xorg regression on CPUs that don't support PAT
  x86/watchdog: Fix Kconfig help text file path reference to lockup watchdog documentation
  x86/build: Permit building with old make versions
  x86/unwind: Add end-of-stack check for ftrace handlers
  Revert "x86/entry: Fix the end of the stack for newly forked tasks"
  tools/include: Sync kernel ABI headers with tooling headers
  perf tools: Put caller above callee in --children mode
  perf report: Do not drop last inlined frame
  perf report: Always honor callchain order for inlined nodes
  perf script: Add --inline option for debugging
  perf report: Fix off-by-one for non-activation frames
  perf report: Fix memory leak in addr2line when called by addr2inlines
  perf report: Don't crash on invalid maps in `-g srcline` mode
  thermal: broadcom: ns-thermal: default on iProc SoCs
  ti-soc-thermal: Fix a typo in a comment line
  ti-soc-thermal: Delete error messages for failed memory allocations in ti_bandgap_build()
  ti-soc-thermal: Use devm_kcalloc() in ti_bandgap_build()
  thermal: core: make thermal_emergency_poweroff static
  thermal: qoriq: remove useless call for of_thermal_get_trip_points()
  posix-timers: Make signal printks conditional
  drm/gma500/psb: Actually use VBT mode when it is found
  PCI/PM: Add needs_resume flag to avoid suspend complete optimization
  libceph: NULL deref on crush_decode() error path
  libceph: fix error handling in process_one_ticket()
  libceph: validate blob_struct_v in process_one_ticket()
  libceph: drop version variable from ceph_monmap_decode()
  libceph: make ceph_msg_data_advance() return void
  libceph: use kbasename() and kill ceph_file_part()
  partitions/msdos: FreeBSD UFS2 file systems are not recognized
  mlx5: fix bug reading rss_hash_type from CQE
  cdc-ether: divorce initialisation with a filter reset and a generic method
  block: fix an error code in add_partition()
  net/mlx5: Tolerate irq_set_affinity_hint() failures
  net/mlx5: Avoid using pending command interface slots
  net/mlx5e: IPoIB, handle RX packet correctly
  net/mlx5e: Fix warnings around parsing of TC pedit actions
  net/mlx5e: Properly enforce disallowing of partial field re-write offload
  net/mlx5e: Allow TC csum offload if applied together with pedit action
  net/sched: act_csum: Add accessors for offloading drivers
  net/mlx5e: Use the correct delete call on offloaded TC encap entry detach
  ptrace: Properly initialize ptracer_cred on fork
  cfg80211: make cfg80211_sched_scan_results() work from atomic context
  arm64: dts: hikey: Fix WiFi support
  arm64: dts: hi6220: Move board data from the dwmmc nodes to hikey dts
  arm64: dts: hikey: Add the SYS_5V and the VDD_3V3 regulators
  arm64: dts: hi6220: Move the fixed_5v_hub regulator to the hikey dts
  arm64: dts: hikey: Add clock for the pmic mfd
  mfd: dts: hi655x: Add clock binding for the pmic
  mmc: pwrseq_simple: Parse DTS for the power-off-delay-us property
  mmc: dt: pwrseq-simple: Invent power-off-delay-us
  drm: Fix deadlock retry loop in page_flip_ioctl
  drm: qxl: Delay entering atomic context during cursor update
  ALSA: hda - apply STAC_9200_DELL_M22 quirk for Dell Latitude D430
  i2c: designware: Fix bogus sda_hold_time due to uninitialized vars
  Input: atmel_mxt_ts - add T100 as a readable object
  Input: edt-ft5x06 - increase allowed data range for threshold parameter
  efi-pstore: Fix write/erase id tracking
  PCI: imx6: Fix config read timeout handling
  switchtec: Fix minor bug with partition ID register
  switchtec: Use new cdev_device_add() helper function
  PCI: endpoint: Make PCI_ENDPOINT depend on HAS_DMA
  blk-throttle: force user to configure all settings for io.low
  blk-throttle: respect 0 bps/iops settings for io.low
  blk-throttle: output some debug info in trace
  blk-throttle: add hierarchy support for latency target and idle time
  kthread: Fix use-after-free if kthread fork fails
  futex,rt_mutex: Fix rt_mutex_cleanup_proxy_lock()
  leds: pca955x: Correct I2C Functionality
  nvme_fc: remove extra controller reference taken on reconnect
  nvme_fc: correct nvme status set on abort
  nvme_fc: set logging level on resets/deletes
  nvme_fc: revise comment on teardown
  nvme_fc: Support ctrl_loss_tmo
  nvme_fc: get rid of local reconnect_delay
  net: sched: cls_matchall: fix null pointer dereference
  blk-mq: remove blk_mq_abort_requeue_list()
  nvme: avoid to use blk_mq_abort_requeue_list()
  nvme: use blk_mq_start_hw_queues() in nvme_kill_queues()
  nvme-rdma: support devices with queue size < 32
  vsock: use new wait API for vsock_stream_sendmsg()
  bonding: fix randomly populated arp target array
  net: Make IP alignment calulations clearer.
  mmc: sdhci-iproc: suppress spurious interrupt with Multiblock read
  bonding: fix accounting of active ports in 3ad
  net: atheros: atl2: don't return zero on failure path in atl2_probe()
  mmc: cavium: Fix probing race with regulator
  of/platform: Make of_platform_device_destroy globally visible
  mmc: cavium: Prevent crash with incomplete DT
  ipv6: fix out of bound writes in __ip6_append_data()
  ALSA: hda - Update the list of quirk models
  ALSA: hda - Provide dual-codecs model option for a few Realtek codecs
  ALSA: hda - Apply dual-codec quirk for MSI Z270-Gaming mobo
  i2c: designware: Fix bogus sda_hold_time due to uninitialized vars
  i2c: i2c-tiny-usb: fix buffer not being DMA capable
  drm/radeon: Fix oops upon driver load on PowerXpress laptops
  acpi, nfit: Fix the memory error check in nfit_handle_mce()
  x86/MCE: Export memory_error()
  bridge: start hello_timer when enabling KERNEL_STP in br_stp_start
  smsc95xx: Support only IPv4 TCP/UDP csum offload
  arp: always override existing neigh entries with gratuitous ARP
  arp: postpone addr_type calculation to as late as possible
  arp: decompose is_garp logic into a separate function
  arp: fixed error in a comment
  tcp: initialize rcv_mss to TCP_MIN_MSS instead of 0
  x86/boot: Use CROSS_COMPILE prefix for readelf
  xfs: avoid mount-time deadlock in CoW extent recovery
  xfrm: fix state migration copy replay sequence numbers
  selftests/powerpc: Fix TM resched DSCR test with some compilers
  mmc: cavium-octeon: Use proper GPIO name for power control
  mmc: cavium-octeon: Fix interrupt enable code
  mmc: sdhci-xenon: kill xenon_clean_phy()
  scsi: zero per-cmd private driver data for each MQ I/O
  scsi: csiostor: fix use after free in csio_hw_use_fwconfig()
  scsi: ufs: Clean up some rpm/spm level SysFS nodes upon remove
  serial: enable serdev support
  tty/serdev: add serdev registration interface
  serdev: Restore serdev_device_write_buf for atomic context
  serial: core: fix crash in uart_suspend_port
  tty: fix port buffer locking
  tty: ehv_bytechan: clean up init error handling
  serial: ifx6x60: fix use-after-free on module unload
  serial: altera_jtaguart: adding iounmap()
  serial: exar: Fix stuck MSIs
  serial: efm32: Fix parity management in 'efm32_uart_console_get_options()'
  serdev: fix tty-port client deregistration
  Revert "tty_port: register tty ports with serdev bus"
  drivers/tty: 8250: only call fintek_8250_probe when doing port I/O
  netfilter: xtables: fix build failure from COMPAT_XT_ALIGN outside CONFIG_COMPAT
  crypto: skcipher - Add missing API setkey checks
  scsi: lpfc: fix build issue if NVME_FC_TARGET is not defined
  scsi: lpfc: Fix NULL pointer dereference during PCI error recovery
  mac80211: strictly check mesh address extension mode
  scsi: lpfc: update version to 11.2.0.14
  scsi: lpfc: Add MDS Diagnostic support.
  scsi: lpfc: Fix NVMEI's handling of NVMET's PRLI response attributes
  scsi: lpfc: Cleanup entry_repost settings on SLI4 queues
  scsi: lpfc: Fix debugfs root inode "lpfc" not getting deleted on driver unload.
  scsi: lpfc: Fix NVME I+T not registering NVME as a supported FC4 type
  scsi: lpfc: Added recovery logic for running out of NVMET IO context resources
  scsi: lpfc: Separate NVMET RQ buffer posting from IO resources SGL/iocbq/context
  scsi: lpfc: Separate NVMET data buffer pool fir ELS/CT.
  scsi: lpfc: Fix NMI watchdog assertions when running nvmet IOPS tests
  scsi: lpfc: Fix NVMEI driver not decrementing counter causing bad rport state.
  scsi: lpfc: Fix nvmet RQ resource needs for large block writes.
  scsi: lpfc: Adding additional stats counters for nvme.
  scsi: lpfc: Fix system crash when port is reset.
  scsi: lpfc: Fix used-RPI accounting problem.
  scsi: libfc: fix incorrect variable assignment
  scsi: sd: Ignore sync cache failures when not supported
  xfs: only return detailed fsmap info if the caller has CAP_SYS_ADMIN
  xfs: bad assertion for delalloc an extent that start at i_size
  xfs: fix warnings about unused stack variables
  xfs: BMAPX shouldn't barf on inline-format directories
  xfs: fix indlen accounting error on partial delalloc conversion
  ebtables: arpreply: Add the standard target sanity check
  ALSA: hda - No loopback on ALC299 codec
  netfilter: nf_tables: revisit chain/object refcounting from elements
  netfilter: nf_tables: missing sanitization in data from userspace
  netfilter: nf_tables: can't assume lock is acquired when dumping set elems
  netfilter: synproxy: fix conntrackd interaction
  netfilter: xtables: zero padding in data_to_user
  netfilter: nfnl_cthelper: reject del request if helper obj is in use
  netfilter: introduce nf_conntrack_helper_put helper function
  netfilter: don't setup nat info for confirmed ct
  netfilter: ctnetlink: Make some parameters integer to avoid enum mismatch
  ALSA: usb-audio: fix Amanero Combo384 quirk on big-endian hosts
  cpufreq: dbx500: add a Kconfig symbol
  PM / hibernate: Declare variables as static
  PowerCap: Fix an error code in powercap_register_zone()
  RTC: rtc-cmos: Fix wakeup from suspend-to-idle
  PM / wakeup: Fix up wakeup_source_report_event()
  cpufreq: intel_pstate: Document the current behavior and user interface
  Revert "ACPI / button: Remove lid_init_state=method mode"
  tools/power/acpi: Add .gitignore file
  scsi: sg: don't return bogus Sg_requests
  scsi: sd: Write lock zone for REQ_OP_WRITE_ZEROES
  scsi: sd: Unlock zone in case of error in sd_setup_write_same_cmnd()
  ipvs: SNAT packet replies only for NATed connections
  xfrm: Fix NETDEV_DOWN with IPSec offload
  af_key: Fix slab-out-of-bounds in pfkey_compile_policy.
  cpufreq: schedutil: use now as reference when aggregating shared policy requests
  xfrm: fix stack access out of bounds with CONFIG_XFRM_SUB_POLICY
  esp4: Fix udpencap for local TCP packets.

Change-Id: I7a5b5e6940e910245074acaa622bd1f64c5cd92d
Signed-off-by: default avatarKyle Yan <kyan@codeaurora.org>
parents 129a572f 5ed02dbb
Loading
Loading
Loading
Loading
+12 −4
Original line number Diff line number Diff line
@@ -59,20 +59,28 @@ button driver uses the following 3 modes in order not to trigger issues.
If the userspace hasn't been prepared to ignore the unreliable "opened"
events and the unreliable initial state notification, Linux users can use
the following kernel parameters to handle the possible issues:
A. button.lid_init_state=open:
A. button.lid_init_state=method:
   When this option is specified, the ACPI button driver reports the
   initial lid state using the returning value of the _LID control method
   and whether the "opened"/"closed" events are paired fully relies on the
   firmware implementation.
   This option can be used to fix some platforms where the returning value
   of the _LID control method is reliable but the initial lid state
   notification is missing.
   This option is the default behavior during the period the userspace
   isn't ready to handle the buggy AML tables.
B. button.lid_init_state=open:
   When this option is specified, the ACPI button driver always reports the
   initial lid state as "opened" and whether the "opened"/"closed" events
   are paired fully relies on the firmware implementation.
   This may fix some platforms where the returning value of the _LID
   control method is not reliable and the initial lid state notification is
   missing.
   This option is the default behavior during the period the userspace
   isn't ready to handle the buggy AML tables.

If the userspace has been prepared to ignore the unreliable "opened" events
and the unreliable initial state notification, Linux users should always
use the following kernel parameter:
B. button.lid_init_state=ignore:
C. button.lid_init_state=ignore:
   When this option is specified, the ACPI button driver never reports the
   initial lid state and there is a compensation mechanism implemented to
   ensure that the reliable "closed" notifications can always be delievered
+10 −9
Original line number Diff line number Diff line
.. |struct cpufreq_policy| replace:: :c:type:`struct cpufreq_policy <cpufreq_policy>`
.. |intel_pstate| replace:: :doc:`intel_pstate <intel_pstate>`

=======================
CPU Performance Scaling
@@ -75,7 +76,7 @@ feedback registers, as that information is typically specific to the hardware
interface it comes from and may not be easily represented in an abstract,
platform-independent way.  For this reason, ``CPUFreq`` allows scaling drivers
to bypass the governor layer and implement their own performance scaling
algorithms.  That is done by the ``intel_pstate`` scaling driver.
algorithms.  That is done by the |intel_pstate| scaling driver.


``CPUFreq`` Policy Objects
@@ -174,13 +175,13 @@ necessary to restart the scaling governor so that it can take the new online CPU
into account.  That is achieved by invoking the governor's ``->stop`` and
``->start()`` callbacks, in this order, for the entire policy.

As mentioned before, the ``intel_pstate`` scaling driver bypasses the scaling
As mentioned before, the |intel_pstate| scaling driver bypasses the scaling
governor layer of ``CPUFreq`` and provides its own P-state selection algorithms.
Consequently, if ``intel_pstate`` is used, scaling governors are not attached to
Consequently, if |intel_pstate| is used, scaling governors are not attached to
new policy objects.  Instead, the driver's ``->setpolicy()`` callback is invoked
to register per-CPU utilization update callbacks for each policy.  These
callbacks are invoked by the CPU scheduler in the same way as for scaling
governors, but in the ``intel_pstate`` case they both determine the P-state to
governors, but in the |intel_pstate| case they both determine the P-state to
use and change the hardware configuration accordingly in one go from scheduler
context.

@@ -257,7 +258,7 @@ are the following:

``scaling_available_governors``
	List of ``CPUFreq`` scaling governors present in the kernel that can
	be attached to this policy or (if the ``intel_pstate`` scaling driver is
	be attached to this policy or (if the |intel_pstate| scaling driver is
	in use) list of scaling algorithms provided by the driver that can be
	applied to this policy.

@@ -274,7 +275,7 @@ are the following:
	the CPU is actually running at (due to hardware design and other
	limitations).

	Some scaling drivers (e.g. ``intel_pstate``) attempt to provide
	Some scaling drivers (e.g. |intel_pstate|) attempt to provide
	information more precisely reflecting the current CPU frequency through
	this attribute, but that still may not be the exact current CPU
	frequency as seen by the hardware at the moment.
@@ -284,13 +285,13 @@ are the following:

``scaling_governor``
	The scaling governor currently attached to this policy or (if the
	``intel_pstate`` scaling driver is in use) the scaling algorithm
	|intel_pstate| scaling driver is in use) the scaling algorithm
	provided by the driver that is currently applied to this policy.

	This attribute is read-write and writing to it will cause a new scaling
	governor to be attached to this policy or a new scaling algorithm
	provided by the scaling driver to be applied to it (in the
	``intel_pstate`` case), as indicated by the string written to this
	|intel_pstate| case), as indicated by the string written to this
	attribute (which must be one of the names listed by the
	``scaling_available_governors`` attribute described above).

@@ -619,7 +620,7 @@ This file is located under :file:`/sys/devices/system/cpu/cpufreq/` and controls
the "boost" setting for the whole system.  It is not present if the underlying
scaling driver does not support the frequency boost mechanism (or supports it,
but provides a driver-specific interface for controlling it, like
``intel_pstate``).
|intel_pstate|).

If the value in this file is 1, the frequency boost mechanism is enabled.  This
means that either the hardware can be put into states in which it is able to
+1 −0
Original line number Diff line number Diff line
@@ -6,6 +6,7 @@ Power Management
   :maxdepth: 2

   cpufreq
   intel_pstate

.. only::  subproject and html

+755 −0

File added.

Preview size limit exceeded, changes collapsed.

+0 −281
Original line number Diff line number Diff line
Intel P-State driver
--------------------

This driver provides an interface to control the P-State selection for the
SandyBridge+ Intel processors.

The following document explains P-States:
http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf
As stated in the document, P-State doesn’t exactly mean a frequency. However, for
the sake of the relationship with cpufreq, P-State and frequency are used
interchangeably.

Understanding the cpufreq core governors and policies are important before
discussing more details about the Intel P-State driver. Based on what callbacks
a cpufreq driver provides to the cpufreq core, it can support two types of
drivers:
- with target_index() callback: In this mode, the drivers using cpufreq core
simply provide the minimum and maximum frequency limits and an additional
interface target_index() to set the current frequency. The cpufreq subsystem
has a number of scaling governors ("performance", "powersave", "ondemand",
etc.). Depending on which governor is in use, cpufreq core will call for
transitions to a specific frequency using target_index() callback.
- setpolicy() callback: In this mode, drivers do not provide target_index()
callback, so cpufreq core can't request a transition to a specific frequency.
The driver provides minimum and maximum frequency limits and callbacks to set a
policy. The policy in cpufreq sysfs is referred to as the "scaling governor".
The cpufreq core can request the driver to operate in any of the two policies:
"performance" and "powersave". The driver decides which frequency to use based
on the above policy selection considering minimum and maximum frequency limits.

The Intel P-State driver falls under the latter category, which implements the
setpolicy() callback. This driver decides what P-State to use based on the
requested policy from the cpufreq core. If the processor is capable of
selecting its next P-State internally, then the driver will offload this
responsibility to the processor (aka HWP: Hardware P-States). If not, the
driver implements algorithms to select the next P-State.

Since these policies are implemented in the driver, they are not same as the
cpufreq scaling governors implementation, even if they have the same name in
the cpufreq sysfs (scaling_governors). For example the "performance" policy is
similar to cpufreq’s "performance" governor, but "powersave" is completely
different than the cpufreq "powersave" governor. The strategy here is similar
to cpufreq "ondemand", where the requested P-State is related to the system load.

Sysfs Interface

In addition to the frequency-controlling interfaces provided by the cpufreq
core, the driver provides its own sysfs files to control the P-State selection.
These files have been added to /sys/devices/system/cpu/intel_pstate/.
Any changes made to these files are applicable to all CPUs (even in a
multi-package system, Refer to later section on placing "Per-CPU limits").

      max_perf_pct: Limits the maximum P-State that will be requested by
      the driver. It states it as a percentage of the available performance. The
      available (P-State) performance may be reduced by the no_turbo
      setting described below.

      min_perf_pct: Limits the minimum P-State that will be requested by
      the driver. It states it as a percentage of the max (non-turbo)
      performance level.

      no_turbo: Limits the driver to selecting P-State below the turbo
      frequency range.

      turbo_pct: Displays the percentage of the total performance that
      is supported by hardware that is in the turbo range. This number
      is independent of whether turbo has been disabled or not.

      num_pstates: Displays the number of P-States that are supported
      by hardware. This number is independent of whether turbo has
      been disabled or not.

For example, if a system has these parameters:
	Max 1 core turbo ratio: 0x21 (Max 1 core ratio is the maximum P-State)
	Max non turbo ratio: 0x17
	Minimum ratio : 0x08 (Here the ratio is called max efficiency ratio)

Sysfs will show :
	max_perf_pct:100, which corresponds to 1 core ratio
	min_perf_pct:24, max_efficiency_ratio / max 1 Core ratio
	no_turbo:0, turbo is not disabled
	num_pstates:26 = (max 1 Core ratio - Max Efficiency Ratio + 1)
	turbo_pct:39 = (max 1 core ratio - max non turbo ratio) / num_pstates

Refer to "Intel® 64 and IA-32 Architectures Software Developer’s Manual
Volume 3: System Programming Guide" to understand ratios.

There is one more sysfs attribute in /sys/devices/system/cpu/intel_pstate/
that can be used for controlling the operation mode of the driver:

      status: Three settings are possible:
      "off"     - The driver is not in use at this time.
      "active"  - The driver works as a P-state governor (default).
      "passive" - The driver works as a regular cpufreq one and collaborates
                  with the generic cpufreq governors (it sets P-states as
                  requested by those governors).
      The current setting is returned by reads from this attribute.  Writing one
      of the above strings to it changes the operation mode as indicated by that
      string, if possible.  If HW-managed P-states (HWP) are enabled, it is not
      possible to change the driver's operation mode and attempts to write to
      this attribute will fail.

cpufreq sysfs for Intel P-State

Since this driver registers with cpufreq, cpufreq sysfs is also presented.
There are some important differences, which need to be considered.

scaling_cur_freq: This displays the real frequency which was used during
the last sample period instead of what is requested. Some other cpufreq driver,
like acpi-cpufreq, displays what is requested (Some changes are on the
way to fix this for acpi-cpufreq driver). The same is true for frequencies
displayed at /proc/cpuinfo.

scaling_governor: This displays current active policy. Since each CPU has a
cpufreq sysfs, it is possible to set a scaling governor to each CPU. But this
is not possible with Intel P-States, as there is one common policy for all
CPUs. Here, the last requested policy will be applicable to all CPUs. It is
suggested that one use the cpupower utility to change policy to all CPUs at the
same time.

scaling_setspeed: This attribute can never be used with Intel P-State.

scaling_max_freq/scaling_min_freq: This interface can be used similarly to
the max_perf_pct/min_perf_pct of Intel P-State sysfs. However since frequencies
are converted to nearest possible P-State, this is prone to rounding errors.
This method is not preferred to limit performance.

affected_cpus: Not used
related_cpus: Not used

For contemporary Intel processors, the frequency is controlled by the
processor itself and the P-State exposed to software is related to
performance levels.  The idea that frequency can be set to a single
frequency is fictional for Intel Core processors. Even if the scaling
driver selects a single P-State, the actual frequency the processor
will run at is selected by the processor itself.

Per-CPU limits

The kernel command line option "intel_pstate=per_cpu_perf_limits" forces
the intel_pstate driver to use per-CPU performance limits.  When it is set,
the sysfs control interface described above is subject to limitations.
- The following controls are not available for both read and write
	/sys/devices/system/cpu/intel_pstate/max_perf_pct
	/sys/devices/system/cpu/intel_pstate/min_perf_pct
- The following controls can be used to set performance limits, as far as the
architecture of the processor permits:
	/sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
	/sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq
	/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
- User can still observe turbo percent and number of P-States from
	/sys/devices/system/cpu/intel_pstate/turbo_pct
	/sys/devices/system/cpu/intel_pstate/num_pstates
- User can read write system wide turbo status
	/sys/devices/system/cpu/no_turbo

Support of energy performance hints
It is possible to provide hints to the HWP algorithms in the processor
to be more performance centric to more energy centric. When the driver
is using HWP, two additional cpufreq sysfs attributes are presented for
each logical CPU.
These attributes are:
	- energy_performance_available_preferences
	- energy_performance_preference

To get list of supported hints:
$ cat energy_performance_available_preferences
    default performance balance_performance balance_power power

The current preference can be read or changed via cpufreq sysfs
attribute "energy_performance_preference". Reading from this attribute
will display current effective setting. User can write any of the valid
preference string to this attribute. User can always restore to power-on
default by writing "default".

Since threads can migrate to different CPUs, this is possible that the
new CPU may have different energy performance preference than the previous
one. To avoid such issues, either threads can be pinned to specific CPUs
or set the same energy performance preference value to all CPUs.

Tuning Intel P-State driver

When the performance can be tuned using PID (Proportional Integral
Derivative) controller, debugfs files are provided for adjusting performance.
They are presented under:
/sys/kernel/debug/pstate_snb/

The PID tunable parameters are:
      deadband
      d_gain_pct
      i_gain_pct
      p_gain_pct
      sample_rate_ms
      setpoint

To adjust these parameters, some understanding of driver implementation is
necessary. There are some tweeks described here, but be very careful. Adjusting
them requires expert level understanding of power and performance relationship.
These limits are only useful when the "powersave" policy is active.

-To make the system more responsive to load changes, sample_rate_ms can
be adjusted  (current default is 10ms).
-To make the system use higher performance, even if the load is lower, setpoint
can be adjusted to a lower number. This will also lead to faster ramp up time
to reach the maximum P-State.
If there are no derivative and integral coefficients, The next P-State will be
equal to:
	current P-State - ((setpoint - current cpu load) * p_gain_pct)

For example, if the current PID parameters are (Which are defaults for the core
processors like SandyBridge):
      deadband = 0
      d_gain_pct = 0
      i_gain_pct = 0
      p_gain_pct = 20
      sample_rate_ms = 10
      setpoint = 97

If the current P-State = 0x08 and current load = 100, this will result in the
next P-State = 0x08 - ((97 - 100) * 0.2) = 8.6 (rounded to 9). Here the P-State
goes up by only 1. If during next sample interval the current load doesn't
change and still 100, then P-State goes up by one again. This process will
continue as long as the load is more than the setpoint until the maximum P-State
is reached.

For the same load at setpoint = 60, this will result in the next P-State
= 0x08 - ((60 - 100) * 0.2) = 16
So by changing the setpoint from 97 to 60, there is an increase of the
next P-State from 9 to 16. So this will make processor execute at higher
P-State for the same CPU load. If the load continues to be more than the
setpoint during next sample intervals, then P-State will go up again till the
maximum P-State is reached. But the ramp up time to reach the maximum P-State
will be much faster when the setpoint is 60 compared to 97.

Debugging Intel P-State driver

Event tracing
To debug P-State transition, the Linux event tracing interface can be used.
There are two specific events, which can be enabled (Provided the kernel
configs related to event tracing are enabled).

# cd /sys/kernel/debug/tracing/
# echo 1 > events/power/pstate_sample/enable
# echo 1 > events/power/cpu_frequency/enable
# cat trace
gnome-terminal--4510  [001] ..s.  1177.680733: pstate_sample: core_busy=107
	scaled=94 from=26 to=26 mperf=1143818 aperf=1230607 tsc=29838618
		freq=2474476
cat-5235  [002] ..s.  1177.681723: cpu_frequency: state=2900000 cpu_id=2


Using ftrace

If function level tracing is required, the Linux ftrace interface can be used.
For example if we want to check how often a function to set a P-State is
called, we can set ftrace filter to intel_pstate_set_pstate.

# cd /sys/kernel/debug/tracing/
# cat available_filter_functions | grep -i pstate
intel_pstate_set_pstate
intel_pstate_cpu_init
...

# echo intel_pstate_set_pstate > set_ftrace_filter
# echo function > current_tracer
# cat trace | head -15
# tracer: function
#
# entries-in-buffer/entries-written: 80/80   #P:4
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
            Xorg-3129  [000] ..s.  2537.644844: intel_pstate_set_pstate <-intel_pstate_timer_func
 gnome-terminal--4510  [002] ..s.  2537.649844: intel_pstate_set_pstate <-intel_pstate_timer_func
     gnome-shell-3409  [001] ..s.  2537.650850: intel_pstate_set_pstate <-intel_pstate_timer_func
          <idle>-0     [000] ..s.  2537.654843: intel_pstate_set_pstate <-intel_pstate_timer_func
Loading