Linux Kernel and System Tuning

Table Of Contents

Introduction
Linux Kernel
Transparent Huge Pages

Introduction

Our Linux performance tuning guide: This is a working draft and will be upgraded frequently. So bookmark this page and revisist for changes.

Guidelines listed in this article are the results of ~20 years of rigorous researching and supporting a ~25million subscriber platform tuned for: DB [Oracle, MariaDB, PostgreSQL] , APP [PoS billing], Cluster [Hadoop]

Tuning systems are very imprtant for user-facing and revenue-critical systems.

Linux Kernel

Our meticulously tuned annotated kernel file sysctl.conf for RHEL.

NOTE: Use sysctl to print, apply and validate sysctl.conf.

# XOMEDIA SETTINGS accurate as of 23Feb2023
#
# Enable sysreq Magic SysRq key:
kernel.sysrq = 1
#
# Memory parameters:
#
# NOTE: Shared memory and HugePages depends on server memory and
# application requirements (Ex. Oracle, PoS APP etc).
# NOTE: To view current values for SHMMAX, SHMALL or SHMMIN, use
# the ipcs command.
#
# [kernel.shmmax]
# Define the maximum size (in bytes) of a single shared memory segment a Linux
# process can allocate.
# The largest allocation of single shared memory allocation – should be larger
# than the largest single allocation of Oracle or other application (Ex. PoS)
# for performance reasons.
# If in doubt, set as large as the physical memory, for example:
# For systems with 64 GB RAM:
#kernel.shmmax = 68719476736
# For systems with 128 GB RAM:
#kernel.shmmax = 137438953472
# For systems with 256 GB RAM:
#kernel.shmmax = 274877906944
# [kernel.shmall]
# This parameter sets (limits) the total amount of shared memory pages that can
# be used system wide. Hence, SHMALL should always be at least ceil(shmmax/PAGE_SIZE).
# PAGE_SIZE is usually 4096 bytes (~#] getconf PAGE_SIZE) unless you use Big Pages
# or HugePages which supports the configuration of larger memory pages.
# For systems with 64 GB of RAM:
#kernel.shmall = 16777216
# For systems with 128 GB RAM:
#kernel.shmall = 33554432
# For systems with 256 GB RAM:
#kernel.shmall = 67108864
# Maximum total amount of shared memory segments:
kernel.shmmni = 8192
kernel.sem = 2048 65536 2048 1024
kernel.msgmni = 4096
kernel.msgmnb = 512000
kernel.msgmax = 65535
kernel.hung_task_timeout_secs = 0
# [vm.swappiness]
# Affects DB performance. Controls swappiness, how many (memory) pages will be
# swapped in/out of RAM. Value ranges from 0 to 100. Zero means disable swap
# and 100 means aggressive swapping. Setting a value of 0 in newer kernels may
# cause the OOM Killer (out of memory killer process) to kill the process.
# Setting a lower value may get good performance. Therefore, it can be sage to
# set the value to 1 to minimize swapping. The default value on a Linux system
# is 60. A higher value causes the MMU (memory management unit) to utilize more
# swap space than RAM, whereas a lower value preserves more data/code in memory.
# A smaller value is a good bet to improve DB performance.
# DB, PoS and HADOOP servers set to 0.
# For other servers set to 20.
# Maximum data in-memory = 0, swap only when OOM (OutOfMemory).
vm.swappiness = 20
# [vm.overcommit_memory]
# APPs acquire/free memory when it is no longer needed. An APP can acquire too
# much memory and not release it (which can invoke the OOM killer).
# Possible values:
# 0: Heuristic overcommit, do it intelligently (default); based kernel heuristics.
# 1: Allow overcommit anyway.
# 2: Don’t over commit beyond the overcommit ratio.
# A value of 2 yields better performance for DBs. It maximizes RAM utilization
# by the server process without any significant risk of getting killed by the OOM
# killer. An APP will be able to overcommit, but only within the overcommit ratio,
# thus, reducing the risk of having OOM killer kill the process. Hence a value of 2
# gives better performance than the default 0 value. However, reliability can be
# improved by ensuring that memory beyond an allowable range is not overcommitted.
# It avoids the risk of the process being killed by OOM-killer. On systems without
# swap, one may experience a problem when set to 2.
# For lazy swap reservation set a value of 1’ – need to make sure sufficient swap is
# allocated to avoid late process failures. Setting the value to 2 (strict swap
# reservation) will require larger swap allocations. This mandates allocation of
# swap size that is at least the size of the physical memory.
# HADOOP: Set to 0. MAPR clusters need to use only real memory, and disables
# swap (memory on disk).
# In DEV ENVs (except stability lab), set to 1.
# NOTE: We were debating whether to set to 2, but couldn’t as it mandates very large
# swap requirements.
#vm.overcommit_memory = 0
# [vm.overcommit_ratio]
# % of RAM that is available for overcommitment. A value of 50% on a system with
# 2 GB of RAM may commit up to 3 GB of RAM.
vm.overcommit_ratio = 50
# [vm.vm_devzero_optimized]
# Required for Oracle version 11.2 and higher for performance reasons
# NOTE: RHEL 5 only, not required for RHEL 6.
#vm.vm_devzero_optimized = 0
# The rest of the “VM” parameters should exist for RH 6:
vm.min_free_kbytes = 1048576
vm.dirty_expire_centisecs = 500
# [vm.dirty_ratio]
# This is the same as vm.dirty_background_ratio / dirty_background_bytes
# except that the flushing is done in the foreground, blocking the application.
# So vm.dirty_ratio should be higher than vm.dirty_background_ratio. This will
# ensure that background processes kick in before the foreground processes to
# avoid blocking the application, as much as possible. You can tune the
# difference between the two ratios depending on your disk IO load.
vm.dirty_ratio = 40 # Keep it 20 for servers with large memory over 256 GB.
# [vm.dirty_background_ratio]
# The % of memory filled with dirty pages that need to be flushed to disk.
# Flushing is done in the background. The value of this parameter ranges
# from 0 to 100; however, a value lower than 5 may not be effective and
# some kernels do not internally support it. The default value is 10 on
# most Linux systems. You can gain performance for write-intensive
# operations with a lower ratio, which means that Linux flushes dirty
# pages in the background.
vm.dirty_background_ratio = 3
vm.dirty_writeback_centisecs = 100
#
# Network parameters:
#
# Max tcp connections: The upper limit of max TCP connections
# kernel will accept (default 128): 
net.core.somaxconn = 8192
#---
net.core.netdev_max_backlog = 300000
# HADOOP:
#net.core.netdev_max_backlog = 50000
#---
# Please see link to note on tuning backlog and budget:
net.core.netdev_budget = 3000
# NOTE: This was set to 'net.core.optmem_max = 65536' in other DOCs:
net.core.optmem_max = 33554432
# Send/Receive buffers:
net.core.rmem_max = 33554432
# Write buffer:
net.core.wmem_max = 33554432
# Receive buffer:
# NOTE: THis was set to 'net.core.rmem_default = 262144' in other DOCs.
net.core.rmem_default = 8388608
# Write buffer:
# NOTE: This was set to 'net.core.wmem_default = 262144' in other DOCs.
net.core.wmem_default = 8388608
# Set minimum size, initial size, and maximum size in bytes.
# For systems with 64 GB RAM:
#net.ipv4.tcp_mem = 6169824 8226432 12339648
# For systems with 96 GB RAM:
#net.ipv4.tcp_mem = 8388608 11361920 16777216
# For systems with 128 GB RAM:
#net.ipv4.tcp_mem = 12398784 16531712 24797568
# For systems with 192 GB RAM:
#net.ipv4.tcp_mem = 18605184 24806912 37210368
#---
# NOTE: This was set to 'net.ipv4.tcp_rmem = 4096 262144 33554432' in other DOCS.
net.ipv4.tcp_rmem = 4096 87380 33554432
# HADOOP:
#net.ipv4.tcp_rmem = 4096 87380 67108864
#---
# NOTE: This was set to 'net.ipv4.tcp_wmem = 4096 262144 33554432' in other DOCs.
net.ipv4.tcp_wmem = 4096 65536 33554432
# HADOOP:
#net.ipv4.tcp_wmem = 4096 65536 67108864
#---
#---
# PoS:
net.ipv4.tcp_tw_reuse = 1
# Set net.ipv4.tcp_tw_recycle to 1 (default is 0, disabled).
# Update: Reference Appendix ‘[net.ipv4.tcp_tw_recycle] and [net.ipv4.tcp_timestamps]’
# This is deprecated with RHEL8.
# [net.ipv4.tcp_tw_recycle] and [net.ipv4.tcp_timestamps] should be
# disabled if the server is behind NAT:
# https://access.redhat.com/solutions/18035
# https://access.redhat.com/solutions/35223
# It has been used in the past for PoS only - there was no real 
# reason to set it further even for PoS the recommendation was changed
# to disable it (Ex. the default) as the net.ipv4.tw_reuse is being
# use and sufficient:
#net.ipv4.tcp_tw_recycle = 0
#---
# [net.ipv4.tcp_keepalive_time] 30 for RAC DB servers:
net.ipv4.tcp_keepalive_time = 180
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 9
# [net.ipv4.tcp_retries2] This value influences the timeout of an alive TCP.
# Important for HADOOP clusters.
# Connection, when TCP retransmissions remain unacknowledged.
# Given a value of 2 TCP connection, retransmit N times before killing the connection.
# Oracle RAC DB servers set to 3.
# HADOOP clusters set to 2.
net.ipv4.tcp_retries2 = 5
# Oracle RAC DB servers set to 2:
net.ipv4.tcp_syn_retries = 1
# [net.ipv4.ip_local_port_range] Only for dedicated DB servers:
#net.ipv4.ip_local_port_range = 9000 65500
net.ipv4.tcp_sack = 1
net.ipv4.tcp_dsack = 0
# [net.ipv4.tcp_tw_recycle] and [net.ipv4.tcp_timestamps]
# Reference Appendix ‘[net.ipv4.tcp_tw_recycle] and [net.ipv4.tcp_timestamps]’
# Disable if the server is behind NAT:
# https://access.redhat.com/solutions/18035
# https://access.redhat.com/solutions/35223
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_window_scaling = 1
# NOTE: Set to 'net.ipv4.tcp_syncookies = 1' in other DOCs.
net.ipv4.tcp_syncookies = 0
# NFS perf - For RHEL < 6.2, for RHEL 6.2 and above it is dynamic:
#sunrpc.tcp_slot_table_entries = 128
# Pages - RH 5 only:
#net.ipv4.udp_mem = 49634688 66179584 99269376
# [fs.file-max] System wide open file handles for Oracle:
#fs.file-max = 6815744
# ASYNC I/O:
fs.aio-max-nr = 3145728
# [vm.nr_hugepages]
# Applications with native HugePages pages support: Oracle, PostgreSQL, MySQL, JVM.
# Shared memory and HugePages depend on server memory and application
# requirements (Ex. Oracle, PoS etc).
# Use HugePages for DATABASEs and PoS server:
# For DB: vm.nr_hugepages should be the SGA size in 2M pages.
# For PoS change vm.hugetlb_shm_group accordingly.
# Example:
# Set vm.nr_hugepages to 0 or larger than Oracle cache or PoS shared memory
# or total of all JVMs memory (sum of –Xmx + maxPermSize of all JVMs)
# allocation size in 2MB pages when machine is database server.>
#vm.nr_hugepages =
# Group ID of the user allocating HugePages shared memory:
#vm.hugetlb_shm_group =

#EOF

Transparent Huge Pages

Transparent Huge Pages (THP) is a memory management technique that can benefit performance by utilizing larger contiguous memory blocks for applications.

THP is usually enabled by default on Linux systems. Administrators can either disable THP altogether if there’s no benefit (RDBMS), or configure it to be used with memory intensive applicaitons.

THP Overview:

Traditional Memory Management: By default, Linux allocates memory in units called pages. The typical page size can vary depending on the system architecture (e.g., 4KB on most modern systems). When a program needs memory, it requests pages from the system. These pages may be scattered throughout physical memory.

Transparent Huge Pages (THP):

THP enables systems to automatically allocate larger contiguous memory blocks, called Huge Pages (usually 2MB or larger). When a program requests memory, the system attempts to allocate it from available Huge Pages.

Benefits:

For applications that work with large contiguous datasets, THP can significantly improve performance.

Reduced Memory Overhead: By reducing the number of page table entries needed, leading to slightly lower memory overhead.
Reduced Translation Lookaside Buffer (TLB) Misses: The TLB is a hardware cache that translates virtual memory addresses to physical addresses. With larger Huge Pages, fewer page table entries are needed, reducing TLB misses and improving memory access speed.
Reduced Page Faults: Since Huge Pages are larger, fewer page faults (situations where the required data isn’t readily available in memory) might occur, leading to faster program execution.

Potential Drawbacks:

It may not be beneficial to enable THP, in which case you should either disable it globally, disable THP for certain applications or seperate database and application workloads on different servers.

Fragmentation: THP allocation can fragment memory, making it harder to allocate smaller chunks of memory for other processes. This can negatively impact the performance of applications with different memory access patterns.
Compatibility Issues: Some applications may not work well with THP due to memory allocation assumptions.

Disabling THP

This is how we’re disabling THP in in PROD / RHEL 7.x:

Typically, folks will disable them via GRUB:

GRUB_CMDLINE_LINUX="vconsole.keymap=us rd.lvm.lv=vg00/root rd.lvm.lv=vg00/swap crashkernel=auto rd.lvm.lv=vg00/usr vconsole.font=latarcyrheb-sun16 rhgb quiet net.ifnames=0 biosdevname=0 transparent_hugepage=never intel_idle.max_cstate=0 processor.max_cstate=1"

We disable THP from /etc/rc.local…

Execute chmod +x /etc/rc.d/rc.local to ensure that script executes during boot.

if test -f /sys/kernel/mm/redhattransparent_hugepage/enabled; then
   echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
fi
if test -f /sys/kernel/mm/redhattransparent_hugepage/defrag; then
   echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
fi
if test -f /sys/kernel/mm/transparent_hugepage/enabled; then
   echo never > /sys/kernel/mm/transparent_hugepage/enabled
fi
if test -f /sys/kernel/mm/transparent_hugepage/defrag; then
   echo never > /sys/kernel/mm/transparent_hugepage/defrag
fi

… because:

NOTE: It is best to disable THP (Transparent HugePages) in rc.local (rather than grub.conf). If tuned is installed/enabled, THP will auto configure. We disable both ktune and tuned services.

NOTE:The ktune service enables transparent hugepages (THP) by default for all profiles. Any time a tuned profile is enabled, whether at bootup via an init.d service, or manually at the command line – transparent hugepages (THP) gets re-enabled.

NOTE: If you decide to disable THP via GRUB: When transparent_hugepage=never is appended to /boot/grub/grub.conf and the system is rebooted, the sys/kernel/mm/transparent_hugepage/defrag option may still be enabled (e.g, set to always). However, this can safely be ignored as THP is disabled and THP defrag will not come into effect.

Verify (note the square brackets around ’never’):

Before:

~]# cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

~]# cat /sys/kernel/mm/transparent_hugepage/defrag
[always] madvise never

~]# grep AnonHugePages /proc/meminfo
AnonHugePages:    106496 kB

~]# egrep 'trans|thp' /proc/vmstat
nr_anon_transparent_hugepages 56
thp_fault_alloc 121
thp_fault_fallback 0
thp_collapse_alloc 18
thp_collapse_alloc_failed 0
thp_split 3
thp_zero_page_alloc 0

After:

~]# cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

~]# grep AnonHugePages /proc/meminfo
AnonHugePages:    106496 kB

~]# cat /sys/kernel/mm/transparent_hugepage/defrag
[always] madvise never

~]# cat /sys/kernel/mm/transparent_hugepage/defrag
always madvise [never]

~]# grep AnonHugePages /proc/meminfo
AnonHugePages:      4096 kB

~]# egrep 'trans|thp' /proc/vmstat
nr_anon_transparent_hugepages 2
thp_fault_alloc 39
thp_fault_fallback 0
thp_collapse_alloc 0
thp_collapse_alloc_failed 0
thp_split 0
thp_zero_page_alloc 0
thp_zero_page_alloc_failed 0

~]# cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

Transparent Huge Pages are a performance optimization technique in Linux that can benefit applications working with large contiguous data. However, it’s crucial to consider potential drawbacks and manage THP appropriately to avoid negative impacts on other applications.

XOMedia is a full-service IT solutions provider. Learn how your business can benefit from XOMedia's 30+ years of experience with our Consulting and Partnership services - all work is backed by our 100% guaranteed.

Linux Kernel and System Tuning

Introduction

Linux Kernel

Transparent Huge Pages

Thanks for Reading!

Products

Tools