Wednesday, September 18, 2019

Kubernetes 1.16: Custom Resources, Overhauled Metrics, and Volume Extensions

Authors: Kubernetes 1.16 Release Team

We’re pleased to announce the delivery of Kubernetes 1.16, our third release of 2019! Kubernetes 1.16 consists of 31 enhancements: 8 enhancements moving to stable, 8 enhancements in beta, and 15 enhancements in alpha.

Major Themes

Custom resources

CRDs are in widespread use as a Kubernetes extensibility mechanism and have been available in beta since the 1.7 release. The 1.16 release marks the graduation of CRDs to general availability (GA).

Overhauled metrics

Kubernetes has previously made extensive use of a global metrics registry to register metrics to be exposed. By implementing a metrics registry, metrics are registered in more transparent means. Previously, Kubernetes metrics have been excluded from any kind of stability requirements.

Volume Extension

There are quite a few enhancements in this release that pertain to volumes and volume modifications. Volume resizing support in CSI specs is moving to beta which allows for any CSI spec volume plugin to be resizable.

Additional Enhancements

Custom Resources Reach General Availability

CRDs have become the basis for extensions in the Kubernetes ecosystem. Started as a ground-up redesign of the ThirdPartyResources prototype, they have finally reached GA in 1.16 with apiextensions.k8s.io/v1, as the hard-won lessons of API evolution in Kubernetes have been integrated. As we transition to GA, the focus is on data consistency for API clients.

As you upgrade to the GA API, you’ll notice that several of the previously optional guard rails have become required and/or default behavior. Things like structural schemas, pruning unknown fields, validation, and protecting the *.k8s.io group are important for ensuring the longevity of your APIs and are now much harder to accidentally miss. Defaulting is another important part of API evolution and that support will be on by default for CRD.v1. The combination of these, along with CRD conversion mechanisms are enough to build stable APIs that evolve over time, the same way that native Kubernetes resources have changed without breaking backward-compatibility.

Updates to the CRD API won’t end here. We have ideas for features like arbitrary subresources, API group migration, and maybe a more efficient serialization protocol, but the changes from here are expected to be optional and complementary in nature to what’s already here in the GA API. Happy operator writing!

Details on how to work with custom resources can be found in the Kubernetes documentation.

Opening Doors With Windows Enhancements

Beta: Enhancing the workload identity options for Windows containers

Active Directory Group Managed Service Account (GMSA) support is graduating to beta and certain annotations that were introduced with the alpha support are being deprecated. GMSA is a specific type of Active Directory account that enables Windows containers to carry an identity across the network and communicate with other resources. Windows containers can now gain authenticated access to external resources. In addition, GMSA provides automatic password management, simplified service principal name (SPN) management, and the ability to delegate the management to other administrators across multiple servers.

Adding support for RunAsUserName as an alpha release. The RunAsUserName is a string specifying the windows identity (or username) in Windows to run the entrypoint of the container and is a part of the newly introduced windowsOptions component of the securityContext (WindowsSecurityContextOptions).

Alpha: Improvements to setup & node join experience with kubeadm

Introducing alpha support for kubeadm, enabling Kubernetes users to easily join (and reset) Windows worker nodes to an existing cluster the same way they do for Linux nodes. Users can utilize kubeadm to prepare and add a Windows node to cluster. When the operations are complete, the node will be in a Ready state and able to run Windows containers. In addition, we will also provide a set of Windows-specific scripts to enable the installation of prerequisites and CNIs ahead of joining the node to the cluster.

Alpha: Introducing support for Container Storage Interface (CSI)

Introducing CSI plugin support for out-of-tree providers, enabling Windows nodes in a Kubernetes cluster to leverage persistent storage capabilities for Windows-based workloads. This significantly expands the storage options of Windows workloads, adding onto a list that included FlexVolume and in-tree storage plugins. This capability is achieved through a host OS proxy that enables the execution of privileged operations on the Windows node on behalf of containers.

Introducing Endpoint Slices

The release of Kubernetes 1.16 includes an exciting new alpha feature: Endpoint Slices. These provide a scalable and extensible alternative to Endpoints resources. Behind the scenes, these resources play a big role in network routing within Kubernetes. Each network endpoint is tracked within these resources, and kube-proxy uses them for generating proxy rules that allow pods to communicate with each other so easily in Kubernetes.

Providing Greater Scalability

A key goal for Endpoint Slices is to enable greater scalability for Kubernetes Services. With the existing Endpoints resources, a single resource must include network endpoints representing all pods matching a Service. As Services start to scale to thousands of pods, the corresponding Endpoints resources become quite large. Simply adding or removing one endpoint from a Service at this scale can be quite costly. As the Endpoints resource is updated, every piece of code watching Endpoints will need to be sent a full copy of the resource. With kube-proxy running on every node in a cluster, a copy needs to be sent to every single node. At a small scale, this is not an issue, but it becomes increasingly noticeable as clusters get larger.

As a simple example, in a cluster with 5,000 nodes and a 1MB Endpoints object, any update would result in approximately 5GB transmitted (that’s enough to fill a DVD). This becomes increasingly significant given how frequently Endpoints can change during events like rolling updates on Deployments.

With Endpoint Slices, network endpoints for a Service are split into multiple resources, significantly decreasing the amount of data required for updates at scale. By default, Endpoint Slices are limited to 100 endpoints each.

For example, let’s take a cluster with 20,000 network endpoints spread over 5,000 nodes. Updating a single endpoint will be much more efficient with Endpoint Slices since each one includes only a tiny portion of the total number of network endpoints. Instead of transferring a big Endpoints object to each node, only the small Endpoint Slice that’s been changed has to be transferred. The net effect is that approximately 200x less data needs to be transferred for this update.

	Endpoints	Endpoint Slices
# of resources	1	20k / 100 = 200
# of network endpoints stored	1 20k = 20k*	200 100 = 20k*
size of each resource	20k const = ~2.0 MB*	100 const = ~10 kB*
watch event data transferred	~2.0MB 5k = 10GB*	~10kB 5k = 50MB*

The second primary goal for Endpoint Slices was to provide a resource that would be highly extensible and useful across a wide variety of use cases. One of the key additions with Endpoint Slices involves a new topology attribute. By default, this will be populated with the existing topology labels used throughout Kubernetes indicating attributes such as region and zone. Of course, this field can be populated with custom labels as well for more specialized use cases.

Endpoint Slices also include greater flexibility for address types. Each contains a list of addresses. An initial use case for multiple addresses would be to support dual stack endpoints with both IPv4 and IPv6 addresses.

The Kubernetes documentation has a lot more information about Endpoint Slices. There’s also a great KubeCon talk that provides more information on the initial rationale for developing Endpoint Slices. As an alpha feature in Kubernetes 1.16, they will not be enabled by default, but the docs cover how to enable them in your cluster.

Notable Feature Updates

Topology Manager, a new Kubelet component, aims to co-ordinate resource assignment decisions to provide optimized resource allocations.
IPv4/IPv6 dual-stack enables the allocation of both IPv4 and IPv6 addresses to Pods and Services.
Extensions for Cloud Controller Manager Migration.
Continued deprecation of extensions/v1beta1, apps/v1beta1, and apps/v1beta2 APIs. These extensions are now retired in 1.16!

Availability

Kubernetes 1.16 is available for download on GitHub. To get started with Kubernetes, check out these interactive tutorials. You can also easily install 1.16 using kubeadm.

Release Team

This release is made possible through the efforts of hundreds of individuals who contributed both technical and non-technical content. Special thanks to the release team led by Lachlan Evenson, Principal Program Manager at Microsoft. The 32 individuals on the release team coordinated many aspects of the release, from documentation to testing, validation, and feature completeness.

As the Kubernetes community has grown, our release process represents an amazing demonstration of collaboration in open source software development. Kubernetes continues to gain new users at a rapid pace. This growth creates a positive feedback cycle where more contributors commit code creating a more vibrant ecosystem. Kubernetes has had over 32,000 individual contributors to date and an active community of more than 66,000 people.

Release Mascot

The Kubernetes 1.16 release crest was loosely inspired by the Apollo 16 mission crest. It represents the hard work of the release-team and the community alike and is an ode to the challenges and fun times we shared as a team throughout the release cycle. Many thanks to Ronan Flynn-Curran of Microsoft for creating this magnificent piece.

Kubernetes 1.16 Release Mascot

Kubernetes Updates

Project Velocity

The CNCF has continued refining DevStats, an ambitious project to visualize the myriad contributions that go into the project. K8s DevStats illustrates the breakdown of contributions from major company contributors, as well as an impressive set of preconfigured reports on everything from individual contributors to pull request lifecycle times. This past year, 1,147 different companies and over 3,149 individuals contribute to Kubernetes each month. Check out DevStats to learn more about the overall velocity of the Kubernetes project and community.

Ecosystem

The Kubernetes project leadership created the Security Audit Working Group to oversee the very first third-part Kubernetes security audit, in an effort to improve the overall security of the ecosystem.
The Kubernetes Certified Service Providers program (KCSP) reached 100 member companies, ranging from the largest multinational cloud, enterprise software, and consulting companies to tiny startups.
The first Kubernetes Project Journey Report was released, showcasing the massive growth of the project.

KubeCon + CloudNativeCon

The Cloud Native Computing Foundation’s flagship conference gathers adopters and technologists from leading open source and cloud native communities in San Diego, California from November 18-21, 2019. Join Kubernetes, Prometheus, Envoy, CoreDNS, containerd, Fluentd, OpenTracing, gRPC, CNI, Jaeger, Notary, TUF, Vitess, NATS, Linkerd, Helm, Rook, Harbor, etcd, Open Policy Agent, CRI-O, and TiKV as the community gathers for four days to further the education and advancement of cloud native computing. Register today!

Webinar

Join members of the Kubernetes 1.16 release team on Oct 22, 2019 to learn about the major features in this release. Register here.

Get Involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.

Follow us on Twitter @Kubernetesio for latest updates
Join the community discussion on Discuss
Join the community on Slack
Post questions (or answer questions) on Stack Overflow
Share your Kubernetes story

2018.07.09

IPVS-Based In-Cluster Load Balancing Deep Dive

作者: Jun Du(华为), Haibin Xie(华为), Wei Liang(华为)

注意: 这篇文章出自系列深度文章介绍 Kubernetes 1.11 的新特性

介绍

根据 Kubernetes 1.11 发布的博客文章, 我们宣布基于 IPVS 的集群内部服务负载均衡已达到一般可用性。在这篇博客中，我们将带您深入了解该功能。

什么是 IPVS ?

IPVS (IP Virtual Server)是在 Netfilter 上层构建的，并作为 Linux 内核的一部分，实现传输层负载均衡。

IPVS 集成在 LVS（Linux Virtual Server，Linux 虚拟服务器）中，它在主机上运行，并在物理服务器集群前作为负载均衡器。IPVS 可以将基于 TCP 和 UDP 服务的请求定向到真实服务器，并使真实服务器的服务在单个IP地址上显示为虚拟服务。因此，IPVS 自然支持 Kubernetes 服务。

为什么为 Kubernetes 选择 IPVS ?

随着 Kubernetes 的使用增长，其资源的可扩展性变得越来越重要。特别是，服务的可扩展性对于运行大型工作负载的开发人员/公司采用 Kubernetes 至关重要。

Kube-proxy 是服务路由的构建块，它依赖于经过强化攻击的 iptables 来实现支持核心的服务类型，如 ClusterIP 和 NodePort。但是，iptables 难以扩展到成千上万的服务，因为它纯粹是为防火墙而设计的，并且基于内核规则列表。

尽管 Kubernetes 在版本v1.6中已经支持5000个节点，但使用 iptables 的 kube-proxy 实际上是将集群扩展到5000个节点的瓶颈。一个例子是，在5000节点集群中使用 NodePort 服务，如果我们有2000个服务并且每个服务有10个 pod，这将在每个工作节点上至少产生20000个 iptable 记录，这可能使内核非常繁忙。

另一方面，使用基于 IPVS 的集群内服务负载均衡可以为这种情况提供很多帮助。 IPVS 专门用于负载均衡，并使用更高效的数据结构（哈希表），允许几乎无限的规模扩张。

基于 IPVS 的 Kube-proxy

参数更改

参数: –proxy-mode 除了现有的用户空间和 iptables 模式，IPVS 模式通过–proxy-mode = ipvs 进行配置。它隐式使用 IPVS NAT 模式进行服务端口映射。

参数: –ipvs-scheduler

添加了一个新的 kube-proxy 参数来指定 IPVS 负载均衡算法，参数为 –ipvs-scheduler。如果未配置，则默认为 round-robin 算法（rr）。

rr: round-robin
lc: least connection
dh: destination hashing
sh: source hashing
sed: shortest expected delay
nq: never queue

将来，我们可以实现特定于服务的调度程序（可能通过注释），该调度程序具有更高的优先级并覆盖该值。

参数: –cleanup-ipvs 类似于 –cleanup-iptables 参数，如果为 true，则清除在 IPVS 模式下创建的 IPVS 配置和 IPTables 规则。

参数: –ipvs-sync-period 刷新 IPVS 规则的最大间隔时间（例如’5s’，’1m’）。必须大于0。

参数: –ipvs-min-sync-period 刷新 IPVS 规则的最小间隔时间间隔（例如’5s’，’1m’）。必须大于0。

参数: –ipvs-exclude-cidrs 清除 IPVS 规则时 IPVS 代理不应触及的 CIDR 的逗号分隔列表，因为 IPVS 代理无法区分 kube-proxy 创建的 IPVS 规则和用户原始规则 IPVS 规则。如果您在环境中使用 IPVS proxier 和您自己的 IPVS 规则，则应指定此参数，否则将清除原始规则。

设计注意事项

IPVS 服务网络拓扑

创建 ClusterIP 类型服务时，IPVS proxier 将执行以下三项操作：

确保节点中存在虚拟接口，默认为 kube-ipvs0
将服务 IP 地址绑定到虚拟接口
分别为每个服务 IP 地址创建 IPVS 虚拟服务器

这是一个例子:

# kubectl describe svc nginx-service
Name:           nginx-service
...
Type:           ClusterIP
IP:             10.102.128.4
Port:           http    3080/TCP
Endpoints:      10.244.0.235:8080,10.244.1.237:8080
Session Affinity:   None

# ip addr
...
73: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 1a:ce:f5:5f:c1:4d brd ff:ff:ff:ff:ff:ff
    inet 10.102.128.4/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever

# ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn     
TCP  10.102.128.4:3080 rr
  -> 10.244.0.235:8080            Masq    1      0          0         
  -> 10.244.1.237:8080            Masq    1      0          0

请注意，Kubernetes 服务和 IPVS 虚拟服务器之间的关系是“1：N”。例如，考虑具有多个 IP 地址的 Kubernetes 服务。外部 IP 类型服务有两个 IP 地址 - 集群IP和外部 IP。然后，IPVS 代理将创建2个 IPVS 虚拟服务器 - 一个用于集群 IP，另一个用于外部 IP。 Kubernetes 的 endpoint（每个IP +端口对）与 IPVS 虚拟服务器之间的关系是“1：1”。

删除 Kubernetes 服务将触发删除相应的 IPVS 虚拟服务器，IPVS 物理服务器及其绑定到虚拟接口的 IP 地址。

端口映射

IPVS 中有三种代理模式：NAT（masq），IPIP 和 DR。只有 NAT 模式支持端口映射。 Kube-proxy 利用 NAT 模式进行端口映射。以下示例显示 IPVS 服务端口3080到Pod端口8080的映射。

TCP  10.102.128.4:3080 rr
  -> 10.244.0.235:8080            Masq    1      0          0         
  -> 10.244.1.237:8080            Masq    1      0

会话关系

IPVS 支持客户端 IP 会话关联（持久连接）。当服务指定会话关系时，IPVS 代理将在 IPVS 虚拟服务器中设置超时值（默认为180分钟= 10800秒）。例如：

# kubectl describe svc nginx-service
Name:           nginx-service
...
IP:             10.102.128.4
Port:           http    3080/TCP
Session Affinity:   ClientIP

# ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.102.128.4:3080 rr persistent 10800

IPVS 代理中的 Iptables 和 Ipset

IPVS 用于负载均衡，它无法处理 kube-proxy 中的其他问题，例如包过滤，数据包欺骗，SNAT 等

IPVS proxier 在上述场景中利用 iptables。具体来说，ipvs proxier 将在以下4种情况下依赖于 iptables：

kube-proxy 以 –masquerade-all = true 开头
在 kube-proxy 启动中指定集群 CIDR
支持 Loadbalancer 类型服务
支持 NodePort 类型的服务

但是，我们不想创建太多的 iptables 规则。所以我们采用 ipset 来减少 iptables 规则。以下是 IPVS proxier 维护的 ipset 集表：

设置名称成员用法
KUBE-CLUSTER-IP 所有服务 IP + 端口 masquerade-all=true 或 clusterCIDR 指定的情况下进行伪装 KUBE-LOOP-BACK 所有服务 IP +端口+ IP 解决数据包欺骗问题
KUBE-EXTERNAL-IP 服务外部 IP +端口将数据包伪装成外部 IP
KUBE-LOAD-BALANCER 负载均衡器入口 IP +端口将数据包伪装成 Load Balancer 类型的服务
KUBE-LOAD-BALANCER-LOCAL 负载均衡器入口 IP +端口以及 externalTrafficPolicy=local 接受数据包到 Load Balancer externalTrafficPolicy=local KUBE-LOAD-BALANCER-FW 负载均衡器入口 IP +端口以及 loadBalancerSourceRanges 使用指定的 loadBalancerSourceRanges 丢弃 Load Balancer类型Service的数据包 KUBE-LOAD-BALANCER-SOURCE-CIDR 负载均衡器入口 IP +端口 + 源 CIDR 接受 Load Balancer 类型 Service 的数据包，并指定loadBalancerSourceRanges KUBE-NODE-PORT-TCP NodePort 类型服务 TCP 将数据包伪装成 NodePort（TCP）
KUBE-NODE-PORT-LOCAL-TCP NodePort 类型服务 TCP 端口，带有 externalTrafficPolicy=local 接受数据包到 NodePort 服务使用 externalTrafficPolicy=local KUBE-NODE-PORT-UDP NodePort 类型服务 UDP 端口将数据包伪装成 NodePort(UDP)
KUBE-NODE-PORT-LOCAL-UDP NodePort 类型服务 UDP 端口使用 externalTrafficPolicy=local 接受数据包到NodePort服务使用 externalTrafficPolicy=local

通常，对于 IPVS proxier，无论我们有多少 Service/ Pod，iptables 规则的数量都是静态的。

在 IPVS 模式下运行 kube-proxy

目前，本地脚本，GCE 脚本和 kubeadm 支持通过导出环境变量（KUBE_PROXY_MODE=ipvs）或指定标志（–proxy-mode=ipvs）来切换 IPVS 代理模式。在运行IPVS 代理之前，请确保已安装 IPVS 所需的内核模块。

ip_vs
ip_vs_rr
ip_vs_wrr
ip_vs_sh
nf_conntrack_ipv4

最后，对于 Kubernetes v1.10，“SupportIPVSProxyMode” 默认设置为 “true”。对于 Kubernetes v1.11 ，该选项已完全删除。但是，您需要在v1.10之前为Kubernetes 明确启用 –feature-gates = SupportIPVSProxyMode = true。

参与其中

参与 Kubernetes 的最简单方法是加入众多特别兴趣小组 (SIG）中与您的兴趣一致的小组。你有什么想要向 Kubernetes 社区广播的吗？在我们的每周社区会议或通过以下渠道分享您的声音。

感谢您的持续反馈和支持。在Stack Overflow上发布问题（或回答问题）

加入K8sPort的倡导者社区门户网站

在 Twitter 上关注我们 @Kubernetesio获取最新更新

在Slack上与社区聊天

分享您的 Kubernetes 故事

0001

Jan 1
Jan 1

Kubernetes 博客