Minik8s is a container orchestration tool like Kubernetes, which can manage containers that meet the CRI interface on multiple machines, supporting basic features such as container lifecycle management, dynamic and automatic scaling, with Serverless platform integration.
- flannel: Unified network abstraction for containers across multiple machines
- docker: Container management
- etcd: Storage
- ipvsadm: Underlying NAT implementation for services
- iproute2: Creation of virtual IPs
- cadvisor: Monitoring container status
# Run on all machines
sysctl -w sysctl net.ipv4.ip_forward=1
# Run on all machines
ip l a minik8s-proxy0 dev dummy
ip l s minik8s-proxy0 up
Since both flannel and minik8s require etcd for storage, it is recommended to use separate etcd instances for each.
Taking a three-node setup as an example, suppose we have node1 (192.168.1.1/24), node2 (192.168.1.2/24), and node3 (192.168.1.3/24). Here, we'll configure the etcd for flannel storage on node1.
node1:
# Since etcd and flannel are actually run as services, you might need to use a program like tmux/screen for hosting
# Start etcd
etcd --listen-client-urls="http://192.168.1.1:2379" --advertise-client-urls="http://192.168.1.1:2379"
node1, node2, node3:
# Start flannel
flannel --etcd-endpoints=http://192.168.1.13:2379 --iface=192.168.1.13 --ip-masq=true --etcd-prefix=/coreos.com/network
At the same time, we need to make Docker use the network environment set up by flannel.
# Modify systemd docker parameters
# The modification should look like this
# cat /lib/systemd/system/docker.service
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network-online.target firewalld.service containerd.service
Wants=network-online.target
Requires=docker.socket containerd.service
[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
EnvironmentFile=/run/docker_opts.env
ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock $DOCKER_OPTS
ExecReload=/bin/kill -s HUP $MAINPID
TimeoutSec=0
RestartSec=2
Restart=always
# Note that StartLimit* options were moved from "Service" to "Unit" in systemd 229.
# Both the old, and new location are accepted by systemd 229 and up, so using the old location
# to make them work for either version of systemd.
StartLimitBurst=3
# Note that StartLimitInterval was renamed to StartLimitIntervalSec in systemd 230.
# Both the old, and new name are accepted by systemd 230 and up, so using the old name to make
# this option work for either version of systemd.
StartLimitInterval=60s
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
# Comment TasksMax if your systemd version does not support it.
# Only systemd 226 and above support this option.
TasksMax=infinity
# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes
# kill only the docker process, not all processes in the cgroup
KillMode=process
OOMScoreAdjust=-500
[Install]
WantedBy=multi-user.target
# create docker ops file
source /run/flannel/subnet.env
echo "DOCKER_OPTS=\" --bip=$FLANNEL_SUBNET --ip-masq=$FLANNEL_IPMASQ --mtu=$FLANNEL_MTU\"" > /run/docker_opts.env
node2 will serve as the Master node of the cluster.
Assuming the project is cloned under root/minik8s
and the cadvisor binary is located in root/minik8s
.
node2:
export API_SERVER=192.168.1.2
export PORT=8080
export NODE_CONFIG=/root/minik8s/examples/dash/node/worker2.yaml
make clean && make
# start cadvisor to use hpa
./cadvisor -port=8090 &>>/var/log/cadvisor.log
NODE_CONFIG=/root/minik8s/examples/dash/node/master.yaml; ./build/master
./build/kube-proxy
./build/kubelet
node1:
export API_SERVER=192.168.1.2
export PORT=8080
export NODE_CONFIG=/root/minik8s/examples/dash/node/worker1.yaml
# start cadvisor to use hpa
./cadvisor -port=8090 &>>/var/log/cadvisor.log
./build/kube-proxy
./build/kubelet
node3:
export API_SERVER=192.168.1.2
export PORT=8080
export NODE_CONFIG=/root/minik8s/examples/dash/node/worker3.yaml
# start cadvisor to use hpa
./cadvisor -port=8090 &>>/var/log/cadvisor.log
./build/kube-proxy
./build/kubelet
Kubectl is a management tool used for executing commands in Minik8s clusters. This section outlines the syntax of Kubectl, descriptions of command operations, and lists common examples. The Kubectl tool should be used on the physical node where the control plane is located.
Installation and Compilation
Compiling the entire project will also compile the Kubectl command-line tool:
make clean && make
Or compile separately:
go build minik8s/cmd/kubectl
Basic Syntax
Navigate to the /minik8s
project directory and open the bash command line.
If compiled through the make script:
./build/kubectl [command] [TYPE] [NAME] [flags]
If compiled separately:
./kubectl [command] [TYPE] [NAME] [flags]
command
:Specifies the operation to perform on one or more resources, such ascreate
,get
,describe
,delete
.TYPE
:Specifies the resource type. Case-sensitive, and can be specified in singular, plural, or abbreviation forms.- For example, the following commands produce the same result:
$ kubectl get pods d022d439-fc71-4bd7-820e-f1cf21f9567a
,$ kubectl get pod d022d439-fc71-4bd7-820e-f1cf21f9567a
.
- For example, the following commands produce the same result:
NAME
: Specifies the unique identifier name of the Resource. For Func type resources, it refers to the function's Name; for all other resources,NAME
refers to theUID
returned after creating the resource. If Name is omitted, information about all resources of that type is displayed, for example,$ kubectl get pods
.flags
:Specifies optional flags.
Operations
The following table includes brief descriptions and general syntax for all kubectl operations:
Operation | Syntax | Description |
---|---|---|
apply | kubectl apply [TYPE] -f FILENAME [flags] | Create resources from a file |
create | kubectl create [TYPE] -f FILENAME [flags] | Create resources from a file |
delete | kubectl del [TYPE] [NAME] [flags] | Delete resources |
describe | kubectl describe [TYPE] ([NAME]) [flags] | Show detailed status of one or all resources |
get | kubectl get [TYPE] ([NAME]) [flags] | List brief status of one or all resources |
update | kubectl update [TYPE] ([NAME]) -f FILENAME [flags] | Change resources from a file |
clear | kubectl clear | Clear all existing resources |
help | kubectl --help/-h | Help information |
FILENAME
:The file supports configuration files in yaml
and json
formats.
Examples
# Create a Pod based on the network-test.yaml configuration file
kubectl create pod -f examples/dash/pod/network-test.yaml
# Get brief information on all existing Pods
kubectl get pod
# Get brief information on the Pod with UID d022d439-fc71-4bd7-820e-f1cf21f9567a
kubectl get pod d022d439-fc71-4bd7-820e-f1cf21f9567a
# Get detailed information on all existing Pods
kubectl describe pod
# Delete the Pod with UID d022d439-fc71-4bd7-820e-f1cf21f9567a
kubectl del pod d022d439-fc71-4bd7-820e-f1cf21f9567a
Minik8s's architecture largely follows the best practices architecture provided in the course. It is primarily divided into two parts: the Control Plane (Master) and the Worker Nodes (Worker).
Core Components
- Control Plane (Master)
- ApiServer: Interacts with various components, persisting API objects into etcd.
- Scheduler: Responsible for scheduling newly created Pods.
- ControllerManager: Manages various Controllers.
- ReplicaSetController: Implements and manages ReplicaSets.
- HorizontalController: Implements and manages Horizontal Pod Autoscaling (HPA).
- DnsController: Implements and manages DNS.
- ServerlessController: Responsible for Serverless function calls and instance management.
- PodController:Manages Pod lifecycle and implements Pod RestartPolicy.
- GpuServer: Manages GPU Jobs.
- Worker Nodes (Worker)
- Kubelet: Manages the lifecycle of Pods on each node.
- Kubeproxy: Configures node networking, implementing unified network abstraction.
- Other Components
- Kubectl: Command-line tool for interacting with the Control Plane.
- ApiClient: Client capable of communicating with the ApiServer.
Component Overview
Control Plane (Master)
- ApiServer: Exposes APIs, handling HTTP requests from users/components and persisting API objects into etcd.
- HttpServer: Receives HTTP requests from users/components.
- Handlers: Calls EtcdClient to handle CRUD operations on API objects.
- ServerlessFuncHandler: Forwards user function call requests to specific running instances, and returns results; implements recursive calls for Serverless functions.
- EtcdClient: Direct client for etcd, handling CRUD operations and offering a watch mechanism.
- HttpServer: Receives HTTP requests from users/components.
- Scheduler: Responsible for scheduling new Pods.
- ControllerManager: Manages Controllers, ensuring the cluster's actual state matches the desired state.
- Controller Basic Components
- Informer: Local data cache for Controllers, caching Object data and listening for updates.
- Reflector: Listens for Object updates.
- ThreadSafeStore: Stores Objects, ensuring thread safety.
- Workqueue: Contains events for Object changes, allowing Controllers to process objects.
- Informer: Local data cache for Controllers, caching Object data and listening for updates.
- ReplicaSetController: Manages ReplicaSets, ensuring the desired number of Pods are running.
- HorizontalController: Manages HPA, making decisions based on resource usage on nodes.
- MetricsClient: Aggregates resource usage (e.g., for a class of Pods).
- CadvisorClient: Interacts with cadvisor on nodes to obtain resource usage data.
- MetricsClient: Aggregates resource usage (e.g., for a class of Pods).
- DnsController: Manages DNS and HTTP request forwarding.
- ServerlessController: Manages the lifecycle of Serverless function instances.
- Controller Basic Components
- GpuServer: Submits GPU tasks to the cloud platform, generates scripts based on configurations, and downloads results.
- JobClient: Interacts with the cloud platform via ssh.
Worker Nodes (Worker)
- HeartbeatSender: Sends heartbeat from Worker Nodes to the Master, indicating healthy status.
- Kubelet: Manages Pods on each Node, interacting with the Master.
- PodManager
- CriClient
- Kubeproxy: Manages node networking, particularly container networking and Node interconnections, implementing and managing Services.
- ServiceManager: Implements and manages Services.
- IpvsClient
Other Components
- Kubectl: Command-line tool for interacting with the Control Plane.
- NodeManager: Manages initialization and deletion of master and worker nodes.
- ApiClient: Communicates with the ApiServer.
- RESTClient: REST Client for interacting with the ApiServer.
- ListerWatcher: Specializes in List and Watch operations.
- Logger: Manages and prints logs.
- Control Plane (Master)
- ApiServer
- ControllerManager
- HorizontalController
- cadvisor: https://github.com/google/cadvisor
- DnsController
- ServerlessController
- HorizontalController
- GpuServer
- Worker Nodes (Worker)
- Kubelet
- docker: https://github.com/moby/moby
- Kubeproxy
- Kubelet
- Other Components
- Kubectl
- cobra: https://github.com/spf13/cobra
- viper: https://github.com/spf13/viper
- yaml: https://gopkg.in/yaml.v3
- Kubectl
Main programming languages: go 1.18
,python
,shell
The project adopts the git branching model proposed by Vincent Driessen. The main branches include:
master
:Used for official and stable versions available to users. All version releases and tagging are performed here. Developers are not allowed to push directly, but merging fromdevelop
is allowed.develop
:The main branch for daily development. Developers can check outfeat
andfix
branches. After development, they submit pull requests, which, after peer review, are merged back intodevelop
. Direct pushes by developers are not allowed, only allowed through pull requests after completing feature development or bug fixes.feat
:Checked out fromdevelop
for new feature development. After development and testing, they are merged intodevelop
through pull requests. Daily pushes by developers are allowed.- Named as
feat/component/detail
, e.g.,feat/apiserver/handlers
for a development branch on handlers functionality of the ApiServer component.
- Named as
fix
:Checked out fromdevelop
for bug fixes (bugs identified during thefeat
process are resolved on the spot). After fixes and testing, they are merged intodevelop
. Daily pushes by developers are allowed.- Named as
fix/component/detail
, e.g.,fix/etcd/endpoint_config
for fixing endpoint configuration issues in etcd development.
- Named as
Branch Overview
<type>: <body>
Types include:
feat
for new features.fix
for bug fixes (mention the corresponding Issue ID in<body>
).test
related to testing.doc
for changes in comments/documentation.refactor
for refactoring (without adding new features or fixing bugs).
Commit messages are automatically checked for standards using the .githooks/commit-msg
script.
When developing new features, team members discuss in meetings. The member responsible for the feature development creates a feat
branch named feat/component/detail
and carries out the development on this branch. After completion, the developer submits a pull request, which, after peer review by at least one other team member, can be merged into the develop
branch.
Some Development Branch Flows
The project uses GitLab CI/CD interfaces. Pushing the repository to your GitLab repository enables automated building and testing provided by GitLab (configuration file).
CI/CD is configured for vet checks and unit tests, also checking the build situation, and provides automated build scripts.
Automated testing is conducted through test scripts and go test
. It primarily includes unit tests for small components (like client
) and functions (like ParseQuantity
). The goal is to test whether components work as required, quickly identify code errors, and assist in development.
This section follows the basic principles of go language testing, which requires you to create a new file as *_test.go
and name the functions in that file with TestXxx
. The functions are then executed with go test [flags] [packages]
.
- Note that some of the tests in the
*_test.go
file depend on the order of the functions, so you can't run parallel tests ingo test
or switch the order in which the test functions appear in the file.
Manual testing is performed by manipulating the postman
or kubectl
command line tools, as well as yaml/json
cases written in the example
folder. This includes integration testing and system testing for complex logic and component interactions.
The core reason for manual testing was that the team was understaffed and writing automated test scripts for each feature would have required a lot of extra work. However, during the development process and before acceptance, we conducted detailed and thorough manual testing of all requirements, including a range of possible boundary cases, to ensure the quality of the developed project code.
The table below outlines all supported resource types (Kind) and their abbreviations. Here, "Resource Kind" refers to the kind field in the corresponding resource YAML files. "Resource type" and "Abbreviated alias" correlate with the TYPE
field in Kubectl commands:
Resource Kind | Resource type | Abbreviated alias |
---|---|---|
Node | nodes | node |
Pod | pods | pod |
ReplicaSet | replicasets | replicaset, rs |
Service | services | service, svc |
HorizontalPodAutoscaler | hpas | hpa |
Func | funcs | func, f |
Job | jobs | job, j |
DNS | dns | - |
Minik8s supports pod abstraction, enabling users to manage the lifecycle of pods, including the control of pod startup and termination. If a container within a pod crashes or terminates on its own, Minik8s will restart the pod. Users can obtain the status of pods using commands like kubectl get pod
and kubectl describe pod
.
Minik8s allows specifying container-related commands, resource usage limits, and exposed ports for each container within a pod. Containers in the same pod can communicate with each other using localhost
and can share files through volumes, enhancing the flexibility and interoperability within the pod environment.
The pod abstraction can be defined using a YAML configuration file of type Pod
. Here's an example:
apiVersion: v1
kind: Pod
metadata:
labels:
app: myapp
tier: frontend
name: succeed-failure
namespace: default
spec:
containers:
- image: lwsg/notice-server
imagePullPolicy: PullIfNotPresent
name: notice-server
ports:
- containerPort: 80
protocol: TCP
env:
- name: _NOTICE
value: 1
- image: "ubuntu:bionic"
imagePullPolicy: PullIfNotPresent
name: timer
command:
- sleep
args:
- 30s
resources:
limits:
cpu: 100m
memory: 200M
restartPolicy: RestartOnFailure
This section illustrates a pod configuration that includes multiple containers, each with distinct settings. One container runs an Ubuntu image with a simple sleep command, demonstrating the tool's ability to define specific commands and resource limits for each container.
Additionally, an example showcasing the configuration for shared file volumes within a pod is provided:
apiVersion: v1
kind: Pod
metadata:
labels:
app: myapp
tier: frontend
name: arg-volume-test
namespace: default
spec:
containers:
- image: lwsg/debug-server
imagePullPolicy: PullIfNotPresent
name: debug-server-write
volumeMounts:
- name: share
mountPath: "/share"
- image: lwsg/debug-server
imagePullPolicy: PullIfNotPresent
name: debug-server-read
volumeMounts:
- name: share
mountPath: "/share"
restartPolicy: Never
This portion of the example illustrates how to set up shared volumes in a pod, allowing different containers to access the same filesystem space. This feature is vital for scenarios where containers need to read and write shared data.
Minik8s supports inter-pod communication through the Container Network Interface (CNI). When a pod is launched, it is assigned a unique internal IP address. Pods can use these assigned IPs to communicate with other pods, whether they are located on the same node or different nodes. This functionality is essential for distributed applications that require seamless communication between different components.
Nodes in Minik8s are distinguished by their names, and it is crucial to ensure that different physical entities correspond to unique Node names within the system. This abstraction allows Minik8s to effectively manage resources and workloads across multiple nodes, providing a scalable and distributed environment for container orchestration.
The Node configuration in Minik8s ensures that the Name
field in the Node's config file is globally unique within the cluster. During the initialization of a Node, the system checks for the existence of any Node with the same name. If a Node with the same name exists, the system further checks if the config file is identical:
- If the config is identical, it reuses the existing Node instead of creating a new one.
- If the config differs, it reports an error to the user and exits. The user then needs to either modify the
Name
field in the config file or update the existing Node's configuration using aput
method to implement the changes.
The Node abstraction can be specified through a YAML configuration file of type Node
. An example is provided:
apiVersion: v1
kind: Node
metadata:
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/os: linux
kubernetes.io/arch: amd64
kubernetes.io/hostname: node1
kubernetes.io/os: linux
name: node1
spec:
podCIDR: 10.244.1.0/24
podCIDRs:
- 10.244.1.0/24
Starting Kubelet on a node automatically registers the worker node with the cluster. Starting the Master node master program automatically registers the Master node. Alternatively, you can register the node manually (the node's status will be Pending
), and then the node's status will be updated to Running
when the corresponding node's Kubelet starts.
A worker node sends a heartbeat to the master control plane to inform the control plane of the current state of the worker node. If the master control plane does not receive a heartbeat from a worker node for a period of time, it considers the node to be abnormal and deletes it.
Implementation
Once a worker node starts up, the Heartbeat Sender sends heartbeats to the Master node's Heartbeat Watcher. If the Heartbeat Watcher doesn't receive a heartbeat from the worker node for a period of time, it will assume that the corresponding worker node has hung up, and delete its information in the etcd.
The scheduler listens for the Create event and then binds the physical node to which the Pod is to be dispatched using a specific scheduling policy, and updates the Node name field in the Pod Spec to the name of the physical node to which the Pod is to be dispatched by Put. After that, the Kubelet on the corresponding physical node listens to the Pod Modify event and creates and runs the Pod when it finds out that it is a newly dispatched Pod to its own node.
The scheduler supports three scheduling policies, NodeAffinity
, PodAntiAffinity
and Round Robin
:
NodeAffinity
: Pods can directly specify which Node they want to run on (by specifying the Node name in the yaml configuration file)PodAntiAffinity
: Pods can specify that they do not run on the same Node as a Pod with a certain label; the scheduling will try to fulfill the AntiAffinity requirement of the Pod as much as possible, but of course, if all the current Nodes cannot fulfill it (e.g., if all the Nodes are running the specified Pods that cannot be run with it), then this configuration will not take effect.Round Robin
: New Pods are dispatched to each Node in turn; Pods dispatched viaNodeAffinity
do not affect the RR queue, and Pods dispatched viaPodAntiAffinity
place the dispatched node at the end of the RR queue.
Pod Anti-Affinity Configuration Example
apiVersion: v1
kind: Pod
metadata:
labels:
app: myapp
tier: frontend
scheduleAntiAffinity: large
name: myapp-schedule-large
namespace: default
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
scheduleAntiAffinity: tiny
containers:
- image: nginx
imagePullPolicy: Always
name: nginx
ports:
- containerPort: 80
protocol: TCP
resources: {}
restartPolicy: Always
Anti-affinity is configured through the requiredDuringSchedulingIgnoredDuringExecution
under the affinity
field podAntiAffinity
. Specifically, the labelSelector
in the configuration indicates that scheduling to Node nodes of Pods with corresponding matchLabels
labels is not desired.
- When scheduling, it will first determine whether the newly created Pod has specified
NodeAffinity
(through the Node name field in the Pod's Spec), if so, it will be directly scheduled to the corresponding node, if not, it will determine whether it has specifiedPodAntiAffinity
or not. - If
PodAntiAffinity
is specified, scheduling is attempted using this policy; otherwise, scheduling is done using the defaultRound Robin
policy.- In
PodAntiAffinity
, the label selector of the newly created Pod determines whether the label of the existing Pods on each Node matches its label to determine which Nodes the new Pod cannot be dispatched to; if all Nodes are excluded, the anti-affinity configuration will be ignored andRound Robin
will be used for scheduling. - If the scheduling via
PodAntiAffinity
is successful, the corresponding Node in the RR queue will be moved to the end of the queue.
- In
- The
Round Robin
policy is realized by maintaining a Node queue, taking the first Node in the queue each time, and then placing the corresponding Node at the end of the queue to realize the purpose of RR. - When scheduling, only the Node with normal status (Running) will be scheduled.
Service abstraction is supported to support multiple Pods communication. Users can access the service through the specified virtual IP, and Minik8s forwards the request to the corresponding specific Pod (which can be regarded as the front-end proxy of a group of Pods).The service filters the Pods containing the corresponding labels through the selector and load balances the traffic to the service to these Pods through the Round Robin policy. The service filters the pods that contain the corresponding label through the selector and load balances the traffic to the service to these pods through the Round Robin policy. The Service is dynamically updated (e.g., deleted Pods are moved out of management and newly started Pods are brought in) when Pods that meet the selector filter criteria are updated (e.g., Pods are added and Pods are deleted).
The abstraction of the Service hides the exact location of the Pod, i.e., the Pod can be accessed through the IP provided by the Service regardless of the physical node on which it is running.
This configuration can be specified through a yaml configuration file of type Service, as shown in the following example:
apiVersion: v1
kind: Service
metadata:
labels:
app: notice
name: notices
namespace: default
spec:
ports:
- name: hello
port: 80
targetPort: 80
selector:
app: notice
clusterIP: 10.6.0.1
type: ClusterIP
Key Features
Allow users to define a virtual ip to encapsulate access to the pod as access to the ServiceIP. Users can define any IP address and use it as the access address of the Service. At the same time, you can configure the Pod corresponding to the Service through selector, and the default scheduling policy is round robin.
Implementation
minik8s uses ipvs as the underlying NAT implementation. ipvs acts on the INPUT and POSTROUTING chains, so you need to make sure that the ip corresponding to the serice can be routed locally, otherwise the packets will enter the FORWARD chain, and the rules on the INPUT chain won't take effect.
Specifically, for each service, its virtual ip address will be bound to the virtual NIC minik8s-proxy0
, and the corresponding rules will be added by ipvs.
Supports the ReplicaSet abstraction, which specifies a certain number of desired quantities (replicas
) for Pods and monitors the state of those Pods.
When a Pod is abnormal (crashed or killed), a new Pod is automatically started (or takes over an existing Running Pod whose label
matches the corresponding selector
condition) based on the Pod Spec template, bringing the number of Pods managed by the ReplicaSet (and whose label
matches the corresponding selector
condition) back up to the number specified by replicas
. Pod instances of ReplicaSet support deployment across multiple machines.
This configuration can be specified via a yaml configuration file of type ReplicaSet, as exemplified below:
apiVersion: apps/v1
kind: ReplicaSet
metadata:
labels:
app: myapp
tier: frontend
name: myapp-replicas
namespace: default
spec:
replicas: 3
selector:
matchLabels:
tier: frontend
template:
metadata:
labels:
app: myapp
tier: frontend
spec:
containers:
- image: nginx
imagePullPolicy: Always
name: nginx
ports:
- containerPort: 80
protocol: TCP
resources: {}
restartPolicy: Always
Key Features
This functionality is mainly handled by the ReplicaSet Controller, which maintains the number of Pods with replicas
that match the selector
matchLabels
label, and deletes more than it adds.
When a ReplicaSet is created, if there are already Pods whose label
matches the ReplicaSet's selector
, the ReplicaSet takes over those Pods directly; only Pods that don't meet the requirements in this way will have new Pods created based on their template template
fields.
When the label
of a Pod that was originally managed by the ReplicaSet is updated, it will be rechecked to see if it matches the ReplicaSet's selector
, and if not, the Pod will be taken over/created.
Field Meaning
The meaning of each of these fields is as follows:
replicas
:Number of replicas it desired. The ReplicaSet maintains the same number of Pods it manages.selector
:The user selects the selector for the Pod whose label matches the Pod label label key and value must match in order to be controlled by this ReplicaSet. The label here must match the label of the Pod template (template
) (i.e., Pods created based on the Pod template must be able to be selected by this selector in order to be managed by the ReplicaSet).template
:Pod template that describes the Pod objects that will be created when it detects that there are not enoughreplicas
for the number of Pods that are actually being managed.
HPA (HorizontalPodAutoscaler
) abstraction is supported, which can dynamically scale up and down the number of ReplicaSet replicas
based on the real-time load of tasks in all Pods managed by the ReplicaSet, so that the amount of resources consumed by all Pods managed by the ReplicaSet meets the given limits. The real-time load of tasks in a Pod is monitored and collected in real-time by cadvisor on each physical machine node (cpu and memory usage metrics are currently supported). HPA's Pod instances support deployment across multiple machines.
Users can customize the resource metrics to be monitored and the corresponding scaling criteria in the configuration file, including CPU utilization and memory usage. You can also customize the scale-up/down policy in the configuration file to limit the scale-up/down speed and method.
This configuration can be specified through a yaml configuration file of type HorizontalPodAutoscaler, as shown in the following example:
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: hpa-practice-cpu-policy-scale-up
spec:
minReplicas: 3
maxReplicas: 6
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 20
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 20
scaleTargetRef:
apiVersion: apps/v1
kind: ReplicaSet
name: myapp-replicas
behavior:
scaleUp:
selectPolicy: Max
stabilizationWindowSeconds: 0
policies:
- type: Pods
value: 1
periodSeconds: 15
Key Features
- Scaling up and down: In the case of scaling up, for example, when the load of Pods managed by the target ReplicaSet of the HPA increases, the HPA will increase the number of Pods (by modifying the
replicas
field of the managed ReplicaSet Spec) to a maximum ofmaxReplicas
if it reaches the value specified in the scaling policy metrics. The same applies to shrinking. - As with ReplicaSet, the newly created Pods for scaling up and down will be distributed among different nodes.
- Scaling policy: user can customize the scaling policy, including the speed limit and time interval limit for scaling.
Meaning of fields
The meaning of each of these fields is as follows:
minReplicas
: the minimum number of replicas that HPA can shrink to when auto-scalingreplicas
maxReplicas
: the maximum number of replicas that the HPA can be scaled up or down automaticallyreplicas
.metrics
: resource metrics on which scaling decisions are based; defines how the resource metrics should be scaled given the current metric's quantitative criteriatype
: the type of resource metrics, currently only supportsResource
.resource
: information about the resource metric.name
: name of the resource, currently supportscpu
andmemory
.target
: target value of the resource.type
: currently supportsAverageValue
andUtilization
.averageValue
: when the average value of the metric or the average utilization of the resource exceeds this, it will be scaled (the average value of the metric is calculated for all relevant Pods)- AverageValue= Total / Current Instances
averageUtilization
: the overall resource utilization is scaled when it exceeds this percentage, expressed as a percentage (the average of the metric is calculated for all relevant Pods)- Utilization = averageUtilization / Request
scaleTargetRef
: HPA-controlled object, currently only supports ReplicaSetbehavior
: scale-up and scale-down policies, wherescaleUp
andscaleDown
configure the scale-up and scale-down policies respectively.stabilizationWindowSeconds
: the number of seconds that must elapse instabilizationWindowSeconds
since the last auto scale event before the next auto scale can occur.selectPolicy
: how the results of each policy in the configured policy groupPolicies
will be combined (the number of Pods scaled up/down for each policy specification scale should not be greater than a certain number)Max
: select the largest number of Pods to be scaled up/down from all Policies in the Policies group.Min
: select the lowest number of Pods to scale up/down from all Policies in PoliciesDisabled
: disables scaling in this dimension (i.e., does not allow auto-scalingScaleUp
or auto-scalingScaleDown
)
policies
: specific policies (array, multiple configurable)type
:Pods
: indicates the change in the number of Pods made by scale delta needs to be less than or equal to the value ofValue
(to limit the absolute number of changes)Percent
: at this time,Value
corresponds to 0 to 100, which means percent; indicates that the delta of the changes made to the number of Pods by the scale needs to be less than or equal to what percent of the number of Pods that are currently available (for example, if Value is 100, the number of additions/deductions by the scale will be up to the number of Pods that are currently available, i.e., up to a multiple of the current number of Pods, i.e., double the current number of Pods/delete the current Pod)
PeriodSeconds
: the number of seconds inPeriodSeconds
that must have passed since the last auto scale event for this policy to take effect.
Default policy for scaling up and down
If no policy is defined in the configuration file, the default is as follows:
- Expansion: When the resource metrics meet the conditions that require expansion, the expansion is based on the higher of the following two principles, and is allowed to expand up to the
maxReplicas
quantity- Add up to 1 Pod every 15 seconds
- Multiply the number of Pods by up to twice the current number every 60 seconds.
- 0 for
stabilizationWindowSeconds
.
- Downsize: Allow up to
minReplicas
to be reduced when resource metrics meet the conditions for downsizing- Up to 100 Pods every 15 seconds
stabilizationWindowSeconds
is 300
Principle of implementation
- Collection and monitoring of actual resource usage information: cadvisor is deployed on each physical node to monitor the real-time CPU and memory resource usage information (including the total resource information of the physical machine and the usage information of each container) on the current node. At the same time, the HPAController in the control plane interacts with the cadvisor client and initiates a request every time it needs information to collect the status of resource utilization in the most recent period of time (including sampling metrics at several points in time).
- Consolidation of utilization information: The Mertic Client under the HPAController in the control plane consolidates this container-by-container resource utilization information by Pod to know the actual resource usage of the Pod.
- Scale-up/down decision: HPAController obtains the actual resource usage of the ReplicaSet from the actual resource usage of the Pod, and makes a decision on whether to scale-up/down based on the resource usage requirements in the corresponding HPA.
- Scale-up/down execution: If it decides to scale-up/down, it executes according to the corresponding scale-up/down policy, which is realized by modifying the
replicas
field of the managed ReplicaSet Spec.
Note: minik8s dns is quite different from kubernetes
DNS allows you to customize the domain name through yaml configuration file to bind the path of http requests to other http services in the cluster.
DNS allows you to aggregate multiple http services in a cluster under the same domain name by combining the domain name with the path.
This configuration can be specified in a yaml configuration file of type DNS, as shown in the following example:
apiVersion: v1
kind: DNS
name: dns-test
spec:
serviceAddress: 10.8.0.1
hostname: hello.world.minik8s
mappings:
- address: http://10.6.1.1:80
path: "/world"
- address: http://10.6.1.2:80
path: "/new/world"
The yaml configuration file for the Service corresponding to the /world
path is shown below:
apiVersion: v1
kind: Service
metadata:
labels:
app: dns-test
name: dns-world
namespace: default
spec:
ports:
- name: notice
port: 80
targetPort: 80
selector:
app: world
clusterIP: 10.6.1.1
type: ClusterIP
Field Meaning
The meaning of each of these fields is as follows:
serviceAddress
: the virtual IP that the domain is bound tohostname
: the primary path of the domain namemappings
: subpath mappings (list of paths)address
: the corresponding Service name and portpath
: the specific path address.
In the above configuration file case, the user first creates a Service (IP 10.6.1.1
), which can be accessed using ServiceIP:Port
.
After configuring DNS and forwarding, users and Pods can access the Service via hello.world.minik8s:80/path
, which has the same effect as ServiceIP:Port
.
Implementation
The mapping of the virtual IP part of the domain name binding is implemented through coreDNS. The virtual IP will be bound to the Nginx Pod and subpath forwarding will be implemented through Nginx.
Minik8s control plane has fault tolerance (including ApiServer, Controller, Scheduler, Kubelet, Kubeproxy, etc.). The control plane components support restarting after a crash, and the original Pod and Service can run normally during and after the restart, and will not be affected by the control plane crash.
At the same time, support through the [heartbeat mechanism](#####Heartbeat mechanism) automatically detects and deletes nodes that have been disconnected/hung up for a long time, and the nodes can rejoin the cluster after recovery.
Minik8s supports users to write CUDA programs for GPU applications, and helps users to submit CUDA programs to the SJTU computing platform for compilation and running.
Users only need to write a CUDA program and submit the corresponding Job through yaml configuration file, Minik8s will automatically generate slurm script through the built-in server, and upload the program to SJTU computing platform for compilation and running. Minik8s will automatically download the result to the user after the job is executed, and can be configured to notify the user via Shanghai Jiao Tong University mailbox.
The configuration can be specified in a yaml configuration file of type Job, as shown in the following example:
apiVersion: v1
kind: Job
metadata:
name: matrix-sum
namespace: default
spec:
cuFilePath: D:\SJTU\Minik8s\minik8s\pkg\gpu\cuda\sum_matrix\sum_matrix.cu
resultFileName: sum_matrix
resultFilePath: D:\SJTU\Minik8s\minik8s\pkg\gpu\cuda\sum_matrix
args:
numTasksPerNode: 1
cpusPerTask: 2
mail:
type: all
userName: albus_tan
Key Features
- Users can submit their CUDA programs to run by writing a Job-type yaml configuration file.
- Automatically generate slurm scripts based on the yaml configuration file and upload the user-written CUDA programs to the platform for compilation and running.
- The slurm script is automatically generated based on the yaml configuration file, and uploads the user-written CUDA program to the platform for compilation and running.
- After successful submission, you can get the real-time execution status (Pending, Running, Failed, Completed) of the current job through the get job method.
- Support to notify users via Shanghai Jiao Tong University email when the job starts/finishes.
- Support to notify users via Shanghai Jiao Tong University mailbox when the job starts/completed and automatically download the result to the directory specified by the user.
Fields Meaning
cuFilePath
: path to the CUDA program that the user wants to submit.resultFileName
: the name of the result file.resultFilePath
: path where the result will be downloaded locally.args
: configurable parameters for the tasknumTasksPerNode
: number of cores per nodecpusPerTask
: number of CPUs usedgpuResources
: number of GPUs usedmail
: notify the user of task status changes via the SJTU mailbox.type
: support begin (notify when task starts), end (notify when task ends), fail (notify when task fails), all (notify when task status changes)userName
: user name of the user's Shanghai Jiao Tong University mailbox, if you fill inalbus_tan
here, the notification mail will be sent toalbus_tan@sjtu.edu.cn
.
Implementation
- Job submission: the server listens for job creation events, then connects to the I-Computing π 2.0 cluster via a ssh client, compiles the .cu file using CUDA, and submits the dgx2 queued job (GPU job queue).
- Job status retrieval: the server background thread checks the status of the job every once in a while via squeue and sacct commands, and modifies the status field of the corresponding job.
- Automatic result download: server background thread listens for Job modification, and when it finds that the corresponding field of Job status shows Job completion, it downloads the execution result in the corresponding Job result folder to the local destination path from the platform via sftp.
The Serverless platform provides the ability to run programs at a functional granularity and supports auto-scaling and scale-to-0, the ability to support the construction of function chains, and support for communication between function chains. Currently, it supports Python language functions.
Divided into v1 and v2 two versions, users can choose the appropriate version according to the function application scenario:
Serverless v1
: applies to Unlikely Path (functions that are called very infrequently, such as error handling functions, etc.), each time a function is called, a Pod will be created to run the corresponding function instance, and the Pod will be immediately destructed after the function call is completed, releasing resources.Serverless v2
: for normal functions, when the function is called for the first time, an instance will be created (cold start), after that, when the function is called again, the instance will be reused (hot start), and new instances will not be re-created (the rate of return from the call will be about 10 times that of the cold start). If a function is called very often, multiple instances are created for it, and the system automatically distributes the user's call requests to each instance. When a function is not called for a while, the number of instances is gradually reduced untilscale-to-0
.
Both versions of the function support a variety of Workflow, including conditional judgment, loops and so on. The two versions of the function can be called each other, compatible with each other, the user only need to define the version you want to use in the configuration file can be.
Key Features
- Users can define the function content (function template) and upload it to the system, after which the template can be modified and deleted.
- After uploading the function content (function template), the user can call the function via http request, pass in the parameters and get the return result.
- If the function has not been executed for a long time, it will first return the
id
number of the call, so that the user can query the result of the call through theid
number after a period of time; if the function is executed within the default timeout period, the result will be sent back to the user directly.
- If the function has not been executed for a long time, it will first return the
- Each function uploaded by the user runs in a separate Pod to ensure isolation.
- Workflow: users can define the workflow call relationship between multiple functions, supporting conditional judgment, loops, etc. (specifically, users only need to specify the conditions for judging the truth value, and specify which function should be executed if the condition is true, and which function should be executed if it is false).
- automatic expansion: for a function (function template), when the first time the function is called (function instance does not exist) will automatically generate a new instance; when the concurrent number of requests for this function increases, the function will automatically be expanded into multiple instances, and the user's call request can be sent to any of these instances to be processed.
- for the first call, if you want to call immediately after the definition of the template, you can specify the number of initialization instances if you want the system to start preparing for a cold start when the template is defined, so that the first call can get a faster response; this parameter defaults to 0
- You can specify the maximum and minimum number of instances of the function in the configuration.
- scale-to-0: when there is no new request for a period of time, the corresponding function instance will be gradually reduced until it is completely zero (or to the specified minimum number of instances).
This configuration can be specified via a configuration file, an example is shown below:
# is_hello.env
API_SERVER=192.168.1.10
PORT=8080
VERSION=v2
NAME=is_hello
MAIN=examples/dash/func/is_hello.py
PRE_RUN=examples/dash/func/nop.sh
LEFT_BRANCH=append_world
RIGHT_BRANCH=append_world
ADDR=10.7.0.3
# is_hello.py
def run(arg):
return arg
def check(arg):
return arg == "hello"
Field Meaning
The .py
file defines the specific content of the function, the user needs to define two functions in the .py
file:
run(arg)
: the body of the main function, equivalent tomain
, which should define the main logic and content of the function; you can pass parametersarg
(string type, if there are multiple parameters/other types of parameters, you need to implement your own encoding and decoding); you can have a return value (string type, if there are multiple return values/other types of return values). ); can have return value (string type, if there are multiple return values/other types of return value, user needs to implement encoding and decoding by himself).check(arg)
: called automatically afterrun
is executed, the return value of this function is True/False, based on thecheck
return value to decide which function will continue to be called after the current function is called. Ifcheck
returns True, the function corresponding to theLEFT_BRANCH
function in the.env
file will be executed, otherwise the function corresponding to theRIGHT_BRANCH
function will be executed. The parameterarg
of this function is the return value ofrun
function, when you write the workflow logic, you can base on the result ofrun
function to decide the direction of the next execution flow.
The .env
file defines the Workflow function and related configuration information:
API_SERVER
: the IP address of the ApiServer that corresponds to the cluster.PORT
: the port of the ApiServer corresponding to the cluster.VERSION
: the version of Serverless that the function wants to use.NAME
: the name of the function (to ensure global uniqueness, a unique identifier for the function)MAIN
: path to the file containing the function's contents (currently only.py
files are supported)PRE_RUN
: shell script that will be executed before the current function is executed.LEFT_BRANCH
: the function corresponding to theLEFT_BRANCH
function name that will be executed ifcheck
returns True after the current function finishes.RIGHT_BRANCH
: ifcheck
returns False after the current function completes, the function corresponding to theRIGHT_BRANCH
function name will be executed.ADDR
: specify the IP address of the service corresponding to the function, that is, you can access the address and forward the call request to the specific function instance through the service's forwarding mechanism.
Usage
Users can define/modify/delete function templates or make function calls via http requests. In order to simplify the user's operation, it is possible to upload, modify, call, and asynchronously retrieve the result of a function template by calling a script in the . /script
directory for uploading, modifying, calling, and asynchronously fetching function templates (defining, modifying, and deleting function templates can also be done via kubectl):
uploader.sh
:Uploading new function templatesupdater.sh
:Modifying defined function templatescall.sh
:Calling a function (you need to have already uploaded the corresponding function template)get-result.sh
:Getting the result of a function call asynchronously
# Upload all function templates (including entries and all functions that may be called in Workflow)
$ script/uploader.sh examples/dash/func/is_hello.env
$ script/uploader.sh examples/dash/func/append_world.env
$ script/uploader.sh examples/dash/func/append_branch.env
# View uploaded function templates
$ build/kubectl get func
# Call the function (with the function name and the incoming function parameters as arguments)
$ bash script/call.sh is_hello hi
# Updating function templates
$ bash script/updater.sh examples/dash/func/is_hello_modify.env
# Delete function template (parameter is function name)
$ build/kubectl del func append_branch
Implementation
See [Serverless](. /doc/Serverless.md), where the differences between the two implementations are mainly in the internal interfaces.
-
Function Template: function template, corresponding to the Func type ApiObject, can be added, deleted, modified and checked through the following URL to the template
/api/funcs/template /api/funcs/template/:name # name is the Name in the Spec in the function template
-
Function Instance and Call Function Instance and Call: the real function call interface, divided into the interface exposed to the user and the internal implementation of the use of the interface
- User Interface
POST /api/funcs/:name
(body section for parameter passing)- Creates and runs an instance based on a function template named name, returning the instance id,
instanceId
.- will generate the instance id of this call, i.e.
instanceId
, and wait for the function call to return to writeetcd
; if the result is not checked for a long time,instanceId
will be returned first, and the user can check the result of the function execution by callingGET /api/funcs/:id
after a period of time; if the result is checked in timeout, the result will be returned to the user directly. If the result is checked within the timeout period, the result will be returned directly to the user. - Call internal interface
PUT /api/funcs/:name/:id
- will generate the instance id of this call, i.e.
- Creates and runs an instance based on a function template named name, returning the instance id,
GET /api/funcs/:id
- See what the called function returns based on the
instanceId
.
- See what the called function returns based on the
- Internal Interface
PUT /api/funcs/:name/:id
(body section for parameter passing)- The
instanceId
parameter is used to identify the user's actual invocation in the invocation stream, and also to store the final result. - Call the corresponding function based on the name field, if the name field is RETURN, then the call is responsible for storing the function return value.
- The
- User Interface
When the internal interface PUT /api/funcs/:name/:id
(body part for parameter passing) is called, the corresponding Pod will be created directly, and the Pod will automatically call the next function that will be executed at the end of the logic of the function named name (again, the PUT /api/funcs/:name/:id
interface, with name field set to the next function that will be called), and will recursively pass the user-called instance id, i.e. instanceId
; it will also call the current Pod delete method itself; it will also decrypt the current Pod itself. The Pod will automatically call the next function to be executed after the logic of the function named name is finished (again, the PUT /api/funcs/:name/:id
interface, with the name field set to the name of the next function to be called), and will pass the instance id of the instance called by the user, i.e., the instanceId
, recursively; at the same time, it will also self-invoke the delete method of the current Pod, and destruct itself.
When the user creates/modifies a function template, ServerlessController listens to the corresponding event and creates a ReplicaSet for the function template (to manage all the Pod instances of the current function template) and its corresponding Service (to provide a unified entry point to access all the Pods of the function template). At the same time, ServerlessController is also responsible for periodically changing the number of replicas
fields in the ReplicaSet according to the frequency of recent function invocations, in order to realize the function of scaling (which is realized by the algorithm of calculating the timestamps of the most recent invocations and the number of times they have been invoked, etc.).
When the internal interface PUT /api/funcs/:name/:id
(body part for parameter passing) is called, the HTTP request will be forwarded directly to the corresponding Service, which will forward the request to the Pod with the corresponding label match (if the corresponding Pod does not respond, it will be forwarded to the next one, until timeout), the label part is told by the user to the apiserver to create the func template, which will be automatically generated in the Pod based on the name of the func (and also the Pod template of the corresponding func server). The label part will be told by the user to the apiserver to create the func template, ServerlessController will automatically generate it according to the name of the func (and also generate the Pod template corresponding to the func server), and the next function will be called in the Pod at the end of the logic of the function named name (the same PUT /api/funcs/:name/:id
interface, with the name field set to the name of the next function that will be called), and passes the user's called instanceId
, recursively.
The number of pods is managed by the ReplicaSet of the func template. Whenever there is a new request to call a function, update the timestamp timestamp in the status of the corresponding func template and increase the counter at the same time. every certain period of time, reduce the counter of all existing func templates by a certain value uniformly and keep the number of replicas
in the replicaset the same as the counter to achieve scale-to-zero when the function is not called. replicas` in the replicaset is kept the same as the counter, so as to scale-to-zero when the function is not called. set up a policy to limit the upper bound of the counter, and optimize the scaling policy when the counter is different, so as to achieve a better effect.
The documentation for the implementation section can be found at the following link:
- API & API Objects
- ApiServer, ApiClient 及 ListWatch
- Scheduler
- Controller(Informer, ReplicaSetController and HorizontalController)
- Gpu Server
- Kubelet
- Kubeproxy
- Serverless
- Node Manager
- [CI/CD](./doc/CI CD.md)
- CNI
- Test
Support ResourceVersion checking when updating API object resources to avoid possible concurrency problems when multiple components/users update the same object resource at the same time. For example, if two components update the A and B fields of an object based on the same version of the object at the same time, the one that is updated later will overwrite the one that is updated first when Put is performed later, which will lead to the omission of updating of part of the fields.
Referring to the Kubernetes implementation, all resources have a resourceVersion
field as part of their metadata. This ResourceVersion is a string that identifies the internal version of an object and can be used by clients to determine when an object has changed. When a record is to be updated, its version is checked against a pre-saved value, and if it doesn't match, the update fails with StatusConflict (HTTP status code 409).
Resource versioning is currently supported by etcd's mod_revision. However, it is important to note that applications should not rely on the implementation details of the versioning system maintained. We may change the implementation of resource versioning in the future, for example by changing it to a timestamp or per-object counter.
Since the mod revision can only be obtained after the Put operation on etcd, and the revision needs to be written into the ResourceVersion field of the API object itself, we need to pay attention to synchronize the maintenance of the global revision number (through the ResourceVersionManager
), and to ensure that the next mod revision is written into the ResourceVersion field of the API object. Ensure to get the next mod revision, write the revision into the ResourceVersion field of the API object itself, and store the API object, this series of Get version, Set version, Store Object operations can only have one of them occurring at the same time. The implementation adds a lock VLock
to guarantee this.
Created via client.Interface
, the wrapper interface is ListWatcher
, which is specialized to call the GetAll
and WatchAll
methods of the corresponding resource.
Listen to an API object to be notified when it is created, modified or deleted.
- Establish a long http connection after a watch request, and use
http.Flusher
to refresh the event to the requestor in real time without disconnecting. - This is done internally through the
etcd
Watch
mechanism. Listen on akey
and notify the channel whenever the correspondingvalue
is modified.etcd
'sWatch
responds with a value of""
when a key-value pair is deleted, and since many components need to get the content before it was deleted, this field is added usingclientv3.WithPrevKV()
.
An implementation of the ListWatcher
interface, components that need to perform List and Watch operations on resources can easily implement listening through ListWatcher
:
Decoder
: responsible for converting the event types fromApiServer
thatWatch
listens to intowatch.Event
types.- Here the event type from
ApiServer
is theEtcd
built-in event type, which has the advantage of decoupling, and modifying the implementation only needs to implement the correspondingDecoder interface
interface.
- Here the event type from
Reporter
: error handling, converts events reporting errors to the standardwatch.Event
type, and also converts errors generated during the process to the standardwatch.Event
type.chan Event
: send the listened events through this channel, components usingStreamWatcher
can get this channel throughResultChan()
inwatch.Interface
and get the events; the channel needs to be closed throughStop()
method when the channel is finished/over.
With reference to Kubernetes, two basic components, Informer and WorkQueue, are implemented to facilitate the implementation of all Controllers.
This is equivalent to a local cache of different ApiObjects on each Node, which can greatly improve performance by avoiding frequent network requests to the ApiServer. Each ApiObject resource corresponds to an Informer (specified by objType
). At startup, one of the Reflectors
fetches all ApiObject information from ApiServer via List
and stores it in ThreadSafeStore
. After that, its Reflector
listens to the change events of all corresponding ApiObjects via Watch
and stores them in ThreadSafeStore
, and calls the registered ResourceEventHandler
to handle them accordingly.
Reflector
: at startup, first get all ApiObject information from ApiServer viaList
and store it inThreadSafeStore
, then listen to all corresponding ApiObject change events viaWatch
and notifyInformer
(by putting the events intoWorkQueue
). WorkQueue); where
Listand
Watchare both accomplished by the
listwatch.ListerWatcher` component.ThreadSafeStore
: shares the same store withReflector
, which stores the local cache of the corresponding ApiObject object.ResourceEventHandler
: registers responses to variousWatch
events, components usingInformer
can add corresponding handlers viaAddEventHandler
.WorkQueue
: every timeReflector
listens to a new event, it will be put into this queue and wait forInformer
to process it inrun
and call the corresponding registeredEventHandler
function.
Work queues allow multiple workers in a controller to consume object-related events at the same time, parallelizing processing and improving performance.
- Thread-safe queues allow multiple threads to be processed at the same time without concurrency issues through read/write locks.
- If the queue is empty during
Dequeue
, the Conditional Variable waits for theEnqueue
operation to wake up before attempting toDequeue
.
- Ziming Tan
- ApiServer + Etcd
- Developing API Object Fields
- ApiClient and ListWatcher
- Controller basic components
- ReplicaSet abstraction and its basic functionality
- Dynamic scaling HPA functionality
- Implement container scheduling functionality on multiple machines (Node abstraction and Scheduler).
- Completion of the GPU component
- DNS and Serverless Controller
- CNI, CI/CD, etc. attempts, kubectl refactoring, etc. other
- Jiarui Wang
- Implement Pod abstraction for container lifecycle management
- Developing API object fields
- Implement kubelet node management functionality
- Implement CNI to support inter-pod communication
- Implement Service abstraction.
- Implement DNS abstraction to realize forwarding function
- Implement Serverless V1 and Serverless V2 functionality.
- gitlab CI/CD
- HPA scenario building and testing, kubectl refactoring, and others.
etcd needs to be installed on the control plane Master node.
The cadvisor needs to be installed and deployed on each worker node and started before boot for HPA functionality to work properly.
Deploy using binary
# download binary file
https://github.com/google/cadvisor/releases/latest
# run locally
./cadvisor -port=8090 &>>/var/log/cadvisor.log
# check process information
ps -aux | grep cadvisor
# check ports
netstat -anp | grep 8090
Deploying with docker
docker run \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:rw \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--volume=/dev/disk/:/dev/disk:ro \
--publish=8090:8090 \
--detach=true \
--name=cadvisor \
google/cadvisor:latest
Port forwarding
This way you can see the cadvisor on the remote machine on your local machine.
ssh -N minik8s-dev -L 8090:localhost:8090
Each node requires network configuration via flannel.
Flannel configures a layer 3 IPv4 overlay network. It creates a large internal network that spans every node in the cluster. In this overlay network, each node has a subnet that is used to assign IP addresses internally. When a pod is configured, the Docker bridge interface on each node assigns an address to each new container. Pods in the same host can communicate using the Docker bridge, while pods on different hosts use flanneld to encapsulate their traffic in UDP packets for routing to the appropriate destination.
Refer to Running flannel Running manually Section
If wget fails consider uploading the file to the server manually
sudo apt install etcd
wget https://github.com/flannel-io/flannel/releases/latest/download/flanneld-amd64 && chmod +x flanneld-amd64
sudo ./flanneld-amd64
docker run --rm --net=host quay.io/coreos/etcd
docker run --rm -e ETCDCTL_API=3 --net=host quay.io/coreos/etcd etcdctl put /coreos.com/network/config '{ "Network": "10.5.0.0/16", "Backend": {"Type": "vxlan"}}'
Check Ports
netstat -nap | grep 2380
Testing
docker run -it busybox sh
# Check IP of container
$ cat /etc/hosts
# ping
ping -c3 10.0.5.2