Add a tool that can run variable allocation load scenarios. (googleforgames#2493)

Robert Bailey · web-flow · commit ae54f9e3a838 · 2022-02-25T16:34:41.000-08:00
diff --git a/test/load/allocation/grpc/.gitignore b/test/load/allocation/grpc/.gitignore
@@ -1,3 +1,5 @@
+!fixed.txt
+!variable.txt
 *.key
 *.crt
-*.txt
+*.txt
diff --git a/test/load/allocation/grpc/README.md b/test/load/allocation/grpc/README.md
@@ -74,3 +74,207 @@ You can use environment variables overwrite defaults. To run only a single run o
 ```
 TESTRUNSCOUNT=1 ./runAllocation.sh 40 10
 ```
+
+
+# Running Scenario tests
+
+The scenario test allows you to generate a variable number of allocations to
+your cluster over time, simulating a game where clients arrive in an unsteady
+pattern. The game servers used in the test are configured to shutdown after
+being allocated, simulating the GameServer churn that is expected during
+normal game play.
+
+## Kubernetes Cluster Setup
+
+For the scenario test to achieve high throughput, you can create multiple groups
+of nodes in your cluster. During testing (on GKE), we created a node pool for
+the Kubernetes system components (such as the metrics server and dns servers), a
+node pool for the Agones system components (as recommended in the installation
+guide), and a node pool for the game servers.
+
+On GKE, to restrict the Kubernetes system components to their own set of nodes,
+you can create a node pool with the taint
+`components.gke.io/gke-managed-components=true:NoExecute`.
+
+To prevent the Kubernetes system components from running on the game servers
+node pool, that node pool was created with the taint
+`scenario-test.io/game-servers=true:NoExecute`
+and the Agones system node pool used the normal taint
+`agones.dev/agones-system=true:NoExecute`.
+
+In addition, the GKE cluster was configured as a regional cluster to ensure high
+availability of the cluster control plane.
+
+The following commands were used to construct a cluster for testing:
+
+```bash
+gcloud container clusters create scenario-test --cluster-version=1.21 \
+ --tags=game-server --scopes=gke-default --num-nodes=2 \
+ --no-enable-autoupgrade --machine-type=n2-standard-2 \
+ --region=us-west1 --enable-ip-alias --cluster-ipv4-cidr 10.0.0.0/10
+
+gcloud container node-pools create kube-system --cluster=scenario-test \
+ --no-enable-autoupgrade \
+ --node-taints components.gke.io/gke-managed-components=true:NoExecute \
+ --num-nodes=1 --machine-type=n2-standard-16 --region us-west1
+
+gcloud container node-pools create agones-system --cluster=scenario-test \
+ --no-enable-autoupgrade --node-taints agones.dev/agones-system=true:NoExecute \
+ --node-labels agones.dev/agones-system=true --num-nodes=1 \
+ --machine-type=n2-standard-16 --region us-west1
+
+gcloud container node-pools create game-servers --cluster=scenario-test \
+ --node-taints scenario-test.io/game-servers=true:NoExecute --num-nodes=1 \
+ --machine-type n2-standard-2 --no-enable-autoupgrade \
+ --region us-west1 --tags=game-server --scopes=gke-default \
+ --enable-autoscaling --max-nodes=300 --min-nodes=175
+```
+
+## Agones Modifications
+
+For the scenario tests, we modified the Agones installation in a number of ways.
+
+First, we made sure that the Agones pods would _only_ run in the Agones node
+pool by changing the node affinity in the deployments for the controller,
+allocator service, and ping service to
+`requiredDuringSchedulingIgnoredDuringExecution`.
+
+We also increased the resources for the controller and allocator service pods,
+and made sure to specify both requests and limits to ensure that the pods were
+given the highest quality of service.
+
+These configuration changes are captured in
+[scenario-values.yaml](scenario-values.yaml) and can be applied during
+installation using helm:
+
+```bash
+helm install my-release --namespace agones-system -f scenario-values.yaml agones/agones --create-namespace
+```
+
+Alternatively, these changes can be applied to an existing Agones installation
+by running [`./configure-agones.sh`](configure-agones.sh).
+
+## Fleet Setting
+
+We used the following fleet configuration:
+
+```
+apiVersion: "agones.dev/v1"
+kind: Fleet
+metadata:
+ name: scenario-test
+spec:
+ replicas: 10
+ template:
+ metadata:
+ labels:
+ gameName: simple-game-server
+ spec:
+ ports:
+ - containerPort: 7654
+ name: default
+ health:
+ initialDelaySeconds: 30
+ periodSeconds: 60
+ template:
+ spec:
+ tolerations:
+ - effect: NoExecute
+ key: scenario-test.io/game-servers
+ operator: Equal
+ value: 'true'
+ containers:
+ - name: simple-game-server
+ image: gcr.io/agones-images/simple-game-server:0.10
+ args:
+ - -automaticShutdownDelaySec=60
+ - -readyIterations=10
+ resources:
+ limits:
+ cpu: 20m
+ memory: 24Mi
+ requests:
+ cpu: 20m
+ memory: 24Mi
+```
+
+and fleet autoscaler configuration:
+
+```
+apiVersion: "autoscaling.agones.dev/v1"
+kind: FleetAutoscaler
+metadata:
+ name: fleet-autoscaler-scenario-test
+spec:
+ fleetName: scenario-test
+ policy:
+ type: Buffer
+ buffer:
+ bufferSize: 2000
+ minReplicas: 10000
+ maxReplicas: 20000
+```
+
+To reduce pod churn in the system, the simple game servers are configured to
+return themselves to `Ready` after being allocated the first 10 times following
+the [Reusing Allocated GameServers for more than one game
+session](https://agones.dev/site/docs/integration-patterns/reusing-gameservers/)
+integration pattern. After 10 simulated game sessions, the simple game servers
+then exit automatically. The fleet configuration above sets each game session to
+last for 1 minute, representing a short game.
+
+## Configuring the Allocator Service
+
+The allocator service uses gRPC. In order to be able to call the service, TLS
+and mTLS have to be set up. For more information visit
+[Allocator Service](https://agones.dev/site/docs/advanced/allocator-service/).
+
+## Running the test
+
+You can use the provided runScenario.sh script by providing one parameter (a
+scenario file). The scenario file is a simple text file where each line
+represents a "scenario" that the program will execute before moving to the next
+scenario. A scenario is a duration and the number of concurrent clients to use,
+separated by a comma. The program will create the desired number of clients and
+those clients send allocation requests to the allocator service for the scenario
+duration. At the end of each scenario the program will print out some statistics
+for the scenario.
+
+Two sample scenario files are included in this directory, one which sends a
+constant rate of allocations for the duration of the test and another that sends
+a variable number of allocations.
+
+Upon concluding, the program will print out the overall statistics from the test.
+
+```
+./runScenario.sh variable.txt
+...
+2022-02-24 10:57:44.985216321 +0000 UTC m=+13814.879251454 :Running Scenario 24 with 15 clients for 10m0s
+===================
+
+Finished Scenario 24
+Count: 100 Error: ObjectHasBeenModified
+Count: 113 Error: TooManyConcurrentRequests
+Count: 0 Error: NoAvailableGameServer
+Count: 0 Error: Unknown
+
+Scenario Failure Count: 213, Allocation Count: 15497
+
+Total Failure Count: 6841, Total Allocation Count: 523204
+
+Final Error Totals
+Count: 0 Error: NoAvailableGameServer
+Count: 0 Error: Unknown
+Count: 3950 Error: ObjectHasBeenModified
+Count: 2891 Error: TooManyConcurrentRequests
+
+
+2022-02-24 11:07:45.677220867 +0000 UTC m=+14415.571255996
+Final Total Failure Count: 6841, Total Allocation Count: 523204
+```
+
+Since error counts are gathered per scenario, it's recommended to keep each
+scenario short (e.g. 10 minutes) to narrow down the window when errors
+occurred even if the allocation rate stays at the same level for longer than
+10 minutes at a time.
+
diff --git a/test/load/allocation/grpc/allocationload.go b/test/load/allocation/grpc/allocationload.go
@@ -12,6 +12,7 @@
 // See the License for the specific language governing permissions and
 // limitations under the License
 
+//nolint:typecheck
 package main
 
 import (
diff --git a/test/load/allocation/grpc/configure-agones.sh b/test/load/allocation/grpc/configure-agones.sh
@@ -0,0 +1,104 @@
+# Copyright 2022 Google LLC All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+#!/bin/bash
+
+set -e
+
+
+function main(){
+ echo "Make sure you have kubectl pointed at the right cluster"
+ tmp_dir=$(mktemp -d)
+
+ patch_controller="${tmp_dir}/patch-controller.yaml"
+ cat << EOF > "${patch_controller}"
+spec:
+ template:
+ spec:
+ affinity:
+ nodeAffinity:
+ \$retainKeys:
+ - requiredDuringSchedulingIgnoredDuringExecution
+ requiredDuringSchedulingIgnoredDuringExecution:
+ nodeSelectorTerms:
+ - matchExpressions:
+ - key: agones.dev/agones-system
+ operator: Exists
+ containers:
+ - name: agones-controller
+ resources:
+ limits:
+ cpu: "4"
+ memory: 4000Mi
+ requests:
+ cpu: "4"
+ memory: 4000Mi
+EOF
+
+ kubectl patch deployment -n agones-system agones-controller --patch "$(cat "${patch_controller}")"
+
+ echo "Restarting controller pods"
+ kubectl get pods -n agones-system -o=name | grep "agones-controller" | xargs kubectl delete -n agones-system
+
+ patch_allocator="${tmp_dir}/patch-allocator.yaml"
+ cat << EOF > "${patch_allocator}"
+spec:
+ template:
+ spec:
+ affinity:
+ nodeAffinity:
+ \$retainKeys:
+ - requiredDuringSchedulingIgnoredDuringExecution
+ requiredDuringSchedulingIgnoredDuringExecution:
+ nodeSelectorTerms:
+ - matchExpressions:
+ - key: agones.dev/agones-system
+ operator: Exists
+ containers:
+ - name: agones-allocator
+ resources:
+ limits:
+ cpu: "4"
+ memory: 4000Mi
+ requests:
+ cpu: "4"
+ memory: 4000Mi
+EOF
+
+ kubectl patch deployment -n agones-system agones-allocator --patch "$(cat "${patch_allocator}")"
+ echo "Restarting allocator pods"
+ kubectl get pods -n agones-system -o=name | grep "agones-allocator" | xargs kubectl delete -n agones-system
+
+ patch_ping="${tmp_dir}/patch-ping.yaml"
+ cat << EOF > "${patch_ping}"
+spec:
+ template:
+ spec:
+ affinity:
+ nodeAffinity:
+ \$retainKeys:
+ - requiredDuringSchedulingIgnoredDuringExecution
+ requiredDuringSchedulingIgnoredDuringExecution:
+ nodeSelectorTerms:
+ - matchExpressions:
+ - key: agones.dev/agones-system
+ operator: Exists
+EOF
+
+ kubectl patch deployment -n agones-system agones-ping --patch "$(cat "${patch_ping}")"
+ echo "Restarting ping pods"
+ kubectl get pods -n agones-system -o=name | grep "agones-ping" | xargs kubectl delete -n agones-system
+}
+
+main "$@"
diff --git a/test/load/allocation/grpc/fixed.txt b/test/load/allocation/grpc/fixed.txt
@@ -0,0 +1,23 @@
+# Copyright 2022 Google LLC All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+### Fixed rate of allocations
+#Duration,Number_of_clients/allocations
+10m,10
+10m,10
+10m,10
+10m,10
+10m,10
+10m,10
+10m,10
diff --git a/test/load/allocation/grpc/runScenario.sh b/test/load/allocation/grpc/runScenario.sh
@@ -0,0 +1,36 @@
+# Copyright 2022 Google LLC All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+#!/bin/bash
+
+if [ "$#" -ne 1 ]; then
+ echo "Must pass exactly one argument"
+ exit 2
+fi
+
+NAMESPACE=${NAMESPACE:-default}
+# extract the required TLS and mTLS files
+kubectl get secret allocator-client.default -n ${NAMESPACE} -ojsonpath="{.data.tls\.crt}" | base64 -d > client.crt
+kubectl get secret allocator-client.default -n ${NAMESPACE} -ojsonpath="{.data.tls\.key}" | base64 -d > client.key
+kubectl get secret allocator-tls-ca -n agones-system -ojsonpath='{.data.tls-ca\.crt}' | base64 -d > ca.crt
+
+EXTERNAL_IP=$(kubectl get services agones-allocator -n agones-system -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
+KEY_FILE=${KEY_FILE:-client.key}
+CERT_FILE=${CERT_FILE:-client.crt}
+TLS_CA_FILE=${TLS_CA_FILE:-ca.crt}
+MULTICLUSTER=${MULTICLUSTER:-false}
+
+go run runscenario.go --ip ${EXTERNAL_IP} --namespace ${NAMESPACE} \
+ --key ${KEY_FILE} --cert ${CERT_FILE} --cacert ${TLS_CA_FILE} \
+ --scenariosFile ${1}
diff --git a/test/load/allocation/grpc/runscenario.go b/test/load/allocation/grpc/runscenario.go
diff --git a/test/load/allocation/grpc/scenario-values.yaml b/test/load/allocation/grpc/scenario-values.yaml
diff --git a/test/load/allocation/grpc/variable.txt b/test/load/allocation/grpc/variable.txt

-Original file line number
+Diff line change
@@ @@ -1,3 +1,5 @@ @@
 +!fixed.txt
 +!variable.txt
 *.key
 *.crt
 -*.txt
 +*.txt