US ATLAS Federated Operations Team Information

The US ATLAS Federated Operations (FedOps) Team supports the operation of centrally-managed, containerized services via SLATE across the U.S. ATLAS complex.  The team is made up of sysadmins from all the U.S. sites, who coordinate service deployments, monitor service performance and provide first line operational support. 

Information for Operations Shifters

Not yet

Information for Sites

Setting Up a New Site

Helpful documentation for setting up a new site

The main contents of this section are based on (copied from) this Google document written by Lincoln Bryant and Ilija Vukotic:

https://docs.google.com/document/d/1NOc7EyOZpNTKwlXU_lP7wbiKHpwVqmZXNCYCGlOaXuI

Also included is information from the SLATE project documentation website:

https://slateci.io/docs/cluster/

The intention of this section is to provide a single document containing a comprehensive set of instructions to bring a new bare metal slate server online running a squid service that will work with both ATLAS  and OSG jobs. The resulting squid instance can be used for both Frontier and cvmfs. The installation procedure has been carefully checked on a test server specifically setup for validating the procedures documented here. Please report documentation issues <xxxx@bnl.gov>.

Steps to setup a SLATE-based squid service on a new server

The steps in the process are:

  1. Setup a new server with Linux and whatever the local environment requires.
  2. Install SLATE.
  3. Install Kubernetes and configure a cluster.
  4. Setup squid.
  5. Add the the new squid to the ATLAS CRIC and the OSG topology.
  6. Modify the site gatekeepers to know about the new squid.
  7. Setup SLATE monitoring if desired.

The attached file, SLATE-install-doc-vi.txt, shows the bash commands and their outputs for steps 2 and 3. This file was created on special test server and there are places where one would need to modify the command to refer to the server that is being setup rather than the test environment.

The following instructions assume root access on the target server.

1. Setup the new server with Linux

While you can use a virtual machine, use of a bare metal server is preferred. The server should meet these minimum requirements to run a squid service:

  • 16GB RAM
  • 2 CPU cores
  • 100GB Disk
  • 1Gbps Connectivity
  • Port 3401/udp for external WLCG monitoring 
  • Port 32200/tcp for client access (e.g. within the site)

NB: If the server will also run an XCache service than the hardware requirements are significantly higher - see: xxx. In this case, reserve one disk for use exclusively by the squid disk cache and remove it from the disk array servicing XCache.

Set this server up with CentOS 7 (or equivalent) and the usual local environment (accounts, firewalls, system management tools etc.) It's always a good idea to reboot the server before proceeding to the next step.

2. Install Slate

Following the SLATE installation documentation, install SLATE and then Kubernetes. NB: The # character  normally used for the root command prompt has been replaced the % character to avoid formatting issue in what is displayed by Drupal. Also many commands have their outputs suppressed to improve readability. See the attached file slate-install-doc-v1.txt which contains a session run on a test server showing the full output of the commands.

If you do not have a SLATE account, create one and obtain a SLATE token for your new server. <More instructions needed>. Once you have the account you need to join the SLATE group for your site or create a group for your site if it is new.

Using the SLATE token that you created when registering with SLATE in place of the token shown (d43ZtGa0DDEjulKrbvlzXe) , create the file slate-token.sh containing:

#!/bin/sh
mkdir -p -m 0700 "$HOME/.slate"
if [ "$?" -ne 0 ] ; then
        echo "Not able to create $HOME/.slate" 1>&2
        exit 1
fi

echo "d43ZtGaODDEjulKrbvlzXe" > "$HOME/.slate/token"
if [ "$?" -ne 0 ] ; then
        echo "Not able to write token data to $HOME/.slate/token" 1>&2
        exit 1
fi
chmod 600 "$HOME/.slate/token"
echo 'https://api.slateci.io:443' > ~/.slate/endpoint

echo "SLATE access token successfully stored"

The suggested location for slate-token.sh is /root but it can be anywhere on the system. The token will be stored /root. Now execute the script slate-token.sh script:

[root@iut2-slate01 ~]% chmod 755 slate-token.sh 
[root@iut2-slate01 ~]% ./slate-token.sh 
SLATE access token successfully stored

Next download the SLATE tarball and check that the download is not corrupted:

[root@server ~]% curl -LO https://jenkins.slateci.io/artifacts/client/slate-linux.sha256
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    85  100    85    0     0    292      0 --:--:-- --:--:-- --:--:--   293
{root@server ~]% curl -LO https://jenkins.slateci.io/artifacts/client/slate-linux.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1896k  100 1896k    0     0  6585k      0 --:--:-- --:--:-- --:--:-- 6607k
[root@server ~]% sha256sum -c slate-linux.sha256
slate-linux.tar.gz: OK

See the attached file slate-install-doc-v1.txt to see the messages generated by running the curl commands.

If you do not get final slate-linux.tar.gz OK message after retrying you are likely receiving a corrupted or compromised version of SLATE. Do not proceed and ask for help.

Now install SLATE on the server:

[root@server ~]% tar xzvf slate-linux.tar.gz
slate
[root@server ~]% ls -ltr
total 5992
[Lines removed]
-rwxr-xr-x  1 1000 1000 4123632 Jul 30 14:37 slate
-rwxr-xr-x  1 root root     410 Aug  4 10:35 slate-token.sh
-rw-r--r--  1 root root      85 Aug  4 10:36 slate-linux.sha256
-rw-r--r--  1 root root 1941755 Aug  4 10:37 slate-linux.tar.gz
[root@server ~]% mv slate /usr/local/bin/slate 

This completes installing SLATE. To check that SLATE is working list the SLATE clusters:

[root@server ~]% slate cluster list
Name                 Admin        ID                 
Rice-CRC-OCI         rice-crc     cluster_wRzlo7q62VM
atlas-af-proto       mwt2         cluster_CwuDuKE43GA
[continues displaying info about more clusters]

3. Install Kubernetes (K8s) and configure a cluster

This section is based on the SLATE documentation with modifications to support the standard US ATLAS squid setup.

[root@server ~]% yum install yum-utils -y
{Usual yum output]
[root@server ~]% yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
[Usual yum output]
[root@server ~]% systemctl enable --now docker
Created symlink from /etc/systemd/system/multi-user.target.wants/docker.service to /usr/lib/systemd/system/docker.service.
[root@server ~]% cat <<EOF > /etc/yum.repos.d/kubernetes.repo
> [kubernetes]
> name=Kubernetes
> baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
> enabled=1
> gpgcheck=1
> repo_gpgcheck=1
> gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
> EOF
[root@server ~]% yum install -y kubeadm kubectl kubelet --disableexcludes=kubernetes
[Usual yum output]

The first step is to disable selinux, turn off swapping, and turn off the local firewall:
 

[root@server ~]% setenforce 0
setenforce: SELinux is disabled
[root@server ~]% sed -i --follow-symlinks 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/sysconfig/selinux
[root@server ~]% swapoff -a
[root@server ~]% sed -e '/swap/s/^/#/g' -i /etc/fstab
[root@server ~]% systemctl disable --now firewalld
[root@server ~]% cat <<EOF >  /etc/sysctl.d/k8s.conf
> net.bridge.bridge-nf-call-ip6tables = 1
> net.bridge.bridge-nf-call-iptables = 1
> EOF
[root@server ~]% sysctl --system
[SLATE-install-doc-v1.txt shows the output]

Now use yum to install packages that Kubernetes requires:

[root@server ~]% yum install yum-utils -y
{Usual yum output]
[root@server ~]% yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
{Usual yum output]
[root@server ~]% yum install docker-ce docker-ce-cli containerd.io -y
docker-ce-stable
{Usual yum output]
[root@server ~]% systemctl enable --now docker
Created symlink from /etc/systemd/system/multi-user.target.wants/docker.service to /usr/lib/systemd/system/docker.service.
[root@server ~]% cat <<EOF > /etc/yum.repos.d/kubernetes.repo
> [kubernetes]
> name=Kubernetes
> baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
> enabled=1
> gpgcheck=1
> repo_gpgcheck=1
> gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
> EOF
[root@server ~]% yum install -y kubeadm kubectl kubelet --disableexcludes=kubernetes
kubernetes/signature
{Usual yum output]

Next enable Kubelet using systemctl:

[root@server ~]% systemctl enable --now kubelet
Created symlink from /etc/systemd/system/multi-user.target.wants/kubelet.service to /usr/lib/systemd/system/kubelet.service.
[root@server ~]% kubeadm init --pod-network-cidr=192.168.0.0/16
[Many lines of output shown in SLATE-install-doc-v1.txt]
[root@server ~]% mkdir -p $HOME/.kube
[root@server ~]% sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
[root@server ~]% sudo chown $(id -u):$(id -g) $HOME/.kube/config
[root@server ~]% kubectl apply -f https://docs.projectcalico.org/v3.8/manifests/calico.yaml
[Many lines of output shown in SLATE-install-doc-v1.txt]
[root@server ~]% kubectl get nodes
NAME                  STATUS   ROLES                  AGE     VERSION
server.example.edu   Ready    control-plane,master   5h52m   v1.21.3
[root@server ~]% kubectl get nodes
NAME                  STATUS   ROLES                  AGE    VERSION
server.example.edu   Ready    control-plane,master   6h1m   v1.21.3
[root@server ~]% kubectl taint nodes --all node-role.kubernetes.io/master-
node/server.example.edu untainted