[00:02:58] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [00:08:24] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [00:21:24] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [00:23:33] 10SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for cjming - https://phabricator.wikimedia.org/T286961 (10MarkTraceur) Approved as manager! [00:29:52] PROBLEM - etcd request latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [00:35:54] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:02:50] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:13:52] PROBLEM - etcd request latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:14:16] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:20:14] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:32:56] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:37:10] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:38:52] PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:56:26] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:57:06] PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:57:12] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Add qqq [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/685533 (owner: 10Legoktm) [02:00:05] Deploy window Branching MediaWiki, extensions, skins, and vendor – See Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210720T0200) [02:01:46] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:02:12] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:02:52] PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:06:41] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.37.0-wmf.15 [core] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/705522 [02:06:45] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.37.0-wmf.15 [core] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/705522 (owner: 10TrainBranchBot) [02:13:22] PROBLEM - etcd request latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:13:42] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:15:26] PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:24:56] PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:24:56] (03Merged) 10jenkins-bot: Branch commit for wmf/1.37.0-wmf.15 [core] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/705522 (owner: 10TrainBranchBot) [02:28:34] PROBLEM - Host an-worker1106 is DOWN: PING CRITICAL - Packet loss = 100% [02:32:42] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:34:40] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:41:58] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:42:18] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:43:59] PROBLEM - etcd request latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:45:00] PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:49:18] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:51:06] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:51:52] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation={listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:57:56] PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [03:02:26] (03CR) 10DannyS712: "I may not be able to be around during deployment, but this shouldn't need any testing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705107 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712) [03:02:44] (03PS6) 10Juan90264: Use the ptwikinews wordmark in new vector and mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704172 (https://phabricator.wikimedia.org/T281591) [03:03:20] (03CR) 10Juan90264: [C: 03+1] Use the ptwikinews wordmark in new vector and mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704172 (https://phabricator.wikimedia.org/T281591) (owner: 10Juan90264) [03:03:37] (03CR) 10jerkins-bot: [V: 04-1] Use the ptwikinews wordmark in new vector and mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704172 (https://phabricator.wikimedia.org/T281591) (owner: 10Juan90264) [03:04:30] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [03:13:29] PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation={listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [03:13:56] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [03:19:04] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [03:19:24] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [03:19:44] PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [03:22:26] PROBLEM - etcd request latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [03:22:42] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [03:27:02] PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [03:27:02] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [03:28:03] PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [03:33:36] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [03:42:28] PROBLEM - etcd request latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [03:43:24] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [03:44:26] PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [03:44:32] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [03:44:36] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [03:51:42] PROBLEM - etcd request latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [03:51:56] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [03:52:36] PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [03:55:14] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [03:55:54] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [03:56:12] PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [03:57:06] PROBLEM - etcd request latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:01:36] PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:02:39] PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:02:46] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:04:16] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:08:19] PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:08:48] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:16:02] PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:18:06] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:23:44] PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:23:52] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation={listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:25:32] PROBLEM - etcd request latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:31:26] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:33:10] PROBLEM - etcd request latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:33:16] PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:42:50] PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:43:38] PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:44:50] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:44:52] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:50:44] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:55:16] PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:56:18] PROBLEM - etcd request latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:56:32] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:59:56] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:59:59] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Janina Abrams - https://phabricator.wikimedia.org/T286927 (10Kalliope) Thank you kindly @RLazarus! Much appreciated :) [05:02:16] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [05:02:58] PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [05:09:52] PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [05:09:56] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [05:13:42] PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [05:15:30] PROBLEM - etcd request latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [05:18:20] PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [05:24:04] PROBLEM - Disk space on elastic1039 is CRITICAL: DISK CRITICAL - free space: / 1326 MB (4% inode=95%): /tmp 1326 MB (4% inode=95%): /var/tmp 1326 MB (4% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [05:26:02] PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [05:27:04] PROBLEM - etcd request latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [05:29:10] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [05:29:36] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [05:31:02] PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [05:34:56] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation={listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [05:37:32] PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [05:42:59] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [05:44:05] (03CR) 10Giuseppe Lavagetto: [C: 03+1] hieradata: enable TLS on memcached eqiad hosts [puppet] - 10https://gerrit.wikimedia.org/r/702590 (https://phabricator.wikimedia.org/T271967) (owner: 10Effie Mouzeli) [05:44:18] PROBLEM - etcd request latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [05:44:58] RECOVERY - Disk space on elastic1039 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [05:46:24] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [05:51:50] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [05:52:14] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [05:52:56] PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [05:57:50] PROBLEM - etcd request latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [06:01:52] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation={listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [06:03:44] PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [06:04:14] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [06:04:28] PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [06:08:04] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [06:12:09] PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [06:13:19] PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [06:13:22] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [06:13:57] elukey: is ^ known? anything we can do about it other than silencing ? [06:17:54] PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [06:18:46] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [06:19:08] PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [06:22:58] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [06:25:24] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [06:26:40] PROBLEM - etcd request latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [06:26:48] PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation=listWithCount https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [06:27:09] I'm silencing all of those [06:31:42] Ty and morning godog [06:32:29] cheers RhinosF1 [06:39:12] (03PS1) 10Filippo Giunchedi: Revert "smokeping: don't poll authdns2001" [puppet] - 10https://gerrit.wikimedia.org/r/705608 [06:42:28] hiya RhinosF1 [06:42:42] DannyS712: you there? [06:43:36] Hi Bsadowski1 [06:52:59] (03CR) 10Giuseppe Lavagetto: [C: 03+1] rake: replace conftool_schema with generic json syntax [puppet] - 10https://gerrit.wikimedia.org/r/705352 (https://phabricator.wikimedia.org/T286882) (owner: 10Filippo Giunchedi) [06:52:59] godog: wow, I noticed some warnings yesterday but never seen all this spam, thanks for silencing! [06:54:32] sure np! [06:54:42] (03CR) 10Filippo Giunchedi: [C: 03+2] rake: replace conftool_schema with generic json syntax [puppet] - 10https://gerrit.wikimedia.org/r/705352 (https://phabricator.wikimedia.org/T286882) (owner: 10Filippo Giunchedi) [06:54:50] (03PS2) 10Filippo Giunchedi: rake: replace conftool_schema with generic json syntax [puppet] - 10https://gerrit.wikimedia.org/r/705352 (https://phabricator.wikimedia.org/T286882) [07:02:56] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Feel free to merge" [alerts] - 10https://gerrit.wikimedia.org/r/702477 (owner: 10Dave Pifke) [07:03:39] 10SRE, 10SRE Observability, 10Patch-For-Review: Validate json files syntax in puppet CI - https://phabricator.wikimedia.org/T286882 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is complete! [07:05:29] (03PS2) 10Filippo Giunchedi: Revert "smokeping: don't poll authdns2001" [puppet] - 10https://gerrit.wikimedia.org/r/705608 [07:07:26] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "smokeping: don't poll authdns2001" [puppet] - 10https://gerrit.wikimedia.org/r/705608 (owner: 10Filippo Giunchedi) [07:08:59] (03CR) 10Elukey: [C: 03+2] profile::kubernetes::master: add comments and improve hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/704831 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [07:09:03] (03PS2) 10Filippo Giunchedi: hieradata: add o11y services to service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/705343 [07:09:08] (03CR) 10Elukey: [C: 03+2] profile::kubernetes::master: add panel numbers to grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/705338 (owner: 10Elukey) [07:09:15] (03PS3) 10Elukey: profile::kubernetes::master: add panel numbers to grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/705338 [07:10:53] !log jmm@puppetmaster1001 conftool action : set/pooled=no; selector: name=ldap-replica1004.wikimedia.org [07:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:37] (03CR) 10DannyS712: [C: 03+1] Update src/defines.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701467 (owner: 10Tim Starling) [07:11:43] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30258/console" [puppet] - 10https://gerrit.wikimedia.org/r/705343 (owner: 10Filippo Giunchedi) [07:12:19] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10MoritzMuehlenhoff) [07:37:57] (03PS3) 10Filippo Giunchedi: swift: use addresses for memcached [puppet] - 10https://gerrit.wikimedia.org/r/704777 (https://phabricator.wikimedia.org/T285835) [07:37:59] (03PS3) 10Filippo Giunchedi: swift: enable listing_formats on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/704920 (https://phabricator.wikimedia.org/T285835) [07:40:02] I'm looking for kind souls for a sanity check on these two reviews ^ [07:50:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test2001.wikimedia.org [07:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:05] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host idp-test2001.wikimedia.org [07:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:19] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate nginx-controller as an Ingress - https://phabricator.wikimedia.org/T286197 (10JMeybohm) My past impression of the nginx-ingress was that while it's okay for low traffic stuff you would start getting trouble with increased traffic. That probably is mostly due to the... [07:59:49] (03CR) 10Ayounsi: Adding 'quality-of-service' template for use on QFX/EX series switches. (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/701499 (https://phabricator.wikimedia.org/T284592) (owner: 10Cathal Mooney) [07:59:52] godog: checking [08:00:25] IIRC there is a policy for production wikis not to depend on Cloud Services or Toolforge (e.g. local CSS or JS loading stuff by default). Anyone knows where to find that policy? (because T166138) Thanks! [08:00:26] T166138: Please add Petit Formal Script to the UniversalLanguageSelector - https://phabricator.wikimedia.org/T166138 [08:00:44] elukey: sweet, thank you! [08:00:57] (03CR) 10Elukey: [C: 03+1] swift: use addresses for memcached [puppet] - 10https://gerrit.wikimedia.org/r/704777 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [08:02:23] !log racadm serveraction powercycle on an-worker1106 due to CPU soft lock-ups on host [08:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:31] (03CR) 10Elukey: [C: 03+1] "Left a nit, looks sane to me modulo the fact that I don't have a lot of familiarity with swift configs :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704920 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [08:05:55] godog: done! [08:06:24] (03Abandoned) 10Filippo Giunchedi: swift-account-stats.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/670981 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [08:07:17] RECOVERY - Host an-worker1106 is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [08:07:21] (03CR) 10Filippo Giunchedi: swift: enable listing_formats on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704920 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [08:07:26] elukey: thank you, appreciate it [08:07:58] <3 [08:09:02] (03CR) 10Elukey: [C: 03+1] swift: enable listing_formats on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704920 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [08:09:10] (03CR) 10Cathal Mooney: "> Patch Set 2:" (036 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/701499 (https://phabricator.wikimedia.org/T284592) (owner: 10Cathal Mooney) [08:09:31] (03PS1) 10Volans: icinga-status: fix incompatibility with Py3.7 [puppet] - 10https://gerrit.wikimedia.org/r/705612 [08:09:46] (03PS4) 10Cathal Mooney: Adding 'quality-of-service' template for use on QFX/EX series switches. [homer/public] - 10https://gerrit.wikimedia.org/r/701499 (https://phabricator.wikimedia.org/T284592) [08:10:24] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: use addresses for memcached [puppet] - 10https://gerrit.wikimedia.org/r/704777 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [08:10:30] (03PS6) 10Nikerabbit: Add stream configuration for ContentTranslation events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704456 (https://phabricator.wikimedia.org/T281982) (owner: 10KartikMistry) [08:10:34] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/705612 (owner: 10Volans) [08:10:36] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: enable listing_formats on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/704920 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [08:11:11] (03CR) 10Volans: [C: 03+2] icinga-status: fix incompatibility with Py3.7 [puppet] - 10https://gerrit.wikimedia.org/r/705612 (owner: 10Volans) [08:17:44] (03CR) 10Ayounsi: [C: 03+1] "Thanks, LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/701499 (https://phabricator.wikimedia.org/T284592) (owner: 10Cathal Mooney) [08:18:51] (03PS1) 10David Caro: prometheus: update alert dashboard link [puppet] - 10https://gerrit.wikimedia.org/r/705615 [08:20:04] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you" [puppet] - 10https://gerrit.wikimedia.org/r/705615 (owner: 10David Caro) [08:20:40] (03CR) 10David Caro: "Tested by copy-pasting the query parameters from a current alert (`https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasour" [puppet] - 10https://gerrit.wikimedia.org/r/705615 (owner: 10David Caro) [08:21:12] (03CR) 10David Caro: [C: 03+2] prometheus: update alert dashboard link [puppet] - 10https://gerrit.wikimedia.org/r/705615 (owner: 10David Caro) [08:21:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mw2352.codfw.wmnet [08:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:03] (03PS4) 10Muehlenhoff: Create an aqs-roots group, analogous to restbase-roots [puppet] - 10https://gerrit.wikimedia.org/r/702452 (https://phabricator.wikimedia.org/T285899) (owner: 10Eevans) [08:27:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mw2352.codfw.wmnet [08:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:57] (03PS2) 10DCausse: [cirrus] drop deprecated ores_articletopics config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661384 (https://phabricator.wikimedia.org/T273508) [08:35:10] (03CR) 10Muehlenhoff: "Ack, I've amended the patch to set Olja as approval contact. Will merge" [puppet] - 10https://gerrit.wikimedia.org/r/702452 (https://phabricator.wikimedia.org/T285899) (owner: 10Eevans) [08:35:15] (03PS5) 10Muehlenhoff: Create an aqs-roots group, analogous to restbase-roots [puppet] - 10https://gerrit.wikimedia.org/r/702452 (https://phabricator.wikimedia.org/T285899) (owner: 10Eevans) [08:36:31] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: labstore1006, thanos-be1003, registry1003, registry2004, gitlab2001 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [08:36:55] (03CR) 10Muehlenhoff: [C: 03+2] Create an aqs-roots group, analogous to restbase-roots [puppet] - 10https://gerrit.wikimedia.org/r/702452 (https://phabricator.wikimedia.org/T285899) (owner: 10Eevans) [08:38:04] 10SRE, 10Platform Engineering, 10SRE-Access-Requests, 10Patch-For-Review: Root access to AQS cluster - https://phabricator.wikimedia.org/T285899 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This was approved in the IF meeting and I've merged the patch now. [08:38:31] (03CR) 10Btullis: Add a CNAME for analytics-presto.eqiad.wmnet (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/705376 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [08:42:37] (03PS2) 10Btullis: Add a CNAME for analytics-test-presto.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/705376 (https://phabricator.wikimedia.org/T273642) [08:49:23] (03PS1) 10Joal: Add 5 minutes offset to gobblin webrequest timer [puppet] - 10https://gerrit.wikimedia.org/r/705621 (https://phabricator.wikimedia.org/T271232) [08:51:02] elukey: If you have a minute, would you mind checking and possibly merge that one please (Andrew is of this beginning of week) --^? [08:55:57] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: registry1003, registry2004, labstore1006, gitlab2001, thanos-be1003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [08:58:56] (03PS3) 10Btullis: Add a CNAME for analytics-test-presto.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/705376 (https://phabricator.wikimedia.org/T273642) [09:09:11] (03PS2) 10Jcrespo: mediabackup: Enable prometheus monitoring of minio [puppet] - 10https://gerrit.wikimedia.org/r/704600 (https://phabricator.wikimedia.org/T262668) [09:11:02] (03CR) 10Jcrespo: mediabackup: Enable prometheus monitoring of minio (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704600 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [09:11:21] (03CR) 10Jcrespo: mediabackup: Enable prometheus monitoring of minio (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704600 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [09:11:42] (03PS1) 10Jgiannelos: Add wmf-certificates dependency to docker images. [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/705623 [09:13:10] (03PS2) 10Jgiannelos: Add wmf-certificates dependency to docker images. [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/705623 [09:16:35] (03PS3) 10Jgiannelos: Add wmf-certificates dependency to container images. [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/705623 [09:16:50] (03PS3) 10Jcrespo: mediabackup: Enable prometheus monitoring of minio [puppet] - 10https://gerrit.wikimedia.org/r/704600 (https://phabricator.wikimedia.org/T262668) [09:17:48] (03PS1) 10Jbond: hiera cloud idp: Add the idp cloud as a service provider [puppet] - 10https://gerrit.wikimedia.org/r/705624 (https://phabricator.wikimedia.org/T286716) [09:18:10] (03PS4) 10Jcrespo: mediabackup: Enable prometheus monitoring of minio [puppet] - 10https://gerrit.wikimedia.org/r/704600 (https://phabricator.wikimedia.org/T262668) [09:18:30] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30260/console" [puppet] - 10https://gerrit.wikimedia.org/r/705624 (https://phabricator.wikimedia.org/T286716) (owner: 10Jbond) [09:18:36] (03CR) 10Jgiannelos: "This patch is required so we can connect tegola in k8s to the swift cluster." [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/705623 (owner: 10Jgiannelos) [09:19:00] 10ops-eqiad, 10DBA, 10DC-Ops: db1170 mysql process crashed - https://phabricator.wikimedia.org/T286888 (10Kormat) a:05Kormat→03Cmjohnson Hi @Cmjohnson, can you get us a new dimm please? Cheers :) [09:20:05] (03CR) 10Jbond: [V: 03+1 C: 03+2] hiera cloud idp: Add the idp cloud as a service provider [puppet] - 10https://gerrit.wikimedia.org/r/705624 (https://phabricator.wikimedia.org/T286716) (owner: 10Jbond) [09:21:42] (03CR) 10MSantos: [C: 03+2] Add wmf-certificates dependency to container images. [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/705623 (owner: 10Jgiannelos) [09:22:09] 10SRE, 10SRE Observability, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10fgiunchedi) [09:22:43] (03Merged) 10jenkins-bot: Add wmf-certificates dependency to container images. [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/705623 (owner: 10Jgiannelos) [09:24:58] (03PS1) 10Jbond: WMCS branch: create a wmcs specific branch to add Delegated Authentication [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/705625 (https://phabricator.wikimedia.org/T286716) [09:26:44] PROBLEM - Prometheus cloudmetrics1002/labs restarted: beware possible monitoring artifacts on cloudmetrics1002 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad+prometheus/labs [09:28:57] (03PS1) 10JMeybohm: dragonfly::dfdaemon [puppet] - 10https://gerrit.wikimedia.org/r/705627 (https://phabricator.wikimedia.org/T286054) [09:29:33] (03PS1) 10Jgiannelos: tegola-vector-tiles: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/705628 [09:30:06] (03PS1) 10Kormat: db1170: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/705629 (https://phabricator.wikimedia.org/T286888) [09:30:28] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30262/console" [puppet] - 10https://gerrit.wikimedia.org/r/705627 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [09:30:38] (03PS1) 10Effie Mouzeli: tegola-vector-tiles: allow connections to thanos-swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/705630 [09:30:58] (03PS1) 10Majavah: metricsinfra: remove alertmanager from prometheus role [puppet] - 10https://gerrit.wikimedia.org/r/705632 (https://phabricator.wikimedia.org/T286335) [09:31:41] (03CR) 10Kormat: [C: 03+2] db1170: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/705629 (https://phabricator.wikimedia.org/T286888) (owner: 10Kormat) [09:31:43] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] dragonfly::dfdaemon [puppet] - 10https://gerrit.wikimedia.org/r/705627 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [09:32:10] jayme: is it safe to merge your puppet change? [09:32:19] kormat: yes, please include [09:32:20] (03CR) 10jerkins-bot: [V: 04-1] tegola-vector-tiles: allow connections to thanos-swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/705630 (owner: 10Effie Mouzeli) [09:32:26] 10SRE, 10Traffic, 10Patch-For-Review: False positives on PyBal IPVS diff check - https://phabricator.wikimedia.org/T286913 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [09:32:32] jayme: 👍 [09:32:38] kormat: thanks [09:33:43] (03PS1) 10Volans: API change: use IcingaHosts instead of Icinga [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/705634 [09:34:42] (03PS2) 10Effie Mouzeli: tegola-vector-tiles: allow connections to thanos-swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/705630 [09:36:56] (03PS1) 10Jgiannelos: Fix apt definition in blubber. [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/705635 [09:39:06] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 14 hosts with reason: Deploying schema change to s6 T281058 [09:39:11] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 14 hosts with reason: Deploying schema change to s6 T281058 [09:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:14] T281058: Rename AbuseFilter indexes for consistency - https://phabricator.wikimedia.org/T281058 [09:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:54] 10SRE, 10SRE Observability: Puppet fail to properly refresh Icinga - https://phabricator.wikimedia.org/T184714 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Boldly resolving because I don't think we've seen this again, feel free to reopen [09:42:45] (03CR) 10MSantos: [C: 03+2] Fix apt definition in blubber. [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/705635 (owner: 10Jgiannelos) [09:43:00] (03CR) 10Jcrespo: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/30261/prometheus2004.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/704600 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [09:43:29] (03PS1) 10Volans: icinga-status: make mypy happy [puppet] - 10https://gerrit.wikimedia.org/r/705636 [09:43:37] (03CR) 10Jcrespo: [C: 03+1] "@godog what do you think about deploying this?" [puppet] - 10https://gerrit.wikimedia.org/r/704600 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [09:43:50] (03Merged) 10jenkins-bot: Fix apt definition in blubber. [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/705635 (owner: 10Jgiannelos) [09:44:50] (03CR) 10Elukey: Add 5 minutes offset to gobblin webrequest timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705621 (https://phabricator.wikimedia.org/T271232) (owner: 10Joal) [09:46:57] (03PS1) 10JMeybohm: dragonfly: Enable dfdaemon on eqiad kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/705639 (https://phabricator.wikimedia.org/T286054) [09:48:50] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30263/console" [puppet] - 10https://gerrit.wikimedia.org/r/705639 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [09:50:24] (03CR) 10Joal: Add 5 minutes offset to gobblin webrequest timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705621 (https://phabricator.wikimedia.org/T271232) (owner: 10Joal) [09:50:41] (03PS2) 10Jgiannelos: tegola-vector-tiles: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/705628 [09:51:46] (03PS2) 10Joal: Add 5 minutes offset to gobblin webrequest timer [puppet] - 10https://gerrit.wikimedia.org/r/705621 (https://phabricator.wikimedia.org/T271232) [09:53:02] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10aborrero) [09:53:17] (03PS3) 10Joal: Add 5 minutes offset to gobblin webrequest timer [puppet] - 10https://gerrit.wikimedia.org/r/705621 (https://phabricator.wikimedia.org/T271232) [09:53:42] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/705628 (owner: 10Jgiannelos) [09:54:37] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10aborrero) [09:54:41] (03PS3) 10Effie Mouzeli: tegola-vector-tiles: allow connections to thanos-swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/705630 [09:54:48] (03CR) 10jerkins-bot: [V: 04-1] tegola-vector-tiles: allow connections to thanos-swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/705630 (owner: 10Effie Mouzeli) [09:56:14] (03Merged) 10jenkins-bot: tegola-vector-tiles: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/705628 (owner: 10Jgiannelos) [09:56:51] (03Abandoned) 10Effie Mouzeli: tegola-vector-tiles: allow connections to thanos-swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/705630 (owner: 10Effie Mouzeli) [09:57:08] (03PS1) 10Effie Mouzeli: tegola-vector-tiles: allow connections to thanos-swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/705640 [09:58:34] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Given we're not using eqiad to serve live traffic, this looks ok. I would like to couple this change with a deploy of an upgraded image th" [puppet] - 10https://gerrit.wikimedia.org/r/705639 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [09:59:46] (03PS2) 10Effie Mouzeli: tegola-vector-tiles: allow connections to thanos-swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/705640 [09:59:47] RECOVERY - Prometheus cloudmetrics1002/labs restarted: beware possible monitoring artifacts on cloudmetrics1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad+prometheus/labs [10:01:15] (03CR) 10Jgiannelos: tegola-vector-tiles: allow connections to thanos-swift (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/705640 (owner: 10Effie Mouzeli) [10:01:57] (03PS3) 10Effie Mouzeli: tegola-vector-tiles: allow connections to thanos-swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/705640 [10:03:24] (03PS1) 10Giuseppe Lavagetto: mwdebug: update two image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/705642 [10:03:45] (03CR) 10David Caro: "Can you test this and share the results?" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/705634 (owner: 10Volans) [10:05:12] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: allow connections to thanos-swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/705640 (owner: 10Effie Mouzeli) [10:05:40] (03CR) 10Elukey: [C: 03+2] Add 5 minutes offset to gobblin webrequest timer [puppet] - 10https://gerrit.wikimedia.org/r/705621 (https://phabricator.wikimedia.org/T271232) (owner: 10Joal) [10:07:13] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/704600 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [10:08:01] (03Merged) 10jenkins-bot: tegola-vector-tiles: allow connections to thanos-swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/705640 (owner: 10Effie Mouzeli) [10:08:16] (03CR) 10Filippo Giunchedi: [C: 03+1] Move existing SLO dashboards towards common template [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/699260 (owner: 10Herron) [10:09:00] (03CR) 10Volans: "> Patch Set 1:" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/705634 (owner: 10Volans) [10:11:58] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [10:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:04] (03CR) 10David Caro: "> Patch Set 1:" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/705634 (owner: 10Volans) [10:15:48] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: add servergroup [deployment-charts] - 10https://gerrit.wikimedia.org/r/703835 (https://phabricator.wikimedia.org/T284418) (owner: 10Giuseppe Lavagetto) [10:18:49] (03Merged) 10jenkins-bot: mwdebug: add servergroup [deployment-charts] - 10https://gerrit.wikimedia.org/r/703835 (https://phabricator.wikimedia.org/T284418) (owner: 10Giuseppe Lavagetto) [10:23:38] (03PS5) 10Jcrespo: mediabackup: Enable prometheus monitoring of minio [puppet] - 10https://gerrit.wikimedia.org/r/704600 (https://phabricator.wikimedia.org/T262668) [10:24:00] (03PS2) 10JMeybohm: dragonfly: Enable dfdaemon on eqiad kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/705639 (https://phabricator.wikimedia.org/T286054) [10:24:02] (03PS1) 10JMeybohm: prometheus::ops: Add jobs to scrape dragonfly supernodes [puppet] - 10https://gerrit.wikimedia.org/r/705646 (https://phabricator.wikimedia.org/T286054) [10:26:56] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30264/console" [puppet] - 10https://gerrit.wikimedia.org/r/705646 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [10:27:10] (03CR) 10Hashar: [C: 03+2] [WMF] its-phabricator: Urlencode POST to conduit [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/705499 (https://phabricator.wikimedia.org/T280197) (owner: 10Hashar) [10:28:43] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:32:18] (03PS4) 10Btullis: Add a CNAME for analytics-test-presto.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/705376 (https://phabricator.wikimedia.org/T273642) [10:34:39] (03Merged) 10jenkins-bot: [WMF] its-phabricator: Urlencode POST to conduit [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/705499 (https://phabricator.wikimedia.org/T280197) (owner: 10Hashar) [10:35:16] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps100[79].eqiad.wmnet [10:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:37] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team (Radar): The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10Joe) Hi and sorry for the late replies, just got back from my break and I'm catching up with the backlog. Just to... [10:39:40] (03CR) 10JMeybohm: [V: 03+1] "Will bring 28 new metrics (+ the go_ standard ones), some of them do have a medium cardinality as they include a unique label value per cl" [puppet] - 10https://gerrit.wikimedia.org/r/705646 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [10:39:46] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: Allow non-roots to perform a rolling restart [deployment-charts] - 10https://gerrit.wikimedia.org/r/703848 (owner: 10Giuseppe Lavagetto) [10:42:27] (03Merged) 10jenkins-bot: mwdebug: Allow non-roots to perform a rolling restart [deployment-charts] - 10https://gerrit.wikimedia.org/r/703848 (owner: 10Giuseppe Lavagetto) [10:42:45] PROBLEM - Disk space on elastic1039 is CRITICAL: DISK CRITICAL - free space: / 1889 MB (7% inode=94%): /tmp 1889 MB (7% inode=94%): /var/tmp 1889 MB (7% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [10:43:31] (03CR) 10Kormat: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/705045 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [10:43:32] !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: name=maps100[79].eqiad.wmnet [10:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:48] <_joe_> dcausse: can you take a look at elastic1039? [10:43:57] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:44:04] <_joe_> or if you're running some maintenance, just let me know :) [10:44:43] PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:46:42] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) [10:47:25] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) [10:47:41] (03PS1) 10Hashar: Update its-phabricator: Urlencode POST to conduit [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/705650 (https://phabricator.wikimedia.org/T280197) [10:48:38] (03PS1) 10Jgiannelos: Temporary log all s3 SDK requests/responses [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/705651 [10:49:13] (03CR) 10Jgiannelos: [C: 04-1] "Blocking merging this since its still WIP" [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/705651 (owner: 10Jgiannelos) [10:50:17] (03CR) 10Jgiannelos: [C: 04-1] "This is a quick workaround that can give us a bit more clarity on whats happening with swift/s3 req/responses. We can revert after success" [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/705651 (owner: 10Jgiannelos) [10:52:40] (03PS2) 10Hashar: Update its-phabricator: Urlencode POST to conduit [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/705650 (https://phabricator.wikimedia.org/T280197) [10:53:48] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:05] (03CR) 10Jgiannelos: [C: 04-1] "A bit of context for this patch:" [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/705651 (owner: 10Jgiannelos) [10:56:23] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install copernicium - https://phabricator.wikimedia.org/T282272 (10MoritzMuehlenhoff) 05Open→03Resolved Puppet has been fixed and the host rebooted, closing the racking task. Further setup of the mirror will happen via T286898 [10:57:08] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:39] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: How many deployers does it take to do European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210720T1100). [11:00:05] Phuedx, cjming, and DannyS712: A patch you scheduled for European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:01:30] I'm here - my patch is a typo fix [11:02:16] here as well - config change to turn off A/B test [11:03:17] I would also have two patches, if there is time for those [11:03:37] RECOVERY - Disk space on elastic1039 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [11:03:38] (03PS2) 10Volans: API change: use IcingaHosts instead of Icinga [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/705634 [11:03:56] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:24] (03CR) 10Volans: "> Patch Set 1:" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/705634 (owner: 10Volans) [11:05:29] o/ let’s start with cjming’s patch [11:05:54] (03PS2) 10Lucas Werkmeister (WMDE): Update config for language switching on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704867 (https://phabricator.wikimedia.org/T286459) (owner: 10Clare Ming) [11:06:08] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Update config for language switching on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704867 (https://phabricator.wikimedia.org/T286459) (owner: 10Clare Ming) [11:06:33] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:53] (03Merged) 10jenkins-bot: Update config for language switching on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704867 (https://phabricator.wikimedia.org/T286459) (owner: 10Clare Ming) [11:07:36] cjming: your patch is on mwdebug2001, please test [11:08:17] oh, or did you want to deploy yourself? I see a deployment training in your phabricator profile ^^ [11:08:35] 10SRE, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Vgutierrez) eqiad: | Host | Row | Host iface | switch iface| | lvs1013|**A**|enp4s0f0|xe-7/0/34| | lvs1014|A|enp4s0f1|xe-4/0/18| | lvs1015|A|enp5s0f0|xe-2/... [11:09:17] i just put in a request to get shell access so not sure i can just yet [11:09:34] ok [11:09:45] do you know how to test on mwdebug? [11:10:34] (03Abandoned) 10Reedy: Localisation updates from https://translatewiki.net. [extensions/WikimediaMessages] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704611 (owner: 10RhinosF1) [11:10:37] (03Abandoned) 10Reedy: Localisation updates from https://translatewiki.net. [extensions/Wikibase] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704612 (owner: 10RhinosF1) [11:13:12] Lucas_WMDE: cjming does and I'm testing as well :) [11:13:17] ok :) [11:14:57] can whoever is deploying take a look at my patch and let me know if I need to be around? There is nothing to test. Last time it wasn't deployed because I couldn't be here in time [11:16:58] Sam and I both tested and looks good! [11:17:26] alright, syncing [11:18:25] (03PS3) 10Muehlenhoff: Use types for apt::pin [puppet] - 10https://gerrit.wikimedia.org/r/704889 [11:18:45] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:704867|Update config for language switching on pilot wikis (T286459)]] (duration: 00m 59s) [11:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:53] T286459: Turn off A/B test for language switching on pilot wikis - https://phabricator.wikimedia.org/T286459 [11:18:57] DannyS712: I can deploy your config change next [11:19:12] okay, I'm still here [11:19:14] (03PS3) 10Lucas Werkmeister (WMDE): Typo fix: "the the" -> "the" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705107 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712) [11:19:24] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Typo fix: "the the" -> "the" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705107 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712) [11:20:25] (03Merged) 10jenkins-bot: Typo fix: "the the" -> "the" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705107 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712) [11:20:28] (03CR) 10Muehlenhoff: [C: 03+2] Use types for apt::pin [puppet] - 10https://gerrit.wikimedia.org/r/704889 (owner: 10Muehlenhoff) [11:21:19] syncing that one directly [11:22:13] (we still have the canaries, after all) [11:22:46] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:705107|Typo fix: "the the" -> "the" (T201491)]] (1/2, prod) (duration: 00m 57s) [11:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:54] T201491: Fix common typos in code - https://phabricator.wikimedia.org/T201491 [11:23:54] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/CommonSettings-labs.php: Config: [[gerrit:705107|Typo fix: "the the" -> "the" (T201491)]] (2/2, beta) (duration: 00m 56s) [11:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:16] zabe: I’m looking at your ckbwiki config change now [11:25:40] (03PS3) 10Lucas Werkmeister (WMDE): Add patroller group for ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705498 (https://phabricator.wikimedia.org/T285221) (owner: 10Zabe) [11:26:50] thanks [11:27:56] zabe: it looks like some similar patches for other wikis also add the autopatrol right to the group; is it intentionally not added here? [11:28:27] (it’s in the autopatrolled group already, but IIUC it would make a difference if an admin promotes a user to patroller who isn’t autopatrolled yet) [11:29:20] on the other hand, some other wikis also only have 'patroller' => [ 'patrol' => true ], so I’d still be okay with merging this [11:29:22] just want to check first [11:29:32] (also, autopatrol could always be added later if it turns out to be necessary) [11:30:17] I left it out, because in their discussion they were talking about a 'New Page Reviewer' group, like in enwiki. And in enwiki it also doesn't contain autopatrol [11:30:45] alright, sure [11:30:50] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add patroller group for ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705498 (https://phabricator.wikimedia.org/T285221) (owner: 10Zabe) [11:31:32] (03Merged) 10jenkins-bot: Add patroller group for ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705498 (https://phabricator.wikimedia.org/T285221) (owner: 10Zabe) [11:32:36] zabe: your change is on mwdebug2001, can you test it? [11:32:44] doing [11:33:45] Lucas_WMDE: works the supposed way [11:34:36] ok, syncing [11:35:51] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:705498|Add patroller group for ckbwiki (T285221)]] (duration: 00m 57s) [11:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:59] T285221: Add patroller user group to Central Kurdish Wikipedia - https://phabricator.wikimedia.org/T285221 [11:38:11] (03PS6) 10Jelto: prometheus::ops add jobs and ferm rule to scrape gitlab metrics [puppet] - 10https://gerrit.wikimedia.org/r/704503 (https://phabricator.wikimedia.org/T275170) [11:38:53] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] Avoid using User::newFrom* methods (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705505 (owner: 10Zabe) [11:41:36] (03PS3) 10Zabe: Avoid using User::newFrom* methods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705505 [11:41:58] Lucas_WMDE: hi, would it be possible to squeeze https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/704996 in too? [11:42:01] (03PS4) 10Zabe: Avoid using User::newFrom* methods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705505 [11:42:03] (03PS1) 10Jbond: C:apereo_cas: Add ability to support delegated authenticators [puppet] - 10https://gerrit.wikimedia.org/r/705657 (https://phabricator.wikimedia.org/T286716) [11:42:04] I can deploy it once you're done, or you can do it, up2you [11:42:25] urbanecm: feel free to do it right now, I still need to deploy the latest version of zabe’s change [11:42:39] Lucas_WMDE: okay, deploying it [11:42:41] should be quick [11:42:43] acak [11:42:45] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30266/console" [puppet] - 10https://gerrit.wikimedia.org/r/705657 (https://phabricator.wikimedia.org/T286716) (owner: 10Jbond) [11:42:47] (03PS2) 10Urbanecm: otrs_wikiwiki: Update logo to use VRT instead of OTRS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704996 (https://phabricator.wikimedia.org/T280400) [11:42:50] *ack ^^ [11:42:50] (03CR) 10Urbanecm: [C: 03+2] otrs_wikiwiki: Update logo to use VRT instead of OTRS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704996 (https://phabricator.wikimedia.org/T280400) (owner: 10Urbanecm) [11:43:30] (03CR) 10Zabe: Avoid using User::newFrom* methods (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705505 (owner: 10Zabe) [11:43:32] (03Merged) 10jenkins-bot: otrs_wikiwiki: Update logo to use VRT instead of OTRS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704996 (https://phabricator.wikimedia.org/T280400) (owner: 10Urbanecm) [11:44:18] urbanecm: I added it to the calendar rtoo [11:44:22] thanks! [11:45:31] syncing [11:46:05] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Avoid using User::newFrom* methods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705505 (owner: 10Zabe) [11:46:25] !log urbanecm@deploy1002 Synchronized static/images/project-logos: e52ae37dc2010ed2483328921a274e4934940791: otrs_wikiwiki: Update logo to use VRT instead of OTRS (T280400; 1/3) (duration: 00m 57s) [11:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:32] T280400: Change the user-visible domain of OTRS wiki - https://phabricator.wikimedia.org/T280400 [11:47:22] !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: e52ae37dc2010ed2483328921a274e4934940791: otrs_wikiwiki: Update logo to use VRT instead of OTRS (T280400; 2/3) (duration: 00m 56s) [11:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:18] !log urbanecm@deploy1002 Synchronized logos/config.yaml: e52ae37dc2010ed2483328921a274e4934940791: otrs_wikiwiki: Update logo to use VRT instead of OTRS (T280400; 3/3) (duration: 00m 56s) [11:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:26] Lucas_WMDE: I'm done, thanks! [11:48:31] ok! [11:48:43] (03PS5) 10Lucas Werkmeister (WMDE): Avoid using User::newFrom* methods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705505 (owner: 10Zabe) [11:48:49] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Avoid using User::newFrom* methods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705505 (owner: 10Zabe) [11:49:34] (03Merged) 10jenkins-bot: Avoid using User::newFrom* methods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705505 (owner: 10Zabe) [11:50:19] zabe: testing the change on mwdebug2001, feel free to test it as well [11:51:55] (03CR) 10Jcrespo: [C: 03+2] mediabackup: Enable prometheus monitoring of minio [puppet] - 10https://gerrit.wikimedia.org/r/704600 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [11:52:17] (03PS1) 10MSantos: maps: decrease tilerator CPU usage in imposm machines [puppet] - 10https://gerrit.wikimedia.org/r/705659 [11:52:26] everything seems okay to me, syncing [11:52:46] zabe: they can be synced in any order, right? [11:53:00] since wikitech.php doesn’t use the wmf function that changed its signature [11:53:41] I think so [11:54:25] ok [11:54:59] ^ godog puppet change applied cleanly on both backups and prometheus hosts, but please shout if you see something broken [11:55:19] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/wikitech.php: Config: [[gerrit:705505|Avoid using User::newFrom* methods]] (1/3) (duration: 00m 56s) [11:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:32] (03PS1) 10Jbond: P:idp: update profile to support delegated authenticators [puppet] - 10https://gerrit.wikimedia.org/r/705660 (https://phabricator.wikimedia.org/T286716) [11:55:40] (03PS1) 10Filippo Giunchedi: pontoon: initialize $_role on bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/705661 [11:55:42] (03PS1) 10Filippo Giunchedi: pontoon: initialize user bare repo on bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/705662 [11:55:44] (03PS1) 10Filippo Giunchedi: pontoon: stop reading stack from hiera [puppet] - 10https://gerrit.wikimedia.org/r/705663 [11:55:45] jynus: ack, will do ! [11:55:47] (03PS1) 10Filippo Giunchedi: pontoon: create puppet client dir [puppet] - 10https://gerrit.wikimedia.org/r/705664 [11:55:49] (03PS1) 10Filippo Giunchedi: pontoon: add instructions [puppet] - 10https://gerrit.wikimedia.org/r/705665 [11:55:51] (03PS1) 10Filippo Giunchedi: pontoon: run puppet twice at enroll [puppet] - 10https://gerrit.wikimedia.org/r/705666 [11:55:53] (03PS1) 10Filippo Giunchedi: pontoon: always link hiera directory [puppet] - 10https://gerrit.wikimedia.org/r/705667 [11:56:21] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30267/console" [puppet] - 10https://gerrit.wikimedia.org/r/705660 (https://phabricator.wikimedia.org/T286716) (owner: 10Jbond) [11:56:55] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:705505|Avoid using User::newFrom* methods]] (2/3) (duration: 00m 56s) [11:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:04] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on maps1007.eqiad.wmnet with reason: Testing impact of tilerator [11:58:04] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on maps1007.eqiad.wmnet with reason: Testing impact of tilerator [11:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:33] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/CommonSettings-labs.php: Config: [[gerrit:705505|Avoid using User::newFrom* methods]] (3/3) (duration: 00m 56s) [11:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:43] !log EU config+backport window done [11:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:03] (03CR) 10Volans: "Thanks for the patch, few minor comments inline." (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [11:59:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=minio site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:00:26] ^that is expected until puppet runs on all hosts, although I will be checking if there are other issues [12:00:34] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10Aklapper) @Volans (`#-sre-foundations`); @Joe, @jbond (`#-private`): Are these two Freenode channels closed/moved (see table above) by now? Thanks [12:03:45] _joe_: sorry just saw your ping about elastic1039 (it's related to T286497) [12:03:46] T286497: hw troubleshooting: Disk failure for elastic1039.eqiad.wmnet - https://phabricator.wikimedia.org/T286497 [12:03:50] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10Volans) @Aklapper AFAIK the closing of channels doesn't apply anymore, see https://meta.wikimedia.org/wiki/IRC/Migrating_to_Libera_Chat#Closing_Freenode_channels [12:04:16] <_joe_> dcausse: yeah sorry I saw that in the meantime [12:04:20] gehel: when you're around mind downtiming elastic1039? [12:04:57] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10Aklapper) Does that mean this task should have `resolved` status? Or is there anything left to do? [12:05:11] (03CR) 10Filippo Giunchedi: "LGTM, see inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/705646 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [12:09:47] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30269/console" [puppet] - 10https://gerrit.wikimedia.org/r/705659 (owner: 10MSantos) [12:11:37] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] maps: decrease tilerator CPU usage in imposm machines [puppet] - 10https://gerrit.wikimedia.org/r/705659 (owner: 10MSantos) [12:12:51] (03CR) 10David Caro: [C: 03+1] "> Patch Set 1:" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/705634 (owner: 10Volans) [12:17:25] PROBLEM - SSH on wdqs2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:19:15] (03PS2) 10JMeybohm: Dragonfly is currently in evaluation phase but it might be wise to collect metrics right away to spot potential bottlenecks etc. [puppet] - 10https://gerrit.wikimedia.org/r/705646 (https://phabricator.wikimedia.org/T286054) [12:19:18] (03PS3) 10JMeybohm: dragonfly: Enable dfdaemon on eqiad kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/705639 (https://phabricator.wikimedia.org/T286054) [12:19:26] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1), 10User-fgiunchedi: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 (10fgiunchedi) [12:19:56] (03CR) 10JMeybohm: Dragonfly is currently in evaluation phase but it might be wise to collect metrics right away to spot potential bottlenecks etc. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/705646 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [12:20:12] I did something wrong for the new job, but because it is only affecting the new, not fully on production service, I will leave it as is for further debugging- working on it CC godog [12:20:37] (03CR) 10jerkins-bot: [V: 04-1] Dragonfly is currently in evaluation phase but it might be wise to collect metrics right away to spot potential bottlenecks etc. [puppet] - 10https://gerrit.wikimedia.org/r/705646 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [12:20:54] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1), 10User-fgiunchedi: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is complete, errors are gone. As a nice side effect we have the first Swift + B... [12:21:10] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30270/console" [puppet] - 10https://gerrit.wikimedia.org/r/705660 (https://phabricator.wikimedia.org/T286716) (owner: 10Jbond) [12:21:32] jynus: I'm not sure I understand, prometheus is not collecting or minio is not exporting or both ? [12:21:52] collection is failing [12:22:01] but I can manually curl from prometheus hosts [12:22:28] I need better understanding on what is failing :-) [12:22:29] (03PS3) 10JMeybohm: prometheus::ops: Add scraping config for dragonfly supernodes [puppet] - 10https://gerrit.wikimedia.org/r/705646 (https://phabricator.wikimedia.org/T286054) [12:22:31] (03PS4) 10JMeybohm: dragonfly: Enable dfdaemon on eqiad kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/705639 (https://phabricator.wikimedia.org/T286054) [12:22:38] jynus: ack, thanks, LMK how it goes [12:22:38] !log bump vcpus from 2 to 4 on ml-serve-ctrl VMs on Ganeti (load/cpu usage increased steadily since we deployed kubelets on them) [12:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:06] I will cry for help if I don't see it, but want to try to see it by myself first [12:23:12] !log reboot ml-serve-ctrl vms to pick up new vcores settings [12:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:57] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:apereo_cas: Add ability to support delegated authenticators [puppet] - 10https://gerrit.wikimedia.org/r/705657 (https://phabricator.wikimedia.org/T286716) (owner: 10Jbond) [12:24:05] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:idp: update profile to support delegated authenticators [puppet] - 10https://gerrit.wikimedia.org/r/705660 (https://phabricator.wikimedia.org/T286716) (owner: 10Jbond) [12:26:49] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.519e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [12:27:33] RECOVERY - etcd request latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [12:27:47] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:29:03] thanos compact is me ^ [12:29:45] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/705646 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [12:30:28] (03CR) 10JMeybohm: [C: 03+2] prometheus::ops: Add scraping config for dragonfly supernodes [puppet] - 10https://gerrit.wikimedia.org/r/705646 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [12:31:03] PROBLEM - Host ml-serve-ctrl1002 is DOWN: PING CRITICAL - Packet loss = 100% [12:31:17] RECOVERY - Host ml-serve-ctrl1002 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [12:34:11] 10SRE, 10Infrastructure-Foundations, 10netops, 10Datacenter-Switchover: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10ayounsi) Pushing the following (and similar on cr2) should do the trick. As it's only for a few days, and it would not be trivial... [12:35:10] godog, found it "http: TLS handshake error from 10.64.0.123:58274: remote error: tls: bad certificate" [12:35:29] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:35:45] RECOVERY - etcd request latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [12:35:49] will go for lunch, as this isn't urgent and we'll see where the missconfiguration happened [12:35:56] jynus: hah, curious but glad you found it. enjoy lunch [12:38:28] (03PS1) 10Jbond: apereo_cas: convert underscores to hyphens [puppet] - 10https://gerrit.wikimedia.org/r/705669 [12:42:56] (03PS2) 10Urbanecm: mediawiki/maintenance/growthexperiments.pp: Run updateMenteeData every day [puppet] - 10https://gerrit.wikimedia.org/r/704506 (https://phabricator.wikimedia.org/T285811) [12:43:05] (03CR) 10Urbanecm: [C: 04-1] "not 100% ready yet" [puppet] - 10https://gerrit.wikimedia.org/r/704506 (https://phabricator.wikimedia.org/T285811) (owner: 10Urbanecm) [12:43:13] (03CR) 10Jbond: [C: 03+2] apereo_cas: convert underscores to hyphens [puppet] - 10https://gerrit.wikimedia.org/r/705669 (owner: 10Jbond) [12:44:02] !log installing systemd security updates on buster [12:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:39] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:50:01] RECOVERY - etcd request latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [12:59:27] (03CR) 10Kormat: [C: 03+2] mariadb: Migrate cron of check_private_data to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/705045 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [13:01:44] (03CR) 10Jgiannelos: [C: 04-1] "Here is how the logs look like from my local tegola+swift env" [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/705651 (owner: 10Jgiannelos) [13:02:23] (03PS1) 10DCausse: Revert "thanos-swift envoy listener: rewrite HTTP host header" [puppet] - 10https://gerrit.wikimedia.org/r/705480 [13:06:49] (03PS1) 10DCausse: rdf-streaming-updater: Do not use envoy for thanos-swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/705671 (https://phabricator.wikimedia.org/T264006) [13:07:42] (03PS1) 10Jbond: O:idp: update idp to also pass the sshKey ldap attribute [puppet] - 10https://gerrit.wikimedia.org/r/705675 [13:08:55] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30272/console" [puppet] - 10https://gerrit.wikimedia.org/r/705675 (owner: 10Jbond) [13:10:07] (03CR) 10Jbond: [V: 03+1 C: 03+2] O:idp: update idp to also pass the sshKey ldap attribute [puppet] - 10https://gerrit.wikimedia.org/r/705675 (owner: 10Jbond) [13:13:37] !log gehel@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=elastic1039.eqiad.wmnet [13:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:21] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 13 hosts with reason: Deploying schema change to s5 T281058 [13:14:26] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 13 hosts with reason: Deploying schema change to s5 T281058 [13:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:27] T281058: Rename AbuseFilter indexes for consistency - https://phabricator.wikimedia.org/T281058 [13:14:29] !log set/pooled=inactive on elastic1039 - disk failure - T285643 [13:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:38] T285643: Degraded RAID on elastic1039 - https://phabricator.wikimedia.org/T285643 [13:18:09] RECOVERY - SSH on wdqs2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:19:07] PROBLEM - Check systemd state on dbprov2001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:22:03] RECOVERY - etcd request latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [13:22:22] (03CR) 10Andrew Bogott: [C: 03+1] "seems better!" [puppet] - 10https://gerrit.wikimedia.org/r/701506 (https://phabricator.wikimedia.org/T285537) (owner: 10David Caro) [13:22:31] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:23:20] (03CR) 10Andrew Bogott: [C: 03+1] wmcs.labs-ip-alias-dump: add a retry [puppet] - 10https://gerrit.wikimedia.org/r/701515 (https://phabricator.wikimedia.org/T285537) (owner: 10David Caro) [13:25:22] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 15 hosts with reason: Deploying schema change to s2 T281058 [13:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:28] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 15 hosts with reason: Deploying schema change to s2 T281058 [13:25:29] T281058: Rename AbuseFilter indexes for consistency - https://phabricator.wikimedia.org/T281058 [13:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:17] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps20(10|0[1-9]).codfw.wmnet [13:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:35] (03PS1) 10Jbond: hiera - cloud: add definition for idp02 [puppet] - 10https://gerrit.wikimedia.org/r/705681 [13:39:17] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10Volans) AFAICT is pending ACLs and mode for the other remaining channels where we didn't had any operator. But I'm not authoritative on this, probably @Legoktm knows more [13:39:56] (03CR) 10Jbond: [C: 03+2] hiera - cloud: add definition for idp02 [puppet] - 10https://gerrit.wikimedia.org/r/705681 (owner: 10Jbond) [13:40:11] (03CR) 10Volans: "> I'm guessing as non-root user?" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/705634 (owner: 10Volans) [13:40:30] (03CR) 10Effie Mouzeli: [C: 03+1] Temporary log all s3 SDK requests/responses [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/705651 (owner: 10Jgiannelos) [13:42:00] (03CR) 10RLazarus: [C: 03+1] "Whoops, sorry!" [puppet] - 10https://gerrit.wikimedia.org/r/705636 (owner: 10Volans) [13:42:45] 10SRE, 10MW-on-K8s, 10serviceops: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Joe) >>! In T261277#7204714, @JMeybohm wrote: > We also talked about using Istio Ingress in the past (envoy-based) which could be a good fit as well and we could... [13:43:43] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps200[89].codfw.wmnet [13:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:49] !log hnowlan@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw [13:45:50] !log hnowlan@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=eqiad [13:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:29] (03CR) 10Jgiannelos: [C: 03+2] Temporary log all s3 SDK requests/responses [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/705651 (owner: 10Jgiannelos) [13:47:32] (03Merged) 10jenkins-bot: Temporary log all s3 SDK requests/responses [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/705651 (owner: 10Jgiannelos) [13:48:02] (03PS1) 10Zabe: Avoid using WikiPage::factory() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705682 [13:49:03] 10SRE, 10Infrastructure-Foundations, 10netops, 10Datacenter-Switchover: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10cmooney) Looks good to me @ayounsi if you want to commit. I would totally agree btw, Netflow is probably handled in silicon, o... [13:50:03] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 15 hosts with reason: Deploying schema change to s7 T281058 [13:50:08] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 15 hosts with reason: Deploying schema change to s7 T281058 [13:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:09] T281058: Rename AbuseFilter indexes for consistency - https://phabricator.wikimedia.org/T281058 [13:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:28] (03PS1) 10Hnowlan: maps: reenable tilerator on codfw new cluster [puppet] - 10https://gerrit.wikimedia.org/r/705684 (https://phabricator.wikimedia.org/T269582) [13:50:50] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:54:17] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate istio as an ingress for production usage - https://phabricator.wikimedia.org/T287007 (10Joe) [13:55:21] (03PS1) 10Jgiannelos: tegola-vector-tiles: Bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/705685 [13:55:49] (03PS1) 10Effie Mouzeli: admin: add tiller-flink ClusterRole in ci [deployment-charts] - 10https://gerrit.wikimedia.org/r/705686 (https://phabricator.wikimedia.org/T286646) [13:55:54] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:56:13] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2008.codfw.wmnet [13:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:39] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/705685 (owner: 10Jgiannelos) [13:56:44] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate istio as an ingress for production usage - https://phabricator.wikimedia.org/T287007 (10Joe) Istio can be configured with native ingress resources, using the annotation: ` kubernetes.io/ingress.class: istio ` see https://istio.io/latest/docs/tasks/traffic-manageme... [13:59:01] (03Merged) 10jenkins-bot: tegola-vector-tiles: Bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/705685 (owner: 10Jgiannelos) [13:59:06] (03PS2) 10Effie Mouzeli: admin: add tiller-flink ClusterRole in ci [deployment-charts] - 10https://gerrit.wikimedia.org/r/705686 (https://phabricator.wikimedia.org/T286646) [14:00:00] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10ArielGlenn) [14:00:19] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [14:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:24] (03PS3) 10JMeybohm: admin: Bind tiller service account for ci to the tiller-flink ClusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/705686 (https://phabricator.wikimedia.org/T286646) (owner: 10Effie Mouzeli) [14:01:35] (03CR) 10JMeybohm: [C: 03+1] admin: Bind tiller service account for ci to the tiller-flink ClusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/705686 (https://phabricator.wikimedia.org/T286646) (owner: 10Effie Mouzeli) [14:02:11] (03CR) 10Effie Mouzeli: [C: 03+2] admin: Bind tiller service account for ci to the tiller-flink ClusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/705686 (https://phabricator.wikimedia.org/T286646) (owner: 10Effie Mouzeli) [14:03:02] (03PS2) 10Volans: icinga-status: use None instead of False [puppet] - 10https://gerrit.wikimedia.org/r/705636 [14:03:15] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10MoritzMuehlenhoff) [14:03:17] (03PS1) 10Volans: icinga: adapt to newer icinga-status [software/spicerack] - 10https://gerrit.wikimedia.org/r/705687 [14:03:56] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2009.codfw.wmnet [14:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:10] (03CR) 10Volans: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/705636 (owner: 10Volans) [14:04:42] (03Merged) 10jenkins-bot: admin: Bind tiller service account for ci to the tiller-flink ClusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/705686 (https://phabricator.wikimedia.org/T286646) (owner: 10Effie Mouzeli) [14:05:18] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 221, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:05:45] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/705687 (owner: 10Volans) [14:07:08] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: registry1004, gitlab2001, labstore1006, thanos-be1003, registry1003, registry2004 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [14:07:33] (03CR) 10Jelto: [V: 03+1] "> Patch Set 10:" [puppet] - 10https://gerrit.wikimedia.org/r/704503 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [14:07:44] PROBLEM - Check systemd state on db2116 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_lldpd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:00] !log jiji@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:07] (03CR) 10JMeybohm: [C: 03+2] dragonfly: Enable dfdaemon on eqiad kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/705639 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [14:09:38] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:09:44] !log jiji@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:06] (03CR) 10Herron: [V: 03+2 C: 03+2] Move existing SLO dashboards towards common template [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/699260 (owner: 10Herron) [14:11:48] RECOVERY - Check systemd state on dbprov2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:57] (03PS1) 10RLazarus: admin: Update jmads expiry to 2021-08-01, per email from mnovotny [puppet] - 10https://gerrit.wikimedia.org/r/705689 [14:12:13] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 18 hosts with reason: Deploying schema change to s4 T281058 [14:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:20] T281058: Rename AbuseFilter indexes for consistency - https://phabricator.wikimedia.org/T281058 [14:12:20] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 18 hosts with reason: Deploying schema change to s4 T281058 [14:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:16] (03PS1) 10Zabe: Avoid using MWHttpRequest::factory() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705690 [14:14:39] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: update two image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/705642 (owner: 10Giuseppe Lavagetto) [14:15:28] (03CR) 10jerkins-bot: [V: 04-1] Avoid using MWHttpRequest::factory() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705690 (owner: 10Zabe) [14:16:36] (03PS2) 10Zabe: Avoid using MWHttpRequest::factory() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705690 [14:17:46] (Traffic on tunnel link) firing: Traffic on tunnel link - https://alerts.wikimedia.org [14:19:36] 10SRE, 10Analytics, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech: Deployment strategy and hardware requirement for new Flink based WDQS updater - https://phabricator.wikimedia.org/T247058 (10Zbyszko) 05Open→03Resolved a:03Zbyszko Strategy was developed and is being implemented. [14:21:32] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [14:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:41] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [14:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:46] (Traffic on tunnel link) resolved: Traffic on tunnel link - https://alerts.wikimedia.org [14:24:04] (03PS1) 10Jcrespo: prometheus: Add hosts_only false on minio job [puppet] - 10https://gerrit.wikimedia.org/r/705694 (https://phabricator.wikimedia.org/T276442) [14:25:00] (03CR) 10Jcrespo: "I believe the scrapping issue is that we are forced to use the full host name on config when using the https schema." [puppet] - 10https://gerrit.wikimedia.org/r/705694 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [14:25:15] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [14:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:46] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [14:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:36] (03CR) 10RLazarus: [C: 03+1] "LGTM, but ideally wait for Joe, just to make sure we're all on the same page." [cookbooks] - 10https://gerrit.wikimedia.org/r/705349 (https://phabricator.wikimedia.org/T285273) (owner: 10Filippo Giunchedi) [14:28:45] (03PS2) 10Jcrespo: prometheus: Add hosts_only=false on minio job [puppet] - 10https://gerrit.wikimedia.org/r/705694 (https://phabricator.wikimedia.org/T276442) [14:30:13] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 18 hosts with reason: Deploying schema change to s8 T281058 [14:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:20] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 18 hosts with reason: Deploying schema change to s8 T281058 [14:30:20] T281058: Rename AbuseFilter indexes for consistency - https://phabricator.wikimedia.org/T281058 [14:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:41] (03PS1) 10PipelineBot: rdf-streaming-updater: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/705695 [14:33:19] 10SRE, 10Traffic, 10observability, 10Sustainability (Incident Followup): Per-country Frontend Traffic dashboards - https://phabricator.wikimedia.org/T286554 (10lmata) [14:36:45] 10SRE, 10SRE Observability (FY2021/2022-Q2): node_cpu_frequency_hertz metric no longer present in Bullseye - https://phabricator.wikimedia.org/T286768 (10lmata) [14:36:58] !log depool cp[1087-1090].eqiad.wmnet - T286069 [14:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:05] T286069: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 [14:37:05] PROBLEM - Query Service HTTP Port on wdqs1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [14:37:31] 10SRE, 10Traffic, 10observability, 10Sustainability (Incident Followup): Per-country Frontend Traffic dashboards - https://phabricator.wikimedia.org/T286554 (10fgiunchedi) The idea LGTM overall, something to lookout for though is that geo country in metric labels (if that's the implementation) could potent... [14:38:35] (03PS1) 10Ssingh: auditd: initial commit for the auditd module. [puppet] - 10https://gerrit.wikimedia.org/r/705696 [14:39:11] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/705689 (owner: 10RLazarus) [14:40:36] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on cp[1087-1090].eqiad.wmnet with reason: eqiad row D maintenance [14:40:38] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cp[1087-1090].eqiad.wmnet with reason: eqiad row D maintenance [14:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:44] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate istio as an ingress for production usage - https://phabricator.wikimedia.org/T287007 (10Joe) Of course, istio also offers its own custom resource definitions for a richer configuration: the istio gateway (https://istio.io/latest/docs/reference/config/networking/gat... [14:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:00] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10ops-monitoring-bot) Icinga downtime set by vgutierrez@cumin1001 for 1:00:00 4 host(s) and their services with reason: eqiad row D maintenance ` cp[1... [14:41:06] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate istio as an ingress for production usage - https://phabricator.wikimedia.org/T287007 (10Joe) [14:43:21] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Vgutierrez) [14:44:00] (03PS2) 10Ssingh: auditd: initial commit for the auditd module. [puppet] - 10https://gerrit.wikimedia.org/r/705696 [14:44:10] (03CR) 10Jcrespo: "I tested by editing manually editing the yaml file and this fixes the issue :-D: https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targ" [puppet] - 10https://gerrit.wikimedia.org/r/705694 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [14:44:28] godog, I am merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/705694 which I think fixes the issue [14:46:20] !log depool dns1002 - T286069 [14:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:27] T286069: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 [14:47:05] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:48:20] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on dns1002.wikimedia.org with reason: eqiad row D maintenance [14:48:21] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dns1002.wikimedia.org with reason: eqiad row D maintenance [14:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:44] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10ops-monitoring-bot) Icinga downtime set by vgutierrez@cumin1001 for 1:00:00 1 host(s) and their services with reason: eqiad row D maintenance ` dns1... [14:49:44] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Vgutierrez) [14:49:57] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:50:15] ^^ that's us depooling dns1002 [14:50:27] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:50:58] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on lvs1016.eqiad.wmnet with reason: eqiad row D maintenance [14:50:59] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lvs1016.eqiad.wmnet with reason: eqiad row D maintenance [14:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:07] (03CR) 10Ssingh: "This change is ready for review. PCC: https://puppet-compiler.wmflabs.org/compiler1003/30277/doh1001.wikimedia.org/index.html." [puppet] - 10https://gerrit.wikimedia.org/r/705696 (owner: 10Ssingh) [14:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:22] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10ops-monitoring-bot) Icinga downtime set by vgutierrez@cumin1001 for 1:00:00 1 host(s) and their services with reason: eqiad row D maintenance ` lvs1... [14:51:40] !log depooled and scheduled downtime for kafka-main100[45] [14:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:10] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: Do not use envoy for thanos-swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/705671 (https://phabricator.wikimedia.org/T264006) (owner: 10DCausse) [14:53:19] (03CR) 10Jcrespo: "I am going to merge it now, as I tested it is the issue, @godog please review/update/move my edits on docs at a later time: https://wikite" [puppet] - 10https://gerrit.wikimedia.org/r/705694 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [14:53:24] (03CR) 10Jcrespo: [C: 03+2] prometheus: Add hosts_only=false on minio job [puppet] - 10https://gerrit.wikimedia.org/r/705694 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [14:53:49] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:53:57] !log Start server-side upload for 7 large PNG files (T285708) [14:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:03] T285708: Please upload large files to Wikimedia Commons - https://phabricator.wikimedia.org/T285708 [14:54:33] (03PS2) 10Giuseppe Lavagetto: mwdebug: update two image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/705642 [14:54:53] <_joe_> jouncebot: next [14:54:53] In 0 hour(s) and 5 minute(s): Switch buffer re-partition - Eqiad Row D(network change) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210720T1500) [14:54:53] In 0 hour(s) and 5 minute(s): Switch buffer re-partition - Eqiad Row C(network change) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210720T1500) [14:54:56] (03Merged) 10jenkins-bot: rdf-streaming-updater: Do not use envoy for thanos-swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/705671 (https://phabricator.wikimedia.org/T264006) (owner: 10DCausse) [14:54:58] <_joe_> urbanecm: ^^ [14:55:16] <_joe_> uh topranks / XioNoX wasn't just row D today? [14:55:40] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [14:55:44] _joe_: yes 100% it is only row D today. [14:55:55] _joe_: upps, I'm sorry. Missed that. My action only does upload large files (theoretically it can be done via web, too). [14:56:06] But i'm happy to stop the upload if you say so [14:56:23] <_joe_> urbanecm: it might fail, that's the only thing I wanted to tell you :) [14:56:36] yes, row d [14:56:39] My mistake on the calendar - updating now. [14:57:07] that's fine, the process is supposed to be failproof [14:57:42] fixed now. [14:59:04] (03PS1) 10Razzi: yarn: disable accepting jobs to queues [puppet] - 10https://gerrit.wikimedia.org/r/705698 (https://phabricator.wikimedia.org/T278423) [14:59:17] jouncebot: refresh [14:59:17] I refreshed my knowledge about deployments. [14:59:21] jouncebot: next [14:59:21] In 0 hour(s) and 0 minute(s): Switch buffer re-partition - Eqiad Row D(network change) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210720T1500) [14:59:26] It's happy now [14:59:58] (03PS1) 10DCausse: rdf-streaming-updater: fix indent of egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/705699 [15:00:04] topranks and XioNox: Dear deployers, time to do the Switch buffer re-partition - Eqiad Row D(network change) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210720T1500). [15:00:05] (03CR) 10jerkins-bot: [V: 04-1] rdf-streaming-updater: fix indent of egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/705699 (owner: 10DCausse) [15:00:25] eh, I never used that deployment bot [15:00:43] looks convenient [15:00:53] (03PS2) 10DCausse: rdf-streaming-updater: fix indent of egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/705699 [15:01:05] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate istio as an ingress for production usage - https://phabricator.wikimedia.org/T287007 (10Joe) Metrics can easily be collected with prometheus - in fact, istio ships with the correct annotations and thus should easily be picked up by our prometheus without adding any... [15:02:10] jynus: ack, thanks for the heads up [15:05:32] (03CR) 10Razzi: [C: 03+2] yarn: disable accepting jobs to queues [puppet] - 10https://gerrit.wikimedia.org/r/705698 (https://phabricator.wikimedia.org/T278423) (owner: 10Razzi) [15:06:14] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 12 hosts with reason: Deploying schema change to s3 T281058 [15:06:18] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [15:06:18] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 12 hosts with reason: Deploying schema change to s3 T281058 [15:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:21] T281058: Rename AbuseFilter indexes for consistency - https://phabricator.wikimedia.org/T281058 [15:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:46] (03CR) 10RLazarus: [C: 03+2] admin: Update jmads expiry to 2021-08-01, per email from mnovotny [puppet] - 10https://gerrit.wikimedia.org/r/705689 (owner: 10RLazarus) [15:07:31] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [15:08:42] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [15:09:34] (03CR) 10RLazarus: [C: 03+1] "Thanks!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/705687 (owner: 10Volans) [15:11:25] (03PS1) 10Dzahn: site/conftool/DHCP: decom mw1289, mw1290, mw1297 [puppet] - 10https://gerrit.wikimedia.org/r/705700 (https://phabricator.wikimedia.org/T280203) [15:11:47] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [15:12:26] (03CR) 10RLazarus: [C: 03+1] "👍" [puppet] - 10https://gerrit.wikimedia.org/r/705636 (owner: 10Volans) [15:14:26] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1289.eqiad.wmnet [15:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:39] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1290.eqiad.wmnet [15:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:50] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1297.eqiad.wmnet [15:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:14] mutante: there is the row D maintenance going on right this moment [15:15:23] FYI [15:16:57] <_joe_> also depooling should use inactive [15:16:59] volans: oh, i'll stop if that is distracting. yep [15:19:15] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ldap-replica1004.wikimedia.org [15:19:16] (03PS1) 10Filippo Giunchedi: puppetdb: rename stockpile mount var [puppet] - 10https://gerrit.wikimedia.org/r/705702 [15:19:18] (03PS1) 10Filippo Giunchedi: pontoon: update puppetdb::microsite [puppet] - 10https://gerrit.wikimedia.org/r/705703 [15:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:20] (03PS1) 10Filippo Giunchedi: profile: restart postgres on first install / bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/705704 [15:19:22] (03PS1) 10Filippo Giunchedi: puppetdb: wait for stockpile initialization instead of creating parents [puppet] - 10https://gerrit.wikimedia.org/r/705705 [15:19:24] (03PS1) 10Filippo Giunchedi: puppetdb: set permissions post-mount [puppet] - 10https://gerrit.wikimedia.org/r/705706 [15:21:11] !log pool cp[1087-1090].eqiad.wmnet - T286069 [15:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:17] T286069: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 [15:22:55] (03CR) 10DannyS712: [C: 03+1] Avoid using WikiPage::factory() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705682 (owner: 10Zabe) [15:23:01] !log pool dns1002 - T286069 [15:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:02] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 79, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:24:17] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:24:24] that's the repool of dns1002 :) [15:24:39] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:24:44] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/705702 (owner: 10Filippo Giunchedi) [15:25:12] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/705703 (owner: 10Filippo Giunchedi) [15:28:21] (03CR) 10Volans: [C: 03+2] icinga-status: use None instead of False [puppet] - 10https://gerrit.wikimedia.org/r/705636 (owner: 10Volans) [15:29:03] (03CR) 10Volans: [C: 03+2] icinga: adapt to newer icinga-status [software/spicerack] - 10https://gerrit.wikimedia.org/r/705687 (owner: 10Volans) [15:29:28] (03CR) 10Volans: [C: 03+2] API change: use IcingaHosts instead of Icinga [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/705634 (owner: 10Volans) [15:32:57] (03PS1) 10Ssingh: rsyslog: send auditd/audispd logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/705707 [15:36:07] (03Merged) 10jenkins-bot: icinga: adapt to newer icinga-status [software/spicerack] - 10https://gerrit.wikimedia.org/r/705687 (owner: 10Volans) [15:36:09] (03Merged) 10jenkins-bot: API change: use IcingaHosts instead of Icinga [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/705634 (owner: 10Volans) [15:37:28] (03CR) 10Jbond: puppetdb: wait for stockpile initialization instead of creating parents (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/705705 (owner: 10Filippo Giunchedi) [15:41:57] (03PS3) 10Volans: sre.hosts.downtime: downtime any Icinga host [cookbooks] - 10https://gerrit.wikimedia.org/r/705428 [15:42:00] (03PS3) 10Volans: sre.hosts.downtime: convert format() to f-string [cookbooks] - 10https://gerrit.wikimedia.org/r/705429 [15:42:02] (03PS2) 10Volans: sre.hosts.remove-downtime: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/705430 [15:42:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:42:37] (03CR) 10Dzahn: [C: 03+1] sre.hosts.remove-downtime: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/705430 (owner: 10Volans) [15:42:53] (Emergency syslog message) firing: Emergency syslog message - https://alerts.wikimedia.org [15:43:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: dumps NFS: failback dumps NFS to labstore1007 [puppet] - 10https://gerrit.wikimedia.org/r/705417 (https://phabricator.wikimedia.org/T286600) (owner: 10Arturo Borrero Gonzalez) [15:44:56] (03CR) 10Volans: "> Patch Set 1: Code-Review+1" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/704343 (owner: 10Volans) [15:45:58] (03CR) 10Effie Mouzeli: [C: 03+1] rdf-streaming-updater: fix indent of egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/705699 (owner: 10DCausse) [15:46:47] 10SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for cjming - https://phabricator.wikimedia.org/T286961 (10RLazarus) @thcipriani Can you approve for the deployment group please? (And, side note: the deployment group doesn't have an Approver[tm] written down yet -- that should be you,... [15:48:27] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:30] 10SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for cjming - https://phabricator.wikimedia.org/T286961 (10RLazarus) [15:52:25] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) All works complete, no signs of any issues really, I had no ping loss on 16 pings towards 2 hosts connected off each member switch. Very h... [15:52:53] (Emergency syslog message) resolved: Emergency syslog message - https://alerts.wikimedia.org [15:53:23] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [15:57:19] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw1289.eqiad.wmnet [15:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:55] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw129[07].eqiad.wmnet [15:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:05] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw1289.eqiad.wmnet [15:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:05] jbond42 and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210720T1600). [16:00:05] DannyS712 and zabe: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:41] o/ [16:02:03] 👋 looking [16:02:19] here [16:03:25] DannyS712: looks like yours is comments only, nothing to test after merging, correct? [16:03:40] zabe: anything you want to test after I merge yours? [16:03:44] yes indeed, nothing to merge [16:03:51] do you need me to stay around [16:03:59] no [16:04:12] kay, going ahead :) thanks both [16:04:23] (03CR) 10RLazarus: [C: 03+2] Typo fix: "the the" -> "the" [puppet] - 10https://gerrit.wikimedia.org/r/705096 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712) [16:05:11] zabe: actually, one thing -- can you first change the file resource to ensure=>absent? [16:05:21] that way puppet actually deletes the file, instead of ignoring it [16:05:29] then we can delete the resource in a second patch [16:05:54] (03PS1) 10Dzahn: typos: add "the the" [puppet] - 10https://gerrit.wikimedia.org/r/705714 [16:06:05] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/705714" [puppet] - 10https://gerrit.wikimedia.org/r/705096 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712) [16:06:15] (03PS1) 10Brennen Bearnes: logging: format nginx access logs as JSON [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/705715 (https://phabricator.wikimedia.org/T274462) [16:06:18] should just be file { '/usr/local/bin/sqldump': ensure => 'absent' } and you can delete all the other fields [16:06:51] (03CR) 10Jbond: puppetdb: set permissions post-mount (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705706 (owner: 10Filippo Giunchedi) [16:10:17] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw1289.eqiad.wmnet [16:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:26] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1289.eqiad.wmnet` - m... [16:10:52] (03PS2) 10Brennen Bearnes: logging: format nginx access logs as JSON [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/705715 (https://phabricator.wikimedia.org/T274462) [16:11:04] doing [16:11:24] ah thanks, was just commenting on the CR :) [16:11:25] (03PS1) 10Effie Mouzeli: admin: add comment about tillerClusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/705717 [16:11:25] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw1290.eqiad.wmnet [16:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:56] (03PS5) 10Cathal Mooney: Adding 'quality-of-service' template for use on QFX/EX series switches. [homer/public] - 10https://gerrit.wikimedia.org/r/701499 (https://phabricator.wikimedia.org/T284592) [16:13:41] (03CR) 10DannyS712: "does whatever software this is using to check for typos work with multiple words?" [puppet] - 10https://gerrit.wikimedia.org/r/705714 (owner: 10Dzahn) [16:13:58] (03PS6) 10Cathal Mooney: Adding 'quality-of-service' template for use on QFX/EX series switches. [homer/public] - 10https://gerrit.wikimedia.org/r/701499 (https://phabricator.wikimedia.org/T284592) [16:14:15] rzl so am I done with my first puppet deployment window? [16:14:27] DannyS712: yep, thanks very much! [16:14:34] thanks for having me [16:15:53] (03CR) 10RLazarus: "> does whatever software this is using to check for typos work with multiple words?" [puppet] - 10https://gerrit.wikimedia.org/r/705714 (owner: 10Dzahn) [16:16:23] (03PS2) 10Dzahn: typos: add "the the" [puppet] - 10https://gerrit.wikimedia.org/r/705714 [16:16:45] (03CR) 10Dzahn: "testing if it adds " " around it or not by adding the bad string to README as an example" [puppet] - 10https://gerrit.wikimedia.org/r/705714 (owner: 10Dzahn) [16:16:50] (03CR) 10jerkins-bot: [V: 04-1] typos: add "the the" [puppet] - 10https://gerrit.wikimedia.org/r/705714 (owner: 10Dzahn) [16:17:35] (03CR) 10Dzahn: "09:16:43 Typo found!" [puppet] - 10https://gerrit.wikimedia.org/r/705714 (owner: 10Dzahn) [16:18:00] (03CR) 10Ayounsi: [C: 03+1] Adding 'quality-of-service' template for use on QFX/EX series switches. [homer/public] - 10https://gerrit.wikimedia.org/r/701499 (https://phabricator.wikimedia.org/T284592) (owner: 10Cathal Mooney) [16:18:17] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: fix indent of egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/705699 (owner: 10DCausse) [16:18:24] actually rzl is it too late to add another patch? [16:18:31] (03PS3) 10Dzahn: typos: add "the the" [puppet] - 10https://gerrit.wikimedia.org/r/705714 [16:18:36] DannyS712: go for it [16:19:16] mutante: FYI it's grep -f so you should be all set: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/rake_modules/taskgen.rb#268 [16:19:20] (03PS1) 10Zabe: scap: set sqldump to ensure => 'absent' [puppet] - 10https://gerrit.wikimedia.org/r/705718 [16:20:06] (03PS4) 10Dzahn: typos: add "the the" [puppet] - 10https://gerrit.wikimedia.org/r/705714 [16:20:17] rzl: thanks:) also PS2 vs PS3 shows it detects "the the" but not "the " [16:20:17] rzl https://gerrit.wikimedia.org/r/c/operations/puppet/+/705485 more typos [16:21:03] (03CR) 10Dzahn: "PS2 detected "the the" and voted -1, PS3 did not detect "the " and voted +2" [puppet] - 10https://gerrit.wikimedia.org/r/705714 (owner: 10Dzahn) [16:21:28] (03Merged) 10jenkins-bot: rdf-streaming-updater: fix indent of egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/705699 (owner: 10DCausse) [16:21:50] (03CR) 10RLazarus: [C: 03+2] logstash::output::kafka - fix typo "boostrap" [puppet] - 10https://gerrit.wikimedia.org/r/705485 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712) [16:21:53] rzl: https://gerrit.wikimedia.org/r/c/operations/puppet/+/705718 [16:21:53] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw1290.eqiad.wmnet [16:21:54] (03PS5) 10DannyS712: typos: add "the the" [puppet] - 10https://gerrit.wikimedia.org/r/705714 (owner: 10Dzahn) [16:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:02] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1290.eqiad.wmnet` - m... [16:22:13] DannyS712: you're fighting a valiant battle, but it's an uphill one ;) [16:22:19] zabe: thanks! one sec [16:22:22] (03CR) 10DannyS712: [C: 03+1] "PS5 removed addition of empty line to readme left over from testing in PS2, hope that was okay. LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/705714 (owner: 10Dzahn) [16:23:09] rzl take a look at https://phabricator.wikimedia.org/T201491 for just how tall that hill is. I sent dozens of patches just for "the the" - https://gerrit.wikimedia.org/r/q/topic:%22typo-the-the%22+(status:open%20OR%20status:merged) [16:23:15] (03CR) 10Dzahn: "thanks for the edit, sure was ok :)" [puppet] - 10https://gerrit.wikimedia.org/r/705714 (owner: 10Dzahn) [16:24:37] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw1297.eqiad.wmnet [16:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:05] !log dcausse@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [16:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:28] PROBLEM - Check systemd state on an-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-namenode.service,hadoop-hdfs-zkfc.service,hadoop-yarn-resourcemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:26:40] (03CR) 10RLazarus: [C: 03+2] "+2ing per the discussion on https://gerrit.wikimedia.org/r/692370 where the real review happened -- I asked Zabe to do the two-patch ensur" [puppet] - 10https://gerrit.wikimedia.org/r/705718 (owner: 10Zabe) [16:26:50] (03CR) 10Jbond: profile: restart postgres on first install / bootstrap (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/705704 (owner: 10Filippo Giunchedi) [16:26:56] PROBLEM - Hadoop ResourceManager on an-master1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.resourcemanager.ResourceManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Resourcemanager_process [16:27:19] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [16:27:33] the above alert for hadoop is part of maintenance --^ [16:27:35] zabe: patch #1 merged, thank you! I'll wait 30 minutes to let Puppet run on the whole fleet, and it'll delete that file as it goes [16:28:03] zabe: if you can rebase patch #2 on top of it now, I can go ahead and merge it for you after that, no need to hang around [16:28:14] PROBLEM - Hadoop HDFS Zookeeper failover controller on an-master1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.tools.DFSZKFailoverController https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_ZKFC_process [16:29:32] (03PS6) 10Zabe: scap: Drop never-used 'sqldump' tool [puppet] - 10https://gerrit.wikimedia.org/r/692370 [16:29:41] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [16:31:31] (03PS1) 10DCausse: rdf-streaming-updater: use image version 2021-07-20-143040-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/705719 (https://phabricator.wikimedia.org/T264006) [16:31:53] (03PS5) 10DannyS712: Update rewrite rule for https://www.mediawiki.org/FAQ [puppet] - 10https://gerrit.wikimedia.org/r/676508 [16:32:04] thanks [16:33:16] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [16:33:26] (03CR) 10Cathal Mooney: [C: 03+2] Adding 'quality-of-service' template for use on QFX/EX series switches. [homer/public] - 10https://gerrit.wikimedia.org/r/701499 (https://phabricator.wikimedia.org/T284592) (owner: 10Cathal Mooney) [16:34:14] PROBLEM - SSH on wdqs2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:34:26] PROBLEM - Hadoop NodeManager on analytics1075 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:34:38] PROBLEM - Hadoop NodeManager on an-worker1085 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:34:39] (03Merged) 10jenkins-bot: Adding 'quality-of-service' template for use on QFX/EX series switches. [homer/public] - 10https://gerrit.wikimedia.org/r/701499 (https://phabricator.wikimedia.org/T284592) (owner: 10Cathal Mooney) [16:34:40] PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:00] PROBLEM - Hadoop NodeManager on an-worker1081 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:01] PROBLEM - Hadoop NodeManager on analytics1070 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:05] rzl I looked to see if I had any other puppet patches open and https://gerrit.wikimedia.org/r/c/operations/puppet/+/676508 is from a few months ago. Any chance you can take a look? That one can be tested [16:35:06] PROBLEM - Hadoop NodeManager on an-worker1091 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:06] PROBLEM - Hadoop NodeManager on an-worker1106 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:06] PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:08] PROBLEM - Hadoop NodeManager on an-worker1120 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:09] PROBLEM - Hadoop NodeManager on analytics1064 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:09] PROBLEM - Hadoop NodeManager on analytics1067 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:10] PROBLEM - Hadoop NodeManager on an-worker1088 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:11] PROBLEM - Hadoop NodeManager on analytics1059 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:12] PROBLEM - Hadoop NodeManager on an-worker1095 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:13] PROBLEM - Hadoop NodeManager on an-worker1092 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:14] PROBLEM - Hadoop NodeManager on analytics1063 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:15] PROBLEM - Hadoop NodeManager on an-worker1109 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:16] PROBLEM - Hadoop NodeManager on an-worker1101 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:17] PROBLEM - Hadoop NodeManager on an-worker1103 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:18] PROBLEM - Hadoop NodeManager on an-worker1084 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:19] PROBLEM - Hadoop NodeManager on an-worker1094 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:20] PROBLEM - Hadoop NodeManager on an-worker1125 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:21] PROBLEM - Hadoop NodeManager on analytics1076 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:22] PROBLEM - Hadoop NodeManager on analytics1058 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:23] PROBLEM - Hadoop NodeManager on an-worker1079 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:26] PROBLEM - Hadoop NodeManager on analytics1071 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:28] PROBLEM - Hadoop NodeManager on an-worker1108 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:28] PROBLEM - Hadoop NodeManager on an-worker1111 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:29] PROBLEM - Hadoop NodeManager on an-worker1114 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:38] PROBLEM - Hadoop NodeManager on an-worker1083 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:41] PROBLEM - Hadoop NodeManager on an-worker1104 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:41] PROBLEM - Hadoop NodeManager on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:48] PROBLEM - Hadoop NodeManager on analytics1061 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:50] PROBLEM - Hadoop NodeManager on analytics1069 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:51] PROBLEM - Hadoop NodeManager on analytics1073 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:52] PROBLEM - Hadoop NodeManager on an-worker1096 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:54] PROBLEM - Hadoop NodeManager on an-worker1127 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:58] PROBLEM - Hadoop NodeManager on analytics1065 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:00] PROBLEM - Hadoop NodeManager on an-worker1107 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:01] RECOVERY - Hadoop NodeManager on an-worker1085 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:01] PROBLEM - Hadoop NodeManager on an-worker1090 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:01] PROBLEM - Hadoop NodeManager on an-worker1078 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:02] PROBLEM - Hadoop NodeManager on an-worker1131 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:04] PROBLEM - Hadoop NodeManager on an-worker1123 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:04] PROBLEM - Hadoop NodeManager on an-worker1130 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:05] PROBLEM - Hadoop NodeManager on an-worker1098 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:06] elukey: ^ expected? [16:36:10] PROBLEM - Hadoop NodeManager on an-worker1089 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:11] PROBLEM - Hadoop NodeManager on an-worker1099 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:30] PROBLEM - Hadoop NodeManager on an-worker1086 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:31] majavah: I wrote to #sre, maintenance from data engineering, too much spam but nothing on fire, people working on it [16:36:32] RECOVERY - Hadoop NodeManager on an-worker1120 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:36] RECOVERY - Hadoop NodeManager on an-worker1095 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:37] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: use image version 2021-07-20-143040-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/705719 (https://phabricator.wikimedia.org/T264006) (owner: 10DCausse) [16:36:52] RECOVERY - Hadoop NodeManager on an-worker1114 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:53] ah, of course I didn't look there that one time when it would have been relevant :P [16:37:18] RECOVERY - Hadoop NodeManager on an-worker1127 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:37:37] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw1297.eqiad.wmnet [16:37:38] RECOVERY - Hadoop NodeManager on an-worker1099 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:46] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1297.eqiad.wmnet` - m... [16:37:58] RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:38:10] RECOVERY - Hadoop NodeManager on analytics1076 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:38:12] (03CR) 10Dzahn: [C: 03+2] site/conftool/DHCP: decom mw1289, mw1290, mw1297 [puppet] - 10https://gerrit.wikimedia.org/r/705700 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [16:38:38] PROBLEM - Hadoop NodeManager on an-worker1112 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:38:38] PROBLEM - Hadoop NodeManager on an-worker1138 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:39:00] RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:39:11] (03Merged) 10jenkins-bot: rdf-streaming-updater: use image version 2021-07-20-143040-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/705719 (https://phabricator.wikimedia.org/T264006) (owner: 10DCausse) [16:39:31] (03PS1) 10Volans: Class API: add on_error() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/705720 [16:39:41] 10SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for cjming - https://phabricator.wikimedia.org/T286961 (10thcipriani) >>! In T286961#7224931, @RLazarus wrote: > @thcipriani Can you approve for the deployment group please? (And, side note: the deployment group doesn't have an Approver... [16:39:50] RECOVERY - Hadoop NodeManager on an-worker1108 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:40:07] DannyS712: traditionally we don't accept apache config changes in the puppet request window, because they're dangerous -- I can help you this one but don't be surprised if others are rejected in future :) I'd like you to add an httpbb test for it, though, let me send you a pointer to how [16:40:24] (03PS2) 10Volans: Class API: add on_error() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/705720 [16:40:56] PROBLEM - Hadoop NodeManager on an-worker1080 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:40:57] PROBLEM - Hadoop NodeManager on an-worker1097 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:40:58] PROBLEM - Hadoop NodeManager on an-worker1128 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:40:58] PROBLEM - Hadoop NodeManager on analytics1077 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:41:08] RECOVERY - Hadoop NodeManager on an-worker1094 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:41:11] RECOVERY - Hadoop NodeManager on analytics1058 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:41:56] RECOVERY - Hadoop NodeManager on an-worker1090 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:42:00] 10SRE, 10Wikimedia-Mailing-lists, 10translatewiki.net, 10Language-Team (Language-2021-July-September): Add mailman-templates to translatewiki.net - https://phabricator.wikimedia.org/T282022 (10abi_) A few things to do: 1. Message documentation [[ https://gerrit.wikimedia.org/r/c/operations/software/mailma... [16:42:10] DannyS712: https://wikitech.wikimedia.org/wiki/Httpbb is documentation on the tool, https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/profile/files/httpbb/appserver/test_redirects.yaml is the file you should add a test to, feel free to PM me with questions and don't worry if it ends up being after the end of the Puppet window, happy to review whenever [16:42:30] RECOVERY - Hadoop NodeManager on an-worker1091 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:42:36] RECOVERY - Hadoop NodeManager on an-worker1088 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:42:38] RECOVERY - Hadoop NodeManager on analytics1059 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:42:38] RECOVERY - Hadoop NodeManager on an-worker1092 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:43:16] RECOVERY - Hadoop NodeManager on an-worker1138 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:43:18] PROBLEM - Hadoop NodeManager on an-worker1082 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:43:18] PROBLEM - Hadoop NodeManager on an-worker1087 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:43:18] PROBLEM - Hadoop NodeManager on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:43:18] PROBLEM - Hadoop NodeManager on an-worker1126 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:44:02] RECOVERY - Hadoop NodeManager on an-worker1081 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:44:10] RECOVERY - Hadoop NodeManager on an-worker1097 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:44:38] RECOVERY - Hadoop NodeManager on an-worker1111 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:44:51] !log dcausse@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [16:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:06] RECOVERY - Hadoop NodeManager on analytics1075 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:45:11] (03CR) 10RLazarus: "As discussed on IRC: happy to merge this with you (and no worries if it's outside the puppet request window) but please add an httpbb test" [puppet] - 10https://gerrit.wikimedia.org/r/676508 (owner: 10DannyS712) [16:45:42] PROBLEM - Hadoop NodeManager on an-worker1110 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:45:42] PROBLEM - Hadoop NodeManager on analytics1062 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:45:43] PROBLEM - Hadoop NodeManager on analytics1072 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:45:55] RECOVERY - Hadoop NodeManager on an-worker1128 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:46:07] Sorry for the hadoop spam all, we're working on it and in safe mode so no real risk here [16:46:38] (03PS1) 10Dzahn: site/conftool: add mw1437,mw1438 as canary jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/705721 (https://phabricator.wikimedia.org/T279309) [16:46:48] RECOVERY - Hadoop NodeManager on analytics1069 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:47:07] (03PS6) 10DannyS712: Update rewrite rule for https://www.mediawiki.org/FAQ [puppet] - 10https://gerrit.wikimedia.org/r/676508 [16:47:14] RECOVERY - Hadoop NodeManager on an-worker1089 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:48:01] PROBLEM - Hadoop NodeManager on an-worker1124 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:48:10] (03PS1) 10Cathal Mooney: Previous change had incorrect file extention on the 'class-of-service' config template. Correcting here. [homer/public] - 10https://gerrit.wikimedia.org/r/705722 (https://phabricator.wikimedia.org/T284592) [16:49:04] PROBLEM - Check systemd state on an-worker1136 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:49:14] RECOVERY - Hadoop NodeManager on analytics1062 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:49:14] RECOVERY - Hadoop NodeManager on analytics1072 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:49:20] (03CR) 10DannyS712: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/676508 (owner: 10DannyS712) [16:49:26] PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:49:32] (03CR) 10Volans: "This is a proposal, feel free to comment on the whole idea behind it, not only the implementation :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/705720 (owner: 10Volans) [16:49:48] PROBLEM - Check systemd state on analytics1068 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:50:12] PROBLEM - Hadoop NodeManager on an-worker1136 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:50:21] PROBLEM - Hadoop NodeManager on an-worker1093 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:50:21] PROBLEM - Hadoop NodeManager on an-worker1105 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:50:21] PROBLEM - Hadoop NodeManager on an-worker1115 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:50:32] PROBLEM - Hadoop NodeManager on analytics1068 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:50:36] RECOVERY - Hadoop NodeManager on an-worker1107 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:50:42] RECOVERY - Hadoop NodeManager on an-worker1098 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:50:51] DannyS712: ah, great -- I'm going to stop puppet on the appservers, merge and pull that to mwdebug2001 for you to test there (I'll verify with httpbb) [16:51:36] PROBLEM - Hadoop NodeManager on an-worker1137 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:52:00] PROBLEM - Hadoop NodeManager on an-worker1085 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:52:14] PROBLEM - Hadoop NodeManager on an-worker1117 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:52:15] PROBLEM - Check systemd state on an-worker1119 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:52:15] I'm going to downtime the an-worker1*** and analytics1*** hosts to reduce the alert spam. [16:52:26] btullis: 🙏 [16:52:42] PROBLEM - Hadoop NodeManager on analytics1066 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:52:48] PROBLEM - Hadoop NodeManager on an-worker1127 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:52:48] PROBLEM - Check systemd state on an-worker1099 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:52:50] same testing as with operations/mediawiki-config changes? (i.e. use chrome extension to specify target server?) [16:52:53] PROBLEM - Check systemd state on an-worker1117 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:52:58] PROBLEM - Check systemd state on an-worker1120 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:53:01] DannyS712: yep, just so [16:53:02] PROBLEM - Check systemd state on an-worker1113 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:53:03] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:53:03] btullis: if that helps you can use cumin aliases (as long as any valid query) in the downtime cookbook ;) [16:53:14] PROBLEM - Hadoop NodeManager on an-worker1099 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:53:15] PROBLEM - Hadoop NodeManager on an-worker1113 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:53:19] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 64 hosts with reason: dealing with an-master1001 rebuild issue [16:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:26] RECOVERY - Check systemd state on an-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:53:42] RECOVERY - Hadoop NodeManager on an-worker1105 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:53:42] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 64 hosts with reason: dealing with an-master1001 rebuild issue [16:53:46] !log disabled puppet on A:mw to test https://gerrit.wikimedia.org/r/676508 [16:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:08] !sudo cookbook sre.hosts.downtime --minutes 60 -r 'dealing with an-master1001 rebuild issue' 'an-worker1*' [16:54:08] You have sudo in any project that you are a member of, excluding global projects (like bastion). Your sudo password is your wikitech wiki password. [16:54:23] (03CR) 10RLazarus: [C: 03+2] Update rewrite rule for https://www.mediawiki.org/FAQ [puppet] - 10https://gerrit.wikimedia.org/r/676508 (owner: 10DannyS712) [16:54:25] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 20 hosts with reason: dealing with an-master1001 rebuild issue [16:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:32] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 20 hosts with reason: dealing with an-master1001 rebuild issue [16:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:40] lol wm-bot, btullis the cookbooks automatically log to SAL for you ;) [16:55:04] RECOVERY - Hadoop NodeManager on an-worker1131 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:55:05] RECOVERY - Hadoop NodeManager on an-worker1123 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:55:13] Oops. That was my first time running a cookbook outside of dry-run mode too. [16:55:28] that said, downtime doesn't prevent recoveries to show up [16:55:28] RECOVERY - Hadoop NodeManager on an-worker1125 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:55:32] RECOVERY - Hadoop HDFS Zookeeper failover controller on an-master1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.tools.DFSZKFailoverController https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_ZKFC_process [16:55:35] RECOVERY - Hadoop NodeManager on an-worker1086 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:55:40] RECOVERY - Hadoop NodeManager on analytics1061 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:55:41] but will prevent further problems to show up [16:56:10] rzl let me know when to test [16:56:11] any existing problem that gets a recovery after a downtime still shows up [16:56:20] DannyS712: yep, running now, one more sec [16:56:45] RECOVERY - Hadoop NodeManager on analytics1073 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:56:50] technically you could prevent those disabling notifications, but there is a risk of leaving them disabled afterwards so we keep that as a last resort usually [16:57:07] DannyS712: okay, go ahead [16:57:42] RECOVERY - Hadoop NodeManager on analytics1071 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:57:50] (03CR) 10Cwhite: "Looks good! One issue I accidentally introduced, inline." (031 comment) [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/705715 (https://phabricator.wikimedia.org/T274462) (owner: 10Brennen Bearnes) [16:58:00] (03CR) 10Ayounsi: [C: 03+1] Previous change had incorrect file extention on the 'class-of-service' config template. Correcting here. [homer/public] - 10https://gerrit.wikimedia.org/r/705722 (https://phabricator.wikimedia.org/T284592) (owner: 10Cathal Mooney) [16:58:38] RECOVERY - Hadoop NodeManager on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:58:44] rzl I tried to test but as far as I can tell nothing changed from the old code - its still sending me to Help:FAQ and then a MediaWiki redirect, instead of Special:MyLanguage/Manual:FAQ [16:58:50] RECOVERY - Hadoop NodeManager on an-worker1115 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:58:54] (03CR) 10Cathal Mooney: [C: 03+2] Previous change had incorrect file extention on the 'class-of-service' config template. Correcting here. [homer/public] - 10https://gerrit.wikimedia.org/r/705722 (https://phabricator.wikimedia.org/T284592) (owner: 10Cathal Mooney) [16:59:10] RECOVERY - Hadoop NodeManager on analytics1065 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:59:51] (03Merged) 10jenkins-bot: Previous change had incorrect file extention on the 'class-of-service' config template. Correcting here. [homer/public] - 10https://gerrit.wikimedia.org/r/705722 (https://phabricator.wikimedia.org/T284592) (owner: 10Cathal Mooney) [16:59:55] DannyS712: hmm [16:59:56] RECOVERY - Hadoop NodeManager on an-worker1084 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:59:56] RECOVERY - Hadoop NodeManager on an-worker1101 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:59:56] RECOVERY - Hadoop NodeManager on an-worker1103 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:00:05] chrisalbon and accraze: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210720T1700). [17:00:26] RECOVERY - Hadoop NodeManager on an-worker1078 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:00:36] DannyS712: you're testing on mwdebug2001? I'm getting the right result from httpbb [17:01:05] RECOVERY - Hadoop NodeManager on analytics1077 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:01:09] (and the wrong result on mwdebug2002, as expected since we haven't updated it ye) [17:01:10] *yet [17:01:12] RECOVERY - Hadoop NodeManager on an-worker1096 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:01:38] RECOVERY - Hadoop NodeManager on an-worker1130 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:01:53] yes, 2001 [17:01:55] RECOVERY - Check systemd state on an-worker1136 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:03] but if its working for you I'll assume its an issue on my end [17:02:06] accraze: we're running slightly over the last window with a puppet deploy, but it'll be fine to coexist with your deploy, feel free to carry on [17:02:06] RECOVERY - Hadoop ResourceManager on an-master1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.resourcemanager.ResourceManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Resourcemanager_process [17:02:06] RECOVERY - Hadoop NodeManager on an-worker1082 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:02:07] RECOVERY - Hadoop NodeManager on an-worker1080 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:02:08] RECOVERY - Hadoop NodeManager on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:02:08] RECOVERY - Hadoop NodeManager on analytics1067 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:02:40] (03PS3) 10Brennen Bearnes: logging: format nginx access logs as JSON [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/705715 (https://phabricator.wikimedia.org/T274462) [17:03:13] DannyS712: oh you know what, I bet you're getting that 301 from the CDN [17:03:19] rzl: no deploy needed on our end, but thanks for the info :) [17:03:23] or no, that can't be right [17:03:28] (03CR) 10Brennen Bearnes: logging: format nginx access logs as JSON (031 comment) [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/705715 (https://phabricator.wikimedia.org/T274462) (owner: 10Brennen Bearnes) [17:03:32] RECOVERY - Hadoop NodeManager on an-worker1093 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:03:37] I'm getting the same though, hm [17:03:52] RECOVERY - Hadoop NodeManager on analytics1068 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:04:22] RECOVERY - Check systemd state on an-worker1119 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:22] RECOVERY - Hadoop NodeManager on an-worker1136 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:04:23] RECOVERY - Hadoop NodeManager on analytics1063 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:04:42] RECOVERY - Hadoop NodeManager on an-worker1137 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:04:55] oh weird, something must be wrong with my mwdebug extension, I'm getting 'server: mw2337.codfw.wmnet' [17:05:09] okay I'll debug that later, I'm going to trust the httpbb result and re-enable puppet [17:05:26] RECOVERY - Hadoop NodeManager on an-worker1085 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:05:38] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Move wikireplica service off of clouddb1019/1020" [puppet] - 10https://gerrit.wikimedia.org/r/705501 (https://phabricator.wikimedia.org/T286598) (owner: 10Andrew Bogott) [17:05:42] RECOVERY - Hadoop NodeManager on an-worker1113 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:05:57] volans: thanks for the tip. So I should have called the downtime cookbook with the argument 'O:analytics_cluster::hadoop::worker' instead of a hostname glob, right? [17:06:15] RECOVERY - Hadoop NodeManager on analytics1066 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:06:26] RECOVERY - Hadoop NodeManager on an-worker1127 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:06:34] RECOVERY - Check systemd state on an-worker1117 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:35] RECOVERY - Hadoop NodeManager on an-worker1112 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:06:38] !log enabled puppet on A:mw [17:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:45] RECOVERY - Check systemd state on an-worker1120 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:52] RECOVERY - Check systemd state on an-worker1113 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:04] RECOVERY - Hadoop NodeManager on an-worker1117 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:07:48] btullis: yes, you could have that way, or any more complex query [17:07:50] RECOVERY - Check systemd state on an-worker1099 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:14] DannyS712: all done! it'll be live on all appservers as puppet rolls out over the next 30 minutes [17:08:15] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:23] Or even just 'A:hadoop-worker' ? [17:08:26] and I need to get something to eat, so I'm declaring the puppet window Definitely Closed :) [17:08:28] RECOVERY - Hadoop NodeManager on an-worker1099 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:08:50] RECOVERY - Hadoop NodeManager on an-worker1110 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:08:50] RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:08:50] RECOVERY - Hadoop NodeManager on an-worker1124 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:08:51] RECOVERY - Hadoop NodeManager on an-worker1126 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:08:52] RECOVERY - Check systemd state on analytics1068 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:54] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Andrew) [17:11:02] RECOVERY - Hadoop NodeManager on an-worker1106 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:13:12] RECOVERY - Hadoop NodeManager on an-worker1087 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:13:12] RECOVERY - Hadoop NodeManager on analytics1070 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:15:26] RECOVERY - Hadoop NodeManager on an-worker1083 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:15:26] RECOVERY - Hadoop NodeManager on an-worker1104 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:17:01] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/705715 (https://phabricator.wikimedia.org/T274462) (owner: 10Brennen Bearnes) [17:19:52] RECOVERY - Hadoop NodeManager on an-worker1079 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:19:52] RECOVERY - Hadoop NodeManager on analytics1064 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:21:58] RECOVERY - Hadoop NodeManager on an-worker1109 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:25:23] (03CR) 10RLazarus: [C: 03+2] scap: Drop never-used 'sqldump' tool [puppet] - 10https://gerrit.wikimedia.org/r/692370 (owner: 10Zabe) [17:35:00] RECOVERY - SSH on wdqs2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:53:41] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Bstorm) [17:55:07] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-master1001.eqiad.wmnet with reason: REIMAGE [17:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:19] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-master1001.eqiad.wmnet with reason: REIMAGE [17:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:51] (03PS1) 10Jgreen: Switch fundraising.wm.o CNAME to point to frdata-eqiad.wm.o [dns] - 10https://gerrit.wikimedia.org/r/705728 (https://phabricator.wikimedia.org/T255435) [18:00:04] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210720T1800) [18:03:50] (03CR) 10Jgreen: [C: 03+2] Switch fundraising.wm.o CNAME to point to frdata-eqiad.wm.o [dns] - 10https://gerrit.wikimedia.org/r/705728 (https://phabricator.wikimedia.org/T255435) (owner: 10Jgreen) [18:05:04] !log authdns-update to point fundraising.wm.o CNAME to a new server [18:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:41] (03PS1) 10Cwhite: logstash: complete restbase transition to ECS [puppet] - 10https://gerrit.wikimedia.org/r/705729 (https://phabricator.wikimedia.org/T234565) [18:12:15] (03PS1) 10Razzi: yarn: re-enable queues [puppet] - 10https://gerrit.wikimedia.org/r/705732 (https://phabricator.wikimedia.org/T278423) [18:21:04] (03CR) 10Razzi: [C: 03+2] yarn: re-enable queues [puppet] - 10https://gerrit.wikimedia.org/r/705732 (https://phabricator.wikimedia.org/T278423) (owner: 10Razzi) [18:45:57] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team (Doing), 10User-brennen: Define auth strategy for GitLab - https://phabricator.wikimedia.org/T274461 (10brennen) [18:54:29] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1170 mysql process crashed - https://phabricator.wikimedia.org/T286888 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr Hi @Kormat - Chris is out this week, so moving over to @Jclark-ctr for him to check out this machine. (under warranty thru Nov 2023) Thanks, Willy [18:57:20] (03CR) 10Milimetric: "Similarly, this was tagging the wrong bug, perhaps the intended one was https://phabricator.wikimedia.org/T164456" [puppet] - 10https://gerrit.wikimedia.org/r/698207 (https://phabricator.wikimedia.org/T163356) (owner: 10Muehlenhoff) [18:57:36] !log Start server-side upload for 4 large PNG files (T285708) [18:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:43] T285708: Please upload large files to Wikimedia Commons - https://phabricator.wikimedia.org/T285708 [19:00:05] dancy and brennen: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210720T1900). [19:07:53] (03PS9) 10Juan90264: Adding and use square wordmark for trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) [19:09:20] (03CR) 10jerkins-bot: [V: 04-1] Adding and use square wordmark for trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) (owner: 10Juan90264) [19:09:26] (03PS10) 10Juan90264: Adding and use square wordmark for trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) [19:09:57] (03PS6) 10RLazarus: icinga: Write to Icinga command file instead of calling icinga-downtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) [19:10:05] (03CR) 10jerkins-bot: [V: 04-1] icinga: Write to Icinga command file instead of calling icinga-downtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [19:10:57] (03CR) 10jerkins-bot: [V: 04-1] Adding and use square wordmark for trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) (owner: 10Juan90264) [19:11:43] (03PS7) 10RLazarus: icinga: Write to Icinga command file instead of calling icinga-downtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) [19:17:56] (03CR) 10jerkins-bot: [V: 04-1] icinga: Write to Icinga command file instead of calling icinga-downtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [19:20:13] (03PS1) 10Jgreen: adjust monitoring for frdata and payments-listener roles [puppet] - 10https://gerrit.wikimedia.org/r/705735 (https://phabricator.wikimedia.org/T255435) [19:21:49] (03CR) 10Jgreen: [C: 03+2] adjust monitoring for frdata and payments-listener roles [puppet] - 10https://gerrit.wikimedia.org/r/705735 (https://phabricator.wikimedia.org/T255435) (owner: 10Jgreen) [19:22:07] (03CR) 10RLazarus: "The jenkins failure looks unrelated to this change? Looks like the relevant part is" (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [19:22:25] (03PS11) 10Juan90264: Adding and use square wordmark for trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) [19:24:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_prefupdate_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:25:43] (03CR) 10jerkins-bot: [V: 04-1] Adding and use square wordmark for trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) (owner: 10Juan90264) [19:33:56] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [19:34:40] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Janina Abrams - https://phabricator.wikimedia.org/T286927 (10RLazarus) Just noticed @Ottomata is out of office. @odimitrijevic can you approve this for Analytics, or should... [19:38:46] (03PS1) 10RLazarus: admin: Upgrade cjming from ldap_only_users to users, add to deployment [puppet] - 10https://gerrit.wikimedia.org/r/705736 (https://phabricator.wikimedia.org/T286961) [19:39:00] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment shell access for cjming - https://phabricator.wikimedia.org/T286961 (10RLazarus) [19:48:28] (03CR) 10Volans: "> Patch Set 7:" (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [19:53:46] (03CR) 10Ssingh: [C: 03+2] admin: Upgrade cjming from ldap_only_users to users, add to deployment [puppet] - 10https://gerrit.wikimedia.org/r/705736 (https://phabricator.wikimedia.org/T286961) (owner: 10RLazarus) [19:55:13] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.01657 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [20:01:31] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment shell access for cjming - https://phabricator.wikimedia.org/T286961 (10RLazarus) 05Open→03Resolved a:03RLazarus @cjming You're all set! Wait 30 minutes for the change to be rolled out everywhere, then you should have depl... [20:04:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:05:25] (03PS1) 10Urbanecm: updateMenteeData: Make it possible to disable script per-wiki [extensions/GrowthExperiments] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/705748 (https://phabricator.wikimedia.org/T285811) [20:06:49] (03PS1) 10Urbanecm: updateMenteeData: Make it possible to disable script per-wiki [extensions/GrowthExperiments] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/705749 (https://phabricator.wikimedia.org/T285811) [20:07:30] jouncebot: now [20:07:30] For the next 0 hour(s) and 52 minute(s): MediaWiki train - American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210720T1900) [20:18:33] urbanecm: I don't expect any train operations to happen during the upcoming window so you're welcome to do what you need to do. [20:18:59] Thanks, that's very welcomed :) [20:19:13] (03CR) 10Urbanecm: [C: 03+2] updateMenteeData: Make it possible to disable script per-wiki [extensions/GrowthExperiments] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/705748 (https://phabricator.wikimedia.org/T285811) (owner: 10Urbanecm) [20:19:22] (03CR) 10Urbanecm: [C: 03+2] updateMenteeData: Make it possible to disable script per-wiki [extensions/GrowthExperiments] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/705749 (https://phabricator.wikimedia.org/T285811) (owner: 10Urbanecm) [20:27:55] (03PS8) 10RLazarus: icinga: Write to Icinga command file instead of calling icinga-downtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) [20:33:17] (03CR) 10jerkins-bot: [V: 04-1] icinga: Write to Icinga command file instead of calling icinga-downtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [20:34:03] (03PS9) 10RLazarus: icinga: Write to Icinga command file instead of calling icinga-downtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) [20:40:28] (03CR) 10RLazarus: "> Yes this is an old "bug" that happens *only* in CI, we weren't able to repro locally in any way." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [20:40:44] (03Merged) 10jenkins-bot: updateMenteeData: Make it possible to disable script per-wiki [extensions/GrowthExperiments] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/705748 (https://phabricator.wikimedia.org/T285811) (owner: 10Urbanecm) [20:40:46] (03Merged) 10jenkins-bot: updateMenteeData: Make it possible to disable script per-wiki [extensions/GrowthExperiments] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/705749 (https://phabricator.wikimedia.org/T285811) (owner: 10Urbanecm) [20:45:49] (03CR) 10RLazarus: [C: 03+1] "LGTM - not sure if you answered the suggestion about using \bthe the\b so this doesn't trigger on e.g. "the theory" but I'm happy either w" [puppet] - 10https://gerrit.wikimedia.org/r/705714 (owner: 10Dzahn) [20:46:53] (03PS1) 10Reedy: Localisation updates from https://translatewiki.net. [extensions/VisualEditor] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/705750 (https://phabricator.wikimedia.org/T286679) [20:47:30] (03PS1) 10Urbanecm: Set wgGEMentorDashboardBackendEnabled properly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705740 (https://phabricator.wikimedia.org/T285811) [20:48:00] (03PS1) 10Reedy: Localisation updates from https://translatewiki.net. [extensions/VisualEditor] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/705751 (https://phabricator.wikimedia.org/T286679) [20:48:16] (03Abandoned) 10Reedy: Localisation updates from https://translatewiki.net. [extensions/VisualEditor] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/705750 (https://phabricator.wikimedia.org/T286679) (owner: 10Reedy) [20:49:30] (03PS2) 10Urbanecm: Set wgGEMentorDashboardBackendEnabled properly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705740 (https://phabricator.wikimedia.org/T285811) [20:49:40] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.14/extensions/GrowthExperiments/maintenance/updateMenteeData.php: dafd953eb5cd35bddbd2fd348b03066420a42362: updateMenteeData: Make it possible to disable script per-wiki (T285811) (duration: 00m 58s) [20:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:48] T285811: Mentee overview module: Run updateMenteeData.php regularly - https://phabricator.wikimedia.org/T285811 [20:50:02] (03CR) 10Urbanecm: [C: 03+2] Set wgGEMentorDashboardBackendEnabled properly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705740 (https://phabricator.wikimedia.org/T285811) (owner: 10Urbanecm) [20:50:51] (03Merged) 10jenkins-bot: Set wgGEMentorDashboardBackendEnabled properly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705740 (https://phabricator.wikimedia.org/T285811) (owner: 10Urbanecm) [20:53:16] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: caa5a076f39b051b01622aa3e4c9d716a8643eef: Set wgGEMentorDashboardBackendEnabled properly (T285811) (duration: 00m 57s) [20:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:29] (03CR) 10Urbanecm: [C: 03+1] "this is ready to go from Growth's perspective" [puppet] - 10https://gerrit.wikimedia.org/r/704506 (https://phabricator.wikimedia.org/T285811) (owner: 10Urbanecm) [20:57:07] (03CR) 10Urbanecm: [C: 03+1] "tagging few SREs to review this" [puppet] - 10https://gerrit.wikimedia.org/r/704506 (https://phabricator.wikimedia.org/T285811) (owner: 10Urbanecm) [20:58:05] 10SRE, 10GrowthExperiments-MentorDashboard, 10Growth-Team (Current Sprint), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), and 2 others: Mentee overview module: Run updateMenteeData.php regularly - https://phabricator.wikimedia.org/T285811 (10Urbanecm_WMF) Tagging #sre, as I need someone to merge https://ger... [20:59:45] (03PS1) 10Urbanecm: labs: Enable mentor dashboard backend everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705742 (https://phabricator.wikimedia.org/T285811) [21:00:01] (03CR) 10Urbanecm: [C: 03+2] labs: Enable mentor dashboard backend everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705742 (https://phabricator.wikimedia.org/T285811) (owner: 10Urbanecm) [21:00:42] (03Merged) 10jenkins-bot: labs: Enable mentor dashboard backend everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705742 (https://phabricator.wikimedia.org/T285811) (owner: 10Urbanecm) [21:00:59] not syncing this one, as it's labs onlyx [21:02:52] i'm done, at least for now :) [21:14:55] (03CR) 10Urbanecm: [C: 04-1] GrowthExperiments: Add more wikis to linkrecommendation experiment (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703179 (https://phabricator.wikimedia.org/T284481) (owner: 10Kosta Harlan) [21:15:55] (03CR) 10RLazarus: [C: 03+1] "LGTM from a puppet standpoint, but I don't know anything about the script -- if you don't mind getting a +1 from someone on the team, I'll" [puppet] - 10https://gerrit.wikimedia.org/r/704506 (https://phabricator.wikimedia.org/T285811) (owner: 10Urbanecm) [21:18:03] rzl: thanks for the review, ack. I'll get a +1 from another Growth engineer, too 🙂 [21:19:16] urbanecm: thanks! sorry to bounce you [21:19:48] no problem, I totally understand that :). [21:19:59] (and FYI I'll be online about another 40m today, otherwise I'll check first thing tomorrow) [21:22:09] ack [21:23:20] (03CR) 10Urbanecm: [C: 04-1] "CR-2 by Kosta removed, approval was received" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703179 (https://phabricator.wikimedia.org/T284481) (owner: 10Kosta Harlan) [21:25:06] (03PS2) 10Urbanecm: GrowthExperiments: Add more wikis to linkrecommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703179 (https://phabricator.wikimedia.org/T284481) (owner: 10Kosta Harlan) [21:25:52] (03PS3) 10Urbanecm: GrowthExperiments: Add more wikis to linkrecommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703179 (https://phabricator.wikimedia.org/T284481) (owner: 10Kosta Harlan) [22:11:59] PROBLEM - Disk space on dumpsdata1003 is CRITICAL: DISK CRITICAL - free space: /data 861311 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops [22:19:45] 10SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for cjming - https://phabricator.wikimedia.org/T286961 (10cjming) @RLazarus looks like I'm all set - thanks so much! [23:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210720T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:46:14] (03PS1) 10Jdlrobson: Revert "Prepare for MediaWiki UI version 2" [extensions/MultimediaViewer] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/705755