[04:01:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:16:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubernetes2007:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:21:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:26:02] PROBLEM - Host db1187.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:30:16] PROBLEM - Host db1186.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:37:25] (03PS1) 10Andrew Bogott: nova.conf: use the default rpc_timeout value [puppet] - 10https://gerrit.wikimedia.org/r/820597 [04:39:05] (03CR) 10Andrew Bogott: [C: 03+2] nova.conf: use the default rpc_timeout value [puppet] - 10https://gerrit.wikimedia.org/r/820597 (owner: 10Andrew Bogott) [04:39:11] (03PS2) 10Andrew Bogott: nova.conf: use the default rpc_timeout value [puppet] - 10https://gerrit.wikimedia.org/r/820597 [04:40:56] RECOVERY - Host db1186.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [04:54:09] (03PS1) 10Andrew Bogott: Revert "Switch back to the new cloudrabbit1xxx nodes and switch nova to TLS" [puppet] - 10https://gerrit.wikimedia.org/r/820598 (https://phabricator.wikimedia.org/T314522) [04:55:09] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Switch back to the new cloudrabbit1xxx nodes and switch nova to TLS" [puppet] - 10https://gerrit.wikimedia.org/r/820598 (https://phabricator.wikimedia.org/T314522) (owner: 10Andrew Bogott) [05:00:01] RECOVERY - Host db1187.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [05:57:29] RECOVERY - DNS on db1186.mgmt is OK: DNS OK: 0.015 seconds response time. db1186.mgmt.eqiad.wmnet returns 10.65.2.255 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:57:31] RECOVERY - DNS on db1187.mgmt is OK: DNS OK: 0.009 seconds response time. db1187.mgmt.eqiad.wmnet returns 10.65.3.0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:59:55] (Traffic bill over quota) firing: (3) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [06:11:44] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good! We can keep the pre-existing hooks as convenience wrappers for now, since several are them are listed in existing documentatio" [puppet] - 10https://gerrit.wikimedia.org/r/816187 (https://phabricator.wikimedia.org/T313387) (owner: 10Jbond) [06:13:03] (03PS2) 10Muehlenhoff: C:package_builder: Allow users to add a specific component to the build [puppet] - 10https://gerrit.wikimedia.org/r/816187 (https://phabricator.wikimedia.org/T313387) (owner: 10Jbond) [06:13:12] (03PS2) 10Muehlenhoff: maps: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/812232 [06:16:57] (03CR) 10Muehlenhoff: [C: 03+2] maps: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/812232 (owner: 10Muehlenhoff) [06:19:05] (03PS1) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) [06:19:24] (03PS3) 10Muehlenhoff: ircecho: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811228 (https://phabricator.wikimedia.org/T308013) [06:19:55] (Traffic bill over quota) resolved: (3) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [06:23:10] (03CR) 10Slyngshede: "Starting point for LDAP wrapper library." [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede) [06:56:04] 10ops-codfw: db2135 (C6) lost power supply redundancy - https://phabricator.wikimedia.org/T314628 (10jcrespo) [06:57:42] 10ops-codfw, 10DC-Ops: db2135 (C6) lost power supply redundancy - https://phabricator.wikimedia.org/T314628 (10jcrespo) Task similar to T314559, I wonder if just a loose cable or the power supplies died. [06:59:03] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:59:21] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220805T0700) [07:01:22] (03PS1) 10Ayounsi: Netbox: add CSP headers [puppet] - 10https://gerrit.wikimedia.org/r/820645 (https://phabricator.wikimedia.org/T296356) [07:04:29] (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/36628/" [puppet] - 10https://gerrit.wikimedia.org/r/820645 (https://phabricator.wikimedia.org/T296356) (owner: 10Ayounsi) [07:07:01] (03CR) 10Muehlenhoff: [C: 03+2] ircecho: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811228 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [07:09:07] 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10ayounsi) Similarly this has been alerting for 1d15h for failed PSU https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=es2021&service=IPMI+Sensor+Status For the rec... [07:10:25] RECOVERY - Juniper alarms on asw-c-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [07:12:16] (03PS1) 10Zabe: Start writing to cuc_actor everywhere except s4 and s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820646 (https://phabricator.wikimedia.org/T233004) [07:35:30] (03PS1) 10Jcrespo: Fix privacy policy check for the mobile wiki [puppet] - 10https://gerrit.wikimedia.org/r/820647 [07:47:38] (03CR) 10Jcrespo: "Tested manually:" [puppet] - 10https://gerrit.wikimedia.org/r/820647 (owner: 10Jcrespo) [08:00:21] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:01:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:05:01] (03PS1) 10Muehlenhoff: geoip: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/820650 [08:08:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/820650 (owner: 10Muehlenhoff) [08:11:59] PROBLEM - Ensure legal html en.m.wp on en.m.wikipedia.org is CRITICAL: a\shref=(https:)?\/\/foundation\.wikimedia\.org\/wiki\/Privacy_policy\sclass=extiw\stitle=wmf:Privacy\spolicyPrivacy policy/a html not found https://phabricator.wikimedia.org/project/members/28/ [08:12:11] (03PS1) 10Muehlenhoff: sbuild: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/820652 [08:16:02] (03PS2) 10Muehlenhoff: geoip: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/820650 [08:16:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubernetes2007:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:16:54] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/820652 (owner: 10Muehlenhoff) [08:18:07] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/820650 (owner: 10Muehlenhoff) [08:21:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:37:11] (03PS1) 10Jaime Nuche: scap gitignore: ignore all files under the `scap` directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820653 [08:41:29] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:41:34] (03CR) 10Majavah: [C: 03+1] "looks good, we don't have any stretch builder hosts in toolforge" [puppet] - 10https://gerrit.wikimedia.org/r/820652 (owner: 10Muehlenhoff) [08:42:08] (03PS1) 10Muehlenhoff: dnsrecursor: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/820654 [08:42:38] (03CR) 10Muehlenhoff: [C: 03+2] sbuild: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/820652 (owner: 10Muehlenhoff) [08:45:02] (03CR) 10Muehlenhoff: [C: 03+2] lxc: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/812303 (owner: 10Muehlenhoff) [08:47:39] (03PS1) 10Muehlenhoff: gitlab_runner: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/820656 [08:49:32] (03PS1) 10Muehlenhoff: puppetdb: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/820658 [08:50:18] (03CR) 10Jbond: [C: 04-1] "lgtm but a couple of nits and a bug" [puppet] - 10https://gerrit.wikimedia.org/r/820463 (https://phabricator.wikimedia.org/T262677) (owner: 10Ayounsi) [08:52:51] (03CR) 10Jbond: Bootstrap etcd on the dse_k8s_etcd cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis) [08:55:18] (03CR) 10Jbond: [C: 03+2] wmflib:: new Wmflib::Dns:Srv type [puppet] - 10https://gerrit.wikimedia.org/r/820508 (owner: 10Jbond) [08:56:01] (03PS1) 10Muehlenhoff: mediabackup: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/820659 [08:57:03] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/820654 (owner: 10Muehlenhoff) [08:58:18] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10Aline_Bruenger_WMDE) 05In progress→03Resolved @Dzahn I can now access the tables I need, thanks a lot! [09:03:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance [09:03:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance [09:03:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 12 hosts with reason: Maintenance [09:04:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 12 hosts with reason: Maintenance [09:05:25] (03CR) 10Jbond: C:package_builder: Allow users to add a specific component to the build (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816187 (https://phabricator.wikimedia.org/T313387) (owner: 10Jbond) [09:05:27] (03CR) 10Jbond: [C: 03+2] C:package_builder: Allow users to add a specific component to the build [puppet] - 10https://gerrit.wikimedia.org/r/816187 (https://phabricator.wikimedia.org/T313387) (owner: 10Jbond) [09:09:32] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/820659 (owner: 10Muehlenhoff) [09:27:56] (03PS1) 10Phuedx: Revert "Revert "testwiki: Add mediawiki.web_ui.interactions stream"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820666 [09:28:32] (03PS2) 10Muehlenhoff: mediabackup: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/820659 [09:30:42] (03PS2) 10Phuedx: Revert "Revert "testwiki: Add mediawiki.web_ui.interactions stream"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820666 [09:31:28] (03CR) 10CI reject: [V: 04-1] mediabackup: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/820659 (owner: 10Muehlenhoff) [09:36:26] (03PS2) 10Ayounsi: Netbox: add hourly postgres backups [puppet] - 10https://gerrit.wikimedia.org/r/820463 (https://phabricator.wikimedia.org/T262677) [09:38:13] (03CR) 10Ayounsi: Netbox: add hourly postgres backups (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/820463 (https://phabricator.wikimedia.org/T262677) (owner: 10Ayounsi) [09:39:36] (03PS3) 10Muehlenhoff: mediabackup: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/820659 [09:40:34] (03PS2) 10Hnowlan: jobqueue: increase num_workers to 4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/820117 (https://phabricator.wikimedia.org/T300914) [09:42:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/820659 (owner: 10Muehlenhoff) [09:44:00] (03CR) 10jenkins-bot: mediabackup: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/820659 (owner: 10Muehlenhoff) [09:44:09] (03PS1) 10Hnowlan: jobqueue: drop memory requirement [deployment-charts] - 10https://gerrit.wikimedia.org/r/820663 [09:54:51] (03CR) 10Giuseppe Lavagetto: [C: 03+1] jobqueue: drop memory requirement [deployment-charts] - 10https://gerrit.wikimedia.org/r/820663 (owner: 10Hnowlan) [09:59:42] (03CR) 10Hnowlan: [C: 03+2] jobqueue: drop memory requirement [deployment-charts] - 10https://gerrit.wikimedia.org/r/820663 (owner: 10Hnowlan) [10:00:17] (03PS1) 10Jcrespo: Improve logic and quality of life for local backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/820664 [10:01:02] (03CR) 10CI reject: [V: 04-1] Improve logic and quality of life for local backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/820664 (owner: 10Jcrespo) [10:02:05] (03PS2) 10Jcrespo: Improve logic and quality of life for local backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/820664 [10:02:20] (03CR) 10CI reject: [V: 04-1] wikimedia-org: add soundlogo.wm.org [dns] - 10https://gerrit.wikimedia.org/r/820667 (https://phabricator.wikimedia.org/T314626) (owner: 10RhinosF1) [10:02:55] (03Merged) 10jenkins-bot: jobqueue: drop memory requirement [deployment-charts] - 10https://gerrit.wikimedia.org/r/820663 (owner: 10Hnowlan) [10:02:57] (03CR) 10CI reject: [V: 04-1] Improve logic and quality of life for local backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/820664 (owner: 10Jcrespo) [10:05:20] (03PS3) 10Jcrespo: Improve logic and quality of life for local backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/820664 [10:05:42] (03CR) 10RhinosF1: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/820667 (https://phabricator.wikimedia.org/T314626) (owner: 10RhinosF1) [10:07:15] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync [10:08:01] (03CR) 10Jcrespo: [C: 04-1] "It says certspotter user?" [puppet] - 10https://gerrit.wikimedia.org/r/820659 (owner: 10Muehlenhoff) [10:10:11] (03CR) 10JMeybohm: P:base::firewall: Add requestctl definitions to ferm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817307 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [10:11:06] (03PS4) 10Muehlenhoff: mediabackup: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/820659 [10:11:10] (03CR) 10Jcrespo: "I have sent https://gerrit.wikimedia.org/r/c/operations/software/wmfbackups/+/820664 which will make disabling the errors unnecessary." [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [10:11:36] (03CR) 10Muehlenhoff: mediabackup: Switch to systemd::sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820659 (owner: 10Muehlenhoff) [10:12:21] (03CR) 10CI reject: [V: 04-1] mediabackup: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/820659 (owner: 10Muehlenhoff) [10:12:52] !log dbmaint at s4@codfw (T312863) [10:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:55] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [10:13:20] (03PS5) 10Muehlenhoff: mediabackup: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/820659 [10:13:37] (03CR) 10Jcrespo: "+1 after indentation fixes" [puppet] - 10https://gerrit.wikimedia.org/r/820659 (owner: 10Muehlenhoff) [10:14:34] (03CR) 10Jcrespo: [C: 03+1] "Always touching my requires :-D" [puppet] - 10https://gerrit.wikimedia.org/r/820659 (owner: 10Muehlenhoff) [10:17:30] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync [10:21:20] (03PS1) 10JMeybohm: Remove statsd exporter port from networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/820686 [10:22:55] (03CR) 10Jbond: "have given a first pass, general structure looks good to me" [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede) [10:24:24] (03CR) 10Jbond: [C: 03+1] geoip: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/820650 (owner: 10Muehlenhoff) [10:25:01] (03CR) 10Jbond: [C: 03+1] puppetdb: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/820658 (owner: 10Muehlenhoff) [10:25:50] (03CR) 10JMeybohm: [C: 03+1] "Apart from me missing the statsd port in networkpolicy, this looks good to me" [deployment-charts] - 10https://gerrit.wikimedia.org/r/816203 (https://phabricator.wikimedia.org/T295698) (owner: 10Ori) [10:26:01] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/820463 (https://phabricator.wikimedia.org/T262677) (owner: 10Ayounsi) [10:26:33] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) [10:27:10] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Muniza - https://phabricator.wikimedia.org/T292955 (10diego) @Dzahn, there is a -ctr email: maslam-ctr@wikimedia.org , would that solve the problem? [10:28:21] (03PS5) 10Jbond: P:base::firewall: Add requestctl definitions to ferm [puppet] - 10https://gerrit.wikimedia.org/r/817307 (https://phabricator.wikimedia.org/T313825) [10:29:09] (03PS6) 10Jbond: P:base::firewall: Add requestctl definitions to ferm [puppet] - 10https://gerrit.wikimedia.org/r/817307 (https://phabricator.wikimedia.org/T313825) [10:29:10] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Muniza - https://phabricator.wikimedia.org/T292955 (10RhinosF1) >>! In T292955#8133322, @diego wrote: > @Dzahn, there is a -ctr email: maslam-ctr@wikimedia.org , would that solve the problem? This is not linked to the... [10:29:49] (03CR) 10Jbond: P:base::firewall: Add requestctl definitions to ferm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817307 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [10:29:59] (03CR) 10Jbond: P:base::firewall: Add requestctl definitions to ferm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817307 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [10:30:39] (03PS1) 10Jcrespo: Move s4 eqiad snapshots from db1150 to db1145 [puppet] - 10https://gerrit.wikimedia.org/r/820687 [10:30:53] (03PS2) 10Jcrespo: Move s4 eqiad snapshots from db1150 to db1145 [puppet] - 10https://gerrit.wikimedia.org/r/820687 [10:31:36] (03PS7) 10Jbond: P:base::firewall: Add requestctl definitions to ferm [puppet] - 10https://gerrit.wikimedia.org/r/817307 (https://phabricator.wikimedia.org/T313825) [10:33:12] (03CR) 10Jcrespo: "It would be great to wait for db1150 to finish before applying this to db1145, so we keep having fresh backups of s4 (will revert then)." [puppet] - 10https://gerrit.wikimedia.org/r/820687 (owner: 10Jcrespo) [10:35:09] (03PS3) 10Jcrespo: dbbackups: Move s4 eqiad snapshots from db1150 to db1145 [puppet] - 10https://gerrit.wikimedia.org/r/820687 [10:36:09] (03CR) 10Ladsgroup: [C: 03+1] dbbackups: Move s4 eqiad snapshots from db1150 to db1145 [puppet] - 10https://gerrit.wikimedia.org/r/820687 (owner: 10Jcrespo) [10:36:42] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync [10:37:15] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) [10:39:09] (03CR) 10Muehlenhoff: [C: 03+2] puppetdb: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/820658 (owner: 10Muehlenhoff) [10:39:13] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Move s4 eqiad snapshots from db1150 to db1145 [puppet] - 10https://gerrit.wikimedia.org/r/820687 (owner: 10Jcrespo) [10:39:32] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/820645 (https://phabricator.wikimedia.org/T296356) (owner: 10Ayounsi) [10:40:51] (03CR) 10Muehlenhoff: [C: 03+2] geoip: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/820650 (owner: 10Muehlenhoff) [10:46:57] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync [10:49:52] I am going to self merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/820647 unless someone objects [10:51:38] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) [10:51:48] (03CR) 10Jbond: "LGTM a few minor nits" [cookbooks] - 10https://gerrit.wikimedia.org/r/812380 (owner: 10Ayounsi) [10:57:56] (03CR) 10Muehlenhoff: [C: 03+2] mediabackup: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/820659 (owner: 10Muehlenhoff) [11:03:30] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Muniza - https://phabricator.wikimedia.org/T292955 (10diego) @Muehlenhoff I just re-confirmed via call with @MunizaA that ssh-key is correct. [11:19:43] RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:34:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repool after PDU maint on C5 (T310145)', diff saved to https://phabricator.wikimedia.org/P32289 and previous config saved to /var/cache/conftool/dbconfig/20220805-113436-ladsgroup.json [11:34:40] T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 [11:35:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repool after PDU maint on C6 (T310145)', diff saved to https://phabricator.wikimedia.org/P32290 and previous config saved to /var/cache/conftool/dbconfig/20220805-113555-ladsgroup.json [11:37:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repool after PDU maint on D3 (T310146)', diff saved to https://phabricator.wikimedia.org/P32291 and previous config saved to /var/cache/conftool/dbconfig/20220805-113729-ladsgroup.json [11:37:33] D3: test - ignore - https://phabricator.wikimedia.org/D3 [11:37:33] T310146: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 [11:40:17] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync [11:40:52] (03PS1) 10Muehlenhoff: Fix sysuser config for mediabackup worker [puppet] - 10https://gerrit.wikimedia.org/r/820710 [11:44:51] (03CR) 10Jcrespo: [C: 03+1] Fix sysuser config for mediabackup worker [puppet] - 10https://gerrit.wikimedia.org/r/820710 (owner: 10Muehlenhoff) [11:44:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Dedicated cloudrabbit nodes in eqiad1 - https://phabricator.wikimedia.org/T314522 (10ayounsi) a:05Cmjohnson→03Andrew Thanks for opening this task! I have a few questions and possibly ideas for improvements we could implement. Let’s start... [11:45:27] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/820710 (owner: 10Muehlenhoff) [11:50:29] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync [11:52:38] (03PS2) 10Muehlenhoff: Fix sysuser config for mediabackup worker [puppet] - 10https://gerrit.wikimedia.org/r/820710 [11:52:44] (03PS1) 10Jelto: gitlab: reduce backup_keep_time to 2d [puppet] - 10https://gerrit.wikimedia.org/r/820712 (https://phabricator.wikimedia.org/T274463) [11:54:43] (03CR) 10Muehlenhoff: [C: 03+2] Fix sysuser config for mediabackup worker [puppet] - 10https://gerrit.wikimedia.org/r/820710 (owner: 10Muehlenhoff) [11:56:57] (03PS1) 10Ladsgroup: Revert "mariadb: Downtime D3 databases" [puppet] - 10https://gerrit.wikimedia.org/r/820668 [11:57:03] (03PS2) 10Ladsgroup: Revert "mariadb: Downtime D3 databases" [puppet] - 10https://gerrit.wikimedia.org/r/820668 [11:57:27] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36629/console" [puppet] - 10https://gerrit.wikimedia.org/r/820712 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [11:57:44] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "mariadb: Downtime D3 databases" [puppet] - 10https://gerrit.wikimedia.org/r/820668 (owner: 10Ladsgroup) [12:01:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:05:58] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:08:02] (03PS1) 10Ssingh: Revert "Revert "Revert "Depool codfw for PDU upgrade""" [dns] - 10https://gerrit.wikimedia.org/r/820669 [12:09:24] (03PS2) 10Ssingh: Revert "Revert "Revert "Depool codfw for PDU upgrade""" [dns] - 10https://gerrit.wikimedia.org/r/820669 [12:13:04] 10SRE, 10ops-codfw, 10DC-Ops: db2135 (C6) lost power supply redundancy - https://phabricator.wikimedia.org/T314628 (10Ladsgroup) It's depooled, I will disable notifications for this and {T314559} [12:16:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubernetes2007:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:19:46] (03PS1) 10Ladsgroup: Revert "mariadb: Disable notifications DBs in C5" [puppet] - 10https://gerrit.wikimedia.org/r/820670 [12:19:54] (03PS2) 10Ladsgroup: Revert "mariadb: Disable notifications DBs in C5" [puppet] - 10https://gerrit.wikimedia.org/r/820670 [12:20:52] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "mariadb: Disable notifications DBs in C5" [puppet] - 10https://gerrit.wikimedia.org/r/820670 (owner: 10Ladsgroup) [12:20:58] (KubernetesRsyslogDown) resolved: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:20:58] (KubernetesRsyslogDown) resolved: (2) rsyslog on kubernetes2007:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:20:59] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Unknown error executing dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [12:21:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:22:26] <_joe_> Amir1: ^^ what's up with dbctl? [12:22:47] _joe_: I probably messed it up let me try again [12:23:16] _joe_: I can't commit [12:23:29] https://www.irccloud.com/pastebin/BaQEp6uj/ [12:23:54] <_joe_> uh [12:24:05] even diff is failing [12:24:18] <_joe_> Amir1: I suspect there was some big damage done [12:24:40] oh [12:24:45] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Unknown error executing dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [12:24:46] :( [12:24:52] !log installing nano bugfix updates from bullseye point release [12:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:58] <_joe_> I have no idea right now but let me check [12:25:07] <_joe_> cdanis: I'd appreciate your expertise too if you're around [12:25:14] Thanks. This part is way outside of my expertise [12:26:14] (03PS1) 10JMeybohm: Collect/export helm list call latencies [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/820713 [12:26:29] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) [12:26:48] <_joe_> ok, data is still all there [12:27:57] RECOVERY - haproxy failover on dbproxy2004 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:28:25] <_joe_> DEBUG:urllib3.connectionpool:https://conf1007.eqiad.wmnet:4001 "GET /v2/keys/conftool/v1/mediawiki-config/codfw/dbconfig HTTP/1.1" 200 None [12:28:29] <_joe_> uhhhh [12:29:38] <_joe_> curl gets the data correctly though [12:29:44] (03PS1) 10Jbond: C:package_builder: ensure we also have WIKIMEDIA=yes when using component [puppet] - 10https://gerrit.wikimedia.org/r/820714 [12:30:05] I did sudo dbctl section all get, I don't know if that could have broken things [12:30:14] <_joe_> Amir1: did you do any dbctl commit before ? [12:30:20] <_joe_> like today I mean [12:30:26] yeah I did today [12:30:39] https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:51] (03CR) 10Jbond: [C: 03+2] C:package_builder: ensure we also have WIKIMEDIA=yes when using component [puppet] - 10https://gerrit.wikimedia.org/r/820714 (owner: 10Jbond) [12:32:30] <_joe_> I really don't get what's wrong right now heh, I'll have to read the code [12:33:15] (03CR) 10Muehlenhoff: "I like the simplicity, but I think we could use the chance to move to PCI IDs entirely (except MDRAID obviously)? All the /proc probing is" [puppet] - 10https://gerrit.wikimedia.org/r/815287 (https://phabricator.wikimedia.org/T313312) (owner: 10Jbond) [12:34:41] (03PS1) 10Jbond: C:package_builder: exit if wikimedia=yes is not set [puppet] - 10https://gerrit.wikimedia.org/r/820716 [12:35:11] (03CR) 10Jbond: [V: 03+2 C: 03+2] C:package_builder: exit if wikimedia=yes is not set [puppet] - 10https://gerrit.wikimedia.org/r/820716 (owner: 10Jbond) [12:38:53] (03CR) 10Ayounsi: "Thanks for the feedback!" [cookbooks] - 10https://gerrit.wikimedia.org/r/812380 (owner: 10Ayounsi) [12:40:31] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [12:42:54] (03CR) 10Ori: [C: 03+1] "LGTM. Is there a mechanism for removing the cruft from already-rendered templates?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/820686 (owner: 10JMeybohm) [12:43:03] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [12:47:03] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/812380 (owner: 10Ayounsi) [12:48:35] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) [12:53:41] <_joe_> !log progressive repool of services in codfw [12:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:16] <_joe_> ok, in ~ 1-5 minutes mw latency in codfw will be down [12:55:58] <_joe_> Emperor: swift is now repooled for thumb generation [12:58:40] (03PS1) 10Ladsgroup: Revert "mariadb: Disable notifications pdu C rows" [puppet] - 10https://gerrit.wikimedia.org/r/820673 [12:58:44] (03PS2) 10Jcrespo: icinga: Fix privacy policy check for the mobile wiki [puppet] - 10https://gerrit.wikimedia.org/r/820647 [12:58:46] (03PS1) 10Andrew Bogott: Openstack: use cloudcontrol1004/1004 as rabbitmq hosts [puppet] - 10https://gerrit.wikimedia.org/r/820728 (https://phabricator.wikimedia.org/T314522) [12:58:51] (03PS2) 10Ladsgroup: Revert "mariadb: Disable notifications pdu C rows" [puppet] - 10https://gerrit.wikimedia.org/r/820673 [12:58:54] (03PS3) 10Jcrespo: icinga: Fix privacy policy check for the mobile wiki [puppet] - 10https://gerrit.wikimedia.org/r/820647 [12:59:13] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "mariadb: Disable notifications pdu C rows" [puppet] - 10https://gerrit.wikimedia.org/r/820673 (owner: 10Ladsgroup) [13:00:19] PROBLEM - Confd template for /var/lib/gdnsd/discovery-videoscaler.state on dns6001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-videoscaler.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:00:19] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:00:57] (03CR) 10Andrew Bogott: [C: 03+2] Openstack: use cloudcontrol1004/1004 as rabbitmq hosts [puppet] - 10https://gerrit.wikimedia.org/r/820728 (https://phabricator.wikimedia.org/T314522) (owner: 10Andrew Bogott) [13:02:16] !log upload honeysql-clojure to puppet7 component [13:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:59] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:05:28] (03PS1) 10Ladsgroup: Revert "mariadb: Disable notifications DBs in C6" [puppet] - 10https://gerrit.wikimedia.org/r/820674 [13:05:41] (03PS2) 10Ladsgroup: Revert "mariadb: Disable notifications DBs in C6" [puppet] - 10https://gerrit.wikimedia.org/r/820674 [13:05:43] PROBLEM - Confd template for /var/lib/gdnsd/discovery-videoscaler.state on dns5001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-videoscaler.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:06:59] (03PS3) 10Ladsgroup: Revert "mariadb: Disable notifications DBs in C6" [puppet] - 10https://gerrit.wikimedia.org/r/820674 [13:07:09] (03PS1) 10Andrew Bogott: hiera: fix a typo [puppet] - 10https://gerrit.wikimedia.org/r/820733 [13:08:11] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "mariadb: Disable notifications DBs in C6" [puppet] - 10https://gerrit.wikimedia.org/r/820674 (owner: 10Ladsgroup) [13:08:13] PROBLEM - Confd template for /var/lib/gdnsd/discovery-videoscaler.state on dns4001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-videoscaler.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:09:14] (03CR) 10Ssingh: [C: 03+2] Revert "Revert "Revert "Depool codfw for PDU upgrade""" [dns] - 10https://gerrit.wikimedia.org/r/820669 (owner: 10Ssingh) [13:09:57] !log repool codfw [13:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:04] (03PS1) 10Jbond: package_builder: use printf to populate component [puppet] - 10https://gerrit.wikimedia.org/r/820734 [13:10:17] (03CR) 10Jbond: [C: 03+2] package_builder: use printf to populate component [puppet] - 10https://gerrit.wikimedia.org/r/820734 (owner: 10Jbond) [13:10:41] PROBLEM - Confd template for /var/lib/gdnsd/discovery-videoscaler.state on dns2002 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-videoscaler.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:10:43] PROBLEM - Confd template for /var/lib/gdnsd/discovery-videoscaler.state on dns3002 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-videoscaler.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:10:43] PROBLEM - Confd template for /var/lib/gdnsd/discovery-videoscaler.state on dns6002 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-videoscaler.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:10:45] PROBLEM - Confd template for /var/lib/gdnsd/discovery-videoscaler.state on dns5002 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-videoscaler.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:11:29] (03CR) 10Andrew Bogott: [C: 03+2] hiera: fix a typo [puppet] - 10https://gerrit.wikimedia.org/r/820733 (owner: 10Andrew Bogott) [13:11:32] ^ is this known? [13:12:16] andrewbogott: did you also merge my change??? [13:12:28] nevermind [13:12:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db2095.codfw.wmnet with reason: Maintenance [13:12:45] jbond: yes :) [13:12:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db2095.codfw.wmnet with reason: Maintenance [13:13:07] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.0102 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:13:25] * jbond looking [13:13:40] andrewbogott: looks like a cloud issue [13:14:20] jbond: the widespread puppet thing? I think I just fixed it [13:14:20] no value for profile::openstack::eqiad1::region' [13:14:24] ahh ok [13:14:36] ill kick of a failed run on cumin [13:14:50] <_joe_> sukhe: yeah it was my fault [13:14:56] <_joe_> I thought I fixed it fast enough [13:15:29] <_joe_> sukhe: gimme 1 min to verify [13:15:48] _joe_: np and thanks [13:16:33] (03PS1) 10Muehlenhoff: proxysql: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/820737 [13:18:05] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.004371 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:18:23] PROBLEM - Confd template for /var/lib/gdnsd/discovery-videoscaler.state on dns1002 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-videoscaler.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:18:46] (03PS1) 10Jbond: package_build: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/820738 [13:20:17] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [13:20:26] (03PS2) 10Ori: New service: function-orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/816203 (https://phabricator.wikimedia.org/T295698) [13:20:34] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/820737 (owner: 10Muehlenhoff) [13:20:58] (03CR) 10Ori: New service: function-orchestrator (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/816203 (https://phabricator.wikimedia.org/T295698) (owner: 10Ori) [13:21:02] (03PS3) 10Ori: New service: function-orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/816203 (https://phabricator.wikimedia.org/T295698) [13:21:14] (03CR) 10Ori: [C: 03+2] New service: function-orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/816203 (https://phabricator.wikimedia.org/T295698) (owner: 10Ori) [13:22:03] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [13:23:35] (03CR) 10Jbond: [C: 03+2] package_build: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/820738 (owner: 10Jbond) [13:25:57] PROBLEM - Confd template for /var/lib/gdnsd/discovery-videoscaler.state on dns4002 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-videoscaler.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:26:30] (03Merged) 10jenkins-bot: New service: function-orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/816203 (https://phabricator.wikimedia.org/T295698) (owner: 10Ori) [13:26:33] (03CR) 10CDanis: [C: 03+1] icinga: Fix privacy policy check for the mobile wiki [puppet] - 10https://gerrit.wikimedia.org/r/820647 (owner: 10Jcrespo) [13:26:40] <_joe_> sukhe: it should recover (the dns errors) [13:26:43] RECOVERY - Confd template for /var/lib/gdnsd/discovery-videoscaler.state on dns2002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:26:45] RECOVERY - Confd template for /var/lib/gdnsd/discovery-videoscaler.state on dns3002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:26:47] RECOVERY - Confd template for /var/lib/gdnsd/discovery-videoscaler.state on dns6002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:26:49] RECOVERY - Confd template for /var/lib/gdnsd/discovery-videoscaler.state on dns5002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:27:03] RECOVERY - Confd template for /var/lib/gdnsd/discovery-videoscaler.state on dns5001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:27:05] RECOVERY - Confd template for /var/lib/gdnsd/discovery-videoscaler.state on dns4001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:27:07] RECOVERY - Confd template for /var/lib/gdnsd/discovery-videoscaler.state on dns6001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:27:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool hosts with fragile power supply (T314559 T314628)', diff saved to https://phabricator.wikimedia.org/P32292 and previous config saved to /var/cache/conftool/dbconfig/20220805-132709-ladsgroup.json [13:27:14] T314628: db2135 (C6) lost power supply redundancy - https://phabricator.wikimedia.org/T314628 [13:27:14] T314559: es2021 (B3) lost power supply redundancy - https://phabricator.wikimedia.org/T314559 [13:28:03] RECOVERY - Confd template for /var/lib/gdnsd/discovery-videoscaler.state on dns4002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:28:13] RECOVERY - Confd template for /var/lib/gdnsd/discovery-videoscaler.state on dns1002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:28:19] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:30:50] (03CR) 10Jcrespo: [C: 03+2] icinga: Fix privacy policy check for the mobile wiki [puppet] - 10https://gerrit.wikimedia.org/r/820647 (owner: 10Jcrespo) [13:31:01] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:32:40] (03PS1) 10Muehlenhoff: profile::docker::engine: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/820745 [13:35:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): Dedicated cloudrabbit nodes in eqiad1 - https://phabricator.wikimedia.org/T314522 (10Andrew) This is going very poorly. When I switch to the new cluster, most things seem to work, but VMs never get scheduled. I can see mes... [13:38:09] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:38:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/820745 (owner: 10Muehlenhoff) [13:42:36] (03CR) 10Jcrespo: [C: 03+1] proxysql: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/820737 (owner: 10Muehlenhoff) [13:55:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:57:44] !log upload logstash-logback-encoder-7.2 to puppet7 component [13:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:49] PROBLEM - Confd template for /var/lib/gdnsd/discovery-videoscaler.state on authdns1001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-videoscaler.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:59:05] (03PS1) 10Jaime Nuche: scap: introduce bootstrapping mechanism specific to deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/820749 [13:59:45] (03CR) 10SBassett: [C: 03+1] Netbox: add CSP headers [puppet] - 10https://gerrit.wikimedia.org/r/820645 (https://phabricator.wikimedia.org/T296356) (owner: 10Ayounsi) [14:03:28] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2089 - https://phabricator.wikimedia.org/T313799 (10Papaul) [14:03:41] (03CR) 10Ssingh: dnsrecursor: Remove support for stretch (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/820654 (owner: 10Muehlenhoff) [14:04:27] 10SRE, 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for Francesco Negri - https://phabricator.wikimedia.org/T313504 (10fnegri) [14:04:31] 10SRE, 10ops-codfw, 10serviceops: decommission mw2251-mw2255, mw2257-mw2258 - https://phabricator.wikimedia.org/T313730 (10Papaul) [14:04:37] (03PS2) 10Ayounsi: wmf-netbox: remove deprecated functions [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792622 [14:06:05] (03PS4) 10Ayounsi: provision cookbook: configure switches using cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/811730 [14:06:15] (03CR) 10Ssingh: dnsrecursor: Remove support for stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820654 (owner: 10Muehlenhoff) [14:06:45] !log upload murphy-clojure to puppet7 component [14:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:55] (03CR) 10Ayounsi: [C: 03+1] netmon: failover to netmon1003 [puppet] - 10https://gerrit.wikimedia.org/r/819179 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [14:07:13] (03CR) 10Ayounsi: [C: 03+1] netmon: failover to netmon1003 [dns] - 10https://gerrit.wikimedia.org/r/819177 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [14:12:36] 10SRE, 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for Francesco Negri - https://phabricator.wikimedia.org/T313504 (10fnegri) [14:13:06] 10SRE, 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for Francesco Negri - https://phabricator.wikimedia.org/T313504 (10fnegri) [14:13:08] !log upload structured-logging-clojure to puppet7 component [14:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:43] 10SRE, 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for Francesco Negri - https://phabricator.wikimedia.org/T313504 (10fnegri) [14:17:12] !log upload truss-clojure to puppet7 component [14:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:31] (03PS2) 10JMeybohm: Collect/export helm list call latencies [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/820713 [14:19:56] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_uwsgi-striker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:16] RECOVERY - Ensure legal html en.m.wp on en.m.wikipedia.org is OK: all html is present. https://phabricator.wikimedia.org/project/members/28/ [14:20:35] (03CR) 10Muehlenhoff: dnsrecursor: Remove support for stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820654 (owner: 10Muehlenhoff) [14:21:11] ^please note this recovery has been sponsored by cdanis reviews(TM), always have your cdanis reviews (TM) with very cold ice! :-D [14:23:12] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:23:19] !log upload encore-clojure to puppet7 component [14:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:22] (03PS3) 10RhinosF1: wikimedia-org: add soundlogo.wm.org [dns] - 10https://gerrit.wikimedia.org/r/820667 (https://phabricator.wikimedia.org/T314626) [14:26:49] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for FNegri - https://phabricator.wikimedia.org/T314066 (10fnegri) [14:29:26] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [14:31:44] (03PS1) 10Majavah: P:puppet: fix inconsistent capitalisation in motd [puppet] - 10https://gerrit.wikimedia.org/r/820752 [14:31:44] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Papaul) @Marostegui Chris is on Vacation, I tried to fixed the issues @ayounsi mentioned above. For the install i will take a look at it later. [14:31:58] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [14:34:33] !log upload data-generators-clojure to puppet7 component [14:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:29] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:36:44] (03PS1) 10Andrew Bogott: nova.conf: remove auth_strategy = keystone [puppet] - 10https://gerrit.wikimedia.org/r/820758 [14:37:51] (03CR) 10CI reject: [V: 04-1] nova.conf: remove auth_strategy = keystone [puppet] - 10https://gerrit.wikimedia.org/r/820758 (owner: 10Andrew Bogott) [14:39:35] 10SRE, 10SRE-Access-Requests: Add Larissa Gaulia to #mediawiki_security - https://phabricator.wikimedia.org/T314616 (10sbassett) Hey @larissagaulia - you should now be invited to `#mediawiki_security`. Once you accept and join the channel, we can resolve this task. Thanks. [14:39:56] 10SRE, 10SRE-Access-Requests: Add Larissa Gaulia to #mediawiki_security - https://phabricator.wikimedia.org/T314616 (10sbassett) 05Open→03In progress [14:40:13] !log upload test-generative-clojure to puppet7 component [14:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:16] 10SRE, 10SRE-Access-Requests: Add Larissa Gaulia to #mediawiki_security - https://phabricator.wikimedia.org/T314616 (10sbassett) p:05Triage→03Medium [14:40:54] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host db1185.eqiad.wmnet with OS bullseye [14:40:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:41:00] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host db1185.eqiad.wmnet with OS bullseye [14:43:49] !log upload fressian to puppet7 component [14:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:34] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2090 - https://phabricator.wikimedia.org/T314109 (10Papaul) [14:50:15] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2072 - https://phabricator.wikimedia.org/T313911 (10Papaul) [14:51:21] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2089 - https://phabricator.wikimedia.org/T313799 (10Papaul) [14:51:33] 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) [14:51:49] 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) [14:52:29] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1185.eqiad.wmnet with reason: host reimage [14:54:55] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Ladsgroup) Manuel is also on vacation, I can take a look once you are done. Thanks. [14:56:07] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1185.eqiad.wmnet with reason: host reimage [14:57:13] !log upload nippy-clojure to puppet7 component [14:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:13] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2088 - https://phabricator.wikimedia.org/T313797 (10Papaul) [15:02:40] (03CR) 10Ssingh: wikimedia-org: add soundlogo.wm.org (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/820667 (https://phabricator.wikimedia.org/T314626) (owner: 10RhinosF1) [15:04:33] !log upload test-check-clojure to puppet7 component [15:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:30] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host db1190.eqiad.wmnet with OS bullseye [15:05:36] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host db1190.eqiad.wmnet with OS bullseye [15:06:00] (03PS4) 10RhinosF1: wikimedia-org: add soundlogo.wm.org [dns] - 10https://gerrit.wikimedia.org/r/820667 (https://phabricator.wikimedia.org/T314626) [15:06:23] (03CR) 10Ahmon Dancy: [C: 03+1] scap gitignore: ignore all files under the `scap` directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820653 (owner: 10Jaime Nuche) [15:07:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved via scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820653 (owner: 10Jaime Nuche) [15:07:36] I'm deploying a no-op mediawiki-config change (new entry in .gitignore) [15:07:42] (03CR) 10Ssingh: [C: 03+1] wikimedia-org: add soundlogo.wm.org (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/820667 (https://phabricator.wikimedia.org/T314626) (owner: 10RhinosF1) [15:08:06] (03CR) 10Ssingh: [V: 03+2 C: 03+2] wikimedia-org: add soundlogo.wm.org [dns] - 10https://gerrit.wikimedia.org/r/820667 (https://phabricator.wikimedia.org/T314626) (owner: 10RhinosF1) [15:08:09] oh, interesting [15:08:47] (03Merged) 10jenkins-bot: scap gitignore: ignore all files under the `scap` directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820653 (owner: 10Jaime Nuche) [15:09:02] !log upload test-chuck-clojure to puppet7 component [15:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:18] !log dancy@deploy1002 Started scap: Backport for [[gerrit:820653]] scap gitignore: ignore all files under the `scap` directory [15:10:04] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1185.eqiad.wmnet with OS bullseye [15:10:10] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host db1185.eqiad.wmnet with OS bullseye completed: - db1... [15:11:54] !log upload jolokia to puppet7 component [15:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:59] !log dancy@deploy1002 Finished scap: Backport for [[gerrit:820653]] scap gitignore: ignore all files under the `scap` directory (duration: 04m 41s) [15:14:49] (03PS1) 10Jcrespo: Revert "db2126: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/820676 [15:15:25] (03CR) 10Jcrespo: "This was done on a separate patch because it was new." [puppet] - 10https://gerrit.wikimedia.org/r/820676 (owner: 10Jcrespo) [15:15:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:16:03] (03CR) 10Jbond: [C: 03+2] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/820752 (owner: 10Majavah) [15:16:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:16:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:17:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:18:54] (03PS1) 10Jcrespo: Revert "db2143: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/820677 [15:19:23] !log upload puppetlabs-http-client-clojur to puppet7 component [15:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:51] (03CR) 10Jcrespo: [C: 03+2] Revert "db2126: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/820676 (owner: 10Jcrespo) [15:21:18] (03CR) 10Jcrespo: [C: 03+2] Revert "db2143: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/820677 (owner: 10Jcrespo) [15:21:47] 10SRE, 10Domains, 10Traffic, 10WMF-Communications, 10Patch-For-Review: Setup URL (soundlogo.wikimedia.org) for Sound Logo website - https://phabricator.wikimedia.org/T314626 (10ssingh) 05Open→03Resolved a:03ssingh ` dig soundlogo.wikimedia.org CNAME +short wikimediasoundlogo.go-vip.net. ` Thanks... [15:22:36] (03CR) 10Ottomata: [C: 03+1] "LGTM, would be happy to merge and shepherd on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/817774 (https://phabricator.wikimedia.org/T312858) (owner: 10Xcollazo) [15:22:53] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host db1191.eqiad.wmnet with OS bullseye [15:22:59] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host db1191.eqiad.wmnet with OS bullseye [15:24:35] !log upload trapperkeeper-metrics-clojure to puppet7 component [15:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:08] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1190.eqiad.wmnet with reason: host reimage [15:28:38] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1190.eqiad.wmnet with reason: host reimage [15:30:16] !log milimetric@deploy1002 Started deploy [analytics/refinery@fe7bf9e]: Hotfix for webrequest load refine [15:30:48] (03PS1) 10Jcrespo: Revert "dbproxy2002: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/820678 [15:34:12] (03CR) 10Ladsgroup: [C: 03+1] Revert "dbproxy2002: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/820678 (owner: 10Jcrespo) [15:34:40] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1191.eqiad.wmnet with reason: host reimage [15:37:14] (03CR) 10Jcrespo: [C: 03+2] Revert "dbproxy2002: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/820678 (owner: 10Jcrespo) [15:38:13] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1191.eqiad.wmnet with reason: host reimage [15:38:33] (03PS1) 10Jcrespo: mariadb: Revert a few leftover disabled notif., belived to be wrong [puppet] - 10https://gerrit.wikimedia.org/r/820773 (https://phabricator.wikimedia.org/T313569) [15:39:16] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:40:10] (03CR) 10Ayounsi: "Going through my Gerrit dashboard, do you remember if this is still neede?" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/804572 (owner: 10Jbond) [15:42:03] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1190.eqiad.wmnet with OS bullseye [15:42:08] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host db1190.eqiad.wmnet with OS bullseye completed: - db1... [15:48:44] (03Abandoned) 10Ayounsi: Look for a VC match before a device match [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/675116 (owner: 10Ayounsi) [15:50:14] (03CR) 10Jcrespo: [C: 04-1] "the diff has a mistake, the body is correct." [puppet] - 10https://gerrit.wikimedia.org/r/820773 (https://phabricator.wikimedia.org/T313569) (owner: 10Jcrespo) [15:51:54] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host db1192.eqiad.wmnet with OS bullseye [15:52:00] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host db1192.eqiad.wmnet with OS bullseye [15:52:20] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1191.eqiad.wmnet with OS bullseye [15:52:25] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host db1191.eqiad.wmnet with OS bullseye completed: - db1... [15:53:58] 10SRE, 10Domains, 10Traffic, 10WMF-Communications: Setup URL (soundlogo.wikimedia.org) for Sound Logo website - https://phabricator.wikimedia.org/T314626 (10Varnent) Thank you, @RhinosF1 and @ssingh! :) [15:54:36] (03PS2) 10Jcrespo: mariadb: Revert a few leftover disabled notif., belived to be wrong [puppet] - 10https://gerrit.wikimedia.org/r/820773 (https://phabricator.wikimedia.org/T313569) [15:55:56] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host db1193.eqiad.wmnet with OS bullseye [15:56:03] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host db1193.eqiad.wmnet with OS bullseye [15:56:34] PROBLEM - Disk space on an-launcher1002 is CRITICAL: DISK CRITICAL - free space: /srv 31 MB (0% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops [15:57:09] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Papaul) [16:00:47] (03CR) 10Jcrespo: [C: 04-1] "While this works, it requires more thinking about how to handle the different needs and precedence of the different settings (defaults, co" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/820664 (owner: 10Jcrespo) [16:03:43] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1192.eqiad.wmnet with reason: host reimage [16:04:54] !log milimetric@deploy1002 Finished deploy [analytics/refinery@fe7bf9e]: Hotfix for webrequest load refine (duration: 34m 38s) [16:05:05] !log milimetric@deploy1002 Started deploy [analytics/refinery@fe7bf9e]: Hotfix for webrequest load refine, now with FORCE :) [16:06:21] (03CR) 10Xcollazo: airflow - Configure new platform_eng instance and rename old one as legacy. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/817774 (https://phabricator.wikimedia.org/T312858) (owner: 10Xcollazo) [16:07:16] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1192.eqiad.wmnet with reason: host reimage [16:07:41] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1193.eqiad.wmnet with reason: host reimage [16:07:43] (03PS1) 10Faidon Liambotis: mirrors: also serve Tails with rsync [puppet] - 10https://gerrit.wikimedia.org/r/820777 [16:08:56] (03PS2) 10Samtar: maintain-views: Add pagetriage-copyvio to allowed logtypes [puppet] - 10https://gerrit.wikimedia.org/r/815215 (https://phabricator.wikimedia.org/T313281) (owner: 10Zabe) [16:10:28] !log dcausse@deploy1002 Started deploy [wikimedia/discovery/analytics@8489923]: T304954: Automate imagesuggestion imports [16:10:33] T304954: Import data from hdfs to commonswiki_file - https://phabricator.wikimedia.org/T304954 [16:11:14] !log milimetric@deploy1002 Finished deploy [analytics/refinery@fe7bf9e]: Hotfix for webrequest load refine, now with FORCE :) (duration: 06m 09s) [16:11:21] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1193.eqiad.wmnet with reason: host reimage [16:12:25] (03CR) 10Samtar: [C: 03+1] maintain-views: Add pagetriage-copyvio to allowed logtypes [puppet] - 10https://gerrit.wikimedia.org/r/815215 (https://phabricator.wikimedia.org/T313281) (owner: 10Zabe) [16:12:32] !log dcausse@deploy1002 Finished deploy [wikimedia/discovery/analytics@8489923]: T304954: Automate imagesuggestion imports (duration: 02m 03s) [16:16:48] RECOVERY - Disk space on an-launcher1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops [16:18:06] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:21:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:21:58] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host db1192.eqiad.wmnet with OS bullseye [16:22:02] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host db1192.eqiad.wmnet with OS bullseye completed: - db1... [16:22:07] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host db1192.eqiad.wmnet with OS bullseye executed with er... [16:23:52] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [16:25:11] (03PS1) 10Ayounsi: Bump pynetbox to ~= 6.6 [software/homer] - 10https://gerrit.wikimedia.org/r/820778 (https://phabricator.wikimedia.org/T310745) [16:25:43] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1193.eqiad.wmnet with OS bullseye [16:25:49] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host db1193.eqiad.wmnet with OS bullseye completed: - db1... [16:26:07] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host db1194.eqiad.wmnet with OS bullseye [16:26:12] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host db1194.eqiad.wmnet with OS bullseye [16:27:09] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp203[56]\.codfw\.wmnet,service=ats-tls [16:27:13] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp203[56]\.codfw\.wmnet,service=ats-be [16:27:18] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp203[56]\.codfw\.wmnet,service=varnish-fe [16:30:14] (03CR) 10CI reject: [V: 04-1] Bump pynetbox to ~= 6.6 [software/homer] - 10https://gerrit.wikimedia.org/r/820778 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi) [16:32:28] (03PS1) 10Brennen Bearnes: phabricator: Change local.json group to www-data [puppet] - 10https://gerrit.wikimedia.org/r/820779 (https://phabricator.wikimedia.org/T313950) [16:34:47] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host db1195.eqiad.wmnet with OS bullseye [16:34:53] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host db1195.eqiad.wmnet with OS bullseye [16:37:28] (03CR) 10Ayounsi: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/820778 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi) [16:37:55] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1194.eqiad.wmnet with reason: host reimage [16:41:25] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1194.eqiad.wmnet with reason: host reimage [16:43:10] (03CR) 10CI reject: [V: 04-1] Bump pynetbox to ~= 6.6 [software/homer] - 10https://gerrit.wikimedia.org/r/820778 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi) [16:49:36] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1195.eqiad.wmnet with reason: host reimage [16:50:29] (03PS2) 10Brennen Bearnes: phabricator: Change local.json group to www-data / world readable [puppet] - 10https://gerrit.wikimedia.org/r/820779 (https://phabricator.wikimedia.org/T313950) [16:53:16] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1195.eqiad.wmnet with reason: host reimage [16:54:46] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1194.eqiad.wmnet with OS bullseye [16:54:52] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host db1194.eqiad.wmnet with OS bullseye completed: - db1... [17:02:59] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox: use FHRP Groups feature - https://phabricator.wikimedia.org/T311218 (10ayounsi) Note that the generate_dns_snippets.py script might need to be adapted, see the error during a test on netbox-next. ` 2022-08-05 18:55:05,373 [INFO] Gathering de... [17:08:52] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1195.eqiad.wmnet with OS bullseye [17:08:58] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host db1195.eqiad.wmnet with OS bullseye completed: - db1... [17:09:37] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:21:53] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Papaul) [17:21:57] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Papaul) @Cmjohnson @Marostegui , this is complete. I created all the 1g interfaces on for the nodes in row E and F and also added those interfaces... [17:22:07] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [17:34:46] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:37:43] 10SRE, 10SRE-Access-Requests: Add Larissa Gaulia to #mediawiki_security - https://phabricator.wikimedia.org/T314616 (10ssingh) Larissa has been invited to the channel. Once they will accept the invitation, we can mark this as resolved. [18:01:00] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:05:05] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for virginiapoundstone - https://phabricator.wikimedia.org/T314676 (10VirginiaPoundstone) [18:16:01] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for virginiapoundstone - https://phabricator.wikimedia.org/T314676 (10DAbad) Approving for access [18:26:51] (03PS1) 10BCornwall: WIP: Run OCSP functions even if certs fail [software/acme-chief] - 10https://gerrit.wikimedia.org/r/820795 (https://phabricator.wikimedia.org/T244232) [18:35:56] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:41:31] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:49:25] (03PS1) 10CDanis: vo-escalate: actually run every 15 seconds [puppet] - 10https://gerrit.wikimedia.org/r/820800 [18:53:15] (03PS2) 10CDanis: vo-escalate: actually run every 15 seconds [puppet] - 10https://gerrit.wikimedia.org/r/820800 (https://phabricator.wikimedia.org/T313603) [18:54:11] (03CR) 10CDanis: [C: 03+2] vo-escalate: actually run every 15 seconds [puppet] - 10https://gerrit.wikimedia.org/r/820800 (https://phabricator.wikimedia.org/T313603) (owner: 10CDanis) [19:22:00] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:24:05] (03PS1) 10Ayounsi: Bump pynetbox to ~= 6.6 [software/spicerack] - 10https://gerrit.wikimedia.org/r/820806 (https://phabricator.wikimedia.org/T310745) [19:28:42] (03PS1) 10Ayounsi: Bump pynetbox to ~= 6.6 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/820808 (https://phabricator.wikimedia.org/T310745) [19:28:52] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [20:15:25] (03PS1) 10Ayounsi: Netbox: add housekeeping systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/820813 (https://phabricator.wikimedia.org/T311048) [20:18:45] (03PS1) 10Stang: trwikivoyage: Create rollbacker user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820815 (https://phabricator.wikimedia.org/T314678) [20:21:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:22:30] (03PS2) 10Ayounsi: Netbox: add housekeeping systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/820813 (https://phabricator.wikimedia.org/T311048) [20:25:02] (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1001/36631/" [puppet] - 10https://gerrit.wikimedia.org/r/820813 (https://phabricator.wikimedia.org/T311048) (owner: 10Ayounsi) [20:32:06] PROBLEM - SSH on mw1314.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:34:15] (03CR) 10Ahmon Dancy: P:gerrit: add ipaddress to host_aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819506 (https://phabricator.wikimedia.org/T303857) (owner: 10Jbond) [20:36:04] (03PS12) 10Mary Yang: Add puppet profile and role files for WikiFunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) [20:37:33] (03CR) 10CI reject: [V: 04-1] Add puppet profile and role files for WikiFunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [20:39:39] (03PS13) 10Mary Yang: Add puppet profile and role files for WikiFunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) [20:40:30] (03CR) 10CI reject: [V: 04-1] Add puppet profile and role files for WikiFunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [20:40:56] (03PS14) 10Mary Yang: Add puppet profile and role files for WikiFunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) [20:41:57] (03CR) 10CI reject: [V: 04-1] Add puppet profile and role files for WikiFunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [20:44:06] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:48:43] (03PS15) 10Mary Yang: Add puppet profile and role files for WikiFunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) [20:54:35] (03CR) 10Mary Yang: "Hi Filippo and Daniel, we added a new API to WikiLambda designed specifically for health check purposes (https://gerrit.wikimedia.org/r/c/" [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [21:04:44] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:33:14] RECOVERY - SSH on mw1314.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:37:22] 10SRE, 10SRE-Access-Requests, 10SecTeam-Processed: Add Larissa Gaulia to #mediawiki_security - https://phabricator.wikimedia.org/T314616 (10sbassett) [22:18:48] !log dcausse@deploy1002 Started deploy [wikimedia/discovery/analytics@71fe016]: Fix schedule_interval for image_recommendation_weekly [22:20:49] !log dcausse@deploy1002 Finished deploy [wikimedia/discovery/analytics@71fe016]: Fix schedule_interval for image_recommendation_weekly (duration: 02m 01s) [22:22:41] (03CR) 10Dzahn: [C: 03+1] "matches https://phabricator.wikimedia.org/P32283 and there is https://phabricator.wikimedia.org/T292955#8133333 but per https://phabricato" [puppet] - 10https://gerrit.wikimedia.org/r/820285 (https://phabricator.wikimedia.org/T292955) (owner: 10RhinosF1) [22:26:32] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Muniza - https://phabricator.wikimedia.org/T292955 (10Dzahn) The key in https://phabricator.wikimedia.org/P32283 matches the key in Gerrit, and thanks for checking it. Per comments on T313299 now I just need to have a... [22:29:02] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for virginiapoundstone - https://phabricator.wikimedia.org/T314676 (10Dzahn) Hello @Ottomata or @odimitrijevic this is asking for your group approval. Thanks [22:31:36] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:34:57] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frauth1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T314517 (10wiki_willy) a:03Cmjohnson [22:36:38] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T314427 (10wiki_willy) a:03Papaul [22:36:54] 10SRE, 10ops-codfw, 10DC-Ops: db2135 (C6) lost power supply redundancy - https://phabricator.wikimedia.org/T314628 (10wiki_willy) a:03Papaul [22:37:10] 10SRE, 10ops-codfw: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314607 (10wiki_willy) a:03Papaul [22:37:27] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10wiki_willy) a:03Papaul [22:39:58] 10SRE, 10Observability-Logging, 10Patch-For-Review: Leverage Grafana annotations to show events in graphs - https://phabricator.wikimedia.org/T222826 (10colewhite) 05Open→03In progress We've enabled the Public Logs datasource in Grafana and forwarded scap.announce logs to it. [22:55:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:00:58] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:08:52] (03PS1) 10Dzahn: mirrors: drop /tails/ from remote_path for tails mirror [puppet] - 10https://gerrit.wikimedia.org/r/820824 [23:26:52] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [23:28:04] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1001/36632/mirror1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/820777 (owner: 10Faidon Liambotis) [23:31:04] (03CR) 10Dzahn: [C: 03+2] mirrors: also serve Tails with rsync [puppet] - 10https://gerrit.wikimedia.org/r/820777 (owner: 10Faidon Liambotis) [23:32:54] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:34:00] (03Abandoned) 10Dzahn: mirrors: drop /tails/ from remote_path for tails mirror [puppet] - 10https://gerrit.wikimedia.org/r/820824 (owner: 10Dzahn) [23:46:18] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [23:55:28] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%