[00:00:42] PROBLEM - Check systemd state on wdqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:14:16] RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:18:16] RECOVERY - Check systemd state on wdqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:03:48] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10wiki_willy) Hi Filippo - is this following link the one used for phase monitoring? https://librenms.wikimedia.org/device/device=108/tab=health/metric=current/... [01:30:40] PROBLEM - Host wdqs2001 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:26] RECOVERY - Host wdqs2001 is UP: PING OK - Packet loss = 0%, RTA = 35.06 ms [01:40:20] PROBLEM - Check systemd state on wdqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:54] RECOVERY - Check systemd state on wdqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:23:05] (03PS6) 10Ladsgroup: mailman: Drop absented files and packages [puppet] - 10https://gerrit.wikimedia.org/r/697635 (https://phabricator.wikimedia.org/T282303) [02:24:02] (03CR) 10Ladsgroup: "> Patch Set 5: Code-Review-1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/697635 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [02:24:09] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/697635 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [02:26:10] (03CR) 10Ladsgroup: "PCC seems happy now https://puppet-compiler.wmflabs.org/compiler1002/810/lists1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/697635 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [03:04:10] 10SRE, 10Services, 10Wikibase-Quality-Constraints, 10Wikidata, and 3 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Ladsgroup) [03:07:35] 10SRE, 10Services, 10Wikibase-Quality-Constraints, 10Wikidata, and 3 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Ladsgroup) [04:53:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1113:3316', diff saved to https://phabricator.wikimedia.org/P16609 and previous config saved to /var/cache/conftool/dbconfig/20210618-045355-marostegui.json [04:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1113:3316', diff saved to https://phabricator.wikimedia.org/P16610 and previous config saved to /var/cache/conftool/dbconfig/20210618-045743-marostegui.json [04:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:58:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1098:3316', diff saved to https://phabricator.wikimedia.org/P16611 and previous config saved to /var/cache/conftool/dbconfig/20210618-045808-marostegui.json [04:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:26] (03PS1) 10Marostegui: db1155: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/700259 [05:00:16] (03CR) 10Marostegui: [C: 03+2] db1155: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/700259 (owner: 10Marostegui) [05:01:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 25%: Repool db1098:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16612 and previous config saved to /var/cache/conftool/dbconfig/20210618-050148-root.json [05:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 50%: Repool db1098:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16613 and previous config saved to /var/cache/conftool/dbconfig/20210618-051652-root.json [05:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1131', diff saved to https://phabricator.wikimedia.org/P16614 and previous config saved to /var/cache/conftool/dbconfig/20210618-051712-marostegui.json [05:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:09] (03PS1) 10Lars Wirzenius: remove Lars Wirzenius (liw) from groups [puppet] - 10https://gerrit.wikimedia.org/r/700260 [05:18:55] (03CR) 10jerkins-bot: [V: 04-1] remove Lars Wirzenius (liw) from groups [puppet] - 10https://gerrit.wikimedia.org/r/700260 (owner: 10Lars Wirzenius) [05:19:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 25%: Repool db1131 after schema change', diff saved to https://phabricator.wikimedia.org/P16615 and previous config saved to /var/cache/conftool/dbconfig/20210618-051942-root.json [05:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:35] (03PS2) 10Lars Wirzenius: remove Lars Wirzenius (liw) from groups [puppet] - 10https://gerrit.wikimedia.org/r/700260 [05:31:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 75%: Repool db1098:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16616 and previous config saved to /var/cache/conftool/dbconfig/20210618-053156-root.json [05:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 50%: Repool db1131 after schema change', diff saved to https://phabricator.wikimedia.org/P16617 and previous config saved to /var/cache/conftool/dbconfig/20210618-053445-root.json [05:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 100%: Repool db1098:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16618 and previous config saved to /var/cache/conftool/dbconfig/20210618-054659-root.json [05:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1165', diff saved to https://phabricator.wikimedia.org/P16619 and previous config saved to /var/cache/conftool/dbconfig/20210618-054841-marostegui.json [05:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 75%: Repool db1131 after schema change', diff saved to https://phabricator.wikimedia.org/P16620 and previous config saved to /var/cache/conftool/dbconfig/20210618-054949-root.json [05:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 25%: Repool db1165 after schema change', diff saved to https://phabricator.wikimedia.org/P16621 and previous config saved to /var/cache/conftool/dbconfig/20210618-055122-root.json [05:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:09] PROBLEM - SSH on wdqs2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:04:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 100%: Repool db1131 after schema change', diff saved to https://phabricator.wikimedia.org/P16622 and previous config saved to /var/cache/conftool/dbconfig/20210618-060452-root.json [06:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 50%: Repool db1165 after schema change', diff saved to https://phabricator.wikimedia.org/P16623 and previous config saved to /var/cache/conftool/dbconfig/20210618-060625-root.json [06:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:25] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:13:53] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:14:02] 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Datacenter-Switchover: CommRel support for June 2021 Switchover - https://phabricator.wikimedia.org/T281209 (10sgrabarczuk) [06:19:17] 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Datacenter-Switchover: CommRel support for June 2021 Switchover - https://phabricator.wikimedia.org/T281209 (10sgrabarczuk) I think we're good. We're adjusting the plan. A minor delay in communication won't impact the switchover, IMHO. [06:21:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 75%: Repool db1165 after schema change', diff saved to https://phabricator.wikimedia.org/P16624 and previous config saved to /var/cache/conftool/dbconfig/20210618-062129-root.json [06:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1168', diff saved to https://phabricator.wikimedia.org/P16625 and previous config saved to /var/cache/conftool/dbconfig/20210618-062452-marostegui.json [06:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 100%: Repool db1165 after schema change', diff saved to https://phabricator.wikimedia.org/P16626 and previous config saved to /var/cache/conftool/dbconfig/20210618-063632-root.json [06:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:11] (03CR) 10Muehlenhoff: "There are other steps involved here as well (like removing your SSH key), but don't need to handle these; offboarding from production acce" [puppet] - 10https://gerrit.wikimedia.org/r/700260 (owner: 10Lars Wirzenius) [06:58:04] (03CR) 10Muehlenhoff: [C: 03+2] archiva: Switch to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/699378 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [06:58:13] !log jmm@puppetmaster1001 conftool action : set/pooled=no; selector: name=ldap-replica1002.wikimedia.org [06:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210618T0700) [07:02:43] RECOVERY - SSH on wdqs2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:17:47] PROBLEM - cassandra CQL 10.192.48.166:9042 on maps2010 is CRITICAL: connect to address 10.192.48.166 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [07:18:08] (03PS3) 10Legoktm: Add shellbox to LVS [puppet] - 10https://gerrit.wikimedia.org/r/693959 (https://phabricator.wikimedia.org/T281423) [07:18:10] (03PS3) 10Legoktm: service: Switch shellbox to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/693960 (https://phabricator.wikimedia.org/T281423) [07:18:12] (03PS3) 10Legoktm: service: Switch shellbox to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/693961 (https://phabricator.wikimedia.org/T281423) [07:18:14] (03PS3) 10Legoktm: service: Switch shellbox to production [puppet] - 10https://gerrit.wikimedia.org/r/693962 (https://phabricator.wikimedia.org/T281423) [07:19:32] (03CR) 10Legoktm: [C: 03+2] Add shellbox to LVS [puppet] - 10https://gerrit.wikimedia.org/r/693959 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [07:22:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps2010.codfw.wmnet [07:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:49] 10SRE, 10SRE-Access-Requests: Replace production ssh keys for jgiannelos - https://phabricator.wikimedia.org/T285126 (10Jgiannelos) [07:27:01] (03CR) 10Legoktm: [C: 03+2] service: Switch shellbox to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/693960 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [07:27:18] pybal will alert in a bit [07:27:27] 10SRE, 10SRE-Access-Requests: Replace production ssh keys for jgiannelos - https://phabricator.wikimedia.org/T285126 (10Jgiannelos) Adding my manager @ssastry for visibility [07:28:13] RECOVERY - cassandra CQL 10.192.48.166:9042 on maps2010 is OK: TCP OK - 0.033 second response time on 10.192.48.166 port 9042 https://phabricator.wikimedia.org/T93886 [07:28:19] legoktm: :) [07:28:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2010.codfw.wmnet [07:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:45] PROBLEM - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 77 connections established with conf2004.codfw.wmnet:4001 (min=78) https://wikitech.wikimedia.org/wiki/PyBal [07:31:49] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.51:4008]) https://wikitech.wikimedia.org/wiki/PyBal [07:32:11] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 113 connections established with conf1004.eqiad.wmnet:4001 (min=114) https://wikitech.wikimedia.org/wiki/PyBal [07:32:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 25%: Repool db1168 after schema change', diff saved to https://phabricator.wikimedia.org/P16627 and previous config saved to /var/cache/conftool/dbconfig/20210618-073225-root.json [07:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:49] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 113 connections established with conf1004.eqiad.wmnet:4001 (min=114) Legoktm deploying shellbox lvs https://wikitech.wikimedia.org/wiki/PyBal [07:32:59] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 77 connections established with conf2004.codfw.wmnet:4001 (min=78) Legoktm deploying shellbox lvs https://wikitech.wikimedia.org/wiki/PyBal [07:33:05] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.51:4008]) https://wikitech.wikimedia.org/wiki/PyBal [07:33:05] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.51:4008]) Legoktm deploying shellbox lvs https://wikitech.wikimedia.org/wiki/PyBal [07:33:15] PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 65 connections established with conf1004.eqiad.wmnet:4001 (min=66) https://wikitech.wikimedia.org/wiki/PyBal [07:33:31] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.51:4008]) Legoktm deploying shellbox lvs https://wikitech.wikimedia.org/wiki/PyBal [07:33:37] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.51:4008]) https://wikitech.wikimedia.org/wiki/PyBal [07:34:13] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 57 connections established with conf2004.codfw.wmnet:4001 (min=58) https://wikitech.wikimedia.org/wiki/PyBal [07:34:13] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.51:4008]) Legoktm deploying shellbox lvs https://wikitech.wikimedia.org/wiki/PyBal [07:34:23] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 65 connections established with conf1004.eqiad.wmnet:4001 (min=66) Legoktm deploying shellbox lvs https://wikitech.wikimedia.org/wiki/PyBal [07:35:03] !log restarting pyball on lvs1016, lvs2010 to add shellbox [07:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:37] RECOVERY - PyBal connections to etcd on lvs2010 is OK: OK: 78 connections established with conf2004.codfw.wmnet:4001 (min=78) https://wikitech.wikimedia.org/wiki/PyBal [07:38:03] RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 114 connections established with conf1004.eqiad.wmnet:4001 (min=114) https://wikitech.wikimedia.org/wiki/PyBal [07:38:55] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [07:39:57] (03CR) 10Giuseppe Lavagetto: [C: 03+1] docker::baseimages: Push images with legacy names [puppet] - 10https://gerrit.wikimedia.org/r/700204 (owner: 10JMeybohm) [07:40:34] (03CR) 10JMeybohm: [C: 03+2] docker::baseimages: Push images with legacy names [puppet] - 10https://gerrit.wikimedia.org/r/700204 (owner: 10JMeybohm) [07:41:19] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:41:45] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:43:08] (03CR) 10JMeybohm: [C: 03+1] profile::kubernetes::deployment_server: add istioctl package [puppet] - 10https://gerrit.wikimedia.org/r/700203 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [07:44:40] !log restarting pybal on lvs1015, lvs2009 (active) - T281423 [07:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:45] T281423: New Service Request Shellbox - https://phabricator.wikimedia.org/T281423 [07:44:57] RECOVERY - PyBal connections to etcd on lvs1015 is OK: OK: 66 connections established with conf1004.eqiad.wmnet:4001 (min=66) https://wikitech.wikimedia.org/wiki/PyBal [07:45:19] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [07:45:55] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 58 connections established with conf2004.codfw.wmnet:4001 (min=58) https://wikitech.wikimedia.org/wiki/PyBal [07:47:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 50%: Repool db1168 after schema change', diff saved to https://phabricator.wikimedia.org/P16628 and previous config saved to /var/cache/conftool/dbconfig/20210618-074729-root.json [07:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:21] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [07:50:35] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:50:48] (03CR) 10Legoktm: [C: 03+2] service: Switch shellbox to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/693961 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [07:51:01] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:51:55] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:52:21] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:56:13] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:57:16] (03PS1) 10Ayounsi: Unify netbox tokens [puppet] - 10https://gerrit.wikimedia.org/r/700314 (https://phabricator.wikimedia.org/T241259) [07:58:24] (03PS4) 10Legoktm: service: Switch shellbox to production [puppet] - 10https://gerrit.wikimedia.org/r/693962 (https://phabricator.wikimedia.org/T281423) [07:58:53] (03PS2) 10Legoktm: Add shellbox to discovery [dns] - 10https://gerrit.wikimedia.org/r/693965 (https://phabricator.wikimedia.org/T281423) [08:01:49] (03CR) 10Ema: [C: 03+1] Add shellbox to discovery [dns] - 10https://gerrit.wikimedia.org/r/693965 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [08:01:55] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:02:24] (03PS1) 10Legoktm: configmaster: Add shellbox to disc_desired_state.py [puppet] - 10https://gerrit.wikimedia.org/r/700315 (https://phabricator.wikimedia.org/T281423) [08:02:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 75%: Repool db1168 after schema change', diff saved to https://phabricator.wikimedia.org/P16629 and previous config saved to /var/cache/conftool/dbconfig/20210618-080233-root.json [08:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:38] (03CR) 10Legoktm: [C: 03+2] service: Switch shellbox to production [puppet] - 10https://gerrit.wikimedia.org/r/693962 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [08:05:00] (03CR) 10Legoktm: [C: 03+2] Add shellbox to discovery [dns] - 10https://gerrit.wikimedia.org/r/693965 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [08:06:28] !log legoktm@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=shellbox [08:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:01] (03PS1) 10Ladsgroup: microsites: Add Query Builder subpage to wdqs gui [puppet] - 10https://gerrit.wikimedia.org/r/700317 (https://phabricator.wikimedia.org/T266703) [08:10:27] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:11:17] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:16:03] (03PS1) 10Legoktm: Revert "Add shellbox to discovery" [dns] - 10https://gerrit.wikimedia.org/r/700042 [08:16:12] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Revert "Add shellbox to discovery" [dns] - 10https://gerrit.wikimedia.org/r/700042 (owner: 10Legoktm) [08:16:54] win 7 [08:16:58] ufff [08:17:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 100%: Repool db1168 after schema change', diff saved to https://phabricator.wikimedia.org/P16630 and previous config saved to /var/cache/conftool/dbconfig/20210618-081737-root.json [08:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:38] (03PS1) 10Legoktm: Add shellbox to discovery" try #2 [dns] - 10https://gerrit.wikimedia.org/r/700044 (https://phabricator.wikimedia.org/T281423) [08:19:38] (03CR) 10Legoktm: [C: 03+2] Add shellbox to discovery" try #2 [dns] - 10https://gerrit.wikimedia.org/r/700044 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [08:22:04] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::yaml_defs: also monkey-patch templated sites [puppet] - 10https://gerrit.wikimedia.org/r/700318 [08:22:23] (03CR) 10Volans: [C: 03+1] "LGTM, but would be nice to have John confirm the one you're leaving is the right hiera 'path' to use." [puppet] - 10https://gerrit.wikimedia.org/r/700314 (https://phabricator.wikimedia.org/T241259) (owner: 10Ayounsi) [08:23:35] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29928/console" [puppet] - 10https://gerrit.wikimedia.org/r/700318 (owner: 10Giuseppe Lavagetto) [08:24:49] (03PS1) 10Legoktm: Revert "Add shellbox to discovery" try #2" [dns] - 10https://gerrit.wikimedia.org/r/700045 [08:25:04] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Revert "Add shellbox to discovery" try #2" [dns] - 10https://gerrit.wikimedia.org/r/700045 (owner: 10Legoktm) [08:26:19] (03PS1) 10Legoktm: Add shellbox to discovery" try #3 [dns] - 10https://gerrit.wikimedia.org/r/700326 (https://phabricator.wikimedia.org/T281423) [08:26:31] (03PS2) 10Legoktm: Add shellbox to discovery (try #3) [dns] - 10https://gerrit.wikimedia.org/r/700326 (https://phabricator.wikimedia.org/T281423) [08:28:07] (03PS1) 10JMeybohm: scaffold: The metrics-config is only needed if statsd is enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/700319 [08:29:22] PROBLEM - Host ganeti5001 is DOWN: PING CRITICAL - Packet loss = 100% [08:29:50] mmmh XioNoX, topranks, telia again? ^^^ [08:30:01] 10SRE: Request for more CPU and RAM for releases1002/2002 - https://phabricator.wikimedia.org/T284772 (10MoritzMuehlenhoff) Looking a Grafana, releases1002 doesn't max out the current resources, though? Unless there is planned work for more parallelisation or similar changes? [08:30:13] PROBLEM - Host ncredir5001 is DOWN: PING CRITICAL - Packet loss = 100% [08:30:17] PROBLEM - Host asw1-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [08:30:18] great [08:30:19] PROBLEM - Host ps1-603-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [08:30:21] PROBLEM - Host ganeti5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:30:26] PROBLEM - Host cr3-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [08:30:37] :( [08:30:52] !log cr1-codfw# set interfaces xe-5/1/2 disable [08:30:53] oh here we go again [08:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:59] sweet [08:31:00] do we need to depool? [08:31:01] PROBLEM - Host ganeti5003 is DOWN: PING CRITICAL - Packet loss = 100% [08:31:01] RECOVERY - Host asw1-eqsin is UP: PING WARNING - Packet loss = 71%, RTA = 238.75 ms [08:31:01] RECOVERY - Host ganeti5001 is UP: PING WARNING - Packet loss = 50%, RTA = 242.65 ms [08:31:02] thx XioNoX [08:31:04] RECOVERY - Host cr3-eqsin is UP: PING OK - Packet loss = 0%, RTA = 238.16 ms [08:31:04] RECOVERY - Host ncredir5001 is UP: PING OK - Packet loss = 0%, RTA = 238.26 ms [08:31:07] should be good now [08:31:09] RECOVERY - Host ganeti5003 is UP: PING OK - Packet loss = 0%, RTA = 237.89 ms [08:31:11] RECOVERY - Host ps1-603-eqsin is UP: PING OK - Packet loss = 0%, RTA = 238.96 ms [08:31:25] can we try get them to actually fix it this time? :) [08:31:28] yeah... [08:31:48] er [08:31:54] womp womp [08:32:18] now that we're all here, can can throw a party or something [08:32:23] (03PS1) 10Giuseppe Lavagetto: Depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/700323 [08:32:34] party_parrot.mp4 [08:32:38] after the depool we can all jump in the pool, sure. [08:32:48] I'll just leave the patch there [08:32:48] joe: shouldn't be needed to depool for now [08:32:55] https://gerrit.wikimedia.org/r/c/operations/dns/+/700027 is still available fwiw :D [08:33:01] ahahaha [08:33:04] okok [08:33:09] 🤦 [08:33:49] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:35:03] is it OK if I deploy my DNS change right now? https://gerrit.wikimedia.org/r/c/operations/dns/+/700326 (adding shellbox to discovery), or should I wait? [08:35:17] cc ema [08:35:25] RECOVERY - Host ganeti5002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 238.74 ms [08:35:38] I'll resolve the incident [08:35:54] legoktm: go ahead [08:36:12] (03CR) 10Legoktm: [C: 03+2] Add shellbox to discovery (try #3) [dns] - 10https://gerrit.wikimedia.org/r/700326 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [08:37:48] > OK - authdns-update successful on all nodes! [08:37:57] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::yaml_defs: also monkey-patch templated sites [puppet] - 10https://gerrit.wikimedia.org/r/700318 [08:38:00] legoktm: \o/ [08:38:12] (03CR) 10Legoktm: [C: 03+2] configmaster: Add shellbox to disc_desired_state.py [puppet] - 10https://gerrit.wikimedia.org/r/700315 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [08:38:46] aaand that should be it! [08:39:05] on the phone with telia [08:39:35] !log finished adding shellbox LVS entry, https://shellbox.svc.eqiad.wmnet:4008/ and https://shellbox.svc.codfw.wmnet:4008/ now work (T281423) [08:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:40] T281423: New Service Request Shellbox - https://phabricator.wikimedia.org/T281423 [08:39:44] ema: thank you for all your help! [08:40:07] 10SRE, 10Services, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Shellbox - https://phabricator.wikimedia.org/T281423 (10Legoktm) [08:40:21] legoktm: thank you for doing everything right except for what I told you to do! [08:41:11] 10SRE, 10Services, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Shellbox - https://phabricator.wikimedia.org/T281423 (10Legoktm) ` legoktm@cumin1001:~$ curl https://shellbox.svc.eqiad.wmnet:4008/healthz { "__": "Shellbox running", "pid": 9 } legoktm@cumin1001:~$ curl https... [08:41:47] (Traffic on tunnel link) firing: Traffic on tunnel link - https://alerts.wikimedia.org [08:41:48] (ie: the only step that went wrong was my fault) [08:42:30] I'll add some more notes to the docs after I get some snacks [08:42:48] XioNoX: godspeed [08:44:46] I'm going to re-enable the link but keep OSPF routing away from it [08:46:17] (03PS6) 10Jcrespo: bacula: Add new jobdefaults/schedule for Gitlab, full backups every day [puppet] - 10https://gerrit.wikimedia.org/r/700183 (https://phabricator.wikimedia.org/T274463) [08:46:47] (Traffic on tunnel link) firing: (2) Traffic on tunnel link - https://alerts.wikimedia.org [08:47:15] (03CR) 10Jcrespo: [C: 03+2] "I saw no comment against it, so I will deploy as is, and we can always revert on raised issues." [puppet] - 10https://gerrit.wikimedia.org/r/700183 (https://phabricator.wikimedia.org/T274463) (owner: 10Jcrespo) [08:49:40] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29929/console" [puppet] - 10https://gerrit.wikimedia.org/r/700318 (owner: 10Giuseppe Lavagetto) [08:49:45] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:49:50] !log eqsin-codfw link re-enabled but drained [08:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:47] (Traffic on tunnel link) resolved: Traffic on tunnel link - https://alerts.wikimedia.org [08:54:46] (03PS1) 10Ayounsi: Fix dumps fail if a device has an empty (None) name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/700348 (https://phabricator.wikimedia.org/T275587) [08:55:50] (03PS2) 10Ayounsi: Fix dumps fail if a device has an empty (None) name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/700348 (https://phabricator.wikimedia.org/T275587) [08:56:50] (03CR) 10Ayounsi: "I haven't seen the error so that's a guess." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/700348 (https://phabricator.wikimedia.org/T275587) (owner: 10Ayounsi) [08:57:03] (03CR) 10Volans: [C: 03+1] "LGTM, one typo inline" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/700348 (https://phabricator.wikimedia.org/T275587) (owner: 10Ayounsi) [09:05:25] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/700071 (owner: 10Hnowlan) [09:07:46] (03CR) 10Ayounsi: "Tested on netbox-next." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/700348 (https://phabricator.wikimedia.org/T275587) (owner: 10Ayounsi) [09:08:04] (03PS1) 10Jcrespo: bacula: Fix schedule and monitoring as a followup to 67ee5c0 [puppet] - 10https://gerrit.wikimedia.org/r/700351 (https://phabricator.wikimedia.org/T274463) [09:08:11] (03PS3) 10Ayounsi: Fix dumps fail if a device has an empty (None) name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/700348 (https://phabricator.wikimedia.org/T275587) [09:10:19] (03CR) 10jerkins-bot: [V: 04-1] bacula: Fix schedule and monitoring as a followup to 67ee5c0 [puppet] - 10https://gerrit.wikimedia.org/r/700351 (https://phabricator.wikimedia.org/T274463) (owner: 10Jcrespo) [09:12:27] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::web::yaml_defs: also monkey-patch templated sites [puppet] - 10https://gerrit.wikimedia.org/r/700318 (owner: 10Giuseppe Lavagetto) [09:12:42] (03PS2) 10Jcrespo: bacula: Fix schedule and monitoring as a followup to 67ee5c0 [puppet] - 10https://gerrit.wikimedia.org/r/700351 (https://phabricator.wikimedia.org/T274463) [09:15:49] 10SRE, 10DNS, 10Traffic, 10netbox, 10cloud-services-team (Kanban): Move some of wikimediacloud.org 185.15.56.0/23 to Netbox - https://phabricator.wikimedia.org/T268621 (10ayounsi) @nskaggs I'm triaging the #netbox tasks. Does WMCS has an opinion on that task or it's fine to proceed? [09:18:03] (03CR) 10Jcrespo: [C: 03+2] bacula: Fix schedule and monitoring as a followup to 67ee5c0 [puppet] - 10https://gerrit.wikimedia.org/r/700351 (https://phabricator.wikimedia.org/T274463) (owner: 10Jcrespo) [09:19:23] 10Puppet, 10SRE, 10netbox: postgres::slave module type for includes parameter in inconsistent. - https://phabricator.wikimedia.org/T232358 (10ayounsi) Putting this on John's radar as it's Postgres and Puppet related. [09:19:28] (03CR) 10Muehlenhoff: [C: 03+1] "That looks fine. Another more KISS option would be to install locales-all (which might also help people running services in an non-English" [puppet] - 10https://gerrit.wikimedia.org/r/700206 (https://phabricator.wikimedia.org/T285086) (owner: 10Jbond) [09:21:54] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:59] (03CR) 10Jbond: "See comment inline (IANADBA)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/700071 (owner: 10Hnowlan) [09:26:18] 10SRE, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929 (10aborrero) cool, thanks! [09:30:31] (03PS4) 10Jcrespo: bacula/gitlab: add a backup::set for gitlab and use it [puppet] - 10https://gerrit.wikimedia.org/r/697850 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [09:32:10] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/700314 (https://phabricator.wikimedia.org/T241259) (owner: 10Ayounsi) [09:32:18] (03CR) 10Jcrespo: "I think this is right, based on bacula config:" [puppet] - 10https://gerrit.wikimedia.org/r/697850 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [09:44:02] (03CR) 10Jcrespo: [C: 03+1] "Looks good to me: https://puppet-compiler.wmflabs.org/compiler1001/29931/" [puppet] - 10https://gerrit.wikimedia.org/r/697850 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [09:44:39] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [10:01:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [10:04:08] PROBLEM - SSH on wdqs2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:04:16] 10SRE, 10DNS, 10Traffic, 10netbox, 10cloud-services-team (Kanban): Move some of wikimediacloud.org 185.15.56.0/23 to Netbox - https://phabricator.wikimedia.org/T268621 (10aborrero) It is fine to proceed. Moreover, after the cloudgw project, some of this may be already on netbox anyway! see https://netbox... [10:07:56] 10SRE, 10DNS, 10Traffic, 10netbox, 10cloud-services-team (Kanban): Move some of wikimediacloud.org 185.15.56.0/23 to Netbox - https://phabricator.wikimedia.org/T268621 (10aborrero) The other topics you mentioned: * Regarding the service FQDNs. We don't need them. These FQDNs related to the edge network... [10:09:45] 10SRE, 10DNS, 10Traffic, 10netbox, 10cloud-services-team (Kanban): Move some of wikimediacloud.org 185.15.56.0/23 to Netbox - https://phabricator.wikimedia.org/T268621 (10aborrero) * regarding the DNS server addresses. You are right, an intermediate service FQDN might be in order here. [10:26:59] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::yaml_defs: mangle vhosts [puppet] - 10https://gerrit.wikimedia.org/r/700356 [10:31:45] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29932/console" [puppet] - 10https://gerrit.wikimedia.org/r/700356 (owner: 10Giuseppe Lavagetto) [10:35:02] (03PS1) 10Jbond: postgresql::slave: Ensure includes are arrays [puppet] - 10https://gerrit.wikimedia.org/r/700357 (https://phabricator.wikimedia.org/T232358) [10:37:05] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::web::yaml_defs: mangle vhosts [puppet] - 10https://gerrit.wikimedia.org/r/700356 (owner: 10Giuseppe Lavagetto) [10:41:15] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29933/console" [puppet] - 10https://gerrit.wikimedia.org/r/700357 (https://phabricator.wikimedia.org/T232358) (owner: 10Jbond) [10:46:43] (03PS2) 10Jbond: postgresql::slave: Ensure includes are arrays [puppet] - 10https://gerrit.wikimedia.org/r/700357 (https://phabricator.wikimedia.org/T232358) [10:46:45] (03PS1) 10Jbond: R:postgres: drop unused postgress roles [puppet] - 10https://gerrit.wikimedia.org/r/700358 (https://phabricator.wikimedia.org/T232358) [10:47:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29934/console" [puppet] - 10https://gerrit.wikimedia.org/r/700357 (https://phabricator.wikimedia.org/T232358) (owner: 10Jbond) [10:48:04] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::yaml_defs: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/700359 [11:04:18] RECOVERY - SSH on wdqs2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:05:30] (03PS5) 10Jbond: C:locales: Add and configure all locales [puppet] - 10https://gerrit.wikimedia.org/r/700206 (https://phabricator.wikimedia.org/T285086) [11:05:33] (03CR) 10Jbond: "> Patch Set 4: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/700206 (https://phabricator.wikimedia.org/T285086) (owner: 10Jbond) [11:16:27] (03CR) 10Jbond: [C: 03+2] R:postgres: drop unused postgress roles [puppet] - 10https://gerrit.wikimedia.org/r/700358 (https://phabricator.wikimedia.org/T232358) (owner: 10Jbond) [11:16:31] (03CR) 10Jbond: [V: 03+1 C: 03+2] postgresql::slave: Ensure includes are arrays [puppet] - 10https://gerrit.wikimedia.org/r/700357 (https://phabricator.wikimedia.org/T232358) (owner: 10Jbond) [11:21:59] 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): June 2021 Datacenter switchover - https://phabricator.wikimedia.org/T281515 (10Marostegui) [11:25:06] (03PS6) 10Jbond: C:locales: Add and configure all locales [puppet] - 10https://gerrit.wikimedia.org/r/700206 (https://phabricator.wikimedia.org/T285086) [11:27:40] (03CR) 10Jbond: "> Patch Set 4: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/700206 (https://phabricator.wikimedia.org/T285086) (owner: 10Jbond) [11:27:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1098:3316', diff saved to https://phabricator.wikimedia.org/P16631 and previous config saved to /var/cache/conftool/dbconfig/20210618-112739-marostegui.json [11:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:57] 10Puppet, 10SRE, 10netbox, 10Patch-For-Review: postgres::slave module type for includes parameter in inconsistent. - https://phabricator.wikimedia.org/T232358 (10jbond) 05Open→03Resolved a:03jbond Fixed [11:40:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 25%: Repool db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16633 and previous config saved to /var/cache/conftool/dbconfig/20210618-114015-root.json [11:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:45] 10SRE, 10SRE-Access-Requests: Access to ptwikinews Search Console for Edu - https://phabricator.wikimedia.org/T285091 (10Aklapper) Great! See the last two bullet points in the first section on https://wikitech.wikimedia.org/wiki/Google_Search_Console_access [11:47:09] (03CR) 10Muehlenhoff: C:locales: Add and configure all locales (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/700206 (https://phabricator.wikimedia.org/T285086) (owner: 10Jbond) [11:55:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 50%: Repool db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16634 and previous config saved to /var/cache/conftool/dbconfig/20210618-115518-root.json [11:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:52] (03PS7) 10Jbond: C:locales: Add ability to customise the installed locales [puppet] - 10https://gerrit.wikimedia.org/r/700206 (https://phabricator.wikimedia.org/T285086) [12:07:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 25%: Repool db1098:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16635 and previous config saved to /var/cache/conftool/dbconfig/20210618-120755-root.json [12:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:28] (03CR) 10Jbond: "thanks reverted back to older PS" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/700206 (https://phabricator.wikimedia.org/T285086) (owner: 10Jbond) [12:10:08] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:10:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 75%: Repool db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16636 and previous config saved to /var/cache/conftool/dbconfig/20210618-121022-root.json [12:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:37] (03CR) 10Muehlenhoff: [C: 03+1] "Ship it :-)" [puppet] - 10https://gerrit.wikimedia.org/r/700206 (https://phabricator.wikimedia.org/T285086) (owner: 10Jbond) [12:22:51] (03PS1) 10Ssingh: admin: update SSH key for jgiannelos [puppet] - 10https://gerrit.wikimedia.org/r/700370 (https://phabricator.wikimedia.org/T285126) [12:22:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 50%: Repool db1098:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16637 and previous config saved to /var/cache/conftool/dbconfig/20210618-122259-root.json [12:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 100%: Repool db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16638 and previous config saved to /var/cache/conftool/dbconfig/20210618-122526-root.json [12:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:07] (03CR) 10Ssingh: [C: 03+2] admin: update SSH key for jgiannelos [puppet] - 10https://gerrit.wikimedia.org/r/700370 (https://phabricator.wikimedia.org/T285126) (owner: 10Ssingh) [12:34:30] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Replace production ssh keys for jgiannelos - https://phabricator.wikimedia.org/T285126 (10ssingh) 05Open→03Resolved a:03ssingh @Jgiannelos: Thanks for the additional confirmation; the SSH key has been updated. Marking this as resolved, please feel free... [12:38:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 75%: Repool db1098:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16639 and previous config saved to /var/cache/conftool/dbconfig/20210618-123802-root.json [12:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:41] (03CR) 10Lars Wirzenius: "Thanks, Moritz, I'll not worry about disabling accounts or such myself." [puppet] - 10https://gerrit.wikimedia.org/r/700260 (owner: 10Lars Wirzenius) [12:53:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 100%: Repool db1098:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16640 and previous config saved to /var/cache/conftool/dbconfig/20210618-125306-root.json [12:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:50] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:14:31] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29935/console" [puppet] - 10https://gerrit.wikimedia.org/r/700359 (owner: 10Giuseppe Lavagetto) [13:15:00] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::web::yaml_defs: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/700359 (owner: 10Giuseppe Lavagetto) [13:15:39] (03PS2) 10Marostegui: wmnet: Promote db1130 to s5 master [dns] - 10https://gerrit.wikimedia.org/r/699136 (https://phabricator.wikimedia.org/T284529) [13:16:55] (03PS1) 10Zfilipin: selenium: Replace selenium npm script with selenium-test [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/700378 (https://phabricator.wikimedia.org/T274579) [13:18:33] (03PS3) 10Zfilipin: selenium: Upgrade WebdriverIO to v7 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/697069 (https://phabricator.wikimedia.org/T274579) (owner: 10Sahilgrewalhere) [13:19:54] (03CR) 10Zfilipin: "PS3 is a rebase on top of https://gerrit.wikimedia.org/r/c/phabricator/deployment/+/700378" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/697069 (https://phabricator.wikimedia.org/T274579) (owner: 10Sahilgrewalhere) [13:24:15] (03CR) 10Zfilipin: [C: 03+1] selenium: Upgrade WebdriverIO to v7 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/697069 (https://phabricator.wikimedia.org/T274579) (owner: 10Sahilgrewalhere) [13:29:32] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:24] (03PS7) 10Elukey: Add the custom_deploy.d directory with basic Istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/697938 (https://phabricator.wikimedia.org/T278192) [13:44:26] (03PS7) 10Elukey: Add support for knative serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) [13:47:49] (03CR) 10Elukey: "Changes made:" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [14:06:51] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::vhost: add wikibase_rewrites boolean [puppet] - 10https://gerrit.wikimedia.org/r/700385 [14:06:53] (03PS1) 10Giuseppe Lavagetto: mediawiki: convert configurations to use wikibase_rewrites [puppet] - 10https://gerrit.wikimedia.org/r/700386 [14:12:15] (03PS1) 10Giuseppe Lavagetto: mediawiki: properly support wikibase rewrites [deployment-charts] - 10https://gerrit.wikimedia.org/r/700387 [14:12:43] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: properly support wikibase rewrites [deployment-charts] - 10https://gerrit.wikimedia.org/r/700387 (owner: 10Giuseppe Lavagetto) [14:32:02] (03PS1) 10Muehlenhoff: Add helper tool for returning a user's current TGT (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/700389 (https://phabricator.wikimedia.org/T283242) [14:32:25] (03PS2) 10Muehlenhoff: Add helper tool for returning a user's current TGT (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/700389 (https://phabricator.wikimedia.org/T283242) [14:33:06] (03CR) 10jerkins-bot: [V: 04-1] Add helper tool for returning a user's current TGT (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/700389 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [14:44:38] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITICAL: Test Get preview mobile HTML for test page returned the unexpected status 503 (expecting: 200) https://wi [14:44:38] ikimedia.org/wiki/Mobileapps_%28service%29 [14:46:30] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:52:28] I need to mess with mwdebug1001 for debugging T285125 [14:52:28] T285125: image sizes not displayed on beta - https://phabricator.wikimedia.org/T285125 [15:00:45] Amir1: it is friday, we had all hands, I don't think there is anybody around today so consider it the week-end already [15:01:08] Amir1: and that bug report is on beta , seems like it is the l10n cache being broken there [15:01:32] hashar: yeah I want to just compare values between production and beta [15:01:40] nothing related to deployment [15:01:51] with var_dump here and there [15:02:03] then clean it up with scap pull [15:02:41] be careful :] [15:04:58] definitely [15:05:08] There's two other bugs that are likely more related with the l10n cache stuff [15:05:09] and you have a sane .plan so yeah that is good [15:05:27] then I don't quite know how to regenerate the l10n stuff nowadays [15:05:36] jerkins should be doing it [15:06:09] https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-sync-world/9407/console and does look to be [15:06:58] it's just that I merged this risky patch https://gerrit.wikimedia.org/r/c/mediawiki/core/+/693298/ [15:07:30] and I want to make sure everything is fine and this is very specific to file pages so part of me is really worried it might have been caused by this patch [15:07:59] but if it's happening other places, then it's not related and I can stop pulling my hair [15:31:26] (03PS1) 10David Caro: nova: collect VPS data from the hypervisor [puppet] - 10https://gerrit.wikimedia.org/r/700393 [15:52:50] (03PS1) 10Elukey: Add istio 1.9.5 images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/700396 (https://phabricator.wikimedia.org/T278192) [15:56:22] (03PS1) 10Elukey: Add support for istioctl 1.9.5 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/700397 (https://phabricator.wikimedia.org/T278192) [16:11:57] (03CR) 10Arturo Borrero Gonzalez: nova: collect VPS data from the hypervisor (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/700393 (owner: 10David Caro) [16:50:41] 10SRE, 10SRE-Access-Requests: Access to ptwikinews Search Console for Edu - https://phabricator.wikimedia.org/T285091 (10Urbanecm) Tbh, I'm suspicious of the need to access here. The tokens provided are for connecting a (previously unconnected) site to a search console, while all Wikimedia projects are already... [17:36:22] 10SRE, 10SRE-Access-Requests: Access to ptwikinews Search Console for Edu - https://phabricator.wikimedia.org/T285091 (10Edu) I'm honestly surprised by your answer @Urbanecm. This is an explicit problem in almost all Wikinews languages. As I stated earlier I need access for a short period of time, it can be up... [17:51:29] 10SRE, 10SRE-Access-Requests: Access to ptwikinews Search Console for Edu - https://phabricator.wikimedia.org/T285091 (10ssingh) Hi all. I have brought this up and are discussing this internally in SRE. I will update the ticket when I have more information. Thank you. [17:53:09] 10SRE, 10SRE-Access-Requests: Access to ptwikinews Search Console for Edu - https://phabricator.wikimedia.org/T285091 (10Urbanecm) >>! In T285091#7163224, @Edu wrote: > I'm honestly surprised by your answer @Urbanecm. This is an explicit problem in almost all Wikinews languages. As I stated earlier I need acce... [18:13:09] (03CR) 10Sahilgrewalhere: [C: 03+1] selenium: Replace selenium npm script with selenium-test [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/700378 (https://phabricator.wikimedia.org/T274579) (owner: 10Zfilipin) [20:40:34] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [20:46:10] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [20:55:37] !log Remove doc1001:/srv/doc/mediawiki-core/wmf-1.36.0-wmf.31-testing [20:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:08] PROBLEM - SSH on mw1279.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:12:46] PROBLEM - SSH on wdqs2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:06:52] RECOVERY - SSH on mw1279.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:13:28] RECOVERY - SSH on wdqs2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:37:36] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:50:34] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [23:33:24] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state