[00:02:22] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:26:58] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:38] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:01:18] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Legoktm) I'd like to enable Score on one or two more projects of any size for a bit more testing before doin... [01:02:54] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:14:08] (03PS1) 10Legoktm: wmcs.toolforge.start_instance_with_prefix: Suppress bogus pylint warning [cookbooks] - 10https://gerrit.wikimedia.org/r/711025 [01:17:36] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:20:24] (03PS2) 10Legoktm: sre.switchdc.services: Exclude helm-charts, lacking a service IP [cookbooks] - 10https://gerrit.wikimedia.org/r/710235 (https://phabricator.wikimedia.org/T285707) [01:25:42] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:04] Deploy window Branching MediaWiki, extensions, skins, and vendor – See Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210810T0200) [02:01:44] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.37.0-wmf.18 [core] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711026 [02:06:49] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.37.0-wmf.18 [core] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711026 (owner: 10TrainBranchBot) [02:07:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:01] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:31] (03Merged) 10jenkins-bot: Branch commit for wmf/1.37.0-wmf.18 [core] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/711026 (owner: 10TrainBranchBot) [02:33:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:15:11] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:26:13] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:37:55] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:39:41] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:02:41] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:05:47] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 221, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:15:03] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:15:51] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:16:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2104 with weight 0 T287454', diff saved to https://phabricator.wikimedia.org/P16981 and previous config saved to /var/cache/conftool/dbconfig/20210810-041627-root.json [04:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:16:37] T287454: Switchover s2 from db2107 to db2104 - https://phabricator.wikimedia.org/T287454 [04:17:14] In 45 minutes we're going to failover s2 master [04:20:09] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:23:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 24 hosts with reason: Master switchover s2 T287454 [04:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:23:44] T287454: Switchover s2 from db2107 to db2104 - https://phabricator.wikimedia.org/T287454 [04:23:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Master switchover s2 T287454 [04:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:25:17] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:37:23] (03PS1) 10Marostegui: mariadb: Promote db2104 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/711029 (https://phabricator.wikimedia.org/T287454) [04:37:36] (03Abandoned) 10Marostegui: mariadb: Promote db2104 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/710516 (https://phabricator.wikimedia.org/T287454) (owner: 10Marostegui) [04:38:12] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2104 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/711029 (https://phabricator.wikimedia.org/T287454) (owner: 10Marostegui) [04:38:47] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:40:37] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:00:04] marostegui and kormat: That opportune time is upon us again. Time for a s2 database master failover deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210810T0500). [05:00:09] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:00:11] going to go ahead [05:00:34] !log Starting s2 codfw failover from db2107 to db2104 - T287454 [05:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:41] T287454: Switchover s2 from db2107 to db2104 - https://phabricator.wikimedia.org/T287454 [05:00:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s2 codfw as read-only for maintenance - T287454', diff saved to https://phabricator.wikimedia.org/P16982 and previous config saved to /var/cache/conftool/dbconfig/20210810-050051-root.json [05:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:05] ro confirmed [05:01:31] ro confirmed on plwiki :) [05:01:58] mmm the script is taking ages [05:02:05] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:02:10] might be the dns, let me check [05:02:39] kormat: around? [05:03:59] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:04:01] it is also not working from cumin in codfw [05:04:47] Going to abort the maintenance [05:04:51] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:05:25] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 139 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:06:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s2 as read-write again - master has not been swapped T287454', diff saved to https://phabricator.wikimedia.org/P16983 and previous config saved to /var/cache/conftool/dbconfig/20210810-050604-root.json [05:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:11] T287454: Switchover s2 from db2107 to db2104 - https://phabricator.wikimedia.org/T287454 [05:06:12] ok, s2 is writable again [05:06:15] master wasn't swapped [05:06:23] I need to revert puppet too [05:06:47] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:08:19] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:08:25] (03PS1) 10Marostegui: Revert "mariadb: Promote db2104 to s2 master" [puppet] - 10https://gerrit.wikimedia.org/r/710713 [05:09:11] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:10:38] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Promote db2104 to s2 master" [puppet] - 10https://gerrit.wikimedia.org/r/710713 (owner: 10Marostegui) [05:11:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 25%: repool after failed switchover', diff saved to https://phabricator.wikimedia.org/P16984 and previous config saved to /var/cache/conftool/dbconfig/20210810-051131-root.json [05:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:15] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:20:07] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:26:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 50%: repool after failed switchover', diff saved to https://phabricator.wikimedia.org/P16985 and previous config saved to /var/cache/conftool/dbconfig/20210810-052635-root.json [05:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:51] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:29:07] 10SRE, 10InternetArchiveBot, 10Traffic, 10Platform Team Workboards (Clinic Duty Team): IAbot sending a huge volume of action=raw requests (HTTP 415 errors) - https://phabricator.wikimedia.org/T269914 (10jijiki) @Cyberpower678 please respond to the above comment [05:32:41] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:34:35] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:41:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 75%: repool after failed switchover', diff saved to https://phabricator.wikimedia.org/P16986 and previous config saved to /var/cache/conftool/dbconfig/20210810-054139-root.json [05:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:03] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:52:43] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:56:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 100%: repool after failed switchover', diff saved to https://phabricator.wikimedia.org/P16987 and previous config saved to /var/cache/conftool/dbconfig/20210810-055642-root.json [05:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:27] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:00:25] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:01:17] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:05:17] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:14:53] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:26:03] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:34:51] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:35:36] (03PS2) 10JMeybohm: Add dragonfly-peer and supernode cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/710528 (https://phabricator.wikimedia.org/T286054) [06:36:33] (03PS7) 10Jelto: profile::gitlab rsync latest backup to passive host [puppet] - 10https://gerrit.wikimedia.org/r/710948 (https://phabricator.wikimedia.org/T285867) [06:38:57] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30526/console" [puppet] - 10https://gerrit.wikimedia.org/r/710948 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [06:42:33] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:43:52] (03CR) 10Jelto: [V: 03+1] "@JMeybohm thanks again, both comments should be fix now" [puppet] - 10https://gerrit.wikimedia.org/r/710948 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [06:50:13] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:51:37] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:52:07] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:57:55] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:58:08] marostegui: yep, I'm around. (if my answer seems delayed, it's replication lag, and definitely not that i forgot to set an alarm this morning 🥺 ) [06:58:52] kormat: check -data-persistence when you can [06:59:26] 10SRE, 10LDAP-Access-Requests: Access request to superset for user natalia-rodriguez - https://phabricator.wikimedia.org/T285436 (10ema) >>! In T285436#7238108, @Legoktm wrote: > @NRodriguez we had a slight mixup, but I've updated the checklist and we're all set to add your access, I just wanted to confirm wit... [06:59:51] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:01:48] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for @dang - https://phabricator.wikimedia.org/T288355 (10ema) p:05Triage→03Medium [07:02:03] marostegui: oh, yikes [07:02:17] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:02:41] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:02:42] (03PS8) 10Jelto: profile::gitlab rsync latest backup to passive host [puppet] - 10https://gerrit.wikimedia.org/r/710948 (https://phabricator.wikimedia.org/T285867) [07:05:10] (03PS2) 10MMandere: Traffic: Add varnish prometheus exporter alert [alerts] - 10https://gerrit.wikimedia.org/r/710968 (https://phabricator.wikimedia.org/T283660) [07:05:50] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30527/console" [puppet] - 10https://gerrit.wikimedia.org/r/710948 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [07:08:27] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:18:05] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:18:13] PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active - NTT, AS2914/IPv4: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:19:57] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:21:03] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:23:22] (03PS7) 10Elukey: Add the Kubeflow storage initializer docker image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710584 (https://phabricator.wikimedia.org/T272919) [07:24:53] PROBLEM - WDQS high update lag on wdqs2003 is CRITICAL: 4.324e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:27:15] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:27:20] (03CR) 10JMeybohm: [C: 03+1] Add the Kubeflow storage initializer docker image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710584 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [07:28:47] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:29:02] (03CR) 10JMeybohm: [C: 04-1] profile::gitlab rsync latest backup to passive host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710948 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [07:31:25] (03CR) 10Giuseppe Lavagetto: [C: 03+1] toolhub: initial chart (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/709565 (https://phabricator.wikimedia.org/T287716) (owner: 10BryanDavis) [07:32:04] (03CR) 10Vgutierrez: trafficserver: ensure sysconfdir exists on default instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710969 (owner: 10Ema) [07:32:38] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add the Kubeflow storage initializer docker image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710584 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [07:33:12] !log installing lynx security updates [07:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:21] (03PS5) 10Ema: trafficserver: ensure sysconfdir exists on default instance [puppet] - 10https://gerrit.wikimedia.org/r/710969 [07:48:21] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:48:55] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 225, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:49:03] (03CR) 10Giuseppe Lavagetto: [C: 03+1] admin_ng: Switch mwdebug namespace to allow-mediawiki-psp [deployment-charts] - 10https://gerrit.wikimedia.org/r/710986 (https://phabricator.wikimedia.org/T288315) (owner: 10JMeybohm) [07:50:53] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:51:14] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Would it make sense to only add the ptrace capability if the slow log timer is set to a value different from 0 (which disables it)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/710987 (https://phabricator.wikimedia.org/T288315) (owner: 10JMeybohm) [07:52:19] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Switch mwdebug namespace to allow-mediawiki-psp [deployment-charts] - 10https://gerrit.wikimedia.org/r/710986 (https://phabricator.wikimedia.org/T288315) (owner: 10JMeybohm) [07:52:33] (03PS1) 10Elukey: kubeflow-kfserving: move to the Wikimedia storage-initializer [deployment-charts] - 10https://gerrit.wikimedia.org/r/711096 (https://phabricator.wikimedia.org/T272919) [07:52:47] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:55:00] (03Merged) 10jenkins-bot: admin_ng: Switch mwdebug namespace to allow-mediawiki-psp [deployment-charts] - 10https://gerrit.wikimedia.org/r/710986 (https://phabricator.wikimedia.org/T288315) (owner: 10JMeybohm) [07:56:44] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: ensure sysconfdir exists on default instance [puppet] - 10https://gerrit.wikimedia.org/r/710969 (owner: 10Ema) [07:58:03] (03CR) 10Ema: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30528/console" [puppet] - 10https://gerrit.wikimedia.org/r/710969 (owner: 10Ema) [07:58:43] (03CR) 10Ema: [V: 03+1 C: 03+2] trafficserver: ensure sysconfdir exists on default instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710969 (owner: 10Ema) [08:00:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me, thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/710943 (https://phabricator.wikimedia.org/T286911) (owner: 10Ayounsi) [08:01:51] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:03:43] !log installing openjdk-8 security updates on stretch [08:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:46] !log upload thanos 0.21.1-1 and upgrade prometheus1004 / thanos-fe2001 to it - T288326 [08:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:53] T288326: thanos compact crash during downsampling and restart on invalid checksum for large block - https://phabricator.wikimedia.org/T288326 [08:07:32] (03PS1) 10Giuseppe Lavagetto: mediawiki::tlsproxy::yaml_defs: fix file permissions [puppet] - 10https://gerrit.wikimedia.org/r/711097 [08:08:01] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::tlsproxy::yaml_defs: fix file permissions [puppet] - 10https://gerrit.wikimedia.org/r/711097 (owner: 10Giuseppe Lavagetto) [08:10:17] (03PS2) 10JMeybohm: mediawiki: Add the SYS_PTRACE capability to the php container [deployment-charts] - 10https://gerrit.wikimedia.org/r/710987 (https://phabricator.wikimedia.org/T288315) [08:11:17] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:13:03] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.524e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [08:13:05] (03CR) 10Elukey: [C: 03+2] "Seems easy enough to just self-merge, please lemme know if anything is off :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/711096 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [08:13:20] (03PS2) 10Giuseppe Lavagetto: mediawiki::tlsproxy::yaml_defs: fix file permissions [puppet] - 10https://gerrit.wikimedia.org/r/711097 [08:14:17] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [08:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:07] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [08:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:23] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [08:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:43] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [08:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:59] (03PS3) 10Giuseppe Lavagetto: mediawiki::tlsproxy::yaml_defs: fix file permissions [puppet] - 10https://gerrit.wikimedia.org/r/711097 [08:16:21] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [08:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:41] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [08:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:01] RECOVERY - BGP status on cr2-eqord is OK: BGP OK - up: 157, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:18:13] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [08:18:14] (03CR) 10Ayounsi: [C: 03+2] discard traffic to mx2002 tcp/25 [homer/public] - 10https://gerrit.wikimedia.org/r/710943 (https://phabricator.wikimedia.org/T286911) (owner: 10Ayounsi) [08:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:54] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::tlsproxy::yaml_defs: fix file permissions [puppet] - 10https://gerrit.wikimedia.org/r/711097 (owner: 10Giuseppe Lavagetto) [08:19:01] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [08:19:01] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:59] (03Merged) 10jenkins-bot: discard traffic to mx2002 tcp/25 [homer/public] - 10https://gerrit.wikimedia.org/r/710943 (https://phabricator.wikimedia.org/T286911) (owner: 10Ayounsi) [08:20:44] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:20:48] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:49] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [08:26:35] (03CR) 10David Caro: "LGTM, though might be superseded with T287465, leaving the approval for the owners of the repository." [cookbooks] - 10https://gerrit.wikimedia.org/r/711025 (owner: 10Legoktm) [08:26:51] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:28:02] (03PS1) 10Giuseppe Lavagetto: mediawiki::tlsproxy::yaml_defs: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/711100 [08:28:34] (03CR) 10David Caro: [C: 03+2] am: added main function tests and small refactor [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/710497 (owner: 10David Caro) [08:28:36] (03CR) 10David Caro: [V: 03+2 C: 03+2] am: added main function tests and small refactor [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/710497 (owner: 10David Caro) [08:29:04] (03CR) 10JMeybohm: mediawiki: Add the SYS_PTRACE capability to the php container (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/710987 (https://phabricator.wikimedia.org/T288315) (owner: 10JMeybohm) [08:29:51] (03CR) 10David Caro: [V: 03+2 C: 03+2] am: added main function tests and small refactor (031 comment) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/710497 (owner: 10David Caro) [08:30:01] (03CR) 10David Caro: [V: 03+2 C: 03+2] global: linted and added vim files to gitignore [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/710498 (owner: 10David Caro) [08:30:50] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: Add the SYS_PTRACE capability to the php container [deployment-charts] - 10https://gerrit.wikimedia.org/r/710987 (https://phabricator.wikimedia.org/T288315) (owner: 10JMeybohm) [08:31:12] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::tlsproxy::yaml_defs: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/711100 (owner: 10Giuseppe Lavagetto) [08:34:04] (03CR) 10JMeybohm: [C: 03+2] mediawiki: Add the SYS_PTRACE capability to the php container [deployment-charts] - 10https://gerrit.wikimedia.org/r/710987 (https://phabricator.wikimedia.org/T288315) (owner: 10JMeybohm) [08:34:35] (03CR) 10David Caro: aptrepo: Drop thirdparty/kubeadm-k8s-1-17 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710669 (owner: 10Majavah) [08:36:27] (03PS2) 10Majavah: aptrepo: Drop thirdparty/kubeadm-k8s-1-17 [puppet] - 10https://gerrit.wikimedia.org/r/710669 [08:36:47] (03CR) 10Majavah: aptrepo: Drop thirdparty/kubeadm-k8s-1-17 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710669 (owner: 10Majavah) [08:37:07] (03Merged) 10jenkins-bot: mediawiki: Add the SYS_PTRACE capability to the php container [deployment-charts] - 10https://gerrit.wikimedia.org/r/710987 (https://phabricator.wikimedia.org/T288315) (owner: 10JMeybohm) [08:38:17] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [08:38:56] (03CR) 10Filippo Giunchedi: "Nice job overall! See inline." [alerts] - 10https://gerrit.wikimedia.org/r/710968 (https://phabricator.wikimedia.org/T283660) (owner: 10MMandere) [08:39:22] (03CR) 10David Caro: [C: 03+1] aptrepo: Drop thirdparty/kubeadm-k8s-1-17 [puppet] - 10https://gerrit.wikimedia.org/r/710669 (owner: 10Majavah) [08:39:59] (03PS2) 10Giuseppe Lavagetto: mwdebug: remove from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/710966 [08:40:01] (03PS1) 10Giuseppe Lavagetto: mwdebug: also use discovery listeners from puppet [deployment-charts] - 10https://gerrit.wikimedia.org/r/711101 [08:40:03] (03PS1) 10Giuseppe Lavagetto: mwdebug: add a small slow log timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/711102 [08:41:05] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:41:45] (03CR) 10jerkins-bot: [V: 04-1] mwdebug: remove from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/710966 (owner: 10Giuseppe Lavagetto) [08:43:50] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: also use discovery listeners from puppet [deployment-charts] - 10https://gerrit.wikimedia.org/r/711101 (owner: 10Giuseppe Lavagetto) [08:44:57] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:46:12] (03CR) 10JMeybohm: [C: 03+1] mwdebug: add a small slow log timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/711102 (owner: 10Giuseppe Lavagetto) [08:46:31] (03Merged) 10jenkins-bot: mwdebug: also use discovery listeners from puppet [deployment-charts] - 10https://gerrit.wikimedia.org/r/711101 (owner: 10Giuseppe Lavagetto) [08:49:04] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:54] (03CR) 10JMeybohm: kubeflow-kfserving: move to the Wikimedia storage-initializer (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/711096 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [08:53:14] (03PS1) 10Btullis: Fix the ownership of more druid directories [puppet] - 10https://gerrit.wikimedia.org/r/711103 (https://phabricator.wikimedia.org/T255148) [08:54:12] (03CR) 10Elukey: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/711103 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [08:54:39] (03PS1) 10ArielGlenn: Deal with other ways the run settings file may be corrupt [dumps] - 10https://gerrit.wikimedia.org/r/711104 (https://phabricator.wikimedia.org/T288192) [08:55:48] (03CR) 10ArielGlenn: [C: 03+2] Deal with other ways the run settings file may be corrupt [dumps] - 10https://gerrit.wikimedia.org/r/711104 (https://phabricator.wikimedia.org/T288192) (owner: 10ArielGlenn) [08:56:14] (03Merged) 10jenkins-bot: Deal with other ways the run settings file may be corrupt [dumps] - 10https://gerrit.wikimedia.org/r/711104 (https://phabricator.wikimedia.org/T288192) (owner: 10ArielGlenn) [08:58:23] !log ariel@deploy1002 Started deploy [dumps/dumps@170e394]: more resilience when reading bad run cache settings files [08:58:26] !log ariel@deploy1002 Finished deploy [dumps/dumps@170e394]: more resilience when reading bad run cache settings files (duration: 00m 03s) [08:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:34] 10Puppet, 10Infrastructure-Foundations, 10MW-on-K8s, 10Kubernetes: Add a fact holding the type of a disk (spinning/ssd) - https://phabricator.wikimedia.org/T288509 (10JMeybohm) [09:00:22] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:00:25] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:29] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:34] (03PS1) 10Marostegui: mariadb: Promote db1107 to m3 master. [puppet] - 10https://gerrit.wikimedia.org/r/711105 (https://phabricator.wikimedia.org/T288197) [09:02:18] (03PS1) 10David Caro: wmsc.puppet_alert: force utf-8 encoding when opening files [puppet] - 10https://gerrit.wikimedia.org/r/711106 (https://phabricator.wikimedia.org/T288508) [09:03:05] (03PS1) 10Marostegui: db1107: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/711107 (https://phabricator.wikimedia.org/T288197) [09:04:32] !log removing stale Java 8 packages from logstash1024/1025/2023/2024/2025 (ELK7 Logstash cluster is on Java 11 for a while now) [09:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:49] (03CR) 10Marostegui: [C: 03+2] db1107: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/711107 (https://phabricator.wikimedia.org/T288197) (owner: 10Marostegui) [09:05:26] (03PS2) 10Marostegui: mariadb: Promote db1107 to m3 master. [puppet] - 10https://gerrit.wikimedia.org/r/711105 (https://phabricator.wikimedia.org/T288197) [09:05:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:05:43] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:53] (03PS3) 10Marostegui: mariadb: Promote db1107 to m3 master. [puppet] - 10https://gerrit.wikimedia.org/r/711105 (https://phabricator.wikimedia.org/T288197) [09:07:27] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/711105 (https://phabricator.wikimedia.org/T288197) (owner: 10Marostegui) [09:07:52] (03CR) 10Btullis: [C: 03+2] Fix the ownership of more druid directories [puppet] - 10https://gerrit.wikimedia.org/r/711103 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [09:08:07] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:11:00] (03PS1) 10ArielGlenn: don't try to apply settings from a corrupt runsettings file [dumps] - 10https://gerrit.wikimedia.org/r/711109 (https://phabricator.wikimedia.org/T288192) [09:12:27] (03CR) 10ArielGlenn: [C: 03+2] don't try to apply settings from a corrupt runsettings file [dumps] - 10https://gerrit.wikimedia.org/r/711109 (https://phabricator.wikimedia.org/T288192) (owner: 10ArielGlenn) [09:14:36] 10SRE, 10Release-Engineering-Team, 10Elections: Create list of developers eligible to vote on the 2021 board vote - https://phabricator.wikimedia.org/T288455 (10Qgil) Thank you for your quick replies! * SecurePoll needs Wimedia usernames (SUL). This might be tricky to extract. Options to do this seem to in... [09:17:33] !log running non-destructive test against s7/codfw (db2107/db2014) T288500 [09:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:41] T288500: db-switchover got stuck while failing over db2107 to db2104 - https://phabricator.wikimedia.org/T288500 [09:19:41] (03Merged) 10jenkins-bot: don't try to apply settings from a corrupt runsettings file [dumps] - 10https://gerrit.wikimedia.org/r/711109 (https://phabricator.wikimedia.org/T288192) (owner: 10ArielGlenn) [09:20:48] (03PS1) 10JMeybohm: mediawiki: Remove duplicate definition of FCGI_URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/711110 [09:20:53] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:22:58] !log ariel@deploy1002 Started deploy [dumps/dumps@72ff209]: refuse to use info from corrupt run settings file [09:23:02] !log ariel@deploy1002 Finished deploy [dumps/dumps@72ff209]: refuse to use info from corrupt run settings file (duration: 00m 03s) [09:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:43] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:26:23] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [09:26:31] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:27:25] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:31:17] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:34:22] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: Remove duplicate definition of FCGI_URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/711110 (owner: 10JMeybohm) [09:36:40] (03PS1) 10Elukey: knative-serving: add ca-certificates to the controller's image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/711111 (https://phabricator.wikimedia.org/T278194) [09:38:09] (03Merged) 10jenkins-bot: mediawiki: Remove duplicate definition of FCGI_URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/711110 (owner: 10JMeybohm) [09:39:55] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [09:40:11] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:40:52] (03CR) 10JMeybohm: knative-serving: add ca-certificates to the controller's image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/711111 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [09:42:13] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:45:13] (03PS2) 10Elukey: knative-serving: add ca-certificates to the controller's image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/711111 (https://phabricator.wikimedia.org/T278194) [09:45:33] (03CR) 10Elukey: knative-serving: add ca-certificates to the controller's image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/711111 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [09:46:01] jouncebot: now [09:46:01] No deployments scheduled for the next 1 hour(s) and 13 minute(s) [09:46:24] I’ll deploy two no-op config changes now, since I’ll probably be away during the proper window later [09:47:01] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:16] (03PS2) 10Lucas Werkmeister (WMDE): Stop setting $wgWBClientSettings['repoDatabase'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708308 (https://phabricator.wikimedia.org/T257260) [09:47:29] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Stop setting $wgWBClientSettings['repoDatabase'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708308 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [09:48:12] (03Merged) 10jenkins-bot: Stop setting $wgWBClientSettings['repoDatabase'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708308 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [09:50:31] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:708308|Stop setting $wgWBClientSettings['repoDatabase'] (T257260)]] (duration: 00m 58s) [09:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:38] T257260: entitysources: Clean up any remainders of the legacy back/compat config in the mediawiki-config repository - https://phabricator.wikimedia.org/T257260 [09:51:13] (03PS2) 10Lucas Werkmeister (WMDE): Remove $wmgWikibaseClientRepoDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708309 (https://phabricator.wikimedia.org/T257260) [09:51:36] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove $wmgWikibaseClientRepoDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708309 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [09:52:11] (03PS1) 10Muehlenhoff: Remove access for toberto [puppet] - 10https://gerrit.wikimedia.org/r/711112 [09:52:21] (03Merged) 10jenkins-bot: Remove $wmgWikibaseClientRepoDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708309 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [09:52:46] (03PS1) 10Elukey: kubeflow,knative: use new controller image and docker registry endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/711113 (https://phabricator.wikimedia.org/T272919) [09:54:03] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:708309|Remove $wmgWikibaseClientRepoDatabase (T257260)]] (1/2, prod) (duration: 00m 57s) [09:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:12] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:708309|Remove $wmgWikibaseClientRepoDatabase (T257260)]] (2/2, beta) (duration: 00m 57s) [09:55:18] alright, I’m done :) [09:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:14] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:02] ^ I can’t look into that, don’t have systemd journal access rights… [10:00:10] Lucas_WMDE: here's the log: https://phabricator.wikimedia.org/P16989 [10:00:14] i can't tell you what it _means_ though [10:00:39] thanks [10:00:49] I’m not sure what it means either but I suspect it’s not related to my deployments [10:00:54] ok :) [10:01:09] (03PS1) 10Marostegui: mariadb: Promote db2104 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/711114 (https://phabricator.wikimedia.org/T287454) [10:01:34] it’s on a 5-minute timer, let’s see if it recovers on its own maybe [10:01:36] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/711114 (https://phabricator.wikimedia.org/T287454) (owner: 10Marostegui) [10:02:31] kormat: that sounds like it means helm is still running from the last job [10:02:58] (03PS1) 10Giuseppe Lavagetto: services_proxy: remove inexistent listener from allowed list [puppet] - 10https://gerrit.wikimedia.org/r/711115 [10:03:54] Even though that shouldn't happen according to https://github.com/wikimedia/puppet/commit/e6d77cdac516472d435c4f5bc5f98076f3d11b40#diff-dae976558c36e3bd2ea4c14cb6a81cda1a675e2243726835c3ebeed1c5742c4e [10:05:06] (03CR) 10Giuseppe Lavagetto: [C: 03+2] services_proxy: remove inexistent listener from allowed list [puppet] - 10https://gerrit.wikimedia.org/r/711115 (owner: 10Giuseppe Lavagetto) [10:05:08] (03CR) 10Kormat: [C: 03+1] mariadb: Promote db2104 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/711114 (https://phabricator.wikimedia.org/T287454) (owner: 10Marostegui) [10:05:20] still failing [10:05:44] jayme: you were logging pinkunicorn stuff to SAL earlier, could the failed deploy_to_mwdebug.service be related to that? [10:06:42] oh, yeah. Sorry [10:06:51] we broke it...working [10:07:02] ok [10:10:36] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:11:29] (03PS1) 10JMeybohm: mediawiki: FPM needs to listen on 0.0.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/711116 [10:11:59] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: FPM needs to listen on 0.0.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/711116 (owner: 10JMeybohm) [10:14:15] (03CR) 10JMeybohm: [C: 04-1] knative-serving: add ca-certificates to the controller's image (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/711111 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [10:14:35] (03CR) 10JMeybohm: [C: 03+1] kubeflow,knative: use new controller image and docker registry endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/711113 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [10:15:07] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] mediawiki: FPM needs to listen on 0.0.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/711116 (owner: 10JMeybohm) [10:15:58] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:03] (03PS3) 10Elukey: knative-serving: add wmf-certificates to the controller's image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/711111 (https://phabricator.wikimedia.org/T278194) [10:16:22] (03CR) 10Elukey: "Thanks!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/711111 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [10:16:34] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for toberto [puppet] - 10https://gerrit.wikimedia.org/r/711112 (owner: 10Muehlenhoff) [10:18:26] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:20:47] (03CR) 10Elukey: [V: 03+2 C: 03+2] knative-serving: add wmf-certificates to the controller's image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/711111 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [10:21:38] (03PS9) 10Dzahn: profile::gitlab rsync latest backup to passive host [puppet] - 10https://gerrit.wikimedia.org/r/710948 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [10:23:15] (03CR) 10Dzahn: profile::gitlab rsync latest backup to passive host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710948 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [10:24:14] (03CR) 10Elukey: [C: 03+2] kubeflow,knative: use new controller image and docker registry endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/711113 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [10:24:49] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:12] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:27:01] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:12] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:27:42] (03PS10) 10Dzahn: profile::gitlab rsync latest backup to passive host [puppet] - 10https://gerrit.wikimedia.org/r/710948 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [10:28:19] (03CR) 10jerkins-bot: [V: 04-1] profile::gitlab rsync latest backup to passive host [puppet] - 10https://gerrit.wikimedia.org/r/710948 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [10:28:54] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:28] (03PS3) 10MMandere: Traffic: Add varnish prometheus exporter alert [alerts] - 10https://gerrit.wikimedia.org/r/710968 (https://phabricator.wikimedia.org/T283660) [10:32:07] (03PS11) 10Dzahn: profile::gitlab rsync latest backup to passive host [puppet] - 10https://gerrit.wikimedia.org/r/710948 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [10:32:54] (03PS2) 10Giuseppe Lavagetto: mwdebug: add a small slow log timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/711102 [10:33:34] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:33:37] (03CR) 10Dzahn: [C: 03+1] "compiler output looks good and the same on both hosts on the surface, but when going into details and following the "change catalog" links" [puppet] - 10https://gerrit.wikimedia.org/r/710948 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [10:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:56] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:20] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:34:29] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:46] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:05] <_joe_> hashar: it looks like CI for operations/deployment-charts is stuck [10:35:44] <_joe_> nevermind, the latest change actually ran [10:35:56] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: add a small slow log timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/711102 (owner: 10Giuseppe Lavagetto) [10:38:24] (03Merged) 10jenkins-bot: mwdebug: add a small slow log timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/711102 (owner: 10Giuseppe Lavagetto) [10:38:31] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30531/console" [puppet] - 10https://gerrit.wikimedia.org/r/710948 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [10:45:05] 10SRE, 10Traffic: (adjust cert monitoring on planet and phabricator) Certificate *.wikipedia.org valid until 2021-08-14 08:01:46 - https://phabricator.wikimedia.org/T286713 (10Dzahn) 05Open→03Resolved Thinking about this again I think we are good here now. One of the 2 checks was removed, the other stayed.... [10:47:04] (03PS1) 10JMeybohm: mwdebug: add a small slow log timeout, fix key [deployment-charts] - 10https://gerrit.wikimedia.org/r/711119 [10:47:22] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] mwdebug: add a small slow log timeout, fix key [deployment-charts] - 10https://gerrit.wikimedia.org/r/711119 (owner: 10JMeybohm) [10:48:43] 10SRE: Evaluate Nautobot fork of Netbox and decide whether to use. - https://phabricator.wikimedia.org/T288515 (10Peachey88) [10:49:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10Jclark-ctr) @nskaggs we have gone through steps with dell and preformed hardware test looks to be operational now with no more errors if you can put back in service i w... [10:50:13] (03Merged) 10jenkins-bot: mwdebug: add a small slow log timeout, fix key [deployment-charts] - 10https://gerrit.wikimedia.org/r/711119 (owner: 10JMeybohm) [10:52:23] (03PS1) 10Btullis: Switch one zookeeper node in the druid cluster [puppet] - 10https://gerrit.wikimedia.org/r/711120 (https://phabricator.wikimedia.org/T255148) [10:52:31] !log Install 10.4.21 on db1096 (s5 and s6) [10:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:44] !log etherpad deleting 2 pads as requested in T288328 [10:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:02] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:54:16] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:48] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:55:59] 10SRE, 10Wikimedia-Mailing-lists: Create mailing list - wikimediadrc-kinshasa - https://phabricator.wikimedia.org/T288410 (10Dzahn) I found this page by searching for the email: https://meta.wikimedia.org/wiki/Wikimedians_of_Democratic_Republic_of_Congo_User_Group seems legit, will create [10:56:02] !log Install 10.4.21 on db1169 (s1) [10:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:20] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:57:11] (03PS1) 10Muehlenhoff: Add component/jdk8 for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/711121 (https://phabricator.wikimedia.org/T287960) [10:59:44] (03CR) 10Muehlenhoff: [C: 03+2] Add component/jdk8 for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/711121 (https://phabricator.wikimedia.org/T287960) (owner: 10Muehlenhoff) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210810T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:01:46] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:02:25] 10SRE, 10Wikimedia-Mailing-lists: Create mailing list - wikimediadrc-kinshasa - https://phabricator.wikimedia.org/T288410 (10Dzahn) I tried to use the new list creation method as described on https://wikitech.wikimedia.org/wiki/Mailman#Create_a_mailing_list but the "LISTNAME" parameter was missing. I [[ https... [11:03:30] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:10:30] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:39] (03PS1) 10Muehlenhoff: Apply MX role to mx2002 [puppet] - 10https://gerrit.wikimedia.org/r/711123 (https://phabricator.wikimedia.org/T286911) [11:11:08] (03CR) 10jerkins-bot: [V: 04-1] Apply MX role to mx2002 [puppet] - 10https://gerrit.wikimedia.org/r/711123 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [11:11:58] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:12:28] 10SRE, 10serviceops, 10Patch-For-Review: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) 05Open→03Stalled [11:12:34] 10SRE, 10serviceops, 10Release-Engineering-Team (Radar): Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [11:12:43] 10SRE, 10serviceops, 10Patch-For-Review: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) [11:15:54] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:17:26] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:18:07] (03PS2) 10Muehlenhoff: Apply MX role to mx2002 [puppet] - 10https://gerrit.wikimedia.org/r/711123 (https://phabricator.wikimedia.org/T286911) [11:19:27] 10SRE, 10Infrastructure-Foundations: Evaluate Nautobot fork of Netbox and decide whether to use. - https://phabricator.wikimedia.org/T288515 (10cmooney) [11:20:57] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/711123 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [11:23:40] 10SRE, 10Release-Engineering-Team, 10Elections: Create list of developers eligible to vote on the 2021 board vote - https://phabricator.wikimedia.org/T288455 (10Dzahn) @Qgil Here is an alphabetical list of email addresses of all users with shell access: It's public info pulled out of the public git repo bu... [11:24:50] 10SRE, 10Infrastructure-Foundations: Evaluate Nautobot fork of Netbox and decide whether to use. - https://phabricator.wikimedia.org/T288515 (10cmooney) p:05Triage→03Medium [11:26:56] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:27:10] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 8:00:00 on planet1002.eqiad.wmnet with reason: known issue [11:27:12] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 8:00:00 on planet1002.eqiad.wmnet with reason: known issue [11:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:16] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:31:54] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:37:18] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:37:42] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:39:08] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:39:32] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:40:30] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.01151 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [11:49:33] 10SRE, 10Wikimedia-Mailing-lists: Create mailing list - wikimediadrc-kinshasa - https://phabricator.wikimedia.org/T288410 (10CapitainAfrika) there is also https://fr.wikipedia.org/wiki/Projet:R%C3%A9publique_d%C3%A9mocratique_du_Congo/Atelier_virtuel_Kinshasa Because soon we will organize a conference [11:59:28] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:01:10] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:18] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:01:45] (03CR) 10Kormat: mariadb: Promote db1107 to m3 master. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711105 (https://phabricator.wikimedia.org/T288197) (owner: 10Marostegui) [12:02:59] (03PS4) 10Marostegui: mariadb: Promote db1107 to m3 master. [puppet] - 10https://gerrit.wikimedia.org/r/711105 (https://phabricator.wikimedia.org/T288197) [12:03:31] (03CR) 10Kormat: [C: 03+1] mariadb: Promote db1107 to m3 master. [puppet] - 10https://gerrit.wikimedia.org/r/711105 (https://phabricator.wikimedia.org/T288197) (owner: 10Marostegui) [12:04:30] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:06:44] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:07:21] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, and 2 others: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10MSantos) Thanks, @Elitre! [12:08:08] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:08:10] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [12:08:32] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:11:48] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [12:17:24] !log ppchelko@deploy1002 Started deploy [restbase/deploy@5791a7a]: Add count parameter to recommendations API T287227 [12:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:32] T287227: Recommendation API does not respect the count query parameter - https://phabricator.wikimedia.org/T287227 [12:19:06] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:19:54] PROBLEM - Check systemd state on puppetdb2002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_stockpile_queue.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:20:11] (03CR) 10JMeybohm: [C: 03+1] profile::gitlab rsync latest backup to passive host [puppet] - 10https://gerrit.wikimedia.org/r/710948 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [12:21:44] RECOVERY - Check systemd state on puppetdb2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:23:40] !log non-destructive (🤞) testing of db-switchover against s2/eqiad T288500 [12:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:48] T288500: db-switchover got stuck while failing over db2107 to db2104 - https://phabricator.wikimedia.org/T288500 [12:26:03] (03PS2) 10Btullis: Switch one zookeeper node in the druid cluster [puppet] - 10https://gerrit.wikimedia.org/r/711120 (https://phabricator.wikimedia.org/T255148) [12:28:10] jouncebot: now [12:28:10] No deployments scheduled for the next 3 hour(s) and 31 minute(s) [12:30:00] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:30:02] (03PS2) 10Lucas Werkmeister (WMDE): Stop setting $wgWBRepoSettings['conceptBaseUri'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709503 (https://phabricator.wikimedia.org/T257260) [12:30:04] (03PS2) 10Lucas Werkmeister (WMDE): Remove wmgWBRepoConceptBaseUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709504 (https://phabricator.wikimedia.org/T257260) [12:30:52] I’ll deploy those two config changes, more no-ops cleaning up the config :) [12:31:19] (deploy_to_mwdebug.service is still failing on deploy1002 btw) [12:31:33] (03CR) 10Lucas Werkmeister (WMDE): "wmf.17 is safely rolled out" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709503 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [12:31:59] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Stop setting $wgWBRepoSettings['conceptBaseUri'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709503 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [12:32:46] (03Merged) 10jenkins-bot: Stop setting $wgWBRepoSettings['conceptBaseUri'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709503 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [12:33:07] testing on mwdebug2001… [12:36:35] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:709503|Stop setting $wgWBRepoSettings['conceptBaseUri'] (T257260)]] (duration: 00m 58s) [12:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:45] T257260: entitysources: Clean up any remainders of the legacy back/compat config in the mediawiki-config repository - https://phabricator.wikimedia.org/T257260 [12:36:47] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove wmgWBRepoConceptBaseUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709504 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [12:37:38] (03Merged) 10jenkins-bot: Remove wmgWBRepoConceptBaseUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709504 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [12:39:45] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:709504|Remove wmgWBRepoConceptBaseUri (T257260)]] (1/3, prod) (duration: 00m 57s) [12:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:50] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:41:05] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:709504|Remove wmgWBRepoConceptBaseUri (T257260)]] (2/3, beta) (duration: 00m 57s) [12:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:59] 10SRE, 10Wikimedia-Mailing-lists: Create mailing list - wikimediadrc-kinshasa - https://phabricator.wikimedia.org/T288410 (10Dzahn) I created the list via the web UI now and did not run into the "illegal name" issue there. List has been created now. @CapitainAfrika please register at https://lists.wikimedia... [12:42:25] !log lucaswerkmeister-wmde@deploy1002 Synchronized tests/multiversion/StaticSettingsTest.php: Config: [[gerrit:709504|Remove wmgWBRepoConceptBaseUri (T257260)]] (3/3, test) (duration: 00m 57s) [12:42:27] 10SRE, 10Wikimedia-Mailing-lists: Create mailing list - wikimediadrc-kinshasa - https://phabricator.wikimedia.org/T288410 (10Dzahn) 05Open→03Resolved a:03Dzahn [12:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:33] T257260: entitysources: Clean up any remainders of the legacy back/compat config in the mediawiki-config repository - https://phabricator.wikimedia.org/T257260 [12:42:45] alright, I’m done again :) [12:42:46] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:44:59] (03CR) 10Elukey: Switch one zookeeper node in the druid cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711120 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [12:45:21] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30532/console" [puppet] - 10https://gerrit.wikimedia.org/r/711120 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [12:46:20] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:48:16] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:50:11] (03PS1) 10Elukey: knative-serving: force the controller to use ca certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/711127 (https://phabricator.wikimedia.org/T278194) [12:54:43] !log ppchelko@deploy1002 Finished deploy [restbase/deploy@5791a7a]: Add count parameter to recommendations API T287227 (duration: 37m 18s) [12:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:50] T287227: Recommendation API does not respect the count query parameter - https://phabricator.wikimedia.org/T287227 [12:56:05] (03CR) 10Jelto: [V: 03+1 C: 03+2] profile::gitlab rsync latest backup to passive host [puppet] - 10https://gerrit.wikimedia.org/r/710948 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [12:59:46] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:00:39] !log jayme@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:26] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:05:40] (03PS1) 10Elukey: kubeflow-kfserving: add quoting and refactor storage_init limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/711129 (https://phabricator.wikimedia.org/T272919) [13:18:04] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:19:28] !log installing perl security updates on Bullseye (older distros not affected) [13:19:30] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:52] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:21:04] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:21:20] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:26:36] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:30:32] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:32:31] !log updating bullseye installations to the latest state of testing [13:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:28] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:49:23] (03PS3) 10Btullis: Switch one zookeeper node in the druid cluster [puppet] - 10https://gerrit.wikimedia.org/r/711120 (https://phabricator.wikimedia.org/T255148) [13:49:52] (03CR) 10jerkins-bot: [V: 04-1] Switch one zookeeper node in the druid cluster [puppet] - 10https://gerrit.wikimedia.org/r/711120 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [13:50:52] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:52:44] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:55:07] (03PS1) 10Hnowlan: restbase: set lower check_disk thresholds for instance-data volume [puppet] - 10https://gerrit.wikimedia.org/r/711135 (https://phabricator.wikimedia.org/T191659) [13:58:14] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:07:24] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:09:55] 10SRE, 10Traffic, 10serviceops: Unexpected upload speed to commons - https://phabricator.wikimedia.org/T288481 (10aborrero) I also tried using the pywikibot upload script, with similar result. This time, however, the script mentions ` action 'upload', server said: ('internal_api_error_DBQueryError', '[d8e17... [14:11:18] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:13:04] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:16:22] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:25:24] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:25:44] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:30:00] PROBLEM - etcd request latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 operation={get,listWithCount} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [14:31:08] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:31:48] RECOVERY - etcd request latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [14:33:56] (03PS1) 10Lucas Werkmeister (WMDE): Stop setting $wgWBRepoSettings['entityNamespaces'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711138 (https://phabricator.wikimedia.org/T257260) [14:33:58] (03PS1) 10Lucas Werkmeister (WMDE): Remove $wmgWikibaseRepoEntityNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711139 (https://phabricator.wikimedia.org/T257260) [14:34:00] (03PS1) 10Lucas Werkmeister (WMDE): Stop setting $wgWBClientSettings['entityNamespaces'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711140 (https://phabricator.wikimedia.org/T257260) [14:34:02] (03PS1) 10Lucas Werkmeister (WMDE): Remove $wmgWikibaseClientEntityNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711141 (https://phabricator.wikimedia.org/T257260) [14:34:30] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:36:18] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:42:44] jouncebot: now [14:42:44] No deployments scheduled for the next 1 hour(s) and 17 minute(s) [14:42:58] coool. Going to deploy horrible things [14:43:39] (03PS3) 10Ladsgroup: Reduce ten seconds from dispatch max time [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710515 (https://phabricator.wikimedia.org/T288175) [14:43:44] (03CR) 10Ladsgroup: [C: 03+2] Reduce ten seconds from dispatch max time [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710515 (https://phabricator.wikimedia.org/T288175) (owner: 10Ladsgroup) [14:44:28] (03Merged) 10jenkins-bot: Reduce ten seconds from dispatch max time [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710515 (https://phabricator.wikimedia.org/T288175) (owner: 10Ladsgroup) [14:45:22] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:45:40] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:47:28] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:48:19] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:710515|Reduce ten seconds from dispatch max time (T288175)]] (duration: 00m 58s) [14:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:28] T288175: Migrate wikibase-dispatch-changes crons to systemd timers - https://phabricator.wikimedia.org/T288175 [14:49:00] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:54:26] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:05:26] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 78 probes of 618 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:07:06] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:09:08] (03PS4) 10Btullis: Switch one zookeeper node in the druid cluster [puppet] - 10https://gerrit.wikimedia.org/r/711120 (https://phabricator.wikimedia.org/T255148) [15:10:51] (03CR) 10Btullis: [C: 03+2] Switch one zookeeper node in the druid cluster [puppet] - 10https://gerrit.wikimedia.org/r/711120 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [15:11:00] PROBLEM - Zookeeper Server on druid1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [15:11:12] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 44 probes of 618 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:11:25] (03CR) 10Btullis: [C: 03+2] "Updated deployment plan in the ticket." [puppet] - 10https://gerrit.wikimedia.org/r/711120 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [15:26:34] (03PS1) 10Elukey: kubeflow: add wmf-certificates to the storage-initializer [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/711151 (https://phabricator.wikimedia.org/T272919) [15:29:42] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 77 probes of 618 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:33:26] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:33:38] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:34:01] (03CR) 10Elukey: [V: 03+2 C: 03+2] kubeflow: add wmf-certificates to the storage-initializer [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/711151 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [15:34:42] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:34:58] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 39 probes of 618 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:35:21] (03PS1) 10ZPapierski: Add task manager data port configuration for flink session cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/711152 (https://phabricator.wikimedia.org/T288531) [15:35:44] (03CR) 10jerkins-bot: [V: 04-1] Add task manager data port configuration for flink session cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/711152 (https://phabricator.wikimedia.org/T288531) (owner: 10ZPapierski) [15:35:51] (03PS1) 10Jcrespo: mediabackup: Puppetize the media backup workers [puppet] - 10https://gerrit.wikimedia.org/r/711153 (https://phabricator.wikimedia.org/T276442) [15:36:14] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:36:19] (03CR) 10jerkins-bot: [V: 04-1] mediabackup: Puppetize the media backup workers [puppet] - 10https://gerrit.wikimedia.org/r/711153 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [15:36:58] (03PS2) 10Elukey: kubeflow-kfserving: update chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/711129 (https://phabricator.wikimedia.org/T272919) [15:38:44] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:39:46] (03PS6) 10Ahmon Dancy: fpm-multiversion-base: Add php-wmerrors [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710621 (https://phabricator.wikimedia.org/T285309) [15:39:54] (03PS1) 10Jcrespo: mediabackup: Add dummy passwords and keys for worker hosts [labs/private] - 10https://gerrit.wikimedia.org/r/711154 (https://phabricator.wikimedia.org/T276442) [15:40:23] (03PS1) 10Btullis: Switch the second zookeeper server in the druid cluster [puppet] - 10https://gerrit.wikimedia.org/r/711155 (https://phabricator.wikimedia.org/T255148) [15:40:35] (03CR) 10Ahmon Dancy: fpm-multiversion-base: Add php-wmerrors (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710621 (https://phabricator.wikimedia.org/T285309) (owner: 10Ahmon Dancy) [15:41:42] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:41:52] (03PS2) 10Jcrespo: mediabackup: Puppetize the media backup workers [puppet] - 10https://gerrit.wikimedia.org/r/711153 (https://phabricator.wikimedia.org/T276442) [15:42:49] (03CR) 10Elukey: [V: 03+2 C: 03+2] "Going to merge since this works fine (I hacked the specs with kubectl edit). As always, please follow up if you see something off. I haven" [deployment-charts] - 10https://gerrit.wikimedia.org/r/711127 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [15:43:14] (03CR) 10Elukey: [C: 03+2] kubeflow-kfserving: update chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/711129 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [15:43:38] (03CR) 10Elukey: [C: 03+1] Switch the second zookeeper server in the druid cluster [puppet] - 10https://gerrit.wikimedia.org/r/711155 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [15:43:54] (03CR) 10Btullis: [C: 03+2] Switch the second zookeeper server in the druid cluster [puppet] - 10https://gerrit.wikimedia.org/r/711155 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [15:45:16] (03PS3) 10Jcrespo: mediabackup: Puppetize the media backup workers [puppet] - 10https://gerrit.wikimedia.org/r/711153 (https://phabricator.wikimedia.org/T276442) [15:45:35] (03CR) 10Krinkle: Move parsercache DB config to *Services.php (1/3) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703629 (owner: 10Krinkle) [15:46:42] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:50:58] (03PS2) 10ZPapierski: Add task manager data port configuration for flink session cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/711152 (https://phabricator.wikimedia.org/T288531) [15:51:31] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:52:37] (03PS2) 10Jcrespo: mediabackup: Add dummy passwords and keys for worker hosts [labs/private] - 10https://gerrit.wikimedia.org/r/711154 (https://phabricator.wikimedia.org/T276442) [15:54:26] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "I think there is a better approach available, namely running 3 instances of the same service with Restart=always is simpler and more elega" [puppet] - 10https://gerrit.wikimedia.org/r/710520 (https://phabricator.wikimedia.org/T288175) (owner: 10Ladsgroup) [15:54:58] (03CR) 10Jcrespo: "Merging to test 711153" [labs/private] - 10https://gerrit.wikimedia.org/r/711154 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [15:55:16] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] mediabackup: Add dummy passwords and keys for worker hosts [labs/private] - 10https://gerrit.wikimedia.org/r/711154 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [15:55:39] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:55:52] (03PS1) 10Btullis: Roll back recent change to zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/711161 (https://phabricator.wikimedia.org/T255148) [15:56:42] (03CR) 10jerkins-bot: [V: 04-1] Roll back recent change to zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/711161 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [15:57:21] (03PS2) 10Btullis: Roll back recent change to zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/711161 (https://phabricator.wikimedia.org/T255148) [15:57:38] (03CR) 10Elukey: [C: 03+1] Roll back recent change to zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/711161 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [15:57:52] (03CR) 10jerkins-bot: [V: 04-1] Roll back recent change to zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/711161 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [15:58:07] (03PS2) 10Giuseppe Lavagetto: mediawiki: Migrate dispatching cron of testwikidatawiki to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/710519 (https://phabricator.wikimedia.org/T288175) (owner: 10Ladsgroup) [15:58:15] PROBLEM - ps1-c1-codfw-infeed-load-tower-B-phase-Y on ps1-c1-codfw is CRITICAL: SNMP CRITICAL - ps1-c1-codfw-infeed-load-tower-B-phase-Y *-1* https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:58:27] PROBLEM - ps1-c1-codfw-infeed-load-tower-B-phase-Z on ps1-c1-codfw is CRITICAL: SNMP CRITICAL - ps1-c1-codfw-infeed-load-tower-B-phase-Z *-1* https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:58:59] PROBLEM - ps1-c1-codfw-infeed-load-tower-B-phase-X on ps1-c1-codfw is CRITICAL: SNMP CRITICAL - ps1-c1-codfw-infeed-load-tower-B-phase-X *-1* https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:00:04] jbond and rzl: (Dis)respected human, time to deploy Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210810T1600). Please do the needful. [16:00:08] (03PS3) 10Btullis: Roll back recent change to zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/711161 (https://phabricator.wikimedia.org/T255148) [16:00:09] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:01:43] RECOVERY - ps1-c1-codfw-infeed-load-tower-B-phase-Y on ps1-c1-codfw is OK: SNMP OK - ps1-c1-codfw-infeed-load-tower-B-phase-Y 458 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:01:51] !log installing c-ares security updates on buster [16:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:59] RECOVERY - ps1-c1-codfw-infeed-load-tower-B-phase-Z on ps1-c1-codfw is OK: SNMP OK - ps1-c1-codfw-infeed-load-tower-B-phase-Z 654 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:02:04] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30536/console" [puppet] - 10https://gerrit.wikimedia.org/r/710519 (https://phabricator.wikimedia.org/T288175) (owner: 10Ladsgroup) [16:02:33] RECOVERY - ps1-c1-codfw-infeed-load-tower-B-phase-X on ps1-c1-codfw is OK: SNMP OK - ps1-c1-codfw-infeed-load-tower-B-phase-X 447 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:03:52] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki: Migrate dispatching cron of testwikidatawiki to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/710519 (https://phabricator.wikimedia.org/T288175) (owner: 10Ladsgroup) [16:04:35] (03PS4) 10Jcrespo: mediabackup: Puppetize the media backup workers [puppet] - 10https://gerrit.wikimedia.org/r/711153 (https://phabricator.wikimedia.org/T276442) [16:05:28] 10SRE, 10LDAP-Access-Requests: LDAP Access to nda user group for TAndic - https://phabricator.wikimedia.org/T288527 (10Aklapper) Hi @TAndic, please see https://phabricator.wikimedia.org/project/profile/1564/ and the list there, plus please link any potential team docs to that page, for future requests. Thanks... [16:06:26] (03PS3) 10Krinkle: Move parsercache DB config to *Services.php (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703629 [16:06:28] (03PS4) 10Krinkle: Move parsercache DB config to *Services.php (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703630 [16:06:30] (03PS4) 10Krinkle: Move parsercache DB config to *Services.php (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703631 [16:08:13] (03PS1) 10Muehlenhoff: Add library hint for c-ares [puppet] - 10https://gerrit.wikimedia.org/r/711163 [16:10:00] (03PS5) 10Jcrespo: mediabackup: Puppetize the media backup workers [puppet] - 10https://gerrit.wikimedia.org/r/711153 (https://phabricator.wikimedia.org/T276442) [16:12:35] (03PS6) 10Jcrespo: mediabackup: Puppetize the media backup workers [puppet] - 10https://gerrit.wikimedia.org/r/711153 (https://phabricator.wikimedia.org/T276442) [16:13:59] (03PS7) 10Jcrespo: mediabackup: Puppetize the media backup workers [puppet] - 10https://gerrit.wikimedia.org/r/711153 (https://phabricator.wikimedia.org/T276442) [16:14:17] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:16:36] (03CR) 10Jcrespo: [C: 03+1] "This is ready to be merged as a first iteration: https://puppet-compiler.wmflabs.org/compiler1003/30539/ms-backup1001.eqiad.wmnet/fulldiff" [puppet] - 10https://gerrit.wikimedia.org/r/711153 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [16:17:44] (03CR) 10Btullis: [C: 03+2] Roll back recent change to zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/711161 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [16:18:25] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Join ARIN waiting list to request additional IPv4 resources. - https://phabricator.wikimedia.org/T288342 (10nskaggs) @aborrero can you ensure our future needs are expressed here? [16:18:43] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:29] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:20:17] 10SRE, 10ops-codfw, 10DC-Ops, 10netops, 10Wikimedia-Incident: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10Papaul) 05Open→03Resolved Received the replacement switch. Rack in C1 U43. setup the mgmt password same as the server mgmt password. Update Netbox with new serial num... [16:20:20] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] fix shell for backup cronjob [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/710676 (https://phabricator.wikimedia.org/T288324) (owner: 10Jelto) [16:22:29] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:23:28] (03PS1) 10Dave Pifke: arclamp: add temporary excimer-k8s pipeline [puppet] - 10https://gerrit.wikimedia.org/r/711166 (https://phabricator.wikimedia.org/T288165) [16:24:50] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Join ARIN waiting list to request additional IPv4 resources. - https://phabricator.wikimedia.org/T288342 (10cmooney) @nskaggs @aborrero might be better to add that to the parent task thanks. [16:25:05] PROBLEM - Zookeeper Server on an-druid1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [16:25:20] !log gitlab: run ansible to apply [[gerrit:710676|fix shell for backup cronjob]] (T288324) [16:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:28] T288324: WARNING: In GitLab 14.0 we will begin removing all configuration backups older than yourgitlab_rails['backup_keep_time'] setting (currently set to: 259200) - https://phabricator.wikimedia.org/T288324 [16:26:47] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:28:59] (03PS3) 10JMeybohm: Add dragonfly-peer and supernode cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/710528 (https://phabricator.wikimedia.org/T286054) [16:29:01] (03PS1) 10JMeybohm: drafonfly: Clean up and document dragonfly classes [puppet] - 10https://gerrit.wikimedia.org/r/711168 [16:32:58] (03PS2) 10JMeybohm: drafonfly: Clean up and document dragonfly classes [puppet] - 10https://gerrit.wikimedia.org/r/711168 [16:33:52] !log btullis@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid's jvm daemons. - btullis@cumin1001 [16:33:53] !log btullis@cumin1001 END (FAIL) - Cookbook sre.druid.roll-restart-workers (exit_code=99) for Druid analytics cluster: Roll restart of Druid's jvm daemons. - btullis@cumin1001 [16:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:47] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@d3c5363]: T287225: Bump rdf-spark-tools to 0.3.81 [16:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:54] T287225: Add all prefixes defined in Blazegraph - https://phabricator.wikimedia.org/T287225 [16:36:57] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@d3c5363]: T287225: Bump rdf-spark-tools to 0.3.81 (duration: 02m 10s) [16:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:33] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:39:27] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:42:46] (03PS1) 10MSantos: mobileapps: bump to 2021-08-10-143135-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/711170 [16:44:17] (03PS1) 10Btullis: Replace the username for btullis with Btullis [puppet] - 10https://gerrit.wikimedia.org/r/711171 (https://phabricator.wikimedia.org/T285754) [16:46:19] (03PS1) 10Jeena Huneidi: testwikis wikis to 1.37.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711172 [16:46:21] (03CR) 10Jeena Huneidi: [C: 03+2] testwikis wikis to 1.37.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711172 (owner: 10Jeena Huneidi) [16:46:44] (03CR) 10Btullis: [C: 03+2] Replace the username for btullis with Btullis [puppet] - 10https://gerrit.wikimedia.org/r/711171 (https://phabricator.wikimedia.org/T285754) (owner: 10Btullis) [16:47:00] (03Merged) 10jenkins-bot: testwikis wikis to 1.37.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711172 (owner: 10Jeena Huneidi) [16:47:01] !log jhuneidi@deploy1002 Started scap: testwikis wikis to 1.37.0-wmf.18 [16:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:51] !log btullis@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid's jvm daemons. - btullis@cumin1001 [16:49:52] !log btullis@cumin1001 END (FAIL) - Cookbook sre.druid.roll-restart-workers (exit_code=99) for Druid analytics cluster: Roll restart of Druid's jvm daemons. - btullis@cumin1001 [16:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:03] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [16:54:29] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30541/console" [puppet] - 10https://gerrit.wikimedia.org/r/711168 (owner: 10JMeybohm) [16:57:07] (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2021-08-10-143135-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/711170 (owner: 10MSantos) [16:59:10] (03CR) 10JMeybohm: [V: 03+1] "...whenever you have a minute." [puppet] - 10https://gerrit.wikimedia.org/r/711168 (owner: 10JMeybohm) [16:59:31] (03Merged) 10jenkins-bot: mobileapps: bump to 2021-08-10-143135-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/711170 (owner: 10MSantos) [17:00:05] chrisalbon and accraze: How many deployers does it take to do Services – Graphoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210810T1700). [17:00:33] 10SRE, 10Release-Engineering-Team, 10Elections: Create list of developers eligible to vote on the 2021 board vote - https://phabricator.wikimedia.org/T288455 (10Tgr) If we could produce a list of commit author email addresses for all Gerrit commits merged in the given period, which doesn't seem that hard, th... [17:01:44] !log mbsantos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [17:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:14] !log btullis@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid's jvm daemons. - btullis@cumin1001 [17:02:15] !log btullis@cumin1001 END (FAIL) - Cookbook sre.druid.roll-restart-workers (exit_code=99) for Druid analytics cluster: Roll restart of Druid's jvm daemons. - btullis@cumin1001 [17:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:25] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10RobH) Can the setup of this device please be noted on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Atl... [17:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:19] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [17:05:07] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (Kanban): wmcs.spicerack: Setup a host to run cookbooks from prod network - https://phabricator.wikimedia.org/T276440 (10dcaro) p:05Medium→03Triage a:05dcaro→03None [17:06:41] !log mbsantos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [17:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:41] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:09:34] !log razzi@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid's jvm daemons. - razzi@cumin1001 [17:09:35] !log razzi@cumin1001 END (FAIL) - Cookbook sre.druid.roll-restart-workers (exit_code=99) for Druid analytics cluster: Roll restart of Druid's jvm daemons. - razzi@cumin1001 [17:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:11] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/ [17:11:11] ps_%28service%29 [17:12:09] (03PS1) 10MSantos: Revert "mobileapps: bump to 2021-08-10-143135-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/710714 [17:12:21] (03PS2) 10Cwhite: hiera: add observability role_contacts [puppet] - 10https://gerrit.wikimedia.org/r/710617 [17:13:29] !log T288501 [WDQS] `ryankemper@wdqs2003:~$ sudo rm -fv /srv/wdqs/wikidata.jnl` [17:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:37] T288501: Blazegraph journal too large on wdqs2003 - https://phabricator.wikimedia.org/T288501 [17:15:32] (03CR) 10MSantos: [C: 03+2] Revert "mobileapps: bump to 2021-08-10-143135-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/710714 (owner: 10MSantos) [17:18:00] (03Merged) 10jenkins-bot: Revert "mobileapps: bump to 2021-08-10-143135-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/710714 (owner: 10MSantos) [17:18:23] !log mbsantos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [17:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:43] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:19:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: TBD) rack/setup/install dumpsdata100[45] - https://phabricator.wikimedia.org/T283290 (10Cmjohnson) updating the firmware now [17:19:33] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [17:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:43] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [17:19:46] !log T288501 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2005.codfw.wmnet --dest wdqs2003.codfw.wmnet --reason "transferring fresh wikidata journal to resolve disk issue" --blazegraph_instance blazegraph` on `cumin2001` tmux session `wdqs_data_xfer` [17:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:53] T288501: Blazegraph journal too large on wdqs2003 - https://phabricator.wikimedia.org/T288501 [17:23:36] !log jhuneidi@deploy1002 Finished scap: testwikis wikis to 1.37.0-wmf.18 (duration: 36m 35s) [17:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:57] PROBLEM - Host graphite2003 is DOWN: PING CRITICAL - Packet loss = 100% [17:32:07] RECOVERY - Disk space on wdqs2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=wdqs2003&var-datasource=codfw+prometheus/ops [17:32:43] RECOVERY - Host graphite2003 is UP: PING OK - Packet loss = 0%, RTA = 31.59 ms [17:42:06] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti102[34] - https://phabricator.wikimedia.org/T283036 (10Cmjohnson) firmware updated, BIOS was current but iDRAC needed updating. Changed root password. [17:42:55] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti102[34] - https://phabricator.wikimedia.org/T283036 (10Cmjohnson) [17:43:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: TBD) rack/setup/install dumpsdata100[45] - https://phabricator.wikimedia.org/T283290 (10Cmjohnson) [17:43:31] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:45:25] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:54:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Cmjohnson) a:05Cmjohnson→03RobH [17:58:43] 10SRE, 10LDAP-Access-Requests: LDAP Access to nda user group for TAndic - https://phabricator.wikimedia.org/T288527 (10TAndic) Thanks, @Aklapper ! Should I file a new request through the task template linked or stick with this one? If sticking with this one: Username: TAndic Shell access: No (would like, pr... [18:00:04] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210810T1800) [18:03:55] (03PS1) 10Cmjohnson: setup dumpsdata1004-5, dhpd, site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/711182 (https://phabricator.wikimedia.org/T283290) [18:05:26] (03CR) 10Cmjohnson: [C: 03+2] setup dumpsdata1004-5, dhpd, site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/711182 (https://phabricator.wikimedia.org/T283290) (owner: 10Cmjohnson) [18:08:07] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack downtime methods fail when the admin reason includes an apostrophe - https://phabricator.wikimedia.org/T288558 (10RLazarus) [18:08:17] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack downtime methods fail when the admin reason includes an apostrophe - https://phabricator.wikimedia.org/T288558 (10RLazarus) p:05Triage→03High [18:10:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: TBD) rack/setup/install dumpsdata100[45] - https://phabricator.wikimedia.org/T283290 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` dumpsdata1004.eqiad.wmnet ` The log can be found in... [18:12:25] PROBLEM - IPMI Sensor Status on maps2005 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:15:36] (03CR) 10Krinkle: [C: 03+1] arclamp: add temporary excimer-k8s pipeline [puppet] - 10https://gerrit.wikimedia.org/r/711166 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke) [18:24:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: TBD) rack/setup/install dumpsdata100[45] - https://phabricator.wikimedia.org/T283290 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` dumpsdata1005.eqiad.wmnet ` The log can be found in... [18:32:07] (03PS1) 10Cmjohnson: ganetia1023-24 setup netboot.cfg, dhcpd file, site.pp [puppet] - 10https://gerrit.wikimedia.org/r/711183 (https://phabricator.wikimedia.org/T283036) [18:33:02] (03CR) 10Legoktm: [C: 03+2] wmcs.toolforge.start_instance_with_prefix: Suppress bogus pylint warning (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/711025 (owner: 10Legoktm) [18:35:31] (03PS1) 10Zabe: logstash: remove absented cron [puppet] - 10https://gerrit.wikimedia.org/r/711184 (https://phabricator.wikimedia.org/T273673) [18:36:19] (03Merged) 10jenkins-bot: wmcs.toolforge.start_instance_with_prefix: Suppress bogus pylint warning [cookbooks] - 10https://gerrit.wikimedia.org/r/711025 (owner: 10Legoktm) [18:36:28] (03Merged) 10jenkins-bot: sre.switchdc.services: Exclude helm-charts, lacking a service IP [cookbooks] - 10https://gerrit.wikimedia.org/r/710235 (https://phabricator.wikimedia.org/T285707) (owner: 10Legoktm) [18:41:34] (03PS1) 10Cmjohnson: update partman receipe used for dumpsdtat1004 and 1005 to partman/custom/dumpsdata100X.cfg [puppet] - 10https://gerrit.wikimedia.org/r/711185 (https://phabricator.wikimedia.org/T283290) [18:42:11] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [18:42:15] (03CR) 10Cmjohnson: [C: 03+2] ganetia1023-24 setup netboot.cfg, dhcpd file, site.pp [puppet] - 10https://gerrit.wikimedia.org/r/711183 (https://phabricator.wikimedia.org/T283036) (owner: 10Cmjohnson) [18:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:19] (03CR) 10jerkins-bot: [V: 04-1] update partman receipe used for dumpsdtat1004 and 1005 to partman/custom/dumpsdata100X.cfg [puppet] - 10https://gerrit.wikimedia.org/r/711185 (https://phabricator.wikimedia.org/T283290) (owner: 10Cmjohnson) [18:43:35] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2003 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:44:19] PROBLEM - WDQS high update lag on wdqs2005 is CRITICAL: 4687 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [18:44:40] (03PS2) 10Cmjohnson: update partman receipe used for dumpsdtat1004/1005 to dumpsdata100X.cfg [puppet] - 10https://gerrit.wikimedia.org/r/711185 (https://phabricator.wikimedia.org/T283290) [18:45:26] !log T288501 `data-transfer` of `wikidata.jnl` completed successfully. Host needs to catch up on ~22 hours of WDQS lag before being re-pooled [18:45:31] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2003 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:34] T288501: Blazegraph journal too large on wdqs2003 - https://phabricator.wikimedia.org/T288501 [18:45:51] RECOVERY - WDQS high update lag on wdqs2003 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 4779 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [18:46:11] (03CR) 10Cmjohnson: [C: 03+2] update partman receipe used for dumpsdtat1004/1005 to dumpsdata100X.cfg [puppet] - 10https://gerrit.wikimedia.org/r/711185 (https://phabricator.wikimedia.org/T283290) (owner: 10Cmjohnson) [18:46:15] !log T288501 (Misread grafana graph, `wdqs2003` only has 1.33 hours to catch up on) [18:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation, 10Patch-For-Review: (Need By: TBD) rack/setup/install dumpsdata100[45] - https://phabricator.wikimedia.org/T283290 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dumpsdata1004.eqiad.wmnet'] ` Of which those **FAILED**: ` ['dumpsdata1004.eqia... [18:46:46] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:46:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation, 10Patch-For-Review: (Need By: TBD) rack/setup/install dumpsdata100[45] - https://phabricator.wikimedia.org/T283290 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dumpsdata1005.eqiad.wmnet'] ` Of which those **FAILED**: ` ['dumpsdata1005.eqia... [18:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:23] !log [WDQS] `ryankemper@wdqs2005:~$ sudo depool` (~1.26 hours of lag) [18:47:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10RobH) [18:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation, 10Patch-For-Review: (Need By: TBD) rack/setup/install dumpsdata100[45] - https://phabricator.wikimedia.org/T283290 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` dumpsdata1005.eqiad.wmnet ` T... [18:48:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation, 10Patch-For-Review: (Need By: TBD) rack/setup/install dumpsdata100[45] - https://phabricator.wikimedia.org/T283290 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` dumpsdata1004.eqiad.wmnet ` T... [18:49:12] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install ganeti102[34] - https://phabricator.wikimedia.org/T283036 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` ganeti1023.eqiad.wmnet ` The log can be found in `/var... [18:51:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install ganeti102[34] - https://phabricator.wikimedia.org/T283036 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` ganeti1024.eqiad.wmnet ` The log can be found in `/var... [18:51:51] (03PS4) 10Legoktm: Add configuration for running on kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703836 (https://phabricator.wikimedia.org/T284418) (owner: 10Giuseppe Lavagetto) [18:51:53] (03PS1) 10Legoktm: Introduce $wmfUsingKubernetes to help with migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711186 [18:52:40] (03PS5) 10Legoktm: Adjust redis configuration for running on kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703836 (https://phabricator.wikimedia.org/T284418) (owner: 10Giuseppe Lavagetto) [18:52:42] (03PS2) 10Legoktm: Introduce $wmfUsingKubernetes to help with migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711186 [18:54:19] 10SRE, 10Traffic, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Services without a service IP cannot automatically be switched by the switchdc cookbook - https://phabricator.wikimedia.org/T285707 (10Legoktm) p:05High→03Medium [18:59:48] (03PS1) 10Jdlrobson: Drop unused config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711187 (https://phabricator.wikimedia.org/T288553) [19:00:04] jeena and twentyafterfour: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210810T1900). [19:02:06] (03PS1) 10Jeena Huneidi: group0 wikis to 1.37.0-wmf.18 refs T281159 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711188 [19:02:08] (03CR) 10Jeena Huneidi: [C: 03+2] group0 wikis to 1.37.0-wmf.18 refs T281159 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711188 (owner: 10Jeena Huneidi) [19:02:34] (03PS1) 10RobH: new mc nodes install params [puppet] - 10https://gerrit.wikimedia.org/r/711189 (https://phabricator.wikimedia.org/T274925) [19:02:47] (03PS3) 10Legoktm: noc: Expose primary datacenter on conf/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710128 [19:02:49] (03Merged) 10jenkins-bot: group0 wikis to 1.37.0-wmf.18 refs T281159 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711188 (owner: 10Jeena Huneidi) [19:02:51] (03CR) 10Legoktm: noc: Expose primary datacenter on conf/ (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710128 (owner: 10Legoktm) [19:03:08] (03PS2) 10RobH: new mc nodes install params [puppet] - 10https://gerrit.wikimedia.org/r/711189 (https://phabricator.wikimedia.org/T274925) [19:03:30] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1023.eqiad.wmnet with reason: REIMAGE [19:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:44] (03CR) 10RobH: [C: 03+2] new mc nodes install params [puppet] - 10https://gerrit.wikimedia.org/r/711189 (https://phabricator.wikimedia.org/T274925) (owner: 10RobH) [19:03:46] (03PS2) 10Jdlrobson: Drop unused config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711187 (https://phabricator.wikimedia.org/T288553) [19:04:11] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:04:13] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dumpsdata1004.eqiad.wmnet with reason: REIMAGE [19:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:20] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.37.0-wmf.18 refs T281159 [19:04:25] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1024.eqiad.wmnet with reason: REIMAGE [19:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:27] T281159: 1.37.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T281159 [19:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:47] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dumpsdata1005.eqiad.wmnet with reason: REIMAGE [19:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:44] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ganeti1023.eqiad.wmnet with reason: REIMAGE [19:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:47] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dumpsdata1004.eqiad.wmnet with reason: REIMAGE [19:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:06] (03PS1) 10RobH: fixing entry for new mc host [puppet] - 10https://gerrit.wikimedia.org/r/711191 (https://phabricator.wikimedia.org/T274925) [19:09:15] (03PS2) 10RobH: fixing entry for new mc host [puppet] - 10https://gerrit.wikimedia.org/r/711191 (https://phabricator.wikimedia.org/T274925) [19:09:26] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ganeti1024.eqiad.wmnet with reason: REIMAGE [19:09:27] (03CR) 10RobH: [C: 03+2] fixing entry for new mc host [puppet] - 10https://gerrit.wikimedia.org/r/711191 (https://phabricator.wikimedia.org/T274925) (owner: 10RobH) [19:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:47] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on dumpsdata1005.eqiad.wmnet with reason: REIMAGE [19:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:11] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:12:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['mc1039.eqiad.wmnet', 'mc1040.eqiad.w... [19:13:23] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:14:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10RobH) [19:14:27] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:14:35] (03PS1) 10Michael DiPietro: add newhire mdipietro key [labs/private] - 10https://gerrit.wikimedia.org/r/711192 (https://phabricator.wikimedia.org/T287287) [19:15:18] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:15:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: TBD) rack/setup/install dumpsdata100[45] - https://phabricator.wikimedia.org/T283290 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dumpsdata1004.eqiad.wmnet'] ` and were **ALL** successful. [19:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:46] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti102[34] - https://phabricator.wikimedia.org/T283036 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti1024.eqiad.wmnet'] ` and were **ALL** successful. [19:16:05] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [19:16:07] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti102[34] - https://phabricator.wikimedia.org/T283036 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti1023.eqiad.wmnet'] ` and were **ALL** successful. [19:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: TBD) rack/setup/install dumpsdata100[45] - https://phabricator.wikimedia.org/T283290 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dumpsdata1005.eqiad.wmnet'] ` and were **ALL** successful. [19:17:59] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:18:05] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] add newhire mdipietro key [labs/private] - 10https://gerrit.wikimedia.org/r/711192 (https://phabricator.wikimedia.org/T287287) (owner: 10Michael DiPietro) [19:20:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: TBD) rack/setup/install dumpsdata100[45] - https://phabricator.wikimedia.org/T283290 (10Cmjohnson) [19:20:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: TBD) rack/setup/install dumpsdata100[45] - https://phabricator.wikimedia.org/T283290 (10Cmjohnson) 05Open→03Resolved all tasks completed [19:20:45] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti102[34] - https://phabricator.wikimedia.org/T283036 (10Cmjohnson) [19:21:22] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti102[34] - https://phabricator.wikimedia.org/T283036 (10Cmjohnson) 05Open→03Resolved All tasks completed [19:21:53] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:22:40] (03CR) 10Legoktm: [V: 03+2 C: 03+2] fpm-multiversion-base: Add php-wmerrors [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710621 (https://phabricator.wikimedia.org/T285309) (owner: 10Ahmon Dancy) [19:24:24] (03CR) 10Legoktm: "Successfully published image docker-registry.discovery.wmnet/php7.2-fpm-multiversion-base:1.0.1" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710621 (https://phabricator.wikimedia.org/T285309) (owner: 10Ahmon Dancy) [19:25:07] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1039.eqiad.wmnet with reason: REIMAGE [19:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:08] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1040.eqiad.wmnet with reason: REIMAGE [19:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:22] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mc1039.eqiad.wmnet with reason: REIMAGE [19:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:41] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mc1040.eqiad.wmnet with reason: REIMAGE [19:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:36] (03CR) 10Ahmon Dancy: [C: 03+1] Introduce $wmfUsingKubernetes to help with migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711186 (owner: 10Legoktm) [19:44:37] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:45:09] RECOVERY - WDQS high update lag on wdqs2005 is OK: (C)3600 ge (W)1200 ge 1168 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:46:11] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:48:41] (03CR) 10Cwhite: [C: 03+2] logstash: remove absented cron [puppet] - 10https://gerrit.wikimedia.org/r/711184 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [19:58:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts: ` ['cloudvirt1038.eqiad.wmnet'] ` The log can be found in `/var/log... [20:00:03] (03CR) 10Eevans: [C: 03+1] maps: disable cassandra metrics collector [puppet] - 10https://gerrit.wikimedia.org/r/710984 (https://phabricator.wikimedia.org/T186567) (owner: 10Hnowlan) [20:02:23] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:03:41] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:03:47] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:04:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10RobH) [20:05:05] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:06:38] (03CR) 10Eevans: [C: 03+1] cassandra: remove cassandra-metrics-collector [puppet] - 10https://gerrit.wikimedia.org/r/710985 (https://phabricator.wikimedia.org/T186567) (owner: 10Hnowlan) [20:13:59] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:14:03] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:14:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['mc1041.eqiad.wmnet', 'mc1042.eqiad.w... [20:15:41] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:15:43] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:20:00] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Install wiki-specific php extensions in the mediawiki production image - https://phabricator.wikimedia.org/T285309 (10Krinkle) [20:20:05] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Krinkle) [20:21:50] (03CR) 10Krinkle: [C: 03+1] "Beware of the scap trap nature of this change w.r.t sync order (given no atomic/restarts yet)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711186 (owner: 10Legoktm) [20:22:10] legoktm: I've got a few wmf-config patches to roll out soon, could take yours with it and/or you mine. [20:22:18] looking at the train status now [20:22:41] jeena: all clear for your deploy window? [20:22:45] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudcephosd1008 - https://phabricator.wikimedia.org/T287838 (10Andrew) Chris replaced this drive (apparently possible without power-down) but now we need to rebuild the RAID. [20:22:54] Krinkle: please :) otherwise I was going to get to it during the backport window [20:22:57] yep all clear Krinkle [20:23:12] (03PS3) 10Krinkle: noc: Fix warning on conf/index.php when testing locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710130 [20:23:15] (03CR) 10Krinkle: [C: 03+2] noc: Fix warning on conf/index.php when testing locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710130 (owner: 10Krinkle) [20:23:57] (03Merged) 10jenkins-bot: noc: Fix warning on conf/index.php when testing locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710130 (owner: 10Krinkle) [20:26:49] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1041.eqiad.wmnet with reason: REIMAGE [20:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:22] (03CR) 10Eevans: "So, for posterity sake (at least), but also so I can be sure I understand:" [puppet] - 10https://gerrit.wikimedia.org/r/711135 (https://phabricator.wikimedia.org/T191659) (owner: 10Hnowlan) [20:28:48] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1042.eqiad.wmnet with reason: REIMAGE [20:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:03] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mc1041.eqiad.wmnet with reason: REIMAGE [20:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:49] (03PS4) 10Krinkle: Move parsercache DB config to *Services.php (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703629 [20:29:53] (03CR) 10Krinkle: [C: 03+2] Move parsercache DB config to *Services.php (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703629 (owner: 10Krinkle) [20:30:27] * Krinkle testing on mwdebug2002 [20:30:38] (03Merged) 10jenkins-bot: Move parsercache DB config to *Services.php (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703629 (owner: 10Krinkle) [20:30:49] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1043.eqiad.wmnet with reason: REIMAGE [20:30:52] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10Jclark-ctr) @ayounsi @cmooney cloudsw2-d5 is ready [20:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:56] (and mwmaint* for the previous noc change) [20:31:18] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mc1042.eqiad.wmnet with reason: REIMAGE [20:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:43] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudcephosd1008 - https://phabricator.wikimedia.org/T287838 (10RobH) Please note 'hot swappable' just means 'can the disk be swapped without powering down the host' it doesn't have anything to do with the raid's automatic failure or rebuil... [20:31:48] I see scap logs still aren't working on the deploy/mwdebug dashboard [20:31:52] !log krinkle@deploy1002 Synchronized docroot/noc/: Ic013a93998f (duration: 01m 37s) [20:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:28] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:32:50] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1044.eqiad.wmnet with reason: REIMAGE [20:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:31] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1043.eqiad.wmnet with reason: REIMAGE [20:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:53] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1045.eqiad.wmnet with reason: REIMAGE [20:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:41] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1044.eqiad.wmnet with reason: REIMAGE [20:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:49] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1045.eqiad.wmnet with reason: REIMAGE [20:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:24] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1038.eqiad.wmnet with reason: REIMAGE [20:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:31] (03PS5) 10Krinkle: Move parsercache DB config to *Services.php (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703630 [20:42:31] !log krinkle@deploy1002 Synchronized wmf-config/: Ic5ff34b (duration: 01m 08s) [20:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:27] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1038.eqiad.wmnet with reason: REIMAGE [20:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:11] (03CR) 10Krinkle: [C: 03+2] Move parsercache DB config to *Services.php (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703630 (owner: 10Krinkle) [20:45:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc1041.eqiad.wmnet', 'mc1042.eqiad.wmnet', 'mc1043.eqiad.wmnet', 'mc1044.eqiad.wmnet... [20:45:29] legoktm: yours ETA 5min [20:45:54] (03Merged) 10jenkins-bot: Move parsercache DB config to *Services.php (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703630 (owner: 10Krinkle) [20:46:43] > This page is using the deprecated ResourceLoader module "skins.vector.styles.legacy". [20:46:43] [1.37] The use of the `content` feature with ResourceLoaderSkinModule is deprecated. Use `content-media` instead. [1.37] The use of the `content-thumbnails` feature with ResourceLoaderSkinModule is deprecated. Use `content-media` instead. More information can be found at [[mw:Manual:ResourceLoaderSkinModule]]. [20:46:53] Jdlrobson: I'm noticing these in prod on group0 wikis. Not sure if that's expected.. [20:47:26] (03PS1) 10Zabe: zuul: migrate cron of zuul_repack to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) [20:47:54] (03CR) 10jerkins-bot: [V: 04-1] zuul: migrate cron of zuul_repack to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [20:48:43] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:50:28] T288563 is IMO serious enough to warrant a train rollback, it breaks special:Contributions in certain circumstances, which can easily render patrollers helpless when trying to fight vandalism :/. [20:50:28] T288563: TypeError: Argument 1 passed to MediaWiki\Revision\RevisionStore::newRevisionFromRowAndSlots() must be an instance of stdClass - https://phabricator.wikimedia.org/T288563 [20:50:35] cc jeena [20:50:48] (03PS2) 10Zabe: zuul: migrate cron of zuul_repack to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) [20:51:01] Krinkle: you're going to sync my patch or it's open for me to sync? [20:51:22] (03CR) 10jerkins-bot: [V: 04-1] zuul: migrate cron of zuul_repack to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [20:51:27] I was going to do it. [20:52:32] urbanecm: I can roll back after these in progress backports are done [20:52:49] thanks jeena :) [20:53:17] (03PS5) 10Krinkle: Move parsercache DB config to *Services.php (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703631 [20:53:23] left a note on the task too [20:53:47] 👍 thanks! [20:54:25] !log krinkle@deploy1002 Synchronized wmf-config/CommonSettings.php: If7a8d6b6 (duration: 01m 22s) [20:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:00] (03CR) 10Krinkle: [C: 03+2] Move parsercache DB config to *Services.php (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703631 (owner: 10Krinkle) [20:55:17] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Mailing list for Web2Cit Advisory Board - https://phabricator.wikimedia.org/T288566 (10Ladsgroup) a:03Ladsgroup Hi, do you want the mailing list public (and open) or private? [20:55:44] (03Merged) 10jenkins-bot: Move parsercache DB config to *Services.php (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703631 (owner: 10Krinkle) [20:59:38] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Mailing list for Web2Cit Advisory Board - https://phabricator.wikimedia.org/T288566 (10Scann) Private, thanks! [21:00:15] Krinkle: ack, I have a 1:1 rn so I won't be able to fully pay attention [21:02:04] jeena: ok, all yours [21:02:16] legoktm: ack, let's do it later then. once jeena is done, I'll stage it for testing. [21:02:23] !log krinkle@deploy1002 Synchronized wmf-config/: I3b54d163b6 (duration: 01m 09s) [21:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:33] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:03:56] Thanks Krinkle [21:08:03] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:08:15] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: Revert "group0 wikis to 1.37.0-wmf.18" [21:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:47] should I be concerned about the above icinga message? [21:09:54] I don't think so. It's been failing all day. [21:09:57] jeena: legoktm previously stated it's nothing to worry about [21:10:16] thanks all! [21:10:22] yw [21:10:46] (03PS1) 10Jeena Huneidi: Revert "group0 wikis to 1.37.0-wmf.18 refs T281159" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711205 [21:10:48] (03CR) 10Jeena Huneidi: [C: 03+2] Revert "group0 wikis to 1.37.0-wmf.18 refs T281159" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711205 (owner: 10Jeena Huneidi) [21:11:31] (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.37.0-wmf.18 refs T281159" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711205 (owner: 10Jeena Huneidi) [21:13:52] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Mailing list for Web2Cit Advisory Board - https://phabricator.wikimedia.org/T288566 (10Ladsgroup) 05Open→03Resolved Done now, you can access it in https://lists.wikimedia.org/postorius/lists/webtocit.lists.wikimedia.org/ Create an account if you haven'... [21:15:05] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Mailing list for Web2Cit Advisory Board - https://phabricator.wikimedia.org/T288566 (10Scann) Oh wow, how quick! Amazing! Thank you so much! [21:16:11] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:17:57] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:18:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['mc1046.eqiad.wmnet', 'mc1047.eqiad.w... [21:19:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10RobH) [21:20:22] Krinkle: rollback is done [21:30:46] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1046.eqiad.wmnet with reason: REIMAGE [21:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:47] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1047.eqiad.wmnet with reason: REIMAGE [21:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:04] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mc1046.eqiad.wmnet with reason: REIMAGE [21:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:23] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mc1047.eqiad.wmnet with reason: REIMAGE [21:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:48] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1048.eqiad.wmnet with reason: REIMAGE [21:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:44] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1049.eqiad.wmnet with reason: REIMAGE [21:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:58] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mc1048.eqiad.wmnet with reason: REIMAGE [21:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:45] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1050.eqiad.wmnet with reason: REIMAGE [21:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:15] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: mc1047, mc1040, an-web1001, labstore1006, mc1046 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [21:40:13] !log T288501 `ryankemper@wdqs2003:~$ sudo pool` [21:40:16] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1049.eqiad.wmnet with reason: REIMAGE [21:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:22] T288501: Blazegraph journal too large on wdqs2003 - https://phabricator.wikimedia.org/T288501 [21:40:25] !log [WDQS] `ryankemper@wdqs2005:~$ sudo pool` [21:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:45] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1051.eqiad.wmnet with reason: REIMAGE [21:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:28] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1050.eqiad.wmnet with reason: REIMAGE [21:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:49] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1052.eqiad.wmnet with reason: REIMAGE [21:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:40] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1051.eqiad.wmnet with reason: REIMAGE [21:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:51] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1053.eqiad.wmnet with reason: REIMAGE [21:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:39] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10decommission-hardware, 10netops: Decommission asw-c-eqiad - https://phabricator.wikimedia.org/T208734 (10Jclark-ctr) [21:46:13] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10decommission-hardware, 10netops: Decommission asw-c-eqiad - https://phabricator.wikimedia.org/T208734 (10Jclark-ctr) Preformed factory reset removed from rack, updated netbox [21:46:30] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10decommission-hardware, 10netops: Decommission asw-c-eqiad - https://phabricator.wikimedia.org/T208734 (10Jclark-ctr) 05Open→03Resolved [21:46:49] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1052.eqiad.wmnet with reason: REIMAGE [21:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:58] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1054.eqiad.wmnet with reason: REIMAGE [21:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:59] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1053.eqiad.wmnet with reason: REIMAGE [21:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:59] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1054.eqiad.wmnet with reason: REIMAGE [21:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc1046.eqiad.wmnet', 'mc1047.eqiad.wmnet', 'mc1048.eqiad.wmnet', 'mc1049.eqiad.wmnet... [21:59:20] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: mc1040, mc1046, an-web1001, mc1048, mc1047, labstore1006 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [22:01:02] jee.na: thx [22:01:32] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:02:02] legoktm: what is 'php.servergroup' setting? [22:02:06] that could use a bit more context [22:02:16] I'm assuming it's not a php.ini value [22:03:10] maybe it's easier if we swap these commits around, also less scaptrap-y that way [22:03:37] and maybe that stuff could move to *Services then instead of relying inline that $dc is a complete match, which CI otherwise enforces [22:03:48] https://gerrit.wikimedia.org/g/operations/deployment-charts/+/cb0f052c509ff1a93b3fa2750791229629d3fe53/helmfile.d/services/mwdebug/values.yaml#34 [22:04:00] (03PS3) 10Krinkle: Introduce $wmfUsingKubernetes to help with migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711186 (owner: 10Legoktm) [22:04:02] it's a helm value, which gets passed into the SERVERGROUP env variable [22:04:10] +1 on switching the order around [22:04:22] it's the equivalent of https://gerrit.wikimedia.org/g/operations/puppet/+/3ae617555a86bea5ee95b9f74893fc49b81ef0bc/modules/profile/manifests/mediawiki/httpd.pp#157 [22:04:25] (03PS4) 10Krinkle: Introduce $wmfUsingKubernetes to help with migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711186 (owner: 10Legoktm) [22:04:30] (03PS6) 10Krinkle: Adjust redis configuration for running on kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703836 (https://phabricator.wikimedia.org/T284418) (owner: 10Giuseppe Lavagetto) [22:04:34] which is "appserver" or "api_appserver" [22:04:53] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (Kanban): wmcs.spicerack: Setup a host to run cookbooks from prod network - https://phabricator.wikimedia.org/T276440 (10nskaggs) p:05Triage→03Medium [22:05:03] legoktm: the k8s pipeline is still a big untrackable soup for me. Maybe a link like that wouldn't be overkill? [22:05:25] sure, OK if I amend? [22:05:30] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:05:34] yep, go head [22:06:10] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:06:26] thinking about *services, it probalby wouldn't be defined there (the global). maybe a constant in multiversion/defines.php would be better [22:07:08] It's an akward fit I guess, but we can clean that up when we thin-down/phase-out multiversion [22:11:22] Hm.. nvm, we need this data even outside MW contextx such as auto-prepend [22:11:29] (Services.php, that is) [22:12:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10RobH) 05Open→03Resolved [22:12:28] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:13:57] (03PS5) 10Legoktm: Introduce $wmfUsingKubernetes to help with migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711186 [22:13:59] (03PS7) 10Legoktm: Adjust redis configuration for running on kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703836 (https://phabricator.wikimedia.org/T284418) (owner: 10Giuseppe Lavagetto) [22:14:05] Krinkle: ^ [22:15:36] probably time to add a "Kubernetes" section below https://wikitech.wikimedia.org/wiki/MediaWiki_at_WMF#App_servers ? [22:16:04] Yeah. I was going to work on that with j.oe, but we could do it as well. [22:16:13] We can set up a meeting in which I ask a bunch of stupid questions :) [22:16:34] legoktm: thoughts re $dc key reliance? [22:16:49] maybe flatten down to if: 127, else: "...$dc..." for now. [22:17:36] I think that would be fine [22:18:27] also a stupid-question meeting is fine too :p [22:18:54] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:20:46] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:22:07] (03PS1) 10Krinkle: ProductionServices: Clarify that not even multiversion can be assumed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711217 [22:24:23] (03CR) 10Krinkle: [C: 03+2] Introduce $wmfUsingKubernetes to help with migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711186 (owner: 10Legoktm) [22:24:46] (03CR) 10Krinkle: [C: 03+2] ProductionServices: Clarify that not even multiversion can be assumed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711217 (owner: 10Krinkle) [22:25:00] legoktm: can you amend the third one? I'll test and sync these meanwhile. [22:25:09] (03Merged) 10jenkins-bot: Introduce $wmfUsingKubernetes to help with migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711186 (owner: 10Legoktm) [22:25:15] * Krinkle on mwdebug2002 [22:25:27] (03Merged) 10jenkins-bot: ProductionServices: Clarify that not even multiversion can be assumed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711217 (owner: 10Krinkle) [22:25:44] by third you mean the redis one? [22:29:22] (03PS2) 10Zabe: dumps: migrate cron of dumps-exception-checker to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711011 (https://phabricator.wikimedia.org/T273673) [22:30:43] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:30:47] (03PS3) 10Zabe: zuul: migrate cron of zuul_repack to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) [22:31:14] (03CR) 10jerkins-bot: [V: 04-1] zuul: migrate cron of zuul_repack to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [22:32:21] hm, I don't see a redis-labs.php [22:33:07] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:33:10] (03PS8) 10Legoktm: Adjust redis configuration for running on kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703836 (https://phabricator.wikimedia.org/T284418) (owner: 10Giuseppe Lavagetto) [22:34:16] (03CR) 10jerkins-bot: [V: 04-1] Adjust redis configuration for running on kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703836 (https://phabricator.wikimedia.org/T284418) (owner: 10Giuseppe Lavagetto) [22:34:30] (03PS4) 10Zabe: zuul: migrate cron of zuul_repack to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) [22:40:32] Krinkle: uh, is that what you had in mind? still figuring out where/how to fix the test... [22:41:41] that is what I had in mind three revisions ago, yes [22:41:51] before I realized the global would not be reliably available in *Services.php [22:41:58] also, wmfLocalServices :) [22:42:05] > maybe flatten down to if: 127, else: "...$dc..." for now. [22:42:16] I meant this applied directly to how it was before, inline. [22:42:58] ohhh [22:43:22] well the 127 address is different depending on the DC [22:44:31] (03CR) 10Jforrester: [C: 03+1] Introduce $wmfUsingKubernetes to help with migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711186 (owner: 10Legoktm) [22:45:23] legoktm: huh [22:45:39] why? :D [22:45:46] port 12000 vs 12001 [22:46:20] because you need to connect to both the primary DC and the local DC [22:46:42] redis_local and redis_master [22:46:52] Hm.. right, for those references. not the default ones, rght [22:46:58] so you always need both reachable [22:47:06] makes senses [22:47:12] sense* even [22:47:15] (03PS9) 10Legoktm: Adjust redis configuration for running on kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703836 (https://phabricator.wikimedia.org/T284418) (owner: 10Giuseppe Lavagetto) [22:47:20] syncing the other thing now [22:47:33] ^ mostly j.oe's original patch, but with $wmfUsingKubernets [22:48:06] (03PS10) 10Legoktm: Adjust redis configuration for running on kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703836 (https://phabricator.wikimedia.org/T284418) (owner: 10Giuseppe Lavagetto) [22:50:33] !log krinkle@deploy1002 Synchronized wmf-config/: I8052636, I2038702b7e0 (duration: 01m 21s) [22:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:39] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:54:16] PROBLEM - ensure kvm processes are running on cloudvirt1038 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:54:43] (03PS2) 10Clare Ming: Enable user links feature for pilot wikis, modern vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710344 (https://phabricator.wikimedia.org/T288274) [22:55:00] It's nice how the new Gerrit just makes those old test failures disappear once fixed [22:55:05] like the nothing ever was! [22:57:29] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:57:58] (03CR) 10Krinkle: [C: 03+2] Adjust redis configuration for running on kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703836 (https://phabricator.wikimedia.org/T284418) (owner: 10Giuseppe Lavagetto) [22:58:39] (03Merged) 10jenkins-bot: Adjust redis configuration for running on kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703836 (https://phabricator.wikimedia.org/T284418) (owner: 10Giuseppe Lavagetto) [22:59:04] legoktm: live on mwdebug2002, checking now.. [23:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210810T2300). [23:00:04] cjming: A patch you scheduled for Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:48] Error: unknown command "diff" for "helm" [23:00:52] hm, something is really broken [23:01:04] with deploying mwdebug to k8s, not that patch [23:01:35] i'm here and ready to deploy [23:02:07] legoktm: ok, logstash seems all good [23:02:29] go for it then, I'm poking at the k8s stuff [23:02:43] cool - starting now [23:02:58] !log legoktm@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:03:03] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [23:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:29] cjming: oops sorry, that was for Krinkle. I think we have one patch left to sync [23:03:46] oh - so should i wait? [23:04:19] ack. scap is locked on deploy1002 [23:04:23] (to me) [23:06:54] !log krinkle@deploy1002 Synchronized wmf-config/: I13e88c303a, T284418 (duration: 01m 07s) [23:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:02] T284418: Add conditional to mediawiki-config for stuff running on kubernetes - https://phabricator.wikimedia.org/T284418 [23:07:13] scap unlocked [23:07:46] cjming: will you be deploying, or are you waiting for one of the three deployers? [23:07:55] I'm here btw, if needee [23:08:13] I can deploy - it'll be my second time deploying - but it's a simple config change [23:08:29] am i ok to begin? [23:09:18] cjming: go for it :) [23:10:36] ok - starting [23:12:51] i'm ssh'd into mwdebug1002.eqiad.wmnet -- is this the right server to test on? [23:13:20] cjming: I recommend to use mwdebug200* while we're on codfw [23:13:26] it's saying "cannot delete non-empty directory: php-1.37.0-wmf.1/cache/l10n" [23:14:00] cjming: that message can be ignored [23:14:23] urbanecm: so run "scap pull" on mwdebug2002.eqiad.wmnet? [23:14:41] mwdebug2002.codfw.wmnet [23:14:49] But...did you merge the change? [23:15:52] i did - i rebased on deploy1002 [23:16:27] cjming: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/710344, the change that's listed on the calendar, is not merged yet [23:16:38] (and is not on the deployment host either) [23:17:03] gah 1 sec- you're right - forgot to merge (facepalm) [23:17:21] no worries :) [23:17:31] (03CR) 10Clare Ming: [C: 03+2] Enable user links feature for pilot wikis, modern vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710344 (https://phabricator.wikimedia.org/T288274) (owner: 10Clare Ming) [23:18:00] !log legoktm@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:16] (03Merged) 10jenkins-bot: Enable user links feature for pilot wikis, modern vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710344 (https://phabricator.wikimedia.org/T288274) (owner: 10Clare Ming) [23:23:19] 10ops-codfw, 10DC-Ops: codfw: Netbox Error - https://phabricator.wikimedia.org/T288586 (10wiki_willy) [23:24:39] so i merged, ran scap pull on mwdebug2002, but still not seeing expected changes [23:25:00] 10SRE, 10ops-eqiad, 10DC-Ops: Netbox Duplicate Cable IDs & Accounting Discrepancies - https://phabricator.wikimedia.org/T285719 (10wiki_willy) [23:26:08] cjming: you forgot to run git rebase at deploy1002 [23:26:47] you only did git fetch, then you should verify only your commit was fetched (via git log -p HEAD..@{u} -- you might want to alias this in your gitconfig), and then git rebase [23:27:09] once that's done, you can pull it to a debug host (any of the two active ones will work) [23:27:46] doh - yes, i forgot to rebase - just did and now we're testing [23:28:13] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:29:31] we see changes - yay -- still testing [23:29:51] great! [23:30:07] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:31:43] looking good - syncing now [23:32:05] (y) [23:33:25] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:710344|Enable user links feature for pilot wikis, modern vector (T288274)]] (duration: 01m 08s) [23:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:34] T288274: Deploy user links feature to all wikis - https://phabricator.wikimedia.org/T288274 [23:33:50] cjming: looks it's time to congratulate you to your second deployment 🙂 [23:34:34] lol - urbanecm: thanks to your prompts [23:35:05] I'm always happy to help cjming :) [23:39:50] 🙌 [23:41:37] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:47:23] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:49:23] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:51:21] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:52:19] (03PS1) 10Zabe: dynamicproxy: migrate cron of proxydb-bak to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711230 (https://phabricator.wikimedia.org/T273673) [23:52:49] (03CR) 10jerkins-bot: [V: 04-1] dynamicproxy: migrate cron of proxydb-bak to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711230 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [23:55:16] (03PS2) 10Zabe: dynamicproxy: migrate cron of proxydb-bak to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711230 (https://phabricator.wikimedia.org/T273673)