[00:00:01] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10SHust) Final Shopify reply below. Sorry everyone, I really tried... Thank you for your patience while I checked on this HSTS inquiry with my team. My team who manages this dug really... [00:28:01] (03CR) 10Cwhite: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/893505 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [00:28:38] !log planet1002 - stopping apache2 to test alerting (active host is codfw) [00:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:39] (03CR) 10Cwhite: [C: 03+1] Address problems found by 'pint' [alerts] - 10https://gerrit.wikimedia.org/r/893504 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [00:34:46] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2110 - https://phabricator.wikimedia.org/T330681 (10Papaul) @Marostegui unfortunately this server is out of warranty i will check and see if i have any disk onsite that i can use. Thanks [00:52:33] db2141 is not replicating is there an alert for it or something [00:52:45] never mind, it's the backup [00:57:11] 10Puppet, 10Infrastructure-Foundations, 10Instrument-ClientError, 10Observability-Logging, 10patch-welcome: Prevent Firefox and Chrome extensions from being able to trigger alerts - https://phabricator.wikimedia.org/T330680 (10Jdlrobson) That should be fine. FWIW it seems very few errors would slip throu... [01:02:56] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Dzahn) [01:04:26] (03PS1) 10Dzahn: switch releases.wikimedia.org from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/893576 (https://phabricator.wikimedia.org/T330960) [01:05:39] (03PS1) 10Dzahn: switch releases.wikimedia.org backends rsync direction [puppet] - 10https://gerrit.wikimedia.org/r/893577 (https://phabricator.wikimedia.org/T330960) [01:08:25] !log releases2002 - stopping apache2 to test alerting (active server is 1002 but should be switched) T327975 T330960 [01:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:33] T330960: switch releases.wikimedia.org from eqiad to codfw - https://phabricator.wikimedia.org/T330960 [01:08:33] T327975: create blackbox::http monitoring for releases.wikimedia.org - https://phabricator.wikimedia.org/T327975 [01:18:17] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Dzahn) [01:18:40] (03PS1) 10Dzahn: switch doc.wikimedia.org from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/893578 (https://phabricator.wikimedia.org/T330963) [01:20:38] (03PS1) 10Dzahn: doc.wikimedia.org: switch active host from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/893579 (https://phabricator.wikimedia.org/T330963) [01:21:37] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2110 - https://phabricator.wikimedia.org/T330681 (10Marostegui) Thank you [01:23:38] !log doc2001 - stopping apache2 to test alerting - active server is doc1002 but should be switched T327973 T330963 [01:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:23:46] T330963: switch doc.wikimedia.org from eqiad to codfw - https://phabricator.wikimedia.org/T330963 [01:23:47] T327973: create blackbox::http monitoring for doc.wikimedia.org - https://phabricator.wikimedia.org/T327973 [01:53:42] (03CR) 10Krinkle: "Does this follow redirects? POST action=purge does not by itself perform a re-parse. The subsequent GET request after following the HTTP 3" [puppet] - 10https://gerrit.wikimedia.org/r/892570 (https://phabricator.wikimedia.org/T290989) (owner: 10RLazarus) [02:03:04] (03PS4) 10RLazarus: mediawiki-cache-warmup: Add POSTs [puppet] - 10https://gerrit.wikimedia.org/r/892570 (https://phabricator.wikimedia.org/T290989) [02:03:28] (03CR) 10RLazarus: mediawiki-cache-warmup: Add POSTs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/892570 (https://phabricator.wikimedia.org/T290989) (owner: 10RLazarus) [02:06:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:21:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:27:46] (03CR) 10Krinkle: [C: 03+2] filebackend: Opinionated reformatting (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891964 (owner: 10Reedy) [04:21:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_main_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:28:38] PROBLEM - Check systemd state on apt1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-aptrepo-apt2001.wikimedia.org.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:34:40] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is CRITICAL: 1.007e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [04:50:02] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:55:32] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:09:14] PROBLEM - MegaRAID on an-worker1132 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:09:16] ACKNOWLEDGEMENT - MegaRAID on an-worker1132 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T330971 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:09:21] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T330971 (10ops-monitoring-bot) [05:38:40] PROBLEM - Check systemd state on db2095 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@s2.service,wmf_auto_restart_prometheus-mysqld-exporter@s7.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:41:02] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:42:00] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is CRITICAL: 1.002e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230302T0700) [07:00:05] kormat, marostegui, and Amir1: Time to snap out of that daydream and deploy Primary database switchover. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230302T0700). [07:09:50] <_joe_> uhm [07:10:07] <_joe_> it looks like mirrormaker in eqiad can't keep up with codfw [07:10:12] <_joe_> since the switchover [07:12:42] <_joe_> specifically for htmlCacheUpdate messages, which is weird [07:13:01] <_joe_> and not since the switchover, since 3 AM [07:17:28] (03PS1) 10Marostegui: db2095: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/893587 (https://phabricator.wikimedia.org/T330975) [07:17:46] !log Stop MySQL on db2095 T330975 [07:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:53] T330975: decommission db2095.codfw.wmnet - https://phabricator.wikimedia.org/T330975 [07:17:59] (03CR) 10Marostegui: [C: 03+2] db2095: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/893587 (https://phabricator.wikimedia.org/T330975) (owner: 10Marostegui) [07:20:08] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:21:48] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.244 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:26:23] 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, 10cloud-services-team: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui) I am going to schedule this on Thursday 9th at 16:00 UTC - if someone has objections, please let me know! [07:26:38] 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, 10cloud-services-team: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui) [07:28:25] (03PS2) 10ArielGlenn: make sure all of dumpsdata1001-7 permit rsync from/to each other [puppet] - 10https://gerrit.wikimedia.org/r/893519 (https://phabricator.wikimedia.org/T330573) [07:28:59] (03PS3) 10ArielGlenn: make sure all of dumpsdata1001-7 permit rsync from/to each other [puppet] - 10https://gerrit.wikimedia.org/r/893519 (https://phabricator.wikimedia.org/T330573) [07:29:01] (03PS1) 10Marostegui: check_private_data_report: Remove db2095 [puppet] - 10https://gerrit.wikimedia.org/r/893588 (https://phabricator.wikimedia.org/T326596) [07:29:30] (03CR) 10Marostegui: [C: 03+2] check_private_data_report: Remove db2095 [puppet] - 10https://gerrit.wikimedia.org/r/893588 (https://phabricator.wikimedia.org/T326596) (owner: 10Marostegui) [07:29:39] (03CR) 10ArielGlenn: [C: 03+2] make sure all of dumpsdata1001-7 permit rsync from/to each other [puppet] - 10https://gerrit.wikimedia.org/r/893519 (https://phabricator.wikimedia.org/T330573) (owner: 10ArielGlenn) [07:36:07] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host ml-serve1006.eqiad.wmnet with OS bullseye [07:37:27] (03PS1) 10Marostegui: control-mariadb-11.0-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/893589 (https://phabricator.wikimedia.org/T330643) [07:37:28] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:37:31] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [07:37:38] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:37:45] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [07:37:50] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:38:06] (03CR) 10Marostegui: [C: 03+2] control-mariadb-11.0-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/893589 (https://phabricator.wikimedia.org/T330643) (owner: 10Marostegui) [07:38:21] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [07:38:23] !log started rsync of xmldatadumps/private from dumpsdata1001 in screen session as ariel on that host, to dumpsdata1005, no bandwidth cap [07:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:36] (03Merged) 10jenkins-bot: control-mariadb-11.0-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/893589 (https://phabricator.wikimedia.org/T330643) (owner: 10Marostegui) [07:39:48] (03PS3) 10Elukey: admin_ng: set kserve values for ml-serve-{eqiad,codfw} clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/893513 (https://phabricator.wikimedia.org/T324542) [07:40:46] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:41:38] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:42:08] these were due to the inference eqiad VIP being down --^ [07:45:47] (03CR) 10Elukey: [C: 03+2] admin_ng: set kserve values for ml-serve-{eqiad,codfw} clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/893513 (https://phabricator.wikimedia.org/T324542) (owner: 10Elukey) [07:47:55] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:48:11] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [07:48:19] !log elukey@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [07:48:32] !log elukey@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [07:53:19] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host ml-serve1007.eqiad.wmnet with OS bullseye [07:54:48] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 20 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [07:58:55] 10SRE, 10Data-Engineering, 10Data-Persistence, 10Infrastructure-Foundations, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [08:00:05] Amir1, apergos, and jnuche: #bothumor I � Unicode. All rise for UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230302T0800). [08:00:14] morning! there are no trainees signed up today and no patches scheduled for deployment in the window. [08:00:53] that being the case, have a nice quiet day everyone and I'll see you all here next time! [08:03:25] 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, 10cloud-services-team: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui) [08:05:34] !log elukey@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1006.eqiad.wmnet with reason: host reimage [08:08:40] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1006.eqiad.wmnet with reason: host reimage [08:09:45] PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.194 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:14:01] PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.194 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:15:26] !log started rsync of xmldatadumps/public from dumpsdata1001 in screen session as ariel on that host, to dumpsdata1005, no bandwidth cap [08:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:43] PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.194 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:18:54] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host ml-serve1007.eqiad.wmnet with OS bullseye [08:19:04] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve1007.eqiad.wmnet with OS bullseye [08:21:19] (03PS4) 10ArielGlenn: Add dumpsdata1006 and dumpsdata1007 as spare dumps hosts and rsync pullers [puppet] - 10https://gerrit.wikimedia.org/r/893031 (https://phabricator.wikimedia.org/T330573) [08:28:05] PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.194 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:28:15] 10SRE, 10Traffic: Add unique error IDs to 4xx responses - https://phabricator.wikimedia.org/T330973 (10Volans) Surely being able to distinguish them from the message would help, but still relies on the user to report the exact verbatim message they are getting, and rely on external information. Given that our... [08:29:17] (03PS2) 10Giuseppe Lavagetto: sre.discovery.datacenter: support a/p state when depooled [cookbooks] - 10https://gerrit.wikimedia.org/r/892999 [08:29:19] (03PS2) 10Giuseppe Lavagetto: sre.discovery.datacenter: uniform style [cookbooks] - 10https://gerrit.wikimedia.org/r/893000 [08:30:19] PROBLEM - Host asw2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [08:31:25] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/892999 (owner: 10Giuseppe Lavagetto) [08:32:25] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:32:27] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [08:32:27] PROBLEM - Host mr1-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [08:32:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] sre.discovery.datacenter: support a/p state when depooled (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/892999 (owner: 10Giuseppe Lavagetto) [08:33:09] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:33:25] (03CR) 10ArielGlenn: [C: 03+2] Add dumpsdata1006 and dumpsdata1007 as spare dumps hosts and rsync pullers [puppet] - 10https://gerrit.wikimedia.org/r/893031 (https://phabricator.wikimedia.org/T330573) (owner: 10ArielGlenn) [08:34:35] (03Merged) 10jenkins-bot: sre.discovery.datacenter: support a/p state when depooled [cookbooks] - 10https://gerrit.wikimedia.org/r/892999 (owner: 10Giuseppe Lavagetto) [08:34:43] !log Stop MySQL on db2093 T330827 [08:34:45] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:50] T330827: decommission db2093.codfw.wmnet - https://phabricator.wikimedia.org/T330827 [08:36:56] (03PS1) 10Marostegui: install_server: Remove db2093 [puppet] - 10https://gerrit.wikimedia.org/r/893643 (https://phabricator.wikimedia.org/T330827) [08:37:29] RECOVERY - BGP status on lsw1-e3-eqiad.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:37:39] (03CR) 10Marostegui: [C: 03+2] install_server: Remove db2093 [puppet] - 10https://gerrit.wikimedia.org/r/893643 (https://phabricator.wikimedia.org/T330827) (owner: 10Marostegui) [08:38:34] (03PS1) 10Nicolas Fraison: hadoop: exclude an-worker1132 node from hdfs and yarn [puppet] - 10https://gerrit.wikimedia.org/r/893644 (https://phabricator.wikimedia.org/T330979) [08:38:45] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [08:38:46] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [08:38:47] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1006.eqiad.wmnet with OS bullseye [08:39:33] (03PS3) 10ArielGlenn: for dumpsdata1004 through 1007 use the partman recipe for reimaging [puppet] - 10https://gerrit.wikimedia.org/r/892437 (https://phabricator.wikimedia.org/T330573) [08:39:49] !log dcaro@cumin1001 START - Cookbook sre.dns.netbox [08:40:30] (03CR) 10ArielGlenn: [C: 03+2] for dumpsdata1004 through 1007 use the partman recipe for reimaging [puppet] - 10https://gerrit.wikimedia.org/r/892437 (https://phabricator.wikimedia.org/T330573) (owner: 10ArielGlenn) [08:40:50] (03CR) 10Giuseppe Lavagetto: [C: 03+2] sre.discovery.datacenter: uniform style [cookbooks] - 10https://gerrit.wikimedia.org/r/893000 (owner: 10Giuseppe Lavagetto) [08:41:13] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39908/console" [puppet] - 10https://gerrit.wikimedia.org/r/893644 (https://phabricator.wikimedia.org/T330979) (owner: 10Nicolas Fraison) [08:42:39] (03Merged) 10jenkins-bot: sre.discovery.datacenter: uniform style [cookbooks] - 10https://gerrit.wikimedia.org/r/893000 (owner: 10Giuseppe Lavagetto) [08:44:05] (03PS2) 10Nicolas Fraison: hadoop: exclude an-worker1132 node from hdfs and yarn [puppet] - 10https://gerrit.wikimedia.org/r/893644 (https://phabricator.wikimedia.org/T330979) [08:45:19] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39909/console" [puppet] - 10https://gerrit.wikimedia.org/r/893644 (https://phabricator.wikimedia.org/T330979) (owner: 10Nicolas Fraison) [08:45:31] (03CR) 10Hashar: contint: regroup common firewalling rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887738 (https://phabricator.wikimedia.org/T329056) (owner: 10Hashar) [08:45:48] (03PS2) 10Hashar: contint: Jenkins master > controller [puppet] - 10https://gerrit.wikimedia.org/r/893412 (https://phabricator.wikimedia.org/T254646) [08:45:55] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/893412 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [08:45:59] (03PS6) 10JMeybohm: Revert "admin: Add kelhurd to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/890848 (https://phabricator.wikimedia.org/T323943) (owner: 10SBassett) [08:46:03] !log dcaro@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: moved cloudcephosd1010 to rack F4 - dcaro@cumin1001" [08:46:25] (03CR) 10Hashar: "I am rebasing on tip of production branch after the "unrelated" parent change https://gerrit.wikimedia.org/r/c/operations/puppet/+/887738/" [puppet] - 10https://gerrit.wikimedia.org/r/893412 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [08:47:34] (03CR) 10Stevemunene: [C: 03+1] hadoop: exclude an-worker1132 node from hdfs and yarn [puppet] - 10https://gerrit.wikimedia.org/r/893644 (https://phabricator.wikimedia.org/T330979) (owner: 10Nicolas Fraison) [08:49:27] (03CR) 10Nicolas Fraison: [V: 03+1 C: 03+2] hadoop: exclude an-worker1132 node from hdfs and yarn [puppet] - 10https://gerrit.wikimedia.org/r/893644 (https://phabricator.wikimedia.org/T330979) (owner: 10Nicolas Fraison) [08:51:33] (03PS1) 10Marostegui: common.yaml: Remove db2093 [puppet] - 10https://gerrit.wikimedia.org/r/893645 (https://phabricator.wikimedia.org/T330827) [08:51:47] (03PS2) 10Marostegui: common.yaml: Remove db2093 [puppet] - 10https://gerrit.wikimedia.org/r/893645 (https://phabricator.wikimedia.org/T330827) [08:52:18] (03CR) 10Marostegui: [C: 03+2] common.yaml: Remove db2093 [puppet] - 10https://gerrit.wikimedia.org/r/893645 (https://phabricator.wikimedia.org/T330827) (owner: 10Marostegui) [08:57:10] !log dcaro@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: moved cloudcephosd1010 to rack F4 - dcaro@cumin1001" [08:57:10] !log dcaro@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:57:31] !log dcaro@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1010 [08:58:07] !log dcaro@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1010 [08:58:43] !log root@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1010'] [08:59:25] PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_pagetriage_cleanup_test2wiki.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:59:36] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10Marostegui) [09:00:05] jnuche and hashar: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230302T0900). [09:00:35] morning, I'll deploy in 5 mins [09:04:33] !log root@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1010'] [09:05:43] (03PS1) 10TrainBranchBot: all wikis to 1.40.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893666 (https://phabricator.wikimedia.org/T325588) [09:05:45] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.40.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893666 (https://phabricator.wikimedia.org/T325588) (owner: 10TrainBranchBot) [09:06:34] !log dcaro@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1010'] [09:07:00] (03Merged) 10jenkins-bot: all wikis to 1.40.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893666 (https://phabricator.wikimedia.org/T325588) (owner: 10TrainBranchBot) [09:08:41] (03CR) 10Hashar: [C: 03+1] "Noop on" [puppet] - 10https://gerrit.wikimedia.org/r/893412 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [09:10:35] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host ml-serve1007.eqiad.wmnet with OS bullseye [09:12:34] 10SRE, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10User-Joe: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888 (10hashar) [09:12:40] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/890501 (https://phabricator.wikimedia.org/T292238) (owner: 10Majavah) [09:13:21] !log dcaro@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1010'] [09:14:41] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: all wikis to 1.40.0-wmf.25 refs T325588 [09:14:47] T325588: 1.40.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T325588 [09:15:17] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [09:16:06] 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech: HTTP URIs do not resolve from NL and DE? - https://phabricator.wikimedia.org/T330906 (10Lydia_Pintscher) [09:16:58] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/893538 (https://phabricator.wikimedia.org/T327960) (owner: 10Eevans) [09:17:33] PROBLEM - Check systemd state on ms-be1060 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:39] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1060 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:18:19] (03PS1) 10Jbond: aptrepo::rsync: refactor to fix issues [puppet] - 10https://gerrit.wikimedia.org/r/893669 [09:19:15] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39910/console" [puppet] - 10https://gerrit.wikimedia.org/r/893669 (owner: 10Jbond) [09:19:32] (03CR) 10Filippo Giunchedi: [C: 03+1] "Looking good! LGTM (see inline)" [puppet] - 10https://gerrit.wikimedia.org/r/891394 (https://phabricator.wikimedia.org/T327161) (owner: 10Cwhite) [09:19:59] (03PS2) 10Jbond: aptrepo::rsync: refactor to fix issues [puppet] - 10https://gerrit.wikimedia.org/r/893669 [09:20:19] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host ml-serve1008.eqiad.wmnet with OS bullseye [09:20:55] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39911/console" [puppet] - 10https://gerrit.wikimedia.org/r/893669 (owner: 10Jbond) [09:21:33] 10ops-ulsfo: mr1-ulsfo dead? - https://phabricator.wikimedia.org/T330984 (10ayounsi) p:05Triage→03High [09:22:35] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by NewConnectionError(urllib3.connection.VerifiedHTTPSConnection object at 0x7fc0b2673710: Failed to establish a new connection: [Errno 111] Connection refused): /api?format=mediawiki&search=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FDarth_Vader https:/ [09:22:35] h.wikimedia.org/wiki/Citoid [09:23:15] ACKNOWLEDGEMENT - ps1-23-ulsfo-infeed-load-tower-B-single-phase on ps1-23-ulsfo is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T330984 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:23:15] ACKNOWLEDGEMENT - ps1-23-ulsfo-infeed-load-tower-A-single-phase on ps1-23-ulsfo is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T330984 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:23:15] ACKNOWLEDGEMENT - ps1-22-ulsfo-infeed-load-tower-B-single-phase on ps1-22-ulsfo is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T330984 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:23:15] ACKNOWLEDGEMENT - ps1-22-ulsfo-infeed-load-tower-A-single-phase on ps1-22-ulsfo is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T330984 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:23:15] ACKNOWLEDGEMENT - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T330984 [09:23:16] ACKNOWLEDGEMENT - Host mr1-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T330984 [09:23:16] ACKNOWLEDGEMENT - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.194 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 ayounsi https://phabricator.wikimedia.org/T330984 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:23:17] ACKNOWLEDGEMENT - Juniper alarms on mr1-ulsfo is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 198.35.26.194 ayounsi https://phabricator.wikimedia.org/T330984 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [09:23:17] ACKNOWLEDGEMENT - Juniper alarms on asw2-ulsfo is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.128.128.7 ayounsi https://phabricator.wikimedia.org/T330984 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [09:23:18] ACKNOWLEDGEMENT - Host asw2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T330984 [09:23:55] ACKNOWLEDGEMENT - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP ayounsi https://phabricator.wikimedia.org/T330984 https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:23:55] ACKNOWLEDGEMENT - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP ayounsi https://phabricator.wikimedia.org/T330984 https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:25:30] (03CR) 10JMeybohm: [C: 03+2] Revert "admin: Add kelhurd to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/890848 (https://phabricator.wikimedia.org/T323943) (owner: 10SBassett) [09:26:03] (03CR) 10Jbond: [C: 03+2] apt: swap active and failover apt servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893409 (https://phabricator.wikimedia.org/T328907) (owner: 10Jbond) [09:26:05] 10SRE, 10Gerrit, 10Continuous-Integration-Config, 10Release-Engineering-Team (Development services): Fix operations/puppet.git "rebase hell" - https://phabricator.wikimedia.org/T224033 (10hashar) For future reference, the configuration change in Gerrit is under `refs/meta/config` and is [[ https://gerrit.w... [09:26:13] (03CR) 10Jbond: [V: 03+1 C: 03+2] aptrepo::rsync: refactor to fix issues [puppet] - 10https://gerrit.wikimedia.org/r/893669 (owner: 10Jbond) [09:27:01] (03PS67) 10Stevemunene: Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [09:28:12] (03CR) 10Giuseppe Lavagetto: "+1 on the mediawiki stuff, but see my comment on the kubernetes-generic one." [alerts] - 10https://gerrit.wikimedia.org/r/893504 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:28:43] (03PS4) 10Arturo Borrero Gonzalez: labstore1004: allow incoming HTTP connections from cloudcontrol servers [puppet] - 10https://gerrit.wikimedia.org/r/893514 (https://phabricator.wikimedia.org/T330916) [09:28:49] (03CR) 10Filippo Giunchedi: Add 'pint' integration (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/893505 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:28:56] !log dcaro@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1010.eqiad.wmnet with OS bullseye [09:29:33] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Datacenter-Switchover: apt.wikimedia.org post-switchover - https://phabricator.wikimedia.org/T330985 (10Clement_Goubert) [09:29:43] PROBLEM - Check systemd state on dumpsdata1005 is CRITICAL: CRITICAL - degraded: The following units failed: cleanup_xmldumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:16] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Marostegui) Any ETA to get these (or some) racked and installed? Thanks! [09:30:25] (03CR) 10Filippo Giunchedi: Address problems found by 'pint' (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/893504 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:30:30] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Datacenter-Switchover: apt.wikimedia.org post-switchover - https://phabricator.wikimedia.org/T330985 (10Clement_Goubert) p:05Triage→03Medium a:05Clement_Goubert→03jBond_WMF [09:31:05] RECOVERY - Check systemd state on apt1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:31:47] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907 (10Clement_Goubert) [09:31:57] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Datacenter-Switchover: apt.wikimedia.org post-switchover - https://phabricator.wikimedia.org/T330985 (10Clement_Goubert) 05Open→03Resolved Fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/893669 [09:32:43] PROBLEM - Hadoop NodeManager on an-worker1132 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:33:43] (03PS3) 10Filippo Giunchedi: Add 'pint' integration [alerts] - 10https://gerrit.wikimedia.org/r/893505 (https://phabricator.wikimedia.org/T309182) [09:35:26] !log elukey@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1007.eqiad.wmnet with reason: host reimage [09:36:20] (03CR) 10Arturo Borrero Gonzalez: labstore1004: allow incoming HTTP connections from cloudcontrol servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893514 (https://phabricator.wikimedia.org/T330916) (owner: 10Arturo Borrero Gonzalez) [09:37:12] 10SRE, 10Infrastructure-Foundations, 10netops: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10ayounsi) 05Resolved→03Open The issue is back: > 2023-01-30 12:36:42 UTC Minor FPC 0 Minor Errors we need to follow up with JTAC for a replacement. [09:37:33] 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech: HTTP URIs do not resolve from NL and DE? - https://phabricator.wikimedia.org/T330906 (10Vgutierrez) >>! In T330906#8657917, @Ennomeijers wrote: > Thanks for the replies! Advising to use HTTPS over HTTP makes sense. > > But not supporting redirection from HTTP to... [09:38:35] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1007.eqiad.wmnet with reason: host reimage [09:40:02] (03CR) 10Filippo Giunchedi: [C: 03+2] Address problems found by 'pint' [alerts] - 10https://gerrit.wikimedia.org/r/893504 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:44:57] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1010.eqiad.wmnet with reason: host reimage [09:47:08] 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T330971 (10Peachey88) [09:47:20] !log elukey@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1008.eqiad.wmnet with reason: host reimage [09:47:57] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1060 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:48:06] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1010.eqiad.wmnet with reason: host reimage [09:50:07] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1008.eqiad.wmnet with reason: host reimage [09:51:48] jnuche: I have added an UBN from yesterday as a train blocker. I don't think it is worse a rollback, it might be an issue with the infra [09:52:53] hashar: I saw it, I replied on wkimedia-releng, it doesn't look like a problem with wmf.25 to me [09:53:06] the problem was already happening with wmf.24 on the Russian wiki yesterday [09:53:21] RECOVERY - Check systemd state on ms-be1060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:53:58] (KubernetesCalicoDown) firing: (2) ml-serve1007.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:54:35] maybe it is related to the switch over somehow [09:54:54] I would suspect that, yeah [09:56:54] (03PS1) 10Elukey: profile::logstash: improve the ORES filter [puppet] - 10https://gerrit.wikimedia.org/r/893672 (https://phabricator.wikimedia.org/T325763) [09:57:45] (03PS2) 10Elukey: profile::logstash: improve the ORES filter [puppet] - 10https://gerrit.wikimedia.org/r/893672 (https://phabricator.wikimedia.org/T325763) [09:59:25] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [09:59:48] (03CR) 10Nicolas Fraison: [C: 03+2] Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [09:59:54] (03CR) 10Elukey: "Tested following https://wikitech.wikimedia.org/wiki/Logstash#Writing_&_testing_filters, all good :)" [puppet] - 10https://gerrit.wikimedia.org/r/893672 (https://phabricator.wikimedia.org/T325763) (owner: 10Elukey) [10:02:22] !2 [10:02:28] (03CR) 10Ilias Sarantopoulos: [C: 03+1] profile::logstash: improve the ORES filter [puppet] - 10https://gerrit.wikimedia.org/r/893672 (https://phabricator.wikimedia.org/T325763) (owner: 10Elukey) [10:03:50] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ml-serve1006.mgmt.eqiad.wmnet with reboot policy GRACEFUL [10:04:24] (03PS1) 10Jbond: kerberos: update the motd so its not to big [puppet] - 10https://gerrit.wikimedia.org/r/893674 [10:05:21] !log dcaro@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - dcaro@cumin1001" [10:05:37] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39912/console" [puppet] - 10https://gerrit.wikimedia.org/r/893674 (owner: 10Jbond) [10:06:01] RECOVERY - BGP status on lsw1-f2-eqiad.mgmt is OK: BGP OK - up: 4, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:07:45] PROBLEM - Host ml-serve1006 is DOWN: PING CRITICAL - Packet loss = 100% [10:08:15] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1007.eqiad.wmnet with OS bullseye [10:08:36] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1126 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/893430 (https://phabricator.wikimedia.org/T330988) [10:08:58] (KubernetesCalicoDown) firing: (2) ml-serve1007.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:09:18] (03PS1) 10Zabe: wmnet: Update maintenance.eqiad.wmnet to point to mwmaint2002 [dns] - 10https://gerrit.wikimedia.org/r/893675 (https://phabricator.wikimedia.org/T327920) [10:09:57] PROBLEM - BGP status on lsw1-e3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:10:31] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:58] (KubernetesCalicoDown) firing: (3) ml-serve1006.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:14:09] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [10:15:53] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [10:16:25] RECOVERY - Host ml-serve1006 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [10:16:33] !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@9568478]: Re-Deploy Airflow upgrade branch for analytics_test [10:16:38] (03PS1) 10Jbond: apt: update the motd so its not to big [puppet] - 10https://gerrit.wikimedia.org/r/893676 [10:16:45] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@9568478]: Re-Deploy Airflow upgrade branch for analytics_test (duration: 00m 12s) [10:17:01] RECOVERY - BGP status on lsw1-e3-eqiad.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:17:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39914/console" [puppet] - 10https://gerrit.wikimedia.org/r/893676 (owner: 10Jbond) [10:18:15] RECOVERY - BGP status on lsw1-f3-eqiad.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:18:29] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for RMaung - https://phabricator.wikimedia.org/T330335 (10JMeybohm) [10:18:31] (03CR) 10Klausman: [C: 03+1] profile::logstash: improve the ORES filter [puppet] - 10https://gerrit.wikimedia.org/r/893672 (https://phabricator.wikimedia.org/T325763) (owner: 10Elukey) [10:18:58] (KubernetesCalicoDown) firing: (3) ml-serve1006.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:21:24] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1008.eqiad.wmnet with OS bullseye [10:23:11] (03PS4) 10Fomafix: Add 'cbk' as alias for 'cbk-zam' [puppet] - 10https://gerrit.wikimedia.org/r/527912 (https://phabricator.wikimedia.org/T124657) [10:23:27] (03PS4) 10Fomafix: Add 'bho' as alias for 'bh' [puppet] - 10https://gerrit.wikimedia.org/r/528782 (https://phabricator.wikimedia.org/T41968) [10:23:58] (KubernetesCalicoDown) resolved: (3) ml-serve1006.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:24:39] (03CR) 10Mvolz: [C: 03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/891390 (owner: 10PipelineBot) [10:27:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T330317 (10phaultfinder) [10:27:50] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1006.mgmt.eqiad.wmnet with reboot policy GRACEFUL [10:28:03] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1126 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/893431 (https://phabricator.wikimedia.org/T330991) [10:28:06] (03Abandoned) 10Marostegui: mariadb: Promote db1126 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/893430 (https://phabricator.wikimedia.org/T330988) (owner: 10Gerrit maintenance bot) [10:30:18] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/891390 (owner: 10PipelineBot) [10:32:30] (Traffic bill over quota) firing: (3) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [10:32:37] (03CR) 10Jbond: "drive by comment" [alerts] - 10https://gerrit.wikimedia.org/r/893505 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [10:38:14] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Datacenter-Switchover: apt.wikimedia.org post-switchover - https://phabricator.wikimedia.org/T330985 (10jbond) a:05jBond_WMF→03jbond [10:39:36] 10SRE, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research intern nickifeajika - https://phabricator.wikimedia.org/T330993 (10Miriam) [10:41:36] (03CR) 10Clément Goubert: [C: 03+2] wmnet: Update maintenance.eqiad.wmnet to point to mwmaint2002 [dns] - 10https://gerrit.wikimedia.org/r/893675 (https://phabricator.wikimedia.org/T327920) (owner: 10Zabe) [10:42:18] !log Running authdns-update for 893675 [10:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:41] (03PS1) 10Nicolas Fraison: hadoop_test: decomission an-test-worker1001 [puppet] - 10https://gerrit.wikimedia.org/r/893681 [10:42:43] zabe: thanks [10:43:35] yw [10:43:57] (03CR) 10Btullis: [C: 03+1] hadoop_test: decomission an-test-worker1001 [puppet] - 10https://gerrit.wikimedia.org/r/893681 (owner: 10Nicolas Fraison) [10:44:26] (03CR) 10Stevemunene: [C: 03+1] hadoop_test: decomission an-test-worker1001 [puppet] - 10https://gerrit.wikimedia.org/r/893681 (owner: 10Nicolas Fraison) [10:44:28] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39915/console" [puppet] - 10https://gerrit.wikimedia.org/r/893681 (owner: 10Nicolas Fraison) [10:44:48] (03PS4) 10Filippo Giunchedi: Add 'pint' integration [alerts] - 10https://gerrit.wikimedia.org/r/893505 (https://phabricator.wikimedia.org/T309182) [10:44:59] (03CR) 10Nicolas Fraison: [V: 03+1 C: 03+2] hadoop_test: decomission an-test-worker1001 [puppet] - 10https://gerrit.wikimedia.org/r/893681 (owner: 10Nicolas Fraison) [10:45:14] (03CR) 10Filippo Giunchedi: Add 'pint' integration (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/893505 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [10:45:49] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/893672 (https://phabricator.wikimedia.org/T325763) (owner: 10Elukey) [10:52:30] (Traffic bill over quota) resolved: (3) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [10:58:15] (03CR) 10Jbond: [C: 03+2] check_puppet_run_changes: use black [puppet] - 10https://gerrit.wikimedia.org/r/893005 (owner: 10Jbond) [10:58:19] (03CR) 10Jbond: [C: 03+2] cumin: update check_puppet_run_changes to remove inactive alerts hosts [puppet] - 10https://gerrit.wikimedia.org/r/893002 (owner: 10Jbond) [10:58:23] (03CR) 10Jbond: [C: 03+2] check_puppet_run_changes: refactor and switch to pql [puppet] - 10https://gerrit.wikimedia.org/r/893022 (owner: 10Jbond) [11:00:05] mvolz: My dear minions, it's time we take the moon! Just kidding. Time for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230302T1100). [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230302T1100) [11:00:37] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [11:02:00] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) @ayounsi @akosiaris @Joe to confirm, we are going to depool eqiad before this maintenance like we've done in codfw right? [11:02:16] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [11:06:41] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (28) node(s) change every puppet run: alert2001.wikimedia.org, an-test-worker1001.eqiad.wmnet, aphlict2001.codfw.wmnet, aqs2001.codfw.wmnet, aqs2002.codfw.wmnet, aqs2003.codfw.wmnet, aqs2004.codfw.wmnet, aqs2005.codfw.wmnet, aqs2006.codfw.wmnet, aqs2007.codfw.wmnet, aqs2008.codfw.wmnet, aqs2009.codfw.wmnet, aqs2010.codf [11:06:41] aqs2011.codfw.wmnet, aqs2012.codfw.wmnet, cloudcephmon2004-dev.codfw.wmnet, cloudcephmon2005-dev.codfw.wmnet, cloudcephmon2006-dev.codfw.wmnet, clouddumps1001.wikimedia.org, clouddumps1002.wikimedia.org, cumin1001.eqiad.wmnet, idm-test1001.wikimedia.org, idm1001.wikimedia.org, idm2001.wikimedia.org, ms-be2067.codfw.wmnet, netmon1003.wikimedia.org, releases1002.eqiad.wmnet, releases2002.codfw.wmnet https://wikitech.wikimedia.org/wiki/Pupp [11:06:41] ck_puppet_run_changes [11:06:43] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (28) node(s) change every puppet run: alert2001.wikimedia.org, an-test-worker1001.eqiad.wmnet, aphlict2001.codfw.wmnet, aqs2001.codfw.wmnet, aqs2002.codfw.wmnet, aqs2003.codfw.wmnet, aqs2004.codfw.wmnet, aqs2005.codfw.wmnet, aqs2006.codfw.wmnet, aqs2007.codfw.wmnet, aqs2008.codfw.wmnet, aqs2009.codfw.wmnet, aqs2010.codf [11:06:43] aqs2011.codfw.wmnet, aqs2012.codfw.wmnet, cloudcephmon2004-dev.codfw.wmnet, cloudcephmon2005-dev.codfw.wmnet, cloudcephmon2006-dev.codfw.wmnet, clouddumps1001.wikimedia.org, clouddumps1002.wikimedia.org, cumin1001.eqiad.wmnet, idm-test1001.wikimedia.org, idm1001.wikimedia.org, idm2001.wikimedia.org, ms-be2067.codfw.wmnet, netmon1003.wikimedia.org, releases1002.eqiad.wmnet, releases2002.codfw.wmnet https://wikitech.wikimedia.org/wiki/Pupp [11:06:43] ck_puppet_run_changes [11:11:26] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:13:06] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [11:16:39] btullis, nfraison: FYI the above is reporting aqs servers in codfw, they seems to change at every run [11:16:45] /etc/default/wikimedia-lvs-realserver with [11:16:46] -LVS_SERVICE_IPS="10.0.5.3" [11:16:47] +LVS_SERVICE_IPS="" [11:17:06] Anyone around to help debug my failed deploy? I have followed https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting but I'm stuck. The last thing that shows up in the logs is "[warning][main] [source/server/server.cc:642] there is no configured limit to the number of allowed active connections. Set a limit via the runtime key overload.global_downstream_max_connections [11:17:09] " which doesn't seem relevant to me. [11:17:19] but then puppet runs /usr/sbin/dpkg-reconfigure -p critical -f noninteractive wikimedia-lvs-realserver [11:17:25] that I guess reset it back to its value [11:19:52] and so on at each puppet run [11:21:28] mvolz: looks to me like citoid-staging-tls-proxy is healthy, that's just the last message it outputs on starting (a bit opaque for sure). according to `kubectl describe pod citoid-staging-77f98857d7-9k7hq` it looks like there's a problem with the citoid container itself [11:21:35] "Error response from daemon: OCI runtime create failed: container_linux.go:367: starting container process caused: exec: "nodejs": executable file not found in $PATH: unknown" [11:22:21] when I do log citoid-staging, it just outputs nothing [11:22:26] :( [11:22:43] yeah I think it's totally failing to start - dunno why that container doesn't have nodejs installed or in PATH, looking into it [11:23:24] it's an update from node 12 to node 14 [11:23:30] maybe a problem with the commit... [11:23:57] might be a change between versions of the container too, trying to check it now [11:24:26] aha... the node14 container doesn't have `nodejs` as a binary, it *only* has `node` [11:24:30] (Traffic bill over quota) firing: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got worse - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [11:24:43] 10SRE, 10serviceops, 10Datacenter-Switchover: Add scap lock/unlock steps to sre.switchdc.mediawiki cookbook - https://phabricator.wikimedia.org/T330996 (10Clement_Goubert) [11:24:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "Please, also update references and docs in Wikitech if any." [puppet] - 10https://gerrit.wikimedia.org/r/890501 (https://phabricator.wikimedia.org/T292238) (owner: 10Majavah) [11:26:04] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Datacenter-Switchover: Add scap lock/unlock steps to sre.switchdc.mediawiki cookbook - https://phabricator.wikimedia.org/T330996 (10Clement_Goubert) p:05Triage→03Medium [11:26:08] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Datacenter-Switchover: Add scap lock/unlock steps to sre.switchdc.mediawiki cookbook - https://phabricator.wikimedia.org/T330996 (10Clement_Goubert) [11:26:13] (03PS1) 10Hnowlan: citoid: use `node` instead of `nodejs` for node14 container [deployment-charts] - 10https://gerrit.wikimedia.org/r/893688 [11:28:48] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Datacenter-Switchover: Support locking cookbooks run except for switchover related cookbooks - https://phabricator.wikimedia.org/T330997 (10Clement_Goubert) [11:29:14] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Datacenter-Switchover: Support locking cookbooks run except for switchover related cookbooks - https://phabricator.wikimedia.org/T330997 (10Clement_Goubert) p:05Triage→03Medium [11:31:10] (03PS1) 10Cathal Mooney: Set switch-side MTU to 9192 for discovered links from servers [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/893690 (https://phabricator.wikimedia.org/T329799) [11:36:39] (03CR) 10Mvolz: [C: 03+2] citoid: use `node` instead of `nodejs` for node14 container [deployment-charts] - 10https://gerrit.wikimedia.org/r/893688 (owner: 10Hnowlan) [11:41:09] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Datacenter-Switchover: Support locking cookbooks run except for switchover related cookbooks - https://phabricator.wikimedia.org/T330997 (10Clement_Goubert) [11:41:12] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907 (10Clement_Goubert) [11:41:36] (03Merged) 10jenkins-bot: citoid: use `node` instead of `nodejs` for node14 container [deployment-charts] - 10https://gerrit.wikimedia.org/r/893688 (owner: 10Hnowlan) [11:41:51] (03CR) 10Mvolz: "thanks! checking why this wasn't an issue when I updated zotero, I'm curious whether we need the "command:" in the template at all? Zotero" [deployment-charts] - 10https://gerrit.wikimedia.org/r/893688 (owner: 10Hnowlan) [11:42:29] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [11:42:47] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:44:30] (Traffic bill over quota) resolved: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got worse - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [11:46:00] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [11:46:21] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:47:15] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [11:47:41] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:48:01] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:48:50] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:49:59] PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:51:49] RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:53:06] (03PS1) 10Jbond: dispatch::web: fix documentation [puppet] - 10https://gerrit.wikimedia.org/r/893692 [11:53:08] (03PS1) 10Jbond: servie::docker: ensure we remove directories [puppet] - 10https://gerrit.wikimedia.org/r/893693 [11:55:38] (03CR) 10Jbond: [C: 03+2] dispatch::web: fix documentation [puppet] - 10https://gerrit.wikimedia.org/r/893692 (owner: 10Jbond) [11:55:42] (03CR) 10Jbond: [C: 03+2] servie::docker: ensure we remove directories [puppet] - 10https://gerrit.wikimedia.org/r/893693 (owner: 10Jbond) [11:56:13] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39916/console" [puppet] - 10https://gerrit.wikimedia.org/r/893693 (owner: 10Jbond) [11:57:51] (03PS1) 10Ladsgroup: Revert "filebackend: Replace stringified class names with ::class" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893652 (https://phabricator.wikimedia.org/T330942) [11:57:59] (03CR) 10CI reject: [V: 04-1] Revert "filebackend: Replace stringified class names with ::class" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893652 (https://phabricator.wikimedia.org/T330942) (owner: 10Ladsgroup) [11:58:43] (03CR) 10Ayounsi: [C: 03+1] Set switch-side MTU to 9192 for discovered links from servers [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/893690 (https://phabricator.wikimedia.org/T329799) (owner: 10Cathal Mooney) [11:59:21] 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech: HTTP URIs do not resolve from NL and DE? - https://phabricator.wikimedia.org/T330906 (10Ennomeijers) I think this touches upon a fundamental question of how to model WD information as Linked Data. As currently stated in the [[ URL | Data Access article ]] the //co... [12:03:12] (03CR) 10Hnowlan: citoid: use `node` instead of `nodejs` for node14 container (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/893688 (owner: 10Hnowlan) [12:06:06] (03CR) 10Cathal Mooney: [C: 03+2] Set switch-side MTU to 9192 for discovered links from servers [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/893690 (https://phabricator.wikimedia.org/T329799) (owner: 10Cathal Mooney) [12:07:06] (03Merged) 10jenkins-bot: Set switch-side MTU to 9192 for discovered links from servers [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/893690 (https://phabricator.wikimedia.org/T329799) (owner: 10Cathal Mooney) [12:12:38] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10akosiaris) >>! In T330165#8660042, @Marostegui wrote: > @ayounsi @akosiaris @Joe to confirm, we are going to depool eqiad before this maintenance like we... [12:12:53] 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech: HTTP URIs do not resolve from NL and DE? - https://phabricator.wikimedia.org/T330906 (10Vgutierrez) Probably HSTS is only being implemented by browsers. There is any particular reason to target the HTTP version or it could be bumped to HTTPS? Considering that we d... [12:15:05] 10SRE, 10ops-ulsfo: mr1-ulsfo dead? - https://phabricator.wikimedia.org/T330984 (10RobH) The scs is also only accessible via mr1-ulsfo, so I'll put in a remote hands for them to power cycle it manually. [12:16:57] 10SRE, 10Data-Persistence (work done), 10serviceops, 10Datacenter-Switchover, 10Sustainability (Incident Followup): Globalize mwconfig ReadOnly - https://phabricator.wikimedia.org/T330304 (10Clement_Goubert) a:05Clement_Goubert→03None [12:21:01] 10SRE, 10Beta-Cluster-Infrastructure, 10Technical-Debt, 10Tracking-Neverending: Minimize infrastructure differences between Beta Cluster and production - https://phabricator.wikimedia.org/T87220 (10Zabe) [12:21:44] 10SRE, 10ops-ulsfo: mr1-ulsfo dead? - https://phabricator.wikimedia.org/T330984 (10RobH) Case Order #00839073 > mr1-ulsfo SRX300 S/N CV4521AN0962 located in 103.02.22:U41 (rear facing) has gone offline as of 3.5 hours ago, preventing access to our mgmt network. > > Can we have remote hands please go and remo... [12:22:54] (03PS1) 10Jbond: check_puppet_run_changes: remove hosts that have acked the puppet check [puppet] - 10https://gerrit.wikimedia.org/r/893701 [12:27:46] 10SRE, 10Maps: Allow Wikimedia Maps usage on bbcrewind.co.uk - https://phabricator.wikimedia.org/T297968 (10LWyatt) The BBC representative has now confirmed to me that they thank us for access to these welsh-language names for the map tiles on this service - that they found it very helpful, and have now transi... [12:28:27] (03PS1) 10Volans: sre.hosts.provision: pick the NIC with LinkUp [cookbooks] - 10https://gerrit.wikimedia.org/r/893706 [12:32:30] (Traffic bill over quota) firing: Alert for device cr2-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [12:34:45] (JobUnavailable) firing: (2) Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:35:31] (03PS1) 10Lucas Werkmeister (WMDE): wmf-update-known-hosts-production: Automatically download DNS [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/893708 [12:37:19] (03CR) 10Lucas Werkmeister (WMDE): "This works as far as I can tell, but might need more work before it’s ready to merge – I consider it mainly a proposal, as in “wouldn’t it" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/893708 (owner: 10Lucas Werkmeister (WMDE)) [12:41:21] (03CR) 10Clément Goubert: wmf-update-known-hosts-production: Automatically download DNS (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/893708 (owner: 10Lucas Werkmeister (WMDE)) [12:43:37] RECOVERY - Host asw2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 83.05 ms [12:44:15] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 72.24 ms [12:44:15] RECOVERY - Host mr1-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 71.32 ms [12:44:29] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:44:35] RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 198.35.26.194, interfaces up: 38, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:44:53] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:49:09] 10SRE, 10ops-ulsfo: mr1-ulsfo dead? - https://phabricator.wikimedia.org/T330984 (10RobH) a:05RobH→03ayounsi They pulled power, plugged it back in, and now its online and responsive. I don't want to just resolve this task, as Arzhel may want to investigate device logs and see what caused this issue. [12:49:45] (JobUnavailable) firing: (2) Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:51:02] (03PS2) 10Jbond: check_puppet_run_changes: remove hosts that have acked the puppet check [puppet] - 10https://gerrit.wikimedia.org/r/893701 [12:52:30] (Traffic bill over quota) resolved: Alert for device cr2-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [12:53:30] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for RMaung - https://phabricator.wikimedia.org/T330335 (10JMeybohm) @Rmaung AIUI you have been granted all requested permissions already in {T266250}. The ssh key on file is a different, though: https://gerrit.... [12:58:16] (03CR) 10Jbond: [C: 03+2] check_puppet_run_changes: remove hosts that have acked the puppet check [puppet] - 10https://gerrit.wikimedia.org/r/893701 (owner: 10Jbond) [13:01:03] (03PS1) 10Hashar: Merge tag 'v3.5.5' into wmf/stable-3.5 [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/893723 (https://phabricator.wikimedia.org/T330663) [13:02:52] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10serviceops, 10Datacenter-Switchover: Support locking cookbooks run except for switchover related cookbooks - https://phabricator.wikimedia.org/T330997 (10Volans) [13:05:53] (03PS2) 10Hashar: Merge tag 'v3.5.5' into wmf/stable-3.5 [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/893723 (https://phabricator.wikimedia.org/T330663) [13:10:29] (03PS2) 10Daniel Kinzler: Bump parsoid parser cache writes to 50%. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886905 (https://phabricator.wikimedia.org/T320534) [13:10:41] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T330218 (10Papaul) p:05Triage→03Medium [13:10:56] (03CR) 10Daniel Kinzler: Bump parsoid parser cache writes to 50%. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886905 (https://phabricator.wikimedia.org/T320534) (owner: 10Daniel Kinzler) [13:15:07] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10BTullis) [13:18:04] (03CR) 10Hashar: [C: 03+2] Merge tag 'v3.5.5' into wmf/stable-3.5 [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/893723 (https://phabricator.wikimedia.org/T330663) (owner: 10Hashar) [13:18:34] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) >>! In T330165#8660202, @akosiaris wrote: >>>! In T330165#8660042, @Marostegui wrote: >> @ayounsi @akosiaris @Joe to confirm, we are going to... [13:19:09] (03PS1) 10Nicolas Fraison: airflow: set specific ariflow release on an-airflow1003 [puppet] - 10https://gerrit.wikimedia.org/r/893732 [13:20:37] (03CR) 10Btullis: [C: 03+1] "I agree, this is the pragmatic approach here. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/893732 (owner: 10Nicolas Fraison) [13:21:14] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39917/console" [puppet] - 10https://gerrit.wikimedia.org/r/893732 (owner: 10Nicolas Fraison) [13:21:28] (03CR) 10Nicolas Fraison: [V: 03+1 C: 03+2] airflow: set specific ariflow release on an-airflow1003 [puppet] - 10https://gerrit.wikimedia.org/r/893732 (owner: 10Nicolas Fraison) [13:26:54] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T330218 (10Papaul) We are still seeing some error on the interface after changing the network cable last week. I will like to move the server from ge-6/0/6 to ge-6/0/1. Thanks [13:28:36] (03Merged) 10jenkins-bot: Merge tag 'v3.5.5' into wmf/stable-3.5 [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/893723 (https://phabricator.wikimedia.org/T330663) (owner: 10Hashar) [13:30:43] (03PS2) 10Jforrester: Revert "filebackend: Replace stringified class names with ::class" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893652 (https://phabricator.wikimedia.org/T330942) (owner: 10Ladsgroup) [13:31:41] (03CR) 10Jforrester: "I've manually re-done this so we could land it if needed, but it looks like this is no longer thought at fault. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893652 (https://phabricator.wikimedia.org/T330942) (owner: 10Ladsgroup) [13:31:51] (03Abandoned) 10Jforrester: Revert "filebackend: Replace stringified class names with ::class" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893652 (https://phabricator.wikimedia.org/T330942) (owner: 10Ladsgroup) [13:34:24] (03PS1) 10Jforrester: build: Pin php-code-coverage so it doesn't dirty the repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893739 [13:40:46] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:42:06] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/893514 (https://phabricator.wikimedia.org/T330916) (owner: 10Arturo Borrero Gonzalez) [13:42:44] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:45:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labstore1004: allow incoming HTTP connections from cloudcontrol servers [puppet] - 10https://gerrit.wikimedia.org/r/893514 (https://phabricator.wikimedia.org/T330916) (owner: 10Arturo Borrero Gonzalez) [13:45:44] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:46:00] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [13:46:49] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [13:47:07] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [13:47:39] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:48:38] !log dcaro@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - dcaro@cumin1001" [13:48:43] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1010.eqiad.wmnet with OS bullseye [13:48:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:49:49] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:50:13] (03PS1) 10Jbond: klaxon: add docs [puppet] - 10https://gerrit.wikimedia.org/r/893740 [13:50:15] (03PS1) 10Jbond: systemd::timer::job: add more parameters [puppet] - 10https://gerrit.wikimedia.org/r/893741 [13:50:17] (03PS1) 10Jbond: klaxon: update to use sttemr::job::timer [puppet] - 10https://gerrit.wikimedia.org/r/893742 [13:51:09] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:52:00] (03PS1) 10Hashar: Update Gerrit to v3.5.5 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/893743 (https://phabricator.wikimedia.org/T330663) [13:52:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39919/console" [puppet] - 10https://gerrit.wikimedia.org/r/893741 (owner: 10Jbond) [13:52:37] (03CR) 10CI reject: [V: 04-1] Update Gerrit to v3.5.5 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/893743 (https://phabricator.wikimedia.org/T330663) (owner: 10Hashar) [13:52:59] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [13:53:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:54:47] (03PS2) 10Jbond: klaxon: update to use sttemr::job::timer [puppet] - 10https://gerrit.wikimedia.org/r/893742 [13:54:57] (03CR) 10Hashar: "recheck" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/893743 (https://phabricator.wikimedia.org/T330663) (owner: 10Hashar) [13:56:31] (03PS2) 10Jbond: systemd::timer::job: add more parameters [puppet] - 10https://gerrit.wikimedia.org/r/893741 [13:56:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39920/console" [puppet] - 10https://gerrit.wikimedia.org/r/893742 (owner: 10Jbond) [13:57:19] (03CR) 10Hashar: "recheck" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/893743 (https://phabricator.wikimedia.org/T330663) (owner: 10Hashar) [13:57:33] (03PS3) 10Jbond: klaxon: update to use sttemr::job::timer [puppet] - 10https://gerrit.wikimedia.org/r/893742 [13:57:46] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39921/console" [puppet] - 10https://gerrit.wikimedia.org/r/893741 (owner: 10Jbond) [13:57:48] (03CR) 10Jbond: [C: 03+2] klaxon: add docs [puppet] - 10https://gerrit.wikimedia.org/r/893740 (owner: 10Jbond) [13:58:09] (03CR) 10Jbond: [V: 03+1 C: 03+2] systemd::timer::job: add more parameters [puppet] - 10https://gerrit.wikimedia.org/r/893741 (owner: 10Jbond) [13:58:55] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:58:57] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:58:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:59:04] (03CR) 10Jbond: [V: 03+2 C: 03+2] systemd::timer::job: add more parameters [puppet] - 10https://gerrit.wikimedia.org/r/893741 (owner: 10Jbond) [13:59:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39922/console" [puppet] - 10https://gerrit.wikimedia.org/r/893742 (owner: 10Jbond) [14:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230302T1400) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230302T1400). [14:00:05] Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:13] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [14:00:20] I’m in a meeting right now, can deploy later during the window though [14:00:32] will leave it to you Lucas_WMDE :) [14:01:29] ok :) [14:01:41] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39923/console" [puppet] - 10https://gerrit.wikimedia.org/r/893742 (owner: 10Jbond) [14:05:19] (03PS1) 10David Caro: wikireplicas: add GRANTS for cloudcontrols to replace labstore1004 [puppet] - 10https://gerrit.wikimedia.org/r/893752 [14:06:35] (03PS2) 10David Caro: wikireplicas: add GRANTS for cloudcontrols to replace labstore1004 [puppet] - 10https://gerrit.wikimedia.org/r/893752 (https://phabricator.wikimedia.org/T331014) [14:07:14] 10Puppet, 10Infrastructure-Foundations, 10Packaging, 10Patch-For-Review: apt: improve apt failover orchestration - https://phabricator.wikimedia.org/T330849 (10jbond) >>! In T330849#8656648, @Volans wrote: > We should find a standard setup for those use cases, I can see Netbox having exactly the same issu... [14:08:02] (03PS1) 10Marostegui: wiki-replicas.sql: Add new grants [puppet] - 10https://gerrit.wikimedia.org/r/893753 (https://phabricator.wikimedia.org/T331014) [14:09:23] (03CR) 10Marostegui: [C: 03+2] "Looks good - this still needs a manual run" [puppet] - 10https://gerrit.wikimedia.org/r/893752 (https://phabricator.wikimedia.org/T331014) (owner: 10David Caro) [14:09:50] (03Abandoned) 10Marostegui: wiki-replicas.sql: Add new grants [puppet] - 10https://gerrit.wikimedia.org/r/893753 (https://phabricator.wikimedia.org/T331014) (owner: 10Marostegui) [14:13:03] o/ [14:16:49] (03PS2) 10Lucas Werkmeister (WMDE): Remove unused Wikibase config variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891571 (https://phabricator.wikimedia.org/T330410) [14:17:17] `grep -rl 'tmpUnconnectedPagePagePropMigrationStage\|tmpUnconnectedPagePagePropMigrationLegacyFormat'` confirms the settings are unused in wmf.25 [14:17:28] (and the wmf.24 is also harmless, in case the train gets rolled back) [14:18:08] ugh, diffConfig is broken again https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig-docker/1899/console [14:18:15] fatal: unable to auto-detect email address (got 'nobody@1a9011334410.(none)') [14:18:24] which apparently means it can’t git stash [14:18:27] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891571 (https://phabricator.wikimedia.org/T330410) (owner: 10Lucas Werkmeister (WMDE)) [14:20:04] 10SRE, 10ops-ulsfo: mr1-ulsfo dead? - https://phabricator.wikimedia.org/T330984 (10ayounsi) 05Open→03Resolved Thanks for the quick turnaround! From previous logs it might be corrupted disk. Hopefully the fsck ran at boot time fixed it. If it happen again we will do a RMA. [14:20:34] filed T331020 for the error [14:20:35] T331020: diffConfig broken due to git stash failure - https://phabricator.wikimedia.org/T331020 [14:20:41] (deployment proceeds) [14:20:54] (03Merged) 10jenkins-bot: Remove unused Wikibase config variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891571 (https://phabricator.wikimedia.org/T330410) (owner: 10Lucas Werkmeister (WMDE)) [14:21:08] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:891571|Remove unused Wikibase config variables (T330410)]] [14:21:14] T330410: Clean up last remnants of Special:UnconnectedPages / unexpectedUnconnectedPageProp migration - https://phabricator.wikimedia.org/T330410 [14:21:42] (03CR) 10Lucas Werkmeister (WMDE): "This appears to be broken: T331020" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893088 (owner: 10Krinkle) [14:23:07] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:891571|Remove unused Wikibase config variables (T330410)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [14:24:00] https://www.wikidata.org/wiki/Special:Version still loads, don’t think there’s anything more to test [14:24:06] syncing [14:29:49] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:891571|Remove unused Wikibase config variables (T330410)]] (duration: 08m 41s) [14:29:58] T330410: Clean up last remnants of Special:UnconnectedPages / unexpectedUnconnectedPageProp migration - https://phabricator.wikimedia.org/T330410 [14:30:07] (03PS1) 10David Caro: ferm: sort chains before comparing [puppet] - 10https://gerrit.wikimedia.org/r/893756 [14:32:35] !log UTC afternoon backport+config window done [14:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:19] 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech: HTTP URIs do not resolve from NL and DE? - https://phabricator.wikimedia.org/T330906 (10akosiaris) curl also implements HSTS. See https://curl.se/docs/hsts.html, but it is indeed primarily a mechanism to protect users of browsers. @Ennomeijers you are right abou... [14:34:55] (03PS4) 10Filippo Giunchedi: klaxon: update to use systemd::job::timer [puppet] - 10https://gerrit.wikimedia.org/r/893742 (owner: 10Jbond) [14:35:27] (03CR) 10Filippo Giunchedi: [C: 03+1] "Updated to set workingdirectory and delete the old unit, LGTM (tested in Pontoon and it is working)" [puppet] - 10https://gerrit.wikimedia.org/r/893742 (owner: 10Jbond) [14:37:24] (03CR) 10Volans: [C: 03+1] "LGTM but I'm unfamiliar with the code, better to get someone else have a look too." [puppet] - 10https://gerrit.wikimedia.org/r/893756 (owner: 10David Caro) [14:38:23] RECOVERY - Hadoop NodeManager on an-worker1132 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:39:44] (03PS1) 10Jbond: lvs::realserver: Ensure we have at least one real ip address [puppet] - 10https://gerrit.wikimedia.org/r/893758 [14:40:14] (03PS2) 10Andrew Bogott: OpenStack: collapse 'user' OpenStack role into 'reader' role [puppet] - 10https://gerrit.wikimedia.org/r/893545 (https://phabricator.wikimedia.org/T330759) [14:40:16] (03PS1) 10Andrew Bogott: OpenStack: rename 'projectadmin' role to 'member' role [puppet] - 10https://gerrit.wikimedia.org/r/893759 (https://phabricator.wikimedia.org/T330759) [14:40:57] RECOVERY - Check systemd state on ms-be2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:25] (03CR) 10CI reject: [V: 04-1] lvs::realserver: Ensure we have at least one real ip address [puppet] - 10https://gerrit.wikimedia.org/r/893758 (owner: 10Jbond) [14:43:15] (03PS2) 10Jbond: lvs::realserver: Ensure we have at least one real ip address [puppet] - 10https://gerrit.wikimedia.org/r/893758 [14:44:48] 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech: HTTP URIs do not resolve from NL and DE? - https://phabricator.wikimedia.org/T330906 (10BBlack) >>! In T330906#8657917, @Ennomeijers wrote: > Thanks for the replies! Advising to use HTTPS over HTTP makes sense. > > But not supporting redirection from HTTP to HTT... [14:45:34] (03CR) 10Jbond: [C: 03+2] klaxon: update to use systemd::job::timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893742 (owner: 10Jbond) [14:47:05] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add pint source for k8s [puppet] - 10https://gerrit.wikimedia.org/r/893407 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [14:49:11] (03PS1) 10David Caro: toolsdb_replica_cnf: configure if we want to redirect to https [puppet] - 10https://gerrit.wikimedia.org/r/893760 (https://phabricator.wikimedia.org/T303663) [14:49:54] (beta, not prod) — If anyone in here hasn't seen T331019 and has any advice, be appreciated [14:49:55] T331019: Edits not saving on beta cluster (db replication error, corrupted table) - https://phabricator.wikimedia.org/T331019 [14:50:46] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/893756 (owner: 10David Caro) [14:52:27] (03CR) 10Andrew Bogott: [C: 03+1] "i've never used epp but otherwise this lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/893760 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [14:54:40] 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech: HTTP URIs do not resolve from NL and DE? - https://phabricator.wikimedia.org/T330906 (10Nikki) >>! In T330906#8659810, @Vgutierrez wrote: >>>! In T330906#8657917, @Ennomeijers wrote: >> But not supporting redirection from HTTP to HTTPS will in my opinion introduce... [14:54:51] (03CR) 10Jbond: "pcc shows that the only failing role is aqs in codfw which has no ip addresses ( was the motive for this change). This will mean that the" [puppet] - 10https://gerrit.wikimedia.org/r/893758 (owner: 10Jbond) [14:55:03] 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T331030 (10MatthewVernon) [14:55:15] 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T331030 (10MatthewVernon) p:05Triage→03High [14:55:24] (03PS3) 10Jbond: lvs::realserver: Ensure we have at least one real ip address [puppet] - 10https://gerrit.wikimedia.org/r/893758 [14:57:49] (03PS2) 10David Caro: toolsdb_replica_cnf: configure if we want to redirect to https [puppet] - 10https://gerrit.wikimedia.org/r/893760 (https://phabricator.wikimedia.org/T303663) [14:57:55] RECOVERY - Disk space on ms-be2067 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2067&var-datasource=codfw+prometheus/ops [14:58:28] (03CR) 10David Caro: [C: 03+2] ferm: sort chains before comparing [puppet] - 10https://gerrit.wikimedia.org/r/893756 (owner: 10David Caro) [14:59:05] (03CR) 10Clément Goubert: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/893762 (owner: 10Clément Goubert) [14:59:38] 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T331030 (10MatthewVernon) `kern.log` is 220M(!) full of errors about these two drives. [15:03:16] (03CR) 10Jbond: "pcc: https://puppet-compiler.wmflabs.org/output/893758/39926/" [puppet] - 10https://gerrit.wikimedia.org/r/893758 (owner: 10Jbond) [15:03:38] (03CR) 10David Caro: [C: 03+2] "Yep, no more restarting ferm on every puppet run \o/" [puppet] - 10https://gerrit.wikimedia.org/r/893756 (owner: 10David Caro) [15:04:19] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10LSobanski) @Jclark-ctr the Thanos hosts had to be returned to the Thanos cluster so this request is in limbo until new hosts are procured (most likely ne... [15:07:16] 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10LSobanski) [15:07:18] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10LSobanski) [15:07:27] 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech: HTTP URIs do not resolve from NL and DE? - https://phabricator.wikimedia.org/T330906 (10Ennomeijers) 05Stalled→03Resolved a:03Ennomeijers As I already mentioned earlier, the SPARQL endpoint and the RDF serialized data all use the HTTP version as the canonica... [15:07:33] 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10LSobanski) [15:07:37] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10LSobanski) [15:08:19] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10LSobanski) T279621 is the best place to track the progress of this. [15:09:47] (03CR) 10Nicolas Fraison: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/893758 (owner: 10Jbond) [15:12:11] (03PS1) 10Nicolas Fraison: aqs: set up lvs for aqs codfw [puppet] - 10https://gerrit.wikimedia.org/r/893763 [15:12:34] (03CR) 10Nicolas Fraison: [C: 04-2] "Still in progress" [puppet] - 10https://gerrit.wikimedia.org/r/893763 (owner: 10Nicolas Fraison) [15:13:32] (03CR) 10Nicolas Fraison: [V: 03+1 C: 04-2] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39929/console" [puppet] - 10https://gerrit.wikimedia.org/r/893763 (owner: 10Nicolas Fraison) [15:15:19] 10SRE, 10Traffic: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ssingh) >>! In T330670#8652102, @jbond wrote: > lgtm just some curiosity :) > >> After the above change, we will have three DNS boxes in the core DCs, with ns0 pointing to d... [15:16:47] 10SRE, 10Traffic: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10jbond) >>! In T330670#8661042, @ssingh wrote: > Yes, that was the eventual plan: to do ns0 over all three dns rec boxes and similarly for ns1, just like we are doing for ns2.... [15:18:04] (03PS2) 10Nicolas Fraison: aqs: set up lvs for aqs codfw [puppet] - 10https://gerrit.wikimedia.org/r/893763 [15:20:17] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) This ticket had little activity in the last month. Did something happen offline that wasn't r... [15:24:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr) an-worker1149 A7 U1 port CableId an-worker1150 B7 U38 port CableId an-worker1151 C7... [15:24:44] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/893787 (owner: 10Clément Goubert) [15:25:42] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) @aborrero I've been getting the cloudsw configured in the background, which is nearly done. M... [15:25:52] (03PS3) 10Nicolas Fraison: aqs: set up lvs for aqs codfw [puppet] - 10https://gerrit.wikimedia.org/r/893763 [15:35:23] (03CR) 10Cwhite: [C: 03+2] toil: restart opensearch-dashboards every wednesday (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891394 (https://phabricator.wikimedia.org/T327161) (owner: 10Cwhite) [15:35:35] (03CR) 10David Caro: [C: 03+2] toolsdb_replica_cnf: configure if we want to redirect to https [puppet] - 10https://gerrit.wikimedia.org/r/893760 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [15:35:41] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T330218 (10Papaul) ` Input errors: Errors: 2497, Drops: 0, Framing errors: 2497, Runts: 0, Bucket drops: 0, [15:39:12] (03PS1) 10Nicolas Fraison: aqs: set DNS entry for aqs.svc.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/893789 [15:39:21] (03CR) 10Nicolas Fraison: [C: 04-2] "WIP" [dns] - 10https://gerrit.wikimedia.org/r/893789 (owner: 10Nicolas Fraison) [15:41:07] !log restart db2099 T330218 [15:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:14] T330218: Inbound interface errors - https://phabricator.wikimedia.org/T330218 [15:41:56] (03CR) 10Cwhite: [C: 03+2] profile::logstash: improve the ORES filter [puppet] - 10https://gerrit.wikimedia.org/r/893672 (https://phabricator.wikimedia.org/T325763) (owner: 10Elukey) [15:42:12] (03CR) 10Papaul: [C: 03+1] sre.hosts.provision: pick the NIC with LinkUp [cookbooks] - 10https://gerrit.wikimedia.org/r/893706 (owner: 10Volans) [15:42:16] (03CR) 10Herron: [C: 03+1] Add 'pint' integration [alerts] - 10https://gerrit.wikimedia.org/r/893505 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [15:42:35] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T330317 (10phaultfinder) [15:44:30] (03PS2) 10Hashar: site: add contint2002 to ci::master role [puppet] - 10https://gerrit.wikimedia.org/r/867673 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [15:45:00] (03CR) 10Hashar: "It is mostly likely going to fail cause a bunch of hiera settings are missing for the various profiles ;)" [puppet] - 10https://gerrit.wikimedia.org/r/867673 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [15:45:07] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/867673 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [15:51:37] While thinking about the timeline for updating mathoid, I realized that T210704 suggests that restbase still runs on node 6. Is that still true? [15:51:38] T210704: Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 [15:52:48] 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T330971 (10Cmjohnson) a:03Cmjohnson Submitted a ticket with Dell for a new HDD. Create Dispatch: Success You have successfully submitted request SR163405094. [15:53:42] (03PS2) 10Lucas Werkmeister (WMDE): wmf-update-known-hosts-production: Automatically download DNS [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/893708 [15:54:28] (03CR) 10Lucas Werkmeister (WMDE): wmf-update-known-hosts-production: Automatically download DNS (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/893708 (owner: 10Lucas Werkmeister (WMDE)) [15:55:50] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [15:57:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Reset management module of mc1039 - https://phabricator.wikimedia.org/T330072 (10Cmjohnson) 05Open→03Resolved a:05Jclark-ctr→03Cmjohnson The server password was not set correctly, fixed and you should be good to go. [15:58:24] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Fix DNS typo in record for cr2-eqiad gr-3/3/0.2 - cmooney@cumin1001" [15:58:32] (03PS1) 10Filippo Giunchedi: profile: add pint source for analytics/ext/services [puppet] - 10https://gerrit.wikimedia.org/r/893794 (https://phabricator.wikimedia.org/T309182) [15:59:30] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Fix DNS typo in record for cr2-eqiad gr-3/3/0.2 - cmooney@cumin1001" [15:59:30] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:00:18] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T330218 (10Papaul) We reboot the server today and so far no more errors on the interface. leaving this task open for now. ` Input errors: Errors: 0, Drops: 0, Framing errors: 0, Runts: 0, Bucket drops: 0, Policed discards: 0, L3 i... [16:01:37] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T330317 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson This is most likely a bad cable, I will fix today [16:01:41] (03PS1) 10David Caro: cloud: add openstack_controllers to cloud.yaml [puppet] - 10https://gerrit.wikimedia.org/r/893795 [16:04:14] (03PS1) 10Giuseppe Lavagetto: filebackend: hotfix - make swift master follow the mediawiki master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893796 (https://phabricator.wikimedia.org/T330942) [16:04:39] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39932/console" [puppet] - 10https://gerrit.wikimedia.org/r/893794 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [16:04:44] (03CR) 10Arturo Borrero Gonzalez: cloud: add openstack_controllers to cloud.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893795 (owner: 10David Caro) [16:04:56] (03CR) 10CI reject: [V: 04-1] filebackend: hotfix - make swift master follow the mediawiki master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893796 (https://phabricator.wikimedia.org/T330942) (owner: 10Giuseppe Lavagetto) [16:05:38] (03CR) 10Ladsgroup: [C: 03+1] Bump parsoid parser cache writes to 50%. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886905 (https://phabricator.wikimedia.org/T320534) (owner: 10Daniel Kinzler) [16:05:39] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2110 - https://phabricator.wikimedia.org/T330681 (10Papaul) a:05Papaul→03Marostegui @Marostegui disk replaced [16:06:05] (03CR) 10David Caro: cloud: add openstack_controllers to cloud.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893795 (owner: 10David Caro) [16:07:17] (03PS1) 10Jbond: icinga: update casts in icinga_status [puppet] - 10https://gerrit.wikimedia.org/r/893797 [16:07:53] (03CR) 10Btullis: [C: 03+1] "Yes, I agree that the effect of this patch is more desirable than the current situation. I'm working on getting calarification on the way " [puppet] - 10https://gerrit.wikimedia.org/r/893758 (owner: 10Jbond) [16:08:25] (03PS2) 10Jforrester: filebackend: hotfix - make swift master follow the mediawiki master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893796 (https://phabricator.wikimedia.org/T330942) (owner: 10Giuseppe Lavagetto) [16:08:29] (03CR) 10Jforrester: [C: 03+1] filebackend: hotfix - make swift master follow the mediawiki master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893796 (https://phabricator.wikimedia.org/T330942) (owner: 10Giuseppe Lavagetto) [16:08:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cloud: add openstack_controllers to cloud.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893795 (owner: 10David Caro) [16:08:54] (03PS1) 10Zabe: beta: Promote deployment-db11 as master, decom deployment-db09 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893798 (https://phabricator.wikimedia.org/T331019) [16:08:59] (03CR) 10Jbond: [C: 03+2] lvs::realserver: Ensure we have at least one real ip address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/893758 (owner: 10Jbond) [16:09:12] (03CR) 10jenkins-bot: filebackend: hotfix - make swift master follow the mediawiki master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893796 (https://phabricator.wikimedia.org/T330942) (owner: 10Giuseppe Lavagetto) [16:09:26] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2110 - https://phabricator.wikimedia.org/T330681 (10Marostegui) Thank you Papaul ` Raw Size: 1.746 TB [0xdf8fe2b0 Sectors] Firmware state: =====> Rebuild <===== ` [16:11:29] (03CR) 10Ladsgroup: "let me double check" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893796 (https://phabricator.wikimedia.org/T330942) (owner: 10Giuseppe Lavagetto) [16:12:11] jouncebot: nowandnext [16:12:11] No deployments scheduled for the next 0 hour(s) and 47 minute(s) [16:12:11] In 0 hour(s) and 47 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230302T1700) [16:12:27] (03CR) 10Zabe: [C: 03+2] beta: Promote deployment-db11 as master, decom deployment-db09 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893798 (https://phabricator.wikimedia.org/T331019) (owner: 10Zabe) [16:12:54] (03CR) 10Samtar: [C: 03+2] "urgh :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893798 (https://phabricator.wikimedia.org/T331019) (owner: 10Zabe) [16:13:12] (03Merged) 10jenkins-bot: beta: Promote deployment-db11 as master, decom deployment-db09 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893798 (https://phabricator.wikimedia.org/T331019) (owner: 10Zabe) [16:13:38] Another +4 [16:14:07] (03CR) 10Ladsgroup: "Let me test this in mwdebug2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893796 (https://phabricator.wikimedia.org/T330942) (owner: 10Giuseppe Lavagetto) [16:16:38] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] profile: add pint source for analytics/ext/services [puppet] - 10https://gerrit.wikimedia.org/r/893794 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [16:18:48] (03CR) 10Ladsgroup: "it leads to this in mwdebug2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893796 (https://phabricator.wikimedia.org/T330942) (owner: 10Giuseppe Lavagetto) [16:19:12] (03PS3) 10Hashar: site: add contint2002 to ci::master role [puppet] - 10https://gerrit.wikimedia.org/r/867673 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [16:19:47] (03CR) 10Hashar: "I have added hieradata to disable Zuul scheduler and merger. Jenkins default to being masked from role::ci::master." [puppet] - 10https://gerrit.wikimedia.org/r/867673 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [16:19:54] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/867673 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [16:26:39] (03CR) 10Filippo Giunchedi: "Change LGTM, do you mind expanding on the context/current issue you have run into?" [puppet] - 10https://gerrit.wikimedia.org/r/893797 (owner: 10Jbond) [16:26:49] (03CR) 10David Caro: [C: 03+2] cloud: add openstack_controllers to cloud.yaml [puppet] - 10https://gerrit.wikimedia.org/r/893795 (owner: 10David Caro) [16:27:22] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10Papaul) I am still waiting on the partman recipe. [16:32:17] (03CR) 10Hashar: [C: 04-1] "So compiler is at https://puppet-compiler.wmflabs.org/output/867673/1647/" [puppet] - 10https://gerrit.wikimedia.org/r/867673 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [16:34:08] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10MatthewVernon) Hi, this is on my TODO, but these backends are a low priority for us at the moment (compared to the frontends, which are a really high p... [16:34:25] (03CR) 10Hashar: [C: 04-1] site: add contint2002 to ci::master role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867673 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [16:39:40] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor Migration, 10Platform Team Workboards (Platform Engineering Reliability): Pooling thumbor-k8s causes spikes in swift 500 errors - https://phabricator.wikimedia.org/T328033 (10hnowlan) [16:39:52] cdanis: nothing from an oncall PoV but there is an UBN (T330942) which has been ongoin all day re thumbnails/swift more infor in m_sec [16:39:53] T330942: Latest image thumbnails aren't replaced correctly after image reupload - https://phabricator.wikimedia.org/T330942 [16:40:26] (03CR) 10Giuseppe Lavagetto: filebackend: hotfix - make swift master follow the mediawiki master (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893796 (https://phabricator.wikimedia.org/T330942) (owner: 10Giuseppe Lavagetto) [16:42:13] (03PS3) 10Giuseppe Lavagetto: filebackend: hotfix - make swift master follow the mediawiki master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893796 (https://phabricator.wikimedia.org/T330942) [16:42:56] (03CR) 10CI reject: [V: 04-1] filebackend: hotfix - make swift master follow the mediawiki master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893796 (https://phabricator.wikimedia.org/T330942) (owner: 10Giuseppe Lavagetto) [16:44:32] (03PS4) 10Giuseppe Lavagetto: filebackend: hotfix - make swift master follow the mediawiki master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893796 (https://phabricator.wikimedia.org/T330942) [16:49:45] (JobUnavailable) firing: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:51:38] (03CR) 10Ladsgroup: "Fixes the issue:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893796 (https://phabricator.wikimedia.org/T330942) (owner: 10Giuseppe Lavagetto) [16:53:40] (03CR) 10Ladsgroup: [C: 03+1] "upload works in mwdebug2002 and deletes the thumbnails. This is good to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893796 (https://phabricator.wikimedia.org/T330942) (owner: 10Giuseppe Lavagetto) [16:55:44] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10Papaul) I understand that this is a low priority for you but it is not for me since i have to meet my install SLA's of 30days. I can remove the ops-cod... [16:56:23] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10MatthewVernon) Please don't do that; I'll try and get back to you before the end of the week. [16:57:08] (03CR) 10Giuseppe Lavagetto: [C: 03+2] filebackend: hotfix - make swift master follow the mediawiki master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893796 (https://phabricator.wikimedia.org/T330942) (owner: 10Giuseppe Lavagetto) [16:57:51] (03Merged) 10jenkins-bot: filebackend: hotfix - make swift master follow the mediawiki master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893796 (https://phabricator.wikimedia.org/T330942) (owner: 10Giuseppe Lavagetto) [16:59:49] !log oblivian@deploy2002 Started scap: Backport for [[gerrit:893796|filebackend: hotfix - make swift master follow the mediawiki master (T330942)]] [16:59:55] T330942: Latest image thumbnails aren't replaced correctly after image reupload - https://phabricator.wikimedia.org/T330942 [17:00:05] jbond and rzl: (Dis)respected human, time to deploy Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230302T1700). Please do the needful. [17:00:05] jnuche: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:22] (03CR) 10Volans: aqs: set DNS entry for aqs.svc.codfw.wmnet (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/893789 (owner: 10Nicolas Fraison) [17:01:14] jnuche: o/ looking [17:01:27] rzl: thx :) [17:01:39] !log oblivian@deploy2002 oblivian: Backport for [[gerrit:893796|filebackend: hotfix - make swift master follow the mediawiki master (T330942)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [17:02:09] <_joe_> Amir1: can you test on mwdebug2001 by uploading a new copy of that image from a browser? [17:02:24] (03CR) 10Krinkle: vcl: fix X-Cache-Status on deployment-prep (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/648199 (https://phabricator.wikimedia.org/T269825) (owner: 10Ema) [17:02:32] _joe_: I tested it before the merge [17:02:50] <_joe_> ack so ok to proceed? [17:03:15] jnuche: stand by, out of caution I'm going to wait until _joe_ is finished using scap before merging this :) [17:03:24] <_joe_> rzl: thanks, yes [17:03:24] ack [17:03:43] <_joe_> sorry, it's an UBN! issue we've been chasing down since this morning [17:04:08] _joe_: no worries, mind pinging me when you're all clear? [17:04:38] (03CR) 10Hashar: [C: 04-1] site: add contint2002 to ci::master role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867673 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [17:06:00] _joe_: yup [17:06:09] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:07:17] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/893797 (owner: 10Jbond) [17:07:39] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) @papaul in terms of the cables we will need to begin as follows. I'm assuming here we go with [[ https://www.fs.com/de-en/products/71644.html?attribute=675&id=... [17:07:59] 10SRE, 10Traffic: Add unique error IDs to 4xx responses - https://phabricator.wikimedia.org/T330973 (10RLazarus) Adding @CDanis as we were just talking about something along these lines. [17:08:35] <_joe_> rzl: we're almost done [17:09:06] !log oblivian@deploy2002 Finished scap: Backport for [[gerrit:893796|filebackend: hotfix - make swift master follow the mediawiki master (T330942)]] (duration: 09m 16s) [17:09:12] T330942: Latest image thumbnails aren't replaced correctly after image reupload - https://phabricator.wikimedia.org/T330942 [17:09:19] 👍 [17:09:48] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:15:21] <_joe_> rzl: you're free to go [17:16:25] thanks! [17:16:49] jnuche: lgtm, merging now -- will you want me to run puppet on deploy2002 so you can test? [17:17:01] (03CR) 10RLazarus: [C: 03+2] scap bootstrap: use new installation mechanism [puppet] - 10https://gerrit.wikimedia.org/r/893473 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche) [17:17:24] oh, or possibly on somewhere not-deploy2002, now that I think about it :) let me know [17:17:35] rzl: could you run puppet on releases2002.codfw.wmnet? that's safe and I can check there [17:17:42] will do [17:19:13] jnuche: done, go ahead [17:19:27] checking [17:20:42] jnuche: I'll tag you on a CR for releases, it should fix a puppet issue, but I'm not sure of the impact on scap and deployments [17:20:45] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for RMaung - https://phabricator.wikimedia.org/T330335 (10Rmaung) @JMeybohm yes, please! [17:20:47] Not urgent [17:21:07] claime: sure [17:21:43] (03PS1) 10Cwhite: prometheus: exclude extension errors from client-errors metrics [puppet] - 10https://gerrit.wikimedia.org/r/893437 (https://phabricator.wikimedia.org/T330680) [17:23:09] rzl: looks healthy [17:23:13] thanks a lot :) [17:23:23] thank you! [17:23:33] puppet window's finished [17:32:35] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T330317 (10phaultfinder) [17:33:27] (03PS1) 10Jcrespo: Moving the working prototype/hack into production [puppet] - 10https://gerrit.wikimedia.org/r/893803 (https://phabricator.wikimedia.org/T329499) [17:33:51] (03CR) 10CI reject: [V: 04-1] Moving the working prototype/hack into production [puppet] - 10https://gerrit.wikimedia.org/r/893803 (https://phabricator.wikimedia.org/T329499) (owner: 10Jcrespo) [17:35:15] (03PS2) 10Jcrespo: Moving the working prototype/hack into production [puppet] - 10https://gerrit.wikimedia.org/r/893803 (https://phabricator.wikimedia.org/T329499) [17:35:36] (03CR) 10CI reject: [V: 04-1] Moving the working prototype/hack into production [puppet] - 10https://gerrit.wikimedia.org/r/893803 (https://phabricator.wikimedia.org/T329499) (owner: 10Jcrespo) [17:39:41] (03PS3) 10Jcrespo: Moving the working prototype/hack into production [puppet] - 10https://gerrit.wikimedia.org/r/893803 (https://phabricator.wikimedia.org/T329499) [17:44:06] (03CR) 10Jcrespo: [C: 03+2] "This is not a great solution and has issues, but better here than nowhere. I will do a proper solution when I return from vacations." [puppet] - 10https://gerrit.wikimedia.org/r/893803 (https://phabricator.wikimedia.org/T329499) (owner: 10Jcrespo) [17:44:37] (03PS4) 10Jcrespo: dbbackups: Moving the recovery working prototype/hack into production [puppet] - 10https://gerrit.wikimedia.org/r/893803 (https://phabricator.wikimedia.org/T329499) [17:44:47] (03CR) 10Jcrespo: [V: 03+2] dbbackups: Moving the recovery working prototype/hack into production [puppet] - 10https://gerrit.wikimedia.org/r/893803 (https://phabricator.wikimedia.org/T329499) (owner: 10Jcrespo) [17:54:14] (03PS1) 10BryanDavis: developer-portal: Bump container version to 2023-02-27-121715-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/893809 [18:00:05] bd808: #bothumor I � Unicode. All rise for Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230302T1800). [18:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230302T1800) [18:01:25] o/ I have a version bump for developer portal today. Our first build including Lëtzebuergesch localizations. [18:01:48] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container version to 2023-02-27-121715-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/893809 (owner: 10BryanDavis) [18:06:44] (03Merged) 10jenkins-bot: developer-portal: Bump container version to 2023-02-27-121715-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/893809 (owner: 10BryanDavis) [18:08:43] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [18:09:12] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [18:09:20] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [18:10:05] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [18:10:22] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [18:10:49] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [18:12:14] (03PS2) 10Eevans: data-persistence: alert on elevated sessions store error rate (5xx) [alerts] - 10https://gerrit.wikimedia.org/r/893538 (https://phabricator.wikimedia.org/T327960) [18:13:45] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10BCornwall) Unless I'm mistaken, the only recourse we have at this point is to throw our "top-ten website" gut around and demand the improvements for the security of our users. I don't... [18:24:25] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:25:01] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:25:52] (03PS1) 10Superpes15: [itwiki] Assign 'changetags' flag only to sysop/bot/botadmin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893810 (https://phabricator.wikimedia.org/T331051) [18:28:41] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10Papaul) Thank you. [18:32:25] (03PS1) 10Sbailey: Enable new Linter UI for namespace, tag and template for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893811 (https://phabricator.wikimedia.org/T299612) [18:33:25] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49709 bytes in 0.154 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:33:32] (03PS1) 10BCornwall: hieradata/common: Remove dns1001/2001 from authdns [puppet] - 10https://gerrit.wikimedia.org/r/893812 (https://phabricator.wikimedia.org/T321309) [18:34:03] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.313 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:49:22] (03CR) 10Ssingh: [C: 03+1] "Looks good. We should just make sure that the firmwares are updated on both hosts before we merge this so that we can avoid additional dow" [puppet] - 10https://gerrit.wikimedia.org/r/893812 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [18:50:43] (03CR) 10Ssingh: [C: 03+1] ntp/eqiad: set to dns1002 [dns] - 10https://gerrit.wikimedia.org/r/893566 (owner: 10BCornwall) [18:50:52] (03CR) 10Ssingh: [C: 03+1] ntp/codfw: set to dns2002 [dns] - 10https://gerrit.wikimedia.org/r/893539 (owner: 10BCornwall) [18:51:08] (03CR) 10BCornwall: [C: 03+2] ntp/codfw: set to dns2002 [dns] - 10https://gerrit.wikimedia.org/r/893539 (owner: 10BCornwall) [18:51:11] (03PS3) 10BCornwall: ntp/codfw: set to dns2002 [dns] - 10https://gerrit.wikimedia.org/r/893539 [18:51:24] (03CR) 10BCornwall: [V: 03+2] ntp/codfw: set to dns2002 [dns] - 10https://gerrit.wikimedia.org/r/893539 (owner: 10BCornwall) [18:51:28] (03PS2) 10BCornwall: ntp/eqiad: set to dns1002 [dns] - 10https://gerrit.wikimedia.org/r/893566 [18:51:37] (03CR) 10Ladsgroup: "Please enable the reads gradually to make sure there is no errors" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893811 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [18:54:37] (03CR) 10BCornwall: [C: 03+2] ntp/eqiad: set to dns1002 [dns] - 10https://gerrit.wikimedia.org/r/893566 (owner: 10BCornwall) [18:57:07] (03PS1) 10Jcrespo: dbbackups: Add parallel as a dependency of mini_loader.sh [puppet] - 10https://gerrit.wikimedia.org/r/893813 (https://phabricator.wikimedia.org/T319383) [18:58:11] (03PS2) 10Jcrespo: dbbackups: Add parallel as a dependency of mini_loader.sh [puppet] - 10https://gerrit.wikimedia.org/r/893813 (https://phabricator.wikimedia.org/T319383) [19:02:56] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Add parallel as a dependency of mini_loader.sh [puppet] - 10https://gerrit.wikimedia.org/r/893813 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo) [19:05:38] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2014.codfw.wmnet with OS bullseye [19:05:45] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-fe2014.codfw.wmnet with OS bullseye [19:06:20] (03PS2) 10Sbailey: Enable new Linter UI for namespace, tag and template for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893811 (https://phabricator.wikimedia.org/T299612) [19:06:49] (03CR) 10Sbailey: Enable new Linter UI for namespace, tag and template for all wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893811 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [19:08:24] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T331030 (10wiki_willy) a:03Papaul [19:09:22] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T330930 (10wiki_willy) a:03Cmjohnson [19:10:56] 10SRE, 10Traffic, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10BBlack) Is there a reasonable shopify alternative that meets policy? That would be my question. If there isn't, we're stuck with this policy violation, but shouldn't st... [19:13:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr) [19:14:03] RECOVERY - MegaRAID on db2110 is OK: OK: optimal, 1 logical, 6 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:24:19] 10SRE, 10Traffic, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10BCornwall) There's [[ https://woocommerce.com/ | WooCommerce ]] which could be created alongside all other WordPress installations (VIP has unlimited sites, right?) with... [19:27:03] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe2014.codfw.wmnet with reason: host reimage [19:31:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe2014.codfw.wmnet with reason: host reimage [19:44:51] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:47:44] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [19:58:04] (03CR) 10BCornwall: [C: 03+2] hieradata/common: Remove dns1001/2001 from authdns [puppet] - 10https://gerrit.wikimedia.org/r/893812 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [19:59:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:59:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2014.codfw.wmnet with OS bullseye [19:59:43] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-fe2014.codfw.wmnet with OS bullseye completed: - ms-f... [19:59:52] dns1001 and dns2001 will be passing errors as we perform OS upgrades [20:04:19] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns1001.wikimedia.org with OS bullseye [20:04:31] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns1001.wikimedia.org with OS bullseye [20:07:42] !log brett@cumin2002 START - Cookbook sre.dns.netbox [20:08:54] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:09:04] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:09:04] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:11:38] (03PS1) 10JHathaway: jaeger: add wmf-certificates [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/893439 (https://phabricator.wikimedia.org/T320553) [20:12:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-fe2004.codfw.wmnet with OS bullseye [20:12:30] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host thanos-fe2004.codfw.wmnet with OS bullseye [20:12:40] PROBLEM - Host 2620:0:861:1:208:80:154:10 is DOWN: CRITICAL - Destination Unreachable (2620:0:861:1:208:80:154:10) [20:13:43] ACKNOWLEDGEMENT - MegaRAID on an-worker1132 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T331059 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:13:47] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T331059 (10ops-monitoring-bot) [20:14:45] (JobUnavailable) firing: (2) Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:16:42] PROBLEM - Recursive DNS on 208.80.154.10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [20:19:08] RECOVERY - Host 2620:0:861:1:208:80:154:10 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [20:19:45] (JobUnavailable) firing: (2) Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:20:34] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns1001.wikimedia.org with reason: host reimage [20:21:54] PROBLEM - Recursive DNS on 2620:0:861:1:208:80:154:10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [20:23:24] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns1001.wikimedia.org with reason: host reimage [20:23:56] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns2001.wikimedia.org with OS bullseye [20:24:07] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns2001.wikimedia.org with OS bullseye [20:29:10] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:29:12] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:29:14] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:33:03] (03CR) 10JHathaway: [C: 03+2] jaeger: add wmf-certificates [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/893439 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway) [20:33:06] (03CR) 10JHathaway: [V: 03+2 C: 03+2] jaeger: add wmf-certificates [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/893439 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway) [20:34:45] (JobUnavailable) firing: (3) Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:37:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe2004.codfw.wmnet with reason: host reimage [20:39:40] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:39:46] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns2001.wikimedia.org with reason: host reimage [20:40:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe2004.codfw.wmnet with reason: host reimage [20:43:06] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns2001.wikimedia.org with reason: host reimage [20:43:29] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns1001.wikimedia.org with OS bullseye [20:43:40] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns1001.wikimedia.org with OS bullseye completed: - dns1001 (**PASS**) - Downtimed on Icinga/Al... [20:43:43] (03PS1) 10BCornwall: Revert "ntp/codfw: set to dns2002" [dns] - 10https://gerrit.wikimedia.org/r/893774 [20:43:49] (03PS2) 10BCornwall: Revert "ntp/codfw: set to dns2002" [dns] - 10https://gerrit.wikimedia.org/r/893774 [20:45:19] RECOVERY - Recursive DNS on 208.80.154.10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [20:46:24] (03PS1) 10JHathaway: jaeger: bump version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/893440 (https://phabricator.wikimedia.org/T320553) [20:46:58] (03CR) 10JHathaway: [C: 03+2] jaeger: bump version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/893440 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway) [20:46:59] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:47:00] (03CR) 10JHathaway: [V: 03+2 C: 03+2] jaeger: bump version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/893440 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway) [20:49:27] PROBLEM - Recursive DNS on 208.80.153.77 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [20:51:19] (03CR) 10Sbailey: "Just need a +1 to be ready for backport window deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893811 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [20:51:35] RECOVERY - Recursive DNS on 2620:0:861:1:208:80:154:10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [20:52:53] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [20:53:15] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:53:17] RECOVERY - Recursive DNS on 208.80.153.77 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [20:54:45] (JobUnavailable) firing: (2) Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:55:31] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 106, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:55:39] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:57:43] (03CR) 10Ssingh: [C: 03+1] "(For when dns2001 actually completes!)" [dns] - 10https://gerrit.wikimedia.org/r/893774 (owner: 10BCornwall) [20:58:11] (03PS1) 10BCornwall: Revert "ntp/eqiad: set to dns1002" [dns] - 10https://gerrit.wikimedia.org/r/893775 [20:59:09] (03CR) 10Ssingh: [C: 03+1] "Nice work!" [dns] - 10https://gerrit.wikimedia.org/r/893775 (owner: 10BCornwall) [21:00:04] brennen and TheresNoTime: That opportune time is upon us again. Time for a UTC late backport and config training deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230302T2100). [21:00:04] Superpes: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:15] * TheresNoTime can deploy [21:00:17] Hi :) [21:00:21] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:00:29] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:00:37] (03PS1) 10BCornwall: Revert "hieradata/common: Remove dns1001/2001 from authdns" [puppet] - 10https://gerrit.wikimedia.org/r/893776 [21:00:41] Oh hi TheresNoTime :P Thanks! [21:00:55] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 181, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:01:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893810 (https://phabricator.wikimedia.org/T331051) (owner: 10Superpes15) [21:01:29] (03CR) 10Ssingh: [C: 03+1] Revert "hieradata/common: Remove dns1001/2001 from authdns" [puppet] - 10https://gerrit.wikimedia.org/r/893776 (owner: 10BCornwall) [21:01:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:01:59] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 106, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:01:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-fe2004.codfw.wmnet with OS bullseye [21:02:05] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host thanos-fe2004.codfw.wmnet with OS bullseye completed: -... [21:02:09] (03CR) 10BCornwall: [C: 03+2] Revert "ntp/eqiad: set to dns1002" [dns] - 10https://gerrit.wikimedia.org/r/893775 (owner: 10BCornwall) [21:02:09] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:02:21] (03Merged) 10jenkins-bot: [itwiki] Assign 'changetags' flag only to sysop/bot/botadmin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893810 (https://phabricator.wikimedia.org/T331051) (owner: 10Superpes15) [21:02:35] !log samtar@deploy2002 Started scap: Backport for [[gerrit:893810|[itwiki] Assign 'changetags' flag only to sysop/bot/botadmin (T331051)]] [21:02:40] T331051: Assign 'changetags' flag only to sysop/bot/botadmin groups on itwiki - https://phabricator.wikimedia.org/T331051 [21:04:06] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns2001.wikimedia.org with OS bullseye [21:04:11] (03CR) 10BCornwall: [C: 03+2] Revert "ntp/codfw: set to dns2002" [dns] - 10https://gerrit.wikimedia.org/r/893774 (owner: 10BCornwall) [21:04:15] !log samtar@deploy2002 superpes and samtar: Backport for [[gerrit:893810|[itwiki] Assign 'changetags' flag only to sysop/bot/botadmin (T331051)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [21:04:15] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns2001.wikimedia.org with OS bullseye completed: - dns2001 (**PASS**) - Downtimed on Icinga/Al... [21:04:21] (03PS3) 10BCornwall: Revert "ntp/codfw: set to dns2002" [dns] - 10https://gerrit.wikimedia.org/r/893774 [21:04:25] (03CR) 10BCornwall: [V: 03+2] Revert "ntp/codfw: set to dns2002" [dns] - 10https://gerrit.wikimedia.org/r/893774 (owner: 10BCornwall) [21:04:35] Superpes: that's on mwdebug now, can you test? [21:04:39] (03CR) 10BCornwall: [C: 03+2] Revert "hieradata/common: Remove dns1001/2001 from authdns" [puppet] - 10https://gerrit.wikimedia.org/r/893776 (owner: 10BCornwall) [21:04:51] TheresNoTime Tested! All's right :) [21:04:56] syncing [21:07:01] All done with the DNS work :) [21:07:13] (03CR) 10Ladsgroup: [C: 03+1] Enable new Linter UI for namespace, tag and template for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893811 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [21:07:48] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [21:10:38] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:893810|[itwiki] Assign 'changetags' flag only to sysop/bot/botadmin (T331051)]] (duration: 08m 03s) [21:10:45] T331051: Assign 'changetags' flag only to sysop/bot/botadmin groups on itwiki - https://phabricator.wikimedia.org/T331051 [21:10:45] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:10:50] and that's live Superpes :) [21:11:09] TheresNoTime Thanks (as always) :D [21:11:16] 10SRE, 10SRE-OnFire, 10Traffic, 10Sustainability (Incident Followup): (Re) evaluate effectiveness / usefulness of varnish/haproxy traffic drop alerts - https://phabricator.wikimedia.org/T310608 (10BCornwall) I know that I ignore them. Perhaps rather than removing them entirely, we could tweak the detection... [21:14:27] PROBLEM - Check systemd state on parse2009 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:16:04] (03CR) 10Jdlrobson: [C: 03+1] "This LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/893437 (https://phabricator.wikimedia.org/T330680) (owner: 10Cwhite) [21:16:48] ACKNOWLEDGEMENT - MegaRAID on an-worker1132 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T331064 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:16:52] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T331064 (10ops-monitoring-bot) [21:21:45] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848 (10Papaul) [21:21:54] (03PS7) 10JHathaway: Add jaeger to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761 (https://phabricator.wikimedia.org/T320554) [21:22:16] (03CR) 10CI reject: [V: 04-1] Add jaeger to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [21:22:34] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848 (10Papaul) 05Open→03Resolved This is complete [21:22:54] !log close UTC late backport and config training [21:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:24] (03CR) 10Krinkle: [C: 03+1] eventlogging: Remove obsoleted navtiming schemas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/726852 (https://phabricator.wikimedia.org/T281103) (owner: 10Phedenskog) [21:30:24] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:33:07] 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T331067 (10aranyap) [21:37:07] 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T331067 (10Jcross) Approved [21:40:55] (03PS1) 10JHathaway: jaeger-cert-manager: remove [deployment-charts] - 10https://gerrit.wikimedia.org/r/893820 (https://phabricator.wikimedia.org/T320554) [21:46:20] PROBLEM - NTP peers on dns1001 is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown https://wikitech.wikimedia.org/wiki/NTP [21:46:22] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for AranyaP - https://phabricator.wikimedia.org/T331067 (10aranyap) [21:46:37] (03PS1) 10Krinkle: build: Fix missing git config for git-stash command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893821 (https://phabricator.wikimedia.org/T331020) [21:47:16] (03CR) 10CI reject: [V: 04-1] build: Fix missing git config for git-stash command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893821 (https://phabricator.wikimedia.org/T331020) (owner: 10Krinkle) [21:47:36] (03PS2) 10Krinkle: build: Fix missing git config for git-stash command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893821 (https://phabricator.wikimedia.org/T331020) [21:48:09] (03CR) 10JHathaway: [C: 03+2] jaeger-cert-manager: remove [deployment-charts] - 10https://gerrit.wikimedia.org/r/893820 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [21:53:20] (03Merged) 10jenkins-bot: jaeger-cert-manager: remove [deployment-charts] - 10https://gerrit.wikimedia.org/r/893820 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [22:00:12] ACKNOWLEDGEMENT - MegaRAID on an-worker1132 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T331068 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:00:16] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T331068 (10ops-monitoring-bot) [22:10:52] (03PS35) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) [22:11:04] !log ryankemper@cumin2002 START - Cookbook sre.hosts.decommission for hosts wdqs[2001-2003].codfw.wmnet [22:14:36] (03PS1) 10Ryan Kemper: wdqs: decom wdqs200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/893825 (https://phabricator.wikimedia.org/T301167) [22:15:52] (03CR) 10Bking: [C: 03+1] wdqs: decom wdqs200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/893825 (https://phabricator.wikimedia.org/T301167) (owner: 10Ryan Kemper) [22:19:54] (03PS1) 10Raymond Ndibe: maintain_dbusers: seperate config from code changes [puppet] - 10https://gerrit.wikimedia.org/r/893826 (https://phabricator.wikimedia.org/T304040) [22:20:32] (03CR) 10Raymond Ndibe: "config has been seperated from code changes and can be found here https://gerrit.wikimedia.org/r/c/operations/puppet/+/893826" [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [22:24:28] (03CR) 10Andrew Bogott: [C: 03+2] maintain_dbusers: seperate config from code changes [puppet] - 10https://gerrit.wikimedia.org/r/893826 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [22:28:51] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: decom wdqs200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/893825 (https://phabricator.wikimedia.org/T301167) (owner: 10Ryan Kemper) [22:31:43] ACKNOWLEDGEMENT - MegaRAID on an-worker1132 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T331073 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:31:48] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T331073 (10ops-monitoring-bot) [22:33:25] 10ops-codfw, 10decommission-hardware: decommission wdqs200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T331074 (10RKemper) [22:34:30] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:37:44] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 21 Apr 2023 05:11:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:37:56] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [22:45:29] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wdqs[2001-2003].codfw.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin2002" [22:47:04] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:48:24] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 21 Apr 2023 05:11:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:10:31] (03CR) 10Cwhite: [C: 03+2] prometheus: exclude extension errors from client-errors metrics [puppet] - 10https://gerrit.wikimedia.org/r/893437 (https://phabricator.wikimedia.org/T330680) (owner: 10Cwhite) [23:17:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T330317 (10phaultfinder) [23:19:42] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:21:16] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:31:48] 10SRE, 10Traffic, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10Dzahn) Maybe it would make shopify reconsider if you merely mention that you _might_ consider using an alternative, combined with pointing out that it's "top 10 website". [23:39:57] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wdqs[2001-2003].codfw.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin2002" [23:39:57] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:39:58] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wdqs[2001-2003].codfw.wmnet [23:49:13] (03PS1) 10Dzahn: releases: add monitor for releases-jenkins.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/893828 [23:57:00] (03PS1) 10Dzahn: httpbb: fix tests for releases.wikimedia.org, remove parsoid [puppet] - 10https://gerrit.wikimedia.org/r/893829 [23:57:20] (03PS2) 10Dzahn: httpbb: fix tests for releases.wikimedia.org, remove parsoid [puppet] - 10https://gerrit.wikimedia.org/r/893829 (https://phabricator.wikimedia.org/T330960) [23:58:37] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "[deploy1002:~] $ httpbb --hosts releases2002.codfw.wmnet ./test_releases.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/893829 (https://phabricator.wikimedia.org/T330960) (owner: 10Dzahn)