[00:01:27] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:22:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:27:11] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:48] (03PS6) 10Andrew Bogott: nova-fullstack: send logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/713006 [00:48:17] (03CR) 10jerkins-bot: [V: 04-1] nova-fullstack: send logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/713006 (owner: 10Andrew Bogott) [00:53:21] 10SRE, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Standardize the logging format - https://phabricator.wikimedia.org/T234565 (10Andrew) I'm slightly confused by the state of the filters for openstack/oslo. Right now I see: 15-filter_oslo_json.conf 1... [00:54:19] (03CR) 10Andrew Bogott: "@jbond, I have two questions about this. The first I added to T234565; the second is a more general concern that I'm not making a proper " [puppet] - 10https://gerrit.wikimedia.org/r/713006 (owner: 10Andrew Bogott) [01:02:45] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:11:57] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [01:13:55] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 18 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [01:26:07] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:52:45] (Storage over 90%) firing: Storage over 90% - https://alerts.wikimedia.org [02:02:01] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:26:49] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:02:17] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:25:55] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:01:27] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:27:03] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:49:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2088 (s1 and s2) to upgrade', diff saved to https://phabricator.wikimedia.org/P17021 and previous config saved to /var/cache/conftool/dbconfig/20210816-044906-marostegui.json [04:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:50] !log Upgrade db2088 (s1 and s2) to 10.4.21 [04:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2088:3311 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17022 and previous config saved to /var/cache/conftool/dbconfig/20210816-045413-root.json [04:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2088:3312 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17023 and previous config saved to /var/cache/conftool/dbconfig/20210816-045430-root.json [04:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:10] (03PS1) 10Marostegui: db1132: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/713090 (https://phabricator.wikimedia.org/T288720) [04:58:55] (03CR) 10Marostegui: [C: 03+2] db1132: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/713090 (https://phabricator.wikimedia.org/T288720) (owner: 10Marostegui) [05:02:41] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:09:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2088:3311 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17024 and previous config saved to /var/cache/conftool/dbconfig/20210816-050916-root.json [05:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2088:3312 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17025 and previous config saved to /var/cache/conftool/dbconfig/20210816-050934-root.json [05:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2088:3311 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17026 and previous config saved to /var/cache/conftool/dbconfig/20210816-052420-root.json [05:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2088:3312 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17027 and previous config saved to /var/cache/conftool/dbconfig/20210816-052437-root.json [05:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:21] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:31:05] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:37:16] (03PS1) 10Marostegui: dbproxy1017,dbroxy1021: Add db1132 to m5 proxies [puppet] - 10https://gerrit.wikimedia.org/r/713099 (https://phabricator.wikimedia.org/T288197) [05:39:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2088:3311 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17028 and previous config saved to /var/cache/conftool/dbconfig/20210816-053924-root.json [05:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2088:3312 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17029 and previous config saved to /var/cache/conftool/dbconfig/20210816-053941-root.json [05:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:28] (03CR) 10Marostegui: [C: 03+2] dbproxy1017,dbroxy1021: Add db1132 to m5 proxies [puppet] - 10https://gerrit.wikimedia.org/r/713099 (https://phabricator.wikimedia.org/T288197) (owner: 10Marostegui) [05:52:46] (Storage over 90%) firing: Storage over 90% - https://alerts.wikimedia.org [05:54:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2088:3311 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17030 and previous config saved to /var/cache/conftool/dbconfig/20210816-055427-root.json [05:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2088:3312 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17031 and previous config saved to /var/cache/conftool/dbconfig/20210816-055445-root.json [05:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:15:04] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:15:56] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:27:04] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:31:56] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:41:48] !log on votewiki, set voter-privacy option to 1 on all prior elections T288924 [06:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:56] T288924: Private keys visible to anonymous users in SecurePoll dump - https://phabricator.wikimedia.org/T288924 [06:52:45] (Storage over 90%) resolved: Storage over 90% - https://alerts.wikimedia.org [07:01:44] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:03:03] (03PS1) 10Filippo Giunchedi: thanos: fix /bucket location [puppet] - 10https://gerrit.wikimedia.org/r/713215 [07:07:39] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: fix /bucket location [puppet] - 10https://gerrit.wikimedia.org/r/713215 (owner: 10Filippo Giunchedi) [07:09:31] (03PS1) 10Sahilgrewalhere: Merge "selenium: Upgrade WebdriverIO to v7" into wmf/stable [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/713216 [07:09:33] (03PS1) 10Sahilgrewalhere: selenium: Update README.md file [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/713217 (https://phabricator.wikimedia.org/T282237) [07:17:41] (03Abandoned) 10Sahilgrewalhere: Merge "selenium: Upgrade WebdriverIO to v7" into wmf/stable [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/713216 (owner: 10Sahilgrewalhere) [07:18:19] (03PS2) 10Sahilgrewalhere: selenium: Update README.md file [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/713217 (https://phabricator.wikimedia.org/T282237) [07:26:48] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:47:12] !log Rename aft_feedback tables on db2115, db2131 - T250715 [07:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:22] T250715: Drop (and archive?) aft_feedback - https://phabricator.wikimedia.org/T250715 [07:52:44] (03PS2) 10Kormat: mariadb: If semi-sync is enabled, always config master settings. [puppet] - 10https://gerrit.wikimedia.org/r/711489 (https://phabricator.wikimedia.org/T288500) [07:54:13] (03CR) 10Kormat: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/711489 (https://phabricator.wikimedia.org/T288500) (owner: 10Kormat) [07:56:37] (03CR) 10Kormat: "PCC looks sane: https://puppet-compiler.wmflabs.org/compiler1002/882/" [puppet] - 10https://gerrit.wikimedia.org/r/711489 (https://phabricator.wikimedia.org/T288500) (owner: 10Kormat) [08:01:04] (03CR) 10Awight: "Jdlrobson, can you also set $wgPopupsReferencePreviewsBetaFeature to false?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713013 (https://phabricator.wikimedia.org/T264305) (owner: 10Jdlrobson) [08:01:38] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:04:44] (03CR) 10Awight: [C: 03+1] "Please disregard my comment about `$wgPopupsReferencePreviewsBetaFeature`, I see that we did something really confusing and the configurat" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713013 (https://phabricator.wikimedia.org/T264305) (owner: 10Jdlrobson) [08:07:23] (03PS3) 10Jcrespo: dbbackups: Remove s2 stretch codfw backup source, move s4, upgrade 2099 [puppet] - 10https://gerrit.wikimedia.org/r/712925 (https://phabricator.wikimedia.org/T287230) [08:18:14] (03CR) 10David Caro: [C: 03+2] wmsc.puppet_alert: force utf-8 encoding when opening files [puppet] - 10https://gerrit.wikimedia.org/r/711106 (https://phabricator.wikimedia.org/T288508) (owner: 10David Caro) [08:26:58] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:27:48] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:28:00] !log repool wdqs eqiad (`confctl --quiet --object-type discovery select 'dnsdisc=wdqs,name=eqiad' set/pooled=true`) - codfw currently overloaded [08:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:24] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [08:29:33] (03PS2) 10David Caro: wmcs.vps.puppet_alert: get the puppet files from config [puppet] - 10https://gerrit.wikimedia.org/r/712922 (https://phabricator.wikimedia.org/T288805) [08:29:35] (03PS3) 10David Caro: wmcs.vps.puppet_alert: allow disabling the puppet alerts [puppet] - 10https://gerrit.wikimedia.org/r/712923 [08:30:20] RECOVERY - etcd request latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [08:31:42] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:33:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10ayounsi) I bumped the threshold. All good now. [08:35:17] (03CR) 10David Caro: [C: 03+2] wmcs.vps.puppet_alert: get the puppet files from config [puppet] - 10https://gerrit.wikimedia.org/r/712922 (https://phabricator.wikimedia.org/T288805) (owner: 10David Caro) [08:35:23] (03CR) 10David Caro: [C: 03+2] wmcs.vps.puppet_alert: allow disabling the puppet alerts [puppet] - 10https://gerrit.wikimedia.org/r/712923 (owner: 10David Caro) [08:44:28] beyond trying a test edit is there a means to identify the latest update times of "spam blacklist" and "title blacklist" [08:47:13] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Remove s2 stretch codfw backup source, move s4, upgrade 2099 [puppet] - 10https://gerrit.wikimedia.org/r/712925 (https://phabricator.wikimedia.org/T287230) (owner: 10Jcrespo) [08:49:54] !log replacing s2 with s4 on db2097 T287230 [08:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:02] T287230: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 [08:56:02] (03CR) 10Marostegui: [C: 03+1] mariadb: If semi-sync is enabled, always config master settings. [puppet] - 10https://gerrit.wikimedia.org/r/711489 (https://phabricator.wikimedia.org/T288500) (owner: 10Kormat) [08:57:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10dcaro) Thanks :) [09:01:54] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:03:43] (03CR) 10Ema: "A couple of suggestions!" [puppet] - 10https://gerrit.wikimedia.org/r/711543 (owner: 10Filippo Giunchedi) [09:14:03] (03CR) 10Ema: [C: 03+1] "Minor nits, LGTM though!" [puppet] - 10https://gerrit.wikimedia.org/r/712100 (owner: 10Filippo Giunchedi) [09:16:07] (03PS3) 10Vgutierrez: envoyproxy: Provide support for UDS upstreams [puppet] - 10https://gerrit.wikimedia.org/r/712368 (https://phabricator.wikimedia.org/T271421) [09:16:10] !log depooling wdqs codfw to allow catching up on lag [09:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:42] (03CR) 10Ema: "A couple of suggestions!" [puppet] - 10https://gerrit.wikimedia.org/r/712099 (owner: 10Filippo Giunchedi) [09:23:15] (03CR) 10Ema: [C: 03+1] pontoon: add service_names (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/712098 (owner: 10Filippo Giunchedi) [09:25:48] 10SRE, 10Infrastructure-Foundations, 10netops: Link failure between mr1-eqiad and asw2-a8-eqiad Aug 13th 2021 - https://phabricator.wikimedia.org/T288834 (10ayounsi) Looks like the whole of FPC8 failed, I opened JTAC case 2021-0816-0128. Good thing you saved the logs as they now fully rolled over. [09:31:03] 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10fgiunchedi) [09:35:22] 10SRE, 10Infrastructure-Foundations, 10netops: Link failure between mr1-eqiad and asw2-a8-eqiad Aug 13th 2021 - https://phabricator.wikimedia.org/T288834 (10cmooney) Thanks @ayounsi yeah I was looking there that seems to be the case. Logs from a host connected also seem to confirm the entire switch died: `... [09:41:05] 10SRE, 10Infrastructure-Foundations, 10netops: Switch failure: asw2-a8-eqiad Aug 13th 2021 - https://phabricator.wikimedia.org/T288834 (10cmooney) [09:41:16] (03CR) 10Filippo Giunchedi: pontoon: add service_names (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/712098 (owner: 10Filippo Giunchedi) [09:41:27] (03CR) 10Filippo Giunchedi: pontoon: add sd module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/712099 (owner: 10Filippo Giunchedi) [09:41:31] (03PS5) 10Filippo Giunchedi: pontoon: add service_names [puppet] - 10https://gerrit.wikimedia.org/r/712098 [09:41:33] (03PS5) 10Filippo Giunchedi: pontoon: add sd module [puppet] - 10https://gerrit.wikimedia.org/r/712099 [09:41:35] (03PS5) 10Filippo Giunchedi: pontoon: add lb module [puppet] - 10https://gerrit.wikimedia.org/r/712100 [09:41:37] (03CR) 10Filippo Giunchedi: pontoon: add lb module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/712100 (owner: 10Filippo Giunchedi) [09:42:37] 10SRE, 10Infrastructure-Foundations, 10netops: Switch failure: asw2-a8-eqiad Aug 13th 2021 - https://phabricator.wikimedia.org/T288834 (10cmooney) [09:44:37] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:25] (03PS1) 10Ladsgroup: Add tags for wikidata edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713225 (https://phabricator.wikimedia.org/T236893) [09:47:17] 10SRE, 10Anti-Harassment, 10Traffic: Enable automatic redirection to the mobile version of votewiki - https://phabricator.wikimedia.org/T288938 (10phuedx) @Jdlrobson noted that the "Mobile view" link at the bottom of every page has been blanked out by setting [[ https://vote.wikimedia.org/wiki/MediaWiki:Mobi... [09:50:53] 10SRE, 10Infrastructure-Foundations, 10netops, 10Datacenter-Switchover, 10User-fgiunchedi: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10ayounsi) @fgiunchedi anything left to do for netops or is it ok to close the task? [09:52:34] (03PS1) 10Vgutierrez: varnish: Do not assume that UDS implies PROXY protocol [puppet] - 10https://gerrit.wikimedia.org/r/713226 [09:52:57] (03CR) 10Filippo Giunchedi: pontoon: add config command (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/711543 (owner: 10Filippo Giunchedi) [09:53:00] (03PS5) 10Filippo Giunchedi: pontoon: add config command [puppet] - 10https://gerrit.wikimedia.org/r/711543 [09:54:21] (03CR) 10Vgutierrez: [C: 03+1] acmechief: acmechief: allow mx2002 [puppet] - 10https://gerrit.wikimedia.org/r/712277 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [09:57:24] (03CR) 10Vgutierrez: "This is affected by https://github.com/envoyproxy/envoy/issues/16682 and https://github.com/envoyproxy/envoy/pull/16598. According to envo" [puppet] - 10https://gerrit.wikimedia.org/r/711386 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [10:00:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10cmooney) 05Open→03Resolved [10:01:00] 10SRE, 10Infrastructure-Foundations, 10netops: Traffic Engineering for Anycast Ranges - https://phabricator.wikimedia.org/T288843 (10ayounsi) Some thoughts: * We need to find the good balance between config complexity and low latency for users, otherwise it's going to be a cat and mouse game, fixing special... [10:01:29] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:18] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM :)" [homer/public] - 10https://gerrit.wikimedia.org/r/712467 (owner: 10Ayounsi) [10:06:38] (03PS6) 10Filippo Giunchedi: pontoon: add service_names [puppet] - 10https://gerrit.wikimedia.org/r/712098 [10:06:40] (03PS6) 10Filippo Giunchedi: pontoon: add sd module [puppet] - 10https://gerrit.wikimedia.org/r/712099 [10:06:42] (03PS6) 10Filippo Giunchedi: pontoon: add lb module [puppet] - 10https://gerrit.wikimedia.org/r/712100 [10:07:57] (03CR) 10Ema: [C: 03+1] pontoon: add service_names [puppet] - 10https://gerrit.wikimedia.org/r/712098 (owner: 10Filippo Giunchedi) [10:08:15] jouncebot: now [10:08:15] No deployments scheduled for the next 0 hour(s) and 21 minute(s) [10:08:20] coool [10:08:26] deploying https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/713225 [10:08:34] (03CR) 10Ladsgroup: [C: 03+2] Add tags for wikidata edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713225 (https://phabricator.wikimedia.org/T236893) (owner: 10Ladsgroup) [10:08:47] (03CR) 10Ema: [C: 03+1] pontoon: add sd module [puppet] - 10https://gerrit.wikimedia.org/r/712099 (owner: 10Filippo Giunchedi) [10:09:04] (03CR) 10Ema: [C: 03+1] pontoon: add lb module [puppet] - 10https://gerrit.wikimedia.org/r/712100 (owner: 10Filippo Giunchedi) [10:09:11] 10SRE, 10Anti-Harassment, 10Traffic: Enable automatic redirection to the mobile version of votewiki - https://phabricator.wikimedia.org/T288938 (10phuedx) @dom_walden noted that after he'd jumped from one wiki to votewiki, the cookies returned by `https://vote.wikimedia.org/wiki/Secure:SpecialPoll/login/{$el... [10:09:14] (03Merged) 10jenkins-bot: Add tags for wikidata edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713225 (https://phabricator.wikimedia.org/T236893) (owner: 10Ladsgroup) [10:09:31] (03CR) 10Ema: [C: 03+1] pontoon: add config command (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711543 (owner: 10Filippo Giunchedi) [10:11:01] (03CR) 10Lucas Werkmeister (WMDE): Add tags for wikidata edits (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713225 (https://phabricator.wikimedia.org/T236893) (owner: 10Ladsgroup) [10:15:48] (03CR) 10Ayounsi: [C: 03+2] Fix typo [homer/public] - 10https://gerrit.wikimedia.org/r/712467 (owner: 10Ayounsi) [10:16:44] (03Merged) 10jenkins-bot: Fix typo [homer/public] - 10https://gerrit.wikimedia.org/r/712467 (owner: 10Ayounsi) [10:19:43] 10SRE, 10Infrastructure-Foundations, 10netops: Traffic Engineering for Anycast Ranges - https://phabricator.wikimedia.org/T288843 (10cmooney) Agreed we need to balance complexity and usefulness. A few points: - I think it's too complex to consider doing this for peers at IXPs. - I would only anticipate... [10:21:33] (03PS1) 10Filippo Giunchedi: swift: ship uwsgi config for account/container server [puppet] - 10https://gerrit.wikimedia.org/r/713230 (https://phabricator.wikimedia.org/T288937) [10:26:19] (03PS1) 10Ladsgroup: Don't set termbox v2 tags yet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713233 (https://phabricator.wikimedia.org/T236893) [10:26:39] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:29:41] (03CR) 10Ladsgroup: [C: 03+2] "to unblock the deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713233 (https://phabricator.wikimedia.org/T236893) (owner: 10Ladsgroup) [10:30:04] jan_drewniak: Dear deployers, time to do the Wikimedia Portals Update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210816T1030). [10:30:23] (03Merged) 10jenkins-bot: Don't set termbox v2 tags yet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713233 (https://phabricator.wikimedia.org/T236893) (owner: 10Ladsgroup) [10:30:45] (03CR) 10Ladsgroup: Add tags for wikidata edits (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713225 (https://phabricator.wikimedia.org/T236893) (owner: 10Ladsgroup) [10:31:07] PROBLEM - Host backup1006 is DOWN: PING CRITICAL - Packet loss = 100% [10:32:24] !log ladsgroup@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:713225|Add tags for wikidata edits (T236893)]] (duration: 00m 58s) [10:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:34] T236893: Tag all edits made via Wikibase View and Wikibase Client - https://phabricator.wikimedia.org/T236893 [10:32:48] jynus: ^ backup1006 down is the downtime expired thing, right? [10:33:20] yep [10:33:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:19] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:58:27] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: Your horoscope predicts another unfortunate European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210816T1100). [11:00:05] Aca and Lucas_WMDE: A patch you scheduled for European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:17] o/ [11:00:21] I can deploy [11:02:15] (03CR) 10Effie Mouzeli: [C: 03+2] Add task manager data port configuration for flink session cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/711152 (https://phabricator.wikimedia.org/T288531) (owner: 10ZPapierski) [11:02:43] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:03:31] (03PS9) 10Lucas Werkmeister (WMDE): Add namespace aliases for hr.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710564 (https://phabricator.wikimedia.org/T287024) (owner: 10Acamicamacaraca) [11:03:34] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add namespace aliases for hr.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710564 (https://phabricator.wikimedia.org/T287024) (owner: 10Acamicamacaraca) [11:03:44] (03PS4) 10Vgutierrez: envoyproxy: Support ciphersuite configuration [puppet] - 10https://gerrit.wikimedia.org/r/710577 (https://phabricator.wikimedia.org/T271421) [11:03:46] (03PS3) 10Vgutierrez: envoyproxy: Support ECDH curves configuration [puppet] - 10https://gerrit.wikimedia.org/r/710581 (https://phabricator.wikimedia.org/T271421) [11:03:48] (03PS3) 10Vgutierrez: envoyproxy: Add upstream PROXY protocol support [puppet] - 10https://gerrit.wikimedia.org/r/711386 (https://phabricator.wikimedia.org/T271421) [11:03:50] (03PS3) 10Vgutierrez: envoyproxy: Add STEK configuration support [puppet] - 10https://gerrit.wikimedia.org/r/711399 (https://phabricator.wikimedia.org/T271421) [11:03:52] (03PS3) 10Vgutierrez: cache: Provide an envoy STEK manager script [puppet] - 10https://gerrit.wikimedia.org/r/711407 (https://phabricator.wikimedia.org/T271421) [11:03:54] (03PS4) 10Vgutierrez: envoyproxy: Provide support for UDS upstreams [puppet] - 10https://gerrit.wikimedia.org/r/712368 (https://phabricator.wikimedia.org/T271421) [11:03:56] (03PS1) 10Vgutierrez: envoyproxy: Support alpn_protocols configuration [puppet] - 10https://gerrit.wikimedia.org/r/713238 (https://phabricator.wikimedia.org/T271421) [11:03:58] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "hm, one moment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710564 (https://phabricator.wikimedia.org/T287024) (owner: 10Acamicamacaraca) [11:04:06] Aca: are you here? [11:04:15] yeah, I'm here [11:04:20] ok [11:04:23] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add namespace aliases for hr.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710564 (https://phabricator.wikimedia.org/T287024) (owner: 10Acamicamacaraca) [11:04:50] I would also have one patch if there is time [11:04:55] (03Merged) 10jenkins-bot: Add namespace aliases for hr.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710564 (https://phabricator.wikimedia.org/T287024) (owner: 10Acamicamacaraca) [11:04:58] (03Merged) 10jenkins-bot: Add task manager data port configuration for flink session cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/711152 (https://phabricator.wikimedia.org/T288531) (owner: 10ZPapierski) [11:04:58] sure [11:05:25] Aca: can you test the namespace aliases change on mwdebug2001? [11:05:45] yeah [11:05:51] (03PS3) 10Zabe: Add extendedconfirmed on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712754 (https://phabricator.wikimedia.org/T287322) [11:05:55] on which wiki? [11:06:10] hrwiki, I assume… [11:06:33] yeah, obviously [11:07:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:29] should I check any of the WikimediaDebug options [11:08:43] no, only select the right backend (mwdebug2001) [11:08:48] Or I should just turn it on [11:08:53] oh, alright [11:09:01] Done [11:09:05] and it works? [11:09:12] lemme see [11:11:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:28] I can’t actually test it myself, the extension isn’t letting me select a different backend o_O [11:11:28] Yeah, seems like everything is just fine. Every alias is working [11:11:31] ok [11:12:17] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: Support alpn_protocols configuration [puppet] - 10https://gerrit.wikimedia.org/r/713238 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:12:54] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:710564|Add namespace aliases for hr.wiki (T287024)]] (duration: 00m 59s) [11:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:01] T287024: Add namespace aliases to hrwiki - https://phabricator.wikimedia.org/T287024 [11:13:40] o_O namespaceDupes.php says “88995 links to fix, 88995 were resolvable, 0 were deleted.” [11:13:47] and then “Oh noeees” [11:13:51] wtf is that supposed to mean [11:13:58] (that’s the dry run, without --fix) [11:14:07] any namespaceDupes.php experts around? [11:14:11] * Lucas_WMDE looks at the source [11:14:59] hmm [11:15:12] Lucas_WMDE: does it give any more output? [11:15:14] (03PS1) 10Vgutierrez: envoyproxy: Suport TLS min/max version config [puppet] - 10https://gerrit.wikimedia.org/r/713246 (https://phabricator.wikimedia.org/T271421) [11:15:28] majavah: yes, plenty [11:15:36] I suspect the important part is before the 88k lines [11:15:41] so let me run it again and pipe it into a file [11:16:42] there’s a handful of *** dest title exists and --add-prefix not specified [11:16:46] presumably that’s the problem [11:16:58] no idea why it doesn’t include this in the numbers at the end [11:17:43] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [11:17:49] Aca: do you have a suggestion for which prefix I should use? [11:18:19] I have no idea lol [11:18:20] for conflicts like [[WP:IRC]] = [[Wikipedija:IRC]] (the former page is now inaccessible) [11:18:28] hold on [11:18:37] I’ll look for other namespaceDupes.php invocations in SAL [11:18:44] see what other people usually do [11:18:46] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: Suport TLS min/max version config [puppet] - 10https://gerrit.wikimedia.org/r/713246 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:19:11] looks like urbanecm usually prefixes with the task ID or “BROKEN” [11:20:21] (03PS2) 10Vgutierrez: envoyproxy: Support alpn_protocols configuration [puppet] - 10https://gerrit.wikimedia.org/r/713238 (https://phabricator.wikimedia.org/T271421) [11:20:23] (03PS2) 10Vgutierrez: envoyproxy: Suport TLS min/max version config [puppet] - 10https://gerrit.wikimedia.org/r/713246 (https://phabricator.wikimedia.org/T271421) [11:22:38] !log lucaswerkmeister-wmde@mwmaint2002:~$ mwscript namespaceDupes.php hrwiki --fix --add-prefix=T287024/ | tee T287024.out # T287024 [11:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:46] T287024: Add namespace aliases to hrwiki - https://phabricator.wikimedia.org/T287024 [11:24:03] (still running, there are tons of links to WP:IRC that need to be fixed apparently) [11:24:56] I'd like to hear Martin's opinion about this if he's active [11:25:26] To me, broken sounds good as a prefix, but Iidk [11:25:35] well, it’s running now… [11:26:10] !log namespaceDupes.php for T287024 finished [11:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:27] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:29:37] Aca: looks like all of those pages were shortcut redirects so you probably just want to delete them [11:29:45] (I added a comment on the task) [11:29:49] yeah, I saw [11:29:51] what was the other change you wanted to do? [11:30:25] ah nevermind that was zabe [11:30:57] https://gerrit.wikimedia.org/r/c/712754/ [11:31:05] I also added it to the calender [11:31:29] I'll contact a hr.wiki admin to delete those. I thought Ivi104 would be here, but he didn't answer me. [11:31:34] ok [11:31:48] zabe: ack, I’m looking at it [11:34:28] zabe: looks good to me, do you know if there are any special consideration when deploying an auto-promoted right? [11:35:22] I don't know any [11:36:45] ok, looks like previous extendedconfirmed deployments in SAL had nothing special either [11:36:55] so I think I’m happy to deploy this [11:37:25] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add extendedconfirmed on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712754 (https://phabricator.wikimedia.org/T287322) (owner: 10Zabe) [11:37:46] will you be able to test it on mwdebug2001? [11:38:13] I guess the main test would be to extendedconfirmed-protect a page, for which you might need to be an admin [11:38:18] and I’m not sure if you are ^^ [11:38:30] (03Merged) 10jenkins-bot: Add extendedconfirmed on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712754 (https://phabricator.wikimedia.org/T287322) (owner: 10Zabe) [11:38:40] I can test the group itself. The autopromotion not realy. I'm not a sysop there. [11:38:48] ok [11:39:24] alright, the change is on mwdebug2001 now [11:40:09] looks like it should show up on Special:UserRights [11:40:29] I still can’t test it myself because the x-wikimedia-debug extension won’t behave 😡 [11:40:37] unless I SSH into mwdebug1001 and pull the change there, I guess [11:42:01] Usually the autopromotion kicks directly after the patch has been synced because there are many users who fill the requirements, so it is possible to see if the autopromotion works afterwards. [11:42:21] Also something isn't right, pls give me a minute finding the mistake in the patch. [11:42:30] ok [11:42:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:50] (03PS1) 10Zabe: Fix extendedconfirmed for bots on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713255 (https://phabricator.wikimedia.org/T287322) [11:45:04] ahh [11:45:08] I should’ve noticed that [11:45:16] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Fix extendedconfirmed for bots on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713255 (https://phabricator.wikimedia.org/T287322) (owner: 10Zabe) [11:46:08] (03Merged) 10jenkins-bot: Fix extendedconfirmed for bots on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713255 (https://phabricator.wikimedia.org/T287322) (owner: 10Zabe) [11:46:33] zabe: pulled to mwdebug2001 [11:46:58] (03PS3) 10Vgutierrez: envoyproxy: Support alpn_protocols configuration [puppet] - 10https://gerrit.wikimedia.org/r/713238 (https://phabricator.wikimedia.org/T271421) [11:47:00] (03PS3) 10Vgutierrez: envoyproxy: Suport TLS min/max version config [puppet] - 10https://gerrit.wikimedia.org/r/713246 (https://phabricator.wikimedia.org/T271421) [11:47:09] Lucas_WMDE: it looks now good to me [11:47:12] ok [11:47:31] I would’ve expected to see some “implicit member of” in https://zh.wikipedia.org/w/index.php?uselang=en&title=Special:%E7%94%A8%E6%88%B7%E6%9D%83%E9%99%90/WhitePhosphorus [11:47:33] (e.g.) [11:47:35] but maybe that’s not how it works [11:47:37] I’ll sync it [11:48:52] (03CR) 10Kormat: [C: 03+2] mariadb: If semi-sync is enabled, always config master settings. [puppet] - 10https://gerrit.wikimedia.org/r/711489 (https://phabricator.wikimedia.org/T288500) (owner: 10Kormat) [11:49:31] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:712754|Add extendedconfirmed on zhwiki (T287322)]] + Config: [[gerrit:713255|Fix extendedconfirmed for bots on zhwiki (T287322)]] (duration: 01m 01s) [11:49:38] I would only expect that if someone who fits the requirements does an edit via mwdebug because for these autopromotions the requirements are checked once per edit. [11:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:41] T287322: Add extendedconfirmed user right and extended confirmed protection on zhwiki - https://phabricator.wikimedia.org/T287322 [11:49:45] ah, got it [11:50:00] extendedconfirmed gets added as a real group on first edit, it's not explicit like autoconfirmed for some reason [11:50:16] jouncebot: next [11:50:16] In 5 hour(s) and 9 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210816T1700) [11:50:40] (03PS1) 10Lucas Werkmeister (WMDE): Support null content in parser tag hook [extensions/Math] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/712962 (https://phabricator.wikimedia.org/T288846) [11:50:48] I’ll also backport this ^ it’s super noisy in logstash [11:50:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:01] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Support null content in parser tag hook [extensions/Math] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/712962 (https://phabricator.wikimedia.org/T288846) (owner: 10Lucas Werkmeister (WMDE)) [11:51:17] first autopromotion happened: https://zh.wikipedia.org/w/index.php?title=Special:%E6%97%A5%E5%BF%97&logid=10712023 so it looks like it works [11:51:45] \o/ [11:54:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:58] there was a brief spike of “could not enqueue jobs” [11:59:02] but it’s already over again [11:59:22] so, uh, I hope that’s fine [11:59:59] Lucas_WMDE: forgot to say, thanks for deploying :) [12:00:04] no problem :) [12:00:11] (03CR) 10Tobias Andersson: [C: 03+1] "LGTM, not sure why it's marked as a failure here but success in the logs." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711512 (https://phabricator.wikimedia.org/T288612) (owner: 10Lucas Werkmeister (WMDE)) [12:00:55] (03CR) 10Tobias Andersson: [C: 03+1] "LGTM seems to be the last mentions" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711513 (https://phabricator.wikimedia.org/T288612) (owner: 10Lucas Werkmeister (WMDE)) [12:01:15] pages deleted, should be done now [12:01:20] ok! [12:01:25] tnx for deploying [12:01:39] (03CR) 10Tobias Andersson: [C: 03+1] Stop setting 'useTermsTableSearchFields' Wikibase option (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711512 (https://phabricator.wikimedia.org/T288612) (owner: 10Lucas Werkmeister (WMDE)) [12:01:54] (03CR) 10Tobias Andersson: [C: 03+1] Stop setting $wgWBClientSettings['fineGrainedLuaTracking'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711514 (https://phabricator.wikimedia.org/T288612) (owner: 10Lucas Werkmeister (WMDE)) [12:02:04] (03CR) 10Tobias Andersson: [C: 03+1] Remove $wmgWikibaseFineGrainedLuaTracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711515 (https://phabricator.wikimedia.org/T288612) (owner: 10Lucas Werkmeister (WMDE)) [12:02:51] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:03:54] (03CR) 10jerkins-bot: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/713244 (owner: 10L10n-bot) [12:08:28] (I’m still waiting for CI so I can backport that Math fix) [12:08:55] (03PS1) 10Tim Starling: Allow vote pages to be linked by title [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/712963 [12:09:48] (03CR) 10Tim Starling: [C: 03+2] Allow vote pages to be linked by title [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/712963 (owner: 10Tim Starling) [12:12:17] (03Merged) 10jenkins-bot: Support null content in parser tag hook [extensions/Math] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/712962 (https://phabricator.wikimedia.org/T288846) (owner: 10Lucas Werkmeister (WMDE)) [12:12:54] ok! [12:13:28] testing on mwdebug1001 [12:13:42] seems to work, syncing [12:13:51] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add config command [puppet] - 10https://gerrit.wikimedia.org/r/711543 (owner: 10Filippo Giunchedi) [12:14:35] (03Merged) 10jenkins-bot: Allow vote pages to be linked by title [extensions/SecurePoll] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/712963 (owner: 10Tim Starling) [12:14:37] !log clean up old /root/.my.cnf files T150446 [12:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:38] !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2099.codfw.wmnet with reason: REIMAGE [12:15:44] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.37.0-wmf.18/extensions/Math/src/HookHandlers/ParserHooksHandler.php: Backport: [[gerrit:712962|Support null content in parser tag hook (T288846)]] (hopefully also fixes T288790) (duration: 00m 59s) [12:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:54] T288846: TypeError: Argument 1 passed to HookHandlers\ParserHooksHandler::mathTagHook() must be of the type string, null given - https://phabricator.wikimedia.org/T288846 [12:15:54] T288790: TypeError: Argument 1 passed to MediaWiki\Extension\Math\HookHandlers\ParserHooksHandler::mathTagHook() must be of the type string, null given, called in /srv/mediawiki/php-1.37.0-wmf.18/includes/parser/Parser.php on line 3966 - https://phabricator.wikimedia.org/T288790 [12:17:36] (03PS1) 10Kormat: mariadb: Remove commented out File for /root/.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/713257 (https://phabricator.wikimedia.org/T150446) [12:17:56] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2099.codfw.wmnet with reason: REIMAGE [12:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:06] !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.18/extensions/SecurePoll/includes/Pages/VotePage.php: allow linking by title (duration: 00m 58s) [12:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:36] !log EU backport+config window done (slightly belatedly) [12:22:41] I’ll deploy my config changes later [12:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:10] (03CR) 10Jcrespo: [C: 03+1] mariadb: Remove commented out File for /root/.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/713257 (https://phabricator.wikimedia.org/T150446) (owner: 10Kormat) [12:25:25] (03CR) 10Kormat: [C: 03+2] mariadb: Remove commented out File for /root/.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/713257 (https://phabricator.wikimedia.org/T150446) (owner: 10Kormat) [12:32:14] (03Abandoned) 10Kormat: mariadb: Remove /root/.my.cnf from all servers [puppet] - 10https://gerrit.wikimedia.org/r/321888 (https://phabricator.wikimedia.org/T150446) (owner: 10Jcrespo) [12:33:22] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [12:43:41] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:47:56] (03Abandoned) 10Jcrespo: mariadb: Create socket dir also on puppet run, & for multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/467317 (https://phabricator.wikimedia.org/T207013) (owner: 10Jcrespo) [12:48:59] (03PS2) 10Jcrespo: mariadb: Remove obsolete mariadb.server init script [puppet] - 10https://gerrit.wikimedia.org/r/658953 (https://phabricator.wikimedia.org/T272559) [12:50:14] (03CR) 10Jcrespo: [C: 04-1] "I am not so sure about this, and will require more changes anyway- monitoring, etc." [puppet] - 10https://gerrit.wikimedia.org/r/455769 (owner: 10Jcrespo) [12:59:28] 10SRE, 10Infrastructure-Foundations, 10netops: Traffic Engineering for Anycast Ranges - https://phabricator.wikimedia.org/T288843 (10ssingh) (Thanks Cathal for filing this task!) >>! In T288843#7284411, @ayounsi wrote: > Some thoughts: > [...] > @ssingh what's the timeline for Wikidough? So we know how to p... [13:01:37] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:23:44] jouncebot: now [13:23:44] No deployments scheduled for the next 3 hour(s) and 36 minute(s) [13:24:03] I’ll deploy the config cleanups from T288612 if that’s alright with everyone (should all be no-ops) [13:24:04] T288612: Remove outdated Wikibase settings from production config - https://phabricator.wikimedia.org/T288612 [13:25:32] (03CR) 10Lucas Werkmeister (WMDE): Stop setting 'useTermsTableSearchFields' Wikibase option (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711512 (https://phabricator.wikimedia.org/T288612) (owner: 10Lucas Werkmeister (WMDE)) [13:26:23] (03PS2) 10Lucas Werkmeister (WMDE): Stop setting 'useTermsTableSearchFields' Wikibase option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711512 (https://phabricator.wikimedia.org/T288612) [13:26:25] (03PS2) 10Lucas Werkmeister (WMDE): Remove $wmgWikibaseClientUseTermsTableSearchFields [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711513 (https://phabricator.wikimedia.org/T288612) [13:26:27] (03PS2) 10Lucas Werkmeister (WMDE): Stop setting $wgWBClientSettings['fineGrainedLuaTracking'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711514 (https://phabricator.wikimedia.org/T288612) [13:26:29] (03PS2) 10Lucas Werkmeister (WMDE): Remove $wmgWikibaseFineGrainedLuaTracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711515 (https://phabricator.wikimedia.org/T288612) [13:26:33] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:27:14] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Stop setting 'useTermsTableSearchFields' Wikibase option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711512 (https://phabricator.wikimedia.org/T288612) (owner: 10Lucas Werkmeister (WMDE)) [13:27:57] (03Merged) 10jenkins-bot: Stop setting 'useTermsTableSearchFields' Wikibase option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711512 (https://phabricator.wikimedia.org/T288612) (owner: 10Lucas Werkmeister (WMDE)) [13:30:26] (03CR) 10Dzahn: "not sure about disabling monitoring and logging. do we do that for (a lot of) other jobs? is the default not ok in this case?" [puppet] - 10https://gerrit.wikimedia.org/r/713087 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [13:31:17] 🤦 I was wondering why wikidatawiki was read-only on mwdebug1001 and then I realized why [13:31:22] (it’s in eqiad) [13:32:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:26] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:711512|Stop setting 'useTermsTableSearchFields' Wikibase option (T288612)]] (duration: 00m 59s) [13:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:34] T288612: Remove outdated Wikibase settings from production config - https://phabricator.wikimedia.org/T288612 [13:33:38] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove $wmgWikibaseClientUseTermsTableSearchFields [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711513 (https://phabricator.wikimedia.org/T288612) (owner: 10Lucas Werkmeister (WMDE)) [13:34:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:28] (03Merged) 10jenkins-bot: Remove $wmgWikibaseClientUseTermsTableSearchFields [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711513 (https://phabricator.wikimedia.org/T288612) (owner: 10Lucas Werkmeister (WMDE)) [13:36:07] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:711513|Remove $wmgWikibaseClientUseTermsTableSearchFields (T288612)]] (prod, 1/2) (duration: 00m 59s) [13:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:22] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:711513|Remove $wmgWikibaseClientUseTermsTableSearchFields (T288612)]] (beta, 2/2) (duration: 00m 59s) [13:37:29] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Stop setting $wgWBClientSettings['fineGrainedLuaTracking'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711514 (https://phabricator.wikimedia.org/T288612) (owner: 10Lucas Werkmeister (WMDE)) [13:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:09] (03Merged) 10jenkins-bot: Stop setting $wgWBClientSettings['fineGrainedLuaTracking'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711514 (https://phabricator.wikimedia.org/T288612) (owner: 10Lucas Werkmeister (WMDE)) [13:40:13] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:711514|Stop setting $wgWBClientSettings['fineGrainedLuaTracking'] (T288612)]] (duration: 00m 58s) [13:40:18] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove $wmgWikibaseFineGrainedLuaTracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711515 (https://phabricator.wikimedia.org/T288612) (owner: 10Lucas Werkmeister (WMDE)) [13:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:21] T288612: Remove outdated Wikibase settings from production config - https://phabricator.wikimedia.org/T288612 [13:40:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:40:47] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 8:00:00 on planet1002.eqiad.wmnet with reason: known issue [13:40:49] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 8:00:00 on planet1002.eqiad.wmnet with reason: known issue [13:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:56] (03PS1) 10Kormat: db-switchover: Only configure semisync once [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/713263 (https://phabricator.wikimedia.org/T288500) [13:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:02] (03Merged) 10jenkins-bot: Remove $wmgWikibaseFineGrainedLuaTracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711515 (https://phabricator.wikimedia.org/T288612) (owner: 10Lucas Werkmeister (WMDE)) [13:42:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:35] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:711515|Remove $wmgWikibaseFineGrainedLuaTracking (T288612)]] (duration: 00m 58s) [13:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:42] (03PS2) 10Jcrespo: dbbackups: Switch s7 backups from stretch (db2100) to buster (db2098) [puppet] - 10https://gerrit.wikimedia.org/r/710981 (https://phabricator.wikimedia.org/T288244) [13:48:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:17] (03PS2) 10Jcrespo: dbbackups: Reenable notifications on db2097, db2099 after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/712926 (https://phabricator.wikimedia.org/T280979) [13:50:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:27] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reenable notifications on db2097, db2099 after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/712926 (https://phabricator.wikimedia.org/T280979) (owner: 10Jcrespo) [13:52:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ['mw1455.eqiad.wmnet'] ` The log can be found in `/var/log/w... [13:53:30] !log mw1455 - mysteriously showing a bunch of issues in icinga, broken packages, envoy, memcached etc, after recent fresh install, trying another reimage (T273915) [13:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:37] T273915: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 [13:58:44] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudcephosd1008 - https://phabricator.wikimedia.org/T287838 (10Andrew) @Cmjohnson, we don't need to do a lot regarding downtime here, but I would like to be present along with @dcaro when we shut this down. Is it possible to schedule this... [14:02:07] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:45] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1455.eqiad.wmnet with reason: REIMAGE [14:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:55] PROBLEM - Long running screen/tmux on maps1009 is CRITICAL: CRIT: Long running SCREEN process. (user: root PID: 11038, 1729999s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [14:11:14] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1455.eqiad.wmnet with reason: REIMAGE [14:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:04] (03PS1) 10Jgiannelos: tegola-vector-tiles: Connect staging to test-eqiad kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/713266 [14:14:53] (03PS2) 10Jgiannelos: tegola-vector-tiles: Connect staging to test-eqiad kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/713266 [14:15:04] (03PS1) 10Majavah: Add node12-sssd [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/713267 (https://phabricator.wikimedia.org/T284590) [14:18:53] (03CR) 10Majavah: [C: 03+2] Add node12-sssd [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/713267 (https://phabricator.wikimedia.org/T284590) (owner: 10Majavah) [14:19:25] (03CR) 10Jgiannelos: "`test-eqiad` broker list copied from:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/713266 (owner: 10Jgiannelos) [14:19:27] (03Merged) 10jenkins-bot: Add node12-sssd [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/713267 (https://phabricator.wikimedia.org/T284590) (owner: 10Majavah) [14:32:11] (03CR) 10Herron: [C: 03+2] acmechief: acmechief: allow mx2002 [puppet] - 10https://gerrit.wikimedia.org/r/712277 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [14:34:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1455.eqiad.wmnet'] ` and were **ALL** successful. [14:35:13] RECOVERY - Apache HTTP on mw1455 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.136 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:38:05] RECOVERY - Check systemd state on mw1455 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:39:17] (03CR) 10Dzahn: [C: 03+2] microsites: Add Query Builder subpage to wdqs gui [puppet] - 10https://gerrit.wikimedia.org/r/700317 (https://phabricator.wikimedia.org/T266703) (owner: 10Ladsgroup) [14:40:21] RECOVERY - Check that envoy is running on mw1455 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [14:41:59] !log mw1455 - works fine after a reimage, unknown why it didnt last time, but ok :) [14:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:16] !log miscweb - deploying new microsite for Wikidata Query Builder subpage (T266703) [14:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:22] T266703: Deploy query builder to microsites (on top of the wdqs-ui) - https://phabricator.wikimedia.org/T266703 [14:43:28] Amir1: https://query-preview.wikidata.org/querybuilder ? [14:43:53] 404s for me? [14:44:10] yea, but is that the right one I am checking? [14:44:38] https://query.wikidata.org/querybuilder/ [14:44:43] drops the "preview" part [14:46:21] Amir1: the webserver config template does not have the "preview' servername, does it? [14:46:26] oh, it does [14:46:53] but the file that was edited in the gerrit change wasn't that one [14:46:55] oh something showed up [14:47:11] httpd::site { 'query-preview.wikidata.org': [14:47:12] content => template('profile/wdqs/httpd-query-preview.wikidata.org.erb'), [14:47:16] we have this, ok [14:47:25] yeah, the puppet part works fine [14:47:33] now I need to clean the env mess :D [14:47:40] but the change edits only modules/profile/templates/wdqs/httpd-query.wikidata.org.erb [14:47:48] there is no "preview" there [14:47:50] see that? [14:48:08] puppet did clone, but that is about the webserver confit there [14:48:16] need to edit the other .erb afaict [14:51:08] (03PS3) 10Jgiannelos: tegola-vector-tiles: Connect staging to test-eqiad kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/713266 [14:51:08] [miscweb1002:/etc/apache2/sites-enabled] $ diff 50-query-wikidata-org.conf 50-query-preview-wikidata-org.conf [14:51:18] (03CR) 10Kormat: [C: 03+2] "I've tested this in our pontoon environment, and it appears to do what we want." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/713263 (https://phabricator.wikimedia.org/T288500) (owner: 10Kormat) [14:52:17] (03PS4) 10Jgiannelos: tegola-vector-tiles: Connect staging to test-eqiad kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/713266 (https://phabricator.wikimedia.org/T283159) [14:53:50] (03Merged) 10jenkins-bot: db-switchover: Only configure semisync once [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/713263 (https://phabricator.wikimedia.org/T288500) (owner: 10Kormat) [14:55:15] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/712389 (owner: 10Jgiannelos) [14:56:48] (03PS1) 10Dzahn: wdqs: sync query.wikidata.org preview config with prod config [puppet] - 10https://gerrit.wikimedia.org/r/713270 (https://phabricator.wikimedia.org/T266703) [14:57:41] (03Merged) 10jenkins-bot: tegola-vector-tiles: Bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/712389 (owner: 10Jgiannelos) [14:59:30] (03PS1) 10Vgutierrez: envoyproxy: Allow setting a global lua script [puppet] - 10https://gerrit.wikimedia.org/r/713271 (https://phabricator.wikimedia.org/T271421) [14:59:32] (03PS1) 10Vgutierrez: cache: Use envoy lua API to provide TLS info [puppet] - 10https://gerrit.wikimedia.org/r/713272 (https://phabricator.wikimedia.org/T271421) [15:04:33] (03PS1) 10Kormat: Prepare for 0.7.2 release. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/713276 [15:08:22] PROBLEM - LVS zotero codfw port 4969/tcp - Zotero- zotero.svc.codfw.wmnet IPv4 #page on zotero.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:08:46] svcops is in our meeting, looking [15:08:50] will pull in everyone else if needed [15:09:15] ack, thank you rzl [15:09:18] I'm around if needed [15:09:43] I'm assuming ok to ack, will do so [15:09:48] RECOVERY - LVS zotero codfw port 4969/tcp - Zotero- zotero.svc.codfw.wmnet IPv4 #page on zotero.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 196 bytes in 1.190 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:09:54] yeah that's right [15:11:48] p50 and p90 latencies look normal, p99 is a mess [15:11:55] from citoid that is [15:13:37] (03CR) 10Kormat: [C: 03+2] Prepare for 0.7.2 release. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/713276 (owner: 10Kormat) [15:15:47] (03CR) 10Ssingh: [C: 03+1] "matches https://github.com/envoyproxy/envoy/blob/main/examples/lua/envoy.yaml#L30 and iterating over each line will give us the required i" [puppet] - 10https://gerrit.wikimedia.org/r/713271 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [15:16:49] PROBLEM - Disk space on maps1004 is CRITICAL: DISK CRITICAL - free space: /srv 63738 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps1004&var-datasource=eqiad+prometheus/ops [15:16:53] (03Merged) 10jenkins-bot: Prepare for 0.7.2 release. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/713276 (owner: 10Kormat) [15:20:31] zotero looks self-healed, just a network blip [15:20:43] neat that we didn't have to do anything, would be nice if we also didn't get paged :) but I'll take it! [15:22:11] (03PS5) 10Jgiannelos: tegola-vector-tiles: Connect staging to test-eqiad kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/713266 (https://phabricator.wikimedia.org/T283159) [15:26:40] (03PS6) 10Jgiannelos: tegola-vector-tiles: Connect staging to test-eqiad kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/713266 (https://phabricator.wikimedia.org/T283159) [15:37:11] !log [WDQS] Re-pooled `codfw`: `ryankemper@puppetmaster1001:~$ sudo -i confctl --quiet --object-type discovery select 'dnsdisc=wdqs,name=codfw' set/pooled=true` [15:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:23] (03CR) 10Dzahn: [C: 03+2] "approved by Jaime via email. going ahead and merging this" [puppet] - 10https://gerrit.wikimedia.org/r/712919 (https://phabricator.wikimedia.org/T288527) (owner: 10Ema) [15:49:03] 10SRE, 10SRE Observability (FY2021/2022-Q1): Icinga refresh hardware selection (2020) - https://phabricator.wikimedia.org/T251644 (10herron) 05Open→03Resolved a:03herron [15:55:23] 10SRE-swift-storage, 10Wikimedia-Site-requests: Cannot delete "File:The Chorisettes tmp.jpg" on Commons: Error deleting file: An unknown error occurred in storage backend "local-multiwrite" - https://phabricator.wikimedia.org/T288968 (10Aklapper) [15:59:37] (03CR) 10Effie Mouzeli: [C: 03+1] tegola-vector-tiles: Connect staging to test-eqiad kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/713266 (https://phabricator.wikimedia.org/T283159) (owner: 10Jgiannelos) [16:01:50] !log LDAP - added user tandic to nda group (T288527) [16:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:58] T288527: LDAP Access to wmf user group for TAndic - https://phabricator.wikimedia.org/T288527 [16:03:05] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP Access to wmf user group for TAndic - https://phabricator.wikimedia.org/T288527 (10Dzahn) approved by @JAnstee_WMF via mail @TAndic You have been added to the nda group as requested. Feel free to try it out. [16:03:58] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP Access to wmf user group for TAndic - https://phabricator.wikimedia.org/T288527 (10Dzahn) 05Open→03Resolved a:03Dzahn [16:04:10] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP Access to wmf user group for TAndic - https://phabricator.wikimedia.org/T288527 (10Dzahn) If there is an issue feel free to reopen the ticket. [16:06:40] (03CR) 10Dzahn: "https://query.wikidata.org/querybuilder/ works now" [puppet] - 10https://gerrit.wikimedia.org/r/713270 (https://phabricator.wikimedia.org/T266703) (owner: 10Dzahn) [16:06:46] (03CR) 10Dzahn: "https://query.wikidata.org/querybuilder/ works now" [puppet] - 10https://gerrit.wikimedia.org/r/700317 (https://phabricator.wikimedia.org/T266703) (owner: 10Ladsgroup) [16:07:12] 10SRE, 10Wikidata, 10Wikidata Query Builder, 10wdwb-tech, and 4 others: Deploy query builder to microsites (on top of the wdqs-ui) - https://phabricator.wikimedia.org/T266703 (10Dzahn) https://query.wikidata.org/querybuilder/ works now [16:08:10] (03CR) 10Dzahn: "should be identical to the existing config without "preview". just that we should have done it the other way around and first preview, the" [puppet] - 10https://gerrit.wikimedia.org/r/713270 (https://phabricator.wikimedia.org/T266703) (owner: 10Dzahn) [16:18:02] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for @dang - https://phabricator.wikimedia.org/T288355 (10Dzahn) @KFrancis Hi, could you contact @dang for the WMDE NDA process? The email address is here in the task above. Thank you @dang We are going through the process described at https... [16:18:33] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for @dang - https://phabricator.wikimedia.org/T288355 (10Dzahn) [16:26:39] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.2711 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [16:28:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={rails,redis_gitlab,sidekiq,workhorse} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:28:50] (03PS1) 10Cmjohnson: Fixing dhcpd entry for mw1453 [puppet] - 10https://gerrit.wikimedia.org/r/713293 (https://phabricator.wikimedia.org/T273915) [16:29:30] (03CR) 10Cmjohnson: [C: 03+2] Fixing dhcpd entry for mw1453 [puppet] - 10https://gerrit.wikimedia.org/r/713293 (https://phabricator.wikimedia.org/T273915) (owner: 10Cmjohnson) [16:29:32] (03CR) 10Dzahn: [C: 03+1] "oooh, that was my mistake then. some servers had 81 there and some 82 :) thanks for finding it" [puppet] - 10https://gerrit.wikimedia.org/r/713293 (https://phabricator.wikimedia.org/T273915) (owner: 10Cmjohnson) [16:31:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar, 10Patch-For-Review: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Dzahn) mw1455 had issues but is now fine after simply repeating the cookbook one more time. (don't know why) mw1453 had the wrong MAC address... [16:34:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:36:59] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:09] !log restart logstash on logstash1008 [16:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:52] (03PS2) 10Vgutierrez: envoyproxy: Allow setting a global lua script [puppet] - 10https://gerrit.wikimedia.org/r/713271 (https://phabricator.wikimedia.org/T271421) [16:37:54] (03PS2) 10Vgutierrez: cache: Use envoy lua API to provide TLS info [puppet] - 10https://gerrit.wikimedia.org/r/713272 (https://phabricator.wikimedia.org/T271421) [16:38:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar, 10Patch-For-Review: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1453.eqiad.wmnet ` The log can b... [16:40:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={rails,sidekiq} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:40:41] (03CR) 10Ssingh: envoyproxy: Support ciphersuite configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710577 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [16:40:56] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:03] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [16:43:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar, 10Patch-For-Review: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1453.eqiad.wmnet'] ` Of which those **FAILED**: ` ['mw1453.eqiad.wmnet'] ` [16:43:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission druid1003.eqiad.wmnet - https://phabricator.wikimedia.org/T288736 (10Cmjohnson) [16:44:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission druid1003.eqiad.wmnet - https://phabricator.wikimedia.org/T288736 (10Cmjohnson) 05Open→03Resolved [16:44:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:45:10] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Switch failure: asw2-a8-eqiad Aug 13th 2021 - https://phabricator.wikimedia.org/T288834 (10ayounsi) p:05Triage→03High a:03Cmjohnson JTAC pointed out that the switch failure matches with a VCP (to the backup spine) going down: ` ayounsi@asw2... [16:45:42] (03CR) 10Ladsgroup: [C: 03+1] "I'm a bit confused why we even have preview but the patch looks sensible." [puppet] - 10https://gerrit.wikimedia.org/r/713270 (https://phabricator.wikimedia.org/T266703) (owner: 10Dzahn) [16:51:18] PROBLEM - Host backup1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:52:23] (03CR) 10Dzahn: [C: 03+2] wdqs: sync query.wikidata.org preview config with prod config [puppet] - 10https://gerrit.wikimedia.org/r/713270 (https://phabricator.wikimedia.org/T266703) (owner: 10Dzahn) [16:53:48] PROBLEM - Host wdqs2004 is DOWN: PING CRITICAL - Packet loss = 100% [16:54:34] ^ even though the previous change talks about WDQS, this is definitely unrelated. that site lives on miscweb* [16:55:06] RECOVERY - Host wdqs2004 is UP: PING OK - Packet loss = 0%, RTA = 31.68 ms [16:55:44] figured someone might just be rebooting it.. afk [16:55:58] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:56:10] PROBLEM - Query Service HTTP Port on wdqs2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [16:56:53] (03CR) 10Dzahn: "in an ideal world we'd also add this to the httpbb tests for miscweb now" [puppet] - 10https://gerrit.wikimedia.org/r/713270 (https://phabricator.wikimedia.org/T266703) (owner: 10Dzahn) [16:57:12] RECOVERY - Query Service HTTP Port on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [16:57:59] (03CR) 10Ssingh: [C: 03+1] "(Ignore my previous comment, both configurations are fine and since you are already using it for alpn_protocols, we should be good.)" [puppet] - 10https://gerrit.wikimedia.org/r/710577 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [16:58:10] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:00:04] ryankemper: I, the Bot under the Fountain, allow thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210816T1700). [17:01:26] PROBLEM - Host wdqs2004 is DOWN: PING CRITICAL - Packet loss = 100% [17:03:43] 10SRE, 10Security: Cookbook to reboot cassandra nodes - https://phabricator.wikimedia.org/T288975 (10razzi) [17:03:58] 10SRE, 10Security: Cookbook to reboot cassandra nodes - https://phabricator.wikimedia.org/T288975 (10razzi) a:03razzi [17:04:11] (03CR) 10Ssingh: envoyproxy: Support ECDH curves configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710581 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [17:05:00] RECOVERY - Host wdqs2004 is UP: PING OK - Packet loss = 0%, RTA = 31.66 ms [17:05:21] jouncebot: now [17:05:21] For the next 0 hour(s) and 24 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210816T1700) [17:05:29] jouncebot: next [17:05:29] In 0 hour(s) and 54 minute(s): Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210816T1800) [17:05:43] !log asw2-a-eqiad> request virtual-chassis vc-port delete pic-slot 1 member 8 port 1 - T288834 [17:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:51] T288834: Switch failure: asw2-a8-eqiad Aug 13th 2021 - https://phabricator.wikimedia.org/T288834 [17:06:00] (03PS1) 10Ladsgroup: Try to use EditStash before re-rendering [extensions/SpamBlacklist] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/712965 (https://phabricator.wikimedia.org/T288639) [17:07:30] (03CR) 10Ladsgroup: [C: 03+2] Try to use EditStash before re-rendering [extensions/SpamBlacklist] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/712965 (https://phabricator.wikimedia.org/T288639) (owner: 10Ladsgroup) [17:08:48] !log asw2-a-eqiad> request virtual-chassis vc-port set pic-slot 1 member 8 port 1 - T288834 [17:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:14] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Switch failure: asw2-a8-eqiad Aug 13th 2021 - https://phabricator.wikimedia.org/T288834 (10ayounsi) p:05High→03Low a:05Cmjohnson→03ayounsi Keeping it a bit for monitoring, will close if no more interfaces errors. [17:15:07] (03CR) 10Effie Mouzeli: [C: 03+1] "It is a +1 for me too, I would like to have jbond take a look too before merging" [puppet] - 10https://gerrit.wikimedia.org/r/709717 (https://phabricator.wikimedia.org/T283159) (owner: 10Hnowlan) [17:17:02] !log installing new line card in slot1 cr1-eqiad T277339 [17:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:11] T277339: (Need By: TBD) rack/setup/install new linecards into routers - https://phabricator.wikimedia.org/T277339 [17:24:14] (03Merged) 10jenkins-bot: Try to use EditStash before re-rendering [extensions/SpamBlacklist] (wmf/1.37.0-wmf.18) - 10https://gerrit.wikimedia.org/r/712965 (https://phabricator.wikimedia.org/T288639) (owner: 10Ladsgroup) [17:25:00] !log cr1-eqiad> request chassis fpc offline slot 5 - T277339 [17:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:09] T277339: (Need By: TBD) rack/setup/install new linecards into routers - https://phabricator.wikimedia.org/T277339 [17:25:43] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:28:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:51] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 217, down: 48, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:30:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:05] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install new linecards into routers - https://phabricator.wikimedia.org/T277339 (10ayounsi) `lang=diff [edit chassis] + fpc 1 { + pic 0 { + pic-mode 100G; + } + pic 1 { + pic-mode 100G; + } + pic 2 { +... [17:33:17] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.18/extensions/SpamBlacklist/includes/SpamBlacklistHooks.php: Backport: [[gerrit:712965|Try to use EditStash before re-rendering (T288639)]] (duration: 00m 59s) [17:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:25] T288639: SpamBlacklistHooks::onEditFilterMergedContent causes every edit to be rendered twice - https://phabricator.wikimedia.org/T288639 [17:33:35] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 217, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:34:08] !log installing new line card in slot1 cr2-eqiad T277339 [17:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:17] T277339: (Need By: TBD) rack/setup/install new linecards into routers - https://phabricator.wikimedia.org/T277339 [17:35:28] (03CR) 10Ssingh: [C: 03+1] envoyproxy: Add prefetched OCSP staple support [puppet] - 10https://gerrit.wikimedia.org/r/710496 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [17:35:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1453.eqiad.wmnet ` The log can be found in `/var/log/w... [17:37:48] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Join ARIN waiting list to request additional IPv4 resources. - https://phabricator.wikimedia.org/T288342 (10Andrew) @aborrero is on holiday for a bit -- I'd like to hear from him as well but here's my current thought: With our... [17:47:20] (03CR) 10Ssingh: [C: 03+1] envoyproxy: Add dual stack cert support [puppet] - 10https://gerrit.wikimedia.org/r/710507 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [17:49:32] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install new linecards into routers - https://phabricator.wikimedia.org/T277339 (10ayounsi) 05Open→03Resolved All done, thanks! [17:49:56] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1453.eqiad.wmnet with reason: REIMAGE [17:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:54] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 236, down: 16, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:52:39] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1453.eqiad.wmnet with reason: REIMAGE [17:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:58] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 233, down: 16, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:59:08] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudcephosd1008 - https://phabricator.wikimedia.org/T287838 (10Cmjohnson) This is planned for tomorrow 17 August [18:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210816T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:42] indeed, nothing to do [18:02:29] 10SRE, 10Anti-Harassment, 10Traffic: Enable automatic redirection to the mobile version of votewiki - https://phabricator.wikimedia.org/T288938 (10Niharika) p:05Triage→03High @BBlack @mark Hello. Sorry to put undue pressure here but the Board election is due to start in two days and we want to test this... [18:04:34] PROBLEM - Memcached on mw1453 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Memcached [18:08:15] 10SRE, 10Anti-Harassment, 10Traffic: Enable automatic redirection to the mobile version of votewiki - https://phabricator.wikimedia.org/T288938 (10jrbs) >>! In T288938#7284392, @phuedx wrote: > @Jdlrobson noted that the "Mobile view" link at the bottom of every page has been blanked out by setting [[ https:/... [18:11:16] PROBLEM - PHP7 rendering on mw1453 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [18:13:16] PROBLEM - Host mw1453 is DOWN: PING CRITICAL - Packet loss = 100% [18:14:06] RECOVERY - Host mw1453 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [18:14:12] RECOVERY - PHP7 rendering on mw1453 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.219 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [18:14:32] RECOVERY - Memcached on mw1453 is OK: TCP OK - 0.000 second response time on 10.64.0.60 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [18:14:46] PROBLEM - Check systemd state on mw1453 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-phpfpm-statustext-textfile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:14:58] PROBLEM - mediawiki-installation DSH group on mw1455 is CRITICAL: Host mw1455 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [18:15:21] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:15:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1453.eqiad.wmnet'] ` and were **ALL** successful. [18:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:03] (03PS1) 10Majavah: kubernetes: Set php7.4 as the default backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713303 [18:16:28] RECOVERY - Check systemd state on mw1453 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:17:53] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:15] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2056-production-search-omega-codfw on elastic2056 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-omega-codfw&var-instance=elastic2056&panelId=37 [18:47:03] PROBLEM - mediawiki-installation DSH group on mw1453 is CRITICAL: Host mw1453 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [18:48:12] !log Restarted Jenkins due to stuck jobs. [18:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:03:34] (03PS1) 10Herron: retire kafka::monitoring and kafkamon[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/713307 (https://phabricator.wikimedia.org/T252773) [19:04:44] (03PS2) 10Herron: retire role::kafka::monitoring and kafkamon[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/713307 (https://phabricator.wikimedia.org/T252773) [19:06:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:06:36] (03PS3) 10Herron: retire role::kafka::monitoring and kafkamon[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/713307 (https://phabricator.wikimedia.org/T252773) [19:08:40] (03PS4) 10Herron: retire role::kafka::monitoring and kafkamon[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/713307 (https://phabricator.wikimedia.org/T252773) [19:09:28] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/713307 (https://phabricator.wikimedia.org/T252773) (owner: 10Herron) [19:09:58] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for @dang - https://phabricator.wikimedia.org/T288355 (10KFrancis) @Dzahn Hello! Dat currently has an NDA on file for this access: any data from the WMDE LDAP Group/any data from the NDA LDAP Group. Would this also cover access for the re... [19:11:43] (03PS1) 10RLazarus: httpbb: Update tests to reflect rename from otrs-wiki to vrt-wiki. [puppet] - 10https://gerrit.wikimedia.org/r/713309 [19:12:55] (03CR) 10RLazarus: "Tested:" [puppet] - 10https://gerrit.wikimedia.org/r/713309 (owner: 10RLazarus) [19:13:11] (03CR) 10jerkins-bot: [V: 04-1] httpbb: Update tests to reflect rename from otrs-wiki to vrt-wiki. [puppet] - 10https://gerrit.wikimedia.org/r/713309 (owner: 10RLazarus) [19:13:46] (03PS2) 10RLazarus: httpbb: Update tests to reflect rename from otrs-wiki to vrt-wiki. [puppet] - 10https://gerrit.wikimedia.org/r/713309 (https://phabricator.wikimedia.org/T280400) [19:16:35] (03CR) 10Ahmon Dancy: [C: 03+1] "LGTM. I don't know what's up with the commit message validator." [puppet] - 10https://gerrit.wikimedia.org/r/713309 (https://phabricator.wikimedia.org/T280400) (owner: 10RLazarus) [19:17:43] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2056-production-search-omega-codfw on elastic2056 is OK: (C)100 gt (W)80 gt 44.75 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-omega-codfw&var-instance=elastic2056&panelId=37 [19:19:12] (03CR) 10Ahmon Dancy: [C: 03+1] httpbb: Update tests to reflect rename from otrs-wiki to vrt-wiki. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713309 (https://phabricator.wikimedia.org/T280400) (owner: 10RLazarus) [19:19:44] (03CR) 10Urbanecm: [C: 03+1] "good catch." [puppet] - 10https://gerrit.wikimedia.org/r/713309 (https://phabricator.wikimedia.org/T280400) (owner: 10RLazarus) [19:21:13] (03CR) 10RLazarus: [C: 03+2] "Thanks for the quick reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/713309 (https://phabricator.wikimedia.org/T280400) (owner: 10RLazarus) [19:34:31] (03CR) 10Andrew Bogott: [C: 03+1] "I scheduled a calendar invite to do this tomorrow when everyone is around." [puppet] - 10https://gerrit.wikimedia.org/r/711499 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro) [19:35:24] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for @dang - https://phabricator.wikimedia.org/T288355 (10RobH) [19:36:01] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for @dang - https://phabricator.wikimedia.org/T288355 (10RobH) [19:41:04] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for @dang - https://phabricator.wikimedia.org/T288355 (10RobH) I've updated the checklist as T288355#7286220 confirms NDA on file, as well as the checkboxes for wikitech userinfo and ssh key (provided by @dang at time of request filing.) Th... [19:41:48] (03PS1) 10Cwhite: openstack: adapt nova_fullstack_test to emit ECS-compatibile logs [puppet] - 10https://gerrit.wikimedia.org/r/713314 (https://phabricator.wikimedia.org/T234565) [19:42:26] (03CR) 10jerkins-bot: [V: 04-1] openstack: adapt nova_fullstack_test to emit ECS-compatibile logs [puppet] - 10https://gerrit.wikimedia.org/r/713314 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [19:44:29] (03PS2) 10Cwhite: openstack: adapt nova_fullstack_test to emit ECS-compatibile logs [puppet] - 10https://gerrit.wikimedia.org/r/713314 (https://phabricator.wikimedia.org/T234565) [19:45:07] (03CR) 10jerkins-bot: [V: 04-1] openstack: adapt nova_fullstack_test to emit ECS-compatibile logs [puppet] - 10https://gerrit.wikimedia.org/r/713314 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [19:46:35] (03PS3) 10Cwhite: openstack: adapt nova_fullstack_test to emit ECS-compatibile logs [puppet] - 10https://gerrit.wikimedia.org/r/713314 (https://phabricator.wikimedia.org/T234565) [19:47:22] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for @dang - https://phabricator.wikimedia.org/T288355 (10RobH) Ok, for the history of this group, I think we need the following approvals: [] - access request (or expansion) has sign off of WMF sponsor/manager (sponsor for volunteers, manag... [19:50:08] (03CR) 10Andrew Bogott: [C: 03+1] "Let's try it!" [puppet] - 10https://gerrit.wikimedia.org/r/713314 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [19:51:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Cmjohnson) [19:52:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Cmjohnson) 05Open→03Resolved @dzahn mw1453 is installed and ready for you, the mac address was off in the dhcpd file. [19:55:22] 10SRE, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Standardize the logging format - https://phabricator.wikimedia.org/T234565 (10colewhite) >>! In T234565#7283933, @Andrew wrote: > I'm slightly confused by the state of the filters for openstack/oslo. `... [19:56:35] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for @dang - https://phabricator.wikimedia.org/T288355 (10thcipriani) >>! In T288355#7286300, @RobH wrote: > Ok, for the history of this group, I think we need the following approvals: > > [] - access request (or expansion) has sign off of W... [19:57:39] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for @dang - https://phabricator.wikimedia.org/T288355 (10RobH) Thank you for the update! With the above note, and the initial creation of this group having @toan as point of contact at WMDE to admin it, this group just needs his sign off to... [19:58:47] (03PS1) 10Ssingh: varnish: enable automatic mobile redirect of vote.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/713315 (https://phabricator.wikimedia.org/T288938) [19:58:49] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for @dang - https://phabricator.wikimedia.org/T288355 (10RobH) a:03toan [20:00:05] chrisalbon and accraze: Your horoscope predicts another unfortunate Services – Graphoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210816T2000). [20:03:17] (03CR) 10Cwhite: [C: 03+2] openstack: adapt nova_fullstack_test to emit ECS-compatibile logs [puppet] - 10https://gerrit.wikimedia.org/r/713314 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [20:04:14] 10SRE, 10LDAP-Access-Requests: Access request to superset for user natalia-rodriguez - https://phabricator.wikimedia.org/T285436 (10RobH) As part of clinic duty rotation, I've dropped @NRodriguez a ping via google hangouts, just in case any notices for this task were either lost in phab notifications, or notif... [20:04:43] (03CR) 10BBlack: [C: 03+1] "Looks correct to me!" [puppet] - 10https://gerrit.wikimedia.org/r/713315 (https://phabricator.wikimedia.org/T288938) (owner: 10Ssingh) [20:05:58] (03CR) 10Ssingh: [C: 03+2] varnish: enable automatic mobile redirect of vote.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/713315 (https://phabricator.wikimedia.org/T288938) (owner: 10Ssingh) [20:12:16] (03PS1) 10Cwhite: openstack: cast record.msg to string in the formatter [puppet] - 10https://gerrit.wikimedia.org/r/713318 (https://phabricator.wikimedia.org/T234565) [20:13:10] (03CR) 10Andrew Bogott: [C: 03+2] openstack: cast record.msg to string in the formatter [puppet] - 10https://gerrit.wikimedia.org/r/713318 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [20:19:22] (03CR) 10Bstorm: [C: 03+2] "Thanks for this. It will seem less arcane a config if it's just what is needed." [puppet] - 10https://gerrit.wikimedia.org/r/713042 (owner: 10Majavah) [20:22:16] 10SRE, 10SRE-swift-storage: Cannot delete "File:The Chorisettes tmp.jpg" on Commons: Error deleting file: An unknown error occurred in storage backend "local-multiwrite" - https://phabricator.wikimedia.org/T288968 (10Peachey88) [20:25:23] (03CR) 10Andrew Bogott: [C: 03+1] wikireplicas: add labswiki manually to s6 and refactor a bit [puppet] - 10https://gerrit.wikimedia.org/r/711240 (https://phabricator.wikimedia.org/T287442) (owner: 10Bstorm) [20:26:52] (03CR) 10Bstorm: [C: 03+2] wikireplicas: add labswiki manually to s6 and refactor a bit [puppet] - 10https://gerrit.wikimedia.org/r/711240 (https://phabricator.wikimedia.org/T287442) (owner: 10Bstorm) [20:35:44] !log bstorm@cumin1001 START - Cookbook wmcs.wikireplicas.add_wiki [20:35:46] !log bstorm@cumin1001 END (FAIL) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=99) [20:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:33] !log bstorm@cumin1001 START - Cookbook wmcs.wikireplicas.add_wiki [20:36:35] !log bstorm@cumin1001 END (FAIL) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=99) [20:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:26] (03CR) 10Michael DiPietro: [C: 03+1] wikireplicas: add labswiki manually to s6 and refactor a bit [puppet] - 10https://gerrit.wikimedia.org/r/711240 (https://phabricator.wikimedia.org/T287442) (owner: 10Bstorm) [20:37:54] !log bstorm@cumin1001 START - Cookbook wmcs.wikireplicas.add_wiki [20:37:57] !log bstorm@cumin1001 Added views for new wiki: labswiki T287442 [20:37:57] !log bstorm@cumin1001 END (PASS) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=0) [20:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:08] T287442: Create views for labswiki (wikitech) - https://phabricator.wikimedia.org/T287442 [20:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:25] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:50:11] PROBLEM - Ensure local MW versions match expected deployment on mw2383 is CRITICAL: CRITICAL: Missing 1 sites from wikiversions. 976 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [20:51:23] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:00:04] Reedy and sbassett: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210816T2100). [21:00:12] (03PS1) 10Cwhite: logstash: forward nova-fullstack logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/713323 (https://phabricator.wikimedia.org/T234565) [21:00:38] (03PS1) 10Bstorm: wikireplicas: add CNAME for labswiki/wikitech [puppet] - 10https://gerrit.wikimedia.org/r/713324 (https://phabricator.wikimedia.org/T287442) [21:04:32] (03CR) 10Bstorm: [C: 03+2] wikireplicas: add CNAME for labswiki/wikitech [puppet] - 10https://gerrit.wikimedia.org/r/713324 (https://phabricator.wikimedia.org/T287442) (owner: 10Bstorm) [21:10:11] 10SRE, 10LDAP-Access-Requests: Access request to superset for user natalia-rodriguez - https://phabricator.wikimedia.org/T285436 (10NRodriguez) Hello everyone, I would still very much appreciate superset access! I will complete the two tasks this week: > Review and sign the L3 document > Coordinate obtaining... [21:10:40] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/713307 (https://phabricator.wikimedia.org/T252773) (owner: 10Herron) [21:11:54] (03PS3) 10Cwhite: hiera: add observability role_contacts [puppet] - 10https://gerrit.wikimedia.org/r/710617 [21:12:19] 10SRE, 10LDAP-Access-Requests: Access request to superset for user natalia-rodriguez - https://phabricator.wikimedia.org/T285436 (10NRodriguez) I have signed the L3 document and pinged danny! [21:14:25] 10SRE, 10LDAP-Access-Requests: Access request to superset for user natalia-rodriguez - https://phabricator.wikimedia.org/T285436 (10DannyH) I approve Natalia's access. Thank you! [21:26:07] (03CR) 10Cwhite: [C: 03+2] hiera: add observability role_contacts [puppet] - 10https://gerrit.wikimedia.org/r/710617 (owner: 10Cwhite) [21:28:09] !log dns4002: upgrade gdnsd package to 3.8.0-1~wmf1 [21:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:21] !log authdns1001: upgrade gdnsd package to 3.8.0-1~wmf1 [21:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:10] (03CR) 10BryanDavis: [C: 03+1] "I am in favor of bumping the default container to the latest version, but it also feels like something that should have a warning period f" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713303 (owner: 10Majavah) [22:13:59] !log dns[1235]002: upgrade gdnsd package to 3.8.0-1~wmf1 [22:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:15] (03PS1) 10RobH: WMF employee Natalia Rodriguez ldap addition [puppet] - 10https://gerrit.wikimedia.org/r/713336 (https://phabricator.wikimedia.org/T285436) [22:16:58] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Access request to superset for user natalia-rodriguez - https://phabricator.wikimedia.org/T285436 (10RobH) [22:17:06] (03CR) 10RobH: [C: 03+2] WMF employee Natalia Rodriguez ldap addition [puppet] - 10https://gerrit.wikimedia.org/r/713336 (https://phabricator.wikimedia.org/T285436) (owner: 10RobH) [22:21:39] 10SRE, 10LDAP-Access-Requests: Access request to superset for user natalia-rodriguez - https://phabricator.wikimedia.org/T285436 (10RobH) a:05NRodriguez→03None [22:22:00] 10SRE, 10LDAP-Access-Requests: Access request to superset for user natalia-rodriguez - https://phabricator.wikimedia.org/T285436 (10RobH) 05Open→03Resolved a:03RobH @NRodriguez, Your access is now live for ldap wmf group user 'natalia-rodriguez' [22:52:34] (03PS1) 10Zabe: Enable NewUserMessage on hiwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713339 (https://phabricator.wikimedia.org/T287091) [22:53:35] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:54:17] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [22:55:01] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210816T2300). [23:00:04] zabe: A patch you scheduled for Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:10] o/ [23:00:11] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 255.32 ms [23:02:31] RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 67, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:02:51] RECOVERY - BGP status on cr3-eqsin is OK: BGP OK - up: 327, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:09:41] i can deploy today! [23:09:46] zabe: hi, are you still around? [23:11:26] urbanecm: yes [23:11:29] great [23:11:37] looks https://hi.wiktionary.org/w/index.php?title=%E0%A4%AE%E0%A5%80%E0%A4%A1%E0%A4%BF%E0%A4%AF%E0%A4%BE%E0%A4%B5%E0%A4%BF%E0%A4%95%E0%A4%BF:Newusermessage-template&action=history is configured [23:11:40] so let's do it [23:11:46] (03CR) 10Urbanecm: [C: 03+2] Enable NewUserMessage on hiwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713339 (https://phabricator.wikimedia.org/T287091) (owner: 10Zabe) [23:12:48] (03Merged) 10jenkins-bot: Enable NewUserMessage on hiwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713339 (https://phabricator.wikimedia.org/T287091) (owner: 10Zabe) [23:13:30] zabe: pulled to mwdebug2001, can you have a look? [23:14:54] urbanecm: seems to be working: https://hi.wiktionary.org/wiki/%E0%A4%B8%E0%A4%A6%E0%A4%B8%E0%A5%8D%E0%A4%AF_%E0%A4%B5%E0%A4%BE%E0%A4%B0%E0%A5%8D%E0%A4%A4%E0%A4%BE:Zabe_(test_4) [23:14:58] (y) [23:15:04] i'm not sure if https://hi.wiktionary.org/wiki/%E0%A4%AE%E0%A5%80%E0%A4%A1%E0%A4%BF%E0%A4%AF%E0%A4%BE%E0%A4%B5%E0%A4%BF%E0%A4%95%E0%A4%BF:Newusermessage-signatures is valid configuration [23:15:13] but as long as it doesn't error out/logspam... [23:15:27] ...i guess that's fine to sync out [23:16:42] nothing suspicious in logstash, syncing [23:18:19] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: a14868bbdf442eede5711576c4b4da51df0ccd77: Enable NewUserMessage on hiwiktionary (T287091) (duration: 01m 00s) [23:18:23] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2059-production-search-omega-codfw on elastic2059 is CRITICAL: 146.4 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-omega-codfw&var-instance=elastic2059&panelId=37 [23:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:27] T287091: Enable NewUserMessage on hi.wiktionary - https://phabricator.wikimedia.org/T287091 [23:18:31] zabe: here you go. Anything else? [23:18:57] the alert is very likely unrelated, there's no way how this would cause search issues [23:19:19] no, thanks :) [23:19:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:19:34] great :) [23:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:18] !log Evening B&C window done [23:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:31] (03PS1) 10Urbanecm: Growth mentor dashboard: Enable beta features only on beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713366 (https://phabricator.wikimedia.org/T280307)