[00:01:00] (03CR) 10Ppchelko: [C: 03+1] Invalidate the conf cache when Defines.php changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704444 (owner: 10Tim Starling) [00:02:02] (03CR) 10Reedy: [C: 03+1] Invalidate the conf cache when Defines.php changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704444 (owner: 10Tim Starling) [00:03:31] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:04:03] (03CR) 10Tim Starling: [C: 03+2] Invalidate the conf cache when Defines.php changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704444 (owner: 10Tim Starling) [00:04:44] (03Merged) 10jenkins-bot: Invalidate the conf cache when Defines.php changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704444 (owner: 10Tim Starling) [00:04:52] jouncebot: now [00:04:52] No deployments scheduled for the next 10 hour(s) and 55 minute(s) [00:04:53] jouncebot: next [00:04:54] In 10 hour(s) and 55 minute(s): European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210714T1100) [00:15:21] !log tstarling@deploy1002 Synchronized wmf-config/CommonSettings.php: fix conf cache conflict with Defines.php noticed in beta (duration: 02m 09s) [00:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:31] (03PS1) 10Reedy: Add a nowandnext command [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/704446 (https://phabricator.wikimedia.org/T286627) [00:18:23] (03CR) 10jerkins-bot: [V: 04-1] Add a nowandnext command [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/704446 (https://phabricator.wikimedia.org/T286627) (owner: 10Reedy) [00:19:34] (03PS2) 10Reedy: Add a nowandnext command [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/704446 (https://phabricator.wikimedia.org/T286627) [00:20:31] (03CR) 10jerkins-bot: [V: 04-1] Add a nowandnext command [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/704446 (https://phabricator.wikimedia.org/T286627) (owner: 10Reedy) [00:23:21] (03PS3) 10Reedy: Add a nowandnext command [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/704446 (https://phabricator.wikimedia.org/T286627) [01:20:27] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:26:41] PROBLEM - Disk space on elastic1039 is CRITICAL: DISK CRITICAL - free space: / 1394 MB (5% inode=94%): /tmp 1394 MB (5% inode=94%): /var/tmp 1394 MB (5% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [01:37:00] Excuse me. I forgot the password , username and the email address that associated with my Wikimedia developer account. What should I do? [02:03:06] Yining_Chen: see https://wikitech.wikimedia.org/wiki/Password_and_2FA_reset#For_users [02:10:57] 10SRE, 10ops-codfw: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10Legoktm) @jijiki can we depool mw2383 from scap if it's going to be down for an extended amount of time? [02:19:45] !issync [02:19:45] Syncing #wikimedia-operations (requested by legoktm) [02:19:47] Set /cs flags #wikimedia-operations Reedy +Aiotv [02:19:49] Set /cs flags #wikimedia-operations ottomata +Aiotv [02:19:51] Set /cs flags #wikimedia-operations paravoid +Aiotv [02:19:53] Set /cs flags #wikimedia-operations volans +Aiotv [02:19:55] Set /cs flags #wikimedia-operations effie +Aiotv [02:19:57] Set /cs flags #wikimedia-operations thcipriani +Aiotv [02:19:59] Set /cs flags #wikimedia-operations Lucas_WMDE +Aiotv [02:20:01] Set /cs flags #wikimedia-operations XioNoX +Aiotv [02:20:03] Set /cs flags #wikimedia-operations dancy +Aiotv [02:20:05] Set /cs flags #wikimedia-operations Nikerabbit +Aiotv [02:20:07] Set /cs flags #wikimedia-operations urbanecm +Aiotv [02:20:09] Set /cs flags #wikimedia-operations sbasset +Aiotv [02:20:11] Set /cs flags #wikimedia-operations question_mark +Aiotv [02:20:13] Set /cs flags #wikimedia-operations hashar +Aiotv [02:20:15] Set /cs flags #wikimedia-operations apergos +Aiotv [02:20:17] Set /cs flags #wikimedia-operations shdubsh +Aiotv [02:20:19] Set /cs flags #wikimedia-operations bd808 +Aiotv [02:20:21] Set /cs flags #wikimedia-operations Isarra +Aiotv [02:20:23] Set /cs flags #wikimedia-operations James_F +Aiotv [02:20:25] Set /cs flags #wikimedia-operations klausman +Aiotv [02:20:27] Set /cs flags #wikimedia-operations wiki_willy +Aiotv [02:20:29] Set /cs flags #wikimedia-operations greg-g +Aiotv [02:25:18] !issync [02:25:19] Syncing #wikimedia-operations (requested by legoktm) [02:25:20] Set /cs flags #wikimedia-operations sbassett +Aiotv [02:25:22] Set /cs flags #wikimedia-operations Urbanecm -Aiotv [02:25:24] Set /cs flags #wikimedia-operations urbanecm +Aiotv [02:27:30] !issync [02:27:31] Syncing #wikimedia-operations (requested by legoktm) [02:27:31] No updates for #wikimedia-operations [02:29:13] RECOVERY - Disk space on elastic1039 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [02:38:07] (03PS13) 10Nikki Nikkhoui: Initial image-suggestion-api helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) [02:38:33] (03CR) 10jerkins-bot: [V: 04-1] Initial image-suggestion-api helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [02:55:15] (03PS14) 10Nikki Nikkhoui: Initial image-suggestion-api helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) [03:56:51] legoktm: is it supposed to ping us as it does its thing or was that a mistake? [04:19:35] (03Abandoned) 10MusikAnimal: Disable code mirror by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703633 (https://phabricator.wikimedia.org/T286270) (owner: 10MusikAnimal) [04:56:07] (03PS5) 10Effie Mouzeli: profile::osm_master: add tilenator users for kubepod subnets [puppet] - 10https://gerrit.wikimedia.org/r/704394 (https://phabricator.wikimedia.org/T283159) [05:39:09] (03PS1) 10KartikMistry: WIP: Configure the Event Platform backend to accept events in the content_translation_event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704456 (https://phabricator.wikimedia.org/T281982) [06:06:15] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 73 probes of 623 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:12:11] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 52 probes of 623 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:21:57] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 70 probes of 623 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:27:55] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 64 probes of 623 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:42:26] (03CR) 10Elukey: [C: 03+1] druid: Add option to roll restart test druid worker java processes [cookbooks] - 10https://gerrit.wikimedia.org/r/704443 (https://phabricator.wikimedia.org/T283067) (owner: 10Razzi) [06:45:19] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 94 probes of 623 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:57:07] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 45 probes of 623 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:05:39] PROBLEM - Check systemd state on phab1001 is CRITICAL: CRITICAL - degraded: The following units failed: phabricator_clean_tmp_files.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:15:41] (03CR) 10Nikerabbit: [C: 03+1] TranslationAid: Handle empty message definition [extensions/Translate] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704404 (https://phabricator.wikimedia.org/T285830) (owner: 10Abijeet Patro) [07:15:50] (03CR) 10Nikerabbit: [C: 03+1] TranslationAid: Make sure to return successfully fetched definitions [extensions/Translate] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704405 (https://phabricator.wikimedia.org/T285830) (owner: 10Abijeet Patro) [07:47:34] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 14 hosts with reason: Deploying schema change to s6 T277118 [07:47:39] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 14 hosts with reason: Deploying schema change to s6 T277118 [07:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:41] T277118: iw_url in interwiki is varbinary(127) in production but blob in code - https://phabricator.wikimedia.org/T277118 [07:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:25] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 13 hosts with reason: Deploying schema change to s5 T277118 [07:48:30] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 13 hosts with reason: Deploying schema change to s5 T277118 [07:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:38] 10SRE, 10ops-codfw: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10jijiki) @Legoktm I will keep it in mind to mark it as inactive, thanks [07:49:39] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 15 hosts with reason: Deploying schema change to s2 T277118 [07:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:45] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 15 hosts with reason: Deploying schema change to s2 T277118 [07:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:30] (03CR) 10Volans: [C: 03+1] "LGTM, nice improvment! Couple of optional suggestions inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/704422 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [07:58:20] 10SRE, 10SRE Observability (FY2021/2022-Q1), 10User-fgiunchedi: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 (10fgiunchedi) >>! In T285835#7207987, @fgiunchedi wrote: > I think I was able to mitigate the problem by using a `part_size: 32mb` setting for multi-part uploa... [08:01:19] (03PS2) 10Volans: decorators: migrate to the wmflib version [software/spicerack] - 10https://gerrit.wikimedia.org/r/704345 (https://phabricator.wikimedia.org/T257905) [08:06:21] (03CR) 10jerkins-bot: [V: 04-1] decorators: migrate to the wmflib version [software/spicerack] - 10https://gerrit.wikimedia.org/r/704345 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [08:22:07] PROBLEM - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:26:24] (03CR) 10Effie Mouzeli: [C: 03+2] rdf-streaming-updater: use the flink-session-cluster chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/693416 (owner: 10DCausse) [08:29:05] (03Merged) 10jenkins-bot: rdf-streaming-updater: use the flink-session-cluster chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/693416 (owner: 10DCausse) [08:48:57] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:49:02] (03PS1) 10Btullis: Update sre.hadoop.roll-restart-masters cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/704500 (https://phabricator.wikimedia.org/T269925) [08:49:04] (03PS1) 10Btullis: Update sre.hadoop.roll-restart-workers cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/704501 (https://phabricator.wikimedia.org/T269925) [08:52:35] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:55:25] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.517e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [08:59:16] (03PS6) 10Effie Mouzeli: profile::osm_master: add tilenator users for kubepod subnets [puppet] - 10https://gerrit.wikimedia.org/r/704394 (https://phabricator.wikimedia.org/T283159) [09:03:34] (03CR) 10Volans: [C: 03+1] "Thanks for migrating this cookbook! LGTM, all comments are optional." (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/704500 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [09:04:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: hw troubleshooting: system right cp board missing in new host backup1006 - https://phabricator.wikimedia.org/T286625 (10LSobanski) Adding @jcrespo for visibility. [09:04:55] 10SRE, 10MediaWiki-Cache, 10Platform Engineering, 10Wikidata, and 4 others: APCu caches are set to expire in 2073 instead of an hour if exptime is a unix timestamp - https://phabricator.wikimedia.org/T286260 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup This is done, I'll file a ticket about fully de... [09:05:12] (03CR) 10Volans: [C: 04-1] "LGTM, couple of typos inline" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/704501 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [09:09:53] (03CR) 10Ladsgroup: "> Patch Set 4: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/703912 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [09:11:54] (03PS5) 10Ladsgroup: arclamp: Migrate crons to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/703912 (https://phabricator.wikimedia.org/T273673) [09:13:40] (03PS1) 10Jelto: prometheus::ops add jobs to scrape gitlab metrics [puppet] - 10https://gerrit.wikimedia.org/r/704503 (https://phabricator.wikimedia.org/T275170) [09:14:19] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 15 hosts with reason: Deploying schema change to s7 T277118 [09:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:25] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 15 hosts with reason: Deploying schema change to s7 T277118 [09:14:26] T277118: iw_url in interwiki is varbinary(127) in production but blob in code - https://phabricator.wikimedia.org/T277118 [09:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:26] 10SRE, 10Traffic, 10good first task: Move Varnish test infrastructure from Vagrant to Docker - https://phabricator.wikimedia.org/T286639 (10ema) [09:17:51] 10SRE, 10Traffic, 10good first task: Move Varnish test infrastructure from Vagrant to Docker - https://phabricator.wikimedia.org/T286639 (10ema) p:05Triage→03Medium [09:18:03] 10SRE, 10ops-eqiad, 10DBA: Upgrade db1104 firmware - https://phabricator.wikimedia.org/T286226 (10LSobanski) @RobH Thanks for digging into it. Let's wait until we have people onsite and can crash cart into the host. Who would be the best person to assign this to so it gets scheduled as soon as possible? [09:18:21] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup100[4-7] - https://phabricator.wikimedia.org/T277327 (10jcrespo) > You can still hot plug the disk bays, just manually have to remove the disk from the mdadm array Thanks for the correction. I think I was based on previous experien... [09:20:02] (03CR) 10Btullis: "> Patch Set 1: Code-Review+1" (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/704500 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [09:20:04] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30208/console" [puppet] - 10https://gerrit.wikimedia.org/r/704503 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [09:23:42] (03CR) 10Jcrespo: "I am going to merge this to continue the testing "break things fast" (for a new service), but please keep the feedback going- this will li" [puppet] - 10https://gerrit.wikimedia.org/r/704185 (owner: 10Jcrespo) [09:26:49] (03CR) 10Jelto: [V: 03+1] "Could you please take a look? I added jobs to scrape gitlab metrics. I'm not sure if the selection of the gitlab host and the class_names " [puppet] - 10https://gerrit.wikimedia.org/r/704503 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [09:27:17] !log [urbanecm@mwmaint2002 /srv/mediawiki/php-1.37.0-wmf.14]$ time mwscript extensions/GrowthExperiments/maintenance/updateMenteeData.php --wiki=testwiki # T285811 [09:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:24] T285811: Mentee overview module: Run updateMenteeData.php regularly - https://phabricator.wikimedia.org/T285811 [09:29:58] (03CR) 10Jgiannelos: "nit: There is a typo (tilerator not tilenator) on commit msg." [puppet] - 10https://gerrit.wikimedia.org/r/704394 (https://phabricator.wikimedia.org/T283159) (owner: 10Effie Mouzeli) [09:30:08] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 18 hosts with reason: Deploying schema change to s4 T277118 [09:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:15] T277118: iw_url in interwiki is varbinary(127) in production but blob in code - https://phabricator.wikimedia.org/T277118 [09:30:15] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 18 hosts with reason: Deploying schema change to s4 T277118 [09:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:31] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [09:30:54] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [09:31:32] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 12 hosts with reason: Deploying schema change to s3 T277118 [09:31:37] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 12 hosts with reason: Deploying schema change to s3 T277118 [09:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:38] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [09:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:54] (03CR) 10Jgiannelos: "Wouldn't it make sense to create new users for the new deployment to avoid future confusion? (tilerator will be decommisioned in the near " [puppet] - 10https://gerrit.wikimedia.org/r/704394 (https://phabricator.wikimedia.org/T283159) (owner: 10Effie Mouzeli) [09:35:22] (03CR) 10Effie Mouzeli: "> Patch Set 6:" [puppet] - 10https://gerrit.wikimedia.org/r/704394 (https://phabricator.wikimedia.org/T283159) (owner: 10Effie Mouzeli) [09:39:58] (03PS1) 10Urbanecm: mediawiki/maintenance/growthexperiments.pp: Run updateMenteeData every day [puppet] - 10https://gerrit.wikimedia.org/r/704506 (https://phabricator.wikimedia.org/T285811) [09:43:56] 10SRE, 10Product-Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Update membership info for iflorez - https://phabricator.wikimedia.org/T286509 (10Vgutierrez) p:05Triage→03High Everything ready in our side but we need @Ottomata approval [09:44:11] 10SRE, 10Product-Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Update membership info for iflorez - https://phabricator.wikimedia.org/T286509 (10Vgutierrez) p:05High→03Medium [09:44:27] (03PS2) 10Elukey: knative,kubeflow: improve the import of the build images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/701396 [09:45:47] (03PS1) 10Elukey: Add missing envoy config files to istio proxyv2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/704507 (https://phabricator.wikimedia.org/T278192) [09:48:47] (03PS2) 10Elukey: Add missing envoy config files to istio proxyv2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/704507 (https://phabricator.wikimedia.org/T278192) [09:49:19] (03CR) 10Elukey: [V: 03+2 C: 03+2] knative,kubeflow: improve the import of the build images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/701396 (owner: 10Elukey) [09:53:18] (03CR) 10Klausman: [C: 03+1] Add missing envoy config files to istio proxyv2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/704507 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [09:55:14] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add missing envoy config files to istio proxyv2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/704507 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [09:55:33] (03CR) 10David Caro: [C: 03+1] Remove legacy wiki replicas [cookbooks] - 10https://gerrit.wikimedia.org/r/704348 (https://phabricator.wikimedia.org/T260389) (owner: 10Nskaggs) [10:07:35] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1012 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:09:23] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1012 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:09:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cloud dev - hiera: add wmflib::expand_path to codfw1dev hiera [puppet] - 10https://gerrit.wikimedia.org/r/702325 (https://phabricator.wikimedia.org/T285539) (owner: 10Jbond) [10:09:35] (03CR) 10Jcrespo: [C: 03+2] Revert "Revert "mediabackup: Install minio on the storage hosts and open port 9000"" [puppet] - 10https://gerrit.wikimedia.org/r/704185 (owner: 10Jcrespo) [10:13:04] (03PS1) 10Jcrespo: mediabackup: Assign correct default group to minio-user [puppet] - 10https://gerrit.wikimedia.org/r/704510 (https://phabricator.wikimedia.org/T276442) [10:13:56] (03CR) 10Jcrespo: [C: 03+2] mediabackup: Assign correct default group to minio-user [puppet] - 10https://gerrit.wikimedia.org/r/704510 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [10:20:04] (03CR) 10Jbond: dragonfly: Trim newlines in config files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/704360 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [10:20:29] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [10:23:07] (03CR) 10ZPapierski: [C: 03+1] [cirrus] switch more_like traffic to codfw 1/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704389 (owner: 10DCausse) [10:23:12] (03CR) 10ZPapierski: [C: 03+1] [cirrus] switch more_like traffic to codfw 2/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704390 (owner: 10DCausse) [10:26:06] (03PS2) 10Btullis: Update sre.hadoop.roll-restart-masters cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/704500 (https://phabricator.wikimedia.org/T269925) [10:26:08] (03PS2) 10Btullis: Update sre.hadoop.roll-restart-workers cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/704501 (https://phabricator.wikimedia.org/T269925) [10:35:44] (03CR) 10Volans: [C: 04-1] "Small issue inline" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/704500 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [10:36:50] (03CR) 10Volans: [C: 03+1] "LGTM, probably better to have Luca have a second look too and/or try it after merging both in DRY-RUN and possibly for real on a test clus" [cookbooks] - 10https://gerrit.wikimedia.org/r/704501 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [10:40:01] PROBLEM - cassandra CQL 10.192.48.165:9042 on maps2008 is CRITICAL: connect to address 10.192.48.165 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [10:40:26] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on maps2008.codfw.wmnet with reason: reimaging as buster replica [10:40:26] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on maps2008.codfw.wmnet with reason: reimaging as buster replica [10:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:18] (03PS3) 10Btullis: Update sre.hadoop.roll-restart-masters cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/704500 (https://phabricator.wikimedia.org/T269925) [10:43:20] (03PS3) 10Btullis: Update sre.hadoop.roll-restart-workers cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/704501 (https://phabricator.wikimedia.org/T269925) [10:43:52] (03PS1) 10Filippo Giunchedi: thanos: test moving compactor to thanos-fe2003 [puppet] - 10https://gerrit.wikimedia.org/r/704513 (https://phabricator.wikimedia.org/T285835) [10:43:55] (03PS1) 10Filippo Giunchedi: hieradata: undeploy statsite from Thanos [puppet] - 10https://gerrit.wikimedia.org/r/704514 (https://phabricator.wikimedia.org/T285835) [10:44:58] (03CR) 10Btullis: "> Patch Set 2: Code-Review-1" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/704500 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [10:45:29] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30209/console" [puppet] - 10https://gerrit.wikimedia.org/r/704513 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [10:46:48] (03PS2) 10Filippo Giunchedi: hieradata: undeploy statsite from Thanos [puppet] - 10https://gerrit.wikimedia.org/r/704514 (https://phabricator.wikimedia.org/T285835) [10:47:01] (03Abandoned) 10Filippo Giunchedi: thanos: test moving compactor to thanos-fe2003 [puppet] - 10https://gerrit.wikimedia.org/r/704513 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [10:48:10] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30210/console" [puppet] - 10https://gerrit.wikimedia.org/r/704514 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [10:48:29] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.002887 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [10:48:44] (03CR) 10Filippo Giunchedi: [V: 03+1] "The same change has been active on the swift cluster for a little while now" [puppet] - 10https://gerrit.wikimedia.org/r/704514 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [10:49:05] seeking (straightforward) +1s for https://gerrit.wikimedia.org/r/c/operations/puppet/+/704514 ^ [10:49:09] btullis: AFAICT also the logged message on line 95(old)/104(new) should go with them ;) [10:50:47] (03CR) 10Volans: [C: 03+1] "LGTM from what I can tell" [puppet] - 10https://gerrit.wikimedia.org/r/704514 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [10:50:49] godog: done [10:53:15] volans: Doh! Thanks. [10:55:24] volans: TYVM sir, appreciate it [10:55:27] (03CR) 10JMeybohm: dragonfly::dfdaemon: Make profile and module ensureable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704318 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [10:55:39] np :) [10:56:35] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] hieradata: undeploy statsite from Thanos [puppet] - 10https://gerrit.wikimedia.org/r/704514 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [10:58:17] (03PS4) 10Btullis: Update sre.hadoop.roll-restart-masters cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/704500 (https://phabricator.wikimedia.org/T269925) [10:58:19] (03PS4) 10Btullis: Update sre.hadoop.roll-restart-workers cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/704501 (https://phabricator.wikimedia.org/T269925) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210714T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:09] o/ [11:00:16] * urbanecm waves only to see there are no customers :( [11:00:23] yup, nothing there [11:01:27] hmm, since no one is around, I can do some flagged revs config changes but I'm a bit tired. I'll do that in fifteen minutes [11:01:57] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.517e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [11:02:10] Amir1: I'll deploy few pending patches i didn't schedule [11:02:14] (03PS1) 10Jcrespo: mediabackup: Add dummy passwords for mediabackup storage keys [labs/private] - 10https://gerrit.wikimedia.org/r/704517 (https://phabricator.wikimedia.org/T276442) [11:02:25] urbanecm: sounds good to me [11:02:31] (03CR) 10Urbanecm: [C: 03+2] Change category name of Babel extension on Javanese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702961 (https://phabricator.wikimedia.org/T286165) (owner: 10Labdajiwa) [11:02:34] ping me once you're done [11:02:55] (03PS3) 10Urbanecm: Disable indexing in NS_USER and NS_USER_TALK on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703884 (https://phabricator.wikimedia.org/T286152) (owner: 10R4356thwiki) [11:03:19] (03Merged) 10jenkins-bot: Change category name of Babel extension on Javanese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702961 (https://phabricator.wikimedia.org/T286165) (owner: 10Labdajiwa) [11:03:42] will do [11:03:45] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] mediabackup: Add dummy passwords for mediabackup storage keys [labs/private] - 10https://gerrit.wikimedia.org/r/704517 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [11:03:53] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [11:04:49] (03PS4) 10Urbanecm: Disable indexing in NS_USER and NS_USER_TALK on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703884 (https://phabricator.wikimedia.org/T286152) (owner: 10R4356thwiki) [11:04:57] (03CR) 10Urbanecm: [C: 03+2] Disable indexing in NS_USER and NS_USER_TALK on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703884 (https://phabricator.wikimedia.org/T286152) (owner: 10R4356thwiki) [11:05:05] (03PS9) 10Hnowlan: maps: reimage maps2008 as buster replica in new cluster [puppet] - 10https://gerrit.wikimedia.org/r/702099 [11:05:55] (03Merged) 10jenkins-bot: Disable indexing in NS_USER and NS_USER_TALK on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703884 (https://phabricator.wikimedia.org/T286152) (owner: 10R4356thwiki) [11:06:04] (03CR) 10Hnowlan: [C: 03+2] maps: reimage maps2008 as buster replica in new cluster [puppet] - 10https://gerrit.wikimedia.org/r/702099 (owner: 10Hnowlan) [11:06:28] (03PS1) 10Jcrespo: mediabackup: Update config dir behaviour, add root passwords [puppet] - 10https://gerrit.wikimedia.org/r/704518 (https://phabricator.wikimedia.org/T276442) [11:06:31] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 4dc11d2333cbf70a4eb20f3fb94a9e363b41d2df: Change category name of Babel extension on Javanese Wikipedia (T286165) (duration: 02m 10s) [11:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:38] T286165: Category name of Babel extension on Javanese Wikipedia - https://phabricator.wikimedia.org/T286165 [11:07:04] (03CR) 10jerkins-bot: [V: 04-1] mediabackup: Update config dir behaviour, add root passwords [puppet] - 10https://gerrit.wikimedia.org/r/704518 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [11:07:28] can someone remove mw2383 from puppet? It's depooled from user traffic, but it's still in scap, even though it always timeouts [11:07:33] *from scap [11:10:27] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 72027e136f10867f5db02043b7505390e49130d1: Disable indexing in NS_USER and NS_USER_TALK on bnwiki (T286152) (duration: 02m 07s) [11:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:33] T286152: Disable search engine indexing in specific namespaces of Bangla Wikipedia - https://phabricator.wikimedia.org/T286152 [11:11:42] (03PS2) 10Jcrespo: mediabackup: Update config dir behaviour, add root passwords [puppet] - 10https://gerrit.wikimedia.org/r/704518 (https://phabricator.wikimedia.org/T276442) [11:12:12] (03CR) 10jerkins-bot: [V: 04-1] mediabackup: Update config dir behaviour, add root passwords [puppet] - 10https://gerrit.wikimedia.org/r/704518 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [11:12:18] it's not in the dsh list that I can see, urbanecm [11:13:34] apergos: but scap tries to reach it [11:13:42] hrm [11:13:46] [255]: ssh: connect to host mw2383.codfw.wmnet port 22: Connection timed out [11:14:28] (03CR) 10Hnowlan: [C: 03+1] "I'm okay with reusing tilerator for now but we should change this once tilerator is decommissioned." [puppet] - 10https://gerrit.wikimedia.org/r/704394 (https://phabricator.wikimedia.org/T283159) (owner: 10Effie Mouzeli) [11:15:10] (03CR) 10JMeybohm: [C: 04-1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/704503 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [11:15:13] hey jynus, your dummy passwords are safe to merge I assume? [11:15:24] apergos: not sure how to get its conftool status, iirc it needs to be inactive to be out of scap [11:15:29] yes, hnowlan, sorry, i forgot about deploy [11:15:34] no worries! [11:15:47] that could be, I see that the scap dsh list has been redone since I looked last [11:15:49] https://phabricator.wikimedia.org/T286463 [11:15:57] so maybe it's the status indeed [11:16:07] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 122 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:17:01] { 'host': 'mw2383.codfw.wmnet', 'weight':30, 'enabled': False } [11:17:12] https://config-master.wikimedia.org/pybal/codfw/appservers-https (for future reference) [11:17:16] now I wonder what it needs to be [11:18:01] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 8 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:18:12] apergos: i think it shouldn't be there at all. `inactive means the server is not in the config we write at all` from https://wikitech.wikimedia.org/wiki/Conftool [11:18:19] (03PS1) 10Majavah: metricsinfra: Remove alertmanager apache proxy [puppet] - 10https://gerrit.wikimedia.org/r/704522 (https://phabricator.wikimedia.org/T286335) [11:18:27] (03PS3) 10Jcrespo: mediabackup: Update config dir behaviour, add root passwords [puppet] - 10https://gerrit.wikimedia.org/r/704518 (https://phabricator.wikimedia.org/T276442) [11:19:01] (03CR) 10jerkins-bot: [V: 04-1] mediabackup: Update config dir behaviour, add root passwords [puppet] - 10https://gerrit.wikimedia.org/r/704518 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [11:19:10] Amir1: i'm done, go ahead [11:19:21] okaaay [11:19:25] let me grab a coffee [11:19:56] (03PS4) 10Jcrespo: mediabackup: Update config dir behaviour, add root passwords [puppet] - 10https://gerrit.wikimedia.org/r/704518 (https://phabricator.wikimedia.org/T276442) [11:20:23] (03PS2) 10Majavah: metricsinfra: Remove alertmanager apache proxy [puppet] - 10https://gerrit.wikimedia.org/r/704522 (https://phabricator.wikimedia.org/T286335) [11:20:47] (03CR) 10jerkins-bot: [V: 04-1] metricsinfra: Remove alertmanager apache proxy [puppet] - 10https://gerrit.wikimedia.org/r/704522 (https://phabricator.wikimedia.org/T286335) (owner: 10Majavah) [11:21:05] (03PS5) 10Jcrespo: mediabackup: Update config dir behaviour, add root passwords [puppet] - 10https://gerrit.wikimedia.org/r/704518 (https://phabricator.wikimedia.org/T276442) [11:23:27] !log ariel@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw2383.codfw.wmnet [11:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:52] well I was just gonna log that but thanks to the tool for autologging [11:24:01] thanks apergos [11:24:01] urbanecm: ^^ let's see if that does what you need [11:24:11] 👍 [11:24:12] hopefully Amir1's deployments will be faster [11:24:27] RECOVERY - SSH on logstash2021.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:24:54] and it indeed disappeared from https://config-master.wikimedia.org/pybal/codfw/appservers-https [11:24:57] so...thanks apergos :) [11:25:08] :-) [11:25:55] 10SRE, 10ops-codfw: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10Peachey88) ` !log ariel@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw2383.codfw.wmnet` [11:27:11] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:30:47] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:32:31] (03CR) 10Ladsgroup: [C: 03+2] flaggedrevs: Reduce levels for ruwiki to 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700854 (https://phabricator.wikimedia.org/T284589) (owner: 10Ladsgroup) [11:33:13] (03Merged) 10jenkins-bot: flaggedrevs: Reduce levels for ruwiki to 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700854 (https://phabricator.wikimedia.org/T284589) (owner: 10Ladsgroup) [11:33:24] (03CR) 10Dzahn: [C: 03+2] site/conftool: turn mw1422 into an mw appserver [puppet] - 10https://gerrit.wikimedia.org/r/704319 (https://phabricator.wikimedia.org/T279309) (owner: 10Dzahn) [11:36:59] PROBLEM - Disk space on elastic1039 is CRITICAL: DISK CRITICAL - free space: / 2354 MB (8% inode=95%): /tmp 2354 MB (8% inode=95%): /var/tmp 2354 MB (8% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [11:37:57] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps2008.codfw.wmnet with reason: REIMAGE [11:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:56] !log ladsgroup@deploy1002 Synchronized wmf-config/flaggedrevs.php: Config: [[gerrit:700854|flaggedrevs: Reduce levels for ruwiki to 1 (T284589)]] (duration: 01m 05s) [11:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:03] T284589: Remove "reviewer" user group from ruwiki (Flagged Revs) - https://phabricator.wikimedia.org/T284589 [11:40:07] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps2008.codfw.wmnet with reason: REIMAGE [11:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:04] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [11:43:53] (03PS1) 10Ladsgroup: Remove reviewer user group in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704525 (https://phabricator.wikimedia.org/T284589) [11:44:05] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [11:45:21] (03CR) 10Ladsgroup: [C: 03+2] Remove reviewer user group in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704525 (https://phabricator.wikimedia.org/T284589) (owner: 10Ladsgroup) [11:46:05] (03Merged) 10jenkins-bot: Remove reviewer user group in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704525 (https://phabricator.wikimedia.org/T284589) (owner: 10Ladsgroup) [11:47:33] (03CR) 10Muehlenhoff: [C: 03+1] "Thanks for preparing a patch, merging!" [puppet] - 10https://gerrit.wikimedia.org/r/704440 (https://phabricator.wikimedia.org/T286624) (owner: 10Samtar) [11:47:43] (03CR) 10Muehlenhoff: [C: 03+2] Update email address for samtar [puppet] - 10https://gerrit.wikimedia.org/r/704440 (https://phabricator.wikimedia.org/T286624) (owner: 10Samtar) [11:49:30] !log ladsgroup@deploy1002 Synchronized wmf-config/flaggedrevs.php: Config: [[gerrit:704525|Remove reviewer user group in ruwiki (T284589)]] (duration: 01m 05s) [11:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:37] T284589: Remove "reviewer" user group from ruwiki (Flagged Revs) - https://phabricator.wikimedia.org/T284589 [11:51:25] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Muehlenhoff out of all services on: 1733 hosts [11:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:31] PROBLEM - Check systemd state on mw1422 is CRITICAL: CRITICAL - degraded: The following units failed: mcrouter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:52:11] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on mw1422.eqiad.wmnet with reason: new host [11:52:11] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on mw1422.eqiad.wmnet with reason: new host [11:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Muehlenhoff out of all services on: 1733 hosts [11:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:41] !log mw1422 - new setup, not in prod yet [11:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:52] (03PS1) 10Ladsgroup: Make idwiki use protect mode of flaggedrevs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704527 (https://phabricator.wikimedia.org/T268317) [11:57:07] RECOVERY - Disk space on elastic1039 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [11:58:20] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [11:59:28] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [11:59:43] RECOVERY - Check systemd state on mw1422 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:53] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [12:01:51] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on maps2008.codfw.wmnet with reason: Bootstrapping cassandra in new cluster [12:01:52] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on maps2008.codfw.wmnet with reason: Bootstrapping cassandra in new cluster [12:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:52] !log upgrading python3-wmflib fleetwide to 0.0.8 (needed for new logout.d wrapper) [12:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:27] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [12:06:49] RECOVERY - tilerator on maps2008 is OK: HTTP OK: HTTP/1.1 200 OK - 322 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [12:09:25] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1422.eqiad.wmnet [12:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:52] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [12:14:38] (03PS4) 10Jelto: role::common::mediawiki::canary_appserver add new canary app server in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/704103 (https://phabricator.wikimedia.org/T279309) [12:15:18] !log mw1422 - scap pull [12:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:53] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30215/console" [puppet] - 10https://gerrit.wikimedia.org/r/704103 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [12:23:26] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Muehlenhoff out of all services on: 1733 hosts [12:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Muehlenhoff out of all services on: 1733 hosts [12:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:17] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/704500 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [12:32:07] (03PS1) 10Filippo Giunchedi: Remove statsite from swift/thanos, replaced by statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/704530 (https://phabricator.wikimedia.org/T285835) [12:32:09] (03PS1) 10Filippo Giunchedi: swift: use python3 packages as needed [puppet] - 10https://gerrit.wikimedia.org/r/704531 (https://phabricator.wikimedia.org/T285835) [12:32:11] (03PS1) 10Filippo Giunchedi: swift: move stats scripts to python3 [puppet] - 10https://gerrit.wikimedia.org/r/704532 (https://phabricator.wikimedia.org/T285835) [12:32:13] (03PS1) 10Filippo Giunchedi: swift: split wmf rewrite into python3 [puppet] - 10https://gerrit.wikimedia.org/r/704533 (https://phabricator.wikimedia.org/T285835) [12:33:13] (03CR) 10jerkins-bot: [V: 04-1] swift: move stats scripts to python3 [puppet] - 10https://gerrit.wikimedia.org/r/704532 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [12:33:31] (03CR) 10jerkins-bot: [V: 04-1] swift: split wmf rewrite into python3 [puppet] - 10https://gerrit.wikimedia.org/r/704533 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [12:34:16] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30216/console" [puppet] - 10https://gerrit.wikimedia.org/r/704533 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [12:37:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rdb1005.eqiad.wmnet [12:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:55] moritzm: funnily enough ores in codfw went down [12:42:15] I hope it doesn't use the eqiad redis for pub/sub [12:43:41] !log Start server-side upload of 3 large image files (T285708) [12:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:48] T285708: Please upload large files to Wikimedia Commons - https://phabricator.wikimedia.org/T285708 [12:44:27] (03PS2) 10Filippo Giunchedi: swift: move stats scripts to python3 [puppet] - 10https://gerrit.wikimedia.org/r/704532 (https://phabricator.wikimedia.org/T285835) [12:44:30] (03PS2) 10Filippo Giunchedi: swift: split wmf rewrite into python3 [puppet] - 10https://gerrit.wikimedia.org/r/704533 (https://phabricator.wikimedia.org/T285835) [12:46:39] RECOVERY - cassandra CQL 10.192.48.165:9042 on maps2008 is OK: TCP OK - 0.033 second response time on 10.192.48.165 port 9042 https://phabricator.wikimedia.org/T93886 [12:47:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1005.eqiad.wmnet [12:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:39] (03CR) 10Volans: "Changes looks good, I can't guarantee there isn't any additional change to make them compatible with py3. One question inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704532 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [12:51:57] (03CR) 10Volans: swift: split wmf rewrite into python3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704533 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [12:52:45] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704532 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [12:53:39] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.00549 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [12:54:04] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [12:54:50] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [12:55:11] (03CR) 10Volans: [C: 03+1] "Looks sane if puppet compiler is happy" [puppet] - 10https://gerrit.wikimedia.org/r/704531 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [12:57:03] (03CR) 10Andrew Bogott: [C: 03+1] cloud dev - hiera: add wmflib::expand_path to codfw1dev hiera [puppet] - 10https://gerrit.wikimedia.org/r/702325 (https://phabricator.wikimedia.org/T285539) (owner: 10Jbond) [13:01:03] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:29] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [13:03:15] (03PS1) 10Effie Mouzeli: rdf-streaming-updater: add namespace for service [deployment-charts] - 10https://gerrit.wikimedia.org/r/704535 (https://phabricator.wikimedia.org/T264006) [13:06:07] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [13:06:41] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:09:25] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 18 hosts with reason: Deploying schema change to s1 T277118 [13:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:31] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 18 hosts with reason: Deploying schema change to s1 T277118 [13:09:32] T277118: iw_url in interwiki is varbinary(127) in production but blob in code - https://phabricator.wikimedia.org/T277118 [13:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:48] (03PS2) 10JMeybohm: admin_ng: Add a new tiller ClusterRole for flink-session-cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/704326 (https://phabricator.wikimedia.org/T264006) [13:12:08] (03PS2) 10Effie Mouzeli: rdf-streaming-updater: add namespace for service [deployment-charts] - 10https://gerrit.wikimedia.org/r/704535 (https://phabricator.wikimedia.org/T264006) [13:12:41] 10SRE, 10Product-Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Update membership info for iflorez - https://phabricator.wikimedia.org/T286509 (10Ottomata) Approved! [13:14:48] (03PS7) 10Effie Mouzeli: profile::osm_master: add tilerator users for kubepod subnets [puppet] - 10https://gerrit.wikimedia.org/r/704394 (https://phabricator.wikimedia.org/T283159) [13:14:49] greg-g: yeah, it's supposed to ping mostly for transparency purposes on what it's doing. Also I think it's good practice to tell people their rights have changed [13:14:55] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:18:13] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [13:22:36] (03CR) 10Btullis: [C: 03+2] druid: Add option to roll restart test druid worker java processes [cookbooks] - 10https://gerrit.wikimedia.org/r/704443 (https://phabricator.wikimedia.org/T283067) (owner: 10Razzi) [13:23:19] (03CR) 10Hnowlan: [C: 03+2] postgresql: ensure that pg_basebackup can access variables for resync [puppet] - 10https://gerrit.wikimedia.org/r/701888 (owner: 10Hnowlan) [13:23:42] (03PS1) 10Effie Mouzeli: Add rdf-streaming-updater user [puppet] - 10https://gerrit.wikimedia.org/r/704537 (https://phabricator.wikimedia.org/T264006) [13:25:26] (03Merged) 10jenkins-bot: druid: Add option to roll restart test druid worker java processes [cookbooks] - 10https://gerrit.wikimedia.org/r/704443 (https://phabricator.wikimedia.org/T283067) (owner: 10Razzi) [13:29:50] (03CR) 10JMeybohm: [C: 03+1] "Just nits, actually. This looks pretty solid and could also be merged as is - so +1!" (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [13:30:36] (03PS5) 10Btullis: Update sre.hadoop.roll-restart-masters cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/704500 (https://phabricator.wikimedia.org/T269925) [13:31:02] (03PS1) 10Effie Mouzeli: Add tokens for rdf-streaming-updater service [labs/private] - 10https://gerrit.wikimedia.org/r/704538 (https://phabricator.wikimedia.org/T264006) [13:31:05] (03PS5) 10Btullis: Update sre.hadoop.roll-restart-workers cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/704501 (https://phabricator.wikimedia.org/T269925) [13:33:43] (03PS1) 10Effie Mouzeli: rdf-streaming-updater: add kubernetes user [labs/private] - 10https://gerrit.wikimedia.org/r/704539 (https://phabricator.wikimedia.org/T26400) [13:34:48] (03PS2) 10Effie Mouzeli: Add rdf-streaming-updater kubernetes user [puppet] - 10https://gerrit.wikimedia.org/r/704537 (https://phabricator.wikimedia.org/T264006) [13:35:35] (03PS2) 10Effie Mouzeli: rdf-streaming-updater:Add tokens for service [labs/private] - 10https://gerrit.wikimedia.org/r/704538 (https://phabricator.wikimedia.org/T264006) [13:35:55] (03PS3) 10Effie Mouzeli: rdf-streaming-updater: Add tokens for service [labs/private] - 10https://gerrit.wikimedia.org/r/704538 (https://phabricator.wikimedia.org/T264006) [13:37:19] (03CR) 10Effie Mouzeli: [C: 03+2] rdf-streaming-updater: Add tokens for service [labs/private] - 10https://gerrit.wikimedia.org/r/704538 (https://phabricator.wikimedia.org/T264006) (owner: 10Effie Mouzeli) [13:37:23] (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] rdf-streaming-updater: Add tokens for service [labs/private] - 10https://gerrit.wikimedia.org/r/704538 (https://phabricator.wikimedia.org/T264006) (owner: 10Effie Mouzeli) [13:37:39] (03CR) 10Effie Mouzeli: [C: 03+2] rdf-streaming-updater: add kubernetes user [labs/private] - 10https://gerrit.wikimedia.org/r/704539 (https://phabricator.wikimedia.org/T26400) (owner: 10Effie Mouzeli) [13:37:42] (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] rdf-streaming-updater: add kubernetes user [labs/private] - 10https://gerrit.wikimedia.org/r/704539 (https://phabricator.wikimedia.org/T26400) (owner: 10Effie Mouzeli) [13:37:56] (03PS2) 10Effie Mouzeli: rdf-streaming-updater: add kubernetes user [labs/private] - 10https://gerrit.wikimedia.org/r/704539 (https://phabricator.wikimedia.org/T26400) [13:38:05] (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] rdf-streaming-updater: add kubernetes user [labs/private] - 10https://gerrit.wikimedia.org/r/704539 (https://phabricator.wikimedia.org/T26400) (owner: 10Effie Mouzeli) [13:38:36] (03CR) 10Vgutierrez: [C: 03+2] "Approved by Otto on the phab task https://phabricator.wikimedia.org/T286509#7212076" [puppet] - 10https://gerrit.wikimedia.org/r/704364 (https://phabricator.wikimedia.org/T286509) (owner: 10Vgutierrez) [13:38:46] (03PS3) 10Vgutierrez: admin: Add iflorez to analytics-product-users [puppet] - 10https://gerrit.wikimedia.org/r/704364 (https://phabricator.wikimedia.org/T286509) [13:42:42] effie: I've skipped your changes so they need to be merged [13:43:05] yeah it is private, I will merge them, thank you [13:43:07] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Muehlenhoff out of all services on: 1733 hosts [13:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:28] (03PS1) 10Ottomata: Bump Refine spark_executor_memory to 8G [puppet] - 10https://gerrit.wikimedia.org/r/704541 (https://phabricator.wikimedia.org/T271232) [13:43:40] 10SRE, 10Product-Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Update membership info for iflorez - https://phabricator.wikimedia.org/T286509 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Done, please wait 30 minutes to allow puppet to run on the affected servers. [13:43:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Muehlenhoff out of all services on: 1733 hosts [13:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:56] (03CR) 10jerkins-bot: [V: 04-1] Bump Refine spark_executor_memory to 8G [puppet] - 10https://gerrit.wikimedia.org/r/704541 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [13:44:21] (03PS2) 10Ottomata: Bump Refine spark_executor_memory to 8G [puppet] - 10https://gerrit.wikimedia.org/r/704541 (https://phabricator.wikimedia.org/T271232) [13:47:51] (03CR) 10Ottomata: [C: 03+2] Bump Refine spark_executor_memory to 8G [puppet] - 10https://gerrit.wikimedia.org/r/704541 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [13:48:37] effie: FYI i just puppet-merged your private patch for rd-streaming-updater dummy token [13:48:45] go ahead [13:48:46] thank you [13:50:05] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at codfw on alert1001 is CRITICAL: 0.375 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [13:51:17] (03CR) 10Effie Mouzeli: [C: 03+2] Add rdf-streaming-updater kubernetes user [puppet] - 10https://gerrit.wikimedia.org/r/704537 (https://phabricator.wikimedia.org/T264006) (owner: 10Effie Mouzeli) [13:51:51] the mw alert seems to be related to api-appservers [13:51:51] (03CR) 10JMeybohm: [C: 03+1] rdf-streaming-updater: add namespace for service [deployment-charts] - 10https://gerrit.wikimedia.org/r/704535 (https://phabricator.wikimedia.org/T264006) (owner: 10Effie Mouzeli) [13:51:57] https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1&from=now-12h&to=now&var-datasource=codfw%20prometheus%2Fops [13:52:09] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudcephosd1018 - https://phabricator.wikimedia.org/T285799 (10dcaro) @Cmjohnson any updates on this? As far as I can see everything is ok in that machine: ` root@cloudcephosd1018:~# cat /proc/mdstat Personalities : [raid1] [linear] [mult... [13:53:02] and latency tripled from this morning sigh [13:54:33] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=9&from=now-12h&orgId=1&to=now&var-datasource=codfw%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200 [13:54:47] more or less from 7 UTC [13:57:43] (03CR) 10Nikki Nikkhoui: Initial image-suggestion-api helm chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [13:57:45] (03PS15) 10Nikki Nikkhoui: Initial image-suggestion-api helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) [13:58:10] whoops yeah, something sure did happen around 07:00, didn't it [13:58:25] (03CR) 10Btullis: [C: 03+2] Update sre.hadoop.roll-restart-workers cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/704501 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [13:58:49] (03CR) 10JMeybohm: Initial image-suggestion-api helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [13:58:51] (03CR) 10Btullis: [C: 03+2] Update sre.hadoop.roll-restart-masters cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/704500 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [14:02:03] (03Merged) 10jenkins-bot: Update sre.hadoop.roll-restart-masters cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/704500 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [14:02:05] (03Merged) 10jenkins-bot: Update sre.hadoop.roll-restart-workers cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/704501 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [14:02:43] (03CR) 10Effie Mouzeli: [C: 03+2] admin_ng: Add a new tiller ClusterRole for flink-session-cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/704326 (https://phabricator.wikimedia.org/T264006) (owner: 10JMeybohm) [14:03:25] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at codfw on alert1001 is CRITICAL: 0.3438 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [14:04:12] worker saturation looks like it's reasonably uniform across machines this time, I'll poke around on one more or less at random [14:05:08] (03Merged) 10jenkins-bot: admin_ng: Add a new tiller ClusterRole for flink-session-cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/704326 (https://phabricator.wikimedia.org/T264006) (owner: 10JMeybohm) [14:06:36] (03PS3) 10Effie Mouzeli: rdf-streaming-updater: add namespace for service [deployment-charts] - 10https://gerrit.wikimedia.org/r/704535 (https://phabricator.wikimedia.org/T264006) [14:08:52] plenty of "script executed too slow, logging" in the error.log, looks like mostly action=query starting at ~07:00 but I don't see much of a pattern beyond that yet [14:09:09] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at codfw on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.0625 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [14:09:12] also no clear change logged in SAL around that time [14:09:32] yeah [14:10:17] (03CR) 10Effie Mouzeli: [C: 03+2] rdf-streaming-updater: add namespace for service [deployment-charts] - 10https://gerrit.wikimedia.org/r/704535 (https://phabricator.wikimedia.org/T264006) (owner: 10Effie Mouzeli) [14:10:50] (03PS1) 10Muehlenhoff: idm.logout: Remove the confirmation on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/704543 [14:12:10] rzl: ok if I try to restart php-fpm on one node and check its latency afterward? [14:12:30] elukey: yeah, I was starting to think about that -- go ahead [14:12:45] (03Merged) 10jenkins-bot: rdf-streaming-updater: add namespace for service [deployment-charts] - 10https://gerrit.wikimedia.org/r/704535 (https://phabricator.wikimedia.org/T264006) (owner: 10Effie Mouzeli) [14:13:14] if we really need to do a rolling restart on the api fleet without knowing why, I'm going to *hate* it :) but if it works it's better than not doing it [14:13:23] !log restart php-fpm on mw2370 [14:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:29] (03PS2) 10Muehlenhoff: idm.logout: Remove the confirmation on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/704543 [14:13:30] (03CR) 10jerkins-bot: [V: 04-1] idm.logout: Remove the confirmation on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/704543 (owner: 10Muehlenhoff) [14:13:32] (03CR) 10Volans: "some alternatives inline" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/704543 (owner: 10Muehlenhoff) [14:14:20] rzl: I am using something like tail -f /var/log/apache2/other_vhosts_access.log | awk 'length($2) > 6' and I keep seeing a lot of entries even after the restart [14:14:40] ack [14:14:42] (IIRC the second field in the httpd logs is the duration) [14:14:51] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at codfw on alert1001 is CRITICAL: 0.3438 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [14:15:11] yep correct [14:16:15] (03CR) 10jerkins-bot: [V: 04-1] idm.logout: Remove the confirmation on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/704543 (owner: 10Muehlenhoff) [14:16:44] (03CR) 10Filippo Giunchedi: "Thank you for the review" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704533 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [14:16:53] (03CR) 10Lucas Werkmeister (WMDE): Add config for updated PropertySuggester beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703205 (https://phabricator.wikimedia.org/T285098) (owner: 10Martaannaj) [14:16:57] RECOVERY - Host mw2383 is UP: PING OK - Packet loss = 0%, RTA = 31.60 ms [14:16:59] tendril looks clean afaics [14:17:06] elukey: gonna check a theory, it looks like latency increased linearly for 30 minutes and then held steady [14:17:18] so I wonder if there was a bad puppet change committed at 07:00? looking [14:17:19] elukey: https://orchestrator.wikimedia.org/web/clusters/ [14:17:59] PROBLEM - Memcached on mw2383 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Memcached [14:18:01] PROBLEM - Apache HTTP on mw2383 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:18:40] (nope, nothing) [14:19:20] !log jiji@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:19:20] !log jiji@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:32] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30217/console" [puppet] - 10https://gerrit.wikimedia.org/r/704530 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [14:19:57] !log jiji@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:19:57] kormat: are those warnings related to the schema changes that you kicked off this morning? [14:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:03] (trying to parse the dashboard) [14:20:08] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Otherwise, this looks okay to me, though I also left a comment on T285098 about some strange suggester behavior that I noticed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703205 (https://phabricator.wikimedia.org/T285098) (owner: 10Martaannaj) [14:20:18] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] Remove statsite from swift/thanos, replaced by statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/704530 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [14:20:19] PROBLEM - Host mw2383 is DOWN: PING CRITICAL - Packet loss = 100% [14:20:37] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at codfw on alert1001 is CRITICAL: 0.4844 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [14:20:53] ^ 2383 is depooled, unrelated [14:20:57] elukey: all of the current warnings on orchestrator are 'expected', and should be ignored. [14:21:18] (03CR) 10Volans: "reply inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704533 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [14:21:23] perfect so all good from the db point of view [14:21:24] i was (unhelpfully) pointing out that tendril is being replaced by orchestrator, and orchestrator will actually flag issues (unlike tendril) [14:21:31] elukey: 👍 [14:21:39] noted, thanks, really nice ui! [14:23:07] evnoy telemetry sais search-https_codfw is what increased quite a bunch https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?viewPanel=14&orgId=1&var-datasource=codfw%20prometheus%2Fops&var-origin=api_appserver&var-origin_instance=All&var-destination=All&from=1626245975967&to=1626246483746 [14:23:11] *envoy [14:23:34] like from 500ms to >2s [14:23:41] !log jiji@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:23:42] oh man, I completely forgot to look at the envoy dashboard [14:23:43] thank you jayme [14:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:25] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at codfw on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.0625 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [14:24:51] IIUC this is elasticsearch [14:25:41] this probably isn't great! https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?viewPanel=53&orgId=1&refresh=1m&from=now-12h&to=now [14:25:44] request() /srv/mediawiki/php-1.37.0-wmf.12/vendor/ruflin/elastica/lib/Elastica/Multi/Search.php:150 [14:25:50] this is indeed in the slow log [14:25:59] I am here [14:26:05] rzl: ouch [14:26:10] looking in a sec [14:26:16] also the currusSearchIncomingLinkCount is quite high [14:26:19] dcausse: <3 I was about to ping you [14:26:21] whatever that is [14:26:25] ryankemper too --^ [14:26:36] (03PS1) 10David Caro: ceph: Update health alert url to runbook [puppet] - 10https://gerrit.wikimedia.org/r/704545 [14:27:31] (03CR) 10Nikki Nikkhoui: Initial image-suggestion-api helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [14:28:12] elastic2054 has load ~70 [14:28:37] jayme: is it okay with you if I drop off for a meeting? happy to reschedule it if you'd rather have more hands here [14:28:48] I took the liberty to sort decreasing jayme's graph so it is more clear when hovering what is the worst offender [14:29:32] rzl: sure. I've no idea what I'm doing but we will manage :) will ping you back if we don't [14:29:52] yeah, don't hesitate :) thanks! [14:30:36] ah, yeah. Sorry elukey. Selections and sorting are no URI parameters :/ [14:31:02] np :) [14:31:06] https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?viewPanel=85&orgId=1&refresh=1m&from=now-12h&to=now is also interesting [14:31:15] !log runnning elasticsearch-madvise-random 1022 on elastic2054 [14:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:55] dcausse: the backlog for IncomingLinkCount started to climb [14:32:00] (if it is of any help) [14:32:17] yes I think it's a consequence of the server misbehaving [14:32:33] ah snap only one server causing this mess? [14:32:34] :( [14:32:58] better dashboard link to envoy telemetry for reference https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?viewPanel=14&orgId=1&from=now-24h&to=now&var-datasource=codfw%20prometheus%2Fops&var-origin=api_appserver&var-origin_instance=All&var-destination=search-https_codfw&var-destination=search-https_eqiad [14:33:10] !log runnning elasticsearch-madvise-random ES_PID on elastic2045 [14:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:30] ACKNOWLEDGEMENT - Host mw2383 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T286463 [14:35:19] elastic@codfw thread pool is recovering [14:36:16] (03CR) 10Muehlenhoff: idm.logout: Remove the confirmation on failure (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/704543 (owner: 10Muehlenhoff) [14:37:56] dcausse: latency of api appservers dropped [14:37:58] nice :) [14:38:10] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1422.eqiad.wmnet [14:38:12] are elastic2054 and elastic2045 somehow special? [14:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:31] at least the former has way more disk IO then others (~twice as much) [14:39:50] jayme: yes, we configure disk read_ahead to something low to avoid wasting page cache but this setting does not seem to be applied on these machines, workaround is to run a custom tool to tell the elastic process to not anticipate reads [14:40:05] !log jiji@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:26] we should have a ticket for that, looking [14:40:32] !log jiji@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [14:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:43] dcausse: would it be ok for your team to follow up on this issue with some extra alert related to ES latency being too high? [14:40:46] 2045 is not pooled, 2054 is pooled [14:41:06] just to have some indication in the future if something is misbehaving [14:41:11] elukey: we do but seems like something complained earlier [14:41:25] ah sorry didn't see it [14:41:59] (03PS1) 10David Caro: ceph: Update dashboard links to tags [puppet] - 10https://gerrit.wikimedia.org/r/704547 [14:42:00] I think the check needs the latencies to drop for at least 30min to be captured [14:42:01] dcausse: ah, thanks. Thats https://gerrit.wikimedia.org/r/c/operations/puppet/+/702791 then I suppose [14:42:21] (going back to istio-fun) [14:42:31] (going back to flink fun) [14:42:36] jayme: yes that's the workaround [14:42:43] ack, thanks [14:43:10] dcausse: is it correct that elastic2033, 2045 and 2048 are not pooled currently (while the others are)? [14:43:15] (03PS2) 10Jelto: prometheus::ops add jobs and ferm rule to scrape gitlab metrics [puppet] - 10https://gerrit.wikimedia.org/r/704503 (https://phabricator.wikimedia.org/T275170) [14:43:28] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:54] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:43:57] !log installing apache security updates on grafana* [14:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:33] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [14:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:44] (03PS1) 10Ottomata: Add prometheus precomputed query for express_router_request_duration_seconds [puppet] - 10https://gerrit.wikimedia.org/r/704548 (https://phabricator.wikimedia.org/T272714) [14:44:58] jayme: if I am doing the flink fun, then what are you doing ?? [14:45:00] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [14:45:19] effie: I thought you might have thrown more reviews at me :D [14:45:30] hahaa [14:45:32] no I am done [14:45:50] great. So I can return to pretend working [14:46:10] (03CR) 10Filippo Giunchedi: [C: 03+1] "Neato" [puppet] - 10https://gerrit.wikimedia.org/r/704548 (https://phabricator.wikimedia.org/T272714) (owner: 10Ottomata) [14:46:29] (03PS3) 10Dzahn: site/conftool: remove mw1281 through mw1283 [puppet] - 10https://gerrit.wikimedia.org/r/704284 (https://phabricator.wikimedia.org/T280203) [14:46:30] lol sure [14:46:52] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: move stats scripts to python3 [puppet] - 10https://gerrit.wikimedia.org/r/704532 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [14:46:55] (03CR) 10Ottomata: [C: 03+2] Add prometheus precomputed query for express_router_request_duration_seconds [puppet] - 10https://gerrit.wikimedia.org/r/704548 (https://phabricator.wikimedia.org/T272714) (owner: 10Ottomata) [14:47:00] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: use python3 packages as needed [puppet] - 10https://gerrit.wikimedia.org/r/704531 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [14:47:02] !log set mw2384 as inactive to investigate mw2383 issue - T286463 [14:47:03] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: split wmf rewrite into python3 [puppet] - 10https://gerrit.wikimedia.org/r/704533 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [14:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:08] T286463: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 [14:47:30] !log jiji@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2384.codfw.wmnet [14:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:35] (03CR) 10Dzahn: [C: 03+2] site/conftool: remove mw1281 through mw1283 [puppet] - 10https://gerrit.wikimedia.org/r/704284 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [14:48:07] jayme: if you want I can send random istio code reviews if you feel bored :D [14:48:42] oh, no. I'm still pretty busy dodgings the ones that are still open. Thanks :) [14:49:51] PROBLEM - Host mw2384 is DOWN: PING CRITICAL - Packet loss = 100% [14:50:16] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30219/console" [puppet] - 10https://gerrit.wikimedia.org/r/704503 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [14:50:43] mutante: I only see elastic2043 being explicitely excluded, I see elastic2033 and elastic2048 in the cluster state [14:51:22] dcausse: I got that from https://config-master.wikimedia.org/pybal/codfw/search-https [14:51:30] !log installing apache security updates on puppet masters [14:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:27] dcausse: f.e. [cumin1001:~] $ sudo -i confctl select name=elastic2033.codfw.wmnet get [14:52:28] mutante: I'm not sure that an elastic host being depooled at the pybal level actually does anything, generally we "depool" by banning from the cluster via the ES api [14:52:36] mutante: oh interesting, I guess they're simply not receiving traffic from mw app servers but still are doing work due to internode communication [14:52:59] ^ yeah that's a better way to put it :) [14:53:02] ryankemper: ah, ok, then it might be irrelevant. ACK [14:53:35] alright, thanks. I had no idea. It just looks odd when you are used to mw appservers [14:53:57] RECOVERY - Host mw2384 is UP: PING OK - Packet loss = 0%, RTA = 31.67 ms [14:55:32] yeah I've got a meeting coming up but I'll see about marking those as pooled after poking around and making sure there's not a reason they're de-pooled at the pybal level [14:55:47] cool, thanks ryankemper [14:57:06] 10SRE, 10serviceops, 10Release-Engineering-Team (Radar): Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) a:03Dzahn [14:57:17] 10SRE, 10serviceops, 10Release-Engineering-Team (Radar): Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) mwmaint1002 done [14:58:44] will also be looking into why readahead mitigation wasn't being applied; at first glance it looks like the associated timer `elasticsearch-disable-readahead.timer` fired once when we first put it in place ~12 days ago but isn't re-firing every 30 mins as intended [14:59:44] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "[mwmaint2002:~] $ /usr/bin/ldapsearch -x "uid=samtar*" confirms it has been changed in LDAP to this new address" [puppet] - 10https://gerrit.wikimedia.org/r/704440 (https://phabricator.wikimedia.org/T286624) (owner: 10Samtar) [15:01:30] 10SRE, 10ops-codfw: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10RobH) Summary of troubleshooting so far to see why this was throttling the CPU: * update idrac and bios firmware to latest revisions ** this did not fix the issue when the host was returned to service post update * compar... [15:02:12] (03CR) 10Cwhite: "Are these metrics generated into a histogram by service-template-node?https://github.com/wikimedia/service-template-node/blob/master/lib/u" [puppet] - 10https://gerrit.wikimedia.org/r/704548 (https://phabricator.wikimedia.org/T272714) (owner: 10Ottomata) [15:03:48] (03PS1) 10JMeybohm: scaffold: Fix trimming in of if blocks in annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/704549 [15:05:37] q:q! [15:05:51] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=CREATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:06:31] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [15:06:38] (03PS2) 10JMeybohm: scaffold: Fix trimming in of if blocks in annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/704549 [15:06:40] (03PS1) 10JMeybohm: scaffold: Enable monitoring in scaffold fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/704550 [15:07:04] (03CR) 10jerkins-bot: [V: 04-1] scaffold: Enable monitoring in scaffold fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/704550 (owner: 10JMeybohm) [15:07:14] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [15:07:29] (03PS1) 10Filippo Giunchedi: swift: handle bytes vs str for subprocess' output [puppet] - 10https://gerrit.wikimedia.org/r/704551 (https://phabricator.wikimedia.org/T285835) [15:08:08] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [15:08:24] (03CR) 10JMeybohm: Initial image-suggestion-api helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [15:08:57] PROBLEM - mediawiki-installation DSH group on mw2384 is CRITICAL: Host mw2384 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:09:22] 10SRE: urldownloader2002 running out of disk space in root partition - https://phabricator.wikimedia.org/T286525 (10Dzahn) apt-get clean reduced disk usage a bit more http://squid-web-proxy-cache.1019090.n4.nabble.com/Netdb-state-too-big-td4687147.html [15:09:41] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:09:55] 10SRE: urldownloader2002 running out of disk space in root partition - https://phabricator.wikimedia.org/T286525 (10Dzahn) ` > Yes, but do you need that at all? > > http://www.squid-cache.org/mail-archive/squid-users/200007/0384.html > seems to imply that it's only useful in a parent-child setup (or cache > hier... [15:10:35] (03PS16) 10Nikki Nikkhoui: Initial image-suggestion-api helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) [15:10:57] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] "Expected to fail as it tests for an existing bug in the current scaffold version. Fix is in https://gerrit.wikimedia.org/r/c/operations/de" [deployment-charts] - 10https://gerrit.wikimedia.org/r/704550 (owner: 10JMeybohm) [15:10:59] volans: only a small followup change for the usual bytes vs str I didn't catch (the rest works though) if you have the time https://gerrit.wikimedia.org/r/c/operations/puppet/+/704551 [15:11:06] (03CR) 10JMeybohm: [C: 03+2] scaffold: Fix trimming in of if blocks in annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/704549 (owner: 10JMeybohm) [15:11:19] (03CR) 10jerkins-bot: [V: 04-1] scaffold: Enable monitoring in scaffold fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/704550 (owner: 10JMeybohm) [15:11:21] (03CR) 10jerkins-bot: [V: 04-1] scaffold: Fix trimming in of if blocks in annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/704549 (owner: 10JMeybohm) [15:11:46] or anyone really [15:12:05] (03PS3) 10Muehlenhoff: idm.logout: Remove the confirmation on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/704543 [15:12:20] godog: in a meeting I can look later on [15:12:24] (03CR) 10JMeybohm: [C: 03+1] Initial image-suggestion-api helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [15:12:44] volans: ack, thanks [15:13:04] 10SRE, 10ops-eqiad, 10DBA: Upgrade db1104 firmware - https://phabricator.wikimedia.org/T286226 (10wiki_willy) a:05LSobanski→03Cmjohnson Moving over to @Cmjohnson, who will be back before John next week. Thanks, Willy [15:13:34] 10SRE: Update email address for samtar in ldap users - https://phabricator.wikimedia.org/T286624 (10Dzahn) 05Open→03Resolved a:03Dzahn change has been merged. So this should be resolved. confirmed in LDAP it already matches the new address as well. [15:13:38] but from a first look, doesn't .encode/.decode would be quicker? [15:13:45] 10SRE: Update email address for samtar in ldap users - https://phabricator.wikimedia.org/T286624 (10Dzahn) a:05Dzahn→03None [15:14:18] (03CR) 10Kormat: [C: 03+1] "One non-binding comment." [puppet] - 10https://gerrit.wikimedia.org/r/704551 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [15:14:55] (03CR) 10jerkins-bot: [V: 04-1] idm.logout: Remove the confirmation on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/704543 (owner: 10Muehlenhoff) [15:15:52] (03PS11) 10Elukey: Add support for knative serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) [15:15:54] (03PS4) 10Elukey: WIP - Add kubeflow's kfserving chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) [15:15:56] (03PS1) 10Elukey: istio: improve base config.yaml for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/704552 (https://phabricator.wikimedia.org/T278192) [15:16:05] volans: sure that'd work too [15:16:26] kormat: I'm wondering if your comment is also bytes.decode ? (the comment is missing) [15:16:32] (03CR) 10jerkins-bot: [V: 04-1] WIP - Add kubeflow's kfserving chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [15:16:57] !log installing apache security updates on lists1001 (lists.wikimedia.org) [15:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:16] godog: it was. thanks, gerrit. no idea how that went missing. [15:17:47] lolz, thank you both I'll fix it [15:18:00] (03PS1) 10Ottomata: Fix express_router_request_duration_seconds precomputed promtheus query [puppet] - 10https://gerrit.wikimedia.org/r/704554 (https://phabricator.wikimedia.org/T272714) [15:18:14] (03CR) 10Ottomata: "Yes, and I got the metric naming wrong." [puppet] - 10https://gerrit.wikimedia.org/r/704548 (https://phabricator.wikimedia.org/T272714) (owner: 10Ottomata) [15:18:44] (03CR) 10JMeybohm: [C: 03+2] Initial image-suggestion-api helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [15:19:00] (03PS2) 10Filippo Giunchedi: swift: handle bytes vs str for subprocess' output [puppet] - 10https://gerrit.wikimedia.org/r/704551 (https://phabricator.wikimedia.org/T285835) [15:19:39] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01039 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:19:45] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01155 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:19:55] (03CR) 10Kormat: [C: 03+1] swift: handle bytes vs str for subprocess' output [puppet] - 10https://gerrit.wikimedia.org/r/704551 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [15:20:07] (03CR) 10Ottomata: "Here's what comes out of service-template-node for this metric:" [puppet] - 10https://gerrit.wikimedia.org/r/704548 (https://phabricator.wikimedia.org/T272714) (owner: 10Ottomata) [15:20:23] (03CR) 10Ottomata: [C: 03+2] Fix express_router_request_duration_seconds precomputed promtheus query [puppet] - 10https://gerrit.wikimedia.org/r/704554 (https://phabricator.wikimedia.org/T272714) (owner: 10Ottomata) [15:20:41] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 1138 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:21:11] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:21:17] (03Merged) 10jenkins-bot: Initial image-suggestion-api helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [15:21:20] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you volans + kormat" [puppet] - 10https://gerrit.wikimedia.org/r/704551 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [15:23:01] (03PS3) 10JMeybohm: scaffold: Fix trimming in of if blocks in annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/704549 [15:23:03] (03PS2) 10JMeybohm: scaffold: Enable monitoring in scaffold fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/704550 [15:24:12] (03CR) 10JMeybohm: [C: 03+2] "> Patch Set 1: Verified+2 Code-Review+2" [deployment-charts] - 10https://gerrit.wikimedia.org/r/704550 (owner: 10JMeybohm) [15:25:06] (03PS1) 10Dzahn: site/conftool: turn mw1423,mw1424,mw1425 into API appservers [puppet] - 10https://gerrit.wikimedia.org/r/704556 (https://phabricator.wikimedia.org/T279309) [15:25:25] (03PS4) 10Muehlenhoff: idm.logout: Remove the confirmation on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/704543 [15:25:31] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:26:20] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to analytics cluster for Ben Tullis - https://phabricator.wikimedia.org/T285754 (10BTullis) I have created myself a kerberos principal with the following command: ` btullis@krb1001:~$ sudo manage_principals.py create btullis --email_add... [15:26:37] (03Merged) 10jenkins-bot: scaffold: Fix trimming in of if blocks in annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/704549 (owner: 10JMeybohm) [15:26:49] (03Merged) 10jenkins-bot: scaffold: Enable monitoring in scaffold fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/704550 (owner: 10JMeybohm) [15:27:20] (03CR) 10Dzahn: "I see, please ignore my nitpick." [puppet] - 10https://gerrit.wikimedia.org/r/700317 (https://phabricator.wikimedia.org/T266703) (owner: 10Ladsgroup) [15:27:49] (03CR) 10Dzahn: [C: 03+1] role::common::mediawiki::canary_appserver add new canary app server in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/704103 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [15:27:51] PROBLEM - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:28:19] !log Start server-side upload of 3 large image files (T285708) [15:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:26] T285708: Please upload large files to Wikimedia Commons - https://phabricator.wikimedia.org/T285708 [15:30:29] (03CR) 10Elukey: Add support for knative serving (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [15:31:28] (03CR) 10Cwhite: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/704548 (https://phabricator.wikimedia.org/T272714) (owner: 10Ottomata) [15:31:41] (03CR) 10Cwhite: [C: 03+1] Fix express_router_request_duration_seconds precomputed promtheus query [puppet] - 10https://gerrit.wikimedia.org/r/704554 (https://phabricator.wikimedia.org/T272714) (owner: 10Ottomata) [15:32:21] (03PS1) 10Ottomata: Remove deprecated eventgate validation erroro throughput alerts [puppet] - 10https://gerrit.wikimedia.org/r/704558 [15:32:41] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:33:47] (03CR) 10Ottomata: [C: 03+2] Remove deprecated eventgate validation erroro throughput alerts [puppet] - 10https://gerrit.wikimedia.org/r/704558 (owner: 10Ottomata) [15:34:25] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [15:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:47] !log installing apache security updates on otrs1001 (ticket.wikimedia.org) [15:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:52] !log [Elastic] Manually triggering readahead mitigation across whole fleet to prevent any further issues today: `ryankemper@cumin1001:~$ sudo cumin -b 12 'P{elastic*}' 'sudo systemctl restart elasticsearch-disable-readahead.service'` (still need to investigate why `elasticsearch-disable-readahead.timer` isn't re-firing every 30 mins as desired) [15:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:04] !log deploying eventgate-analytics with direct service-runner promethues support [15:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:27] !log installing klibc security updates [15:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:34] (03PS1) 10Ladsgroup: librenms: Drop absented crons [puppet] - 10https://gerrit.wikimedia.org/r/704559 (https://phabricator.wikimedia.org/T273673) [15:38:40] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10JMeybohm) [15:38:59] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/704559 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [15:39:24] (03CR) 10Ahmon Dancy: [C: 03+2] TranslationAid: Handle empty message definition [extensions/Translate] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704404 (https://phabricator.wikimedia.org/T285830) (owner: 10Abijeet Patro) [15:39:28] (03CR) 10Ahmon Dancy: [C: 03+2] TranslationAid: Make sure to return successfully fetched definitions [extensions/Translate] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704405 (https://phabricator.wikimedia.org/T285830) (owner: 10Abijeet Patro) [15:40:25] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10JMeybohm) [15:41:28] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10JMeybohm) [15:42:07] (03CR) 10Ladsgroup: "PCC seems happy: https://puppet-compiler.wmflabs.org/compiler1003/853/" [puppet] - 10https://gerrit.wikimedia.org/r/704559 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [15:42:17] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10JMeybohm) [15:43:33] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005774 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:45:05] 10SRE, 10Anti-Harassment, 10Traffic, 10serviceops: Add IP Info (ASN & Geolocation) to requests to MediaWiki - https://phabricator.wikimedia.org/T251933 (10Niharika) [15:47:27] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005196 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:48:45] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10JMeybohm) [15:49:08] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10JMeybohm) [15:56:32] 10SRE: Integrate Buster 10.10 point update - https://phabricator.wikimedia.org/T285206 (10MoritzMuehlenhoff) [15:57:40] (03PS1) 10Btullis: Enable kerberos for btullis [puppet] - 10https://gerrit.wikimedia.org/r/704562 (https://phabricator.wikimedia.org/T285754) [15:59:11] (03PS1) 10Ottomata: Tune Refine jobs in production hadoop [puppet] - 10https://gerrit.wikimedia.org/r/704563 (https://phabricator.wikimedia.org/T271232) [15:59:28] joal: https://gerrit.wikimedia.org/r/c/operations/puppet/+/704563 [15:59:40] (03CR) 10jerkins-bot: [V: 04-1] Tune Refine jobs in production hadoop [puppet] - 10https://gerrit.wikimedia.org/r/704563 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [16:00:31] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/704562 (https://phabricator.wikimedia.org/T285754) (owner: 10Btullis) [16:00:35] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30220/console" [puppet] - 10https://gerrit.wikimedia.org/r/704563 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [16:01:29] (03CR) 10Muehlenhoff: [C: 03+2] Deploy systemd-login logout.d script fleet-wide [puppet] - 10https://gerrit.wikimedia.org/r/703571 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [16:01:46] (03Merged) 10jenkins-bot: TranslationAid: Handle empty message definition [extensions/Translate] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704404 (https://phabricator.wikimedia.org/T285830) (owner: 10Abijeet Patro) [16:01:49] (03Merged) 10jenkins-bot: TranslationAid: Make sure to return successfully fetched definitions [extensions/Translate] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704405 (https://phabricator.wikimedia.org/T285830) (owner: 10Abijeet Patro) [16:06:40] (03CR) 10RLazarus: httpbb: add tests for noc.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704297 (https://phabricator.wikimedia.org/T267607) (owner: 10Dzahn) [16:07:10] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [16:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:18] btullis: standup? [16:07:22] oops wrong room [16:07:45] (03PS2) 10Ottomata: Tune Refine jobs in production hadoop [puppet] - 10https://gerrit.wikimedia.org/r/704563 (https://phabricator.wikimedia.org/T271232) [16:08:56] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30221/console" [puppet] - 10https://gerrit.wikimedia.org/r/704563 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [16:11:05] !log dancy@deploy1002 Synchronized php-1.37.0-wmf.12/extensions/Translate: Backport: [[gerrit:704404|TranslationAid: Handle empty message definition (T285830)]] and [[gerrit:704405|TranslationAid: Make sure to return successfully fetched definitions (T285830)]] (duration: 01m 09s) [16:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:11] T285830: TypeError: trim() expects parameter 1 to be string, null given - https://phabricator.wikimedia.org/T285830 [16:12:26] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at codfw #page on alert1001 is CRITICAL: 0.2321 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [16:12:43] what, again? looking [16:12:51] oooof [16:12:53] deploy? [16:12:57] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_eventgate_analytics_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:12:57] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:12:57] whoa, that's sudden [16:13:09] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at codfw on alert1001 is CRITICAL: 0.5938 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:13:28] dancy: can you roll back while we investigate please [16:13:32] yes [16:13:32] (03CR) 10Joal: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/704563 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [16:13:33] rzl: https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?viewPanel=10&orgId=1 [16:13:46] may be eventgate related [16:13:50] ottomata: o/ [16:13:51] oh yep [16:13:53] very large spike [16:14:04] ottomata: can you roll back please [16:14:08] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at codfw #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.5565 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [16:14:09] but it may be only a spike [16:14:13] yeah [16:14:24] only on api [16:14:37] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:14:43] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventgate-analytics_4592: Servers kubernetes2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:14:43] ottomata: hm, hold that thought, don't roll back actually [16:14:54] latency is recovering [16:14:54] looks like it was a spike related to the deploy itself, yeah [16:15:05] I'll hold on the rollback as well [16:15:06] dancy: hold off too please [16:15:07] but it matches with eventgate's request [16:15:08] <3 [16:15:15] *requests [16:15:24] dancy: yours is probably innocent anyway, just unlucky timing :) sorry for the adrenaline [16:15:39] ottomata: can you confirm that you're here and looking please [16:15:43] also see eventgate's p99 https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?viewPanel=14&orgId=1 [16:15:56] ins tandup reading backscroll [16:16:03] p50 went up a lot [16:16:29] ottomata: it seems that eventgate-analytics is showing up a regression in latency right after the deployment [16:16:35] see my graph above [16:16:37] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventgate-analytics_4592: Servers kubernetes2007.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2001.codfw.wmnet, kubernetes2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBa [16:16:46] yeah --^ [16:16:52] something weird happened starting 16:09 [16:17:15] hMMMm [16:17:26] jynus: Andrew deploy at :07 afaics [16:17:26] ok FYI, this deploy is about direct promethues support [16:17:33] rather than using the statsd proemtheus exporter sidecare [16:17:48] is it possible that the addition of the prometheus http server in the eventgate container is causing this problem? [16:17:58] mediawiki channel EventBus (Unable to deliver all events: 504: Gateway Timeout) [16:18:17] i think i should rollback [16:18:20] ottomata: +1 [16:18:33] seems too unstable right now [16:18:44] uhhh been a long time, can I rollback with just a helmfile or do I need a revert? [16:18:48] helmfile command [16:19:13] jynus: ^^ ? [16:19:17] no idea, never done it :( [16:19:26] I don't know what helmfile is [16:19:28] found [16:19:28] https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_changes [16:19:29] reading [16:19:40] you do need a revert [16:19:51] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:19:59] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:20:13] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:20:14] hm, is this recovering ^ ? [16:20:37] is it possible that just the pod churn caused this? [16:20:49] it's been up and down a few times -- please prepare the revert, and we can decide before deploying it if it's stabilized [16:20:51] and that a rollback deploy would cause it to happen? [16:20:53] ok [16:21:04] I got some hints that envoy is doing circuit breaking, so mw is recovering :) [16:21:21] hmm, oof the revert is several commits, been testing in staging for days HMmMM [16:21:23] its a chart change [16:21:25] not just helmfile [16:21:53] elukey: yeah I suspect the circuit break may be flapping, and that's why we keep recovering and un-recovering [16:21:53] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at codfw on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:21:58] ok if I do the emergency helm based rollback? [16:22:33] oh [16:22:36] i thnk it auto rolled back [16:22:39] because the deploy failed [16:22:48] 4 Wed Jul 14 16:07:12 2021 SUPERSEDED eventgate-0.3.1 Upgrade "production" failed: timed out waiting for the co... [16:22:48] 5 Wed Jul 14 16:17:18 2021 DEPLOYED eventgate-0.2.14 Rollback to 3 [16:22:54] Fancy [16:23:04] the situation seems stable from the eventgate-analytics latency point of view [16:23:20] yeah if i do a helmfile diff on codfw [16:23:26] the change i tried to deploy is not applied [16:23:29] so, it auto rolled back! [16:23:30] cool! [16:23:43] that's a fancy feature [16:24:04] yeah, looking okay from here [16:24:19] hm [16:24:27] there was however, like 2 other spikes of latency about the main one, trials maybe? [16:25:08] only on api hosts [16:25:08] hm, haven't done anything to eventgate-main [16:25:32] the previous spike is unrelated (local port 80?) [16:25:57] s/about/after/ [16:25:58] you mean the ones an hourish ago? [16:26:01] oh [16:26:03] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Muehlenhoff out of all services on: 1733 hosts [16:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:21] could be just an artefact of lvs [16:26:30] i think the deploy and rollback took many minutes to complete [16:26:38] ottomata: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=9&from=1626276391814&orgId=1&to=1626279991814&var-datasource=codfw%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200 [16:26:50] 10+minutes? [16:26:52] I think a lot of that is due to retries and circuit-breaking [16:26:55] from Envoy's POV: https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&from=1626278199112&to=1626279999112&var-datasource=codfw%20prometheus%2Fops&var-origin=appserver&var-origin_instance=All&var-destination=eventgate-analytics [16:26:59] those are not worrying just in case it helps debugging [16:27:05] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Muehlenhoff out of all services on: 1733 hosts [16:27:06] (03CR) 10Klausman: [C: 03+1] istio: improve base config.yaml for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/704552 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [16:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:10] sorry, s/a lot of that/a lot of the oscillation in latency/ [16:27:12] yeah the rollback didn't finsh until 16:17 [16:27:17] PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash-codfw,logstash7-codfw} instance=kafkamon1002 job=burrow partition={0,1,2,3,4,5} prometheus=ops site=eqiad topic=udp_localhost-err https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource= [16:27:17] ar-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [16:27:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10nskaggs) @wiki_willy If a server replacement is likely, we can wait until Dell resolves the issue completely. We would prefer not to put a potentially failing server back... [16:27:29] or, at least thats the timestampin helm history [16:28:37] RECOVERY - SSH on logstash2021.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:28:38] (03PS1) 10Ryan Kemper: elastic: Fix timer to fire continually [puppet] - 10https://gerrit.wikimedia.org/r/704567 (https://phabricator.wikimedia.org/T264053) [16:29:33] (03PS2) 10Ryan Kemper: elastic: Fix timer to fire continually [puppet] - 10https://gerrit.wikimedia.org/r/704567 (https://phabricator.wikimedia.org/T264053) [16:29:40] then there are ripple effects- sometimes, when app servers get higher latency, mysql connections are not freed and sometimes get saturated and create more issues back to all app servers [16:30:14] but I think that was minimal in this case [16:30:23] (03CR) 10Nskaggs: [C: 03+1] Allow access to nfs dumps for research-collaborations-api [puppet] - 10https://gerrit.wikimedia.org/r/704381 (https://phabricator.wikimedia.org/T286635) (owner: 10Reedy) [16:31:01] I think we're okay to call this resolved from MW's perspective, ottomata do you still need anything for incident response? [16:31:29] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/704567 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper) [16:33:13] dancy: and just to close the loop, you're fine to leave your thing deployed, sorry for the quick guess :D [16:34:47] rzl: well....gotta figure out how to figure out what went wrong [16:34:53] 👍🏾 [16:34:54] this is deployed in eqiad right now [16:35:04] so if dc was switched back, we'd prob see it happen again [16:35:07] (03CR) 10Elukey: [C: 03+2] istio: improve base config.yaml for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/704552 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [16:36:10] ottomata: okay that's good to know [16:36:30] ottomata: and, obviously there's debugging to do but I meant in the immediate "we all just got paged, drop everything and fix it" sense of incident response :) [16:36:45] rzl: yeah i think in that case things are stable atm [16:36:49] emergency over anyway [16:36:55] thanks so much [16:37:13] should I do anything? do we need a doc with explanation? [16:38:13] I think an incident report would be a good idea, the lightweight one is probably plenty - https://wikitech.wikimedia.org/wiki/Incident_response/Lightweight_report [16:38:35] if you don't mind starting a draft I'm sure we can help fill in some detials [16:38:37] *details [16:39:00] k wil do! [16:39:28] thanks! [16:42:27] RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [16:46:26] (03PS1) 10Ahmon Dancy: Do not lock preferences row for a rememberpassword check [extensions/CentralAuth] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704382 (https://phabricator.wikimedia.org/T286521) [16:52:33] rzl https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-07-14_eventgate-analytics_latency_spike_caused_MW_app_server_overload [16:52:43] 👍 [16:52:45] (03CR) 10Ahmon Dancy: [C: 03+2] Do not lock preferences row for a rememberpassword check [extensions/CentralAuth] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704382 (https://phabricator.wikimedia.org/T286521) (owner: 10Ahmon Dancy) [16:53:07] 10SRE, 10ops-eqiad, 10DBA: Upgrade db1104 firmware - https://phabricator.wikimedia.org/T286226 (10RobH) FYI: My understanding is with a crash cart, one can activate the lifecycle controller manually on rebooting the host in POST and then fail backwards to the last version of any firmware pushed. This option... [16:54:56] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Tune Refine jobs in production hadoop [puppet] - 10https://gerrit.wikimedia.org/r/704563 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [16:57:41] (03Merged) 10jenkins-bot: Do not lock preferences row for a rememberpassword check [extensions/CentralAuth] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704382 (https://phabricator.wikimedia.org/T286521) (owner: 10Ahmon Dancy) [17:00:54] !log dancy@deploy1002 Synchronized php-1.37.0-wmf.12/extensions/CentralAuth/includes/specials/SpecialCentralAutoLogin.php: Backport: [[gerrit:704382|Do not lock preferences row for a rememberpassword check (T286521)]] (duration: 01m 05s) [17:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:04] T286521: Deadlock found when trying to get lock (UserOptionsManager::saveOptionsQuery) - https://phabricator.wikimedia.org/T286521 [17:01:41] (03CR) 10Jforrester: [C: 03+1] "Makes sense." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704149 (https://phabricator.wikimedia.org/T257066) (owner: 10Legoktm) [17:05:16] (03PS1) 10Inductiveload: Enable ProofreadPage status Change Tags on Beta Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704574 (https://phabricator.wikimedia.org/T286663) [17:05:24] (03CR) 10Jcrespo: [C: 03+2] mediabackup: Update config dir behaviour, add root passwords [puppet] - 10https://gerrit.wikimedia.org/r/704518 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [17:06:47] (03CR) 10Tpt: [C: 03+1] Enable ProofreadPage status Change Tags on Beta Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704574 (https://phabricator.wikimedia.org/T286663) (owner: 10Inductiveload) [17:11:01] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.517e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [17:11:29] (03CR) 10Tpt: [C: 03+1] "@Roan @Martin" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704574 (https://phabricator.wikimedia.org/T286663) (owner: 10Inductiveload) [17:20:46] (03PS1) 10Muehlenhoff: Install generic systemd-logout.d logout script [puppet] - 10https://gerrit.wikimedia.org/r/704578 [17:21:36] (03CR) 10Urbanecm: [C: 03+2] "beta-only change, per request by ProofreadPage maintainer " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704574 (https://phabricator.wikimedia.org/T286663) (owner: 10Inductiveload) [17:21:52] (03PS1) 10Ahmon Dancy: Do not lock preferences row for a rememberpassword check [extensions/CentralAuth] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704383 (https://phabricator.wikimedia.org/T286521) [17:22:26] (03Merged) 10jenkins-bot: Enable ProofreadPage status Change Tags on Beta Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704574 (https://phabricator.wikimedia.org/T286663) (owner: 10Inductiveload) [17:23:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/704578 (owner: 10Muehlenhoff) [17:26:06] (03CR) 10Urbanecm: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704574 (https://phabricator.wikimedia.org/T286663) (owner: 10Inductiveload) [17:26:46] (03CR) 10Ahmon Dancy: [C: 03+2] Do not lock preferences row for a rememberpassword check [extensions/CentralAuth] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704383 (https://phabricator.wikimedia.org/T286521) (owner: 10Ahmon Dancy) [17:27:22] (03PS2) 10Muehlenhoff: Install generic systemd-logind logout.d script [puppet] - 10https://gerrit.wikimedia.org/r/704578 [17:28:37] (03PS2) 10Bstorm: Remove legacy wiki replicas [cookbooks] - 10https://gerrit.wikimedia.org/r/704348 (https://phabricator.wikimedia.org/T260389) (owner: 10Nskaggs) [17:30:45] (03CR) 10Muehlenhoff: [C: 03+2] Install generic systemd-logind logout.d script [puppet] - 10https://gerrit.wikimedia.org/r/704578 (owner: 10Muehlenhoff) [17:31:56] (03CR) 10Bstorm: [C: 03+2] Remove legacy wiki replicas [cookbooks] - 10https://gerrit.wikimedia.org/r/704348 (https://phabricator.wikimedia.org/T260389) (owner: 10Nskaggs) [17:32:09] (03Merged) 10jenkins-bot: Do not lock preferences row for a rememberpassword check [extensions/CentralAuth] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704383 (https://phabricator.wikimedia.org/T286521) (owner: 10Ahmon Dancy) [17:34:41] (03Merged) 10jenkins-bot: Remove legacy wiki replicas [cookbooks] - 10https://gerrit.wikimedia.org/r/704348 (https://phabricator.wikimedia.org/T260389) (owner: 10Nskaggs) [17:35:40] !log dancy@deploy1002 Synchronized php-1.37.0-wmf.14/extensions/CentralAuth/includes/specials/SpecialCentralAutoLogin.php: Backport: [[gerrit:704383|Do not lock preferences row for a rememberpassword check (T286521)]] (duration: 01m 06s) [17:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:47] T286521: Deadlock found when trying to get lock (UserOptionsManager::saveOptionsQuery) - https://phabricator.wikimedia.org/T286521 [17:35:55] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Muehlenhoff out of all services on: 1733 hosts [17:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:08] (03PS3) 10Bstorm: cloud nfs: cleaning up the non-drbd setup [puppet] - 10https://gerrit.wikimedia.org/r/702738 (https://phabricator.wikimedia.org/T224747) [17:39:33] !log razzi@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid public cluster: Roll restart of Druid's jvm daemons. - razzi@cumin1001 [17:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:44] (03CR) 10Volans: [C: 03+1] "LGTM" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/704543 (owner: 10Muehlenhoff) [17:42:39] (03CR) 10Muehlenhoff: idm.logout: Remove the confirmation on failure (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/704543 (owner: 10Muehlenhoff) [17:42:42] (03PS5) 10Muehlenhoff: idm.logout: Remove the confirmation on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/704543 [17:47:57] (03CR) 10Muehlenhoff: [C: 03+2] idm.logout: Remove the confirmation on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/704543 (owner: 10Muehlenhoff) [17:49:40] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Muehlenhoff out of all services on: 10 hosts [17:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:55] !log jmm@cumin2002 END (FAIL) - Cookbook sre.idm.logout (exit_code=99) Logging Muehlenhoff out of all services on: 10 hosts [17:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:27] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Muehlenhoff out of all services on: 10 hosts [17:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:41] !log jmm@cumin2002 END (FAIL) - Cookbook sre.idm.logout (exit_code=99) Logging Muehlenhoff out of all services on: 10 hosts [17:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:58] (03PS1) 10Jcrespo: mediabackup: Add monitoring to the minio storage server process [puppet] - 10https://gerrit.wikimedia.org/r/704580 (https://phabricator.wikimedia.org/T276442) [17:59:43] (03CR) 10Jcrespo: [C: 03+2] mediabackup: Add monitoring to the minio storage server process [puppet] - 10https://gerrit.wikimedia.org/r/704580 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [18:00:05] dancy and brennen: How many deployers does it take to do Train log triage with CPT deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210714T1800). [18:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210714T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:04:10] That's missing a . And space [18:07:15] (03PS1) 10Jcrespo: mediabackup: Update check name to not have spaces [puppet] - 10https://gerrit.wikimedia.org/r/704582 (https://phabricator.wikimedia.org/T276442) [18:08:51] (03CR) 10Jcrespo: [C: 03+2] mediabackup: Update check name to not have spaces [puppet] - 10https://gerrit.wikimedia.org/r/704582 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [18:10:22] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.004498 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [18:13:06] PROBLEM - MinIO server processes on backup1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name /usr/local/bin/minio server https://wikitech.wikimedia.org/wiki/Media_storage/Backups [18:13:33] ^thats me doing tests [18:14:10] !log nskaggs@cumin1001 START - Cookbook wmcs.wikireplicas.add_wiki [18:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:25] (03PS1) 10Muehlenhoff: systemdlogind-logout.py: Check login state prior to logout attempt [puppet] - 10https://gerrit.wikimedia.org/r/704584 (https://phabricator.wikimedia.org/T283242) [18:16:57] (03CR) 10jerkins-bot: [V: 04-1] systemdlogind-logout.py: Check login state prior to logout attempt [puppet] - 10https://gerrit.wikimedia.org/r/704584 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [18:18:47] (03PS2) 10Muehlenhoff: systemdlogind-logout.py: Check login state prior to logout attempt [puppet] - 10https://gerrit.wikimedia.org/r/704584 (https://phabricator.wikimedia.org/T283242) [18:19:37] (03PS1) 10Jcrespo: mediabackup: Update the method of detecting minio server processes [puppet] - 10https://gerrit.wikimedia.org/r/704585 (https://phabricator.wikimedia.org/T276442) [18:20:23] (03CR) 10jerkins-bot: [V: 04-1] mediabackup: Update the method of detecting minio server processes [puppet] - 10https://gerrit.wikimedia.org/r/704585 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [18:20:28] (03PS2) 10Jcrespo: mediabackup: Update the method of detecting minio server processes [puppet] - 10https://gerrit.wikimedia.org/r/704585 (https://phabricator.wikimedia.org/T276442) [18:24:35] (03CR) 10Jcrespo: [C: 03+2] mediabackup: Update the method of detecting minio server processes [puppet] - 10https://gerrit.wikimedia.org/r/704585 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [18:24:46] (03PS1) 10Jcrespo: mariabackup: Fix config dir variable interpolation on module [puppet] - 10https://gerrit.wikimedia.org/r/704587 (https://phabricator.wikimedia.org/T276442) [18:25:26] RECOVERY - MinIO server processes on backup1004 is OK: PROCS OK: 1 process with command name minio, args server https://wikitech.wikimedia.org/wiki/Media_storage/Backups [18:26:13] I am going to do another test, to make sure the check is working as intended, I will make backup1004 alert and recover soon after it [18:27:53] PROBLEM - MinIO server processes on backup1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name /usr/local/bin/minio server https://wikitech.wikimedia.org/wiki/Media_storage/Backups [18:29:07] RECOVERY - MinIO server processes on backup1005 is OK: PROCS OK: 1 process with command name minio, args server https://wikitech.wikimedia.org/wiki/Media_storage/Backups [18:29:37] PROBLEM - MinIO server processes on backup2007 is CRITICAL: PROCS CRITICAL: 0 processes with command name /usr/local/bin/minio server https://wikitech.wikimedia.org/wiki/Media_storage/Backups [18:29:42] (03PS1) 10Ottomata: eventgate-analytics - set num_workers: 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/704588 (https://phabricator.wikimedia.org/T272714) [18:30:36] !log razzi@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid public cluster: Roll restart of Druid's jvm daemons. - razzi@cumin1001 [18:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:09] (03CR) 10Ottomata: "There's a bug in the eventgate-wikimedia + node-rdkafka-prometheus integration that causes service-runner prometheus to break when num_wor" [deployment-charts] - 10https://gerrit.wikimedia.org/r/704588 (https://phabricator.wikimedia.org/T272714) (owner: 10Ottomata) [18:31:14] I will stay around until the full deployment finishes, and make sure everything is green [18:34:31] (03CR) 10Jcrespo: [C: 03+2] mariabackup: Fix config dir variable interpolation on module [puppet] - 10https://gerrit.wikimedia.org/r/704587 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [18:34:33] (03CR) 10Ottomata: [C: 03+1] Don't show Kerberos ticket info in general [puppet] - 10https://gerrit.wikimedia.org/r/701512 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff) [18:36:25] !log nskaggs@cumin1001 Added views for new wiki: banwikisource T284390 [18:36:25] !log nskaggs@cumin1001 END (PASS) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=0) [18:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:31] T284390: Prepare and check storage layer for banwikisource - https://phabricator.wikimedia.org/T284390 [18:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:32] (03CR) 10Tpt: "> Merged. To answer the question, for beta changes, it's not strictly necessary to list them in a backport window (although that's possibl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704574 (https://phabricator.wikimedia.org/T286663) (owner: 10Inductiveload) [18:43:17] (03PS3) 10Nikki Nikkhoui: Helmfile for image suggestion api [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257) [18:47:17] RECOVERY - MinIO server processes on backup2007 is OK: PROCS OK: 1 process with command name minio, args server https://wikitech.wikimedia.org/wiki/Media_storage/Backups [18:47:40] (03CR) 10Volans: "couple of comment/questions inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/704584 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [18:49:28] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team (Radar): The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10mmodell) >>! In T285232#7199879, @Joe wrote: > I am starting to think that we should just mount a volume from the... [18:50:08] we should be ok now Re: minio server processes and I expect no more tests from now on [18:54:43] !log nskaggs@cumin1001 START - Cookbook wmcs.wikireplicas.add_wiki [18:54:44] !log nskaggs@cumin1001 END (ERROR) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=97) [18:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:28] !log nskaggs@cumin1001 START - Cookbook wmcs.wikireplicas.add_wiki [18:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:42] (03PS1) 10Ladsgroup: Fix deprecated offset() on invalid DOM [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704606 (https://phabricator.wikimedia.org/T185629) [19:00:05] dancy and brennen: (Dis)respected human, time to deploy MediaWiki train - American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210714T1900). Please do the needful. [19:05:05] (03CR) 10Bstorm: "This will require one server-side change before merge. Let me get that." [puppet] - 10https://gerrit.wikimedia.org/r/704381 (https://phabricator.wikimedia.org/T286635) (owner: 10Reedy) [19:08:28] o/ [19:08:47] Hey Amir [19:08:49] Once the train has been deployed to group1, let me know, I have a quick backport to deploy [19:09:03] I'm also around for any issues with wikidata etc. [19:09:05] Train is still blocked at testwikis. :-( [19:09:10] :( [19:09:17] due to T286521 [19:09:18] T286521: Deadlock found when trying to get lock (UserOptionsManager::saveOptionsQuery) - https://phabricator.wikimedia.org/T286521 [19:09:23] (from .12) [19:09:44] is it okay if I make it UBN? [19:09:58] Sure. [19:10:44] then I quickly backport my patch :D [19:11:14] (03CR) 10Ladsgroup: [C: 03+2] Fix deprecated offset() on invalid DOM [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704606 (https://phabricator.wikimedia.org/T185629) (owner: 10Ladsgroup) [19:17:44] !log nskaggs@cumin1001 Added views for new wiki: dagwiki T284456 [19:17:44] !log nskaggs@cumin1001 END (PASS) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=0) [19:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:51] T284456: Prepare and check storage layer for dagwiki - https://phabricator.wikimedia.org/T284456 [19:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:02] (03CR) 10Bstorm: [C: 03+2] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/704381 (https://phabricator.wikimedia.org/T286635) (owner: 10Reedy) [19:24:20] 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Puppetize media backups infrastructure - https://phabricator.wikimedia.org/T276442 (10jcrespo) All hosts have been setup including TLS. For now we are using Puppet's CA and certs, which for an internal service with internal IPs and that should n... [19:26:50] !log andrew@deploy1002 Started deploy [horizon/deploy@156a984]: fix trove-dashboard bug [19:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:05] (03CR) 10Cwhite: [C: 03+1] "Seems worth it to give it a shot." [deployment-charts] - 10https://gerrit.wikimedia.org/r/704588 (https://phabricator.wikimedia.org/T272714) (owner: 10Ottomata) [19:28:46] (03CR) 10Ottomata: "Comment from Petr:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/704588 (https://phabricator.wikimedia.org/T272714) (owner: 10Ottomata) [19:28:51] (03Merged) 10jenkins-bot: Fix deprecated offset() on invalid DOM [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704606 (https://phabricator.wikimedia.org/T185629) (owner: 10Ladsgroup) [19:30:53] PROBLEM - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:31:08] !log andrew@deploy1002 Finished deploy [horizon/deploy@156a984]: fix trove-dashboard bug (duration: 04m 18s) [19:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:39] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.14/resources: Backport: [[gerrit:704606|Fix deprecated offset() on invalid DOM (T185629)]] (duration: 01m 07s) [19:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:46] T185629: jquery.ui.ooMenu.js should not cause warning "jQuery.fn.offset() requires an element connected to a document" - https://phabricator.wikimedia.org/T185629 [19:53:22] (03PS2) 10RLazarus: icinga: Performance improvements to icinga-status [puppet] - 10https://gerrit.wikimedia.org/r/704422 (https://phabricator.wikimedia.org/T285803) [19:55:36] (03CR) 10RLazarus: icinga: Performance improvements to icinga-status (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/704422 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [20:00:04] dancy and brennen: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210714T1900). [20:00:04] chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210714T2000). Please do the needful. [20:00:18] hmm. [20:00:56] hrm, is something up with the deploy calendar? jouncebot seems a bit confused. [20:01:21] (03PS1) 10Jcrespo: mediabackup: Remove deleted directories and files [puppet] - 10https://gerrit.wikimedia.org/r/704599 (https://phabricator.wikimedia.org/T276442) [20:01:23] (03PS1) 10Jcrespo: mediabackup: Enable prometheus monitoring of minio [puppet] - 10https://gerrit.wikimedia.org/r/704600 (https://phabricator.wikimedia.org/T262668) [20:01:28] The calendar itself looks fine. [20:07:37] (03CR) 10Jcrespo: "I have no idea what I am doing, but I feel @godog should be made aware of what I am trying to do before I break something! 0:-)" [puppet] - 10https://gerrit.wikimedia.org/r/704600 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [20:09:07] (03CR) 10Jcrespo: [C: 04-1] "Oh, we need to open the firewall from the prometheus servers, but that is an easy fix. What about the other stuff?" [puppet] - 10https://gerrit.wikimedia.org/r/704600 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [20:11:53] dancy, brennen: The bot selects all windows that overlap an announce time. The train entry is set to run from 19:00Z-21:00Z. When the 20:00Z window's time hit the lookup in the data set found 2 things to announce because of the overlap. [20:12:03] ah, yeah. [20:12:06] makes sense. [20:12:26] I had to go read the code to be sure that this was the reason though :) [20:15:34] Thanks for the info! [20:19:53] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:28:55] (03PS1) 10Ladsgroup: Move saving user options to onTransactionPreCommitOrIdle [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704608 (https://phabricator.wikimedia.org/T286521) [20:29:12] (03PS1) 10Ladsgroup: Move saving user options to onTransactionPreCommitOrIdle [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704609 (https://phabricator.wikimedia.org/T286521) [20:30:04] dancy: I made a patch that allegedly fixes the train blocker, should I backport it to wmf.12 too? [20:30:09] I can deploy it now [20:30:13] Yes please! [20:30:35] (03CR) 10Ladsgroup: [C: 03+2] "UBN!" [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704609 (https://phabricator.wikimedia.org/T286521) (owner: 10Ladsgroup) [20:30:42] (03CR) 10Ladsgroup: [C: 03+2] "UBN!" [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704608 (https://phabricator.wikimedia.org/T286521) (owner: 10Ladsgroup) [20:31:10] being deployed then [20:31:37] RECOVERY - SSH on logstash2021.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:32:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10wiki_willy) Ok, thanks @nskaggs. They're currently processing a server replacement. Simultaneously, especially with the long lead times for new servers, there's one mor... [20:35:26] dancy: once that's deployed, it takes a while to figure out if it really fixed the problem or not [20:35:39] nod. [20:35:50] do you want to roll the train right away or wait until tomorrow a couple of hours? [20:37:10] Lemme check with brennen. [20:38:57] it doesn't make a huge user-facing impact [20:39:12] it's like 4 errors every hour [20:48:06] Amir1: I'm going to hold off on rolling forward since I need to go afk soon and brennen isn't around at the moment to cover me. [20:48:41] oh, he's back. [20:48:47] ok, moving forward then. [20:48:48] yeah, here [20:49:21] haha [20:49:23] cool [20:49:28] I'm around [20:49:42] (03PS1) 10Ahmon Dancy: group0 wikis to 1.37.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704604 [20:49:44] (03CR) 10Ahmon Dancy: [C: 03+2] group0 wikis to 1.37.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704604 (owner: 10Ahmon Dancy) [20:49:46] (03PS1) 10Andrew Bogott: Galera: increase number of allowed connections [puppet] - 10https://gerrit.wikimedia.org/r/704605 [20:50:23] (03Merged) 10jenkins-bot: group0 wikis to 1.37.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704604 (owner: 10Ahmon Dancy) [20:51:36] (03Merged) 10jenkins-bot: Move saving user options to onTransactionPreCommitOrIdle [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704609 (https://phabricator.wikimedia.org/T286521) (owner: 10Ladsgroup) [20:51:49] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.37.0-wmf.14 [20:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:09] ok. .14 is at group0. [20:52:32] (03Merged) 10jenkins-bot: Move saving user options to onTransactionPreCommitOrIdle [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704608 (https://phabricator.wikimedia.org/T286521) (owner: 10Ladsgroup) [20:53:00] (03PS2) 10Andrew Bogott: Galera: increase number of allowed connections [puppet] - 10https://gerrit.wikimedia.org/r/704605 (https://phabricator.wikimedia.org/T286675) [20:53:15] cool, will monitor for new breakage. [20:58:16] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.14/includes/user/User.php: Backport: [[gerrit:704608|Move saving user options to onTransactionPreCommitOrIdle (T286521)]] (duration: 01m 05s) [20:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:23] T286521: Deadlock found when trying to get lock (UserOptionsManager::saveOptionsQuery) - https://phabricator.wikimedia.org/T286521 [21:04:38] Amir1: did the .12 backport for that get deployed? just noticed a deadlock error there. [21:04:58] brennen: testing it on mwdebug1002 atm [21:05:02] mwbdeug2002 [21:05:06] Amir1: ack, thx [21:06:47] so far it looks like it doesn't bring wikipedia down, so I roll forward now [21:06:52] always a good sign [21:08:03] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.12/includes/user/User.php: Backport: [[gerrit:704609|Move saving user options to onTransactionPreCommitOrIdle (T286521)]] (duration: 01m 05s) [21:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:10] T286521: Deadlock found when trying to get lock (UserOptionsManager::saveOptionsQuery) - https://phabricator.wikimedia.org/T286521 [21:13:21] brennen: okay the patch is there, do you want to move group1? [21:13:25] or later? [21:14:56] Amir1: yeah, everything else seems quiet. let's go ahead and give it a try on group1 before end-of-US-daytime. [21:15:08] coool [21:15:26] rolling forward shortly [21:16:35] (03PS1) 10Brennen Bearnes: group1 wikis to 1.37.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704631 [21:16:37] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.37.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704631 (owner: 10Brennen Bearnes) [21:17:15] (03Merged) 10jenkins-bot: group1 wikis to 1.37.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704631 (owner: 10Brennen Bearnes) [21:17:49] (03CR) 10Dave Pifke: [C: 03+1] "LGTM. I re-applied in deployment-prep, ran `sudo crontab -u xenon -r`, ran `sudo puppet agent -t` again (to make sure the cron deletion s" [puppet] - 10https://gerrit.wikimedia.org/r/703912 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [21:18:37] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.14 [21:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:42] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.14 (duration: 01m 05s) [21:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:05] (03CR) 10Andrew Bogott: [C: 03+2] Galera: increase number of allowed connections [puppet] - 10https://gerrit.wikimedia.org/r/704605 (https://phabricator.wikimedia.org/T286675) (owner: 10Andrew Bogott) [21:24:09] (03CR) 10Dave Pifke: Fix NavtimingStaleBeacon false alarms, add test (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/702477 (owner: 10Dave Pifke) [21:25:48] Amir1: well, i notice a fresh deadlock error for .14 at 21:20 UTC. [21:26:37] brennen: hmm, okay, this might need to be a deferred update then [21:26:41] so i guess that's still a live issue. everything else seems stable, and it doesn't feel like rolling back the error rate is going to change anything. [21:27:00] er, sorry: i meant to say: it doesn't feel like rolling back to group0 would change the error rate. [21:27:08] yeah [21:47:16] 10SRE, 10ops-eqiad, 10DC-Ops: Netbox Duplicate Cable IDs & Accounting Discrepancies - https://phabricator.wikimedia.org/T285719 (10wiki_willy) [21:48:00] 10SRE, 10ops-eqiad, 10DC-Ops: Netbox Duplicate Cable IDs & Accounting Discrepancies - https://phabricator.wikimedia.org/T285719 (10wiki_willy) [21:51:08] 10SRE, 10Product-Analytics, 10SRE-Access-Requests: Update membership info for iflorez - https://phabricator.wikimedia.org/T286509 (10mpopov) Thank you very much! [22:29:09] Reedy: https://github.com/wikimedia/mediawiki-extensions-WikimediaMessages/commit/b45f04a61d07c60afdbaa3ec916863481343dbaf#diff-76d1cbf93019e8f231434836d12ea276ba6673e9d45f6037b6a7aa93d85a9ba2 is 8 of them 14 [22:29:33] joyous [22:29:43] Question is whether it's damaging enough we should backport and run scap [22:29:56] RhinosF1: ~540 translations were deleted [22:30:01] fun [22:30:10] that's a lot to backport [22:30:28] I think most of them are in MW core [22:30:48] Personally yes but I use en-gb [22:31:01] So do I [22:31:07] I quite often find this stuff broken, repeatedly [22:31:15] PROBLEM - Ensure local MW versions match expected deployment on mw2384 is CRITICAL: CRITICAL: 653 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [22:31:31] is there an easy way to backport all 540 messages [22:31:42] but then an hour of scap [22:32:00] Need to find which commits the updates are in [22:32:05] Or if they were actually exported [22:32:13] It's possible that they never were [22:32:37] https://github.com/wikimedia/mediawiki/blob/master/languages/i18n/en-gb.json hasn't been touched for nearly 2 months [22:32:54] is there logs? [22:33:13] none of the MW core en-gb.json have been updated for months [22:33:54] (03PS1) 10RhinosF1: Localisation updates from https://translatewiki.net. [extensions/WikimediaMessages] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704611 [22:34:02] if needed [22:34:10] https://github.com/wikimedia/Wikibase/commit/c9cb5b23e5fe303a95ea68909b33f1b90dd785ae [22:35:11] (03CR) 10Volans: [C: 03+1] "LGTM and consider it a +1 also in case you add the set too, see inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704422 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [22:35:28] (03PS1) 10RhinosF1: Localisation updates from https://translatewiki.net. [extensions/Wikibase] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704612 [22:36:18] Reedy: there's 2 patches covering 20 / 540 messages [22:37:20] It does look like most of them weren't exported, or haven't been reverted out automatically [22:38:03] most of the 540 show in cy ? [22:38:45] I haven't looked [22:39:00] But Raymond deleted them as being in the wrong language [22:39:03] So I'm guessing so [22:40:28] Is there a bot for cleaning it up? [22:40:44] What do you mean? [22:40:53] The exports delete them as per the linked commits [22:41:00] ah right [22:41:25] do we know a list of patches so we can backport or do we bother? [22:41:34] No, we don't have a list [22:41:44] it'd be a case of looking at the list of deleted commits and working out what repo it is [22:43:06] * RhinosF1 can't see the deleted contribs [22:44:46] Amir1: there's 520 messages to track first [22:45:58] i need sleep though [22:46:00] that'll be fun [22:46:20] I need to leave too, it's really late in the continent [22:46:26] yeah [22:49:59] I'm not on the continent but it's still nearly midnight [22:52:00] (03PS1) 10Bstorm: cloud galera: have haproxy shut down sessions when marked [puppet] - 10https://gerrit.wikimedia.org/r/704638 (https://phabricator.wikimedia.org/T286675) [22:53:04] (03CR) 10Bstorm: [C: 03+2] cloud nfs: cleaning up the non-drbd setup [puppet] - 10https://gerrit.wikimedia.org/r/702738 (https://phabricator.wikimedia.org/T224747) (owner: 10Bstorm) [23:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210714T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:05:57] (03CR) 10Bstorm: "Interestingly, the marked-down one I'm suggesting here comes with no warnings against it like the "marked-up" does https://cbonte.github.i" [puppet] - 10https://gerrit.wikimedia.org/r/704638 (https://phabricator.wikimedia.org/T286675) (owner: 10Bstorm) [23:24:38] (03CR) 10Juan90264: [C: 03+1] Uninstall Score on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704149 (https://phabricator.wikimedia.org/T257066) (owner: 10Legoktm) [23:44:10] (03PS3) 10RLazarus: icinga: Performance improvements to icinga-status [puppet] - 10https://gerrit.wikimedia.org/r/704422 (https://phabricator.wikimedia.org/T285803) [23:48:54] (03CR) 10RLazarus: [C: 03+2] "Thanks for the review!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704422 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus)