[00:00:04] RoanKattouw and Urbanecm: Time to snap out of that daydream and deploy UTC late backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211118T0000). [00:00:04] zabe: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:20] that's because they were all ~~stolen~~repurposed from mediawiki imagescalers [00:00:52] Hey zabe, around? [00:00:55] !log legoktm@cumin1001 START - Cookbook sre.hosts.reimage for host thumbor2005.codfw.wmnet with OS stretch [00:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:05] yes [00:01:13] I can deploy today then :) [00:02:03] (03PS5) 10Urbanecm: Lossless optimization of the brwikimedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735745 (owner: 10Zabe) [00:02:08] (03CR) 10Urbanecm: [C: 03+2] Lossless optimization of the brwikimedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735745 (owner: 10Zabe) [00:02:54] (03Merged) 10jenkins-bot: Lossless optimization of the brwikimedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735745 (owner: 10Zabe) [00:04:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:53] zabe: can you test? it's at mwdebug1001 [00:06:07] urbanecm: looks good to me [00:06:48] thanks, syncing [00:07:58] (03PS3) 10Urbanecm: Migrate wmfHostnames to wmgHostnames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734574 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [00:08:05] (03CR) 10Urbanecm: [C: 03+2] "per Timo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734574 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [00:08:24] !log urbanecm@deploy1002 Synchronized static/images/project-logos: 59c3fe66a0d140ae21f7269150a256a5e9786b24: Lossless optimization of the brwikimedia logo (duration: 01m 04s) [00:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:55] (03Merged) 10jenkins-bot: Migrate wmfHostnames to wmgHostnames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734574 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [00:11:03] zabe: second patch is at mwdebug1001 [00:11:15] (i know, hard to test, just please make sure nothing apparent breaks :)) [00:12:09] !log Purge https://en.wikipedia.org/static/images/project-logos/brwikimedia.png and respective HD variants [00:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:15] urbanecm: nothing breaks, so I think we are good [00:14:05] thanks [00:14:13] checking logs one more time [00:14:59] logs sound good [00:15:00] syncing [00:15:08] (03CR) 10Ryan Kemper: elasticsearch: disallow puppet to restart (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739379 (https://phabricator.wikimedia.org/T290902) (owner: 10Ryan Kemper) [00:16:24] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:16:25] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 5110fe77bb982cca82c8d474339a2b73d02c8024: Migrate wmfHostnames to wmgHostnames (T45956) (duration: 01m 03s) [00:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:29] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [00:16:31] and, live [00:16:36] anything else zabe ? [00:17:02] no, thanks :) [00:17:10] (03CR) 10Ryan Kemper: [C: 03+2] elasticsearch: disallow puppet to restart [puppet] - 10https://gerrit.wikimedia.org/r/739379 (https://phabricator.wikimedia.org/T290902) (owner: 10Ryan Kemper) [00:17:56] !log T290902 Disabling puppet across all elastic*: `ryankemper@cumin1001:~$ sudo cumin '*elastic*' 'sudo disable-puppet "Merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/739379"'` [00:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:59] T290902: cirrussearch: iron out and document procedure for puppet changes triggering restart - https://phabricator.wikimedia.org/T290902 [00:18:43] !log T290902 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/739379; running puppet agent on arbitrary elastic host: `ryankemper@elastic1051:~$ sudo run-puppet-agent --force` [00:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:49] ooh [00:18:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:54] !log UTC late B&C finished [00:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:33] !log T290902 Test host looks good, proceeding to rest of fleet `ryankemper@cumin1001:~$ sudo cumin -b 4 '*elastic*' 'sudo run-puppet-agent --force'` [00:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:43] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:25:57] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:26:46] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thumbor2005.codfw.wmnet with OS stretch [00:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:33] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:27:34] (03PS3) 10Ryan Kemper: query_service: Generalize prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/737484 (https://phabricator.wikimedia.org/T280008) (owner: 10Ebernhardson) [00:28:14] !log legoktm@cumin1001 START - Cookbook sre.hosts.reimage for host thumbor2006.codfw.wmnet with OS stretch [00:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:32] reimage is running in tmux, I'm going to step away for a bit [00:32:51] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:35:49] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:48:19] (03PS1) 10BryanDavis: wikireplicas: remove dependency on meta_p for user_properties_anon view [puppet] - 10https://gerrit.wikimedia.org/r/739680 (https://phabricator.wikimedia.org/T294652) [00:48:49] (03PS3) 10Krinkle: profile::mediawiki::php: support kubernetes in php-fatal-error.php [puppet] - 10https://gerrit.wikimedia.org/r/739520 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [00:54:03] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thumbor2006.codfw.wmnet with OS stretch [00:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:04] twentyafterfour: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211118T0100). [01:13:01] (03PS1) 10Ladsgroup: Revert "Stop setting wgActorTableSchemaMigrationStage, no longer read in core" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739633 (https://phabricator.wikimedia.org/T275246) [01:13:09] (03PS2) 10Ladsgroup: Revert "Stop setting wgActorTableSchemaMigrationStage, no longer read in core" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739633 (https://phabricator.wikimedia.org/T275246) [01:14:16] (03PS2) 10Ryan Kemper: Add repository-swift plugin [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/738979 (https://phabricator.wikimedia.org/T295705) (owner: 10Ebernhardson) [01:14:18] (03PS3) 10Ladsgroup: Revert "Stop setting wgActorTableSchemaMigrationStage, no longer read in core" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739633 (https://phabricator.wikimedia.org/T275246) [01:24:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:25:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:27:03] jouncebot: nowandnext [01:27:03] For the next 0 hour(s) and 32 minute(s): Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211118T0100) [01:27:03] In 9 hour(s) and 32 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211118T1100) [01:28:03] (03CR) 10Ryan Kemper: "Plugin build complete, merging" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/738979 (https://phabricator.wikimedia.org/T295705) (owner: 10Ebernhardson) [01:28:06] (03CR) 10Ryan Kemper: [C: 03+2] Add repository-swift plugin [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/738979 (https://phabricator.wikimedia.org/T295705) (owner: 10Ebernhardson) [01:31:58] (03CR) 10Legoktm: [C: 03+2] Move thumbor2005 and thumbor2006 to thumbor::mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/739677 (https://phabricator.wikimedia.org/T285477) (owner: 10Legoktm) [01:32:34] (03CR) 10Krinkle: profile::mediawiki::php: support kubernetes in php-fatal-error.php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739520 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [01:32:47] (03CR) 10Ladsgroup: [C: 03+2] "double checked. Noop for production." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739633 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [01:33:32] (03Merged) 10jenkins-bot: Revert "Stop setting wgActorTableSchemaMigrationStage, no longer read in core" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739633 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [01:35:29] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: NOOP - Config: [[gerrit:739633|Revert "Stop setting wgActorTableSchemaMigrationStage, no longer read in core" (T275246)]] (duration: 01m 04s) [01:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:43] T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246 [01:39:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [01:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:39:25] (03PS1) 10Legoktm: thumbor: Add thumbor2005 [puppet] - 10https://gerrit.wikimedia.org/r/739684 (https://phabricator.wikimedia.org/T285477) [01:39:27] (03PS1) 10Legoktm: thumbor: Add thumbor2006 [puppet] - 10https://gerrit.wikimedia.org/r/739685 (https://phabricator.wikimedia.org/T285477) [01:39:29] (03PS1) 10Legoktm: conftool: Add thumbor2005 [puppet] - 10https://gerrit.wikimedia.org/r/739686 (https://phabricator.wikimedia.org/T285477) [01:39:31] (03PS1) 10Legoktm: conftool: Add thumbor2006 [puppet] - 10https://gerrit.wikimedia.org/r/739687 (https://phabricator.wikimedia.org/T285477) [01:42:37] !log legoktm@cumin1001 START - Cookbook sre.hosts.reboot-single for host thumbor2005.codfw.wmnet [01:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:42:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [01:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:47:38] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor2005.codfw.wmnet [01:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:05] !log legoktm@cumin1001 START - Cookbook sre.hosts.reboot-single for host thumbor2006.codfw.wmnet [01:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:53:55] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10Papaul) [01:56:04] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor2006.codfw.wmnet [01:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:01:41] (03CR) 10Legoktm: [C: 03+2] thumbor: Add thumbor2005 [puppet] - 10https://gerrit.wikimedia.org/r/739684 (https://phabricator.wikimedia.org/T285477) (owner: 10Legoktm) [02:01:49] (03CR) 10Legoktm: [C: 03+2] thumbor: Add thumbor2006 [puppet] - 10https://gerrit.wikimedia.org/r/739685 (https://phabricator.wikimedia.org/T285477) (owner: 10Legoktm) [02:02:34] PROBLEM - snapshot of s6 in eqiad on alert1001 is CRITICAL: snapshot for s6 at eqiad taken more than 3 days ago: Most recent backup 2021-11-15 01:46:07 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [02:07:26] (03CR) 10Legoktm: [C: 03+2] conftool: Add thumbor2005 [puppet] - 10https://gerrit.wikimedia.org/r/739686 (https://phabricator.wikimedia.org/T285477) (owner: 10Legoktm) [02:07:33] (03CR) 10Legoktm: [C: 03+2] conftool: Add thumbor2006 [puppet] - 10https://gerrit.wikimedia.org/r/739687 (https://phabricator.wikimedia.org/T285477) (owner: 10Legoktm) [02:08:09] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=thumbor2005.codfw.wmnet [02:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:08:14] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=thumbor2006.codfw.wmnet [02:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:15] (03PS4) 10Krinkle: profile::mediawiki::php: support kubernetes in php-fatal-error.php [puppet] - 10https://gerrit.wikimedia.org/r/739520 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [02:46:44] (03CR) 10Krinkle: "made some minor changes, tested on mwdebug1002 and confirmed in logstash." [puppet] - 10https://gerrit.wikimedia.org/r/739520 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [03:23:51] PROBLEM - Host db1131 is DOWN: PING CRITICAL - Packet loss = 100% [03:24:12] RECOVERY - Host db1131 is UP: PING WARNING - Packet loss = 90%, RTA = 0.28 ms [03:24:24] Hello [03:24:34] The paging certainly works [03:25:08] here [03:25:44] db1131 is a s6 api replica [03:26:24] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-job=All&var-server=db1131&var-port=9104 [03:27:13] it looks back to normal? [03:27:54] Nov 18 03:24:02 db1131 kernel: [15945956.681518] tg3 0000:04:00.0 eno1: Link is down [03:27:54] Nov 18 03:24:09 db1131 kernel: [15945963.733365] tg3 0000:04:00.0 eno1: Link is up at 1000 Mbps, full duplex [03:28:05] here [03:28:07] network hiccup per https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=db1131&var-datasource=thanos&var-cluster=mysql&from=1637202481738&to=1637206081738 I guess [03:28:24] If it's all good, the page will have to be manually resolved. Host alerts don't auto-resolve. [03:28:53] socket errors still aren't back to zero, but I'm not sure if that's a sliding window or something [03:29:57] zero now [03:30:10] so it is [03:30:21] I thought we have alert on primaries only [03:30:26] meh [03:30:33] this was a recent change [03:30:40] I guess if it keeps hiccupping like this we might reconsider it :) [03:30:59] looks like there's nothing to do, I'll go ahead and resolve in VO and we can get back to what we were doing [03:31:26] do we need a follow-up task or are we ok saying it was a network hiccup and that's it? [03:31:48] I dunno, having Amir1 here live is a weird new phenomenon :D [03:32:04] previously I'd say open a task so data persistence can look when they're awake [03:32:16] korm*at did the work, blame it on her :P [03:32:37] I create the ticket [03:32:45] I guess two separate questions, one is do we want to keep this as a paging alert, the other is does anyone need to investigate db1131 [03:33:17] the latter (investigating db1131) was what I meant for the follow-up ticket [03:33:23] network is not my strongest suite [03:33:27] nod [03:33:38] can't hurt to open the task, worst case m.anuel or someone will just close it in the morning [03:33:41] for paging, I think we need more data / time [03:33:43] sure [03:34:06] and yeah, re more data, agree -- if it's usually actionable we don't mind the occasional red herring [03:34:15] but maybe now is a good time to start collecting that data someplace [03:34:25] I'll add it to "pages for awareness" as usual, but beyond that not sure [03:35:00] Amir1: are you creating a ticket then? or should I? [03:35:05] I'm working on the ticket [03:36:48] ty :) [03:37:18] * legoktm goes back to being afk, ping if anything is needed [03:38:50] I try to look at the logs but I doubt it'll be fruitfull [03:39:10] T295952 [03:39:10] T295952: db1131 alerting due to network hiccup - https://phabricator.wikimedia.org/T295952 [03:39:19] 10SRE, 10DBA: db1131 alerting due to network hiccup - https://phabricator.wikimedia.org/T295952 (10RLazarus) [03:39:29] (just added #SRE so wikibugs will yell about it in here) [03:39:32] looks good, thanks! [03:39:43] maint bot would do it :P [03:40:46] * Amir1 goes back to read more about Wikipedia as part of his onboarding [03:40:56] won't be the first time I did a job that a robot could have done perfectly well! [03:41:06] I'm checking out too, thanks all <3 [03:41:30] my future plan is to automate myself out of my job [03:42:41] the fun thing is that Manuel will be awake in an hour or so :D [03:47:26] 10SRE, 10DBA: db1131 alerting due to network hiccup - https://phabricator.wikimedia.org/T295952 (10Ladsgroup) daemon logs during the hiccup: ` Nov 18 03:23:15 db1131 systemd[1]: Created slice User Slice of UID 112. Nov 18 03:23:15 db1131 systemd[1]: Starting User Runtime Directory /run/user/112... Nov 18 03:23... [03:51:28] 10SRE, 10DBA: db1131 alerting due to network hiccup - https://phabricator.wikimedia.org/T295952 (10Ladsgroup) Oh the syslog is juicy: ` Nov 18 03:22:06 db1131 kernel: [15945840.380917] tg3 0000:04:00.0 eno1: Link is down Nov 18 03:23:05 db1131 mysqld[3170]: 2021-11-18 3:23:05 2602606439 [ERROR] Event Schedule... [04:04:44] (03PS3) 10Dzahn: miscweb: enable TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/739675 (https://phabricator.wikimedia.org/T281538) [04:05:31] (03CR) 10Dzahn: [C: 03+2] miscweb: enable TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/739675 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [04:06:44] RECOVERY - snapshot of s6 in eqiad on alert1001 is OK: Last snapshot for s6 at eqiad (db1140.eqiad.wmnet:3316) taken on 2021-11-18 03:15:46 (552 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [04:12:25] 10SRE, 10DBA: db1131 alerting due to network hiccup - https://phabricator.wikimedia.org/T295952 (10Ladsgroup) It's on C8, the access switch didn't log any errors in that port: https://librenms.wikimedia.org/device/device=162/tab=port/port=14869/ [04:15:01] (03Merged) 10jenkins-bot: miscweb: enable TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/739675 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [04:17:56] (03CR) 10Krinkle: [C: 03+1] profile::mediawiki::php: support kubernetes in php-fatal-error.php [puppet] - 10https://gerrit.wikimedia.org/r/739520 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [04:23:48] !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [04:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:24:19] 10SRE, 10DBA: db1131 alerting due to network hiccup - https://phabricator.wikimedia.org/T295952 (10Ladsgroup) It actually flapped several times (from kern.log): ` Nov 18 01:35:21 db1131 kernel: [15939435.175199] tg3 0000:04:00.0 eno1: Link is down Nov 18 01:35:40 db1131 kernel: [15939454.725701] tg3 0000:04:00... [04:26:36] I'm this close to depooling db1131 [04:26:58] (03PS1) 10Dzahn: Revert "miscweb: enable TLS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/739634 [04:28:02] Amir1: oh, the brand new alerting, did not get it though [04:28:12] downtime it? [04:28:37] it's acked but it's getting traffic so I don't want to down time it [04:28:51] unless it's depooled [04:29:39] if it's flapping the ACK will be gone next flap though [04:29:53] downtime that expires in a couple hours wouldn't [04:30:15] if it flaps again, I depool it and the downtime it [04:30:34] I agree with you, either cable or NIC it seems, btw [04:30:39] alright [04:32:15] looks like when the cable is loose or it was touched by nearby work [04:32:28] except this isnt the time of day for that [04:32:56] yeah. It started earlier though [04:33:01] will see [04:33:07] Amir1: you can also try "racadm getsel" on mgmt console if you want [04:33:12] to check for hardware errors [04:33:47] oh thanks [04:37:43] > run ipmitool, it will ask for management password, that is stored in pwstore. [04:37:45] :D [04:38:19] ipmitool is one option. I'd just ssh to .mgmt, but you need the same password, yes, pwstore [04:38:49] does it [04:39:40] I forgot about the reencrypting for you, arr [04:40:13] sorry, too many things on the list [04:40:51] all good, you have way too many things to do [04:41:09] soo. ssh root@db1131.mgmt.eqiad.wmnet [04:41:34] Date/Time: 07/15/2020 04:40:05 [04:41:46] Severity: Critical [04:41:46] Description: The chassis is open while the power is off. [04:41:57] it was open 2 days ago, but that is all [04:42:05] is that expected?:) [04:42:38] no new hardware error though from today, so that was that [04:44:38] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:02:32] hello [05:02:40] let's depool it [05:03:07] (03CR) 10Dzahn: [C: 03+2] Revert "miscweb: enable TLS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/739634 (owner: 10Dzahn) [05:06:11] mutante: sure, on it [05:06:18] marostegui: ^ [05:06:22] wrong ping, sorry [05:06:43] haha [05:08:02] I would suggest to tag ops-eqiad and ask them to review the cable, it might be loose [05:08:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 due to network issues (T295952)', diff saved to https://phabricator.wikimedia.org/P17758 and previous config saved to /var/cache/conftool/dbconfig/20211118-050802-ladsgroup.json [05:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:07] T295952: db1131 alerting due to network hiccup - https://phabricator.wikimedia.org/T295952 [05:08:11] done [05:08:35] the patch looks good! [05:08:40] well done [05:09:47] 10SRE, 10ops-eqiad, 10DBA: db1131 alerting due to network hiccup - https://phabricator.wikimedia.org/T295952 (10Ladsgroup) Hi, ops-eqiad. It is possible to check Ethernet cable of db1131 (C8)? Thanks! [05:09:49] Amir1: you might want to check db1133.yaml and copy the alerts notification part and set it on db1131.yaml so it doesn't page again [05:10:09] in puppet that is [05:10:13] sure sure [05:11:30] once this is all fixed, we'd need to remove that before repooling [05:12:07] (03Merged) 10jenkins-bot: Revert "miscweb: enable TLS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/739634 (owner: 10Dzahn) [05:13:24] (03PS1) 10Ladsgroup: mediawiki: Disable notification for db1131 [puppet] - 10https://gerrit.wikimedia.org/r/739698 (https://phabricator.wikimedia.org/T295952) [05:13:43] marostegui: is this good ^ [05:14:11] checking [05:14:24] (03CR) 10Marostegui: [C: 03+1] mediawiki: Disable notification for db1131 [puppet] - 10https://gerrit.wikimedia.org/r/739698 (https://phabricator.wikimedia.org/T295952) (owner: 10Ladsgroup) [05:15:22] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mediawiki: Disable notification for db1131 [puppet] - 10https://gerrit.wikimedia.org/r/739698 (https://phabricator.wikimedia.org/T295952) (owner: 10Ladsgroup) [05:19:18] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1131 alerting due to network hiccup - https://phabricator.wikimedia.org/T295952 (10Marostegui) p:05Triage→03High Raising this to high as this is the candidate master for s6. [05:34:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 1%: Repool after HW maintenance', diff saved to https://phabricator.wikimedia.org/P17759 and previous config saved to /var/cache/conftool/dbconfig/20211118-053438-root.json [05:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:02] 10SRE, 10DBA, 10cloud-services-team (Kanban): db1112 (s3 contribs/rc replica) is down - https://phabricator.wikimedia.org/T294295 (10Marostegui) I have started to slowly repool this host. [05:41:59] (03PS1) 10Marostegui: dbproxy1019: Depool clouddb1014. [puppet] - 10https://gerrit.wikimedia.org/r/739699 [05:42:44] (03CR) 10jerkins-bot: [V: 04-1] dbproxy1019: Depool clouddb1014. [puppet] - 10https://gerrit.wikimedia.org/r/739699 (owner: 10Marostegui) [05:43:39] (03PS2) 10Marostegui: dbproxy1019: Depool clouddb1014. [puppet] - 10https://gerrit.wikimedia.org/r/739699 [05:45:32] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:45:42] (03CR) 10Marostegui: [C: 03+2] dbproxy1019: Depool clouddb1014. [puppet] - 10https://gerrit.wikimedia.org/r/739699 (owner: 10Marostegui) [05:47:05] !log Upgrade clouddb1014 [05:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:35] (03PS1) 10Marostegui: Revert "dbproxy1019: Depool clouddb1014." [puppet] - 10https://gerrit.wikimedia.org/r/739635 [05:48:15] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1019: Depool clouddb1014." [puppet] - 10https://gerrit.wikimedia.org/r/739635 (owner: 10Marostegui) [05:49:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 5%: Repool after HW maintenance', diff saved to https://phabricator.wikimedia.org/P17760 and previous config saved to /var/cache/conftool/dbconfig/20211118-054942-root.json [05:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 10%: Repool after HW maintenance', diff saved to https://phabricator.wikimedia.org/P17761 and previous config saved to /var/cache/conftool/dbconfig/20211118-060446-root.json [06:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:18] !log revoked all grants from wikiadmin and gave back an explicit list on db1156 (T249683) [06:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:22] T249683: Redefine mysql GRANTs for wikiadmin - https://phabricator.wikimedia.org/T249683 [06:19:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 20%: Repool after HW maintenance', diff saved to https://phabricator.wikimedia.org/P17762 and previous config saved to /var/cache/conftool/dbconfig/20211118-061949-root.json [06:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 10%: After fixing GRANTs', diff saved to https://phabricator.wikimedia.org/P17763 and previous config saved to /var/cache/conftool/dbconfig/20211118-062048-root.json [06:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:47] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:24:23] oh oh [06:24:39] it's probably redundancy but still [06:31:46] !log revoked all grants from wikiadmin and gave back an explicit list on db1102:3312 (T249683) [06:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:51] T249683: Redefine mysql GRANTs for wikiadmin - https://phabricator.wikimedia.org/T249683 [06:34:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 25%: Repool after HW maintenance', diff saved to https://phabricator.wikimedia.org/P17764 and previous config saved to /var/cache/conftool/dbconfig/20211118-063453-root.json [06:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 25%: After fixing GRANTs', diff saved to https://phabricator.wikimedia.org/P17765 and previous config saved to /var/cache/conftool/dbconfig/20211118-063552-root.json [06:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 40%: Repool after HW maintenance', diff saved to https://phabricator.wikimedia.org/P17766 and previous config saved to /var/cache/conftool/dbconfig/20211118-064957-root.json [06:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 75%: After fixing GRANTs', diff saved to https://phabricator.wikimedia.org/P17767 and previous config saved to /var/cache/conftool/dbconfig/20211118-065055-root.json [06:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:09] (03PS1) 10Ladsgroup: maintenance: Add waitForReplication and sleep in migrateRevisionActorTemp [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739636 (https://phabricator.wikimedia.org/T275246) [07:02:27] (03CR) 10Elukey: profile::rsyslog: move Kafka TLS CA settings to the new bundle (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [07:05:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 50%: Repool after HW maintenance', diff saved to https://phabricator.wikimedia.org/P17768 and previous config saved to /var/cache/conftool/dbconfig/20211118-070500-root.json [07:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 100%: After fixing GRANTs', diff saved to https://phabricator.wikimedia.org/P17769 and previous config saved to /var/cache/conftool/dbconfig/20211118-070559-root.json [07:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove watchlist from s5 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P17770 and previous config saved to /var/cache/conftool/dbconfig/20211118-070620-marostegui.json [07:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:23] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [07:10:13] (03CR) 10Elukey: istio: Fix main config, add basic NetworkPolicy for staging/ml-serve (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/720906 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [07:10:39] (03CR) 10Ladsgroup: [C: 03+2] "to catch the train. We don't have a train next week." [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739636 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [07:15:19] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:20:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 75%: Repool after HW maintenance', diff saved to https://phabricator.wikimedia.org/P17771 and previous config saved to /var/cache/conftool/dbconfig/20211118-072004-root.json [07:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:46] (03PS1) 10Cathal Mooney: Depool eqiad at DNS level to faciliate iBGP reconfig on CRs [dns] - 10https://gerrit.wikimedia.org/r/739703 (https://phabricator.wikimedia.org/T295672) [07:27:11] (03CR) 10Ayounsi: [C: 03+1] Depool eqiad at DNS level to faciliate iBGP reconfig on CRs [dns] - 10https://gerrit.wikimedia.org/r/739703 (https://phabricator.wikimedia.org/T295672) (owner: 10Cathal Mooney) [07:29:45] (03CR) 10Elukey: P::kerberos: automate principal management (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/736753 (https://phabricator.wikimedia.org/T292389) (owner: 10Majavah) [07:30:24] (03PS12) 10Elukey: istio: Fix main config, add basic NetworkPolicy for staging/ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/720906 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [07:31:16] (03CR) 10Elukey: istio: Fix main config, add basic NetworkPolicy for staging/ml-serve (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/720906 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [07:33:44] (03Merged) 10jenkins-bot: maintenance: Add waitForReplication and sleep in migrateRevisionActorTemp [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739636 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [07:35:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 100%: Repool after HW maintenance', diff saved to https://phabricator.wikimedia.org/P17772 and previous config saved to /var/cache/conftool/dbconfig/20211118-073507-root.json [07:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:47] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.9/maintenance/migrateRevisionActorTemp.php: Backport: [[gerrit:739636|maintenance: Add waitForReplication and sleep in migrateRevisionActorTemp (T275246)]] (duration: 01m 04s) [07:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:50] T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246 [07:41:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:25] 10SRE, 10DBA, 10cloud-services-team (Kanban): db1112 (s3 contribs/rc replica) is down - https://phabricator.wikimedia.org/T294295 (10Marostegui) 05Open→03Resolved Host fully repooled! [07:50:54] (03CR) 10Majavah: P::kerberos: automate principal management (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736753 (https://phabricator.wikimedia.org/T292389) (owner: 10Majavah) [07:54:47] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:54:53] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [07:56:55] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:00:29] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 233, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:00:58] (03CR) 10Cathal Mooney: [C: 03+2] Depool eqiad at DNS level to faciliate iBGP reconfig on CRs [dns] - 10https://gerrit.wikimedia.org/r/739703 (https://phabricator.wikimedia.org/T295672) (owner: 10Cathal Mooney) [08:01:40] !log Depooling eqiad in authdns to allow for reconfiguration of CR routers on site (T295672) [08:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:44] T295672: Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 [08:04:53] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [08:09:13] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:12:18] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/739611 (https://phabricator.wikimedia.org/T292503) (owner: 10Cathal Mooney) [08:12:55] (03CR) 10Muehlenhoff: [C: 03+1] "Thanks! I'll merge this next week." [dns] - 10https://gerrit.wikimedia.org/r/739284 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah) [08:19:24] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/739579 (https://phabricator.wikimedia.org/T244792) (owner: 10Herron) [08:20:08] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for - https://phabricator.wikimedia.org/T295898 (10Aklapper) @CGlenn: I'm afraid I cannot follow... You closed this ticket as resolved, so I assume that everything requested in this ticket had been successfully done. See also https://www.mediaw... [08:25:59] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:27:15] !log De-pool of Eqiad seems to be ok, transit/peering/transport links changed BW profile but nothing maxed, total LVS connections steady but have shifted to codfw. Proceeding to reconfigure iBGP policy on cr1-eqiad and cr2-eqiad maually. [08:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:44] (03CR) 10JMeybohm: [C: 03+2] Skip (re-)building helm dependencies during CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/739590 (owner: 10JMeybohm) [08:35:23] (03CR) 10Ema: [C: 03+2] varnish: remove code used to clean up old mtail scripts [puppet] - 10https://gerrit.wikimedia.org/r/739560 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [08:35:53] (03Merged) 10jenkins-bot: Skip (re-)building helm dependencies during CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/739590 (owner: 10JMeybohm) [08:37:59] (03CR) 10JMeybohm: [C: 03+1] "Nice, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/720906 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [08:45:17] (03PS1) 10Vgutierrez: site: Reimage cp1090 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/739754 (https://phabricator.wikimedia.org/T290005) [08:45:33] !log installing mariadb-10.3 security updates on buster (as packaged in Debian, not the wmf-internal packages) [08:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:24] !log depool cp1090 to be reimaged as cache::upload_haproxy - T290005 [08:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:29] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [08:46:58] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp1090 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/739754 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [08:48:15] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp1090.eqiad.wmnet with OS buster [08:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:27] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp1090.eqiad.wmnet with OS buster [08:49:37] (03PS1) 10JMeybohm: istio: Rebuild for new wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/739755 [08:50:20] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] istio: Rebuild for new wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/739755 (owner: 10JMeybohm) [08:52:16] (03PS13) 10JMeybohm: istio: Fix main config, add basic NetworkPolicy for staging/ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/720906 (https://phabricator.wikimedia.org/T290966) [08:53:01] (03CR) 10JMeybohm: [C: 03+1] "Snuck in a istio image version bump 😉" [deployment-charts] - 10https://gerrit.wikimedia.org/r/720906 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [08:53:07] (03CR) 10Giuseppe Lavagetto: mediawiki: add handling of php-fpm logs via rsyslogd (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/734692 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [09:00:59] (03CR) 10Arturo Borrero Gonzalez: wmcs: use raw help formatter and module docs (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/739563 (owner: 10David Caro) [09:01:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "thanks!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/739562 (owner: 10David Caro) [09:02:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/739600 (owner: 10Majavah) [09:05:20] 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Manfredi Martorana to wmf ldap group - https://phabricator.wikimedia.org/T295789 (10Aklapper) (For future reference, please use the template at https://phabricator.wikimedia.org/tag/ldap-access-requests/ to file such requests - thanks... [09:05:22] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.11 point update - https://phabricator.wikimedia.org/T292838 (10MoritzMuehlenhoff) [09:05:38] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP/WMF for JKieserman - https://phabricator.wikimedia.org/T295693 (10Aklapper) > added to the "wmf" LDAP group @dzahn: Did this miss the [step to add the Phab account](https://wikitech.wikimedia.org/w/index.php?title=SRE/LDAP#Add_a_user_to_a_group) to #WMF-NDA (... [09:06:21] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.11 point update - https://phabricator.wikimedia.org/T292838 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is all rolled out. [09:09:16] (03PS4) 10Ema: cache: enable single backend experiment on cp4021 [puppet] - 10https://gerrit.wikimedia.org/r/710244 (https://phabricator.wikimedia.org/T288106) [09:09:39] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/710244 (https://phabricator.wikimedia.org/T288106) (owner: 10Ema) [09:12:15] (03PS5) 10Ema: cache: enable single backend experiment on cp4021 [puppet] - 10https://gerrit.wikimedia.org/r/710244 (https://phabricator.wikimedia.org/T288106) [09:14:18] (03PS1) 10MMandere: site: Add drmrs lvs instances [puppet] - 10https://gerrit.wikimedia.org/r/739757 (https://phabricator.wikimedia.org/T282787) [09:18:07] !log systemctl start prune-production-images.service on deneb - T287222 [09:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:12] T287222: Clean up old Docker images on deneb - https://phabricator.wikimedia.org/T287222 [09:21:24] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=cache_haproxy_tls site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:23:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:24:13] <_joe_> vgutierrez: ^^ dunno if that was worrisome or not [09:24:40] nope, the only host is being reimaged at the moment [09:24:43] thanks for pinging :) [09:27:13] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for SCherukuwada - https://phabricator.wikimedia.org/T295550 (10Jelto) [09:30:17] (03PS6) 10Ema: cache: enable single backend experiment on cp4021 [puppet] - 10https://gerrit.wikimedia.org/r/710244 (https://phabricator.wikimedia.org/T288106) [09:30:32] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/710244 (https://phabricator.wikimedia.org/T288106) (owner: 10Ema) [09:32:05] (03PS1) 10Jelto: admin: add user saisuman to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/739761 (https://phabricator.wikimedia.org/T295550) [09:32:24] !log pool cp1090 (upload) running HAProxy as TLS terminator - T290005 [09:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:29] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [09:32:54] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1090.eqiad.wmnet with OS buster [09:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:05] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp1090.eqiad.wmnet with OS buster c... [09:33:36] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [09:35:51] !log cp4021: depool to enable single backend experiment T288106 [09:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:59] T288106: Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 [09:36:27] (03CR) 10Ema: [C: 03+2] cache: enable single backend experiment on cp4021 [puppet] - 10https://gerrit.wikimedia.org/r/710244 (https://phabricator.wikimedia.org/T288106) (owner: 10Ema) [09:37:39] (03PS2) 10David Caro: wmcs: use argparse formatter and module docs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/739563 [09:38:55] (03CR) 10David Caro: [C: 03+2] CI: add style checks and formatting script [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738366 (https://phabricator.wikimedia.org/T295063) (owner: 10David Caro) [09:39:56] (03CR) 10jerkins-bot: [V: 04-1] CI: add style checks and formatting script [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738366 (https://phabricator.wikimedia.org/T295063) (owner: 10David Caro) [09:40:57] (03CR) 10Elukey: [C: 03+2] istio: Fix main config, add basic NetworkPolicy for staging/ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/720906 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [09:41:50] !log cp4021: stop ats-be and clear its cache T288106 [09:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:55] T288106: Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 [09:49:52] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [09:49:54] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [09:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti-test2002.codfw.wmnet with OS buster [09:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:08] 10SRE, 10Infrastructure-Foundations: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti-test2002.codfw.wmnet with OS buster [09:54:55] (03PS1) 10Elukey: helmfile.d: add port 15021 to Istio's network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/739762 (https://phabricator.wikimedia.org/T290966) [09:55:35] jayme: --^ (if you have time later on) [09:55:39] (03PS1) 10Daniel Kinzler: Don't trust Title that if it exists pageId will be > 0 [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739638 (https://phabricator.wikimedia.org/T295931) [09:55:56] 10SRE, 10Traffic, 10Patch-For-Review, 10User-ema: Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 (10ema) After setting `cache::single_backend_fqdn: cp4021.ulsfo.wmnet` in hiera, cp4021 is now gone from the list of cache backends on all upload@ulsfo nodes, see for insta... [09:56:29] !log cp4021: repool w/ single backend experiment enabled T288106 [09:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:34] T288106: Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 [09:57:22] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.1 point update - https://phabricator.wikimedia.org/T292844 (10MoritzMuehlenhoff) [10:00:06] !log Re-enabling Equinix IXP port on cr1-eqiad following iBGP changes to address T295650 [10:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:12] T295650: cr1-eqiad -> Charter/AS7843 connectivity is broken - https://phabricator.wikimedia.org/T295650 [10:00:20] (03CR) 10Vgutierrez: [C: 03+1] site: Add drmrs lvs instances [puppet] - 10https://gerrit.wikimedia.org/r/739757 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [10:01:28] !log updating perf on buster hosts [10:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:44] (03CR) 10MMandere: [C: 03+2] site: Add drmrs lvs instances [puppet] - 10https://gerrit.wikimedia.org/r/739757 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [10:02:22] (03PS3) 10David Caro: CI: add style checks and formatting script [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738366 (https://phabricator.wikimedia.org/T295063) [10:03:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_webrequest_partitions.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:04:10] (03CR) 10David Caro: [C: 03+2] CI: add style checks and formatting script [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738366 (https://phabricator.wikimedia.org/T295063) (owner: 10David Caro) [10:05:41] (03Merged) 10jenkins-bot: CI: add style checks and formatting script [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738366 (https://phabricator.wikimedia.org/T295063) (owner: 10David Caro) [10:06:45] (03CR) 10Elukey: [C: 03+2] "Trying it out, easy to revert in case :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/739762 (https://phabricator.wikimedia.org/T290966) (owner: 10Elukey) [10:08:41] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:08:44] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:07] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Manfredi Martorana to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T295790 (10mmartorana) [10:11:25] (03PS1) 10Cathal Mooney: Revert "Depool eqiad at DNS level to faciliate iBGP reconfig on CRs" [dns] - 10https://gerrit.wikimedia.org/r/739639 [10:12:33] (03CR) 10Cathal Mooney: [C: 03+2] Revert "Depool eqiad at DNS level to faciliate iBGP reconfig on CRs" [dns] - 10https://gerrit.wikimedia.org/r/739639 (owner: 10Cathal Mooney) [10:12:53] !log Re-pooling eqiad in DNS after completing iBGP policy changes on cr1-eqiad and cr2-eqiad T295672 [10:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:58] T295672: Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 [10:17:40] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host lvs6001.drmrs.wmnet with OS buster [10:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:50] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host lvs6001.drmrs.wmnet with OS buster [10:21:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti-test2002.codfw.wmnet with OS buster [10:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:42] 10SRE, 10Infrastructure-Foundations: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti-test2002.codfw.wmnet with OS buster completed: - ganeti-test2002 (**PASS**) - Dow... [10:22:50] (03PS1) 10Muehlenhoff: sre.debmonitor.remove-hosts: Catch the RemoteError exception [cookbooks] - 10https://gerrit.wikimedia.org/r/739765 [10:24:25] (03PS2) 10Muehlenhoff: sre.debmonitor.remove-hosts: Catch the RemoteError exception [cookbooks] - 10https://gerrit.wikimedia.org/r/739765 [10:26:14] (03PS1) 10David Caro: DONOTMERGE tests for pcc [puppet] - 10https://gerrit.wikimedia.org/r/739766 [10:26:34] (03PS1) 10Elukey: helmfile.d: allow gateways to talk with istiod [deployment-charts] - 10https://gerrit.wikimedia.org/r/739767 (https://phabricator.wikimedia.org/T290966) [10:26:56] (03PS2) 10Elukey: helmfile.d: allow gateways to talk with istiod [deployment-charts] - 10https://gerrit.wikimedia.org/r/739767 (https://phabricator.wikimedia.org/T290966) [10:28:15] (03PS14) 10Vgutierrez: cache:haproxy: Gather TTFB metrics using mtail [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) [10:33:55] (03CR) 10Jbond: [C: 03+1] exim4.conf.mx: switch 'data' to 'condition' in otrs config [puppet] - 10https://gerrit.wikimedia.org/r/739579 (https://phabricator.wikimedia.org/T244792) (owner: 10Herron) [10:33:57] (03PS1) 10Jcrespo: dbbackups: Switch s3 and x1 backup generation as an optimization [puppet] - 10https://gerrit.wikimedia.org/r/739768 (https://phabricator.wikimedia.org/T138562) [10:34:23] (03PS2) 10Jcrespo: dbbackups: Switch s3 and x1 backup generation as an optimization [puppet] - 10https://gerrit.wikimedia.org/r/739768 (https://phabricator.wikimedia.org/T138562) [10:35:13] (03CR) 10Elukey: [C: 03+2] helmfile.d: allow gateways to talk with istiod [deployment-charts] - 10https://gerrit.wikimedia.org/r/739767 (https://phabricator.wikimedia.org/T290966) (owner: 10Elukey) [10:35:16] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32481/console" [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [10:35:20] (03CR) 10Jbond: [C: 03+2] Pathlib: switch to pathlib vs os.path everywhere [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738261 (owner: 10Jbond) [10:35:37] (03CR) 10Jbond: [C: 03+2] f-strings: convert strings to f-strings [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738437 (owner: 10Jbond) [10:36:25] (03CR) 10jerkins-bot: [V: 04-1] Pathlib: switch to pathlib vs os.path everywhere [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738261 (owner: 10Jbond) [10:36:27] (03CR) 10jerkins-bot: [V: 04-1] f-strings: convert strings to f-strings [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738437 (owner: 10Jbond) [10:38:38] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:40] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:05] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Switch s3 and x1 backup generation as an optimization [puppet] - 10https://gerrit.wikimedia.org/r/739768 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [10:41:32] (03CR) 10David Caro: [C: 03+2] controller: consider failure if any host fails [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738420 (https://phabricator.wikimedia.org/T295030) (owner: 10David Caro) [10:41:58] (03CR) 10jerkins-bot: [V: 04-1] controller: consider failure if any host fails [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738420 (https://phabricator.wikimedia.org/T295030) (owner: 10David Caro) [10:42:10] (03PS3) 10David Caro: controller: consider failure if any host fails [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738420 (https://phabricator.wikimedia.org/T295030) [10:42:59] (03CR) 10Btullis: [C: 03+1] sre.druid.roll-restart-workers: restart Druid exporter (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/739240 (owner: 10Elukey) [10:44:26] (03CR) 10David Caro: [C: 03+2] controller: consider failure if any host fails [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738420 (https://phabricator.wikimedia.org/T295030) (owner: 10David Caro) [10:45:38] (03Merged) 10jenkins-bot: controller: consider failure if any host fails [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738420 (https://phabricator.wikimedia.org/T295030) (owner: 10David Caro) [10:46:13] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10cmooney) Change completed successfully in eqiad. ##### Before ` cmooney@re0.cr1-eqiad> show route receive-protocol bgp 208.80.154.197 inet.0:... [10:56:37] 10SRE, 10Infrastructure-Foundations, 10netops: cr1-eqiad -> Charter/AS7843 connectivity is broken - https://phabricator.wikimedia.org/T295650 (10cmooney) 05Open→03Resolved The next-hop self policy has been applied on cr1-eqiad and cr2-eqiad, in the Confed_eqiad group, to address this issue. cr2-eqiad is... [10:57:11] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs6001.drmrs.wmnet with OS buster [10:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:19] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host lvs6001.drmrs.wmnet with OS buster completed: - lvs6001 (**WARN**... [11:00:05] mvolz: My dear minions, it's time we take the moon! Just kidding. Time for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211118T1100). [11:02:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [11:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:13] (03Abandoned) 10Majavah: aptrepo: add component for rackspace openstack debs [puppet] - 10https://gerrit.wikimedia.org/r/737856 (https://phabricator.wikimedia.org/T295234) (owner: 10Majavah) [11:05:45] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host lvs6002.drmrs.wmnet with OS buster [11:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:53] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host lvs6002.drmrs.wmnet with OS buster [11:06:21] !log aborrero@apt1001:~ $ for i in $(ll /srv/wikimedia/incoming/ | grep aborrero | awk -F' ' '{print $NF}') ; do rm /srv/wikimedia/incoming/$i ; done [11:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:57] (03PS1) 10Hnowlan: discovery check: kartotherian is depooled in codfw [puppet] - 10https://gerrit.wikimedia.org/r/739771 [11:07:01] (03CR) 10JMeybohm: [C: 03+1] admin: add user saisuman to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/739761 (https://phabricator.wikimedia.org/T295550) (owner: 10Jelto) [11:07:09] !log added python-flask-oslolog_0.1~git20201012.7803a46-1 to bullseye-wikimedia (T295234) [11:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:12] T295234: Add keystone auth for dynamicproxy api - https://phabricator.wikimedia.org/T295234 [11:08:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [11:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:32] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache:haproxy: Gather TTFB metrics using mtail [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [11:08:40] !log run aborrero@apt1001:~$ sudo -i reprepro processincoming default [11:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:27] (03PS18) 10Jbond: Pathlib: switch to pathlib vs os.path everywhere [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738261 [11:10:29] (03PS7) 10Jbond: f-strings: convert strings to f-strings [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738437 [11:10:31] (03PS3) 10Jbond: Add Typing: And fix other minopr lint issues [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 [11:11:43] (03PS1) 10Jgiannelos: Log tile instead of line number on error [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/739773 [11:12:14] (03CR) 10jerkins-bot: [V: 04-1] Add Typing: And fix other minopr lint issues [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 (owner: 10Jbond) [11:12:23] (03CR) 10jerkins-bot: [V: 04-1] Pathlib: switch to pathlib vs os.path everywhere [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738261 (owner: 10Jbond) [11:12:28] (03CR) 10jerkins-bot: [V: 04-1] f-strings: convert strings to f-strings [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738437 (owner: 10Jbond) [11:15:28] (03PS1) 10Muehlenhoff: sre.ganeti.addnode: Extend globbing for bridge check [cookbooks] - 10https://gerrit.wikimedia.org/r/739774 [11:16:44] (03PS19) 10Jbond: Pathlib: switch to pathlib vs os.path everywhere [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738261 [11:16:49] (03PS8) 10Jbond: f-strings: convert strings to f-strings [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738437 [11:17:45] jouncebot: next [11:17:45] In 0 hour(s) and 42 minute(s): UTC morning backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211118T1200) [11:17:51] (03PS9) 10Jbond: f-strings: convert strings to f-strings [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738437 [11:18:07] (03CR) 10jerkins-bot: [V: 04-1] f-strings: convert strings to f-strings [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738437 (owner: 10Jbond) [11:23:38] (03PS4) 10Jbond: Add Typing: And fix other minor lint issues [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 [11:24:41] (03CR) 10jerkins-bot: [V: 04-1] Add Typing: And fix other minor lint issues [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 (owner: 10Jbond) [11:26:10] !log aborrero@apt1001:~$ sudo -i reprepro processincoming default /srv/wikimedia/incoming/python-flask-keystone_0.2~git20201012.b5cd4da-1_amd64.changes (T295234) [11:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:13] T295234: Add keystone auth for dynamicproxy api - https://phabricator.wikimedia.org/T295234 [11:27:16] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host lvs6003.drmrs.wmnet with OS buster [11:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:26] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host lvs6003.drmrs.wmnet with OS buster [11:29:02] (03PS1) 10Vgutierrez: cache::haproxy: Fix unit_type for haproxy-mtail@tls.socket [puppet] - 10https://gerrit.wikimedia.org/r/739775 (https://phabricator.wikimedia.org/T290005) [11:29:27] !log aborrero@apt1001:~$ sudo -i reprepro export [11:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:19] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32482/console" [puppet] - 10https://gerrit.wikimedia.org/r/739775 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [11:38:17] (03Abandoned) 10Vgutierrez: cache::haproxy: Fix unit_type for haproxy-mtail@tls.socket [puppet] - 10https://gerrit.wikimedia.org/r/739775 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [11:43:55] (03Abandoned) 10Muehlenhoff: Fix role handling for canaries [puppet] - 10https://gerrit.wikimedia.org/r/731094 (owner: 10Muehlenhoff) [11:44:16] (03PS2) 10Giuseppe Lavagetto: Add apple-search deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/736273 (https://phabricator.wikimedia.org/T289224) [11:44:24] (03CR) 10Giuseppe Lavagetto: Add apple-search deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/736273 (https://phabricator.wikimedia.org/T289224) (owner: 10Giuseppe Lavagetto) [11:45:07] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs6002.drmrs.wmnet with OS buster [11:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:16] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host lvs6002.drmrs.wmnet with OS buster completed: - lvs6002 (**WARN**... [11:45:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/739761 (https://phabricator.wikimedia.org/T295550) (owner: 10Jelto) [11:46:31] (03PS1) 10Vgutierrez: prometheus:ops: Fix cache_haproxy_tls_mtail yaml filename [puppet] - 10https://gerrit.wikimedia.org/r/739776 (https://phabricator.wikimedia.org/T290005) [11:47:22] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/739648 (https://phabricator.wikimedia.org/T295900) (owner: 10Dzahn) [11:48:23] (03CR) 10Vgutierrez: [C: 03+2] prometheus:ops: Fix cache_haproxy_tls_mtail yaml filename [puppet] - 10https://gerrit.wikimedia.org/r/739776 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [11:49:32] CI Jenkins is restarting [11:51:55] (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [11:55:23] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [11:56:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [12:00:05] Amir1, Lucas_WMDE, and apergos: Dear deployers, time to do the UTC morning backport and config training deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211118T1200). [12:00:05] kart_: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:08] (03PS1) 10Arturo Borrero Gonzalez: hiera: cloud: update secrets for cinder-backups @ codfw1dev [labs/private] - 10https://gerrit.wikimedia.org/r/739777 (https://phabricator.wikimedia.org/T295584) [12:00:16] * kart_ is here. [12:00:20] (03PS2) 10Arturo Borrero Gonzalez: cloud: codfw1dev: hiera update for new backup servers [puppet] - 10https://gerrit.wikimedia.org/r/739599 (https://phabricator.wikimedia.org/T295584) [12:00:25] o/ [12:00:28] here. there is one patch in the window and no trainees [12:00:31] I’m here but not sure if CI works at the moment [12:00:36] the one patch looks straightforward enough. [12:00:39] oh? [12:00:39] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] hiera: cloud: update secrets for cinder-backups @ codfw1dev [labs/private] - 10https://gerrit.wikimedia.org/r/739777 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [12:00:42] Oh. CI? [12:00:47] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 76.60 ms [12:00:54] maybe it’ll work, we can try [12:01:05] Was being restarted ~12 mins ago [12:01:14] 12 minutes seems a while [12:01:21] Have you met Java? :D [12:01:37] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/739765 (owner: 10Muehlenhoff) [12:01:45] no, I closed the door on it ever since it sent weird folks to collect the garbage... [12:01:52] lol [12:02:06] 10SRE, 10ops-codfw: mw2280 unresponsive to powercycle and hardreset - https://phabricator.wikimedia.org/T290708 (10MoritzMuehlenhoff) But if we're replacing this with a new server, then let's decom the broken mw2280? [12:02:48] 2 patches seems stuck from last 10/11 hours :P [12:03:16] (03PS1) 10Elukey: helmfile.d: use the istio label instead of app in Network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/739778 (https://phabricator.wikimedia.org/T290966) [12:03:39] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/739774 (owner: 10Muehlenhoff) [12:04:24] that's a different CI issue [12:05:06] sorry I had to restart Jenkins [12:05:10] (03CR) 10Jelto: [C: 03+2] admin: add user saisuman to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/739761 (https://phabricator.wikimedia.org/T295550) (owner: 10Jelto) [12:05:10] it was deadlocked :-\ [12:05:12] (03CR) 10JMeybohm: [C: 03+1] "Such fun!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/739778 (https://phabricator.wikimedia.org/T290966) (owner: 10Elukey) [12:05:52] (03CR) 10Elukey: [C: 03+1] "If it is temporary let's add a comment (with a task-id) if any, so people will know :)" [puppet] - 10https://gerrit.wikimedia.org/r/739771 (owner: 10Hnowlan) [12:06:38] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs6003.drmrs.wmnet with OS buster [12:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:42] (03PS1) 10Arturo Borrero Gonzalez: hiera: cloud: add more secrets for cinder-backups @ codfw1dev [labs/private] - 10https://gerrit.wikimedia.org/r/739779 (https://phabricator.wikimedia.org/T295584) [12:06:47] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host lvs6003.drmrs.wmnet with OS buster completed: - lvs6003 (**WARN**... [12:06:47] (03PS2) 10KartikMistry: Enable Tamil (ta) Section Translation in test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739550 (https://phabricator.wikimedia.org/T294223) [12:07:00] just let us know when it is back, hashar? [12:07:03] apergos: Lucas_WMDE: the CI Jenkins usually comes bakc in a couple of minutes and is working again [12:07:09] ok thanks! [12:07:14] ok great! [12:07:15] apergos: do you want to deploy or should I? [12:07:17] that was bad timing I apologize [12:07:23] stuff happens [12:07:32] I should definitely have waited for after the deployment window [12:07:32] oh let me not hog the spotlight, Lucas_WMDE :-D [12:07:37] CI seems back. Thanks hashar ! [12:07:45] ok ^^ [12:08:02] (03PS3) 10Arturo Borrero Gonzalez: cloud: codfw1dev: hiera update for new backup servers [puppet] - 10https://gerrit.wikimedia.org/r/739599 (https://phabricator.wikimedia.org/T295584) [12:08:13] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] hiera: cloud: add more secrets for cinder-backups @ codfw1dev [labs/private] - 10https://gerrit.wikimedia.org/r/739779 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [12:09:05] Lucas_WMDE: Let me deploy then.. :) [12:09:14] ah ha! [12:09:29] a self deployer? wonderful! we love it. [12:09:31] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for SCherukuwada - https://phabricator.wikimedia.org/T295550 (10Jelto) 05Open→03Resolved @SCherukuwada you should have access now to `deployment` group. I'm closing this task. Feel free to re-open in case you have any probl... [12:10:22] ah, okay ^^ [12:10:23] (03CR) 10KartikMistry: [C: 03+2] Enable Tamil (ta) Section Translation in test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739550 (https://phabricator.wikimedia.org/T294223) (owner: 10KartikMistry) [12:10:26] (03PS4) 10Arturo Borrero Gonzalez: cloud: codfw1dev: hiera update for new backup servers [puppet] - 10https://gerrit.wikimedia.org/r/739599 (https://phabricator.wikimedia.org/T295584) [12:10:42] apergos: :) [12:10:43] hm, strange pattern of DjVuHandler errors in logspam-watch [12:11:09] (03Merged) 10jenkins-bot: Enable Tamil (ta) Section Translation in test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739550 (https://phabricator.wikimedia.org/T294223) (owner: 10KartikMistry) [12:11:29] think we need to look at it before the deploy? [12:11:38] nah, probably not [12:11:41] (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.addnode: Extend globbing for bridge check [cookbooks] - 10https://gerrit.wikimedia.org/r/739774 (owner: 10Muehlenhoff) [12:12:03] (03PS3) 10Muehlenhoff: sre.debmonitor.remove-hosts: Catch the RemoteError exception [cookbooks] - 10https://gerrit.wikimedia.org/r/739765 [12:12:29] okey dokey [12:13:40] (03PS1) 10Arturo Borrero Gonzalez: hiera: cloud: fix ceph hiera key name for cinder-backups @ codfw1dev [labs/private] - 10https://gerrit.wikimedia.org/r/739782 (https://phabricator.wikimedia.org/T295584) [12:14:14] medebug was good, so deploying.. [12:14:24] *mwdebug was good, so deploying.. [12:14:35] medebug would be a good hostname... [12:14:44] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] hiera: cloud: fix ceph hiera key name for cinder-backups @ codfw1dev [labs/private] - 10https://gerrit.wikimedia.org/r/739782 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [12:15:01] microsoft hasn’t had to medebug since the release of xp [12:15:05] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:739550|Enable Tamil (ta) Section Translation in test wiki (T294223)]] (duration: 01m 05s) [12:15:06] lolol [12:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:08] (03CR) 10Muehlenhoff: [C: 03+2] sre.debmonitor.remove-hosts: Catch the RemoteError exception [cookbooks] - 10https://gerrit.wikimedia.org/r/739765 (owner: 10Muehlenhoff) [12:15:09] T294223: Enable more languages for Section Translation in test wiki - https://phabricator.wikimedia.org/T294223 [12:15:53] !log Upgrade dbstore1007 to 10.4.22 T290841 T295970 [12:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:57] T290841: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 [12:15:58] T295970: Compile and package mariadb 10.4.22 - https://phabricator.wikimedia.org/T295970 [12:16:26] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mc1026.eqiad.wmnet [12:16:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mc1026.eqiad.wmnet [12:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:43] apergos: You want me to deploy XP on mwdebug? Let me try ;) [12:16:57] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mc1025.eqiad.wmnet [12:16:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mc1025.eqiad.wmnet [12:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:01] pretty sure that might run into some licensing issues :-P [12:17:10] :) [12:17:44] so how's it look? [12:17:49] (03PS1) 10Arturo Borrero Gonzalez: hiera: cloud: fix ceph hiera key entry for cinder-backups @ codfw1dev [labs/private] - 10https://gerrit.wikimedia.org/r/739786 (https://phabricator.wikimedia.org/T295584) [12:17:51] (03CR) 10Elukey: [C: 03+2] helmfile.d: use the istio label instead of app in Network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/739778 (https://phabricator.wikimedia.org/T290966) (owner: 10Elukey) [12:18:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:36] (03CR) 10Elukey: [C: 03+2] sre.druid.roll-restart-workers: restart Druid exporter [cookbooks] - 10https://gerrit.wikimedia.org/r/739240 (owner: 10Elukey) [12:18:42] (03PS2) 10Elukey: sre.druid.roll-restart-workers: restart Druid exporter [cookbooks] - 10https://gerrit.wikimedia.org/r/739240 [12:19:02] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] hiera: cloud: fix ceph hiera key entry for cinder-backups @ codfw1dev [labs/private] - 10https://gerrit.wikimedia.org/r/739786 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [12:20:19] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [12:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:22] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [12:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [12:21:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:58] !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. [12:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:34] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: cloudcephosd1016.wikimedia.org [12:23:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: cloudcephosd1016.wikimedia.org [12:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs: use argparse formatter and module docs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/739563 (owner: 10David Caro) [12:26:42] um... kart_ you done? [12:29:08] (03PS2) 10Hnowlan: discovery check: kartotherian is depooled in codfw [puppet] - 10https://gerrit.wikimedia.org/r/739771 [12:30:09] apergos: yeah yeah. [12:30:20] ok, great! [12:30:22] (03CR) 10Jelto: [C: 03+2] admin: upgrade ihurbain from ldap_only to shell, add to parsoid-test-admins [puppet] - 10https://gerrit.wikimedia.org/r/739648 (https://phabricator.wikimedia.org/T295900) (owner: 10Dzahn) [12:30:37] Lucas_WMDE: did you have anything you wanted to sneak in at the last minute (or really, overflow from last week)? [12:30:44] nope [12:33:06] seems like the window is done eh [12:33:07] ? [12:38:48] (03CR) 10Jelto: [C: 03+2] admin: let parsoid-test-admins run 'sudo mysql..' on test servers [puppet] - 10https://gerrit.wikimedia.org/r/739647 (https://phabricator.wikimedia.org/T295900) (owner: 10Dzahn) [12:38:53] I think so, yeah [12:38:55] should we log it? [12:39:49] (03PS3) 10Jelto: admin: upgrade ihurbain from ldap_only to shell, add to parsoid-test-admins [puppet] - 10https://gerrit.wikimedia.org/r/739648 (https://phabricator.wikimedia.org/T295900) (owner: 10Dzahn) [12:40:05] (03PS5) 10Jbond: Add Typing: And fix other minor lint issues [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 [12:41:14] (03CR) 10Jbond: [C: 03+2] Pathlib: switch to pathlib vs os.path everywhere [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738261 (owner: 10Jbond) [12:41:18] (03CR) 10Jbond: [C: 03+2] f-strings: convert strings to f-strings [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738437 (owner: 10Jbond) [12:41:49] (03CR) 10jerkins-bot: [V: 04-1] Add Typing: And fix other minor lint issues [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 (owner: 10Jbond) [12:42:37] (03Merged) 10jenkins-bot: Pathlib: switch to pathlib vs os.path everywhere [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738261 (owner: 10Jbond) [12:42:39] (03Merged) 10jenkins-bot: f-strings: convert strings to f-strings [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738437 (owner: 10Jbond) [12:46:39] (03PS3) 10Jelto: admin: let parsoid-test-admins run 'sudo mysql..' on test servers [puppet] - 10https://gerrit.wikimedia.org/r/739647 (https://phabricator.wikimedia.org/T295900) (owner: 10Dzahn) [12:49:49] (03CR) 10Elukey: [C: 03+1] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/739771 (owner: 10Hnowlan) [12:50:47] (03PS6) 10Jbond: Add Typing: And fix other minor lint issues [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 [12:51:06] (03CR) 10Jgiannelos: "This change is ready for review." [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/739773 (owner: 10Jgiannelos) [12:51:54] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:52:32] (03CR) 10jerkins-bot: [V: 04-1] Add Typing: And fix other minor lint issues [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 (owner: 10Jbond) [12:53:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [12:54:22] (03PS5) 10Arturo Borrero Gonzalez: cloud: codfw1dev: hiera update for new backup servers [puppet] - 10https://gerrit.wikimedia.org/r/739599 (https://phabricator.wikimedia.org/T295584) [12:54:57] (03PS1) 10Elukey: helmfile.d: add default-deny and icmp to ml-serve's settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/739791 (https://phabricator.wikimedia.org/T289834) [12:56:13] (03PS4) 10Jelto: admin: upgrade ihurbain from ldap_only to shell, add to parsoid-test-admins [puppet] - 10https://gerrit.wikimedia.org/r/739648 (https://phabricator.wikimedia.org/T295900) (owner: 10Dzahn) [12:56:38] (03PS1) 10Arturo Borrero Gonzalez: hiera: cloudbackup1001-dev: relocate ceph auth config [labs/private] - 10https://gerrit.wikimedia.org/r/739792 (https://phabricator.wikimedia.org/T295584) [12:57:45] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] hiera: cloudbackup1001-dev: relocate ceph auth config [labs/private] - 10https://gerrit.wikimedia.org/r/739792 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [13:00:22] heh [13:00:31] we should probably have, oh well :-D [13:00:48] "window definitely over" <-- a bit late [13:01:04] (03CR) 10Arturo Borrero Gonzalez: "sigh, can't make hiera work https://integration.wikimedia.org/ci/view/operations/job/operations-puppet-catalog-compiler/32489/console" [puppet] - 10https://gerrit.wikimedia.org/r/739599 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [13:01:13] 10SRE, 10SRE-Access-Requests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Jelto) [13:03:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [13:04:05] (03PS7) 10Jbond: Add Typing: And fix other minor lint issues [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 [13:04:26] (03CR) 10DCausse: [C: 03+2] Add CirrusSearch Old GC Hell alerting [alerts] - 10https://gerrit.wikimedia.org/r/739034 (https://phabricator.wikimedia.org/T290604) (owner: 10Ebernhardson) [13:04:41] (03CR) 10Jbond: [C: 03+1] sre.hosts.reimage: additional check of remote host [cookbooks] - 10https://gerrit.wikimedia.org/r/739603 (owner: 10Volans) [13:05:58] (03CR) 10Jbond: "Ready for review" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 (owner: 10Jbond) [13:06:09] (03CR) 10Elukey: [C: 03+2] helmfile.d: add default-deny and icmp to ml-serve's settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/739791 (https://phabricator.wikimedia.org/T289834) (owner: 10Elukey) [13:06:30] (03Merged) 10jenkins-bot: Add CirrusSearch Old GC Hell alerting [alerts] - 10https://gerrit.wikimedia.org/r/739034 (https://phabricator.wikimedia.org/T290604) (owner: 10Ebernhardson) [13:08:39] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:52] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:29] 10SRE, 10SRE-Access-Requests, 10serviceops, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Jelto) 05Open→03Resolved a:03Jelto I added @ihurbain to `parsoid-test-admins`. All six mentioned above should have access now b... [13:14:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [13:17:38] (03PS2) 10Volans: sre.hosts.reimage: additional check of remote host [cookbooks] - 10https://gerrit.wikimedia.org/r/739603 [13:22:03] (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: additional check of remote host [cookbooks] - 10https://gerrit.wikimedia.org/r/739603 (owner: 10Volans) [13:22:23] !log failover ganeti master in test cluster to ganeti-test2002 T284811 [13:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:27] T284811: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 [13:24:28] (03Merged) 10jenkins-bot: sre.hosts.reimage: additional check of remote host [cookbooks] - 10https://gerrit.wikimedia.org/r/739603 (owner: 10Volans) [13:24:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [13:25:17] 10SRE, 10ops-eqiad, 10DBA: db1131 alerting due to network hiccup - https://phabricator.wikimedia.org/T295952 (10Marostegui) @Cmjohnson @Jclark-ctr can you double check if the cable is loose? [13:26:00] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host testvm2003.codfw.wmnet [13:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:54] 10SRE, 10SRE-Access-Requests, 10serviceops, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10ihurbain) All good; took a little while for the whole auth to propagate (I guess), but I now have access. Thanks! [13:34:00] !log installing pam bugfix updates on bullseye hosts [13:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [13:39:24] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.1 point update - https://phabricator.wikimedia.org/T292844 (10MoritzMuehlenhoff) [13:40:07] (03CR) 10David Caro: cloud: codfw1dev: hiera update for new backup servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739599 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [13:41:09] (03PS1) 10Majavah: hieradata: add missing keys to codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/739798 [13:42:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm2003.codfw.wmnet [13:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [13:52:16] (03CR) 10Herron: [C: 03+2] exim4.conf.mx: switch 'data' to 'condition' in otrs config [puppet] - 10https://gerrit.wikimedia.org/r/739579 (https://phabricator.wikimedia.org/T244792) (owner: 10Herron) [13:52:23] (03CR) 10Muehlenhoff: [C: 03+2] Obsolete role::restbase::base [puppet] - 10https://gerrit.wikimedia.org/r/729943 (owner: 10Muehlenhoff) [13:53:49] moritzm: feel free to multiple mine when ready to puppet-merge the role::restbase::base patch [13:53:57] ack [13:54:21] now merged [14:04:23] (03PS1) 10Muehlenhoff: Add MAC for testvm2003 [puppet] - 10https://gerrit.wikimedia.org/r/739800 [14:05:58] thx [14:08:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [14:08:58] (03CR) 10Muehlenhoff: [C: 03+2] Add MAC for testvm2003 [puppet] - 10https://gerrit.wikimedia.org/r/739800 (owner: 10Muehlenhoff) [14:11:10] (03PS1) 10Jgiannelos: tile-pregeneration: Avoid stopping execution on tegola error [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/739803 [14:12:37] 10SRE, 10SRE-Access-Requests, 10serviceops, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10ssastry) Thanks @dzahn and @jelto ... It looks like isabelle cannot setup a ssh tunnel via `ssh -L 8003:localhost:8003 testreduce1001... [14:14:42] (03Abandoned) 10Muehlenhoff: Retire role::mediawiki::common [puppet] - 10https://gerrit.wikimedia.org/r/730836 (owner: 10Muehlenhoff) [14:18:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [14:19:51] !log installing testvm2003 [14:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:41] (03PS1) 10Elukey: profile::cache::kafka::webrequest: add pki settings [puppet] - 10https://gerrit.wikimedia.org/r/739806 (https://phabricator.wikimedia.org/T291905) [14:25:03] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32491/console" [puppet] - 10https://gerrit.wikimedia.org/r/739806 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [14:26:32] (03CR) 10Alexandros Kosiaris: [C: 03+1] mediawiki: add handling of php-fpm logs via rsyslogd [deployment-charts] - 10https://gerrit.wikimedia.org/r/734692 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [14:28:48] (03PS1) 10Jbond: puppet_compiler:puppetdb: We only need one puppetdb for all compilers [puppet] - 10https://gerrit.wikimedia.org/r/739808 [14:29:24] (03CR) 10jerkins-bot: [V: 04-1] puppet_compiler:puppetdb: We only need one puppetdb for all compilers [puppet] - 10https://gerrit.wikimedia.org/r/739808 (owner: 10Jbond) [14:31:57] (03PS1) 10Dzahn: miscweb: remove nodePort and re-enable TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/739810 (https://phabricator.wikimedia.org/T281538) [14:32:00] (03PS1) 10Jbond: Revert "Revert "mx2001: disable ldap validation"" [puppet] - 10https://gerrit.wikimedia.org/r/739641 [14:32:12] (03PS2) 10Jbond: Revert "Revert "mx2001: disable ldap validation"" [puppet] - 10https://gerrit.wikimedia.org/r/739641 [14:33:13] (03CR) 10Jbond: [C: 03+2] Revert "Revert "mx2001: disable ldap validation"" [puppet] - 10https://gerrit.wikimedia.org/r/739641 (owner: 10Jbond) [14:35:38] (03CR) 10Andrew Bogott: [C: 03+2] hieradata: add missing keys to codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/739798 (owner: 10Majavah) [14:36:40] (03CR) 10Dzahn: [C: 03+2] beta::autoupdater Don't mess with ${stage_dir}/php-master/cache/l10n [puppet] - 10https://gerrit.wikimedia.org/r/739620 (https://phabricator.wikimedia.org/T295304) (owner: 10Ahmon Dancy) [14:37:45] !log roll-restarting sessionstore for java updates [14:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [14:38:12] (03CR) 10Dzahn: [C: 03+2] miscweb: remove nodePort and re-enable TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/739810 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [14:38:21] !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:sessionstore: Restarting to pick up Java security updates - hnowlan@cumin1001 [14:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:23] (03CR) 10Hnowlan: [C: 03+2] discovery check: kartotherian is depooled in codfw [puppet] - 10https://gerrit.wikimedia.org/r/739771 (owner: 10Hnowlan) [14:42:53] (03CR) 10David Caro: Add Typing: And fix other minor lint issues (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 (owner: 10Jbond) [14:42:57] (03Merged) 10jenkins-bot: miscweb: remove nodePort and re-enable TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/739810 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [14:44:24] !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [14:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:49] (03PS1) 10Jbond: Revert "Revert "Revert "mx2001: disable ldap validation""" [puppet] - 10https://gerrit.wikimedia.org/r/739645 [14:52:31] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "Revert "Revert "mx2001: disable ldap validation""" [puppet] - 10https://gerrit.wikimedia.org/r/739645 (owner: 10Jbond) [14:52:58] (03PS1) 10Jbond: Revert "Revert "Revert "Revert "mx2001: disable ldap validation"""" [puppet] - 10https://gerrit.wikimedia.org/r/739826 [14:58:33] (03CR) 104nn1l2: Enable mapframe on the Indonesian Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738547 (https://phabricator.wikimedia.org/T295571) (owner: 104nn1l2) [15:01:06] (03CR) 10JMeybohm: [C: 03+1] "I'm definitely lacking rsyslog knowledge to assess all this in detail, but it looks sane to me (in some definition of sane)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/734692 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [15:02:13] (03PS1) 10Jbond: otrs - aliases: fix up aliases file [puppet] - 10https://gerrit.wikimedia.org/r/739821 [15:02:45] (03CR) 10Jbond: [C: 03+2] otrs - aliases: fix up aliases file [puppet] - 10https://gerrit.wikimedia.org/r/739821 (owner: 10Jbond) [15:04:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:15] (03PS1) 10Hnowlan: restbase_dev: restore removed username parameter [puppet] - 10https://gerrit.wikimedia.org/r/739824 (https://phabricator.wikimedia.org/T235299) [15:06:17] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Christoph Jauera - https://phabricator.wikimedia.org/T295781 (10odimitrijevic) Approved [15:08:48] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10Ottomata) furud does not run any active services; it can be restarted anytime. [15:10:35] (03CR) 10Ema: [C: 03+1] profile::cache::kafka::webrequest: add pki settings [puppet] - 10https://gerrit.wikimedia.org/r/739806 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [15:14:08] (03PS1) 10Jbond: otrs_aliases: sort and unique emails [puppet] - 10https://gerrit.wikimedia.org/r/739825 [15:14:14] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:sessionstore: Restarting to pick up Java security updates - hnowlan@cumin1001 [15:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [15:14:56] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools automatic topic subscriptions as beta feature on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739846 (https://phabricator.wikimedia.org/T290500) [15:16:29] !log roll restarting cassandra on codfw maps for java updates [15:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:00] (03PS1) 10Jbond: otrs-aliases: use full email as alias and match [puppet] - 10https://gerrit.wikimedia.org/r/739847 [15:18:19] (03CR) 10Jbond: [C: 03+2] otrs-aliases: use full email as alias and match [puppet] - 10https://gerrit.wikimedia.org/r/739847 (owner: 10Jbond) [15:19:21] (03PS1) 10Dzahn: miscweb: before enabling TLS, first remove nodePort line separately [deployment-charts] - 10https://gerrit.wikimedia.org/r/739848 (https://phabricator.wikimedia.org/T281538) [15:19:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [15:22:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [15:24:39] (03CR) 10David Caro: [C: 03+1] "Essentially, it's better than before and (I think) it's not breaking anything, so LGTM" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 (owner: 10Jbond) [15:25:54] 10SRE, 10SRE-Access-Requests, 10serviceops, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Jelto) >>! In T295900#7513678, @ssastry wrote: > Thanks @dzahn and @jelto ... It looks like isabelle cannot setup a ssh tunnel via `s... [15:26:32] (03CR) 10Dzahn: [C: 03+2] miscweb: before enabling TLS, first remove nodePort line separately [deployment-charts] - 10https://gerrit.wikimedia.org/r/739848 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [15:27:07] (03CR) 10MSantos: [C: 03+2] Log tile instead of line number on error [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/739773 (owner: 10Jgiannelos) [15:28:07] (03CR) 10Ppchelko: [C: 03+1] Don't trust Title that if it exists pageId will be > 0 [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739638 (https://phabricator.wikimedia.org/T295931) (owner: 10Daniel Kinzler) [15:28:21] (03Merged) 10jenkins-bot: Log tile instead of line number on error [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/739773 (owner: 10Jgiannelos) [15:31:24] (03Merged) 10jenkins-bot: miscweb: before enabling TLS, first remove nodePort line separately [deployment-charts] - 10https://gerrit.wikimedia.org/r/739848 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [15:33:20] !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. [15:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:46] !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [15:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:42] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: Review puppetmaster SSL configuration - https://phabricator.wikimedia.org/T268040 (10joanna_borun) [15:35:43] !log cr2-codfw# set interfaces et-1/0/3 disable - T295118 [15:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:47] T295118: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 [15:36:28] !log dzahn@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'miscweb' for release 'main' . [15:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:49] (03PS1) 10Jelto: admin: let parsoid-test-admins see parsoid logs and restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/739851 (https://phabricator.wikimedia.org/T295900) [15:38:58] (03CR) 10Dzahn: [C: 03+1] "double checked if we have other examples with the wildcard in the middle, and yes: start druid-*.service'. lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/739851 (https://phabricator.wikimedia.org/T295900) (owner: 10Jelto) [15:39:24] !log lvs2007:~$ sudo service pybal stop - T295118 [15:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:51] (03PS1) 10Ssingh: wikidough: set CSP headers for the landing page [puppet] - 10https://gerrit.wikimedia.org/r/739853 [15:40:28] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - miscweb_4111: Servers kubernetes2007.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2010.codfw.wmnet, kubernetes2002.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:40:33] (03CR) 10MSantos: [C: 03+2] tile-pregeneration: Avoid stopping execution on tegola error [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/739803 (owner: 10Jgiannelos) [15:40:45] (03PS8) 10JMeybohm: Add basic ingress support to chart common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/732374 (https://phabricator.wikimedia.org/T290966) [15:40:54] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - miscweb_4111: Servers kubernetes2015.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2002.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:41:42] (03Merged) 10jenkins-bot: tile-pregeneration: Avoid stopping execution on tegola error [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/739803 (owner: 10Jgiannelos) [15:42:18] bblack: is that expected ^ ? [15:42:46] PROBLEM - pybal on lvs2007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:42:54] that's expected ^ [15:43:04] PROBLEM - PyBal backends health check on lvs2007 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [15:43:10] (03CR) 10jerkins-bot: [V: 04-1] Add basic ingress support to chart common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/732374 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [15:43:16] XioNoX: I don't think that's a direct effect of the lvs2007 depool or lvs2007 loss of connectivity [15:43:32] ok, good to know :) [15:43:34] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:43:47] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Christoph Jauera - https://phabricator.wikimedia.org/T295781 (10Jelto) [15:43:49] I think what you're seeing in the miscweb_411 thing, is that loss of connectivity to those k8s end-hosts has created a resiliency issue for that service (too many members of that service taken out at once) [15:44:12] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:44:18] (as in, I'd expect all of those servers listed are affected by your planned outage, and it's more than the threshold we've configured for minimum server counts) [15:44:37] this will create a problem, as actually-unreachable servers are still pooled to meet the minimum counts [15:44:42] bblack: none of those are affected [15:44:57] oh, I am trying to apply a change to that right now [15:45:05] I ran helmfile and it's not done yet [15:45:07] (03PS1) 10Muehlenhoff: ganeti: Add new Ganeti clusters in drmrs [software/spicerack] - 10https://gerrit.wikimedia.org/r/739855 [15:45:07] mutante: a change to what? [15:45:21] the miscweb on k8s [15:45:24] in codfw [15:45:38] oh, so, we assumed from timing it was related to the switch work, maybe it isn't! [15:46:06] mutante: I can wait for your work to be finished before doing my maintenance [15:46:07] it's not a problem that miscweb itself has an issue, but of course I dont want to affect anything else [15:46:18] either way, the "down but pooled" alert can be an indication of a real problem (if the down-but-pooled servers are non-functional, they still *are* getting traffic, which would be failing) [15:46:23] this should either finish or roll back in a minute [15:46:29] there is a timeout where it gives up [15:46:33] ok [15:46:44] mutante: let me know when you're all set [15:46:48] (03PS2) 10Ssingh: wikidough: set CSP headers for the landing page [puppet] - 10https://gerrit.wikimedia.org/r/739853 [15:46:51] Error: UPGRADE FAILED: timed out waiting for the condition [15:46:56] well, I am now ^ :p [15:47:03] alright, thanks! [15:47:20] Rollback was a success. [15:47:32] ^ stepping back now completely :) go ahead [15:47:36] (03Abandoned) 10Ahmon Dancy: mediawiki: Ensure mwdeploy user is a member of the www-data group [puppet] - 10https://gerrit.wikimedia.org/r/738461 (https://phabricator.wikimedia.org/T295304) (owner: 10Ahmon Dancy) [15:47:48] PROBLEM - PyBal connections to etcd on lvs2007 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [15:47:49] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32493/console" [puppet] - 10https://gerrit.wikimedia.org/r/739853 (owner: 10Ssingh) [15:47:57] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/739855 (owner: 10Muehlenhoff) [15:48:13] hrmm, let me try to fix the icinga alert now [15:48:17] though [15:48:18] (03PS1) 10Jelto: admin: add wmde-fisch to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/739856 (https://phabricator.wikimedia.org/T295781) [15:49:56] !log asw-b-codfw> request system power-off member 7 - T295118 [15:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:00] T295118: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 [15:50:54] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:50:54] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:51:04] ^ there, those are fixed. laters [15:52:38] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:52:52] PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100% [15:53:22] PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% [15:53:38] PROBLEM - Host elastic2044 is DOWN: PING CRITICAL - Packet loss = 100% [15:53:54] PROBLEM - Host ms-be2033 is DOWN: PING CRITICAL - Packet loss = 100% [15:54:01] ^ expected [15:54:32] PROBLEM - Host thanos-be2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:54:33] (03CR) 10Dzahn: [C: 03+1] "kind of wish uslfo and eqsin would also start with letters now :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/739855 (owner: 10Muehlenhoff) [15:54:34] PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100% [15:54:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [15:54:57] ACK [15:55:34] (03CR) 10Muehlenhoff: [C: 03+1] "Looks fine, but changes to the permissions of an access group need IF meeting signoff, next meeting happening on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/739851 (https://phabricator.wikimedia.org/T295900) (owner: 10Jelto) [15:57:06] PROBLEM - configured eth on lvs2007 is CRITICAL: ens2f1np1 reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [15:59:50] (03CR) 10Muehlenhoff: "Depending on the use case for analytics-privatedata-users (needs clarification on task) the user will need a Kerberos principal (with an a" [puppet] - 10https://gerrit.wikimedia.org/r/739856 (https://phabricator.wikimedia.org/T295781) (owner: 10Jelto) [16:00:14] PROBLEM - Juniper virtual chassis ports on asw-b-codfw is CRITICAL: CRIT: Down: 7 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [16:00:22] 10SRE, 10ops-eqiad, 10DBA: db1131 alerting due to network hiccup - https://phabricator.wikimedia.org/T295952 (10wiki_willy) a:03Cmjohnson [16:00:34] PROBLEM - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [2000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=39&fullscreen&orgId=1&var-cluster=codfw&var-smoothing=1 [16:03:08] RECOVERY - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is OK: OK: Less than 20.00% above the threshold [1200.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=39&fullscreen&orgId=1&var-cluster=codfw&var-smoothing=1 [16:07:28] 10SRE, 10ops-eqiad, 10DBA: db1131 alerting due to network hiccup - https://phabricator.wikimedia.org/T295952 (10wiki_willy) Hopefully it's just a loose cable. If not, just a FYI - the out of warranty date is listed as November 13, so we may or may not be able to RMA parts with Dell for it. Thanks, Willy >>... [16:07:32] vgutierrez, ema: Hey, is there any chance we could redirect the cache::haproxy TTFB metrics gathering to a stream that doesn't forward to logstash? (https://gerrit.wikimedia.org/r/c/operations/puppet/+/738422) [16:08:13] vgutierrez, ema: I'm seeing a massive increase in haproxy logs: https://logstash.wikimedia.org/goto/ca0c3e167e0b6c09ac6cc80f31b15104 [16:09:12] (03PS1) 10AOkoth: site: include new k8s hosts on kubestage group [puppet] - 10https://gerrit.wikimedia.org/r/739857 (https://phabricator.wikimedia.org/T293729) [16:09:24] (03PS6) 10Arturo Borrero Gonzalez: cloud: codfw1dev: hiera update for new backup servers [puppet] - 10https://gerrit.wikimedia.org/r/739599 (https://phabricator.wikimedia.org/T295584) [16:09:48] (03CR) 10Arturo Borrero Gonzalez: cloud: codfw1dev: hiera update for new backup servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739599 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [16:11:26] (03PS7) 10Arturo Borrero Gonzalez: cloud: codfw1dev: hiera update for new backup servers [puppet] - 10https://gerrit.wikimedia.org/r/739599 (https://phabricator.wikimedia.org/T295584) [16:13:50] (03PS1) 10Arturo Borrero Gonzalez: Revert "hiera: cloudbackup1001-dev: relocate ceph auth config" [labs/private] - 10https://gerrit.wikimedia.org/r/739858 [16:15:07] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] Revert "hiera: cloudbackup1001-dev: relocate ceph auth config" [labs/private] - 10https://gerrit.wikimedia.org/r/739858 (owner: 10Arturo Borrero Gonzalez) [16:16:10] (03CR) 10WMDE-Fisch: admin: add wmde-fisch to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739856 (https://phabricator.wikimedia.org/T295781) (owner: 10Jelto) [16:16:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1002/32496/" [puppet] - 10https://gerrit.wikimedia.org/r/739599 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [16:16:50] (03CR) 10David Caro: cloud: codfw1dev: hiera update for new backup servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739599 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [16:17:53] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Christoph Jauera - https://phabricator.wikimedia.org/T295781 (10WMDE-Fisch) [16:18:25] (03PS1) 10Ahmon Dancy: Add a better placeholder for udp2log service value [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/739859 [16:18:51] (03CR) 10Ahmon Dancy: [C: 03+2] Add a better placeholder for udp2log service value [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/739859 (owner: 10Ahmon Dancy) [16:19:26] (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Add new Ganeti clusters in drmrs [software/spicerack] - 10https://gerrit.wikimedia.org/r/739855 (owner: 10Muehlenhoff) [16:19:31] (03PS2) 10AOkoth: site: include new k8s hosts on kubestage group [puppet] - 10https://gerrit.wikimedia.org/r/739857 (https://phabricator.wikimedia.org/T293729) [16:19:37] (03Merged) 10jenkins-bot: Add a better placeholder for udp2log service value [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/739859 (owner: 10Ahmon Dancy) [16:19:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [16:20:18] (03CR) 10David Caro: cloud: codfw1dev: hiera update for new backup servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739599 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [16:20:43] (03CR) 10Jelto: admin: let parsoid-test-admins see parsoid logs and restart php-fpm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739851 (https://phabricator.wikimedia.org/T295900) (owner: 10Jelto) [16:21:42] (03PS3) 10AOkoth: site: include new k8s hosts on kubestage group [puppet] - 10https://gerrit.wikimedia.org/r/739857 (https://phabricator.wikimedia.org/T293729) [16:24:22] (03CR) 10Dzahn: [C: 03+1] "we can optionally have Joanna approve it earlier than the meeting" [puppet] - 10https://gerrit.wikimedia.org/r/739851 (https://phabricator.wikimedia.org/T295900) (owner: 10Jelto) [16:28:14] (03PS10) 10EllenR: Set up beta test environment for QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) [16:28:52] eh, tgr and subbu k-lined? [16:29:37] matrix got nuked [16:29:55] (03PS11) 10EllenR: Set up beta test environment for QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) [16:32:33] (03CR) 10Lucas Werkmeister (WMDE): "> https://integration.wikimedia.org/ci/job/operations-mw-config-php72-composer-diffConfig-docker/9083/console : SUCCESS Please carefully r" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [16:32:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [16:34:39] AntiComposite: ah! makes sense [16:34:58] (03CR) 10Jforrester: "Nice." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739633 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [16:38:12] [11:36:29] <@amdj> (for clarification, all of matrix was accidentally k-lined, as it was placed on a /64 instead of a single IP) [16:39:00] (03CR) 10Jhernandez: [C: 04-1] "Typo to fix" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [16:40:21] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 178 probes of 725 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:41:20] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 86 probes of 718 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:41:36] (03CR) 10Subramanya Sastry: admin: let parsoid-test-admins see parsoid logs and restart php-fpm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739851 (https://phabricator.wikimedia.org/T295900) (owner: 10Jelto) [16:42:17] (03PS1) 10Jforrester: ExtensionDistributor: 1.37.0 is out now, so there's no beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739861 (https://phabricator.wikimedia.org/T289585) [16:43:59] (03PS1) 10Vgutierrez: cache::haproxy: Increase rsyslog config priority [puppet] - 10https://gerrit.wikimedia.org/r/739862 (https://phabricator.wikimedia.org/T290005) [16:44:23] (03CR) 10Cwhite: [C: 03+1] cache::haproxy: Increase rsyslog config priority [puppet] - 10https://gerrit.wikimedia.org/r/739862 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [16:45:42] (03CR) 10Vgutierrez: [C: 03+2] cache::haproxy: Increase rsyslog config priority [puppet] - 10https://gerrit.wikimedia.org/r/739862 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [16:45:55] 10SRE, 10SRE-Access-Requests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10ssastry) >>! In T295900#7513863, @Jelto wrote: >>>! In T295900#7513678, @ssastry wrote: >> Thanks @dzahn and @je... [16:46:09] 10SRE, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), 10Traffic, and 2 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10Reedy) [16:46:26] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 6 probes of 725 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:47:34] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 8 probes of 718 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:52:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [16:54:29] 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Manfredi Martorana to wmf ldap group - https://phabricator.wikimedia.org/T295789 (10sbassett) [16:57:50] (03CR) 10Reedy: [C: 03+1] ExtensionDistributor: 1.37.0 is out now, so there's no beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739861 (https://phabricator.wikimedia.org/T289585) (owner: 10Jforrester) [16:58:31] (03CR) 10Jhernandez: [C: 04-1] Set up beta test environment for QuickSurveys (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [16:59:30] (03PS1) 10Arturo Borrero Gonzalez: cloudbackup: introduce ceph rbd configuration [puppet] - 10https://gerrit.wikimedia.org/r/739870 (https://phabricator.wikimedia.org/T295584) [17:00:04] jbond and rzl: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211118T1700). [17:00:04] No Gerrit patches in the queue for this window AFAICS. [17:00:18] (03CR) 10jerkins-bot: [V: 04-1] cloudbackup: introduce ceph rbd configuration [puppet] - 10https://gerrit.wikimedia.org/r/739870 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [17:00:22] ✅ [17:02:25] (03PS1) 10Hnowlan: cassandra: load grants files upon change [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) [17:03:00] (03CR) 10Lucas Werkmeister (WMDE): Set up beta test environment for QuickSurveys (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [17:03:05] (03CR) 10jerkins-bot: [V: 04-1] cassandra: load grants files upon change [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan) [17:03:25] (03PS2) 10Hnowlan: cassandra: load grants files upon change [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) [17:04:01] (03CR) 10jerkins-bot: [V: 04-1] cassandra: load grants files upon change [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan) [17:04:29] (03PS3) 10Hnowlan: cassandra: load grants files upon change [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) [17:17:55] (03CR) 10Eigyan: Set up beta test environment for QuickSurveys (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [17:18:30] (03PS1) 10AOkoth: sites: add new kubestage nodes [homer/public] - 10https://gerrit.wikimedia.org/r/739879 (https://phabricator.wikimedia.org/T293729) [17:18:55] (03CR) 10Jhernandez: [C: 04-1] Set up beta test environment for QuickSurveys (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [17:19:24] (03PS4) 10Hnowlan: cassandra: load grants files upon change [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) [17:19:51] (03PS8) 10Jbond: Add Typing: And fix other minor lint issues [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 [17:20:37] (03PS12) 10EllenR: Set up beta test environment for QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) [17:21:20] (03PS13) 10EllenR: Set up beta test environment for QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) [17:21:34] (03CR) 10jerkins-bot: [V: 04-1] Add Typing: And fix other minor lint issues [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 (owner: 10Jbond) [17:21:43] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::cache::kafka::webrequest: add pki settings [puppet] - 10https://gerrit.wikimedia.org/r/739806 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [17:22:13] (03CR) 10Hnowlan: [C: 03+2] restbase_dev: restore removed username parameter [puppet] - 10https://gerrit.wikimedia.org/r/739824 (https://phabricator.wikimedia.org/T235299) (owner: 10Hnowlan) [17:22:52] (03CR) 10Dzahn: [C: 04-1] "based on previous comments, it needs the additional krb line and then a "[krb1001:~] $ sudo manage_principals.py create '. The m" [puppet] - 10https://gerrit.wikimedia.org/r/739856 (https://phabricator.wikimedia.org/T295781) (owner: 10Jelto) [17:22:54] PROBLEM - Juniper virtual chassis ports on asw-b-codfw is CRITICAL: CRIT: Down: 7 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [17:23:19] XioNoX: I assume you ^ [17:23:36] yep [17:24:12] :) [17:25:42] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Manfredi Martorana to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T295790 (10mmartorana) Hi, I have now provided all the required information. I am attaching here my SSH public key: `ssh-rsa... [17:26:11] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Manfredi Martorana to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T295790 (10mmartorana) [17:26:38] (03CR) 10Eigyan: Set up beta test environment for QuickSurveys (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [17:27:19] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32500/console" [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan) [17:27:28] (03CR) 10Eigyan: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [17:28:16] (03PS2) 10Arturo Borrero Gonzalez: cloudbackup: introduce ceph rbd configuration [puppet] - 10https://gerrit.wikimedia.org/r/739870 (https://phabricator.wikimedia.org/T295584) [17:28:22] PROBLEM - puppet last run on restbase-dev1005 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:29:46] PROBLEM - puppet last run on restbase-dev1006 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:30:58] RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 30.01 ms [17:31:20] (03PS4) 10AOkoth: site: include new k8s hosts on kubestage group [puppet] - 10https://gerrit.wikimedia.org/r/739857 (https://phabricator.wikimedia.org/T293729) [17:31:38] RECOVERY - Juniper virtual chassis ports on asw-b-codfw is OK: OK: UP: 28 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [17:33:02] RECOVERY - Host thanos-be2002 is UP: PING OK - Packet loss = 0%, RTA = 30.08 ms [17:33:06] Greetings All, is there anyone here that can deploy a change of mine to Beta cluster. It was scheduled earlier this week but put on hold due to change. The change was made and we would like to proceed as soon as we can if anyone can help. thanks [17:33:18] RECOVERY - Host furud is UP: PING OK - Packet loss = 0%, RTA = 30.13 ms [17:34:25] eigyan: it should just need a +2, a link? [17:34:26] RECOVERY - Host ms-be2033 is UP: PING OK - Packet loss = 0%, RTA = 30.05 ms [17:34:32] RECOVERY - puppet last run on restbase-dev1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:34:42] RECOVERY - Host elastic2044 is UP: PING OK - Packet loss = 0%, RTA = 30.89 ms [17:35:28] RECOVERY - Host elastic2043 is UP: PING WARNING - Packet loss = 90%, RTA = 1238.56 ms [17:35:28] PROBLEM - SSH on elastic2043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:35:42] PROBLEM - Elasticsearch HTTPS for production-search-psi-codfw on elastic2043 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection timed out https://wikitech.wikimedia.org/wiki/Search [17:35:44] (03PS3) 10Arturo Borrero Gonzalez: cloudbackup: introduce ceph rbd configuration [puppet] - 10https://gerrit.wikimedia.org/r/739870 (https://phabricator.wikimedia.org/T295584) [17:35:56] RECOVERY - puppet last run on restbase-dev1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:37:16] RECOVERY - Elasticsearch HTTPS for production-search-psi-codfw on elastic2043 is OK: SSL OK - Certificate search.svc.codfw.wmnet valid until 2023-08-22 10:03:17 +0000 (expires in 641 days) https://wikitech.wikimedia.org/wiki/Search [17:37:24] RECOVERY - SSH on elastic2043 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:41:08] (03PS2) 10Jgiannelos: tile-pregeneration: Silent cURL with faster timeout [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/739538 [17:42:49] (03CR) 10Cwhite: [C: 03+1] "Tested manually in deployment-prep and production and appears to introduce no issue." [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [17:44:23] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:36] 10SRE, 10SRE-Access-Requests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Dzahn) bumped this for early approval because of the high prio [17:45:44] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1003/32502/" [puppet] - 10https://gerrit.wikimedia.org/r/739870 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [17:46:28] (03PS1) 10JMeybohm: Rewrite admin_ng helmfiles for local charts and fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/739890 [17:48:04] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:43] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: Q2) eqiad: Upgrades of Management Switches - https://phabricator.wikimedia.org/T259758 (10Cmjohnson) [17:48:58] (03PS2) 10JMeybohm: Rewrite admin_ng helmfiles for local charts and fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/739890 [17:49:10] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: Q2) eqiad: Upgrades of Management Switches - https://phabricator.wikimedia.org/T259758 (10Cmjohnson) all of row B has been updated, netbox updated, only item remaining is the removal of the old mgmt switches. [17:50:17] (03PS1) 10Arturo Borrero Gonzalez: cloudbackup: introduce base profiles [puppet] - 10https://gerrit.wikimedia.org/r/739892 (https://phabricator.wikimedia.org/T295584) [17:50:33] 10SRE, 10ops-codfw: mw2280 unresponsive to powercycle and hardreset - https://phabricator.wikimedia.org/T290708 (10Dzahn) Yea, so, this pretty much changed back to "as long as we are buying it again next time we are refreshing". So we can just turn this into decom either way. ACK [17:51:27] 10SRE, 10ops-codfw: decom mw2280 (was: mw2280 unresponsive to powercycle and hardreset) - https://phabricator.wikimedia.org/T290708 (10Dzahn) [17:51:41] 10SRE, 10ops-codfw, 10serviceops: decom mw2280 (was: mw2280 unresponsive to powercycle and hardreset) - https://phabricator.wikimedia.org/T290708 (10Dzahn) [17:51:49] 10SRE, 10ops-codfw, 10serviceops: decom mw2280 (was: mw2280 unresponsive to powercycle and hardreset) - https://phabricator.wikimedia.org/T290708 (10Dzahn) p:05Low→03Medium [17:52:12] 10SRE, 10ops-codfw, 10serviceops: decom mw2280 (was: mw2280 unresponsive to powercycle and hardreset) - https://phabricator.wikimedia.org/T290708 (10Dzahn) a:05Papaul→03None [17:52:42] 10SRE, 10ops-codfw, 10serviceops: decom mw2280 (was: mw2280 unresponsive to powercycle and hardreset) - https://phabricator.wikimedia.org/T290708 (10Dzahn) correct me if i'm wrong @wiki_willy [17:52:55] PROBLEM - Host upload-lb.codfw.wikimedia.org_ipv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:860:ed1a::2:b) [17:53:00] ummm [17:53:21] <_joe_> uh [17:53:26] uh oh [17:53:26] (03PS2) 10Arturo Borrero Gonzalez: cloudbackup: introduce base profiles [puppet] - 10https://gerrit.wikimedia.org/r/739892 (https://phabricator.wikimedia.org/T295584) [17:53:29] ipv6 broken internally? [17:53:29] I can reproduce [17:53:32] * akosiaris around [17:53:37] hello hello [17:53:39] I can reach it from my laptop [17:53:40] I don't have v6 to test [17:53:43] I can't [17:53:43] here [17:53:48] * volans here too [17:53:54] * volans no v6 [17:54:06] unreachable for me as well [17:54:08] 64 bytes from upload-lb.codfw.wikimedia.org (2620:0:860:ed1a::2:b): [17:54:09] majavah: can you post a mtr? [17:54:12] jayme: same [17:54:12] ping6 from home ^ [17:54:13] mtr --report-wide --show-ips --aslookup --tcp --port 443 en.wikipedia.org [17:54:15] here but won't be that useful [17:54:17] I can reproduce too [17:54:19] cdanis: already doing [17:54:20] maybe add a -6 [17:54:22] ty ty [17:54:24] hey [17:54:29] it's reachable directly [17:54:32] but not from cr2 [17:54:36] er, not from eqiad [17:54:37] I acked the page [17:54:54] starting a patch to depool codfw, just in case we want it [17:55:02] <_joe_> rzl: yes I was about to [17:55:10] cdanis: https://phabricator.wikimedia.org/P17776 [17:55:13] is v4 affected for anyone? [17:55:14] majavah: thanks [17:55:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:55:23] v4 works for me [17:55:26] v4 seems fine [17:55:34] both work for me [17:55:35] XioNoX: looks like not reachable from eqord either, given the traceroute ^ [17:55:37] v4 works for me [17:55:42] yeah, v6 stops at last hop for me, v4 works fine. [17:55:44] ok that's good [17:55:54] (03PS1) 10RLazarus: Depool codfw due to IPV6 connectivity issues [dns] - 10https://gerrit.wikimedia.org/r/739894 [17:55:58] ^ not merging yet but ready [17:56:13] as a data point, text-lb v6 still works for me [17:56:20] <_joe_> I have a 90% package loss on v6 to upload-lb codfw [17:56:24] cdanis: under majavah's one [17:56:39] also ingressing via eqord [17:56:41] interesting [17:56:56] actually given the DNS TTL, maybe best to depool first and ask questions later [17:57:00] rzl: +1 [17:57:01] opinions? and is anyone the IC? :) [17:57:05] * cwhite started doc [17:57:06] <_joe_> rzl: +1 [17:57:14] +! on depool, no idea what the issue is right now [17:57:14] for what it's worth, i also can't access it via ipv6 [17:57:19] text-lb and upload-lb are on separate LVS'es, right? [17:57:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:57:29] <_joe_> majavah: yes [17:57:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1003/32503/" [puppet] - 10https://gerrit.wikimedia.org/r/739892 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [17:57:45] okay, if someone can check and +1 https://gerrit.wikimedia.org/r/739894 I'll go ahead [17:57:47] _joe_: ^ [17:57:47] <_joe_> in fact, that could be the point [17:58:03] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Depool codfw due to IPV6 connectivity issues [dns] - 10https://gerrit.wikimedia.org/r/739894 (owner: 10RLazarus) [17:58:09] (03CR) 10Majavah: [C: 03+1] Depool codfw due to IPV6 connectivity issues [dns] - 10https://gerrit.wikimedia.org/r/739894 (owner: 10RLazarus) [17:58:16] same here, last response if from cr2-eqord [17:58:20] (03CR) 10RLazarus: [C: 03+2] Depool codfw due to IPV6 connectivity issues [dns] - 10https://gerrit.wikimedia.org/r/739894 (owner: 10RLazarus) [17:58:22] (03CR) 10Legoktm: [C: 03+1] Depool codfw due to IPV6 connectivity issues [dns] - 10https://gerrit.wikimedia.org/r/739894 (owner: 10RLazarus) [17:58:22] <_joe_> I have a meeting now, but I think majavah was raising a good point [17:58:42] <_joe_> text-lb works and goes throuygh the same cr2 [17:58:49] authdns-updating [17:58:58] <_joe_> so it must be cr2 -> lvs2007 I guess? [17:59:05] https://phabricator.wikimedia.org/P17777 is an mtr from my VPS [17:59:41] RECOVERY - Host upload-lb.codfw.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 30.05 ms [17:59:43] <_joe_> XioNoX: why is pybal still down on lvs2007? [17:59:53] lvs2007 is the one with maintenance [17:59:54] (for discussion later: could have just depooled upload, but I decided to keep it simple -- we could repool only text-lb later if we want to) [17:59:58] alright [17:59:58] authdns-update complete [18:00:04] chrisalbon and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211118T1800). [18:00:12] <_joe_> ok I need to get to a meeting [18:00:15] ping6 upload-lb.codfw.wikimedia.org now works for me [18:00:17] <_joe_> but ping me if I'm needed [18:00:43] after it's depooled, it went back up. [18:00:43] works for me too [18:00:45] https://phabricator.wikimedia.org/P17778 note the warnings at lines 30-33, I'm guessing those are already known? [18:00:48] o.O [18:00:49] I also have a meeting but lvs2007 is the that was done before this started. the ongoing work. right [18:00:50] (03PS1) 10Arturo Borrero Gonzalez: cloudbackup: specify administrative contact [puppet] - 10https://gerrit.wikimedia.org/r/739895 (https://phabricator.wikimedia.org/T295584) [18:01:09] yes [18:01:16] topranks: so re-enabling the link between asw-b and cr2-codfw fixed it [18:01:21] back working for me now [18:01:22] er, re-disabing [18:01:23] cool:) [18:01:26] mtr: https://phabricator.wikimedia.org/P17779 [18:01:34] same here [18:01:50] ok so it relates to the switch replacement you and pa.paul are doing then? [18:01:54] yeah [18:02:05] is this another sneaky layer2 adjacency issue or something [18:02:10] ok cool well there is a reason. [18:02:20] so should we keep codfw depooled? [18:02:49] yeah, better to keep depooled for nwo [18:03:09] that's the switch that returned i/o errors when committing ? [18:03:19] akosiaris: yeah it got replaced [18:03:38] current theory is that the link went up while discarding v6 traffic only [18:04:02] ok, I guess then more traffic was blackholed too, not just ipv6 for upload-codfw, just not paging, so keep codfw depooled I 'd say for the duration of the maint. [18:04:28] (03PS3) 10JMeybohm: Rewrite admin_ng helmfiles for local charts and fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/739890 [18:04:29] the LVS announcing the IP for upload-lb was connected to the unit being replaced right? [18:05:42] (03PS1) 10Cmjohnson: updating site.pp for an-test-coord1002. [puppet] - 10https://gerrit.wikimedia.org/r/739896 (https://phabricator.wikimedia.org/T293938) [18:06:09] I can't ssh at all to lvs2007 now [18:06:21] topranks: yeah lvs2007 is depooled [18:06:29] but now it's fully unreach? [18:06:57] cwhite, others: I need to drop off, I have an appointment -- need anything from me first? [18:07:20] 10SRE, 10ops-eqiad, 10DBA: db1131 alerting due to network hiccup - https://phabricator.wikimedia.org/T295952 (10Cmjohnson) @marostegui the cable has been replaced, I am able to ssh into the server now. Please take a look [18:07:23] rzl: not from me [18:08:11] (03CR) 10Cmjohnson: [C: 03+2] updating site.pp for an-test-coord1002. [puppet] - 10https://gerrit.wikimedia.org/r/739896 (https://phabricator.wikimedia.org/T293938) (owner: 10Cmjohnson) [18:08:55] (03PS1) 10Andrew Bogott: Keystone policy: add support for the keystonevalidate role [puppet] - 10https://gerrit.wikimedia.org/r/739902 (https://phabricator.wikimedia.org/T295234) [18:09:30] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] "Merged parts from https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/737939/5/Rakefile over here." [deployment-charts] - 10https://gerrit.wikimedia.org/r/739890 (owner: 10JMeybohm) [18:11:00] (03CR) 10Jhernandez: [C: 04-1] Set up beta test environment for QuickSurveys (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [18:11:26] IIUC, the plan is to leave codfw depooled until switch maintenance is complete. Will somone doing the maintenance repool codfw afterwards? [18:11:50] 10SRE, 10ops-eqiad, 10DBA: db1131 alerting due to network hiccup - https://phabricator.wikimedia.org/T295952 (10Marostegui) Thanks @Cmjohnson - I can access the host too. Before it was also accesible but we saw the network flapping a few times. I am going to leave the host depooled for the weekend and if the... [18:11:51] Or at least notify someone who can repool? [18:13:04] (03Merged) 10jenkins-bot: Rewrite admin_ng helmfiles for local charts and fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/739890 (owner: 10JMeybohm) [18:13:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudbackup: specify administrative contact [puppet] - 10https://gerrit.wikimedia.org/r/739895 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [18:13:42] (03PS9) 10JMeybohm: Add basic ingress support to chart common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/732374 (https://phabricator.wikimedia.org/T290966) [18:14:11] (03PS5) 10AOkoth: site: include new k8s hosts on kubestage group [puppet] - 10https://gerrit.wikimedia.org/r/739857 (https://phabricator.wikimedia.org/T293729) [18:18:29] (03PS1) 10Arturo Borrero Gonzalez: cloudbackup: require the lvm package [puppet] - 10https://gerrit.wikimedia.org/r/739903 (https://phabricator.wikimedia.org/T295584) [18:19:09] (03PS2) 10Arturo Borrero Gonzalez: cloudbackup: require the lvm package [puppet] - 10https://gerrit.wikimedia.org/r/739903 (https://phabricator.wikimedia.org/T295584) [18:19:20] (03PS14) 10Jhernandez: Set up beta test environment for QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [18:20:35] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/compiler1001/32504/" [puppet] - 10https://gerrit.wikimedia.org/r/739857 (https://phabricator.wikimedia.org/T293729) (owner: 10AOkoth) [18:20:47] (03CR) 10Jhernandez: Set up beta test environment for QuickSurveys (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [18:21:04] yeah I can take care of it [18:21:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudbackup: require the lvm package [puppet] - 10https://gerrit.wikimedia.org/r/739903 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [18:22:17] XioNoX: Thanks! [18:26:13] is there a certificate problem on https://en.wikipedia.beta.wmflabs.org/ , or is it just me? [18:26:26] (03PS1) 10Arturo Borrero Gonzalez: cloudbackup: fix disk allocation [puppet] - 10https://gerrit.wikimedia.org/r/739905 (https://phabricator.wikimedia.org/T295584) [18:26:40] MatmaRex: https://phabricator.wikimedia.org/T296000 [18:26:51] thanks [18:27:14] (03PS6) 10AOkoth: site: include new k8s hosts on kubestage group [puppet] - 10https://gerrit.wikimedia.org/r/739857 (https://phabricator.wikimedia.org/T293729) [18:27:50] (03PS7) 10AOkoth: site: include new k8s hosts on kubestage group [puppet] - 10https://gerrit.wikimedia.org/r/739857 (https://phabricator.wikimedia.org/T293729) [18:28:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudbackup: fix disk allocation [puppet] - 10https://gerrit.wikimedia.org/r/739905 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [18:28:20] (this seems like a good moment to note that the beta cluster has no official maintainer taking care of issues like that, T215217) [18:28:20] T215217: deployment-prep: Code stewardship request - https://phabricator.wikimedia.org/T215217 [18:36:20] (03CR) 10EllenR: Set up beta test environment for QuickSurveys (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [18:40:20] (03CR) 10Jhernandez: [C: 04-1] "https://integration.wikimedia.org/ci/job/operations-mw-config-php72-composer-diffConfig-docker/9087/console" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [18:41:19] (03CR) 10EllenR: Set up beta test environment for QuickSurveys (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [18:47:29] (03PS9) 10Jbond: Add Typing: And fix other minor lint issues [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 [18:47:34] (03CR) 10Jbond: "hopefully all sorted" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 (owner: 10Jbond) [18:49:01] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-coord1002.eqiad.wmnet with OS bullseye [18:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:09] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS bullseye [18:49:41] (03CR) 10MSantos: [C: 03+1] "could you elaborate on the 10 seconds max-time or add it to the commit message as well? After that I'm fine with you merging this patch." [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/739538 (owner: 10Jgiannelos) [18:52:21] !log asw-b-codfw> request system reboot member 7 - T295118 [18:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:24] T295118: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 [18:54:33] PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% [18:55:53] PROBLEM - Host elastic2044 is DOWN: PING CRITICAL - Packet loss = 100% [18:56:27] PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100% [18:56:35] PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100% [18:56:51] PROBLEM - Host ms-be2033 is DOWN: PING CRITICAL - Packet loss = 100% [18:57:19] PROBLEM - Host thanos-be2002 is DOWN: PING CRITICAL - Packet loss = 100% [18:57:29] RECOVERY - Host ms-be2033 is UP: PING OK - Packet loss = 0%, RTA = 30.07 ms [18:57:31] RECOVERY - Host thanos-be2002 is UP: PING OK - Packet loss = 0%, RTA = 30.13 ms [18:57:35] RECOVERY - Host elastic2043 is UP: PING OK - Packet loss = 0%, RTA = 31.40 ms [18:57:43] RECOVERY - Host elastic2044 is UP: PING OK - Packet loss = 0%, RTA = 30.25 ms [18:57:47] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic plugin upgrade + restart - ryankemper@cumin1001 - T295705 [18:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:51] RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 30.04 ms [18:57:52] T295705: Cleanup missing Commons index on Elasticsearch eqiad - https://phabricator.wikimedia.org/T295705 [18:58:07] RECOVERY - Host furud is UP: PING OK - Packet loss = 0%, RTA = 30.08 ms [18:58:28] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10Cmjohnson) [19:00:04] RoanKattouw and Urbanecm: May I have your attention please! UTC evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211118T1900) [19:00:04] MatmaRex: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:12] hi [19:05:33] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on phab2001.codfw.wmnet with reason: kernel upgrade [19:05:35] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab2001.codfw.wmnet with reason: kernel upgrade [19:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:03] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-test-coord1002.eqiad.wmnet with OS bullseye [19:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:10] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS bullseye executed wit... [19:07:33] !log rebooting phab2001 to apply updated php and kernel packages [19:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:14] (03PS1) 10Andrew Bogott: disable_tool: run every 2 minutes rather than every 5 [puppet] - 10https://gerrit.wikimedia.org/r/739917 (https://phabricator.wikimedia.org/T170355) [19:10:36] i can deploy today [19:10:40] MatmaRex: still around? [19:11:29] yeah, hello urbanecm [19:11:31] thanks [19:12:17] (03CR) 10Urbanecm: [C: 03+2] Enable DiscussionTools automatic topic subscriptions as beta feature on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739846 (https://phabricator.wikimedia.org/T290500) (owner: 10Bartosz Dziewoński) [19:13:04] (03Merged) 10jenkins-bot: Enable DiscussionTools automatic topic subscriptions as beta feature on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739846 (https://phabricator.wikimedia.org/T290500) (owner: 10Bartosz Dziewoński) [19:13:29] !log upgrading php7.3 packages on phab1001 [19:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:37] PROBLEM - PyBal backends health check on lvs2008 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:13:45] * urbanecm closes beta ssh sessions and opens production ones [19:14:04] MatmaRex: mwdebug1001 has the change now. Can you test please? [19:14:11] ^ that is us not critical [19:14:14] looking [19:14:56] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:16:21] urbanecm: are you sure it's there? i don't see the expcted effect [19:16:42] MatmaRex: let me double check [19:16:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:50] 10SRE, 10ops-eqiad, 10DBA: db1131 alerting due to network hiccup - https://phabricator.wikimedia.org/T295952 (10Jclark-ctr) @Marostegui when i switched out cable earlier had left my laptop in another building i relayed information to chris. The connection activity led on nic was flashing abnormally after r... [19:17:05] ACKNOWLEDGEMENT - PyBal backends health check on lvs2008 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled 20after4 upgrading phab2001 https://wikitech.wikimedia.org/wiki/PyBal [19:17:05] ACKNOWLEDGEMENT - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled 20after4 upgrading phab2001 https://wikitech.wikimedia.org/wiki/PyBal [19:17:25] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10Cmjohnson) updated site.pp, ran the script again and it made it to the debian installer but failed on raid cfg. [19:17:26] MatmaRex: I'm positive it is there https://www.irccloud.com/pastebin/9jBTWrv8/ [19:17:56] (assuming the observed configuration values are what i'm supposed to see) [19:17:59] oh, wait, i think i see [19:18:04] mutante: why not just depool? [19:18:21] i was testing on enwiki, but the train hasn't rolled out there yet [19:18:24] heh [19:18:29] and the feature this enables is in the last version only [19:18:36] it works as expected on mw.org, so that's fine [19:18:41] so, sync? [19:18:42] twentyafterfour: * [19:18:45] yeah. thanks [19:18:49] or are we going to backport sth? [19:19:03] no, it should be rolled out in like an hour, right? [19:19:06] RhinosF1: didn't realize it was pooled it's the non-active server [19:19:25] MatmaRex: correct, unless we run into any new blockers [19:19:42] 10SRE, 10ops-eqiad, 10DBA: db1131 alerting due to network hiccup - https://phabricator.wikimedia.org/T295952 (10Marostegui) Oh sweet - thanks for the information John. So maybe it was indeed a bad cable. I will check on Monday if we had some other flapping since the cable change, and if all goes well I will... [19:19:43] twentyafterfour: I'm guessing because the alert says down but pooled that it is? [19:19:44] yeah. it's too large to backport, if that happens, well, it won't be there [19:19:47] mutante was already started on the primary server when the alert popped up so he's finishing that before touching phab2001 again [19:20:11] got it [19:20:12] syncing [19:20:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:08] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 4b4c0bca9aa6bceac86f40f03ad688b9d4481c58: Enable DiscussionTools automatic topic subscriptions as beta feature on most wikis (T290500) (duration: 01m 04s) [19:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:11] T290500: [Config Change] Deploy Automatic Topic Subscriptions as Beta Feature at Initial Wikis - https://phabricator.wikimedia.org/T290500 [19:21:32] MatmaRex: and, done [19:21:36] anything else i can do for you today? [19:21:44] urbanecm: thanks! [19:22:37] (03CR) 10Ssingh: [V: 03+1 C: 03+2] wikidough: set CSP headers for the landing page [puppet] - 10https://gerrit.wikimedia.org/r/739853 (owner: 10Ssingh) [19:24:39] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=phab2001-vcs.codfw.wmnet [19:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:43] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:31:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:09] PROBLEM - PyBal IPVS diff check on lvs2008 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.153.250:22, 2620:0:860:ed1a::3:fa:22]) https://wikitech.wikimedia.org/wiki/PyBal [19:31:23] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.153.250:22, 2620:0:860:ed1a::3:fa:22]) https://wikitech.wikimedia.org/wiki/PyBal [19:31:53] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:34:35] ok this new IPVS diff check is weird. we depooled git-ssh on phab2001 but now there are two alerts instead of one [19:34:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:59] two alerts on two separate LVS boxes [19:35:24] so, we rebooted a host and it is back up [19:35:34] we dont know why the alert is not going away [19:35:51] we wouldnt worry about the actual service, just the alert [19:36:15] I seem to remember this thing happened and pybal needed a restart and it was all ok again [19:36:40] but I don't want to just do that [19:43:55] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:46:03] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:46:11] (03PS1) 10Milimetric: analytics/systemd/sqoop: Add daily sqoop [puppet] - 10https://gerrit.wikimedia.org/r/739923 (https://phabricator.wikimedia.org/T290516) [19:48:45] 10SRE, 10ops-eqiad, 10DBA: db1131 alerting due to network hiccup - https://phabricator.wikimedia.org/T295952 (10Marostegui) p:05High→03Medium [19:51:04] (03CR) 10Dzahn: [C: 03+1] "looks good to me but check with Janis when to actually merge it" [puppet] - 10https://gerrit.wikimedia.org/r/739857 (https://phabricator.wikimedia.org/T293729) (owner: 10AOkoth) [19:51:35] (03CR) 10AOkoth: [C: 03+2] site: include new k8s hosts on kubestage group [puppet] - 10https://gerrit.wikimedia.org/r/739857 (https://phabricator.wikimedia.org/T293729) (owner: 10AOkoth) [19:52:05] !log legoktm@cumin1001 conftool action : set/weight=10; selector: name=thumbor1006.eqiad.wmnet [19:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:53] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=thumbor1003.eqiad.wmnet [19:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:09] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=thumbor1004.eqiad.wmnet [19:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:49] (03CR) 10Andrew Bogott: [C: 03+2] disable_tool: run every 2 minutes rather than every 5 [puppet] - 10https://gerrit.wikimedia.org/r/739917 (https://phabricator.wikimedia.org/T170355) (owner: 10Andrew Bogott) [19:55:07] (03CR) 10Majavah: [C: 03+1] "looks good if the role exists" [puppet] - 10https://gerrit.wikimedia.org/r/739902 (https://phabricator.wikimedia.org/T295234) (owner: 10Andrew Bogott) [19:59:56] (03CR) 10Jeena Huneidi: [C: 03+2] Don't trust Title that if it exists pageId will be > 0 [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739638 (https://phabricator.wikimedia.org/T295931) (owner: 10Daniel Kinzler) [20:00:04] jeena and dduvall: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211118T2000). [20:00:06] (03PS1) 10Ryan Kemper: elasticsearch: sleep instead of waiting 4 write q [cookbooks] - 10https://gerrit.wikimedia.org/r/739924 [20:00:49] (03PS2) 10Ryan Kemper: elasticsearch: sleep instead of waiting 4 write q [cookbooks] - 10https://gerrit.wikimedia.org/r/739924 [20:00:57] (03CR) 10Gehel: [C: 03+1] "LGTM as a short term workaround" [cookbooks] - 10https://gerrit.wikimedia.org/r/739924 (owner: 10Ryan Kemper) [20:01:56] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] elasticsearch: sleep instead of waiting 4 write q [cookbooks] - 10https://gerrit.wikimedia.org/r/739924 (owner: 10Ryan Kemper) [20:01:57] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic plugin upgrade + restart - ryankemper@cumin1001 - T295705 [20:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:00] T295705: Cleanup missing Commons index on Elasticsearch eqiad - https://phabricator.wikimedia.org/T295705 [20:05:00] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic plugin upgrade + restart - ryankemper@cumin1001 - T295705 [20:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:27] (KubernetesCalicoDown) firing: kubestage1003.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [20:08:28] ^ kubestage1003 - new setup ongoing [20:08:38] we dont know how to tell jinxer [20:11:10] (03PS1) 10Legoktm: thumbor: Remove thumbor1003 and thumbor1004 from memcached [puppet] - 10https://gerrit.wikimedia.org/r/739925 (https://phabricator.wikimedia.org/T285479) [20:12:32] (03PS1) 10Legoktm: conftool: Remove thumbor1003 and thumbor1004 [puppet] - 10https://gerrit.wikimedia.org/r/739926 (https://phabricator.wikimedia.org/T285479) [20:12:35] (03PS1) 10Legoktm: Remove thumbor1003 and thumbor1004 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/739927 (https://phabricator.wikimedia.org/T285479) [20:13:07] (03CR) 10Legoktm: [C: 03+2] thumbor: Remove thumbor1003 and thumbor1004 from memcached [puppet] - 10https://gerrit.wikimedia.org/r/739925 (https://phabricator.wikimedia.org/T285479) (owner: 10Legoktm) [20:13:18] (03PS3) 10Jgiannelos: tile-pregeneration: Make curl calls silent [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/739538 [20:14:31] (03PS1) 10Dzahn: icinga: authorize Arnold to run commands [puppet] - 10https://gerrit.wikimedia.org/r/739928 (https://phabricator.wikimedia.org/T288645) [20:14:54] (03PS2) 10Dzahn: icinga: authorize Arnold to run commands [puppet] - 10https://gerrit.wikimedia.org/r/739928 (https://phabricator.wikimedia.org/T288645) [20:15:47] (03CR) 10Jgiannelos: tile-pregeneration: Make curl calls silent (031 comment) [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/739538 (owner: 10Jgiannelos) [20:18:05] (03CR) 10Ssingh: [C: 03+1] icinga: authorize Arnold to run commands [puppet] - 10https://gerrit.wikimedia.org/r/739928 (https://phabricator.wikimedia.org/T288645) (owner: 10Dzahn) [20:19:35] (03CR) 10Dzahn: [C: 03+2] icinga: authorize Arnold to run commands [puppet] - 10https://gerrit.wikimedia.org/r/739928 (https://phabricator.wikimedia.org/T288645) (owner: 10Dzahn) [20:19:45] (03Merged) 10jenkins-bot: Don't trust Title that if it exists pageId will be > 0 [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739638 (https://phabricator.wikimedia.org/T295931) (owner: 10Daniel Kinzler) [20:25:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:40] !log jhuneidi@deploy1002 Synchronized php-1.38.0-wmf.9/includes/page/PageStore.php: Backport for T295931 (duration: 01m 04s) [20:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:43] T295931: Wikimedia\Assert\ParameterAssertionException: Bad value for parameter $pageId: must be greater than zero - https://phabricator.wikimedia.org/T295931 [20:27:42] !log jhuneidi@deploy1002 Synchronized php-1.38.0-wmf.9/tests/phpunit/includes/page/PageStoreTest.php: Backport for T295931 (duration: 01m 03s) [20:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:42] (03PS1) 10Jeena Huneidi: group1 wikis to 1.38.0-wmf.9 refs T293950 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739931 [20:28:44] (03CR) 10Jeena Huneidi: [C: 03+2] group1 wikis to 1.38.0-wmf.9 refs T293950 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739931 (owner: 10Jeena Huneidi) [20:29:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:39] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.9 refs T293950 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739931 (owner: 10Jeena Huneidi) [20:30:55] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.9 refs T293950 [20:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:01] T293950: 1.38.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T293950 [20:31:58] !log jhuneidi@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.9 refs T293950 (duration: 01m 03s) [20:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:38] PROBLEM - Juniper alarms on mr1-esams is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 91.198.174.247 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [20:37:06] RECOVERY - Juniper alarms on mr1-esams is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [20:39:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:19] Logs look okay so I will deploy to all wikis now [20:41:07] (03PS1) 10Jeena Huneidi: all wikis to 1.38.0-wmf.9 refs T293950 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739934 [20:41:09] (03CR) 10Jeena Huneidi: [C: 03+2] all wikis to 1.38.0-wmf.9 refs T293950 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739934 (owner: 10Jeena Huneidi) [20:41:54] (03Merged) 10jenkins-bot: all wikis to 1.38.0-wmf.9 refs T293950 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739934 (owner: 10Jeena Huneidi) [20:43:07] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.9 refs T293950 [20:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:11] T293950: 1.38.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T293950 [20:43:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:23] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=phab2001-vcs.codfw.wmnet [20:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:21] (03CR) 10Dzahn: "we confirmed after this Arnold could now schedule downtimes etc also via web UI" [puppet] - 10https://gerrit.wikimedia.org/r/739928 (https://phabricator.wikimedia.org/T288645) (owner: 10Dzahn) [20:49:48] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/git-ssh is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [20:50:31] ^ that's since I tried setting the one in codfw to inactive.. hrmm [20:50:42] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/git-ssh is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [20:50:44] PROBLEM - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/git-ssh is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [20:50:57] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=phab2001-vcs.codfw.wmnet [20:50:58] PROBLEM - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/git-ssh is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [20:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:02] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:51:21] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic plugin upgrade + restart - ryankemper@cumin1001 - T295705 [20:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:25] T295705: Cleanup missing Commons index on Elasticsearch eqiad - https://phabricator.wikimedia.org/T295705 [20:51:29] !log restart blazegraph on wdqs1006 (jvm stuck) [20:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:07] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=phab2001-vcs.codfw.wmnet [20:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:38] PROBLEM - Query Service HTTP Port on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [20:53:24] RECOVERY - PyBal IPVS diff check on lvs2008 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [20:53:24] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [20:53:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:48] twentyafterfour: ^ fixed..phhhew [20:53:53] arnoldokoth: ^ [20:54:14] so I basically just set it to inactive, back to no and then yes [20:54:26] got away without the restart so far [20:54:38] RECOVERY - Query Service HTTP Port on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [20:55:02] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:55:15] (03PS13) 10Brennen Bearnes: gitlab-runner: restrict docker images and services [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) [20:56:41] except we have the other new alert about compilation of etcd templates. been there too in the past.. ugh [20:57:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:10] !log [puppetmaster2001:/var/run/confd-template] $ sudo rm .git-ssh*.err [21:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:24] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [21:00:48] !log [puppetmaster1001:/var/run/confd-template] $ sudo rm .git-ssh*.err [21:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:56] RECOVERY - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [21:00:58] RECOVERY - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [21:01:07] what a fight.. but this fixed it per https://wikitech.wikimedia.org/wiki/Confd#Compilation_is_broken [21:01:28] PROBLEM - WDQS high update lag on wdqs1006 is CRITICAL: 5.593e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [21:01:28] always remembers afterwards how the same thing happened before :p [21:02:34] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [21:02:48] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org [21:03:09] XioNoX: there were more lvs/pybal alerts that had nothing to do with lvs2007/2008, but I fixed them now. what is left is actually your maintenance. it might have looked like related again but was not [21:03:20] 10SRE, 10ops-codfw, 10serviceops: decom mw2280 (was: mw2280 unresponsive to powercycle and hardreset) - https://phabricator.wikimedia.org/T290708 (10wiki_willy) Hi @Dzahn - there's an email thread with Alex, Lukasz, Faidon, and Mark around whether or not it's worth the extra cycles needed to replace mw2280.... [21:04:41] mutante: thanks! [21:07:44] (03CR) 10MSantos: [C: 03+2] tile-pregeneration: Make curl calls silent [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/739538 (owner: 10Jgiannelos) [21:07:48] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org [21:08:52] (03Merged) 10jenkins-bot: tile-pregeneration: Make curl calls silent [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/739538 (owner: 10Jgiannelos) [21:13:59] (03CR) 10GeoffreyT2000: "Please abandon this change. All wikis are now on 1.38.0-wmf.9." [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/737085 (https://phabricator.wikimedia.org/T291967) (owner: 10Ppchelko) [21:30:31] !log legoktm@cumin1001 conftool action : set/pooled=inactive; selector: name=thumbor1003.eqiad.wmnet [21:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:36] !log legoktm@cumin1001 conftool action : set/pooled=inactive; selector: name=thumbor1004.eqiad.wmnet [21:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:15] !log asw-b-codfw> request system power-off member 7 [21:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:48] PROBLEM - Host elastic2044 is DOWN: PING CRITICAL - Packet loss = 100% [21:34:17] (03CR) 10Legoktm: [C: 03+2] conftool: Remove thumbor1003 and thumbor1004 [puppet] - 10https://gerrit.wikimedia.org/r/739926 (https://phabricator.wikimedia.org/T285479) (owner: 10Legoktm) [21:34:18] PROBLEM - Host thanos-be2002 is DOWN: PING CRITICAL - Packet loss = 100% [21:34:22] PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% [21:35:13] (03CR) 10SBassett: [C: 03+1] "...for the allow list portion, as long as it doesn't become a free-for-all." [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes) [21:35:20] PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100% [21:36:04] PROBLEM - Host ms-be2033 is DOWN: PING CRITICAL - Packet loss = 100% [21:36:08] PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100% [21:36:59] !log legoktm@cumin1001 START - Cookbook sre.hosts.decommission for hosts thumbor1004.eqiad.wmnet [21:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:24] PROBLEM - Juniper virtual chassis ports on asw-b-codfw is CRITICAL: CRIT: Down: 7 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [21:50:07] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts thumbor1004.eqiad.wmnet [21:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:22] RECOVERY - WDQS high update lag on wdqs1006 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 2.121e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [21:53:10] 10ops-eqiad, 10decommission-hardware, 10serviceops: decommission thumbor1004.eqiad.wmnet - https://phabricator.wikimedia.org/T285480 (10Legoktm) [21:53:33] !log legoktm@cumin1001 START - Cookbook sre.hosts.decommission for hosts thumbor1003.eqiad.wmnet [21:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:14] RECOVERY - Juniper virtual chassis ports on asw-b-codfw is OK: OK: UP: 28 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [22:02:02] RECOVERY - Host elastic2044 is UP: PING OK - Packet loss = 0%, RTA = 30.07 ms [22:02:02] RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 30.06 ms [22:02:10] RECOVERY - Host furud is UP: PING OK - Packet loss = 0%, RTA = 37.17 ms [22:02:12] RECOVERY - Host thanos-be2002 is UP: PING OK - Packet loss = 0%, RTA = 30.05 ms [22:02:14] RECOVERY - Host elastic2043 is UP: PING OK - Packet loss = 0%, RTA = 30.11 ms [22:02:32] RECOVERY - Host ms-be2033 is UP: PING OK - Packet loss = 0%, RTA = 30.08 ms [22:03:23] (03CR) 10Legoktm: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes) [22:03:24] PROBLEM - Check systemd state on ms-be2047 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:06:46] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts thumbor1003.eqiad.wmnet [22:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:55] 10ops-eqiad, 10decommission-hardware, 10serviceops: decommission thumbor1003.eqiad.wmnet - https://phabricator.wikimedia.org/T285479 (10Legoktm) [22:08:23] (03CR) 10Legoktm: [C: 03+2] Remove thumbor1003 and thumbor1004 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/739927 (https://phabricator.wikimedia.org/T285479) (owner: 10Legoktm) [22:16:28] RECOVERY - Check systemd state on ms-be2047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:29:00] (03PS1) 10Ebernhardson: wdquery_service: Provide return-to url with auth checks [puppet] - 10https://gerrit.wikimedia.org/r/739942 (https://phabricator.wikimedia.org/T295676) [22:37:56] RECOVERY - configured eth on lvs2007 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [22:44:12] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705 [22:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:16] T295705: Cleanup missing Commons index on Elasticsearch eqiad - https://phabricator.wikimedia.org/T295705 [22:48:57] !log asw-b-codfw> request system power-off member 7 [22:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:10] ryankemper: note that we're still working on codfw row B7, it's having more issues than anticipated, I think there are 1 or 2 of your boxes there [22:51:00] PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% [22:52:24] PROBLEM - Host thanos-be2002 is DOWN: PING CRITICAL - Packet loss = 100% [22:52:30] PROBLEM - Host ms-be2033 is DOWN: PING CRITICAL - Packet loss = 100% [22:52:42] PROBLEM - Host elastic2044 is DOWN: PING CRITICAL - Packet loss = 100% [22:52:45] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705 [22:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:48] T295705: Cleanup missing Commons index on Elasticsearch eqiad - https://phabricator.wikimedia.org/T295705 [22:52:56] PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100% [22:53:04] PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100% [22:54:38] PROBLEM - Juniper virtual chassis ports on asw-b-codfw is CRITICAL: CRIT: Down: 7 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [22:55:02] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3548 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [22:56:48] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [22:57:40] PROBLEM - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [2000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=39&fullscreen&orgId=1&var-cluster=codfw&var-smoothing=1 [23:02:35] 10SRE, 10ops-codfw, 10serviceops: decom mw2280 (was: mw2280 unresponsive to powercycle and hardreset) - https://phabricator.wikimedia.org/T290708 (10Dzahn) Thanks Willy! Alright, will wait for that. Regardless of the outcome we would remove the existing broken one, I think. [23:02:52] (03PS1) 10Ryan Kemper: elasticsearch: add import for cluster check error [cookbooks] - 10https://gerrit.wikimedia.org/r/739943 [23:03:18] RECOVERY - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is OK: OK: Less than 20.00% above the threshold [1200.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=39&fullscreen&orgId=1&var-cluster=codfw&var-smoothing=1 [23:03:48] RECOVERY - Juniper virtual chassis ports on asw-b-codfw is OK: OK: UP: 28 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [23:04:35] (03PS2) 10Ryan Kemper: elasticsearch: add import for cluster check error [cookbooks] - 10https://gerrit.wikimedia.org/r/739943 (https://phabricator.wikimedia.org/T280221) [23:05:18] RECOVERY - Host furud is UP: PING OK - Packet loss = 0%, RTA = 30.11 ms [23:05:19] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] elasticsearch: add import for cluster check error [cookbooks] - 10https://gerrit.wikimedia.org/r/739943 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [23:05:24] RECOVERY - Host thanos-be2002 is UP: PING OK - Packet loss = 0%, RTA = 30.00 ms [23:05:28] RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 34.79 ms [23:05:46] RECOVERY - Host ms-be2033 is UP: PING OK - Packet loss = 0%, RTA = 30.07 ms [23:21:10] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:27:02] !log dzahn@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'miscweb' for release 'main' . [23:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:22] trying one more time what I tried this morning.. and timed out [23:27:55] and .. it worked ..and quickly:) [23:28:50] !log dzahn@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'miscweb' for release 'main' . [23:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:24] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - miscweb_4111: Servers kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1015.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:34:47] (03PS1) 10Dzahn: miscweb: try enabling TLS after nodePort is removed and we deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/739945 (https://phabricator.wikimedia.org/T281538) [23:35:17] well, that pybal alert is for sure related again to my deploying. the miscweb thing is new, not critical [23:35:56] sigh that I cause the alert again though [23:37:24] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - miscweb_4111: Servers kubernetes1001.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1017.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:39:00] ACKNOWLEDGEMENT - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - miscweb_4111: Servers kubernetes1001.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1017.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled daniel_zahn new service setup https://wikite [23:39:00] edia.org/wiki/PyBal [23:39:00] ACKNOWLEDGEMENT - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - miscweb_4111: Servers kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1015.eqiad.wmnet are marked down but pooled daniel_zahn new service setup https://wikite [23:39:00] edia.org/wiki/PyBal [23:41:58] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - miscweb_4111: Servers kubernetes2010.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:43:10] ACKNOWLEDGEMENT - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - miscweb_4111: Servers kubernetes2010.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled daniel_zahn on it - not related to lvs2007 w [23:43:10] s://wikitech.wikimedia.org/wiki/PyBal [23:44:28] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes1001.eqiad.wmnet,service=miscweb [23:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:32] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad plugin upgrade + restart - ryankemper@cumin1001 - T295705 [23:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:36] T295705: Cleanup missing Commons index on Elasticsearch eqiad - https://phabricator.wikimedia.org/T295705