[00:00:05] RoanKattouw and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220202T0000). [00:00:05] Jdlrobson, Juan_90264, and zabe: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:33] quite full window :) [00:00:43] i can deploy today [00:01:00] Jdlrobson: zabe: hi, around? [00:01:08] present! [00:01:22] (03PS4) 10Urbanecm: Enable migration mode on all group 0, group 1 and desktop-improvement wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757733 (https://phabricator.wikimedia.org/T299927) (owner: 10Jdlrobson) [00:01:32] hi, yes [00:02:13] zabe: congrats on your OC appointment, btw [00:02:50] :D [00:02:51] thx [00:05:32] Jdlrobson: I'm trying to understand what it does. If I get it right, you want to migrate to the new vector-2022 skin (rather than having one skin with "subskin", as before). I don't understand why wgVectorSkinMigrationMode is only set for group0/group1 (other parts of the patch use the desktop-improvements dblist which includes group2 wikis too) [00:06:38] * ebernhardson apparently put a patch in the wrong deploy window, but this is full enough i'll wait :) [00:06:58] Basically trying to do the risky part of the work now [00:07:25] End goal is the new Vector skin displays in Special:Preference on all desktop improvements, group 1 and group 0 wikis [00:07:31] So group 2 desktop improvements wikis will be included [00:07:43] all else (for example English Wikipedia) will not show the preference [00:07:52] I'm planning to do that later in the week (likely Thursday AM) [00:08:31] urbanecm: Ah but yes [00:08:46] you are right desktop-improvements should have been set in VectorSkinMigrationMode [00:08:48] (03PS1) 10Esanders: Enable reply tool by default on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758988 (https://phabricator.wikimedia.org/T296645) [00:09:20] Jdlrobson: good i asked that question then :). Feel free to change the patch [00:09:23] (03CR) 10Esanders: [C: 04-1] "Scheduled for 7th Feb" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758988 (https://phabricator.wikimedia.org/T296645) (owner: 10Esanders) [00:09:36] (03PS5) 10Jdlrobson: Enable migration mode on all group 0, group 1 and desktop-improvement wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757733 (https://phabricator.wikimedia.org/T299927) [00:09:38] and thanks for the rest of the explanation. Why do you say this part is risky? (In other words, is there anything special i should be paying attention to?) [00:09:40] Does that make more sense now? [00:09:53] Thanks for being my sounding board :) [00:10:28] euwiki is a desktop-improvements wiki, mediawiki is group0 and itwiki is group1 [00:11:06] yup yup, patch makes sense now :). Thanks. [00:12:04] * urbanecm waits for the "why is it risky" answer and then he'll +2 [00:12:37] So this change should be adding the skin to all non-wikipedias as well as arywiki, bnwiki, fawiki, frwiki, hewiki, idwiki, kowiki, ptwiki, srwiki, , thwiki, trwiki, vecwiki, viwiki [00:12:55] it's not risky from a fatal point of view [00:13:03] it's risky from a "i got opted out of new Vector" point of view [00:13:06] anything using useskinversion will break [00:13:13] but I think that's unlikely [00:13:28] the data suggests that's working as expected [00:13:40] Jdlrobson: right, so a product definition of "risky" rather than deployer definition of "risky" [00:13:44] yep exactly [00:13:54] (03CR) 10Urbanecm: [C: 03+2] Enable migration mode on all group 0, group 1 and desktop-improvement wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757733 (https://phabricator.wikimedia.org/T299927) (owner: 10Jdlrobson) [00:13:58] let's deploy then :) [00:13:59] I'm doing it today, so that I can revert tomorrow if anything is wrong according to my product manager or we hear about problems from community [00:14:39] AntiComposite: just saw your comment, are you saying useskinversion will no longer work post-patch (or that's a potential consequence should sth be wrong)? [00:14:47] (03Merged) 10jenkins-bot: Enable migration mode on all group 0, group 1 and desktop-improvement wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757733 (https://phabricator.wikimedia.org/T299927) (owner: 10Jdlrobson) [00:14:50] https://eu.wikipedia.org/wiki/Azala?useskinversion=1 [00:14:51] That's by design urbanecm [00:15:01] ok [00:15:15] If the wiki is in desktop-improvements, part of being in that group is being okay with bugs. If it's not, then they've specifically opted into the skin in preferences so should also be okay with bugs. [00:15:43] Jdlrobson: patch is at mwdebug1001, please test :) [00:15:46] testing :) [00:15:59] gimme a few minutes on this one [00:16:05] sure thing [00:17:02] Okay test 1) Hebrew anonymous users are still getting the same default skin. While it looks the same, it's skin-vector2022 under the hood. So that's a pass. [00:17:09] next test.. [00:17:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ml-serve2006.codfw.wmnet with OS buster [00:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-serve2006.codfw.wmnet with OS b... [00:18:31] test 2) If i've opted out of new Vector on Hebrew, I'm still opted out in mwdebug1001 so that's also a pass. [00:21:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:31] test 3) Wikivoyage, where vector is default, if I am opted out I am still opted out in debug1002 [00:22:48] test 4) Wikivoyage, where vector is default, if I am opted in I am still opted in in debug1001 [00:23:20] okay urbanecm let's sync it [00:23:23] I'm happy it's working as expected. [00:23:51] Jdlrobson: ad test 3, i assume there's a typo in the debug srv name? the patch's not at mwdebug1002 [00:25:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:32] syncing [00:27:47] oh that's a typo [00:27:50] i tested 1001 [00:27:59] thought so [00:29:24] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: b2c13c64029cba3cd34f0e6144d322508fb4afb4: Enable migration mode on all group 0, group 1 and desktop-improvement wikis (T299927) (duration: 01m 58s) [00:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:27] T299927: Deploy new Vector skin to all projects - https://phabricator.wikimedia.org/T299927 [00:29:27] Jdlrobson: it's live [00:30:12] zabe: let's go to your patch now. Can you test it once it's at a mwdebug server? [00:30:15] thanks urbanecm [00:30:21] any time Jdlrobson [00:30:22] now to monitor! wish me luck :) [00:30:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:06] yeah, but only in the way that I confirm does not break down [00:31:11] Sorry I'm late, I'm ready to deploy [00:31:27] * it does not break [00:31:38] stuff [00:31:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:57] Juan_90264: hello! please wait, I'll ping you once your patch is up next [00:32:08] zabe: should be good enough, as it's refactoring only [00:32:24] Okay urbanecm [00:32:38] (03CR) 10Urbanecm: [C: 03+2] Start writing to some wmg* constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734572 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [00:32:44] (03PS5) 10Urbanecm: Start writing to some wmg* constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734572 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [00:32:52] (03CR) 10Urbanecm: [C: 03+2] Start writing to some wmg* constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734572 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [00:32:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:31] (03Merged) 10jenkins-bot: Start writing to some wmg* constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734572 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [00:34:32] zabe: pulled to mwdebug1001 [00:37:28] zabe: let me know how the testing goes. [00:38:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:05] urbanecm: looks good to me, nothing seems to break and I can't find anything in logstash [00:38:19] okay, let's try it then [00:38:24] logstash's clear to me too [00:38:52] Cs.php syncing now [00:39:39] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 06444c16d29d78256d270564ae25ad887d3a2112: Start writing to some wmg* constants (T45956; 1/2) (duration: 00m 49s) [00:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:42] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [00:40:12] db.php syncing now (although that's mwmaint-only AFAIK) [00:40:46] !log urbanecm@deploy1002 Synchronized docroot/noc/db.php: 06444c16d29d78256d270564ae25ad887d3a2112: Start writing to some wmg* constants (T45956; 2/2) (duration: 00m 49s) [00:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:51] zabe: and, live [00:41:12] thanks for your help :) [00:41:30] zabe: can you please keep an eye logstash for few minutes and shout if something's bad happening? [00:41:38] just in case we missed sth [00:41:52] (03PS5) 10Urbanecm: Add 'wgUploadNavigationUrl' upload page of ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758497 (https://phabricator.wikimedia.org/T300466) (owner: 10Juan90264) [00:42:12] of course [00:42:19] (03CR) 10Urbanecm: [C: 03+2] Add 'wgUploadNavigationUrl' upload page of ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758497 (https://phabricator.wikimedia.org/T300466) (owner: 10Juan90264) [00:42:22] thanks zabe [00:42:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:23] (03Merged) 10jenkins-bot: Add 'wgUploadNavigationUrl' upload page of ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758497 (https://phabricator.wikimedia.org/T300466) (owner: 10Juan90264) [00:43:38] Juan_90264: your patch's at mwdebug1001, please test [00:43:44] Ok [00:43:49] (03PS3) 10Jdlrobson: Migration mode enabled everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757734 (https://phabricator.wikimedia.org/T299927) [00:44:02] (03PS4) 10Jdlrobson: Migration mode enabled everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757734 (https://phabricator.wikimedia.org/T299927) [00:46:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2006.codfw.wmnet with OS buster [00:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:13] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-serve2006.codfw.wmnet with OS buste... [00:49:25] Urbanecm: I tested and approved [00:49:30] syncing [00:50:54] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: b560843182f2f8dc0b189cd80f021b60749c5c90: Add wgUploadNavigationUrl upload page of ptwikinews (T300466) (duration: 00m 50s) [00:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:57] T300466: enable local upload on ptwikinews - https://phabricator.wikimedia.org/T300466 [00:51:06] Juan_90264: it's live [00:51:12] so, i think we're done :) [00:51:26] !log UTC late B&C window completed [00:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:54] ebernhardson: hi, in case you wish to self-deploy sth now, feel free to, B&C is over. [00:52:15] urbanecm: alright, thanks! [00:52:33] (03PS4) 10Ebernhardson: rdf-streaming-updater: add the reconciliation stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753788 (https://phabricator.wikimedia.org/T279541) (owner: 10DCausse) [00:52:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ml-serve2007.codfw.wmnet with OS buster [00:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:25] Change working, thank Urbanecm! [00:53:26] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-serve2007.codfw.wmnet with OS b... [00:53:31] *thanks [00:53:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:22] great :) [00:56:47] (y) [00:58:49] (03CR) 10Ebernhardson: [C: 03+2] rdf-streaming-updater: add the reconciliation stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753788 (https://phabricator.wikimedia.org/T279541) (owner: 10DCausse) [00:59:32] (03Merged) 10jenkins-bot: rdf-streaming-updater: add the reconciliation stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753788 (https://phabricator.wikimedia.org/T279541) (owner: 10DCausse) [01:03:13] !log ebernhardson@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:753788|rdf-streaming-updater: add the reconciliation stream (T279541)]] (duration: 00m 49s) [01:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:16] T279541: Add a reconciliation strategy to the wdqs streaming updater - https://phabricator.wikimedia.org/T279541 [01:03:34] (03PS1) 10Jdlrobson: Opt in link should be different in migration mode [skins/Vector] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/758911 (https://phabricator.wikimedia.org/T299927) [01:04:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [01:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [01:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [01:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:50] seems all good, i'm done [01:06:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [01:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:12:04] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve2007.codfw.wmnet with OS buster [01:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:12:14] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-serve2007.codfw.wmnet with OS buste... [01:12:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ml-serve2007.codfw.wmnet with OS buster [01:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:13:09] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-serve2007.codfw.wmnet with OS b... [01:13:10] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve2007.codfw.wmnet with OS buster [01:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:13:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-serve2007.codfw.wmnet with OS buste... [01:17:00] RECOVERY - Disk space on centrallog1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=centrallog1001&var-datasource=eqiad+prometheus/ops [01:22:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ml-serve2007.codfw.wmnet with OS buster [01:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:22:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-serve2007.codfw.wmnet with OS b... [01:23:35] (03CR) 10jerkins-bot: [V: 04-1] Opt in link should be different in migration mode [skins/Vector] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/758911 (https://phabricator.wikimedia.org/T299927) (owner: 10Jdlrobson) [01:40:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2007.codfw.wmnet with OS buster [01:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:29] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-serve2007.codfw.wmnet with OS buste... [01:45:22] (03CR) 10Krinkle: [C: 03+1] "LGTM, but test throroughly on mwdebug and watch Logstash for mwdebug prior to deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734573 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [01:57:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ml-serve2008.codfw.wmnet with OS buster [01:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:57:45] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-serve2008.codfw.wmnet with OS b... [02:00:26] (03CR) 10Krinkle: [C: 03+1] Consistently write to $wmgRealm the same value as to $wmfRealm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734582 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [02:06:13] (03PS1) 10Andrew Bogott: nfs-mounts.yaml.erb: temporarily mount 'maps' in cloudinfra-nfs [puppet] - 10https://gerrit.wikimedia.org/r/758998 (https://phabricator.wikimedia.org/T291405) [02:08:08] (03CR) 10Andrew Bogott: [C: 03+2] nfs-mounts.yaml.erb: temporarily mount 'maps' in cloudinfra-nfs [puppet] - 10https://gerrit.wikimedia.org/r/758998 (https://phabricator.wikimedia.org/T291405) (owner: 10Andrew Bogott) [02:18:30] (03CR) 10Clare Ming: [C: 03+1] Migration mode enabled everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757734 (https://phabricator.wikimedia.org/T299927) (owner: 10Jdlrobson) [02:29:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2008.codfw.wmnet with OS buster [02:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:54] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-serve2008.codfw.wmnet with OS buste... [02:36:29] 10SRE, 10Icinga, 10User-Ladsgroup: Request downtime hosts and servies privileges in Icinga - https://phabricator.wikimedia.org/T300660 (10Papaul) @Ladsgroup thanks I will take care of that tomorrow. [02:36:57] 10SRE, 10Icinga, 10User-Ladsgroup: Request downtime hosts and services privileges in Icinga - https://phabricator.wikimedia.org/T300660 (10Papaul) [02:40:02] (03PS1) 10Andrew Bogott: Revert "nfs-mounts.yaml.erb: temporarily mount 'maps' in cloudinfra-nfs" [puppet] - 10https://gerrit.wikimedia.org/r/758913 [02:40:33] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10Papaul) [02:42:05] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10Papaul) @elukey all yours leaving the task open since i don't have the Packing Slip to receive the servers in Coupa [02:48:50] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [02:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:54:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:56:06] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:(Need By: TBD) rack/setup/install ganeti2029.codfw.wmnet, ganeti2030.codfw.wmnet - https://phabricator.wikimedia.org/T298998 (10Papaul) [04:44:52] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [05:00:38] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (phab1001), Fresh: 104 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:27:10] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 237, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:31:48] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:35:10] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:36:02] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:37:32] (03CR) 10Andrew Bogott: [C: 03+2] openstack: tell dynamicproxy to not edit dns records [puppet] - 10https://gerrit.wikimedia.org/r/756117 (https://phabricator.wikimedia.org/T295246) (owner: 10Majavah) [06:46:52] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [06:51:32] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [06:58:43] (03CR) 10Elukey: [C: 03+1] Actually unset env vars that are activated by conda/activate.d/env_vars.sh [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/758983 (https://phabricator.wikimedia.org/T292699) (owner: 10Ottomata) [06:58:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [06:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [06:59:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [06:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [06:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [07:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [07:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T300402)', diff saved to https://phabricator.wikimedia.org/P19886 and previous config saved to /var/cache/conftool/dbconfig/20220202-070012-marostegui.json [07:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:15] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [07:00:57] (03PS1) 10Marostegui: add_linter_namespace_T300402.py: stop&start slave [software/schema-changes] - 10https://gerrit.wikimedia.org/r/759147 (https://phabricator.wikimedia.org/T300402) [07:04:16] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:06:49] (03CR) 10Marostegui: [V: 03+2 C: 03+2] "This has been used for s2 for linter already." [software/schema-changes] - 10https://gerrit.wikimedia.org/r/759147 (https://phabricator.wikimedia.org/T300402) (owner: 10Marostegui) [07:07:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T300402)', diff saved to https://phabricator.wikimedia.org/P19887 and previous config saved to /var/cache/conftool/dbconfig/20220202-070722-marostegui.json [07:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:26] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [07:07:55] (03CR) 10Elukey: [C: 03+1] "LGTM! Just to be sure - in the ML case, when we'll have staging-ml-codfw (or something similar), we'll also get inference.staging.discover" [deployment-charts] - 10https://gerrit.wikimedia.org/r/758880 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [07:09:23] (03PS2) 10Marostegui: wmnet: Promote es1020 to es4 master [dns] - 10https://gerrit.wikimedia.org/r/758717 (https://phabricator.wikimedia.org/T300127) [07:09:33] (03CR) 10Elukey: [C: 03+1] Deploy ingress components to staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/758881 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [07:10:03] (03PS2) 10Marostegui: mariadb: Promote es1020 to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/758716 (https://phabricator.wikimedia.org/T300127) [07:16:06] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:19:02] (03PS1) 10Elukey: helmfile.d: add sa to ML's draftquality InferenceService transformers [deployment-charts] - 10https://gerrit.wikimedia.org/r/759151 [07:20:36] (03CR) 10Marostegui: [C: 03+2] Revert "db2105: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758813 (owner: 10Marostegui) [07:22:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P19888 and previous config saved to /var/cache/conftool/dbconfig/20220202-072227-marostegui.json [07:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:47] (03CR) 10Ladsgroup: [C: 03+1] wmnet: Promote es1020 to es4 master [dns] - 10https://gerrit.wikimedia.org/r/758717 (https://phabricator.wikimedia.org/T300127) (owner: 10Marostegui) [07:24:16] (03CR) 10Ladsgroup: [C: 03+1] mariadb: Promote es1020 to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/758716 (https://phabricator.wikimedia.org/T300127) (owner: 10Marostegui) [07:28:40] (03CR) 10Marostegui: [C: 03+2] db-production.php: Disable writes on es4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758715 (https://phabricator.wikimedia.org/T300127) (owner: 10Marostegui) [07:29:21] (03Merged) 10jenkins-bot: db-production.php: Disable writes on es4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758715 (https://phabricator.wikimedia.org/T300127) (owner: 10Marostegui) [07:30:10] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: check agent resources too in PuppetFailure [alerts] - 10https://gerrit.wikimedia.org/r/758860 (https://phabricator.wikimedia.org/T299628) (owner: 10Filippo Giunchedi) [07:30:14] (03PS2) 10Filippo Giunchedi: sre: check agent resources too in PuppetFailure [alerts] - 10https://gerrit.wikimedia.org/r/758860 (https://phabricator.wikimedia.org/T299628) [07:30:51] !log marostegui@deploy1002 Synchronized wmf-config/ProductionServices.php: Disable writes on es4 T300127 (duration: 00m 51s) [07:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:54] T300127: Switchover es4 master es1021 -> es1020 - https://phabricator.wikimedia.org/T300127 [07:31:04] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics Private Data Users for Tanja Andic - https://phabricator.wikimedia.org/T300383 (10Ladsgroup) 05Open→03Resolved I boldly close this to remove it from our board. Reopen if you have issues accessing. [07:31:24] 10SRE, 10SRE-Access-Requests: Requesting LDAP-only access to analytics-privatedata-users for Madalina Ana - https://phabricator.wikimedia.org/T299587 (10Ladsgroup) 05Open→03Resolved I boldly close this to remove it from our board. Reopen if you have issues accessing. [07:31:30] 10SRE, 10SRE-Access-Requests: NRodriguez uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T299336 (10Ladsgroup) 05Open→03Resolved I boldly close this to remove it from our board. Reopen if you have issues accessing. [07:31:52] 10SRE, 10SRE-Access-Requests: Requesting access to Superset for Margeigh Novotny - https://phabricator.wikimedia.org/T299072 (10Ladsgroup) 05Open→03Resolved I boldly close this to remove it from our board. Reopen if you have issues accessing. [07:32:09] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: configure mapping and cache rules for grafana-next-rw [puppet] - 10https://gerrit.wikimedia.org/r/757778 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [07:32:24] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:32:41] (03CR) 10Filippo Giunchedi: hiera: add grafana-next-rw to grafana public_aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757777 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [07:32:56] (03CR) 10Filippo Giunchedi: [C: 03+1] wikimedia.org: add grafana-next-rw [dns] - 10https://gerrit.wikimedia.org/r/757780 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [07:33:24] (03CR) 10Filippo Giunchedi: [C: 03+1] idp, grafana: configure grafana-next-rw for sso [puppet] - 10https://gerrit.wikimedia.org/r/757776 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [07:33:41] (03CR) 10Filippo Giunchedi: [C: 03+1] graphite: add grafana-next-rw to cors origins [puppet] - 10https://gerrit.wikimedia.org/r/757775 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [07:35:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [07:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:12] !log marostegui@deploy1002 Synchronized wmf-config/db-production.php: Disable writes on es4 T300127 (duration: 00m 50s) [07:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:15] T300127: Switchover es4 master es1021 -> es1020 - https://phabricator.wikimedia.org/T300127 [07:36:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [07:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [07:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P19889 and previous config saved to /var/cache/conftool/dbconfig/20220202-073731-marostegui.json [07:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [07:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:13] (03PS1) 10Marostegui: Revert "db-production.php: Disable writes on es4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758918 [07:38:21] (03CR) 10Marostegui: [C: 04-2] "Not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758918 (owner: 10Marostegui) [07:38:53] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: Switchover es4 T300127 [07:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:58] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Switchover es4 T300127 [07:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set es1020 with weight 10 T300127', diff saved to https://phabricator.wikimedia.org/P19890 and previous config saved to /var/cache/conftool/dbconfig/20220202-073918-root.json [07:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:23] !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6]: update wmf-proxy-dashboard [07:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:44] (03CR) 10Elukey: [C: 03+2] helmfile.d: add sa to ML's draftquality InferenceService transformers [deployment-charts] - 10https://gerrit.wikimedia.org/r/759151 (owner: 10Elukey) [07:43:50] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={LIST,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:44:43] !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: update wmf-proxy-dashboard (duration: 02m 19s) [07:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:10] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [07:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:59] 10ops-eqiad, 10Traffic: asw2-b-eqiad:xe-2/0/3 interface errors (lvs1015) - https://phabricator.wikimedia.org/T300703 (10ayounsi) p:05Triage→03High [07:46:09] (03CR) 10Filippo Giunchedi: [C: 03+1] "See inline for a thought" [puppet] - 10https://gerrit.wikimedia.org/r/757774 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [07:46:15] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [07:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:56] (03CR) 10Filippo Giunchedi: [C: 03+1] apifeatureusage: increase logstash heap memory to 2G [puppet] - 10https://gerrit.wikimedia.org/r/758970 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [07:47:45] !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6]: update wmf-proxy-dashboard (eqiad1) [07:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [07:48:26] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:51:51] 10ops-eqiad, 10Traffic: asw2-b-eqiad:xe-2/0/3 interface errors (lvs1015) - https://phabricator.wikimedia.org/T300703 (10ayounsi) [07:51:53] !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: update wmf-proxy-dashboard (eqiad1) (duration: 04m 09s) [07:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T300402)', diff saved to https://phabricator.wikimedia.org/P19891 and previous config saved to /var/cache/conftool/dbconfig/20220202-075236-marostegui.json [07:52:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [07:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:39] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [07:52:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [07:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T300402)', diff saved to https://phabricator.wikimedia.org/P19892 and previous config saved to /var/cache/conftool/dbconfig/20220202-075244-marostegui.json [07:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [07:53:55] 10SRE, 10Icinga, 10User-Ladsgroup: Request downtime hosts and services privileges in Icinga - https://phabricator.wikimedia.org/T300660 (10Ladsgroup) a:05Ladsgroup→03Papaul Awesome, I just assign it to you. Let me know if I can be of service. [07:56:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T300402)', diff saved to https://phabricator.wikimedia.org/P19893 and previous config saved to /var/cache/conftool/dbconfig/20220202-075629-marostegui.json [07:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:38] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10fgiunchedi) >>! In T294137#7668908, @Cmjohnson wrote: > @fgiunchedi I racked ms-fe1012 in the new cage e1. I believe it's going to be used to test the networ... [08:03:41] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 105 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:07:50] (03PS1) 10Filippo Giunchedi: hieradata: move pushgateway to prometheus1005 [puppet] - 10https://gerrit.wikimedia.org/r/759194 (https://phabricator.wikimedia.org/T296199) [08:08:40] (03PS1) 10Filippo Giunchedi: wmnet: move pushgateway to prometheus1005 [dns] - 10https://gerrit.wikimedia.org/r/759195 (https://phabricator.wikimedia.org/T296199) [08:09:06] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: move pushgateway to prometheus1005 [puppet] - 10https://gerrit.wikimedia.org/r/759194 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [08:10:05] (03CR) 10Filippo Giunchedi: [C: 03+2] wmnet: move pushgateway to prometheus1005 [dns] - 10https://gerrit.wikimedia.org/r/759195 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [08:10:53] 10SRE, 10Python3-Porting: git-fat needs to be ported to Python 3 - https://phabricator.wikimedia.org/T279509 (10Ladsgroup) I can't say for sure specially since it's part of base packages so it could be used anywhere but the only explicit usage is archiva and I hope we can find a usecase to just avoid using tha... [08:11:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P19894 and previous config saved to /var/cache/conftool/dbconfig/20220202-081134-marostegui.json [08:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:05] (03PS1) 10Legoktm: toolforge: Install clang on the grid [puppet] - 10https://gerrit.wikimedia.org/r/759197 (https://phabricator.wikimedia.org/T300469) [08:17:26] (03PS3) 10Marostegui: mariadb: Promote es1020 to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/758716 (https://phabricator.wikimedia.org/T300127) [08:18:09] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote es1020 to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/758716 (https://phabricator.wikimedia.org/T300127) (owner: 10Marostegui) [08:26:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P19895 and previous config saved to /var/cache/conftool/dbconfig/20220202-082638-marostegui.json [08:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:23] (03PS1) 10Ladsgroup: admin: Revoke saisuman ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/759199 [08:31:00] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] admin: Revoke saisuman ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/759199 (owner: 10Ladsgroup) [08:31:51] (03CR) 10JMeybohm: [C: 03+2] Create certificates for different FQDN's in staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/758880 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [08:31:56] (03CR) 10JMeybohm: [C: 03+2] Deploy ingress components to staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/758881 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [08:35:20] 10SRE, 10SRE-Access-Requests: saisuman ssh keys have been uploaded to WMCS - https://phabricator.wikimedia.org/T300708 (10Ladsgroup) [08:35:32] (03Merged) 10jenkins-bot: Create certificates for different FQDN's in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/758880 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [08:35:34] (03Merged) 10jenkins-bot: Deploy ingress components to staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/758881 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [08:36:39] 10SRE, 10SRE-Access-Requests: saisuman ssh keys have been uploaded to WMCS - https://phabricator.wikimedia.org/T300708 (10SCherukuwada) Apologies for the inconvenience. Bad key management on my part, thanks for letting me know. [08:40:56] (03CR) 10Alexandros Kosiaris: [C: 03+1] service catalog: introduce 'page' field [puppet] - 10https://gerrit.wikimedia.org/r/757447 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [08:41:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T300402)', diff saved to https://phabricator.wikimedia.org/P19896 and previous config saved to /var/cache/conftool/dbconfig/20220202-084143-marostegui.json [08:41:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [08:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [08:41:47] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [08:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T300402)', diff saved to https://phabricator.wikimedia.org/P19897 and previous config saved to /var/cache/conftool/dbconfig/20220202-084150-marostegui.json [08:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T300402)', diff saved to https://phabricator.wikimedia.org/P19898 and previous config saved to /var/cache/conftool/dbconfig/20220202-084709-marostegui.json [08:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:14] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [08:48:50] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Switchover es4 T300127 [08:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:53] T300127: Switchover es4 master es1021 -> es1020 - https://phabricator.wikimedia.org/T300127 [08:48:55] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Switchover es4 T300127 [08:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:33] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [08:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:42] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 09s) [08:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:59] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki::logging::yaml_defs: use wmf-certificates' bundle as CA cert [puppet] - 10https://gerrit.wikimedia.org/r/757661 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [08:58:14] (03CR) 10Elukey: [C: 03+2] mediawiki::logging::yaml_defs: use wmf-certificates' bundle as CA cert [puppet] - 10https://gerrit.wikimedia.org/r/757661 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [08:59:38] (03CR) 10Giuseppe Lavagetto: [C: 03+1] service catalog: introduce 'page' field [puppet] - 10https://gerrit.wikimedia.org/r/757447 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:00:05] !log Starting es4 eqiad failover from es1021 to es1020 - T300127 [09:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:08] T300127: Switchover es4 master es1021 -> es1020 - https://phabricator.wikimedia.org/T300127 [09:00:33] Oh phone [09:00:45] *On [09:01:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1020 to es4 primary and set section read-write T300127', diff saved to https://phabricator.wikimedia.org/P19899 and previous config saved to /var/cache/conftool/dbconfig/20220202-090121-marostegui.json [09:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:50] (do we use a cookbook for these, OOI?) [09:02:09] Emperor: it's a script in dbtools [09:02:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P19900 and previous config saved to /var/cache/conftool/dbconfig/20220202-090214-marostegui.json [09:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:22] ta [09:02:32] Probably should become a cookbook at least partially [09:02:34] (not that I really expect to need to do so myself at any point) [09:02:55] vo.ans is quite evangenlical about the merits of cookbooking things :) [09:03:06] s/ about.*// [09:03:33] We really should cookbook this 😄 [09:04:16] (ThanosSidecarPrometheusDown) firing: Thanos Sidecar cannot connect to Prometheus - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org [09:04:16] (ThanosSidecarUnhealthy) firing: Thanos Sidecar is unhealthy. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org [09:04:21] * volans hides [09:04:28] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [09:04:42] (03CR) 10Marostegui: [C: 03+2] wmnet: Promote es1020 to es4 master [dns] - 10https://gerrit.wikimedia.org/r/758717 (https://phabricator.wikimedia.org/T300127) (owner: 10Marostegui) [09:04:58] Is there a direct API to ductless? [09:05:03] Dbctl [09:05:10] I hate autocorrect [09:05:18] (03CR) 10Marostegui: [C: 03+2] Revert "db-production.php: Disable writes on es4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758918 (owner: 10Marostegui) [09:05:56] (03Merged) 10jenkins-bot: Revert "db-production.php: Disable writes on es4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758918 (owner: 10Marostegui) [09:06:03] (03PS3) 10Marostegui: wmnet: Promote es1020 to es4 master [dns] - 10https://gerrit.wikimedia.org/r/758717 (https://phabricator.wikimedia.org/T300127) [09:06:04] thanos is me btw [09:06:59] Amir1: a programming api, I belive yes, a webservice, I don't think so [09:07:08] !log marostegui@deploy1002 Synchronized wmf-config/db-production.php: Enable writes on es4 T300127 (duration: 00m 50s) [09:07:09] Writes are now back enabled on es4, let's check that everything works fine [09:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:11] T300127: Switchover es4 master es1021 -> es1020 - https://phabricator.wikimedia.org/T300127 [09:07:19] I mean programming [09:07:49] I hope it's used in spicerack so I can copy [09:07:57] yes, at least I was told to use it when I last tried to automate poolings and depoolings [09:07:59] (03CR) 10Filippo Giunchedi: [C: 03+2] service catalog: introduce 'page' field [puppet] - 10https://gerrit.wikimedia.org/r/757447 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:08:05] (03PS6) 10Filippo Giunchedi: service catalog: introduce 'page' field [puppet] - 10https://gerrit.wikimedia.org/r/757447 (https://phabricator.wikimedia.org/T291946) [09:08:06] Can we please focus on the switchover and leave the discussion for a bit later? [09:08:07] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [09:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:15] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 08s) [09:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [09:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:06] Writes seem to be flowing fine on the new master [09:09:10] The binlog shows them [09:09:16] (ThanosSidecarPrometheusDown) resolved: Thanos Sidecar cannot connect to Prometheus - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org [09:09:16] (ThanosSidecarUnhealthy) resolved: Thanos Sidecar is unhealthy. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org [09:09:28] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [09:10:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [09:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [09:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:35] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [09:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:44] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 09s) [09:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [09:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:24] Writes are ok, I have checked a few blobs_id inserted on the binlog and they are being replicated fine to the rest of hosts in es4 [09:13:24] Wohoo [09:13:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1021 T300127', diff saved to https://phabricator.wikimedia.org/P19901 and previous config saved to /var/cache/conftool/dbconfig/20220202-091355-marostegui.json [09:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:59] T300127: Switchover es4 master es1021 -> es1020 - https://phabricator.wikimedia.org/T300127 [09:15:28] 10SRE, 10ops-eqiad, 10Traffic: asw2-b-eqiad:xe-2/0/3 interface errors (lvs1015) - https://phabricator.wikimedia.org/T300703 (10RhinosF1) [09:15:33] 10SRE, 10ops-eqiad, 10Traffic: asw2-b-eqiad:xe-2/0/3 interface errors (lvs1015) - https://phabricator.wikimedia.org/T300703 (10RhinosF1) Can the alert be downtimed so it doesn't go off again? [09:17:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1019.eqiad.wmnet with OS buster [09:17:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P19902 and previous config saved to /var/cache/conftool/dbconfig/20220202-091718-marostegui.json [09:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:26] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1019.eqiad.wmnet with OS buster [09:19:06] (03PS1) 10Marostegui: es1021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/759207 (https://phabricator.wikimedia.org/T300005) [09:20:00] (03CR) 10Marostegui: [C: 03+2] es1021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/759207 (https://phabricator.wikimedia.org/T300005) (owner: 10Marostegui) [09:20:16] (03PS5) 10Giuseppe Lavagetto: Refactor Rakefile [deployment-charts] - 10https://gerrit.wikimedia.org/r/757977 [09:20:18] (03PS4) 10Giuseppe Lavagetto: Rakefile: switch to using the new check_charts task [deployment-charts] - 10https://gerrit.wikimedia.org/r/758423 [09:21:46] (03CR) 10jerkins-bot: [V: 04-1] Rakefile: switch to using the new check_charts task [deployment-charts] - 10https://gerrit.wikimedia.org/r/758423 (owner: 10Giuseppe Lavagetto) [09:28:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.provision for host es1021.mgmt.eqiad.wmnet with reboot policy GRACEFUL [09:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T300402)', diff saved to https://phabricator.wikimedia.org/P19903 and previous config saved to /var/cache/conftool/dbconfig/20220202-093223-marostegui.json [09:32:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [09:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [09:32:27] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [09:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T300402)', diff saved to https://phabricator.wikimedia.org/P19904 and previous config saved to /var/cache/conftool/dbconfig/20220202-093231-marostegui.json [09:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:38] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti1011.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [09:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti1011.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [09:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:42] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) [09:33:50] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) One more server is ready and downtimed; ganeti1011 [09:39:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1021.mgmt.eqiad.wmnet with reboot policy GRACEFUL [09:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:35] !log installing apache/apache-modsecurity2 security updates [09:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es1021.eqiad.wmnet with OS bullseye [09:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:44] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:45:56] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:47:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1019.eqiad.wmnet with OS buster [09:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:29] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1019.eqiad.wmnet with OS buster completed: - ganeti1019 (**PASS**)... [09:48:19] 10SRE, 10observability: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10elukey) Status update: kafka producers (rsyslog on regular nodes and mwdebug k8s) have been migrated to the new ca bundle. The next step is to migrate the logstash kafka consumer to the new bu... [09:53:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1008.eqiad.wmnet with OS buster [09:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:07] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1008.eqiad.wmnet with OS buster [10:00:22] 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting: Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209 (10Volans) I had a chat with @jbond about this yesterday, putting the summary here for future reference for those that will work on this. In Spicerack we... [10:01:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [10:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [10:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:31] (03PS2) 10Elukey: ml-services: update draftquality transformer image [deployment-charts] - 10https://gerrit.wikimedia.org/r/758942 (https://phabricator.wikimedia.org/T298989) (owner: 10Accraze) [10:02:03] PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_text layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [10:02:24] * volans here, acking VO [10:02:30] <_joe_> uh wtf? [10:02:37] here [10:02:39] here too [10:02:39] eqsin [10:02:54] * akosiaris around [10:03:20] it seems it had a quite deeper hole ~1h ago and didn't page AFAICT [10:03:21] checking 5xx dashboard too, looks like a spike tho [10:03:22] <_joe_> it seems to be recovering though [10:03:23] quick look at the network things seems fine [10:03:33] RECOVERY - ATS TLS has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [10:03:34] <_joe_> volans: yeah that was more narrow [10:03:41] <_joe_> which is what this alert is designed for [10:04:03] <_joe_> very short self-recovering spikes don't need our attnetion immediately, so they should not page [10:04:49] looking at smokeping we might have a problem with the primary transport links [10:04:51] https://smokeping.wikimedia.org/?displaymode=n;start=2022-02-02%2007:04;end=now;target=eqsin.Core.cr3-eqsin [10:06:27] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [10:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:31] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 04s) [10:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:36] monitoring v4/v6 latency [10:06:39] and loss [10:07:09] <_joe_> XioNoX: should we prepare to depool eqsin? [10:07:50] _joe_: nah, worse case we fail it over to the 2nd transport (now that we have one!) [10:09:10] (03CR) 10Elukey: [C: 03+2] "Rebased, all good, merging!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/758942 (https://phabricator.wikimedia.org/T298989) (owner: 10Accraze) [10:09:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [10:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [10:09:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Maintenance [10:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:59] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [10:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance [10:10:02] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 03s) [10:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1021.eqiad.wmnet with OS bullseye [10:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:51] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [10:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:04] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [10:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:25] (03CR) 10Volans: [C: 03+1] "I'm not familiar with the underlying process but the change looks sane to me." [cookbooks] - 10https://gerrit.wikimedia.org/r/753426 (owner: 10DCausse) [10:18:21] (03PS1) 10Elukey: ml-services: update the editquality's transformer image [deployment-charts] - 10https://gerrit.wikimedia.org/r/759214 (https://phabricator.wikimedia.org/T298943) [10:21:16] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [10:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:20] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [10:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1008.eqiad.wmnet with OS buster [10:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:40] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1008.eqiad.wmnet with OS buster completed: - ganeti1008 (**PASS**)... [10:22:03] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [10:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:06] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [10:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [10:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [10:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T298558)', diff saved to https://phabricator.wikimedia.org/P19905 and previous config saved to /var/cache/conftool/dbconfig/20220202-102717-marostegui.json [10:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:21] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [10:27:30] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [10:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:57] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=CREATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:28:06] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [10:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:15] (03PS1) 10Marostegui: Revert "es1021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758919 [10:32:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T300402)', diff saved to https://phabricator.wikimedia.org/P19906 and previous config saved to /var/cache/conftool/dbconfig/20220202-103250-marostegui.json [10:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:53] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [10:33:26] (03CR) 10Marostegui: [C: 03+2] Revert "es1021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758919 (owner: 10Marostegui) [10:33:37] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:34:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 1%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19907 and previous config saved to /var/cache/conftool/dbconfig/20220202-103401-root.json [10:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool es1021 after reimage', diff saved to https://phabricator.wikimedia.org/P19908 and previous config saved to /var/cache/conftool/dbconfig/20220202-103436-marostegui.json [10:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 2%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19909 and previous config saved to /var/cache/conftool/dbconfig/20220202-103502-root.json [10:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove recentchanges and recentchanges groups from s4 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P19910 and previous config saved to /var/cache/conftool/dbconfig/20220202-103830-marostegui.json [10:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:40] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [10:38:46] _joe_, godog: I agree small spikes should not page, but I don't see why the first one didn't page and the second did, if you look at https://phabricator.wikimedia.org/F34940425 (same horizontal scale), the first one lasted more that the second one and if I read correctly modules/monitoring/manifests/alerts/http_availability.pp the threshold is 99. [10:39:18] (03PS4) 10Phuedx: ULS: Remove unused ULSEventLogging variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670224 (https://phabricator.wikimedia.org/T275894) [10:39:43] (03CR) 10Gehel: [C: 03+2] sre.wdqs.data-reload: few fixes and cleanups [cookbooks] - 10https://gerrit.wikimedia.org/r/753426 (owner: 10DCausse) [10:39:53] (03PS1) 10Muehlenhoff: Add Cumin alias for apifeatureusage [puppet] - 10https://gerrit.wikimedia.org/r/759218 [10:40:46] (03PS5) 10Gehel: sre.wdqs.data-reload: few fixes and cleanups [cookbooks] - 10https://gerrit.wikimedia.org/r/753426 (owner: 10DCausse) [10:40:56] <_joe_> volans: it's a function of data binning and time of check [10:41:32] I graphed the exact metric used by the alert, not the one in the dashboard [10:41:35] <_joe_> I don't want to go do the calculations, but I think you have to consider you use a moving average and probably the checks happened at the right time for the second spike [10:42:52] (03PS1) 10Ladsgroup: admin: Fully deprecate sc-admin group [puppet] - 10https://gerrit.wikimedia.org/r/759219 [10:43:09] <_joe_> so the binning, if I read correctly, is done before the metric is read by the alert [10:43:15] (03PS2) 10Ladsgroup: admin: Fully deprecate sc-admin group [puppet] - 10https://gerrit.wikimedia.org/r/759219 [10:43:18] <_joe_> and we're reading a single point [10:43:20] (03PS5) 10Phuedx: Clean-up decommisioned Print schema configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570625 (https://phabricator.wikimedia.org/T196159) (owner: 10Polishdeveloper) [10:43:30] (03PS3) 10Ladsgroup: admin: Fully deprecate sc-admins group [puppet] - 10https://gerrit.wikimedia.org/r/759219 [10:43:35] <_joe_> volans: so I would say we didn't do two checks during the dip in the first case [10:43:52] but even the plateau is below the threshold [10:43:59] around 92~90 [10:44:11] <_joe_> that looks very wrong heh [10:44:25] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) >>! In T299527#7668721, @Cmjohnson wrote: > @MoritzMuehlenhoff 1006, 1016 and 1019 updated I can't connect to the serial console... [10:44:28] I'm taking a look too [10:44:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T298558)', diff saved to https://phabricator.wikimedia.org/P19911 and previous config saved to /var/cache/conftool/dbconfig/20220202-104453-marostegui.json [10:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:56] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [10:46:42] (03CR) 10Alexandros Kosiaris: [C: 03+1] admin: Fully deprecate sc-admins group [puppet] - 10https://gerrit.wikimedia.org/r/759219 (owner: 10Ladsgroup) [10:47:07] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] admin: Fully deprecate sc-admins group [puppet] - 10https://gerrit.wikimedia.org/r/759219 (owner: 10Ladsgroup) [10:47:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P19912 and previous config saved to /var/cache/conftool/dbconfig/20220202-104755-marostegui.json [10:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:34] (03CR) 10Muehlenhoff: admin: Fully deprecate sc-admins group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/759219 (owner: 10Ladsgroup) [10:50:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 5%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19913 and previous config saved to /var/cache/conftool/dbconfig/20220202-105006-root.json [10:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:53] (03CR) 10Ladsgroup: admin: Fully deprecate sc-admins group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/759219 (owner: 10Ladsgroup) [10:55:19] volans: yeah I think it is a compounding of factors, mainly using the 'global' prometheus instance since we want global availability. I'll open a task to port those over to alertmanager and thanos, that'll be more precise/reactive for sure [10:59:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P19914 and previous config saved to /var/cache/conftool/dbconfig/20220202-105957-marostegui.json [10:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [11:01:56] (03PS4) 10Giuseppe Lavagetto: thanos::frontend: fix envoy configuration [puppet] - 10https://gerrit.wikimedia.org/r/757452 (https://phabricator.wikimedia.org/T300119) [11:02:21] (03PS1) 10Marostegui: mariadb: Promote db1159 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/759222 (https://phabricator.wikimedia.org/T300329) [11:02:59] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Switchover m2 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T300329 (10Marostegui) [11:03:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P19915 and previous config saved to /var/cache/conftool/dbconfig/20220202-110259-marostegui.json [11:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:17] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [puppet] - 10https://gerrit.wikimedia.org/r/759222 (https://phabricator.wikimedia.org/T300329) (owner: 10Marostegui) [11:05:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 10%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19916 and previous config saved to /var/cache/conftool/dbconfig/20220202-110509-root.json [11:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:16] 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting: Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209 (10jbond) > Have a consistent way to match a host in alertmanager, ensuring that a given label is consistently either the hostname or the FQDN (but always t... [11:05:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [11:06:04] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [11:06:19] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [11:07:24] (03PS1) 10Btullis: Use the system default mysql prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/759223 (https://phabricator.wikimedia.org/T299762) [11:08:58] (03PS1) 10Vgutierrez: cache::envoy: Reduce downstream_idle_timeout [puppet] - 10https://gerrit.wikimedia.org/r/759224 (https://phabricator.wikimedia.org/T271421) [11:10:50] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33544/console" [puppet] - 10https://gerrit.wikimedia.org/r/759224 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:12:17] (03PS1) 10Elukey: Move ml-serve-ctrl nodes to overlay fs [puppet] - 10https://gerrit.wikimedia.org/r/759225 [11:12:49] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::envoy: Reduce downstream_idle_timeout [puppet] - 10https://gerrit.wikimedia.org/r/759224 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:13:25] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33545/console" [puppet] - 10https://gerrit.wikimedia.org/r/759225 (owner: 10Elukey) [11:15:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P19917 and previous config saved to /var/cache/conftool/dbconfig/20220202-111502-marostegui.json [11:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:35] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/758558 (owner: 10Volans) [11:15:51] (03CR) 10Volans: [C: 03+2] dhcp: case-insensitive match if Dell serial number [software/spicerack] - 10https://gerrit.wikimedia.org/r/758558 (owner: 10Volans) [11:16:50] (03CR) 10Volans: "Change looks sane, just a couple of minor comments inline." [software/wmfdb] - 10https://gerrit.wikimedia.org/r/757666 (owner: 10Kormat) [11:18:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T300402)', diff saved to https://phabricator.wikimedia.org/P19918 and previous config saved to /var/cache/conftool/dbconfig/20220202-111804-marostegui.json [11:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:08] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [11:19:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [11:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [11:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:24] (03PS2) 10Elukey: Move ml-serve-ctrl nodes to overlay fs [puppet] - 10https://gerrit.wikimedia.org/r/759225 [11:20:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 15%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19919 and previous config saved to /var/cache/conftool/dbconfig/20220202-112013-root.json [11:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:51] (03CR) 10Jbond: [C: 04-1] "see inline for minor issue and nit" [puppet] - 10https://gerrit.wikimedia.org/r/757776 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [11:21:14] (03PS1) 10Marostegui: parsercache.my.cnf: innodb_adaptive_hash_index=OFF [puppet] - 10https://gerrit.wikimedia.org/r/759226 (https://phabricator.wikimedia.org/T268869) [11:22:14] (03CR) 10JMeybohm: Move ml-serve-ctrl nodes to overlay fs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/759225 (owner: 10Elukey) [11:23:30] (03CR) 10Marostegui: "This option has been tested on db1125 (testing host).The server boots up fine" [puppet] - 10https://gerrit.wikimedia.org/r/759226 (https://phabricator.wikimedia.org/T268869) (owner: 10Marostegui) [11:25:35] (03Merged) 10jenkins-bot: dhcp: case-insensitive match if Dell serial number [software/spicerack] - 10https://gerrit.wikimedia.org/r/758558 (owner: 10Volans) [11:26:08] (03CR) 10Giuseppe Lavagetto: [C: 03+2] thanos::frontend: fix envoy configuration [puppet] - 10https://gerrit.wikimedia.org/r/757452 (https://phabricator.wikimedia.org/T300119) (owner: 10Giuseppe Lavagetto) [11:28:16] <_joe_> !log depooling thanos-fe1001 for testing T300119 [11:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:19] T300119: Using port in Host header for thanos-swift / thanos-query breaks vhost selection - https://phabricator.wikimedia.org/T300119 [11:28:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [11:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [11:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T300402)', diff saved to https://phabricator.wikimedia.org/P19920 and previous config saved to /var/cache/conftool/dbconfig/20220202-112849-marostegui.json [11:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:53] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [11:29:23] (03CR) 10Muehlenhoff: [C: 03+2] Addintional cache setting for puppetboard against idp-test [puppet] - 10https://gerrit.wikimedia.org/r/757868 (owner: 10Muehlenhoff) [11:30:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T298558)', diff saved to https://phabricator.wikimedia.org/P19921 and previous config saved to /var/cache/conftool/dbconfig/20220202-113007-marostegui.json [11:30:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [11:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [11:30:10] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [11:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19922 and previous config saved to /var/cache/conftool/dbconfig/20220202-113516-root.json [11:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T300402)', diff saved to https://phabricator.wikimedia.org/P19923 and previous config saved to /var/cache/conftool/dbconfig/20220202-113558-marostegui.json [11:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:02] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [11:38:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:27] <_joe_> !log repooling thanos-fe1001 T300119 [11:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:30] T300119: Using port in Host header for thanos-swift / thanos-query breaks vhost selection - https://phabricator.wikimedia.org/T300119 [11:46:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [11:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [11:46:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T298558)', diff saved to https://phabricator.wikimedia.org/P19924 and previous config saved to /var/cache/conftool/dbconfig/20220202-114639-marostegui.json [11:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:42] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [11:47:22] (03PS4) 10Ladsgroup: admin: Add dannyh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/758603 (https://phabricator.wikimedia.org/T300579) [11:47:46] 10SRE, 10SRE-Access-Requests, 10Analytics, 10Patch-For-Review, 10User-Ladsgroup: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10Ladsgroup) I talked to @jbond and it seems you can login with your CN in CAS, that's why "DannyH (WMF)" works but your ldap entry says y... [11:47:48] (03PS1) 10Vgutierrez: site: Reimage cp1087 as cache::text_envoy [puppet] - 10https://gerrit.wikimedia.org/r/759231 (https://phabricator.wikimedia.org/T271421) [11:48:05] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] admin: Add dannyh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/758603 (https://phabricator.wikimedia.org/T300579) (owner: 10Ladsgroup) [11:48:40] !log depool cp1087 to be reimaged as cache::text_envoy - T271421 [11:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:43] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [11:49:43] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp1087 as cache::text_envoy [puppet] - 10https://gerrit.wikimedia.org/r/759231 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:50:03] 10SRE, 10SRE-Access-Requests, 10Analytics, 10Patch-For-Review, 10User-Ladsgroup: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10Ladsgroup) [11:50:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 40%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19925 and previous config saved to /var/cache/conftool/dbconfig/20220202-115020-root.json [11:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:36] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp1087.eqiad.wmnet with OS buster [11:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:45] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp1087.eqiad.wmnet with OS buster [11:51:00] 10SRE, 10SRE-Access-Requests, 10Analytics, 10Patch-For-Review, 10User-Ladsgroup: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10Ladsgroup) 05Open→03Resolved You should be able to access it in half an hour, reopen if that's not the case. [11:51:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P19926 and previous config saved to /var/cache/conftool/dbconfig/20220202-115103-marostegui.json [11:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:44] 10SRE, 10SRE-Access-Requests, 10Analytics, 10Patch-For-Review, 10User-Ladsgroup: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10jbond) > I talked to @jbond and it seems you can login with your CN in CAS, that's why "DannyH (WMF)" works but your ldap entry says you... [11:56:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298558)', diff saved to https://phabricator.wikimedia.org/P19927 and previous config saved to /var/cache/conftool/dbconfig/20220202-115601-marostegui.json [11:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:04] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: That opportune time is upon us again. Time for a UTC morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220202T1200). [12:00:04] phuedx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:07] o/ I can deploy today [12:00:15] go for it [12:00:19] unless phuedx wants to self-serve? [12:01:26] 10SRE, 10Traffic, 10envoy, 10serviceops: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10Joe) Thanks @rzl this looks like an excellent plan. I would suggest that when we move to 1.18, we might want to start from the `thanos-fe` cluster which would see fixing of a real iss... [12:02:22] phuedx: {{ping}} hi, around? [12:04:32] I'm treating that as a no and continuing to deploy a patch of my own [12:04:46] (03PS2) 10Majavah: prod: READ_NEW for CentralAuth hidden level migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758443 (https://phabricator.wikimedia.org/T289068) [12:04:56] (03PS14) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [12:05:11] (03CR) 10Majavah: [C: 03+2] prod: READ_NEW for CentralAuth hidden level migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758443 (https://phabricator.wikimedia.org/T289068) (owner: 10Majavah) [12:05:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19928 and previous config saved to /var/cache/conftool/dbconfig/20220202-120524-root.json [12:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P19929 and previous config saved to /var/cache/conftool/dbconfig/20220202-120608-marostegui.json [12:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:16] (03Merged) 10jenkins-bot: prod: READ_NEW for CentralAuth hidden level migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758443 (https://phabricator.wikimedia.org/T289068) (owner: 10Majavah) [12:11:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P19930 and previous config saved to /var/cache/conftool/dbconfig/20220202-121105-marostegui.json [12:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:20] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:758443|prod: READ_NEW for CentralAuth hidden level migration (T289068)]] (duration: 00m 50s) [12:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:22] T289068: Normalise centralauth.gu_hidden - https://phabricator.wikimedia.org/T289068 [12:12:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:49] 10SRE, 10Traffic, 10envoy, 10serviceops: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10Vgutierrez) this looks great :) in traffic we're already using 1.18.3 from the envoy-future component, thanks @RLazarus [12:15:01] (03PS1) 10Muehlenhoff: Add cname for puppetboard-idptest [dns] - 10https://gerrit.wikimedia.org/r/759234 [12:15:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:17] 10SRE, 10Infrastructure-Foundations: Setup new mirror server (mirror1001.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10jbond) are we in a state to revery https://gerrit.wikimedia.org/r/c/operations/puppet/+/748740 yet? [12:20:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 65%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19932 and previous config saved to /var/cache/conftool/dbconfig/20220202-122027-root.json [12:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1019.eqiad.wmnet [12:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T300402)', diff saved to https://phabricator.wikimedia.org/P19933 and previous config saved to /var/cache/conftool/dbconfig/20220202-122112-marostegui.json [12:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:16] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [12:26:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P19934 and previous config saved to /var/cache/conftool/dbconfig/20220202-122610-marostegui.json [12:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1019.eqiad.wmnet [12:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [12:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [12:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T300402)', diff saved to https://phabricator.wikimedia.org/P19936 and previous config saved to /var/cache/conftool/dbconfig/20220202-123127-marostegui.json [12:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:30] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [12:32:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1019.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [12:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:01] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) [12:34:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1019.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [12:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19937 and previous config saved to /var/cache/conftool/dbconfig/20220202-123531-root.json [12:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T300402)', diff saved to https://phabricator.wikimedia.org/P19938 and previous config saved to /var/cache/conftool/dbconfig/20220202-123956-marostegui.json [12:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:59] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [12:40:59] (03PS15) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [12:41:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298558)', diff saved to https://phabricator.wikimedia.org/P19939 and previous config saved to /var/cache/conftool/dbconfig/20220202-124115-marostegui.json [12:41:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [12:41:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [12:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T298558)', diff saved to https://phabricator.wikimedia.org/P19940 and previous config saved to /var/cache/conftool/dbconfig/20220202-124122-marostegui.json [12:41:23] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [12:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:20] taavi, all: Mibad. I had to step away. I should've updated my status [12:43:51] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1087.eqiad.wmnet with OS buster [12:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:00] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp1087.eqiad.wmnet with OS buster completed: - cp1087 (**WARN*... [12:44:13] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:45:12] (03CR) 10Muehlenhoff: [C: 03+2] Add cname for puppetboard-idptest [dns] - 10https://gerrit.wikimedia.org/r/759234 (owner: 10Muehlenhoff) [12:45:53] taavi, Lucas_WMDE: Is the window closed? [12:46:02] (03PS1) 10Matthias Mullie: [WikibaseMediaInfo] Stop normalizing full text scores [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759240 (https://phabricator.wikimedia.org/T296631) [12:46:06] phuedx: we still have 15min left [12:46:27] My apologies for not updating my status [12:46:59] no worries [12:47:16] The two changes that I have scheduled are NOPs - they remove config variables that are no longer used [12:47:16] want me to deploy? [12:47:41] (03PS16) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [12:47:56] taavi: Sure. Thanks :) [12:48:11] I'm guessing they can't really be tested on wmdebug? [12:48:17] (03PS6) 10Majavah: Clean-up decommisioned Print schema configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570625 (https://phabricator.wikimedia.org/T196159) (owner: 10Polishdeveloper) [12:48:26] (03CR) 10Majavah: [C: 03+2] Clean-up decommisioned Print schema configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570625 (https://phabricator.wikimedia.org/T196159) (owner: 10Polishdeveloper) [12:48:30] (03CR) 10jerkins-bot: [V: 04-1] O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [12:49:02] taavi: I can double-check that the WikimediaEvents extension is still serving assets as expected. Happy to do so on mwdebug [12:49:07] (03Merged) 10jenkins-bot: Clean-up decommisioned Print schema configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570625 (https://phabricator.wikimedia.org/T196159) (owner: 10Polishdeveloper) [12:49:48] phuedx: alright, the first one is on mwdebug1001 [12:50:04] Checking now [12:50:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19941 and previous config saved to /var/cache/conftool/dbconfig/20220202-125034-root.json [12:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:35] taavi: LGTM. I'm able to hit enwiki and have looked at the source for the ext.wikimediaEvents module [12:53:43] thanks, syncing [12:53:48] (03PS3) 10Elukey: Move ml-serve-ctrl nodes to overlay fs [puppet] - 10https://gerrit.wikimedia.org/r/759225 [12:53:50] (03PS1) 10Elukey: role::ml_k8s::master: allow overlay fs [puppet] - 10https://gerrit.wikimedia.org/r/759243 [12:53:52] (03PS5) 10Majavah: ULS: Remove unused ULSEventLogging variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670224 (https://phabricator.wikimedia.org/T275894) (owner: 10Phuedx) [12:54:00] (03CR) 10Majavah: [C: 03+2] ULS: Remove unused ULSEventLogging variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670224 (https://phabricator.wikimedia.org/T275894) (owner: 10Phuedx) [12:54:12] (03PS17) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [12:54:31] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:570625|Clean-up decommisioned Print schema configs (T196159)]] (duration: 00m 50s) [12:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:34] T196159: Remove instrumentation for Schema:Print - https://phabricator.wikimedia.org/T196159 [12:55:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P19942 and previous config saved to /var/cache/conftool/dbconfig/20220202-125500-marostegui.json [12:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33549/console" [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [12:55:28] (03CR) 10jerkins-bot: [V: 04-1] O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [12:55:45] (03Merged) 10jenkins-bot: ULS: Remove unused ULSEventLogging variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670224 (https://phabricator.wikimedia.org/T275894) (owner: 10Phuedx) [12:55:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:05] phuedx: the second patch is available for testing too [12:56:23] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33550/console" [puppet] - 10https://gerrit.wikimedia.org/r/759243 (owner: 10Elukey) [12:56:50] (03PS4) 10Elukey: Move ml-serve-ctrl nodes to overlay fs [puppet] - 10https://gerrit.wikimedia.org/r/759225 [12:57:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:57:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:35] (03CR) 10Elukey: Move ml-serve-ctrl nodes to overlay fs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/759225 (owner: 10Elukey) [12:57:46] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::ml_k8s::master: allow overlay fs [puppet] - 10https://gerrit.wikimedia.org/r/759243 (owner: 10Elukey) [12:58:07] taavi: LGTM [12:58:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:47] syncing [12:59:10] !log taavi@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:670224|ULS: Remove unused ULSEventLogging variable (T275894)]] (duration: 00m 49s) [12:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:13] T275894: Migrate UniversalLanguageSelector instrument to WikimediaEvents extension - https://phabricator.wikimedia.org/T275894 [12:59:16] just in time :D [12:59:24] anything else in the last half a minute or so? [12:59:33] :D [13:00:53] (03PS5) 10JMeybohm: Move ml-serve-ctrl nodes to overlay fs [puppet] - 10https://gerrit.wikimedia.org/r/759225 (owner: 10Elukey) [13:02:22] (03CR) 10JMeybohm: [C: 03+1] Move ml-serve-ctrl nodes to overlay fs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/759225 (owner: 10Elukey) [13:02:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 10%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P19944 and previous config saved to /var/cache/conftool/dbconfig/20220202-130224-root.json [13:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:16] (03PS18) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [13:03:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33551/console" [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [13:04:17] (03CR) 10jerkins-bot: [V: 04-1] O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [13:04:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:38] (03CR) 10Jbond: [V: 03+1] O:rpkivalidator: add bgpalerter to rpki servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [13:07:51] (03CR) 10Jbond: [V: 03+1] O:rpkivalidator: add bgpalerter to rpki servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [13:08:19] (03PS6) 10Elukey: Move ml-serve-ctrl nodes to overlay fs [puppet] - 10https://gerrit.wikimedia.org/r/759225 [13:08:21] (03PS19) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [13:08:31] (03CR) 10Elukey: Move ml-serve-ctrl nodes to overlay fs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/759225 (owner: 10Elukey) [13:09:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33552/console" [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [13:09:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: Install clang on the grid [puppet] - 10https://gerrit.wikimedia.org/r/759197 (https://phabricator.wikimedia.org/T300469) (owner: 10Legoktm) [13:10:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P19945 and previous config saved to /var/cache/conftool/dbconfig/20220202-131006-marostegui.json [13:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:18] (03CR) 10Elukey: [C: 03+2] Move ml-serve-ctrl nodes to overlay fs [puppet] - 10https://gerrit.wikimedia.org/r/759225 (owner: 10Elukey) [13:12:51] 10SRE, 10Traffic, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10fgiunchedi) [13:15:33] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:15:58] this is me --^ [13:17:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P19946 and previous config saved to /var/cache/conftool/dbconfig/20220202-131728-root.json [13:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:42] in theory the alert should clear soon, the bgp session is back to "Established" on cr1 [13:25:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T300402)', diff saved to https://phabricator.wikimedia.org/P19947 and previous config saved to /var/cache/conftool/dbconfig/20220202-132510-marostegui.json [13:25:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [13:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [13:25:15] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [13:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:20] !log rename cr3-ulsfo loopback terms in preparation of move to Capirca [13:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:40] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: sync on production [13:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:41] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: sync on canary [13:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:09] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: sync on production [13:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:25] (03CR) 10Jbond: P:admin: switch to using wmflib::dump_params (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753786 (owner: 10Jbond) [13:29:45] (03CR) 10Jbond: [C: 03+2] P:admin: switch to using wmflib::dump_params [puppet] - 10https://gerrit.wikimedia.org/r/753786 (owner: 10Jbond) [13:30:05] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [13:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:13] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 08s) [13:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [13:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [13:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:07] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync on canary [13:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:42] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync on canary [13:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:17] (03PS4) 10Ayounsi: Move core routers loopback filter to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/748098 (https://phabricator.wikimedia.org/T273865) [13:32:19] (03PS4) 10Ayounsi: Move core routers border-in filter to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/748111 (https://phabricator.wikimedia.org/T273865) [13:32:21] (03PS2) 10Ayounsi: Delete now unused analytics policy file [homer/public] - 10https://gerrit.wikimedia.org/r/758470 [13:32:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P19949 and previous config saved to /var/cache/conftool/dbconfig/20220202-133231-root.json [13:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:38] !log roll restarting eventgate-main to pick up stream-configs for rdf-streaming-updater.reconcile [13:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:49] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: sync on canary [13:32:49] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: sync on production [13:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:31] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync on production [13:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:36] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync on canary [13:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:40] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync on canary [13:34:40] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync on production [13:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:42] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync on canary [13:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:19] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync on production [13:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:44] (03PS4) 10Kormat: wmfdb/db: Improve error reporting. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/757666 [13:36:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [13:37:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [13:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [13:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T300402)', diff saved to https://phabricator.wikimedia.org/P19951 and previous config saved to /var/cache/conftool/dbconfig/20220202-133713-marostegui.json [13:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:16] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [13:38:10] 10SRE, 10ops-eqiad, 10Traffic: asw2-b-eqiad:xe-2/0/3 interface errors (lvs1015) - https://phabricator.wikimedia.org/T300703 (10RhinosF1) [13:38:40] (03CR) 10Ayounsi: "As PS3 caused https://wikitech.wikimedia.org/wiki/Incident_documentation/2022-02-01_ulsfo_network" [homer/public] - 10https://gerrit.wikimedia.org/r/748098 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [13:40:08] !log ULSFO routers: push Capirca generated loopback filters [13:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:13] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:40:26] 10SRE, 10Traffic, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10fgiunchedi) [13:41:11] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:41:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [13:43:22] FYI this lag was due to logstash 5 consumer, likely to be fixed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/758970 [13:43:45] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:47:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T300402)', diff saved to https://phabricator.wikimedia.org/P19952 and previous config saved to /var/cache/conftool/dbconfig/20220202-134659-marostegui.json [13:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:03] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [13:47:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P19953 and previous config saved to /var/cache/conftool/dbconfig/20220202-134735-root.json [13:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:26] (03CR) 10Herron: [C: 03+1] Add Cumin alias for apifeatureusage [puppet] - 10https://gerrit.wikimedia.org/r/759218 (owner: 10Muehlenhoff) [13:49:33] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:49:47] (03CR) 10Kormat: wmfdb/db: Improve error reporting. (034 comments) [software/wmfdb] - 10https://gerrit.wikimedia.org/r/757666 (owner: 10Kormat) [13:50:09] PROBLEM - VRRP status on cr3-ulsfo is CRITICAL: VRRP CRITICAL - 3 inconsistent interfaces, 0 misconfigured interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [13:50:09] !log move docker on ml-serve-ctrl* nodes from device mapper to overlay2 [13:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:33] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:50:35] PROBLEM - Host ml-serve-ctrl1002 is DOWN: PING CRITICAL - Packet loss = 100% [13:51:15] lovely [13:52:20] 10SRE, 10Infrastructure-Foundations, 10netops: Eqiad Expansion - LVS Connectivity Options - https://phabricator.wikimedia.org/T292630 (10BBlack) 05Open→03Resolved Just doing some cleanup here, we did end up on the 2B path (replacement LVSes with 3x dual NIC cards) and have the hardware already racked. [13:52:44] (03CR) 10Herron: [C: 03+1] apifeatureusage: increase logstash heap memory to 2G [puppet] - 10https://gerrit.wikimedia.org/r/758970 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [13:55:15] RECOVERY - VRRP status on cr3-ulsfo is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [13:55:27] (KubernetesCalicoDown) firing: ml-serve-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [13:57:27] (03PS1) 10Urbanecm: logos/config.yaml: Fix zhwikinews-hans logo variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759247 [13:59:49] 10SRE-swift-storage: Bring ms-fe20[09-12] into service - https://phabricator.wikimedia.org/T300738 (10MatthewVernon) [14:02:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P19954 and previous config saved to /var/cache/conftool/dbconfig/20220202-140204-marostegui.json [14:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P19955 and previous config saved to /var/cache/conftool/dbconfig/20220202-140239-root.json [14:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1179 schema change', diff saved to https://phabricator.wikimedia.org/P19956 and previous config saved to /var/cache/conftool/dbconfig/20220202-140317-marostegui.json [14:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:43] (03PS1) 10MVernon: swift: add new proxy nodes to as proxyhosts and memcached_servers [puppet] - 10https://gerrit.wikimedia.org/r/759248 (https://phabricator.wikimedia.org/T300738) [14:08:16] jouncebot: nowandnext [14:08:16] No deployments scheduled for the next 4 hour(s) and 51 minute(s) [14:08:16] In 4 hour(s) and 51 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220202T1900) [14:08:16] In 4 hour(s) and 51 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220202T1900) [14:08:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 10%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P19957 and previous config saved to /var/cache/conftool/dbconfig/20220202-140818-root.json [14:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:27] (03CR) 10Urbanecm: [C: 03+2] "no op for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759247 (owner: 10Urbanecm) [14:09:13] (03Merged) 10jenkins-bot: logos/config.yaml: Fix zhwikinews-hans logo variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759247 (owner: 10Urbanecm) [14:09:42] * urbanecm done [14:09:44] !log cr2-eqord: push Capirca generated loopback filters [14:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:04] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/759248 (https://phabricator.wikimedia.org/T300738) (owner: 10MVernon) [14:11:32] (03CR) 10Kormat: [C: 03+1] parsercache.my.cnf: innodb_adaptive_hash_index=OFF [puppet] - 10https://gerrit.wikimedia.org/r/759226 (https://phabricator.wikimedia.org/T268869) (owner: 10Marostegui) [14:12:01] (03CR) 10Marostegui: [C: 03+2] parsercache.my.cnf: innodb_adaptive_hash_index=OFF [puppet] - 10https://gerrit.wikimedia.org/r/759226 (https://phabricator.wikimedia.org/T268869) (owner: 10Marostegui) [14:13:17] !log pool cp1087 running envoy as TLS terminator - T271421 [14:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:20] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [14:14:05] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Vgutierrez) [14:14:18] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10MatthewVernon) I have no object, as long as Netbox knows where it is :) [14:14:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove weight from es1020 - as it is the master', diff saved to https://phabricator.wikimedia.org/P19958 and previous config saved to /var/cache/conftool/dbconfig/20220202-141455-marostegui.json [14:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:16] !log cr2-eqdfw: push Capirca generated loopback filters [14:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:35] RECOVERY - Host ml-serve-ctrl1002 is UP: PING WARNING - Packet loss = 75%, RTA = 1.81 ms [14:17:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P19959 and previous config saved to /var/cache/conftool/dbconfig/20220202-141709-marostegui.json [14:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:27] (KubernetesCalicoDown) resolved: ml-serve-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [14:21:52] (03PS1) 10Kormat: Use module-level loggers. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/759250 [14:21:53] !log eqsin: push Capirca generated loopback filters [14:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 25%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P19960 and previous config saved to /var/cache/conftool/dbconfig/20220202-142321-root.json [14:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:03] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:29:14] (03PS1) 10Btullis: Set the syslog programname of the movement metric scripts [puppet] - 10https://gerrit.wikimedia.org/r/759252 (https://phabricator.wikimedia.org/T295733) [14:29:19] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:30:31] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33553/console" [puppet] - 10https://gerrit.wikimedia.org/r/759252 (https://phabricator.wikimedia.org/T295733) (owner: 10Btullis) [14:32:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T300402)', diff saved to https://phabricator.wikimedia.org/P19961 and previous config saved to /var/cache/conftool/dbconfig/20220202-143214-marostegui.json [14:32:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [14:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [14:32:18] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [14:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T300402)', diff saved to https://phabricator.wikimedia.org/P19962 and previous config saved to /var/cache/conftool/dbconfig/20220202-143221-marostegui.json [14:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 50%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P19963 and previous config saved to /var/cache/conftool/dbconfig/20220202-143825-root.json [14:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:54] (03CR) 10Volans: [C: 03+1] "LGTM" [software/wmfdb] - 10https://gerrit.wikimedia.org/r/757666 (owner: 10Kormat) [14:39:58] !log jayme@cumin1001 START - Cookbook sre.dns.netbox [14:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T300402)', diff saved to https://phabricator.wikimedia.org/P19965 and previous config saved to /var/cache/conftool/dbconfig/20220202-144038-marostegui.json [14:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:41] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [14:44:54] !log codfw: push Capirca generated loopback filters [14:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:14] (03PS1) 10JMeybohm: Add k8s-ingress-staging LVS VIPs [dns] - 10https://gerrit.wikimedia.org/r/759253 (https://phabricator.wikimedia.org/T300740) [14:47:02] !log jayme@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:49] (03CR) 10Kormat: [C: 03+2] wmfdb/db: Improve error reporting. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/757666 (owner: 10Kormat) [14:51:12] (03Merged) 10jenkins-bot: wmfdb/db: Improve error reporting. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/757666 (owner: 10Kormat) [14:53:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 75%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P19966 and previous config saved to /var/cache/conftool/dbconfig/20220202-145329-root.json [14:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:32] 10SRE, 10Icinga, 10User-Ladsgroup: Request downtime hosts and services privileges in Icinga - https://phabricator.wikimedia.org/T300660 (10Papaul) p:05Triage→03Medium [14:55:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P19967 and previous config saved to /var/cache/conftool/dbconfig/20220202-145542-marostegui.json [14:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:56] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2029.mgmt.codfw.wmnet with reboot policy FORCED [14:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:32] !log esams: push Capirca generated loopback filters [15:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:48] (03CR) 10Volans: "LGTM, one general style comment:" [software/wmfdb] - 10https://gerrit.wikimedia.org/r/759250 (owner: 10Kormat) [15:03:20] (03PS1) 10Jelto: gitlab_runner: execute gitlab-runner as non-root [puppet] - 10https://gerrit.wikimedia.org/r/759254 (https://phabricator.wikimedia.org/T295481) [15:05:38] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed, and 2 others: sc-admins sudo rule is vulnerable to command injection risk - https://phabricator.wikimedia.org/T300665 (10sbassett) >>! In T300665#7670756, @Ladsgroup wrote: > I let the security team to process this first, they might have d... [15:05:44] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed, and 2 others: sc-admins sudo rule is vulnerable to command injection risk - https://phabricator.wikimedia.org/T300665 (10sbassett) p:05Triage→03Low [15:08:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 100%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P19968 and previous config saved to /var/cache/conftool/dbconfig/20220202-150832-root.json [15:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:32] (03CR) 10Elukey: [C: 03+1] Add k8s-ingress-staging LVS VIPs [dns] - 10https://gerrit.wikimedia.org/r/759253 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [15:10:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P19969 and previous config saved to /var/cache/conftool/dbconfig/20220202-151047-marostegui.json [15:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:58] (03PS2) 10Jelto: gitlab_runner: execute gitlab-runner as non-root [puppet] - 10https://gerrit.wikimedia.org/r/759254 (https://phabricator.wikimedia.org/T295481) [15:15:41] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33555/console" [puppet] - 10https://gerrit.wikimedia.org/r/759254 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [15:16:41] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:16:42] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti2029.mgmt.codfw.wmnet with reboot policy FORCED [15:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:15] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:17:26] (KubernetesCalicoDown) firing: ml-serve-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:17:38] this is me [15:18:29] (03PS1) 10Ottomata: airflow - Ensure mariadb-client installed [puppet] - 10https://gerrit.wikimedia.org/r/759255 [15:18:38] (03CR) 10Jelto: gitlab_runner: execute gitlab-runner as non-root [puppet] - 10https://gerrit.wikimedia.org/r/759254 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [15:18:52] (03CR) 10Btullis: [C: 03+1] "Nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/758529 (https://phabricator.wikimedia.org/T300299) (owner: 10Ottomata) [15:18:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] tls_helpers: fail if a listener is non existent [deployment-charts] - 10https://gerrit.wikimedia.org/r/755527 (https://phabricator.wikimedia.org/T291959) (owner: 10Giuseppe Lavagetto) [15:18:59] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 113, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:19:23] (03PS1) 10Esanders: wgDiscussionToolsABTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759256 [15:19:32] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2029.mgmt.codfw.wmnet with reboot policy FORCED [15:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:35] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 80, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:19:49] (03CR) 10Btullis: [V: 03+1 C: 03+2] Set the syslog programname of the movement metric scripts [puppet] - 10https://gerrit.wikimedia.org/r/759252 (https://phabricator.wikimedia.org/T295733) (owner: 10Btullis) [15:20:09] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33556/console" [puppet] - 10https://gerrit.wikimedia.org/r/759254 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [15:22:26] (KubernetesCalicoDown) resolved: ml-serve-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:22:39] (03Merged) 10jenkins-bot: tls_helpers: fail if a listener is non existent [deployment-charts] - 10https://gerrit.wikimedia.org/r/755527 (https://phabricator.wikimedia.org/T291959) (owner: 10Giuseppe Lavagetto) [15:23:47] (03PS5) 10Ayounsi: Move core routers loopback filter to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/748098 (https://phabricator.wikimedia.org/T273865) [15:23:49] (03PS5) 10Ayounsi: Move core routers border-in filter to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/748111 (https://phabricator.wikimedia.org/T273865) [15:23:51] (03PS3) 10Ayounsi: Delete now unused analytics policy file [homer/public] - 10https://gerrit.wikimedia.org/r/758470 [15:25:19] (03PS1) 10Jbond: mx2001: disable ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/759257 [15:25:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T300402)', diff saved to https://phabricator.wikimedia.org/P19970 and previous config saved to /var/cache/conftool/dbconfig/20220202-152552-marostegui.json [15:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:56] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [15:25:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [15:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [15:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance [15:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance [15:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:39] (03CR) 10Ayounsi: [C: 03+2] Move core routers loopback filter to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/748098 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [15:27:24] (03Merged) 10jenkins-bot: Move core routers loopback filter to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/748098 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [15:27:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2029.mgmt.codfw.wmnet with reboot policy FORCED [15:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:55] (03PS1) 10JMeybohm: Add k8s-ingress-staging to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/759259 (https://phabricator.wikimedia.org/T300740) [15:28:57] (03PS1) 10JMeybohm: Add LVS service k8s-ingress-staging [puppet] - 10https://gerrit.wikimedia.org/r/759260 (https://phabricator.wikimedia.org/T300740) [15:30:07] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/759257 (owner: 10Jbond) [15:30:11] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [15:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:21] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 09s) [15:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [15:32:00] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [15:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [15:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:04] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 03s) [15:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T300402)', diff saved to https://phabricator.wikimedia.org/P19972 and previous config saved to /var/cache/conftool/dbconfig/20220202-153206-marostegui.json [15:32:08] (03PS1) 10Zabe: Update kywiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759261 (https://phabricator.wikimedia.org/T300241) [15:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:09] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [15:32:11] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [15:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:20] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 08s) [15:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:05] (03CR) 10jerkins-bot: [V: 04-1] Update kywiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759261 (https://phabricator.wikimedia.org/T300241) (owner: 10Zabe) [15:34:17] (03CR) 10Elukey: [C: 03+1] Add k8s-ingress-staging to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/759259 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [15:34:30] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [15:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:40] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 09s) [15:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:49] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [15:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:57] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 08s) [15:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:03] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [15:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:07] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 03s) [15:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:24] (03CR) 10Zabe: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759261 (https://phabricator.wikimedia.org/T300241) (owner: 10Zabe) [15:39:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T300402)', diff saved to https://phabricator.wikimedia.org/P19973 and previous config saved to /var/cache/conftool/dbconfig/20220202-153913-marostegui.json [15:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:17] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [15:41:35] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [15:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:45] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 09s) [15:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:01] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [15:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:09] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 08s) [15:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:52] !log aqu@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [15:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:00] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 08s) [15:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:36] (03PS2) 10Ottomata: airflow - Ensure a mariadb client installed [puppet] - 10https://gerrit.wikimedia.org/r/759255 [15:50:53] (03PS1) 10Muehlenhoff: Simplify Ganeti partman setup for codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/759263 (https://phabricator.wikimedia.org/T298998) [15:51:33] (03CR) 10Ayounsi: [C: 03+2] Delete now unused analytics policy file [homer/public] - 10https://gerrit.wikimedia.org/r/758470 (owner: 10Ayounsi) [15:54:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P19974 and previous config saved to /var/cache/conftool/dbconfig/20220202-155418-marostegui.json [15:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:57] (03CR) 10Cwhite: [C: 03+2] apifeatureusage: increase logstash heap memory to 2G [puppet] - 10https://gerrit.wikimedia.org/r/758970 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [15:57:13] (03PS1) 10Muehlenhoff: idp-test/puppetboard: Grant access to cn=idptest-users [puppet] - 10https://gerrit.wikimedia.org/r/759264 [15:57:37] (03CR) 10Muehlenhoff: [C: 03+2] Simplify Ganeti partman setup for codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/759263 (https://phabricator.wikimedia.org/T298998) (owner: 10Muehlenhoff) [15:57:51] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/759248 (https://phabricator.wikimedia.org/T300738) (owner: 10MVernon) [15:58:36] (03PS1) 10Papaul: Add ganeti2029 and ganeti2030 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/759265 (https://phabricator.wikimedia.org/T298998) [15:58:55] 10SRE, 10Traffic, 10envoy, 10serviceops: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10Joe) >>! In T300324#7670904, @Vgutierrez wrote: > this looks great :) in traffic we're already using 1.18.3 from the envoy-future component, thanks @RLazarus I think the question we... [16:00:32] (03CR) 10Papaul: [C: 03+2] Add ganeti2029 and ganeti2030 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/759265 (https://phabricator.wikimedia.org/T298998) (owner: 10Papaul) [16:01:58] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ganeti2029.codfw.wmnet, ganeti2030.codfw.wmnet - https://phabricator.wikimedia.org/T298998 (10Papaul) [16:04:18] (03CR) 10MVernon: [C: 03+2] swift: add new proxy nodes to as proxyhosts and memcached_servers [puppet] - 10https://gerrit.wikimedia.org/r/759248 (https://phabricator.wikimedia.org/T300738) (owner: 10MVernon) [16:05:21] 10SRE, 10Traffic, 10envoy, 10serviceops: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10hnowlan) Based on the release notes I think the API gateway will most likely have no issue going straight to 1.21. If there are issues they will most likely be minor enough that we can... [16:07:15] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 109 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:09:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P19975 and previous config saved to /var/cache/conftool/dbconfig/20220202-160923-marostegui.json [16:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:41] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 4 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:11:07] jouncebot now [16:11:07] No deployments scheduled for the next 2 hour(s) and 48 minute(s) [16:12:07] PROBLEM - Host ms-fe2009 is DOWN: PING CRITICAL - Packet loss = 100% [16:12:43] (03Abandoned) 10Zabe: Update kywiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759261 (https://phabricator.wikimedia.org/T300241) (owner: 10Zabe) [16:12:59] RECOVERY - Host ms-fe2009 is UP: PING OK - Packet loss = 0%, RTA = 31.60 ms [16:13:05] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [16:15:41] (03PS1) 10Majavah: O:puppetmaster::standalone: add role motd [puppet] - 10https://gerrit.wikimedia.org/r/759266 [16:16:07] PROBLEM - Host ms-fe2012 is DOWN: PING CRITICAL - Packet loss = 100% [16:16:45] RECOVERY - Host ms-fe2012 is UP: PING OK - Packet loss = 0%, RTA = 33.09 ms [16:19:15] !log rolling restart of swift frontends to bring new ones into service T300738 [16:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:18] T300738: Bring ms-fe20[09-12] into service - https://phabricator.wikimedia.org/T300738 [16:23:33] (03CR) 10Accraze: [C: 03+1] ml-services: update the editquality's transformer image [deployment-charts] - 10https://gerrit.wikimedia.org/r/759214 (https://phabricator.wikimedia.org/T298943) (owner: 10Elukey) [16:24:17] !log disable ldap email checks on mx2001 [16:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:25] (03CR) 10Jbond: [C: 03+2] mx2001: disable ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/759257 (owner: 10Jbond) [16:24:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T300402)', diff saved to https://phabricator.wikimedia.org/P19976 and previous config saved to /var/cache/conftool/dbconfig/20220202-162428-marostegui.json [16:24:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [16:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:31] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [16:24:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [16:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T300402)', diff saved to https://phabricator.wikimedia.org/P19977 and previous config saved to /var/cache/conftool/dbconfig/20220202-162435-marostegui.json [16:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:50] 10SRE, 10Traffic-Icebox, 10HTTPS, 10Upstream: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378 (10Jaxonvilleder) I think our [[ https://www.morningtoncabinetmakers.com.au/ | cabinets ]] does now! [16:25:58] (03CR) 10Elukey: [C: 03+2] ml-services: update the editquality's transformer image [deployment-charts] - 10https://gerrit.wikimedia.org/r/759214 (https://phabricator.wikimedia.org/T298943) (owner: 10Elukey) [16:26:24] (03CR) 10Eevans: [C: 03+1] restbase: remove restbase2011 [puppet] - 10https://gerrit.wikimedia.org/r/757648 (https://phabricator.wikimedia.org/T299928) (owner: 10Hnowlan) [16:26:44] !log mvernon@puppetmaster1001 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe2009.codfw.wmnet [16:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:11] !log mvernon@puppetmaster1001 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe2009.codfw.wmnet [16:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:21] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missi [16:29:21] [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [16:30:46] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [16:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:52] (03PS1) 10JHathaway: Revert "profile::apt::mirror: change apt mirror to deb.debian.org" [puppet] - 10https://gerrit.wikimedia.org/r/758921 [16:33:01] (03PS2) 10JHathaway: Revert "profile::apt::mirror: change apt mirror to deb.debian.org" [puppet] - 10https://gerrit.wikimedia.org/r/758921 [16:34:10] (03CR) 10Ayounsi: [C: 03+1] O:rpkivalidator: add bgpalerter to rpki servers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [16:36:09] (03PS1) 10MVernon: conftool-data - add ms-fe20{09..12} [puppet] - 10https://gerrit.wikimedia.org/r/759269 (https://phabricator.wikimedia.org/T300738) [16:36:30] (03CR) 10JHathaway: [C: 03+2] Revert "profile::apt::mirror: change apt mirror to deb.debian.org" [puppet] - 10https://gerrit.wikimedia.org/r/758921 (owner: 10JHathaway) [16:36:36] (03CR) 10Filippo Giunchedi: [C: 03+1] conftool-data - add ms-fe20{09..12} [puppet] - 10https://gerrit.wikimedia.org/r/759269 (https://phabricator.wikimedia.org/T300738) (owner: 10MVernon) [16:37:04] 10SRE, 10Infrastructure-Foundations: Setup new mirror server (mirror1001.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10jhathaway) >>! In T286898#7670905, @jbond wrote: > are we in a state to revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/748740 yet? yup, thanks for the reminder, r... [16:37:05] (03CR) 10MVernon: [C: 03+2] conftool-data - add ms-fe20{09..12} [puppet] - 10https://gerrit.wikimedia.org/r/759269 (https://phabricator.wikimedia.org/T300738) (owner: 10MVernon) [16:38:55] !log mvernon@puppetmaster1001 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe2009.codfw.wmnet [16:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:03] !log mvernon@puppetmaster1001 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe2009.codfw.wmnet [16:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:02] !log mvernon@puppetmaster1001 conftool action : set/weight=40; selector: service=nginx,name=ms-fe2009.codfw.wmnet [16:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:09] !log mvernon@puppetmaster1001 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe2009.codfw.wmnet [16:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:51] !log standardising nginx weights for codfw swift proxies to match eqiad ones T300738 [16:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:54] T300738: Bring ms-fe20[09-12] into service - https://phabricator.wikimedia.org/T300738 [16:42:04] !log mvernon@puppetmaster1001 conftool action : set/weight=40; selector: service=nginx,name=ms-fe2008.codfw.wmnet [16:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:08] (03PS2) 10Kormat: Use module-level loggers. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/759250 [16:42:10] !log mvernon@puppetmaster1001 conftool action : set/weight=40; selector: service=nginx,name=ms-fe2007.codfw.wmnet [16:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:15] !log mvernon@puppetmaster1001 conftool action : set/weight=40; selector: service=nginx,name=ms-fe2006.codfw.wmnet [16:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:20] !log mvernon@puppetmaster1001 conftool action : set/weight=40; selector: service=nginx,name=ms-fe2005.codfw.wmnet [16:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:32] !log mvernon@puppetmaster1001 conftool action : set/weight=40; selector: service=nginx,name=ms-fe2010.codfw.wmnet [16:45:33] !log mvernon@puppetmaster1001 conftool action : set/weight=40; selector: service=nginx,name=ms-fe2011.codfw.wmnet [16:45:33] !log mvernon@puppetmaster1001 conftool action : set/weight=40; selector: service=nginx,name=ms-fe2012.codfw.wmnet [16:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:37] (03CR) 10Kormat: Use module-level loggers. (033 comments) [software/wmfdb] - 10https://gerrit.wikimedia.org/r/759250 (owner: 10Kormat) [16:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:51] !log mvernon@puppetmaster1001 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe2010.codfw.wmnet [16:45:51] !log mvernon@puppetmaster1001 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe2011.codfw.wmnet [16:45:51] !log mvernon@puppetmaster1001 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe2012.codfw.wmnet [16:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:53] !log mvernon@puppetmaster1001 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe2010.codfw.wmnet [16:46:54] !log mvernon@puppetmaster1001 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe2011.codfw.wmnet [16:46:54] !log mvernon@puppetmaster1001 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe2012.codfw.wmnet [16:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:09] !log mvernon@puppetmaster1001 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe2010.codfw.wmnet [16:47:09] !log mvernon@puppetmaster1001 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe2011.codfw.wmnet [16:47:10] !log mvernon@puppetmaster1001 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe2012.codfw.wmnet [16:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:49] (03PS3) 10Kormat: Use module-level loggers. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/759250 [16:51:01] PROBLEM - Check systemd state on prometheus1006 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-node-exporter.service,wmf_auto_restart_prometheus-swagger-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:51:43] (03CR) 10Ottomata: [C: 03+2] airflow - Ensure a mariadb client installed [puppet] - 10https://gerrit.wikimedia.org/r/759255 (owner: 10Ottomata) [16:53:51] 10SRE, 10Traffic: Problem loading thumbnail images due to Envoy (426 Upgrade Required) - https://phabricator.wikimedia.org/T300366 (10TheDJ) Here is another related Envoy ticket about 1.0 support that might be useful: https://github.com/envoyproxy/envoy/issues/170 [16:56:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T300402)', diff saved to https://phabricator.wikimedia.org/P19979 and previous config saved to /var/cache/conftool/dbconfig/20220202-165611-marostegui.json [16:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:16] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [16:57:45] (03PS1) 104nn1l2: kywiki: update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759277 (https://phabricator.wikimedia.org/T298438) [16:59:02] (03Abandoned) 104nn1l2: kywiki: update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759277 (https://phabricator.wikimedia.org/T298438) (owner: 104nn1l2) [16:59:08] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) [16:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:17] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics-test@3ad07a0]: (no justification provided) (duration: 00m 09s) [16:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:36] (03PS1) 104nn1l2: kywiki: update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759278 (https://phabricator.wikimedia.org/T300241) [17:02:07] (03CR) 10Jbond: [C: 03+1] idp-test/puppetboard: Grant access to cn=idptest-users [puppet] - 10https://gerrit.wikimedia.org/r/759264 (owner: 10Muehlenhoff) [17:02:50] (03CR) 10Jbond: [C: 03+2] "LGTM will merge" [puppet] - 10https://gerrit.wikimedia.org/r/759266 (owner: 10Majavah) [17:02:52] (03CR) 10Andrew Bogott: [C: 03+2] O:puppetmaster::standalone: add role motd [puppet] - 10https://gerrit.wikimedia.org/r/759266 (owner: 10Majavah) [17:03:17] ^ timing, lol [17:04:33] (03CR) 10Zabe: "Any updates on this?" [puppet] - 10https://gerrit.wikimedia.org/r/751207 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [17:05:25] (03CR) 10Majavah: [C: 04-1] graphite: migrate archiver crons to systemd timer jobs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/751207 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [17:05:47] (03CR) 10Kormat: [C: 03+1] mariadb: Promote db1159 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/759222 (https://phabricator.wikimedia.org/T300329) (owner: 10Marostegui) [17:08:13] 10SRE, 10ops-eqiad, 10Traffic: asw2-b-eqiad:xe-2/0/3 interface errors (lvs1015) - https://phabricator.wikimedia.org/T300703 (10RhinosF1) [17:08:50] (03PS2) 10Zabe: graphite: migrate archiver crons to systemd timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/751207 (https://phabricator.wikimedia.org/T273673) [17:10:03] (03CR) 10Zabe: graphite: migrate archiver crons to systemd timer jobs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/751207 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [17:10:56] 10SRE, 10Analytics, 10Data-Engineering, 10Event-Platform: ~1 request/minute to intake-logging.wikimedia.org times out at the traffic/service interface - https://phabricator.wikimedia.org/T264021 (10Ottomata) a:05Ottomata→03None [17:11:00] (03CR) 10Jbond: [V: 03+1] O:rpkivalidator: add bgpalerter to rpki servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [17:11:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P19981 and previous config saved to /var/cache/conftool/dbconfig/20220202-171115-marostegui.json [17:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:41] (03PS2) 104nn1l2: kywiki: update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759278 (https://phabricator.wikimedia.org/T300241) [17:13:50] 10SRE, 10LDAP-Access-Requests, 10User-Ladsgroup: Grant access to wmf LDAP group for Madalina Ana - https://phabricator.wikimedia.org/T300749 (10Ladsgroup) a:03Ladsgroup [17:15:02] 10SRE, 10ops-eqiad: New Cage Config/Testing Eqiad - https://phabricator.wikimedia.org/T300353 (10cmooney) p:05Medium→03Low a:05Cmjohnson→03cmooney @Cmjohnson thanks for this. I can confirm I've access to all 3 servers now and things are looking good. I've put the task in my name rather than closing,... [17:19:26] (03CR) 10Hnowlan: [C: 03+2] restbase: remove restbase2011 [puppet] - 10https://gerrit.wikimedia.org/r/757648 (https://phabricator.wikimedia.org/T299928) (owner: 10Hnowlan) [17:19:52] (03PS3) 10Hnowlan: restbase: remove restbase2011 [puppet] - 10https://gerrit.wikimedia.org/r/757648 (https://phabricator.wikimedia.org/T299928) [17:21:00] (03CR) 104nn1l2: "Hey, please help me finish this today finally 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759278 (https://phabricator.wikimedia.org/T300241) (owner: 104nn1l2) [17:21:40] 10SRE, 10LDAP-Access-Requests, 10User-Ladsgroup: Grant access to wmf LDAP group for Madalina Ana - https://phabricator.wikimedia.org/T300749 (10Ladsgroup) 05Open→03Resolved [17:26:12] (03PS1) 10Elukey: ml-services: set the new editquality's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/759282 [17:26:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P19982 and previous config saved to /var/cache/conftool/dbconfig/20220202-172620-marostegui.json [17:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:43] !log begin logstash upgrade (codfw) T299168 [17:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:45] T299168: Upgrade OpenSearch - https://phabricator.wikimedia.org/T299168 [17:26:48] (03CR) 10Accraze: [C: 03+1] ml-services: set the new editquality's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/759282 (owner: 10Elukey) [17:28:34] jouncebot: next [17:28:34] In 1 hour(s) and 31 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220202T1900) [17:28:34] In 1 hour(s) and 31 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220202T1900) [17:30:57] (03CR) 10Elukey: [C: 03+2] ml-services: set the new editquality's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/759282 (owner: 10Elukey) [17:32:43] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [17:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:09] 10SRE-swift-storage: Bring ms-fe20[09-12] into service - https://phabricator.wikimedia.org/T300738 (10MatthewVernon) 05Open→03Resolved [17:40:24] (03CR) 10Zabe: [C: 03+1] kywiki: update logo (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759278 (https://phabricator.wikimedia.org/T300241) (owner: 104nn1l2) [17:41:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T300402)', diff saved to https://phabricator.wikimedia.org/P19983 and previous config saved to /var/cache/conftool/dbconfig/20220202-174125-marostegui.json [17:41:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [17:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:29] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [17:41:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [17:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T300402)', diff saved to https://phabricator.wikimedia.org/P19984 and previous config saved to /var/cache/conftool/dbconfig/20220202-174138-marostegui.json [17:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T300402)', diff saved to https://phabricator.wikimedia.org/P19985 and previous config saved to /var/cache/conftool/dbconfig/20220202-174513-marostegui.json [17:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:24] !log end logstash upgrade (codfw) T299168 [17:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:26] T299168: Upgrade OpenSearch - https://phabricator.wikimedia.org/T299168 [17:57:53] (03PS1) 10Accraze: ml-services: bump editquality-transformer image [deployment-charts] - 10https://gerrit.wikimedia.org/r/759290 (https://phabricator.wikimedia.org/T298943) [17:58:59] (03PS1) 10Ladsgroup: Revert "Support audio on filepage in InstantCommons" [core] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/758925 (https://phabricator.wikimedia.org/T300751) [17:59:29] dancy: this will unblock the train https://gerrit.wikimedia.org/r/c/mediawiki/core/+/758925 [17:59:35] (03CR) 10Ladsgroup: [C: 03+2] Revert "Support audio on filepage in InstantCommons" [core] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/758925 (https://phabricator.wikimedia.org/T300751) (owner: 10Ladsgroup) [18:00:07] thx. Will you be handling the deployment? [18:00:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P19986 and previous config saved to /var/cache/conftool/dbconfig/20220202-180018-marostegui.json [18:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:02] 10SRE, 10Infrastructure-Foundations, 10netops: Configuration of New Switches Eqiad Rows E-F - https://phabricator.wikimedia.org/T299758 (10cmooney) Just to add an update here the main VXLAN/EVPN configuration has been added to the devices, and using some test servers kindly installed by dc-ops I've been able... [18:03:22] (03PS2) 10RLazarus: imagecatalog: Only run on the active deployment host [puppet] - 10https://gerrit.wikimedia.org/r/757530 (https://phabricator.wikimedia.org/T287130) [18:06:13] (03CR) 10RLazarus: [C: 03+2] imagecatalog: Only run on the active deployment host [puppet] - 10https://gerrit.wikimedia.org/r/757530 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [18:09:05] dancy: yup [18:09:14] Fantastic [18:13:06] (03CR) 10Elukey: [C: 03+2] ml-services: bump editquality-transformer image [deployment-charts] - 10https://gerrit.wikimedia.org/r/759290 (https://phabricator.wikimedia.org/T298943) (owner: 10Accraze) [18:13:11] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:14:01] (03Merged) 10jenkins-bot: Revert "Support audio on filepage in InstantCommons" [core] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/758925 (https://phabricator.wikimedia.org/T300751) (owner: 10Ladsgroup) [18:15:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P19987 and previous config saved to /var/cache/conftool/dbconfig/20220202-181522-marostegui.json [18:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:03] RECOVERY - Check systemd state on prometheus1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:16:16] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.20/includes/filerepo/file/ForeignAPIFile.php: Backport: [[gerrit:758925|Revert "Support audio on filepage in InstantCommons" (T300751)]] (duration: 00m 51s) [18:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:19] T300751: InstantCommons is broken - https://phabricator.wikimedia.org/T300751 [18:17:50] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [18:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [18:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:43] (03CR) 10Ahmon Dancy: [C: 03+1] logspam: Read log files more efficiently [puppet] - 10https://gerrit.wikimedia.org/r/758962 (owner: 10Ahmon Dancy) [18:22:47] (03CR) 10Ahmon Dancy: [C: 03+1] fix logspam-watch: sorting by column 6 is broken [puppet] - 10https://gerrit.wikimedia.org/r/758965 (https://phabricator.wikimedia.org/T300298) (owner: 10Ahmon Dancy) [18:22:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [18:22:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [18:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [18:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:04] Amir1: ty for the revert [18:26:21] yw [18:26:35] dancy: the train is clear from that blocker point of view [18:27:01] Fantastic. Thanks to everyone for getting things fixed up [18:27:04] Amir1: I moved the task to block next week [18:27:17] Thank Amir, I just poked people [18:27:35] Thanks [18:28:15] Np [18:29:23] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10cmooney) [18:29:44] (03CR) 10Jdlrobson: "recheck" [skins/Vector] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/758911 (https://phabricator.wikimedia.org/T299927) (owner: 10Jdlrobson) [18:30:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T300402)', diff saved to https://phabricator.wikimedia.org/P19988 and previous config saved to /var/cache/conftool/dbconfig/20220202-183027-marostegui.json [18:30:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [18:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [18:30:31] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [18:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T300402)', diff saved to https://phabricator.wikimedia.org/P19989 and previous config saved to /var/cache/conftool/dbconfig/20220202-183034-marostegui.json [18:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:30] (03PS6) 10Hashar: ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) [18:32:58] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10cmooney) [18:33:11] (03Abandoned) 10Jdlrobson: Opt in link should be different in migration mode [skins/Vector] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/758911 (https://phabricator.wikimedia.org/T299927) (owner: 10Jdlrobson) [18:33:19] 10SRE, 10ops-eqiad, 10DC-Ops: Install OpenGear console server (SCS) in new Eqiad cage - https://phabricator.wikimedia.org/T299759 (10cmooney) 05In progress→03Resolved [18:34:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T300402)', diff saved to https://phabricator.wikimedia.org/P19990 and previous config saved to /var/cache/conftool/dbconfig/20220202-183404-marostegui.json [18:34:06] (03PS1) 10Elukey: ml-services: add STORAGE_URI env vars to editquality's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/759294 [18:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:20] (03CR) 10Hashar: ci: Qemu image and snapshot creation (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [18:36:17] (03PS5) 10Jdlrobson: Migration mode enabled everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757734 (https://phabricator.wikimedia.org/T299927) [18:36:39] (03CR) 10Accraze: [C: 03+1] ml-services: add STORAGE_URI env vars to editquality's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/759294 (owner: 10Elukey) [18:40:20] (03CR) 10Ahmon Dancy: [C: 03+1] ci: Qemu image and snapshot creation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [18:43:37] (03CR) 10Elukey: [C: 03+2] ml-services: add STORAGE_URI env vars to editquality's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/759294 (owner: 10Elukey) [18:49:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P19991 and previous config saved to /var/cache/conftool/dbconfig/20220202-184909-marostegui.json [18:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:31] jouncebot next [18:50:31] In 0 hour(s) and 9 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220202T1900) [18:50:31] In 0 hour(s) and 9 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220202T1900) [18:52:21] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [18:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:34] (03PS1) 10Jdlrobson: Fix the opt in URl [skins/Vector] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759306 (https://phabricator.wikimedia.org/T300097) [18:58:25] (03PS1) 10Herron: watchrat: check URLs from watchmouse not already covered by icinga [puppet] - 10https://gerrit.wikimedia.org/r/759297 (https://phabricator.wikimedia.org/T299147) [19:00:05] RoanKattouw and Urbanecm: #bothumor I � Unicode. All rise for UTC evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220202T1900). [19:00:05] zabe, nn1l2, and Jdlrobson: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:05] dancy and brennen: May I have your attention please! Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220202T1900) [19:00:08] hi [19:00:25] I believe train log triage has been cancelled this week. [19:00:36] ^ or at least isn't scheduled for this window. [19:00:43] i can deploy today [19:00:45] o/ [19:01:04] Jdlrobson: hi, around? [19:01:19] (03CR) 10Urbanecm: [C: 03+2] kywiki: update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759278 (https://phabricator.wikimedia.org/T300241) (owner: 104nn1l2) [19:02:06] (03CR) 10jerkins-bot: [V: 04-1] kywiki: update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759278 (https://phabricator.wikimedia.org/T300241) (owner: 104nn1l2) [19:02:58] (03PS8) 10Ahmon Dancy: multiversion: Improve error message if wikiversions.php has wrong format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622408 [19:03:21] that test seems to be flaky ^ [19:03:34] HTTP/1.1 426 Upgrade Required? [19:03:36] that's...weird [19:03:48] (03CR) 10Urbanecm: [C: 03+2] kywiki: update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759278 (https://phabricator.wikimedia.org/T300241) (owner: 104nn1l2) [19:03:52] let's try once again [19:04:06] sometimes there are errors that are just gone if you "recheck" [19:04:09] hmm.. file_get_contents(https://meta.wikimedia.org/w/api.php?action=sitematrix&format=json&smtype=language&smlangprop=dir%7Ccode%7Csite&smsiteprop=dbname&formatversion=2): failed to open stream: HTTP request failed! HTTP/1.1 426 Upgrade Required [19:04:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P19992 and previous config saved to /var/cache/conftool/dbconfig/20220202-190414-marostegui.json [19:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:26] mutante: or re-+2, in this case. let's hope. [19:04:41] (03Merged) 10jenkins-bot: kywiki: update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759278 (https://phabricator.wikimedia.org/T300241) (owner: 104nn1l2) [19:04:46] voilá [19:04:53] (03PS1) 10Dzahn: gitlab: add parameter to allow usign either acmechief or certbot [puppet] - 10https://gerrit.wikimedia.org/r/759299 (https://phabricator.wikimedia.org/T297411) [19:05:13] mutante: it's jenkins being jenkins [19:05:14] nn1l2: your patch is at mwdebug1001, please test. [19:05:20] urbanecm: Can you add https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/622408 to your queue ? [19:05:40] LGTM [19:05:43] dancy: sure thing. Should I ping you when I'm done with deploying? [19:05:45] nn1l2: syncing [19:05:49] yes please. [19:05:52] will do [19:05:54] Thanks! [19:08:28] (03PS2) 10Dzahn: gitlab: parameter to allow using either acmechief or certbot for certs [puppet] - 10https://gerrit.wikimedia.org/r/759299 (https://phabricator.wikimedia.org/T297411) [19:08:28] urbanecm: 426 issues are https://phabricator.wikimedia.org/T300366 I think, I've pinged traffic a few times about that but not sure if I was noticed or not [19:08:28] fallout from their TLS termination experiments, envoy does not support http 1.0 [19:08:28] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: 335cbee: kywiki: update logo (1/3; T300241) (duration: 00m 50s) [19:08:28] sorry im late. I'm, here now urbanecm [19:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:28] (03CR) 10Dzahn: "One thing here that I would still have to check is the "class { 'sslcert::dhparam': }", we do that in in the gerrit class but we don't do " [puppet] - 10https://gerrit.wikimedia.org/r/759299 (https://phabricator.wikimedia.org/T297411) (owner: 10Dzahn) [19:08:28] T300241: Changing the logo of the Kyrgyz Wikipedia - https://phabricator.wikimedia.org/T300241 [19:08:28] lots of drama this morning relating to yesterdays deploy hehe :) [19:08:44] thanks taavi. WIll ignore the failures, for now. [19:08:51] !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: 335cbee: kywiki: update logo (2/3; T300241) (duration: 00m 53s) [19:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:02] (03CR) 10Urbanecm: [C: 03+2] Fix the opt in URl [skins/Vector] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759306 (https://phabricator.wikimedia.org/T300097) (owner: 10Jdlrobson) [19:09:06] and hi Jdlrobson [19:09:27] :) [19:09:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:40] !log urbanecm@deploy1002 Synchronized logos/config.yaml: 335cbee: kywiki: update logo (3/3; T300241) (duration: 00m 49s) [19:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:45] (03PS3) 10Urbanecm: Consistently write to $wmgRealm the same value as to $wmfRealm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734582 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [19:09:48] (03CR) 10Urbanecm: [C: 03+2] Consistently write to $wmgRealm the same value as to $wmfRealm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734582 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [19:09:50] !log Running homer to enable interface et-1/0/2 on cr1-eqiad (towards lsw1-e1-eqiad) to test connectivity. [19:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:52] nn1l2: your patch is live [19:10:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:57] !log Purge https://en.wikipedia.org/static/images/project-logos/{kywiki,kywiki-1.5x,kywiki-2x}.png (T300241) [19:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:03] (03Merged) 10jenkins-bot: Consistently write to $wmgRealm the same value as to $wmfRealm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734582 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [19:11:17] zabe: pulled to mwdebug1001, can you have a look? [19:11:21] yep [19:11:23] Yeah, Thanks! [19:12:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:23] urbanecm: stuff still seems to work, since it is only buildConfigCache.php, I can't really test it further [19:13:35] ack [19:14:07] (03PS3) 10Urbanecm: Migrate calls of wmf* constants to wmg* constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734573 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [19:14:12] (03CR) 10Urbanecm: [C: 03+2] Migrate calls of wmf* constants to wmg* constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734573 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [19:14:16] syncing [19:14:47] (03CR) 10jerkins-bot: [V: 04-1] Migrate calls of wmf* constants to wmg* constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734573 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [19:14:50] !log urbanecm@deploy1002 Synchronized multiversion/buildConfigCache.php: 83f1f6a: Consistently write to $wmgRealm the same value as to $wmfRealm (T45956) (duration: 00m 49s) [19:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:54] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [19:15:11] (03CR) 10Urbanecm: [C: 03+2] Migrate calls of wmf* constants to wmg* constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734573 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [19:15:46] (03CR) 10Dzahn: "so yea, the class { '::sslcert::dhparam': } include is in lots of places, we probably should have that." [puppet] - 10https://gerrit.wikimedia.org/r/759299 (https://phabricator.wikimedia.org/T297411) (owner: 10Dzahn) [19:16:02] (03Merged) 10jenkins-bot: Migrate calls of wmf* constants to wmg* constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734573 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [19:16:32] zabe: second patch pulled to mwdebug1001 [19:16:46] as Timo said, please test throroughly :) [19:17:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T300402)', diff saved to https://phabricator.wikimedia.org/P19993 and previous config saved to /var/cache/conftool/dbconfig/20220202-191918-marostegui.json [19:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:22] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [19:19:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:07] zabe: did you see my message? 🙂 [19:20:23] yes, I'm testing [19:20:37] okay, waiting patiently :) [19:21:38] ok, lgtm. Multiple wikis are looking good. Also https://meta.wikimedia.org/w/api.php?action=query&meta=siteinfo&siprop=general&sishowalldb= still returns the correct infos under wmf-config. Nothing suspicious in Logstash [19:22:14] urbanecm: ^ [19:22:24] good [19:24:04] !log urbanecm@deploy1002 Synchronized wmf-config/: a48f8bd: Migrate calls of wmf* constants to wmg* constants (T45956) (duration: 00m 51s) [19:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:08] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [19:24:12] zabe: live. as yesterday, please monitor logstash for a while. [19:24:25] (03Merged) 10jenkins-bot: Fix the opt in URl [skins/Vector] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/759306 (https://phabricator.wikimedia.org/T300097) (owner: 10Jdlrobson) [19:24:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:24:33] yep [19:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:28] Jdlrobson: your backport is at mwdebug1001 [19:25:30] please test [19:25:57] urbanecm testing... [19:26:16] LGTM urbanecm [19:26:20] syncing [19:27:23] (03PS6) 10Urbanecm: Migration mode enabled everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757734 (https://phabricator.wikimedia.org/T299927) (owner: 10Jdlrobson) [19:27:42] (03CR) 10Urbanecm: [C: 03+2] Migration mode enabled everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757734 (https://phabricator.wikimedia.org/T299927) (owner: 10Jdlrobson) [19:27:53] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.20/skins/Vector/includes/SkinVector.php: bdc20ddb357e6a993cc4a3d9ddbecac964843744: Fix the opt in URl (T300097) (duration: 00m 49s) [19:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:56] T300097: "Switch to old look" points to wrong section on wikis with new two-skin configuration - https://phabricator.wikimedia.org/T300097 [19:28:28] (03Merged) 10jenkins-bot: Migration mode enabled everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757734 (https://phabricator.wikimedia.org/T299927) (owner: 10Jdlrobson) [19:28:45] Jdlrobson: and config is at mwdebug1001 now, too [19:28:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:17] checking.. [19:31:30] 10SRE: bugzilla SSL - weak RSA key, RC4 usage - https://phabricator.wikimedia.org/T83768 (10Dzahn) [19:31:46] just waiting 5 mins on the logs [19:31:48] but it's looking good [19:33:09] okqa [19:33:11] *okqa [19:33:14] *okay [19:35:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:15] urbanecm: okay let's do it. [19:36:22] syncing! [19:36:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:39] (03PS1) 10Zabe: Migrate $wmfRealm calls to $wmgRealm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759300 (https://phabricator.wikimedia.org/T45956) [19:37:42] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 62b2acb: Migration mode enabled everywhere (T299927) (duration: 00m 49s) [19:37:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:45] T299927: Deploy new Vector skin to all projects - https://phabricator.wikimedia.org/T299927 [19:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:55] Jdlrobson: should be live [19:37:57] anything else? [19:38:03] urbanecm: nope... just gonna watch the logs now [19:38:07] THANK YOU [19:38:12] any time [19:38:45] possibly see you later today :) [19:38:49] heh [19:38:52] dancy: I'm done [19:39:05] Thanks. +2'ing my change [19:39:11] (03CR) 10Ahmon Dancy: [C: 03+2] multiversion: Improve error message if wikiversions.php has wrong format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622408 (owner: 10Ahmon Dancy) [19:40:02] (03Merged) 10jenkins-bot: multiversion: Improve error message if wikiversions.php has wrong format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622408 (owner: 10Ahmon Dancy) [19:40:30] (03CR) 10Dzahn: "chatted with traffic, we don't need to include that class for DHE params here" [puppet] - 10https://gerrit.wikimedia.org/r/759299 (https://phabricator.wikimedia.org/T297411) (owner: 10Dzahn) [19:40:56] (03CR) 10jerkins-bot: [V: 04-1] cowikimedia: Allow bureaucrats to remove sysop and bureaucrat flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759307 (https://phabricator.wikimedia.org/T300779) (owner: 10MarcoAurelio) [19:41:11] (03PS2) 10Zabe: Migrate $wmfRealm calls to $wmgRealm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759300 (https://phabricator.wikimedia.org/T45956) [19:41:33] (03CR) 10MarcoAurelio: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759307 (https://phabricator.wikimedia.org/T300779) (owner: 10MarcoAurelio) [19:41:35] (03PS3) 10Dzahn: gitlab: parameter to allow using either acmechief or certbot for certs [puppet] - 10https://gerrit.wikimedia.org/r/759299 (https://phabricator.wikimedia.org/T297411) [19:41:55] are we still in the backport window? [19:42:08] yeah. I'm finishing up a deploy right now. [19:42:38] If I may add a last-minute request? [19:42:42] !log dancy@deploy1002 Synchronized multiversion/MWMultiVersion.php: Config: [[gerrit:622408|multiversion: Improve error message if wikiversions.php has wrong format]] (duration: 00m 49s) [19:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:48] I'm done. [19:42:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:42:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:50] Not sure if we've deployed 6 patches already [19:43:34] hauskatze: Whatcha got? [19:43:45] dancy: got https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/759307 [19:43:53] chapter wiki, no danger to do it IMHO [19:44:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:44:06] ok.. testable? [19:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:18] dancy: yup, on mwdebug, I can do it [19:44:23] ok.. +2'ing [19:44:25] (03CR) 10Ahmon Dancy: [C: 03+2] cowikimedia: Allow bureaucrats to remove sysop and bureaucrat flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759307 (https://phabricator.wikimedia.org/T300779) (owner: 10MarcoAurelio) [19:44:30] checking listgrouprights for the change [19:44:51] (and I'll add the patch to the calendar afterwards, many thanks :) ) [19:44:57] (03CR) 10jerkins-bot: [V: 04-1] cowikimedia: Allow bureaucrats to remove sysop and bureaucrat flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759307 (https://phabricator.wikimedia.org/T300779) (owner: 10MarcoAurelio) [19:45:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:26] (03CR) 10Ahmon Dancy: cowikimedia: Allow bureaucrats to remove sysop and bureaucrat flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759307 (https://phabricator.wikimedia.org/T300779) (owner: 10MarcoAurelio) [19:45:28] re+2ing should fix it [19:45:32] (03CR) 10Ahmon Dancy: [C: 03+2] cowikimedia: Allow bureaucrats to remove sysop and bureaucrat flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759307 (https://phabricator.wikimedia.org/T300779) (owner: 10MarcoAurelio) [19:46:14] (03Merged) 10jenkins-bot: cowikimedia: Allow bureaucrats to remove sysop and bureaucrat flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759307 (https://phabricator.wikimedia.org/T300779) (owner: 10MarcoAurelio) [19:46:53] (03PS4) 10Dzahn: gitlab: parameter to allow using either acmechief or certbot for certs [puppet] - 10https://gerrit.wikimedia.org/r/759299 (https://phabricator.wikimedia.org/T297411) [19:47:06] hauskatze: Ok. It's on mwdebug [19:47:16] 1 or 2 dancy ? :) [19:47:22] 1001 [19:47:32] ty, checking [19:47:43] (03PS1) 10Jdlrobson: Changes the labels of the Vector skins [skins/Vector] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/759308 (https://phabricator.wikimedia.org/T299927) [19:47:51] (03PS1) 10Herron: initial sketch of watchrat alert [alerts] - 10https://gerrit.wikimedia.org/r/759302 (https://phabricator.wikimedia.org/T299147) [19:48:03] (03PS5) 10Dzahn: gitlab: parameter to allow using either acmechief or certbot for certs [puppet] - 10https://gerrit.wikimedia.org/r/759299 (https://phabricator.wikimedia.org/T297411) [19:48:17] dancy: looks good to me, Special:ListGroupRights displays correctly the new capabilities as desired [19:48:25] ok.. deploying fully [19:49:05] (03PS6) 10Dzahn: gitlab: parameter to allow using either acmechief or certbot for certs [puppet] - 10https://gerrit.wikimedia.org/r/759299 (https://phabricator.wikimedia.org/T297411) [19:49:22] !log dancy@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:759307|cowikimedia: Allow bureaucrats to remove sysop and bureaucrat flags (T300779)]] (duration: 00m 50s) [19:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:25] T300779: Allow co.wikimedia (chapter wiki) bureaucrats to revoke sysop and bureaucrat permissions - https://phabricator.wikimedia.org/T300779 [19:49:34] hauskatze: Done [19:49:42] Thanks dancy [19:49:50] I'll list the patch on Wikitech shortly too [19:49:54] too many places to log-in [19:50:11] (03CR) 10jerkins-bot: [V: 04-1] initial sketch of watchrat alert [alerts] - 10https://gerrit.wikimedia.org/r/759302 (https://phabricator.wikimedia.org/T299147) (owner: 10Herron) [19:50:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:26] Added to the calendar as well [19:57:16] 10SRE, 10Traffic: Problem loading thumbnail images due to Envoy (426 Upgrade Required) - https://phabricator.wikimedia.org/T300366 (10Dzahn) 19:07 < taavi> fallout from their TLS termination experiments, envoy does not support http 1.0 [20:00:20] dancy and brennen: That opportune time is upon us again. Time for a MediaWiki train - Utc-7 Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220202T2000). [20:01:14] Starting. [20:01:16] o/ [20:01:30] (03PS1) 10Ahmon Dancy: group1 wikis to 1.38.0-wmf.20 refs T293961 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759305 [20:01:32] (03CR) 10Ahmon Dancy: [C: 03+2] group1 wikis to 1.38.0-wmf.20 refs T293961 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759305 (owner: 10Ahmon Dancy) [20:02:11] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.20 refs T293961 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759305 (owner: 10Ahmon Dancy) [20:03:24] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.20 refs T293961 [20:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:28] T293961: 1.38.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T293961 [20:04:12] 10SRE, 10Traffic: Problem loading thumbnail images due to Envoy (426 Upgrade Required) - https://phabricator.wikimedia.org/T300366 (10Majavah) >>! In T300366#7672412, @Dzahn wrote: > 19:07 < taavi> fallout from their TLS termination experiments, envoy does not support http 1.0 "TLS termination experiments" re... [20:04:14] !log dancy@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.20 refs T293961 (duration: 00m 49s) [20:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:19] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Dzahn) There are some user reports / IRC chatter and this ticket T300366 that seem like they are related to this. [20:07:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:09:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:38] hmm replag error spike there around deployment, but seems transient? [20:10:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:20] There have been steady replication lag complaints for .19 throughout the day [20:12:34] I think. [20:12:37] * dancy double checks [20:12:59] dancy brennen the writes had an spike around the roll out https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&refresh=1m [20:13:38] but it has recovered [20:14:22] dancy: is the replag errors are more than usual? what wikis are? it looks like s4 from the graphs but I'm not sure [20:14:41] i recall a similar spike during last week's train, i think. [20:15:04] replag errors do seem... higher than usual overall? [20:15:09] * brennen pokes at logstash [20:15:17] (higher than usual but i suspect they were that way prior to .20) [20:15:55] yeah but I want to know what's the underlying issue (I don't think it's train related at least) [20:16:04] can be simply some maint running and causing issues [20:16:50] it looks like it is commons (which is s4) [20:17:16] Looking at the last 7 days of replag complaints, today's level looks about normal. [20:17:39] (excluding the brief spike) [20:18:56] the only writes I know is the actor migration there [20:19:52] it will be done in a month or so :/ [20:20:10] anyway, nothing we can fix atm [20:22:11] thanks Amir1. [20:24:49] 10SRE, 10Traffic: Problem loading thumbnail images due to Envoy (426 Upgrade Required) - https://phabricator.wikimedia.org/T300366 (10MusikAnimal) #xtools was affected by this, since it uses `file_get_contents` to fetch an on-wiki config. In case it helps others, I managed to get around it by changing the code... [20:27:09] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10BBlack) summarizing about the link above: apparently we do have HTTP/1.0 clients, and it does work with our other terminators, but not envoy. Envoy does have s... [20:42:29] (03PS1) 10Cathal Mooney: Add eBGP peering between CR routers and datacenter switches. [homer/public] - 10https://gerrit.wikimedia.org/r/759331 (https://phabricator.wikimedia.org/T299758) [20:43:08] (03CR) 10jerkins-bot: [V: 04-1] Add eBGP peering between CR routers and datacenter switches. [homer/public] - 10https://gerrit.wikimedia.org/r/759331 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [20:50:01] (03PS2) 10Cathal Mooney: Add eBGP peering between CR routers and datacenter switches. [homer/public] - 10https://gerrit.wikimedia.org/r/759331 (https://phabricator.wikimedia.org/T299758) [20:50:29] (03CR) 10jerkins-bot: [V: 04-1] Add eBGP peering between CR routers and datacenter switches. [homer/public] - 10https://gerrit.wikimedia.org/r/759331 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [20:54:35] (03PS3) 10Cathal Mooney: Add eBGP peering between CR routers and datacenter switches. [homer/public] - 10https://gerrit.wikimedia.org/r/759331 (https://phabricator.wikimedia.org/T299758) [20:55:08] (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/33557/gitlab1001.wikimedia.org/change.gitlab1001.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/759299 (https://phabricator.wikimedia.org/T297411) (owner: 10Dzahn) [20:56:16] (03CR) 10Dzahn: [C: 04-1] "I am mixing up Gerrit and Gitlab stuff...fixing" [puppet] - 10https://gerrit.wikimedia.org/r/759299 (https://phabricator.wikimedia.org/T297411) (owner: 10Dzahn) [21:00:05] dancy and brennen: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220202T2000). [21:00:05] chrisalbon and accraze: Your horoscope predicts another unfortunate Services – Graphoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220202T2100). [21:23:11] (03PS8) 10Dzahn: gitlab: parameter to allow using either acmechief or certbot for certs [puppet] - 10https://gerrit.wikimedia.org/r/759299 (https://phabricator.wikimedia.org/T297411) [21:26:41] 10SRE, 10Commons, 10MediaWiki-File-management, 10Product-Infrastructure-Team-Backlog, and 6 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214 (10Jdforrester-WMF) [21:28:36] (03PS9) 10Dzahn: gitlab: parameter to allow using either acmechief or certbot for certs [puppet] - 10https://gerrit.wikimedia.org/r/759299 (https://phabricator.wikimedia.org/T297411) [21:32:04] 10SRE, 10ops-eqiad, 10Traffic: asw2-b-eqiad:xe-2/0/3 interface errors (lvs1015) - https://phabricator.wikimedia.org/T300703 (10Zabe) [21:32:40] (03CR) 10Dzahn: Regularly check MFA status of elevated Phabricator accounts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758963 (https://phabricator.wikimedia.org/T299403) (owner: 10Aklapper) [21:32:56] which version of Znuny (ex-OTRS) are we using? [21:34:23] 6.0.37 I think [21:35:55] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/33561/gitlab1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/759299 (https://phabricator.wikimedia.org/T297411) (owner: 10Dzahn) [21:37:38] (03CR) 10Dzahn: [C: 04-1] "oh... except the "post hook" after cert renewal refreshes apache and we use nginx..?" [puppet] - 10https://gerrit.wikimedia.org/r/759299 (https://phabricator.wikimedia.org/T297411) (owner: 10Dzahn) [21:39:55] thanks zabe [21:41:34] (03PS10) 10Dzahn: gitlab: parameter to allow using either acmechief or certbot for certs [puppet] - 10https://gerrit.wikimedia.org/r/759299 (https://phabricator.wikimedia.org/T297411) [21:46:58] (03CR) 10Hashar: ci: Qemu image and snapshot creation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [21:47:58] (03PS7) 10Hashar: ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) [21:53:12] (03CR) 10Ahmon Dancy: [C: 03+1] ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [21:53:37] (03CR) 10Hashar: "I guess the next steps are:" [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [21:55:00] (03CR) 10Hashar: "To clarify: I will do that tomorrow and report back. Thank you Dave and Ahmon for the reviews and hints!" [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [22:06:35] (03PS1) 10JHathaway: mx: set net.ipv4.tcp_fastopen_blackhole_timeout_sec sysctl [puppet] - 10https://gerrit.wikimedia.org/r/759344 [22:08:13] 10SRE, 10SRE-Access-Requests, 10Analytics, 10User-Ladsgroup: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10DannyH) @Ladsgroup Thank you, I appreciate it! [22:16:18] (03PS2) 10JHathaway: mx: set net.ipv4.tcp_fastopen_blackhole_timeout_sec sysctl [puppet] - 10https://gerrit.wikimedia.org/r/759344 (https://phabricator.wikimedia.org/T299107) [22:18:17] (03CR) 10jerkins-bot: [V: 04-1] mx: set net.ipv4.tcp_fastopen_blackhole_timeout_sec sysctl [puppet] - 10https://gerrit.wikimedia.org/r/759344 (https://phabricator.wikimedia.org/T299107) (owner: 10JHathaway) [22:19:02] (03PS3) 10JHathaway: mx: set net.ipv4.tcp_fastopen_blackhole_timeout_sec sysctl [puppet] - 10https://gerrit.wikimedia.org/r/759344 (https://phabricator.wikimedia.org/T299107) [22:19:43] 10SRE, 10Datasets-Archiving, 10Datasets-General-or-Unknown, 10Dumps-Generation: Image tarball dumps on your.org are not being generated - https://phabricator.wikimedia.org/T53001 (10Aklapper) a:05ArielGlenn→03None @ArielGlenn: Hi, I'm resetting the task assignee due to inactivity. Please feel free to r... [22:22:32] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Ahecht) Toolforge is still running PHP 7.3, which defaults to HTTP/1.0, so any PHP toolforge tools that access the API are going to be broken by this unless the... [22:22:38] 10SRE, 10Traffic: Problem loading thumbnail images due to Envoy (426 Upgrade Required) - https://phabricator.wikimedia.org/T300366 (10Ahecht) My [[ https://randomincategory.toolforge.org/ | RandomInCategory ]] tool on toolforge was affected by this as well since it's using the standard PHP 7.3 installation on... [22:23:16] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/33563/gitlab1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/759299 (https://phabricator.wikimedia.org/T297411) (owner: 10Dzahn) [22:26:02] !log gitlab - introducing parameter to fetch TLS certs either with acmechief or certbot (if in cloud). Boolean $use_acmechief = lookup('profile::gitlab::use_acmechief'), confirmed noop in prod on gitlab1001.wikimedia.org ( T297411) [22:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:06] T297411: Migrate gitlab-test instance to puppet - https://phabricator.wikimedia.org/T297411 [22:26:55] 10SRE, 10Traffic, 10WMF-Legal, 10Performance-Team (Radar), 10Privacy: Consider disabling Chrome Lite pages for Wikipedia on Chrome on mobile with Cache-Control: no-transform - https://phabricator.wikimedia.org/T218618 (10JBennett) I'm not aware of prior management decisions or risk assessments to prohibi... [22:27:43] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/759344 (https://phabricator.wikimedia.org/T299107) (owner: 10JHathaway) [22:47:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudmetrics1004 potential hardware problem - https://phabricator.wikimedia.org/T299744 (10wiki_willy) 05Open→03Resolved Resolving task since the kernel upgrade seems to have fixed this. Thanks, Willly [22:47:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10wiki_willy) [22:48:42] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Hardware): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10wiki_willy) 05Open→03Resolved Resolving task since the kernel upgrade seems to have fixed this. Thanks, Willly [22:48:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudmetrics1004 potential hardware problem - https://phabricator.wikimedia.org/T299744 (10wiki_willy) [22:48:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10wiki_willy) [22:55:59] (03CR) 10Dzahn: [C: 03+2] fix logspam-watch: sorting by column 6 is broken [puppet] - 10https://gerrit.wikimedia.org/r/758965 (https://phabricator.wikimedia.org/T300298) (owner: 10Ahmon Dancy) [22:56:40] (03PS1) 10Aklapper: mariadb/phabricator: Add auth table to GRANTS for phstats [puppet] - 10https://gerrit.wikimedia.org/r/759357 [22:58:17] (03PS2) 10Aklapper: mariadb/phabricator: Add auth table to GRANTS for phstats [puppet] - 10https://gerrit.wikimedia.org/r/759357 (https://phabricator.wikimedia.org/T299403) [22:58:31] (03CR) 10Dzahn: "Is /usr/local/bin/logspam used only by humans or also by other software?" [puppet] - 10https://gerrit.wikimedia.org/r/758962 (owner: 10Ahmon Dancy) [22:58:38] (03PS1) 10Aklapper: Regularly check MFA status of elevated Phabricator accounts [puppet] - 10https://gerrit.wikimedia.org/r/759359 (https://phabricator.wikimedia.org/T299403) [22:59:27] (03Abandoned) 10Aklapper: Regularly check MFA status of elevated Phabricator accounts [puppet] - 10https://gerrit.wikimedia.org/r/758963 (https://phabricator.wikimedia.org/T299403) (owner: 10Aklapper) [23:00:01] (03CR) 10Aklapper: "This depends on https://gerrit.wikimedia.org/r/c/operations/puppet/+/759357" [puppet] - 10https://gerrit.wikimedia.org/r/759359 (https://phabricator.wikimedia.org/T299403) (owner: 10Aklapper) [23:05:38] (03CR) 10Ahmon Dancy: [C: 03+1] logspam: Read log files more efficiently (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758962 (owner: 10Ahmon Dancy) [23:07:59] 10SRE, 10Traffic: Problem loading thumbnail images due to Envoy (426 Upgrade Required) - https://phabricator.wikimedia.org/T300366 (10Dzahn) If other users find their way here, it affects you if you are a HTTP/1.0 client for one way or another. see the summary at T271421#7672538 [23:08:52] (03CR) 10Brennen Bearnes: logspam: Read log files more efficiently (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758962 (owner: 10Ahmon Dancy) [23:11:24] 10SRE, 10Traffic: Problem loading thumbnail images due to Envoy (HTTP/1.0 clients getting '426 Upgrade Required') - https://phabricator.wikimedia.org/T300366 (10Dzahn) [23:13:11] 10SRE, 10Traffic, 10WMF-Legal, 10Performance-Team (Radar), 10Privacy: Consider disabling Chrome Lite pages for Wikipedia on Chrome on mobile with Cache-Control: no-transform - https://phabricator.wikimedia.org/T218618 (10Krinkle) >>! From T298166: > Add the `no-transform` header […] and hope this deals w... [23:13:13] (03CR) 10Dave Pifke: [C: 03+1] ci: Qemu image and snapshot creation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [23:14:17] (03CR) 10Ahmon Dancy: [C: 04-1] "Holding while I do some manual testing on edge cases." [puppet] - 10https://gerrit.wikimedia.org/r/758962 (owner: 10Ahmon Dancy) [23:14:29] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Dzahn) [23:14:35] 10SRE, 10Traffic: Problem loading thumbnail images due to Envoy (HTTP/1.0 clients getting '426 Upgrade Required') - https://phabricator.wikimedia.org/T300366 (10Dzahn) [23:20:49] (03PS4) 10Ahmon Dancy: logspam: Read log files more efficiently [puppet] - 10https://gerrit.wikimedia.org/r/758962 [23:21:14] (03CR) 10Ahmon Dancy: [C: 03+1] "Fixed bug if input file is empty." [puppet] - 10https://gerrit.wikimedia.org/r/758962 (owner: 10Ahmon Dancy) [23:23:00] 10SRE, 10GitLab: gitlab: enable IPv6 for https - https://phabricator.wikimedia.org/T300816 (10Dzahn) [23:23:55] 10SRE, 10GitLab (Infrastructure): gitlab: enable IPv6 for https - https://phabricator.wikimedia.org/T300816 (10Dzahn) [23:24:24] 10SRE, 10serviceops, 10GitLab (Infrastructure): gitlab: enable IPv6 for https - https://phabricator.wikimedia.org/T300816 (10Dzahn) [23:37:19] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10TheDJ) >>! In T271421#7672538, @BBlack wrote: > Envoy does have some config for this (cf https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/proto... [23:53:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudmetrics1004 potential hardware problem - https://phabricator.wikimedia.org/T299744 (10Andrew) [23:53:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10Andrew) [23:54:33] 10SRE, 10cloud-services-team (Kanban): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10Andrew) 05Resolved→03Open I'm reopening because we need to decide if this kernel can be user long-term and, if so, the same should be applied to 1004. No tasks here for dcops though,... [23:55:29] 10SRE, 10cloud-services-team (Kanban): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10wiki_willy) Sounds good @Andrew, thanks!