[00:00:32] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:28] PROBLEM - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [00:03:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:28] RECOVERY - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [00:05:22] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [00:25:19] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:25:20] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts gitlab1001.wikimedia.org [00:28:35] (03CR) 10Dzahn: [C: 03+1] "END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts gitlab1001.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/811782 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [00:29:45] (03CR) 10Dzahn: [C: 03+1] "mainly just FYI and about my comment changes" [puppet] - 10https://gerrit.wikimedia.org/r/811782 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [00:35:26] (03PS1) 10Mary Yang: Add alert manager alert receivers for the Abstract Wikipedia team. [puppet] - 10https://gerrit.wikimedia.org/r/811790 (https://phabricator.wikimedia.org/T311457) [00:38:25] (03CR) 10Mary Yang: "Hi Daniel, please let me know if this looks right. Thank you so much." [puppet] - 10https://gerrit.wikimedia.org/r/811790 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [00:41:02] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [00:41:18] (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:42:10] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 18.14 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [00:43:16] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 22.68 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [00:44:18] (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:44:36] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 49.8 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [00:44:40] RECOVERY - Check systemd state on thumbor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:43] looking [00:44:53] (assume it's post working hours for the oncallers) [00:45:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:45:14] (03PS1) 10Krinkle: Enable wgResourceLoaderUseObjectCacheForDeps for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811794 (https://phabricator.wikimedia.org/T113916) [00:45:16] (03PS1) 10Krinkle: Remove unused 'ResourceLoaderImage' logging setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811795 [00:46:18] (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:46:52] rzl, is https://www.wikimediastatus.net not displaying stuff related? [00:47:06] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 86.72 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [00:47:08] oop, back. Might've been issue on my end. [00:47:49] perryprog: should be unrelated, that's hosted off our infrastructure exactly so that it'll keep working when we have trouble :) [00:48:10] 👍 [00:48:16] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [00:48:30] for anyone following along, the varnish traffic drop alert is also unrelated to the shellbox issue, it's just a red herring from a traffic bump we got 30 minutes ago [00:48:50] just got paged for shellbox, looking [00:49:17] 👋 [00:49:18] (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:49:31] jhathaway: wow, you work fast :) [00:49:42] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [00:49:49] ha! I wish I could claim some action on my behalf [00:49:57] is it the same shellbox spike thing? [00:50:06] seems likely, that's what I'm checking on now [00:50:13] Strikes fear into the infra, fixes self [00:50:39] I think it would be useful if you can look into exec.log to get an indication of what the shellouts are to see if we can correspond it to some page edits or traffic [00:50:55] (03CR) 10Krinkle: [C: 03+2] noc: Add wiki.php to view a given wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [00:51:02] I forget if there's a separate score.log too [00:51:03] ah yup https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&from=now-30m&to=now [00:51:13] er, wrong paste [00:51:36] https://grafana.wikimedia.org/goto/gaDAlp6nk?orgId=1 [00:51:43] (03Merged) 10jenkins-bot: noc: Add wiki.php to view a given wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [00:51:52] (wish the grafana "copy link url" button actually touched my paste buffer, or failing that, I wish I could remember that it doesn't) [00:52:10] legoktm: yeah good idea -- I'll be kind of hamfisted getting in there for the first time, but I'll see what I can find [00:52:12] PROBLEM - Check systemd state on thumbor1002 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8807.service,thumbor@8812.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:52:18] good practice for me if nothing else [00:52:29] legoktm: where do I find the exec.log? [00:52:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:52:40] mwlog100X:/srv/mw-log/ [00:52:45] thanks [00:52:51] wait, thumbor failure? didn't this happen with mybib not too long ago [00:53:18] oh, nice, I thought I was going to have to get a shell in the pod [00:53:33] https://grafana.wikimedia.org/d/3SiE86Nnz/mediawiki-shellouts?orgId=1&from=now-3h&to=now scroll down to the bottom where score/lilypond is and you can see the spike from the MW side [00:54:00] someone plugged some big pdf into mybib, mybib made a punch of requests to scan through that pdf or something and it made some stuffy angry. (2022-05-06 in my logs) [00:54:32] the proportion of png vs audio shellouts seems pretty organic, that someone purged or parsed a bunch of pages with scores on them [00:55:21] exec.log at least will let us narrow it down by wiki, which if it doesn't have URL/title, we hopefully will get lucky in webrequest/5xx logs [00:56:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [00:56:33] 1005 '\''scripts/generatePngAndMidi.sh'\''' [00:56:41] seems to be the most executed in that time period [00:56:56] yeah, that's what the dashboard says [00:57:07] what wiki was it on? [00:57:16] enwiki [00:57:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [00:57:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [00:58:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [00:58:42] almost entirely -- I also see one enwikisource and one mediawikiwiki, let me get an accurate count [00:59:15] https://en.wikipedia.org/wiki/Special:RecentChangesLinked?hidebots=1&hidecategorization=1&target=Category%3APages_using_the_Score_extension&limit=500&days=30&enhanced=1&damaging__likelybad_color=c4&damaging__verylikelybad_color=c5&urlversion=2 no smoking gun [01:00:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:00:45] nothing obvious in template/module edits either [01:01:09] https://www.irccloud.com/pastebin/ZLHfpazY/ [01:02:11] my guess is that the frwiki/mw.org hits are just coincidences...I guess we need a way to correlate these shellouts to actual requests [01:03:18] yeah, what does generatePngAndMidi do? What does midi mean in this context? [01:03:39] it literally generates png and midi files :) [01:03:52] it's the same MIDI you're thinking of :D these are shellouts for musical scores [01:03:55] The MIDI file is so you can play it alongside the score [01:03:55] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/Score/+/refs/heads/master/scripts/generatePngAndMidi.sh [01:03:57] seems like an odd combination, but I trust you! [01:04:13] it really means music scores as in https://en.wikipedia.org/wiki/MIDI [01:04:16] ah!!! that makse sense thanks perryprog [01:04:49] we shell out to lilypad to do the actual work, shellbox's purpose is to be a security boundary to do that shellout safely [01:05:00] PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8806.service,thumbor@8808.service,thumbor@8810.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:05:02] lilypond scuse me [01:05:20] got it, I hadn't put together all the pieces, thanks [01:05:31] legoktm, would a purge re-call all that? Any page with a score tag should be in that category, and only pages with the score tag should also be calling generatePngAndMidi (https://gerrit.wikimedia.org/g/mediawiki/extensions/Score/+/d445b6740cb270c2aa1951809e7de88f073dfd74/includes/Score.php#675) [01:05:40] That or a null edit [01:06:16] PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-config-backup-gitlab1001.wikimedia.org.service,rsync-config-backup-gitlab2001.wikimedia.org.service,rsync-data-backup-gitlab1001.wikimedia.org.service,rsync-data-backup-gitlab2001.wikimedia.org.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:06:27] mmm [01:06:50] I got that [01:06:53] no...to trigger generatePngAndMidi you have to pass some unique content in the tag because it gets cached [01:07:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:07:55] if that lilypond spike was when score tags were added to a bunch of articles, is it possible they wouldn't be reflected in that category yet, and so wouldn't show up on that RC query? [01:08:00] or should that be caught up by now [01:08:18] should be caught up by now. one "attack" would be to just "show preview" but never save it [01:08:28] oh, cute [01:08:35] Ah, right [01:08:59] or you could even api.php?action=parse&text=...., etc. [01:09:13] weird attack vector though [01:09:21] a classic one of those "yeah, that'd work, but man, there are easier ways" [01:09:28] my suggestion would be 1) try to corelate request IDs/shellouts to actual requests in webrequest/5xx. if that doesn't work 2) add more extensive debug logging to Score that effectively captures title and $_GET or the URL [01:09:34] exactly [01:09:35] s'cuse me if I'm stating the obvious as I have zero idea what I'm talking about, but https://logstash.wikimedia.org/goto/990bc0c351eb3c8f58da3f17b4e60b70 has 1,242 Shellbox\ShellboxError counts in the last hour, all for this edit https://en.wikipedia.org/w/index.php?title=Pictures_at_an_Exhibition&diff=1096844863&oldid=1096808949&diffmode=source ? [01:09:46] !log gitlab1004 - systemctl reset-failed, clear icinga alerts about rsync to decom'ed machine [01:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:10:35] TheresNoTime: not stating the obvious at all, that's super welcome! but I think the timestamp for that edit is after these requests were finished [01:11:16] maybe while doing that edit they kept creating an invalid score, previewing, modifying, and repeat? [01:11:16] RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:11:24] ... unless that single edit somehow shelled out thousands of times in the course of parsing, took 23 minutes to save, and so got that timestamp at the end [01:11:26] 1000 hits is pretty excessive though [01:11:32] oh, or that, yeah [01:11:35] That seems high unless it's somehow reparsing multiple times [01:12:01] there's no way visualeditor could be tricked into doing this on every keystroke, is there? [01:12:14] is `transform/wikitext/to/pagebundle` VE btw? [01:12:16] ... I guess I can test that pretty easily [01:12:23] umm, there's the optimistic pre-save thing [01:12:39] oh man [01:12:47] wow [01:12:47] it does [01:12:58] I love this?? [01:13:09] there's a small debounce window, but every change you get a new request [01:13:31] Of the form: https://gist.github.com/perryprog/4f29f9c20ef7d9a10288265d940e33da [01:13:35] * TheresNoTime makes note to test real-time preview with this :/ [01:13:50] VE doesn't do pre-save other than during the writing of the edit summary, not while editing. [01:14:02] It does in the lilypond editor [01:14:08] WikiEditor does have a 2s debounce if the text area is untouched to start a parse in the background. [01:14:35] oooo [01:14:41] yeah, but if you're sitting there with a musical score in front of you, transcribing it into the box one note at a time, and then looking back over at your score... [01:15:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:16:57] okay so yeah I can reproduce it. the page ended up with a template wrapping it so the lilypond VE editor plugin isn't used on the page when you edit it now [01:16:59] okay, well, that's an extremely cool bug, and we should find a good way to stop triggering it [01:17:10] ahhh [01:17:25] yea, that was the request linked in the gist above (though I deleted said gist because I forgot to remove the cookies line, lol) [01:17:25] but if you remove the {{block thing and then switch from wikitext to VE, you can indeed get the visualeditor plugin for lilypond and that modal has a 2s debounce preview indeed. [01:18:56] this seems bad but I'm not super convinced this triggered the 1k requests we saw [01:19:50] also if I hold down "a" for a second or two into the editor it sends off like 5-6 requests [01:21:23] I think r.zl's point about using it as a live preview while transcribing could make that reach a thousand or so [01:21:48] rzl: can you look at api.log for that time window for action=parse and paction=parsefragment? [01:21:57] good call, one sec [01:22:10] rzl: sorry, action=visualeditor [01:22:15] ack [01:22:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:24:22] (03PS2) 10Dzahn: site/gitlab: remove gitlab1001, update comments [puppet] - 10https://gerrit.wikimedia.org/r/811782 (https://phabricator.wikimedia.org/T307142) [01:25:05] fwiw, this is from MWExtensionDialog as used in Score: https://codesearch.wmcloud.org/deployed/?q=MWExtensionPreviewDialog&i=nope&files=js%24&excludeFiles=&repos= [01:25:22] which defaults to a 0.25s debounce as intended for templates and other static html parse responses [01:25:43] e.g. not something that generates a Midi file and PNG with unclear Swift retention [01:25:46] that seems reasonable for those applications, but I think we should increase it for score [01:26:10] ack [01:26:27] I assume that's overrideable in https://gerrit.wikimedia.org/g/mediawiki/extensions/Score/+/d445b6740cb270c2aa1951809e7de88f073dfd74/modules/ve-score/ve.ui.MWScoreDialog.js? [01:26:29] possibly even a button to preview [01:26:59] just tested score with real-time preview, once that's deployed you could quite easily have it re-render the score + audio file per change :/ that is a Not Good ™️ thing, right? [01:27:02] although a way to get fast feedback for syntax issues etc is helpful for sure you wouldn't want to wait 10 second for every missing comma [01:27:23] I wonder if lilypond supports a check versus full run flag that'd let that be possible [01:27:35] we have client-side syntax highlighting for this, I guess it has a linter as well [01:27:36] Krinkle: I think something like 2s would be more reasonable [01:28:02] can we do this via concurrency instead? will the editor ever fire off a second parse request before the first one has finished, and if so, can we make it wait? [01:28:24] not sure if that's more or less satisfying than a fixed interval, but it definitely saves us guessing what that interval should be [01:28:32] rzl: not curently no, it's a blind debounce and afaik it doesn't even accont for out of order responses [01:28:34] satisfying user experience I mean [01:28:37] nod [01:28:57] (Ah, not parameterized https://gerrit.wikimedia.org/g/mediawiki/extensions/VisualEditor/+/1548904bc46377a986347af81bb13da194af76c7/modules/ve-mw/ui/dialogs/ve.ui.MWExtensionPreviewDialog.js#22) [01:29:02] I say this not having meaningfully reviewed VE code in almost 7 years [01:29:12] so do check with Editing :) [01:29:13] I think this is probably the culprit. I sat here typing one character at a time for about 20 seconds (like rzl's transcription idea) and managed to trigger off 400 requests [01:29:33] rzl, normally you set a timeout for e.g., 250ms and re-start that timer every new user input. That way only once the user stopped typing for 250ms is the request sent off. [01:29:35] Krinkle: ack, still appreciate the gut check [01:30:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:30:09] perryprog: right -- the advantage of switching from that approach to "at most one concurrent request" is that it self-limits the impact on the infrastructure [01:30:19] no matter how fast you type, you only tie up one shellbox worker [01:30:27] (03CR) 10Dzahn: "sorry, I forgot this already existed as well. I did merge your other change though related to this." [dns] - 10https://gerrit.wikimedia.org/r/744762 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [01:30:29] oh, concurrent request, not concurrent javascript. [01:30:33] oh! yes [01:30:37] that makes more sense [01:30:46] 18:26:59 just tested score with real-time preview, once that's deployed you could quite easily have it re-render the score + audio file per change :/ that is a Not Good ™️ thing, right? <-- I think if the score is changing, it needs a reasonable debounce, like 2s. if the score is the same and other things are changing, then it's a pretty cheap request [01:30:56] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/36208/gitlab1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/811782 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [01:31:02] in addition to r.zl's point about concurrency [01:31:29] perryprog: yes, what you descibe is what it does now. it waits for 250s of no input, then makes a real request. But it never cancels or cares about hte on-going requst to inform (not) yet starting another. This does have UX impact of course, if you have a slow connection or if its going to time out, you might not want a 60s period of no preview just because 1 request got stuck [01:31:37] we already debounce re-renders, so its probably 2s+ :) [01:31:42] I dont know how well we do cancellation within the parse API [01:31:54] (^ for RTP, sorry, cross-talking) [01:31:57] Krinkle: yeah that's a good point, you'd want to have a deadline on it for sure [01:32:02] I mean the client can call abort() but I don't know how far into PHP we still meaningfully save resources through that [01:32:11] note that the same flaw conceptually exists for and except those are so much faster that I doubt it is a problem in practice [01:32:23] yep [01:32:30] but syntaxhighlight doesn't use this base class [01:32:35] Oh, hang on—is the midi being fully generated for this... midiless preview? [01:32:37] because it has client-side suyntaxhighlighting [01:32:45] I wouldn't be surprised if that was also the more expensive of the two [01:33:06] (03PS1) 10Dzahn: gitlab: remove gitlab1001 from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/811797 (https://phabricator.wikimedia.org/T307142) [01:33:06] perryprog: I *think* we only generate MIDI when it's asked for but it's been half a year since I poked at this [01:33:18] perryprog: maybe nto midi, but MP3 for sure. I may've miscalled it earlier. [01:33:24] I'm def seeing an mp3 request after each response [01:33:24] you might be right, I think it needs the audio="1" on the score tag. Though that's another way to make shellbox even slower. [01:33:51] I'm playing fast and loose with my definition of midi there [01:33:55] tbh that's the coolest part about score, that you can actually listen to them :) [01:34:04] true [01:34:13] yeah it's a great extension [01:34:36] okay so recapping [01:35:05] long-term it would be neat to investigate a concurrency limit for those parses, but that will probably take Actual Engineering and let's assume it won't happen right away [01:35:25] (03CR) 10Dzahn: [C: 03+2] "VM is destroyed. node is removed from site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/811797 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [01:35:33] near-term we ought to be able to tweak the debouncing interval to make this unimpactful in the normal case [01:36:13] immediate-term, like without waiting until working hours -- do we want to do anything Big Hammerish to mitigate this? my instinct is no but I do want to have talked about it [01:36:34] Actually, the preview request is /made with score="1" if it's enabled in the dialog/, despite the dialog not doing anything with the generated audio file in the preview, and that actually slows it down client-side a huge amount [01:36:46] audio="1"* [01:36:57] given we don't believe it was intentional, I think we can leave it as is for the night rzl [01:37:24] And the edit in question had audio="1" enabled, so they probably—unintentionally—generating a new audio file (that was going nowhere) every new request. [01:37:33] rzl: +1 on summary, is the patch j.oe made to increase the number of replicas still in place? [01:37:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:45] (JobUnavailable) firing: (3) Reduced availability for job redis_gitlab in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:37:55] perryprog: oof, also nice catch. definitely worth a separate bug report [01:37:56] legoktm: 90% sure, double-checking now [01:39:12] probably doesn't need security status right? ("there are easier ways") [01:39:19] this is T310557 parent-task wise yeah? [01:39:19] T310557: Shellbox resource management - https://phabricator.wikimedia.org/T310557 [01:39:21] assuming it is, then I agree no action is needed besides filing tickets for the editing team to look at [01:39:30] perryprog: right, and we just finished discussing it in a public channel [01:39:36] that too [01:40:10] weird, I can't find his patch in the deployment charts repo [01:40:30] I think he just did it live [01:40:35] I wonder if that increase actually happened or if we just talked about it, I'll- aha [01:41:06] https://sal.toolforge.org/production?p=0&q=shellbox&d= [01:41:32] maybe good to do it in the repo itself since we are actively relying on it as mitigation [01:42:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:44:34] shoutout to perryprog for actually testing the theory and figuring this out :) [01:45:00] yeah! really appreciate TheresNoTime spotting that edit too [01:45:09] ^^ seconded [01:45:15] what's the saying? a broken lurker spots the issue once every other month? [01:45:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:45:31] I have my rare moments [01:46:00] tomorrow during the day I'll see if the extra shellbox instances are actually getting us anything -- I suspect we're never bursting into them at all, except for when we burst past them anyway [01:46:15] if I'm wrong about that I'll update the repo to match reality, else I'll update reality to match the repo [01:46:17] perryprog: I believe this is Linus's law - "given enough eyeballs, all bugs are shallow" [01:46:25] sounds good [01:46:50] the other thing we need is one or more writeups in phab for what we've talked about [01:47:22] I can take a stab but someone who knows the editors better than me would do a much better job [01:47:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:58] I can do the possible unnecessary midi generation assuming my hypothesis is correct (currently testing to make sure I was right) [01:48:09] perryprog: I think you should file that as a separate task [01:48:16] +1 [01:48:31] rzl: happy to review what you write and clarify/adjust as needed [01:48:59] works for me [01:49:20] I will, though if there is an issue there it's a contributing factor to the requests from the edit TNT noted being very slow. [01:49:59] yeah, just cross-link them [01:50:31] IMO conceptually they're two separate issues. one is that we're making requests too often, second is that those requests are more expensive than they need to be [01:50:50] 🎯 [01:51:41] not to keep banging the concurrency drum but another nice thing about it is that making the requests less expensive also makes the experience get immediately snappier for the same resource cost, without having to adjust anything [01:52:10] Ah, yeah, you're totally right. I blame my being tired on not noticing that (though I'd probably miss that anyway) [01:52:16] I guess I can freely bang the concurrency drum as much as I want, up to as fast as I can hit it with one hand [01:52:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:52:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:53:01] lol [01:53:05] :D [01:55:43] https://en.wikipedia.org/wiki/User_talk:Squandermania#How_do_you_edit_musical_scores? [01:56:10] oh yeah!! we can just ask people stuff [01:56:35] high privacy does not mean we have no transparancy :P [01:56:40] I learned to do SRE in a very different kind of place, and it's been years but I still forget that [01:56:43] if your number is up, we'll find you [01:57:15] * Krinkle quotes Person of Interest [01:57:28] "you didn't do anything wrong" (yet.) [01:58:27] if you haven't checked out their user page yet, I highly recommend it [01:58:47] yep, checks out [02:10:08] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:10:26] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:12:04] (03CR) 10BCornwall: "pcc output targeting P:cache::varnish::frontend:" [puppet] - 10https://gerrit.wikimedia.org/r/811780 (https://phabricator.wikimedia.org/T311445) (owner: 10BCornwall) [02:12:26] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48390 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:12:46] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.289 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:16:44] Looks like I was correct and fluidsynth is being run for every preview-edit if you have the relevant options enabled, though (and I didn't notice this due to probably browser issues) it /is/ displayed in the preview dialog. This also can't be easily cached since even non-note changing things (like changes to text) will change the midi file lilypond emits. legoktm, do you think it's still worth filing a ticket if this is the case? [02:17:28] if it is being displayed to users then I think we should keep it as is [02:18:06] since the goal of VE is to be WYSIWYG-like [02:18:24] Sounds good, it probably doesn't have that big of an impact overall anyway (though that's somewhat of a guess that that's the case). [02:18:33] (Heh, sounds good... I need to go sleep) [02:18:39] one loose end, I'm missing something [02:19:01] I think this is probably the culprit. I sat here typing one character at a time for about 20 seconds (like rzl's transcription idea) and managed to trigger off 400 requests <-- at 0.25 sec, shouldn't you have seen 80 requests at most? [02:19:45] https://grafana.wikimedia.org/d/3SiE86Nnz/mediawiki-shellouts?orgId=1&from=1657157007240&to=1657157433112&viewPanel=2 [02:20:12] maybe it was closer to 30 or 40s? unsure, it was very unscientific [02:20:35] hrm [02:21:39] I was literally just typing a few sentences out of something on my desk but at a slightly slower speed than I normally type [02:22:13] nod okay [02:22:32] I wonder if we're gonna discover there's more requests unaccounted for, like there's some fanout somewhere [02:22:41] should be easy to repeat on test.wikipedia.org/wiki/Score with more precision/reproducibility if you want [02:22:59] yeah, might play with ittomorrow [02:23:26] wellll, I think it ends up being api.php?action=visualeditor -> restbase -> parsoid -> shellbox [02:23:49] oh cool, that graph shows generateAudio calls too, so that confirms it wasn't anything after all [02:24:19] (Or at least, was only about 25% of the big requests: https://grafana.wikimedia.org/d/3SiE86Nnz/mediawiki-shellouts?orgId=1&from=1657153468249&to=1657154956813&viewPanel=2) [02:24:20] yeah, I just used a score without audio while testing [02:37:59] 10SRE, 10Shellbox, 10serviceops, 10Sustainability (Incident Followup): Limit Lilypond shellouts from VisualEditor - https://phabricator.wikimedia.org/T312319 (10RLazarus) [02:38:27] legoktm and others: ^ rough summary of what we talked about, edits welcome [02:38:57] I'll see about checking in with the Editing folks [02:43:20] ack, will look after dinner. ty! [02:43:29] oh, dinner! great idea [02:43:50] thanks for all the work <3 [02:44:48] RECOVERY - Check systemd state on thumbor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:45:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:52:20] PROBLEM - Check systemd state on thumbor1002 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8807.service,thumbor@8812.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:52:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:00:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:03:30] PROBLEM - Host thumbor2005 is DOWN: PING CRITICAL - Packet loss = 71%, RTA = 3239.70 ms [03:03:46] RECOVERY - Host thumbor2005 is UP: PING OK - Packet loss = 0%, RTA = 35.22 ms [03:07:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:00:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:00:31] (03PS1) 10KartikMistry: Update MT label for Flores [extensions/ContentTranslation] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/811425 (https://phabricator.wikimedia.org/T311411) [04:00:59] (03PS1) 10KartikMistry: Update MT label for Flores [extensions/ContentTranslation] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811806 (https://phabricator.wikimedia.org/T311411) [04:07:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:15:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:22:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:30:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:35:22] PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8806.service,thumbor@8808.service,thumbor@8810.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:37:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:43:30] PROBLEM - MegaRAID on db1176 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:43:31] ACKNOWLEDGEMENT - MegaRAID on db1176 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T312321 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:43:36] 10SRE, 10ops-eqiad: Degraded RAID on db1176 - https://phabricator.wikimedia.org/T312321 (10ops-monitoring-bot) [04:45:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:48:59] (03CR) 10Santhosh: [C: 03+1] Update MT label for Flores [extensions/ContentTranslation] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/811425 (https://phabricator.wikimedia.org/T311411) (owner: 10KartikMistry) [04:49:18] (03CR) 10Santhosh: [C: 03+1] Update MT label for Flores [extensions/ContentTranslation] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811806 (https://phabricator.wikimedia.org/T311411) (owner: 10KartikMistry) [04:52:56] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:00:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:00:36] RECOVERY - Check systemd state on thumbor2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:03:18] RECOVERY - Check systemd state on thumbor2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:08:04] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:08:12] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8809.service,thumbor@8811.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:10:42] PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8806.service,thumbor@8808.service,thumbor@8810.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:12:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 31 hosts with reason: Primary switchover s4 T311611 [05:12:57] T311611: Switchover s4 master - https://phabricator.wikimedia.org/T311611 [05:13:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 31 hosts with reason: Primary switchover s4 T311611 [05:14:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1138 with weight 0 T311611', diff saved to https://phabricator.wikimedia.org/P30933 and previous config saved to /var/cache/conftool/dbconfig/20220707-051406-ladsgroup.json [05:15:36] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:19:03] (03PS2) 10Ladsgroup: Switchover s4 master [puppet] - 10https://gerrit.wikimedia.org/r/810908 (https://phabricator.wikimedia.org/T311611) [05:19:08] (03CR) 10Ladsgroup: [C: 03+2] Switchover s4 master [puppet] - 10https://gerrit.wikimedia.org/r/810908 (https://phabricator.wikimedia.org/T311611) (owner: 10Ladsgroup) [05:20:33] 10SRE, 10ops-eqiad, 10Data-Persistence-Backup: Degraded RAID on db1176 - https://phabricator.wikimedia.org/T312321 (10Marostegui) p:05Triage→03Medium The raid is degraded: ` root@db1176:~# megacli -LDInfo -Lall -aALL Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name... [05:23:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:30:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:30:31] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) @Ladsgroup can you hit db1132 again? ` root@db1132.eqiad.wmnet[(none)]> UPDATE performance_schema.setup_consumers SET... [05:32:14] (03CR) 10Legoktm: [C: 04-1] "Overall looks good, all of these are minor concerns" [software/varnish/libvmod-querysort] - 10https://gerrit.wikimedia.org/r/810551 (https://phabricator.wikimedia.org/T138093) (owner: 10Ori) [05:34:43] (03CR) 10Legoktm: [C: 03+1] Support percent-encoded array key syntax (031 comment) [software/varnish/libvmod-querysort] - 10https://gerrit.wikimedia.org/r/810552 (owner: 10Ori) [05:35:01] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1003.wikimedia.org with OS bullseye [05:35:07] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye [05:37:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:41:04] (03PS1) 10Marostegui: mariadb: Productionize db2161 [puppet] - 10https://gerrit.wikimedia.org/r/811804 (https://phabricator.wikimedia.org/T311493) [05:42:05] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2161 [puppet] - 10https://gerrit.wikimedia.org/r/811804 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [05:45:15] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Ladsgroup) Died after 50 batches (around 50K queries, 5000 connections) [05:49:34] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: fix r123 syntax for special:codereview redirects [puppet] - 10https://gerrit.wikimedia.org/r/774943 (https://phabricator.wikimedia.org/T205361) (owner: 10Majavah) [05:50:14] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10Marostegui) @Cmjohnson db1137 does not need 10G and can be moved Tuesday 15:30 UTC - I will get the host ready for you. [06:00:04] kormat, marostegui, and Amir1: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220707T0600) [06:00:08] o/ [06:00:09] o/ [06:00:13] marostegui: are you ready? [06:00:17] Always [06:00:23] coolsies [06:00:25] !log Starting s4 eqiad failover from db1160 to db1138 - T311611 [06:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:28] T311611: Switchover s4 master - https://phabricator.wikimedia.org/T311611 [06:00:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s4 eqiad as read-only for maintenance - T311611', diff saved to https://phabricator.wikimedia.org/P30935 and previous config saved to /var/cache/conftool/dbconfig/20220707-060037-ladsgroup.json [06:01:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1138 to s4 primary and set section read-write T311611', diff saved to https://phabricator.wikimedia.org/P30936 and previous config saved to /var/cache/conftool/dbconfig/20220707-060112-ladsgroup.json [06:01:22] can edit now [06:01:37] I see recentchanges moving [06:03:32] (03PS2) 10Ladsgroup: Switchover s4 master [dns] - 10https://gerrit.wikimedia.org/r/810909 (https://phabricator.wikimedia.org/T311611) [06:03:36] (03CR) 10Ladsgroup: [C: 03+2] Switchover s4 master [dns] - 10https://gerrit.wikimedia.org/r/810909 (https://phabricator.wikimedia.org/T311611) (owner: 10Ladsgroup) [06:03:40] (03CR) 10Slavina Stefanova: novafullstack: Refactor and minor fix (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/811316 (owner: 10David Caro) [06:04:12] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Switchover s4 master [dns] - 10https://gerrit.wikimedia.org/r/810909 (https://phabricator.wikimedia.org/T311611) (owner: 10Ladsgroup) [06:07:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1160 T311611', diff saved to https://phabricator.wikimedia.org/P30937 and previous config saved to /var/cache/conftool/dbconfig/20220707-060743-ladsgroup.json [06:07:47] T311611: Switchover s4 master - https://phabricator.wikimedia.org/T311611 [06:10:01] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: fix r123 syntax for special:codereview redirects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774943 (https://phabricator.wikimedia.org/T205361) (owner: 10Majavah) [06:10:47] Amir1: Remember to edit db1160 weight to give the previous' host weight [06:10:53] Otherwise once we repool it it will have 0 weight XD [06:11:20] we need to add it to the checklist. I forget it every time :D [06:11:40] Amir1: can I reboot db1160 now? [06:12:07] marostegui: sure, but let me know once you're done, I have around ten-fifteen schema changes on the line [06:12:11] sure [06:12:13] doing it now [06:16:05] marostegui: to double check, is "sudo dbctl instance db1160 edit" enough? no commit or etc? [06:16:44] Amir1: I was just sending the patch for that [06:16:44] haha [06:16:47] (no commit no) [06:17:06] (03PS1) 10Marostegui: switchover-tmpl.py: Reminder to set weights [software] - 10https://gerrit.wikimedia.org/r/811826 (https://phabricator.wikimedia.org/T311611) [06:17:09] Amir1: ^ XD [06:17:40] (03CR) 10CI reject: [V: 04-1] switchover-tmpl.py: Reminder to set weights [software] - 10https://gerrit.wikimedia.org/r/811826 (https://phabricator.wikimedia.org/T311611) (owner: 10Marostegui) [06:18:05] (03CR) 10Ladsgroup: switchover-tmpl.py: Reminder to set weights (031 comment) [software] - 10https://gerrit.wikimedia.org/r/811826 (https://phabricator.wikimedia.org/T311611) (owner: 10Marostegui) [06:18:21] (03CR) 10Slavina Stefanova: [C: 03+1] openstack: move known nodes to the openstack lib [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810854 (owner: 10David Caro) [06:19:10] (03PS2) 10Marostegui: switchover-tmpl.py: Reminder to set weights [software] - 10https://gerrit.wikimedia.org/r/811826 (https://phabricator.wikimedia.org/T311611) [06:20:34] (03CR) 10Ladsgroup: [C: 03+2] switchover-tmpl.py: Reminder to set weights [software] - 10https://gerrit.wikimedia.org/r/811826 (https://phabricator.wikimedia.org/T311611) (owner: 10Marostegui) [06:21:05] Amir1: db1160 ready for you [06:21:10] (03Merged) 10jenkins-bot: switchover-tmpl.py: Reminder to set weights [software] - 10https://gerrit.wikimedia.org/r/811826 (https://phabricator.wikimedia.org/T311611) (owner: 10Marostegui) [06:21:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [06:21:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [06:21:36] marostegui: awesome [06:22:10] 14 schema changes needed 😭 [06:22:21] hahaha [06:26:23] creating +100 tickets by a script [06:28:42] @Amir1 Can I +2 my wmf backport changes in about 10-12 minutes? Will be helpful to deploy them quickly during backport window. [06:28:53] kart_: sure [06:30:32] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:48] RECOVERY - Check systemd state on thumbor2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:31:15] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1003.wikimedia.org with OS bullseye [06:31:19] Amir1: Thanks! [06:32:40] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye executed with errors... [06:33:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:33:54] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8803.service,thumbor@8805.service,thumbor@8806.service,thumbor@8809.service,thumbor@8811.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:40:57] Doing +2 for wmf.18/wmf.19 patches to be deployed in upcoming backport window.. [06:41:25] (03CR) 10KartikMistry: [C: 03+2] Update MT label for Flores [extensions/ContentTranslation] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/811425 (https://phabricator.wikimedia.org/T311411) (owner: 10KartikMistry) [06:41:31] (03CR) 10KartikMistry: [C: 03+2] Update MT label for Flores [extensions/ContentTranslation] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811806 (https://phabricator.wikimedia.org/T311411) (owner: 10KartikMistry) [06:42:48] (03PS2) 10Giuseppe Lavagetto: mediawiki: fix r123 syntax for special:codereview redirects [puppet] - 10https://gerrit.wikimedia.org/r/774943 (https://phabricator.wikimedia.org/T205361) (owner: 10Majavah) [06:43:36] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:47:32] (03PS1) 10Marostegui: site.pp: Remove insetup role from db2161 [puppet] - 10https://gerrit.wikimedia.org/r/811836 (https://phabricator.wikimedia.org/T311493) [06:48:45] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup role from db2161 [puppet] - 10https://gerrit.wikimedia.org/r/811836 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [06:49:20] (03PS6) 10Ori: Initial Debian packaging [software/varnish/libvmod-querysort] - 10https://gerrit.wikimedia.org/r/810551 (https://phabricator.wikimedia.org/T138093) [06:49:40] (03CR) 10Ori: Initial Debian packaging (038 comments) [software/varnish/libvmod-querysort] - 10https://gerrit.wikimedia.org/r/810551 (https://phabricator.wikimedia.org/T138093) (owner: 10Ori) [06:56:25] !log dbmaint s7@eqiad T312288 [06:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:30] T312288: Drop unique index thread_root_2 from thread table on wmf wikis - https://phabricator.wikimedia.org/T312288 [06:57:14] RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:58:17] (03Merged) 10jenkins-bot: Update MT label for Flores [extensions/ContentTranslation] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/811425 (https://phabricator.wikimedia.org/T311411) (owner: 10KartikMistry) [06:58:20] (03Merged) 10jenkins-bot: Update MT label for Flores [extensions/ContentTranslation] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811806 (https://phabricator.wikimedia.org/T311411) (owner: 10KartikMistry) [06:58:28] hmmm! [06:58:36] someone self-merging ahead of the window :-D [06:59:02] fortunately we have no trainees signed up :-P [07:00:03] !log dbmaint s2@eqiad T312288 [07:00:03] (03PS6) 10Ori: Support percent-encoded array key syntax [software/varnish/libvmod-querysort] - 10https://gerrit.wikimedia.org/r/810552 [07:00:04] Amir1, apergos, and jnuche: That opportune time is upon us again. Time for a UTC morning backport and config training deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220707T0700). [07:00:05] kart_: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:05] apergos: Yes. That's me. [07:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:11] I knew it was :-P [07:00:20] I'll do self-deploy too :) [07:00:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:21] (03CR) 10Ori: Support percent-encoded array key syntax (031 comment) [software/varnish/libvmod-querysort] - 10https://gerrit.wikimedia.org/r/810552 (owner: 10Ori) [07:00:22] so: 2 patches in the window, no trainees, kart_ I assume you [07:00:27] yep. self-deploy :-D [07:00:30] Yes. [07:00:47] take it away, kart_ ! [07:00:59] 👍 [07:01:10] (hi there) [07:02:14] hello hello! [07:02:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:02:56] RECOVERY - Check systemd state on db2078 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:03:16] !log dbmaint s6@eqiad T312288 [07:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:20] T312288: Drop unique index thread_root_2 from thread table on wmf wikis - https://phabricator.wikimedia.org/T312288 [07:03:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:03:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:03:47] Deploying wmf.18 patch [07:04:16] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8807.service,thumbor@8808.service,thumbor@8812.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:04:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:05:45] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic, 10IPv6: Some Traffic clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271144 (10ayounsi) > fixed up lvs[4005-4007].ulsfo.wmnet For context: T311290 The the issue is twofold: 1/ the LVS hosts use SLAAC IPs on their... [07:06:16] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10MoritzMuehlenhoff) @Cmjohnson Newly procured Ganeti servers use 10G, but ganeti1020 still has 1G only. I'll get it ready by Tuesday. [07:07:07] (03PS2) 10JMeybohm: Use the generic services_proxy definition for envoy config [deployment-charts] - 10https://gerrit.wikimedia.org/r/811751 [07:07:09] (03PS2) 10JMeybohm: Remove the need for charts to define services_proxy fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/811752 [07:07:11] !log dbmaint s3@eqiad T312288 [07:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:07:34] !log drain ganeti1020 T308331 [07:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:37] T308331: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 [07:07:44] !log kartik@deploy1002 Synchronized php-1.39.0-wmf.18/extensions/ContentTranslation/modules/mw.cx.MachineTranslationManager.js: Backport: [[gerrit:811425|Update MT label for Flores (T311411)]] (duration: 03m 41s) [07:07:47] T311411: Update label for Flores - https://phabricator.wikimedia.org/T311411 [07:07:50] 10SRE, 10Image-Suggestions, 10Patch-For-Review: Envoy cannot connect to image-suggestion service - https://phabricator.wikimedia.org/T312225 (10JMeybohm) a:03JMeybohm [07:08:07] OK. Deploying wmf.19 patch now.. [07:09:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:10:33] (03CR) 10JMeybohm: [C: 03+1] "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/811699 (owner: 10Muehlenhoff) [07:10:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:10:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:11:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:13:36] (03PS1) 10Marostegui: db1160: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/811872 [07:13:51] Amir1: ^ going to merge that [07:14:04] awesome [07:14:06] thanks [07:14:20] (03CR) 10Marostegui: [C: 03+2] db1160: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/811872 (owner: 10Marostegui) [07:14:42] !log kartik@deploy1002 Synchronized php-1.39.0-wmf.19/extensions/ContentTranslation/modules/mw.cx.MachineTranslationManager.js: Backport: [[gerrit:811806|Update MT label for Flores (T311411)]] (duration: 03m 20s) [07:14:46] T311411: Update label for Flores - https://phabricator.wikimedia.org/T311411 [07:15:32] PROBLEM - MariaDB Replica SQL: s3 on db1154 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table liquidthreads_labswikimedia.thread doesnt exist on query. Default database: liquidthreads_labswikimedia. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:15:34] PROBLEM - MariaDB Replica SQL: s3 on db2094 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table liquidthreads_labswikimedia.thread doesnt exist on query. Default database: liquidthreads_labswikimedia. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:15:41] ^ that's me [07:15:42] fixing [07:16:15] I'm done with my changes/deployment <-- apergos [07:16:24] tests all look good, kart_ ? [07:16:36] apergos: all good! :) [07:16:40] sweet! [07:16:57] I'll hang about for another 10 mins or so in case any other self deployer wants to sneak something in [07:17:01] after that I'll be gone [07:17:58] RECOVERY - MariaDB Replica SQL: s3 on db1154 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:18:02] RECOVERY - MariaDB Replica SQL: s3 on db2094 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:19:35] !log dbmaint s2@eqiad T312287 [07:19:39] !log dbmaint s7@eqiad T312287 [07:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:40] T312287: Remove default of empty value from thread.thread_modified/thread_created on wmf wikis - https://phabricator.wikimedia.org/T312287 [07:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:02] !log dbmaint s6@eqiad T312287 [07:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:15] !log dbmaint s3@eqiad T312287 [07:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:45] welp that's the 10 minute wait. no takers, so [07:27:58] !log UTC morning backport and config training window closed [07:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:07] see everyone next time! [07:28:43] !log dbmaint s6@eqiad T312286 [07:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:46] T312286: Adjust the field type of thread_history.th_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312286 [07:29:07] !log dbmaint s2@eqiad T312286 [07:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:21] !log dbmaint s7@eqiad T312286 [07:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:13] !log dbmaint s3@eqiad T312286 [07:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:53] (03PS1) 10Muehlenhoff: Remove access for aniketars [puppet] - 10https://gerrit.wikimedia.org/r/811874 [07:38:37] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for aniketars [puppet] - 10https://gerrit.wikimedia.org/r/811874 (owner: 10Muehlenhoff) [07:40:23] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:40:34] (03CR) 10JMeybohm: [C: 03+2] service_proxy: Set SNI and Host header for ingress services [puppet] - 10https://gerrit.wikimedia.org/r/811733 (https://phabricator.wikimedia.org/T312225) (owner: 10JMeybohm) [07:42:30] (03PS1) 10Tim Starling: Add ucfirst overrides for PHP 7.2 -> 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811875 (https://phabricator.wikimedia.org/T271736) [07:43:09] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48391 bytes in 0.263 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:45:58] (03PS2) 10Tim Starling: Add ucfirst overrides for the PHP 7.4 migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811875 (https://phabricator.wikimedia.org/T271736) [07:46:38] (03CR) 10Ayounsi: [C: 03+1] "That's great!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [07:48:37] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic, 10IPv6: Some Traffic clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271144 (10Volans) >>! In T271144#8057351, @BCornwall wrote: > Thank you for doing that, @Volans ; I apologize for forgetting to run the cookbook.... [07:53:21] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:53:41] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Management interface SSH icinga alerts - https://phabricator.wikimedia.org/T304289 (10akosiaris) @Cmjohnson Could you please add some more information why mgmt flapping will be an ongoing issue ? Also, I see this is tagged #ops-eq... [07:55:03] (03PS1) 10Muehlenhoff: Extend access for daniram [puppet] - 10https://gerrit.wikimedia.org/r/811877 [07:59:10] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for daniram [puppet] - 10https://gerrit.wikimedia.org/r/811877 (owner: 10Muehlenhoff) [08:00:04] jnuche and dduvall: (Dis)respected human, time to deploy MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220707T0800). Please do the needful. [08:00:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:02:49] (03PS1) 10Jaime Nuche: all wikis to 1.39.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811880 (https://phabricator.wikimedia.org/T308072) [08:02:51] (03CR) 10Jaime Nuche: [C: 03+2] all wikis to 1.39.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811880 (https://phabricator.wikimedia.org/T308072) (owner: 10Jaime Nuche) [08:03:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: fix r123 syntax for special:codereview redirects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774943 (https://phabricator.wikimedia.org/T205361) (owner: 10Majavah) [08:03:49] (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811880 (https://phabricator.wikimedia.org/T308072) (owner: 10Jaime Nuche) [08:06:02] (03CR) 10Muehlenhoff: [C: 03+2] profile::rsyslog::kubernetes: Remove stretch support [puppet] - 10https://gerrit.wikimedia.org/r/811699 (owner: 10Muehlenhoff) [08:07:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:08:15] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.19 refs T308072 [08:08:20] T308072: 1.39.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T308072 [08:09:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:10:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:10:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:11:01] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::sites: fix special chars escaping [puppet] - 10https://gerrit.wikimedia.org/r/811881 [08:11:16] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::sites: fix special chars escaping [puppet] - 10https://gerrit.wikimedia.org/r/811881 [08:11:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:11:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::web::sites: fix special chars escaping [puppet] - 10https://gerrit.wikimedia.org/r/811881 (owner: 10Giuseppe Lavagetto) [08:17:01] (03PS1) 10Jelto: gitlab: fix IPs, hostname and regex for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/811882 [08:17:03] (03CR) 10Alexandros Kosiaris: "@Giuseppe, quick q: I think I should split this in 2 patches, 1 for the server records (_etcd-server-ssl._tcp.v3) that gets merged right b" [dns] - 10https://gerrit.wikimedia.org/r/811728 (https://phabricator.wikimedia.org/T311407) (owner: 10Alexandros Kosiaris) [08:19:10] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Management interface SSH icinga alerts - https://phabricator.wikimedia.org/T304289 (10Volans) Some fresh stats (May, June, July): `lang=shell $ grep -o '.*SSH.*\.mgmt is CRITICAL' \#wikimedia-operations.2022-0{5,6,7}.log | grep PRO... [08:20:14] (03CR) 10Alexandros Kosiaris: [C: 03+1] gitlab: fix IPs, hostname and regex for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/811882 (owner: 10Jelto) [08:23:05] PROBLEM - Check systemd state on thumbor2006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8809.service,thumbor@8812.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:17] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:28:03] (03CR) 10Giuseppe Lavagetto: Add conf100[789] in DNS SRV records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/811728 (https://phabricator.wikimedia.org/T311407) (owner: 10Alexandros Kosiaris) [08:35:13] PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8806.service,thumbor@8808.service,thumbor@8810.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:37:45] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36210/console" [puppet] - 10https://gerrit.wikimedia.org/r/811882 (owner: 10Jelto) [08:42:32] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: fix IPs, hostname and regex for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/811882 (owner: 10Jelto) [08:43:04] (03PS1) 10Marostegui: mariadb: Decommission db2074 [puppet] - 10https://gerrit.wikimedia.org/r/811883 (https://phabricator.wikimedia.org/T311990) [08:43:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2074.codfw.wmnet [08:43:46] (03PS1) 10Jcrespo: dbprov: Update backup source for misc from db2078 to db2160 [puppet] - 10https://gerrit.wikimedia.org/r/811884 (https://phabricator.wikimedia.org/T311493) [08:44:39] (03PS2) 10Alexandros Kosiaris: Add conf100[789] in DNS SRV records [dns] - 10https://gerrit.wikimedia.org/r/811728 (https://phabricator.wikimedia.org/T311407) [08:44:41] (03PS1) 10Alexandros Kosiaris: Add client side conf100[789] in DNS SRV records [dns] - 10https://gerrit.wikimedia.org/r/811885 (https://phabricator.wikimedia.org/T311407) [08:47:13] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [08:47:58] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add conf100[789] in DNS SRV records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/811728 (https://phabricator.wikimedia.org/T311407) (owner: 10Alexandros Kosiaris) [08:48:13] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2074 [puppet] - 10https://gerrit.wikimedia.org/r/811883 (https://phabricator.wikimedia.org/T311990) (owner: 10Marostegui) [08:48:46] (03CR) 10Elukey: [C: 03+1] ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [08:49:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1160.eqiad.wmnet with reason: Maintenance [08:49:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1160.eqiad.wmnet with reason: Maintenance [08:51:02] 10ops-codfw, 10decommission-hardware: decommission db2074 - https://phabricator.wikimedia.org/T311990 (10Marostegui) a:03Papaul [08:51:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:51:11] 10ops-codfw, 10decommission-hardware: decommission db2074 - https://phabricator.wikimedia.org/T311990 (10Marostegui) @Papaul this is ready for you [08:52:20] (03PS1) 10Marostegui: db2076,db2077: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/811909 (https://phabricator.wikimedia.org/T311475) [08:52:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2074.codfw.wmnet [08:52:43] 10ops-codfw, 10decommission-hardware: decommission db2074 - https://phabricator.wikimedia.org/T311990 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2074.codfw.wmnet` - db2074.codfw.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager - Found phys... [08:53:05] (03CR) 10Marostegui: [C: 03+2] db2076,db2077: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/811909 (https://phabricator.wikimedia.org/T311475) (owner: 10Marostegui) [08:53:20] (03PS1) 10Urbanecm: Declare mediawiki.editgrowthconfig schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811910 (https://phabricator.wikimedia.org/T312148) [08:53:49] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2076 - https://phabricator.wikimedia.org/T312190 (10Marostegui) [08:53:53] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2077 - https://phabricator.wikimedia.org/T312191 (10Marostegui) [08:54:47] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:57:14] (03PS1) 10Jelto: wikimedia.org: remove gitlab-old.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/811912 (https://phabricator.wikimedia.org/T307142) [08:58:55] (03CR) 10Cathal Mooney: [C: 03+2] Add test in Netbox network report for port-block speeds on QFX5120 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [08:59:37] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic, 10IPv6: Some Traffic clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271144 (10Volans) Just for completeness, and to use the same wording I'm using for other tasks. Some clusters managed by the Traffic team have in... [08:59:57] (03Merged) 10jenkins-bot: Add test in Netbox network report for port-block speeds on QFX5120 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [09:02:12] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host dbstore1003.eqiad.wmnet [09:03:09] (03PS1) 10Marostegui: instances.yaml: Add db2161 [puppet] - 10https://gerrit.wikimedia.org/r/811913 (https://phabricator.wikimedia.org/T311493) [09:04:08] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2161 [puppet] - 10https://gerrit.wikimedia.org/r/811913 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [09:05:27] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:06:00] (03PS1) 10Vgutierrez: haproxy: Log backend saturation detection [puppet] - 10https://gerrit.wikimedia.org/r/811914 (https://phabricator.wikimedia.org/T306580) [09:06:43] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:07:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2074', diff saved to https://phabricator.wikimedia.org/P30938 and previous config saved to /var/cache/conftool/dbconfig/20220707-090700-marostegui.json [09:09:36] (03PS2) 10Vgutierrez: haproxy: Log backend saturation detection [puppet] - 10https://gerrit.wikimedia.org/r/811914 (https://phabricator.wikimedia.org/T306580) [09:09:45] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2004.codfw.wmnet with reason: Switch disk type back to plain [09:10:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2004.codfw.wmnet with reason: Switch disk type back to plain [09:10:49] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:11:53] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host dbstore1003.eqiad.wmnet [09:13:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10cmooney) @cmjohnson Hey. Drop me a line on this one perhaps. The issue is that the cloudnet assigned IPs do not seem to match the Vlans they ha... [09:14:43] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host dbstore1005.eqiad.wmnet [09:17:06] (03CR) 10Klausman: [C: 03+2] ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [09:17:11] (03PS13) 10Klausman: ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) [09:17:41] !log draining ganeti2009 T311686 [09:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:45] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [09:21:37] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dbstore1005.eqiad.wmnet [09:22:12] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host dbstore1007.eqiad.wmnet [09:22:20] (03PS1) 10Giuseppe Lavagetto: php::metrics: do not repeat comments on metrics in statustext [puppet] - 10https://gerrit.wikimedia.org/r/811918 [09:22:56] (03CR) 10CI reject: [V: 04-1] php::metrics: do not repeat comments on metrics in statustext [puppet] - 10https://gerrit.wikimedia.org/r/811918 (owner: 10Giuseppe Lavagetto) [09:24:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2161 to dbctl T311493', diff saved to https://phabricator.wikimedia.org/P30940 and previous config saved to /var/cache/conftool/dbconfig/20220707-092424-marostegui.json [09:24:27] T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493 [09:24:43] (03PS2) 10Giuseppe Lavagetto: php::metrics: do not repeat comments on metrics in statustext [puppet] - 10https://gerrit.wikimedia.org/r/811918 [09:26:23] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:27:02] (03CR) 10Giuseppe Lavagetto: [C: 03+2] php::metrics: do not repeat comments on metrics in statustext [puppet] - 10https://gerrit.wikimedia.org/r/811918 (owner: 10Giuseppe Lavagetto) [09:29:59] (03PS4) 10JMeybohm: Alert on helm releases in bad state [alerts] - 10https://gerrit.wikimedia.org/r/808968 (https://phabricator.wikimedia.org/T310714) [09:30:01] (03PS1) 10JMeybohm: Split phpfpm_statustext_processes by php version [alerts] - 10https://gerrit.wikimedia.org/r/811920 [09:30:27] (03CR) 10JMeybohm: [C: 03+2] Alert on helm releases in bad state [alerts] - 10https://gerrit.wikimedia.org/r/808968 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [09:31:02] * marostegui !log dbmaint s6@eqiad T312285 [09:31:07] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Split phpfpm_statustext_processes by php version [alerts] - 10https://gerrit.wikimedia.org/r/811920 (owner: 10JMeybohm) [09:31:08] !log dbmaint s6@eqiad T312285 [09:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:11] T312285: Adjust the field type of user_message_state.ums_read_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312285 [09:32:26] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dbstore1007.eqiad.wmnet [09:33:12] !log dbmaint s2@eqiad T312285 [09:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:18] !log dbmaint s7@eqiad T312285 [09:33:20] !log dbmaint s3@eqiad T312285 [09:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:36] (03Merged) 10jenkins-bot: Alert on helm releases in bad state [alerts] - 10https://gerrit.wikimedia.org/r/808968 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [09:33:57] (03CR) 10JMeybohm: [C: 03+2] Split phpfpm_statustext_processes by php version [alerts] - 10https://gerrit.wikimedia.org/r/811920 (owner: 10JMeybohm) [09:35:14] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:36:12] (03Merged) 10jenkins-bot: Split phpfpm_statustext_processes by php version [alerts] - 10https://gerrit.wikimedia.org/r/811920 (owner: 10JMeybohm) [09:37:24] (03CR) 10Kosta Harlan: [C: 03+1] Declare mediawiki.editgrowthconfig schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811910 (https://phabricator.wikimedia.org/T312148) (owner: 10Urbanecm) [09:37:48] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [09:38:38] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:38:45] (03CR) 10Urbanecm: [C: 03+2] Declare mediawiki.editgrowthconfig schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811910 (https://phabricator.wikimedia.org/T312148) (owner: 10Urbanecm) [09:39:33] (03Merged) 10jenkins-bot: Declare mediawiki.editgrowthconfig schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811910 (https://phabricator.wikimedia.org/T312148) (owner: 10Urbanecm) [09:39:59] 10SRE, 10Image-Suggestions, 10Patch-For-Review: Envoy cannot connect to image-suggestion service - https://phabricator.wikimedia.org/T312225 (10kostajh) >>! In T312225#8060471, @gerritbot wrote: > Change 811733 **merged** by JMeybohm: > %%%[operations/puppet@production] service_proxy: Set SNI and Host header... [09:40:14] urbanecm: if you're syncing config patches now, maybe we could do https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/808208 ? [09:40:51] kostajh: do we have the SRE +1 that was mentioned in the -sre discussion the other day? [09:41:11] urbanecm: I just pinged them on phab, but could ping here if we're able to sync patches now [09:42:26] kostajh: sure, let's ask them. at the very least, `curl http://localhost:6030/public/image_suggestions/suggestions/cswiki/344465` works now, but let's get the +1 anyway. [09:42:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:43:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:43:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:43:39] urbanecm: I'm asking over in -sre [09:43:44] thanks [09:43:46] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 8599f395bd3af2b27aa06cdc318d44e97efc8119: Declare mediawiki.editgrowthconfig schema (T312148) (duration: 03m 37s) [09:43:51] T312148: Add instrumentation to Special:EditGrowthConfig - https://phabricator.wikimedia.org/T312148 [09:44:27] !log installing 5.10.120-1~bpo10+1 kernels on buster hosts running Linux 5.10 [09:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:45:01] (03PS2) 10JMeybohm: service-proxy: Set SNI and Host header for ingress services [deployment-charts] - 10https://gerrit.wikimedia.org/r/811744 (https://phabricator.wikimedia.org/T312225) [09:45:03] (03PS3) 10JMeybohm: Use the generic services_proxy definition for envoy config [deployment-charts] - 10https://gerrit.wikimedia.org/r/811751 [09:45:05] (03PS3) 10JMeybohm: Remove the need for charts to define services_proxy fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/811752 [09:45:07] (03CR) 10Jelto: blackbox-check-http: only add hash page if severity=page (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811891 (owner: 10RhinosF1) [09:46:32] (03PS3) 10RhinosF1: blackbox-check-http: only add hash page if severity=page [puppet] - 10https://gerrit.wikimedia.org/r/811891 [09:47:20] (03PS4) 10RhinosF1: blackbox-check-http: only add hash page if severity=page [puppet] - 10https://gerrit.wikimedia.org/r/811891 [09:47:44] (03PS5) 10RhinosF1: blackbox-check-http: only add hash page if severity=page [puppet] - 10https://gerrit.wikimedia.org/r/811891 [09:47:46] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye [09:47:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye [09:51:25] (03CR) 10Jelto: [C: 03+1] "lgtm, but let's get a review from Filippo too." [puppet] - 10https://gerrit.wikimedia.org/r/811891 (owner: 10RhinosF1) [09:52:01] (03CR) 10CI reject: [V: 04-1] blackbox-check-http: only add hash page if severity=page [puppet] - 10https://gerrit.wikimedia.org/r/811891 (owner: 10RhinosF1) [09:53:31] (03CR) 10Filippo Giunchedi: "Is this actually needed? For non-tls probes the metric won't be there anyways" [puppet] - 10https://gerrit.wikimedia.org/r/811715 (owner: 10Majavah) [09:53:37] (03PS6) 10RhinosF1: blackbox-check-http: only add hash page if severity=page [puppet] - 10https://gerrit.wikimedia.org/r/811891 [09:54:16] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: blackbox: support exporting modules for other instances [puppet] - 10https://gerrit.wikimedia.org/r/811716 (owner: 10Majavah) [09:54:23] (03PS3) 10Filippo Giunchedi: prometheus: blackbox: support exporting modules for other instances [puppet] - 10https://gerrit.wikimedia.org/r/811716 (owner: 10Majavah) [09:54:58] urbanecm: we have SRE's blessing [09:55:04] (03CR) 10Majavah: prometheus: blackbox: don't deploy tls alerts when tls is disabled (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811715 (owner: 10Majavah) [09:55:06] 10SRE-swift-storage, 10User-fgiunchedi: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 (10fgiunchedi) [09:55:24] (03CR) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [09:55:27] let's go for it then :) [09:55:29] (03PS16) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) [09:55:46] urbanecm: thanks, will you deploy? I can verify [09:55:49] yup yup [09:56:23] (03CR) 10Urbanecm: [C: 03+2] "Joe and jayme gave their +1 in -sre at IRC, let's go ahead." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [09:57:14] (03Merged) 10jenkins-bot: GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [09:57:39] (HelmReleaseBadStatus) firing: Helm release helm-state-metrics/main on k8s@codfw in state - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:57:41] kostajh: pulled to mwdebug1001, can you check please? [09:57:58] urbanecm: yep [09:59:33] urbanecm: hmm, first article has no suggestions found, though calling the API via the CLI does find suggestions. [09:59:56] :/ [10:00:03] second one worked though :) [10:00:04] kostajh: MW logs say `[df7c8904-0c7f-48c3-aaab-03df3d947754] /w/index.php?title=Arabsk%C3%A1_hudba&getasktype=image-recommendation&gesuggestededit=1&geclickid=ulh3kh9nh3f4rav3kngrp97ok2d2bb1l&genewcomertasktoken=32ki8ubce0o9hrj16vgrvemcn5mia9l0&veaction=edit§ion=all Exception: Invalid source type for Arabská_hudba: `. can that be related? [10:00:05] mvolz: #bothumor My software never has bugs. It just develops random features. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220707T1000). [10:00:14] urbanecm: yeah that's the one I was checking [10:01:02] urbanecm: yeah, looks like the first suggestion is a .ogg file 08_5Mala_Alkasaat.ogg [10:01:14] oh, makes sense it doesn't work [10:01:25] should we deploy and deal with ogg files later? or revert? [10:01:42] I think we can deal with the ogg files later, but let me spot check to get a sense of frequency [10:01:47] sure [10:02:39] (HelmReleaseBadStatus) firing: (3) Helm release helm-state-metrics/main on k8s@codfw in state - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:02:55] urbanecm: eh, error rate is 3 out 6. let's fix the application code that is sorting and filtering candidate results, and try to deploy this later. [10:03:11] okay, reverting. there are also other errors in logstash, not sure if they're new [10:03:23] like `Deferred update 'MWCallableUpdate_GrowthExperiments\NewcomerTasks\TaskSetListener->run' failed to run.` [10:03:29] (and timeouts) [10:03:47] what timeout do you see? [10:04:27] [637cd55a-d12b-4426-88d0-7d27158f90f2] /w/index.php?title=Speci%C3%A1ln%C3%AD:Domovsk%C3%A1_str%C3%A1nka&source=personaltoolslink&namespace=0 Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of 60 seconds was exceeded [10:04:37] a standard MW one. that the homepage took 60+ seconds to load. [10:04:44] PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:04:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:04:48] (03PS1) 10Urbanecm: Revert "GrowthExperiments: Set GEImageRecommendationApiHandler" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811904 (https://phabricator.wikimedia.org/T306032) [10:04:54] hmm. [10:05:00] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "revert" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811904 (https://phabricator.wikimedia.org/T306032) (owner: 10Urbanecm) [10:05:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:05:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:06:09] kostajh: reverted, and removed from the mwdebug host. so, i guess we're done for now :) [10:06:37] yep [10:06:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:06:51] (03PS1) 10Marostegui: mariadb: Productionize db2165 [puppet] - 10https://gerrit.wikimedia.org/r/811926 (https://phabricator.wikimedia.org/T311493) [10:07:42] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2165 [puppet] - 10https://gerrit.wikimedia.org/r/811926 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [10:07:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10cmooney) @Cmjohnson just an update here. I left cloudnet1005 alone, so we can piece back why the switch ports ended up on the wrong vlans (I unf... [10:08:26] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/811891 (owner: 10RhinosF1) [10:08:59] (03CR) 10Jelto: [C: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/811891 (owner: 10RhinosF1) [10:09:01] urbanecm: though, I think I have an easy fix for this [10:09:03] (03CR) 10CI reject: [V: 04-1] blackbox-check-http: only add hash page if severity=page [puppet] - 10https://gerrit.wikimedia.org/r/811891 (owner: 10RhinosF1) [10:09:13] yeah? [10:10:58] urbanecm: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/811928 completely untested, so I guess it depends on your appetite to test in production. Otherwise I can verify this locally later today and we can try again another time. [10:11:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:11:59] kostajh: i have to leave in a couple of minutes, which is not enough time for the change, so let's try again later today? [10:12:07] (or after inspiration week i guess) [10:12:15] ok [10:12:19] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: blackbox: don't deploy tls alerts when tls is disabled (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811715 (owner: 10Majavah) [10:12:25] (03PS2) 10Filippo Giunchedi: prometheus: blackbox: don't deploy tls alerts when tls is disabled [puppet] - 10https://gerrit.wikimedia.org/r/811715 (owner: 10Majavah) [10:12:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:12:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:13:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:15:56] (03CR) 10Abijeet Patro: "This change is ready for review." [extensions/Translate] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811929 (https://phabricator.wikimedia.org/T312293) (owner: 10Abijeet Patro) [10:20:09] (03CR) 10Filippo Giunchedi: [C: 03+2] P:toolforge::prometheus: import blackbox checks from puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/811717 (owner: 10Majavah) [10:20:22] (03CR) 10Filippo Giunchedi: [C: 03+2] P:toolforge::static: add blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/811719 (owner: 10Majavah) [10:21:24] (03CR) 10Filippo Giunchedi: [C: 03+1] "+1'd, not going to merge as I don't know if this is dependent on parents" [puppet] - 10https://gerrit.wikimedia.org/r/811719 (owner: 10Majavah) [10:23:26] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:26:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Cmjohnson) @cmooney the cloudnet servers were manually moved in netbox, so I don't know if the script would've picked up the vlan change. I find... [10:27:27] (03PS1) 10Filippo Giunchedi: thanos: split retention times based on resolution [puppet] - 10https://gerrit.wikimedia.org/r/811932 (https://phabricator.wikimedia.org/T311690) [10:27:29] (03PS1) 10Filippo Giunchedi: thanos: trim raw samples retention to 54 weeks [puppet] - 10https://gerrit.wikimedia.org/r/811933 (https://phabricator.wikimedia.org/T311690) [10:29:17] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36213/console" [puppet] - 10https://gerrit.wikimedia.org/r/811932 (https://phabricator.wikimedia.org/T311690) (owner: 10Filippo Giunchedi) [10:29:21] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36214/console" [puppet] - 10https://gerrit.wikimedia.org/r/811933 (https://phabricator.wikimedia.org/T311690) (owner: 10Filippo Giunchedi) [10:31:19] (03PS7) 10Filippo Giunchedi: blackbox-check-http: only add hash page if severity=page [puppet] - 10https://gerrit.wikimedia.org/r/811891 (owner: 10RhinosF1) [10:32:03] (03CR) 10CI reject: [V: 04-1] blackbox-check-http: only add hash page if severity=page [puppet] - 10https://gerrit.wikimedia.org/r/811891 (owner: 10RhinosF1) [10:32:20] !log draining ganeti2010 T311686 [10:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:25] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [10:33:49] 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-fgiunchedi: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10fgiunchedi) >>! In T307184#8017418, @Jgiannelos wrote: > Just a quick correction on the numbers: the current production container... [10:36:20] (03PS1) 10Giuseppe Lavagetto: mtail: fix regexes due to changes in apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/811934 [10:39:21] 10SRE, 10ops-eqiad, 10Data-Persistence-Backup: Degraded RAID on db1176 - https://phabricator.wikimedia.org/T312321 (10Cmjohnson) @marostegui The server is under warranty, submitted a ticket to Dell for a replacement. Should be here tomorrow. You have successfully submitted request SR146038043. [10:40:21] (03CR) 10CI reject: [V: 04-1] mtail: fix regexes due to changes in apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/811934 (owner: 10Giuseppe Lavagetto) [10:40:41] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Cmjohnson) @Eevans Let me know when I am able to move these servers for you. [10:40:53] 10SRE, 10ops-eqiad, 10Data-Persistence-Backup: Degraded RAID on db1176 - https://phabricator.wikimedia.org/T312321 (10Marostegui) Thanks! [10:40:55] 10SRE, 10ops-eqiad, 10Data-Persistence-Backup: Degraded RAID on db1176 - https://phabricator.wikimedia.org/T312321 (10jcrespo) Thank you, Chris! [10:41:08] 10SRE-swift-storage, 10Observability-Metrics, 10User-fgiunchedi: Investigate HW requirements for Thanos frontend - https://phabricator.wikimedia.org/T312201 (10fgiunchedi) [10:45:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:46:38] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [10:48:55] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync [10:49:11] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [10:49:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10cmooney) @Cmjohnson ok. Is it possible that when you moved them you selected the wrong Vlan? If the script is assigning IPs from one Vlan, but... [10:50:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:54:59] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:56:52] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync [10:57:39] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync [10:59:02] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: sync [10:59:44] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync [11:00:27] !log installing intel-microcode security updates [11:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [11:04:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [11:04:25] 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-fgiunchedi: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10Jgiannelos) I think a good estimate is this graph: https://grafana.wikimedia.org/goto/LgX0j2e7k?orgId=1 This is the rate of new... [11:06:03] RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:08:44] 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-fgiunchedi: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10Jgiannelos) Also just a heads up, current production is using `bucket: tegola-swift-v001` in case you want to cleanup the old co... [11:14:35] (03PS7) 10Aqu: [WIP] Build spark assembly for Spark3 [puppet] - 10https://gerrit.wikimedia.org/r/810951 (https://phabricator.wikimedia.org/T310578) [11:22:59] (03PS1) 10JMeybohm: Fix label in HelmReleaseBadStatus [alerts] - 10https://gerrit.wikimedia.org/r/811967 (https://phabricator.wikimedia.org/T310714) [11:24:41] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:27:56] (03PS2) 10JMeybohm: Fix labels in HelmReleaseBadStatus [alerts] - 10https://gerrit.wikimedia.org/r/811967 (https://phabricator.wikimedia.org/T310714) [11:29:21] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] Configure wbsearchentities profile parameter on Test Wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806930 (https://phabricator.wikimedia.org/T307869) (owner: 10Lucas Werkmeister (WMDE)) [11:31:01] (03CR) 10JMeybohm: [C: 03+2] Fix labels in HelmReleaseBadStatus [alerts] - 10https://gerrit.wikimedia.org/r/811967 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [11:33:26] (03Merged) 10jenkins-bot: Fix labels in HelmReleaseBadStatus [alerts] - 10https://gerrit.wikimedia.org/r/811967 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [11:34:48] (03PS1) 10Muehlenhoff: Add a cookbook to change the storage type of a Ganeti VM [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) [11:38:20] (03CR) 10CI reject: [V: 04-1] Add a cookbook to change the storage type of a Ganeti VM [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff) [11:40:15] !log rolling back helm release tegola-vector-tiles/main to revision 11 on staging-eqiad because it's pending-upgrade since Mon Jun 27 09:45:56 2022 [11:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:37] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply [11:42:06] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply [11:42:39] (HelmReleaseBadStatus) firing: (3) Helm release eventstreams-internal/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:42:54] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply [11:45:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:37] (03PS2) 10Muehlenhoff: Add a cookbook to change the storage type of a Ganeti VM [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) [11:49:58] !log rolling back helm release eventstreams-internal/main to revision 3 on eqiad and codfw clusters because it's pending-upgrade since Mon Mar 21 21:36:56 2022 / Mon Mar 21 16:05:54 2022 [11:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:33] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:52:39] (HelmReleaseBadStatus) resolved: (2) Helm release eventstreams-internal/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:53:01] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply [11:53:52] (03CR) 10CI reject: [V: 04-1] Add a cookbook to change the storage type of a Ganeti VM [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff) [11:59:33] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8807.service,thumbor@8808.service,thumbor@8812.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:03:39] (03PS3) 10Muehlenhoff: Add a cookbook to change the storage type of a Ganeti VM [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) [12:06:58] (03CR) 10CI reject: [V: 04-1] Add a cookbook to change the storage type of a Ganeti VM [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff) [12:13:16] (03PS4) 10Muehlenhoff: Add a cookbook to change the storage type of a Ganeti VM [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) [12:17:39] (03CR) 10CI reject: [V: 04-1] Add a cookbook to change the storage type of a Ganeti VM [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff) [12:22:02] !log draining ganeti2015 T311686 [12:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:07] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [12:23:05] (03PS5) 10Muehlenhoff: Add a cookbook to change the storage type of a Ganeti VM [cookbooks] - 10https://gerrit.wikimedia.org/r/811970 (https://phabricator.wikimedia.org/T312116) [12:26:05] PROBLEM - ganeti-confd running on ganeti2010 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [12:27:11] PROBLEM - ganeti-noded running on ganeti2010 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [12:27:29] PROBLEM - ganeti-mond running on ganeti2010 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [12:28:45] ^ expected, forgot to add downtime in time [12:30:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf2004.codfw.wmnet [12:34:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [12:34:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [12:34:40] (03PS1) 10Jgiannelos: tegola: Add postgres upstreams for staging env [deployment-charts] - 10https://gerrit.wikimedia.org/r/811980 (https://phabricator.wikimedia.org/T312533) [12:37:08] 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-fgiunchedi: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10fgiunchedi) >>! In T307184#8061049, @Jgiannelos wrote: > I think a good estimate is this graph: > https://grafana.wikimedia.org/g... [12:37:22] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1006.eqiad.wmnet with OS bullseye [12:37:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye executed... [12:37:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf2004.codfw.wmnet [12:41:03] (03PS1) 10Kosta Harlan: ServiceImageRecommendationProvider: Don't fail on first validation error [extensions/GrowthExperiments] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811953 (https://phabricator.wikimedia.org/T312521) [12:42:52] (03PS1) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler (attempt 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811954 (https://phabricator.wikimedia.org/T306032) [12:44:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2010.codfw.wmnet with OS bullseye [12:44:08] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2010.codfw.wmnet with OS bullseye [12:44:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [12:44:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [12:45:45] (03PS1) 10JMeybohm: k8s: Retry pod evictions on HTTP 429 from API server [software/spicerack] - 10https://gerrit.wikimedia.org/r/811983 (https://phabricator.wikimedia.org/T260661) [12:46:06] (03CR) 10Kosta Harlan: [C: 03+2] "backport" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811953 (https://phabricator.wikimedia.org/T312521) (owner: 10Kosta Harlan) [12:52:06] (03CR) 10Ssingh: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/811780 (https://phabricator.wikimedia.org/T311445) (owner: 10BCornwall) [12:52:15] 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-fgiunchedi: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10Jgiannelos) > Thank you for the dashboard link! What do you mean "per 5mins" here? I'm asking because the graph linked shows a pe... [12:53:27] (03PS1) 10Jelto: blackbox-check-http: escape prometheus variables [puppet] - 10https://gerrit.wikimedia.org/r/811984 [12:53:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [12:53:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [12:56:27] 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-fgiunchedi: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10Jgiannelos) Do we have an upper limit were the object count is going to end up being problematic in our container? [12:57:08] (03PS1) 10Majavah: P:vrts: fix probe port [puppet] - 10https://gerrit.wikimedia.org/r/811985 [12:58:18] (03Abandoned) 10Muehlenhoff: Add a helper function to query the disk type of a VM [software/spicerack] - 10https://gerrit.wikimedia.org/r/811693 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff) [13:00:04] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220707T1300) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220707T1300). [13:00:04] abijeet and kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] o/ [13:00:12] hello! [13:00:14] hi [13:00:14] i can deploy today [13:00:18] thanks urbanecm [13:00:20] ok! [13:00:34] I started the merge process for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/811953 already [13:00:43] (03CR) 10Urbanecm: [C: 03+2] Translation unit deletion: Skip translation update if it doesn't exist [extensions/Translate] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811929 (https://phabricator.wikimedia.org/T312293) (owner: 10Abijeet Patro) [13:00:54] kostajh: ah, excellent. thanks. [13:00:59] abijeet: ^ [13:01:11] hello [13:01:13] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2010.codfw.wmnet with reason: host reimage [13:01:14] hi abijeet [13:03:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [13:03:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [13:04:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2010.codfw.wmnet with reason: host reimage [13:05:16] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Set GEImageRecommendationApiHandler (attempt 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811954 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [13:06:08] (03Merged) 10jenkins-bot: GrowthExperiments: Set GEImageRecommendationApiHandler (attempt 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811954 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [13:06:40] waiting for CI for the backport, as it's needed for the config change to work [13:07:58] (03Merged) 10jenkins-bot: ServiceImageRecommendationProvider: Don't fail on first validation error [extensions/GrowthExperiments] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811953 (https://phabricator.wikimedia.org/T312521) (owner: 10Kosta Harlan) [13:08:04] kostajh: do we also want to backport to wmf.18, just in case train rolls back? doesn't look like it's going to happen, but perhaps wise doing, to avoid the feature breaking? [13:08:21] urbanecm: seems unlikely, but yeah that's probably wise [13:08:40] kostajh: okay. pulled to mwdebug1001, please test [13:08:45] * urbanecm goes to upload the wmf.18 backport [13:08:46] (03PS1) 10Ottomata: Add support for airflow filesystem backend variables [puppet] - 10https://gerrit.wikimedia.org/r/811986 [13:08:50] urbanecm: ok looking [13:09:18] (03PS1) 10Urbanecm: ServiceImageRecommendationProvider: Don't fail on first validation error [extensions/GrowthExperiments] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/811956 (https://phabricator.wikimedia.org/T312521) [13:09:23] (03CR) 10CI reject: [V: 04-1] Add support for airflow filesystem backend variables [puppet] - 10https://gerrit.wikimedia.org/r/811986 (owner: 10Ottomata) [13:09:25] (03CR) 10Urbanecm: [C: 03+2] ServiceImageRecommendationProvider: Don't fail on first validation error [extensions/GrowthExperiments] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/811956 (https://phabricator.wikimedia.org/T312521) (owner: 10Urbanecm) [13:10:23] (03CR) 10Muehlenhoff: [C: 03+2] bigtop::hadoop: All hosts use the new GID/UID scheme by now [puppet] - 10https://gerrit.wikimedia.org/r/811680 (owner: 10Muehlenhoff) [13:11:19] urbanecm: arwiki has a few "no recommendation found" errors, but that seems to happen with or without this patch [13:11:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:11:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [13:11:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [13:11:43] I suspect it's due to the search index being updated with the new API, but we are using the old API in production [13:11:55] i see [13:12:02] urbanecm: anyway, the patch lgtm [13:12:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:12:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:13:07] kostajh: okay. i see the timeouts and `Deferred update 'MWCallableUpdate_GrowthExperiments\NewcomerTasks\TaskSetListener->run' failed to run.` again. not sure what the deferred update is for. doesn't appear to show up in the growth-team logstash dashboard. given you say it works, perhaps we should sync and check the errors after? [13:13:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:13:46] urbanecm: the deferred update failure is because we cache calls to the image suggestion API, and processing of metadata. In this case, if it can't find the recommendation, the curl request just hangs. We should try to fix that, but it's not directly related [13:13:55] (03CR) 10Jelto: P:vrts: fix probe port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811985 (owner: 10Majavah) [13:14:05] ah, got it. yeah, that's fine to leave later IMO. [13:14:54] so, let's sync? i plan to sync ServiceImageRecommendationProvider.php first, then wmf-config/ProductionServices.php and then wmf-config/. does that look like a good sync order kostajh? [13:15:00] (03PS1) 10Marostegui: instances.yaml: Add db2165 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/811988 (https://phabricator.wikimedia.org/T311493) [13:15:31] urbanecm: yes, I think so [13:15:39] doing. [13:15:47] (03PS1) 10Klausman: hiera/ores: Switch uwsgi to sending JSON logs to IP instead of name [puppet] - 10https://gerrit.wikimedia.org/r/811989 [13:15:49] (03PS1) 10Joal: Fix profile::analytics::refinery::job::test::refine [puppet] - 10https://gerrit.wikimedia.org/r/811990 [13:16:15] urbanecm: you're going to put the config patch up on mwdebug first? [13:16:26] kostajh: i did pull both patches there [13:16:33] sorry if that wasn't clear [13:16:35] urbanecm: oh, I didn't realize that. [13:16:52] urbanecm: hmm, give me another minute then [13:16:55] sure [13:17:28] let me know if you want me to remove the config patch from mwdebug [13:17:35] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2165 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/811988 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [13:17:50] I think it's fine, I don't think that is going to cause any issues between the mvp/production APIs. [13:18:05] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36216/console" [puppet] - 10https://gerrit.wikimedia.org/r/811989 (owner: 10Klausman) [13:18:07] I mean: I don't think the wmf.19 patch depends on the config patch in any way [13:18:18] okay [13:18:18] (03Merged) 10jenkins-bot: Translation unit deletion: Skip translation update if it doesn't exist [extensions/Translate] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811929 (https://phabricator.wikimedia.org/T312293) (owner: 10Abijeet Patro) [13:18:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:18:38] (03CR) 10Elukey: [V: 03+1 C: 03+1] hiera/ores: Switch uwsgi to sending JSON logs to IP instead of name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811989 (owner: 10Klausman) [13:18:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2165 to dbctl T311493', diff saved to https://phabricator.wikimedia.org/P30948 and previous config saved to /var/cache/conftool/dbconfig/20220707-131852-marostegui.json [13:18:57] T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493 [13:19:20] kostajh: sorry, was that to say "let's sync"? [13:19:29] !log roll restart eventgate-main pods to add a new stream - T301878 [13:19:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:19:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:34] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync [13:19:34] T301878: Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 [13:19:36] urbanecm: I think it looks good, let's do it. [13:19:44] okay, syncing [13:19:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [13:19:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [13:19:55] abijeet: fetched your patch to mwdebug1002, can you test it please? [13:20:01] (03CR) 10Alexandros Kosiaris: [C: 03+1] network: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811230 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:20:01] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync [13:20:02] urbanecm, yes, thanks [13:20:13] (03PS8) 10RhinosF1: blackbox-check-http: only add hash page if severity=page [puppet] - 10https://gerrit.wikimedia.org/r/811891 [13:20:17] (03PS2) 10Klausman: ores: Switch uwsgi to sending JSON logs to IP instead of name [puppet] - 10https://gerrit.wikimedia.org/r/811989 [13:20:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:20:43] (03PS3) 10Klausman: ores: Switch uwsgi to sending JSON logs to IP instead of name [puppet] - 10https://gerrit.wikimedia.org/r/811989 [13:20:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] Assign conf100[789] roles and add them to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/811729 (https://phabricator.wikimedia.org/T311407) (owner: 10Alexandros Kosiaris) [13:20:52] (03PS2) 10Alexandros Kosiaris: Assign conf100[789] roles and add them to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/811729 (https://phabricator.wikimedia.org/T311407) [13:20:56] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: sync [13:21:01] (03CR) 10Klausman: ores: Switch uwsgi to sending JSON logs to IP instead of name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811989 (owner: 10Klausman) [13:21:17] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync [13:21:28] (03PS4) 10Klausman: ores: Switch uwsgi to sending JSON logs to IP instead of name [puppet] - 10https://gerrit.wikimedia.org/r/811989 [13:21:39] (03CR) 10Klausman: [C: 03+2] ores: Switch uwsgi to sending JSON logs to IP instead of name [puppet] - 10https://gerrit.wikimedia.org/r/811989 (owner: 10Klausman) [13:21:43] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host ganeti2010.codfw.wmnet with OS bullseye [13:21:50] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2010.codfw.wmnet with OS bullseye completed: - ganeti2010 (**FAIL**) - Downtimed on... [13:21:52] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2010.codfw.wmnet with OS bullseye executed with errors: - ganeti2010 (**FAIL**) - D... [13:22:01] (03CR) 10Marostegui: [C: 03+2] dbprov: Update backup source for misc from db2078 to db2160 [puppet] - 10https://gerrit.wikimedia.org/r/811884 (https://phabricator.wikimedia.org/T311493) (owner: 10Jcrespo) [13:22:03] (03CR) 10JMeybohm: [C: 03+1] "egress network policies are in values.yaml already. Looks good to me" [deployment-charts] - 10https://gerrit.wikimedia.org/r/811980 (https://phabricator.wikimedia.org/T312533) (owner: 10Jgiannelos) [13:22:22] urbanecm, damnit, I did not check whether I have permission to delete pages on mediawiki. Do you have permissions to do that? I'd like this page to be deleted: https://www.mediawiki.org/w/index.php?title=Translations:User:APatro_(WMF)/T312293/1/fr [13:22:52] abijeet: sure. [13:23:02] (03Abandoned) 10Jelto: blackbox-check-http: escape prometheus variables [puppet] - 10https://gerrit.wikimedia.org/r/811984 (owner: 10Jelto) [13:23:15] abijeet: deleted via mwdebug1002 [13:23:26] (03CR) 10Klausman: [V: 03+1 C: 03+2] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36217/console" [puppet] - 10https://gerrit.wikimedia.org/r/811989 (owner: 10Klausman) [13:23:50] meh, accidentally used incorrect mwdebug server... [13:23:53] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, +1 to what Jelto said too (see inline)" [puppet] - 10https://gerrit.wikimedia.org/r/811985 (owner: 10Majavah) [13:24:00] abijeet: should i restore and delete from correct srv? [13:24:27] urbanecm, no, just the deletion should be enough [13:24:29] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.19/extensions/GrowthExperiments/includes/NewcomerTasks/AddImage/ServiceImageRecommendationProvider.php: df1393f10f987f7ddb41aea135ca43dcc6372715: ServiceImageRecommendationProvider: Dont fail on first validation error (T312521) (duration: 03m 30s) [13:24:32] okay, great [13:24:34] T312521: Add image: filter suggestions to prevent invalid source type errors - https://phabricator.wikimedia.org/T312521 [13:25:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:26:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:26:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:27:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:27:50] !log urbanecm@deploy1002 Synchronized wmf-config/ProductionServices.php: aa1d8c8ce27c4e19a621250b6fb3cefdb6b64574: GrowthExperiments: Set GEImageRecommendationApiHandler (T306032; 1/2) (duration: 03m 20s) [13:27:54] T306032: Adapt GrowthExperiments to new Image Suggestions API - https://phabricator.wikimedia.org/T306032 [13:28:03] urbanecm, just one more deletion please: https://www.mediawiki.org/w/index.php?title=Translations:User:APatro_(WMF)/T312293/2/fr [13:28:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [13:28:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [13:28:31] (03Merged) 10jenkins-bot: ServiceImageRecommendationProvider: Don't fail on first validation error [extensions/GrowthExperiments] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/811956 (https://phabricator.wikimedia.org/T312521) (owner: 10Urbanecm) [13:28:39] abijeet: done (from mwdebug1002 this time) [13:29:11] (03CR) 10Jelto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/811891 (owner: 10RhinosF1) [13:29:25] ah right, thanks, last time it was from an incorrect server ... got it. [13:29:34] (03CR) 10Btullis: [C: 03+2] Fix profile::analytics::refinery::job::test::refine [puppet] - 10https://gerrit.wikimedia.org/r/811990 (owner: 10Joal) [13:29:38] urbanecm, works well, thanks [13:29:50] abijeet: okay, great. syncing! [13:29:54] yup [13:30:49] abijeet: fyi, i'm happy to give you +sysop at testwiki (should also have translate enabled) in case that'd be useful for tests like this one. [13:30:56] 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-fgiunchedi: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10Jgiannelos) From a tegola development point of view I think it will be complicated to implement some sort of custom sharding logi... [13:31:16] urbanecm, yes, that would be useful [13:31:23] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/811891 (owner: 10RhinosF1) [13:31:27] !log urbanecm@deploy1002 Synchronized wmf-config/: aa1d8c8ce27c4e19a621250b6fb3cefdb6b64574: GrowthExperiments: Set GEImageRecommendationApiHandler (T306032; 2/2) (duration: 03m 37s) [13:31:42] kostajh: config+backport should be fully live now. can you check please? [13:31:42] (03CR) 10Muehlenhoff: "Looks great, a few nits inline." [puppet] - 10https://gerrit.wikimedia.org/r/811667 (owner: 10Slyngshede) [13:31:48] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:52] urbanecm: looking [13:31:52] (03PS1) 10Klausman: ores: Switch correct file to use IP instead of name [puppet] - 10https://gerrit.wikimedia.org/r/811995 [13:32:10] (03CR) 10Filippo Giunchedi: [C: 03+2] blackbox-check-http: only add hash page if severity=page [puppet] - 10https://gerrit.wikimedia.org/r/811891 (owner: 10RhinosF1) [13:32:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:32:55] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36218/console" [puppet] - 10https://gerrit.wikimedia.org/r/811995 (owner: 10Klausman) [13:33:03] abijeet: granted. [13:33:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:33:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:33:20] urbanecm, thanks [13:33:27] any time [13:33:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:34:00] PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8806.service,thumbor@8808.service,thumbor@8810.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:34:01] urbanecm: mixed results :( Some articles I just checked are now returning "Invalid JSON response for page". I am pretty sure I just checked these with mwdebug... [13:34:24] kostajh:( any idea what might cause that issue? [13:35:10] urbanecm: can you pull https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/811935 to a mwmaint host? That would help debug [13:35:18] or tell me how to do it, and I can do it for my account [13:35:32] certainly [13:35:52] urbanecm: actually, probably won't tell us anything useful here. hang on... [13:35:56] waiting [13:36:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [13:36:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [13:36:42] PROBLEM - Check systemd state on ms-be2047 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:38:24] PROBLEM - PyBal connections to etcd on lvs1017 is CRITICAL: CRITICAL: 0 connections established with conf1004.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [13:38:42] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:39:53] (03CR) 10Elukey: [C: 03+1] ores: Switch correct file to use IP instead of name [puppet] - 10https://gerrit.wikimedia.org/r/811995 (owner: 10Klausman) [13:40:18] PROBLEM - PyBal connections to etcd on lvs1018 is CRITICAL: CRITICAL: 0 connections established with conf1004.eqiad.wmnet:4001 (min=36) https://wikitech.wikimedia.org/wiki/PyBal [13:40:18] PROBLEM - PyBal connections to etcd on lvs1020 is CRITICAL: CRITICAL: 41 connections established with conf1004.eqiad.wmnet:4001 (min=119) https://wikitech.wikimedia.org/wiki/PyBal [13:40:45] (03CR) 10Klausman: [V: 03+1 C: 03+2] ores: Switch correct file to use IP instead of name [puppet] - 10https://gerrit.wikimedia.org/r/811995 (owner: 10Klausman) [13:41:07] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.19/extensions/Translate/tag/PageTranslationHooks.php: af5174528885a72230be7346e355261383e91b5c: Translation unit deletion: Skip translation update if it doesnt exist (T312293) (duration: 03m 32s) [13:41:12] T312293: After all translations to a given locale are deleted, the localized page for that locale is not deleted - https://phabricator.wikimedia.org/T312293 [13:42:24] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [13:42:27] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10ssastry) I am hoping @Dzahn can answer the question for me for scandium since I don't know what the 10G requirement means. [13:42:31] abijeet: should be live [13:42:39] (03CR) 10Giuseppe Lavagetto: "I don't think this is a good idea, in general. The current model is simple enough and uses parameters that describe a feature." [puppet] - 10https://gerrit.wikimedia.org/r/797223 (owner: 10Jbond) [13:42:46] hmm, the mw latency alert is worrying a bit [13:43:29] urbanecm, thanks again [13:43:54] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [13:44:25] no problem [13:44:32] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.18/extensions/GrowthExperiments/includes/NewcomerTasks/AddImage/ServiceImageRecommendationProvider.php: 95c38bd0416d2bb14404526bdf8106ef033e77b3: ServiceImageRecommendationProvider: Dont fail on first validation error (T312521) (duration: 03m 24s) [13:44:36] T312521: Add image: filter suggestions to prevent invalid source type errors - https://phabricator.wikimedia.org/T312521 [13:44:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:45:34] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [13:47:07] 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-fgiunchedi: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10fgiunchedi) >>! In T307184#8061430, @Jgiannelos wrote: > Do we have an upper limit were the object count is going to end up being... [13:47:28] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [13:47:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [13:48:16] (03PS3) 10Dbrant: Add sampling to android.breadcrumbs event stream. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811765 (https://phabricator.wikimedia.org/T310847) [13:48:22] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:48:48] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2047 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:49:44] !log draining ganeti2016 T311686 [13:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:48] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [13:49:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:51:05] MW latency went back down. looks like a temporary spike. [13:51:06] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Code LGTM, but we can also just yank out all that code." [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [13:52:10] (03PS2) 10Giuseppe Lavagetto: Turn mw_releases into a list [puppet] - 10https://gerrit.wikimedia.org/r/800758 (https://phabricator.wikimedia.org/T299648) (owner: 10Ahmon Dancy) [13:52:44] PROBLEM - etcd tlsproxy SSL conf1007.eqiad.wmnet:4001 on conf1007 is CRITICAL: SSL CRITICAL - failed to verify conf1007.eqiad.wmnet against etcd-v3.eqiad.wmnet, conf1005.eqiad.wmnet, etcd.eqiad.wmnet, conf1004.eqiad.wmnet, conf1005, conf1004, conf1006.eqiad.wmnet, conf1006 https://wikitech.wikimedia.org/wiki/Cergen [13:52:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [13:53:06] PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 3 connections established with conf1004.eqiad.wmnet:4001 (min=71) https://wikitech.wikimedia.org/wiki/PyBal [13:53:12] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:53:52] (03PS1) 10David Caro: WIP wmcs: add openstack nodes down alerts [alerts] - 10https://gerrit.wikimedia.org/r/811999 [13:54:36] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Turn mw_releases into a list [puppet] - 10https://gerrit.wikimedia.org/r/800758 (https://phabricator.wikimedia.org/T299648) (owner: 10Ahmon Dancy) [13:56:09] (03CR) 10CI reject: [V: 04-1] WIP wmcs: add openstack nodes down alerts [alerts] - 10https://gerrit.wikimedia.org/r/811999 (owner: 10David Caro) [13:58:00] did I miss an etcd maint. window? ^^ [14:00:29] (03PS2) 10Giuseppe Lavagetto: mtail: fix regexes due to changes in apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/811934 [14:00:33] (03PS1) 10Giuseppe Lavagetto: mediawiki/releases: fix expected parameter format [puppet] - 10https://gerrit.wikimedia.org/r/812000 [14:00:55] (03PS2) 10Giuseppe Lavagetto: mediawiki/releases: fix expected parameter format [puppet] - 10https://gerrit.wikimedia.org/r/812000 [14:01:27] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1003.wikimedia.org with OS bullseye [14:01:32] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye [14:04:28] (03CR) 10CI reject: [V: 04-1] mtail: fix regexes due to changes in apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/811934 (owner: 10Giuseppe Lavagetto) [14:05:03] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki/releases: fix expected parameter format [puppet] - 10https://gerrit.wikimedia.org/r/812000 (owner: 10Giuseppe Lavagetto) [14:05:11] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Zabe) [14:05:53] (03PS2) 10Ottomata: Add support for airflow filesystem backend variables [puppet] - 10https://gerrit.wikimedia.org/r/811986 [14:06:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [14:08:23] (03PS10) 10Zabe: Remove obsolete apache configuration files [puppet] - 10https://gerrit.wikimedia.org/r/761718 [14:08:49] (03PS3) 10Ottomata: Add support for airflow filesystem backend variables [puppet] - 10https://gerrit.wikimedia.org/r/811986 [14:12:17] PROBLEM - etcd tlsproxy SSL conf1009.eqiad.wmnet:4001 on conf1009 is CRITICAL: SSL CRITICAL - failed to verify conf1009.eqiad.wmnet against etcd-v3.eqiad.wmnet, conf1005.eqiad.wmnet, etcd.eqiad.wmnet, conf1004.eqiad.wmnet, conf1005, conf1004, conf1006.eqiad.wmnet, conf1006 https://wikitech.wikimedia.org/wiki/Cergen [14:12:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [14:12:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [14:14:09] (03CR) 10Volans: [C: 03+1] "LGTM, nit inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/811983 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [14:16:47] (03PS1) 10Matthias Mullie: [ImageSuggestions] Remove from beta testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812002 [14:17:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give more weight to db1132', diff saved to https://phabricator.wikimedia.org/P30950 and previous config saved to /var/cache/conftool/dbconfig/20220707-141724-marostegui.json [14:17:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30951 and previous config saved to /var/cache/conftool/dbconfig/20220707-141740-root.json [14:18:06] (03CR) 10Matthias Mullie: [C: 03+2] [ImageSuggestions] Remove from beta testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812002 (owner: 10Matthias Mullie) [14:19:39] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2047 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:20:01] (03Merged) 10jenkins-bot: [ImageSuggestions] Remove from beta testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812002 (owner: 10Matthias Mullie) [14:20:05] PROBLEM - Check systemd state on conf1007 is CRITICAL: CRITICAL - degraded: The following units failed: etcd-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:18] RECOVERY - Check systemd state on thumbor2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:03] RECOVERY - Check systemd state on ms-be2047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:59] PROBLEM - etcd tlsproxy SSL conf1008.eqiad.wmnet:4001 on conf1008 is CRITICAL: SSL CRITICAL - failed to verify conf1008.eqiad.wmnet against etcd-v3.eqiad.wmnet, conf1005.eqiad.wmnet, etcd.eqiad.wmnet, conf1004.eqiad.wmnet, conf1005, conf1004, conf1006.eqiad.wmnet, conf1006 https://wikitech.wikimedia.org/wiki/Cergen [14:23:07] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1003.wikimedia.org with OS bullseye [14:23:11] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye executed with errors... [14:23:25] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1003.wikimedia.org with OS bullseye [14:23:31] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye [14:23:45] (JobUnavailable) firing: Reduced availability for job etcd in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:24:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:25:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:25:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:26:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2010.codfw.wmnet [14:26:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:26:41] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36220/console" [puppet] - 10https://gerrit.wikimedia.org/r/811986 (owner: 10Ottomata) [14:26:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [14:26:59] 10SRE-tools, 10Dumps-Generation, 10Infrastructure-Foundations, 10serviceops, 10IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10Volans) [resuming this task, let me know if you instead prefer a separate one] Some clusters managed by the S... [14:27:33] PROBLEM - Check systemd state on thumbor2006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8809.service,thumbor@8812.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:17] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36221/console" [puppet] - 10https://gerrit.wikimedia.org/r/811986 (owner: 10Ottomata) [14:28:24] 10SRE: uwsgi socket/UDP logger is broken if no other logger uses the same format - https://phabricator.wikimedia.org/T312550 (10klausman) [14:28:31] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1003.wikimedia.org with OS bullseye [14:28:36] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye executed with errors... [14:29:31] 10SRE: uwsgi socket/UDP logger is broken if no other logger uses the same format - https://phabricator.wikimedia.org/T312550 (10klausman) [14:29:59] 10SRE: uwsgi socket/UDP logger is broken if no other logger uses the same format - https://phabricator.wikimedia.org/T312550 (10klausman) [14:31:15] PROBLEM - Check systemd state on conf1008 is CRITICAL: CRITICAL - degraded: The following units failed: etcd-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:31] (03PS4) 10Ottomata: Add support for airflow filesystem backend variables [puppet] - 10https://gerrit.wikimedia.org/r/811986 (https://phabricator.wikimedia.org/T309622) [14:32:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 2%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30952 and previous config saved to /var/cache/conftool/dbconfig/20220707-143244-root.json [14:34:29] (03CR) 10CI reject: [V: 04-1] Add support for airflow filesystem backend variables [puppet] - 10https://gerrit.wikimedia.org/r/811986 (https://phabricator.wikimedia.org/T309622) (owner: 10Ottomata) [14:35:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2010.codfw.wmnet [14:36:49] (03PS3) 10Vlad.shapik: Adjust the online tests to new changes in the thumbor functionality [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/811257 (https://phabricator.wikimedia.org/T312103) [14:38:18] (03PS1) 10Matthias Mullie: [ImageSuggestions] Disable extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812003 [14:38:52] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudcontrol1006.mgmt.eqiad.wmnet with reboot policy FORCED [14:38:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [14:38:57] (03CR) 10Matthias Mullie: [C: 04-2] "DNM! This is prep in case something goes wrong once we start sending notifications." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812003 (owner: 10Matthias Mullie) [14:39:27] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudcontrol1007.mgmt.eqiad.wmnet with reboot policy FORCED [14:39:31] (03Abandoned) 10Matthias Mullie: [ImageSuggestions] Disable extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812003 (owner: 10Matthias Mullie) [14:39:47] (03PS4) 10Vlad.shapik: Adjust the online tests to new changes in the thumbor functionality [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/811257 (https://phabricator.wikimedia.org/T312103) [14:41:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2010.codfw.wmnet to cluster codfw and group C [14:41:38] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2010.codfw.wmnet to cluster codfw and group C [14:43:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2010.codfw.wmnet [14:43:38] (03PS2) 10David Caro: wmcs: add openstack nodes down alerts [alerts] - 10https://gerrit.wikimedia.org/r/811999 [14:43:41] PROBLEM - Check systemd state on conf1009 is CRITICAL: CRITICAL - degraded: The following units failed: etcd-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:15] RECOVERY - Check systemd state on thumbor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:47:06] 10SRE: uwsgi socket/UDP logger is broken if no other logger uses the same format - https://phabricator.wikimedia.org/T312550 (10klausman) p:05Triage→03Medium a:03klausman [14:47:10] (03PS1) 10Reedy: composer.json: Swap "composer foo" for "@foo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812006 [14:47:12] (03PS1) 10Reedy: composer.json: Split phpunit to its own command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812007 [14:47:13] jouncebot: nowandnext [14:47:13] No deployments scheduled for the next 1 hour(s) and 12 minute(s) [14:47:13] In 1 hour(s) and 12 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220707T1600) [14:47:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30953 and previous config saved to /var/cache/conftool/dbconfig/20220707-144748-root.json [14:49:24] (03CR) 10Reedy: [C: 03+2] composer.json: Swap "composer foo" for "@foo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812006 (owner: 10Reedy) [14:49:26] (03CR) 10Reedy: [C: 03+2] composer.json: Split phpunit to its own command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812007 (owner: 10Reedy) [14:50:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2010.codfw.wmnet [14:50:15] (03Merged) 10jenkins-bot: composer.json: Swap "composer foo" for "@foo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812006 (owner: 10Reedy) [14:50:19] (03Merged) 10jenkins-bot: composer.json: Split phpunit to its own command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812007 (owner: 10Reedy) [14:52:35] PROBLEM - Check systemd state on thumbor1002 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8807.service,thumbor@8809.service,thumbor@8812.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:19] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10jhathaway) @Jclark-ctr have you had a chance to test your newly granted sudo permissions? [14:53:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [14:54:01] !log reedy@deploy1002 Synchronized composer.json: Cleanup (duration: 03m 19s) [14:54:35] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:55:26] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2010.codfw.wmnet to cluster codfw and group C [14:56:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:57:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:57:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:58:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:01:39] (03PS5) 10Ottomata: Add support for airflow filesystem backend variables [puppet] - 10https://gerrit.wikimedia.org/r/811986 (https://phabricator.wikimedia.org/T309622) [15:01:49] RECOVERY - Check systemd state on mw2387 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30955 and previous config saved to /var/cache/conftool/dbconfig/20220707-150252-root.json [15:02:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [15:07:30] (03PS1) 10Elukey: ml-services: Add knative and egress config for eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/812010 (https://phabricator.wikimedia.org/T301878) [15:08:14] (03PS2) 10Elukey: ml-services: Add knative and egress config for eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/812010 (https://phabricator.wikimedia.org/T301878) [15:09:44] !log installing containerd security updates [15:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2010.codfw.wmnet to cluster codfw and group C [15:12:06] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol1006.mgmt.eqiad.wmnet with reboot policy FORCED [15:12:24] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol1007.mgmt.eqiad.wmnet with reboot policy FORCED [15:12:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [15:13:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10Cmjohnson) [15:13:39] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [15:15:31] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti2016.codfw.wmnet with reason: Drop from ganeti cluster for eventual reimage [15:15:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti2016.codfw.wmnet with reason: Drop from ganeti cluster for eventual reimage [15:17:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 25%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30956 and previous config saved to /var/cache/conftool/dbconfig/20220707-151756-root.json [15:21:10] (03PS1) 10David Caro: tests: add a test to ensure that the runbook is accessible if there [alerts] - 10https://gerrit.wikimedia.org/r/812011 [15:21:26] (03CR) 10Elukey: [V: 03+2 C: 03+2] Upgrade kserve images to upstream release 0.8 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/810841 (https://phabricator.wikimedia.org/T311982) (owner: 10Elukey) [15:21:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [15:24:52] (03PS2) 10David Caro: tests: add a test to ensure that the runbook is accessible if there [alerts] - 10https://gerrit.wikimedia.org/r/812011 [15:33:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30957 and previous config saved to /var/cache/conftool/dbconfig/20220707-153300-root.json [15:34:31] (03PS1) 10David Caro: wmcs: add test to ensure we add a runbook to each alert [alerts] - 10https://gerrit.wikimedia.org/r/812013 [15:35:24] (03PS1) 10Btullis: Add partman recipes for the new dse-k8s VMs [puppet] - 10https://gerrit.wikimedia.org/r/812014 (https://phabricator.wikimedia.org/T310170) [15:38:05] (03CR) 10Elukey: Add partman recipes for the new dse-k8s VMs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812014 (https://phabricator.wikimedia.org/T310170) (owner: 10Btullis) [15:38:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Cmjohnson) @cmooney it is most likely a manual change error. I did not completely delete the interface after removing the cloudcephosd hosts, I... [15:39:01] (03PS1) 10Mforns: Migrate WikibaseTermboxInteraction from EventLogging to EventGate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812017 (https://phabricator.wikimedia.org/T290303) [15:39:06] 10SRE: uwsgi socket/UDP logger is broken if no other logger uses the same format - https://phabricator.wikimedia.org/T312550 (10klausman) Upstream issue: https://github.com/unbit/uwsgi/issues/2456 [15:39:18] (03CR) 10BCornwall: [C: 03+2] varnish: Enable Prometheus sysctl exporting [puppet] - 10https://gerrit.wikimedia.org/r/811780 (https://phabricator.wikimedia.org/T311445) (owner: 10BCornwall) [15:40:18] (03PS1) 10Alexandros Kosiaris: Update _etcd-server-ssl._tcp.v3.eqiad.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/812018 (https://phabricator.wikimedia.org/T311407) [15:40:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10Jclark-ctr) cloudweb1003 B2 u3 20220205 port34 cloudweb1004 D2 u33 20220109 port33 [15:40:55] (03PS2) 10Btullis: Add partman recipes for the new dse-k8s VMs [puppet] - 10https://gerrit.wikimedia.org/r/812014 (https://phabricator.wikimedia.org/T310170) [15:41:17] (03PS5) 10Ori: New service: function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698) [15:42:06] (03CR) 10Ori: New service: function-evaluator (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698) (owner: 10Ori) [15:43:26] (03CR) 10Btullis: Add partman recipes for the new dse-k8s VMs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812014 (https://phabricator.wikimedia.org/T310170) (owner: 10Btullis) [15:45:39] (03CR) 10Ottomata: [C: 03+1] Migrate WikibaseTermboxInteraction from EventLogging to EventGate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812017 (https://phabricator.wikimedia.org/T290303) (owner: 10Mforns) [15:47:58] (03CR) 10Alexandros Kosiaris: [C: 03+2] Update _etcd-server-ssl._tcp.v3.eqiad.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/812018 (https://phabricator.wikimedia.org/T311407) (owner: 10Alexandros Kosiaris) [15:48:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30958 and previous config saved to /var/cache/conftool/dbconfig/20220707-154804-root.json [15:48:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Cmjohnson) @cmooney I believe I found the error, in site.pp I failed to put a ^ before the hostname [15:49:51] PROBLEM - Check systemd state on thumbor1006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8809.service,thumbor@8811.service,thumbor@8813.service,thumbor@8817.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:17] (03PS1) 10Cmjohnson: Adding cloudcontrol1006-7 and fix wmcs site.pp entries [puppet] - 10https://gerrit.wikimedia.org/r/812020 (https://phabricator.wikimedia.org/T306853) [15:53:43] (03CR) 10Cmjohnson: [C: 03+2] Adding cloudcontrol1006-7 and fix wmcs site.pp entries [puppet] - 10https://gerrit.wikimedia.org/r/812020 (https://phabricator.wikimedia.org/T306853) (owner: 10Cmjohnson) [15:54:13] (03PS1) 10Bartosz Dziewoński: Enable VisualEditor on thwikibooks by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812021 (https://phabricator.wikimedia.org/T308379) [15:55:07] (03Abandoned) 10Bartosz Dziewoński: Enable VisualEditor on thwikibooks by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812021 (https://phabricator.wikimedia.org/T308379) (owner: 10Bartosz Dziewoński) [15:56:07] (03CR) 10Jgiannelos: [C: 03+2] tegola: Add postgres upstreams for staging env [deployment-charts] - 10https://gerrit.wikimedia.org/r/811980 (https://phabricator.wikimedia.org/T312533) (owner: 10Jgiannelos) [15:59:32] (03Merged) 10jenkins-bot: tegola: Add postgres upstreams for staging env [deployment-charts] - 10https://gerrit.wikimedia.org/r/811980 (https://phabricator.wikimedia.org/T312533) (owner: 10Jgiannelos) [15:59:37] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:59:41] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10bking) Step Zero: See https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Updating_Firmware Once you've... [15:59:49] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudrabbit1001.wikimedia.org with OS bullseye [15:59:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudrabbit1001.wikimedia.org with OS bullseye [16:00:04] jbond and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220707T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:01:01] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply [16:01:24] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply [16:01:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [16:02:12] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1003.wikimedia.org with OS bullseye [16:02:17] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye [16:03:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30959 and previous config saved to /var/cache/conftool/dbconfig/20220707-160308-root.json [16:03:20] (03PS3) 10Btullis: Add partman recipes for the new dse-k8s VMs [puppet] - 10https://gerrit.wikimedia.org/r/812014 (https://phabricator.wikimedia.org/T310170) [16:03:32] (03PS4) 10BCornwall: varnish: add VarnishHighMmapCount [alerts] - 10https://gerrit.wikimedia.org/r/805873 (https://phabricator.wikimedia.org/T300723) [16:04:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:05:51] (03CR) 10Btullis: [C: 03+2] Add partman recipes for the new dse-k8s VMs [puppet] - 10https://gerrit.wikimedia.org/r/812014 (https://phabricator.wikimedia.org/T310170) (owner: 10Btullis) [16:12:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [16:13:56] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudrabbit1002.wikimedia.org with OS bullseye [16:14:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudrabbit1002.wikimedia.org with OS bullseye [16:14:24] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudrabbit1003.wikimedia.org with OS bullseye [16:14:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudrabbit1003.wikimedia.org with OS bullseye [16:14:50] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1003.wikimedia.org with reason: host reimage [16:15:10] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye [16:15:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye [16:15:25] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit1001.wikimedia.org with reason: host reimage [16:18:33] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1003.wikimedia.org with reason: host reimage [16:18:45] (03PS6) 10David Caro: novafullstack: add types and some names refactor [puppet] - 10https://gerrit.wikimedia.org/r/810950 [16:18:47] (03PS4) 10David Caro: novafullstack: Refactor and minor fix [puppet] - 10https://gerrit.wikimedia.org/r/811316 [16:18:49] (03CR) 10David Caro: novafullstack: Refactor and minor fix (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/811316 (owner: 10David Caro) [16:18:51] (03PS2) 10David Caro: novafullstack: allow running on codfw [puppet] - 10https://gerrit.wikimedia.org/r/811318 [16:20:43] (03CR) 10CI reject: [V: 04-1] novafullstack: allow running on codfw [puppet] - 10https://gerrit.wikimedia.org/r/811318 (owner: 10David Caro) [16:21:08] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit1001.wikimedia.org with reason: host reimage [16:25:13] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudservices1005.wikimedia.org with OS bullseye [16:25:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudservices1005.wikimedia.org with OS bull... [16:29:32] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit1002.wikimedia.org with reason: host reimage [16:30:02] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit1003.wikimedia.org with reason: host reimage [16:33:08] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit1002.wikimedia.org with reason: host reimage [16:33:50] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1003.wikimedia.org with OS bullseye [16:33:56] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye completed: - cloudel... [16:35:38] (03PS1) 10Cmjohnson: updating site.pp entry cloudnet1005-6 [puppet] - 10https://gerrit.wikimedia.org/r/812033 (https://phabricator.wikimedia.org/T304888) [16:35:43] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit1003.wikimedia.org with reason: host reimage [16:36:28] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit1001.wikimedia.org with OS bullseye [16:36:31] (03CR) 10CI reject: [V: 04-1] updating site.pp entry cloudnet1005-6 [puppet] - 10https://gerrit.wikimedia.org/r/812033 (https://phabricator.wikimedia.org/T304888) (owner: 10Cmjohnson) [16:36:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudrabbit1001.wikimedia.... [16:40:51] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices1005.wikimedia.org with reason: host reimage [16:42:23] (03PS1) 10Alexandros Kosiaris: Add v3.eqiad.wmnet to _etcd-server-ssl._tcp.v3.eqiad.wmnet cert [puppet] - 10https://gerrit.wikimedia.org/r/812035 (https://phabricator.wikimedia.org/T311407) [16:42:51] (03PS2) 10Cmjohnson: updating site.pp entry cloudnet1005-6 [puppet] - 10https://gerrit.wikimedia.org/r/812033 (https://phabricator.wikimedia.org/T304888) [16:43:24] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Add v3.eqiad.wmnet to _etcd-server-ssl._tcp.v3.eqiad.wmnet cert [puppet] - 10https://gerrit.wikimedia.org/r/812035 (https://phabricator.wikimedia.org/T311407) (owner: 10Alexandros Kosiaris) [16:43:58] (03CR) 10Cmjohnson: [C: 03+2] updating site.pp entry cloudnet1005-6 [puppet] - 10https://gerrit.wikimedia.org/r/812033 (https://phabricator.wikimedia.org/T304888) (owner: 10Cmjohnson) [16:44:25] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices1005.wikimedia.org with reason: host reimage [16:47:12] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit1002.wikimedia.org with OS bullseye [16:47:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudrabbit1002.wikimedia.... [16:48:29] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster reimage to bullseye - bking@cumin1001 - T309343 [16:48:34] T309343: Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 [16:49:20] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit1003.wikimedia.org with OS bullseye [16:49:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudrabbit1003.wikimedia.... [16:49:32] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1002.wikimedia.org with OS bullseye [16:49:40] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1002.wikimedia.org with OS bullseye [16:51:55] (03PS2) 10Majavah: Revert "Revert "openstack::nova: enable TLS encryption for rabbitmq"" [puppet] - 10https://gerrit.wikimedia.org/r/809633 [16:52:34] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch inactive shards 292 threshold =0.2 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, active_primary_shards: 727, active_shards: 1165, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 292, delayed_unassigned_shards: 0, number_of_pending_t [16:52:34] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 79.95881949210707 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:52:40] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye [16:52:47] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1006.eqiad.wmnet with OS bullseye [16:52:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmn... [16:52:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet w... [16:53:14] PROBLEM - Check systemd state on cloudelastic1003 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:54:52] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 292 threshold =0.2 breach: active_primary_shards: 727, number_of_in_flight_fetch: 0, timed_out: False, delayed_unassigned_shards: 0, initializing_shards: 0, number_of_nodes: 5, unassigned_shards: 292, number_of_pending_tasks: 0, status: yellow, number_of_data_nodes: 5, relocating_shards: 0, active_shards_percen [16:54:52] ber: 79.95881949210707, cluster_name: cloudelastic-chi-eqiad, active_shards: 1165, task_max_waiting_in_queue_millis: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:55:50] (03PS7) 10David Caro: novafullstack: add types and some names refactor [puppet] - 10https://gerrit.wikimedia.org/r/810950 [16:55:52] (03PS5) 10David Caro: novafullstack: Refactor and minor fix [puppet] - 10https://gerrit.wikimedia.org/r/811316 [16:55:54] (03PS1) 10David Caro: novafullstack: generate prometheus stats too [puppet] - 10https://gerrit.wikimedia.org/r/812037 [16:57:36] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 292 threshold =0.2 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, active_primary_shards: 727, active_shards: 1165, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 292, delayed_unassigned_shards: 0, number_of_pending_t [16:57:36] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 79.95881949210707 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:59:20] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudservices1005.wikimedia.org with OS bullseye [16:59:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudservices1005.wikimedi... [17:00:04] bd808: #bothumor My software never has bugs. It just develops random features. Rise for Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220707T1700). [17:00:44] (03PS1) 10BryanDavis: developer-portal: Bump container version to 2022-07-07-111803-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/812038 [17:01:49] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1002.wikimedia.org with reason: host reimage [17:02:14] RECOVERY - Check systemd state on cloudelastic1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:19] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1002.wikimedia.org with reason: host reimage [17:06:05] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container version to 2022-07-07-111803-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/812038 (owner: 10BryanDavis) [17:07:35] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (39) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudrabbit1001, cloudservices1003, cloudservices1004, dse-k8s-etcd1001, dse-k8s-etcd1002, dse-k8s-etcd1003, elastic2049, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011 [17:07:35] 012, stat1007, stat1008, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe2003, thumbor1002, thumbor1005, thumbor1006, thumbor2003, thumbor2004, thumbor2005, thumbor2006 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [17:07:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [17:09:05] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch inactive shards 292 threshold =0.2 breach: cluster_name: cloudelastic-chi-eqiad, active_primary_shards: 727, initializing_shards: 0, number_of_data_nodes: 5, number_of_in_flight_fetch: 0, unassigned_shards: 292, active_shards: 1165, number_of_nodes: 5, status: yellow, timed_out: False, delayed_unassigned_shards: 0, relocating_ [17:09:05] 0, number_of_pending_tasks: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 79.95881949210707 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:09:24] (03Merged) 10jenkins-bot: developer-portal: Bump container version to 2022-07-07-111803-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/812038 (owner: 10BryanDavis) [17:10:31] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:10:57] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:11:08] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:11:43] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:12:01] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:12:47] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:14:37] (03PS6) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [17:16:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Cmjohnson) [17:17:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Cmjohnson) all but the cloudnets installed correctly, they're still presenting the dhcp error. I am thinking I may just blo... [17:17:37] (03CR) 10CI reject: [V: 04-1] wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [17:17:57] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:18:24] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1006.wikimedia.org with OS bullseye [17:18:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcontrol1006.wikime... [17:18:38] (03CR) 10Aqu: [WIP] Build spark assembly for Spark3 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810951 (https://phabricator.wikimedia.org/T310578) (owner: 10Aqu) [17:19:48] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:20:46] (03CR) 10Bearloga: [C: 03+1] Add sampling to android.breadcrumbs event stream. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811765 (https://phabricator.wikimedia.org/T310847) (owner: 10Dbrant) [17:21:23] RECOVERY - Check systemd state on thumbor2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:22:43] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol1006.wikimedia.org with OS bullseye [17:22:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcontrol1006.wikimedia.... [17:22:49] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1002.wikimedia.org with OS bullseye [17:22:56] 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1002.wikimedia.org with OS bullseye completed: - cloudel... [17:22:57] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 292 threshold =0.2 breach: relocating_shards: 0, initializing_shards: 0, status: yellow, timed_out: False, number_of_data_nodes: 6, number_of_pending_tasks: 0, delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 0, active_shards: 1165, active_shards_percent_as_number: 79.95881949210707, active_prima [17:22:57] s: 727, number_of_in_flight_fetch: 0, unassigned_shards: 292, cluster_name: cloudelastic-chi-eqiad, number_of_nodes: 6 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:25:05] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1001 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: status: yellow, timed_out: False, cluster_name: cloudelastic-chi-eqiad, active_shards_percent_as_number: 80.02745367192861, unassigned_shards: 289, number_of_pending_tasks: 1, initializing_shards: 2, number_of_data_nodes: 6, delayed_unassigned_shards: 0, relocating_shards: 0, number_of_in_flight_fetch: 0, a [17:25:05] imary_shards: 727, number_of_nodes: 6, task_max_waiting_in_queue_millis: 0, active_shards: 1166 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:25:12] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Revert "openstack::nova: enable TLS encryption for rabbitmq"" [puppet] - 10https://gerrit.wikimedia.org/r/809633 (owner: 10Majavah) [17:25:41] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: status: yellow, number_of_in_flight_fetch: 0, active_primary_shards: 727, delayed_unassigned_shards: 0, active_shards_percent_as_number: 80.09608785175017, relocating_shards: 0, number_of_nodes: 6, number_of_pending_tasks: 0, number_of_data_nodes: 6, task_max_waiting_in_queue_millis: 0, timed_out: False, ac [17:25:41] rds: 1167, initializing_shards: 2, cluster_name: cloudelastic-chi-eqiad, unassigned_shards: 288 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:26:15] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 727, active_shards: 1167, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 288, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_f [17:26:15] tch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.09608785175017 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:26:31] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1004 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: number_of_nodes: 6, timed_out: False, delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 0, number_of_in_flight_fetch: 0, relocating_shards: 0, unassigned_shards: 288, active_primary_shards: 727, initializing_shards: 2, active_shards: 1167, active_shards_percent_as_number: 80.09608785175017, nu [17:26:31] pending_tasks: 0, cluster_name: cloudelastic-chi-eqiad, number_of_data_nodes: 6, status: yellow https://wikitech.wikimedia.org/wiki/Search%23Administration [17:27:08] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:27:13] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1003 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 727, active_shards: 1167, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 288, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_f [17:27:13] tch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.09608785175017 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:27:31] PROBLEM - Check systemd state on thumbor2006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8809.service,thumbor@8812.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:28:28] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10Papaul) [17:31:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:33:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2176.mgmt.codfw.wmnet with reboot policy FORCED [17:37:52] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2177.mgmt.codfw.wmnet with reboot policy FORCED [17:38:07] (03PS7) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [17:38:54] PROBLEM - Check systemd state on cloudelastic1002 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:39:03] (03PS1) 10Andrew Bogott: Revert "Revert "Revert "openstack::nova: enable TLS encryption for rabbitmq""" [puppet] - 10https://gerrit.wikimedia.org/r/811960 [17:39:25] (03CR) 10Majavah: [C: 03+1] Revert "Revert "Revert "openstack::nova: enable TLS encryption for rabbitmq""" [puppet] - 10https://gerrit.wikimedia.org/r/811960 (owner: 10Andrew Bogott) [17:39:28] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host db2175.mgmt.codfw.wmnet with reboot policy FORCED [17:40:31] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Revert "Revert "openstack::nova: enable TLS encryption for rabbitmq""" [puppet] - 10https://gerrit.wikimedia.org/r/811960 (owner: 10Andrew Bogott) [17:40:59] jouncebot: nowandnext [17:40:59] For the next 0 hour(s) and 19 minute(s): Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220707T1700) [17:40:59] In 0 hour(s) and 19 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220707T1800) [17:41:09] (03CR) 10CI reject: [V: 04-1] wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [17:44:06] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:44:32] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8807.service,thumbor@8808.service,thumbor@8812.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:45:38] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48391 bytes in 0.116 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:47:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [17:49:12] RECOVERY - Check systemd state on cloudelastic1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:49:49] (03CR) 10Andrew Bogott: [C: 03+2] openstack::trove: enable rabbitmq tls for api [puppet] - 10https://gerrit.wikimedia.org/r/795361 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [17:51:25] !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2175.mgmt.codfw.wmnet with reboot policy FORCED [17:51:47] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host db2175.mgmt.codfw.wmnet with reboot policy FORCED [17:53:13] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:56:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:56:46] RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:58:33] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2176.mgmt.codfw.wmnet with reboot policy FORCED [17:58:42] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2177.mgmt.codfw.wmnet with reboot policy FORCED [17:58:46] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:59:00] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2176.mgmt.codfw.wmnet with reboot policy FORCED [18:00:05] jnuche and dduvall: Dear deployers, time to do the MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220707T1800). [18:00:36] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:01:46] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:02:04] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:04:06] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8807.service,thumbor@8808.service,thumbor@8812.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:05:00] (03PS1) 10Majavah: O:openstack: remove profiles no longer used [puppet] - 10https://gerrit.wikimedia.org/r/812043 [18:05:02] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:05:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:06:24] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2177.mgmt.codfw.wmnet with reboot policy FORCED [18:06:25] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36222/console" [puppet] - 10https://gerrit.wikimedia.org/r/812043 (owner: 10Majavah) [18:06:54] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2074 - https://phabricator.wikimedia.org/T311990 (10Papaul) [18:07:27] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2074 - https://phabricator.wikimedia.org/T311990 (10Papaul) 05Open→03Resolved complete [18:07:32] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:10:42] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:12:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [18:16:34] !log brett@cumin1001 START - Cookbook sre.dns.netbox [18:18:42] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:22:12] cmjohnson1: Looks like we're butting heads here [18:22:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2176.mgmt.codfw.wmnet with reboot policy FORCED [18:22:33] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:22:42] !log brett@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:22:50] !log brett@cumin1001 START - Cookbook sre.dns.netbox [18:22:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2178.mgmt.codfw.wmnet with reboot policy FORCED [18:26:04] !log brett@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:31:49] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1005.eqiad.wmnet with OS bullseye [18:31:51] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic, 10IPv6: Some Traffic clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271144 (10BCornwall) 05Open→03Resolved Thank you for the help, @ssingh, @Volans and @ayounsi I've added the DNS records to only the primary i... [18:31:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudnet1005.eqiad.wmnet with OS bullseye [18:33:12] RECOVERY - Check systemd state on thumbor2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:36:12] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster reimage to bullseye - bking@cumin1001 - T309343 [18:36:15] T309343: Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 [18:38:24] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic, 10IPv6: Some Traffic clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271144 (10Volans) Thanks @BCornwall for the quick turnaround and fix. I'll close the tmux then given the revert is not needed anymore. [18:39:03] (03PS1) 10Volans: reports/network: remove lvs* from no IPv6 list [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812050 (https://phabricator.wikimedia.org/T271144) [18:39:32] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1006.wikimedia.org with OS bullseye [18:39:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcontrol1006.wikime... [18:40:28] PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8806.service,thumbor@8808.service,thumbor@8810.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:40:45] (03PS1) 10C. Scott Ananian: ParserOutput::mergeMapStrategy: don't crash if merging non-array values [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811961 (https://phabricator.wikimedia.org/T312242) [18:42:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2177.mgmt.codfw.wmnet with reboot policy FORCED [18:42:23] (03CR) 10Volans: [C: 03+2] reports/network: remove lvs* from no IPv6 list [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812050 (https://phabricator.wikimedia.org/T271144) (owner: 10Volans) [18:44:26] (03Merged) 10jenkins-bot: reports/network: remove lvs* from no IPv6 list [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812050 (https://phabricator.wikimedia.org/T271144) (owner: 10Volans) [18:46:43] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol1006.wikimedia.org with OS bullseye [18:46:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcontrol1006.wikimedia.... [18:47:38] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet1005.eqiad.wmnet with reason: host reimage [18:47:54] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1007.wikimedia.org with OS bullseye [18:48:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcontrol1007.wikime... [18:48:37] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic, and 2 others: Some Traffic clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271144 (10Volans) I've removed the lvs prefix from the no IPv6 cluster list and now the Network report in Netbox confirms there are no lvs hos... [18:48:54] RECOVERY - Check systemd state on thumbor1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:51:11] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet1005.eqiad.wmnet with reason: host reimage [18:52:34] (03CR) 10Dzahn: "great fix. I think this is why people got pinged more than expected on IRC. thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/811891 (owner: 10RhinosF1) [18:54:05] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:54:06] (03CR) 10Dzahn: "If I had realized this string is in there even though it's "only" critical and people have their own config that notifies them because of " [puppet] - 10https://gerrit.wikimedia.org/r/811891 (owner: 10RhinosF1) [18:54:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb1005, frdb1006 (dev database) - https://phabricator.wikimedia.org/T306935 (10Jgreen) [18:57:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2178.mgmt.codfw.wmnet with reboot policy FORCED [18:57:51] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2179.mgmt.codfw.wmnet with reboot policy FORCED [18:57:59] PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8807.service,thumbor@8808.service,thumbor@8812.service,thumbor@8816.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:58:00] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:59:24] (03CR) 10Ottomata: [WIP] Build spark assembly for Spark3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810951 (https://phabricator.wikimedia.org/T310578) (owner: 10Aqu) [18:59:45] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1007.wikimedia.org with reason: host reimage [19:01:28] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2180.mgmt.codfw.wmnet with reboot policy FORCED [19:01:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frlog1002 - https://phabricator.wikimedia.org/T306839 (10Jgreen) 05Open→03Resolved >>! In T306839#8045388, @Cmjohnson wrote: > @Jgreen i don't seem to have the template directory or 10.in file in my DNS repo to m... [19:03:24] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1007.wikimedia.org with reason: host reimage [19:05:02] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet1005.eqiad.wmnet with OS bullseye [19:05:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudnet1005.eqiad.wmnet with OS bullseye comple... [19:05:24] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:07:58] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2175.mgmt.codfw.wmnet with reboot policy FORCED [19:09:13] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:10:35] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:12:53] PROBLEM - Check systemd state on thumbor1006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8804.service,thumbor@8805.service,thumbor@8806.service,thumbor@8807.service,thumbor@8809.service,thumbor@8811.service,thumbor@8813.service,thumbor@8817.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:13:51] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:15:10] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1006.wikimedia.org with OS bullseye [19:15:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcontrol1006.wikime... [19:15:44] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye [19:15:52] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1006.eqiad.wmnet with OS bullseye [19:16:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye [19:16:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye execut... [19:16:25] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye [19:16:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye [19:16:32] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1006.eqiad.wmnet with OS bullseye [19:16:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye execut... [19:17:58] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol1007.wikimedia.org with OS bullseye [19:18:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcontrol1007.wikimedia.... [19:18:47] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye [19:18:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye [19:18:54] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1006.eqiad.wmnet with OS bullseye [19:19:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye execut... [19:19:55] 10SRE, 10Observability-Alerting, 10Traffic, 10Patch-For-Review, 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10BCornwall) [19:20:05] 10SRE, 10Traffic: Create vm.max_map_count metrics for Prometheus - https://phabricator.wikimedia.org/T311445 (10BCornwall) 05Open→03Resolved Implemented and deployed to varnish servers. The `sysctl_vm_max_map_count` metric is now available. [19:21:30] (03CR) 10Subramanya Sastry: "See the slack thread where there is more discussion before backporting this." [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811961 (https://phabricator.wikimedia.org/T312242) (owner: 10C. Scott Ananian) [19:22:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Cmjohnson) [19:23:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Cmjohnson) all but cloudnet1006 has gone through the installer, cloudnet1006 is still giving the dhcp error. I did try deleting all the ports an... [19:25:31] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:26:59] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1006.wikimedia.org with reason: host reimage [19:28:42] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye [19:28:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye [19:28:49] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1006.eqiad.wmnet with OS bullseye [19:28:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye execut... [19:32:41] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1006.wikimedia.org with reason: host reimage [19:33:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10Cmjohnson) [19:34:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10Cmjohnson) 05Open→03Resolved these are finished @dcaro [19:36:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2179.mgmt.codfw.wmnet with reboot policy FORCED [19:36:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2180.mgmt.codfw.wmnet with reboot policy FORCED [19:36:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson [19:43:03] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2181.mgmt.codfw.wmnet with reboot policy FORCED [19:43:38] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2182.mgmt.codfw.wmnet with reboot policy FORCED [19:44:25] (03CR) 10Dzahn: P:vrts: fix probe port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811985 (owner: 10Majavah) [19:46:10] (03CR) 10Dzahn: "per inline comment, I _do_ think it's good to merge a quick fix, but ALSO totally agree with what Jelto said. We do want a check from exte" [puppet] - 10https://gerrit.wikimedia.org/r/811985 (owner: 10Majavah) [19:46:18] (03CR) 10Dzahn: [C: 03+2] P:vrts: fix probe port [puppet] - 10https://gerrit.wikimedia.org/r/811985 (owner: 10Majavah) [19:46:52] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol1006.wikimedia.org with OS bullseye [19:46:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcontrol1006.wikimedia.... [19:55:35] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10Cmjohnson) a:03Cmjohnson [19:55:50] 10SRE, 10ops-eqiad, 10Data-Persistence-Backup: Degraded RAID on db1176 - https://phabricator.wikimedia.org/T312321 (10Cmjohnson) a:03Cmjohnson [19:56:19] (03CR) 10Dzahn: [C: 03+2] P:vrts: fix probe port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811985 (owner: 10Majavah) [19:59:45] (03Abandoned) 10Dzahn: phabricator: de-duplicate list of VCS IPs and usage in module [puppet] - 10https://gerrit.wikimedia.org/r/753561 (owner: 10Dzahn) [20:00:05] brennen: Your horoscope predicts another unfortunate UTC late backport and config training deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220707T2000). [20:00:05] mforns and MatmaRex: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] hello [20:00:15] hi! [20:01:24] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [20:01:49] howdy, I can deploy [20:01:57] * urbanecm waves [20:02:15] hiya urbanecm [20:02:36] (03PS2) 10Thcipriani: Migrate WikibaseTermboxInteraction from EventLogging to EventGate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812017 (https://phabricator.wikimedia.org/T290303) (owner: 10Mforns) [20:02:51] (03CR) 10Thcipriani: [C: 03+2] Migrate WikibaseTermboxInteraction from EventLogging to EventGate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812017 (https://phabricator.wikimedia.org/T290303) (owner: 10Mforns) [20:02:59] !log dzahn@cumin2002 START - Cookbook sre.hosts.decommission for hosts doc1001.eqiad.wmnet [20:03:38] !log destroying former strech backend of doc.wikimedia.org, replaced by doc1002 on buster (T247653) [20:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:42] T247653: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 [20:04:19] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:05:08] (03Merged) 10jenkins-bot: Migrate WikibaseTermboxInteraction from EventLogging to EventGate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812017 (https://phabricator.wikimedia.org/T290303) (owner: 10Mforns) [20:05:35] mforns: your change is on mwdebug1002, check please [20:05:43] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [20:05:44] thcipriani: doing! [20:06:00] (03PS3) 10Thcipriani: Enable VisualEditor on thwikibooks by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791729 (https://phabricator.wikimedia.org/T308379) (owner: 10Klein Muçi) [20:06:53] sent an event from testwiki, waiting for it to show up in kafka [20:07:10] sounds good :) [20:08:13] i've got a backport patch queued up as well, did I miss a ping? [20:08:21] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [20:08:45] Guest2635: because your nick changed to Guest? [20:08:47] Guest2635: seeing as your username is Guest2635, that is possible [20:09:21] or if you've queued it at last minute, the bot might not have updated its data before sending the messages [20:09:39] but also I only see two patches for backport: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220707T2000 [20:10:29] looks like you're cscott and your patch is listed in the previous window :D [20:10:38] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:11:37] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:11:38] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doc1001.eqiad.wmnet [20:11:47] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10serviceops-collab, and 2 others: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin2002 for h... [20:12:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:12:24] (03CR) 10C. Scott Ananian: ParserOutput::mergeMapStrategy: don't crash if merging non-array values (031 comment) [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811961 (https://phabricator.wikimedia.org/T312242) (owner: 10C. Scott Ananian) [20:12:42] nice detective work MatmaRex :P [20:12:48] PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:13:12] MatmaRex: fixed? i hope. [20:13:15] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10serviceops-collab, and 2 others: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) :) yw doc1001.eqiad.wmnet has now been destroyed (via decom cookbook). [20:13:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:13:28] i also apparently had a timezone issue [20:13:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:13:43] don't we all :) [20:13:46] thcipriani: I only see the regular production events coming from kafka, not the ones from test that I just sent... [20:14:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:14:24] the matrix-irc bridge keeps kicking me off my nick :( [20:14:59] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudweb1003.mgmt.eqiad.wmnet with reboot policy FORCED [20:15:05] hrm mforns I can confirm your code is on mwdebug1002 [20:15:40] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudweb1004.mgmt.eqiad.wmnet with reboot policy FORCED [20:15:49] I don't see anything interesting in the logs (apart from a few info messages from testwiki) [20:16:06] thcipriani: I'm using mwdebug1002, but the version of WikimediaDebug might not be the latest one, would that matter? [20:16:32] I don't think it should, should be setting the same header (I would think) [20:16:37] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn) [20:16:59] you can always check the headers in the browser console [20:17:00] ok, thcipriani please feel free to revert. I will talk with andrew and see if there's something wrong [20:17:29] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10serviceops-collab, and 2 others: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) 05In progress→03Resolved the original ticket is resolved. doc1001 is gone a... [20:17:38] mforns: ok, will do, thank you for checking <3 [20:17:47] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn) doc1001 has just been deleted. fixed one of the few remaining subtask. might be a good time to check the others and see where we are [20:17:55] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host kafka-jumbo1011.mgmt.eqiad.wmnet with reboot policy FORCED [20:17:55] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host kafka-jumbo1013.mgmt.eqiad.wmnet with reboot policy FORCED [20:17:55] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host kafka-jumbo1010.mgmt.eqiad.wmnet with reboot policy FORCED [20:17:55] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host kafka-jumbo1014.mgmt.eqiad.wmnet with reboot policy FORCED [20:17:55] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host kafka-jumbo1012.mgmt.eqiad.wmnet with reboot policy FORCED [20:17:55] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host kafka-jumbo1015.mgmt.eqiad.wmnet with reboot policy FORCED [20:17:58] thcipriani: thanks for the help! [20:18:22] (03PS1) 10Thcipriani: Revert "Migrate WikibaseTermboxInteraction from EventLogging to EventGate on testwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812074 [20:18:33] (03PS2) 10Dzahn: site/DHCP: decom doc1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/810400 (https://phabricator.wikimedia.org/T247653) [20:18:36] mforns: sure thing, yw [20:18:47] (03CR) 10Thcipriani: [C: 03+2] Revert "Migrate WikibaseTermboxInteraction from EventLogging to EventGate on testwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812074 (owner: 10Thcipriani) [20:19:15] cscott: this is the patch you're hoping to get out in this window, correct? https://gerrit.wikimedia.org/r/c/mediawiki/core/+/811961 [20:19:59] (03CR) 10C. Scott Ananian: "https://commons.wikimedia.org/w/index.php?title=File:Commons_Growth.svg should no longer crash once this patch is deployed." [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811961 (https://phabricator.wikimedia.org/T312242) (owner: 10C. Scott Ananian) [20:20:01] (03CR) 10Dzahn: [C: 03+2] "executed decom cookbook:" [puppet] - 10https://gerrit.wikimedia.org/r/810400 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [20:20:32] yes, and https://commons.wikimedia.org/w/index.php?title=File:Commons_Growth.svg is the test url [20:20:44] thanks, I'll get that backported [20:20:56] (03CR) 10Thcipriani: [C: 03+2] ParserOutput::mergeMapStrategy: don't crash if merging non-array values [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811961 (https://phabricator.wikimedia.org/T312242) (owner: 10C. Scott Ananian) [20:21:04] (03CR) 10Dzahn: [C: 03+2] "Host doc1001.eqiad.wmnet not found: 3(NXDOMAIN)" [puppet] - 10https://gerrit.wikimedia.org/r/810400 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [20:21:08] (03Merged) 10jenkins-bot: Revert "Migrate WikibaseTermboxInteraction from EventLogging to EventGate on testwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812074 (owner: 10Thcipriani) [20:22:05] (03CR) 10Thcipriani: [C: 03+2] Enable VisualEditor on thwikibooks by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791729 (https://phabricator.wikimedia.org/T308379) (owner: 10Klein Muçi) [20:23:41] (03Merged) 10jenkins-bot: Enable VisualEditor on thwikibooks by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791729 (https://phabricator.wikimedia.org/T308379) (owner: 10Klein Muçi) [20:24:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:24:57] MatmaRex: your change is live on mwdebug1002, check please [20:25:08] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudweb1003.mgmt.eqiad.wmnet with reboot policy FORCED [20:25:14] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudweb1004.mgmt.eqiad.wmnet with reboot policy FORCED [20:25:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:25:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:25:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2182.mgmt.codfw.wmnet with reboot policy FORCED [20:26:16] looking [20:26:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:26:56] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:27:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2181.mgmt.codfw.wmnet with reboot policy FORCED [20:29:29] thcipriani: sorry, i was a little confused by the site. the change looks good [20:30:15] MatmaRex: no worries, thanks for checking, I'll sync now: do these files need to go out in any particular order that you're aware of? [20:30:55] thcipriani: no, i think only one of them is used in production, and the other is used to generate it [20:31:23] ah, perfect, thank you: going live now [20:31:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:32:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:32:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:33:37] o/ :) [20:33:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:33:52] hey ottomata :] [20:34:16] howdy [20:34:30] thcipriani: is it possible to redeploy our change? [20:34:34] or too late? [20:34:42] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-jumbo1010.mgmt.eqiad.wmnet with reboot policy FORCED [20:34:45] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-jumbo1015.mgmt.eqiad.wmnet with reboot policy FORCED [20:34:48] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-jumbo1014.mgmt.eqiad.wmnet with reboot policy FORCED [20:34:49] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-jumbo1011.mgmt.eqiad.wmnet with reboot policy FORCED [20:34:49] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-jumbo1013.mgmt.eqiad.wmnet with reboot policy FORCED [20:34:51] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-jumbo1012.mgmt.eqiad.wmnet with reboot policy FORCED [20:34:53] !log thcipriani@deploy1002 Synchronized wmf-config/config/thwikibooks.yaml: Config: [[gerrit:791729|Enable VisualEditor on thwikibooks by default (T308379)]] (duration: 03m 25s) [20:34:55] mforns: it's probably possible, we've got about half the window left [20:35:01] T308379: Enable VisualEditor on thwikibooks - https://phabricator.wikimedia.org/T308379 [20:35:11] thcipriani: do you need another patch? [20:35:17] mforns: should I revert my revert or do you need changes? [20:35:27] thcipriani: no, same code [20:35:27] revert revert should be good [20:35:29] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:35:32] need to figure out wy not working [20:35:49] also, it is a safe change to deploy as is, it only affects configs on testwiki [20:36:35] (03PS1) 10Thcipriani: Revert "Revert "Migrate WikibaseTermboxInteraction from EventLogging to EventGate on testwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811963 [20:36:50] (03CR) 10Thcipriani: [C: 03+2] Revert "Revert "Migrate WikibaseTermboxInteraction from EventLogging to EventGate on testwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811963 (owner: 10Thcipriani) [20:38:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2176.codfw.wmnet with OS bullseye [20:38:11] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2176.codfw.wmnet with OS bullseye [20:38:39] !log thcipriani@deploy1002 Synchronized dblists/visualeditor-nondefault.dblist: Config: [[gerrit:791729|Enable VisualEditor on thwikibooks by default (T308379)]] (duration: 03m 13s) [20:38:46] ^ MatmaRex your change should be live now [20:39:31] (03Merged) 10jenkins-bot: Revert "Revert "Migrate WikibaseTermboxInteraction from EventLogging to EventGate on testwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811963 (owner: 10Thcipriani) [20:40:09] mforns: ottomata change is live on mwdebug1002 again if there's anything you want to check [20:40:26] thanks thcipriani, ottomata wanna pair in the batcave? [20:40:31] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:40:54] mforns: lets do slack huddle so my computer doesn't die [20:41:00] ok [20:41:03] * thcipriani mentally notes to budget for future batcave [20:42:09] (03Merged) 10jenkins-bot: ParserOutput::mergeMapStrategy: don't crash if merging non-array values [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811961 (https://phabricator.wikimedia.org/T312242) (owner: 10C. Scott Ananian) [20:43:10] cscott: after a lot of waiting on jenkins I'm happy to say: your change is on mwdebug1002, check please! [20:43:41] ok, checking! [20:43:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:44:13] (03CR) 10Dzahn: "@Filippo The original idea was that this would then be used in something like https://gerrit.wikimedia.org/r/c/operations/puppet/+/810146" [puppet] - 10https://gerrit.wikimedia.org/r/811790 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [20:44:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:44:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:45:00] thcipriani: doesn't crash on mwdebug1002 yay [20:45:21] cscott: great, glad to hear it, I'll deploy :) [20:45:32] thanks! [20:45:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:49:05] !log thcipriani@deploy1002 Synchronized php-1.39.0-wmf.19/includes/parser/ParserOutput.php: Backport: [[gerrit:811961|ParserOutput::mergeMapStrategy: don't crash if merging non-array values (T312242)]] (duration: 03m 05s) [20:49:06] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@e0a8f03]: tune subgraph_mapping_weekly based on first prod run [20:49:09] T312242: Graph extension: Error: Cannot use object of type stdClass as array - https://phabricator.wikimedia.org/T312242 [20:49:15] cscott: should be live now [20:50:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:51:02] thcipriani: thanks. confirmed crasher is gone even with WikimediaDebug off [20:51:11] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@e0a8f03]: tune subgraph_mapping_weekly based on first prod run (duration: 02m 05s) [20:51:12] nice [20:51:20] thcipriani: we're still trying to understand why the events don't reach kafka, is it ok to deploy this anyway since it's just activated on testwiki? This way we can continue testing [20:51:41] mforns: sure if you're around keeping an eye on it, I can deploy for you [20:51:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:51:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:51:52] thcipriani: ok, thanks! [20:52:36] going live now [20:52:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:54:56] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Eevans) [20:55:42] !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:812017|Migrate WikibaseTermboxInteraction from EventLogging to EventGate on testwiki (T290303)]] (duration: 03m 12s) [20:55:45] T290303: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy - https://phabricator.wikimedia.org/T290303 [20:56:04] thanks a lot thcipriani [20:56:48] mforns: sure thing! should be live everywhere now :) [20:57:52] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Eevans) >>! In T307035#8060976, @Cmjohnson wrote: > @Eevans Let me know when I am able to move these servers for you. We're just waiting on T307802 (next week hopefully?)... [21:00:22] (03PS5) 10Krinkle: multiversion: Factor out getTagsForWiki() for re-use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810147 (https://phabricator.wikimedia.org/T169821) [21:00:59] thcipriani: window done? [21:01:34] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2177.codfw.wmnet with OS bullseye [21:01:40] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2177.codfw.wmnet with OS bullseye [21:02:04] (03PS2) 10Krinkle: missing.php: Update docs and add test plan [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807609 (https://phabricator.wikimedia.org/T308932) [21:02:07] (03PS2) 10Krinkle: multiversion: Move missing.php from wmf-config/ to /multiversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807610 (https://phabricator.wikimedia.org/T308932) [21:08:42] (03PS1) 10Alexandros Kosiaris: Remove conf1008, conf1009 from server etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/812080 [21:10:20] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove conf1008, conf1009 from server etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/812080 (owner: 10Alexandros Kosiaris) [21:12:46] (03CR) 10Nskaggs: [C: 03+1] "Minor comments below." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810915 (owner: 10David Caro) [21:16:16] PROBLEM - Zookeeper Server #page on conf1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [21:16:21] RECOVERY - etcd tlsproxy SSL conf1007.eqiad.wmnet:4001 on conf1007 is OK: SSL OK - Certificate etcd-v3.eqiad.wmnet valid until 2027-07-06 15:00:50 +0000 (expires in 1824 days) https://wikitech.wikimedia.org/wiki/Cergen [21:18:01] gonna silence that conf1007 alert, it's already tracked in T312539 [21:18:13] 10SRE: ms-be2028 on stretch - https://phabricator.wikimedia.org/T312595 (10Dzahn) [21:18:24] akosiaris: ah thanks, was just starting to look [21:18:24] akosiaris: thanks [21:18:53] PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2005 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:18:55] PROBLEM - Check unit status of etcd-backup on conf1007 is CRITICAL: CRITICAL: Status of the systemd unit etcd-backup https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:18:57] PROBLEM - Check systemd state on conf2005 is CRITICAL: CRITICAL - degraded: The following units failed: etcdmirror-conftool-eqiad-wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:19:15] (JobUnavailable) firing: Reduced availability for job etcdmirror in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:20:08] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn) [21:20:10] 10SRE: ms-be2028 on stretch - https://phabricator.wikimedia.org/T312595 (10Dzahn) [21:20:39] PROBLEM - PyBal connections to etcd on lvs6001 is CRITICAL: CRITICAL: 7 connections established with conf1006.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [21:20:40] 10SRE, 10SRE-swift-storage: ms-be2028 on stretch - https://phabricator.wikimedia.org/T312595 (10Dzahn) [21:21:51] (03CR) 10Krinkle: [C: 03+2] multiversion: Factor out getTagsForWiki() for re-use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810147 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [21:22:34] (03Merged) 10jenkins-bot: multiversion: Factor out getTagsForWiki() for re-use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810147 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [21:22:38] akosiaris, jhathaway: I'll resolve in VO too if there's nothing to do [21:22:47] rzl: thanks [21:23:29] PROBLEM - PyBal connections to etcd on lvs3007 is CRITICAL: CRITICAL: 8 connections established with conf1006.eqiad.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [21:23:41] RECOVERY - PyBal connections to etcd on lvs1020 is OK: OK: 119 connections established with conf1004.eqiad.wmnet:4001 (min=119) https://wikitech.wikimedia.org/wiki/PyBal [21:23:50] 10SRE-tools, 10Infrastructure-Foundations, 10Wikimedia-Mailing-lists, 10serviceops: Support services VIPs with not marked as VIP in Netbox - https://phabricator.wikimedia.org/T295793 (10Dzahn) >>! In T295793#8050412, @Jelto wrote: > `gitlab1001` and `gitlab2001` will be decommissioned soon in T307142. So r... [21:24:31] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:24:55] (03PS1) 10Papaul: Add new db nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/812083 (https://phabricator.wikimedia.org/T306849) [21:25:27] PROBLEM - PyBal connections to etcd on lvs6003 is CRITICAL: CRITICAL: 6 connections established with conf1006.eqiad.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [21:25:47] (03CR) 10Papaul: [C: 03+2] Add new db nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/812083 (https://phabricator.wikimedia.org/T306849) (owner: 10Papaul) [21:26:11] RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2005 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:26:17] RECOVERY - Check systemd state on conf2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:27:05] RECOVERY - PyBal connections to etcd on lvs6001 is OK: OK: 12 connections established with conf1006.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [21:28:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:28:30] !log krinkle@deploy1002 Synchronized multiversion/MWMultiVersion.php: Ice5302f791fb1d5 (duration: 03m 18s) [21:29:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:29:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:29:15] (JobUnavailable) resolved: (2) Reduced availability for job etcd in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:29:51] RECOVERY - PyBal connections to etcd on lvs3007 is OK: OK: 16 connections established with conf1006.eqiad.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [21:30:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:30:09] (03PS8) 10Krinkle: noc: Add support for dblists to wiki.php config viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810148 (https://phabricator.wikimedia.org/T169821) [21:30:47] RECOVERY - PyBal connections to etcd on lvs1017 is OK: OK: 12 connections established with conf1004.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [21:30:57] PROBLEM - PyBal connections to etcd on lvs3006 is CRITICAL: CRITICAL: 2 connections established with conf1006.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [21:31:21] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Management interface SSH icinga alerts - https://phabricator.wikimedia.org/T304289 (10Dzahn) In codfw we have seen flapping mgmt being fixed by one of 2 actions: - firmware / DRAC upgrades - DRAC hard resets [21:32:24] (03PS1) 10Alexandros Kosiaris: Add conf1008 in DNS SRV records [dns] - 10https://gerrit.wikimedia.org/r/812084 (https://phabricator.wikimedia.org/T311407) [21:33:21] !log krinkle@deploy1002 Synchronized multiversion/: Ice5302f791fb1d5 (duration: 03m 18s) [21:33:45] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add conf1008 in DNS SRV records [dns] - 10https://gerrit.wikimedia.org/r/812084 (https://phabricator.wikimedia.org/T311407) (owner: 10Alexandros Kosiaris) [21:33:47] PROBLEM - PyBal connections to etcd on lvs3005 is CRITICAL: CRITICAL: 4 connections established with conf1006.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [21:37:19] RECOVERY - PyBal connections to etcd on lvs3006 is OK: OK: 4 connections established with conf1006.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [21:38:23] RECOVERY - PyBal connections to etcd on lvs6003 is OK: OK: 16 connections established with conf1006.eqiad.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [21:38:29] RECOVERY - etcd tlsproxy SSL conf1008.eqiad.wmnet:4001 on conf1008 is OK: SSL OK - Certificate etcd-v3.eqiad.wmnet valid until 2027-07-06 15:00:50 +0000 (expires in 1824 days) https://wikitech.wikimedia.org/wiki/Cergen [21:41:57] (03PS1) 10Alexandros Kosiaris: Add conf1009 in DNS SRV records [dns] - 10https://gerrit.wikimedia.org/r/812088 (https://phabricator.wikimedia.org/T311407) [21:44:05] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add conf1009 in DNS SRV records [dns] - 10https://gerrit.wikimedia.org/r/812088 (https://phabricator.wikimedia.org/T311407) (owner: 10Alexandros Kosiaris) [21:44:27] RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 71 connections established with conf1004.eqiad.wmnet:4001 (min=71) https://wikitech.wikimedia.org/wiki/PyBal [21:46:41] RECOVERY - PyBal connections to etcd on lvs3005 is OK: OK: 12 connections established with conf1006.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [21:48:10] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/36224/" [puppet] - 10https://gerrit.wikimedia.org/r/810401 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [21:49:39] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10nskaggs) @cmooney Thank you for updating and linking these instructions. Yes, that is helpful [21:49:47] RECOVERY - etcd tlsproxy SSL conf1009.eqiad.wmnet:4001 on conf1009 is OK: SSL OK - Certificate etcd-v3.eqiad.wmnet valid until 2027-07-06 15:00:50 +0000 (expires in 1824 days) https://wikitech.wikimedia.org/wiki/Cergen [21:49:52] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2177.codfw.wmnet with OS bullseye [21:49:54] (03PS8) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [21:49:56] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2177.codfw.wmnet with OS bullseye executed with er... [21:50:00] (03CR) 10Andrew Bogott: [C: 03+2] labweb: point tlsproxy envoy at port 8080 for striker [puppet] - 10https://gerrit.wikimedia.org/r/811381 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [21:50:59] RECOVERY - PyBal connections to etcd on lvs1018 is OK: OK: 36 connections established with conf1004.eqiad.wmnet:4001 (min=36) https://wikitech.wikimedia.org/wiki/PyBal [21:51:19] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [21:52:38] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn) [21:52:51] 10SRE, 10SRE-swift-storage: ms-be2028 on stretch - https://phabricator.wikimedia.org/T312595 (10Dzahn) 05Open→03Invalid Invalid - there are actually more ms-be hosts on stretch. I did not get the correct list. [21:53:27] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:54:40] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2176.codfw.wmnet with OS bullseye [21:54:40] pt1979@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [21:54:44] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2176.codfw.wmnet with OS bullseye executed with er... [21:56:07] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2176.codfw.wmnet with OS bullseye [21:56:08] pt1979@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [21:56:11] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [21:56:13] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2176.codfw.wmnet with OS bullseye [21:56:18] (ProbeDown) firing: Service labweb-ssl:7443 has failed probes (http_labweb-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:56:39] (03PS4) 10Dzahn: doc: remove support for stretch, add support for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/810401 (https://phabricator.wikimedia.org/T247653) [21:56:55] andrewbogott, bd808: the alert is you right ^ [21:57:01] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2177.codfw.wmnet with OS bullseye [21:57:01] pt1979@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [21:57:07] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2177.codfw.wmnet with OS bullseye [21:57:18] RhinosF1: yes, I was just hunting around for things to ack :) [21:57:18] (ProbeDown) firing: Service labweb-ssl:7443 has failed probes (http_labweb-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:57:32] andrewbogott: need anything? [21:57:55] andrewbogott: no problem, just wanted it to be clear in here that it was expected before anyone showed. [21:58:16] ACKNOWLEDGEMENT - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb-ssl_7443: Servers labweb1002.wikimedia.org are marked down but pooled Andrew Bogott more attempts to containerize Striker, work in progress... https://wikitech.wikimedia.org/wiki/PyBal [21:58:16] ACKNOWLEDGEMENT - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb-ssl_7443: Servers labweb1002.wikimedia.org are marked down but pooled Andrew Bogott more attempts to containerize Striker, work in progress... https://wikitech.wikimedia.org/wiki/PyBal [21:58:19] rzl: no, all is well, just ack'd [21:58:35] andrewbogott: okay, is it expected that wikitech is down? :) [21:58:41] rzl: well, actually, maybe you could help us do this right, but there's not an outage that needs to interrupt your life [21:58:49] rzl: it's not desired but it is expected :( [21:58:51] rzl: we need somebody who understands the damn envoy proxy layer to help me make a patch that actually works :/ [21:58:53] got it [21:58:58] envoy I might be able to help with, what's up? [21:59:18] https://gerrit.wikimedia.org/r/c/operations/puppet/+/811381/2/hieradata/role/eqiad/wmcs/openstack/eqiad1/labweb.yaml fixes toolsadmin, but breaks wikitech and horizon [21:59:20] https://gerrit.wikimedia.org/r/c/operations/puppet/+/811381 <- the most recent attempt [21:59:29] looking [21:59:52] (early heads up, I'll need to go in ~25 minutes for an appointment, happy to do what I can until then) [21:59:58] rzl: I'm going to revert in a minute or two but ping me if you come up with a fix before then :) [22:00:17] go ahead and revert please, we can fix after [22:00:19] somehow leaving the upstream using FQDN makes horizon and wikitech work (which makes no sense to me at all) [22:00:24] (03PS1) 10Andrew Bogott: Revert "labweb: point tlsproxy envoy at port 8080 for striker" [puppet] - 10https://gerrit.wikimedia.org/r/811965 [22:01:12] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2176.codfw.wmnet with reason: host reimage [22:01:12] pt1979@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [22:02:51] (03CR) 10BryanDavis: [C: 03+1] Revert "labweb: point tlsproxy envoy at port 8080 for striker" [puppet] - 10https://gerrit.wikimedia.org/r/811965 (owner: 10Andrew Bogott) [22:02:55] * andrewbogott just waiting for CI [22:02:58] (03CR) 10Andrew Bogott: [C: 03+2] Revert "labweb: point tlsproxy envoy at port 8080 for striker" [puppet] - 10https://gerrit.wikimedia.org/r/811965 (owner: 10Andrew Bogott) [22:03:49] (03CR) 10Dzahn: "noop on doc1002/doc2001" [puppet] - 10https://gerrit.wikimedia.org/r/810401 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [22:04:41] (03PS1) 10Krinkle: noc: Minor improvements to wiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812091 [22:04:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2176.codfw.wmnet with reason: host reimage [22:05:10] rzl: the weirdness I'm trying to accomplish is pointing the envoy proxy at port 8080 on a Docker container. That container doesn't have IPv6 because Docker and the default FQDN lookup maked envoy want to talk IPv6 to the upstreams. [22:05:22] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:05:25] *made [22:05:25] (those sites are back up btw) [22:06:05] andrewbogott: thanks for the try and the rollback [22:06:07] bd808: ahh, got it [22:06:18] (ProbeDown) resolved: Service labweb-ssl:7443 has failed probes (http_labweb-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:07:01] --net=host for that container? [22:07:05] (03CR) 10Dzahn: [C: 03+1] "certainly looks right, is it expected that nothing about it is in netbox? do we have to run a cookbook?" [dns] - 10https://gerrit.wikimedia.org/r/811912 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [22:07:16] akosiaris: yes. for other strange reasons [22:07:18] (ProbeDown) resolved: Service labweb-ssl:7443 has failed probes (http_labweb-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:07:52] bd808: nothing strange about trying to avoid docker networking. It's a mess waiting to bite you [22:07:57] akosiaris: I couldn't get the container to talk to the nutcracker on the host without using the host network [22:08:31] I probably should be doing all this in k8s instead of docker, but I thought I was making life easier... [22:08:51] may I bash that ? ^ [22:09:01] akosiaris: always :) [22:09:02] there's a subtle hint of irony in there ;-) [22:09:31] docker networking and life easier ... well it ain't it [22:10:44] I can do yet another thing which is changing the apache config that currently proxies toolsadmin.wm.o to a uwsgi container to instead reverse proxy to service running in the Docker container [22:11:04] that leaves the envoy dark magic alone [22:11:23] bd808: I don't see anything obviously wrong in the envoy config as shown in PCC -- I agree your whole situation is bizarre but I think it's all doing the right thing at least as far as envoy [22:11:40] docker networking I won't pretend to understand though [22:12:29] Do we have log info to know what actual URLs we were asking apache for when it displayed that 'hello world' page? [22:12:30] This pass the thing that went boom unexpectedly was envoy->apache on the host. I need to go look at all the vhosts to see if this makes any sense... [22:13:02] there isn't much to understand. You either stick to expose host port -> container port or you are in for a trip down a big ugly rabbit hole. [22:13:27] there is a reason we didn't want to adopt docker as is back when the sirens were begging [22:13:44] well, actually multiple reasons [22:13:52] (03CR) 10Krinkle: [C: 03+2] noc: Add support for dblists to wiki.php config viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810148 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [22:13:55] the networking being one of them [22:14:20] (03CR) 10Krinkle: [C: 03+2] noc: Minor improvements to wiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812091 (owner: 10Krinkle) [22:14:32] RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:14:34] (03Merged) 10jenkins-bot: noc: Add support for dblists to wiki.php config viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810148 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [22:14:49] Is there any containerization system that has... sanity-retaining networking? [22:15:05] andrewbogott: I figured out why it didn't work. There is a conf-enabled/50-server-status.conf vhost that listens explictly on 127.0.0.1:80 that ate the traffic [22:15:08] (03Merged) 10jenkins-bot: noc: Minor improvements to wiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812091 (owner: 10Krinkle) [22:15:32] bd808: the other moving part is of course labweb1001.wikimedia.org doesn't resolve to 127.0.0.1, it resolves to 208.80.154.160, so-- ahahaha [22:15:39] man, 30 seconds too late to be really helpful with that, huh [22:15:45] ok, that would do it! And I bet we need that for a health check [22:15:52] so this is why 127.0.0.1 worked for striker (not hitting apache) and failed for the rest (apache has unexpected config) [22:15:58] yeah. [22:16:17] Want me to move wikitech and horizon to port 8081? [22:16:23] Dumb but easy! [22:17:02] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2177.codfw.wmnet with reason: host reimage [22:17:12] I was thinking more about making it possible to set an upstream_addr in a profile::tlsproxy::envoy::services entry [22:17:45] perryprog: niah, networking is always weird in these situation, but some. e.g. systemd-nspawn can truly easily support macvlans (docker supports macvlan too, but boy oh boy) and have the containers be right next to your host [22:18:10] That sounds... foot gun adjacent [22:18:23] kubernetes has 3 very simple rules that make networking mostly work without much pain [22:18:34] Jeez I need to read some networking books. [22:18:46] and one can distill them to 1 "thou shall not NAT" [22:18:48] I feel like I always know half of everything relevant. [22:20:12] lxc/lxd however don't have a very sanity preserving network implementation [22:20:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2177.codfw.wmnet with reason: host reimage [22:20:47] but it's a very flexible one IIRC [22:20:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:20:55] haven't seen it in years in action though [22:21:15] (03PS9) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [22:21:22] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:21:29] freebsd jails... I am not gonna even talk about that [22:21:41] andrewbogott, bd808: anything else from me before I drop off? [22:21:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:21:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:22:05] rzl: I'm good. thanks for your time [22:22:08] (stashbot might need a restart or something) [22:22:09] rzl I don't think so. We might have another go at this so feel free to ignore any repeat of that same set of alerts in the next hour or so :) [22:22:27] rzl: stashbot doens't ack logmsgbot now [22:22:31] ohh that's right [22:22:44] andrewbogott: I won't be at a keyboard anyway, but consider silencing so that you don't page a bunch of other people [22:22:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:23:21] rzl: I'll try but was already surprised at which host emitted the alerts [22:23:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2176.codfw.wmnet with OS bullseye [22:24:11] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2176.codfw.wmnet with OS bullseye completed: - db2... [22:25:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2178.codfw.wmnet with OS bullseye [22:25:55] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2178.codfw.wmnet with OS bullseye [22:26:02] (And sorry for the OT-ness, but since y'all are experts, if anyone /does/ have any good recommendations on "networking" broadly speaking in the sense of what's being discussed here, I'd love to have some.) [22:26:07] (03CR) 10Krinkle: [C: 03+2] missing.php: Update docs and add test plan [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807609 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [22:27:15] (03Merged) 10jenkins-bot: missing.php: Update docs and add test plan [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807609 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [22:32:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:33:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:33:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:34:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2177.codfw.wmnet with OS bullseye [22:34:13] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2177.codfw.wmnet with OS bullseye completed: - db2... [22:34:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:38:24] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:39:12] (03PS3) 10Krinkle: multiversion: Move missing.php from wmf-config/ to /multiversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807610 (https://phabricator.wikimedia.org/T308932) [22:39:15] (03CR) 10Krinkle: [C: 03+2] multiversion: Move missing.php from wmf-config/ to /multiversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807610 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [22:39:19] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2179.codfw.wmnet with OS bullseye [22:39:24] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2179.codfw.wmnet with OS bullseye [22:42:34] !log krinkle@deploy1002 Synchronized wmf-config/missing.php: I13a4ba0e307a (duration: 03m 33s) [22:44:34] (03Merged) 10jenkins-bot: multiversion: Move missing.php from wmf-config/ to /multiversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807610 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [22:45:01] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2178.codfw.wmnet with reason: host reimage [22:48:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2178.codfw.wmnet with reason: host reimage [22:49:28] (03PS1) 10Andrew Bogott: labweb: move striker, wikitech, horizon behind envoy [puppet] - 10https://gerrit.wikimedia.org/r/812096 (https://phabricator.wikimedia.org/T306469) [22:50:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:51:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:51:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:52:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:52:58] (03CR) 10Krinkle: [C: 03+2] build: Add .editorconfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804682 (owner: 10Krinkle) [22:53:12] (03CR) 10BryanDavis: labweb: move striker, wikitech, horizon behind envoy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812096 (https://phabricator.wikimedia.org/T306469) (owner: 10Andrew Bogott) [22:53:34] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:53:42] (03Merged) 10jenkins-bot: build: Add .editorconfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804682 (owner: 10Krinkle) [22:55:13] (03CR) 10Ahmon Dancy: "Thanks for this!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804682 (owner: 10Krinkle) [22:55:51] (03PS2) 10Andrew Bogott: labweb: move striker, wikitech, horizon behind envoy [puppet] - 10https://gerrit.wikimedia.org/r/812096 (https://phabricator.wikimedia.org/T306469) [22:56:19] !log krinkle@deploy1002 Synchronized multiversion/: I1f2daab316 (duration: 03m 43s) [22:57:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:57:45] (03CR) 10BryanDavis: [C: 03+1] labweb: move striker, wikitech, horizon behind envoy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812096 (https://phabricator.wikimedia.org/T306469) (owner: 10Andrew Bogott) [22:58:02] PROBLEM - PyBal connections to etcd on lvs6002 is CRITICAL: CRITICAL: 0 connections established with conf1006.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [22:58:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:58:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:58:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2179.codfw.wmnet with reason: host reimage [22:59:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:59:42] (03CR) 10Andrew Bogott: [C: 03+2] labweb: move striker, wikitech, horizon behind envoy [puppet] - 10https://gerrit.wikimedia.org/r/812096 (https://phabricator.wikimedia.org/T306469) (owner: 10Andrew Bogott) [23:02:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2179.codfw.wmnet with reason: host reimage [23:02:25] (03CR) 10Krinkle: [C: 03+2] Enable wgResourceLoaderUseObjectCacheForDeps for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811794 (https://phabricator.wikimedia.org/T113916) (owner: 10Krinkle) [23:03:02] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2178.codfw.wmnet with OS bullseye [23:03:02] pt1979@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [23:03:03] RECOVERY - PyBal connections to etcd on lvs6002 is OK: OK: 4 connections established with conf1006.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [23:03:07] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2178.codfw.wmnet with OS bullseye completed: - db2... [23:04:04] (03PS1) 10Andrew Bogott: Revert "labweb: move striker, wikitech, horizon behind envoy" [puppet] - 10https://gerrit.wikimedia.org/r/812106 [23:05:18] (ProbeDown) firing: Service labweb-ssl:7443 has failed probes (http_labweb-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:05:18] (ProbeDown) firing: Service labweb-ssl:7443 has failed probes (http_labweb-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:06:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2180.codfw.wmnet with OS bullseye [23:06:16] pt1979@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [23:06:21] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2180.codfw.wmnet with OS bullseye [23:07:12] (03CR) 10Andrew Bogott: [C: 03+2] Revert "labweb: move striker, wikitech, horizon behind envoy" [puppet] - 10https://gerrit.wikimedia.org/r/812106 (owner: 10Andrew Bogott) [23:07:26] (03PS2) 10Krinkle: Enable wgResourceLoaderUseObjectCacheForDeps for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811794 (https://phabricator.wikimedia.org/T113916) [23:07:32] (03CR) 10Krinkle: [C: 03+2] Enable wgResourceLoaderUseObjectCacheForDeps for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811794 (https://phabricator.wikimedia.org/T113916) (owner: 10Krinkle) [23:08:49] (03Merged) 10jenkins-bot: Enable wgResourceLoaderUseObjectCacheForDeps for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811794 (https://phabricator.wikimedia.org/T113916) (owner: 10Krinkle) [23:10:18] (ProbeDown) resolved: Service labweb-ssl:7443 has failed probes (http_labweb-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:10:18] (ProbeDown) resolved: Service labweb-ssl:7443 has failed probes (http_labweb-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:14:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:15:25] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10Papaul) [23:15:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:15:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:16:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:16:36] (03CR) 10BryanDavis: [C: 04-1] "This is going to fail in the same way that I22397754468abe1de3fed12a3e7e1fdff8d6d336 did. The issue is that all of our apaches have an exp" [puppet] - 10https://gerrit.wikimedia.org/r/811332 (https://phabricator.wikimedia.org/T306469) (owner: 10Majavah) [23:16:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2179.codfw.wmnet with OS bullseye [23:16:56] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2179.codfw.wmnet with OS bullseye completed: - db2... [23:24:29] PROBLEM - PyBal connections to etcd on lvs6002 is CRITICAL: CRITICAL: 3 connections established with conf1006.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [23:25:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2180.codfw.wmnet with reason: host reimage [23:26:05] !log krinkle@deploy1002 Synchronized wmf-config/InitialiseSettings.php: I9b97f79618 (duration: 03m 23s) [23:26:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2181.codfw.wmnet with OS bullseye [23:26:22] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2181.codfw.wmnet with OS bullseye [23:29:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2180.codfw.wmnet with reason: host reimage [23:43:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2180.codfw.wmnet with OS bullseye [23:43:28] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2180.codfw.wmnet with OS bullseye completed: - db2... [23:45:06] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2181.codfw.wmnet with reason: host reimage [23:49:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2181.codfw.wmnet with reason: host reimage