[00:00:04] twentyafterfour: Time to snap out of that daydream and deploy Phabricator update. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210909T0000). [00:05:01] PROBLEM - Check systemd state on prometheus1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. An error occured trying to list the failed units https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:11:12] (03PS5) 10Legoktm: Remove putenv() for GDFONTPATH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664670 (https://phabricator.wikimedia.org/T274822) [00:11:22] (03CR) 10Legoktm: [C: 03+2] Remove putenv() for GDFONTPATH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664670 (https://phabricator.wikimedia.org/T274822) (owner: 10Legoktm) [00:12:51] (03Merged) 10jenkins-bot: Remove putenv() for GDFONTPATH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664670 (https://phabricator.wikimedia.org/T274822) (owner: 10Legoktm) [00:15:28] !log legoktm@deploy1002 Synchronized wmf-config/CommonSettings.php: Remove putenv() for GDFONTPATH (duration: 00m 58s) [00:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:07] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [00:18:16] (03CR) 10Legoktm: [C: 03+2] shell: Fix $wgShellboxUrls by passing service name when creating BoxedCommand [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/716103 (https://phabricator.wikimedia.org/T290193) (owner: 10Legoktm) [00:18:21] (03CR) 10Legoktm: [C: 03+2] Use the 'score' Shellbox if configured [extensions/Score] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/716104 (https://phabricator.wikimedia.org/T290193) (owner: 10Legoktm) [00:33:51] RECOVERY - Check systemd state on prometheus1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:35:56] (03Merged) 10jenkins-bot: shell: Fix $wgShellboxUrls by passing service name when creating BoxedCommand [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/716103 (https://phabricator.wikimedia.org/T290193) (owner: 10Legoktm) [00:37:38] (03Merged) 10jenkins-bot: Use the 'score' Shellbox if configured [extensions/Score] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/716104 (https://phabricator.wikimedia.org/T290193) (owner: 10Legoktm) [00:40:41] (03PS1) 10Legoktm: Don't set default $wgShellboxUrls to Score (try #2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719505 [00:40:46] (03PS2) 10Legoktm: Don't set default $wgShellboxUrls to Score (try #2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719505 [00:41:51] (03PS3) 10Legoktm: Don't set default $wgShellboxUrls to Score (try #2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719505 [00:42:06] (03CR) 10Legoktm: [C: 03+2] Don't set default $wgShellboxUrls to Score (try #2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719505 (owner: 10Legoktm) [00:42:51] (03Merged) 10jenkins-bot: Don't set default $wgShellboxUrls to Score (try #2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719505 (owner: 10Legoktm) [00:45:10] !log legoktm@deploy1002 sync-file aborted: shell: Fix $wgShellboxUrls by passing service name when creating BoxedCommand (T290193 (duration: 00m 07s) [00:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:15] T290193: $wgShellboxUrls doesn't work for BoxedCommand - https://phabricator.wikimedia.org/T290193 [00:45:32] I aborted because I typo'd the message [00:46:15] !log legoktm@deploy1002 Synchronized php-1.37.0-wmf.21/includes/shell/CommandFactory.php: shell: Fix $wgShellboxUrls by passing service name when creating BoxedCommand (T290193) (duration: 00m 58s) [00:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:21] !log legoktm@deploy1002 Synchronized php-1.37.0-wmf.21/extensions/Score/includes/Score.php: Use the 'score' Shellbox if configured (T290193) (duration: 00m 57s) [00:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:11] !log legoktm@deploy1002 Synchronized wmf-config/CommonSettings.php: Don't set default to Score (try #2) (duration: 00m 58s) [00:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:39] (03PS1) 10Milimetric: role::common::aqs: update druid mw datasource [puppet] - 10https://gerrit.wikimedia.org/r/719680 [01:40:59] PROBLEM - Check systemd state on prometheus5001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:50:11] RECOVERY - Check systemd state on prometheus5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:55:39] PROBLEM - Check systemd state on prometheus5001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:01:09] RECOVERY - Check systemd state on prometheus5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:41] PROBLEM - Check systemd state on prometheus5001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:08:35] RECOVERY - Check systemd state on prometheus5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:16:15] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2001.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2001.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:20:23] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs20 [02:20:23] .wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:28:37] PROBLEM - Check systemd state on clouddb1017 is CRITICAL: CRITICAL - degraded: The following units failed: wmf-pt-kill@s1.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:28:43] ryankemper: umm ^^ wdqs down? [02:31:48] legoktm: thanks for ping, that's not good...we've had instability in codfw over the last week that we've slapped a bandaid on by restarting codfw once hourly...might be that even restarting once per hour isn't enough right now :/ [02:31:50] looking right now [02:32:18] https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&from=now-1h&to=now [02:32:45] seems like an uptick in errors started around ~2:12 [02:33:36] !log [WDQS] Restarting `wdqs-blazegraph` across all of `wdqs2*` [02:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:01] PROBLEM - Query Service HTTP Port on wdqs2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [02:34:36] !log [WDQS] For context I glanced at `ryankemper@cumin1001:~$ sudo -E cumin 'P{wdqs2*}' 'sudo systemctl status wdqs-blazegraph'` before doing the aforementioned restarts and they'd all last restarted between 25-28 minutes ago [02:34:37] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2004 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [02:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:09] !log [WDQS] https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7&orgId=1&from=1631152574841&to=1631154942992 shows the availability pattern, anywhere we see missing data (null) represents time that blazegraph was locked up and therefore unable to report metrics [02:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:33] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2004 is OK: TCP OK - 1.025 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [02:37:53] RECOVERY - Query Service HTTP Port on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [02:38:35] Grafana's not looking great, feels like the hosts are rapidly dropping out again [02:40:07] Gonna go spelunking through logstash and see if there's an obviously bad actor, restarting blazegraph is just a bandaid and a very poor one at that currently [02:40:29] is there any timeout that can be lowered? [02:40:46] ACKNOWLEDGEMENT - Check systemd state on clouddb1017 is CRITICAL: CRITICAL - degraded: The following units failed: wmf-pt-kill@s1.service andrew bogott T290630 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:40:46] ACKNOWLEDGEMENT - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 983.20 seconds andrew bogott T290630 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:42:07] RECOVERY - Check systemd state on clouddb1017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:42:18] legoktm: The main issue is that when blazegraph's livelock/deadlock occurs it seems to kind of lose track of the query it's running (or more accurately, loses track of which threads are doing what work) [02:42:34] We could lower the timeout on the nginx side, but I don't think that would actually stop the work blazegraph is doing on the other end [02:42:40] * legoktm nods [02:42:57] Currently from nginx's perspective, for a "downed" host it isn't hearing back from blazegraph at all so should be timing out around ~60s [02:44:42] There's some steps/context in https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Blazegraph_deadlock of stuff we've done previously, but the tl;dr is that we need to be able to identify which queries are causing the issue (given that doing a restart across the fleet didn't even give us a few minutes of proper service availability) [02:46:48] For one of these past incidents, blocking the largest source of `MalformedQueryException`, so I'm going to see if there's any large sources of those [02:47:56] https://logstash.wikimedia.org/goto/fccba0471b06406728defef7be8bc982 here's a query to find malformed query exceptions, now I need to actually turn it into a visualization or similar so I can sort by source ip / user agent or similar [02:48:14] ack, would it be helpful if I also looked in logstash? (having no other knowledge besides what you've said and what's on the wiki page) [02:49:21] legoktm: it can't hurt :) half the battle is wrangling kibana to visualize the thing we actually want to see (and also playing around with different queries to an extent) [02:49:43] usual disclaimers apply about if you've got real life stuff to be doing do that first, but otherwise any help is welcome [02:51:40] * ryankemper is going to fire off a quick message to the WDQS mailing list [02:52:12] ryankemper: looking at the 5xx logs, there's a lot of errors from "VideocatalogTopic/1.0" [02:53:20] legoktm: any chance we see a spike in those errors starting around 2:12 or a little before? [02:54:37] nope, they've been erroring ever since logrotate [02:55:06] let me look around that time... [03:00:23] nothing obvious sticks out [03:00:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:03:11] yeah that's often the case :/ [03:03:14] * ryankemper finally finished writing the e-mail [03:04:31] !log [WDQS] Dispatched email to Wikidata public mailing list about reduced service availability [03:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:51] Okay so I remember in past incidents seeing a pie chart of error types, but can't remember exactly where that was. I'll see if we have any relevant saved kibana dashboards, if not then it's a tossup between trying to recreate something similar or spelunking through IRC logs from the last couple incident [03:07:19] PROBLEM - Query Service HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 3.026 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [03:08:25] https://logstash.wikimedia.org/app/dashboards#/view/AXRQ-_Rf3_NNwgAUtTdZ?_g=h@c823129&_a=h@4a204bd this is half decent...I do see 2 x 100 of `VideocatalogTopic/1.0 (https://dailymotion.com/;` and 300 of `Twisted PageGetter` [03:08:29] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:09:09] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:09:17] Twisted is a python framework so presumably someone's homegrown script...longshot but could be a candidate culprit [03:11:38] !log stopping and restarting mariadb on clouddb1017 s1 [03:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:12:29] !log attempting to start replication on clouddb1017 s1 T290630 [03:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:12:33] T290630: clouddb1017 replag alerts - https://phabricator.wikimedia.org/T290630 [03:13:07] RECOVERY - Query Service HTTP Port on wdqs2003 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 4.118 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [03:13:34] ryankemper: Twisted is pybal itself [03:14:17] ah [03:15:09] PROBLEM - Query Service HTTP Port on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [03:16:53] PROBLEM - Check systemd state on clouddb1017 is CRITICAL: CRITICAL - degraded: The following units failed: wmf-pt-kill@s1.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:17:58] ryankemper: I'm going afk, but ping me if you need help with anything [03:18:08] legoktm: sounds good, thanks for the help! [03:19:46] Tried combining the existing wikidata query service kibana dashboard with a filter to get only malformedqueryexceptions: https://logstash.wikimedia.org/goto/3c98aae38ffe41fbef28fd7a4384264d [03:19:59] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:20:53] there is a spike a few minutes before 2:12...there was also a similar spike around 1:41 with no corresponding deadlock issue, so that part doesn't quite line upb [03:21:21] up* but overall the pattern looks more related to the 2:12 error spike than other things I've been looking at (like the same dashboard w/o the filter) [03:22:33] You know what I need to filter out `wdqs1*` to narrow it down since it's only impacting `wdqs2*` [03:24:49] RECOVERY - Query Service HTTP Port on wdqs2002 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.730 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [03:25:00] Ah, filtering for only `wdqs2*` filters out many of these events, and the pattern isn't related to the 2:12 spike at all, so `MalformedQueryException` is likely a dead end: https://logstash.wikimedia.org/goto/ae95b797586201bc30e20e076b5e9ad2 [03:25:49] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are mar [03:25:49] but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:28:26] (03PS2) 10RLazarus: Cleanup: Replace all format() calls on string literals with f-strings. [software/spicerack] - 10https://gerrit.wikimedia.org/r/719389 [03:29:04] !log [WDQS] There's no clear indication of them being a culprit, but by far the most common user agent is a dailymotion VideocatalogTopic UA (see https://logstash.wikimedia.org/goto/51f238e9010d0220e5d33c6c210be93e) [03:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:29:14] * ryankemper can't remember if UAs count as PII so not including the full user agent [03:30:22] it technically is, but I think including snippets of them is OK [03:34:33] Could be worth a try banning them, I do know that across the application lifetime we do some automatic throttling (which gets reset when the service gets restarted) [03:34:56] I don't have anything concrete besides that they're the most common UA by a good bit [03:35:32] (03CR) 10RLazarus: Cleanup: Replace all format() calls on string literals with f-strings. (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/719389 (owner: 10RLazarus) [03:35:36] Service availability is bad enough that the difference in availability after getting banned isn't too wide a gap as well... [03:37:07] Well I'll get a patch up for now and then decide at that point [03:37:38] (03Abandoned) 10Rishabhbhat: Add sitename for kswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/718419 (owner: 10Rishabhbhat) [03:37:46] It can't hurt much, I'm just not very optimistic about it helping either...but our options are quite limited and service availability is quite poor anyway [03:38:11] +1 that it can't hurt [03:38:39] plus their email is in the UA so you can send them a heads up that they're getting a bunch of 5xx responses, so something is wrong with their queries anyways [03:38:47] (and that's why we blocked them) [03:39:57] RECOVERY - Check systemd state on clouddb1017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:40:18] (03PS1) 10Ryan Kemper: wdqs: [hotfix] ban dailymotion UA [puppet] - 10https://gerrit.wikimedia.org/r/719753 [03:40:39] legoktm: https://gerrit.wikimedia.org/r/719753 [03:41:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:41:33] (03CR) 10Legoktm: [C: 03+1] wdqs: [hotfix] ban dailymotion UA [puppet] - 10https://gerrit.wikimedia.org/r/719753 (owner: 10Ryan Kemper) [03:42:01] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: [hotfix] ban dailymotion UA [puppet] - 10https://gerrit.wikimedia.org/r/719753 (owner: 10Ryan Kemper) [03:43:04] !log [WDQS] Running puppet agent on `wdqs[2001-2008].codfw.wmnet` to roll out https://gerrit.wikimedia.org/r/719753 [03:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:43:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:47:21] !log [WDQS] Restarting `wdqs-blazegraph` on `wdqs[2001-2008].codfw.wmnet`; if banning the dailymotion UA was sufficient then servers should come back up healthy and not drop back into deadlock [03:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:47:59] PROBLEM - Query Service HTTP Port on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [03:48:57] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:49:15] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:49:41] Well I'll be damned... [03:49:51] RECOVERY - Query Service HTTP Port on wdqs2002 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.019 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [03:49:55] was that it?! [03:50:44] legoktm: All the codfw hosts are back up properly [03:51:03] Sometimes the first hunch is the right one :) [03:51:20] note to future self: ban first and think later [03:51:42] * ryankemper will dispatch an e-mail to dailymotion [03:52:03] awesome :) [03:54:37] legoktm: thanks for all your help! and the initial ping as well [03:55:10] yw :) glad I could help [03:57:27] !log [WDQS] Dispatched e-mail to WDQS public mailing list informing them the outage is over; all that's left is the e-mail to the banned UA [03:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:04:39] (03PS1) 10Bstorm: wikireplicas: reduce the innodb_buffer_pool_size for s1 analytics [puppet] - 10https://gerrit.wikimedia.org/r/719758 (https://phabricator.wikimedia.org/T290630) [04:09:58] (03CR) 10Andrew Bogott: [C: 03+1] "I suggested 20% totally at random so we could use a nice round number if that feels better :)" [puppet] - 10https://gerrit.wikimedia.org/r/719758 (https://phabricator.wikimedia.org/T290630) (owner: 10Bstorm) [04:10:05] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:14:21] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:37:09] !log [WDQS] Dispatched e-mail to the banned user agent (dailymotion) [04:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:43] 10SRE, 10DBA, 10observability, 10Datacenter-Switchover: Database read_only alert has a changing service description - https://phabricator.wikimedia.org/T290591 (10Marostegui) I like option #3 too, but I also have to say that option #1 is also quite important. Having the passive DC in read-write is very dan... [05:08:03] PROBLEM - Check systemd state on doh2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:10:17] PROBLEM - Check systemd state on prometheus5001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:12:11] RECOVERY - Check systemd state on prometheus5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:21:53] PROBLEM - Check systemd state on prometheus5001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:35:03] RECOVERY - Check systemd state on doh2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:39:19] RECOVERY - Check systemd state on prometheus5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:03] (03CR) 10Marostegui: "We should probably also change the other s1 host, so they have the same configuration and can be exchanged without issues in the future." [puppet] - 10https://gerrit.wikimedia.org/r/719758 (https://phabricator.wikimedia.org/T290630) (owner: 10Bstorm) [06:35:27] (03PS4) 10Elukey: kubeflow-kfserving-inference: avoid repetitions with multi-models [deployment-charts] - 10https://gerrit.wikimedia.org/r/719515 (https://phabricator.wikimedia.org/T286791) [06:35:29] (03PS7) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) [06:35:31] (03PS5) 10Elukey: Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791) [06:37:02] (03CR) 10jerkins-bot: [V: 04-1] Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [06:37:47] (03CR) 10Elukey: "Improved the kubeflow-kfserving-inference chart and changed the values.yaml accordingly :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [06:44:10] 10SRE, 10Traffic, 10vm-requests: Please create a Ganeti VM for durum in eqiad - https://phabricator.wikimedia.org/T290599 (10Dzahn) @ssingh Does it matter if durum1001 and durum1002 are in separate rows? [06:47:03] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host durum1002.eqiad.wmnet [06:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:53] 10SRE, 10Traffic, 10vm-requests: Please create a Ganeti VM for durum in eqiad - https://phabricator.wikimedia.org/T290599 (10Dzahn) Ready to create Ganeti VM durum1002.eqiad.wmnet in the ganeti01.svc.eqiad.wmnet cluster on row B with 2 vCPUs, 4GB of RAM, 15GB of disk in the private network. [06:51:50] (03PS1) 10Dzahn: site: add durum1002 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/719886 (https://phabricator.wikimedia.org/T290599) [06:54:56] (03CR) 10Alexandros Kosiaris: Update jabram's ssh key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719530 (https://phabricator.wikimedia.org/T290433) (owner: 10Alexandros Kosiaris) [06:55:05] (03CR) 10Alexandros Kosiaris: [C: 03+2] k8s: Instruct docker to keep logs at 100M [puppet] - 10https://gerrit.wikimedia.org/r/719550 (https://phabricator.wikimedia.org/T289578) (owner: 10Alexandros Kosiaris) [07:00:35] .win 12 [07:00:38] ufff :) [07:02:28] 10SRE, 10DBA, 10observability, 10Datacenter-Switchover: Database read_only alert has a changing service description - https://phabricator.wikimedia.org/T290591 (10Volans) Ad for the feasibility, I thought about some ways to achieve more or less what you want without having alerts that change name based on... [07:02:53] (03PS1) 10Alexandros Kosiaris: k8s: Fix docker log-options key [puppet] - 10https://gerrit.wikimedia.org/r/719893 (https://phabricator.wikimedia.org/T289578) [07:03:22] 10SRE, 10serviceops: Pods in evicted state for various namespaces in k8s main - https://phabricator.wikimedia.org/T290444 (10elukey) Fine for me, what i had in mind was an alert if a namespace showed evicted pods for too much time (say days) since it seemed to some something that could be missed. Ok to close :) [07:05:50] 10SRE, 10ops-eqiad, 10decommission-hardware: Decommission asw-a-eqiad - https://phabricator.wikimedia.org/T218734 (10Dzahn) When making an unrelated change in DNS I got an unexpected diff, the removal of asw-a-eqiad.mgmt.eqiad.wmnet, which made me hesitate. I asked around to make sure it's ok and eventuall... [07:07:35] 10SRE, 10Wikifeeds, 10serviceops, 10Patch-For-Review: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10elukey) We didn't have time to follow up but it may be worth an incident doc. The envoy issue that we faced may be either something that could b... [07:07:51] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum1002.eqiad.wmnet [07:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:01] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [07:11:27] (03PS1) 10Dzahn: DHCP: add MAC address for durum1002 [puppet] - 10https://gerrit.wikimedia.org/r/719894 (https://phabricator.wikimedia.org/T290599) [07:13:17] (03CR) 10Dzahn: [C: 03+2] site: add durum1002 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/719886 (https://phabricator.wikimedia.org/T290599) (owner: 10Dzahn) [07:13:24] (03CR) 10Volans: [C: 03+2] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/719389 (owner: 10RLazarus) [07:14:06] (03PS2) 10Alexandros Kosiaris: k8s: Fix docker log-options key [puppet] - 10https://gerrit.wikimedia.org/r/719893 (https://phabricator.wikimedia.org/T289578) [07:14:15] (03CR) 10Dzahn: [C: 03+2] DHCP: add MAC address for durum1002 [puppet] - 10https://gerrit.wikimedia.org/r/719894 (https://phabricator.wikimedia.org/T290599) (owner: 10Dzahn) [07:15:13] (03CR) 10Alexandros Kosiaris: [C: 03+2] k8s: Fix docker log-options key [puppet] - 10https://gerrit.wikimedia.org/r/719893 (https://phabricator.wikimedia.org/T289578) (owner: 10Alexandros Kosiaris) [07:15:21] PROBLEM - kubelet operational latencies on kubestage1002 is CRITICAL: instance=kubestage1002.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:15:34] (03PS2) 10Alexandros Kosiaris: k8s: Instruct docker to keep logs at 100M, followup [puppet] - 10https://gerrit.wikimedia.org/r/719551 (https://phabricator.wikimedia.org/T289578) [07:19:18] (03Merged) 10jenkins-bot: Cleanup: Replace all format() calls on string literals with f-strings. [software/spicerack] - 10https://gerrit.wikimedia.org/r/719389 (owner: 10RLazarus) [07:19:40] (03CR) 10Filippo Giunchedi: [C: 03+2] sre.switchdc.services: Temporarily exclude swift [cookbooks] - 10https://gerrit.wikimedia.org/r/719556 (https://phabricator.wikimedia.org/T287539) (owner: 10Legoktm) [07:19:54] (03CR) 10Filippo Giunchedi: "Thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/719556 (https://phabricator.wikimedia.org/T287539) (owner: 10Legoktm) [07:21:24] (03CR) 10Filippo Giunchedi: profile: restart postgres on first install / bootstrap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705704 (owner: 10Filippo Giunchedi) [07:23:01] RECOVERY - kubelet operational latencies on kubestage1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:23:51] (03PS1) 10Muehlenhoff: Add to absent_ldap group [puppet] - 10https://gerrit.wikimedia.org/r/719901 [07:25:20] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create a Ganeti VM for durum in eqiad - https://phabricator.wikimedia.org/T290599 (10Dzahn) @ssingh durum1002 has been created and OS is already installed. It sits at login and "insetup" now. [07:25:33] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create a Ganeti VM for durum in eqiad - https://phabricator.wikimedia.org/T290599 (10Dzahn) 05Open→03Resolved a:03Dzahn [07:25:46] (03CR) 10Alexandros Kosiaris: Remove user greta from admin/ (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719536 (https://phabricator.wikimedia.org/T290423) (owner: 10Alexandros Kosiaris) [07:28:06] (03CR) 10Dzahn: Remove user greta from admin/ (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719536 (https://phabricator.wikimedia.org/T290423) (owner: 10Alexandros Kosiaris) [07:28:59] (03CR) 10Dzahn: "also see https://gerrit.wikimedia.org/r/c/operations/puppet/+/719901" [puppet] - 10https://gerrit.wikimedia.org/r/719536 (https://phabricator.wikimedia.org/T290423) (owner: 10Alexandros Kosiaris) [07:30:41] (03CR) 10Muehlenhoff: Remove user greta from admin/ (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719536 (https://phabricator.wikimedia.org/T290423) (owner: 10Alexandros Kosiaris) [07:30:59] 10SRE, 10DBA, 10observability, 10Datacenter-Switchover: Database read_only alert has a changing service description - https://phabricator.wikimedia.org/T290591 (10LSobanski) p:05Triage→03Medium [07:33:59] (03CR) 10Muehlenhoff: Remove user greta from admin/ (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719536 (https://phabricator.wikimedia.org/T290423) (owner: 10Alexandros Kosiaris) [07:34:24] (03CR) 10Alexandros Kosiaris: Remove user greta from admin/ (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/719536 (https://phabricator.wikimedia.org/T290423) (owner: 10Alexandros Kosiaris) [07:34:35] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add to absent_ldap group [puppet] - 10https://gerrit.wikimedia.org/r/719901 (owner: 10Muehlenhoff) [07:34:40] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/719901 (owner: 10Muehlenhoff) [07:35:54] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1, but since this is also used in filter_and_plot.py and merging it would break runs of that script with the current data we have generat" [software/benchmw] - 10https://gerrit.wikimedia.org/r/719608 (https://phabricator.wikimedia.org/T280497) (owner: 10Krinkle) [07:38:43] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10akosiaris) Graph with latency percentiles comparing baremetal against both the IPv6 etcd egress rule fixed version and the non fixed v... [07:40:21] (03PS3) 10Dzahn: thumbor: convert generate-thumbor-age-metrics to timer [puppet] - 10https://gerrit.wikimedia.org/r/719543 (https://phabricator.wikimedia.org/T273673) [07:41:26] (03CR) 10Muehlenhoff: Remove user greta from admin/ (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/719536 (https://phabricator.wikimedia.org/T290423) (owner: 10Alexandros Kosiaris) [07:42:52] (03CR) 10Alexandros Kosiaris: Remove user greta from admin/ (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/719536 (https://phabricator.wikimedia.org/T290423) (owner: 10Alexandros Kosiaris) [07:48:52] 10SRE, 10serviceops: Pods in evicted state for various namespaces in k8s main - https://phabricator.wikimedia.org/T290444 (10akosiaris) >>! In T290444#7341138, @elukey wrote: > Fine for me, what i had in mind was an alert if a namespace showed evicted pods for too much time (say days) since it seemed to some s... [07:54:58] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab::backup move backup cronjobs to puppet [puppet] - 10https://gerrit.wikimedia.org/r/712322 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [07:55:03] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:56:11] PROBLEM - Check systemd state on ms-be2031 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:07:01] 10SRE, 10DBA, 10observability, 10Datacenter-Switchover: Database read_only alert has a changing service description - https://phabricator.wikimedia.org/T290591 (10Kormat) Note that this isn't the only alert that works like this for DBs. We also have: - MariaDB Replica IO: - MariaDB Replica Lag: - MariaDB... [08:09:59] (03CR) 10Jelto: [V: 03+2 C: 03+2] remove backup crontab managed by Ansible [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/719041 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [08:12:14] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.59 [software/spicerack] - 10https://gerrit.wikimedia.org/r/719913 [08:13:09] !log run ansible change 719041 on gitlab2001 [08:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:11] (03PS1) 10Marostegui: tables_to_check.txt: Add pagelinks table [software] - 10https://gerrit.wikimedia.org/r/719919 [08:17:28] (03CR) 10Marostegui: [C: 03+2] tables_to_check.txt: Add pagelinks table [software] - 10https://gerrit.wikimedia.org/r/719919 (owner: 10Marostegui) [08:17:53] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:18:22] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.59 [software/spicerack] - 10https://gerrit.wikimedia.org/r/719913 (owner: 10Volans) [08:19:23] (03CR) 10Marostegui: [V: 03+2 C: 03+2] tables_to_check.txt: Add pagelinks table [software] - 10https://gerrit.wikimedia.org/r/719919 (owner: 10Marostegui) [08:20:42] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 55.43 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:21:09] RECOVERY - Check systemd state on ms-be2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:22:37] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:23:25] !log run ansible change 719041 on gitlab1001 [08:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:14] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.59 [software/spicerack] - 10https://gerrit.wikimedia.org/r/719913 (owner: 10Volans) [08:29:18] (03PS1) 10Volans: Upstream release v0.0.59 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/719922 [08:37:21] PROBLEM - Check systemd state on cp5006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:38:32] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.59 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/719922 (owner: 10Volans) [08:41:16] (03PS2) 10Elukey: role::aqs: update druid mw datasource [puppet] - 10https://gerrit.wikimedia.org/r/719680 (owner: 10Milimetric) [08:42:10] (03CR) 10Elukey: [C: 03+2] role::aqs: update druid mw datasource [puppet] - 10https://gerrit.wikimedia.org/r/719680 (owner: 10Milimetric) [08:44:12] (03CR) 10Elukey: "We realized afterwards that aqs::next is needed as well, a new patch will follow :)" [puppet] - 10https://gerrit.wikimedia.org/r/719680 (owner: 10Milimetric) [08:44:59] (03Merged) 10jenkins-bot: Upstream release v0.0.59 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/719922 (owner: 10Volans) [08:50:36] !log uploaded spicerack_0.0.59 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [08:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:53] (03PS1) 10Joal: role::aqs_next: update druid mw datasource [puppet] - 10https://gerrit.wikimedia.org/r/719929 [08:52:03] btullis: --^ please :) [08:52:35] (03CR) 10Btullis: [C: 03+2] role::aqs_next: update druid mw datasource [puppet] - 10https://gerrit.wikimedia.org/r/719929 (owner: 10Joal) [08:55:09] ^submitted and merged on puppetmasters. Shall I run puppet on all of the aqs and then run the `sre.aqs.roll-restart` cookbook now? [08:55:41] yep [08:56:03] it should take care of all aqs nodes (even the newer ones) since the cumin alias is role::aqs or role::aqs_next [08:56:07] btullis: if possible use cumin1001 for now please, I'm testing something on 2002 (was about to log it) [08:56:36] volans: ack, will do, thanks. [08:56:37] !log upgrading spicerack on cumin2002 to test the new release [08:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:32] thank you [08:58:58] !log filippo@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=swift-ro,name=codfw [08:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:04] !log filippo@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=swift,name=codfw [08:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:32] !log move swift traffic fully to codfw to rebalance eqiad - T287539 [08:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:36] T287539: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 [08:59:44] jynus Emperor FYI ^ [08:59:57] hashar: o/ hiii if you have a minute for https://gerrit.wikimedia.org/r/c/integration/config/+/719548 I'd be super grateful :) [09:00:16] godog, should I change anything of what I am doing? [09:01:09] (03CR) 10JMeybohm: "Currently blocked by:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/719540 (owner: 10JMeybohm) [09:01:57] (03PS1) 10Jelto: gitlab::backup remove deprication warning and deletion of config backup [puppet] - 10https://gerrit.wikimedia.org/r/719930 (https://phabricator.wikimedia.org/T288324) [09:04:18] (03CR) 10jerkins-bot: [V: 04-1] Rakefile: Add tasks to test and diff admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/719540 (owner: 10JMeybohm) [09:04:19] RECOVERY - Check systemd state on cp5006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:05:32] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31036/console" [puppet] - 10https://gerrit.wikimedia.org/r/719930 (https://phabricator.wikimedia.org/T288324) (owner: 10Jelto) [09:06:06] (03PS1) 10DCausse: alertmanager: set search-platform team [puppet] - 10https://gerrit.wikimedia.org/r/719931 (https://phabricator.wikimedia.org/T276467) [09:09:41] !log btullis@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. - btullis@cumin1001 [09:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:35] !log btullis@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. - btullis@cumin1001 [09:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:43] jynus: I don't think so no, I'll kick off the rebalance in a little bit and I'm curious to see what kind of impact it has on the backups but I'm not expecting any significant disruption [09:14:14] ok, thanks [09:15:03] I will communicate, although backups speed varies normally quite a lot during the day, due to load and size-to-title trends [09:15:12] !log volans@cumin2002 START - Cookbook sre.hosts.downtime for 0:20:00 on sretest1001.eqiad.wmnet with reason: testing reboot via ipmi [09:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:15] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on sretest1001.eqiad.wmnet with reason: testing reboot via ipmi [09:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:29] !log rebooting sretest1001 to test ipmi reboot via spicerack [09:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:51] e.g. speed seems to be 50 files/s during UTC nights but 38 files/s during day [09:17:12] (matching a bit latencies at https://grafana.wikimedia.org/d/000000584/swift-4gs?viewPanel=27&orgId=1&from=1630574221360&to=1631178961360&var-DC=eqiad&var-prometheus=thanos) [09:17:57] (03PS2) 10Volans: sre.hosts.decommission: use asset tag if mgmt fails [cookbooks] - 10https://gerrit.wikimedia.org/r/719135 [09:19:03] !log filippo@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=swift,name=eqiad [09:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:08] !log filippo@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=swift-ro,name=eqiad [09:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:18] makes sense yeah [09:19:33] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: When WMF staff requests to be added to ldap/wmf, also add their Phabricator account to #WMF-NDA - https://phabricator.wikimedia.org/T290605 (10Addshore) Should I file a very similar ticket for ldap/wmde, or include it in this ticket? or rather wait and see t... [09:22:27] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM, all the field types make sense as far as I can see. Nice!" [homer/public] - 10https://gerrit.wikimedia.org/r/674318 (https://phabricator.wikimedia.org/T272688) (owner: 10Ayounsi) [09:23:17] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: When WMF staff requests to be added to ldap/wmf, also add their Phabricator account to #WMF-NDA - https://phabricator.wikimedia.org/T290605 (10Aklapper) I'd probably wait first where this ticket might go, in order to have conversations in one place (and I'm... [09:23:59] 10ops-eqiad, 10decommission-hardware, 10serviceops: decommission mc1027.eqiad.wmnet - https://phabricator.wikimedia.org/T281618 (10jijiki) a:03Cmjohnson [09:24:23] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Add drmrs DC Site [puppet] - 10https://gerrit.wikimedia.org/r/719526 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [09:27:25] PROBLEM - Check systemd state on mw2332 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:28:25] (03CR) 10Filippo Giunchedi: [C: 04-1] "We shouldn't be adding templated shell scripts (e.g. https://phabricator.wikimedia.org/T254480) but rather their "config". I see at least " [puppet] - 10https://gerrit.wikimedia.org/r/719543 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [09:30:48] 10SRE, 10SRE-swift-storage, 10User-fgiunchedi: Put ms-be10[64-67] in service - https://phabricator.wikimedia.org/T290546 (10fgiunchedi) [09:37:01] RECOVERY - Check systemd state on mw2332 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:37:21] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: use asset tag if mgmt fails [cookbooks] - 10https://gerrit.wikimedia.org/r/719135 (owner: 10Volans) [09:37:46] !log swift eqiad add ms-be10[64-67] with initial weight - T290546 [09:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:50] T290546: Put ms-be10[64-67] in service - https://phabricator.wikimedia.org/T290546 [09:39:49] (03Merged) 10jenkins-bot: sre.hosts.decommission: use asset tag if mgmt fails [cookbooks] - 10https://gerrit.wikimedia.org/r/719135 (owner: 10Volans) [09:40:03] (03PS1) 10Volans: sre.hosts.ipmi-password-reset: adapt to new IPMI [cookbooks] - 10https://gerrit.wikimedia.org/r/719946 [09:42:29] (03CR) 10jerkins-bot: [V: 04-1] sre.hosts.ipmi-password-reset: adapt to new IPMI [cookbooks] - 10https://gerrit.wikimedia.org/r/719946 (owner: 10Volans) [09:42:47] PROBLEM - Check systemd state on mw2332 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:44:18] (03PS2) 10Volans: sre.hosts.ipmi-password-reset: adapt to new IPMI [cookbooks] - 10https://gerrit.wikimedia.org/r/719946 [09:46:24] 10SRE, 10DBA, 10observability, 10Datacenter-Switchover: Database read_only alert has a changing service description - https://phabricator.wikimedia.org/T290591 (10Kormat) >>! In T290591#7341079, @Marostegui wrote: > I like option #3 too, but I also have to say that option #1 is also quite important. Having... [09:46:55] !log volans@cumin2002 START - Cookbook sre.hosts.decommission for hosts mc1027.eqiad.wmnet [09:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:31] !log volans@cumin2002 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97) for hosts mc1027.eqiad.wmnet [09:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:52] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog: Migrate maps to Buster - https://phabricator.wikimedia.org/T264292 (10hnowlan) [09:51:10] (03PS1) 10Volans: sre.hosts.decommission: adapt to new IPMI API [cookbooks] - 10https://gerrit.wikimedia.org/r/719947 [09:52:06] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10hnowlan) [09:52:41] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog: Migrate maps to Buster - https://phabricator.wikimedia.org/T264292 (10hnowlan) 05Open→03Resolved a:03hnowlan [09:54:21] RECOVERY - Check systemd state on mw2332 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:54:38] (03CR) 10Elukey: "Added some comments, mostly nits!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/719540 (owner: 10JMeybohm) [09:54:43] (03PS1) 10Kormat: mariadb: Page for read-only status issues in both DCs [puppet] - 10https://gerrit.wikimedia.org/r/719948 (https://phabricator.wikimedia.org/T290591) [09:56:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mx2002.wikimedia.org [09:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:09] (03PS2) 10Kormat: mariadb: Page for read-only status issues in both DCs [puppet] - 10https://gerrit.wikimedia.org/r/719948 (https://phabricator.wikimedia.org/T290591) [09:57:43] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/719946 (owner: 10Volans) [09:58:03] (03CR) 10Volans: [C: 03+2] sre.hosts.ipmi-password-reset: adapt to new IPMI [cookbooks] - 10https://gerrit.wikimedia.org/r/719946 (owner: 10Volans) [09:58:16] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/719947 (owner: 10Volans) [10:00:04] mvolz: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Citoid / Zotero . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210909T1000). [10:00:53] (03CR) 10Kormat: [C: 04-2] "Needs discussion about whether this can/should be merged before the DC switchover..." [puppet] - 10https://gerrit.wikimedia.org/r/719948 (https://phabricator.wikimedia.org/T290591) (owner: 10Kormat) [10:02:05] (03Merged) 10jenkins-bot: sre.hosts.ipmi-password-reset: adapt to new IPMI [cookbooks] - 10https://gerrit.wikimedia.org/r/719946 (owner: 10Volans) [10:02:43] (03PS2) 10Volans: sre.hosts.decommission: adapt to new IPMI API [cookbooks] - 10https://gerrit.wikimedia.org/r/719947 [10:05:54] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: adapt to new IPMI API [cookbooks] - 10https://gerrit.wikimedia.org/r/719947 (owner: 10Volans) [10:06:39] (03PS3) 10Kormat: mariadb: Page for read-only status issues in both DCs [puppet] - 10https://gerrit.wikimedia.org/r/719948 (https://phabricator.wikimedia.org/T290591) [10:08:48] (03Merged) 10jenkins-bot: sre.hosts.decommission: adapt to new IPMI API [cookbooks] - 10https://gerrit.wikimedia.org/r/719947 (owner: 10Volans) [10:09:01] (03CR) 10Kormat: [V: 03+1 C: 04-2] "PCC SUCCESS (NOOP 6 DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31038/console" [puppet] - 10https://gerrit.wikimedia.org/r/719948 (https://phabricator.wikimedia.org/T290591) (owner: 10Kormat) [10:09:25] (03CR) 10Marostegui: "Does this also exclude x2 from paging? as that one is supposed to be writable on the passive as well, making it a snowflake :(" [puppet] - 10https://gerrit.wikimedia.org/r/719948 (https://phabricator.wikimedia.org/T290591) (owner: 10Kormat) [10:10:43] !log volans@cumin2002 START - Cookbook sre.hosts.decommission for hosts mc1027.eqiad.wmnet [10:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:44] (03CR) 10DCausse: "Adding the team, mainly to discuss if we want a separate channel for the alerts to avoid polluting the #wikimedia-search channel." [puppet] - 10https://gerrit.wikimedia.org/r/719931 (https://phabricator.wikimedia.org/T276467) (owner: 10DCausse) [10:16:48] Hello everyone. I'd like to have access to the alerts dashboard but I get a "Service access denied due to missing privileges." error upon login. Could anyone please help me with the required privileges here? [10:17:19] PROBLEM - Check systemd state on mx2002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:18:21] PROBLEM - Check whether ferm is active by checking the default input chain on mx2002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:18:27] PROBLEM - spamassassin on mx2002 is CRITICAL: PROCS CRITICAL: 0 processes with args spamd https://wikitech.wikimedia.org/wiki/Mail%23SpamAssassin [10:18:55] scherukuwada: I would recommend you create a phabricator task and tag SRE [10:19:37] ^mx2002 is expected, testing things [10:19:52] Thank you, will do. [10:20:00] !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mc1027.eqiad.wmnet [10:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:06] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by volans@cumin2002 for hosts: `mc1027.eqiad.wmnet` - mc1027.eqiad.wmnet (... [10:20:24] effie: FYI ^^^ all yours! :) thanks for lending it for testing the new feature, now the decom cookbook fallbacks to the asset tag mgmt FQDN in case the hostname-based one is alrady gone [10:20:44] <3<3 [10:22:03] !log upgrading spicerack on cumin1001 [10:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:36] ACKNOWLEDGEMENT - DNS on mc1026.mgmt is CRITICAL: Domain mc1026.mgmt.eqiad.wmnet was not found by the server Effie Mouzeli Host has been decommd - The acknowledgement expires at: 2021-10-10 10:22:01. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:23:41] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10jijiki) @Cmjohnson You can now remove any of the remaining hosts any given time, thank you! [10:25:48] (03PS1) 10Marostegui: echo_tables_to_check.txt: Similar to tables_to_check.txt [software] - 10https://gerrit.wikimedia.org/r/719968 [10:26:28] (03PS2) 10Marostegui: echo_tables_to_check.txt: Similar to tables_to_check.txt [software] - 10https://gerrit.wikimedia.org/r/719968 [10:27:50] (03CR) 10Marostegui: [C: 03+2] echo_tables_to_check.txt: Similar to tables_to_check.txt [software] - 10https://gerrit.wikimedia.org/r/719968 (owner: 10Marostegui) [10:28:04] (03PS1) 10Volans: sre.hosts.decommission: use italic for warnings [cookbooks] - 10https://gerrit.wikimedia.org/r/719969 [10:29:34] 10SRE, 10Observability-Alerting: New Web Readers Member Requests AlertManager Access - https://phabricator.wikimedia.org/T290643 (10SCherukuwada) [10:30:52] (03PS4) 10Volans: sre.experimental.reimage: add reimage cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/716216 (https://phabricator.wikimedia.org/T205885) [10:35:49] (03PS1) 10Effie Mouzeli: scaffold: add auto_prepend_file option for PHP [deployment-charts] - 10https://gerrit.wikimedia.org/r/719973 [10:36:17] (03CR) 10jerkins-bot: [V: 04-1] scaffold: add auto_prepend_file option for PHP [deployment-charts] - 10https://gerrit.wikimedia.org/r/719973 (owner: 10Effie Mouzeli) [10:36:54] (03CR) 10Jbond: [C: 03+1] prometheus: Add drmrs DC Site [puppet] - 10https://gerrit.wikimedia.org/r/719526 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [10:37:53] (03CR) 10MMandere: [C: 03+2] prometheus: Add drmrs DC Site [puppet] - 10https://gerrit.wikimedia.org/r/719526 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [10:38:40] (03PS1) 10Vgutierrez: haproxy: Add H2 performance tuning settings [puppet] - 10https://gerrit.wikimedia.org/r/719974 (https://phabricator.wikimedia.org/T290005) [10:40:35] (03PS2) 10Vgutierrez: haproxy: Add H2 performance tuning settings [puppet] - 10https://gerrit.wikimedia.org/r/719974 (https://phabricator.wikimedia.org/T290005) [10:40:50] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [10:43:57] (03PS1) 10Muehlenhoff: Disable new config validation on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/719975 (https://phabricator.wikimedia.org/T286911) [10:45:15] !log Removing peering to AS24218 at Equinix Singapore (cr3-eqsin) - network no longer uses this ASN. [10:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:56] !log Removing peering to old IPs of AS139931 (BSCCL) at Equinix Singapore (cr3-eqsin). [10:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:09] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1006.eqiad.wmnet [10:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:48] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1007.eqiad.wmnet [10:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:57] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on maps1007.eqiad.wmnet with reason: Resyncing from master [10:48:59] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on maps1007.eqiad.wmnet with reason: Resyncing from master [10:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:14] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1007.eqiad.wmnet [10:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/719975 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [10:52:51] (03PS1) 10Volans: management: deprecate the Management class [software/spicerack] - 10https://gerrit.wikimedia.org/r/719976 [10:53:12] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/719540 (owner: 10JMeybohm) [10:54:01] (03PS2) 10Volans: management: deprecate the Management class [software/spicerack] - 10https://gerrit.wikimedia.org/r/719976 [10:55:41] (03CR) 10Btullis: analytics/data_purge: Finish renaming geoeditors_daily to editors_daily (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/716441 (https://phabricator.wikimedia.org/T290093) (owner: 10Milimetric) [10:57:04] (03PS1) 10MMandere: thanos: Add drmrs DC Site [puppet] - 10https://gerrit.wikimedia.org/r/719977 (https://phabricator.wikimedia.org/T282787) [10:57:49] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:00:04] Amir1, Lucas_WMDE, and apergos: My dear minions, it's time we take the moon! Just kidding. Time for EU Backport and Config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210909T1100). [11:00:50] here, no patches scheduled in the window and no one signed up for the training [11:13:56] awww [11:33:03] joe: I have noticed docker-pkg lacks a 3.0.3 git tag :D `git tag 3.0.3 66b22ed && git push --tags` should do it ;) [11:35:59] (03CR) 10Ssingh: durum: switch to client-side UUID generation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719538 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [11:36:07] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.49% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [11:40:35] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/719540 (owner: 10JMeybohm) [11:59:33] (03CR) 10Filippo Giunchedi: [C: 03+1] thanos: Add drmrs DC Site [puppet] - 10https://gerrit.wikimedia.org/r/719977 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [12:02:51] (03PS4) 10Alexandros Kosiaris: Update jabram's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/719530 (https://phabricator.wikimedia.org/T290433) [12:03:12] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "User confirmed on slack." [puppet] - 10https://gerrit.wikimedia.org/r/719530 (https://phabricator.wikimedia.org/T290433) (owner: 10Alexandros Kosiaris) [12:04:48] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Replace JAbrams' old ssh public key with a new one - https://phabricator.wikimedia.org/T290433 (10akosiaris) 05Open→03Resolved a:03akosiaris User identity confirmed via slack, change merged. @JAbrams, give it 30mins, that's the amount of time this requi... [12:06:11] RECOVERY - spamassassin on mx2002 is OK: PROCS OK: 1 process with args spamd https://wikitech.wikimedia.org/wiki/Mail%23SpamAssassin [12:08:17] 10SRE, 10Wikifeeds, 10serviceops, 10Patch-For-Review: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10akosiaris) >>! In T290445#7341141, @elukey wrote: > "but I still don't have a complete understanding of why this happened :)" I think that's th... [12:11:55] PROBLEM - spamassassin on mx2002 is CRITICAL: PROCS CRITICAL: 0 processes with args spamd https://wikitech.wikimedia.org/wiki/Mail%23SpamAssassin [12:33:56] (03PS1) 10Jbond: P:sre::check_user: update dependencies and ass proxy support [puppet] - 10https://gerrit.wikimedia.org/r/720004 (https://phabricator.wikimedia.org/T244792) [12:34:56] hmmm s/ass/add? [12:37:38] (03CR) 10Elukey: analytics/data_purge: Finish renaming geoeditors_daily to editors_daily (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/716441 (https://phabricator.wikimedia.org/T290093) (owner: 10Milimetric) [12:38:42] jbond: --^ +1 :D [12:43:10] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31041/console" [puppet] - 10https://gerrit.wikimedia.org/r/720004 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [12:43:27] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/708479 (https://phabricator.wikimedia.org/T287566) (owner: 10Majavah) [12:43:30] (03CR) 10Muehlenhoff: [C: 03+2] Add logoutd script for wikitech [puppet] - 10https://gerrit.wikimedia.org/r/708479 (https://phabricator.wikimedia.org/T287566) (owner: 10Majavah) [12:43:53] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:sre::check_user: update dependencies and ass proxy support [puppet] - 10https://gerrit.wikimedia.org/r/720004 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [12:44:29] oops [12:44:39] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:46:30] ah snap we should have commented [12:46:33] RECOVERY - mailman list info on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 28 Oct 2021 09:00:44 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:47:11] I like it as-is! good for the next dramatic reading of tasks [12:50:35] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 2414 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:55:38] (03PS5) 10Effie Mouzeli: Decommission old eqiad memcached hosts [puppet] - 10https://gerrit.wikimedia.org/r/716432 (https://phabricator.wikimedia.org/T289657) [12:55:40] (03CR) 10Volans: [C: 03+2] "Pure cosmetic change of the message logged into phabricator, self-merging" [cookbooks] - 10https://gerrit.wikimedia.org/r/719969 (owner: 10Volans) [12:56:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:15] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10akosiaris) p:05Triage→03Medium [12:56:17] (03CR) 10Volans: [C: 03+2] "As agreed merging to start testing it, being in the experimental/ directory should make clear that is not yet fully ready." [cookbooks] - 10https://gerrit.wikimedia.org/r/716216 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [12:57:13] (03CR) 10Effie Mouzeli: [C: 03+2] Decommission old eqiad memcached hosts [puppet] - 10https://gerrit.wikimedia.org/r/716432 (https://phabricator.wikimedia.org/T289657) (owner: 10Effie Mouzeli) [12:57:41] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:58:21] (03Merged) 10jenkins-bot: sre.hosts.decommission: use italic for warnings [cookbooks] - 10https://gerrit.wikimedia.org/r/719969 (owner: 10Volans) [12:59:46] (03Merged) 10jenkins-bot: sre.experimental.reimage: add reimage cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/716216 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [13:00:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:20] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Muehlenhoff out of all services on: 2 hosts [13:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Muehlenhoff out of all services on: 2 hosts [13:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:02] (03PS1) 10Elukey: Set lvs_setup status for the inference service [puppet] - 10https://gerrit.wikimedia.org/r/720009 (https://phabricator.wikimedia.org/T289835) [13:05:42] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: When WMF staff requests to be added to ldap/wmf, also add their Phabricator account to #WMF-NDA - https://phabricator.wikimedia.org/T290605 (10akosiaris) p:05Triage→03Medium [13:05:50] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Muehlenhoff out of all services on: 2 hosts [13:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Muehlenhoff out of all services on: 2 hosts [13:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:06] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Muehlenhoff out of all services on: 2 hosts [13:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Muehlenhoff out of all services on: 2 hosts [13:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:16] (03PS1) 10Volans: sre.experimental: add a title to the directory [cookbooks] - 10https://gerrit.wikimedia.org/r/720011 [13:09:30] 10SRE, 10Wikimedia-Mailing-lists, 10I18n: mailman3 encoding issues on unsubscription emails - https://phabricator.wikimedia.org/T290613 (10akosiaris) p:05Triage→03Medium Hi @MarcoAurelio can you please clarify what the issue is? e.g. was something in the part that is marked as "[redacted]" badly encoded?... [13:11:56] !log planet1002 - re-enabling disabled puppet [13:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:41] (03CR) 10Volans: [C: 03+2] "Just adding the directory title, self-merging" [cookbooks] - 10https://gerrit.wikimedia.org/r/720011 (owner: 10Volans) [13:14:02] 10SRE, 10ops-eqiad, 10decommission-hardware: Decommission asw-a-eqiad - https://phabricator.wikimedia.org/T218734 (10Cmjohnson) @dzahn, I think that was a case of, he didn't realize mgmt addresses were still associated with the old network gear. Thanks for the heads up [13:15:14] (03PS1) 10Ssingh: wikidough: add parking for wikidough.{net,org} [dns] - 10https://gerrit.wikimedia.org/r/720013 [13:16:03] (03Merged) 10jenkins-bot: sre.experimental: add a title to the directory [cookbooks] - 10https://gerrit.wikimedia.org/r/720011 (owner: 10Volans) [13:16:19] (03PS1) 10Muehlenhoff: wikitech logout.d: Fix section in .ini file [puppet] - 10https://gerrit.wikimedia.org/r/720014 (https://phabricator.wikimedia.org/T287566) [13:16:24] (03CR) 10Dzahn: [C: 03+1] "heh, you got the domains. whois looks good" [dns] - 10https://gerrit.wikimedia.org/r/720013 (owner: 10Ssingh) [13:16:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:02] (03PS1) 10Jbond: P::mariadb::grants::production: Add user to allow queries from check_user script [puppet] - 10https://gerrit.wikimedia.org/r/720015 (https://phabricator.wikimedia.org/T259746) [13:20:37] (03PS5) 10JMeybohm: admin_ng: Support managing of system namespaces with helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/719295 [13:20:39] (03PS6) 10JMeybohm: admin_ng/main: Create istio-system namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/719296 [13:20:41] (03PS2) 10JMeybohm: Rakefile: Remove check_docker. It's already in utils.rb [deployment-charts] - 10https://gerrit.wikimedia.org/r/719539 [13:20:43] (03PS4) 10JMeybohm: Rakefile: Add tasks to test and diff admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/719540 (https://phabricator.wikimedia.org/T266670) [13:22:02] (03CR) 10JMeybohm: Rakefile: Add tasks to test and diff admin_ng (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719540 (https://phabricator.wikimedia.org/T266670) (owner: 10JMeybohm) [13:24:21] (03CR) 10Ssingh: wikidough: add parking for wikidough.{net,org} (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/720013 (owner: 10Ssingh) [13:25:24] (03CR) 10Dzahn: [C: 03+1] ".org ist better anyways for a Wikipedia-related project :)" [dns] - 10https://gerrit.wikimedia.org/r/720013 (owner: 10Ssingh) [13:28:51] (03CR) 10Elukey: Rakefile: Add tasks to test and diff admin_ng (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719540 (https://phabricator.wikimedia.org/T266670) (owner: 10JMeybohm) [13:30:49] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/720015 (https://phabricator.wikimedia.org/T259746) (owner: 10Jbond) [13:31:10] (03PS1) 10Dzahn: planet: remove ad.huikeshoven feed [puppet] - 10https://gerrit.wikimedia.org/r/720018 [13:31:40] (03CR) 10Elukey: [C: 03+1] "LGTM (limited understanding of this Rakefile, but the overall result looks good)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/719540 (https://phabricator.wikimedia.org/T266670) (owner: 10JMeybohm) [13:32:46] (03CR) 10Ssingh: [C: 03+2] wikidough: add parking for wikidough.{net,org} [dns] - 10https://gerrit.wikimedia.org/r/720013 (owner: 10Ssingh) [13:38:30] (03PS1) 10Effie Mouzeli: tegola-vector-tiles: use v0.3 templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/720019 [13:40:17] (03PS2) 10Herron: Disable new config validation on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/719975 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [13:40:29] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/719975 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [13:41:08] (03CR) 10Jbond: [C: 03+1] thanos: Add drmrs DC Site [puppet] - 10https://gerrit.wikimedia.org/r/719977 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [13:43:08] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/720014 (https://phabricator.wikimedia.org/T287566) (owner: 10Muehlenhoff) [13:43:13] (03PS2) 10Dzahn: planet: remove ad.huikeshoven feed [puppet] - 10https://gerrit.wikimedia.org/r/720018 (https://phabricator.wikimedia.org/T289984) [13:45:20] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/719976 (owner: 10Volans) [13:45:37] (03CR) 10Muehlenhoff: [C: 03+2] wikitech logout.d: Fix section in .ini file [puppet] - 10https://gerrit.wikimedia.org/r/720014 (https://phabricator.wikimedia.org/T287566) (owner: 10Muehlenhoff) [13:48:52] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host mx2002.wikimedia.org [13:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:48] (03CR) 10MMandere: [C: 03+2] thanos: Add drmrs DC Site [puppet] - 10https://gerrit.wikimedia.org/r/719977 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [13:50:33] (03CR) 10Jelto: "I added some first thoughts. I think the way you specify the helmBinary will have no effect. So you have to specify the binary in every he" [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [13:52:23] (03CR) 10Herron: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/719975 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [13:54:41] (03CR) 10Volans: [C: 03+2] management: deprecate the Management class [software/spicerack] - 10https://gerrit.wikimedia.org/r/719976 (owner: 10Volans) [13:54:56] (03CR) 10Dzahn: [C: 03+2] planet: remove ad.huikeshoven feed [puppet] - 10https://gerrit.wikimedia.org/r/720018 (https://phabricator.wikimedia.org/T289984) (owner: 10Dzahn) [13:58:59] (03PS1) 10Vgutierrez: haproxy: Add PROXY protocol support [puppet] - 10https://gerrit.wikimedia.org/r/720021 (https://phabricator.wikimedia.org/T290005) [14:00:06] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [14:01:17] (03Merged) 10jenkins-bot: management: deprecate the Management class [software/spicerack] - 10https://gerrit.wikimedia.org/r/719976 (owner: 10Volans) [14:02:14] (03CR) 10Muehlenhoff: Disable new config validation on Bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719975 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [14:03:11] (03PS3) 10Muehlenhoff: Disable new config validation on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/719975 (https://phabricator.wikimedia.org/T286911) [14:05:13] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/719975 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [14:08:25] (03PS8) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) [14:08:27] (03PS6) 10Elukey: Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791) [14:08:31] (03CR) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [14:10:16] (03CR) 10Muehlenhoff: [C: 03+2] Disable new config validation on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/719975 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [14:11:39] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1007.eqiad.wmnet [14:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:49] (03PS1) 10Volans: sre.experimental.reimage: fix cumin query [cookbooks] - 10https://gerrit.wikimedia.org/r/720024 [14:15:03] RECOVERY - spamassassin on mx2002 is OK: PROCS OK: 3 processes with args spamd https://wikitech.wikimedia.org/wiki/Mail%23SpamAssassin [14:16:01] RECOVERY - Check systemd state on mx2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:44] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Muehlenhoff out of all services on: 2 hosts [14:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Muehlenhoff out of all services on: 2 hosts [14:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:29] (03CR) 10Jcrespo: [C: 04-1] "While adding a separate user would be the right way for any new application access control (e.g. we do it for backups), this wouldn't be i" [puppet] - 10https://gerrit.wikimedia.org/r/720015 (https://phabricator.wikimedia.org/T259746) (owner: 10Jbond) [14:25:24] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Muehlenhoff out of all services on: 2 hosts [14:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Muehlenhoff out of all services on: 2 hosts [14:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:41] RECOVERY - Check whether ferm is active by checking the default input chain on mx2002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:27:11] (03PS4) 10Kormat: mariadb: Page for read-only status issues in both DCs [puppet] - 10https://gerrit.wikimedia.org/r/719948 (https://phabricator.wikimedia.org/T290591) [14:27:38] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Cookbook for centralised logouts and session status queries - https://phabricator.wikimedia.org/T283242 (10MoritzMuehlenhoff) [14:27:54] 10SRE, 10Infrastructure-Foundations, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Add logout.d script for Wikitech - https://phabricator.wikimedia.org/T287566 (10MoritzMuehlenhoff) 05Open→03Resolved @Majavah, I've merged your patch and confirmed that it works fine :-) Thanks! [14:28:04] 10SRE, 10DBA, 10observability, 10Datacenter-Switchover, 10Patch-For-Review: Database read_only alert has a changing service description - https://phabricator.wikimedia.org/T290591 (10Kormat) CR on hold until {T290665} is fixed, as otherwise the PCC output is incorrect. [14:31:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mx2002.wikimedia.org [14:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mx2002.wikimedia.org [14:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:16] (03PS5) 10Ssingh: durum: switch to client-side UUID generation [puppet] - 10https://gerrit.wikimedia.org/r/719538 (https://phabricator.wikimedia.org/T289536) [14:36:33] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31043/console" [puppet] - 10https://gerrit.wikimedia.org/r/719538 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [14:37:22] (03CR) 10Ssingh: [V: 03+1] "rebased; no code change." [puppet] - 10https://gerrit.wikimedia.org/r/719538 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [14:37:24] (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: switch to client-side UUID generation [puppet] - 10https://gerrit.wikimedia.org/r/719538 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [14:38:32] (03Abandoned) 10Jbond: P::mariadb::grants::production: Add user to allow queries from check_user script [puppet] - 10https://gerrit.wikimedia.org/r/720015 (https://phabricator.wikimedia.org/T259746) (owner: 10Jbond) [14:41:30] (03CR) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [14:42:00] (03PS1) 10Volans: ipmi: improve dry-run mode for force_pxe() [software/spicerack] - 10https://gerrit.wikimedia.org/r/720027 [14:46:12] (03CR) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [14:47:21] !log planet - deleting all state and lock files for the "en" feeds (T285251 T289984) [14:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:29] T285251: Wikimedia Planet not showing any external blog posts - https://phabricator.wikimedia.org/T285251 [14:51:54] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/720024 (owner: 10Volans) [14:59:12] .win 7 [14:59:17] nope [15:01:21] Phabricator is dead Encountered a processing exception, then another exception when trying to build a response for the first exception. [15:01:21] - PhabricatorClusterStrandedException: Unable to establish a connection to any database host (while trying "phabricator_feed"). All masters and replicas are completely unreachable. [15:01:21] AphrontConnectionLostQueryException: #2006: MySQL server has gone away [15:01:30] I'd file a phab task, but ¯\_(ツ)_/¯ [15:02:35] SRE are aware and are working on it [15:02:50] We noticed ~15 minutes ago ;) [15:04:45] should be fixed now [15:04:58] AntiComposite: expecting a recovery [15:05:07] works from here [15:05:10] great [15:05:25] RECOVERY - check updates on en.planet.wikimedia.org on en.planet.wikimedia.org is OK: OK - Website content is current (68 = 86400) https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [15:06:27] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for scherukuwada - https://phabricator.wikimedia.org/T290661 (10jcrespo) [15:08:15] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for scherukuwada - https://phabricator.wikimedia.org/T290661 (10jcrespo) Combining both tickets, my guess if you are asking one or the other is probably going to be wmf, but the clinic duty person will be able to guide you best, as we are refining the dif... [15:08:55] (03PS1) 10Ssingh: durum: update reference to instructions in uuidv4.js [puppet] - 10https://gerrit.wikimedia.org/r/720043 [15:10:29] (03CR) 10Ssingh: [C: 03+2] durum: update reference to instructions in uuidv4.js [puppet] - 10https://gerrit.wikimedia.org/r/720043 (owner: 10Ssingh) [15:10:36] (03CR) 10Volans: [C: 03+2] sre.experimental.reimage: fix cumin query [cookbooks] - 10https://gerrit.wikimedia.org/r/720024 (owner: 10Volans) [15:10:39] (03CR) 10Brennen Bearnes: [C: 03+1] gitlab::backup remove deprication warning and deletion of config backup [puppet] - 10https://gerrit.wikimedia.org/r/719930 (https://phabricator.wikimedia.org/T288324) (owner: 10Jelto) [15:13:39] (03Merged) 10jenkins-bot: sre.experimental.reimage: fix cumin query [cookbooks] - 10https://gerrit.wikimedia.org/r/720024 (owner: 10Volans) [15:15:03] (03PS1) 10Elukey: kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) [15:16:14] (03PS2) 10Elukey: kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) [15:17:44] (03CR) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [15:20:55] PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [15:21:11] PROBLEM - Wikitech-static main page has content on cloudweb2001-dev is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [15:21:33] PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [15:22:37] RECOVERY - Wikitech-static main page has content on labweb1002 is OK: OK - Certificate wikitech-static.wikimedia.org will expire on Mon 15 Nov 2021 10:29:39 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Wikitech-static [15:22:53] RECOVERY - Wikitech-static main page has content on cloudweb2001-dev is OK: OK - Certificate wikitech-static.wikimedia.org will expire on Mon 15 Nov 2021 10:29:39 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Wikitech-static [15:23:13] RECOVERY - Wikitech-static main page has content on labweb1001 is OK: OK - Certificate wikitech-static.wikimedia.org will expire on Mon 15 Nov 2021 10:29:39 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Wikitech-static [15:24:01] (03CR) 10Elukey: kubernetes: add revscoring-editquality in the services configs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [15:24:20] !log volans@cumin1001 START - Cookbook sre.experimental.reimage for host sretest1001.eqiad.wmnet [15:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:36] jouncebot now [15:24:37] No deployments scheduled for the next 0 hour(s) and 35 minute(s) [15:25:02] (03PS1) 10Dzahn: Revert "planet: remove ad.huikeshoven feed" [puppet] - 10https://gerrit.wikimedia.org/r/719694 [15:25:32] (03CR) 10Ahmon Dancy: [C: 03+2] "Thanks Dave" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719610 (owner: 10Dave Pifke) [15:26:43] (03Merged) 10jenkins-bot: pipeline: add comment redirecting to correct file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719610 (owner: 10Dave Pifke) [15:26:44] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for scherukuwada - https://phabricator.wikimedia.org/T290661 (10Aklapper) [15:28:53] !log dancy@deploy1002 Synchronized .pipeline/config.yaml: Config: [[gerrit:719610|pipeline: add comment redirecting to correct file]] (duration: 00m 59s) [15:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:49] (03CR) 10Ahmon Dancy: fpm-multiversion-base: add php-excimer extension (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/719609 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke) [15:32:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:52] !log volans@cumin1001 END (FAIL) - Cookbook sre.experimental.reimage (exit_code=99) for host sretest1001.eqiad.wmnet [15:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:25] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 94 probes of 618 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:40:51] (03CR) 10Ebernhardson: "I'd generally prefer not polluting the discussion channel. Once a channel is taken over by bots talking actual communication tends to be m" [puppet] - 10https://gerrit.wikimedia.org/r/719931 (https://phabricator.wikimedia.org/T276467) (owner: 10DCausse) [15:41:42] (03CR) 10Jbond: [C: 03+1] ipmi: improve dry-run mode for force_pxe() [software/spicerack] - 10https://gerrit.wikimedia.org/r/720027 (owner: 10Volans) [15:41:47] (03PS1) 10Volans: sre.experimental.reimage: fix file mode [cookbooks] - 10https://gerrit.wikimedia.org/r/720051 [15:42:10] 10SRE, 10Traffic, 10vm-requests: Please create Ganeti VMs for durum - https://phabricator.wikimedia.org/T290672 (10ssingh) [15:42:14] (03CR) 10Volans: [C: 03+2] ipmi: improve dry-run mode for force_pxe() [software/spicerack] - 10https://gerrit.wikimedia.org/r/720027 (owner: 10Volans) [15:43:45] (03CR) 10Jbond: [C: 03+1] sre.experimental.reimage: fix file mode [cookbooks] - 10https://gerrit.wikimedia.org/r/720051 (owner: 10Volans) [15:44:05] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 40 probes of 618 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:46:04] (03PS2) 10Volans: sre.experimental.reimage: fix file mode [cookbooks] - 10https://gerrit.wikimedia.org/r/720051 [15:46:46] (03CR) 10Jbond: [C: 03+1] sre.experimental.reimage: fix file mode [cookbooks] - 10https://gerrit.wikimedia.org/r/720051 (owner: 10Volans) [15:47:45] (03PS1) 10Ssingh: site: update roles for durum[12345]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/720052 (https://phabricator.wikimedia.org/T290672) [15:47:51] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:48:01] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:48:05] 10SRE, 10JavaScript, 10Maps (Kartographer): Display map markers on Kartographer maps even in case of mapserver failures - https://phabricator.wikimedia.org/T270865 (10MSantos) [15:49:48] (03Merged) 10jenkins-bot: ipmi: improve dry-run mode for force_pxe() [software/spicerack] - 10https://gerrit.wikimedia.org/r/720027 (owner: 10Volans) [15:51:06] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` cloudcephosd1021.eqiad.wmnet ` The log can be found in... [15:51:31] (03Abandoned) 10Ssingh: site: update roles for durum[12345]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/720052 (https://phabricator.wikimedia.org/T290672) (owner: 10Ssingh) [15:52:03] (03CR) 10Volans: [C: 03+2] sre.experimental.reimage: fix file mode [cookbooks] - 10https://gerrit.wikimedia.org/r/720051 (owner: 10Volans) [15:52:16] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for scherukuwada - https://phabricator.wikimedia.org/T290661 (10SCherukuwada) If it's relevant, please note my shell account username is different. It's "saisuman". [15:54:53] (03Merged) 10jenkins-bot: sre.experimental.reimage: fix file mode [cookbooks] - 10https://gerrit.wikimedia.org/r/720051 (owner: 10Volans) [15:58:26] (03CR) 10Hashar: zuul: migrate cron of zuul_repack to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [16:00:04] legoktm, rzl, jelto, and arnoldokoth: Time to snap out of that daydream and deploy Switch Datacenter --live-test. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210909T1600). [16:00:05] jbond and rzl: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210909T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:17] !log volans@cumin1001 START - Cookbook sre.experimental.reimage for host sretest1001.eqiad.wmnet [16:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:27] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Cmjohnson) cloudcephosd1021 and 1022 disks are correct in non-raid. I did fix bios setting for 1021, it was set to continuously boot to the NIC. Installing again... [16:01:23] use the following command to view the live-test dc switch on cumin1001: tmux attach -t dc-switch-live-test [16:02:29] rzl <3 thanks for joucebot patch [16:03:06] \o/ [16:04:25] (03PS1) 10Krinkle: Add parse_light bench [software/benchmw] - 10https://gerrit.wikimedia.org/r/720055 (https://phabricator.wikimedia.org/T280497) [16:05:49] PROBLEM - Host mc1026.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:08:35] (03PS1) 10Jbond: P:sre::check_user: add support for wikitech querys [puppet] - 10https://gerrit.wikimedia.org/r/720056 [16:10:04] (03CR) 10jerkins-bot: [V: 04-1] P:sre::check_user: add support for wikitech querys [puppet] - 10https://gerrit.wikimedia.org/r/720056 (owner: 10Jbond) [16:10:38] !log volans@cumin1001 END (FAIL) - Cookbook sre.experimental.reimage (exit_code=99) for host sretest1001.eqiad.wmnet [16:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:48] (03PS1) 10Arturo Borrero Gonzalez: kubeadm: psp: base-pod-security-policies.yaml [puppet] - 10https://gerrit.wikimedia.org/r/720057 [16:12:19] (03PS2) 10Arturo Borrero Gonzalez: kubeadm: psp: base-pod-security-policies.yaml [puppet] - 10https://gerrit.wikimedia.org/r/720057 [16:12:36] (03CR) 10Bstorm: [C: 03+1] tool: Read name prefix from /etc/wmcs-project [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/718004 (https://phabricator.wikimedia.org/T290325) (owner: 10Majavah) [16:12:44] (03PS1) 10Inductiveload: Add IA-Upload tool domains to Commons wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720058 (https://phabricator.wikimedia.org/T287241) [16:13:40] (03PS4) 10Milimetric: analytics/data_purge: Finish renaming geoeditors_daily to editors_daily [puppet] - 10https://gerrit.wikimedia.org/r/716441 (https://phabricator.wikimedia.org/T290093) [16:13:50] (03CR) 10Bstorm: kubeadm: psp: base-pod-security-policies.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/720057 (owner: 10Arturo Borrero Gonzalez) [16:14:05] (03CR) 10jerkins-bot: [V: 04-1] Add IA-Upload tool domains to Commons wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720058 (https://phabricator.wikimedia.org/T287241) (owner: 10Inductiveload) [16:14:27] (03PS1) 10Jbond: realm.pp: only check numa fact if it exists [puppet] - 10https://gerrit.wikimedia.org/r/720060 [16:15:13] (03PS3) 10Arturo Borrero Gonzalez: kubeadm: psp: base-pod-security-policies.yaml [puppet] - 10https://gerrit.wikimedia.org/r/720057 [16:15:29] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/720060 (owner: 10Jbond) [16:15:57] (03CR) 10Bstorm: [C: 03+1] kubeadm: psp: base-pod-security-policies.yaml [puppet] - 10https://gerrit.wikimedia.org/r/720057 (owner: 10Arturo Borrero Gonzalez) [16:16:13] (03PS2) 10Krinkle: Add parse_light bench [software/benchmw] - 10https://gerrit.wikimedia.org/r/720055 [16:16:17] (03Abandoned) 10Krinkle: Add parse_light bench [software/benchmw] - 10https://gerrit.wikimedia.org/r/720055 (owner: 10Krinkle) [16:16:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: psp: base-pod-security-policies.yaml [puppet] - 10https://gerrit.wikimedia.org/r/720057 (owner: 10Arturo Borrero Gonzalez) [16:17:01] (03CR) 10Krinkle: "OK, that's no use though, this is for mw-k8s. I"ll leave the old keys then." [software/benchmw] - 10https://gerrit.wikimedia.org/r/719608 (https://phabricator.wikimedia.org/T280497) (owner: 10Krinkle) [16:18:41] (03CR) 10Milimetric: analytics/data_purge: Finish renaming geoeditors_daily to editors_daily (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/716441 (https://phabricator.wikimedia.org/T290093) (owner: 10Milimetric) [16:22:34] 10SRE, 10ops-eqiad, 10decommission-hardware: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10Cmjohnson) [16:23:11] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:55] 10SRE, 10ops-eqiad, 10decommission-hardware: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10Cmjohnson) 05Open→03Resolved removed from rack and updated netbox [16:24:03] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops: decommission mc1027.eqiad.wmnet - https://phabricator.wikimedia.org/T281618 (10Cmjohnson) [16:24:13] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops: decommission mc1027.eqiad.wmnet - https://phabricator.wikimedia.org/T281618 (10Cmjohnson) removed from rack and updated netbox [16:24:41] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops: decommission mc1027.eqiad.wmnet - https://phabricator.wikimedia.org/T281618 (10Cmjohnson) 05Open→03Resolved [16:26:37] (03PS3) 10Krinkle: Fix 'load' title, add 'rl_startup', add 'parse_light' [software/benchmw] - 10https://gerrit.wikimedia.org/r/719608 (https://phabricator.wikimedia.org/T280497) [16:26:39] (03PS1) 10Krinkle: Update bench urls and improve url labels [software/benchmw] - 10https://gerrit.wikimedia.org/r/720061 [16:30:57] (03PS3) 10Elukey: kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) [16:31:15] 10SRE, 10Wikimedia-Mailing-lists, 10I18n: mailman3 encoding issues on unsubscription emails - https://phabricator.wikimedia.org/T290613 (10MarcoAurelio) Hello @akosiaris. The whole subscriber/account name is a random sequence of letters and symbols, not only the "redacted" part. I've forwarded the email I go... [16:33:35] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:54] (03PS1) 10Elukey: Add kubernetes services ML tokens [labs/private] - 10https://gerrit.wikimedia.org/r/720062 [16:36:01] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission pc1008.eqiad.wmnet - https://phabricator.wikimedia.org/T289119 (10Cmjohnson) [16:36:46] (03PS4) 10Elukey: kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) [16:37:15] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add kubernetes services ML tokens [labs/private] - 10https://gerrit.wikimedia.org/r/720062 (owner: 10Elukey) [16:37:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission pc1008.eqiad.wmnet - https://phabricator.wikimedia.org/T289119 (10Cmjohnson) 05Open→03Resolved removed from rack and netbox updated [16:37:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission pc1007.eqiad.wmnet. - https://phabricator.wikimedia.org/T289118 (10Cmjohnson) [16:38:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission pc1007.eqiad.wmnet. - https://phabricator.wikimedia.org/T289118 (10Cmjohnson) 05Open→03Resolved removed from rack and netbox updated [16:38:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission pc1009.eqiad.wmnet - https://phabricator.wikimedia.org/T289120 (10Cmjohnson) [16:38:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission pc1009.eqiad.wmnet - https://phabricator.wikimedia.org/T289120 (10Cmjohnson) 05Open→03Resolved removed from rack and netbox updated [16:39:23] (03PS1) 10Cwhite: o11y: add rsyslog alerts [alerts] - 10https://gerrit.wikimedia.org/r/720063 [16:40:05] (03PS2) 10Cwhite: o11y: add rsyslog alerts [alerts] - 10https://gerrit.wikimedia.org/r/720063 (https://phabricator.wikimedia.org/T288726) [16:40:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission pc1010.eqiad.wmnet - https://phabricator.wikimedia.org/T289122 (10Cmjohnson) [16:41:19] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:37] (03PS5) 10Elukey: kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) [16:42:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission pc1010.eqiad.wmnet - https://phabricator.wikimedia.org/T289122 (10Cmjohnson) 05Open→03Resolved removed from rack and netbox updated [16:43:33] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Switch failure: asw2-a8-eqiad Aug 13th 2021 - https://phabricator.wikimedia.org/T288834 (10Cmjohnson) 05Open→03Resolved I am resolving this taks, if the issue persists please re-open [16:45:57] (03PS6) 10Elukey: kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) [16:46:07] (03PS2) 10Inductiveload: Add IA-Upload tool domains to Commons wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720058 (https://phabricator.wikimedia.org/T287241) [16:47:03] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31046/console" [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [16:47:22] (03PS1) 10DCausse: search-platform: add flink alerts [alerts] - 10https://gerrit.wikimedia.org/r/720066 (https://phabricator.wikimedia.org/T276467) [16:49:17] (03CR) 10Elukey: [V: 03+1] "Pcc looks reasonable, names of variables etc.. will of course change depending on what we prefer. But the idea is to clearly split tokens " [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [16:52:25] (03PS2) 10Bstorm: wikireplicas: reduce the innodb_buffer_pool_size for s1 [puppet] - 10https://gerrit.wikimedia.org/r/719758 (https://phabricator.wikimedia.org/T290630) [16:52:57] (03CR) 10Andrew Bogott: [C: 03+1] wikireplicas: reduce the innodb_buffer_pool_size for s1 [puppet] - 10https://gerrit.wikimedia.org/r/719758 (https://phabricator.wikimedia.org/T290630) (owner: 10Bstorm) [16:55:32] (03CR) 10Bstorm: wikireplicas: reduce the innodb_buffer_pool_size for s1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719758 (https://phabricator.wikimedia.org/T290630) (owner: 10Bstorm) [16:55:43] 10SRE, 10SRE-Access-Requests, 10Analytics: Requesting access to analytics-privatedata-users group for Abban Dunne - https://phabricator.wikimedia.org/T289775 (10odimitrijevic) Approved [16:57:06] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1021.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cloudcephosd1021.eqiad.wmnet'] ` [16:57:24] !log start cookbook sre.switchdc.mediawiki eqiad codfw --live-test this will generate some additional SAL logs here [16:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:23] !log jelto@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet [16:58:26] !log jelto@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0) [16:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:42] (03PS2) 10DCausse: search-platform: add flink alerts [alerts] - 10https://gerrit.wikimedia.org/r/720066 (https://phabricator.wikimedia.org/T276467) [16:58:48] !log jelto@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl [16:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:46] (03CR) 10Marostegui: [C: 03+1] wikireplicas: reduce the innodb_buffer_pool_size for s1 [puppet] - 10https://gerrit.wikimedia.org/r/719758 (https://phabricator.wikimedia.org/T290630) (owner: 10Bstorm) [17:00:05] chrisalbon and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210909T1700). [17:00:19] (03CR) 10Marostegui: [C: 03+1] "If you want me to restart MySQL tomorrow early UTC morning just let me know!" [puppet] - 10https://gerrit.wikimedia.org/r/719758 (https://phabricator.wikimedia.org/T290630) (owner: 10Bstorm) [17:00:41] (03CR) 10JMeybohm: admin_ng/main: Create istio-system namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719296 (owner: 10JMeybohm) [17:02:39] (03CR) 10Bstorm: wikireplicas: reduce the innodb_buffer_pool_size for s1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719758 (https://phabricator.wikimedia.org/T290630) (owner: 10Bstorm) [17:02:45] (03CR) 10JMeybohm: [C: 03+2] Rakefile: Add tasks to test and diff admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/719540 (https://phabricator.wikimedia.org/T266670) (owner: 10JMeybohm) [17:02:47] (03CR) 10Bstorm: [C: 03+2] wikireplicas: reduce the innodb_buffer_pool_size for s1 [puppet] - 10https://gerrit.wikimedia.org/r/719758 (https://phabricator.wikimedia.org/T290630) (owner: 10Bstorm) [17:02:49] (03CR) 10JMeybohm: [C: 03+2] Rakefile: Remove check_docker. It's already in utils.rb [deployment-charts] - 10https://gerrit.wikimedia.org/r/719539 (owner: 10JMeybohm) [17:02:52] (03CR) 10JMeybohm: [C: 03+2] admin_ng/main: Create istio-system namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/719296 (owner: 10JMeybohm) [17:02:56] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Support managing of system namespaces with helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/719295 (owner: 10JMeybohm) [17:02:59] (03CR) 10JMeybohm: [C: 03+2] custom_deploy: Add istio manifest for main clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/719265 (https://phabricator.wikimedia.org/T287007) (owner: 10JMeybohm) [17:03:03] (03CR) 10JMeybohm: [C: 03+2] Rakefile: Add task validate_istio_config [deployment-charts] - 10https://gerrit.wikimedia.org/r/719272 (owner: 10JMeybohm) [17:03:59] (03PS2) 10Majavah: aptrepo: drop k8s 1.18 updates [puppet] - 10https://gerrit.wikimedia.org/r/719401 [17:04:02] (03PS3) 10Majavah: aptrepo: drop k8s 1.18 repo [puppet] - 10https://gerrit.wikimedia.org/r/719402 [17:04:16] !log jelto@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) [17:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:53] !log jelto@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-warmup-caches [17:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:01] (03CR) 10JMeybohm: [C: 03+1] kube_env: Give usage when no arguments are passed [puppet] - 10https://gerrit.wikimedia.org/r/719562 (owner: 10Hnowlan) [17:05:24] 10SRE, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 (10Cmjohnson) 05Open→03Resolved Opengear's response was for me to update the f/w. It appears to be a newer version than the one Robh had installed. The newest version is cm71xx-4.11.0.flash and has... [17:05:40] about to test-run the appserver warmups in eqiad (passive DC) -- some appserver latency alerts are expected and are OK [17:05:43] (03Merged) 10jenkins-bot: Rakefile: Add task validate_istio_config [deployment-charts] - 10https://gerrit.wikimedia.org/r/719272 (owner: 10JMeybohm) [17:06:04] (03Merged) 10jenkins-bot: custom_deploy: Add istio manifest for main clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/719265 (https://phabricator.wikimedia.org/T287007) (owner: 10JMeybohm) [17:06:52] ACKNOWLEDGEMENT - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 91.67% of data under the critical threshold [5.0] Hnowlan Misleading alert, will be removed. https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [17:07:27] !log jelto@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0) [17:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:55] !log jelto@cumin1001 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance [17:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:08] !log jelto@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) [17:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:11] (03Merged) 10jenkins-bot: admin_ng: Support managing of system namespaces with helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/719295 (owner: 10JMeybohm) [17:09:13] (03Merged) 10jenkins-bot: admin_ng/main: Create istio-system namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/719296 (owner: 10JMeybohm) [17:09:15] (03Merged) 10jenkins-bot: Rakefile: Remove check_docker. It's already in utils.rb [deployment-charts] - 10https://gerrit.wikimedia.org/r/719539 (owner: 10JMeybohm) [17:12:28] !log jelto@cumin1001 START - Cookbook sre.switchdc.mediawiki.02-set-readonly [17:12:28] !log jelto@cumin1001 [DRY-RUN] MediaWiki read-only period starts at: 2021-09-09 17:12:27.974410 [17:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:39] !log jelto@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) [17:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:48] (Device rebooted) firing: Device rebooted - https://alerts.wikimedia.org [17:12:55] !log jelto@cumin1001 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly [17:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:27] !log jelto@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0) [17:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:33] !log jelto@cumin1001 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki [17:13:36] (03Merged) 10jenkins-bot: Rakefile: Add tasks to test and diff admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/719540 (https://phabricator.wikimedia.org/T266670) (owner: 10JMeybohm) [17:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:43] !log jelto@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) [17:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:50] !log jelto@cumin1001 START - Cookbook sre.switchdc.mediawiki.05-invert-redis-sessions [17:13:51] !log jelto@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.05-invert-redis-sessions (exit_code=0) [17:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:00] !log jelto@cumin1001 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite [17:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:03] !log jelto@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) [17:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:11] !log jelto@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite [17:14:12] !log jelto@cumin1001 [DRY-RUN] MediaWiki read-only period ends at: 2021-09-09 17:14:12.502162 [17:14:12] !log jelto@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) [17:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:18] (03CR) 10Marostegui: wikireplicas: reduce the innodb_buffer_pool_size for s1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719758 (https://phabricator.wikimedia.org/T290630) (owner: 10Bstorm) [17:17:48] (Device rebooted) resolved: Device rebooted - https://alerts.wikimedia.org [17:20:56] !log jelto@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-restart-envoy-on-jobrunners [17:20:58] !log jelto@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restart-envoy-on-jobrunners (exit_code=0) [17:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:16] !log jelto@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-restore-ttl [17:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:44] !log jelto@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restore-ttl (exit_code=0) [17:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:13] !log jelto@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-run-puppet-on-db-masters [17:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:57] !log jelto@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-run-puppet-on-db-masters (exit_code=0) [17:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:17] !log jelto@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance [17:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:25] !log jelto@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) [17:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:48] !log jelto@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-update-tendril [17:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:00] !log jelto@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-update-tendril (exit_code=0) [17:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:06] (03CR) 10JMeybohm: [C: 03+1] envoyproxy: Support alpn_protocols configuration [puppet] - 10https://gerrit.wikimedia.org/r/713238 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [17:39:46] (03CR) 10JMeybohm: [C: 03+1] envoyproxy: Support TLS min/max version config [puppet] - 10https://gerrit.wikimedia.org/r/713246 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [17:41:50] (03CR) 10Legoktm: "Ping! What's missing to move this forward? It has 3 +1s already." [puppet] - 10https://gerrit.wikimedia.org/r/710083 (https://phabricator.wikimedia.org/T288162) (owner: 10Brennen Bearnes) [17:43:52] (03CR) 10Brennen Bearnes: [C: 04-1] gitlab / idp: open gitlab access to all users [puppet] - 10https://gerrit.wikimedia.org/r/710083 (https://phabricator.wikimedia.org/T288162) (owner: 10Brennen Bearnes) [17:46:18] (03CR) 10Brennen Bearnes: [C: 04-1] gitlab / idp: open gitlab access to all users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710083 (https://phabricator.wikimedia.org/T288162) (owner: 10Brennen Bearnes) [17:49:14] (03PS1) 10Ebernhardson: query_service: Remove parts of gui from backend servers [puppet] - 10https://gerrit.wikimedia.org/r/720070 [17:49:52] (03CR) 10Ebernhardson: [C: 04-1] "We can't deploy this yet, patch serves as a reminder to clean this up when ready." [puppet] - 10https://gerrit.wikimedia.org/r/720070 (owner: 10Ebernhardson) [17:53:20] (03PS1) 10Urbanecm: Standardize indentation in several .yaml files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720073 [17:53:22] (03PS1) 10Urbanecm: Deploy Growth features in dark modes to ~200 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720074 (https://phabricator.wikimedia.org/T290582) [17:53:25] jouncebot: next [17:53:25] In 5 hour(s) and 6 minute(s): US Backport and Config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210909T2300) [17:53:32] jouncebot: now [17:53:32] For the next 0 hour(s) and 6 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210909T1700) [17:53:44] (03PS1) 10RLazarus: sre.switchdc.mediawiki: Move some cookbooks from phase 8 to 9 [cookbooks] - 10https://gerrit.wikimedia.org/r/720075 [17:55:12] (03PS2) 10RLazarus: sre.switchdc.mediawiki: Move some cookbooks from phase 8 to 9 [cookbooks] - 10https://gerrit.wikimedia.org/r/720075 [18:00:04] (03PS1) 10Volans: sre.experimenta.reimage: refactor PuppetDB update [cookbooks] - 10https://gerrit.wikimedia.org/r/720076 [18:01:12] * urbanecm goes to deploy something [18:01:16] (03CR) 10Urbanecm: [C: 03+2] Standardize indentation in several .yaml files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720073 (owner: 10Urbanecm) [18:02:22] (03Merged) 10jenkins-bot: Standardize indentation in several .yaml files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720073 (owner: 10Urbanecm) [18:03:11] (03CR) 10Urbanecm: [C: 03+2] Deploy Growth features in dark modes to ~200 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720074 (https://phabricator.wikimedia.org/T290582) (owner: 10Urbanecm) [18:04:20] (03Merged) 10jenkins-bot: Deploy Growth features in dark modes to ~200 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720074 (https://phabricator.wikimedia.org/T290582) (owner: 10Urbanecm) [18:05:06] !log urbanecm@deploy1002 Synchronized wmf-config/config: no-op: 76c51f2753aed9dc8e06b63de6657c3c94371a3c: Standardize indentation in several .yaml files (duration: 00m 58s) [18:05:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:16] (03PS1) 10Ebernhardson: trafficserver: Create routing for commons-query.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/720078 (https://phabricator.wikimedia.org/T280247) [18:06:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:06] (03CR) 10Ebernhardson: "plausibly we'll want to wait until the service is setup enough that wcqs.discovery.wmnet resolves before merging this" [puppet] - 10https://gerrit.wikimedia.org/r/720078 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [18:10:13] (03PS1) 10Cwhite: o11y: add logstash alerts [alerts] - 10https://gerrit.wikimedia.org/r/720079 (https://phabricator.wikimedia.org/T288726) [18:10:56] (03PS3) 10Legoktm: sre.switchdc.mediawiki: Move some cookbooks from phase 8 to 9 [cookbooks] - 10https://gerrit.wikimedia.org/r/720075 (https://phabricator.wikimedia.org/T290677) (owner: 10RLazarus) [18:11:11] !log Run extensions/WikimediaMaintenance/createExtensionTables.php growthexperiments for wikis in P17258 (T290582) [18:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:16] T290582: Deploy Growth features to all remaining active versions of Wikipedia - https://phabricator.wikimedia.org/T290582 [18:12:37] !log [urbanecm@mwmaint2002 /srv/mediawiki/php]$ foreachwikiindblist growthexperiments extensions/GrowthExperiments/maintenance/initWikiConfig.php --phab=T290582 | tee ~/initwikiconfig.out # T290582 [18:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:35] (03CR) 10Legoktm: [C: 03+2] "I'll update the wiki page for the new phase." [cookbooks] - 10https://gerrit.wikimedia.org/r/720075 (https://phabricator.wikimedia.org/T290677) (owner: 10RLazarus) [18:16:36] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: REIMAGE [18:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:22] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [18:17:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:36] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [18:18:38] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: REIMAGE [18:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:42] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: Move some cookbooks from phase 8 to 9 [cookbooks] - 10https://gerrit.wikimedia.org/r/720075 (https://phabricator.wikimedia.org/T290677) (owner: 10RLazarus) [18:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:17] !log urbanecm@deploy1002 sync-file aborted: 6af38d951f0ef9af369e2172c175628dc6e9a281: Deploy Growth features in dark modes to ~200 wikis (T290582) (duration: 00m 05s) [18:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:21] T290582: Deploy Growth features to all remaining active versions of Wikipedia - https://phabricator.wikimedia.org/T290582 [18:20:38] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [18:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:06] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [18:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:20] !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: 6af38d951f0ef9af369e2172c175628dc6e9a281: Deploy Growth features in dark modes to ~200 wikis (T290582; 1/3) (duration: 00m 58s) [18:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:37] !log urbanecm@deploy1002 Synchronized wmf-config/config/: 6af38d951f0ef9af369e2172c175628dc6e9a281: Deploy Growth features in dark modes to ~200 wikis (T290582; 2/3) (duration: 01m 01s) [18:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:25] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 6af38d951f0ef9af369e2172c175628dc6e9a281: Deploy Growth features in dark modes to ~200 wikis (T290582; 3/3) (duration: 00m 57s) [18:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:31] (03PS5) 10Legoktm: sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [18:27:10] (03CR) 10Legoktm: [C: 04-1] "PS5 is a manual rebase. Setting -1 on behalf of kormat." [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [18:29:26] (03PS6) 10RLazarus: sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) [18:29:28] (03CR) 10Tjones: alertmanager: set search-platform team (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/719931 (https://phabricator.wikimedia.org/T276467) (owner: 10DCausse) [18:29:30] (03PS1) 10Urbanecm: Growth: Push 44 wikis out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720082 (https://phabricator.wikimedia.org/T289680) [18:30:19] (03CR) 10Urbanecm: [C: 03+2] Growth: Push 44 wikis out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720082 (https://phabricator.wikimedia.org/T289680) (owner: 10Urbanecm) [18:31:12] (03Merged) 10jenkins-bot: Growth: Push 44 wikis out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720082 (https://phabricator.wikimedia.org/T289680) (owner: 10Urbanecm) [18:32:11] (03PS2) 10Brennen Bearnes: gitlab cas: update uid field to use uid not CN [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/714382 (https://phabricator.wikimedia.org/T288392) (owner: 10Jbond) [18:33:14] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: bc4f20437868b39ae2cc4eac8735ecb8bcd93157: Growth: Push 44 wikis out of dark mode (T289680) (duration: 00m 57s) [18:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:19] T289680: Deploy Growth features to Round 4 wikis - https://phabricator.wikimedia.org/T289680 [18:34:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:17] * urbanecm done [18:37:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:37] (03PS1) 10Legoktm: sre.switchdc: Switch mwdebug as part of MediaWiki, not services [cookbooks] - 10https://gerrit.wikimedia.org/r/720085 (https://phabricator.wikimedia.org/T290676) [18:49:33] 10SRE, 10Wikimedia-Mailing-lists, 10SecTeam-Processed, 10Security, 10Upstream: lists.wikimedia.org allows unsubscribing other users without prior confirmation (CVE-2021-40347) - https://phabricator.wikimedia.org/T289798 (10Legoktm) DSA issued: https://lists.debian.org/debian-security-announce/2021/msg001... [18:50:11] legoktm: no need to do anything about it now, but will we eventually need separate mwdebug-ro and mwdebug-rw services? [18:51:28] I doubt that the "mwdebug" name will ever get those, but I assume mw-on-k8s will introduce more service names [18:51:42] (03CR) 10RLazarus: [C: 03+1] sre.switchdc: Switch mwdebug as part of MediaWiki, not services [cookbooks] - 10https://gerrit.wikimedia.org/r/720085 (https://phabricator.wikimedia.org/T290676) (owner: 10Legoktm) [18:51:48] nod [18:53:16] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q1: (Need By: TBD) rack/setup/install puppetmaster200[45].codfw.wmnet - https://phabricator.wikimedia.org/T289733 (10Papaul) [18:55:56] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10RobH) I've doublechecked the settings on cloudcephosd1021 and also updated its firmware (newer releases in the last couple weeks) all to no avail, it is still qui... [18:56:31] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-8), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10ldelench_wmf) Appreciate everyone's help with this! @ArielGlenn this came up at a CommTech retro th... [18:59:36] (03PS2) 10Ebernhardson: Deploy query_service microsite for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/717630 [19:00:58] I'll merge it after lunch and dry-run the service cookbook to make sure it gets excluded [19:04:04] !log jiji@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:48] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:13:16] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:14:50] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:14:58] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:18:51] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: Q2) eqiad: Upgrades of Management Switches - https://phabricator.wikimedia.org/T259758 (10Cmjohnson) Added asset tags to all of the switches [19:29:15] (03PS6) 10Krinkle: ci: Add 'bulleye' to docker lsbdistcodename hack [puppet] - 10https://gerrit.wikimedia.org/r/717687 (https://phabricator.wikimedia.org/T284774) [19:31:11] (03PS2) 10Ebernhardson: trafficserver: Create routing for commons-query.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/720078 (https://phabricator.wikimedia.org/T280247) [19:32:27] (03PS2) 10Cwhite: o11y: add logstash alerts [alerts] - 10https://gerrit.wikimedia.org/r/720079 (https://phabricator.wikimedia.org/T288726) [19:37:53] !log jiji@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:49] (03PS1) 10Cwhite: logging: clean up legacy logstash alerts [puppet] - 10https://gerrit.wikimedia.org/r/720093 (https://phabricator.wikimedia.org/T288726) [19:40:34] !log jiji@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:02] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:45:34] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:54:38] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:54:56] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:02:16] (03PS7) 10Krinkle: ci: Add 'bullseye' to docker lsbdistcodename hack [puppet] - 10https://gerrit.wikimedia.org/r/717687 (https://phabricator.wikimedia.org/T284774) [20:05:16] 10SRE, 10Traffic: Let's Encrypt issuance chains update - https://phabricator.wikimedia.org/T283164 (10TheDJ) Note, it’s September ! [20:07:47] (03CR) 10Bstorm: "This change is ready for review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/717732 (https://phabricator.wikimedia.org/T284774) (owner: 10Krinkle) [20:31:15] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-8), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10ArielGlenn) >>! In T285857#7343025, @ldelench_wmf wrote: > Appreciate everyone's help with this! @A... [20:32:51] (Traffic bill over quota) firing: (2) Traffic bill over quota - https://alerts.wikimedia.org [20:34:25] (03PS1) 10Ryan Kemper: wdqs: remove codfw hourly restarts [puppet] - 10https://gerrit.wikimedia.org/r/720102 (https://phabricator.wikimedia.org/T290330) [20:35:53] (03PS2) 10Ryan Kemper: wdqs: remove codfw hourly restarts [puppet] - 10https://gerrit.wikimedia.org/r/720102 (https://phabricator.wikimedia.org/T290330) [20:36:28] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/720102 (https://phabricator.wikimedia.org/T290330) (owner: 10Ryan Kemper) [20:38:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:42:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:49:14] (03PS3) 10Inductiveload: Add IA-Upload tool domains to Commons wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720058 (https://phabricator.wikimedia.org/T287241) [20:52:51] (Traffic bill over quota) resolved: (2) Traffic bill over quota - https://alerts.wikimedia.org [20:58:54] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [20:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:43] (03CR) 10Krinkle: "I believe something like it is still needed, but it's only noticed when creating a new instance which doesn't happen all that often. I ran" [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [21:16:36] 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10RobH) [21:16:44] 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10RobH) [21:17:54] (03PS2) 10Ebernhardson: blazegraph: LVS for WCQS step 1 [puppet] - 10https://gerrit.wikimedia.org/r/713959 [21:17:56] (03CR) 10Ebernhardson: blazegraph: LVS for WCQS step 1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713959 (owner: 10Ebernhardson) [21:17:59] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [21:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:25] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/720110 [21:27:11] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [21:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:09] 10SRE, 10ops-codfw: Test Dell switches cabling - https://phabricator.wikimedia.org/T290133 (10Papaul) @ayounsi I have SONIC on all 4 switches . [21:31:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:53] (03CR) 10Herron: "still a wip but please see comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/720110 (owner: 10Herron) [21:44:40] PROBLEM - Check systemd state on clouddb1017 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@s1.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:58] ACKNOWLEDGEMENT - Check systemd state on clouddb1017 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@s1.service andrew bogott T290630 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:09:10] RECOVERY - Check systemd state on clouddb1017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:24:54] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10dpifke) [22:44:43] (03CR) 10Legoktm: [C: 03+2] sre.switchdc: Switch mwdebug as part of MediaWiki, not services [cookbooks] - 10https://gerrit.wikimedia.org/r/720085 (https://phabricator.wikimedia.org/T290676) (owner: 10Legoktm) [22:47:01] (03CR) 10Krinkle: Unset logo config rather than set to false (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 (owner: 10Jdlrobson) [22:48:33] (03Merged) 10jenkins-bot: sre.switchdc: Switch mwdebug as part of MediaWiki, not services [cookbooks] - 10https://gerrit.wikimedia.org/r/720085 (https://phabricator.wikimedia.org/T290676) (owner: 10Legoktm) [22:53:26] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10RobH) Update from sub-task and IRC discussions: Updating the firmware is the first thing to try. However, the system is currently experiencing tran... [23:00:05] brennen: I, the Bot under the Fountain, call upon thee, The Deployer, to do US Backport and Config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210909T2300). [23:07:00] !log no takers on patches, ending backport & config training window. [23:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:25] (03CR) 10Legoktm: [C: 03+2] mailman: Remove absented file definitions [puppet] - 10https://gerrit.wikimedia.org/r/719484 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [23:56:26] elukey: I'm merging your labs/private change on the puppetmaster