[00:03:08] is there a server in codfw that I can run a browser on? [00:03:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [00:03:28] I think if I really want to test these races, I need either a browser window in codfw, or bpirkle [00:04:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [00:04:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [00:05:13] who lives like 1ms away from codfw [00:05:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [00:07:03] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 101.1 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [00:08:20] I'm like 150ms aways, so I guess I do not count :p [00:08:25] (03CR) 10Krinkle: [C: 03+2] multiversion: Remove unused $cacheDir and writeToStaticCache (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818645 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [00:09:36] (03Merged) 10jenkins-bot: multiversion: Remove unused $cacheDir and writeToStaticCache (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818645 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [00:10:11] Maybe a bit longer than 1ms, but certainly closer than TimStarling :) [00:10:44] TimStarling: iirc we only run chromium and friends in a container these days, but a Ganeti VM might be fairly easy to create and install a few debian packages in like chromium or firefox-esr and whatever else you need. It's too bad we don't yet have Codfw WMCS regions (I think?) [00:11:10] Yup, eqiad only. [00:11:44] looking at the debug log, it would be hard to really test it, since as Krinkle predicted, I already had an edit token in my session [00:11:49] context... [00:12:39] the test is to enable the hotcat gadget and do a category edit on test.wikipedia.org. I did this with my browser and it shows a GET request for an edit token, followed immediately by a POST request for the edit [00:13:31] the GET request is served by codfw and the POST is served by eqiad, so it looks like a race [00:13:52] but in the debug log, there is no session write in the GET request, only a session read [00:14:09] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 101.8 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [00:14:18] maybe if I wiped my session cookie and then did the edit, it would get closer to testing it [00:15:04] but maybe that's too synthetic, we want to know if real users will hit issues [00:15:09] afaik the csrf tokens we serve from the API are all derived the same way as the UI ones, so it'd have to be a race with... well, it'd have to be early after login or directly after e.g. a session expiry being re-created from the remember-me cookie or some such and then race the lazy-init of the secret between the RTT from the GET request, and the dispatching of the next POST. [00:15:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [00:15:39] also, if hotcat uses mw.Api.js, then it'll re-try automatically [00:16:11] I can probably simulate it with curl [00:16:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [00:16:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [00:16:41] yeah, curl + jq and some basic to then use the edit token in some way, like api edit appendtext [00:16:48] bash* [00:17:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [00:18:36] apparently we log quite a few warning messages about LinkCache being asked about interwikis and special pages, e.g. WARNING "link to non-proper page: commons:User:CommonsDelinker" or "link to non-proper page: Speciaal:Contributions/Siebrand". Given that that seems very normal, that's either something to adjust to debug() or something someone thought the caller shoudl check but doesn't. [00:18:54] noticed it when debugging nlwiki Main Page action=history [00:19:32] (03CR) 10Krinkle: [C: 03+2] multiversion: Remove unused $cacheDir (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818646 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [00:20:19] (03Merged) 10jenkins-bot: multiversion: Remove unused $cacheDir (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818646 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [00:22:42] well, Daniel thought it was Very Bad to ask LinkCache for a page that can't exist and that callers should be fixed [00:23:12] but of course Daniel is responsible for a lot of logspam for things like this [00:23:15] !log krinkle@deploy1002 Synchronized multiversion/: Ia3406eba4ab8bb (duration: 03m 22s) [00:23:33] like that global title backtrace that spams my logs every day [00:26:04] I tried a token request with curl from codfw to codfw, it took 528ms [00:26:46] (03PS1) 10Krinkle: CirrusTest: Remove reference to 'unittest' realm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819203 (https://phabricator.wikimedia.org/T308932) [00:27:12] too slow to really see an eqiad/codfw race [00:27:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [00:27:38] (03CR) 10CI reject: [V: 04-1] CirrusTest: Remove reference to 'unittest' realm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819203 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [00:27:54] TimStarling: assuming that the actual session bago write doesn't happen close to the end of the response, and/or that the session read doesn't happen in the first few ms of the next req.. yes. [00:28:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [00:28:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [00:28:35] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [00:29:15] (03PS2) 10Krinkle: CirrusTest: Remove reference to 'unittest' realm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819203 (https://phabricator.wikimedia.org/T308932) [00:29:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [00:30:14] (03CR) 10CI reject: [V: 04-1] CirrusTest: Remove reference to 'unittest' realm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819203 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [00:30:43] (03PS3) 10Krinkle: CirrusTest: Remove reference to 'unittest' realm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819203 (https://phabricator.wikimedia.org/T308932) [00:30:57] err. passed locally, but not when detached from the rest of it. [00:31:45] 10SRE, 10Performance-Team, 10Traffic, 10serviceops, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) [00:33:26] (03CR) 10Krinkle: [C: 03+2] CirrusTest: Remove reference to 'unittest' realm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819203 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [00:34:23] (03Merged) 10jenkins-bot: CirrusTest: Remove reference to 'unittest' realm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819203 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [00:35:53] !log krinkle@deploy1002 Synchronized wmf-config/CommonSettings.php: Ieaea60a991e5 (duration: 03m 10s) [00:37:49] I'll switch over test2wiki now and we'll see if the performance is any better [00:39:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [00:39:34] TimStarling: ack, you mean testwiki from codfw to codfw vs eqiad isn't "better"? [00:39:58] (03PS2) 10Krinkle: multiversion: Untangle MWConfigCacheGenerator from CS.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818648 (https://phabricator.wikimedia.org/T169821) [00:40:02] (03PS3) 10Krinkle: multiversion: Untangle MWConfigCacheGenerator from CS.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818649 (https://phabricator.wikimedia.org/T169821) [00:40:17] (03PS4) 10Krinkle: multiversion: Untangle MWConfigCacheGenerator from CS.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818649 (https://phabricator.wikimedia.org/T169821) [00:40:27] (03PS3) 10Krinkle: noc: Re-use getConfigGlobals() in wiki.php viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818650 [00:40:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [00:40:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [00:40:43] (03PS2) 10Krinkle: multiversion: Move labs-overrides responsibility to getStaticConfig() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818651 (https://phabricator.wikimedia.org/T308932) [00:40:48] probably not, but I don't want to get too involved with performance testing of testwiki given how weird it is [00:41:03] ack [00:41:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [00:41:38] (03CR) 10Tim Starling: [C: 03+2] Switch test2.wikipedia.org to multi-DC local routing mode [puppet] - 10https://gerrit.wikimedia.org/r/818991 (owner: 10Tim Starling) [00:41:46] * Krinkle afk for 10-20min [00:43:38] "Revision range includes commits from multiple committers" [00:48:12] is it ok to deploy these commits, moritzm denisse|m ? [00:49:47] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [00:50:34] Hello TimStarling: , they're safe to commit. Thank you!! [00:52:52] thanks [00:55:00] running puppet agent -tv on netmon1002 for moritzm [01:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220802T0100) [01:04:17] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 101.3 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [01:18:45] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.8 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [01:23:27] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.1 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [01:24:41] 10SRE, 10Performance-Team, 10Traffic, 10serviceops, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) Initial ab run: {P32126} [01:30:45] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.2 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [01:35:37] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.2 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [01:37:45] (JobUnavailable) firing: Reduced availability for job workhorse in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:00] (03CR) 10Ahmon Dancy: [C: 04-1] "Holding. Need to make sure the timer doesn't run on beta" [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy) [01:42:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:35] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [01:52:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:55:07] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.1 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [02:00:03] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.2 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [02:07:13] PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:23] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.7 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [02:07:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:07:40] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.23 [core] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819221 [02:07:46] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.23 [core] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819221 (owner: 10TrainBranchBot) [02:08:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:08:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:09:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:17:26] 10SRE-swift-storage, 10Commons, 10Tracking-Neverending: Thumbnail/imagescaler (tracking) - https://phabricator.wikimedia.org/T43371 (10Bawolff) [02:17:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:21:31] (03PS3) 10Krinkle: multiversion: Move labs-overrides responsibility to getStaticConfig() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818651 (https://phabricator.wikimedia.org/T308932) [02:23:39] (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.23 [core] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819221 (owner: 10TrainBranchBot) [02:24:27] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.8 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [02:29:13] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.6 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [02:29:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:30:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:30:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:31:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:31:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve1006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:33:28] (03CR) 10Tim Starling: [C: 03+1] multiversion: Untangle MWConfigCacheGenerator from CS.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818648 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [02:36:41] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2059.codfw.wmnet with OS bullseye [02:36:49] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic2059.codfw.wmnet with OS bullseye [02:38:47] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.3 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [02:46:03] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.5 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [02:46:15] PROBLEM - Elasticsearch HTTPS for production-search-codfw on elastic2059 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [02:48:23] PROBLEM - Elasticsearch HTTPS for production-search-omega-codfw on elastic2059 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [02:53:19] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.2 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [02:53:55] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2059.codfw.wmnet with reason: host reimage [02:54:18] !log [WDQS] `ryankemper@wdqs1012:~$ sudo systemctl restart wdqs-blazegraph.service` to clear `Query Service HTTP Port` && `WDQS SPARQL` alerts [02:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:54:41] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [02:55:17] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.068 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [02:56:13] PROBLEM - Elasticsearch HTTPS for production-search-omega-codfw on elastic2059 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [02:56:27] PROBLEM - Elasticsearch HTTPS for production-search-codfw on elastic2059 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [02:58:13] (03CR) 10Tim Starling: [C: 03+1] multiversion: Untangle MWConfigCacheGenerator from CS.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818649 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [02:58:19] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2059.codfw.wmnet with reason: host reimage [02:59:19] (03PS3) 10Krinkle: multiversion: Untangle MWConfigCacheGenerator from CS.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818648 (https://phabricator.wikimedia.org/T169821) [02:59:21] (03PS5) 10Krinkle: multiversion: Untangle MWConfigCacheGenerator from CS.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818649 (https://phabricator.wikimedia.org/T169821) [02:59:23] (03PS4) 10Krinkle: noc: Re-use getConfigGlobals() in wiki.php viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818650 [02:59:25] (03PS4) 10Krinkle: multiversion: Move labs-overrides responsibility to getStaticConfig() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818651 (https://phabricator.wikimedia.org/T308932) [02:59:27] (03PS1) 10Krinkle: multiversion: Remove unused readFromStaticCache() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819222 (https://phabricator.wikimedia.org/T169821) [02:59:29] (03PS1) 10Krinkle: CommonSettings.php: Remove side-effect from getConfigGlobals (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819223 (https://phabricator.wikimedia.org/T169821) [02:59:31] (03PS1) 10Krinkle: CommonSettings.php: Remove side-effect from getConfigGlobals (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819224 (https://phabricator.wikimedia.org/T169821) [03:01:56] PROBLEM - Elasticsearch HTTPS for production-search-omega-codfw on elastic2059 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [03:01:58] PROBLEM - Elasticsearch HTTPS for production-search-codfw on elastic2059 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [03:03:06] (03CR) 10Krinkle: [C: 03+2] multiversion: Untangle MWConfigCacheGenerator from CS.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818648 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [03:04:36] (03Merged) 10jenkins-bot: multiversion: Untangle MWConfigCacheGenerator from CS.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818648 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [03:05:34] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.2 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [03:06:02] RECOVERY - Elasticsearch HTTPS for production-search-omega-codfw on elastic2059 is OK: SSL OK - Certificate search.discovery.wmnet valid until 2027-01-23 13:10:52 +0000 (expires in 1635 days) https://wikitech.wikimedia.org/wiki/Search [03:06:14] RECOVERY - Elasticsearch HTTPS for production-search-codfw on elastic2059 is OK: SSL OK - Certificate search.discovery.wmnet valid until 2027-01-23 13:10:52 +0000 (expires in 1635 days) https://wikitech.wikimedia.org/wiki/Search [03:10:44] (03CR) 10Krinkle: [C: 04-1] Move CirrusSearch settings from IS.php to ext-CirrusSearch.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799272 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [03:11:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [03:12:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [03:12:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [03:13:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [03:14:10] !log krinkle@mwmaint1002 pull aborted: (duration: 01m 31s) [03:14:13] !log krinkle@mwmaint2002 pull aborted: (duration: 01m 36s) [03:15:11] ^ the above is due to scap-pull not working on maint hosts as of last week, which I just learned about, due to php-fpm restarts being unconditional it seems. [03:15:13] Workaround: `scap pull --no-php-restart` [03:15:25] !log krinkle@deploy1002 Synchronized multiversion/: Ieaea60a991e5611 (duration: 03m 03s) [03:15:41] (03CR) 10Krinkle: [C: 03+2] multiversion: Untangle MWConfigCacheGenerator from CS.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818649 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [03:16:38] (03Merged) 10jenkins-bot: multiversion: Untangle MWConfigCacheGenerator from CS.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818649 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [03:18:30] (03CR) 10Krinkle: [C: 03+2] noc: Re-use getConfigGlobals() in wiki.php viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818650 (owner: 10Krinkle) [03:19:31] (03Merged) 10jenkins-bot: noc: Re-use getConfigGlobals() in wiki.php viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818650 (owner: 10Krinkle) [03:20:07] (03CR) 10Krinkle: [C: 03+2] multiversion: Move labs-overrides responsibility to getStaticConfig() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818651 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [03:20:09] (03CR) 10Krinkle: [C: 03+2] multiversion: Remove unused readFromStaticCache() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819222 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [03:20:57] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host elastic2059.codfw.wmnet with OS bullseye [03:21:03] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic2059.codfw.wmnet with OS bullseye completed: - elastic2059 (... [03:21:07] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic2059.codfw.wmnet with OS bullseye executed with errors: - el... [03:21:37] (03Merged) 10jenkins-bot: multiversion: Move labs-overrides responsibility to getStaticConfig() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818651 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [03:21:40] (03Merged) 10jenkins-bot: multiversion: Remove unused readFromStaticCache() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819222 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [03:21:44] !log krinkle@deploy1002 Synchronized wmf-config/CommonSettings.php: I39a2b86065 (duration: 03m 19s) [03:22:19] !log krinkle@mwmaint1002 pull aborted: (duration: 00m 11s) [03:24:04] RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:24:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [03:25:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [03:25:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [03:26:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [03:27:28] !log krinkle@deploy1002 Synchronized multiversion/: I6e97d39a3, Ib843ebced31 (duration: 03m 30s) [03:27:56] (03CR) 10Krinkle: [C: 03+2] CommonSettings.php: Remove side-effect from getConfigGlobals (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819223 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [03:29:08] (03Merged) 10jenkins-bot: CommonSettings.php: Remove side-effect from getConfigGlobals (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819223 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [03:31:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [03:32:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [03:32:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [03:33:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [03:34:38] !log krinkle@deploy1002 Synchronized wmf-config/: I9b89c0ff5c2 (duration: 03m 32s) [03:34:44] (03CR) 10Krinkle: [C: 03+2] CommonSettings.php: Remove side-effect from getConfigGlobals (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819224 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [03:36:11] (03Merged) 10jenkins-bot: CommonSettings.php: Remove side-effect from getConfigGlobals (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819224 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [03:38:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [03:38:30] okay, that's enough syncs for today. if I keep going, we'll start to affect the load.php average latency since we're wiping apcu every time https://grafana.wikimedia.org/d/000000066/resourceloader?orgId=1&viewPanel=45 [03:39:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [03:39:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [03:40:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [03:40:56] !log krinkle@deploy1002 Synchronized multiversion/: I0802db272695 (duration: 03m 10s) [03:45:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [03:46:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [03:46:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [03:47:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [03:54:10] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [03:59:38] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.8 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [04:00:26] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve [04:00:26] ments returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:02:38] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:04:08] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.7 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [04:04:58] !log [Elastic] Red cluster status in main codfw elasticsearch cluster (`https://search.svc.codfw.wmnet:9243`); culprit appears to be index `be_x_oldwiki_titlesuggest_1659407912`. Confusingly it has 2 replicas set so it's not clear to me how we got into this state starting from green (in the past we've gone into red status from indices that erroneously had 0 replicas in production) [04:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:05:44] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:07:10] !log [Elastic] Per `curl -s https://search.svc.codfw.wmnet:9243/_cat/aliases | grep -i be_x` I see `be_x_oldwiki_titlesuggest ` alias points to `be_x_oldwiki_titlesuggest_1658396688`. I think this means the red index is an old index from an in-progress reindex operation. I likely just need to delete `be_x_oldwiki_titlesuggest_1659407912` but doing some quick digging first [04:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:15:03] !log [Elastic] Blew away red index like so: `ryankemper@cumin1001:~$ curl -XDELETE https://search.svc.codfw.wmnet:9243/be_x_oldwiki_titlesuggest_1659407912`. Cluster is back to `green` status. [04:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:17:26] !log [Elastic] Small amendment to my earlier statement; based off epoch time `be_x_oldwiki_titlesuggest_1659407912` was not an old index hanging around after a reindex operation, but rather the new one that the reindex operation was trying to create, but had not yet finished (therefore didn't switch over the aliases). It presumably got interrupted by the reimage of `elastic2059`. [04:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:20:50] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.3 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [04:32:48] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 101.8 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [04:38:45] (03PS1) 10KartikMistry: CX: Set MT threshold for publishing in Armenian WP to 80% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819227 (https://phabricator.wikimedia.org/T313208) [04:45:02] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.9 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [05:12:00] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 101.3 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [05:19:22] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.4 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [05:24:13] !log dbmait x1@eqiad T314087 [05:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:17] T314087: Add primary key and drop unique index on cx_translators on wmf wikis - https://phabricator.wikimedia.org/T314087 [05:26:51] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10MoritzMuehlenhoff) [05:29:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1181', diff saved to https://phabricator.wikimedia.org/P32127 and previous config saved to /var/cache/conftool/dbconfig/20220802-052923-root.json [05:30:10] (03PS1) 10Marostegui: db1181: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/819433 (https://phabricator.wikimedia.org/T314154) [05:34:19] (03PS1) 10Marostegui: db2088: Decommission [puppet] - 10https://gerrit.wikimedia.org/r/819434 (https://phabricator.wikimedia.org/T313797) [05:35:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2088.codfw.wmnet [05:35:54] (03CR) 10Marostegui: [C: 03+2] db1181: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/819433 (https://phabricator.wikimedia.org/T314154) (owner: 10Marostegui) [05:36:24] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.2 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [05:36:32] PROBLEM - MariaDB read only s7 on db1181 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [05:36:58] PROBLEM - mysqld processes #page on db1181 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [05:37:24] PROBLEM - MariaDB Replica SQL: s7 #page on db1181 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:37:34] PROBLEM - MariaDB Replica IO: s7 #page on db1181 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:37:48] <_joe_> uhhh [05:37:50] seems expected, just a case of notificiations not getting properly disabled? [05:37:51] marostegui: all good? [05:38:14] <_joe_> yeah apparently puppet did not run in time [05:38:18] (given https://gerrit.wikimedia.org/r/819433) [05:38:26] hey [05:38:40] ah ok [05:38:41] No, dowtiming isn't working [05:44:20] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [05:46:04] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 117 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:48:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [05:48:22] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts db2088.codfw.wmnet [05:49:47] (03CR) 10Marostegui: [C: 03+2] db2088: Decommission [puppet] - 10https://gerrit.wikimedia.org/r/819434 (https://phabricator.wikimedia.org/T313797) (owner: 10Marostegui) [05:51:17] 10ops-codfw, 10decommission-hardware: decommission db2088 - https://phabricator.wikimedia.org/T313797 (10Marostegui) [05:51:33] 10SRE, 10Performance-Team, 10Traffic, 10serviceops, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) A possible reason for the slightly slower times on codfw is cross-DC connections for LoadBalancer::isPrimaryRunningReadOnly(). While running ab -... [05:52:15] 10SRE, 10Performance-Team, 10Traffic, 10serviceops, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) [06:00:05] kormat, marostegui, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220802T0600). [06:07:11] 10SRE, 10Observability-Alerting: Icinga downtimes not working - https://phabricator.wikimedia.org/T314353 (10MoritzMuehlenhoff) [06:16:39] 10SRE, 10Observability-Alerting: Icinga downtimes not working - https://phabricator.wikimedia.org/T314353 (10MoritzMuehlenhoff) Happens since yesterday around 16:30 UTC: https://grafana.wikimedia.org/d/rsCfQfuZz/icinga?orgId=1&from=now-24h&to=now [06:24:38] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 100.7 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [06:31:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve1006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:40:45] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:46:01] (03PS1) 10Ladsgroup: pruneRevData: Make cleaning in larger batches [extensions/FlaggedRevs] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/819077 (https://phabricator.wikimedia.org/T296380) [06:46:20] !log bounce icinga on alert1001 - T314353 [06:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:23] T314353: Icinga downtimes not working - https://phabricator.wikimedia.org/T314353 [06:46:56] (03CR) 10Ladsgroup: [C: 03+2] pruneRevData: Make cleaning in larger batches [extensions/FlaggedRevs] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/819077 (https://phabricator.wikimedia.org/T296380) (owner: 10Ladsgroup) [06:47:36] 10SRE, 10Observability-Alerting: Icinga downtimes not working - https://phabricator.wikimedia.org/T314353 (10fgiunchedi) Indeed check max latency spiked up to 30+ min (!) around that time [06:51:21] (03Merged) 10jenkins-bot: pruneRevData: Make cleaning in larger batches [extensions/FlaggedRevs] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/819077 (https://phabricator.wikimedia.org/T296380) (owner: 10Ladsgroup) [06:53:55] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:54:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [06:55:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [06:55:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [06:56:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [06:56:24] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.22/extensions/FlaggedRevs/maintenance/pruneRevData.php: Backport: [[gerrit:819077|pruneRevData: Make cleaning in larger batches (T296380)]] (duration: 03m 26s) [06:56:28] T296380: flaggedtemplates table is still too big - https://phabricator.wikimedia.org/T296380 [06:58:10] !log restart rsyslog on ml-serve2006 [06:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:33] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:05] Amir1 and Urbanecm: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220802T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:02:57] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:03:00] (03CR) 10Elukey: Revert "deployment_server: remove packages wrk, siege and lua-cjson" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819076 (owner: 10Dzahn) [07:04:48] 10SRE, 10observability, 10Patch-For-Review: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10elukey) @colewhite hi! Periodical ping to see if we can move forward with this task. IIRC there were some clients to move to the new bundle, what's the status? Thanks :) [07:05:33] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:06:31] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:06:57] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:08:16] 10SRE, 10Performance-Team, 10Traffic, 10serviceops, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10jcrespo) > I looked more closely at one of them with tcpdump So are cross-DC connections happening in plain text? [07:09:02] (03CR) 10Slyngshede: [C: 03+1] "Looks good. The use of "service" is better than just a shell command." [puppet] - 10https://gerrit.wikimedia.org/r/814157 (https://phabricator.wikimedia.org/T313119) (owner: 10Hashar) [07:09:48] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Bump version number to 0.2 [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/819061 (https://phabricator.wikimedia.org/T311288) (owner: 10Slyngshede) [07:13:13] 10SRE, 10Observability-Alerting: Icinga downtimes not working - https://phabricator.wikimedia.org/T314353 (10jcrespo) Is this just a duplicate/continuation of T196336 ? [07:18:04] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2005.codfw.wmnet with reason: Switch instance to DRBD, T311686 [07:18:10] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [07:18:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2005.codfw.wmnet with reason: Switch instance to DRBD, T311686 [07:20:27] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:22:36] !log bounce icinga on alert2001 - T314353 [07:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:39] T314353: Icinga downtimes not working - https://phabricator.wikimedia.org/T314353 [07:28:23] 10SRE, 10Observability-Alerting: Icinga downtimes not working - https://phabricator.wikimedia.org/T314353 (10fgiunchedi) >>! In T314353#8122434, @jcrespo wrote: > Is this just a duplicate/continuation of T196336 ? It is definitely possible, I don't know ATM [07:29:52] 10SRE, 10Icinga, 10SRE Observability, 10observability: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336 (10jcrespo) Sanity check, not related to T164206#3434924: ` alert1001# df -h Filesystem Size Used Avail Use% Mounted on [...] none... [07:30:27] 10SRE, 10Observability-Alerting: Icinga downtimes not working - https://phabricator.wikimedia.org/T314353 (10fgiunchedi) A restart of Icinga on alert1001 brought things back AFAICT, the "max check latency" (which is so far the one/only signal I found of sth being wrong) went back down within ~30 min of Icinga... [07:32:01] 10SRE, 10Observability-Alerting: Icinga downtimes not working - https://phabricator.wikimedia.org/T314353 (10fgiunchedi) [07:32:27] 10SRE, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q1): Icinga downtimes not working - https://phabricator.wikimedia.org/T314353 (10fgiunchedi) [07:33:18] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2005.codfw.wmnet with reason: Switch instance to plain disks, T311686 [07:33:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2005.codfw.wmnet with reason: Switch instance to plain disks, T311686 [07:33:23] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [07:34:00] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36563/console" [puppet] - 10https://gerrit.wikimedia.org/r/819095 (https://phabricator.wikimedia.org/T314275) (owner: 10Filippo Giunchedi) [07:36:33] (03PS1) 10Marostegui: mariadb: Productionize db2175 [puppet] - 10https://gerrit.wikimedia.org/r/819486 (https://phabricator.wikimedia.org/T311494) [07:39:50] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2175 [puppet] - 10https://gerrit.wikimedia.org/r/819486 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [07:41:39] 10SRE, 10Performance-Team, 10Traffic, 10serviceops, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) @jcrespo No, cross-DC DB connections are encrypted but you can figure out what's going on by looking at surrounding (DC-local) memcached traffic. [07:44:52] 10SRE, 10Wikimedia-Mailing-lists: MM3/postorius: cannot use multiple accounts - https://phabricator.wikimedia.org/T314251 (10Krd) I don't want to have the list content e-mail reroutet, but the list admin notifications. Is this also possible? [07:45:07] 10SRE, 10Performance-Team, 10Traffic, 10serviceops, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) All cross-DC connections except the first had an associated statsd metric `MediaWiki.wanobjectcache.rdbms_server_readonly.hit.refresh`, which imp... [07:46:29] 10SRE, 10Performance-Team, 10Traffic, 10serviceops, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10jcrespo) >>! In T279664#8122493, @tstarling wrote: > @jcrespo No, cross-DC DB connections are encrypted but you can figure out what's going on by looking at... [07:48:05] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] swift: add script to grow the SSD partition for container databases [puppet] - 10https://gerrit.wikimedia.org/r/819095 (https://phabricator.wikimedia.org/T314275) (owner: 10Filippo Giunchedi) [07:49:21] (03PS2) 10Ladsgroup: mariadb: Set innodb_flush_neighbors to 0 in most dbs [puppet] - 10https://gerrit.wikimedia.org/r/815308 (https://phabricator.wikimedia.org/T313288) [07:49:56] !log upgrading drmrs ganeti clusters to 3.0.2 T312637 [07:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:59] T312637: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 [07:50:49] (03CR) 10Marostegui: [C: 03+1] mariadb: Set innodb_flush_neighbors to 0 in most dbs [puppet] - 10https://gerrit.wikimedia.org/r/815308 (https://phabricator.wikimedia.org/T313288) (owner: 10Ladsgroup) [07:52:15] 10SRE, 10Wikimedia-Mailing-lists: MM3/postorius: incomprehensible/overcomplicated unsubscription for end users - https://phabricator.wikimedia.org/T314252 (10Krd) I have changed the footers accordingly, and I think there is a good cance that this could solve a relevant part of the problem. Thank you! [07:57:11] (03CR) 10Ladsgroup: [C: 03+2] mariadb: Set innodb_flush_neighbors to 0 in most dbs [puppet] - 10https://gerrit.wikimedia.org/r/815308 (https://phabricator.wikimedia.org/T313288) (owner: 10Ladsgroup) [08:01:03] 10SRE, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q1): Icinga downtimes not working - https://phabricator.wikimedia.org/T314353 (10Peachey88) [08:04:56] (03PS1) 10Marostegui: db2176-db2182: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/819492 (https://phabricator.wikimedia.org/T311494) [08:06:33] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:06:34] (03CR) 10Marostegui: [C: 03+2] db2176-db2182: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/819492 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [08:06:56] (03PS1) 10Slyngshede: Fix changelog formatting [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/819493 [08:07:23] (03CR) 10Slyngshede: [C: 03+2] Fix changelog formatting [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/819493 (owner: 10Slyngshede) [08:07:25] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Fix changelog formatting [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/819493 (owner: 10Slyngshede) [08:12:23] (03PS3) 10Slyngshede: c:ganeti::prometheus Enable Prometheus exporter for Ganeti on Buster [puppet] - 10https://gerrit.wikimedia.org/r/812290 [08:14:27] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/812290 (owner: 10Slyngshede) [08:19:49] (03CR) 10Giuseppe Lavagetto: "See question inline: I think you can drop the change from profile::beta::mediawiki and just keep it into scap::target only." [puppet] - 10https://gerrit.wikimedia.org/r/816140 (https://phabricator.wikimedia.org/T303559) (owner: 10Jaime Nuche) [08:22:54] (03CR) 10Jaime Nuche: scap: allow `scap` user to login into deployment-prep scap targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816140 (https://phabricator.wikimedia.org/T303559) (owner: 10Jaime Nuche) [08:24:26] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap: allow `scap` user to login into deployment-prep scap targets [puppet] - 10https://gerrit.wikimedia.org/r/816140 (https://phabricator.wikimedia.org/T303559) (owner: 10Jaime Nuche) [08:24:49] (03CR) 10Slyngshede: c:ganeti::prometheus Enable Prometheus exporter for Ganeti on Buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812290 (owner: 10Slyngshede) [08:25:00] (03CR) 10Slyngshede: [C: 03+2] c:ganeti::prometheus Enable Prometheus exporter for Ganeti on Buster [puppet] - 10https://gerrit.wikimedia.org/r/812290 (owner: 10Slyngshede) [08:27:41] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap: enable target bootstrap in beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/817762 (https://phabricator.wikimedia.org/T303559) (owner: 10Jaime Nuche) [08:38:14] (03PS1) 10Marostegui: mariadb: Disable notifications rack b5 [puppet] - 10https://gerrit.wikimedia.org/r/819497 (https://phabricator.wikimedia.org/T310070) [08:39:05] (03PS2) 10Marostegui: mariadb: Disable notifications rack b5 [puppet] - 10https://gerrit.wikimedia.org/r/819497 (https://phabricator.wikimedia.org/T310070) [08:39:17] (03CR) 10DCausse: [C: 03+1] Add a check that deb is unreleased in prepare_commit [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/804004 (owner: 10Ebernhardson) [08:39:47] (03CR) 10CI reject: [V: 04-1] mariadb: Disable notifications rack b5 [puppet] - 10https://gerrit.wikimedia.org/r/819497 (https://phabricator.wikimedia.org/T310070) (owner: 10Marostegui) [08:40:59] (03PS3) 10Marostegui: mariadb: Disable notifications rack b5 [puppet] - 10https://gerrit.wikimedia.org/r/819497 (https://phabricator.wikimedia.org/T310070) [08:42:43] (03CR) 10Marostegui: [C: 03+2] mariadb: Disable notifications rack b5 [puppet] - 10https://gerrit.wikimedia.org/r/819497 (https://phabricator.wikimedia.org/T310070) (owner: 10Marostegui) [08:46:14] !log stop mysql on db2095 db2107 db2109 db2137 db2147 db2159 db2160 pc2012 for pdu maintenance on codfw b5 T310070 [08:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:18] T310070: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 [08:47:14] (03PS1) 10Marostegui: Revert "db1181: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/819079 [08:47:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32130 and previous config saved to /var/cache/conftool/dbconfig/20220802-084740-root.json [08:47:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 1%: After restart', diff saved to https://phabricator.wikimedia.org/P32131 and previous config saved to /var/cache/conftool/dbconfig/20220802-084745-root.json [08:47:59] (03PS2) 10Muehlenhoff: osm: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811232 (https://phabricator.wikimedia.org/T308013) [08:48:20] (03CR) 10Marostegui: [C: 03+2] Revert "db1181: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/819079 (owner: 10Marostegui) [08:48:48] (03PS1) 10Marostegui: Revert "db1174: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/819080 [08:49:32] (03CR) 10Marostegui: [C: 03+2] Revert "db1174: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/819080 (owner: 10Marostegui) [08:51:38] (03PS4) 10Slyngshede: P:dbbackups::mydumper Move mydumper from cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) [08:58:36] (03CR) 10Muehlenhoff: [C: 03+2] osm: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811232 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:59:06] (03PS1) 10Daniel Kinzler: Remove temporary benchmark script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819499 [09:02:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32132 and previous config saved to /var/cache/conftool/dbconfig/20220802-090245-root.json [09:02:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 2%: After restart', diff saved to https://phabricator.wikimedia.org/P32133 and previous config saved to /var/cache/conftool/dbconfig/20220802-090250-root.json [09:04:56] (03CR) 10Elukey: [C: 03+2] Revert "deployment_server: remove packages wrk, siege and lua-cjson" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819076 (owner: 10Dzahn) [09:15:04] (03PS1) 10Marostegui: db2143: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/819502 (https://phabricator.wikimedia.org/T310070) [09:15:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2143', diff saved to https://phabricator.wikimedia.org/P32134 and previous config saved to /var/cache/conftool/dbconfig/20220802-091518-root.json [09:15:19] PROBLEM - haproxy failover on dbproxy2003 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:15:45] (JobUnavailable) firing: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:16:51] (03CR) 10Marostegui: [C: 03+2] db2143: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/819502 (https://phabricator.wikimedia.org/T310070) (owner: 10Marostegui) [09:17:47] 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10Marostegui) @Papaul all hosts in B5 have mysql off, you can power the hosts off as if you need. [09:17:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32135 and previous config saved to /var/cache/conftool/dbconfig/20220802-091749-root.json [09:17:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 5%: After restart', diff saved to https://phabricator.wikimedia.org/P32136 and previous config saved to /var/cache/conftool/dbconfig/20220802-091754-root.json [09:18:53] 10SRE, 10Performance-Team, 10Traffic, 10serviceops, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10Joe) >>! In T279664#8122378, @tstarling wrote: > A possible reason for the slightly slower times on codfw is cross-DC connections for LoadBalancer::isPrimar... [09:19:03] (03PS1) 10Filippo Giunchedi: swift: account for 'bootable' and missing sector size in grow_ssd_part [puppet] - 10https://gerrit.wikimedia.org/r/819503 (https://phabricator.wikimedia.org/T314275) [09:19:35] (03CR) 10CI reject: [V: 04-1] swift: account for 'bootable' and missing sector size in grow_ssd_part [puppet] - 10https://gerrit.wikimedia.org/r/819503 (https://phabricator.wikimedia.org/T314275) (owner: 10Filippo Giunchedi) [09:20:05] (03PS1) 10Slyngshede: c:dynamicproxy move cronjob to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/819505 (https://phabricator.wikimedia.org/T273673) [09:20:13] PROBLEM - haproxy failover on dbproxy2004 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:22:09] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [09:22:39] !log btullis@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [09:22:49] PROBLEM - haproxy failover on dbproxy2002 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:24:02] (03CR) 10Muehlenhoff: [C: 03+2] tilerator: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811225 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:24:09] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [09:24:14] (03CR) 10Slyngshede: "This class doesn't appear to be used anywhere." [puppet] - 10https://gerrit.wikimedia.org/r/819505 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [09:24:43] (03PS2) 10Filippo Giunchedi: swift: account for 'bootable' and missing sector size in grow_ssd_part [puppet] - 10https://gerrit.wikimedia.org/r/819503 (https://phabricator.wikimedia.org/T314275) [09:25:35] (03CR) 10Muehlenhoff: c:dynamicproxy move cronjob to systemd timer. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819505 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [09:25:53] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=wikireplicas-a,name=dbproxy1018.eqiad.wmnet [09:26:27] !log btullis@puppetmaster1001 conftool action : set/pooled=inactive; selector: cluster=wikireplicas-a,name=dbproxy1019.eqiad.wmnet [09:26:58] (03PS1) 10Jbond: P:gerrit: add ipaddress to host_aliases [puppet] - 10https://gerrit.wikimedia.org/r/819506 (https://phabricator.wikimedia.org/T303857) [09:27:55] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36564/console" [puppet] - 10https://gerrit.wikimedia.org/r/819506 (https://phabricator.wikimedia.org/T303857) (owner: 10Jbond) [09:28:30] (03CR) 10MVernon: [C: 03+1] "LGTM, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/819503 (https://phabricator.wikimedia.org/T314275) (owner: 10Filippo Giunchedi) [09:28:38] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:gerrit: add ipaddress to host_aliases [puppet] - 10https://gerrit.wikimedia.org/r/819506 (https://phabricator.wikimedia.org/T303857) (owner: 10Jbond) [09:28:58] !log btullis@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [09:30:02] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=wikireplicas-b,name=dbproxy1018.eqiad.wmnet [09:30:37] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: account for 'bootable' and missing sector size in grow_ssd_part [puppet] - 10https://gerrit.wikimedia.org/r/819503 (https://phabricator.wikimedia.org/T314275) (owner: 10Filippo Giunchedi) [09:30:49] !log btullis@puppetmaster1001 conftool action : set/pooled=no; selector: cluster=wikireplicas-b,name=dbproxy1019.eqiad.wmnet [09:31:27] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: fix re-registration issues [puppet] - 10https://gerrit.wikimedia.org/r/817759 (https://phabricator.wikimedia.org/T311746) (owner: 10Jelto) [09:32:25] PROBLEM - haproxy failover on dbproxy2001 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:32:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32137 and previous config saved to /var/cache/conftool/dbconfig/20220802-093254-root.json [09:33:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 10%: After restart', diff saved to https://phabricator.wikimedia.org/P32138 and previous config saved to /var/cache/conftool/dbconfig/20220802-093259-root.json [09:35:13] 10SRE, 10Deployments, 10bacula, 10Parsoid (Tracking), 10Release-Engineering-Team (Doing): Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10elukey) 05Open→03Resolved We can close this task and see if any clean up is needed in the follo... [09:36:01] !log btullis@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons. [09:37:30] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1019.eqiad.wmnet with OS bullseye [09:40:21] (03CR) 10Marostegui: [C: 03+1] auto_schema: Make alter non-blocking on master of primary dc [software] - 10https://gerrit.wikimedia.org/r/791297 (owner: 10Ladsgroup) [09:42:20] !log btullis@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons. [09:43:12] !log btullis@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons. [09:44:19] !log grow sdb3 by 100G on thanos-be2004 - T314275 [09:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:23] T314275: thanos-be2004 sdb3 fully used - https://phabricator.wikimedia.org/T314275 [09:44:48] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:gerrit: Export sshkey for gerrit shared services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816715 (https://phabricator.wikimedia.org/T303857) (owner: 10Jbond) [09:45:37] (03PS20) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723 [09:47:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32139 and previous config saved to /var/cache/conftool/dbconfig/20220802-094759-root.json [09:48:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 25%: After restart', diff saved to https://phabricator.wikimedia.org/P32140 and previous config saved to /var/cache/conftool/dbconfig/20220802-094804-root.json [09:49:31] !log btullis@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons. [09:49:37] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1019.eqiad.wmnet with reason: host reimage [09:51:05] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:52:13] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1019.eqiad.wmnet with reason: host reimage [09:52:56] 10SRE, 10vm-requests: Site: 2 VMs for failoid - https://phabricator.wikimedia.org/T280759 (10MoritzMuehlenhoff) 05Open→03Resolved This are in use since over a year, closing [09:53:08] (03PS1) 10Marostegui: instances.yaml: Depool db2079 [puppet] - 10https://gerrit.wikimedia.org/r/819509 (https://phabricator.wikimedia.org/T313885) [09:54:30] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Depool db2079 [puppet] - 10https://gerrit.wikimedia.org/r/819509 (https://phabricator.wikimedia.org/T313885) (owner: 10Marostegui) [09:54:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2079 from dbctl T313885', diff saved to https://phabricator.wikimedia.org/P32141 and previous config saved to /var/cache/conftool/dbconfig/20220802-095455-marostegui.json [09:54:59] T313885: decommission db2079 - https://phabricator.wikimedia.org/T313885 [09:56:34] (03PS5) 10Jbond: C:varnish: fix varnish confd test data [puppet] - 10https://gerrit.wikimedia.org/r/818134 (https://phabricator.wikimedia.org/T138093) [09:57:55] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Eqiad and codfw 1xVM per site for netboix - https://phabricator.wikimedia.org/T309029 (10jbond) 05Open→03Resolved This has been completed [09:58:17] (03PS1) 10Giuseppe Lavagetto: trafficserver: allow x-wikimedia-debug to pick a php backend [puppet] - 10https://gerrit.wikimedia.org/r/819510 (https://phabricator.wikimedia.org/T312653) [10:02:05] (03PS1) 10Urbanecm: MentorTools: Ensure weight setter displays correct weight [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819085 (https://phabricator.wikimedia.org/T314050) [10:02:39] (03CR) 10Urbanecm: [C: 03+2] ".23 is not at deployment host atm => merging" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819085 (https://phabricator.wikimedia.org/T314050) (owner: 10Urbanecm) [10:02:43] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10MatthewVernon) moss-fe2001 will need depooling in C2 before work on that rack starts. [10:03:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32143 and previous config saved to /var/cache/conftool/dbconfig/20220802-100304-root.json [10:03:09] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad/codfw: 1xVM per site for Netbox - https://phabricator.wikimedia.org/T309029 (10faidon) [10:03:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 50%: After restart', diff saved to https://phabricator.wikimedia.org/P32144 and previous config saved to /var/cache/conftool/dbconfig/20220802-100308-root.json [10:05:40] !log shutdown dbprov2002 backup2005 backup2008 T310070 [10:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:43] T310070: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 [10:06:50] (03CR) 10Jbond: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/818114 (owner: 10Jbond) [10:12:18] (03PS1) 10Marostegui: instances.yaml: Add db2175 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/819512 (https://phabricator.wikimedia.org/T311494) [10:12:31] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1019.eqiad.wmnet with OS bullseye [10:13:10] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2175 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/819512 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [10:15:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2175 to s2 T311494', diff saved to https://phabricator.wikimedia.org/P32145 and previous config saved to /var/cache/conftool/dbconfig/20220802-101522-marostegui.json [10:15:26] T311494: Productionize db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T311494 [10:16:09] (03PS1) 10Marostegui: db2175: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/819513 (https://phabricator.wikimedia.org/T311494) [10:17:19] (03CR) 10Vgutierrez: [C: 04-1] C:varnish: fix varnish confd test data (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/818134 (https://phabricator.wikimedia.org/T138093) (owner: 10Jbond) [10:17:37] (03CR) 10Marostegui: [C: 03+2] db2175: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/819513 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [10:18:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 75%: After restart', diff saved to https://phabricator.wikimedia.org/P32146 and previous config saved to /var/cache/conftool/dbconfig/20220802-101813-root.json [10:19:33] ACKNOWLEDGEMENT - haproxy failover on dbproxy2001 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Marostegui known https://wikitech.wikimedia.org/wiki/HAProxy [10:19:33] ACKNOWLEDGEMENT - haproxy failover on dbproxy2002 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Marostegui known https://wikitech.wikimedia.org/wiki/HAProxy [10:19:33] ACKNOWLEDGEMENT - haproxy failover on dbproxy2003 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Marostegui known https://wikitech.wikimedia.org/wiki/HAProxy [10:19:33] ACKNOWLEDGEMENT - haproxy failover on dbproxy2004 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Marostegui known https://wikitech.wikimedia.org/wiki/HAProxy [10:20:38] (03PS1) 10Marostegui: site.pp: Remove insetup from db2175 [puppet] - 10https://gerrit.wikimedia.org/r/819514 [10:21:43] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup from db2175 [puppet] - 10https://gerrit.wikimedia.org/r/819514 (owner: 10Marostegui) [10:23:35] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:23:59] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Introduce $wmgEntityUsageModifierLimitsStatement (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754937 (https://phabricator.wikimedia.org/T296384) (owner: 10Noa wmde) [10:24:24] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Enable usage tracking for statement for cebwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754933 (https://phabricator.wikimedia.org/T296384) (owner: 10Noa wmde) [10:26:37] (03Merged) 10jenkins-bot: MentorTools: Ensure weight setter displays correct weight [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819085 (https://phabricator.wikimedia.org/T314050) (owner: 10Urbanecm) [10:27:18] (03PS1) 10Marostegui: rename_su_normalized_T312972.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/819515 (https://phabricator.wikimedia.org/T312972) [10:31:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve1006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:33:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 100%: After restart', diff saved to https://phabricator.wikimedia.org/P32147 and previous config saved to /var/cache/conftool/dbconfig/20220802-103318-root.json [10:34:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:34:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:34:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:35:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:37:01] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1122 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/819518 [10:37:41] (03Abandoned) 10Ladsgroup: mariadb: Promote db1122 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/819518 (owner: 10Gerrit maintenance bot) [10:39:34] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1122 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/819519 [10:40:15] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): More public IPs for codfw1dev - https://phabricator.wikimedia.org/T313977 (10cmooney) Hi Andrew, I'm unable to find any issue here. Looking at the cloud-in acl/filter on the CR routers there does is no rule that will block tra... [10:40:25] (03CR) 10CI reject: [V: 04-1] mariadb: Promote db1122 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/819519 (owner: 10Gerrit maintenance bot) [10:40:54] (03PS1) 10Marostegui: db1185-db1195: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/819520 (https://phabricator.wikimedia.org/T313569) [10:41:11] (03CR) 10Vgutierrez: [C: 03+1] "overall LGTM, please check inline comments" [puppet] - 10https://gerrit.wikimedia.org/r/819510 (https://phabricator.wikimedia.org/T312653) (owner: 10Giuseppe Lavagetto) [10:41:46] (03CR) 10Marostegui: [C: 03+2] db1185-db1195: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/819520 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [10:42:16] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=wikireplicas-b,name=dbproxy1019.eqiad.wmnet [10:42:37] !log btullis@puppetmaster1001 conftool action : set/pooled=inactive; selector: cluster=wikireplicas-b,name=dbproxy1018.eqiad.wmnet [10:48:38] (03PS1) 10Filippo Giunchedi: swift: set fs label and run xfs_growfs on mountpoint [puppet] - 10https://gerrit.wikimedia.org/r/819522 (https://phabricator.wikimedia.org/T314275) [10:49:04] (03CR) 10CI reject: [V: 04-1] swift: set fs label and run xfs_growfs on mountpoint [puppet] - 10https://gerrit.wikimedia.org/r/819522 (https://phabricator.wikimedia.org/T314275) (owner: 10Filippo Giunchedi) [10:49:12] !log grow sda3 by 100G on thanos-be2004 - T314275 [10:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:15] T314275: thanos-be2004 sdb3 fully used - https://phabricator.wikimedia.org/T314275 [10:49:36] (03PS2) 10Filippo Giunchedi: swift: set fs label and run xfs_growfs on mountpoint [puppet] - 10https://gerrit.wikimedia.org/r/819522 (https://phabricator.wikimedia.org/T314275) [10:50:19] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on an-worker1082.eqiad.wmnet with reason: T312626 btullis [10:50:21] T312626: Replace RAID controller battery in an-worker1082 - https://phabricator.wikimedia.org/T312626 [10:50:32] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on an-worker1082.eqiad.wmnet with reason: T312626 btullis [10:50:32] (03CR) 10CI reject: [V: 04-1] swift: set fs label and run xfs_growfs on mountpoint [puppet] - 10https://gerrit.wikimedia.org/r/819522 (https://phabricator.wikimedia.org/T314275) (owner: 10Filippo Giunchedi) [10:52:55] (03PS3) 10Filippo Giunchedi: swift: set fs label and run xfs_growfs on mountpoint [puppet] - 10https://gerrit.wikimedia.org/r/819522 (https://phabricator.wikimedia.org/T314275) [10:54:02] (03PS1) 10WMDE-Fisch: Restore scrolling parameters into view with(out) sticky header [extensions/VisualEditor] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819535 (https://phabricator.wikimedia.org/T312926) [10:55:01] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: allow x-wikimedia-debug to pick a php backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819510 (https://phabricator.wikimedia.org/T312653) (owner: 10Giuseppe Lavagetto) [10:55:22] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10MatthewVernon) [10:55:34] Q: When the branch cut is done, but the branch not deployed yet (like .23 atm) backporting is as easy as merging the cherry-pick on that branch ... right? [10:55:52] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10MatthewVernon) Are you proposing to do away with the concept of "active" DC, then? e.g. currently `swiftrepl` runs from the active DC to fix up where MW fail... [10:57:49] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1122 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/819525 (https://phabricator.wikimedia.org/T314368) [10:57:53] (03PS1) 10Gerrit maintenance bot: wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/819546 (https://phabricator.wikimedia.org/T314368) [11:00:47] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: allow x-wikimedia-debug to pick a php backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819510 (https://phabricator.wikimedia.org/T312653) (owner: 10Giuseppe Lavagetto) [11:04:13] PROBLEM - SSH on mw1324.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:07:00] (03PS1) 10FNegri: d/changelog: Prepare for 0.88 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/819547 [11:14:11] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1109 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/819548 (https://phabricator.wikimedia.org/T314369) [11:14:16] (03PS1) 10Gerrit maintenance bot: wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/819549 (https://phabricator.wikimedia.org/T314369) [11:15:11] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1100 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/819550 (https://phabricator.wikimedia.org/T314370) [11:15:17] (03PS1) 10Gerrit maintenance bot: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/819551 (https://phabricator.wikimedia.org/T314370) [11:15:48] 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1082 - https://phabricator.wikimedia.org/T312626 (10BTullis) @Cmjohnson - Apologies for all of the delay on this, I just kept missing you. I've now downtimed an-worker1082 for 3 days and I've shut it down already. If it's convenient yo... [11:17:33] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for EllenR - https://phabricator.wikimedia.org/T313821 (10ERayfield) Sorry, the person who can help me get the information you requested is out with covid. I will resubmit when I have the correct information. Regards, Ellen Rayfield [11:23:48] (03PS3) 10Jcrespo: Attempt to follow Wikimedia's Design Style Guide [software/pampinus] - 10https://gerrit.wikimedia.org/r/819025 (https://phabricator.wikimedia.org/T283017) [11:24:41] (03PS4) 10Jcrespo: Attempt to follow Wikimedia's Design Style Guide [software/pampinus] - 10https://gerrit.wikimedia.org/r/819025 (https://phabricator.wikimedia.org/T283017) [11:26:53] (03PS1) 10Slyngshede: c:spamassassin move Spamassassin updates from crontab to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) [11:28:22] (03CR) 10CI reject: [V: 04-1] c:spamassassin move Spamassassin updates from crontab to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:29:34] (03PS2) 10Slyngshede: c:spamassassin move Spamassassin updates from crontab to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) [11:30:44] (03CR) 10CI reject: [V: 04-1] c:spamassassin move Spamassassin updates from crontab to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:35:09] !log restart rsyslog on ml-serve1006 [11:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:58] (KubernetesRsyslogDown) resolved: rsyslog on ml-serve1006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:36:58] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Restore scrolling parameters into view with(out) sticky header [extensions/VisualEditor] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819535 (https://phabricator.wikimedia.org/T312926) (owner: 10WMDE-Fisch) [11:38:20] (03PS1) 10Elukey: ml-services: update editquality's en image to test ray workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/819555 (https://phabricator.wikimedia.org/T313915) [11:43:47] (03PS3) 10Slyngshede: c:spamassassin move Spamassassin updates from crontab to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) [11:43:54] (03CR) 10MVernon: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/819522 (https://phabricator.wikimedia.org/T314275) (owner: 10Filippo Giunchedi) [11:44:12] (03PS2) 10Elukey: ml-services: update editquality's en image to test ray workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/819555 (https://phabricator.wikimedia.org/T313915) [11:44:29] (03CR) 10Majavah: d/changelog: Prepare for 0.88 release (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/819547 (owner: 10FNegri) [11:46:30] !log dbmait s4@eqiad T314377 [11:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:33] T314377: Add primary key and drop unique index on revtag on wmf wikis - https://phabricator.wikimedia.org/T314377 [11:46:58] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: set fs label and run xfs_growfs on mountpoint [puppet] - 10https://gerrit.wikimedia.org/r/819522 (https://phabricator.wikimedia.org/T314275) (owner: 10Filippo Giunchedi) [11:48:23] !log dbmait s7@eqiad T314377 [11:48:25] (03CR) 10Elukey: [C: 03+2] ml-services: update editquality's en image to test ray workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/819555 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [11:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:54] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1118 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/819557 (https://phabricator.wikimedia.org/T314380) [11:48:57] (03PS1) 10Gerrit maintenance bot: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/819558 (https://phabricator.wikimedia.org/T314380) [11:49:13] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36572/console" [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:50:48] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [11:51:57] (03CR) 10WMDE-Fisch: [C: 03+2] "Merging before the train starts in the late afternoon." [extensions/VisualEditor] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819535 (https://phabricator.wikimedia.org/T312926) (owner: 10WMDE-Fisch) [11:54:46] !log dbmait s3@eqiad T314377 [11:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:49] T314377: Add primary key and drop unique index on revtag on wmf wikis - https://phabricator.wikimedia.org/T314377 [11:54:59] !log dbmait s8@eqiad T314377 [11:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:22] (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/818134 (https://phabricator.wikimedia.org/T138093) (owner: 10Jbond) [11:55:31] (03PS6) 10Jbond: C:varnish: fix varnish confd test data [puppet] - 10https://gerrit.wikimedia.org/r/818134 (https://phabricator.wikimedia.org/T138093) [11:55:43] marostegui: "dbmaint" so it will show up on https://wikitech.wikimedia.org/wiki/Map_of_database_maintenance ? [11:55:55] PleaseStand: yep [11:56:16] (03CR) 10Jaime Nuche: [C: 03+1] "LGTM" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/819194 (https://phabricator.wikimedia.org/T314195) (owner: 10Dduvall) [11:56:26] (03CR) 10CI reject: [V: 04-1] C:varnish: fix varnish confd test data [puppet] - 10https://gerrit.wikimedia.org/r/818134 (https://phabricator.wikimedia.org/T138093) (owner: 10Jbond) [11:57:12] marostegui: because I just saw the last three say "dbmait" not "dbmaint" [11:57:22] ah damn [11:57:27] good catch PleaseStand [11:57:33] !log dbmaint s8@eqiad T314377 [11:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:35] !log dbmaint s3@eqiad T314377 [11:57:37] !log dbmaint s7@eqiad T314377 [11:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:28] (03CR) 10Filippo Giunchedi: [C: 03+1] netmon: failover to netmon1003 [dns] - 10https://gerrit.wikimedia.org/r/819177 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [11:58:33] (03CR) 10Filippo Giunchedi: [C: 03+1] netmon: failover to netmon1003 [puppet] - 10https://gerrit.wikimedia.org/r/819179 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [11:58:39] (03PS6) 10Btullis: Add roles and cumin aliases for the new dse_k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/813841 (https://phabricator.wikimedia.org/T310170) [12:01:12] !log dbmaint x1@eqiad T314087 [12:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:15] T314087: Add primary key and drop unique index on cx_translators on wmf wikis - https://phabricator.wikimedia.org/T314087 [12:02:33] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Make alter non-blocking on master of primary dc [software] - 10https://gerrit.wikimedia.org/r/791297 (owner: 10Ladsgroup) [12:03:26] (03Merged) 10jenkins-bot: auto_schema: Make alter non-blocking on master of primary dc [software] - 10https://gerrit.wikimedia.org/r/791297 (owner: 10Ladsgroup) [12:05:10] (03CR) 10Ladsgroup: [C: 04-1] rename_su_normalized_T312972.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/819515 (https://phabricator.wikimedia.org/T312972) (owner: 10Marostegui) [12:05:29] RECOVERY - SSH on mw1324.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:06:13] (03PS2) 10Marostegui: rename_su_normalized_T312972.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/819515 (https://phabricator.wikimedia.org/T312972) [12:06:23] (03CR) 10Marostegui: rename_su_normalized_T312972.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/819515 (https://phabricator.wikimedia.org/T312972) (owner: 10Marostegui) [12:06:41] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:07:25] (03CR) 10Ladsgroup: [C: 03+1] rename_su_normalized_T312972.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/819515 (https://phabricator.wikimedia.org/T312972) (owner: 10Marostegui) [12:07:33] (03CR) 10Marostegui: [C: 03+2] rename_su_normalized_T312972.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/819515 (https://phabricator.wikimedia.org/T312972) (owner: 10Marostegui) [12:08:02] (03CR) 10Ladsgroup: [C: 03+1] "Actually the second part of name is different so the first one was correct too. Doesn't matter 😄" [software/schema-changes] - 10https://gerrit.wikimedia.org/r/819515 (https://phabricator.wikimedia.org/T312972) (owner: 10Marostegui) [12:08:04] (03Merged) 10jenkins-bot: Restore scrolling parameters into view with(out) sticky header [extensions/VisualEditor] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819535 (https://phabricator.wikimedia.org/T312926) (owner: 10WMDE-Fisch) [12:08:07] (03Merged) 10jenkins-bot: rename_su_normalized_T312972.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/819515 (https://phabricator.wikimedia.org/T312972) (owner: 10Marostegui) [12:08:28] (03CR) 10Jbond: [C: 03+1] "lgtm, minor nit" [puppet] - 10https://gerrit.wikimedia.org/r/819553 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:11:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:12:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:12:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:12:57] (03PS5) 10Jbond: P:adduser: apply adduser before any packages are installed [puppet] - 10https://gerrit.wikimedia.org/r/818450 (https://phabricator.wikimedia.org/T235067) [12:12:59] (03PS1) 10Jbond: ulog: ensure package is installed before trying to start the service [puppet] - 10https://gerrit.wikimedia.org/r/819560 [12:13:01] (03PS1) 10Jbond: beaker: fail if no changes preformed in first puppet run [puppet] - 10https://gerrit.wikimedia.org/r/819561 [12:13:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:15:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [12:16:08] (03PS1) 10Ayounsi: Spicerack: add configuration file and API key for PeeringDB [puppet] - 10https://gerrit.wikimedia.org/r/819562 [12:16:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [12:16:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T312972)', diff saved to https://phabricator.wikimedia.org/P32148 and previous config saved to /var/cache/conftool/dbconfig/20220802-121624-marostegui.json [12:16:27] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [12:16:33] (03PS1) 10Marostegui: rename_su_normalized_T312972.py: typo [software/schema-changes] - 10https://gerrit.wikimedia.org/r/819563 (https://phabricator.wikimedia.org/T312972) [12:17:05] (03CR) 10CI reject: [V: 04-1] rename_su_normalized_T312972.py: typo [software/schema-changes] - 10https://gerrit.wikimedia.org/r/819563 (https://phabricator.wikimedia.org/T312972) (owner: 10Marostegui) [12:17:38] (03PS2) 10Marostegui: rename_su_normalized_T312972.py: typo [software/schema-changes] - 10https://gerrit.wikimedia.org/r/819563 (https://phabricator.wikimedia.org/T312972) [12:17:40] (03PS2) 10FNegri: d/changelog: Prepare for 0.88 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/819547 [12:17:52] (03CR) 10Jbond: P:adduser: apply adduser before any packages are installed (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/818450 (https://phabricator.wikimedia.org/T235067) (owner: 10Jbond) [12:18:06] (03CR) 10Jbond: [C: 03+2] ulog: ensure package is installed before trying to start the service [puppet] - 10https://gerrit.wikimedia.org/r/819560 (owner: 10Jbond) [12:18:10] (03CR) 10Jbond: [C: 03+2] beaker: fail if no changes preformed in first puppet run [puppet] - 10https://gerrit.wikimedia.org/r/819561 (owner: 10Jbond) [12:18:31] (03CR) 10FNegri: d/changelog: Prepare for 0.88 release (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/819547 (owner: 10FNegri) [12:18:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T312972)', diff saved to https://phabricator.wikimedia.org/P32149 and previous config saved to /var/cache/conftool/dbconfig/20220802-121832-marostegui.json [12:18:35] (03PS1) 10Btullis: Add DNS SRV records for dse-k8s etcd cluster [dns] - 10https://gerrit.wikimedia.org/r/819565 (https://phabricator.wikimedia.org/T313129) [12:18:54] (03CR) 10Jbond: [C: 03+2] P:adduser: apply adduser before any packages are installed [puppet] - 10https://gerrit.wikimedia.org/r/818450 (https://phabricator.wikimedia.org/T235067) (owner: 10Jbond) [12:19:04] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage db1185-db1195 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813234 (https://phabricator.wikimedia.org/T306928) (owner: 10Marostegui) [12:19:14] (03CR) 10Marostegui: [C: 03+2] rename_su_normalized_T312972.py: typo [software/schema-changes] - 10https://gerrit.wikimedia.org/r/819563 (https://phabricator.wikimedia.org/T312972) (owner: 10Marostegui) [12:19:50] (03Merged) 10jenkins-bot: rename_su_normalized_T312972.py: typo [software/schema-changes] - 10https://gerrit.wikimedia.org/r/819563 (https://phabricator.wikimedia.org/T312972) (owner: 10Marostegui) [12:20:30] (03CR) 10Btullis: [C: 03+2] Add roles and cumin aliases for the new dse_k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/813841 (https://phabricator.wikimedia.org/T310170) (owner: 10Btullis) [12:21:33] (03PS1) 10Ayounsi: Add mock peeringdb token [labs/private] - 10https://gerrit.wikimedia.org/r/819568 [12:23:01] (03PS2) 10Ayounsi: Spicerack: add configuration file and API key for PeeringDB [puppet] - 10https://gerrit.wikimedia.org/r/819562 [12:25:47] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.02384 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:28:17] looking [12:28:28] (03PS1) 10Jbond: puppetmaster2004: move to puppetmastr::backend role [puppet] - 10https://gerrit.wikimedia.org/r/819572 (https://phabricator.wikimedia.org/T314136) [12:29:57] (03PS1) 10Jbond: Revert "P:adduser: apply adduser before any packages are installed" [puppet] - 10https://gerrit.wikimedia.org/r/819537 [12:30:01] (03PS1) 10Jbond: Revert "beaker: fail if no changes preformed in first puppet run" [puppet] - 10https://gerrit.wikimedia.org/r/819539 [12:30:03] (03PS1) 10Jbond: Revert "ulog: ensure package is installed before trying to start..." [puppet] - 10https://gerrit.wikimedia.org/r/819540 [12:30:12] (03PS2) 10Jbond: Revert "P:adduser: apply adduser before any packages are installed" [puppet] - 10https://gerrit.wikimedia.org/r/819537 [12:30:22] (03PS3) 10Jbond: Revert "P:adduser: apply adduser before any packages are installed" [puppet] - 10https://gerrit.wikimedia.org/r/819537 [12:30:30] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "P:adduser: apply adduser before any packages are installed" [puppet] - 10https://gerrit.wikimedia.org/r/819537 (owner: 10Jbond) [12:33:36] 10SRE, 10Observability-Metrics, 10User-fgiunchedi: Programmatic generation of grafana dashboards - https://phabricator.wikimedia.org/T171482 (10Ottomata) AH! very cool! [12:33:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P32150 and previous config saved to /var/cache/conftool/dbconfig/20220802-123338-marostegui.json [12:35:16] (03CR) 10Ottomata: [C: 03+1] Revert "testwiki: Add mediawiki.web_ui.interactions stream" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819014 (https://phabricator.wikimedia.org/T314151) (owner: 10Phuedx) [12:36:23] (03PS1) 10Marostegui: install_server: Do not reimage db2174-db2175 [puppet] - 10https://gerrit.wikimedia.org/r/819576 [12:37:28] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2174-db2175 [puppet] - 10https://gerrit.wikimedia.org/r/819576 (owner: 10Marostegui) [12:37:51] (03PS1) 10Slyngshede: c:raid::md move from crontab to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) [12:38:51] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005839 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:40:11] (03PS1) 10Jbond: P:adduser: apply adduser before any packages are installed [puppet] - 10https://gerrit.wikimedia.org/r/819541 (https://phabricator.wikimedia.org/T235067) [12:41:38] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36576/console" [puppet] - 10https://gerrit.wikimedia.org/r/819541 (https://phabricator.wikimedia.org/T235067) (owner: 10Jbond) [12:41:55] (03PS1) 10Elukey: Revert "ml-services: update editquality's en image to test ray workers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/819542 [12:43:15] (03CR) 10CI reject: [V: 04-1] P:adduser: apply adduser before any packages are installed [puppet] - 10https://gerrit.wikimedia.org/r/819541 (https://phabricator.wikimedia.org/T235067) (owner: 10Jbond) [12:43:25] (03PS2) 10Slyngshede: c:raid::md move from crontab to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) [12:43:34] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] Add scap.cfg section for devtools environment [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/819194 (https://phabricator.wikimedia.org/T314195) (owner: 10Dduvall) [12:43:42] (03PS1) 10MVernon: Hieradata: move restbase prod to 3.11.13 [puppet] - 10https://gerrit.wikimedia.org/r/819578 (https://phabricator.wikimedia.org/T309896) [12:44:39] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36577/console" [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:44:45] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/819578 (https://phabricator.wikimedia.org/T309896) (owner: 10MVernon) [12:46:45] (03CR) 10Majavah: [C: 03+1] d/changelog: Prepare for 0.88 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/819547 (owner: 10FNegri) [12:48:13] (03PS1) 10Jbond: P:adduser: apply adduser before any packages are installed [puppet] - 10https://gerrit.wikimedia.org/r/819580 (https://phabricator.wikimedia.org/T235067) [12:48:16] (03PS1) 10Jbond: P:apt: apply apt before any packages are installed [puppet] - 10https://gerrit.wikimedia.org/r/819581 (https://phabricator.wikimedia.org/T235067) [12:48:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P32151 and previous config saved to /var/cache/conftool/dbconfig/20220802-124845-marostegui.json [12:49:59] (03PS7) 10Ayounsi: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 [12:51:15] (03CR) 10Elukey: [C: 03+2] Revert "ml-services: update editquality's en image to test ray workers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/819542 (owner: 10Elukey) [12:51:39] (03CR) 10Jbond: [C: 03+2] P:adduser: apply adduser before any packages are installed [puppet] - 10https://gerrit.wikimedia.org/r/819580 (https://phabricator.wikimedia.org/T235067) (owner: 10Jbond) [12:51:45] (03PS8) 10Ayounsi: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 [12:52:33] (03PS9) 10Ayounsi: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 [12:53:26] (03PS3) 10Slyngshede: c:raid::md move from crontab to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) [12:54:55] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:19] (03CR) 10Brennen Bearnes: [C: 03+1] devtools: Configure keyholder for scap3 deployment of phabricator [puppet] - 10https://gerrit.wikimedia.org/r/819193 (https://phabricator.wikimedia.org/T314195) (owner: 10Dduvall) [12:56:06] (03CR) 10Jbond: [C: 03+2] puppetmaster2004: move to puppetmastr::backend role [puppet] - 10https://gerrit.wikimedia.org/r/819572 (https://phabricator.wikimedia.org/T314136) (owner: 10Jbond) [12:56:15] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 242, down: 4, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:56:36] (03PS4) 10Slyngshede: c:raid::md move from crontab to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) [12:57:59] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36578/console" [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:59:44] (03CR) 10CI reject: [V: 04-1] PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [12:59:45] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2013.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [12:59:49] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: That opportune time is upon us again. Time for a UTC afternoon backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220802T1300). [13:00:04] samwilson, phuedx, and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220802T1300) [13:00:08] o/ [13:00:13] o/ [13:00:15] let's do it [13:00:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2013.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [13:00:26] Hullo [13:00:34] hello everyone [13:00:39] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2028.codfw.wmnet with OS bullseye [13:00:45] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic2028.codfw.wmnet with OS bullseye [13:00:55] (03PS2) 10Urbanecm: Enable RealtimePreview on Group 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818647 (https://phabricator.wikimedia.org/T314150) (owner: 10Samwilson) [13:01:04] (03CR) 10Urbanecm: [C: 03+2] Enable RealtimePreview on Group 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818647 (https://phabricator.wikimedia.org/T314150) (owner: 10Samwilson) [13:02:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2013.codfw.wmnet with OS bullseye [13:02:43] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2013.codfw.wmnet with OS bullseye [13:03:16] (03Merged) 10jenkins-bot: Enable RealtimePreview on Group 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818647 (https://phabricator.wikimedia.org/T314150) (owner: 10Samwilson) [13:03:51] btw, is it intentional that the mobileapps/wikifeeds window overlaps with the UTC afternoon backport window? [13:03:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T312972)', diff saved to https://phabricator.wikimedia.org/P32152 and previous config saved to /var/cache/conftool/dbconfig/20220802-130351-marostegui.json [13:03:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [13:03:55] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [13:03:55] I think it happened yesterday too [13:04:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [13:04:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [13:04:11] samwilson: your patch is at mwdebug1001, can you check please? [13:04:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [13:04:25] testing now, thanks [13:04:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T312972)', diff saved to https://phabricator.wikimedia.org/P32153 and previous config saved to /var/cache/conftool/dbconfig/20220802-130428-marostegui.json [13:05:36] Lucas_WMDE: not sure if it's intentional, but it's not a new thing (https://wikitech.wikimedia.org/wiki/Deployments/Archive/2022/06, for example) [13:05:44] urbanecm: all looks good, go for it [13:05:52] thanks, syncing! [13:06:05] ok [13:06:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T312972)', diff saved to https://phabricator.wikimedia.org/P32154 and previous config saved to /var/cache/conftool/dbconfig/20220802-130636-marostegui.json [13:07:13] (03PS2) 10Urbanecm: Revert "testwiki: Add mediawiki.web_ui.interactions stream" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819014 (https://phabricator.wikimedia.org/T314151) (owner: 10Phuedx) [13:07:17] (03CR) 10Urbanecm: [C: 03+2] Revert "testwiki: Add mediawiki.web_ui.interactions stream" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819014 (https://phabricator.wikimedia.org/T314151) (owner: 10Phuedx) [13:08:43] (03Merged) 10jenkins-bot: Revert "testwiki: Add mediawiki.web_ui.interactions stream" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819014 (https://phabricator.wikimedia.org/T314151) (owner: 10Phuedx) [13:08:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:09:29] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: c2fb8a58d8f62e29a15ebee26198e79e4597d24c: Enable RealtimePreview on Group 0 wikis (T314150) (duration: 03m 21s) [13:09:31] T314150: Enable Realtime Preview on group0 - https://phabricator.wikimedia.org/T314150 [13:09:40] samwilson: it's live [13:09:54] great! thanks [13:09:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:09:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:09:59] phuedx: pulled to mwdebug1001 if you want to take a look [13:10:02] looks good [13:10:20] great! [13:10:39] urbanecm: On it [13:10:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:11:04] hi, may i have some late additions to the window? backports for wmf.23 [13:11:33] MatmaRex: sure thing! add them to the calendar please [13:11:56] (03PS1) 10Bartosz Dziewoński: Tighten conditions for incompatible skin warnings [extensions/VisualEditor] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819543 (https://phabricator.wikimedia.org/T312632) [13:12:04] (03PS1) 10Bartosz Dziewoński: Fix reply buttons not being available on mobile [extensions/DiscussionTools] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819544 [13:12:05] urbanecm: LGTM. I've confirmed that the stream config isn't being sent to the client on enwiki or testwiki and have confirmed that it's still working on Beta enwiki [13:12:11] (03PS1) 10Bartosz Dziewoński: Don't infuse reply buttons if not in use [extensions/DiscussionTools] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819545 [13:12:14] great! syncing [13:13:36] <3 Thanks [13:13:41] done, thanks [13:13:57] (03CR) 10Urbanecm: [C: 03+2] Tighten conditions for incompatible skin warnings [extensions/VisualEditor] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819543 (https://phabricator.wikimedia.org/T312632) (owner: 10Bartosz Dziewoński) [13:13:59] wmf.23 is not deployed anywhere yet, so i have no way to test [13:14:00] (03CR) 10Urbanecm: [C: 03+2] Fix reply buttons not being available on mobile [extensions/DiscussionTools] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819544 (owner: 10Bartosz Dziewoński) [13:14:03] (03CR) 10Urbanecm: [C: 03+2] Don't infuse reply buttons if not in use [extensions/DiscussionTools] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819545 (owner: 10Bartosz Dziewoński) [13:14:16] these are changes that were meant for train deployments, but we were late with reviewing them :) [13:14:35] (03PS1) 10Vgutierrez: trafficserver: Set thread count factor to 1.5x for ats9 instances [puppet] - 10https://gerrit.wikimedia.org/r/819585 (https://phabricator.wikimedia.org/T309651) [13:14:48] MatmaRex: yup. wmf.23 is not even yet at the deployment server, so i'll just merge them, and they'll ride the train [13:15:35] (03PS2) 10Vgutierrez: trafficserver: Set thread count factor to 1.5x for ats9 instances [puppet] - 10https://gerrit.wikimedia.org/r/819585 (https://phabricator.wikimedia.org/T309651) [13:15:56] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: a4499e5ac23a0558bed276e2b74134590afc5c95: Revert "testwiki: Add mediawiki.web_ui.interactions stream" (T314151, T311268) (duration: 03m 19s) [13:16:00] (JobUnavailable) firing: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:16:00] Lucas_WMDE: your patch has an actual merge conflict unfortunately. can you rebase manually? also, let me know if you want me to deploy them, or if you prefer self-serving :) [13:16:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:16:01] T314151: Metrics Platform Event custom_data field isn't refined correctly - https://phabricator.wikimedia.org/T314151 [13:16:02] T311268: *WebUIActionsTracking migration to Metrics Platform - https://phabricator.wikimedia.org/T311268 [13:16:08] phuedx: should be live [13:17:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:17:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:17:04] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36579/console" [puppet] - 10https://gerrit.wikimedia.org/r/819585 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez) [13:17:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:18:07] urbanecm: Confirmed. Thanks [13:18:12] no problem :) [13:18:44] urbanecm: sorry, I didn’t see the ping [13:18:52] I’ll rebase, and then I’d like to deploy them myself :) [13:19:05] Lucas_WMDE: in that case, the floor is yours :) [13:19:19] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2028.codfw.wmnet with reason: host reimage [13:19:24] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2013.codfw.wmnet with reason: host reimage [13:19:29] ok, thanks :) [13:20:11] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:20:15] (03PS5) 10Lucas Werkmeister (WMDE): Introduce $wmgEntityUsageModifierLimitsStatement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754937 (https://phabricator.wikimedia.org/T296384) (owner: 10Noa wmde) [13:20:17] (03PS6) 10Lucas Werkmeister (WMDE): Enable usage tracking for statement for cebwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754933 (https://phabricator.wikimedia.org/T296384) (owner: 10Noa wmde) [13:21:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P32155 and previous config saved to /var/cache/conftool/dbconfig/20220802-132142-marostegui.json [13:22:56] (03CR) 10Ssingh: [C: 03+1] trafficserver: Set thread count factor to 1.5x for ats9 instances [puppet] - 10https://gerrit.wikimedia.org/r/819585 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez) [13:23:09] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2028.codfw.wmnet with reason: host reimage [13:23:14] PROBLEM - Check systemd state on puppetmaster2004 is CRITICAL: CRITICAL - degraded: The following units failed: remove_old_puppet_reports.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:23:19] (03PS6) 10Lucas Werkmeister (WMDE): Introduce $wmgEntityUsageModifierLimitsStatement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754937 (https://phabricator.wikimedia.org/T296384) (owner: 10Noa wmde) [13:23:21] (03PS7) 10Lucas Werkmeister (WMDE): Enable usage tracking for statement for cebwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754933 (https://phabricator.wikimedia.org/T296384) (owner: 10Noa wmde) [13:23:29] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Set thread count factor to 1.5x for ats9 instances [puppet] - 10https://gerrit.wikimedia.org/r/819585 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez) [13:23:35] rearranged the code a bit more ^^ [13:23:45] after almost confusing myself about what the effective settings were [13:24:44] !log restarting ATS 9.x instances to apply https://gerrit.wikimedia.org/r/819585 - T309651 [13:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:48] T309651: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651 [13:24:55] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Introduce $wmgEntityUsageModifierLimitsStatement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754937 (https://phabricator.wikimedia.org/T296384) (owner: 10Noa wmde) [13:24:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2013.codfw.wmnet with reason: host reimage [13:26:03] (03Merged) 10jenkins-bot: Introduce $wmgEntityUsageModifierLimitsStatement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754937 (https://phabricator.wikimedia.org/T296384) (owner: 10Noa wmde) [13:26:29] checking on mwdebug1001 [13:26:39] 10SRE, 10vm-requests: eqiad: 1 VMs requested for airflow on behalf of the platform engineering team - https://phabricator.wikimedia.org/T314319 (10MoritzMuehlenhoff) Looks good. Best to pick row A or D, but given that it replaces airflow1003 which is currently in row D, the latter is probably the best choice. [13:26:47] seems to do what it should, will sync [13:26:59] (03PS1) 10AOkoth: gitlab: enable restore on gitlab2002 [puppet] - 10https://gerrit.wikimedia.org/r/819589 (https://phabricator.wikimedia.org/T296713) [13:26:59] order shouldn’t matter but IS.php before Wikibase.php makes more sense semantically :) [13:27:34] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host puppetmaster2004.codfw.wmnet with OS buster [13:30:01] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on ganeti2028.codfw.wmnet with reason: Power down for PDU maintenance, T309957 [13:30:04] T309957: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 [13:30:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ganeti2028.codfw.wmnet with reason: Power down for PDU maintenance, T309957 [13:30:19] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:30:59] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:754937|Introduce $wmgEntityUsageModifierLimitsStatement (T296384)]] (1/2) (duration: 03m 16s) [13:31:01] T296384: Enable statement usage tracking on cebwiki - https://phabricator.wikimedia.org/T296384 [13:31:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:53] (03Merged) 10jenkins-bot: Tighten conditions for incompatible skin warnings [extensions/VisualEditor] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819543 (https://phabricator.wikimedia.org/T312632) (owner: 10Bartosz Dziewoński) [13:31:59] 10SRE, 10Datacenter-Switchover: Warn when CirrusSearch is not configured to use local DC for an extended time - https://phabricator.wikimedia.org/T204135 (10RhinosF1) 05Declined→03Open Re-opening tasks and removing from team workboard per IRC feedback given yesterday and discussion with MPham. [13:32:05] (03Merged) 10jenkins-bot: Fix reply buttons not being available on mobile [extensions/DiscussionTools] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819544 (owner: 10Bartosz Dziewoński) [13:32:09] (03Merged) 10jenkins-bot: Don't infuse reply buttons if not in use [extensions/DiscussionTools] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819545 (owner: 10Bartosz Dziewoński) [13:32:43] (03PS10) 10Jbond: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [13:33:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:34:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:34:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:34:34] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:754937|Introduce $wmgEntityUsageModifierLimitsStatement (T296384)]] (2/2) (duration: 03m 21s) [13:34:45] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable usage tracking for statement for cebwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754933 (https://phabricator.wikimedia.org/T296384) (owner: 10Noa wmde) [13:35:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:35:23] 10SRE, 10Discovery, 10Elasticsearch: Collect metrics on CirrusSearch usage of PoolCounter - https://phabricator.wikimedia.org/T130617 (10RhinosF1) 05Declined→03Open Re-opening tasks and removing from team workboard per IRC feedback given yesterday and discussion with MPham. [13:36:01] 10SRE, 10Discovery, 10Elasticsearch: Setup backups of elasticsearch indices - https://phabricator.wikimedia.org/T91404 (10RhinosF1) 05Declined→03Open Re-opening tasks and removing from team workboard per IRC feedback given yesterday and discussion with MPham. [13:36:11] (03Merged) 10jenkins-bot: Enable usage tracking for statement for cebwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754933 (https://phabricator.wikimedia.org/T296384) (owner: 10Noa wmde) [13:36:13] testing the second change on mwdebug1001 [13:36:29] looks good, syncing [13:36:45] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): More public IPs for codfw1dev - https://phabricator.wikimedia.org/T313977 (10rook) I've been looking for how we see the routing of a subnet in openstack, but thus far have come up with little. How did you identify that there is... [13:36:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P32156 and previous config saved to /var/cache/conftool/dbconfig/20220802-133648-marostegui.json [13:37:27] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) on cr1-eqaid, we have all the interfaces setup for asw2-c and asw2-d move ` papaul@re0.cr1-eqiad> show interfaces terse | match xe-1/1 xe-1/1/0:0... [13:39:10] (03CR) 10CI reject: [V: 04-1] PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [13:39:40] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2028.codfw.wmnet with OS bullseye [13:39:48] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic2028.codfw.wmnet with OS bullseye completed: - elastic2028 (... [13:40:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:40:18] (03PS1) 10Ssingh: Depool codfw for PDU upgrade [dns] - 10https://gerrit.wikimedia.org/r/819591 (https://phabricator.wikimedia.org/T309957) [13:40:37] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:754933|Enable usage tracking for statement for cebwiki (T296384)]] – expected to gradually increase number of wbc_entity_usage and probably recentchanges rows on cebwiki, but not too much, see task for details (duration: 03m 06s) [13:40:39] T296384: Enable statement usage tracking on cebwiki - https://phabricator.wikimedia.org/T296384 [13:41:00] alright, I think I’m done [13:41:11] anything else to deploy? [13:41:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:41:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:41:30] (03CR) 10BBlack: [C: 03+1] Depool codfw for PDU upgrade [dns] - 10https://gerrit.wikimedia.org/r/819591 (https://phabricator.wikimedia.org/T309957) (owner: 10Ssingh) [13:41:35] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): More public IPs for codfw1dev - https://phabricator.wikimedia.org/T313977 (10cmooney) Hi @rook, you probably need to confirm within the cloud team, but as far as I am aware the cloudgw nodes are external to OpenStack completely,... [13:41:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2013.codfw.wmnet with OS bullseye [13:41:48] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2013.codfw.wmnet with OS bullseye completed: - ganeti2013 (**PASS**) - Downtimed on... [13:42:07] !log UTC afternoon backport+config window done [13:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:42:29] !log jbond@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetmaster2004.codfw.wmnet with reason: host reimage [13:43:49] (03CR) 10Ssingh: [C: 03+2] Depool codfw for PDU upgrade [dns] - 10https://gerrit.wikimedia.org/r/819591 (https://phabricator.wikimedia.org/T309957) (owner: 10Ssingh) [13:45:11] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetmaster2004.codfw.wmnet with reason: host reimage [13:46:37] (03CR) 10FNegri: [C: 03+2] d/changelog: Prepare for 0.88 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/819547 (owner: 10FNegri) [13:47:49] (03Merged) 10jenkins-bot: d/changelog: Prepare for 0.88 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/819547 (owner: 10FNegri) [13:50:11] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2029.codfw.wmnet,service=ats-be [13:50:11] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2029.codfw.wmnet,service=varnish-fe [13:50:11] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2029.codfw.wmnet,service=ats-tls [13:50:31] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2030.codfw.wmnet,service=ats-be [13:50:31] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2030.codfw.wmnet,service=varnish-fe [13:50:32] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2030.codfw.wmnet,service=ats-be [13:51:15] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp203[1-2].codfw.wmnet,service=varnish-fe [13:51:54] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp203[1-2].codfw.wmnet,service=ats-tls [13:51:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T312972)', diff saved to https://phabricator.wikimedia.org/P32157 and previous config saved to /var/cache/conftool/dbconfig/20220802-135155-marostegui.json [13:51:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [13:51:58] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [13:52:00] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp203[1-2].codfw.wmnet,service=ats-tls [13:52:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [13:52:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T312972)', diff saved to https://phabricator.wikimedia.org/P32158 and previous config saved to /var/cache/conftool/dbconfig/20220802-135226-marostegui.json [13:53:08] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp203[3-4].codfw.wmnet,service=varnish-fe [13:53:13] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp203[3-4].codfw.wmnet,service=ats-tls [13:53:19] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp203[3-4].codfw.wmnet,service=ats-tls [13:53:34] !log depool and poweroff prometheus2005 - T310070 [13:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:36] T310070: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 [13:54:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T312972)', diff saved to https://phabricator.wikimedia.org/P32159 and previous config saved to /var/cache/conftool/dbconfig/20220802-135435-marostegui.json [13:54:48] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp203[3-4].codfw.wmnet,service=ats-be [13:55:30] (03PS1) 10Vivian Rook: extra ips for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/819593 (https://phabricator.wikimedia.org/T313977) [13:55:44] (03CR) 10Eevans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/819578 (https://phabricator.wikimedia.org/T309896) (owner: 10MVernon) [13:55:53] (03CR) 10Jbond: [C: 04-1] PeeringDB API: initial commit (0311 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [13:56:14] !log schedule poweroff for centrallog2002 at 16 utc - T310070 [13:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:18] PROBLEM - Host prometheus2005 is DOWN: PING CRITICAL - Packet loss = 100% [13:56:35] (03CR) 10Vivian Rook: "The edits to cloudgw200[12]-dev.yaml are almost surely meaningless, if not outright wrong." [puppet] - 10https://gerrit.wikimedia.org/r/819593 (https://phabricator.wikimedia.org/T313977) (owner: 10Vivian Rook) [13:57:21] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2031.codfw.wmnet,service=ats-be [13:57:26] 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10MoritzMuehlenhoff) [13:57:30] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2032.codfw.wmnet,service=ats-be [13:57:37] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2030.codfw.wmnet,service=ats-tls [14:00:13] RECOVERY - Check systemd state on puppetmaster2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:45] (JobUnavailable) firing: (2) Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:01:35] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on prometheus2005.codfw.wmnet with reason: pdu [14:01:49] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on prometheus2005.codfw.wmnet with reason: pdu [14:03:35] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on centrallog2002.codfw.wmnet with reason: pdu [14:03:48] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on centrallog2002.codfw.wmnet with reason: pdu [14:04:04] !log grow sda/sdb 3 by 100G on thanos-be1001 - T314275 [14:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:07] T314275: thanos-be2004 sdb3 fully used - https://phabricator.wikimedia.org/T314275 [14:05:34] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetmaster2004.codfw.wmnet with OS buster [14:09:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P32160 and previous config saved to /var/cache/conftool/dbconfig/20220802-140940-marostegui.json [14:11:21] (03CR) 10JHathaway: [C: 03+1] Revert "P:adduser: apply adduser before any packages are installed" [puppet] - 10https://gerrit.wikimedia.org/r/819537 (owner: 10Jbond) [14:12:42] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2060.codfw.wmnet with OS bullseye [14:12:50] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic2060.codfw.wmnet with OS bullseye [14:18:04] 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10Andrew) [14:21:34] brennen, hi ... so, we decided to back out the parsoid deployment for this week's train. arlo has +2ed the vendor repo revert ( https://gerrit.wikimedia.org/r/c/mediawiki/vendor/+/819608 ). [14:21:59] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ms-be[2030,2045,2052].codfw.wmnet with reason: shutdown for PDU replacement [14:22:18] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be[2030,2045,2052].codfw.wmnet with reason: shutdown for PDU replacement [14:22:23] brennen, just wanted to give you a heads up before you rolled out the train. [14:22:28] 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=cd0b03ef-75d5-4a98-8161-1d31bb05694f) set by mvernon@cumin2002 for 1:00:00 on 3 host(s) and their services wi... [14:23:29] !log shutdown ms-be20[30,45,52] for PDU work T309957 [14:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:32] T309957: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 [14:23:52] scott is cherry-picking it to wmf.23 right now. [14:24:30] subbu: thanks for heads up! d.ancy is primary this week, i don't think deployer should need to take any action here since it's not yet checked out on the deployment box. [14:24:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P32161 and previous config saved to /var/cache/conftool/dbconfig/20220802-142446-marostegui.json [14:25:39] ok. so, it is sufficient to cherrypick to wmf.23 then? [14:26:14] should be. [14:26:33] yeah, if stuff gets merged before it's on the deploy hosts.. it should "just work"(TM) [14:27:33] okay ... so, i'll +2 the cherrypick then. [14:27:45] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:28:04] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2060.codfw.wmnet with reason: host reimage [14:30:20] 10SRE-OnFire, 10observability, 10SRE Observability (FY2022/2023-Q1): Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers - https://phabricator.wikimedia.org/T313603 (10lmata) a:03fgiunchedi [14:32:31] (03CR) 10Jbond: [C: 04-1] "some more general comments. id also be tempted to break out the cache functionality to its own class e.g." [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [14:32:36] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2060.codfw.wmnet with reason: host reimage [14:35:12] 10SRE, 10Observability-Logging, 10vm-requests: logstash collector nodes in codfw not row redundant - https://phabricator.wikimedia.org/T313408 (10MoritzMuehlenhoff) Looks good, could you use Row C but wait two days? I'm currently reimaging codfw ganeti nodes to bullseye and the ones in codfw/C should be done... [14:36:43] (03Abandoned) 10Jbond: Revert "beaker: fail if no changes preformed in first puppet run" [puppet] - 10https://gerrit.wikimedia.org/r/819539 (owner: 10Jbond) [14:36:50] (03Abandoned) 10Jbond: Revert "ulog: ensure package is installed before trying to start..." [puppet] - 10https://gerrit.wikimedia.org/r/819540 (owner: 10Jbond) [14:37:36] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host puppetmaster1004.eqiad.wmnet with OS buster [14:38:36] 10SRE-OnFire, 10observability, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers - https://phabricator.wikimedia.org/T313603 (10fgiunchedi) [14:39:03] (03CR) 10Jbond: "this causes a circular dependency on at least the jumbo kafka hosts" [puppet] - 10https://gerrit.wikimedia.org/r/819581 (https://phabricator.wikimedia.org/T235067) (owner: 10Jbond) [14:39:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T312972)', diff saved to https://phabricator.wikimedia.org/P32162 and previous config saved to /var/cache/conftool/dbconfig/20220802-143952-marostegui.json [14:39:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [14:39:56] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [14:40:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [14:40:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T312972)', diff saved to https://phabricator.wikimedia.org/P32163 and previous config saved to /var/cache/conftool/dbconfig/20220802-144013-marostegui.json [14:42:04] 10SRE, 10Commons, 10Data-Persistence (Consultation), 10MediaWiki-extensions-WikibaseClient, and 6 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Lydia_Pintscher) [14:42:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T312972)', diff saved to https://phabricator.wikimedia.org/P32164 and previous config saved to /var/cache/conftool/dbconfig/20220802-144222-marostegui.json [14:43:27] 10SRE, 10serviceops-radar, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi) >>! In T314118#8116129, @Dzahn wrote: > For "Apache HTTP on mw" I guess ideally it would be replaced by 2 things: > > - a p... [14:47:16] 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10ssingh) [14:48:11] (03CR) 10Andrew Bogott: [C: 03+1] "This looks right to me. I'd like to get another +1 from someone regarding the non-codfw1dev bits." [puppet] - 10https://gerrit.wikimedia.org/r/819593 (https://phabricator.wikimedia.org/T313977) (owner: 10Vivian Rook) [14:49:05] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10Marostegui) [14:50:29] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetmaster1004.eqiad.wmnet with reason: host reimage [14:50:34] !log uploaded gnupg2 2.1.18-8~deb9u4+wmf1 to stretch-wikimedia [14:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:37] 10SRE-swift-storage: thanos-be2004 sdb3 fully used - https://phabricator.wikimedia.org/T314275 (10fgiunchedi) >>! In T314275#8124331, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/jg7eXoIB6FQ6iqKi7n6a} [2022-08-02T14:04:04Z] grow sda/... [14:53:28] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetmaster1004.eqiad.wmnet with reason: host reimage [14:54:52] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2060.codfw.wmnet with OS bullseye [14:54:57] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic2060.codfw.wmnet with OS bullseye completed: - elastic2060 (... [14:57:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P32166 and previous config saved to /var/cache/conftool/dbconfig/20220802-145728-marostegui.json [14:58:12] !log oblivian@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=(appservers|api)-ro,name=codfw [14:58:46] (03CR) 10Cwhite: "Tested on beta-logs and appears to do the right thing." [puppet] - 10https://gerrit.wikimedia.org/r/817388 (https://phabricator.wikimedia.org/T166107) (owner: 10Cwhite) [14:58:55] PROBLEM - Host ms-be2030.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:59:03] PROBLEM - Host ms-be2045.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:59:29] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2025.codfw.wmnet with reason: T309957 [14:59:32] T309957: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 [14:59:35] PROBLEM - Host cp2030 is DOWN: PING CRITICAL - Packet loss = 100% [14:59:35] PROBLEM - Host cp2031 is DOWN: PING CRITICAL - Packet loss = 100% [14:59:37] PROBLEM - Host cp2034 is DOWN: PING CRITICAL - Packet loss = 100% [14:59:43] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2025.codfw.wmnet with reason: T309957 [14:59:57] PROBLEM - Host cp2029 is DOWN: PING CRITICAL - Packet loss = 100% [14:59:57] PROBLEM - Host cp2033 is DOWN: PING CRITICAL - Packet loss = 100% [15:00:03] PROBLEM - Host cp2032 is DOWN: PING CRITICAL - Packet loss = 100% [15:00:04] (03CR) 10Ahmon Dancy: P:gerrit: add ipaddress to host_aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819506 (https://phabricator.wikimedia.org/T303857) (owner: 10Jbond) [15:00:52] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: shutdown for PDU upgrade [15:01:15] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: shutdown for PDU upgrade [15:04:35] 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Andrew) [15:04:37] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2037.codfw.wmnet with reason: T309957 [15:04:40] T309957: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 [15:04:51] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2037.codfw.wmnet with reason: T309957 [15:06:17] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on mc-gp2002.codfw.wmnet with reason: Power down for PDU maintenance, T310070 [15:06:19] T310070: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 [15:06:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mc-gp2002.codfw.wmnet with reason: Power down for PDU maintenance, T310070 [15:07:25] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on ms-be[2030,2045,2052].codfw.wmnet with reason: shutdown for PDU replacement [15:07:32] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be[2030,2045,2052].codfw.wmnet with reason: shutdown for PDU replacement [15:07:40] 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1afb1eaa-338e-4346-baff-e22c312e16f5) set by mvernon@cumin2002 for 3:00:00 on 3 host(s) and their services wi... [15:08:39] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on thanos-be2001.codfw.wmnet with reason: pdu [15:08:53] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on thanos-be2001.codfw.wmnet with reason: pdu [15:10:29] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on ganeti-test[2001-2003].codfw.wmnet with reason: Power down for PDU maintenance, T310070 [15:10:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ganeti-test[2001-2003].codfw.wmnet with reason: Power down for PDU maintenance, T310070 [15:12:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P32167 and previous config saved to /var/cache/conftool/dbconfig/20220802-151234-marostegui.json [15:12:54] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Nicely done" [puppet] - 10https://gerrit.wikimedia.org/r/817388 (https://phabricator.wikimedia.org/T166107) (owner: 10Cwhite) [15:13:27] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetmaster1004.eqiad.wmnet with OS buster [15:14:47] PROBLEM - Host mc2024 is DOWN: PING CRITICAL - Packet loss = 100% [15:15:05] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2024.codfw.wmnet with reason: shutdown for PDU upgrade [15:15:09] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2024.codfw.wmnet with reason: shutdown for PDU upgrade [15:18:07] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:19:21] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:20:45] (JobUnavailable) firing: (3) Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:22:27] PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [15:24:31] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:24:42] !log installing gnupg2 security updates [15:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T312972)', diff saved to https://phabricator.wikimedia.org/P32169 and previous config saved to /var/cache/conftool/dbconfig/20220802-152740-marostegui.json [15:27:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [15:27:44] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [15:27:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [15:27:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:28:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:28:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T312972)', diff saved to https://phabricator.wikimedia.org/P32170 and previous config saved to /var/cache/conftool/dbconfig/20220802-152818-marostegui.json [15:30:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T312972)', diff saved to https://phabricator.wikimedia.org/P32171 and previous config saved to /var/cache/conftool/dbconfig/20220802-153027-marostegui.json [15:31:56] PROBLEM - Aggregate IPsec Tunnel Status eqiad on alert1001 is CRITICAL: instance=mc1042 site=eqiad tunnel=mc2024_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:33:27] 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1082 - https://phabricator.wikimedia.org/T312626 (10Cmjohnson) @BTullis Thanks, doing this now [15:35:50] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:36:27] !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host elastic2037.codfw.wmnet [15:36:50] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host elastic2037.codfw.wmnet [15:36:56] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10Papaul) [15:37:12] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc[2040-2041].codfw.wmnet with reason: shutdown for PDU upgrade [15:37:29] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc[2040-2041].codfw.wmnet with reason: shutdown for PDU upgrade [15:38:06] PROBLEM - Host an-worker1082.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:38:26] (03CR) 10Krinkle: [C: 03+1] Remove temporary benchmark script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819499 (owner: 10Daniel Kinzler) [15:39:20] Krinkle: Congrats on the settings cache removal! [15:40:25] (03PS2) 10Ahmon Dancy: Add systemd timer to run scap stage-train on Tuesday morning [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) [15:43:46] RECOVERY - Host an-worker1082.mgmt is UP: PING WARNING - Packet loss = 77%, RTA = 1.35 ms [15:44:14] jouncebot nowandnext [15:44:14] No deployments scheduled for the next 0 hour(s) and 15 minute(s) [15:44:14] In 0 hour(s) and 15 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220802T1600) [15:44:24] 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10Andrew) [15:45:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P32172 and previous config saved to /var/cache/conftool/dbconfig/20220802-154533-marostegui.json [15:45:45] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2039.codfw.wmnet with reason: T309957 [15:45:45] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:45:48] T309957: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 [15:45:59] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2039.codfw.wmnet with reason: T309957 [15:45:59] 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1082 - https://phabricator.wikimedia.org/T312626 (10Cmjohnson) 05Open→03Resolved @btullis replaced the battery and powered on, everything looks good from my end. resolving [15:47:52] RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [15:49:45] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2040.codfw.wmnet with reason: T309957 [15:49:58] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2040.codfw.wmnet with reason: T309957 [15:50:16] 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1082 - https://phabricator.wikimedia.org/T312626 (10BTullis) Many thanks indeed @Cmjohnson. [15:50:55] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2056.codfw.wmnet with reason: T309957 [15:50:59] T309957: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 [15:51:08] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2056.codfw.wmnet with reason: T309957 [15:52:48] PROBLEM - Juniper alarms on asw-b-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:53:07] (03PS2) 10Btullis: Add DNS SRV records for dse-k8s etcd cluster [dns] - 10https://gerrit.wikimedia.org/r/819565 (https://phabricator.wikimedia.org/T313129) [15:55:04] (03PS1) 10Hnowlan: changeprop-jobqueue: increase number of workers to 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/819627 [15:57:02] (03CR) 10Eevans: [C: 03+1] "I'm not familiar enough with the actual config changes to speak to them, but +1 from me for re-imaging to Buster with data in-situ." [puppet] - 10https://gerrit.wikimedia.org/r/770984 (https://phabricator.wikimedia.org/T303833) (owner: 10Hnowlan) [15:57:14] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:13] (03CR) 10Brennen Bearnes: "My main thought here was that as a followup we should have a stop mechanism for this, but I see that's already mentioned on task (behaving" [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy) [16:00:05] jbond and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220802T1600). [16:00:05] Lucas_WMDE: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:10] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36582/console" [puppet] - 10https://gerrit.wikimedia.org/r/818440 (https://phabricator.wikimedia.org/T314130) (owner: 10Lucas Werkmeister (WMDE)) [16:00:13] o/ [16:00:19] 👋 [16:00:23] hi! [16:00:30] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:00:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P32173 and previous config saved to /var/cache/conftool/dbconfig/20220802-160039-marostegui.json [16:00:47] looks like only maintenance jobs, so nothing to test on deployment, right? [16:00:48] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:01:17] nothing to test on deployment I think [16:01:23] but there is a related manual action I’d need [16:01:24] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:01:25] let me check [16:01:38] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:01:56] jbond: happen to be around? I'm jumping into a meeting, can try to multitask if needed :) [16:02:14] yup, wmde-analytics-daily-early.service is still running on stat1007 [16:02:17] (for almost two weeks now) [16:02:24] so if someone could systemctl stop that, it’d be great :) [16:02:35] (and then the max_execution_seconds added by the puppet patch should prevent it from running so long again) [16:03:25] I could kill the process myself (since it runs as analytics-wmde) but then y’all would be left with a failed systemd unit, so that wouldn’t be very nice of me [16:04:02] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:05:34] !log btullis@cumin1001 START - Cookbook sre.ganeti.makevm for new host an-airflow1004.eqiad.wmnet [16:05:36] !log btullis@cumin1001 START - Cookbook sre.dns.netbox [16:05:37] Lucas_WMDE: okay sounds good -- I can take care of that in a few minutes [16:05:44] ok, thanks! [16:06:00] RECOVERY - Juniper alarms on asw-b-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [16:06:59] 10SRE, 10vm-requests: eqiad: 1 VMs requested for airflow on behalf of the platform engineering team - https://phabricator.wikimedia.org/T314319 (10BTullis) Many thanks @MoritzMuehlenhoff Creating that VM now. ` btullis@cumin1001:~$ sudo cookbook sre.ganeti.makevm --vcpus 4 --memory 8 --disk 100 --cluster eqi... [16:08:28] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:10:33] !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:10:33] !log btullis@cumin1001 START - Cookbook sre.dns.wipe-cache an-airflow1004.eqiad.wmnet on all recursors [16:10:36] !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-airflow1004.eqiad.wmnet on all recursors [16:14:01] (03PS2) 10Hnowlan: changeprop-jobqueue: increase number of workers to 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/819627 [16:15:28] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:15:45] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:15:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T312972)', diff saved to https://phabricator.wikimedia.org/P32174 and previous config saved to /var/cache/conftool/dbconfig/20220802-161545-marostegui.json [16:15:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [16:15:49] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [16:16:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [16:16:05] (03PS2) 10Andrew Bogott: Move labweb100[12] to role::spare [puppet] - 10https://gerrit.wikimedia.org/r/817385 (https://phabricator.wikimedia.org/T313861) [16:16:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T312972)', diff saved to https://phabricator.wikimedia.org/P32175 and previous config saved to /var/cache/conftool/dbconfig/20220802-161607-marostegui.json [16:17:26] Lucas_WMDE: sorry for the delay! rerunning pcc with the correct hostname this time and then good to go [16:17:45] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36583/console" [puppet] - 10https://gerrit.wikimedia.org/r/818440 (https://phabricator.wikimedia.org/T314130) (owner: 10Lucas Werkmeister (WMDE)) [16:17:51] ok [16:18:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T312972)', diff saved to https://phabricator.wikimedia.org/P32176 and previous config saved to /var/cache/conftool/dbconfig/20220802-161815-marostegui.json [16:18:19] hmm, I don't see the timer diffs I expected to see there, checking [16:18:31] is it only stat1007 and not the other statistics::explorer hosts? [16:19:34] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36584/console" [puppet] - 10https://gerrit.wikimedia.org/r/818440 (https://phabricator.wikimedia.org/T314130) (owner: 10Lucas Werkmeister (WMDE)) [16:20:27] aha, yep; [16:20:41] (03CR) 10RLazarus: [V: 03+1 C: 03+2] statistics::wmde::graphite: add max_runtime_seconds [puppet] - 10https://gerrit.wikimedia.org/r/818440 (https://phabricator.wikimedia.org/T314130) (owner: 10Lucas Werkmeister (WMDE)) [16:22:18] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:23:10] yeah, only the one host AFAIK [16:23:17] 👍 [16:23:31] running puppet now, then I'll stop that service [16:23:39] alright, thanks! [16:25:00] PROBLEM - Host elastic2075.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:25:14] PROBLEM - Host ganeti-test2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:25:20] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:gerrit: add ipaddress to host_aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819506 (https://phabricator.wikimedia.org/T303857) (owner: 10Jbond) [16:25:20] !log rzl@stat1007:~$ sudo systemctl stop wmde-analytics-daily-early # wedged, timer will restart it now with max_runtime_seconds [16:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:24] PROBLEM - Host cp2029.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:25:24] PROBLEM - Host cp2030.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:25:37] thanks! [16:26:01] (03PS1) 10Giuseppe Lavagetto: scap: remove configuratoion for deploy* [puppet] - 10https://gerrit.wikimedia.org/r/819630 [16:26:03] (03PS1) 10Giuseppe Lavagetto: scap: do not restart php on the mwmaint servers [puppet] - 10https://gerrit.wikimedia.org/r/819631 [16:26:07] sure thing! [16:26:15] and now `systemctl list-timers` shows next/left for the timer again, yay [16:26:21] \o/ [16:26:44] PROBLEM - Host ganeti2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:27:09] 10SRE, 10SRE-OnFire, 10serviceops, 10serviceops-collab, 10Patch-For-Review: productionize 'sremap' and 'filter_victorops_calendar' under sretools.wikimedia.org - https://phabricator.wikimedia.org/T313355 (10JMeybohm) >>! In T313355#8091114, @CDanis wrote: > filter_victorops_calendar requires some persist... [16:27:14] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:27:14] I also have two more Puppet changes where I’d appreciate a review :) [16:27:18] which are https://gerrit.wikimedia.org/r/c/operations/puppet/+/819016/ and https://gerrit.wikimedia.org/r/c/operations/puppet/+/819017/ [16:27:24] doesn’t have to be now [16:27:34] PROBLEM - Host elastic2056.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:27:34] PROBLEM - Host elastic2069.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:27:34] PROBLEM - Host elastic2076.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:27:35] but I thought I’d try to get that task unstuck with my very limited Puppet skills ^^ [16:27:42] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: increase number of workers to 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/819627 (owner: 10Hnowlan) [16:27:48] PROBLEM - Host ganeti-test2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:27:48] PROBLEM - Host ganeti-test2003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:27:48] PROBLEM - Host mc2040.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:27:48] PROBLEM - Host mc2041.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:28:34] PROBLEM - Host cloudbackup2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:28:52] PROBLEM - Host thanos-be2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:28:54] looking! [16:28:55] (03CR) 10Ahmon Dancy: [C: 03+1] scap: remove configuratoion for deploy* [puppet] - 10https://gerrit.wikimedia.org/r/819630 (owner: 10Giuseppe Lavagetto) [16:29:02] PROBLEM - Host wdqs2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:29:09] rzl: guessing it is the codfw maintenance [16:29:12] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:29:22] jynus: yes, sorry, I meant I was looking at Lucas_WMDE's patches :) [16:29:31] my fault, then [16:29:56] !log dancy@mwmaint1002 pull aborted: (duration: 00m 07s) [16:29:57] chat bot just interluded at the wrong time 0:-) [16:30:11] PROBLEM - Host ms-be2052.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:30:12] not at all, appreciate the context [16:31:15] PROBLEM - Host elastic2039.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:31:15] PROBLEM - Host elastic2040.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:31:42] (03CR) 10Ahmon Dancy: scap: do not restart php on the mwmaint servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819631 (owner: 10Giuseppe Lavagetto) [16:31:55] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:32:09] (03Merged) 10jenkins-bot: changeprop-jobqueue: increase number of workers to 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/819627 (owner: 10Hnowlan) [16:32:41] RECOVERY - Host elastic2069.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.04 ms [16:33:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P32177 and previous config saved to /var/cache/conftool/dbconfig/20220802-163321-marostegui.json [16:33:45] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:34:05] 10SRE, 10vm-requests: eqiad: 1 VMs requested for airflow on behalf of the platform engineering team - https://phabricator.wikimedia.org/T314319 (10BTullis) 05Open→03Resolved [16:35:26] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: sync [16:35:37] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: sync [16:35:45] Lucas_WMDE: this seems fine to me but I'm not super familiar with inputrc, and given it's profile::base::production I'd like to get another pair of eyes on it :) anyone from Infrastructure Foundations ought to be able to give you a good review, lmk if you have trouble finding someone [16:35:49] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:36:05] * Lucas_WMDE looks at the staff page [16:36:21] #wikimedia-sre-foundations is where they live [16:36:40] thanks [16:38:20] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync [16:45:13] RECOVERY - Host wdqs2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.44 ms [16:46:55] RECOVERY - Host elastic2075.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms [16:47:03] (03CR) 10Ahmon Dancy: scap: remove configuratoion for deploy* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819630 (owner: 10Giuseppe Lavagetto) [16:47:17] RECOVERY - Host ganeti-test2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.71 ms [16:48:03] RECOVERY - Host elastic2040.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.17 ms [16:48:03] RECOVERY - Host elastic2039.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.99 ms [16:48:25] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:48:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P32178 and previous config saved to /var/cache/conftool/dbconfig/20220802-164827-marostegui.json [16:48:36] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync [16:49:03] RECOVERY - Host ms-be2030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.84 ms [16:49:11] RECOVERY - Host ms-be2045.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms [16:49:22] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync [16:49:41] RECOVERY - Host elastic2056.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms [16:49:41] RECOVERY - Host elastic2076.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.67 ms [16:49:57] RECOVERY - Host ganeti-test2003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.84 ms [16:49:57] RECOVERY - Host ganeti-test2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms [16:50:35] RECOVERY - Host cloudbackup2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.73 ms [16:50:55] RECOVERY - Host thanos-be2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.73 ms [16:51:00] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync [16:51:25] (03PS1) 10Giuseppe Lavagetto: wancache: temporarily remove mc-gp2002 from the gutter pool [puppet] - 10https://gerrit.wikimedia.org/r/819634 [16:52:39] RECOVERY - Host cp2029 is UP: PING OK - Packet loss = 0%, RTA = 31.57 ms [16:53:01] RECOVERY - Host ms-be2052.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.43 ms [16:53:07] RECOVERY - Host cp2030 is UP: PING OK - Packet loss = 0%, RTA = 31.54 ms [16:53:09] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp2030 is CRITICAL: connect to address 10.192.0.32 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [16:53:15] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync [16:53:15] PROBLEM - purged service on cp2030 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:53:17] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36585/console" [puppet] - 10https://gerrit.wikimedia.org/r/819634 (owner: 10Giuseppe Lavagetto) [16:53:17] RECOVERY - Host mc2041.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.82 ms [16:53:45] RECOVERY - Host cp2029.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.76 ms [16:53:45] RECOVERY - Host cp2030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms [16:54:05] PROBLEM - purged service on cp2029 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:54:05] PROBLEM - Check systemd state on cp2029 is CRITICAL: CRITICAL - degraded: The following units failed: esitest.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:54:10] (03PS2) 10Giuseppe Lavagetto: wancache: temporarily remove mc-gp2002 from the gutter pool [puppet] - 10https://gerrit.wikimedia.org/r/819634 [16:54:12] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync [16:55:01] RECOVERY - Host ganeti2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms [16:55:17] RECOVERY - purged service on cp2030 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:55:44] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36586/console" [puppet] - 10https://gerrit.wikimedia.org/r/819634 (owner: 10Giuseppe Lavagetto) [16:56:07] RECOVERY - purged service on cp2029 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:56:07] RECOVERY - Check systemd state on cp2029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:56:07] RECOVERY - Host mc2040.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms [16:56:11] (03CR) 10Andrew Bogott: [C: 03+2] striker: remove legacy deployment [puppet] - 10https://gerrit.wikimedia.org/r/819121 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [16:57:11] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp2030 is OK: HTTP OK: HTTP/1.1 200 OK - 470 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Varnish [16:57:33] !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host an-airflow1004.eqiad.wmnet [16:57:51] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:58:41] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:52] (03CR) 10Andrew Bogott: [C: 03+2] "Info: Computing checksum on file /etc/rsyslog.d/20-wmf-auto-restart-uwsgi-striker.conf" [puppet] - 10https://gerrit.wikimedia.org/r/819121 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [17:00:13] !log mvernon@cumin2002 START - Cookbook sre.hosts.remove-downtime for ms-be[2030,2045,2052].codfw.wmnet [17:00:15] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[2030,2045,2052].codfw.wmnet [17:00:30] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] striker: remove legacy settings [labs/private] - 10https://gerrit.wikimedia.org/r/819116 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [17:00:59] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2029.codfw.wmnet,service=ats-be [17:01:00] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2029.codfw.wmnet,service=varnish-fe [17:01:00] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2029.codfw.wmnet,service=ats-tls [17:02:31] (JobUnavailable) firing: (3) Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:03:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T312972)', diff saved to https://phabricator.wikimedia.org/P32179 and previous config saved to /var/cache/conftool/dbconfig/20220802-170333-marostegui.json [17:03:37] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [17:03:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [17:03:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [17:03:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [17:03:58] (03PS3) 10Dduvall: phabricator: Support scap3 deployment of configuration [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950) [17:04:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [17:04:15] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: shutdown for PDU replacement [17:04:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [17:04:37] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: shutdown for PDU replacement [17:04:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [17:04:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [17:04:47] 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d85c427a-fe27-4337-ba4f-b92100f4ccf6) set by mvernon@cumin2002 for 1 day, 0:00:00 on 6 host(s) and their serv... [17:04:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [17:05:01] (03PS3) 10Dduvall: phabricator: Support scap3 deployment of configuration [puppet] - 10https://gerrit.wikimedia.org/r/818465 (https://phabricator.wikimedia.org/T313950) (owner: 10Jaime Nuche) [17:05:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T312972)', diff saved to https://phabricator.wikimedia.org/P32180 and previous config saved to /var/cache/conftool/dbconfig/20220802-170503-marostegui.json [17:05:29] !log ms-be20[31,32,41,46].codfw.wmnet,ms-fe2010.codfw.wmnet,thanos-fe2002.codfw.wmnet downtime for PDU work T309957 [17:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:31] T309957: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 [17:06:01] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:06:24] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc[2042-2043].codfw.wmnet with reason: shutdown for PDU upgrade [17:06:40] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc[2042-2043].codfw.wmnet with reason: shutdown for PDU upgrade [17:07:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T312972)', diff saved to https://phabricator.wikimedia.org/P32181 and previous config saved to /var/cache/conftool/dbconfig/20220802-170711-marostegui.json [17:09:37] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:18:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2013.codfw.wmnet [17:20:01] (03CR) 10Dzahn: [C: 03+1] "I was wondering if this is the plan, to enable restore on both replicas, but I don't see why not. compiled this:" [puppet] - 10https://gerrit.wikimedia.org/r/819589 (https://phabricator.wikimedia.org/T296713) (owner: 10AOkoth) [17:20:35] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2030.codfw.wmnet,service=ats-be [17:20:36] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2030.codfw.wmnet,service=varnish-fe [17:20:36] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2030.codfw.wmnet,service=ats-tls [17:21:22] (03CR) 10Dzahn: [C: 03+2] admin: upgrade Aline Bruenger from ldap_only to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/819165 (https://phabricator.wikimedia.org/T314117) (owner: 10Dzahn) [17:21:53] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:22:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P32182 and previous config saved to /var/cache/conftool/dbconfig/20220802-172217-marostegui.json [17:22:46] (03CR) 10Dzahn: [C: 03+2] admin: upgrade Szymon Grabarczuk from ldap_only to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/819166 (https://phabricator.wikimedia.org/T313616) (owner: 10Dzahn) [17:22:53] (03PS2) 10Dzahn: admin: upgrade Szymon Grabarczuk from ldap_only to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/819166 (https://phabricator.wikimedia.org/T313616) [17:23:54] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [17:23:56] (03CR) 10Dzahn: [C: 03+2] "bast1003 - Notice: /Stage[main]/Admin/Admin::Hashuser[alinebruenger]/Admin::User[alinebruenger]/User[alinebruenger]/ensure: created" [puppet] - 10https://gerrit.wikimedia.org/r/819165 (https://phabricator.wikimedia.org/T314117) (owner: 10Dzahn) [17:25:16] !log installing fribidi security updates [17:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:54] (03CR) 10Dzahn: "bast1003 - Notice: /Stage[main]/Admin/Admin::Hashuser[sgrabarczuk]/Admin::User[sgrabarczuk]/User[sgrabarczuk]/ensure: created" [puppet] - 10https://gerrit.wikimedia.org/r/819166 (https://phabricator.wikimedia.org/T313616) (owner: 10Dzahn) [17:27:44] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to the Desktop Improvements project statistics for SGrabarczuk - https://phabricator.wikimedia.org/T313616 (10Dzahn) [17:27:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2013.codfw.wmnet [17:28:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [17:30:02] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to the Desktop Improvements project statistics for SGrabarczuk - https://phabricator.wikimedia.org/T313616 (10Dzahn) a:05Dzahn→03sgrabarczuk >>! In T313616#8119509, @Ottomata wrote: > @volans, sounds like ssh access is not needed for thi... [17:31:56] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Aline_Bruenger_WMDE - https://phabricator.wikimedia.org/T314117 (10Dzahn) a:05Dzahn→03Aline_Bruenger_WMDE >>! In T314117#8115143, @Volans wrote: > @Aline_Bruenger_WMDE ok, let's proceed for the simple access for now and you... [17:31:58] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic[2041-2042,2057].codfw.wmnet with reason: T310070 [17:32:01] T310070: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 [17:32:13] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic[2041-2042,2057].codfw.wmnet with reason: T310070 [17:35:18] !log installing node-moment security updates [17:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:05] (03PS1) 10Dzahn: deployment::server: add comment about usage of benchmarking tools [puppet] - 10https://gerrit.wikimedia.org/r/819649 (https://phabricator.wikimedia.org/T230178) [17:37:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P32183 and previous config saved to /var/cache/conftool/dbconfig/20220802-173723-marostegui.json [17:37:31] (03CR) 10Dzahn: Revert "deployment_server: remove packages wrk, siege and lua-cjson" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819076 (owner: 10Dzahn) [17:38:14] (03PS11) 10Ayounsi: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 [17:38:55] (03PS2) 10Dzahn: deployment::server: add comment about usage of benchmarking tools [puppet] - 10https://gerrit.wikimedia.org/r/819649 (https://phabricator.wikimedia.org/T230178) [17:39:23] (03CR) 10Ayounsi: "Thanks, replying to your first batch of comment." [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [17:39:40] (03CR) 10Dzahn: [C: 03+2] "just a comment but also a follow-up" [puppet] - 10https://gerrit.wikimedia.org/r/819649 (https://phabricator.wikimedia.org/T230178) (owner: 10Dzahn) [17:39:42] RECOVERY - Host cloudcephosd2001-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.89 ms [17:39:42] RECOVERY - Host cloudcephosd2002-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.54 ms [17:39:46] RECOVERY - Host cloudvirt2002-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.68 ms [17:43:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db2159', diff saved to https://phabricator.wikimedia.org/P32184 and previous config saved to /var/cache/conftool/dbconfig/20220802-174311-ladsgroup.json [17:45:08] 10SRE, 10Patch-For-Review, 10Platform Team Workboards (Green): Install wrk, siege and lua-cjson packages on deploy1001 - https://phabricator.wikimedia.org/T230178 (10Dzahn) 05Open→03Resolved [17:45:24] RECOVERY - Host cloudcephosd2003-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 62.17 ms [17:45:28] RECOVERY - Host cloudvirt2001-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.67 ms [17:45:28] RECOVERY - Host cloudvirt2003-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 67.18 ms [17:46:48] PROBLEM - Host ml-cache2002 is DOWN: PING CRITICAL - Packet loss = 100% [17:46:58] PROBLEM - Host ms-be2041.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:47:18] PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [17:47:46] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:49:05] 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10ori) >>! In T138093#8120299, @BBlack wrote: > So, something like this? > [...] Yep, this works! I think it would be useful to have que... [17:49:58] (03CR) 10CI reject: [V: 04-1] PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [17:50:00] (03PS12) 10Ayounsi: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 [17:50:35] ACKNOWLEDGEMENT - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. Ryan Kemper https://phabricator.wikimedia.org/T310070 https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [17:50:35] ACKNOWLEDGEMENT - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. Ryan Kemper https://phabricator.wikimedia.org/T310070 https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [17:50:48] (03CR) 10Ayounsi: PeeringDB API: initial commit (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [17:52:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T312972)', diff saved to https://phabricator.wikimedia.org/P32185 and previous config saved to /var/cache/conftool/dbconfig/20220802-175233-marostegui.json [17:52:37] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [17:53:06] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:57:29] (03CR) 10CI reject: [V: 04-1] PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [17:58:54] PROBLEM - puppetmaster backend https on puppetmaster2004 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [18:00:05] dancy and brennen: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220802T1800). [18:00:16] o/ [18:00:21] o/ [18:00:29] It's train tiiiiiiiiiiime!!!! [18:00:33] * dancy whoops it up [18:00:38] PROBLEM - Host elastic2041.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:00:42] PROBLEM - Host cp2031.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:00:42] PROBLEM - Host cp2032.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:00:51] [muted whoop] [18:00:52] dancy: BREAK ALL THE THINGS!!! [18:01:04] PROBLEM - Host ml-cache2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:01:06] PROBLEM - Host thanos-fe2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:01:15] The button has been pressed. [18:01:38] * Sario listens for explosions in the distance [18:01:40] PROBLEM - Host elastic2042.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:01:50] PROBLEM - Host elastic2057.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:02:20] (03PS1) 10TrainBranchBot: testwikis wikis to 1.39.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819654 (https://phabricator.wikimedia.org/T308076) [18:02:22] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.39.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819654 (https://phabricator.wikimedia.org/T308076) (owner: 10TrainBranchBot) [18:02:28] PROBLEM - Host mc2042.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:02:28] PROBLEM - Host mc2043.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:02:50] PROBLEM - Host moss-be2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:02:54] PROBLEM - Host ms-fe2010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:03:18] (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819654 (https://phabricator.wikimedia.org/T308076) (owner: 10TrainBranchBot) [18:03:42] PROBLEM - Host centrallog2002 is DOWN: PING CRITICAL - Packet loss = 100% [18:04:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:04:20] !log dancy@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.23 refs T308076 [18:04:24] T308076: 1.39.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T308076 [18:05:16] PROBLEM - Host ms-be2031.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:05:16] PROBLEM - Host ms-be2032.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:05:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:05:44] PROBLEM - Host ms-be2046.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:07:36] RECOVERY - Host cloudcephmon2005-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.81 ms [18:07:40] RECOVERY - Host cloudservices2004-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.90 ms [18:08:08] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:08:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:08:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:08:34] (03PS1) 10RLazarus: site: Add mc2038 as mediawiki::memcached [puppet] - 10https://gerrit.wikimedia.org/r/819655 (https://phabricator.wikimedia.org/T293012) [18:09:12] (03CR) 10Ssingh: [C: 03+1] site: Add mc2038 as mediawiki::memcached [puppet] - 10https://gerrit.wikimedia.org/r/819655 (https://phabricator.wikimedia.org/T293012) (owner: 10RLazarus) [18:10:16] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-logging2001 is CRITICAL: 70 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2001 [18:10:20] RECOVERY - Host cloudcephmon2006-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.79 ms [18:10:24] RECOVERY - Host cloudgw2002-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.72 ms [18:10:24] RECOVERY - Host cloudnet2006-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.89 ms [18:10:24] RECOVERY - Host cloudnet2005-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.27 ms [18:10:24] RECOVERY - Host cloudservices2005-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.46 ms [18:10:32] PROBLEM - Host ps1-b2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:10:44] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-logging2003 is CRITICAL: 70 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2003 [18:10:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:11:20] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [18:11:58] PROBLEM - Host lvs2008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:12:14] (03PS1) 10RLazarus: mcrouter_wancache: Replace mc2024 with mc2038 [puppet] - 10https://gerrit.wikimedia.org/r/819656 (https://phabricator.wikimedia.org/T293012) [18:14:30] PROBLEM - Host elastic2063.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:14:30] PROBLEM - Host elastic2064.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:14:30] PROBLEM - Host elastic2077.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:14:30] PROBLEM - Host elastic2078.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:16:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:16:38] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on lvs2008.codfw.wmnet with reason: shutdown for PDU upgrade [18:16:53] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lvs2008.codfw.wmnet with reason: shutdown for PDU upgrade [18:17:04] PROBLEM - Host wdqs2010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:17:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:17:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:17:55] !log rzl@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2038.codfw.wmnet with reason: install [18:17:57] !log rzl@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2038.codfw.wmnet with reason: install [18:18:29] !log rzl@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2038.codfw.wmnet with reason: install [18:18:33] whoops :) [18:18:45] !log rzl@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2038.codfw.wmnet with reason: install [18:18:47] (03CR) 10Ssingh: [C: 03+1] mcrouter_wancache: Replace mc2024 with mc2038 [puppet] - 10https://gerrit.wikimedia.org/r/819656 (https://phabricator.wikimedia.org/T293012) (owner: 10RLazarus) [18:19:00] heh. all good? [18:19:23] yep, I just can't type mw2038 if I want to downtime mc2038 [18:19:24] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:19:31] (03PS1) 10Dzahn: gerrit: switch gerrit2002 from gerrit:migration to gerrit role [puppet] - 10https://gerrit.wikimedia.org/r/819672 (https://phabricator.wikimedia.org/T296713) [18:19:32] eagerly awaiting the --dwim flag for cookbooks [18:19:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:19:56] lucky mw2038 doesn't exist, or I might not have noticed [18:20:24] (03CR) 10RLazarus: [C: 03+2] site: Add mc2038 as mediawiki::memcached [puppet] - 10https://gerrit.wikimedia.org/r/819655 (https://phabricator.wikimedia.org/T293012) (owner: 10RLazarus) [18:22:36] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [18:23:06] (03CR) 10Ori: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/819677 (https://phabricator.wikimedia.org/T138093) (owner: 10Ori) [18:23:52] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:30:13] (03CR) 10Dzahn: [C: 03+2] gerrit: switch gerrit2002 from gerrit:migration to gerrit role [puppet] - 10https://gerrit.wikimedia.org/r/819672 (https://phabricator.wikimedia.org/T296713) (owner: 10Dzahn) [18:30:18] (03PS2) 10Dzahn: gerrit: switch gerrit2002 from gerrit:migration to gerrit role [puppet] - 10https://gerrit.wikimedia.org/r/819672 (https://phabricator.wikimedia.org/T296713) [18:30:50] RECOVERY - Host ms-be2032.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.36 ms [18:30:50] RECOVERY - Host ms-be2031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.30 ms [18:31:08] RECOVERY - Host lvs2008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.23 ms [18:31:24] RECOVERY - Host ms-be2041.mgmt is UP: PING OK - Packet loss = 0%, RTA = 55.69 ms [18:31:34] RECOVERY - Host ms-be2046.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 45.22 ms [18:32:40] RECOVERY - Host ml-cache2002 is UP: PING OK - Packet loss = 0%, RTA = 31.63 ms [18:32:46] RECOVERY - Host elastic2041.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.32 ms [18:32:48] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-logging2001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2001 [18:33:06] RECOVERY - Host ml-cache2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.64 ms [18:33:16] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-logging2003 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2003 [18:33:38] RECOVERY - Host elastic2042.mgmt is UP: PING OK - Packet loss = 0%, RTA = 270.30 ms [18:33:38] RECOVERY - Host elastic2063.mgmt is UP: PING OK - Packet loss = 0%, RTA = 268.62 ms [18:33:38] RECOVERY - Host elastic2077.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.78 ms [18:33:38] RECOVERY - Host elastic2064.mgmt is UP: PING OK - Packet loss = 0%, RTA = 273.04 ms [18:33:38] RECOVERY - Host elastic2078.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms [18:33:50] RECOVERY - Host elastic2057.mgmt is UP: PING OK - Packet loss = 0%, RTA = 40.87 ms [18:33:52] RECOVERY - Host thanos-fe2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.79 ms [18:34:30] RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [18:34:48] RECOVERY - Host moss-be2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.95 ms [18:34:54] RECOVERY - Host ms-fe2010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.73 ms [18:35:08] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:35:18] RECOVERY - Host cp2031 is UP: PING OK - Packet loss = 0%, RTA = 31.75 ms [18:35:40] RECOVERY - Host cp2032 is UP: PING OK - Packet loss = 0%, RTA = 31.61 ms [18:35:40] PROBLEM - Varnish HTTP upload-frontend - port 3124 on cp2032 is CRITICAL: connect to address 10.192.16.33 and port 3124: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [18:35:52] PROBLEM - IPMI Sensor Status on cp2032 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:35:54] PROBLEM - purged service on cp2031 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:35:56] PROBLEM - purged service on cp2032 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:36:14] RECOVERY - Host wdqs2010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.73 ms [18:38:12] RECOVERY - purged service on cp2031 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:38:14] RECOVERY - purged service on cp2032 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:39:16] RECOVERY - Host cp2032.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.85 ms [18:39:16] RECOVERY - Host cp2031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.80 ms [18:39:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:39:58] RECOVERY - Varnish HTTP upload-frontend - port 3124 on cp2032 is OK: HTTP OK: HTTP/1.1 200 OK - 470 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:40:50] RECOVERY - Host mc2042.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.63 ms [18:40:51] RECOVERY - Host mc2043.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.71 ms [18:41:23] ACKNOWLEDGEMENT - HP RAID on ms-be2032 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T314427 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gat [18:41:26] 10SRE, 10ops-codfw: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T314427 (10ops-monitoring-bot) [18:41:40] !log rzl@cumin2002 START - Cookbook sre.hosts.remove-downtime for mc2038.codfw.wmnet [18:41:40] !log rzl@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2038.codfw.wmnet [18:44:37] PROBLEM - Check systemd state on gerrit2002 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:44:38] (03PS4) 10Dzahn: gerrit: add hiera data for a second replica [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) [18:45:30] ah, removed the downtime before rerunning puppet on icinga -- re-adding it briefly [18:45:44] !log rzl@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on mc2038.codfw.wmnet with reason: install [18:46:00] !log rzl@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mc2038.codfw.wmnet with reason: install [18:46:04] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1002/36591/" [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [18:46:26] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: deploy wmf-elasticsearch-search-plugins pkg - bking@cumin1001 - T314078 [18:46:28] T314078: Fix slow super_detect_noop code and monitor for future Elastic hangs - https://phabricator.wikimedia.org/T314078 [18:47:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:47:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:49:07] RECOVERY - Check systemd state on gerrit2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:50:14] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:51:48] dancy: hrm - i don't think the revert mentioned on T314395 got cherry-picked to wmf/1.39.0-wmf.23 [18:51:49] T314395: Parsoid rt-testing is still broken, parsoid needs a revert - https://phabricator.wikimedia.org/T314395 [18:51:54] ^ cc: subbu [18:52:17] oh, let me check. [18:52:42] i see it merged on master [18:54:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:54:59] !log dancy@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.23 refs T308076 (duration: 50m 39s) [18:55:01] T308076: 1.39.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T308076 [18:55:38] (03PS1) 10Daniel Kinzler: ParsoidHandler: pass metrics object to HTMLTransformInput [core] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819611 [18:56:19] brennen, ok ... so, miscommunication on our team ... it didn't get cherry-picked it appears. [18:57:09] subbu, dancy: ack. can be synced before promotion to group0 then. [18:57:14] (03PS1) 10Subramanya Sastry: Revert "Bump wikimedia/parsoid to 0.16.0-a18" [vendor] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819612 [18:57:21] ^ [18:57:52] OK. Holding off. [18:58:10] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "Bump wikimedia/parsoid to 0.16.0-a18" [vendor] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819612 (owner: 10Subramanya Sastry) [18:58:52] (03PS1) 10Daniel Kinzler: ParsoidHandler: fix page bundle input with no orig HTML. [core] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819613 [18:59:28] apologies for not catching that earlier. [18:59:36] PROBLEM - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [19:00:10] i need to step afk for a bit. [19:00:44] ^ gerrit alert is because we added a role to a new host, downtiming [19:01:19] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on gerrit2002.wikimedia.org with reason: new machine [19:01:34] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on gerrit2002.wikimedia.org with reason: new machine [19:01:42] dancy, should I +2 that? or will you and bren.nen +2 it before you sync? [19:02:03] subbu: Go for it! [19:02:50] (03CR) 10Subramanya Sastry: [C: 03+2] Revert "Bump wikimedia/parsoid to 0.16.0-a18" [vendor] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819612 (owner: 10Subramanya Sastry) [19:03:13] oh, looks like brennen already +2ed it earlier ... i missed it. :) [19:03:54] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/36591/" [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [19:04:06] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "This also adds the missing ServerAlias" [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [19:05:45] (JobUnavailable) firing: (4) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:06:12] RECOVERY - IPMI Sensor Status on cp2032 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:06:24] PROBLEM - Host mc-gp2002 is DOWN: PING CRITICAL - Packet loss = 100% [19:06:54] ACKNOWLEDGEMENT - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site daniel_zahn setting up new host https://wikitech.wikimedia.org/wiki/Gerrit [19:07:02] PROBLEM - Host ps1-b4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [19:08:04] (03PS1) 10RLazarus: redis: Replace mc2024 with mc2038 [puppet] - 10https://gerrit.wikimedia.org/r/819697 (https://phabricator.wikimedia.org/T293012) [19:09:00] (03CR) 10Ssingh: [C: 03+1] redis: Replace mc2024 with mc2038 [puppet] - 10https://gerrit.wikimedia.org/r/819697 (https://phabricator.wikimedia.org/T293012) (owner: 10RLazarus) [19:10:48] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [19:11:23] !log gerrit1001 - rsyncing /home/ to gerrit2002:/srv/home-gerrit1001.wikimedia.org T313250 [19:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:27] T313250: Bring up gerrit2002 - https://phabricator.wikimedia.org/T313250 [19:11:30] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:12:38] PROBLEM - IPMI Sensor Status on sessionstore2001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:13:04] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2031.codfw.wmnet,service=ats-be [19:13:04] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2031.codfw.wmnet,service=varnish-fe [19:13:05] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2031.codfw.wmnet,service=ats-tls [19:15:45] (JobUnavailable) firing: (4) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:17:21] !log rzl@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on mc2038.codfw.wmnet with reason: install [19:17:26] !log rzl@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on mc2038.codfw.wmnet with reason: install [19:17:50] PROBLEM - IPMI Sensor Status on elastic2058 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:18:36] PROBLEM - Check systemd state on elastic1061 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:18:59] (03CR) 10CI reject: [V: 04-1] ParsoidHandler: pass metrics object to HTMLTransformInput [core] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819611 (owner: 10Daniel Kinzler) [19:20:39] (03CR) 10AOkoth: [C: 03+2] gitlab: enable restore on gitlab2002 [puppet] - 10https://gerrit.wikimedia.org/r/819589 (https://phabricator.wikimedia.org/T296713) (owner: 10AOkoth) [19:20:45] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:21:26] 10SRE, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q1): Icinga downtimes not working - https://phabricator.wikimedia.org/T314353 (10Dzahn) >>! In T314353#8122412, @fgiunchedi wrote: > Indeed check max latency spiked up to 30+ min (!) around that time I feel like T196336 is/was also caused by... [19:21:32] sudo confctl select 'name=cp2032.codfw.wmnet,service=ats-be' set/pooled=yes [19:21:35] sudo confctl select 'name=cp2032.codfw.wmnet,service=varnish-fe' set/pooled=yes [19:21:38] sudo confctl select 'name=cp2032.codfw.wmnet,service=ats-tls' set/pooled=yes [19:21:40] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2032.codfw.wmnet,service=ats-be [19:21:40] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2032.codfw.wmnet,service=varnish-fe [19:21:41] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2032.codfw.wmnet,service=ats-tls [19:21:41] ha [19:23:31] 10SRE, 10Icinga, 10SRE Observability, 10observability: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336 (10Dzahn) also see T314353#8122412 - feels to me like these can both be the general load and latence on the icinga server [19:24:20] (03Merged) 10jenkins-bot: Revert "Bump wikimedia/parsoid to 0.16.0-a18" [vendor] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819612 (owner: 10Subramanya Sastry) [19:25:25] (03PS2) 10Daniel Kinzler: ParsoidHandler: pass metrics object to HTMLTransformInput [core] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819611 [19:25:28] PROBLEM - IPMI Sensor Status on kafka-main2002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:25:32] PROBLEM - IPMI Sensor Status on logstash2034 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:25:45] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:26:15] !log mvernon@cumin2002 START - Cookbook sre.hosts.remove-downtime for ms-fe2010.codfw.wmnet [19:26:15] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-fe2010.codfw.wmnet [19:28:49] !log mvernon@cumin2002 START - Cookbook sre.hosts.remove-downtime for thanos-fe2002.codfw.wmnet [19:28:50] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for thanos-fe2002.codfw.wmnet [19:29:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:30:06] (03CR) 10Subramanya Sastry: [C: 03+1] ParsoidHandler: fix page bundle input with no orig HTML. [core] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819613 (owner: 10Daniel Kinzler) [19:30:23] (03CR) 10Subramanya Sastry: [C: 03+1] ParsoidHandler: pass metrics object to HTMLTransformInput [core] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819611 (owner: 10Daniel Kinzler) [19:32:20] (03PS1) 10Ssingh: hiera: replace mc2024 with mc2038 [puppet] - 10https://gerrit.wikimedia.org/r/819706 (https://phabricator.wikimedia.org/T293012) [19:32:43] RECOVERY - Host cp2034 is UP: PING OK - Packet loss = 0%, RTA = 31.63 ms [19:32:51] RECOVERY - Host cp2033 is UP: PING OK - Packet loss = 0%, RTA = 31.60 ms [19:33:02] (03CR) 10Ssingh: "I *think* this should fix the:" [puppet] - 10https://gerrit.wikimedia.org/r/819706 (https://phabricator.wikimedia.org/T293012) (owner: 10Ssingh) [19:33:17] PROBLEM - purged service on cp2033 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:33:25] RECOVERY - Host mc-gp2002 is UP: PING OK - Packet loss = 0%, RTA = 31.64 ms [19:33:29] PROBLEM - Check systemd state on cp2033 is CRITICAL: CRITICAL - degraded: The following units failed: esitest.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:33:43] PROBLEM - purged service on cp2034 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:33:44] history [19:34:10] sukhe: there's a lot of it [19:34:15] ha [19:34:30] I need to close down my terminal windows :) [19:34:47] RECOVERY - purged service on cp2034 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:35:29] RECOVERY - purged service on cp2033 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:35:43] RECOVERY - Check systemd state on cp2033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:35:59] sudo confctl select 'name=cp2033.codfw.wmnet,service=ats-be' set/pooled=yes [19:35:59] !log mvernon@cumin2002 START - Cookbook sre.hosts.remove-downtime for ms-be[2041,2046].codfw.wmnet [19:36:00] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[2041,2046].codfw.wmnet [19:36:02] sudo confctl select 'name=cp2033.codfw.wmnet,service=varnish-fe' set/pooled=yes [19:36:04] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2033.codfw.wmnet,service=ats-be [19:36:04] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2033.codfw.wmnet,service=varnish-fe [19:36:04] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2033.codfw.wmnet,service=ats-tls [19:36:05] sudo confctl select 'name=cp2033.codfw.wmnet,service=ats-tls' set/pooled=yes [19:36:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:36:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:37:29] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2034.codfw.wmnet,service=ats-be [19:37:29] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2034.codfw.wmnet,service=varnish-fe [19:37:29] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2034.codfw.wmnet,service=ats-tls [19:38:21] (JobUnavailable) firing: (4) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:42:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:43:07] RECOVERY - IPMI Sensor Status on sessionstore2001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:47:09] (03CR) 10Andrew Bogott: [C: 03+2] mediawiki: Redirect developers.wm.o to developer.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/816175 (https://phabricator.wikimedia.org/T313597) (owner: 10Zabe) [19:47:28] (03CR) 10Andrew Bogott: [C: 03+2] wikimedia.org: Add developers.wikimedia.org alias [dns] - 10https://gerrit.wikimedia.org/r/816174 (https://phabricator.wikimedia.org/T313597) (owner: 10Zabe) [19:47:35] (03PS3) 10Andrew Bogott: wikimedia.org: Add developers.wikimedia.org alias [dns] - 10https://gerrit.wikimedia.org/r/816174 (https://phabricator.wikimedia.org/T313597) (owner: 10Zabe) [19:52:22] dancy: 'fraid my network situation is a bit tenuous, if you don't mind syncing that vendor/ revert. [19:52:42] Can do. [19:52:44] * brennen curses verizon at some length, not for the first time. [19:53:37] RECOVERY - Check systemd state on elastic1061 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:53:57] PROBLEM - Host mc2024.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:54:05] PROBLEM - Host ps1-b5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [19:54:11] PROBLEM - Host ml-serve2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:54:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved via scap backport" [vendor] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819612 (owner: 10Subramanya Sastry) [19:54:49] PROBLEM - Host db2107.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:54:49] PROBLEM - Host db2109.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:54:57] PROBLEM - Host db2137.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:54:57] PROBLEM - Host db2143.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:54:57] PROBLEM - Host ores2004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:55:03] PROBLEM - Host db2147.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:55:07] PROBLEM - Host db2159.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:55:07] PROBLEM - Host pc2012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:55:07] PROBLEM - Host db2160.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:55:09] PROBLEM - Host pki2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:55:11] PROBLEM - Host prometheus2005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:55:13] PROBLEM - Host db2177.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:55:13] PROBLEM - Host db2178.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:55:23] !log dancy@deploy1002 Started scap: Backport for [[gerrit:819612]] Revert "Bump wikimedia/parsoid to 0.16.0-a18" [19:56:13] RECOVERY - Host db2147.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.72 ms [19:56:37] RECOVERY - IPMI Sensor Status on logstash2034 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:58:01] urbanecm: will you be doing the deployment window again today? I have two more patches to go in, i'll hit +2 on them now. [19:58:22] duesen: i can [19:58:53] (03CR) 10Daniel Kinzler: [C: 03+2] "merging for backport deployment" [core] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819611 (owner: 10Daniel Kinzler) [19:59:05] (03CR) 10Daniel Kinzler: [C: 03+2] "merging for backport deployment" [core] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819613 (owner: 10Daniel Kinzler) [19:59:22] subbu: Are you still around? [19:59:39] RECOVERY - Host ml-serve2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.65 ms [19:59:39] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [19:59:49] !log dancy@deploy1002 Started deploy [gerrit/gerrit@94c5028]: (no justification provided) [19:59:51] !log dancy@deploy1002 Finished deploy [gerrit/gerrit@94c5028]: (no justification provided) (duration: 00m 01s) [20:00:05] RoanKattouw, Urbanecm, and cjming: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220802T2000). [20:00:05] danisztls, duesen, subbu, and xSavitar: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:14] o/ [20:00:21] RECOVERY - Host db2109.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.16 ms [20:00:21] RECOVERY - Host db2107.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms [20:00:25] RECOVERY - Host db2137.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.02 ms [20:00:25] RECOVERY - Host db2143.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.77 ms [20:00:25] RECOVERY - Host mc2024.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.54 ms [20:00:25] RECOVERY - Host ores2004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.51 ms [20:00:25] RECOVERY - Host db2159.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.82 ms [20:00:26] RECOVERY - Host db2160.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.77 ms [20:00:26] RECOVERY - Host pc2012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.82 ms [20:00:27] RECOVERY - Host pki2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.83 ms [20:00:27] RECOVERY - Host prometheus2005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms [20:00:28] RECOVERY - Host db2177.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.77 ms [20:00:28] RECOVERY - Host db2178.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.72 ms [20:01:00] (03CR) 10Dduvall: [C: 03+1] "Love it. Let's roll!" [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy) [20:01:16] !log dancy@deploy1002 Started deploy [gerrit/gerrit@94c5028]: (no justification provided) [20:01:21] !log dancy@deploy1002 Finished deploy [gerrit/gerrit@94c5028]: (no justification provided) (duration: 00m 05s) [20:01:32] subbu: The revert is on mwdebug now. Can you test it? [20:01:59] dancy .. nothing really to test .. okay to sync. [20:02:05] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:02:08] i can deploy today [20:02:12] ok [20:02:17] dancy: but it looks like you're deploying too, so i'll wait for you to finish [20:02:38] thanks urbanecm -- i would have offered by my internet has been spotty all day [20:02:43] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:02:48] no problem [20:04:29] my patches will take a while to merge anyway. [20:04:31] (03CR) 10Andrew Bogott: "seems to work :) Thanks!" [dns] - 10https://gerrit.wikimedia.org/r/816174 (https://phabricator.wikimedia.org/T313597) (owner: 10Zabe) [20:04:51] there's also a config patch by danisztls, but i don't see them atm [20:04:55] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:05:27] (anyone any idea what's happening with mw@codfw?) [20:05:45] (JobUnavailable) firing: (4) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:06:53] !log dancy@deploy1002 Finished scap: Backport for [[gerrit:819612]] Revert "Bump wikimedia/parsoid to 0.16.0-a18" (duration: 11m 30s) [20:07:19] urbanecm: there's a lot of maintenance going on [20:07:23] urbanecm, subbu: I'm done [20:07:29] thanks [20:07:36] ty [20:09:53] PROBLEM - Juniper alarms on asw-b-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [20:10:41] urbanecm: lemme know when you're done and I'll roll to group0 [20:11:15] dancy: sure. shipping an low-urgency patch now, while waiting for CI. [20:11:21] (03PS2) 10Urbanecm: GrowthExperiments: Remove wgGEHomepageTutorialTitle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811664 [20:11:31] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Remove wgGEHomepageTutorialTitle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811664 (owner: 10Urbanecm) [20:12:07] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:12:51] RECOVERY - Juniper alarms on asw-b-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [20:15:57] (03Merged) 10jenkins-bot: GrowthExperiments: Remove wgGEHomepageTutorialTitle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811664 (owner: 10Urbanecm) [20:16:01] finally [20:18:03] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:19:23] RECOVERY - IPMI Sensor Status on elastic2058 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [20:19:32] (03Merged) 10jenkins-bot: ParsoidHandler: pass metrics object to HTMLTransformInput [core] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819611 (owner: 10Daniel Kinzler) [20:20:48] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 5fac0aaf8e76a6f8cc3302771eac068e4f866e5f: GrowthExperiments: Remove wgGEHomepageTutorialTitle (duration: 03m 26s) [20:21:17] (03Merged) 10jenkins-bot: ParsoidHandler: fix page bundle input with no orig HTML. [core] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819613 (owner: 10Daniel Kinzler) [20:21:41] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:22:09] duesen: pulled to mwdebug1001. can you check? [20:22:33] urbanecm: oh, you are fast, I was just about to log in ;) [20:22:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:22:56] duesen: oh, sorry :). want to finish the deployment? [20:23:12] urbanecm: no, i'm quite tired and afraid to mess things up :) I'm happy iof you do it [20:23:20] okay, sounds good [20:23:30] let me know how the patches look at mwdebug1001 then :) [20:24:06] But... there is nothing to test on debug. The metrics one isn't testable at all. And the ParsoidHandler one will need to be on a host that has the parsoid endpoints enabled to be testable. [20:24:30] okay, in that case, let's sync it [20:24:33] subbu: can you verify that the RT tests look better once the patch has rolled out? [20:24:41] will do [20:24:44] urbanecm: let me just check that the site loads :=) [20:25:01] does for me! :) [20:25:57] urbanecm: yea, VE also comes up. So it's at least not totally broken ;) [20:26:04] let me know one synced and i'll verify ... since wmf.23 is only on group 0 ... oh hold on .. it isn't on group0 yet right? [20:26:25] let's hope :) [20:26:26] so, i cannot verify till wmf.23 rolls out to group 0. since this is a wmf.23 backport. [20:26:30] subbu: only testwiki [20:26:42] ah, ok. hold on then. let me create a page on testwiki before you sync. :) [20:26:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:27:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:27:18] Oh, testwiki isn't group 0, bout group -1? [20:27:29] group 0.5 [20:27:40] ;) [20:28:13] ok, done. and verified it is broken right now ... will verify once synced. [20:29:39] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.23/includes/Rest/Handler/ParsoidHandler.php: 322a960e3777bc01fa8823908340c36e3851a648: ParsoidHandler: pass metrics object to HTMLTransformInput (duration: 03m 19s) [20:29:53] metrics one synced, second one flying [20:30:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:31:52] (03PS1) 10Ssingh: Revert "Depool codfw for PDU upgrade" [dns] - 10https://gerrit.wikimedia.org/r/819619 [20:33:02] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.23/includes/Rest/Handler/HTMLTransformInput.php: 69e91528a5c6f372af520307dc2f4227b9981442: ParsoidHandler: fix page bundle input with no orig HTML (duration: 03m 22s) [20:33:08] duesen: subbu: synced [20:33:39] verified fixed. [20:34:06] great! [20:34:14] duesen: anything else to do? [20:35:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:35:50] urbanecm: if subbu sais it's good, i'm happy :) [20:36:02] (03CR) 10Ssingh: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/819619 (owner: 10Ssingh) [20:36:04] ack :) [20:36:10] !log UTC evening B&C window done [20:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:14] dancy: floor is yours [20:36:29] Thanks.. Rolling forward to group0 [20:36:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:36:42] (03PS1) 10TrainBranchBot: group0 wikis to 1.39.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819741 (https://phabricator.wikimedia.org/T308076) [20:36:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:36:44] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.39.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819741 (https://phabricator.wikimedia.org/T308076) (owner: 10TrainBranchBot) [20:36:51] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host gerrit2002.wikimedia.org with OS buster [20:37:01] 10SRE, 10Gerrit, 10serviceops, 10serviceops-collab, and 2 others: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host gerrit2002.wikimedia.org with OS buster [20:37:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:38:00] RECOVERY - Host cloudcontrol2005-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.95 ms [20:38:00] RECOVERY - Host cloudgw2001-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.95 ms [20:38:01] !log re-imaging gerrit2002 with buster - because it's on bullseye, needs git-fat and that has not been ported to python3 yet which blocks upgrading gerrit machines otherwise T313250 T243027 T279509 [20:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:07] T243027: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 [20:38:07] T279509: git-fat needs to be ported to Python 3 - https://phabricator.wikimedia.org/T279509 [20:38:07] T313250: Bring up gerrit2002 - https://phabricator.wikimedia.org/T313250 [20:38:42] * urbanecm is confused. are we now using a bot to roll trains forward? [20:39:32] A bot account is used to push the wikiversions change to Gerrit [20:39:35] (03CR) 10Ssingh: [C: 03+2] Revert "Depool codfw for PDU upgrade" [dns] - 10https://gerrit.wikimedia.org/r/819619 (owner: 10Ssingh) [20:39:52] (03CR) 10Ssingh: [V: 03+2 C: 03+2] Revert "Depool codfw for PDU upgrade" [dns] - 10https://gerrit.wikimedia.org/r/819619 (owner: 10Ssingh) [20:40:15] (03PS2) 10Ssingh: Revert "Depool codfw for PDU upgrade" [dns] - 10https://gerrit.wikimedia.org/r/819619 [20:40:30] RECOVERY - Host cloudweb2002-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.77 ms [20:40:50] (03CR) 10Andrew Bogott: [C: 03+2] openstack: wmcs-image-create: adapt for systemd based puppet runs [puppet] - 10https://gerrit.wikimedia.org/r/814857 (owner: 10Majavah) [20:41:01] i see [20:41:58] (03CR) 10Ssingh: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/819619 (owner: 10Ssingh) [20:43:02] RECOVERY - Host cloudcephmon2004-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.72 ms [20:43:46] RECOVERY - Host clouddb2002-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.73 ms [20:43:46] RECOVERY - Host cloudgw2003-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms [20:46:30] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819741 (https://phabricator.wikimedia.org/T308076) (owner: 10TrainBranchBot) [20:49:46] RECOVERY - Host cloudcontrol2001-dev.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.00 ms [20:50:35] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.23 refs T308076 [20:50:38] T308076: 1.39.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T308076 [20:51:07] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gerrit2002.wikimedia.org with reason: host reimage [20:51:21] Possible new error: .23 i/l/r/q/JoinGroupBase:138 Wikimedia\Rdbms\JoinGroupBase::addJoin: $table must be either string, JoinGro... [20:52:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:53:03] Rolling back [20:53:31] (03PS1) 10TrainBranchBot: group0 wikis to 1.39.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819745 (https://phabricator.wikimedia.org/T308076) [20:53:33] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.39.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819745 (https://phabricator.wikimedia.org/T308076) (owner: 10TrainBranchBot) [20:53:48] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gerrit2002.wikimedia.org with reason: host reimage [20:53:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:53:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:54:30] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819745 (https://phabricator.wikimedia.org/T308076) (owner: 10TrainBranchBot) [20:54:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:56:45] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:58:23] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.22 refs T308076 [20:58:26] T308076: 1.39.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T308076 [20:59:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:00:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:00:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:01:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:02:36] dancy: do you need a backport for the blocker? [21:02:53] (03PS1) 10Ladsgroup: Fix appending of join conds [extensions/CirrusSearch] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819621 (https://phabricator.wikimedia.org/T312421) [21:05:16] Amir1: yes please, that would be great [21:05:28] (03CR) 10Ladsgroup: [C: 03+2] Fix appending of join conds [extensions/CirrusSearch] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819621 (https://phabricator.wikimedia.org/T312421) (owner: 10Ladsgroup) [21:05:32] Awesome [21:05:45] (JobUnavailable) firing: (4) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:06:06] no idea if this is my end, but have a vague "wikitech.wikimedia.org feels slow" [21:07:18] (consistently slower than en.wikipedia.org for example, ~8-10 seconds to load the main page? 🤷) [21:07:36] it's quite fast for me [21:07:43] same [21:08:11] woo \o/ [21:08:30] (rather that than anything actually wrong) [21:11:09] RECOVERY - gerrit process on gerrit2002 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [21:11:25] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gerrit2002.wikimedia.org with OS buster [21:11:35] 10SRE, 10Gerrit, 10serviceops, 10serviceops-collab, and 2 others: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host gerrit2002.wikimedia.org with OS buster completed: - gerrit2002 (**PASS**)... [21:12:55] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:15:43] (03PS1) 10CDanis: Add VictorOps CLI tool & escalate_unpaged command [software/klaxon] - 10https://gerrit.wikimedia.org/r/819750 (https://phabricator.wikimedia.org/T313603) [21:15:45] (JobUnavailable) firing: (4) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:17:51] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [21:18:53] 10SRE-OnFire, 10observability, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers - https://phabricator.wikimedia.org/T313603 (10CDanis) During discussion with @fgiunchedi toda... [21:21:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:23:43] (03PS1) 10Ebernhardson: Change CirrusSearchElasticaWrite partitioning key [deployment-charts] - 10https://gerrit.wikimedia.org/r/819752 (https://phabricator.wikimedia.org/T314426) [21:24:08] (03CR) 10Ebernhardson: [C: 04-1] "The dependant patch needs to be fully deployed to production before this can move forward." [deployment-charts] - 10https://gerrit.wikimedia.org/r/819752 (https://phabricator.wikimedia.org/T314426) (owner: 10Ebernhardson) [21:24:47] (03PS2) 10Ebernhardson: Change CirrusSearchElasticaWrite partitioning key [deployment-charts] - 10https://gerrit.wikimedia.org/r/819752 (https://phabricator.wikimedia.org/T314426) [21:25:19] (03Merged) 10jenkins-bot: Fix appending of join conds [extensions/CirrusSearch] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/819621 (https://phabricator.wikimedia.org/T312421) (owner: 10Ladsgroup) [21:25:23] RECOVERY - Host ps1-b2-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.77 ms [21:25:43] PROBLEM - ps1-b2-codfw-infeed-load-tower-A-phase-Y on ps1-b2-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:25:43] PROBLEM - ps1-b2-codfw-infeed-load-tower-B-phase-Y on ps1-b2-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:26:17] PROBLEM - ps1-b2-codfw-infeed-load-tower-B-phase-Z on ps1-b2-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:26:23] PROBLEM - ps1-b2-codfw-infeed-load-tower-A-phase-X on ps1-b2-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:26:25] PROBLEM - ps1-b2-codfw-infeed-load-tower-A-phase-Z on ps1-b2-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:26:29] PROBLEM - ps1-b2-codfw-infeed-load-tower-B-phase-X on ps1-b2-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:26:39] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [21:26:59] Amir1: Your change has been merged. [21:27:27] dancy: it's already being synced [21:27:34] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: deploy wmf-elasticsearch-search-plugins pkg - bking@cumin1001 - T314078 [21:27:37] T314078: Fix slow super_detect_noop code and monitor for future Elastic hangs - https://phabricator.wikimedia.org/T314078 [21:28:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:28:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:29:13] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.23/extensions/CirrusSearch/includes/Sanity/Checker.php: Backport: [[gerrit:819621|Fix appending of join conds (T312421 T314439)]] (duration: 03m 15s) [21:29:18] T312421: Migrate usage of Database::select to SelectQueryBuilder in CirrusSearch - https://phabricator.wikimedia.org/T312421 [21:29:18] T314439: InvalidArgumentException: Wikimedia\Rdbms\JoinGroupBase::addJoin: $table must be either string, JoinGroup or SelectQueryBuilder - https://phabricator.wikimedia.org/T314439 [21:35:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:40:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:46:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:46:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:47:24] Amir1: Ready for roll forward to group0 again? [21:49:46] (03PS5) 10Dduvall: scap: Deploy configuration using scap3 templates [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/817915 (https://phabricator.wikimedia.org/T313950) [21:51:58] dancy: yup [21:52:41] I thought we are already on group0. Didn't see a rollback [21:53:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:53:32] (03PS1) 10TrainBranchBot: group0 wikis to 1.39.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819756 (https://phabricator.wikimedia.org/T308076) [21:53:34] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.39.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819756 (https://phabricator.wikimedia.org/T308076) (owner: 10TrainBranchBot) [21:54:22] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819756 (https://phabricator.wikimedia.org/T308076) (owner: 10TrainBranchBot) [21:56:40] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [21:58:17] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.23 refs T308076 [21:58:20] T308076: 1.39.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T308076 [21:58:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:59:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:59:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:00:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:01:16] (03PS2) 10Andrew Bogott: extra ips for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/819593 (https://phabricator.wikimedia.org/T313977) (owner: 10Vivian Rook) [22:02:45] (03CR) 10CI reject: [V: 04-1] extra ips for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/819593 (https://phabricator.wikimedia.org/T313977) (owner: 10Vivian Rook) [22:03:56] (03PS3) 10Andrew Bogott: extra ips for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/819593 (https://phabricator.wikimedia.org/T313977) (owner: 10Vivian Rook) [22:04:18] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:04:29] !log dancy@deploy1002 Started deploy [gerrit/gerrit@94c5028]: (no justification provided) [22:04:35] !log dancy@deploy1002 Finished deploy [gerrit/gerrit@94c5028]: (no justification provided) (duration: 00m 06s) [22:04:42] (03CR) 10CI reject: [V: 04-1] extra ips for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/819593 (https://phabricator.wikimedia.org/T313977) (owner: 10Vivian Rook) [22:04:44] well, no immediate .23 errors anyhow... [22:05:07] brennen: you were late to the party [22:05:38] my timing is clearly impeccable [22:05:52] (03PS4) 10Andrew Bogott: extra ips for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/819593 (https://phabricator.wikimedia.org/T313977) (owner: 10Vivian Rook) [22:05:54] (actually: highly, uh, peccable) [22:06:53] (03CR) 10Andrew Bogott: extra ips for codfw1dev (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819593 (https://phabricator.wikimedia.org/T313977) (owner: 10Vivian Rook) [22:08:08] (03PS5) 10Andrew Bogott: extra ips for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/819593 (https://phabricator.wikimedia.org/T313977) (owner: 10Vivian Rook) [22:09:54] I'm calling it a day then [22:10:11] See you in a couple of hours 🤦🤦 [22:10:13] Thanks Amir! Have a good one [22:10:14] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [22:10:21] (03CR) 10CI reject: [V: 04-1] extra ips for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/819593 (https://phabricator.wikimedia.org/T313977) (owner: 10Vivian Rook) [22:10:35] (03CR) 10Dzahn: [C: 03+2] devtools: Configure keyholder for scap3 deployment of phabricator [puppet] - 10https://gerrit.wikimedia.org/r/819193 (https://phabricator.wikimedia.org/T314195) (owner: 10Dduvall) [22:14:00] wotcha all, is T314444 expected/known? Seeing quite a few `Wikimedia\Rdbms\DBQueryError: Error 1213: Deadlock found when trying to get lock; try restarting transaction` errors [22:14:05] T314444: Wikimedia\Rdbms\DBQueryError: Error 1213: Deadlock found when trying to get lock; try restarting transaction - https://phabricator.wikimedia.org/T314444 [22:14:27] (03PS1) 10Ahmon Dancy: Add gerrit2002.wikimedia.org to scap targets list [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/819760 (https://phabricator.wikimedia.org/T243027) [22:14:52] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [22:15:17] TheresNoTime: Those happen from time to time. [22:15:28] The User::saveSettings() call in AuthManager resulting in locks is not new. [22:15:31] !log gerrit - syncing data (/srv/gerrit /var/lib/gerrit2/review_site /home) again after gerrit2002 was reimaged with buster T313250 T313972 [22:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:35] T313972: Add gerrit2002 as a replica of gerrit1001 - https://phabricator.wikimedia.org/T313972 [22:15:35] T313250: Bring up gerrit2002 - https://phabricator.wikimedia.org/T313250 [22:15:46] https://logstash.wikimedia.org/app/dashboards#/view/mediawiki-errors?_g=h@42b0d52&_a=h@c759faa eh this feels more than "time to time" dancy ? [22:15:50] (03PS6) 10Andrew Bogott: extra ips for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/819593 (https://phabricator.wikimedia.org/T313977) (owner: 10Vivian Rook) [22:16:02] 96/hour? [22:16:21] The main issue was that implementations of the LocalUserCreatedHook incorrectly called User::saveSettings themself. But those seems to have all been fixed. [22:16:28] (03CR) 10Dzahn: "linked to wrong ticket. correct would be https://phabricator.wikimedia.org/T313250" [puppet] - 10https://gerrit.wikimedia.org/r/819672 (https://phabricator.wikimedia.org/T296713) (owner: 10Dzahn) [22:17:43] there apparently is some race condition with centralauth autocreating account in metawiki and loginwiki [22:17:58] hm, causing T314442 ? [22:17:59] T314442: New accounts not attaching to meta/loginwiki - https://phabricator.wikimedia.org/T314442 [22:18:06] (03CR) 10Dzahn: [C: 03+1] Add gerrit2002.wikimedia.org to scap targets list [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/819760 (https://phabricator.wikimedia.org/T243027) (owner: 10Ahmon Dancy) [22:19:30] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [22:20:00] maybe [22:22:05] (03PS7) 10Andrew Bogott: extra ips for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/819593 (https://phabricator.wikimedia.org/T313977) (owner: 10Vivian Rook) [22:22:14] (03CR) 10Ahmon Dancy: [C: 03+2] Add gerrit2002.wikimedia.org to scap targets list [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/819760 (https://phabricator.wikimedia.org/T243027) (owner: 10Ahmon Dancy) [22:22:48] I'm taking a break. [22:23:46] (03Merged) 10jenkins-bot: Add gerrit2002.wikimedia.org to scap targets list [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/819760 (https://phabricator.wikimedia.org/T243027) (owner: 10Ahmon Dancy) [22:25:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:26:16] (ah ty zabe ref. T306636) [22:26:17] T306636: UserOptionsManager: DBQueryError: Error 1213: Deadlock found when trying to get lock; try restarting transaction ([db])Function: MediaWiki\User\UserOptionsManager::saveOptionsInternalQuery - https://phabricator.wikimedia.org/T306636 [22:27:29] yw. I also closed the private one as a dupe since it doesn't really appear to have further useful information. [22:28:34] yeah good call :) [22:29:53] (03PS8) 10Andrew Bogott: extra ips for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/819593 (https://phabricator.wikimedia.org/T313977) (owner: 10Vivian Rook) [22:32:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:32:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:32:55] (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/pcc-worker1003/36600/" [puppet] - 10https://gerrit.wikimedia.org/r/819593 (https://phabricator.wikimedia.org/T313977) (owner: 10Vivian Rook) [22:35:38] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [22:38:02] but actually the error rate only started growing a few hours ago https://logstash.wikimedia.org/app/discover#/?_g=(filters:!(),query:(language:lucene,query:'normalized_message:%22%5B%7BreqId%7D%5D%20%7Bexception_url%7D%20Wikimedia%5CRdbms%5CDBQueryError:%20Error%201213:%20Deadlock%20found%20when%20trying%20to%20get%20lock;%20try%20restarting%20transaction%20(db1181)%20Function:%20MediaWiki%5CUser%5CUserOptionsManager::saveOptionsInternal%20Query: [22:38:02] %20INSERT%20IGNORE%20INTO%20%60user_properties%60%22'),refreshInterval:(pause:!t,value:0),time:(from:now-15h,to:now))&_a=(columns:!(_source),filters:!(),index:'logstash-*',interval:auto,query:(language:lucene,query:'normalized_message:%22%20Error%201213:%20Deadlock%20found%20when%20trying%20to%20get%20lock;%20try%20restarting%20transaction%22'),sort:!()) [22:38:05] bah [22:38:25] https://logstash.wikimedia.org/goto/763c2a520d4e9e09923bf4bef9cce46d [22:39:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:39:22] RECOVERY - Host ps1-b4-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.86 ms [22:40:05] (03CR) 10Chad: [C: 03+1] gerrit: add hiera data for a second replica [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [22:40:45] zabe: yeah, something feels "off" about it all, I believe op873 noticed it this time around and it certainly feels like its affecting more accounts than usual [22:41:04] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T314427 (10Dzahn) [22:48:37] (03PS1) 10Papaul: Add new PDU model [puppet] - 10https://gerrit.wikimedia.org/r/819763 (https://phabricator.wikimedia.org/T310070) [22:50:04] (03CR) 10CI reject: [V: 04-1] Add new PDU model [puppet] - 10https://gerrit.wikimedia.org/r/819763 (https://phabricator.wikimedia.org/T310070) (owner: 10Papaul) [22:53:42] RECOVERY - Host ps1-b5-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms [22:53:48] (03PS2) 10Papaul: Add new PDU model [puppet] - 10https://gerrit.wikimedia.org/r/819763 (https://phabricator.wikimedia.org/T310070) [22:55:57] (03CR) 10Papaul: [C: 03+2] Add new PDU model [puppet] - 10https://gerrit.wikimedia.org/r/819763 (https://phabricator.wikimedia.org/T310070) (owner: 10Papaul) [22:56:30] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [23:00:48] (03PS1) 10Zabe: Start writing to cuc_actor on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819765 (https://phabricator.wikimedia.org/T233004) [23:01:16] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [23:16:42] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:17:47] (Device rebooted) firing: Alert for device ps1-a7-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [23:19:23] 10SRE, 10Observability-Logging, 10vm-requests: logstash collector nodes in codfw not row redundant - https://phabricator.wikimedia.org/T313408 (10colewhite) >>! In T313408#8124432, @MoritzMuehlenhoff wrote: > Looks good, could you use Row C but wait two days? I'm currently reimaging codfw ganeti nodes to bul... [23:22:08] (03CR) 10Cwhite: [C: 03+2] logstash: add rolling strategy to json logs [puppet] - 10https://gerrit.wikimedia.org/r/817388 (https://phabricator.wikimedia.org/T166107) (owner: 10Cwhite) [23:22:30] PROBLEM - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [23:22:47] (Device rebooted) resolved: Device ps1-a7-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [23:24:27] RECOVERY - gerrit process on gerrit2002 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [23:26:34] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [23:32:46] (Device rebooted) firing: Alert for device ps1-b1-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [23:35:32] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [23:37:46] (Device rebooted) resolved: Device ps1-b1-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [23:48:44] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET