[00:02:27] (03PS1) 10Reedy: Temporarily disable UCoC link from non tech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698074 (https://phabricator.wikimedia.org/T280886) [00:03:37] (03CR) 10jerkins-bot: [V: 04-1] Temporarily disable UCoC link from non tech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698074 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [00:04:06] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in eqsin - https://phabricator.wikimedia.org/T284246 (10Dzahn) @ssingh doh5001.wikimedia.org is ready for you now. doh5002 on hold for lack of IP in that subnet. [00:04:37] (03PS2) 10Reedy: Temporarily disable UCoC link from non tech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698074 (https://phabricator.wikimedia.org/T280886) [00:05:12] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [00:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:36] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [00:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:55] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1008.eqiad.wmnet --dest wdqs1005.eqiad.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `wdqs_reimage` [00:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:59] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [00:06:35] (03CR) 10Reedy: [C: 03+2] Temporarily disable UCoC link from non tech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698074 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [00:06:58] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer [00:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:07] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: drop_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:20] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2007.codfw.wmnet --dest wdqs2001.codfw.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin2002` tmux session `wdqs_reimage` [00:07:21] (03Merged) 10jenkins-bot: Temporarily disable UCoC link from non tech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698074 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [00:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:31] thcipriani sorry I missed the backport window, I'll reschedule for next week [00:08:41] !log reedy@deploy1002 Synchronized wmf-config/CommonSettings.php: T280886 (duration: 00m 57s) [00:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:45] T280886: Add Code of Conduct link to the Universal Code of Conduct to all non technical wikis - https://phabricator.wikimedia.org/T280886 [00:12:25] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [00:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:11] (03PS1) 10Dzahn: static-bugzilla: only load httpd modules actually needed, add gzipped test file, minimize [container/miscweb] - 10https://gerrit.wikimedia.org/r/698079 (https://phabricator.wikimedia.org/T281538) [00:46:17] (03CR) 10Dzahn: [C: 04-2] "WIP, just storing it for right now" [container/miscweb] - 10https://gerrit.wikimedia.org/r/698079 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [01:06:07] PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3806497136 and 183 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:06:33] PROBLEM - Postgres Replication Lag on maps2007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2399093560 and 117 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:07:37] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 9858712400 and 605 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:07:47] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 878057424 and 128 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:08:05] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 5608868672 and 381 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:08:21] RECOVERY - Postgres Replication Lag on maps2007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 570944 and 123 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:09:35] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 100856 and 198 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:09:43] RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 148288 and 206 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:11:11] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 381664 and 293 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:11:39] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 68848 and 321 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:24:03] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer [01:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:38:51] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [01:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:39:00] 10SRE, 10LDAP-Access-Requests: Get access to a page using a Dev Account - https://phabricator.wikimedia.org/T284249 (10Ottomata) Approved [01:39:23] 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Ben Vershbow - https://phabricator.wikimedia.org/T284248 (10Ottomata) Approved [01:41:03] PROBLEM - WDQS high update lag on wdqs1008 is CRITICAL: 1.061e+04 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:00:48] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@500179f]: Stop overwriting uploads in swift [02:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:28] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@500179f]: Stop overwriting uploads in swift (duration: 04m 40s) [02:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:48] !log post-deploy restart airflow-(webserver|scheduer) on an-airflow1001 [02:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:08:47] PROBLEM - SSH on wdqs2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:09:34] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2007.codfw.wmnet --dest wdqs2001.codfw.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin2002` tmux session `wdqs_reimage` [02:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:09:39] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [02:16:42] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in eqsin - https://phabricator.wikimedia.org/T284246 (10ssingh) >>! In T284246#7133295, @Dzahn wrote: > @ssingh doh5001.wikimedia.org is ready for you now. doh5002 on hold for lack of IP in that subnet. Thanks... [02:25:02] !log [WDQS] `ryankemper@wdqs1012:~$ sudo pool` (caught up on lag) [02:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:33] !log T280382 `wdqs1005.eqiad.wmnet` has been re-imaged and had the appropriate wikidata/categories journal files transferred. `df -h` shows disk space is no longer an issue following the switch to `raid0`: `/dev/md2 2.9T 998G 1.8T 36% /srv` [02:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:37] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [02:32:38] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [02:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:10] !log [WDQS] `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1006.eqiad.wmnet --dest wdqs1013.eqiad.wmnet --reason "repair overinflated wikidata jnl" --blazegraph_instance blazegraph` [02:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:40:37] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [02:41:58] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.2963 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [02:42:11] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.5385 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [02:42:23] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.6349 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [02:42:25] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [02:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:42:46] ummm [02:43:19] ryankemper: your cookbook had nothing to do with MW requests right? [02:43:19] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [02:43:22] good evening 👋 [02:43:46] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.61 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [02:43:53] 👋 hrmm [02:44:01] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [02:44:03] hi :P [02:44:08] we had a couple blips of "Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad" earlier today [02:44:13] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [02:44:18] nothing that hit the paging threshold though, I didn't put too much time into looking at it [02:44:45] well, that was quick, yea, we did, they just did not trigger this [02:44:53] but they were equally short [02:45:03] appserver latency spiked and recovered, so whatever this was, it had real impact just now https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-1h&to=now [02:45:09] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [02:46:01] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4444 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [02:46:14] the graph doesn't look awesome [02:46:43] it looks like it could rebuild to a new peak each iteration (since it started looking funky ~30m ago) [02:47:11] like some kind of "ringing" sort of system effect [02:47:50] sure, could be something backing off and then un-backing-off [02:48:28] started like 9 hours ago when zooming out [02:48:54] % of mw servers with over 60% of workers busy [02:49:05] ^ if you look at that for the last 2 days [02:49:06] we had a spike of s1 errors there too https://grafana-rw.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=10&orgId=1&from=now-1h&to=now&forceLogin=true [02:49:17] yeah there was a notable bump to a new plateau at ~15:50, then some little tiny spikes all along since [02:49:24] explains why it affected both api_ and appservers this time [02:49:24] but the past ~30m is much worse [02:50:29] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-2d&orgId=1&to=now&var-cluster=appserver&var-datasource=eqiad%20prometheus%2Fops&var-method=GET&viewPanel=9 [02:50:50] yeah [02:51:06] ^ there was a more-notable jump to a new plateau around the time of a deploy at 10:00 [02:51:25] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.06349 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [02:51:27] peering at https://tendril.wikimedia.org/report/slow_queries?host=%5Edb&user=wikiuser but I don't know my way around it super well [02:51:29] but I donno how stable this graph normally is, maybe that's "normal" :) [02:52:26] that is a very complex sql statement :) [02:53:41] https://tendril.wikimedia.org/activity shows the currently running slow queries [02:54:02] yeah, just the worst of spike's already past [02:54:23] the slow query graph does roughly match the appserver latency spikes [02:54:41] and that looks like an untarually high number of "Tsum" [02:54:42] 10:13 Synchronized wmf-config/db-eqiad.php: Set pc1010 as pc2 primary T282761 (duration: 00m 58s) [02:54:43] T282761: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 [02:55:45] if parser cache isn't as full/hot then latency would go up but in theory it should trend downwards as it fills up [02:56:20] it could be that we finally got around to a more interesting time-of-day for certain traffic [02:56:39] (a new region with different hot content coming in) [02:56:44] * legoktm nods [02:57:01] I peeked at api.log btw and nothing obvious stood out to me, like excessive hammering [02:57:17] https://phabricator.wikimedia.org/T282761#7131412 [02:58:03] this is actually peak US query load timeframe [02:58:11] 5xx showed an increase in errors at the time of the page, but also, no single IP/range stood out, could've easily been the requests failing as a side-effect, not the cause [02:58:52] esams is in its overnight lull, but the US is on its peak plateau-ish period and right about now eqsin is ramping up pretty hard for the day too [03:00:33] 10SRE, 10serviceops, 10Patch-For-Review: make noc.wikimedia.org active/active (was: improve mw maintenance server switch over and discovery names) - https://phabricator.wikimedia.org/T265936 (10Dzahn) . [03:01:04] https://grafana.wikimedia.org/d/000000500/varnish-caching?viewPanel=5&orgId=1&var-cluster=cache_text&var-site=codfw&var-site=eqiad&var-site=ulsfo&var-site=eqsin&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5&from=now-7d&to=now&refresh=15m [03:01:23] ^ this is the one week view of the req graph shape, for "all but esams" [03:01:39] all-but-esams this is peak time, whereas esams itself is on a completely different schedule :) [03:02:17] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:04:07] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:04:10] it has been 17 hours since " pc1008.eqiad.wmnet with reason: Purging parsercache" and that's also about the time since the first spike in Mean average latency [03:04:55] rzl: I figured out why /* IndexPager::buildQueryInfo (history page unfiltered) */ is the top slow query, someone is fetching history pages with &limit=2000 [03:05:38] 2021-06-04 02:45:58 [163e3d87-49b2-4070-8fa5-a4bc5efe6fc0] mw1275 enwiki 1.37.0-wmf.7 exception ERROR: [163e3d87-49b2-4070-8fa5-a4bc5efe6fc0] /w/index.php?title=Bronze_Age&offset=20110309020825%7C417887994&limit=2000&action=history Wiki [03:05:38] media\RequestTimeout\RequestTimeoutException: The maximum execution time of 60 seconds was exceeded {"exception_url":"/w/index.php?title=Bronze_Age&offset=20110309020825%7C417887994&limit=2000&action=history","reqId":"163e3d87-49b2-4070-8 [03:05:38] fa5-a4bc5efe6fc0","caught_by":"entrypoint"} [03:05:46] I guess if they're doing it fast enough that could be the cause here [03:06:30] yeah seems like a solid theory [03:06:38] uhmm.. can we change the max limit ? [03:06:41] there's some elevation in 500s that matches too, but it's not huge [03:06:52] the timeouts started at 2021-06-04 02:45:33 per exception.log [03:07:05] https://grafana.wikimedia.org/d/000000464/prometheus-varnish-aggregate-client-status-code?viewPanel=2&orgId=1&var-site=codfw&var-site=eqiad&var-site=eqsin&var-site=esams&var-site=ulsfo&var-cache_type=varnish-text&var-status_type=5&var-method=GET&var-method=HEAD&var-method=POST&from=now-3h&to=now [03:07:06] no wait, that's when it rotated [03:07:39] 02:42-02:46 was the biggest peak of the little 500s spike too [03:18:37] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:20:21] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3492 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [03:20:25] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:35:27] (03PS1) 10BBlack: cache_text: block annoying reqs for now [puppet] - 10https://gerrit.wikimedia.org/r/698082 (https://phabricator.wikimedia.org/T284274) [03:36:33] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.06349 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [03:37:03] (03CR) 10RLazarus: [C: 03+1] cache_text: block annoying reqs for now [puppet] - 10https://gerrit.wikimedia.org/r/698082 (https://phabricator.wikimedia.org/T284274) (owner: 10BBlack) [03:37:26] (03CR) 10BBlack: [C: 03+2] cache_text: block annoying reqs for now [puppet] - 10https://gerrit.wikimedia.org/r/698082 (https://phabricator.wikimedia.org/T284274) (owner: 10BBlack) [03:49:44] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [03:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:03:43] RECOVERY - WDQS high update lag on wdqs1008 is OK: (C)3600 ge (W)1200 ge 1138 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [04:22:22] !log T280382 `wdqs2001.codfw.wmnet` has been re-imaged and had the appropriate wikidata/categories journal files transferred. `df -h` shows disk space is no longer an issue following the switch to `raid0`: `/dev/md2 2.9T 998G 1.8T 36% /srv` [04:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:22:27] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [04:25:40] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs2002.codfw.wmnet` on `ryankemper@cumin2002` tmux session `wdqs_reimage` [04:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:41:21] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2002.codfw.wmnet with reason: REIMAGE [04:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:43:33] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2002.codfw.wmnet with reason: REIMAGE [04:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:33] (03PS1) 10Marostegui: db1121: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/698085 [05:09:44] (03CR) 10Marostegui: [C: 03+2] db1121: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/698085 (owner: 10Marostegui) [05:10:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 25%: Repool db1121', diff saved to https://phabricator.wikimedia.org/P16287 and previous config saved to /var/cache/conftool/dbconfig/20210604-051010-root.json [05:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:50] RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:16:59] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer [05:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:06] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2002.codfw.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin2002` tmux session `wdqs_reimage` [05:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:10] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [05:22:21] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [05:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:31] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer [05:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:27] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2002.codfw.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin2002` tmux session `wdqs_reimage` [05:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:30] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [05:25:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 50%: Repool db1121', diff saved to https://phabricator.wikimedia.org/P16288 and previous config saved to /var/cache/conftool/dbconfig/20210604-052514-root.json [05:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:41] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [05:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 75%: Repool db1121', diff saved to https://phabricator.wikimedia.org/P16289 and previous config saved to /var/cache/conftool/dbconfig/20210604-054017-root.json [05:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 100%: Repool db1121', diff saved to https://phabricator.wikimedia.org/P16290 and previous config saved to /var/cache/conftool/dbconfig/20210604-055521-root.json [05:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:42] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for west1 - https://phabricator.wikimedia.org/T284136 (10elukey) >>! In T284136#7133186, @colewhite wrote: > @KFrancis can you confirm an NDA on file for @Cervisiarius? @colewhite in https://wikitech.wikimedia.org/wiki/SRE_Clinic_Duty#Access_requests it is me... [06:42:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3316 db1096:3315', diff saved to https://phabricator.wikimedia.org/P16291 and previous config saved to /var/cache/conftool/dbconfig/20210604-064242-marostegui.json [06:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:50] !log Upgrade mysql on db1096:3315 db1096:3316 [06:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:38] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [06:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 25%: Repool db1096:3316', diff saved to https://phabricator.wikimedia.org/P16292 and previous config saved to /var/cache/conftool/dbconfig/20210604-064807-root.json [06:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 25%: Repool db1096:3315', diff saved to https://phabricator.wikimedia.org/P16293 and previous config saved to /var/cache/conftool/dbconfig/20210604-064815-root.json [06:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210604T0700) [07:03:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 50%: Repool db1096:3316', diff saved to https://phabricator.wikimedia.org/P16294 and previous config saved to /var/cache/conftool/dbconfig/20210604-070311-root.json [07:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 50%: Repool db1096:3315', diff saved to https://phabricator.wikimedia.org/P16295 and previous config saved to /var/cache/conftool/dbconfig/20210604-070319-root.json [07:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:10] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 117 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:05:56] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:15:56] (03PS1) 10Marostegui: install_server: Reimage db2113 to Buster and 10.4 [puppet] - 10https://gerrit.wikimedia.org/r/698151 (https://phabricator.wikimedia.org/T283235) [07:16:52] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2113 to Buster and 10.4 [puppet] - 10https://gerrit.wikimedia.org/r/698151 (https://phabricator.wikimedia.org/T283235) (owner: 10Marostegui) [07:18:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 75%: Repool db1096:3316', diff saved to https://phabricator.wikimedia.org/P16296 and previous config saved to /var/cache/conftool/dbconfig/20210604-071815-root.json [07:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 75%: Repool db1096:3315', diff saved to https://phabricator.wikimedia.org/P16297 and previous config saved to /var/cache/conftool/dbconfig/20210604-071823-root.json [07:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:12] !log Password reset for SUL User:Dominic_Mayers (T282656) [07:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:16] T282656: User:Dominic_Mayers has forgotten their password, need a reset via CLI - https://phabricator.wikimedia.org/T282656 [07:22:27] (03PS1) 10Muehlenhoff: Enable profile::nginx for parsoid::testing [puppet] - 10https://gerrit.wikimedia.org/r/698152 (https://phabricator.wikimedia.org/T164456) [07:22:29] (03CR) 10Gehel: [C: 03+1] wdqs-internal: lower depool threshold to .3 [puppet] - 10https://gerrit.wikimedia.org/r/698069 (https://phabricator.wikimedia.org/T284264) (owner: 10Ryan Kemper) [07:22:41] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/697988 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [07:23:00] (03CR) 10jerkins-bot: [V: 04-1] Enable profile::nginx for parsoid::testing [puppet] - 10https://gerrit.wikimedia.org/r/698152 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [07:23:45] (03PS1) 10Marostegui: install_server: Do not reimage db2151 [puppet] - 10https://gerrit.wikimedia.org/r/698153 [07:24:13] !log cleanup now unused nginx mods and former deps on install* servers after switch towards nginx-light (various X11 libs and libxslt) [07:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:39] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2151 [puppet] - 10https://gerrit.wikimedia.org/r/698153 (owner: 10Marostegui) [07:29:38] !log cleanup now unused nginx mods and former deps on install* and puppetdb* servers after switch towards nginx-light (various X11 libs and libxslt) T164456 [07:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:42] T164456: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456 [07:29:55] (03CR) 10Jcrespo: [C: 03+1] "I see nothing wrong here." [puppet] - 10https://gerrit.wikimedia.org/r/698064 (https://phabricator.wikimedia.org/T284157) (owner: 10Bstorm) [07:31:07] (03PS2) 10Muehlenhoff: Enable profile::nginx for parsoid::testing [puppet] - 10https://gerrit.wikimedia.org/r/698152 (https://phabricator.wikimedia.org/T164456) [07:32:36] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698152 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [07:33:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 100%: Repool db1096:3316', diff saved to https://phabricator.wikimedia.org/P16298 and previous config saved to /var/cache/conftool/dbconfig/20210604-073318-root.json [07:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 100%: Repool db1096:3315', diff saved to https://phabricator.wikimedia.org/P16299 and previous config saved to /var/cache/conftool/dbconfig/20210604-073326-root.json [07:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:20] !log stop and upgrade db1150 T283235 [07:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:25] T283235: Upgrade s5 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T283235 [07:44:40] (03PS1) 10Muehlenhoff: Switch scandium/testreduce to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698155 (https://phabricator.wikimedia.org/T164456) [07:52:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698155 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [07:56:21] (03CR) 10Muehlenhoff: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/697988 also needs to be merged first" [puppet] - 10https://gerrit.wikimedia.org/r/698155 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [08:08:50] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] New upstream release [debs/karma] - 10https://gerrit.wikimedia.org/r/697916 (owner: 10Filippo Giunchedi) [08:09:33] (03PS1) 10Jcrespo: dbbackups: Switchover eqiad s5 backups from db1145 to db1150 (buster) [puppet] - 10https://gerrit.wikimedia.org/r/698157 (https://phabricator.wikimedia.org/T283235) [08:09:39] (03PS2) 10Jcrespo: dbbackups: Switchover codfw s5 backups from db2099 to db2101 (buster) [puppet] - 10https://gerrit.wikimedia.org/r/693142 (https://phabricator.wikimedia.org/T283235) [08:20:00] !log upgrade karma to 0.86-1 [08:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:17] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: cc -operations on IRC for all SRE pages [puppet] - 10https://gerrit.wikimedia.org/r/697943 (https://phabricator.wikimedia.org/T273716) (owner: 10Filippo Giunchedi) [08:24:19] (03PS1) 10Jcrespo: mariadb: Switchover s7&s8 codfw backups from db2100 to db2098 [puppet] - 10https://gerrit.wikimedia.org/r/698158 (https://phabricator.wikimedia.org/T283995) [08:24:22] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: highlight 'instance' label in alerts dashboard [puppet] - 10https://gerrit.wikimedia.org/r/697924 (https://phabricator.wikimedia.org/T282806) (owner: 10Filippo Giunchedi) [08:25:49] (03CR) 10Jcrespo: "To be reverted once T283995 is fixed (it may happen while I am off)." [puppet] - 10https://gerrit.wikimedia.org/r/698158 (https://phabricator.wikimedia.org/T283995) (owner: 10Jcrespo) [08:29:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1110', diff saved to https://phabricator.wikimedia.org/P16300 and previous config saved to /var/cache/conftool/dbconfig/20210604-082956-marostegui.json [08:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 25%: Repool db1110', diff saved to https://phabricator.wikimedia.org/P16301 and previous config saved to /var/cache/conftool/dbconfig/20210604-083232-root.json [08:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:48] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be100[12] - https://phabricator.wikimedia.org/T276637 (10fgiunchedi) Ack, thank you @RobH ! [08:33:39] !log Upgrade db1110 T283235 [08:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:43] T283235: Upgrade s5 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T283235 [08:36:39] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1053 - https://phabricator.wikimedia.org/T282839 (10fgiunchedi) Disk is rebuilding, thank you @Cmjohnson [08:47:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 50%: Repool db1110', diff saved to https://phabricator.wikimedia.org/P16302 and previous config saved to /var/cache/conftool/dbconfig/20210604-084735-root.json [08:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:12] (03CR) 10Jbond: [C: 03+2] P:gitlab: open SSH port to the world [puppet] - 10https://gerrit.wikimedia.org/r/696024 (https://phabricator.wikimedia.org/T276144) (owner: 10Jbond) [08:55:22] PROBLEM - very high load average likely xfs on ms-be1053 is CRITICAL: CRITICAL - load average: 117.82, 104.03, 71.40 https://wikitech.wikimedia.org/wiki/Swift [08:56:46] 10SRE, 10observability, 10Patch-For-Review: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 (10fgiunchedi) I thought about it a little more and I'm convinced we should strive to keep Prometheus on VMs primarily for ease of management: on hardware faults on bare meta... [08:59:58] ms-be1053 is rebuilding a disk [09:00:06] (03CR) 10Hashar: "Gerrit is all happy under Java 11. Nothing refers to the java 8 binary based on (lsof -f). Lets remove the unused packages on Monday " [puppet] - 10https://gerrit.wikimedia.org/r/696591 (https://phabricator.wikimedia.org/T268225) (owner: 10Hashar) [09:00:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:01:10] 10SRE, 10observability, 10Patch-For-Review: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 (10MoritzMuehlenhoff) >>! In T243057#7133670, @fgiunchedi wrote: > I thought about it a little more and I'm convinced we should strive to keep Prometheus on VMs primarily for... [09:02:28] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:02:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 75%: Repool db1110', diff saved to https://phabricator.wikimedia.org/P16303 and previous config saved to /var/cache/conftool/dbconfig/20210604-090239-root.json [09:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:09] !log reboot cp1087 T278729 [09:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:13] T278729: cp1087 down with hardware issues - https://phabricator.wikimedia.org/T278729 [09:06:26] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/697723 (https://phabricator.wikimedia.org/T275852) (owner: 10Jbond) [09:17:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 100%: Repool db1110', diff saved to https://phabricator.wikimedia.org/P16304 and previous config saved to /var/cache/conftool/dbconfig/20210604-091742-root.json [09:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:44] 10SRE, 10DNS, 10Traffic, 10serviceops, and 2 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10jbond) [09:19:56] 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10jbond) a:05Dzahn→03jbond The SSH port has now been opened as well [09:20:23] 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10jbond) 05Resolved→03Open a:05jbond→03Dzahn [09:21:21] 10SRE, 10ops-eqiad, 10Traffic: cp1087 down with hardware issues - https://phabricator.wikimedia.org/T278729 (10ema) >>! In T278729#7132555, @Dzahn wrote: > https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=cp1087 Thanks Daniel, after rebooting the host all the alerts are now gone. [09:24:24] (03PS14) 10Jbond: role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) [09:25:19] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [09:25:27] (03PS8) 10Jbond: mx2001: disable ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/612826 [09:26:03] (03CR) 10jerkins-bot: [V: 04-1] role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [09:26:06] (03CR) 10jerkins-bot: [V: 04-1] mx2001: disable ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/612826 (owner: 10Jbond) [09:30:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people2002.codfw.wmnet [09:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people2002.codfw.wmnet [09:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people1003.eqiad.wmnet [09:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:02] PROBLEM - very high load average likely xfs on ms-be1053 is CRITICAL: CRITICAL - load average: 113.44, 103.87, 98.84 https://wikitech.wikimedia.org/wiki/Swift [09:37:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people1003.eqiad.wmnet [09:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:03] (03PS1) 10Ssingh: Add doh5001 to BGP anycast in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/698162 (https://phabricator.wikimedia.org/T283503) [09:46:32] (03CR) 10jerkins-bot: [V: 04-1] Add doh5001 to BGP anycast in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/698162 (https://phabricator.wikimedia.org/T283503) (owner: 10Ssingh) [09:46:42] PROBLEM - very high load average likely xfs on ms-be1053 is CRITICAL: CRITICAL - load average: 108.99, 101.84, 99.49 https://wikitech.wikimedia.org/wiki/Swift [09:47:04] !log pool cp1087 T278729 [09:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:13] T278729: cp1087 down with hardware issues - https://phabricator.wikimedia.org/T278729 [09:47:36] (03PS2) 10Ssingh: Add doh5001 to BGP anycast in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/698162 (https://phabricator.wikimedia.org/T283503) [09:50:48] 10SRE, 10ops-eqiad, 10Traffic: cp1087 down with hardware issues - https://phabricator.wikimedia.org/T278729 (10ema) 05Open→03Resolved Tentatively closing. [09:55:24] (03CR) 10Cathal Mooney: [C: 03+1] "looks good to me." [homer/public] - 10https://gerrit.wikimedia.org/r/698162 (https://phabricator.wikimedia.org/T283503) (owner: 10Ssingh) [09:57:03] (03PS1) 10Jbond: P:docker::reporter: exclude all debian images [puppet] - 10https://gerrit.wikimedia.org/r/698164 (https://phabricator.wikimedia.org/T251918) [09:57:53] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29792/console" [puppet] - 10https://gerrit.wikimedia.org/r/698164 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond) [10:02:44] (03PS11) 10Hashar: [WMF] register our plugins as submodules [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684336 [10:02:48] (03CR) 10Hashar: [WMF] register our plugins as submodules (031 comment) [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684336 (owner: 10Hashar) [10:03:07] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10jbond) [10:03:35] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10jbond) 05Resolved→03Open [10:06:51] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10jbond) now seeing `lsb_release: command not found` on the docker buster instances ` var/lib/dpkg/info/debmonitor-client.postinst: line 16: lsb_release: comman... [10:08:58] (03PS2) 10Ssingh: acme_chief: authorize doh5001 host for Wikidough [puppet] - 10https://gerrit.wikimedia.org/r/698014 (https://phabricator.wikimedia.org/T284246) [10:11:19] (03CR) 10Ssingh: "Already reviewed; removing extra host." [puppet] - 10https://gerrit.wikimedia.org/r/698014 (https://phabricator.wikimedia.org/T284246) (owner: 10Ssingh) [10:11:35] 10SRE, 10LDAP-Access-Requests: Add Dat Nguyen to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T284285 (10dang) [10:12:46] (03CR) 10Ssingh: [C: 03+2] acme_chief: authorize doh5001 host for Wikidough [puppet] - 10https://gerrit.wikimedia.org/r/698014 (https://phabricator.wikimedia.org/T284246) (owner: 10Ssingh) [10:14:03] 10Puppet, 10SRE, 10User-jbond: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480 (10hashar) Dropping releng/CI, doesn't seem we have anything to do to complete the resolution of this task. It seems dec... [10:15:55] (03PS1) 10Ssingh: site: switch doh5001 to O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/698166 [10:17:25] (03CR) 10Ssingh: [C: 03+2] site: switch doh5001 to O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/698166 (owner: 10Ssingh) [10:22:28] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10JMeybohm) >>! In T251918#7133852, @jbond wrote: > now seeing `lsb_release: command not found` on the docker buster instances > > ` > var/lib/dpkg/info/debmoni... [10:22:55] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10fgiunchedi) Thank you @Papaul ! I see the device in librenms but looks like discovery isn't working. I've removed and added the device again without success: ht... [10:24:15] (03PS1) 10Jbond: postints: update postinstall to check for lsb_release before using it [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918) [10:24:58] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10jbond) > This is a packaging error I guess. The script tries to install debmonitor-client inside of the container. Exactly, see https://gerrit.wikimedia.org/r/... [10:29:53] (03CR) 10JMeybohm: [C: 04-1] "I would suggest to instead add a dependency to lsb-release" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond) [10:30:35] (03CR) 10Ssingh: "> Patch Set 2: Code-Review+1" [homer/public] - 10https://gerrit.wikimedia.org/r/698162 (https://phabricator.wikimedia.org/T283503) (owner: 10Ssingh) [10:30:38] (03CR) 10Ssingh: [C: 03+2] Add doh5001 to BGP anycast in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/698162 (https://phabricator.wikimedia.org/T283503) (owner: 10Ssingh) [10:30:55] 10SRE, 10LDAP-Access-Requests: Add Dat Nguyen to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T284285 (10dang) [10:31:14] (03Merged) 10jenkins-bot: Add doh5001 to BGP anycast in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/698162 (https://phabricator.wikimedia.org/T283503) (owner: 10Ssingh) [10:33:29] (03CR) 10Muehlenhoff: "Agreed, it's still a bit of an industry standard despite being somewhat dormant and other applications might also break silently, so it al" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond) [10:33:31] (03CR) 10Jbond: "> Patch Set 1: Code-Review-1" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond) [10:35:14] (03CR) 10Muehlenhoff: "But we can also rely on /etc/os-release entirely, that's also fine (and stop using lsb_release unconditonally)" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond) [10:40:07] (03PS2) 10Jbond: postints: update postinstall to check for lsb_release before using it [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918) [10:41:27] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond) [10:41:35] (03CR) 10Jbond: "> Patch Set 1:" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond) [10:41:55] (03CR) 10Muehlenhoff: [C: 03+1] "Commit message needs updating, though :-)" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond) [10:43:09] (03PS3) 10Jbond: postints: update postinstall to check for lsb_release before using it [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918) [10:43:26] (03CR) 10Kormat: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/697618 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [10:44:14] (03PS4) 10Jbond: postints: update postinstall to use /etc/os-release Our docker images don't have lsb_release this causes docker-reporter to /etc/os-release fail with the following message. As such use /etc/os-release directly [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918) [10:44:32] (03CR) 10Jbond: "> Patch Set 2:" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond) [10:45:23] (03PS5) 10Jbond: postints: update postinstall to use /etc/os-release [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918) [10:45:47] (03PS6) 10Jbond: postints: update postinstall to use /etc/os-release [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918) [10:48:07] (03CR) 10Jbond: [C: 03+2] "> Patch Set 1: Code-Review+1" [software/netbox] - 10https://gerrit.wikimedia.org/r/698020 (owner: 10Jbond) [10:48:15] (03CR) 10Jbond: [V: 03+2 C: 03+2] add .gitreview file [software/netbox] - 10https://gerrit.wikimedia.org/r/698020 (owner: 10Jbond) [10:49:07] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10fgiunchedi) Ok I think I got the correct `addhost.php` incantation to add the device and get discovery to work properly: ` ./addhost.php ps2-test-d8-codfw.mgmt.... [10:51:54] (03CR) 10Jcrespo: [C: 03+2] mariadb: Switchover s7&s8 codfw backups from db2100 to db2098 [puppet] - 10https://gerrit.wikimedia.org/r/698158 (https://phabricator.wikimedia.org/T283995) (owner: 10Jcrespo) [10:52:42] (03PS1) 10Giuseppe Lavagetto: mwdebug: fix various issues with the deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/698172 [10:57:13] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10Joe) should we ensure lsb_release is installed in the base images? [10:57:41] (03PS2) 10Giuseppe Lavagetto: mwdebug: fix various issues with the deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/698172 [10:59:06] !log Running homer for Gerrit 698162: Set up BGP peering to doh5001 in eqsin, triggering DoH /24 announcement there. [10:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:24] (03CR) 10Jbond: Add WMCS specific cloud role for syslog server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [11:03:22] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks!" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond) [11:03:46] 10SRE, 10puppet-compiler: jenkins-bot puppet-compiler-test may report SUCCESS though compiling failed - https://phabricator.wikimedia.org/T214629 (10hashar) Dropping #continuous-integration-config , resolving this task solely depends on changing the exit code in #puppet-compiler [11:05:17] (03CR) 10Jbond: [C: 03+2] postints: update postinstall to use /etc/os-release [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond) [11:20:52] !log upload debmonitor-client_0.3.0-1+deb10u3_all.deb to apt [11:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:55] 10SRE, 10Traffic, 10netops: Please configure the routers for Wikidough's anycasted IP - https://phabricator.wikimedia.org/T283503 (10ssingh) doh5001 is also up; from Mumbai, we are reaching eqsin as desired: ` $ kdig @wikimedia-dns.org +nsid +tls-ca wikipedia.org ;; TLS session (TLS1.3)-(ECDHE-SECP256R1)-(E... [11:40:09] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:40:37] 10SRE, 10puppet-compiler, 10User-jbond: jenkins-bot puppet-compiler-test may report SUCCESS though compiling failed - https://phabricator.wikimedia.org/T214629 (10jbond) [11:41:30] (03CR) 10Ayounsi: [C: 04-1] "https://phabricator.wikimedia.org/T284213 for the AM dashboard discussion so we don't get sidetracked on this CR." [alerts] - 10https://gerrit.wikimedia.org/r/697710 (https://phabricator.wikimedia.org/T282806) (owner: 10Ema) [11:50:47] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 134 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:51:26] (03PS1) 10Marostegui: install_server: Set the partitioning scheme to new pc* [puppet] - 10https://gerrit.wikimedia.org/r/698177 (https://phabricator.wikimedia.org/T282484) [11:52:34] (03CR) 10Marostegui: [C: 03+2] install_server: Set the partitioning scheme to new pc* [puppet] - 10https://gerrit.wikimedia.org/r/698177 (https://phabricator.wikimedia.org/T282484) (owner: 10Marostegui) [11:52:35] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: (C)100 gt (W)50 gt 20 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:57:05] (03PS1) 10Joal: Update AQS druid datasource to 2021_05 [puppet] - 10https://gerrit.wikimedia.org/r/698178 [12:23:07] PROBLEM - Check systemd state on ms-be1053 is CRITICAL: CRITICAL - degraded: The following units failed: session-132628.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:24:16] (03PS1) 10Marostegui: dbproxy1019: Depool clouddb1015 [puppet] - 10https://gerrit.wikimedia.org/r/698185 (https://phabricator.wikimedia.org/T283235) [12:25:39] (03CR) 10Marostegui: [C: 03+2] dbproxy1019: Depool clouddb1015 [puppet] - 10https://gerrit.wikimedia.org/r/698185 (https://phabricator.wikimedia.org/T283235) (owner: 10Marostegui) [12:27:57] !log Upgrade mysql on clouddb1015 T283235 [12:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:02] T283235: Upgrade s5 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T283235 [12:28:20] (03PS1) 10Marostegui: Revert "dbproxy1019: Depool clouddb1015" [puppet] - 10https://gerrit.wikimedia.org/r/698033 [12:29:04] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1019: Depool clouddb1015" [puppet] - 10https://gerrit.wikimedia.org/r/698033 (owner: 10Marostegui) [12:29:08] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10JMeybohm) >>! In T251918#7133990, @Joe wrote: > should we ensure lsb_release is installed in the base images? I don't think we should. If it is really needed... [12:30:40] 10SRE, 10Continuous-Integration-Config: operations/docker-images/production-images has no CI - https://phabricator.wikimedia.org/T283855 (10hashar) [12:30:50] 10SRE, 10Traffic, 10netops: BGP Policy on aggregate routes prevents them being created in some circumstances. - https://phabricator.wikimedia.org/T283163 (10cmooney) After discussion with @ayounsi on IRC he suggested looking at the use of the following command to address this: ` set protocols bgp group PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 127 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:34:01] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:45:39] 10SRE, 10User-jbond, 10ci-test-error: operations/puppet CI errors not being displayed in console log - https://phabricator.wikimedia.org/T214726 (10hashar) [12:45:53] 10SRE, 10User-jbond, 10ci-test-error: operations/puppet CI errors not being displayed in console log - https://phabricator.wikimedia.org/T214726 (10hashar) 05Open→03Resolved a:03hashar [12:46:02] !log Upgrade mysql on clouddb1016 T283235 [12:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:05] (03PS1) 10Marostegui: dbproxy1019: Depool clouddb1016 [puppet] - 10https://gerrit.wikimedia.org/r/698190 [12:46:06] T283235: Upgrade s5 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T283235 [12:47:56] (03PS1) 10Marostegui: Revert "dbproxy1019: Depool clouddb1016" [puppet] - 10https://gerrit.wikimedia.org/r/698034 [12:49:38] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1019: Depool clouddb1016" [puppet] - 10https://gerrit.wikimedia.org/r/698034 (owner: 10Marostegui) [12:57:26] 10SRE, 10Traffic, 10netops: BGP Policy on aggregate routes prevents them being created in some circumstances. - https://phabricator.wikimedia.org/T283163 (10ayounsi) That sounds great! Let's test it out next week. Thanks. [12:57:29] RECOVERY - Check systemd state on ms-be1053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:23:45] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: fix various issues with the deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/698172 (owner: 10Giuseppe Lavagetto) [13:23:57] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10jbond) I now see an error with docker-registry.wikimedia.org/kubernetes-fluentd-daemonset:0.0.1-20190122 ` E: Failed to fetch http://mirrors.wikimedia.org/deb... [13:24:24] (03PS2) 10Jbond: P:docker::reporter: exclude all jessie images [puppet] - 10https://gerrit.wikimedia.org/r/698164 (https://phabricator.wikimedia.org/T251918) [13:26:10] (03Merged) 10jenkins-bot: mwdebug: fix various issues with the deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/698172 (owner: 10Giuseppe Lavagetto) [13:26:28] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10Joe) The "last updated" on that page means nothing - it barely tells you when the script ran the last time. That image is old and should really be retired fro... [13:26:40] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. To fix the underlying issue it would be best if all jessie images were deleted from the registry entirely, I'll check how/if/w" [puppet] - 10https://gerrit.wikimedia.org/r/698164 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond) [13:27:18] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10jbond) >The "last updated" on that page means nothing - it barely tells you when the script ran the last time. I was beginning to wonder :) > That image is ol... [13:29:55] (03PS3) 10Jbond: P:docker::reporter: exclude all jessie images [puppet] - 10https://gerrit.wikimedia.org/r/698164 (https://phabricator.wikimedia.org/T251918) [13:30:07] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 131 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:30:51] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29794/console" [puppet] - 10https://gerrit.wikimedia.org/r/698164 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond) [13:31:11] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:18] (03PS1) 10Elukey: hadoop: increase the HDFS Namenode's service handler threads [puppet] - 10https://gerrit.wikimedia.org/r/698194 (https://phabricator.wikimedia.org/T283733) [13:31:55] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:32:14] <_joe_> the deploy1002 issue is me [13:32:19] <_joe_> but I'm fixing it [13:32:59] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:14] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:36] (03CR) 10Jbond: [V: 03+1 C: 03+2] "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/698164 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond) [13:39:35] !log mwmaint1002: Running purge_parsercache_now.php on pc1008, server 3/4, ref T282761 [13:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:40] T282761: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 [13:42:47] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 181 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:46:23] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: (C)100 gt (W)50 gt 23 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:49:59] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 160 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:51:30] 10SRE, 10SRE-tools: Rootless cookbooks/spicerack - https://phabricator.wikimedia.org/T284302 (10MoritzMuehlenhoff) [13:53:35] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:57:22] (03PS1) 10Krinkle: Set wgDiscussionToolsTalkPageParserCacheExpiry to 10 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698201 (https://phabricator.wikimedia.org/T280605) [14:00:03] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10Papaul) @fgiunchedi Thanks [14:03:55] 10SRE, 10SRE-tools, 10User-jbond: Rootless cookbooks/spicerack - https://phabricator.wikimedia.org/T284302 (10jbond) [14:11:38] (03PS1) 10Ayounsi: Manage analytics-in4/6 with Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/698202 (https://phabricator.wikimedia.org/T279429) [14:11:47] (03CR) 10jerkins-bot: [V: 04-1] Manage analytics-in4/6 with Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/698202 (https://phabricator.wikimedia.org/T279429) (owner: 10Ayounsi) [14:14:26] (03CR) 10Krinkle: [C: 03+2] Allow talk pages to have a different ParserCache expiry [extensions/DiscussionTools] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/694314 (https://phabricator.wikimedia.org/T280605) (owner: 10Krinkle) [14:14:49] (03PS2) 10Ayounsi: Manage analytics-in4/6 with Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/698202 (https://phabricator.wikimedia.org/T279429) [14:15:17] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for west1 - https://phabricator.wikimedia.org/T284136 (10colewhite) [14:16:30] (03PS1) 10Cwhite: admin: update west1 staff contact and email [puppet] - 10https://gerrit.wikimedia.org/r/698203 (https://phabricator.wikimedia.org/T284136) [14:20:39] (03Merged) 10jenkins-bot: Allow talk pages to have a different ParserCache expiry [extensions/DiscussionTools] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/694314 (https://phabricator.wikimedia.org/T280605) (owner: 10Krinkle) [14:20:59] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/698203 (https://phabricator.wikimedia.org/T284136) (owner: 10Cwhite) [14:21:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:21:30] (03CR) 10Elukey: [C: 03+1] admin: update west1 staff contact and email [puppet] - 10https://gerrit.wikimedia.org/r/698203 (https://phabricator.wikimedia.org/T284136) (owner: 10Cwhite) [14:22:51] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:23:13] (03CR) 10Cwhite: [C: 03+2] admin: update west1 staff contact and email [puppet] - 10https://gerrit.wikimedia.org/r/698203 (https://phabricator.wikimedia.org/T284136) (owner: 10Cwhite) [14:27:00] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to nda for west1 - https://phabricator.wikimedia.org/T284136 (10colewhite) [14:29:46] (03PS1) 10Cwhite: admin: amend west1 uid to uidNumber from ldap [puppet] - 10https://gerrit.wikimedia.org/r/698205 (https://phabricator.wikimedia.org/T284136) [14:32:13] (03PS1) 10Ayounsi: Add 185.71.138.0/24 to network::external [puppet] - 10https://gerrit.wikimedia.org/r/698206 (https://phabricator.wikimedia.org/T252132) [14:34:10] (03CR) 10Cwhite: "@moritz|@john: please cross-reference ldap to ensure this is a correct change. This uid has been this way for a long time." [puppet] - 10https://gerrit.wikimedia.org/r/698205 (https://phabricator.wikimedia.org/T284136) (owner: 10Cwhite) [14:35:45] (03PS2) 10Ayounsi: Add 185.71.138.0/24 to network::external and diffscan [puppet] - 10https://gerrit.wikimedia.org/r/698206 (https://phabricator.wikimedia.org/T252132) [14:37:06] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to nda for west1 - https://phabricator.wikimedia.org/T284136 (10colewhite) 05Open→03Resolved The west1 added to nda group. Please feel free to reopen if you encounter any related issue. [14:38:23] (03CR) 10Krinkle: [C: 03+2] Set wgDiscussionToolsTalkPageParserCacheExpiry to 10 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698201 (https://phabricator.wikimedia.org/T280605) (owner: 10Krinkle) [14:38:29] (03CR) 10Elukey: "The uid looks good to me, really good catch. I see that the user has files owned only on stat1007, so we could run a quick script like the" [puppet] - 10https://gerrit.wikimedia.org/r/698205 (https://phabricator.wikimedia.org/T284136) (owner: 10Cwhite) [14:38:30] * Krinkle staging on mwdebug1002 [14:39:21] (03Merged) 10jenkins-bot: Set wgDiscussionToolsTalkPageParserCacheExpiry to 10 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698201 (https://phabricator.wikimedia.org/T280605) (owner: 10Krinkle) [14:39:58] (03PS1) 10Muehlenhoff: htmldumps: Add profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/698207 (https://phabricator.wikimedia.org/T163356) [14:40:00] (03PS1) 10Muehlenhoff: Switch htmldumps to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698208 (https://phabricator.wikimedia.org/T164456) [14:41:28] !log krinkle@deploy1002 Scap failed!: 9/9 canaries failed their endpoint checks(https://en.wikipedia.org) [14:41:30] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698207 (https://phabricator.wikimedia.org/T163356) (owner: 10Muehlenhoff) [14:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:46] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::yaml_defs: fix data structure name [puppet] - 10https://gerrit.wikimedia.org/r/698209 [14:42:42] eh... [14:43:10] * Krinkle reverts [14:43:50] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts: ` ['cloudvirt1040.eqiad.wmnet'] ` The log can be found in `/var/lo... [14:44:33] OK, that was stupid of me. The backport introduces a method and a call at the same time, and I shoudl have synced them separately. screwed by non-atomicity, saved by canaries. [14:44:39] re-syncing one by one now. [14:44:57] !log krinkle@deploy1002 Synchronized php-1.37.0-wmf.7/extensions/DiscussionTools/includes/: Iea41ab8599ffae (duration: 00m 59s) [14:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:03] !log krinkle@deploy1002 Synchronized php-1.37.0-wmf.7/extensions/DiscussionTools/extension.json: Iea41ab8599ffae (duration: 00m 56s) [14:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:42] !log krinkle@deploy1002 Synchronized wmf-config/CommonSettings.php: I434d9cfa29d84f (duration: 00m 56s) [14:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:20] (03PS2) 10Muehlenhoff: htmldumps: Switch to common profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/698207 (https://phabricator.wikimedia.org/T163356) [14:55:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698207 (https://phabricator.wikimedia.org/T163356) (owner: 10Muehlenhoff) [15:00:45] 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service needs. This email confirms that your request for... [15:03:29] 10SRE, 10LDAP-Access-Requests, 10WMF-Legal: Add Kara Payne to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T284308 (10karapayneWMDE) As the #WMF-Legal project tag was added to this task, some general information to avoid wrong expectations: Please note that public tasks in Wikimedia P... [15:04:55] (03PS1) 10Muehlenhoff: Switch htmldumps to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698215 (https://phabricator.wikimedia.org/T164456) [15:05:15] (03Abandoned) 10Muehlenhoff: Switch htmldumps to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698208 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [15:05:30] (03CR) 10jerkins-bot: [V: 04-1] Switch htmldumps to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698215 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [15:05:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698215 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [15:08:06] (03PS1) 10Herron: prometheus::pop add retention size param and set to 100G [puppet] - 10https://gerrit.wikimedia.org/r/698216 (https://phabricator.wikimedia.org/T243057) [15:08:31] (03PS2) 10Muehlenhoff: Switch htmldumps to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698215 (https://phabricator.wikimedia.org/T164456) [15:10:40] (03PS2) 10Herron: prometheus::pop add retention size param and set to 100G [puppet] - 10https://gerrit.wikimedia.org/r/698216 (https://phabricator.wikimedia.org/T243057) [15:14:42] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/29796/" [puppet] - 10https://gerrit.wikimedia.org/r/698216 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [15:15:17] 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10Papaul) I will be receiving the CPU on Monday [15:15:50] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10Papaul) I have the main board on site, I will be replacing it on Monday. [15:16:37] (03CR) 10Ahmon Dancy: [C: 03+1] [WMF] register our plugins as submodules [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684336 (owner: 10Hashar) [15:17:53] (03PS3) 10Herron: prometheus::pop add retention size param and set to 80G [puppet] - 10https://gerrit.wikimedia.org/r/698216 (https://phabricator.wikimedia.org/T243057) [15:19:12] 10SRE, 10LDAP-Access-Requests: Get access to a page using a Dev Account - https://phabricator.wikimedia.org/T284249 (10LZaman) @colewhite Thanks for looking into this. Yes, my wikitech name is correct. Re the need for access: I need the dev sign on to look into issues and [[ https://idp.wikimedia.org/login?se... [15:19:19] (03CR) 10Filippo Giunchedi: "LGTM overall" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/698216 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [15:20:46] (03PS4) 10Herron: prometheus::pop add retention size param and set to 80G [puppet] - 10https://gerrit.wikimedia.org/r/698216 (https://phabricator.wikimedia.org/T243057) [15:25:41] !log Adding 1:1 NAT configuration for fran2001 / analytics.codfw.wikimedia.org to pfw3-codfw (backup site) [15:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:53] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts: ` ['cloudvirt1040.eqiad.wmnet'] ` The log can be found in `/var/lo... [15:43:13] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:45:03] 10SRE, 10observability, 10Patch-For-Review: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 (10herron) +1! I'll plan deploy the patch above (now amended to 80G retention), move data and release the vdb device from prometheus3001 next week. [15:50:29] (03PS15) 10Herron: role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [15:56:48] (03PS1) 10Ladsgroup: Enable wikisource group as langlink group of sourcewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698226 (https://phabricator.wikimedia.org/T275958) [16:02:29] (03PS1) 10Giuseppe Lavagetto: mediawiki: fix etcd connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/698228 [16:02:31] (03PS1) 10Giuseppe Lavagetto: mwdebug: add etcd servers, datacenter [deployment-charts] - 10https://gerrit.wikimedia.org/r/698229 [16:09:48] (03CR) 10Herron: [C: 04-1] "Looks good to me overall, although at the moment the compiler shows an unexpected result (removal of ldap validation on both mxes) which I" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [16:34:20] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1040.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cloudvirt1040.eqiad.wmnet'] ` [16:37:19] 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10jcrespo) Thank you! [16:38:34] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts: ` ['cloudvirt1040.eqiad.wmnet'] ` The log can be found in `/var/lo... [16:43:11] 10SRE, 10LDAP-Access-Requests: Get access to a page using a Dev Account - https://phabricator.wikimedia.org/T284249 (10colewhite) [16:44:17] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts: ` ['cloudvirt1040.eqiad.wmnet'] ` The log can be found in `/var/lo... [17:10:18] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts: ` ['cloudvirt1040.eqiad.wmnet'] ` The log can be found in `/var/lo... [17:14:50] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-web1001 - https://phabricator.wikimedia.org/T281787 (10Jclark-ctr) an-web1001 B1 U27 Port9 id#3025 [17:15:15] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-web1001 - https://phabricator.wikimedia.org/T281787 (10Jclark-ctr) [17:18:51] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-web1001 - https://phabricator.wikimedia.org/T281787 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [17:20:03] 10SRE, 10LDAP-Access-Requests: Add Kara Payne to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T284308 (10WMDE-leszek) As an Engineering Manager at WMDE, I approve this request and confirm Kara's affiliation with WDME. [17:25:12] (03PS1) 10Razzi: role::common::aqs: update druid mediawiki's datasource [puppet] - 10https://gerrit.wikimedia.org/r/698233 [17:25:51] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: REIMAGE [17:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:00] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: REIMAGE [17:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:20] (03CR) 10Razzi: [C: 03+2] role::common::aqs: update druid mediawiki's datasource [puppet] - 10https://gerrit.wikimedia.org/r/698233 (owner: 10Razzi) [17:33:15] !log razzi@cumin1001 START - Cookbook sre.aqs.roll-restart [17:33:15] !log razzi@cumin1001 END (FAIL) - Cookbook sre.aqs.roll-restart (exit_code=99) [17:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:25] !log razzi@cumin1001 START - Cookbook sre.aqs.roll-restart [17:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:29] (03PS2) 10Effie Mouzeli: mediawiki::alerts fix panelId for mediawiki exceptions alert [puppet] - 10https://gerrit.wikimedia.org/r/690540 (https://phabricator.wikimedia.org/T284301) [17:36:57] !log razzi@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) [17:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:29] (03PS1) 10Jgreen: add analytics-codfw.frdev.wikimedia.org A and PTR records [dns] - 10https://gerrit.wikimedia.org/r/698235 (https://phabricator.wikimedia.org/T284155) [18:03:49] (03CR) 10Jgreen: [C: 03+2] add analytics-codfw.frdev.wikimedia.org A and PTR records [dns] - 10https://gerrit.wikimedia.org/r/698235 (https://phabricator.wikimedia.org/T284155) (owner: 10Jgreen) [18:20:54] 10SRE, 10GitLab (Initialization), 10Patch-For-Review, 10Release-Engineering-Team (Doing), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10Dzahn) per T276144#7133694 the SSH port is open to the world now. So that means this must be done! Let's close it ? [18:21:08] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:09] PROBLEM - Check systemd state on ms-be1053 is CRITICAL: CRITICAL - degraded: The following units failed: session-132777.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:22:18] (03CR) 10Andrew Bogott: [C: 03+1] galera: ensure that mariabackup is also installed [puppet] - 10https://gerrit.wikimedia.org/r/698064 (https://phabricator.wikimedia.org/T284157) (owner: 10Bstorm) [18:22:51] 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Dzahn) Thanks! I think the SSH part was T276148 (shouldnt that be closed now? We can't... [18:23:44] 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Ben Vershbow - https://phabricator.wikimedia.org/T284248 (10BVershbow_WMF) @colewhite Thanks for your help! I read and signed the L3 acknowledgement and got the approval from @Ottomata. Anything left to do? [18:24:28] 10SRE, 10DNS, 10Traffic, 10serviceops, and 2 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10Dzahn) [18:25:06] 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Dzahn) 05Open→03Resolved ` ACCEPT tcp -- anywhere gitlab.wikimedi... [18:26:01] 10SRE, 10LDAP-Access-Requests: Add Kara Payne to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T284308 (10RStallman-legalteam) I can prepare this NDA. @karapayneWMDE - could you please confirm your email address here or to rstallman@wikimedia.org? I will likely route this for signatures... [18:28:22] 10SRE, 10GitLab (Initialization), 10Patch-For-Review, 10Release-Engineering-Team (Doing), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10Dzahn) Sure you want to open ssh to the public before backups and logging tasks are done? [18:34:27] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1040.eqiad.wmnet'] ` and were **ALL** successful. [18:35:11] 10SRE, 10Packaging, 10serviceops, 10Design-Systems-team-board (Vue.js Migration Team Radar): Create Debian packages for Node.js 14 upgrade - https://phabricator.wikimedia.org/T267891 (10egardner) [18:38:59] PROBLEM - Check systemd state on cp1087 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:40:29] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/29799/scandium.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/698152 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [18:41:11] 10SRE, 10ops-eqiad, 10Traffic: cp1087 down with hardware issues - https://phabricator.wikimedia.org/T278729 (10Dzahn) 18:38 < icinga-wm> PROBLEM - Check systemd state on cp1087 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wik... [18:50:11] 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Ben Vershbow - https://phabricator.wikimedia.org/T284248 (10colewhite) [18:57:01] RECOVERY - Check systemd state on ms-be1053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:57:19] (03PS1) 10Cwhite: admin: add bvershbow to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/698240 (https://phabricator.wikimedia.org/T284248) [19:00:17] (03CR) 10Dzahn: "noop on scandium" [puppet] - 10https://gerrit.wikimedia.org/r/698152 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [19:00:27] (03PS2) 10Jforrester: Provide nodejs12-slim and -devel based on Bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/697672 (https://phabricator.wikimedia.org/T284346) [19:03:07] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/29800/testreduce1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/697988 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [19:06:58] !log depool cp1087 - T278729 [19:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:02] T278729: cp1087 down with hardware issues - https://phabricator.wikimedia.org/T278729 [19:07:13] (03CR) 10Dzahn: "noop on testreduce1001" [puppet] - 10https://gerrit.wikimedia.org/r/697988 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [19:08:36] 10SRE, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Dzahn) a:05JAnstee_WMF→03colewhite [19:10:59] (03CR) 10Cwhite: [C: 03+2] admin: add bvershbow to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/698240 (https://phabricator.wikimedia.org/T284248) (owner: 10Cwhite) [19:13:36] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Ben Vershbow - https://phabricator.wikimedia.org/T284248 (10colewhite) 05Open→03Resolved a:03colewhite Added to wmf group. Please feel free to reopen if you encounter any related issue. [19:17:33] RECOVERY - Check systemd state on cp1087 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:20:52] 10SRE, 10ops-eqiad, 10Traffic: cp1087 down with hardware issues - https://phabricator.wikimedia.org/T278729 (10BBlack) rsyslogd was down for repeatedly segfaulting on startup. I was able to strace the failure and see that it kept segfaulting while reading one of its own files in `/var/spool/rsyslog/` on sta... [19:21:04] (03CR) 10Cwhite: [C: 03+2] admin: replace SSH key for janstee [puppet] - 10https://gerrit.wikimedia.org/r/698071 (https://phabricator.wikimedia.org/T266249) (owner: 10Dzahn) [19:21:11] (03PS2) 10Cwhite: admin: replace SSH key for janstee [puppet] - 10https://gerrit.wikimedia.org/r/698071 (https://phabricator.wikimedia.org/T266249) (owner: 10Dzahn) [19:24:42] 10SRE, 10Traffic, 10vm-requests: Please create two Ganeti VMs for Wikidough in eqiad - https://phabricator.wikimedia.org/T284348 (10ssingh) [19:24:45] mutante: ^ Monday :) [19:24:55] 10SRE, 10Traffic, 10vm-requests: Please create two Ganeti VMs for Wikidough in ulsfo - https://phabricator.wikimedia.org/T284349 (10ssingh) [19:25:00] ^ Tuesday! [19:25:02] :D [19:25:43] and then we are done, till doh5002 [19:26:00] 10SRE, 10LDAP-Access-Requests: Add Kara Payne to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T284308 (10colewhite) p:05Triage→03Medium [19:28:16] 10SRE, 10LDAP-Access-Requests: Add Dat Nguyen to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T284285 (10colewhite) [19:29:12] 10SRE, 10LDAP-Access-Requests: Add Dat Nguyen to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T284285 (10colewhite) p:05Triage→03Medium @KFrancis, can you assist @dang with setting up the NDA? [19:29:40] 10SRE, 10Traffic, 10vm-requests: Please create a Ganeti VM for Wikidough in ulsfo - https://phabricator.wikimedia.org/T284349 (10ssingh) [19:31:43] 10SRE, 10Traffic: ATS: origins server response data accounting issues - https://phabricator.wikimedia.org/T284290 (10colewhite) p:05Triage→03Medium [19:32:06] 10SRE, 10Traffic: Take response size into account in CDN HTTP requests throttling - https://phabricator.wikimedia.org/T284292 (10colewhite) p:05Triage→03Medium [19:32:23] sukhe: :) ok [19:32:30] 10SRE, 10SRE-tools, 10User-jbond: Rootless cookbooks/spicerack - https://phabricator.wikimedia.org/T284302 (10colewhite) p:05Triage→03Medium [19:32:53] 10SRE, 10Traffic: Create dashboard showing aggregate data transfer rates per DC/cluster - https://phabricator.wikimedia.org/T284304 (10colewhite) p:05Triage→03Medium [19:38:45] (03CR) 10Bstorm: [C: 03+2] galera: ensure that mariabackup is also installed [puppet] - 10https://gerrit.wikimedia.org/r/698064 (https://phabricator.wikimedia.org/T284157) (owner: 10Bstorm) [19:45:53] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:53:01] 10SRE, 10ops-eqiad, 10Traffic: cp1087 down with hardware issues - https://phabricator.wikimedia.org/T278729 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` cp1087.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202106041952_bblack_1... [19:57:16] (03PS1) 10Bstorm: galera: clear up confusing xtrabackup parameter [puppet] - 10https://gerrit.wikimedia.org/r/698251 (https://phabricator.wikimedia.org/T284157) [19:58:53] PROBLEM - Disk space on dbprov2003 is CRITICAL: DISK CRITICAL - free space: /srv 266219 MB (2% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dbprov2003&var-datasource=codfw+prometheus/ops [20:07:49] (03CR) 10Dzahn: "thanks Cole :)" [puppet] - 10https://gerrit.wikimedia.org/r/698071 (https://phabricator.wikimedia.org/T266249) (owner: 10Dzahn) [20:08:15] (03PS2) 10Dzahn: bacula/gitlab: add a backup::set for gitlab and use it [puppet] - 10https://gerrit.wikimedia.org/r/697850 (https://phabricator.wikimedia.org/T274463) [20:09:03] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1087.eqiad.wmnet with reason: REIMAGE [20:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:11] RECOVERY - very high load average likely xfs on ms-be1053 is OK: OK - load average: 66.30, 73.88, 79.59 https://wikitech.wikimedia.org/wiki/Swift [20:11:15] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1087.eqiad.wmnet with reason: REIMAGE [20:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:51] (03Abandoned) 10Joal: Update AQS druid datasource to 2021_05 [puppet] - 10https://gerrit.wikimedia.org/r/698178 (owner: 10Joal) [20:46:38] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:56:12] (03CR) 10MarcoAurelio: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698039 (https://phabricator.wikimedia.org/T283522) (owner: 10MarcoAurelio) [20:56:25] 10SRE, 10ops-eqiad, 10Traffic: cp1087 down with hardware issues - https://phabricator.wikimedia.org/T278729 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1087.eqiad.wmnet'] ` and were **ALL** successful. [20:59:02] !log repool cp1087 - T278729 [20:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:07] T278729: cp1087 down with hardware issues - https://phabricator.wikimedia.org/T278729 [21:05:40] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 145 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:06:21] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10colewhite) Key has been updated. Please let us know if this action resolved the problem. [21:15:56] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4444 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [21:28:58] (03CR) 10MarcoAurelio: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698041 (https://phabricator.wikimedia.org/T283523) (owner: 10MarcoAurelio) [21:37:02] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 41 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:37:06] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.07937 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [21:49:36] 10SRE, 10Traffic, 10vm-requests: Please create a Ganeti VM for Wikidough in ulsfo - https://phabricator.wikimedia.org/T284349 (10colewhite) p:05Triage→03Medium a:03colewhite [21:51:14] !log cwhite@cumin1001 START - Cookbook sre.ganeti.makevm for new host doh4001.wikimedia.org [21:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:27] !log cwhite@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh4001.wikimedia.org [22:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:07] (03PS1) 10Cwhite: site: add doh4001 to role insetup and setup dhcp [puppet] - 10https://gerrit.wikimedia.org/r/698265 (https://phabricator.wikimedia.org/T284349) [23:01:54] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create a Ganeti VM for Wikidough in ulsfo - https://phabricator.wikimedia.org/T284349 (10colewhite) Cookbook ran successfully. Currently unprovisioned. [23:54:19] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Dzahn) a:05colewhite→03JAnstee_WMF