[00:02:27] <wikibugs>	 (03PS1) 10Reedy: Temporarily disable UCoC link from non tech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698074 (https://phabricator.wikimedia.org/T280886)
[00:03:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Temporarily disable UCoC link from non tech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698074 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy)
[00:04:06] <wikibugs>	 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in eqsin - https://phabricator.wikimedia.org/T284246 (10Dzahn) @ssingh doh5001.wikimedia.org is ready for you now. doh5002 on hold for lack of IP in that subnet.
[00:04:37] <wikibugs>	 (03PS2) 10Reedy: Temporarily disable UCoC link from non tech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698074 (https://phabricator.wikimedia.org/T280886)
[00:05:12] <logmsgbot>	 !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97)
[00:05:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:05:36] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer
[00:05:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:05:55] <ryankemper>	 !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1008.eqiad.wmnet --dest wdqs1005.eqiad.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `wdqs_reimage`
[00:05:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:05:59] <stashbot>	 T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382
[00:06:35] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] Temporarily disable UCoC link from non tech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698074 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy)
[00:06:58] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer
[00:07:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:07:07] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: drop_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:07:20] <ryankemper>	 !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2007.codfw.wmnet --dest wdqs2001.codfw.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin2002` tmux session `wdqs_reimage`
[00:07:21] <wikibugs>	 (03Merged) 10jenkins-bot: Temporarily disable UCoC link from non tech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698074 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy)
[00:07:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:07:31] <DannyS712>	 thcipriani sorry I missed the backport window, I'll reschedule for next week
[00:08:41] <logmsgbot>	 !log reedy@deploy1002 Synchronized wmf-config/CommonSettings.php: T280886 (duration: 00m 57s)
[00:08:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:08:45] <stashbot>	 T280886: Add Code of Conduct link to the Universal Code of Conduct to all non technical wikis - https://phabricator.wikimedia.org/T280886
[00:12:25] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[00:12:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:44:11] <wikibugs>	 (03PS1) 10Dzahn: static-bugzilla: only load httpd modules actually needed, add gzipped test file, minimize [container/miscweb] - 10https://gerrit.wikimedia.org/r/698079 (https://phabricator.wikimedia.org/T281538)
[00:46:17] <wikibugs>	 (03CR) 10Dzahn: [C: 04-2] "WIP, just storing it for right now" [container/miscweb] - 10https://gerrit.wikimedia.org/r/698079 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn)
[01:06:07] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3806497136 and 183 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:06:33] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2399093560 and 117 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:07:37] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 9858712400 and 605 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:07:47] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 878057424 and 128 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:08:05] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 5608868672 and 381 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:08:21] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 570944 and 123 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:09:35] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 100856 and 198 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:09:43] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 148288 and 206 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:11:11] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 381664 and 293 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:11:39] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 68848 and 321 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:24:03] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer
[01:24:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:38:51] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[01:38:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:39:00] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Get access to a page using a Dev Account - https://phabricator.wikimedia.org/T284249 (10Ottomata) Approved
[01:39:23] <wikibugs>	 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Ben Vershbow - https://phabricator.wikimedia.org/T284248 (10Ottomata) Approved
[01:41:03] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs1008 is CRITICAL: 1.061e+04 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[02:00:48] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@500179f]: Stop overwriting uploads in swift
[02:00:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:05:28] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@500179f]: Stop overwriting uploads in swift (duration: 04m 40s)
[02:05:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:06:48] <ebernhardson>	 !log post-deploy restart airflow-(webserver|scheduer) on an-airflow1001
[02:06:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:08:47] <icinga-wm>	 PROBLEM - SSH on wdqs2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:09:34] <ryankemper>	 !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2007.codfw.wmnet --dest wdqs2001.codfw.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin2002` tmux session `wdqs_reimage`
[02:09:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:09:39] <stashbot>	 T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382
[02:16:42] <wikibugs>	 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in eqsin - https://phabricator.wikimedia.org/T284246 (10ssingh) >>! In T284246#7133295, @Dzahn wrote: > @ssingh doh5001.wikimedia.org is ready for you now. doh5002 on hold for lack of IP in that subnet.  Thanks...
[02:25:02] <ryankemper>	 !log [WDQS] `ryankemper@wdqs1012:~$ sudo pool` (caught up on lag)
[02:25:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:30:33] <ryankemper>	 !log T280382 `wdqs1005.eqiad.wmnet` has been re-imaged and had the appropriate wikidata/categories journal files transferred. `df -h` shows disk space is no longer an issue following the switch to `raid0`: `/dev/md2        2.9T  998G  1.8T  36% /srv`
[02:30:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:30:37] <stashbot>	 T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382
[02:32:38] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer
[02:32:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:33:10] <ryankemper>	 !log [WDQS] `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1006.eqiad.wmnet --dest wdqs1013.eqiad.wmnet --reason "repair overinflated wikidata jnl" --blazegraph_instance blazegraph`
[02:33:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:40:37] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[02:41:58] <icinga-wm>	 PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.2963 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver
[02:42:11] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.5385 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[02:42:23] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.6349 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[02:42:25] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[02:42:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:42:46] <legoktm>	 ummm
[02:43:19] <legoktm>	 ryankemper: your cookbook had nothing to do with MW requests right?
[02:43:19] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[02:43:22] <rzl>	 good evening 👋
[02:43:46] <icinga-wm>	 RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.61 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver
[02:43:53] <mutante>	 👋 hrmm
[02:44:01] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[02:44:03] <bblack>	 hi :P
[02:44:08] <rzl>	 we had a couple blips of "Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad" earlier today
[02:44:13] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[02:44:18] <rzl>	 nothing that hit the paging threshold though, I didn't put too much time into looking at it
[02:44:45] <mutante>	 well, that was quick, yea, we did, they just did not trigger this
[02:44:53] <mutante>	 but they were equally short
[02:45:03] <rzl>	 appserver latency spiked and recovered, so whatever this was, it had real impact just now https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-1h&to=now
[02:45:09] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[02:46:01] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4444 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[02:46:14] <bblack>	 the graph doesn't look awesome
[02:46:43] <bblack>	 it looks like it could rebuild to a new peak each iteration (since it started looking funky ~30m ago)
[02:47:11] <bblack>	 like some kind of "ringing" sort of system effect
[02:47:50] <rzl>	 sure, could be something backing off and then un-backing-off
[02:48:28] <mutante>	 started like 9 hours ago when zooming out 
[02:48:54] <mutante>	 % of mw servers with over 60% of workers busy
[02:49:05] <mutante>	 ^ if you look at that for the last 2 days
[02:49:06] <rzl>	 we had a spike of s1 errors there too https://grafana-rw.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=10&orgId=1&from=now-1h&to=now&forceLogin=true
[02:49:17] <bblack>	 yeah there was a notable bump to a new plateau at ~15:50, then some little tiny spikes all along since
[02:49:24] <rzl>	 explains why it affected both api_ and appservers this time
[02:49:24] <bblack>	 but the past ~30m is much worse
[02:50:29] <bblack>	 https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-2d&orgId=1&to=now&var-cluster=appserver&var-datasource=eqiad%20prometheus%2Fops&var-method=GET&viewPanel=9
[02:50:50] <rzl>	 yeah
[02:51:06] <bblack>	 ^ there was a more-notable jump to a new plateau around the time of a deploy at 10:00
[02:51:25] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.06349 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[02:51:27] <rzl>	 peering at https://tendril.wikimedia.org/report/slow_queries?host=%5Edb&user=wikiuser but I don't know my way around it super well
[02:51:29] <bblack>	 but I donno how stable this graph normally is, maybe that's "normal" :)
[02:52:26] <bblack>	 that is a very complex sql statement :)
[02:53:41] <legoktm>	 https://tendril.wikimedia.org/activity shows the currently running slow queries
[02:54:02] <rzl>	 yeah, just the worst of spike's already past
[02:54:23] <mutante>	 the slow query graph does roughly match the appserver latency spikes
[02:54:41] <mutante>	 and that looks like an untarually high number of "Tsum"
[02:54:42] <legoktm>	 10:13 	<kormat@deploy1002> 	Synchronized wmf-config/db-eqiad.php: Set pc1010 as pc2 primary T282761 (duration: 00m 58s)
[02:54:43] <stashbot>	 T282761: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761
[02:55:45] <legoktm>	 if parser cache isn't as full/hot then latency would go up but in theory it should trend downwards as it fills up
[02:56:20] <bblack>	 it could be that we finally got around to a more interesting time-of-day for certain traffic
[02:56:39] <bblack>	 (a new region with different hot content coming in)
[02:56:44] * legoktm nods
[02:57:01] <legoktm>	 I peeked at api.log btw and nothing obvious stood out to me, like excessive hammering
[02:57:17] <mutante>	 https://phabricator.wikimedia.org/T282761#7131412
[02:58:03] <bblack>	 this is actually peak US query load timeframe
[02:58:11] <legoktm>	 5xx showed an increase in errors at the time of the page, but also, no single IP/range stood out, could've easily been the requests failing as a side-effect, not the cause
[02:58:52] <bblack>	 esams is in its overnight lull, but the US is on its peak plateau-ish period and right about now eqsin is ramping up pretty hard for the day too
[03:00:33] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: make noc.wikimedia.org active/active (was: improve mw maintenance server switch over and discovery names) - https://phabricator.wikimedia.org/T265936 (10Dzahn) .
[03:01:04] <bblack>	 https://grafana.wikimedia.org/d/000000500/varnish-caching?viewPanel=5&orgId=1&var-cluster=cache_text&var-site=codfw&var-site=eqiad&var-site=ulsfo&var-site=eqsin&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5&from=now-7d&to=now&refresh=15m
[03:01:23] <bblack>	 ^ this is the one week view of the req graph shape, for "all but esams"
[03:01:39] <bblack>	 all-but-esams this is peak time, whereas esams itself is on a completely different schedule :)
[03:02:17] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[03:04:07] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[03:04:10] <mutante>	 it has been 17 hours since " pc1008.eqiad.wmnet with reason: Purging parsercache" and that's also about the time since the first spike in Mean average latency
[03:04:55] <legoktm>	 rzl: I figured out why /* IndexPager::buildQueryInfo (history page unfiltered) */ is the top slow query, someone is fetching history pages with &limit=2000
[03:05:38] <legoktm>	 2021-06-04 02:45:58 [163e3d87-49b2-4070-8fa5-a4bc5efe6fc0] mw1275 enwiki 1.37.0-wmf.7 exception ERROR: [163e3d87-49b2-4070-8fa5-a4bc5efe6fc0] /w/index.php?title=Bronze_Age&offset=20110309020825%7C417887994&limit=2000&action=history   Wiki
[03:05:38] <legoktm>	 media\RequestTimeout\RequestTimeoutException: The maximum execution time of 60 seconds was exceeded {"exception_url":"/w/index.php?title=Bronze_Age&offset=20110309020825%7C417887994&limit=2000&action=history","reqId":"163e3d87-49b2-4070-8
[03:05:38] <legoktm>	 fa5-a4bc5efe6fc0","caught_by":"entrypoint"} 
[03:05:46] <rzl>	 I guess if they're doing it fast enough that could be the cause here
[03:06:30] <bblack>	 yeah seems like a solid theory
[03:06:38] <mutante>	 uhmm.. can we change the max limit ?
[03:06:41] <bblack>	 there's some elevation in 500s that matches too, but it's not huge
[03:06:52] <legoktm>	 the timeouts started at 2021-06-04 02:45:33 per exception.log
[03:07:05] <bblack>	 https://grafana.wikimedia.org/d/000000464/prometheus-varnish-aggregate-client-status-code?viewPanel=2&orgId=1&var-site=codfw&var-site=eqiad&var-site=eqsin&var-site=esams&var-site=ulsfo&var-cache_type=varnish-text&var-status_type=5&var-method=GET&var-method=HEAD&var-method=POST&from=now-3h&to=now
[03:07:06] <legoktm>	 no wait, that's when it rotated
[03:07:39] <bblack>	 02:42-02:46 was the biggest peak of the little 500s spike too
[03:18:37] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[03:20:21] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3492 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[03:20:25] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[03:35:27] <wikibugs>	 (03PS1) 10BBlack: cache_text: block annoying reqs for now [puppet] - 10https://gerrit.wikimedia.org/r/698082 (https://phabricator.wikimedia.org/T284274)
[03:36:33] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.06349 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[03:37:03] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] cache_text: block annoying reqs for now [puppet] - 10https://gerrit.wikimedia.org/r/698082 (https://phabricator.wikimedia.org/T284274) (owner: 10BBlack)
[03:37:26] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] cache_text: block annoying reqs for now [puppet] - 10https://gerrit.wikimedia.org/r/698082 (https://phabricator.wikimedia.org/T284274) (owner: 10BBlack)
[03:49:44] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[03:49:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:03:43] <icinga-wm>	 RECOVERY - WDQS high update lag on wdqs1008 is OK: (C)3600 ge (W)1200 ge 1138 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[04:22:22] <ryankemper>	 !log T280382 `wdqs2001.codfw.wmnet` has been re-imaged and had the appropriate wikidata/categories journal files transferred. `df -h` shows disk space is no longer an issue following the switch to `raid0`: `/dev/md2        2.9T  998G  1.8T  36% /srv`
[04:22:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:22:27] <stashbot>	 T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382
[04:25:40] <ryankemper>	 !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs2002.codfw.wmnet` on `ryankemper@cumin2002` tmux session `wdqs_reimage`
[04:25:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:41:21] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2002.codfw.wmnet with reason: REIMAGE
[04:41:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:43:33] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2002.codfw.wmnet with reason: REIMAGE
[04:43:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:08:33] <wikibugs>	 (03PS1) 10Marostegui: db1121: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/698085
[05:09:44] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1121: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/698085 (owner: 10Marostegui)
[05:10:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 25%: Repool db1121', diff saved to https://phabricator.wikimedia.org/P16287 and previous config saved to /var/cache/conftool/dbconfig/20210604-051010-root.json
[05:10:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:10:50] <icinga-wm>	 RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:16:59] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer
[05:17:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:17:06] <ryankemper>	 !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2002.codfw.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin2002` tmux session `wdqs_reimage`
[05:17:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:17:10] <stashbot>	 T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382
[05:22:21] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[05:22:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:23:31] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer
[05:23:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:24:27] <ryankemper>	 !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2002.codfw.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin2002` tmux session `wdqs_reimage`
[05:24:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:24:30] <stashbot>	 T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382
[05:25:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 50%: Repool db1121', diff saved to https://phabricator.wikimedia.org/P16288 and previous config saved to /var/cache/conftool/dbconfig/20210604-052514-root.json
[05:25:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:26:41] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[05:26:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:40:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 75%: Repool db1121', diff saved to https://phabricator.wikimedia.org/P16289 and previous config saved to /var/cache/conftool/dbconfig/20210604-054017-root.json
[05:40:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:55:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 100%: Repool db1121', diff saved to https://phabricator.wikimedia.org/P16290 and previous config saved to /var/cache/conftool/dbconfig/20210604-055521-root.json
[05:55:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:12:42] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to nda for west1 - https://phabricator.wikimedia.org/T284136 (10elukey) >>! In T284136#7133186, @colewhite wrote: > @KFrancis can you confirm an NDA on file for @Cervisiarius?  @colewhite in https://wikitech.wikimedia.org/wiki/SRE_Clinic_Duty#Access_requests it is me...
[06:42:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3316 db1096:3315', diff saved to https://phabricator.wikimedia.org/P16291 and previous config saved to /var/cache/conftool/dbconfig/20210604-064242-marostegui.json
[06:42:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:42:50] <marostegui>	 !log Upgrade mysql on db1096:3315 db1096:3316
[06:42:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:46:38] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[06:46:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:48:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 25%: Repool db1096:3316', diff saved to https://phabricator.wikimedia.org/P16292 and previous config saved to /var/cache/conftool/dbconfig/20210604-064807-root.json
[06:48:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:48:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 25%: Repool db1096:3315', diff saved to https://phabricator.wikimedia.org/P16293 and previous config saved to /var/cache/conftool/dbconfig/20210604-064815-root.json
[06:48:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210604T0700)
[07:03:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 50%: Repool db1096:3316', diff saved to https://phabricator.wikimedia.org/P16294 and previous config saved to /var/cache/conftool/dbconfig/20210604-070311-root.json
[07:03:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:03:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 50%: Repool db1096:3315', diff saved to https://phabricator.wikimedia.org/P16295 and previous config saved to /var/cache/conftool/dbconfig/20210604-070319-root.json
[07:03:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:04:10] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 117 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[07:05:56] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[07:15:56] <wikibugs>	 (03PS1) 10Marostegui: install_server: Reimage db2113 to Buster and 10.4 [puppet] - 10https://gerrit.wikimedia.org/r/698151 (https://phabricator.wikimedia.org/T283235)
[07:16:52] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2113 to Buster and 10.4 [puppet] - 10https://gerrit.wikimedia.org/r/698151 (https://phabricator.wikimedia.org/T283235) (owner: 10Marostegui)
[07:18:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 75%: Repool db1096:3316', diff saved to https://phabricator.wikimedia.org/P16296 and previous config saved to /var/cache/conftool/dbconfig/20210604-071815-root.json
[07:18:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:18:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 75%: Repool db1096:3315', diff saved to https://phabricator.wikimedia.org/P16297 and previous config saved to /var/cache/conftool/dbconfig/20210604-071823-root.json
[07:18:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:12] <urbanecm>	 !log Password reset for SUL User:Dominic_Mayers  (T282656)
[07:19:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:16] <stashbot>	 T282656: User:Dominic_Mayers has forgotten their password, need a reset via CLI - https://phabricator.wikimedia.org/T282656
[07:22:27] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable profile::nginx for parsoid::testing [puppet] - 10https://gerrit.wikimedia.org/r/698152 (https://phabricator.wikimedia.org/T164456)
[07:22:29] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] wdqs-internal: lower depool threshold to .3 [puppet] - 10https://gerrit.wikimedia.org/r/698069 (https://phabricator.wikimedia.org/T284264) (owner: 10Ryan Kemper)
[07:22:41] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/697988 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff)
[07:23:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable profile::nginx for parsoid::testing [puppet] - 10https://gerrit.wikimedia.org/r/698152 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff)
[07:23:45] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db2151 [puppet] - 10https://gerrit.wikimedia.org/r/698153
[07:24:13] <moritzm>	 !log cleanup now unused nginx mods and former deps on install* servers after switch towards nginx-light (various X11 libs and libxslt)
[07:24:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:24:39] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2151 [puppet] - 10https://gerrit.wikimedia.org/r/698153 (owner: 10Marostegui)
[07:29:38] <moritzm>	 !log cleanup now unused nginx mods and former deps on install* and puppetdb* servers after switch towards nginx-light (various X11 libs and libxslt) T164456
[07:29:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:42] <stashbot>	 T164456: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456
[07:29:55] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "I see nothing wrong here." [puppet] - 10https://gerrit.wikimedia.org/r/698064 (https://phabricator.wikimedia.org/T284157) (owner: 10Bstorm)
[07:31:07] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable profile::nginx for parsoid::testing [puppet] - 10https://gerrit.wikimedia.org/r/698152 (https://phabricator.wikimedia.org/T164456)
[07:32:36] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698152 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff)
[07:33:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 100%: Repool db1096:3316', diff saved to https://phabricator.wikimedia.org/P16298 and previous config saved to /var/cache/conftool/dbconfig/20210604-073318-root.json
[07:33:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:33:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 100%: Repool db1096:3315', diff saved to https://phabricator.wikimedia.org/P16299 and previous config saved to /var/cache/conftool/dbconfig/20210604-073326-root.json
[07:33:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:38:20] <jynus>	 !log stop and upgrade db1150 T283235
[07:38:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:38:25] <stashbot>	 T283235: Upgrade s5 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T283235
[07:44:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch scandium/testreduce to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698155 (https://phabricator.wikimedia.org/T164456)
[07:52:42] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698155 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff)
[07:56:21] <wikibugs>	 (03CR) 10Muehlenhoff: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/697988 also needs to be merged first" [puppet] - 10https://gerrit.wikimedia.org/r/698155 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff)
[08:08:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] New upstream release [debs/karma] - 10https://gerrit.wikimedia.org/r/697916 (owner: 10Filippo Giunchedi)
[08:09:33] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Switchover eqiad s5 backups from db1145 to db1150 (buster) [puppet] - 10https://gerrit.wikimedia.org/r/698157 (https://phabricator.wikimedia.org/T283235)
[08:09:39] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Switchover codfw s5 backups from db2099 to db2101 (buster) [puppet] - 10https://gerrit.wikimedia.org/r/693142 (https://phabricator.wikimedia.org/T283235)
[08:20:00] <godog>	 !log upgrade karma to 0.86-1 
[08:20:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: cc -operations on IRC for all SRE pages [puppet] - 10https://gerrit.wikimedia.org/r/697943 (https://phabricator.wikimedia.org/T273716) (owner: 10Filippo Giunchedi)
[08:24:19] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Switchover s7&s8 codfw backups from db2100 to db2098 [puppet] - 10https://gerrit.wikimedia.org/r/698158 (https://phabricator.wikimedia.org/T283995)
[08:24:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: highlight 'instance' label in alerts dashboard [puppet] - 10https://gerrit.wikimedia.org/r/697924 (https://phabricator.wikimedia.org/T282806) (owner: 10Filippo Giunchedi)
[08:25:49] <wikibugs>	 (03CR) 10Jcrespo: "To be reverted once T283995 is fixed (it may happen while I am off)." [puppet] - 10https://gerrit.wikimedia.org/r/698158 (https://phabricator.wikimedia.org/T283995) (owner: 10Jcrespo)
[08:29:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1110', diff saved to https://phabricator.wikimedia.org/P16300 and previous config saved to /var/cache/conftool/dbconfig/20210604-082956-marostegui.json
[08:29:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 25%: Repool db1110', diff saved to https://phabricator.wikimedia.org/P16301 and previous config saved to /var/cache/conftool/dbconfig/20210604-083232-root.json
[08:32:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be100[12] - https://phabricator.wikimedia.org/T276637 (10fgiunchedi) Ack, thank you @RobH !
[08:33:39] <marostegui>	 !log Upgrade db1110 T283235 
[08:33:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:43] <stashbot>	 T283235: Upgrade s5 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T283235
[08:36:39] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on ms-be1053 - https://phabricator.wikimedia.org/T282839 (10fgiunchedi) Disk is rebuilding, thank you @Cmjohnson
[08:47:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 50%: Repool db1110', diff saved to https://phabricator.wikimedia.org/P16302 and previous config saved to /var/cache/conftool/dbconfig/20210604-084735-root.json
[08:47:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:gitlab: open SSH port to the world [puppet] - 10https://gerrit.wikimedia.org/r/696024 (https://phabricator.wikimedia.org/T276144) (owner: 10Jbond)
[08:55:22] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be1053 is CRITICAL: CRITICAL - load average: 117.82, 104.03, 71.40 https://wikitech.wikimedia.org/wiki/Swift
[08:56:46] <wikibugs>	 10SRE, 10observability, 10Patch-For-Review: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 (10fgiunchedi) I thought about it a little more and I'm convinced we should strive to keep Prometheus on VMs primarily for ease of management: on hardware faults on bare meta...
[08:59:58] <godog>	 ms-be1053 is rebuilding a disk
[09:00:06] <wikibugs>	 (03CR) 10Hashar: "Gerrit is all happy under Java 11.   Nothing refers to the java 8 binary based on (lsof -f).    Lets remove the unused packages on Monday " [puppet] - 10https://gerrit.wikimedia.org/r/696591 (https://phabricator.wikimedia.org/T268225) (owner: 10Hashar)
[09:00:42] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:01:10] <wikibugs>	 10SRE, 10observability, 10Patch-For-Review: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 (10MoritzMuehlenhoff) >>! In T243057#7133670, @fgiunchedi wrote: > I thought about it a little more and I'm convinced we should strive to keep Prometheus on VMs primarily for...
[09:02:28] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:02:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 75%: Repool db1110', diff saved to https://phabricator.wikimedia.org/P16303 and previous config saved to /var/cache/conftool/dbconfig/20210604-090239-root.json
[09:02:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:09] <ema>	 !log reboot cp1087 T278729
[09:06:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:13] <stashbot>	 T278729: cp1087 down with hardware issues - https://phabricator.wikimedia.org/T278729
[09:06:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/697723 (https://phabricator.wikimedia.org/T275852) (owner: 10Jbond)
[09:17:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 100%: Repool db1110', diff saved to https://phabricator.wikimedia.org/P16304 and previous config saved to /var/cache/conftool/dbconfig/20210604-091742-root.json
[09:17:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:44] <wikibugs>	 10SRE, 10DNS, 10Traffic, 10serviceops, and 2 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10jbond)
[09:19:56] <wikibugs>	 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10jbond) a:05Dzahn→03jbond The SSH port has now been opened as well
[09:20:23] <wikibugs>	 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10jbond) 05Resolved→03Open a:05jbond→03Dzahn
[09:21:21] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: cp1087 down with hardware issues - https://phabricator.wikimedia.org/T278729 (10ema) >>! In T278729#7132555, @Dzahn wrote: > https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=cp1087  Thanks Daniel, after rebooting the host all the alerts are now gone.
[09:24:24] <wikibugs>	 (03PS14) 10Jbond: role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792)
[09:25:19] <wikibugs>	 (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond)
[09:25:27] <wikibugs>	 (03PS8) 10Jbond: mx2001: disable ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/612826
[09:26:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond)
[09:26:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mx2001: disable ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/612826 (owner: 10Jbond)
[09:30:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people2002.codfw.wmnet
[09:30:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people2002.codfw.wmnet
[09:33:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people1003.eqiad.wmnet
[09:34:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:02] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be1053 is CRITICAL: CRITICAL - load average: 113.44, 103.87, 98.84 https://wikitech.wikimedia.org/wiki/Swift
[09:37:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people1003.eqiad.wmnet
[09:37:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:46:03] <wikibugs>	 (03PS1) 10Ssingh: Add doh5001 to BGP anycast in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/698162 (https://phabricator.wikimedia.org/T283503)
[09:46:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add doh5001 to BGP anycast in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/698162 (https://phabricator.wikimedia.org/T283503) (owner: 10Ssingh)
[09:46:42] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be1053 is CRITICAL: CRITICAL - load average: 108.99, 101.84, 99.49 https://wikitech.wikimedia.org/wiki/Swift
[09:47:04] <ema>	 !log pool cp1087 T278729
[09:47:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:47:13] <stashbot>	 T278729: cp1087 down with hardware issues - https://phabricator.wikimedia.org/T278729
[09:47:36] <wikibugs>	 (03PS2) 10Ssingh: Add doh5001 to BGP anycast in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/698162 (https://phabricator.wikimedia.org/T283503)
[09:50:48] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: cp1087 down with hardware issues - https://phabricator.wikimedia.org/T278729 (10ema) 05Open→03Resolved Tentatively closing.
[09:55:24] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "looks good to me." [homer/public] - 10https://gerrit.wikimedia.org/r/698162 (https://phabricator.wikimedia.org/T283503) (owner: 10Ssingh)
[09:57:03] <wikibugs>	 (03PS1) 10Jbond: P:docker::reporter: exclude all debian images [puppet] - 10https://gerrit.wikimedia.org/r/698164 (https://phabricator.wikimedia.org/T251918)
[09:57:53] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29792/console" [puppet] - 10https://gerrit.wikimedia.org/r/698164 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond)
[10:02:44] <wikibugs>	 (03PS11) 10Hashar: [WMF] register our plugins as submodules [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684336
[10:02:48] <wikibugs>	 (03CR) 10Hashar: [WMF] register our plugins as submodules (031 comment) [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684336 (owner: 10Hashar)
[10:03:07] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10jbond)
[10:03:35] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10jbond) 05Resolved→03Open
[10:06:51] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10jbond) now seeing `lsb_release: command not found` on the docker buster instances  ` var/lib/dpkg/info/debmonitor-client.postinst: line 16: lsb_release: comman...
[10:08:58] <wikibugs>	 (03PS2) 10Ssingh: acme_chief: authorize doh5001 host for Wikidough [puppet] - 10https://gerrit.wikimedia.org/r/698014 (https://phabricator.wikimedia.org/T284246)
[10:11:19] <wikibugs>	 (03CR) 10Ssingh: "Already reviewed; removing extra host." [puppet] - 10https://gerrit.wikimedia.org/r/698014 (https://phabricator.wikimedia.org/T284246) (owner: 10Ssingh)
[10:11:35] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Add Dat Nguyen to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T284285 (10dang)
[10:12:46] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] acme_chief: authorize doh5001 host for Wikidough [puppet] - 10https://gerrit.wikimedia.org/r/698014 (https://phabricator.wikimedia.org/T284246) (owner: 10Ssingh)
[10:14:03] <wikibugs>	 10Puppet, 10SRE, 10User-jbond: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480 (10hashar) Dropping releng/CI, doesn't seem we have anything to do to complete the resolution of this task. It seems dec...
[10:15:55] <wikibugs>	 (03PS1) 10Ssingh: site: switch doh5001 to O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/698166
[10:17:25] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] site: switch doh5001 to O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/698166 (owner: 10Ssingh)
[10:22:28] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10JMeybohm) >>! In T251918#7133852, @jbond wrote: > now seeing `lsb_release: command not found` on the docker buster instances >  > ` > var/lib/dpkg/info/debmoni...
[10:22:55] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10fgiunchedi) Thank  you @Papaul ! I see the device in librenms but looks like discovery isn't working. I've removed and added the device again without success: ht...
[10:24:15] <wikibugs>	 (03PS1) 10Jbond: postints: update postinstall to check for lsb_release before using it [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918)
[10:24:58] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10jbond) > This is a packaging error I guess. The script tries to install debmonitor-client inside of the container. Exactly, see https://gerrit.wikimedia.org/r/...
[10:29:53] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "I would suggest to instead add a dependency to lsb-release" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond)
[10:30:35] <wikibugs>	 (03CR) 10Ssingh: "> Patch Set 2: Code-Review+1" [homer/public] - 10https://gerrit.wikimedia.org/r/698162 (https://phabricator.wikimedia.org/T283503) (owner: 10Ssingh)
[10:30:38] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Add doh5001 to BGP anycast in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/698162 (https://phabricator.wikimedia.org/T283503) (owner: 10Ssingh)
[10:30:55] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Add Dat Nguyen to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T284285 (10dang)
[10:31:14] <wikibugs>	 (03Merged) 10jenkins-bot: Add doh5001 to BGP anycast in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/698162 (https://phabricator.wikimedia.org/T283503) (owner: 10Ssingh)
[10:33:29] <wikibugs>	 (03CR) 10Muehlenhoff: "Agreed, it's still a bit of an industry standard despite being somewhat dormant and other applications might also break silently, so it al" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond)
[10:33:31] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 1: Code-Review-1" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond)
[10:35:14] <wikibugs>	 (03CR) 10Muehlenhoff: "But we can also rely on /etc/os-release entirely, that's also fine (and stop using lsb_release unconditonally)" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond)
[10:40:07] <wikibugs>	 (03PS2) 10Jbond: postints: update postinstall to check for lsb_release before using it [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918)
[10:41:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond)
[10:41:35] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 1:" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond)
[10:41:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Commit message needs updating, though :-)" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond)
[10:43:09] <wikibugs>	 (03PS3) 10Jbond: postints: update postinstall to check for lsb_release before using it [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918)
[10:43:26] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/697618 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata)
[10:44:14] <wikibugs>	 (03PS4) 10Jbond: postints: update postinstall to use /etc/os-release Our docker images don't have lsb_release this causes docker-reporter to /etc/os-release fail with the following message.  As such use /etc/os-release directly [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918)
[10:44:32] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 2:" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond)
[10:45:23] <wikibugs>	 (03PS5) 10Jbond: postints: update postinstall to use /etc/os-release [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918)
[10:45:47] <wikibugs>	 (03PS6) 10Jbond: postints: update postinstall to use /etc/os-release [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918)
[10:48:07] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "> Patch Set 1: Code-Review+1" [software/netbox] - 10https://gerrit.wikimedia.org/r/698020 (owner: 10Jbond)
[10:48:15] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] add .gitreview file [software/netbox] - 10https://gerrit.wikimedia.org/r/698020 (owner: 10Jbond)
[10:49:07] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10fgiunchedi) Ok I think I got the correct `addhost.php` incantation to add the device and get discovery to work properly:  ` ./addhost.php ps2-test-d8-codfw.mgmt....
[10:51:54] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Switchover s7&s8 codfw backups from db2100 to db2098 [puppet] - 10https://gerrit.wikimedia.org/r/698158 (https://phabricator.wikimedia.org/T283995) (owner: 10Jcrespo)
[10:52:42] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mwdebug: fix various issues with the deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/698172
[10:57:13] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10Joe) should we ensure lsb_release is installed in the base images?
[10:57:41] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mwdebug: fix various issues with the deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/698172
[10:59:06] <topranks>	 !log Running homer for Gerrit 698162: Set up BGP peering to doh5001 in eqsin, triggering DoH /24 announcement there.
[10:59:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:02:24] <wikibugs>	 (03CR) 10Jbond: Add WMCS specific cloud role for syslog server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan)
[11:03:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks!" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond)
[11:03:46] <wikibugs>	 10SRE, 10puppet-compiler: jenkins-bot puppet-compiler-test may report SUCCESS though compiling failed - https://phabricator.wikimedia.org/T214629 (10hashar) Dropping #continuous-integration-config , resolving this task solely depends on changing the exit code in #puppet-compiler
[11:05:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] postints: update postinstall to use /etc/os-release [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/698168 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond)
[11:20:52] <jbond>	 !log upload debmonitor-client_0.3.0-1+deb10u3_all.deb to apt
[11:20:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:55] <wikibugs>	 10SRE, 10Traffic, 10netops: Please configure the routers for Wikidough's anycasted IP - https://phabricator.wikimedia.org/T283503 (10ssingh) doh5001 is also up; from Mumbai, we are reaching eqsin as desired:  ` $ kdig @wikimedia-dns.org +nsid +tls-ca wikipedia.org ;; TLS session (TLS1.3)-(ECDHE-SECP256R1)-(E...
[11:40:09] <icinga-wm>	 PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:40:37] <wikibugs>	 10SRE, 10puppet-compiler, 10User-jbond: jenkins-bot puppet-compiler-test may report SUCCESS though compiling failed - https://phabricator.wikimedia.org/T214629 (10jbond)
[11:41:30] <wikibugs>	 (03CR) 10Ayounsi: [C: 04-1] "https://phabricator.wikimedia.org/T284213 for the AM dashboard discussion so we don't get sidetracked on this CR." [alerts] - 10https://gerrit.wikimedia.org/r/697710 (https://phabricator.wikimedia.org/T282806) (owner: 10Ema)
[11:50:47] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 134 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:51:26] <wikibugs>	 (03PS1) 10Marostegui: install_server: Set the partitioning scheme to new pc* [puppet] - 10https://gerrit.wikimedia.org/r/698177 (https://phabricator.wikimedia.org/T282484)
[11:52:34] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Set the partitioning scheme to new pc* [puppet] - 10https://gerrit.wikimedia.org/r/698177 (https://phabricator.wikimedia.org/T282484) (owner: 10Marostegui)
[11:52:35] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: (C)100 gt (W)50 gt 20 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:57:05] <wikibugs>	 (03PS1) 10Joal: Update AQS druid datasource to 2021_05 [puppet] - 10https://gerrit.wikimedia.org/r/698178
[12:23:07] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1053 is CRITICAL: CRITICAL - degraded: The following units failed: session-132628.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:24:16] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1019: Depool clouddb1015 [puppet] - 10https://gerrit.wikimedia.org/r/698185 (https://phabricator.wikimedia.org/T283235)
[12:25:39] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbproxy1019: Depool clouddb1015 [puppet] - 10https://gerrit.wikimedia.org/r/698185 (https://phabricator.wikimedia.org/T283235) (owner: 10Marostegui)
[12:27:57] <marostegui>	 !log Upgrade mysql on clouddb1015 T283235
[12:28:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:02] <stashbot>	 T283235: Upgrade s5 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T283235
[12:28:20] <wikibugs>	 (03PS1) 10Marostegui: Revert "dbproxy1019: Depool clouddb1015" [puppet] - 10https://gerrit.wikimedia.org/r/698033
[12:29:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1019: Depool clouddb1015" [puppet] - 10https://gerrit.wikimedia.org/r/698033 (owner: 10Marostegui)
[12:29:08] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10JMeybohm) >>! In T251918#7133990, @Joe wrote: > should we ensure lsb_release is installed in the base images?  I don't think we should. If it is really needed...
[12:30:40] <wikibugs>	 10SRE, 10Continuous-Integration-Config: operations/docker-images/production-images has no CI - https://phabricator.wikimedia.org/T283855 (10hashar)
[12:30:50] <wikibugs>	 10SRE, 10Traffic, 10netops: BGP Policy on aggregate routes prevents them being created in some circumstances. - https://phabricator.wikimedia.org/T283163 (10cmooney) After discussion with @ayounsi on IRC he suggested looking at the use of the following command to address this: ` set protocols bgp group <grou...
[12:32:13] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 127 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[12:34:01] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[12:45:39] <wikibugs>	 10SRE, 10User-jbond, 10ci-test-error: operations/puppet CI errors  not being displayed in console log - https://phabricator.wikimedia.org/T214726 (10hashar)
[12:45:53] <wikibugs>	 10SRE, 10User-jbond, 10ci-test-error: operations/puppet CI errors  not being displayed in console log - https://phabricator.wikimedia.org/T214726 (10hashar) 05Open→03Resolved a:03hashar
[12:46:02] <marostegui>	 !log Upgrade mysql on clouddb1016 T283235
[12:46:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:05] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1019: Depool clouddb1016 [puppet] - 10https://gerrit.wikimedia.org/r/698190
[12:46:06] <stashbot>	 T283235: Upgrade s5 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T283235
[12:47:56] <wikibugs>	 (03PS1) 10Marostegui: Revert "dbproxy1019: Depool clouddb1016" [puppet] - 10https://gerrit.wikimedia.org/r/698034
[12:49:38] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1019: Depool clouddb1016" [puppet] - 10https://gerrit.wikimedia.org/r/698034 (owner: 10Marostegui)
[12:57:26] <wikibugs>	 10SRE, 10Traffic, 10netops: BGP Policy on aggregate routes prevents them being created in some circumstances. - https://phabricator.wikimedia.org/T283163 (10ayounsi) That sounds great! Let's test it out next week. Thanks.
[12:57:29] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:23:45] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: fix various issues with the deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/698172 (owner: 10Giuseppe Lavagetto)
[13:23:57] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10jbond) I now see an error with docker-registry.wikimedia.org/kubernetes-fluentd-daemonset:0.0.1-20190122  ` E: Failed to fetch http://mirrors.wikimedia.org/deb...
[13:24:24] <wikibugs>	 (03PS2) 10Jbond: P:docker::reporter: exclude all jessie images [puppet] - 10https://gerrit.wikimedia.org/r/698164 (https://phabricator.wikimedia.org/T251918)
[13:26:10] <wikibugs>	 (03Merged) 10jenkins-bot: mwdebug: fix various issues with the deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/698172 (owner: 10Giuseppe Lavagetto)
[13:26:28] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10Joe) The "last updated" on that page means nothing - it barely tells you when the script ran the last time.  That image is old and should really be retired fro...
[13:26:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. To fix the underlying issue it would be best if all jessie images were deleted from the registry entirely, I'll check how/if/w" [puppet] - 10https://gerrit.wikimedia.org/r/698164 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond)
[13:27:18] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10jbond) >The "last updated" on that page means nothing - it barely tells you when the script ran the last time. I was beginning to wonder :)  > That image is ol...
[13:29:55] <wikibugs>	 (03PS3) 10Jbond: P:docker::reporter: exclude all jessie images [puppet] - 10https://gerrit.wikimedia.org/r/698164 (https://phabricator.wikimedia.org/T251918)
[13:30:07] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 131 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:30:51] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29794/console" [puppet] - 10https://gerrit.wikimedia.org/r/698164 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond)
[13:31:11] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:31:18] <wikibugs>	 (03PS1) 10Elukey: hadoop: increase the HDFS Namenode's service handler threads [puppet] - 10https://gerrit.wikimedia.org/r/698194 (https://phabricator.wikimedia.org/T283733)
[13:31:55] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:32:14] <_joe_>	 the deploy1002 issue is me
[13:32:19] <_joe_>	 but I'm fixing it
[13:32:59] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:33:14] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[13:33:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:36] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/698164 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond)
[13:39:35] <Krinkle>	 !log mwmaint1002: Running purge_parsercache_now.php on pc1008, server 3/4, ref T282761
[13:39:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:40] <stashbot>	 T282761: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761
[13:42:47] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 181 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:46:23] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: (C)100 gt (W)50 gt 23 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:49:59] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 160 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:51:30] <wikibugs>	 10SRE, 10SRE-tools: Rootless cookbooks/spicerack - https://phabricator.wikimedia.org/T284302 (10MoritzMuehlenhoff)
[13:53:35] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:57:22] <wikibugs>	 (03PS1) 10Krinkle: Set wgDiscussionToolsTalkPageParserCacheExpiry to 10 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698201 (https://phabricator.wikimedia.org/T280605)
[14:00:03] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10Papaul) @fgiunchedi  Thanks
[14:03:55] <wikibugs>	 10SRE, 10SRE-tools, 10User-jbond: Rootless cookbooks/spicerack - https://phabricator.wikimedia.org/T284302 (10jbond)
[14:11:38] <wikibugs>	 (03PS1) 10Ayounsi: Manage analytics-in4/6 with Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/698202 (https://phabricator.wikimedia.org/T279429)
[14:11:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Manage analytics-in4/6 with Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/698202 (https://phabricator.wikimedia.org/T279429) (owner: 10Ayounsi)
[14:14:26] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Allow talk pages to have a different ParserCache expiry [extensions/DiscussionTools] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/694314 (https://phabricator.wikimedia.org/T280605) (owner: 10Krinkle)
[14:14:49] <wikibugs>	 (03PS2) 10Ayounsi: Manage analytics-in4/6 with Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/698202 (https://phabricator.wikimedia.org/T279429)
[14:15:17] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to nda for west1 - https://phabricator.wikimedia.org/T284136 (10colewhite)
[14:16:30] <wikibugs>	 (03PS1) 10Cwhite: admin: update west1 staff contact and email [puppet] - 10https://gerrit.wikimedia.org/r/698203 (https://phabricator.wikimedia.org/T284136)
[14:20:39] <wikibugs>	 (03Merged) 10jenkins-bot: Allow talk pages to have a different ParserCache expiry [extensions/DiscussionTools] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/694314 (https://phabricator.wikimedia.org/T280605) (owner: 10Krinkle)
[14:20:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/698203 (https://phabricator.wikimedia.org/T284136) (owner: 10Cwhite)
[14:21:01] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:21:30] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] admin: update west1 staff contact and email [puppet] - 10https://gerrit.wikimedia.org/r/698203 (https://phabricator.wikimedia.org/T284136) (owner: 10Cwhite)
[14:22:51] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:23:13] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] admin: update west1 staff contact and email [puppet] - 10https://gerrit.wikimedia.org/r/698203 (https://phabricator.wikimedia.org/T284136) (owner: 10Cwhite)
[14:27:00] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to nda for west1 - https://phabricator.wikimedia.org/T284136 (10colewhite)
[14:29:46] <wikibugs>	 (03PS1) 10Cwhite: admin: amend west1 uid to uidNumber from ldap [puppet] - 10https://gerrit.wikimedia.org/r/698205 (https://phabricator.wikimedia.org/T284136)
[14:32:13] <wikibugs>	 (03PS1) 10Ayounsi: Add 185.71.138.0/24 to network::external [puppet] - 10https://gerrit.wikimedia.org/r/698206 (https://phabricator.wikimedia.org/T252132)
[14:34:10] <wikibugs>	 (03CR) 10Cwhite: "@moritz|@john: please cross-reference ldap to ensure this is a correct change.  This uid has been this way for a long time." [puppet] - 10https://gerrit.wikimedia.org/r/698205 (https://phabricator.wikimedia.org/T284136) (owner: 10Cwhite)
[14:35:45] <wikibugs>	 (03PS2) 10Ayounsi: Add 185.71.138.0/24 to network::external and diffscan [puppet] - 10https://gerrit.wikimedia.org/r/698206 (https://phabricator.wikimedia.org/T252132)
[14:37:06] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to nda for west1 - https://phabricator.wikimedia.org/T284136 (10colewhite) 05Open→03Resolved The west1 added to nda group.  Please feel free to reopen if you encounter any related issue.
[14:38:23] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Set wgDiscussionToolsTalkPageParserCacheExpiry to 10 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698201 (https://phabricator.wikimedia.org/T280605) (owner: 10Krinkle)
[14:38:29] <wikibugs>	 (03CR) 10Elukey: "The uid looks good to me, really good catch. I see that the user has files owned only on stat1007, so we could run a quick script like the" [puppet] - 10https://gerrit.wikimedia.org/r/698205 (https://phabricator.wikimedia.org/T284136) (owner: 10Cwhite)
[14:38:30] * Krinkle staging on mwdebug1002
[14:39:21] <wikibugs>	 (03Merged) 10jenkins-bot: Set wgDiscussionToolsTalkPageParserCacheExpiry to 10 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698201 (https://phabricator.wikimedia.org/T280605) (owner: 10Krinkle)
[14:39:58] <wikibugs>	 (03PS1) 10Muehlenhoff: htmldumps: Add profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/698207 (https://phabricator.wikimedia.org/T163356)
[14:40:00] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch htmldumps to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698208 (https://phabricator.wikimedia.org/T164456)
[14:41:28] <logmsgbot>	 !log krinkle@deploy1002 Scap failed!: 9/9 canaries failed their endpoint checks(https://en.wikipedia.org)
[14:41:30] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698207 (https://phabricator.wikimedia.org/T163356) (owner: 10Muehlenhoff)
[14:41:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:41:46] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::web::yaml_defs: fix data structure name [puppet] - 10https://gerrit.wikimedia.org/r/698209
[14:42:42] <Krinkle>	 eh...
[14:43:10] * Krinkle reverts
[14:43:50] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts: ` ['cloudvirt1040.eqiad.wmnet'] ` The log can be found in `/var/lo...
[14:44:33] <Krinkle>	 OK, that was stupid of me. The backport introduces a method and a call at the same time, and I shoudl have synced them separately. screwed by non-atomicity, saved by canaries.
[14:44:39] <Krinkle>	 re-syncing one by one now.
[14:44:57] <logmsgbot>	 !log krinkle@deploy1002 Synchronized php-1.37.0-wmf.7/extensions/DiscussionTools/includes/: Iea41ab8599ffae (duration: 00m 59s)
[14:44:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:03] <logmsgbot>	 !log krinkle@deploy1002 Synchronized php-1.37.0-wmf.7/extensions/DiscussionTools/extension.json: Iea41ab8599ffae (duration: 00m 56s)
[14:46:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:42] <logmsgbot>	 !log krinkle@deploy1002 Synchronized wmf-config/CommonSettings.php: I434d9cfa29d84f (duration: 00m 56s)
[14:47:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:20] <wikibugs>	 (03PS2) 10Muehlenhoff: htmldumps: Switch to common profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/698207 (https://phabricator.wikimedia.org/T163356)
[14:55:33] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698207 (https://phabricator.wikimedia.org/T163356) (owner: 10Muehlenhoff)
[15:00:45] <wikibugs>	 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10Papaul) Dear Mr Papaul Tshibamba,  Thank you for contacting Hewlett Packard Enterprise for your service needs. This email confirms that your request for...
[15:03:29] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-Legal: Add Kara Payne to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T284308 (10karapayneWMDE) As the #WMF-Legal project tag was added to this task, some general information to avoid wrong expectations: Please note that public tasks in Wikimedia P...
[15:04:55] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch htmldumps to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698215 (https://phabricator.wikimedia.org/T164456)
[15:05:15] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Switch htmldumps to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698208 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff)
[15:05:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Switch htmldumps to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698215 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff)
[15:05:37] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698215 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff)
[15:08:06] <wikibugs>	 (03PS1) 10Herron: prometheus::pop add retention size param and set to 100G [puppet] - 10https://gerrit.wikimedia.org/r/698216 (https://phabricator.wikimedia.org/T243057)
[15:08:31] <wikibugs>	 (03PS2) 10Muehlenhoff: Switch htmldumps to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698215 (https://phabricator.wikimedia.org/T164456)
[15:10:40] <wikibugs>	 (03PS2) 10Herron: prometheus::pop add retention size param and set to 100G [puppet] - 10https://gerrit.wikimedia.org/r/698216 (https://phabricator.wikimedia.org/T243057)
[15:14:42] <wikibugs>	 (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/29796/" [puppet] - 10https://gerrit.wikimedia.org/r/698216 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron)
[15:15:17] <wikibugs>	 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10Papaul) I will be receiving the CPU on Monday
[15:15:50] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10Papaul) I have the main board on site, I will be replacing it on Monday.
[15:16:37] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] [WMF] register our plugins as submodules [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684336 (owner: 10Hashar)
[15:17:53] <wikibugs>	 (03PS3) 10Herron: prometheus::pop add retention size param and set to 80G [puppet] - 10https://gerrit.wikimedia.org/r/698216 (https://phabricator.wikimedia.org/T243057)
[15:19:12] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Get access to a page using a Dev Account - https://phabricator.wikimedia.org/T284249 (10LZaman) @colewhite  Thanks for looking into this. Yes, my wikitech name is correct. Re the need for access: I need the dev sign on to look into issues and [[ https://idp.wikimedia.org/login?se...
[15:19:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/698216 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron)
[15:20:46] <wikibugs>	 (03PS4) 10Herron: prometheus::pop add retention size param and set to 80G [puppet] - 10https://gerrit.wikimedia.org/r/698216 (https://phabricator.wikimedia.org/T243057)
[15:25:41] <topranks>	 !log Adding 1:1 NAT configuration for fran2001 / analytics.codfw.wikimedia.org to pfw3-codfw (backup site)
[15:25:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:53] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts: ` ['cloudvirt1040.eqiad.wmnet'] ` The log can be found in `/var/lo...
[15:43:13] <icinga-wm>	 RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:45:03] <wikibugs>	 10SRE, 10observability, 10Patch-For-Review: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 (10herron) +1!  I'll plan deploy the patch above (now amended to 80G retention), move data and release the vdb device from prometheus3001 next week.
[15:50:29] <wikibugs>	 (03PS15) 10Herron: role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond)
[15:56:48] <wikibugs>	 (03PS1) 10Ladsgroup: Enable wikisource group as langlink group of sourcewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698226 (https://phabricator.wikimedia.org/T275958)
[16:02:29] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki: fix etcd connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/698228
[16:02:31] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mwdebug: add etcd servers, datacenter [deployment-charts] - 10https://gerrit.wikimedia.org/r/698229
[16:09:48] <wikibugs>	 (03CR) 10Herron: [C: 04-1] "Looks good to me overall, although at the moment the compiler shows an unexpected result (removal of ldap validation on both mxes) which I" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond)
[16:34:20] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1040.eqiad.wmnet'] `  Of which those **FAILED**: ` ['cloudvirt1040.eqiad.wmnet'] `
[16:37:19] <wikibugs>	 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10jcrespo) Thank you!
[16:38:34] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts: ` ['cloudvirt1040.eqiad.wmnet'] ` The log can be found in `/var/lo...
[16:43:11] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Get access to a page using a Dev Account - https://phabricator.wikimedia.org/T284249 (10colewhite)
[16:44:17] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts: ` ['cloudvirt1040.eqiad.wmnet'] ` The log can be found in `/var/lo...
[17:10:18] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts: ` ['cloudvirt1040.eqiad.wmnet'] ` The log can be found in `/var/lo...
[17:14:50] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-web1001 - https://phabricator.wikimedia.org/T281787 (10Jclark-ctr) an-web1001 B1 U27 Port9 id#3025
[17:15:15] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-web1001 - https://phabricator.wikimedia.org/T281787 (10Jclark-ctr)
[17:18:51] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-web1001 - https://phabricator.wikimedia.org/T281787 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson
[17:20:03] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Add Kara Payne to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T284308 (10WMDE-leszek) As an Engineering Manager at WMDE, I approve this request and confirm Kara's affiliation with WDME.
[17:25:12] <wikibugs>	 (03PS1) 10Razzi: role::common::aqs: update druid mediawiki's datasource [puppet] - 10https://gerrit.wikimedia.org/r/698233
[17:25:51] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: REIMAGE
[17:25:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:28:00] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: REIMAGE
[17:28:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:29:20] <wikibugs>	 (03CR) 10Razzi: [C: 03+2] role::common::aqs: update druid mediawiki's datasource [puppet] - 10https://gerrit.wikimedia.org/r/698233 (owner: 10Razzi)
[17:33:15] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.aqs.roll-restart
[17:33:15] <logmsgbot>	 !log razzi@cumin1001 END (FAIL) - Cookbook sre.aqs.roll-restart (exit_code=99)
[17:33:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:25] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.aqs.roll-restart
[17:33:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:29] <wikibugs>	 (03PS2) 10Effie Mouzeli: mediawiki::alerts fix panelId for mediawiki exceptions alert [puppet] - 10https://gerrit.wikimedia.org/r/690540 (https://phabricator.wikimedia.org/T284301)
[17:36:57] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0)
[17:36:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:01:29] <wikibugs>	 (03PS1) 10Jgreen: add analytics-codfw.frdev.wikimedia.org A and PTR records [dns] - 10https://gerrit.wikimedia.org/r/698235 (https://phabricator.wikimedia.org/T284155)
[18:03:49] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] add analytics-codfw.frdev.wikimedia.org A and PTR records [dns] - 10https://gerrit.wikimedia.org/r/698235 (https://phabricator.wikimedia.org/T284155) (owner: 10Jgreen)
[18:20:54] <wikibugs>	 10SRE, 10GitLab (Initialization), 10Patch-For-Review, 10Release-Engineering-Team (Doing), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10Dzahn) per T276144#7133694 the SSH port is open to the world now.     So that means this must be done!  Let's close it ?
[18:21:08] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[18:21:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:22:09] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1053 is CRITICAL: CRITICAL - degraded: The following units failed: session-132777.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:22:18] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] galera: ensure that mariabackup is also installed [puppet] - 10https://gerrit.wikimedia.org/r/698064 (https://phabricator.wikimedia.org/T284157) (owner: 10Bstorm)
[18:22:51] <wikibugs>	 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Dzahn) Thanks!  I think the SSH part was T276148 (shouldnt that be closed now? We can't...
[18:23:44] <wikibugs>	 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Ben Vershbow - https://phabricator.wikimedia.org/T284248 (10BVershbow_WMF) @colewhite Thanks for your help! I read and signed the L3 acknowledgement and got the approval from @Ottomata. Anything left to do?
[18:24:28] <wikibugs>	 10SRE, 10DNS, 10Traffic, 10serviceops, and 2 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10Dzahn)
[18:25:06] <wikibugs>	 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Dzahn) 05Open→03Resolved ` ACCEPT     tcp  --  anywhere             gitlab.wikimedi...
[18:26:01] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Add Kara Payne to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T284308 (10RStallman-legalteam) I can prepare this NDA. @karapayneWMDE - could you please confirm your email address here or to rstallman@wikimedia.org? I will likely route this for signatures...
[18:28:22] <wikibugs>	 10SRE, 10GitLab (Initialization), 10Patch-For-Review, 10Release-Engineering-Team (Doing), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10Dzahn) Sure you want to open ssh to the public before backups and logging tasks are done?
[18:34:27] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1040.eqiad.wmnet'] `  and were **ALL** successful.
[18:35:11] <wikibugs>	 10SRE, 10Packaging, 10serviceops, 10Design-Systems-team-board (Vue.js Migration Team Radar): Create Debian packages for Node.js 14 upgrade - https://phabricator.wikimedia.org/T267891 (10egardner)
[18:38:59] <icinga-wm>	 PROBLEM - Check systemd state on cp1087 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:40:29] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/29799/scandium.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/698152 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff)
[18:41:11] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: cp1087 down with hardware issues - https://phabricator.wikimedia.org/T278729 (10Dzahn) 18:38 < icinga-wm> PROBLEM - Check systemd state on cp1087 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket                     https://wikitech.wik...
[18:50:11] <wikibugs>	 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Ben Vershbow - https://phabricator.wikimedia.org/T284248 (10colewhite)
[18:57:01] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:57:19] <wikibugs>	 (03PS1) 10Cwhite: admin: add bvershbow to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/698240 (https://phabricator.wikimedia.org/T284248)
[19:00:17] <wikibugs>	 (03CR) 10Dzahn: "noop on scandium" [puppet] - 10https://gerrit.wikimedia.org/r/698152 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff)
[19:00:27] <wikibugs>	 (03PS2) 10Jforrester: Provide nodejs12-slim and -devel based on Bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/697672 (https://phabricator.wikimedia.org/T284346)
[19:03:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/29800/testreduce1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/697988 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff)
[19:06:58] <bblack>	 !log depool cp1087 - T278729
[19:07:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:07:02] <stashbot>	 T278729: cp1087 down with hardware issues - https://phabricator.wikimedia.org/T278729
[19:07:13] <wikibugs>	 (03CR) 10Dzahn: "noop on testreduce1001" [puppet] - 10https://gerrit.wikimedia.org/r/697988 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff)
[19:08:36] <wikibugs>	 10SRE, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Dzahn) a:05JAnstee_WMF→03colewhite
[19:10:59] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] admin: add bvershbow to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/698240 (https://phabricator.wikimedia.org/T284248) (owner: 10Cwhite)
[19:13:36] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Ben Vershbow - https://phabricator.wikimedia.org/T284248 (10colewhite) 05Open→03Resolved a:03colewhite Added to wmf group.  Please feel free to reopen if you encounter any related issue.
[19:17:33] <icinga-wm>	 RECOVERY - Check systemd state on cp1087 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:20:52] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: cp1087 down with hardware issues - https://phabricator.wikimedia.org/T278729 (10BBlack) rsyslogd was down for repeatedly segfaulting on startup.  I was able to strace the failure and see that it kept segfaulting while reading one of its own files in `/var/spool/rsyslog/` on sta...
[19:21:04] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] admin: replace SSH key for janstee [puppet] - 10https://gerrit.wikimedia.org/r/698071 (https://phabricator.wikimedia.org/T266249) (owner: 10Dzahn)
[19:21:11] <wikibugs>	 (03PS2) 10Cwhite: admin: replace SSH key for janstee [puppet] - 10https://gerrit.wikimedia.org/r/698071 (https://phabricator.wikimedia.org/T266249) (owner: 10Dzahn)
[19:24:42] <wikibugs>	 10SRE, 10Traffic, 10vm-requests: Please create two Ganeti VMs for Wikidough in eqiad - https://phabricator.wikimedia.org/T284348 (10ssingh)
[19:24:45] <sukhe>	 mutante: ^ Monday :)
[19:24:55] <wikibugs>	 10SRE, 10Traffic, 10vm-requests: Please create two Ganeti VMs for Wikidough in ulsfo - https://phabricator.wikimedia.org/T284349 (10ssingh)
[19:25:00] <sukhe>	 ^ Tuesday!
[19:25:02] <sukhe>	 :D
[19:25:43] <sukhe>	 and then we are done, till doh5002
[19:26:00] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Add Kara Payne to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T284308 (10colewhite) p:05Triage→03Medium
[19:28:16] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Add Dat Nguyen to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T284285 (10colewhite)
[19:29:12] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Add Dat Nguyen to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T284285 (10colewhite) p:05Triage→03Medium @KFrancis, can you assist @dang with setting up the NDA?
[19:29:40] <wikibugs>	 10SRE, 10Traffic, 10vm-requests: Please create a Ganeti VM for Wikidough in ulsfo - https://phabricator.wikimedia.org/T284349 (10ssingh)
[19:31:43] <wikibugs>	 10SRE, 10Traffic: ATS: origins server response data accounting issues - https://phabricator.wikimedia.org/T284290 (10colewhite) p:05Triage→03Medium
[19:32:06] <wikibugs>	 10SRE, 10Traffic: Take response size into account in CDN HTTP requests throttling - https://phabricator.wikimedia.org/T284292 (10colewhite) p:05Triage→03Medium
[19:32:23] <mutante>	 sukhe: :) ok
[19:32:30] <wikibugs>	 10SRE, 10SRE-tools, 10User-jbond: Rootless cookbooks/spicerack - https://phabricator.wikimedia.org/T284302 (10colewhite) p:05Triage→03Medium
[19:32:53] <wikibugs>	 10SRE, 10Traffic: Create dashboard showing aggregate data transfer rates per DC/cluster - https://phabricator.wikimedia.org/T284304 (10colewhite) p:05Triage→03Medium
[19:38:45] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] galera: ensure that mariabackup is also installed [puppet] - 10https://gerrit.wikimedia.org/r/698064 (https://phabricator.wikimedia.org/T284157) (owner: 10Bstorm)
[19:45:53] <icinga-wm>	 PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:53:01] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: cp1087 down with hardware issues - https://phabricator.wikimedia.org/T278729 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` cp1087.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202106041952_bblack_1...
[19:57:16] <wikibugs>	 (03PS1) 10Bstorm: galera: clear up confusing xtrabackup parameter [puppet] - 10https://gerrit.wikimedia.org/r/698251 (https://phabricator.wikimedia.org/T284157)
[19:58:53] <icinga-wm>	 PROBLEM - Disk space on dbprov2003 is CRITICAL: DISK CRITICAL - free space: /srv 266219 MB (2% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dbprov2003&var-datasource=codfw+prometheus/ops
[20:07:49] <wikibugs>	 (03CR) 10Dzahn: "thanks Cole :)" [puppet] - 10https://gerrit.wikimedia.org/r/698071 (https://phabricator.wikimedia.org/T266249) (owner: 10Dzahn)
[20:08:15] <wikibugs>	 (03PS2) 10Dzahn: bacula/gitlab: add a backup::set for gitlab and use it [puppet] - 10https://gerrit.wikimedia.org/r/697850 (https://phabricator.wikimedia.org/T274463)
[20:09:03] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1087.eqiad.wmnet with reason: REIMAGE
[20:09:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:11] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be1053 is OK: OK - load average: 66.30, 73.88, 79.59 https://wikitech.wikimedia.org/wiki/Swift
[20:11:15] <logmsgbot>	 !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1087.eqiad.wmnet with reason: REIMAGE
[20:11:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:51] <wikibugs>	 (03Abandoned) 10Joal: Update AQS druid datasource to 2021_05 [puppet] - 10https://gerrit.wikimedia.org/r/698178 (owner: 10Joal)
[20:46:38] <icinga-wm>	 RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:56:12] <wikibugs>	 (03CR) 10MarcoAurelio: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698039 (https://phabricator.wikimedia.org/T283522) (owner: 10MarcoAurelio)
[20:56:25] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: cp1087 down with hardware issues - https://phabricator.wikimedia.org/T278729 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1087.eqiad.wmnet'] `  and were **ALL** successful.
[20:59:02] <bblack>	 !log repool cp1087 - T278729
[20:59:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:59:07] <stashbot>	 T278729: cp1087 down with hardware issues - https://phabricator.wikimedia.org/T278729
[21:05:40] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 145 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:06:21] <wikibugs>	 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10colewhite) Key has been updated.  Please let us know if this action resolved the problem.
[21:15:56] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4444 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[21:28:58] <wikibugs>	 (03CR) 10MarcoAurelio: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698041 (https://phabricator.wikimedia.org/T283523) (owner: 10MarcoAurelio)
[21:37:02] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 41 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:37:06] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.07937 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[21:49:36] <wikibugs>	 10SRE, 10Traffic, 10vm-requests: Please create a Ganeti VM for Wikidough in ulsfo - https://phabricator.wikimedia.org/T284349 (10colewhite) p:05Triage→03Medium a:03colewhite
[21:51:14] <logmsgbot>	 !log cwhite@cumin1001 START - Cookbook sre.ganeti.makevm for new host doh4001.wikimedia.org
[21:51:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:08:27] <logmsgbot>	 !log cwhite@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh4001.wikimedia.org
[22:08:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:12:07] <wikibugs>	 (03PS1) 10Cwhite: site: add doh4001 to role insetup and setup dhcp [puppet] - 10https://gerrit.wikimedia.org/r/698265 (https://phabricator.wikimedia.org/T284349)
[23:01:54] <wikibugs>	 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create a Ganeti VM for Wikidough in ulsfo - https://phabricator.wikimedia.org/T284349 (10colewhite) Cookbook ran successfully.  Currently unprovisioned.
[23:54:19] <wikibugs>	 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Dzahn) a:05colewhite→03JAnstee_WMF