[00:00:39] <icinga-wm>	 ACKNOWLEDGEMENT - NFS Share Volume Space /srv/tools on labstore1004 is CRITICAL: DISK CRITICAL - free space: /srv/tools 1263734 MB (15% inode=79%): Bstorm Working on this via T284964 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1
[00:30:10] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 104.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37
[00:35:37] <wikibugs>	 (03PS3) 10H.krishna123: repository: add .gitignore and README.md to the repository [software/bernard] - 10https://gerrit.wikimedia.org/r/698327 (https://phabricator.wikimedia.org/T284399)
[00:36:25] <wikibugs>	 (03PS4) 10H.krishna123: repository: add .gitignore and README.md to the repository [software/bernard] - 10https://gerrit.wikimedia.org/r/698327 (https://phabricator.wikimedia.org/T284399)
[00:45:23] <wikibugs>	 (03CR) 10H.krishna123: "Thank you, I have made changes to the commit message, I wonder if calling the component "repository" is appropriate in this scenario?" [software/bernard] - 10https://gerrit.wikimedia.org/r/698327 (https://phabricator.wikimedia.org/T284399) (owner: 10H.krishna123)
[01:25:47] <wikibugs>	 (03PS5) 10H.krishna123: repository: add README.md to the repository [software/bernard] - 10https://gerrit.wikimedia.org/r/698327 (https://phabricator.wikimedia.org/T284399)
[02:07:54] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.37.0-wmf.10 [core] (wmf/1.37.0-wmf.10) - 10https://gerrit.wikimedia.org/r/699826
[02:07:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.37.0-wmf.10 [core] (wmf/1.37.0-wmf.10) - 10https://gerrit.wikimedia.org/r/699826 (owner: 10TrainBranchBot)
[02:24:48] <icinga-wm>	 PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[02:29:43] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.37.0-wmf.10 [core] (wmf/1.37.0-wmf.10) - 10https://gerrit.wikimedia.org/r/699826 (owner: 10TrainBranchBot)
[02:46:30] <icinga-wm>	 RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[02:49:18] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-chi-eqiad on cloudelastic1005 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1005&panelId=37
[03:18:08] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1006-cloudelastic-chi-eqiad on cloudelastic1006 is OK: (C)100 gt (W)80 gt 73.22 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1006&panelId=37
[04:26:46] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 70.17 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37
[04:48:18] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage new pc20* [puppet] - 10https://gerrit.wikimedia.org/r/699840
[04:49:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage new pc20* [puppet] - 10https://gerrit.wikimedia.org/r/699840 (owner: 10Marostegui)
[04:49:50] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - apaches_80: Servers mw2255.codfw.wmnet, mw2313.codfw.wmnet, mw2409.codfw.wmnet, mw2371.codfw.wmnet, mw2392.codfw.wmnet, mw2333.codfw.wmnet, mw2393.codfw.wmnet, mw2312.codfw.wmnet, mw2353.codfw.wmnet, mw2375.codfw.wmnet, mw2338.codfw.wmnet, mw2329.codfw.wmnet, mw2314.codfw.wmnet, mw2275.codfw.wmnet, mw2361.codfw.wmnet, mw2369.codfw.wmnet, mw2365.codfw
[04:49:50] <icinga-wm>	 mw2355.codfw.wmnet, mw2406.codfw.wmnet, mw2315.codfw.wmnet, mw2327.codfw.wmnet, mw2351.codfw.wmnet, mw2373.codfw.wmnet, mw2270.codfw.wmnet, mw2335.codfw.wmnet, mw2254.codfw.wmnet, mw2385.codfw.wmnet, mw2331.codfw.wmnet, mw2384.codfw.wmnet, mw2388.codfw.wmnet, mw2272.codfw.wmnet, mw2307.codfw.wmnet, mw2407.codfw.wmnet, mw2383.codfw.wmnet, mw2301.codfw.wmnet, mw2336.codfw.wmnet, mw2276.codfw.wmnet, mw2363.codfw.wmnet, mw2303.codfw.wmnet, mw
[04:49:50] <icinga-wm>	 fw.wmnet, mw2309.codfw.wmnet, mw2387.codfw.wmnet, mw2367.codfw.wmnet, mw2390.codfw.wmnet, mw2359.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:50:20] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - apaches_80: Servers mw2255.codfw.wmnet, mw2313.codfw.wmnet, mw2409.codfw.wmnet, mw2371.codfw.wmnet, mw2392.codfw.wmnet, mw2333.codfw.wmnet, mw2393.codfw.wmnet, mw2312.codfw.wmnet, mw2353.codfw.wmnet, mw2375.codfw.wmnet, mw2338.codfw.wmnet, mw2303.codfw.wmnet, mw2314.codfw.wmnet, mw2275.codfw.wmnet, mw2361.codfw.wmnet, mw2369.codfw.wmnet, mw2365.codfw
[04:50:20] <icinga-wm>	 mw2355.codfw.wmnet, mw2406.codfw.wmnet, mw2315.codfw.wmnet, mw2327.codfw.wmnet, mw2351.codfw.wmnet, mw2373.codfw.wmnet, mw2270.codfw.wmnet, mw2335.codfw.wmnet, mw2254.codfw.wmnet, mw2385.codfw.wmnet, mw2331.codfw.wmnet, mw2384.codfw.wmnet, mw2388.codfw.wmnet, mw2272.codfw.wmnet, mw2307.codfw.wmnet, mw2407.codfw.wmnet, mw2383.codfw.wmnet, mw2301.codfw.wmnet, mw2336.codfw.wmnet, mw2276.codfw.wmnet, mw2363.codfw.wmnet, mw2329.codfw.wmnet, mw
[04:50:20] <icinga-wm>	 fw.wmnet, mw2309.codfw.wmnet, mw2387.codfw.wmnet, mw2367.codfw.wmnet, mw2390.codfw.wmnet, mw2359.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:50:28] <marostegui>	 uh?
[04:51:12] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 348 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:59:05] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wikireplicas: re-enable notifications for clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/699791 (owner: 10Bstorm)
[05:09:18] <icinga-wm>	 PROBLEM - snapshot of s6 in codfw on alert1001 is CRITICAL: snapshot for s6 at codfw taken more than 3 days ago: Most recent backup 2021-06-12 04:37:53 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[05:24:02] <icinga-wm>	 PROBLEM - SSH on wdqs2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:00:22] <joe>	 marostegui: what's up in codfw?
[06:00:27] <joe>	 did you check?
[06:01:09] <joe>	 looks like it takes 5 seconds to render the blank page
[06:01:12] <joe>	 Jun 15 06:00:50 lvs2010 pybal[7422]: [apaches_80 ProxyFetch] WARN: mw2391.codfw.wmnet (enabled/partially up/pooled): Fetch failed (http://www.wikidata.org/wiki/Special:BlankPage), 5.001 s
[06:01:29] <joe>	 is there anything going on? a schema change on s1?
[06:01:49] <joe>	 also why is this not paging
[06:02:43] * apergos peeks in
[06:02:45] <marostegui>	 joe: schema change on s8 codfw
[06:02:50] <joe>	 ah I see
[06:03:04] <joe>	 why a schema change in s8 makes the blank page of enwiki not working???
[06:03:08] <joe>	 Amir1: ^^
[06:03:10] <marostegui>	 that's my question
[06:03:13] <joe>	 s8 is wikidata
[06:03:22] <joe>	 marostegui: lemme take a look on one server
[06:03:25] <marostegui>	 codfw databases aren't used
[06:03:26] <Amir1>	 wait wat
[06:03:40] <Amir1>	 it can be caused by that
[06:03:41] <marostegui>	 Amir1: TL;DR schema change on s8 codfw master (with replication stopped) is going on
[06:03:50] <Amir1>	 s8 is being read by every wiki
[06:04:00] <joe>	 2021-06-15T06:00:30	202100644	10.192.17.10	proxy:unix:/run/php/fpm-www.sock|fcgi://localhost/504	247	GET	http://www.wikidata.org/wiki/Special:BlankPage	-	text/html	-	-	Twisted PageGetter	-	-	-	-	10.192.17.10	-	-
[06:04:02] <Amir1>	 actually most of the reads are not from wikidat
[06:04:05] <majavah>	 is that what's also causing the few hundred mw exceptions per minute? https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?orgId=1&var-datasource=eqiad%20prometheus%2Fops&viewPanel=18&from=now-24h&to=now
[06:04:07] <joe>	 Amir1: all the time?
[06:04:11] <Amir1>	 joe: yes
[06:04:27] <marostegui>	 Amir1: yeah, but why does it affect everything if codfw master is stopped? if codfw isn't used
[06:04:43] <Amir1>	 that is the question I don't have answer for it yet
[06:04:51] <joe>	 marostegui: it's not affecting anything in eqiad
[06:05:03] <joe>	 so it's not live traffic
[06:05:18] <Amir1>	 aha, it means it's just depooling codfw mw hosts
[06:05:19] <Amir1>	 fun
[06:05:29] <joe>	 ok mystery solved
[06:05:31] <marostegui>	 joe: ah ok pheeew
[06:05:39] <joe>	 we now require the wikidata blank page as well
[06:05:41] <apergos>	 wikibase clients all... right
[06:05:42] <joe>	 from pybal
[06:06:03] <joe>	 so the enwiki blank page takes 100 ms
[06:06:05] <joe>	 as usual
[06:06:07] <Amir1>	 joe: not sure why it's being invoked in blank page. I'll do a xhgui on it
[06:06:13] <joe>	 the wikidata one takes 200 seconds
[06:06:22] <joe>	 Amir1: nevermind it was a red herring
[06:06:33] <joe>	 I forgot we now also check the wikidata blank page
[06:06:40] <apergos>	 oh directly checked? ic
[06:06:52] <Amir1>	 now, another mystery, why we do have exceptions?
[06:06:54] <joe>	 yeah see the line I pasted above
[06:06:56] <Amir1>	 probably not related
[06:07:02] <joe>	 Amir1: the schema change :D
[06:07:21] <Amir1>	 but this is eqiad: https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?orgId=1&var-datasource=eqiad%20prometheus%2Fops&viewPanel=18&from=now-24h&to=now
[06:07:37] <Amir1>	 you mean another schema change?
[06:07:48] <joe>	 you're being misguided
[06:07:55] <joe>	 that is collective
[06:08:03] <joe>	 it's "eqiad" as in "eqiad's statsd
[06:08:15] <Amir1>	 legend
[06:08:16] <joe>	 so eqiad's prometheus
[06:08:23] <joe>	 but it's everything :)
[06:08:27] <Amir1>	 oh of course
[06:08:48] <Amir1>	 it's coming from a promethus exporter 
[06:08:48] <joe>	 [{reqId}] {exception_url} Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of 60 seconds was exceeded <- this is the exception, happening mostly in codfw and mostly for wikidata
[06:09:07] <Amir1>	 I saw the exporter in puppet a while ago
[06:10:32] <Amir1>	 a general note: around 60-70% of reads on s8 are not from wikidata.org but any other clients. I measured it in December 
[06:10:39] <XioNoX>	 !log roll OSPF link-protection to all routers - T167306
[06:10:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:10:45] <stashbot>	 T167306: ospf link-protection - https://phabricator.wikimedia.org/T167306
[06:12:24] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add OSPF link-protection to all P2P links [homer/public] - 10https://gerrit.wikimedia.org/r/698512 (https://phabricator.wikimedia.org/T167306) (owner: 10Ayounsi)
[06:16:41] <elukey>	 away
[06:16:51] <elukey>	 good morning, need coffee :D
[06:33:00] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 50 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210615T0700)
[07:12:08] <wikibugs>	 (03PS3) 10Tobias Andersson: Remove idGeneratorRateLimiting from production config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698751 (https://phabricator.wikimedia.org/T274157) (owner: 10Dat Nguyen)
[07:17:38] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] "> Patch Set 4:" [software/bernard] - 10https://gerrit.wikimedia.org/r/698327 (https://phabricator.wikimedia.org/T284399) (owner: 10H.krishna123)
[07:17:45] <wikibugs>	 (03CR) 10Jcrespo: [V: 03+2 C: 03+2] repository: add README.md to the repository [software/bernard] - 10https://gerrit.wikimedia.org/r/698327 (https://phabricator.wikimedia.org/T284399) (owner: 10H.krishna123)
[07:25:30] <icinga-wm>	 RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:33:40] <wikibugs>	 (03CR) 10Ladsgroup: [C: 04-2] "Virtual +1. It can go in after deployment of wmf.11 landed in production in full power (~Monday 28 June)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698751 (https://phabricator.wikimedia.org/T274157) (owner: 10Dat Nguyen)
[07:43:15] <wikibugs>	 (03PS1) 10QChris: Add .gitreview [debs/cfssl] - 10https://gerrit.wikimedia.org/r/699904
[07:43:17] <wikibugs>	 (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [debs/cfssl] - 10https://gerrit.wikimedia.org/r/699904 (owner: 10QChris)
[07:53:44] <wikibugs>	 (03PS2) 10Jelto: add job to weekly rebuild production-images [puppet] - 10https://gerrit.wikimedia.org/r/699752 (https://phabricator.wikimedia.org/T284431)
[07:59:01] <wikibugs>	 (03PS3) 10Jelto: add job to weekly rebuild production-images [puppet] - 10https://gerrit.wikimedia.org/r/699752 (https://phabricator.wikimedia.org/T284431)
[08:02:00] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:02:42] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:06:22] <icinga-wm>	 RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:09:30] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:14:48] <wikibugs>	 (03PS3) 10Ema: varnish: add timing data to varnishmtail [puppet] - 10https://gerrit.wikimedia.org/r/699223 (https://phabricator.wikimedia.org/T284576)
[08:15:34] <wikibugs>	 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10jcrespo) I am going to create a separate task to put db2100 into service, will only reopen this if crashes return. Thank to everyone that helped here.
[08:16:40] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - apaches_80: Servers mw2254.codfw.wmnet, mw2371.codfw.wmnet, mw2338.codfw.wmnet, mw2386.codfw.wmnet, mw2365.codfw.wmnet, mw2327.codfw.wmnet, mw2373.codfw.wmnet, mw2384.codfw.wmnet, mw2407.codfw.wmnet, mw2363.codfw.wmnet, mw2268.codfw.wmnet, mw2390.codfw.wmnet, mw2359.codfw.wmnet, mw2255.codfw.wmnet, mw2409.codfw.wmnet, mw2357.codfw.wmnet, mw2269.codfw
[08:16:40] <icinga-wm>	 mw2355.codfw.wmnet, mw2270.codfw.wmnet, mw2385.codfw.wmnet, mw2274.codfw.wmnet, mw2277.codfw.wmnet, mw2383.codfw.wmnet, mw2336.codfw.wmnet, mw2311.codfw.wmnet, mw2389.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:18:23] <jayme>	 hmm...what's going on here
[08:19:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - apaches_80: Servers mw2371.codfw.wmnet, mw2331.codfw.wmnet, mw2393.codfw.wmnet, mw2365.codfw.wmnet, mw2408.codfw.wmnet, mw2315.codfw.wmnet, mw2384.codfw.wmnet, mw2272.codfw.wmnet, mw2367.codfw.wmnet, mw2390.codfw.wmnet, mw2255.codfw.wmnet, mw2392.codfw.wmnet, mw2353.codfw.wmnet, mw2316.codfw.wmnet, mw2275.codfw.wmnet, mw2257.codfw.wmnet, mw2387.codfw
[08:19:14] <icinga-wm>	 mw2406.codfw.wmnet, mw2385.codfw.wmnet, mw2277.codfw.wmnet, mw2388.codfw.wmnet, mw2273.codfw.wmnet, mw2258.codfw.wmnet, mw2311.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:19:49] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: mailman3 unsubscribe link not showing in daily article list e-mails - https://phabricator.wikimedia.org/T284548 (10Krd) 05Open→03Resolved a:03Krd Thx.
[08:22:24] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 112 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:25:05] <jayme>	 PyBal backend errors related to schema change on s8 - https://phabricator.wikimedia.org/T284981
[08:27:20] <wikibugs>	 10SRE, 10docker-pkg, 10serviceops, 10Patch-For-Review: Refresh all images in production-images - https://phabricator.wikimedia.org/T284431 (10Jelto) > So we have two options for a rebuild:    @Joe I implemented option 1 in https://gerrit.wikimedia.org/r/699752 Could you take a look and add your feedback?...
[08:28:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2081', diff saved to https://phabricator.wikimedia.org/P16524 and previous config saved to /var/cache/conftool/dbconfig/20210615-082857-marostegui.json
[08:29:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:29:32] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:30:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:31:34] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 291 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:32:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2080 db2083 db2084 db2091', diff saved to https://phabricator.wikimedia.org/P16525 and previous config saved to /var/cache/conftool/dbconfig/20210615-083233-marostegui.json
[08:32:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:26] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 38 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:44:39] <wikibugs>	 (03CR) 10Gilles: [C: 03+1] Enable canary events for NavigationTiming ext streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699789 (https://phabricator.wikimedia.org/T271208) (owner: 10Ottomata)
[08:50:30] <icinga-wm>	 RECOVERY - snapshot of x1 in eqiad on alert1001 is OK: Last snapshot for x1 at eqiad (db1102.eqiad.wmnet:3320) taken on 2021-06-15 08:31:27 (245 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[08:59:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2091', diff saved to https://phabricator.wikimedia.org/P16526 and previous config saved to /var/cache/conftool/dbconfig/20210615-085938-marostegui.json
[08:59:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2082', diff saved to https://phabricator.wikimedia.org/P16527 and previous config saved to /var/cache/conftool/dbconfig/20210615-085953-marostegui.json
[08:59:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:10] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] mwdebug: include nutcracker and mcrouter pools in values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/699432 (https://phabricator.wikimedia.org/T284420) (owner: 10Effie Mouzeli)
[09:02:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db2081', diff saved to https://phabricator.wikimedia.org/P16528 and previous config saved to /var/cache/conftool/dbconfig/20210615-090206-marostegui.json
[09:02:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db2084', diff saved to https://phabricator.wikimedia.org/P16529 and previous config saved to /var/cache/conftool/dbconfig/20210615-090243-marostegui.json
[09:02:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db2083', diff saved to https://phabricator.wikimedia.org/P16530 and previous config saved to /var/cache/conftool/dbconfig/20210615-090650-marostegui.json
[09:06:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm, but better get a +1 from S&F as well just incase" [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/699819 (https://phabricator.wikimedia.org/T274461) (owner: 10Brennen Bearnes)
[09:08:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2086:3318', diff saved to https://phabricator.wikimedia.org/P16531 and previous config saved to /var/cache/conftool/dbconfig/20210615-090802-marostegui.json
[09:08:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:32] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] python3: fix encoding in grid output [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/699815 (owner: 10Bstorm)
[09:17:07] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "Following discussions on IRC etc. I think this is a good way to go for istio right now. With all this README's around, this LGTM 😊" [deployment-charts] - 10https://gerrit.wikimedia.org/r/697938 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey)
[09:22:03] <wikibugs>	 (03PS2) 10Effie Mouzeli: mwdebug: include nutcracker and mcrouter pools in values [deployment-charts] - 10https://gerrit.wikimedia.org/r/699432 (https://phabricator.wikimedia.org/T284420)
[09:22:19] <wikibugs>	 (03CR) 10Effie Mouzeli: mwdebug: include nutcracker and mcrouter pools in values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/699432 (https://phabricator.wikimedia.org/T284420) (owner: 10Effie Mouzeli)
[09:23:46] <icinga-wm>	 PROBLEM - Host bast5002 is DOWN: PING CRITICAL - Packet loss = 100%
[09:23:48] <icinga-wm>	 PROBLEM - Host asw1-eqsin is DOWN: PING CRITICAL - Packet loss = 100%
[09:23:56] <icinga-wm>	 PROBLEM - Host ncredir5001 is DOWN: PING CRITICAL - Packet loss = 100%
[09:24:06] <icinga-wm>	 PROBLEM - Host ganeti5002 is DOWN: PING CRITICAL - Packet loss = 100%
[09:24:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2086:3318, db2082', diff saved to https://phabricator.wikimedia.org/P16532 and previous config saved to /var/cache/conftool/dbconfig/20210615-092409-marostegui.json
[09:24:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:21] <elukey>	 this smells like a rack going down
[09:24:32] <icinga-wm>	 PROBLEM - Host ganeti5001 is DOWN: PING CRITICAL - Packet loss = 100%
[09:24:32] <icinga-wm>	 PROBLEM - Host dns5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:24:34] <icinga-wm>	 PROBLEM - Host install5001 is DOWN: PING CRITICAL - Packet loss = 100%
[09:24:43] <elukey>	 or maybe the top of rack
[09:24:51] <vgutierrez>	 uh
[09:25:04] <icinga-wm>	 PROBLEM - Host dns5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:25:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2080', diff saved to https://phabricator.wikimedia.org/P16533 and previous config saved to /var/cache/conftool/dbconfig/20210615-092511-marostegui.json
[09:25:12] <icinga-wm>	 PROBLEM - Host cp5008.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:25:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:22] <icinga-wm>	 PROBLEM - Host lvs5003.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:25:40] <icinga-wm>	 PROBLEM - Host ps1-604-eqsin is DOWN: PING CRITICAL - Packet loss = 100%
[09:25:48] <icinga-wm>	 PROBLEM - Host cp5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:25:51] <wikibugs>	 10SRE, 10MediaWiki-General, 10Platform Engineering, 10Pybal, and 3 others: SELECT query arriving to wikidatawiki db codfw hosts causing pile ups during schema change - https://phabricator.wikimedia.org/T284981 (10Marostegui) All the hosts have been recovered.
[09:26:38] * volans checking
[09:26:50] <moritzm>	 it's not just one rack AFAICT
[09:26:54] <icinga-wm>	 PROBLEM - Host cp5011.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:26:58] <volans>	 there was a maintence IIRC
[09:27:07] <volans>	 but not today
[09:27:33] <moritzm>	 ganeti5002 is rack 604, ganeti5001 is 603
[09:27:36] <vgutierrez>	 yep
[09:27:51] <vgutierrez>	 and we got 2 racks in eqsin
[09:27:54] <icinga-wm>	 RECOVERY - Host ganeti5001 is UP: PING WARNING - Packet loss = 90%, RTA = 237.77 ms
[09:27:57] <volans>	 I can ssh
[09:28:00] <volans>	 into them
[09:28:07] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[09:28:22] <wikibugs>	 (03PS1) 10Effie Mouzeli: hieradata: Use TLS codfw pool for memcached replication on canaries [puppet] - 10https://gerrit.wikimedia.org/r/699908 (https://phabricator.wikimedia.org/T284420)
[09:28:24] <volans>	 wshould we depoll?
[09:28:27] <volans>	 *depool
[09:28:38] <elukey>	 I tried to ssh to asw1-eqsin but the connection is broken now
[09:28:52] <icinga-wm>	 PROBLEM - Host doh5001 is DOWN: PING CRITICAL - Packet loss = 100%
[09:28:54] <icinga-wm>	 PROBLEM - Host ganeti5003 is DOWN: PING CRITICAL - Packet loss = 100%
[09:28:55] <icinga-wm>	 PROBLEM - Host text-lb.eqsin.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[09:29:02] <icinga-wm>	 PROBLEM - Host ripe-atlas-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[09:29:09] <vgutierrez>	 volans, I think so /cc: ema
[09:29:20] <volans>	 lI'm on ganeti5001 and the connection is still up
[09:29:21] <godog>	 I'm +1 to depool, and we can figure it out
[09:29:21] <jynus>	 network on eqsin?
[09:29:22] <icinga-wm>	 PROBLEM - Host prometheus5001 is DOWN: PING CRITICAL - Packet loss = 100%
[09:29:23] <volans>	 yeah let's depool
[09:29:27] * jbond here
[09:29:37] <moritzm>	 +1 on depooling
[09:29:42] <XioNoX>	 what's up?
[09:29:52] <icinga-wm>	 PROBLEM - Host netflow5001 is DOWN: PING CRITICAL - Packet loss = 100%
[09:29:53] <icinga-wm>	 PROBLEM - Host cr3-eqsin is DOWN: PING CRITICAL - Packet loss = 100%
[09:29:54] <effie>	 XioNoX: something is happening in eqsin 
[09:29:55] <wikibugs>	 (03PS1) 10Majavah: Depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/699909
[09:29:58] <XioNoX>	 who takes care of depooling?
[09:30:02] <wikibugs>	 (03PS1) 10Ema: Depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/699910
[09:30:04] <volans>	 XioNoX: looks like network issues with eqsin
[09:30:10] <effie>	 Ema is depooling 
[09:30:14] <icinga-wm>	 PROBLEM - Host ganeti5001 is DOWN: PING CRITICAL - Packet loss = 100%
[09:30:14] <ema>	 please review https://gerrit.wikimedia.org/r/699910
[09:30:23] <jynus>	 let's go for ema patch
[09:30:29] <jynus>	 to make a quick decision
[09:30:30] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/699910 (owner: 10Ema)
[09:30:31] <icinga-wm>	 PROBLEM - Host cr2-eqsin is DOWN: PING CRITICAL - Packet loss = 100%
[09:30:36] <joe>	 +1
[09:30:38] <volans>	 ema: ship it
[09:30:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [dns] - 10https://gerrit.wikimedia.org/r/699910 (owner: 10Ema)
[09:30:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/699910 (owner: 10Ema)
[09:30:43] <joe>	 ship it yes
[09:30:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/699910 (owner: 10Ema)
[09:30:49] <wikibugs>	 (03CR) 10Ema: [C: 03+2] Depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/699910 (owner: 10Ema)
[09:30:49] <icinga-wm>	 PROBLEM - Host ncredir-lb.eqsin.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[09:30:50] <icinga-wm>	 PROBLEM - Host upload-lb.eqsin.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[09:30:52] <icinga-wm>	 PROBLEM - Host ncredir5002 is DOWN: PING CRITICAL - Packet loss = 100%
[09:31:00] <XioNoX>	 routers are up
[09:31:01] <elukey>	 from the switch logs it seems that mgmt interfaces went down all in once https://librenms.wikimedia.org/device/163/logs
[09:31:02] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={pdu_sentry4,swagger_check_restbase_eqsin} site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:31:10] <joe>	 sigh
[09:31:12] <wikibugs>	 (03Abandoned) 10Majavah: Depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/699909 (owner: 10Majavah)
[09:31:22] <jynus>	 majavah, nothing against yours :-), but better not going indecisive
[09:31:23] <elukey>	 ifAdminStatus: up -> NULL
[09:31:23] <ema>	 running authdns-update
[09:31:31] * volans acked all pages
[09:31:47] <jynus>	 I will check user impact, user reports, etc.
[09:32:04] <XioNoX>	 switch is fine too
[09:32:21] <volans>	 who's taking IC?
[09:32:24] <apergos>	 uh
[09:32:26] <elukey>	 XioNoX: did you see the up -> Null transitions?
[09:32:35] <icinga-wm>	 PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_text layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1
[09:32:35] <apergos>	 (sorr,y I just now realized that string of beeps was a pile of pages)
[09:32:39] <XioNoX>	 elukey: where?
[09:32:48] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:32:54] <ema>	 authdns updated successfully except for dns5001/dns5002 for obvious reasons
[09:33:00] <elukey>	 XioNoX: https://librenms.wikimedia.org/device/163/logs
[09:33:18] <volans>	 ema: let's try to update them manually, I can ssh there
[09:33:24] <ema>	 volans: go ahead
[09:33:26] <volans>	 so maybe it's just eqsin-infra
[09:33:37] <joe>	 yeah looks like
[09:33:40] <volans>	 the problem and not eqsin<-?rest-of-the-world
[09:33:49] <joe>	 which makes things worse
[09:34:00] <volans>	 ema: ack, I'll update dns500*
[09:34:15] <joe>	 volans: are you updating them? those are really the dns servers we need to update
[09:34:21] <volans>	 yes
[09:34:27] <joe>	 but I doubt they can reach the git repo?
[09:34:29] <XioNoX>	 elukey: that's just snmp pooling issues
[09:34:45] <elukey>	 ack
[09:34:47] <XioNoX>	 I agree it looks like eqsin-infra
[09:34:52] <XioNoX>	 not eqsin-world
[09:34:54] <volans>	 pull from gerrit worked
[09:34:58] <joe>	 great
[09:35:01] <volans>	 I'm just running authdn-update
[09:35:03] <joe>	 because it went via the internet
[09:35:04] <volans>	 no manual hacks
[09:35:11] <volans>	 and see how it goes
[09:35:13] <joe>	 authdns-local-update?
[09:35:45] <godog>	 interestingly it seems v4 only? I can v6-ping text-lb.eqsin.wikimedia.org from alert1001 just fine
[09:35:45] <XioNoX>	 nothing on the routers logs
[09:35:56] <joe>	 volans: please let us know when the process is finished
[09:36:00] <jbond>	 created an IC document https://docs.google.com/document/d/1_rV0RU9wZ0Y1VQUJkOq5L2uDUv-7XgOCuJyR6o5f_BY/edit (backfilling now)
[09:36:01] <majavah>	 fyi, user report in -commons
[09:36:02] <volans>	 finished
[09:36:06] <volans>	 5001/2 updated
[09:36:09] <volans>	 all the others failed
[09:36:11] <volans>	 the clush
[09:36:14] <volans>	 but that was covered by ema
[09:36:24] <joe>	 ok so in the next 5 minutes, we should see the traffic shift
[09:36:30] <wikibugs>	 10SRE, 10Traffic, 10netops: Wikimedias eqsin datacenter has network connectivity issues (?) - https://phabricator.wikimedia.org/T284986 (10jcrespo)
[09:36:30] <ema>	 yup authdns2001.wikimedia.org,dns[1001-1002,2001-2002,3001-3002,4001-4002].wikimedia.org went fine
[09:36:42] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 47.01 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[09:36:44] <XioNoX>	 I installed mtr-tiny on alert1001
[09:36:47] <jynus>	 jbond, created initial task: https://phabricator.wikimedia.org/T284986
[09:37:01] <wikibugs>	 10SRE, 10Traffic, 10netops, 10Wikimedia-Incident: Wikimedias eqsin datacenter has network connectivity issues (?) - https://phabricator.wikimedia.org/T284986 (10Majavah)
[09:37:18] <joe>	 I see ulsfo is peaking
[09:37:30] <XioNoX>	 smokepings confirms the connectivity issues - https://smokeping.wikimedia.org/?displaymode=n;start=2021-06-15%2006:37;end=now;target=eqsin.Hosts.bast5002
[09:37:40] <joe>	 but we're still down ~ 1 million requests per minute
[09:38:11] <jynus>	 4 twitter reports saying down, so cleary affecting users
[09:38:16] <XioNoX>	 v4 from eqiad to bast5003 doesn't go trhough
[09:38:24] <elukey>	 it is weird, netflow5001 is reported down, I can ssh and ping cp5001 for example (both v4 and v6)
[09:38:31] <volans>	 XioNoX: if that helps I can ping bast1003 from dns5001 but I can't telnet 22
[09:38:44] <wikibugs>	 10SRE, 10Traffic, 10netops, 10Wikimedia-Incident: Wikimedias eqsin datacenter has network connectivity issues (?) - https://phabricator.wikimedia.org/T284986 (10Peachey88)
[09:39:33] <joe>	 request rates are slowly recovering
[09:39:40] <elukey>	 XioNoX: bast5002 right? (I don't find 5003)
[09:39:47] <XioNoX>	 yeah
[09:40:11] <volans>	 both v4 and v6 the telnet above
[09:40:12] <volans>	 fwiw
[09:40:30] <jynus>	 marostegui, I cannot change topic because I am not op, can you from up to "eqsin network issues"
[09:41:01] <moritzm>	 jynus: I' changing it now
[09:41:10] <volans>	 now v6 seems to go throuh but v4 not
[09:41:16] <XioNoX>	 going to try to kill the telia transport link
[09:41:23] <volans>	 ack
[09:41:28] <XioNoX>	 it might be the one missbehaving and only letting some traffic through
[09:41:49] <icinga-wm>	 RECOVERY - Host cr3-eqsin is UP: PING WARNING - Packet loss = 60%, RTA = 238.58 ms
[09:41:49] <icinga-wm>	 RECOVERY - Host netflow5001 is UP: PING WARNING - Packet loss = 60%, RTA = 239.70 ms
[09:41:50] <icinga-wm>	 RECOVERY - Host asw1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 238.84 ms
[09:41:50] <icinga-wm>	 RECOVERY - Host ganeti5002 is UP: PING OK - Packet loss = 0%, RTA = 237.93 ms
[09:41:52] <icinga-wm>	 RECOVERY - Host bast5002 is UP: PING OK - Packet loss = 0%, RTA = 238.75 ms
[09:41:52] <icinga-wm>	 RECOVERY - Host ncredir5002 is UP: PING OK - Packet loss = 0%, RTA = 238.21 ms
[09:41:53] <icinga-wm>	 RECOVERY - Host cr2-eqsin is UP: PING OK - Packet loss = 0%, RTA = 238.27 ms
[09:41:54] <icinga-wm>	 RECOVERY - Host ganeti5003 is UP: PING OK - Packet loss = 0%, RTA = 237.94 ms
[09:41:54] <icinga-wm>	 RECOVERY - Host ganeti5001 is UP: PING OK - Packet loss = 0%, RTA = 237.92 ms
[09:41:54] <icinga-wm>	 RECOVERY - Host cp5002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 238.68 ms
[09:41:54] <icinga-wm>	 RECOVERY - Host cp5008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 238.43 ms
[09:41:54] <icinga-wm>	 RECOVERY - Host cp5011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 238.49 ms
[09:41:54] <icinga-wm>	 RECOVERY - Host dns5001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 238.44 ms
[09:41:54] <icinga-wm>	 RECOVERY - Host dns5002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 238.51 ms
[09:41:55] <icinga-wm>	 RECOVERY - Host lvs5003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 238.49 ms
[09:41:55] <icinga-wm>	 RECOVERY - Host ps1-604-eqsin is UP: PING OK - Packet loss = 0%, RTA = 238.91 ms
[09:41:56] <icinga-wm>	 RECOVERY - Host ncredir5001 is UP: PING OK - Packet loss = 0%, RTA = 238.20 ms
[09:41:58] <elukey>	 cr2-codfw reports a bfd session down to cr3-eqsin
[09:41:58] <icinga-wm>	 RECOVERY - Host install5001 is UP: PING OK - Packet loss = 0%, RTA = 239.00 ms
[09:41:59] <XioNoX>	 well
[09:42:00] <icinga-wm>	 RECOVERY - Host ripe-atlas-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 238.17 ms
[09:42:03] <volans>	 lol
[09:42:04] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:42:04] <icinga-wm>	 RECOVERY - Host doh5001 is UP: PING OK - Packet loss = 0%, RTA = 241.05 ms
[09:42:05] <hashar>	 :D
[09:42:06] <icinga-wm>	 RECOVERY - Host prometheus5001 is UP: PING OK - Packet loss = 0%, RTA = 238.45 ms
[09:42:07] <XioNoX>	 I guess we have our answer :)
[09:42:09] <arturo>	 xD
[09:42:11] <dcaro>	 🎉
[09:42:13] <vgutierrez>	 nice call :D
[09:42:27] <icinga-wm>	 RECOVERY - Host upload-lb.eqsin.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 239.31 ms
[09:42:31] <icinga-wm>	 RECOVERY - Host ncredir-lb.eqsin.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 238.74 ms
[09:42:33] <XioNoX>	 !log cr1-codfw# set interfaces xe-5/1/2 disable
[09:42:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:42:39] <icinga-wm>	 RECOVERY - Host text-lb.eqsin.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 237.85 ms
[09:42:42] <elukey>	 nice :)
[09:42:46] <icinga-wm>	 RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp5007 is OK: OK - Certificate *.wikipedia.org will expire on Tue 16 Nov 2021 11:59:59 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[09:42:53] * jbond sorrtu computer crashed shortly after posting gdoc catching up
[09:43:12] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The patch is correct (good job!), but I think we need to explicitly tell the system that we want to only run this timer once the baseimage" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/699752 (https://phabricator.wikimedia.org/T284431) (owner: 10Jelto)
[09:43:45] <icinga-wm>	 RECOVERY - ATS TLS has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1
[09:43:47] <XioNoX>	 transport traffic is now going through the GRE tunnel so better to keep eqsin depooled for now
[09:43:58] <joe>	 people report things are now ok in Asia, as far as users are conerned
[09:44:00] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:44:07] <vgutierrez>	 XioNoX: ack
[09:44:21] <joe>	 traffic is back to normal standards, I think the emergency is over
[09:44:32] <jbond>	 is esqin still depooled?
[09:44:38] <elukey>	 yep
[09:44:43] <jbond>	 ack thx
[09:44:54] <volans>	 let's open a case to telia
[09:45:06] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqsin_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:45:14] <joe>	 yes, XioNoX thinks it's best given we're on the GRE tunnel for eqsin <-> prod connection
[09:45:28] <joe>	 netbox is always last :D
[09:46:36] <jynus>	 traffic is back to pre-incident levels, at leasy in absolute number
[09:46:46] <icinga-wm>	 PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:47:11] <volans>	 joe: it's a 15m systemd timer, ofc it's last :D
[09:47:29] <joe>	 I know, I just wanted to give you a hard time :D
[09:49:50] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_eqsin_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[09:50:05] <jynus>	 anyone sees any high prio ongoing issues? other than follow up
[09:50:11] <jynus>	 user reports, etc?
[09:50:25] <volans>	 don't think so jynus 
[09:50:33] <volans>	 all looks under control for now
[09:50:38] <XioNoX>	 sent the email to Telia NOC
[09:50:43] <jbond>	 volans: where you raising a case with telia ?
[09:50:44] <jynus>	 jbond, then that is the queue to "consider it closed"
[09:50:48] <jbond>	 answered :)
[09:51:37] <XioNoX>	 depending on Telia's answer I might turn the link back up with a lower preference, so no traffic goes through it but they can test it
[09:51:52] <joe>	 so yeah, the actual user-facing outage lasted between 07:30 and 07:44
[09:51:56] <jynus>	 as a follow up, can I get topic permissions here?
[09:52:32] <jbond>	 ack officially closing the incident i think the only follow up is to repool esqin at some point
[09:52:56] <jynus>	 yep, we can track that on open ticket, plus doc needed, etc
[09:53:05] <jbond>	 yes creating ticket now
[09:53:21] <jynus>	 jbond, T284986
[09:53:23] <stashbot>	 T284986: Wikimedias eqsin datacenter has network connectivity issues (?) - https://phabricator.wikimedia.org/T284986
[09:53:39] <jynus>	 I just need to update the title
[09:53:49] <jbond>	 ack can repurpose that oen
[09:53:54] <joe>	 we lost about 450k*14 requests, so about 6 million requests
[09:54:19] <joe>	 (very rough guesstimate)
[09:54:20] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:54:22] <wikibugs>	 10SRE, 10Traffic, 10netops, 10Wikimedia-Incident: Wikimedia's eqsin datacenter (Asia Pacific) had network connectivity issues - https://phabricator.wikimedia.org/T284986 (10jcrespo)
[09:55:17] <wikibugs>	 10SRE, 10Traffic, 10netops, 10Wikimedia-Incident: Wikimedia's eqsin datacenter (Asia Pacific) had network connectivity issues - https://phabricator.wikimedia.org/T284986 (10jbond) Once telia issues have been resolved we need to repool ESQIN.  @ayounsi can you confirm when we are good to repool
[09:56:13] <XioNoX>	 are the cp to cp connections over v4 only or also v6?
[09:56:53] <ema>	 XioNoX: there are no cp to cp connections between DCs anymore
[09:57:14] <XioNoX>	 ema: I mean the eqsin to eqiad connections
[09:57:22] <XioNoX>	 whatever impacted the users :)
[09:57:55] <XioNoX>	 aka, if v6 connectivity between the DC was still working, would it uses it?
[09:58:02] <vgutierrez>	 hmm that would depend on mw DNS records
[09:58:06] <ema>	 ah, traffic to the origin is v4 only
[09:58:12] <vgutierrez>	 here you go :)
[10:00:02] <vgutierrez>	 s/mw/appserver|api/ :)
[10:00:39] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_eqsin_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[10:01:15] <jinxer-wm>	 (Juniper alarm active) resolved: Juniper alarm active   - https://alerts.wikimedia.org
[10:02:21] <jinxer-wm>	 (Traffic on tunnel link) firing: Traffic on tunnel link   - https://alerts.wikimedia.org
[10:02:56] <jinxer-wm>	 (Traffic on tunnel link) resolved: Traffic on tunnel link   - https://alerts.wikimedia.org
[10:04:30] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[10:11:32] <wikibugs>	 (03PS1) 10Jbond: P::mediawiki::mcrouter_wancache: add shad data for cloud proxies [puppet] - 10https://gerrit.wikimedia.org/r/699912
[10:12:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P::mediawiki::mcrouter_wancache: add shad data for cloud proxies [puppet] - 10https://gerrit.wikimedia.org/r/699912 (owner: 10Jbond)
[10:16:24] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Add all redis and memcached backends to mw on k8s automatically - https://phabricator.wikimedia.org/T284420 (10hashar) For Beta that should be fixed via https://gerrit.wikimedia.org/r/c/operations/puppet/+/699912/ :)
[10:16:47] <hashar>	 jbond: nice, that should fix puppet on the deployment-deployXXXX instances 6-)
[10:18:21] <hashar>	 jbond: thank you!
[10:18:31] <jbond>	 hashar: indeed all fixed
[10:18:36] <hashar>	 \o/
[10:18:59] <hashar>	 guess I will get my lunch break now
[10:20:04] <jbond>	 :)
[10:22:58] <wikibugs>	 10SRE, 10SRE-Access-Requests: re-open access  to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10diego)
[10:28:02] <icinga-wm>	 PROBLEM - SSH on wdqs2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:38:12] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Fix prometheus monitoring for Toolforge Ingress [puppet] - 10https://gerrit.wikimedia.org/r/699484 (https://phabricator.wikimedia.org/T284353) (owner: 10Majavah)
[10:44:33] <XioNoX>	 alright, telia is saying the outage is resolved
[10:45:46] <XioNoX>	 !log re-enable cr1-codfw:xe-5/1/2
[10:45:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:12] <XioNoX>	 my mtr still works
[10:46:36] <icinga-wm>	 PROBLEM - Host cp5006 is DOWN: PING CRITICAL - Packet loss = 100%
[10:46:40] <icinga-wm>	 PROBLEM - Host install5001 is DOWN: PING CRITICAL - Packet loss = 100%
[10:46:58] <volans>	 XioNoX: ^^^
[10:47:08] <volans>	 might be transient of the re-enable?
[10:47:10] <icinga-wm>	 PROBLEM - Host ganeti5002 is DOWN: PING CRITICAL - Packet loss = 100%
[10:47:34] <icinga-wm>	 PROBLEM - Host ganeti5001 is DOWN: PING CRITICAL - Packet loss = 100%
[10:47:34] <icinga-wm>	 RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:47:52] <icinga-wm>	 RECOVERY - Host ganeti5002 is UP: PING WARNING - Packet loss = 80%, RTA = 239.29 ms
[10:47:54] <icinga-wm>	 RECOVERY - Host cp5006 is UP: PING OK - Packet loss = 0%, RTA = 237.92 ms
[10:47:57] <XioNoX>	 I re-disabled it
[10:47:58] <icinga-wm>	 RECOVERY - Host install5001 is UP: PING OK - Packet loss = 0%, RTA = 238.16 ms
[10:48:06] <icinga-wm>	 RECOVERY - Host ganeti5001 is UP: PING OK - Packet loss = 0%, RTA = 237.96 ms
[10:53:06] <icinga-wm>	 PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:02:46] <jinxer-wm>	 (Traffic on tunnel link) firing: Traffic on tunnel link   - https://alerts.wikimedia.org
[11:05:26] <wikibugs>	 10SRE, 10netops: Cleanup confed BGP peerings and policies - https://phabricator.wikimedia.org/T167841 (10ayounsi)
[11:05:35] <wikibugs>	 10SRE, 10netops, 10Sustainability (Incident Followup): ospf link-protection - https://phabricator.wikimedia.org/T167306 (10ayounsi) 05Open→03Resolved Closed! After 4 years and 1 week.
[11:06:52] <wikibugs>	 10SRE, 10SRE-Access-Requests: re-open access  to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10diego)
[11:06:55] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10diego)
[11:07:46] <jinxer-wm>	 (Traffic on tunnel link) resolved: Traffic on tunnel link   - https://alerts.wikimedia.org
[11:12:33] <wikibugs>	 (03PS1) 10H.krishna123: api_db: Add working skeleton code for api_db, add dockerfile [software/bernard] - 10https://gerrit.wikimedia.org/r/699915 (https://phabricator.wikimedia.org/T284399)
[11:14:40] <wikibugs>	 (03CR) 10H.krishna123: "I've just committed the skeleton code for the API backend for exposing data from the databases. There is no database functionality yet, bu" [software/bernard] - 10https://gerrit.wikimedia.org/r/699915 (https://phabricator.wikimedia.org/T284399) (owner: 10H.krishna123)
[11:20:17] <wikibugs>	 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947 (10Aklapper)
[11:23:38] <wikibugs>	 (03CR) 10Jcrespo: "I don't see any big issue with this, although it is a bit early regarding functionality. Will wait for your explanation live tomorrow." [software/bernard] - 10https://gerrit.wikimedia.org/r/699915 (https://phabricator.wikimedia.org/T284399) (owner: 10H.krishna123)
[11:41:08] <wikibugs>	 (03CR) 10Zfilipin: "Main branch is failing on my machine. I'll update the configuration and rebase this patch." [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/697069 (https://phabricator.wikimedia.org/T274579) (owner: 10Sahilgrewalhere)
[11:55:22] <icinga-wm>	 PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:58:39] <wikibugs>	 (03CR) 10H.krishna123: "Okay, sounds good, will go through it tomorrow." [software/bernard] - 10https://gerrit.wikimedia.org/r/699915 (https://phabricator.wikimedia.org/T284399) (owner: 10H.krishna123)
[12:06:18] <topranks>	 c-9Ginu.
[12:08:06] <godog>	 topranks: https://bash.toolforge.org/quip/AU7VV1aJ6snAnmqnK_0n
[12:08:24] <godog>	 (yes tooting my own horn)
[12:09:23] <topranks>	 smart :)
[12:09:39] <topranks>	 unlike yours truly.  doh!
[12:09:58] <godog>	 hehe we've all been there
[12:17:15] <wikibugs>	 (03PS1) 10Marostegui: db2080: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/699925
[12:17:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2080: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/699925 (owner: 10Marostegui)
[12:29:32] <icinga-wm>	 RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:53:37] <wikibugs>	 10SRE, 10Pybal, 10Traffic: PyBal healthchecks should specify User-Agent instead of using "Twisted PageGetter" - https://phabricator.wikimedia.org/T246431 (10ema)
[12:55:20] <icinga-wm>	 RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:57:29] <wikibugs>	 (03PS2) 10Aklapper: api_db: Add working skeleton code for api_db, add dockerfile [software/bernard] - 10https://gerrit.wikimedia.org/r/699915 (https://phabricator.wikimedia.org/T284399) (owner: 10H.krishna123)
[12:58:42] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10Aklapper)
[12:58:45] <wikibugs>	 10SRE, 10SRE-Access-Requests: re-open access  to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10Aklapper)
[12:58:53] <wikibugs>	 10SRE, 10SRE-Access-Requests: re-open access  to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10Aklapper)
[12:58:55] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10Aklapper)
[13:02:36] <wikibugs>	 10SRE, 10SRE-Access-Requests: re-open access  to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10Ottomata) Approved.
[13:02:44] <wikibugs>	 10SRE, 10SRE-Access-Requests: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10ssingh)
[13:03:37] <wikibugs>	 10SRE, 10SRE-Access-Requests: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10ssingh) 05Open→03Resolved Marc has been added to VictorOps and the SRE team; resolving this task. Thanks, all.
[13:10:17] <effie>	 !log disable puppet on canaries to deploy 699908
[13:10:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:46] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: Use TLS codfw pool for memcached replication on canaries [puppet] - 10https://gerrit.wikimedia.org/r/699908 (https://phabricator.wikimedia.org/T284420) (owner: 10Effie Mouzeli)
[13:15:27] <effie>	 !log enable puppet on canaries
[13:15:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:44] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1018: Depool clouddb1018 [puppet] - 10https://gerrit.wikimedia.org/r/699933
[13:21:59] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Depool clouddb1018 [puppet] - 10https://gerrit.wikimedia.org/r/699933 (owner: 10Marostegui)
[13:23:20] <marostegui>	 !log Upgrade clouddb1018
[13:23:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:22] <wikibugs>	 (03PS1) 10Marostegui: Revert "dbproxy1018: Depool clouddb1018" [puppet] - 10https://gerrit.wikimedia.org/r/699855
[13:25:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1018: Depool clouddb1018" [puppet] - 10https://gerrit.wikimedia.org/r/699855 (owner: 10Marostegui)
[13:28:33] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10Papaul) I Sent an email to Dell asking them to dispatch one of their Tech to do the troubleshooting on this server, since it is taking a while...
[13:40:42] <wikibugs>	 (03CR) 10David Caro: [C: 04-1] "Look ok, just got some questions" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[13:48:05] <wikibugs>	 10SRE, 10SRE-Access-Requests: re-open access  to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10ssingh)
[13:50:13] <wikibugs>	 10SRE, 10SRE-Access-Requests: re-open access  to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10ssingh) Hi @ChristineDeKock: Can you please read through and sign the L3 (Acknowledgement of Wikimedia Server Access Responsibilities) document? Thank you!
[13:53:29] <wikibugs>	 (03PS5) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498)
[14:02:31] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "I feel like it would make more sense to incorporate this into the docker_registry_ha:web class as that  one actually requires nginx and co" [puppet] - 10https://gerrit.wikimedia.org/r/698800 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff)
[14:09:11] <wikibugs>	 10SRE: 503 errors from varnish - https://phabricator.wikimedia.org/T284996 (10Majavah)
[14:10:26] <wikibugs>	 10SRE: 503 errors from varnish - https://phabricator.wikimedia.org/T284996 (10RoySmith)
[14:10:34] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] disable issues & wikis by default on new projects [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/699812 (https://phabricator.wikimedia.org/T264231) (owner: 10Brennen Bearnes)
[14:11:46] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] CAS: stop marking users as external [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/699819 (https://phabricator.wikimedia.org/T274461) (owner: 10Brennen Bearnes)
[14:15:08] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] O:base::resolving: make nameservers mandatory (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[14:16:54] <wikibugs>	 (03CR) 10David Caro: "I'll wait until the tests are fixed 😊" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[14:17:08] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "I've not tried it (and it is helmfile, so there is a good chance the asterisk does not work 😊) but it looks good" [deployment-charts] - 10https://gerrit.wikimedia.org/r/699432 (https://phabricator.wikimedia.org/T284420) (owner: 10Effie Mouzeli)
[14:17:52] <wikibugs>	 (03CR) 10Jbond: "thanks for the review see inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[14:18:12] <wikibugs>	 (03PS7) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498)
[14:18:20] <wikibugs>	 (03PS9) 10Jbond: O:base::resolver: unify resolv.con templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498)
[14:19:48] <wikibugs>	 (03CR) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[14:20:06] <wikibugs>	 (03PS1) 10Ssingh: admin: add christinedk to analytics-privatedata-users (with kerberos) [puppet] - 10https://gerrit.wikimedia.org/r/699938 (https://phabricator.wikimedia.org/T284987)
[14:25:49] <XioNoX>	 !log re-enable cr1-codfw:xe-5/1/2
[14:25:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:17] <icinga-wm>	 RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:27:42] <wikibugs>	 (03CR) 10Sfigor: [C: 03+1] disable issues & wikis by default on new projects [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/699812 (https://phabricator.wikimedia.org/T264231) (owner: 10Brennen Bearnes)
[14:30:16] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: re-open access  to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10ChristineDeKock) Hi @ssingh. Thank you for your trouble. I have signed the L3 form.
[14:30:17] <icinga-wm>	 PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[14:31:01] <wikibugs>	 (03PS10) 10Jbond: O:base::resolver: unify resolv.con templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498)
[14:31:36] <wikibugs>	 (03PS11) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498)
[14:31:39] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: re-open access  to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10ssingh)
[14:31:56] <wikibugs>	 (03PS6) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498)
[14:32:25] <wikibugs>	 (03PS8) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498)
[14:32:38] <wikibugs>	 (03PS12) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498)
[14:34:00] <wikibugs>	 (03CR) 10Sfigor: [C: 03+1] gitlab_backup_keep_time to 3 days [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/699464 (https://phabricator.wikimedia.org/T274463) (owner: 10Brennen Bearnes)
[14:35:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[14:36:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[14:37:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[14:37:55] <wikibugs>	 (03PS9) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498)
[14:38:10] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] admin: add christinedk to analytics-privatedata-users (with kerberos) [puppet] - 10https://gerrit.wikimedia.org/r/699938 (https://phabricator.wikimedia.org/T284987) (owner: 10Ssingh)
[14:39:27] <wikibugs>	 (03PS10) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498)
[14:41:05] <wikibugs>	 (03PS7) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498)
[14:41:10] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: re-open access  to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10ssingh) 05Open→03Resolved a:03ssingh >>! In T284987#7158017, @ChristineDeKock wrote: > Hi @ssingh. Thank you for your trouble. I have signed the L3 fo...
[14:41:28] <wikibugs>	 (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[14:41:51] <wikibugs>	 (03PS11) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498)
[14:43:03] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] O:base::resolving: drop the domain keyword and use the domain fact (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[14:43:50] <wikibugs>	 (03CR) 10Sfigor: [C: 03+1] CAS: stop marking users as external [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/699819 (https://phabricator.wikimedia.org/T274461) (owner: 10Brennen Bearnes)
[14:48:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Cmjohnson) @robh mw1414-1422 were missing the mgmt cables. Fixed and they're good to go.  it appears John racked the others, I will add that to my list.
[14:48:23] <wikibugs>	 (03PS8) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498)
[14:48:25] <wikibugs>	 (03PS12) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498)
[14:48:30] <wikibugs>	 (03PS13) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498)
[14:48:59] <wikibugs>	 (03PS4) 10Ema: varnish: add timing data to varnishmtail [puppet] - 10https://gerrit.wikimedia.org/r/699223 (https://phabricator.wikimedia.org/T284576)
[14:49:01] <wikibugs>	 (03PS1) 10Ema: varnish: add prometheus histogram varnish_processing_seconds [puppet] - 10https://gerrit.wikimedia.org/r/699941 (https://phabricator.wikimedia.org/T284576)
[14:50:36] <wikibugs>	 (03CR) 10Effie Mouzeli: "> Patch Set 2: Code-Review+1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/699432 (https://phabricator.wikimedia.org/T284420) (owner: 10Effie Mouzeli)
[14:51:08] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[14:51:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:21] <wikibugs>	 (03PS1) 10Razzi: yarn: temporarily stop queues [puppet] - 10https://gerrit.wikimedia.org/r/699943 (https://phabricator.wikimedia.org/T278423)
[14:51:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[14:52:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[14:52:58] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] yarn: temporarily stop queues [puppet] - 10https://gerrit.wikimedia.org/r/699943 (https://phabricator.wikimedia.org/T278423) (owner: 10Razzi)
[14:53:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[14:53:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] varnish: add prometheus histogram varnish_processing_seconds [puppet] - 10https://gerrit.wikimedia.org/r/699941 (https://phabricator.wikimedia.org/T284576) (owner: 10Ema)
[14:54:45] <wikibugs>	 10SRE, 10Traffic: 503 errors from varnish - https://phabricator.wikimedia.org/T284996 (10ssingh) p:05Triage→03High
[14:55:32] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:55:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:56] <wikibugs>	 (03CR) 10Muehlenhoff: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/698800 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff)
[14:57:12] <icinga-wm>	 RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[14:59:41] <wikibugs>	 (03PS13) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498)
[15:00:16] <wikibugs>	 (03CR) 10Razzi: [C: 03+2] yarn: temporarily stop queues [puppet] - 10https://gerrit.wikimedia.org/r/699943 (https://phabricator.wikimedia.org/T278423) (owner: 10Razzi)
[15:04:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install thumbor100[56] - https://phabricator.wikimedia.org/T273914 (10Cmjohnson)
[15:04:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install thumbor100[56] - https://phabricator.wikimedia.org/T273914 (10Cmjohnson) a:05Cmjohnson→03RobH @RobH These are ready for you whenever you get a chance
[15:07:40] <wikibugs>	 (03CR) 10Jbond: O:base::resolving: make nameservers mandatory (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[15:07:52] <wikibugs>	 (03PS14) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498)
[15:08:34] <wikibugs>	 (03PS9) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498)
[15:08:50] <wikibugs>	 (03PS14) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498)
[15:10:05] <wikibugs>	 (03PS15) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498)
[15:10:08] * jbond  rebaseing hell  :S
[15:13:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[15:13:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[15:14:09] <wikibugs>	 (03CR) 10Jbond: "CI issues seem to be unrelated" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[15:14:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[15:15:35] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] O:base::resolving: drop the domain keyword and use the domain fact (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond)
[15:18:00] <jbond>	 dcaro: thanks for the reviews just an fyi CI seems to be playing up at the moment so ill leave that change set for today
[15:20:37] <dcaro>	 jbond: ack, np, thanks for the patches :)
[15:22:27] <jbond>	 np :)
[15:24:55] <wikibugs>	 (03PS2) 10MSantos: maps: fix SQL modules paths in import script [puppet] - 10https://gerrit.wikimedia.org/r/695240
[15:55:36] <icinga-wm>	 RECOVERY - snapshot of s6 in codfw on alert1001 is OK: Last snapshot for s6 at codfw (db2141.codfw.wmnet:3316) taken on 2021-06-15 14:27:42 (577 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[15:56:20] <jynus>	 my hint to go^
[16:11:02] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 60 days, 0:00:00 on an-master1002.eqiad.wmnet with reason: Update operating system to bullseye
[16:11:03] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 60 days, 0:00:00 on an-master1002.eqiad.wmnet with reason: Update operating system to bullseye
[16:11:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:11:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:11:20] <wikibugs>	 (03CR) 10Majavah: "Thanks!" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/699815 (owner: 10Bstorm)
[16:11:36] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] python3: fix encoding in grid output [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/699815 (owner: 10Bstorm)
[16:12:56] <wikibugs>	 (03Merged) 10jenkins-bot: python3: fix encoding in grid output [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/699815 (owner: 10Bstorm)
[16:14:12] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[16:30:10] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1101 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:30:12] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:30:16] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1117 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:30:24] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1090 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:30:24] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1091 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:30:26] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1080 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:30:27] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1105 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:30:27] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1077 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:30:28] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1093 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:30:32] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1104 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:30:34] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1092 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:30:36] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1068 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:30:36] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1073 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:30:40] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1131 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:30:40] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1088 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:30:40] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1061 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:30:42] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1086 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:30:43] <elukey>	 we are in maintenance mode, sorry for the spam
[16:30:44] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1070 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:30:44] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1100 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:30:50] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1125 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:30:50] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1099 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:30:54] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1087 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:30:58] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1074 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:04] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1130 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:06] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1103 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:07] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1132 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:08] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1126 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:08] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1113 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:12] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1072 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:12] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1095 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:12] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:13] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1098 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:14] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1085 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:16] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1094 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:17] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1108 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:17] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1066 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:18] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:19] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1124 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:20] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1069 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:21] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1127 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:22] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1081 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:26] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1059 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:27] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1062 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:28] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1075 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:28] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1071 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:34] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1120 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:36] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1089 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:38] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:38] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1097 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:46] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1084 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:48] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1083 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:48] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1114 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:48] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1123 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:50] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1076 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:54] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1128 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:31:56] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1060 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:32:04] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1107 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:32:04] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1137 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:32:07] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1138 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:32:10] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1102 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:32:18] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1077 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:32:18] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1111 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:32:27] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1112 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:32:32] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1058 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:32:48] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1063 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:32:53] <razzi>	 Hi all, we are aware of this issue and in readonly mode on hdfs, so this will not cause data loss
[16:32:54] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1130 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:33:02] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:33:52] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1137 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:34:03] <elukey>	 some alert aggregation MAY be needed :D
[16:34:07] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1093 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:34:14] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1068 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:24] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1114 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:36:22] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1074 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:36:32] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1113 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:36:36] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1095 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:36:38] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1085 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:36:40] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1066 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:36:44] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1127 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:36:54] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1120 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:37:26] <elukey>	 ok we are good :)
[16:37:26] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:37:30] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1117 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:38:02] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1099 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:38:16] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1132 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:38:26] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1108 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:38:57] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1076 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:39:17] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1102 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:40:14] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:41:26] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1058 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:41:57] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1094 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:42:44] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1138 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:42:54] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1090 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:42:54] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1091 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:43:02] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1092 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:43:10] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1088 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:43:50] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1081 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:43:54] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1059 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:44:04] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1097 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:44:44] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1111 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:45:10] <razzi>	 Quick summary of what's going on with hadoop: we're in maintenance mode for an os upgrade, and have 1 active and 1 standby namenode. I meant to stop hadoop on the standby, but accidentally did so on the active; when I realized my mistake I restarted hadoop on the active and stopped hadoop on the standby, but the original standby had become the active and the original active was still recovering. Sorry for the spam!
[16:45:34] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1069 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:45:42] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1075 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:46:07] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1128 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:46:48] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1100 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:47:36] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1089 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:49:00] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1072 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:49:14] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1062 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:49:32] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1084 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:49:34] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1083 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:49:48] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1107 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:50:02] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1080 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:50:18] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1086 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:50:28] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1087 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:50:46] <icinga-wm>	 PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[16:50:47] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1098 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:51:27] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1060 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:51:34] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1101 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:51:50] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1105 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:51:54] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1104 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:52:06] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1070 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:52:20] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1063 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:52:26] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1103 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:52:40] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1124 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:53:44] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1112 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:53:56] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1125 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:54:14] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1126 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:54:42] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:54:54] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1123 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:55:34] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1131 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:55:36] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1061 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:56:20] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1071 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:57:18] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1073 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:09:44] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-master1002.eqiad.wmnet with reason: REIMAGE
[17:09:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:58] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-master1002.eqiad.wmnet with reason: REIMAGE
[17:12:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:26] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "approved by langcom" [dns] - 10https://gerrit.wikimedia.org/r/698521 (https://phabricator.wikimedia.org/T284450) (owner: 10Gerrit maintenance bot)
[17:14:46] <wikibugs>	 (03PS3) 10Dzahn: Add dag to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/698521 (https://phabricator.wikimedia.org/T284450) (owner: 10Gerrit maintenance bot)
[17:17:04] <mutante>	 !log new Wikimedia language "dag" added - Dagbani (or Dagbane), also known as Dagbanli and Dagbanle, is a Gur language spoken in Ghana.
[17:17:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:26] <wikibugs>	 (03PS2) 10Dzahn: Add shi to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/699754 (https://phabricator.wikimedia.org/T284885) (owner: 10Gerrit maintenance bot)
[17:20:13] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "approved by langcom - https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_Tachelhit" [dns] - 10https://gerrit.wikimedia.org/r/699754 (https://phabricator.wikimedia.org/T284885) (owner: 10Gerrit maintenance bot)
[17:21:10] <mutante>	 !log new Wikimedia language "shi" added - Shilha /ˈʃɪlhə/ is a Berber language native to Shilha people. The endonym is Taclḥit /taʃlʜijt/, and in recent English publications the language is often rendered Tashelhiyt or Tashelhit. 
[17:21:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:57] <wikibugs>	 (03PS1) 10Razzi: yarn: re-enable queues [puppet] - 10https://gerrit.wikimedia.org/r/699955 (https://phabricator.wikimedia.org/T278423)
[17:35:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] yarn: re-enable queues [puppet] - 10https://gerrit.wikimedia.org/r/699955 (https://phabricator.wikimedia.org/T278423) (owner: 10Razzi)
[17:35:57] <wikibugs>	 (03PS2) 10Razzi: yarn: re-enable queues [puppet] - 10https://gerrit.wikimedia.org/r/699955 (https://phabricator.wikimedia.org/T278423)
[17:39:15] <wikibugs>	 (03CR) 10Razzi: [C: 03+2] yarn: re-enable queues [puppet] - 10https://gerrit.wikimedia.org/r/699955 (https://phabricator.wikimedia.org/T278423) (owner: 10Razzi)
[17:48:34] <wikibugs>	 (03PS1) 10BryanDavis: toolforge: setup some reasonable php.ini defaults [puppet] - 10https://gerrit.wikimedia.org/r/699956
[17:50:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] toolforge: setup some reasonable php.ini defaults [puppet] - 10https://gerrit.wikimedia.org/r/699956 (owner: 10BryanDavis)
[17:54:54] <dancy>	 !log testing upcoming Scap release on beta
[17:54:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:04] <wikibugs>	 (03PS2) 10BryanDavis: toolforge: setup some reasonable php.ini defaults [puppet] - 10https://gerrit.wikimedia.org/r/699956
[18:02:34] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[18:03:09] <wikibugs>	 (03PS1) 10Cathal Mooney: Repool eqsin [dns] - 10https://gerrit.wikimedia.org/r/699957 (https://phabricator.wikimedia.org/T284986)
[18:06:08] <wikibugs>	 10SRE, 10Traffic, 10netops, 10Wikimedia-Incident: Wikimedia's eqsin datacenter (Asia Pacific) had network connectivity issues - https://phabricator.wikimedia.org/T284986 (10cmooney) I made a typo in the commit msg so this didn't link:  https://gerrit.wikimedia.org/r/c/operations/dns/+/699957
[18:08:42] <wikibugs>	 (03Abandoned) 10Cathal Mooney: Repool eqsin [dns] - 10https://gerrit.wikimedia.org/r/699957 (https://phabricator.wikimedia.org/T284986) (owner: 10Cathal Mooney)
[18:10:16] <wikibugs>	 (03PS1) 10Cathal Mooney: Revert "Depool eqsin" [dns] - 10https://gerrit.wikimedia.org/r/699859
[18:14:15] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "If the link has been stable for few hours LGTM, but make sure there is someone around for the next hour or so just in case." [dns] - 10https://gerrit.wikimedia.org/r/699859 (owner: 10Cathal Mooney)
[18:17:37] <wikibugs>	 (03PS2) 10Cathal Mooney: Revert "Depool eqsin" [dns] - 10https://gerrit.wikimedia.org/r/699859
[18:24:10] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_netflow_hourly.service,eventlogging_to_druid_prefupdate_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:25:17] <topranks>	 I was going to merge above revert and update DNS to re-pool eqsin.
[18:25:20] <topranks>	 Can anyone advise what dashboards might be good to track during/after?
[18:25:41] <topranks>	 I know how to check the authdns has changed and router graphs etc. for traffic patterns but I'm sure there are other things also....
[18:32:17] <rzl>	 topranks: as a start I would keep an eye on https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&from=now-12h&to=now
[18:32:34] <rzl>	 you should see eqsin picking up traffic from esams, and preferably without an error spike :)
[18:32:35] <topranks>	 cool.... yep have that one open :)
[18:32:55] <topranks>	 yeah I can see the errors there earlier today.
[18:33:55] <rzl>	 between that, and having an eye on this channel for alerts, you should be all set
[18:34:04] <icinga-wm>	 PROBLEM - SSH on wdqs2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:34:48] <topranks>	 ok, thanks rzl I will proceed cautiously now.
[18:34:54] <rzl>	 I'm happy to ride along for a bit too, if you want some company just in case
[18:35:11] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] Revert "Depool eqsin" [dns] - 10https://gerrit.wikimedia.org/r/699859 (owner: 10Cathal Mooney)
[18:35:44] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Revert "Depool eqsin" [dns] - 10https://gerrit.wikimedia.org/r/699859 (owner: 10Cathal Mooney)
[18:38:21] <topranks>	 rzl: thanks, but did I not see that you weren't feeling so good?
[18:38:33] <rzl>	 that was yesterday! appreciate the thought though <3
[18:38:49] <topranks>	 ah ok.  well hope you're doing better :)
[18:39:24] <rzl>	 heh, better enough :)
[18:40:00] <rzl>	 btw forgive me if you already know this -- merging in the dns repo is similar to merging in the puppet repo, jenkins won't auto-submit when you +2
[18:40:14] <topranks>	 cool thanks.
[18:40:22] <topranks>	 yes this is the first non-homer change I'm doing.
[18:40:27] <rzl>	 instead, hit the submit button in gerrit when you're ready, and then `sudo authdns-update` from any authdns host
[18:40:32] <topranks>	 I submitted there, looks ok.
[18:40:48] <topranks>	 I will do that now, and watch watch happens.
[18:40:52] <rzl>	 👍
[18:40:59] <topranks>	 scary !
[18:41:14] <rzl>	 what's the worst that could happen :D
[18:41:30] <topranks>	 wikipedia breaks for most of Asia?  we'll de-pool it again :)
[18:41:40] <topranks>	 it'll be fine :P
[18:42:38] <wikibugs>	 10SRE, 10Traffic, 10netops, 10Wikimedia-Incident: Wikimedia's eqsin datacenter (Asia Pacific) had network connectivity issues - https://phabricator.wikimedia.org/T284986 (10cmooney) Ok @volans was kind enough to explain how I could just revert the original change instead:  https://gerrit.wikimedia.org/r/c/...
[18:45:16] <topranks>	 presume I should run with "sudo" ?
[18:45:54] <rzl>	 yep
[18:45:56] <XioNoX>	 I do "sudo -s ..."
[18:46:20] <topranks>	 "sudo -s" and drop to root shell and run it?
[18:46:57] <topranks>	 can you do "sudo -s <command>" ?  not familiar with doing that.
[18:50:02] <topranks>	 I did it your way XioNoX, will need to look into exactly what the "-s" does in that scenario.
[18:50:20] <topranks>	 dns in eqsin returning local IP again
[18:51:36] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:53:03] <topranks>	 I guess resolver caches need to time out before it really picks up though.
[18:53:11] <rzl>	 yep
[18:55:46] <topranks>	 seeing some small uptick in graphs now.
[18:56:04] <topranks>	 I've a test box in that region it's working from there too 
[18:56:13] <wikibugs>	 (03PS4) 10Umherirrender: Add SpecialFewestrevisions to wgDisableQueryPageUpdate for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697151 (https://phabricator.wikimedia.org/T238199)
[18:57:27] <rzl>	 yeah, looking good
[18:58:17] <wikibugs>	 (03CR) 10BryanDavis: "PCC diff: https://puppet-compiler.wmflabs.org/compiler1002/29891/tools-sgeexec-0906.tools.eqiad.wmflabs/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/699956 (owner: 10BryanDavis)
[19:00:00] <rzl>	 topranks: maybe obvious, but don't forget we depooled eqsin pretty near its daily peak, and now we're repooling it at just about trough -- so don't be surprised when it comes back with way less traffic
[19:00:12] <topranks>	 yep good point.
[19:00:18] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] "It'll still be possible to explode grid nodes, but this is a whole lot less rope to do it with." [puppet] - 10https://gerrit.wikimedia.org/r/699956 (owner: 10BryanDavis)
[19:00:25] <topranks>	 I set the graphs to 24h there to get a sense of where we'd be at this hour.
[19:00:41] <rzl>	 👍
[19:01:52] <rzl>	 that record is on a five-minute TTL, right? so we should be just about there, modulo misbehaved caches
[19:02:11] <apergos>	 so modulo all the sites that keep crap around for 24 hours "just because"
[19:03:16] <topranks>	 10-min on dyna.wikimedia.org 
[19:03:34] <topranks>	 we're close to where we were this time yesterday on the varnish graphs.
[19:04:03] <topranks>	 aspergos:  yes of course :)
[19:04:03] <rzl>	 oops, that's what I get for not checking
[19:08:00] <topranks>	 Network traffic graphs have caught up - levels also similar to yesterday.
[19:09:12] <wikibugs>	 (03PS1) 10BryanDavis: toolhub: fix php.ini path [puppet] - 10https://gerrit.wikimedia.org/r/699966
[19:10:19] <rzl>	 cool -- probably best to hang around for a while and keep an eye out for alerts as it soaks in, but looks like we're in good shape
[19:10:32] <rzl>	 don't forget to update the timeline in the incident doc, if you don't mind
[19:10:45] <rzl>	 and nice job :)
[19:11:02] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] toolhub: fix php.ini path [puppet] - 10https://gerrit.wikimedia.org/r/699966 (owner: 10BryanDavis)
[19:12:50] <wikibugs>	 10SRE, 10Traffic, 10netops, 10Wikimedia-Incident: Wikimedia's eqsin datacenter (Asia Pacific) had network connectivity issues - https://phabricator.wikimedia.org/T284986 (10cmooney) CR merged and DNS updated.  All looks good, dns servers are returning the eqsin IPs again and traffic is back to normal level...
[19:13:52] <topranks>	 rzl: that last point I'd forgot.
[19:13:55] <topranks>	 good call thanks.
[19:14:18] <topranks>	 cheers for all the help :)
[19:15:20] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_netflow_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:18:02] <wikibugs>	 (03PS1) 10Ottomata: airflow::instance - allow access to API by default [puppet] - 10https://gerrit.wikimedia.org/r/699968 (https://phabricator.wikimedia.org/T272973)
[19:19:32] <wikibugs>	 (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29892/console" [puppet] - 10https://gerrit.wikimedia.org/r/699968 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata)
[19:20:43] <wikibugs>	 (03CR) 10Ottomata: [V: 03+1 C: 03+2] airflow::instance - allow access to API by default [puppet] - 10https://gerrit.wikimedia.org/r/699968 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata)
[19:55:33] <wikibugs>	 (03CR) 10Umherirrender: [C: 04-1] Add SpecialFewestrevisions to wgDisableQueryPageUpdate for wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697151 (https://phabricator.wikimedia.org/T238199) (owner: 10Umherirrender)
[20:01:07] <wikibugs>	 (03PS5) 10Umherirrender: Add SpecialFewestrevisions to wgDisableQueryPageUpdate for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697151 (https://phabricator.wikimedia.org/T238199)
[20:01:14] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:01:44] <wikibugs>	 (03CR) 10Umherirrender: Add SpecialFewestrevisions to wgDisableQueryPageUpdate for wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697151 (https://phabricator.wikimedia.org/T238199) (owner: 10Umherirrender)
[20:26:38] <icinga-wm>	 RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[20:32:42] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-web1001 - https://phabricator.wikimedia.org/T281787 (10Cmjohnson)
[20:33:18] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-web1001 - https://phabricator.wikimedia.org/T281787 (10Cmjohnson) a:05Cmjohnson→03RobH @robh the onsite work for this server is completed
[20:34:56] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:35:40] <icinga-wm>	 RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:38:38] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:04:49] <wikibugs>	 (03PS1) 10Bstorm: nfs prometheus: change to strings for dir sizes [puppet] - 10https://gerrit.wikimedia.org/r/699973 (https://phabricator.wikimedia.org/T284964)
[21:06:35] <wikibugs>	 (03CR) 10Bstorm: "I've tested this live on the server, effectively. So its probably ready to go. I have another patch that I may try after this that turns t" [puppet] - 10https://gerrit.wikimedia.org/r/699973 (https://phabricator.wikimedia.org/T284964) (owner: 10Bstorm)
[21:29:04] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] "I'm going to go ahead and merge this to get the data on Grafana" [puppet] - 10https://gerrit.wikimedia.org/r/699973 (https://phabricator.wikimedia.org/T284964) (owner: 10Bstorm)
[21:29:58] <wikibugs>	 10SRE, 10Technical-blog-posts, 10Wikimedia-Mailing-lists: Story idea for Blog: Discovering and fixing CVE-2021-33038 in Mailman3 - https://phabricator.wikimedia.org/T284486 (10Legoktm) @srodlund one more thing, in the 3rd paragraph, can we switch "Why we didn’t… ?" -> "Why didn’t we… ?" (spotted by @Krinkle)
[21:30:54] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[22:01:46] <icinga-wm>	 PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:01:50] <icinga-wm>	 PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[22:27:36] <icinga-wm>	 RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[22:38:41] <wikibugs>	 (03PS3) 10Platonides: eswiki AbuseFilter config changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699494 (https://phabricator.wikimedia.org/T284797) (owner: 10Zabe)
[22:40:28] <wikibugs>	 (03CR) 10Platonides: [C: 03+1] eswiki AbuseFilter config changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699494 (https://phabricator.wikimedia.org/T284797) (owner: 10Zabe)
[22:58:10] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:58:32] <icinga-wm>	 PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:58:52] <icinga-wm>	 PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 2/4 UP : OSPFv3: 2/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:59:02] <icinga-wm>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:59:42] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:59:48] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:15:32] <icinga-wm>	 RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:15:40] <icinga-wm>	 RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:16:20] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:16:26] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:16:36] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:17:02] <icinga-wm>	 RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:21:40] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[23:38:14] <icinga-wm>	 PROBLEM - SSH on wdqs2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:53:28] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-chi-eqiad on cloudelastic1005 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1005&panelId=37