[00:00:39] ACKNOWLEDGEMENT - NFS Share Volume Space /srv/tools on labstore1004 is CRITICAL: DISK CRITICAL - free space: /srv/tools 1263734 MB (15% inode=79%): Bstorm Working on this via T284964 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1 [00:30:10] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 104.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [00:35:37] (03PS3) 10H.krishna123: repository: add .gitignore and README.md to the repository [software/bernard] - 10https://gerrit.wikimedia.org/r/698327 (https://phabricator.wikimedia.org/T284399) [00:36:25] (03PS4) 10H.krishna123: repository: add .gitignore and README.md to the repository [software/bernard] - 10https://gerrit.wikimedia.org/r/698327 (https://phabricator.wikimedia.org/T284399) [00:45:23] (03CR) 10H.krishna123: "Thank you, I have made changes to the commit message, I wonder if calling the component "repository" is appropriate in this scenario?" [software/bernard] - 10https://gerrit.wikimedia.org/r/698327 (https://phabricator.wikimedia.org/T284399) (owner: 10H.krishna123) [01:25:47] (03PS5) 10H.krishna123: repository: add README.md to the repository [software/bernard] - 10https://gerrit.wikimedia.org/r/698327 (https://phabricator.wikimedia.org/T284399) [02:07:54] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.37.0-wmf.10 [core] (wmf/1.37.0-wmf.10) - 10https://gerrit.wikimedia.org/r/699826 [02:07:56] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.37.0-wmf.10 [core] (wmf/1.37.0-wmf.10) - 10https://gerrit.wikimedia.org/r/699826 (owner: 10TrainBranchBot) [02:24:48] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [02:29:43] (03Merged) 10jenkins-bot: Branch commit for wmf/1.37.0-wmf.10 [core] (wmf/1.37.0-wmf.10) - 10https://gerrit.wikimedia.org/r/699826 (owner: 10TrainBranchBot) [02:46:30] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [02:49:18] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-chi-eqiad on cloudelastic1005 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1005&panelId=37 [03:18:08] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1006-cloudelastic-chi-eqiad on cloudelastic1006 is OK: (C)100 gt (W)80 gt 73.22 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1006&panelId=37 [04:26:46] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 70.17 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [04:48:18] (03PS1) 10Marostegui: install_server: Do not reimage new pc20* [puppet] - 10https://gerrit.wikimedia.org/r/699840 [04:49:24] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage new pc20* [puppet] - 10https://gerrit.wikimedia.org/r/699840 (owner: 10Marostegui) [04:49:50] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - apaches_80: Servers mw2255.codfw.wmnet, mw2313.codfw.wmnet, mw2409.codfw.wmnet, mw2371.codfw.wmnet, mw2392.codfw.wmnet, mw2333.codfw.wmnet, mw2393.codfw.wmnet, mw2312.codfw.wmnet, mw2353.codfw.wmnet, mw2375.codfw.wmnet, mw2338.codfw.wmnet, mw2329.codfw.wmnet, mw2314.codfw.wmnet, mw2275.codfw.wmnet, mw2361.codfw.wmnet, mw2369.codfw.wmnet, mw2365.codfw [04:49:50] mw2355.codfw.wmnet, mw2406.codfw.wmnet, mw2315.codfw.wmnet, mw2327.codfw.wmnet, mw2351.codfw.wmnet, mw2373.codfw.wmnet, mw2270.codfw.wmnet, mw2335.codfw.wmnet, mw2254.codfw.wmnet, mw2385.codfw.wmnet, mw2331.codfw.wmnet, mw2384.codfw.wmnet, mw2388.codfw.wmnet, mw2272.codfw.wmnet, mw2307.codfw.wmnet, mw2407.codfw.wmnet, mw2383.codfw.wmnet, mw2301.codfw.wmnet, mw2336.codfw.wmnet, mw2276.codfw.wmnet, mw2363.codfw.wmnet, mw2303.codfw.wmnet, mw [04:49:50] fw.wmnet, mw2309.codfw.wmnet, mw2387.codfw.wmnet, mw2367.codfw.wmnet, mw2390.codfw.wmnet, mw2359.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:50:20] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - apaches_80: Servers mw2255.codfw.wmnet, mw2313.codfw.wmnet, mw2409.codfw.wmnet, mw2371.codfw.wmnet, mw2392.codfw.wmnet, mw2333.codfw.wmnet, mw2393.codfw.wmnet, mw2312.codfw.wmnet, mw2353.codfw.wmnet, mw2375.codfw.wmnet, mw2338.codfw.wmnet, mw2303.codfw.wmnet, mw2314.codfw.wmnet, mw2275.codfw.wmnet, mw2361.codfw.wmnet, mw2369.codfw.wmnet, mw2365.codfw [04:50:20] mw2355.codfw.wmnet, mw2406.codfw.wmnet, mw2315.codfw.wmnet, mw2327.codfw.wmnet, mw2351.codfw.wmnet, mw2373.codfw.wmnet, mw2270.codfw.wmnet, mw2335.codfw.wmnet, mw2254.codfw.wmnet, mw2385.codfw.wmnet, mw2331.codfw.wmnet, mw2384.codfw.wmnet, mw2388.codfw.wmnet, mw2272.codfw.wmnet, mw2307.codfw.wmnet, mw2407.codfw.wmnet, mw2383.codfw.wmnet, mw2301.codfw.wmnet, mw2336.codfw.wmnet, mw2276.codfw.wmnet, mw2363.codfw.wmnet, mw2329.codfw.wmnet, mw [04:50:20] fw.wmnet, mw2309.codfw.wmnet, mw2387.codfw.wmnet, mw2367.codfw.wmnet, mw2390.codfw.wmnet, mw2359.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:50:28] uh? [04:51:12] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 348 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:59:05] (03CR) 10Marostegui: [C: 03+2] wikireplicas: re-enable notifications for clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/699791 (owner: 10Bstorm) [05:09:18] PROBLEM - snapshot of s6 in codfw on alert1001 is CRITICAL: snapshot for s6 at codfw taken more than 3 days ago: Most recent backup 2021-06-12 04:37:53 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [05:24:02] PROBLEM - SSH on wdqs2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:00:22] marostegui: what's up in codfw? [06:00:27] did you check? [06:01:09] looks like it takes 5 seconds to render the blank page [06:01:12] Jun 15 06:00:50 lvs2010 pybal[7422]: [apaches_80 ProxyFetch] WARN: mw2391.codfw.wmnet (enabled/partially up/pooled): Fetch failed (http://www.wikidata.org/wiki/Special:BlankPage), 5.001 s [06:01:29] is there anything going on? a schema change on s1? [06:01:49] also why is this not paging [06:02:43] * apergos peeks in [06:02:45] joe: schema change on s8 codfw [06:02:50] ah I see [06:03:04] why a schema change in s8 makes the blank page of enwiki not working??? [06:03:08] Amir1: ^^ [06:03:10] that's my question [06:03:13] s8 is wikidata [06:03:22] marostegui: lemme take a look on one server [06:03:25] codfw databases aren't used [06:03:26] wait wat [06:03:40] it can be caused by that [06:03:41] Amir1: TL;DR schema change on s8 codfw master (with replication stopped) is going on [06:03:50] s8 is being read by every wiki [06:04:00] 2021-06-15T06:00:30 202100644 10.192.17.10 proxy:unix:/run/php/fpm-www.sock|fcgi://localhost/504 247 GET http://www.wikidata.org/wiki/Special:BlankPage - text/html - - Twisted PageGetter - - - - 10.192.17.10 - - [06:04:02] actually most of the reads are not from wikidat [06:04:05] is that what's also causing the few hundred mw exceptions per minute? https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?orgId=1&var-datasource=eqiad%20prometheus%2Fops&viewPanel=18&from=now-24h&to=now [06:04:07] Amir1: all the time? [06:04:11] joe: yes [06:04:27] Amir1: yeah, but why does it affect everything if codfw master is stopped? if codfw isn't used [06:04:43] that is the question I don't have answer for it yet [06:04:51] marostegui: it's not affecting anything in eqiad [06:05:03] so it's not live traffic [06:05:18] aha, it means it's just depooling codfw mw hosts [06:05:19] fun [06:05:29] ok mystery solved [06:05:31] joe: ah ok pheeew [06:05:39] we now require the wikidata blank page as well [06:05:41] wikibase clients all... right [06:05:42] from pybal [06:06:03] so the enwiki blank page takes 100 ms [06:06:05] as usual [06:06:07] joe: not sure why it's being invoked in blank page. I'll do a xhgui on it [06:06:13] the wikidata one takes 200 seconds [06:06:22] Amir1: nevermind it was a red herring [06:06:33] I forgot we now also check the wikidata blank page [06:06:40] oh directly checked? ic [06:06:52] now, another mystery, why we do have exceptions? [06:06:54] yeah see the line I pasted above [06:06:56] probably not related [06:07:02] Amir1: the schema change :D [06:07:21] but this is eqiad: https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?orgId=1&var-datasource=eqiad%20prometheus%2Fops&viewPanel=18&from=now-24h&to=now [06:07:37] you mean another schema change? [06:07:48] you're being misguided [06:07:55] that is collective [06:08:03] it's "eqiad" as in "eqiad's statsd [06:08:15] legend [06:08:16] so eqiad's prometheus [06:08:23] but it's everything :) [06:08:27] oh of course [06:08:48] it's coming from a promethus exporter [06:08:48] [{reqId}] {exception_url} Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of 60 seconds was exceeded <- this is the exception, happening mostly in codfw and mostly for wikidata [06:09:07] I saw the exporter in puppet a while ago [06:10:32] a general note: around 60-70% of reads on s8 are not from wikidata.org but any other clients. I measured it in December [06:10:39] !log roll OSPF link-protection to all routers - T167306 [06:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:45] T167306: ospf link-protection - https://phabricator.wikimedia.org/T167306 [06:12:24] (03CR) 10Ayounsi: [C: 03+2] Add OSPF link-protection to all P2P links [homer/public] - 10https://gerrit.wikimedia.org/r/698512 (https://phabricator.wikimedia.org/T167306) (owner: 10Ayounsi) [06:16:41] away [06:16:51] good morning, need coffee :D [06:33:00] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 50 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210615T0700) [07:12:08] (03PS3) 10Tobias Andersson: Remove idGeneratorRateLimiting from production config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698751 (https://phabricator.wikimedia.org/T274157) (owner: 10Dat Nguyen) [07:17:38] (03CR) 10Jcrespo: [C: 03+2] "> Patch Set 4:" [software/bernard] - 10https://gerrit.wikimedia.org/r/698327 (https://phabricator.wikimedia.org/T284399) (owner: 10H.krishna123) [07:17:45] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] repository: add README.md to the repository [software/bernard] - 10https://gerrit.wikimedia.org/r/698327 (https://phabricator.wikimedia.org/T284399) (owner: 10H.krishna123) [07:25:30] RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:33:40] (03CR) 10Ladsgroup: [C: 04-2] "Virtual +1. It can go in after deployment of wmf.11 landed in production in full power (~Monday 28 June)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698751 (https://phabricator.wikimedia.org/T274157) (owner: 10Dat Nguyen) [07:43:15] (03PS1) 10QChris: Add .gitreview [debs/cfssl] - 10https://gerrit.wikimedia.org/r/699904 [07:43:17] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [debs/cfssl] - 10https://gerrit.wikimedia.org/r/699904 (owner: 10QChris) [07:53:44] (03PS2) 10Jelto: add job to weekly rebuild production-images [puppet] - 10https://gerrit.wikimedia.org/r/699752 (https://phabricator.wikimedia.org/T284431) [07:59:01] (03PS3) 10Jelto: add job to weekly rebuild production-images [puppet] - 10https://gerrit.wikimedia.org/r/699752 (https://phabricator.wikimedia.org/T284431) [08:02:00] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:02:42] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:06:22] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:09:30] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:14:48] (03PS3) 10Ema: varnish: add timing data to varnishmtail [puppet] - 10https://gerrit.wikimedia.org/r/699223 (https://phabricator.wikimedia.org/T284576) [08:15:34] 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10jcrespo) I am going to create a separate task to put db2100 into service, will only reopen this if crashes return. Thank to everyone that helped here. [08:16:40] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - apaches_80: Servers mw2254.codfw.wmnet, mw2371.codfw.wmnet, mw2338.codfw.wmnet, mw2386.codfw.wmnet, mw2365.codfw.wmnet, mw2327.codfw.wmnet, mw2373.codfw.wmnet, mw2384.codfw.wmnet, mw2407.codfw.wmnet, mw2363.codfw.wmnet, mw2268.codfw.wmnet, mw2390.codfw.wmnet, mw2359.codfw.wmnet, mw2255.codfw.wmnet, mw2409.codfw.wmnet, mw2357.codfw.wmnet, mw2269.codfw [08:16:40] mw2355.codfw.wmnet, mw2270.codfw.wmnet, mw2385.codfw.wmnet, mw2274.codfw.wmnet, mw2277.codfw.wmnet, mw2383.codfw.wmnet, mw2336.codfw.wmnet, mw2311.codfw.wmnet, mw2389.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:18:23] hmm...what's going on here [08:19:14] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - apaches_80: Servers mw2371.codfw.wmnet, mw2331.codfw.wmnet, mw2393.codfw.wmnet, mw2365.codfw.wmnet, mw2408.codfw.wmnet, mw2315.codfw.wmnet, mw2384.codfw.wmnet, mw2272.codfw.wmnet, mw2367.codfw.wmnet, mw2390.codfw.wmnet, mw2255.codfw.wmnet, mw2392.codfw.wmnet, mw2353.codfw.wmnet, mw2316.codfw.wmnet, mw2275.codfw.wmnet, mw2257.codfw.wmnet, mw2387.codfw [08:19:14] mw2406.codfw.wmnet, mw2385.codfw.wmnet, mw2277.codfw.wmnet, mw2388.codfw.wmnet, mw2273.codfw.wmnet, mw2258.codfw.wmnet, mw2311.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:19:49] 10SRE, 10Wikimedia-Mailing-lists: mailman3 unsubscribe link not showing in daily article list e-mails - https://phabricator.wikimedia.org/T284548 (10Krd) 05Open→03Resolved a:03Krd Thx. [08:22:24] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 112 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:25:05] PyBal backend errors related to schema change on s8 - https://phabricator.wikimedia.org/T284981 [08:27:20] 10SRE, 10docker-pkg, 10serviceops, 10Patch-For-Review: Refresh all images in production-images - https://phabricator.wikimedia.org/T284431 (10Jelto) > So we have two options for a rebuild: @Joe I implemented option 1 in https://gerrit.wikimedia.org/r/699752 Could you take a look and add your feedback?... [08:28:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2081', diff saved to https://phabricator.wikimedia.org/P16524 and previous config saved to /var/cache/conftool/dbconfig/20210615-082857-marostegui.json [08:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:32] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:30:14] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:31:34] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 291 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:32:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2080 db2083 db2084 db2091', diff saved to https://phabricator.wikimedia.org/P16525 and previous config saved to /var/cache/conftool/dbconfig/20210615-083233-marostegui.json [08:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:26] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 38 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:44:39] (03CR) 10Gilles: [C: 03+1] Enable canary events for NavigationTiming ext streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699789 (https://phabricator.wikimedia.org/T271208) (owner: 10Ottomata) [08:50:30] RECOVERY - snapshot of x1 in eqiad on alert1001 is OK: Last snapshot for x1 at eqiad (db1102.eqiad.wmnet:3320) taken on 2021-06-15 08:31:27 (245 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [08:59:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2091', diff saved to https://phabricator.wikimedia.org/P16526 and previous config saved to /var/cache/conftool/dbconfig/20210615-085938-marostegui.json [08:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2082', diff saved to https://phabricator.wikimedia.org/P16527 and previous config saved to /var/cache/conftool/dbconfig/20210615-085953-marostegui.json [08:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:10] (03CR) 10JMeybohm: [C: 04-1] mwdebug: include nutcracker and mcrouter pools in values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/699432 (https://phabricator.wikimedia.org/T284420) (owner: 10Effie Mouzeli) [09:02:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db2081', diff saved to https://phabricator.wikimedia.org/P16528 and previous config saved to /var/cache/conftool/dbconfig/20210615-090206-marostegui.json [09:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db2084', diff saved to https://phabricator.wikimedia.org/P16529 and previous config saved to /var/cache/conftool/dbconfig/20210615-090243-marostegui.json [09:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db2083', diff saved to https://phabricator.wikimedia.org/P16530 and previous config saved to /var/cache/conftool/dbconfig/20210615-090650-marostegui.json [09:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:41] (03CR) 10Jbond: [C: 03+1] "lgtm, but better get a +1 from S&F as well just incase" [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/699819 (https://phabricator.wikimedia.org/T274461) (owner: 10Brennen Bearnes) [09:08:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2086:3318', diff saved to https://phabricator.wikimedia.org/P16531 and previous config saved to /var/cache/conftool/dbconfig/20210615-090802-marostegui.json [09:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] python3: fix encoding in grid output [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/699815 (owner: 10Bstorm) [09:17:07] (03CR) 10JMeybohm: [C: 03+1] "Following discussions on IRC etc. I think this is a good way to go for istio right now. With all this README's around, this LGTM 😊" [deployment-charts] - 10https://gerrit.wikimedia.org/r/697938 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [09:22:03] (03PS2) 10Effie Mouzeli: mwdebug: include nutcracker and mcrouter pools in values [deployment-charts] - 10https://gerrit.wikimedia.org/r/699432 (https://phabricator.wikimedia.org/T284420) [09:22:19] (03CR) 10Effie Mouzeli: mwdebug: include nutcracker and mcrouter pools in values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/699432 (https://phabricator.wikimedia.org/T284420) (owner: 10Effie Mouzeli) [09:23:46] PROBLEM - Host bast5002 is DOWN: PING CRITICAL - Packet loss = 100% [09:23:48] PROBLEM - Host asw1-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [09:23:56] PROBLEM - Host ncredir5001 is DOWN: PING CRITICAL - Packet loss = 100% [09:24:06] PROBLEM - Host ganeti5002 is DOWN: PING CRITICAL - Packet loss = 100% [09:24:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2086:3318, db2082', diff saved to https://phabricator.wikimedia.org/P16532 and previous config saved to /var/cache/conftool/dbconfig/20210615-092409-marostegui.json [09:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:21] this smells like a rack going down [09:24:32] PROBLEM - Host ganeti5001 is DOWN: PING CRITICAL - Packet loss = 100% [09:24:32] PROBLEM - Host dns5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:24:34] PROBLEM - Host install5001 is DOWN: PING CRITICAL - Packet loss = 100% [09:24:43] or maybe the top of rack [09:24:51] uh [09:25:04] PROBLEM - Host dns5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:25:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2080', diff saved to https://phabricator.wikimedia.org/P16533 and previous config saved to /var/cache/conftool/dbconfig/20210615-092511-marostegui.json [09:25:12] PROBLEM - Host cp5008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:22] PROBLEM - Host lvs5003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:25:40] PROBLEM - Host ps1-604-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [09:25:48] PROBLEM - Host cp5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:25:51] 10SRE, 10MediaWiki-General, 10Platform Engineering, 10Pybal, and 3 others: SELECT query arriving to wikidatawiki db codfw hosts causing pile ups during schema change - https://phabricator.wikimedia.org/T284981 (10Marostegui) All the hosts have been recovered. [09:26:38] * volans checking [09:26:50] it's not just one rack AFAICT [09:26:54] PROBLEM - Host cp5011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:26:58] there was a maintence IIRC [09:27:07] but not today [09:27:33] ganeti5002 is rack 604, ganeti5001 is 603 [09:27:36] yep [09:27:51] and we got 2 racks in eqsin [09:27:54] RECOVERY - Host ganeti5001 is UP: PING WARNING - Packet loss = 90%, RTA = 237.77 ms [09:27:57] I can ssh [09:28:00] into them [09:28:07] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:28:22] (03PS1) 10Effie Mouzeli: hieradata: Use TLS codfw pool for memcached replication on canaries [puppet] - 10https://gerrit.wikimedia.org/r/699908 (https://phabricator.wikimedia.org/T284420) [09:28:24] wshould we depoll? [09:28:27] *depool [09:28:38] I tried to ssh to asw1-eqsin but the connection is broken now [09:28:52] PROBLEM - Host doh5001 is DOWN: PING CRITICAL - Packet loss = 100% [09:28:54] PROBLEM - Host ganeti5003 is DOWN: PING CRITICAL - Packet loss = 100% [09:28:55] PROBLEM - Host text-lb.eqsin.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [09:29:02] PROBLEM - Host ripe-atlas-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:29:09] volans, I think so /cc: ema [09:29:20] lI'm on ganeti5001 and the connection is still up [09:29:21] I'm +1 to depool, and we can figure it out [09:29:21] network on eqsin? [09:29:22] PROBLEM - Host prometheus5001 is DOWN: PING CRITICAL - Packet loss = 100% [09:29:23] yeah let's depool [09:29:27] * jbond here [09:29:37] +1 on depooling [09:29:42] what's up? [09:29:52] PROBLEM - Host netflow5001 is DOWN: PING CRITICAL - Packet loss = 100% [09:29:53] PROBLEM - Host cr3-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [09:29:54] XioNoX: something is happening in eqsin [09:29:55] (03PS1) 10Majavah: Depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/699909 [09:29:58] who takes care of depooling? [09:30:02] (03PS1) 10Ema: Depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/699910 [09:30:04] XioNoX: looks like network issues with eqsin [09:30:10] Ema is depooling [09:30:14] PROBLEM - Host ganeti5001 is DOWN: PING CRITICAL - Packet loss = 100% [09:30:14] please review https://gerrit.wikimedia.org/r/699910 [09:30:23] let's go for ema patch [09:30:29] to make a quick decision [09:30:30] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/699910 (owner: 10Ema) [09:30:31] PROBLEM - Host cr2-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [09:30:36] +1 [09:30:38] ema: ship it [09:30:39] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [dns] - 10https://gerrit.wikimedia.org/r/699910 (owner: 10Ema) [09:30:41] (03CR) 10Filippo Giunchedi: [C: 03+1] Depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/699910 (owner: 10Ema) [09:30:43] ship it yes [09:30:43] (03CR) 10Jbond: [C: 03+1] Depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/699910 (owner: 10Ema) [09:30:49] (03CR) 10Ema: [C: 03+2] Depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/699910 (owner: 10Ema) [09:30:49] PROBLEM - Host ncredir-lb.eqsin.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [09:30:50] PROBLEM - Host upload-lb.eqsin.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [09:30:52] PROBLEM - Host ncredir5002 is DOWN: PING CRITICAL - Packet loss = 100% [09:31:00] routers are up [09:31:01] from the switch logs it seems that mgmt interfaces went down all in once https://librenms.wikimedia.org/device/163/logs [09:31:02] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={pdu_sentry4,swagger_check_restbase_eqsin} site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:31:10] sigh [09:31:12] (03Abandoned) 10Majavah: Depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/699909 (owner: 10Majavah) [09:31:22] majavah, nothing against yours :-), but better not going indecisive [09:31:23] ifAdminStatus: up -> NULL [09:31:23] running authdns-update [09:31:31] * volans acked all pages [09:31:47] I will check user impact, user reports, etc. [09:32:04] switch is fine too [09:32:21] who's taking IC? [09:32:24] uh [09:32:26] XioNoX: did you see the up -> Null transitions? [09:32:35] PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_text layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [09:32:35] (sorr,y I just now realized that string of beeps was a pile of pages) [09:32:39] elukey: where? [09:32:48] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:32:54] authdns updated successfully except for dns5001/dns5002 for obvious reasons [09:33:00] XioNoX: https://librenms.wikimedia.org/device/163/logs [09:33:18] ema: let's try to update them manually, I can ssh there [09:33:24] volans: go ahead [09:33:26] so maybe it's just eqsin-infra [09:33:37] yeah looks like [09:33:40] the problem and not eqsin<-?rest-of-the-world [09:33:49] which makes things worse [09:34:00] ema: ack, I'll update dns500* [09:34:15] volans: are you updating them? those are really the dns servers we need to update [09:34:21] yes [09:34:27] but I doubt they can reach the git repo? [09:34:29] elukey: that's just snmp pooling issues [09:34:45] ack [09:34:47] I agree it looks like eqsin-infra [09:34:52] not eqsin-world [09:34:54] pull from gerrit worked [09:34:58] great [09:35:01] I'm just running authdn-update [09:35:03] because it went via the internet [09:35:04] no manual hacks [09:35:11] and see how it goes [09:35:13] authdns-local-update? [09:35:45] interestingly it seems v4 only? I can v6-ping text-lb.eqsin.wikimedia.org from alert1001 just fine [09:35:45] nothing on the routers logs [09:35:56] volans: please let us know when the process is finished [09:36:00] created an IC document https://docs.google.com/document/d/1_rV0RU9wZ0Y1VQUJkOq5L2uDUv-7XgOCuJyR6o5f_BY/edit (backfilling now) [09:36:01] fyi, user report in -commons [09:36:02] finished [09:36:06] 5001/2 updated [09:36:09] all the others failed [09:36:11] the clush [09:36:14] but that was covered by ema [09:36:24] ok so in the next 5 minutes, we should see the traffic shift [09:36:30] 10SRE, 10Traffic, 10netops: Wikimedias eqsin datacenter has network connectivity issues (?) - https://phabricator.wikimedia.org/T284986 (10jcrespo) [09:36:30] yup authdns2001.wikimedia.org,dns[1001-1002,2001-2002,3001-3002,4001-4002].wikimedia.org went fine [09:36:42] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 47.01 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:36:44] I installed mtr-tiny on alert1001 [09:36:47] jbond, created initial task: https://phabricator.wikimedia.org/T284986 [09:37:01] 10SRE, 10Traffic, 10netops, 10Wikimedia-Incident: Wikimedias eqsin datacenter has network connectivity issues (?) - https://phabricator.wikimedia.org/T284986 (10Majavah) [09:37:18] I see ulsfo is peaking [09:37:30] smokepings confirms the connectivity issues - https://smokeping.wikimedia.org/?displaymode=n;start=2021-06-15%2006:37;end=now;target=eqsin.Hosts.bast5002 [09:37:40] but we're still down ~ 1 million requests per minute [09:38:11] 4 twitter reports saying down, so cleary affecting users [09:38:16] v4 from eqiad to bast5003 doesn't go trhough [09:38:24] it is weird, netflow5001 is reported down, I can ssh and ping cp5001 for example (both v4 and v6) [09:38:31] XioNoX: if that helps I can ping bast1003 from dns5001 but I can't telnet 22 [09:38:44] 10SRE, 10Traffic, 10netops, 10Wikimedia-Incident: Wikimedias eqsin datacenter has network connectivity issues (?) - https://phabricator.wikimedia.org/T284986 (10Peachey88) [09:39:33] request rates are slowly recovering [09:39:40] XioNoX: bast5002 right? (I don't find 5003) [09:39:47] yeah [09:40:11] both v4 and v6 the telnet above [09:40:12] fwiw [09:40:30] marostegui, I cannot change topic because I am not op, can you from up to "eqsin network issues" [09:41:01] jynus: I' changing it now [09:41:10] now v6 seems to go throuh but v4 not [09:41:16] going to try to kill the telia transport link [09:41:23] ack [09:41:28] it might be the one missbehaving and only letting some traffic through [09:41:49] RECOVERY - Host cr3-eqsin is UP: PING WARNING - Packet loss = 60%, RTA = 238.58 ms [09:41:49] RECOVERY - Host netflow5001 is UP: PING WARNING - Packet loss = 60%, RTA = 239.70 ms [09:41:50] RECOVERY - Host asw1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 238.84 ms [09:41:50] RECOVERY - Host ganeti5002 is UP: PING OK - Packet loss = 0%, RTA = 237.93 ms [09:41:52] RECOVERY - Host bast5002 is UP: PING OK - Packet loss = 0%, RTA = 238.75 ms [09:41:52] RECOVERY - Host ncredir5002 is UP: PING OK - Packet loss = 0%, RTA = 238.21 ms [09:41:53] RECOVERY - Host cr2-eqsin is UP: PING OK - Packet loss = 0%, RTA = 238.27 ms [09:41:54] RECOVERY - Host ganeti5003 is UP: PING OK - Packet loss = 0%, RTA = 237.94 ms [09:41:54] RECOVERY - Host ganeti5001 is UP: PING OK - Packet loss = 0%, RTA = 237.92 ms [09:41:54] RECOVERY - Host cp5002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 238.68 ms [09:41:54] RECOVERY - Host cp5008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 238.43 ms [09:41:54] RECOVERY - Host cp5011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 238.49 ms [09:41:54] RECOVERY - Host dns5001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 238.44 ms [09:41:54] RECOVERY - Host dns5002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 238.51 ms [09:41:55] RECOVERY - Host lvs5003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 238.49 ms [09:41:55] RECOVERY - Host ps1-604-eqsin is UP: PING OK - Packet loss = 0%, RTA = 238.91 ms [09:41:56] RECOVERY - Host ncredir5001 is UP: PING OK - Packet loss = 0%, RTA = 238.20 ms [09:41:58] cr2-codfw reports a bfd session down to cr3-eqsin [09:41:58] RECOVERY - Host install5001 is UP: PING OK - Packet loss = 0%, RTA = 239.00 ms [09:41:59] well [09:42:00] RECOVERY - Host ripe-atlas-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 238.17 ms [09:42:03] lol [09:42:04] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:42:04] RECOVERY - Host doh5001 is UP: PING OK - Packet loss = 0%, RTA = 241.05 ms [09:42:05] :D [09:42:06] RECOVERY - Host prometheus5001 is UP: PING OK - Packet loss = 0%, RTA = 238.45 ms [09:42:07] I guess we have our answer :) [09:42:09] xD [09:42:11] 🎉 [09:42:13] nice call :D [09:42:27] RECOVERY - Host upload-lb.eqsin.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 239.31 ms [09:42:31] RECOVERY - Host ncredir-lb.eqsin.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 238.74 ms [09:42:33] !log cr1-codfw# set interfaces xe-5/1/2 disable [09:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:39] RECOVERY - Host text-lb.eqsin.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 237.85 ms [09:42:42] nice :) [09:42:46] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp5007 is OK: OK - Certificate *.wikipedia.org will expire on Tue 16 Nov 2021 11:59:59 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:42:53] * jbond sorrtu computer crashed shortly after posting gdoc catching up [09:43:12] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The patch is correct (good job!), but I think we need to explicitly tell the system that we want to only run this timer once the baseimage" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/699752 (https://phabricator.wikimedia.org/T284431) (owner: 10Jelto) [09:43:45] RECOVERY - ATS TLS has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [09:43:47] transport traffic is now going through the GRE tunnel so better to keep eqsin depooled for now [09:43:58] people report things are now ok in Asia, as far as users are conerned [09:44:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:44:07] XioNoX: ack [09:44:21] traffic is back to normal standards, I think the emergency is over [09:44:32] is esqin still depooled? [09:44:38] yep [09:44:43] ack thx [09:44:54] let's open a case to telia [09:45:06] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqsin_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:45:14] yes, XioNoX thinks it's best given we're on the GRE tunnel for eqsin <-> prod connection [09:45:28] netbox is always last :D [09:46:36] traffic is back to pre-incident levels, at leasy in absolute number [09:46:46] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:47:11] joe: it's a 15m systemd timer, ofc it's last :D [09:47:29] I know, I just wanted to give you a hard time :D [09:49:50] PROBLEM - Check unit status of netbox_ganeti_eqsin_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:50:05] anyone sees any high prio ongoing issues? other than follow up [09:50:11] user reports, etc? [09:50:25] don't think so jynus [09:50:33] all looks under control for now [09:50:38] sent the email to Telia NOC [09:50:43] volans: where you raising a case with telia ? [09:50:44] jbond, then that is the queue to "consider it closed" [09:50:48] answered :) [09:51:37] depending on Telia's answer I might turn the link back up with a lower preference, so no traffic goes through it but they can test it [09:51:52] so yeah, the actual user-facing outage lasted between 07:30 and 07:44 [09:51:56] as a follow up, can I get topic permissions here? [09:52:32] ack officially closing the incident i think the only follow up is to repool esqin at some point [09:52:56] yep, we can track that on open ticket, plus doc needed, etc [09:53:05] yes creating ticket now [09:53:21] jbond, T284986 [09:53:23] T284986: Wikimedias eqsin datacenter has network connectivity issues (?) - https://phabricator.wikimedia.org/T284986 [09:53:39] I just need to update the title [09:53:49] ack can repurpose that oen [09:53:54] we lost about 450k*14 requests, so about 6 million requests [09:54:19] (very rough guesstimate) [09:54:20] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:54:22] 10SRE, 10Traffic, 10netops, 10Wikimedia-Incident: Wikimedia's eqsin datacenter (Asia Pacific) had network connectivity issues - https://phabricator.wikimedia.org/T284986 (10jcrespo) [09:55:17] 10SRE, 10Traffic, 10netops, 10Wikimedia-Incident: Wikimedia's eqsin datacenter (Asia Pacific) had network connectivity issues - https://phabricator.wikimedia.org/T284986 (10jbond) Once telia issues have been resolved we need to repool ESQIN. @ayounsi can you confirm when we are good to repool [09:56:13] are the cp to cp connections over v4 only or also v6? [09:56:53] XioNoX: there are no cp to cp connections between DCs anymore [09:57:14] ema: I mean the eqsin to eqiad connections [09:57:22] whatever impacted the users :) [09:57:55] aka, if v6 connectivity between the DC was still working, would it uses it? [09:58:02] hmm that would depend on mw DNS records [09:58:06] ah, traffic to the origin is v4 only [09:58:12] here you go :) [10:00:02] s/mw/appserver|api/ :) [10:00:39] RECOVERY - Check unit status of netbox_ganeti_eqsin_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:01:15] (Juniper alarm active) resolved: Juniper alarm active - https://alerts.wikimedia.org [10:02:21] (Traffic on tunnel link) firing: Traffic on tunnel link - https://alerts.wikimedia.org [10:02:56] (Traffic on tunnel link) resolved: Traffic on tunnel link - https://alerts.wikimedia.org [10:04:30] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:11:32] (03PS1) 10Jbond: P::mediawiki::mcrouter_wancache: add shad data for cloud proxies [puppet] - 10https://gerrit.wikimedia.org/r/699912 [10:12:35] (03CR) 10Jbond: [C: 03+2] P::mediawiki::mcrouter_wancache: add shad data for cloud proxies [puppet] - 10https://gerrit.wikimedia.org/r/699912 (owner: 10Jbond) [10:16:24] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Add all redis and memcached backends to mw on k8s automatically - https://phabricator.wikimedia.org/T284420 (10hashar) For Beta that should be fixed via https://gerrit.wikimedia.org/r/c/operations/puppet/+/699912/ :) [10:16:47] jbond: nice, that should fix puppet on the deployment-deployXXXX instances 6-) [10:18:21] jbond: thank you! [10:18:31] hashar: indeed all fixed [10:18:36] \o/ [10:18:59] guess I will get my lunch break now [10:20:04] :) [10:22:58] 10SRE, 10SRE-Access-Requests: re-open access to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10diego) [10:28:02] PROBLEM - SSH on wdqs2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:38:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Fix prometheus monitoring for Toolforge Ingress [puppet] - 10https://gerrit.wikimedia.org/r/699484 (https://phabricator.wikimedia.org/T284353) (owner: 10Majavah) [10:44:33] alright, telia is saying the outage is resolved [10:45:46] !log re-enable cr1-codfw:xe-5/1/2 [10:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:12] my mtr still works [10:46:36] PROBLEM - Host cp5006 is DOWN: PING CRITICAL - Packet loss = 100% [10:46:40] PROBLEM - Host install5001 is DOWN: PING CRITICAL - Packet loss = 100% [10:46:58] XioNoX: ^^^ [10:47:08] might be transient of the re-enable? [10:47:10] PROBLEM - Host ganeti5002 is DOWN: PING CRITICAL - Packet loss = 100% [10:47:34] PROBLEM - Host ganeti5001 is DOWN: PING CRITICAL - Packet loss = 100% [10:47:34] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:47:52] RECOVERY - Host ganeti5002 is UP: PING WARNING - Packet loss = 80%, RTA = 239.29 ms [10:47:54] RECOVERY - Host cp5006 is UP: PING OK - Packet loss = 0%, RTA = 237.92 ms [10:47:57] I re-disabled it [10:47:58] RECOVERY - Host install5001 is UP: PING OK - Packet loss = 0%, RTA = 238.16 ms [10:48:06] RECOVERY - Host ganeti5001 is UP: PING OK - Packet loss = 0%, RTA = 237.96 ms [10:53:06] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:02:46] (Traffic on tunnel link) firing: Traffic on tunnel link - https://alerts.wikimedia.org [11:05:26] 10SRE, 10netops: Cleanup confed BGP peerings and policies - https://phabricator.wikimedia.org/T167841 (10ayounsi) [11:05:35] 10SRE, 10netops, 10Sustainability (Incident Followup): ospf link-protection - https://phabricator.wikimedia.org/T167306 (10ayounsi) 05Open→03Resolved Closed! After 4 years and 1 week. [11:06:52] 10SRE, 10SRE-Access-Requests: re-open access to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10diego) [11:06:55] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10diego) [11:07:46] (Traffic on tunnel link) resolved: Traffic on tunnel link - https://alerts.wikimedia.org [11:12:33] (03PS1) 10H.krishna123: api_db: Add working skeleton code for api_db, add dockerfile [software/bernard] - 10https://gerrit.wikimedia.org/r/699915 (https://phabricator.wikimedia.org/T284399) [11:14:40] (03CR) 10H.krishna123: "I've just committed the skeleton code for the API backend for exposing data from the databases. There is no database functionality yet, bu" [software/bernard] - 10https://gerrit.wikimedia.org/r/699915 (https://phabricator.wikimedia.org/T284399) (owner: 10H.krishna123) [11:20:17] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947 (10Aklapper) [11:23:38] (03CR) 10Jcrespo: "I don't see any big issue with this, although it is a bit early regarding functionality. Will wait for your explanation live tomorrow." [software/bernard] - 10https://gerrit.wikimedia.org/r/699915 (https://phabricator.wikimedia.org/T284399) (owner: 10H.krishna123) [11:41:08] (03CR) 10Zfilipin: "Main branch is failing on my machine. I'll update the configuration and rebase this patch." [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/697069 (https://phabricator.wikimedia.org/T274579) (owner: 10Sahilgrewalhere) [11:55:22] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:58:39] (03CR) 10H.krishna123: "Okay, sounds good, will go through it tomorrow." [software/bernard] - 10https://gerrit.wikimedia.org/r/699915 (https://phabricator.wikimedia.org/T284399) (owner: 10H.krishna123) [12:06:18] c-9Ginu. [12:08:06] topranks: https://bash.toolforge.org/quip/AU7VV1aJ6snAnmqnK_0n [12:08:24] (yes tooting my own horn) [12:09:23] smart :) [12:09:39] unlike yours truly. doh! [12:09:58] hehe we've all been there [12:17:15] (03PS1) 10Marostegui: db2080: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/699925 [12:17:54] (03CR) 10Marostegui: [C: 03+2] db2080: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/699925 (owner: 10Marostegui) [12:29:32] RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:53:37] 10SRE, 10Pybal, 10Traffic: PyBal healthchecks should specify User-Agent instead of using "Twisted PageGetter" - https://phabricator.wikimedia.org/T246431 (10ema) [12:55:20] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:57:29] (03PS2) 10Aklapper: api_db: Add working skeleton code for api_db, add dockerfile [software/bernard] - 10https://gerrit.wikimedia.org/r/699915 (https://phabricator.wikimedia.org/T284399) (owner: 10H.krishna123) [12:58:42] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10Aklapper) [12:58:45] 10SRE, 10SRE-Access-Requests: re-open access to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10Aklapper) [12:58:53] 10SRE, 10SRE-Access-Requests: re-open access to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10Aklapper) [12:58:55] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10Aklapper) [13:02:36] 10SRE, 10SRE-Access-Requests: re-open access to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10Ottomata) Approved. [13:02:44] 10SRE, 10SRE-Access-Requests: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10ssingh) [13:03:37] 10SRE, 10SRE-Access-Requests: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10ssingh) 05Open→03Resolved Marc has been added to VictorOps and the SRE team; resolving this task. Thanks, all. [13:10:17] !log disable puppet on canaries to deploy 699908 [13:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:46] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: Use TLS codfw pool for memcached replication on canaries [puppet] - 10https://gerrit.wikimedia.org/r/699908 (https://phabricator.wikimedia.org/T284420) (owner: 10Effie Mouzeli) [13:15:27] !log enable puppet on canaries [13:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:44] (03PS1) 10Marostegui: dbproxy1018: Depool clouddb1018 [puppet] - 10https://gerrit.wikimedia.org/r/699933 [13:21:59] (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Depool clouddb1018 [puppet] - 10https://gerrit.wikimedia.org/r/699933 (owner: 10Marostegui) [13:23:20] !log Upgrade clouddb1018 [13:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:22] (03PS1) 10Marostegui: Revert "dbproxy1018: Depool clouddb1018" [puppet] - 10https://gerrit.wikimedia.org/r/699855 [13:25:24] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1018: Depool clouddb1018" [puppet] - 10https://gerrit.wikimedia.org/r/699855 (owner: 10Marostegui) [13:28:33] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10Papaul) I Sent an email to Dell asking them to dispatch one of their Tech to do the troubleshooting on this server, since it is taking a while... [13:40:42] (03CR) 10David Caro: [C: 04-1] "Look ok, just got some questions" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [13:48:05] 10SRE, 10SRE-Access-Requests: re-open access to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10ssingh) [13:50:13] 10SRE, 10SRE-Access-Requests: re-open access to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10ssingh) Hi @ChristineDeKock: Can you please read through and sign the L3 (Acknowledgement of Wikimedia Server Access Responsibilities) document? Thank you! [13:53:29] (03PS5) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) [14:02:31] (03CR) 10JMeybohm: [C: 04-1] "I feel like it would make more sense to incorporate this into the docker_registry_ha:web class as that one actually requires nginx and co" [puppet] - 10https://gerrit.wikimedia.org/r/698800 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [14:09:11] 10SRE: 503 errors from varnish - https://phabricator.wikimedia.org/T284996 (10Majavah) [14:10:26] 10SRE: 503 errors from varnish - https://phabricator.wikimedia.org/T284996 (10RoySmith) [14:10:34] (03CR) 10Ahmon Dancy: [C: 03+1] disable issues & wikis by default on new projects [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/699812 (https://phabricator.wikimedia.org/T264231) (owner: 10Brennen Bearnes) [14:11:46] (03CR) 10Ahmon Dancy: [C: 03+1] CAS: stop marking users as external [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/699819 (https://phabricator.wikimedia.org/T274461) (owner: 10Brennen Bearnes) [14:15:08] (03CR) 10David Caro: [C: 03+1] O:base::resolving: make nameservers mandatory (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [14:16:54] (03CR) 10David Caro: "I'll wait until the tests are fixed 😊" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [14:17:08] (03CR) 10JMeybohm: [C: 03+1] "I've not tried it (and it is helmfile, so there is a good chance the asterisk does not work 😊) but it looks good" [deployment-charts] - 10https://gerrit.wikimedia.org/r/699432 (https://phabricator.wikimedia.org/T284420) (owner: 10Effie Mouzeli) [14:17:52] (03CR) 10Jbond: "thanks for the review see inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [14:18:12] (03PS7) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) [14:18:20] (03PS9) 10Jbond: O:base::resolver: unify resolv.con templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) [14:19:48] (03CR) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [14:20:06] (03PS1) 10Ssingh: admin: add christinedk to analytics-privatedata-users (with kerberos) [puppet] - 10https://gerrit.wikimedia.org/r/699938 (https://phabricator.wikimedia.org/T284987) [14:25:49] !log re-enable cr1-codfw:xe-5/1/2 [14:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:17] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:27:42] (03CR) 10Sfigor: [C: 03+1] disable issues & wikis by default on new projects [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/699812 (https://phabricator.wikimedia.org/T264231) (owner: 10Brennen Bearnes) [14:30:16] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: re-open access to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10ChristineDeKock) Hi @ssingh. Thank you for your trouble. I have signed the L3 form. [14:30:17] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [14:31:01] (03PS10) 10Jbond: O:base::resolver: unify resolv.con templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) [14:31:36] (03PS11) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) [14:31:39] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: re-open access to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10ssingh) [14:31:56] (03PS6) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) [14:32:25] (03PS8) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) [14:32:38] (03PS12) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) [14:34:00] (03CR) 10Sfigor: [C: 03+1] gitlab_backup_keep_time to 3 days [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/699464 (https://phabricator.wikimedia.org/T274463) (owner: 10Brennen Bearnes) [14:35:57] (03CR) 10jerkins-bot: [V: 04-1] O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [14:36:35] (03CR) 10jerkins-bot: [V: 04-1] O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [14:37:19] (03CR) 10jerkins-bot: [V: 04-1] O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [14:37:55] (03PS9) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) [14:38:10] (03CR) 10Ssingh: [C: 03+2] admin: add christinedk to analytics-privatedata-users (with kerberos) [puppet] - 10https://gerrit.wikimedia.org/r/699938 (https://phabricator.wikimedia.org/T284987) (owner: 10Ssingh) [14:39:27] (03PS10) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) [14:41:05] (03PS7) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) [14:41:10] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: re-open access to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10ssingh) 05Open→03Resolved a:03ssingh >>! In T284987#7158017, @ChristineDeKock wrote: > Hi @ssingh. Thank you for your trouble. I have signed the L3 fo... [14:41:28] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [14:41:51] (03PS11) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) [14:43:03] (03CR) 10David Caro: [C: 03+1] O:base::resolving: drop the domain keyword and use the domain fact (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [14:43:50] (03CR) 10Sfigor: [C: 03+1] CAS: stop marking users as external [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/699819 (https://phabricator.wikimedia.org/T274461) (owner: 10Brennen Bearnes) [14:48:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Cmjohnson) @robh mw1414-1422 were missing the mgmt cables. Fixed and they're good to go. it appears John racked the others, I will add that to my list. [14:48:23] (03PS8) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) [14:48:25] (03PS12) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) [14:48:30] (03PS13) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) [14:48:59] (03PS4) 10Ema: varnish: add timing data to varnishmtail [puppet] - 10https://gerrit.wikimedia.org/r/699223 (https://phabricator.wikimedia.org/T284576) [14:49:01] (03PS1) 10Ema: varnish: add prometheus histogram varnish_processing_seconds [puppet] - 10https://gerrit.wikimedia.org/r/699941 (https://phabricator.wikimedia.org/T284576) [14:50:36] (03CR) 10Effie Mouzeli: "> Patch Set 2: Code-Review+1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/699432 (https://phabricator.wikimedia.org/T284420) (owner: 10Effie Mouzeli) [14:51:08] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [14:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:21] (03PS1) 10Razzi: yarn: temporarily stop queues [puppet] - 10https://gerrit.wikimedia.org/r/699943 (https://phabricator.wikimedia.org/T278423) [14:51:31] (03CR) 10jerkins-bot: [V: 04-1] O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [14:52:49] (03CR) 10jerkins-bot: [V: 04-1] O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [14:52:58] (03CR) 10Elukey: [C: 03+1] yarn: temporarily stop queues [puppet] - 10https://gerrit.wikimedia.org/r/699943 (https://phabricator.wikimedia.org/T278423) (owner: 10Razzi) [14:53:17] (03CR) 10jerkins-bot: [V: 04-1] O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [14:53:38] (03CR) 10jerkins-bot: [V: 04-1] varnish: add prometheus histogram varnish_processing_seconds [puppet] - 10https://gerrit.wikimedia.org/r/699941 (https://phabricator.wikimedia.org/T284576) (owner: 10Ema) [14:54:45] 10SRE, 10Traffic: 503 errors from varnish - https://phabricator.wikimedia.org/T284996 (10ssingh) p:05Triage→03High [14:55:32] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:56] (03CR) 10Muehlenhoff: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/698800 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [14:57:12] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [14:59:41] (03PS13) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) [15:00:16] (03CR) 10Razzi: [C: 03+2] yarn: temporarily stop queues [puppet] - 10https://gerrit.wikimedia.org/r/699943 (https://phabricator.wikimedia.org/T278423) (owner: 10Razzi) [15:04:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install thumbor100[56] - https://phabricator.wikimedia.org/T273914 (10Cmjohnson) [15:04:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install thumbor100[56] - https://phabricator.wikimedia.org/T273914 (10Cmjohnson) a:05Cmjohnson→03RobH @RobH These are ready for you whenever you get a chance [15:07:40] (03CR) 10Jbond: O:base::resolving: make nameservers mandatory (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [15:07:52] (03PS14) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) [15:08:34] (03PS9) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) [15:08:50] (03PS14) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) [15:10:05] (03PS15) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) [15:10:08] * jbond rebaseing hell :S [15:13:17] (03CR) 10jerkins-bot: [V: 04-1] O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [15:13:49] (03CR) 10jerkins-bot: [V: 04-1] O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [15:14:09] (03CR) 10Jbond: "CI issues seem to be unrelated" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [15:14:25] (03CR) 10jerkins-bot: [V: 04-1] O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [15:15:35] (03CR) 10David Caro: [C: 03+1] O:base::resolving: drop the domain keyword and use the domain fact (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [15:18:00] dcaro: thanks for the reviews just an fyi CI seems to be playing up at the moment so ill leave that change set for today [15:20:37] jbond: ack, np, thanks for the patches :) [15:22:27] np :) [15:24:55] (03PS2) 10MSantos: maps: fix SQL modules paths in import script [puppet] - 10https://gerrit.wikimedia.org/r/695240 [15:55:36] RECOVERY - snapshot of s6 in codfw on alert1001 is OK: Last snapshot for s6 at codfw (db2141.codfw.wmnet:3316) taken on 2021-06-15 14:27:42 (577 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [15:56:20] my hint to go^ [16:11:02] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 60 days, 0:00:00 on an-master1002.eqiad.wmnet with reason: Update operating system to bullseye [16:11:03] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 60 days, 0:00:00 on an-master1002.eqiad.wmnet with reason: Update operating system to bullseye [16:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:20] (03CR) 10Majavah: "Thanks!" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/699815 (owner: 10Bstorm) [16:11:36] (03CR) 10Majavah: [C: 03+2] python3: fix encoding in grid output [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/699815 (owner: 10Bstorm) [16:12:56] (03Merged) 10jenkins-bot: python3: fix encoding in grid output [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/699815 (owner: 10Bstorm) [16:14:12] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:30:10] PROBLEM - Hadoop NodeManager on an-worker1101 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:12] PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:16] PROBLEM - Hadoop NodeManager on an-worker1117 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:24] PROBLEM - Hadoop NodeManager on an-worker1090 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:24] PROBLEM - Hadoop NodeManager on an-worker1091 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:26] PROBLEM - Hadoop NodeManager on an-worker1080 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:27] PROBLEM - Hadoop NodeManager on an-worker1105 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:27] PROBLEM - Hadoop NodeManager on analytics1077 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:28] PROBLEM - Hadoop NodeManager on an-worker1093 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:32] PROBLEM - Hadoop NodeManager on an-worker1104 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:34] PROBLEM - Hadoop NodeManager on an-worker1092 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:36] PROBLEM - Hadoop NodeManager on analytics1068 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:36] PROBLEM - Hadoop NodeManager on analytics1073 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:40] PROBLEM - Hadoop NodeManager on an-worker1131 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:40] PROBLEM - Hadoop NodeManager on an-worker1088 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:40] PROBLEM - Hadoop NodeManager on analytics1061 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:42] PROBLEM - Hadoop NodeManager on an-worker1086 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:43] we are in maintenance mode, sorry for the spam [16:30:44] PROBLEM - Hadoop NodeManager on analytics1070 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:44] PROBLEM - Hadoop NodeManager on an-worker1100 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:50] PROBLEM - Hadoop NodeManager on an-worker1125 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:50] PROBLEM - Hadoop NodeManager on an-worker1099 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:54] PROBLEM - Hadoop NodeManager on an-worker1087 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:30:58] PROBLEM - Hadoop NodeManager on analytics1074 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:04] PROBLEM - Hadoop NodeManager on an-worker1130 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:06] PROBLEM - Hadoop NodeManager on an-worker1103 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:07] PROBLEM - Hadoop NodeManager on an-worker1132 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:08] PROBLEM - Hadoop NodeManager on an-worker1126 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:08] PROBLEM - Hadoop NodeManager on an-worker1113 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:12] PROBLEM - Hadoop NodeManager on analytics1072 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:12] PROBLEM - Hadoop NodeManager on an-worker1095 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:12] PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:13] PROBLEM - Hadoop NodeManager on an-worker1098 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:14] PROBLEM - Hadoop NodeManager on an-worker1085 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:16] PROBLEM - Hadoop NodeManager on an-worker1094 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:17] PROBLEM - Hadoop NodeManager on an-worker1108 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:17] PROBLEM - Hadoop NodeManager on analytics1066 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:18] PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:19] PROBLEM - Hadoop NodeManager on an-worker1124 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:20] PROBLEM - Hadoop NodeManager on analytics1069 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:21] PROBLEM - Hadoop NodeManager on an-worker1127 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:22] PROBLEM - Hadoop NodeManager on an-worker1081 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:26] PROBLEM - Hadoop NodeManager on analytics1059 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:27] PROBLEM - Hadoop NodeManager on analytics1062 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:28] PROBLEM - Hadoop NodeManager on analytics1075 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:28] PROBLEM - Hadoop NodeManager on analytics1071 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:34] PROBLEM - Hadoop NodeManager on an-worker1120 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:36] PROBLEM - Hadoop NodeManager on an-worker1089 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:38] PROBLEM - Hadoop NodeManager on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:38] PROBLEM - Hadoop NodeManager on an-worker1097 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:46] PROBLEM - Hadoop NodeManager on an-worker1084 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:48] PROBLEM - Hadoop NodeManager on an-worker1083 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:48] PROBLEM - Hadoop NodeManager on an-worker1114 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:48] PROBLEM - Hadoop NodeManager on an-worker1123 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:50] PROBLEM - Hadoop NodeManager on analytics1076 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:54] PROBLEM - Hadoop NodeManager on an-worker1128 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:56] PROBLEM - Hadoop NodeManager on analytics1060 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:32:04] PROBLEM - Hadoop NodeManager on an-worker1107 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:32:04] PROBLEM - Hadoop NodeManager on an-worker1137 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:32:07] PROBLEM - Hadoop NodeManager on an-worker1138 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:32:10] PROBLEM - Hadoop NodeManager on an-worker1102 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:32:18] RECOVERY - Hadoop NodeManager on analytics1077 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:32:18] PROBLEM - Hadoop NodeManager on an-worker1111 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:32:27] PROBLEM - Hadoop NodeManager on an-worker1112 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:32:32] PROBLEM - Hadoop NodeManager on analytics1058 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:32:48] PROBLEM - Hadoop NodeManager on analytics1063 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:32:53] Hi all, we are aware of this issue and in readonly mode on hdfs, so this will not cause data loss [16:32:54] RECOVERY - Hadoop NodeManager on an-worker1130 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:33:02] RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:33:52] RECOVERY - Hadoop NodeManager on an-worker1137 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:34:03] some alert aggregation MAY be needed :D [16:34:07] RECOVERY - Hadoop NodeManager on an-worker1093 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:34:14] RECOVERY - Hadoop NodeManager on analytics1068 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:24] RECOVERY - Hadoop NodeManager on an-worker1114 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:22] RECOVERY - Hadoop NodeManager on analytics1074 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:32] RECOVERY - Hadoop NodeManager on an-worker1113 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:36] RECOVERY - Hadoop NodeManager on an-worker1095 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:38] RECOVERY - Hadoop NodeManager on an-worker1085 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:40] RECOVERY - Hadoop NodeManager on analytics1066 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:44] RECOVERY - Hadoop NodeManager on an-worker1127 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:36:54] RECOVERY - Hadoop NodeManager on an-worker1120 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:37:26] ok we are good :) [16:37:26] RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:37:30] RECOVERY - Hadoop NodeManager on an-worker1117 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:38:02] RECOVERY - Hadoop NodeManager on an-worker1099 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:38:16] RECOVERY - Hadoop NodeManager on an-worker1132 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:38:26] RECOVERY - Hadoop NodeManager on an-worker1108 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:38:57] RECOVERY - Hadoop NodeManager on analytics1076 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:39:17] RECOVERY - Hadoop NodeManager on an-worker1102 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:40:14] RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:41:26] RECOVERY - Hadoop NodeManager on analytics1058 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:41:57] RECOVERY - Hadoop NodeManager on an-worker1094 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:42:44] RECOVERY - Hadoop NodeManager on an-worker1138 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:42:54] RECOVERY - Hadoop NodeManager on an-worker1090 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:42:54] RECOVERY - Hadoop NodeManager on an-worker1091 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:43:02] RECOVERY - Hadoop NodeManager on an-worker1092 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:43:10] RECOVERY - Hadoop NodeManager on an-worker1088 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:43:50] RECOVERY - Hadoop NodeManager on an-worker1081 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:43:54] RECOVERY - Hadoop NodeManager on analytics1059 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:44:04] RECOVERY - Hadoop NodeManager on an-worker1097 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:44:44] RECOVERY - Hadoop NodeManager on an-worker1111 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:45:10] Quick summary of what's going on with hadoop: we're in maintenance mode for an os upgrade, and have 1 active and 1 standby namenode. I meant to stop hadoop on the standby, but accidentally did so on the active; when I realized my mistake I restarted hadoop on the active and stopped hadoop on the standby, but the original standby had become the active and the original active was still recovering. Sorry for the spam! [16:45:34] RECOVERY - Hadoop NodeManager on analytics1069 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:45:42] RECOVERY - Hadoop NodeManager on analytics1075 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:46:07] RECOVERY - Hadoop NodeManager on an-worker1128 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:46:48] RECOVERY - Hadoop NodeManager on an-worker1100 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:47:36] RECOVERY - Hadoop NodeManager on an-worker1089 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:49:00] RECOVERY - Hadoop NodeManager on analytics1072 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:49:14] RECOVERY - Hadoop NodeManager on analytics1062 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:49:32] RECOVERY - Hadoop NodeManager on an-worker1084 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:49:34] RECOVERY - Hadoop NodeManager on an-worker1083 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:49:48] RECOVERY - Hadoop NodeManager on an-worker1107 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:50:02] RECOVERY - Hadoop NodeManager on an-worker1080 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:50:18] RECOVERY - Hadoop NodeManager on an-worker1086 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:50:28] RECOVERY - Hadoop NodeManager on an-worker1087 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:50:46] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [16:50:47] RECOVERY - Hadoop NodeManager on an-worker1098 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:51:27] RECOVERY - Hadoop NodeManager on analytics1060 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:51:34] RECOVERY - Hadoop NodeManager on an-worker1101 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:51:50] RECOVERY - Hadoop NodeManager on an-worker1105 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:51:54] RECOVERY - Hadoop NodeManager on an-worker1104 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:52:06] RECOVERY - Hadoop NodeManager on analytics1070 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:52:20] RECOVERY - Hadoop NodeManager on analytics1063 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:52:26] RECOVERY - Hadoop NodeManager on an-worker1103 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:52:40] RECOVERY - Hadoop NodeManager on an-worker1124 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:53:44] RECOVERY - Hadoop NodeManager on an-worker1112 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:53:56] RECOVERY - Hadoop NodeManager on an-worker1125 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:54:14] RECOVERY - Hadoop NodeManager on an-worker1126 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:54:42] RECOVERY - Hadoop NodeManager on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:54:54] RECOVERY - Hadoop NodeManager on an-worker1123 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:55:34] RECOVERY - Hadoop NodeManager on an-worker1131 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:55:36] RECOVERY - Hadoop NodeManager on analytics1061 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:56:20] RECOVERY - Hadoop NodeManager on analytics1071 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:57:18] RECOVERY - Hadoop NodeManager on analytics1073 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:09:44] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-master1002.eqiad.wmnet with reason: REIMAGE [17:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:58] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-master1002.eqiad.wmnet with reason: REIMAGE [17:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:26] (03CR) 10Dzahn: [C: 03+2] "approved by langcom" [dns] - 10https://gerrit.wikimedia.org/r/698521 (https://phabricator.wikimedia.org/T284450) (owner: 10Gerrit maintenance bot) [17:14:46] (03PS3) 10Dzahn: Add dag to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/698521 (https://phabricator.wikimedia.org/T284450) (owner: 10Gerrit maintenance bot) [17:17:04] !log new Wikimedia language "dag" added - Dagbani (or Dagbane), also known as Dagbanli and Dagbanle, is a Gur language spoken in Ghana. [17:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:26] (03PS2) 10Dzahn: Add shi to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/699754 (https://phabricator.wikimedia.org/T284885) (owner: 10Gerrit maintenance bot) [17:20:13] (03CR) 10Dzahn: [C: 03+2] "approved by langcom - https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_Tachelhit" [dns] - 10https://gerrit.wikimedia.org/r/699754 (https://phabricator.wikimedia.org/T284885) (owner: 10Gerrit maintenance bot) [17:21:10] !log new Wikimedia language "shi" added - Shilha /ˈʃɪlhə/ is a Berber language native to Shilha people. The endonym is Taclḥit /taʃlʜijt/, and in recent English publications the language is often rendered Tashelhiyt or Tashelhit. [17:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:57] (03PS1) 10Razzi: yarn: re-enable queues [puppet] - 10https://gerrit.wikimedia.org/r/699955 (https://phabricator.wikimedia.org/T278423) [17:35:13] (03CR) 10jerkins-bot: [V: 04-1] yarn: re-enable queues [puppet] - 10https://gerrit.wikimedia.org/r/699955 (https://phabricator.wikimedia.org/T278423) (owner: 10Razzi) [17:35:57] (03PS2) 10Razzi: yarn: re-enable queues [puppet] - 10https://gerrit.wikimedia.org/r/699955 (https://phabricator.wikimedia.org/T278423) [17:39:15] (03CR) 10Razzi: [C: 03+2] yarn: re-enable queues [puppet] - 10https://gerrit.wikimedia.org/r/699955 (https://phabricator.wikimedia.org/T278423) (owner: 10Razzi) [17:48:34] (03PS1) 10BryanDavis: toolforge: setup some reasonable php.ini defaults [puppet] - 10https://gerrit.wikimedia.org/r/699956 [17:50:04] (03CR) 10jerkins-bot: [V: 04-1] toolforge: setup some reasonable php.ini defaults [puppet] - 10https://gerrit.wikimedia.org/r/699956 (owner: 10BryanDavis) [17:54:54] !log testing upcoming Scap release on beta [17:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:04] (03PS2) 10BryanDavis: toolforge: setup some reasonable php.ini defaults [puppet] - 10https://gerrit.wikimedia.org/r/699956 [18:02:34] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:03:09] (03PS1) 10Cathal Mooney: Repool eqsin [dns] - 10https://gerrit.wikimedia.org/r/699957 (https://phabricator.wikimedia.org/T284986) [18:06:08] 10SRE, 10Traffic, 10netops, 10Wikimedia-Incident: Wikimedia's eqsin datacenter (Asia Pacific) had network connectivity issues - https://phabricator.wikimedia.org/T284986 (10cmooney) I made a typo in the commit msg so this didn't link: https://gerrit.wikimedia.org/r/c/operations/dns/+/699957 [18:08:42] (03Abandoned) 10Cathal Mooney: Repool eqsin [dns] - 10https://gerrit.wikimedia.org/r/699957 (https://phabricator.wikimedia.org/T284986) (owner: 10Cathal Mooney) [18:10:16] (03PS1) 10Cathal Mooney: Revert "Depool eqsin" [dns] - 10https://gerrit.wikimedia.org/r/699859 [18:14:15] (03CR) 10Volans: [C: 03+1] "If the link has been stable for few hours LGTM, but make sure there is someone around for the next hour or so just in case." [dns] - 10https://gerrit.wikimedia.org/r/699859 (owner: 10Cathal Mooney) [18:17:37] (03PS2) 10Cathal Mooney: Revert "Depool eqsin" [dns] - 10https://gerrit.wikimedia.org/r/699859 [18:24:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_netflow_hourly.service,eventlogging_to_druid_prefupdate_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:25:17] I was going to merge above revert and update DNS to re-pool eqsin. [18:25:20] Can anyone advise what dashboards might be good to track during/after? [18:25:41] I know how to check the authdns has changed and router graphs etc. for traffic patterns but I'm sure there are other things also.... [18:32:17] topranks: as a start I would keep an eye on https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&from=now-12h&to=now [18:32:34] you should see eqsin picking up traffic from esams, and preferably without an error spike :) [18:32:35] cool.... yep have that one open :) [18:32:55] yeah I can see the errors there earlier today. [18:33:55] between that, and having an eye on this channel for alerts, you should be all set [18:34:04] PROBLEM - SSH on wdqs2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:34:48] ok, thanks rzl I will proceed cautiously now. [18:34:54] I'm happy to ride along for a bit too, if you want some company just in case [18:35:11] (03CR) 10RLazarus: [C: 03+1] Revert "Depool eqsin" [dns] - 10https://gerrit.wikimedia.org/r/699859 (owner: 10Cathal Mooney) [18:35:44] (03CR) 10Cathal Mooney: [C: 03+2] Revert "Depool eqsin" [dns] - 10https://gerrit.wikimedia.org/r/699859 (owner: 10Cathal Mooney) [18:38:21] rzl: thanks, but did I not see that you weren't feeling so good? [18:38:33] that was yesterday! appreciate the thought though <3 [18:38:49] ah ok. well hope you're doing better :) [18:39:24] heh, better enough :) [18:40:00] btw forgive me if you already know this -- merging in the dns repo is similar to merging in the puppet repo, jenkins won't auto-submit when you +2 [18:40:14] cool thanks. [18:40:22] yes this is the first non-homer change I'm doing. [18:40:27] instead, hit the submit button in gerrit when you're ready, and then `sudo authdns-update` from any authdns host [18:40:32] I submitted there, looks ok. [18:40:48] I will do that now, and watch watch happens. [18:40:52] 👍 [18:40:59] scary ! [18:41:14] what's the worst that could happen :D [18:41:30] wikipedia breaks for most of Asia? we'll de-pool it again :) [18:41:40] it'll be fine :P [18:42:38] 10SRE, 10Traffic, 10netops, 10Wikimedia-Incident: Wikimedia's eqsin datacenter (Asia Pacific) had network connectivity issues - https://phabricator.wikimedia.org/T284986 (10cmooney) Ok @volans was kind enough to explain how I could just revert the original change instead: https://gerrit.wikimedia.org/r/c/... [18:45:16] presume I should run with "sudo" ? [18:45:54] yep [18:45:56] I do "sudo -s ..." [18:46:20] "sudo -s" and drop to root shell and run it? [18:46:57] can you do "sudo -s " ? not familiar with doing that. [18:50:02] I did it your way XioNoX, will need to look into exactly what the "-s" does in that scenario. [18:50:20] dns in eqsin returning local IP again [18:51:36] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:53:03] I guess resolver caches need to time out before it really picks up though. [18:53:11] yep [18:55:46] seeing some small uptick in graphs now. [18:56:04] I've a test box in that region it's working from there too [18:56:13] (03PS4) 10Umherirrender: Add SpecialFewestrevisions to wgDisableQueryPageUpdate for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697151 (https://phabricator.wikimedia.org/T238199) [18:57:27] yeah, looking good [18:58:17] (03CR) 10BryanDavis: "PCC diff: https://puppet-compiler.wmflabs.org/compiler1002/29891/tools-sgeexec-0906.tools.eqiad.wmflabs/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/699956 (owner: 10BryanDavis) [19:00:00] topranks: maybe obvious, but don't forget we depooled eqsin pretty near its daily peak, and now we're repooling it at just about trough -- so don't be surprised when it comes back with way less traffic [19:00:12] yep good point. [19:00:18] (03CR) 10Bstorm: [C: 03+2] "It'll still be possible to explode grid nodes, but this is a whole lot less rope to do it with." [puppet] - 10https://gerrit.wikimedia.org/r/699956 (owner: 10BryanDavis) [19:00:25] I set the graphs to 24h there to get a sense of where we'd be at this hour. [19:00:41] 👍 [19:01:52] that record is on a five-minute TTL, right? so we should be just about there, modulo misbehaved caches [19:02:11] so modulo all the sites that keep crap around for 24 hours "just because" [19:03:16] 10-min on dyna.wikimedia.org [19:03:34] we're close to where we were this time yesterday on the varnish graphs. [19:04:03] aspergos: yes of course :) [19:04:03] oops, that's what I get for not checking [19:08:00] Network traffic graphs have caught up - levels also similar to yesterday. [19:09:12] (03PS1) 10BryanDavis: toolhub: fix php.ini path [puppet] - 10https://gerrit.wikimedia.org/r/699966 [19:10:19] cool -- probably best to hang around for a while and keep an eye out for alerts as it soaks in, but looks like we're in good shape [19:10:32] don't forget to update the timeline in the incident doc, if you don't mind [19:10:45] and nice job :) [19:11:02] (03CR) 10Bstorm: [C: 03+2] toolhub: fix php.ini path [puppet] - 10https://gerrit.wikimedia.org/r/699966 (owner: 10BryanDavis) [19:12:50] 10SRE, 10Traffic, 10netops, 10Wikimedia-Incident: Wikimedia's eqsin datacenter (Asia Pacific) had network connectivity issues - https://phabricator.wikimedia.org/T284986 (10cmooney) CR merged and DNS updated. All looks good, dns servers are returning the eqsin IPs again and traffic is back to normal level... [19:13:52] rzl: that last point I'd forgot. [19:13:55] good call thanks. [19:14:18] cheers for all the help :) [19:15:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_netflow_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:18:02] (03PS1) 10Ottomata: airflow::instance - allow access to API by default [puppet] - 10https://gerrit.wikimedia.org/r/699968 (https://phabricator.wikimedia.org/T272973) [19:19:32] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29892/console" [puppet] - 10https://gerrit.wikimedia.org/r/699968 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [19:20:43] (03CR) 10Ottomata: [V: 03+1 C: 03+2] airflow::instance - allow access to API by default [puppet] - 10https://gerrit.wikimedia.org/r/699968 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [19:55:33] (03CR) 10Umherirrender: [C: 04-1] Add SpecialFewestrevisions to wgDisableQueryPageUpdate for wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697151 (https://phabricator.wikimedia.org/T238199) (owner: 10Umherirrender) [20:01:07] (03PS5) 10Umherirrender: Add SpecialFewestrevisions to wgDisableQueryPageUpdate for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697151 (https://phabricator.wikimedia.org/T238199) [20:01:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:44] (03CR) 10Umherirrender: Add SpecialFewestrevisions to wgDisableQueryPageUpdate for wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697151 (https://phabricator.wikimedia.org/T238199) (owner: 10Umherirrender) [20:26:38] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [20:32:42] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-web1001 - https://phabricator.wikimedia.org/T281787 (10Cmjohnson) [20:33:18] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-web1001 - https://phabricator.wikimedia.org/T281787 (10Cmjohnson) a:05Cmjohnson→03RobH @robh the onsite work for this server is completed [20:34:56] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:35:40] RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:38:38] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:04:49] (03PS1) 10Bstorm: nfs prometheus: change to strings for dir sizes [puppet] - 10https://gerrit.wikimedia.org/r/699973 (https://phabricator.wikimedia.org/T284964) [21:06:35] (03CR) 10Bstorm: "I've tested this live on the server, effectively. So its probably ready to go. I have another patch that I may try after this that turns t" [puppet] - 10https://gerrit.wikimedia.org/r/699973 (https://phabricator.wikimedia.org/T284964) (owner: 10Bstorm) [21:29:04] (03CR) 10Bstorm: [C: 03+2] "I'm going to go ahead and merge this to get the data on Grafana" [puppet] - 10https://gerrit.wikimedia.org/r/699973 (https://phabricator.wikimedia.org/T284964) (owner: 10Bstorm) [21:29:58] 10SRE, 10Technical-blog-posts, 10Wikimedia-Mailing-lists: Story idea for Blog: Discovering and fixing CVE-2021-33038 in Mailman3 - https://phabricator.wikimedia.org/T284486 (10Legoktm) @srodlund one more thing, in the 3rd paragraph, can we switch "Why we didn’t… ?" -> "Why didn’t we… ?" (spotted by @Krinkle) [21:30:54] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:01:46] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:01:50] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [22:27:36] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [22:38:41] (03PS3) 10Platonides: eswiki AbuseFilter config changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699494 (https://phabricator.wikimedia.org/T284797) (owner: 10Zabe) [22:40:28] (03CR) 10Platonides: [C: 03+1] eswiki AbuseFilter config changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699494 (https://phabricator.wikimedia.org/T284797) (owner: 10Zabe) [22:58:10] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:58:32] PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:58:52] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 2/4 UP : OSPFv3: 2/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:59:02] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:59:42] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:59:48] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:15:32] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:15:40] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:16:20] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:16:26] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:16:36] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:17:02] RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:21:40] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [23:38:14] PROBLEM - SSH on wdqs2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:53:28] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-chi-eqiad on cloudelastic1005 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1005&panelId=37