[01:49:55] RECOVERY - Check the last execution of monitor_refine_event_failure_flags on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:09:33] goood morning [06:15:01] superset 0.37.2 was finally released, including our patches (hopefully) [06:15:15] I am re-building now, and then I'll deploy on an-tool1005 for testing [06:15:22] hopefully we'll migrate this week :D [06:32:54] "How Netflix Manages Version Upgrades of Cassandra at Scale" [06:33:04] "We at Netflix have about 70% of our fleet on Apache Cassandra 2.1, while the remaining 30% is on 3.0" [06:33:07] !!! [06:34:21] I am definitely interested in this one [06:35:30] and there is also "Cassandra Upgrade in production : Strategies and Best Practices" [06:35:52] (03PS4) 10Elukey: Update to Superset 0.37.2 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/627738 (https://phabricator.wikimedia.org/T262162) [07:10:40] * elukey bbiab [07:43:43] 10Analytics-Clusters: Upgrade to Superset 0.37.x - https://phabricator.wikimedia.org/T262162 (10elukey) New version is out: 0.37.2 The new code is currently deployed on an-tool1005 (staging instance) for further testing. As soon as the Analytics team is done we'll also upgrade the production instance. [07:50:09] 10Analytics: Remove support for the (deprecated) Druid datasources (in favor of Druid Tables) on Superset - https://phabricator.wikimedia.org/T263972 (10elukey) [07:58:57] !log starting the process to decom the old hadoop test cluster [07:58:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:08:13] 10Analytics, 10Analytics-Wikistats: Wikistats Bug - https://phabricator.wikimedia.org/T263973 (10Kipala) [08:40:17] 10Analytics-Clusters, 10Operations, 10decommission-hardware, 10ops-eqiad, 10Patch-For-Review: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10elukey) [08:42:18] 10Analytics-Clusters, 10Operations, 10decommission-hardware, 10ops-eqiad, 10Patch-For-Review: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: `analytics[1028-1029].eqiad.wmnet... [08:43:37] Morning! [08:43:46] About to reinstall 1007 (backup was successful) [08:43:52] ack! [08:43:56] good morning :) [08:44:48] 10Analytics-Clusters, 10Analytics-Kanban: Move the stat1004-6-7 hosts to Debian Buster - https://phabricator.wikimedia.org/T255028 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by klausman on cumin1001.eqiad.wmnet for hosts: ` ['stat1007.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-... [08:53:40] 10Analytics-Clusters, 10Operations, 10decommission-hardware, 10ops-eqiad, 10Patch-For-Review: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: `analytics[1030-1031,1033-1039].e... [08:56:25] * elukey plays sad song for the hadoop test cluster's funeral [09:00:52] 10Analytics-Clusters, 10Operations, 10decommission-hardware, 10ops-eqiad, 10Patch-For-Review: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: `analytics[1040-1041].eqiad.wmnet... [09:17:24] Puppet still running, I think it had failed by this point last time, so fingers crossed. [09:17:56] 10Analytics-Clusters: Create the new Hadoop test cluster - https://phabricator.wikimedia.org/T255139 (10elukey) [09:18:00] 10Analytics: Enable Security (stronger authentication and data encryption) for the Analytics Hadoop cluster and its dependent services - https://phabricator.wikimedia.org/T211836 (10elukey) [09:18:05] 10Analytics-Clusters, 10Operations, 10decommission-hardware, 10ops-eqiad: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10elukey) 05Stalled→03Open a:05elukey→03Cmjohnson [09:24:38] klausman: ack for puppet, I think that some little errors will still show up but it should make the whole thing to fail [09:24:52] Yeah, e.g. it tries to clone from gerrit, but hangs :( [09:25:12] From lsof: [09:25:14] git-remot 64064 root 3u IPv6 168839 0t0 TCP stat1007.eqiad.wmnet:34306->gerrit.wikimedia.org:https (SYN_SENT) [09:25:48] ah it doesn't get the proxy config before doing it then :( [09:26:43] Puppet has since moved on, but the end result will likely still be a fail [09:27:10] The spark bit has worked fine, though [09:29:26] the important bit is that it deploys ssh keys etc.. [09:29:39] then it reboots, re-run puppet and we can check what it is still failing after it [09:29:45] Yep [09:30:36] Kormat and I spent quite some time on similar problems (Adding a config bit, but then you have to reconnect to activate it) when doing Ansible stuff for Anapaya [09:33:29] :) [09:33:33] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Review and improve Oozie authorization permissions - https://phabricator.wikimedia.org/T262660 (10elukey) @razzi I had to start the decom of the hadoop test cluster sooner, so all the testing env is now gone sorry. I think that we can proceed anyw... [09:33:41] taking a little break [09:39:59] 10Analytics-Radar, 10Domains, 10Operations, 10Traffic, and 2 others: Blocking all third-party storage access requests - https://phabricator.wikimedia.org/T262996 (10ArielGlenn) p:05Triage→03Medium [09:44:03] Hmm, it looks like /srv/analytics-wmde on wmde now has a missing group (gid 7081), which makes the wmde parts of puppet fail (it can't git fetch). What *should the group there be? [09:44:39] Looking at the puppet role, analytics-wmde, I suspect [09:45:28] good point, not sure [09:45:45] it might be something very old [09:46:49] so is the dir managed by puppet? I guess not, in case I'd ping the wmde team [09:47:07] ah there is also another info [09:47:26] stat1007 is a bit "special" since there is a profile that runs some extra timers only on 1007 [09:47:34] Manually corrected the ownership, rerunning puppet [09:47:40] profile::statistics::explorer::misc_jobs [09:48:08] Nope, still failed [09:48:17] Notice: /Stage[main]/Statistics::Wmde::Graphite/Git::Clone[wmde/scripts]/Exec[git_pull_wmde/scripts]/returns: error: cannot open .git/FETCH_HEAD: Permission denied [09:49:05] so the class is class statistics::wmde [09:49:05] Hmmm [09:49:30] and yeah analytics-wmde system user/group are not fixed, created in there [09:49:34] so the gid etc.. changed [09:50:04] Some stuff under /srv/analytics-wmde is also owned by analytics-product:analytics-wmde. Is that correct? [09:50:23] no idea [09:50:38] it is sort-of self managed [09:51:57] Actually, I think this may be a bug in... transfer.py [09:52:55] ?? [09:52:59] Scratch that, we haven't done a restore [09:53:26] But it made me realize that transfer.py *does* have a potential bug regarding uids/gids vs. user/group names. [09:53:43] Anyway, puppet completed now after I fixed the .git ownership [09:53:52] ah! [09:53:56] all right nice [09:54:05] I'll do a reboot to be on the safe side. [09:54:17] +1 [09:54:42] Once the machine is back, I'll update the SSH fingerprints in the wiki and send the all-clear mail. [09:54:49] 10Analytics, 10Operations, 10Traffic, 10netops: Turnilo: per-second rates for wmf_netflow bytes + packets - https://phabricator.wikimedia.org/T263290 (10ArielGlenn) p:05Triage→03Medium [10:02:08] !log force /srv/jupyterhub/deploy/create_virtual_env.sh on stat1007 after the reimage [10:02:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:02:40] wow all stat100x on buster! [10:02:45] nice job :) [10:04:36] 10Analytics-Clusters, 10Analytics-Kanban: Move the stat1004-6-7 hosts to Debian Buster - https://phabricator.wikimedia.org/T255028 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['stat1007.eqiad.wmnet'] ` Of which those **FAILED**: ` ['stat1007.eqiad.wmnet'] ` [10:08:19] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Fix TLS certificate location and expire for Hadoop/Presto/etc.. and add alarms on TLS cert expiry - https://phabricator.wikimedia.org/T253957 (10elukey) I have cleaned up the puppet private repository from all certificates/configs not used, all good. The... [10:09:39] 10Analytics-Radar, 10Domains, 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: WMF third-party cookies rejected - https://phabricator.wikimedia.org/T262882 (10ArielGlenn) p:05Triage→03Medium [10:14:57] 10Analytics, 10Analytics-Kanban: Move the Analytics infrastructure to Debian Buster - https://phabricator.wikimedia.org/T234629 (10klausman) a:03klausman [10:23:20] 10Analytics-Radar, 10Operations, 10Patch-For-Review: Move Hue to a Buster VM - https://phabricator.wikimedia.org/T258768 (10elukey) Currently two outstanding UI issues: https://github.com/cloudera/hue/issues/1273 https://github.com/cloudera/hue/issues/1272 In theory those are not blocking the migration of... [10:32:56] klausman: all good afaics, if nothing stands up I am going to take a lunch break soon (usual couple of hours) [10:36:16] I get it as yes :) [10:36:18] * elukey lunch [10:39:42] yes, sorry, was busy making my own lunch :D [11:06:50] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Review and improve Oozie authorization permissions - https://phabricator.wikimedia.org/T262660 (10mforns) Hey @elukey, On friday @razzi and I encountered a puppet compiler error when trying to test your puppet change for the test cluster. Razzi cr... [11:07:21] hellooo teammm [11:08:10] (03PS4) 10Mforns: Improve path discovery in drop-older-than [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628933 (https://phabricator.wikimedia.org/T263495) [11:08:59] (03CR) 10Mforns: "Patch set 4 is just to improve some comments." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628933 (https://phabricator.wikimedia.org/T263495) (owner: 10Mforns) [11:13:32] BTW, in profile::analytics::cluster::packages::statistics there's a few packages which are only installed <= stretch along with a comment to review for Buster, now that all stat* hosts are on Buster sounds like a good time :-) [11:14:24] and also some conditional code which can be axed, like the inclusion of eventlogging::dependencies [11:34:00] hi team [12:14:07] 10Analytics, 10Analytics-Data-Quality: page_id is null where it shouldn't be in mediawiki history - https://phabricator.wikimedia.org/T259823 (10Meghajain171192) Hi @mforns @JAllemandou @Milimetric ! If this seems like a good first issue for a newbie to understand Wiki Datasets , could I look into this ? Will... [12:15:38] morning team! [12:43:34] moritzm: ack! [12:43:39] fdans: morning! [12:43:51] o/ [12:45:45] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Review and improve Oozie authorization permissions - https://phabricator.wikimedia.org/T262660 (10elukey) Yep not related but +1 on waiting, thanks! The admin module changes are battle tested so we can deploy directly user/groups on an-coord1001 e... [13:29:54] hey fdans, let me help you with the alarms that fired during the weekend, they are theoretically mine [13:31:50] hellooo mforns , the data quality ones? you mentioned that there's code to be deployed for them right? [13:32:09] I'm looking at the banner activity one [13:32:47] yes, the data quality ones are waiting for the deployment [13:33:21] BTW, fdans I might do a refinery deployment today, to unblock the deletion of mediawiki_job and netflow data [13:38:50] mforns: sounds good, lmk if I can help in any way [13:39:01] fdans: OK [14:03:20] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monit [14:03:36] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) timed out before a response was received: /analytics.wikimedia.org/v1/unique-devices/{project}/{access-site}/{granularity}/{start}/{end} (Get unique devices) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top/{project}/{a [14:03:36] onth}/{day} (Get top page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:03:46] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:03:52] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/unique-devices/{project}/{access-site}/{granularity}/{start}/{end} (Get unique devices) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{p [14:03:52] site}/{granularity}/{start}/{end} (Get pagecounts) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:04:32] There is some maintenance ongoing for row D in eqiad, in theory on aqs1006 should go down [14:04:38] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:04:42] okok [14:05:02] so 2 cassandra instances down (on 1006) caused a temporary issue [14:05:12] hopefully the others are recovering soo [14:05:13] *soon [14:06:24] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:06:24] fdans: can you check wikistats and the AQS api? [14:07:42] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:07:52] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:07:52] elukey: could these AQS errors be related to me re-running the banner activity druid loader? [14:08:17] nono [14:09:28] elukey: aqs seems good :) [14:10:43] yep looks like it recovered [14:11:11] mforns: the SRE team is recabling the rack d4 in eqiad, it contains aqs1006 (running two cassandra instances) [14:11:23] https://grafana.wikimedia.org/d/000000483/cassandra-client-request?orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=aqs&var-node=All&var-quantile=99p&from=now-3h&to=now [14:11:32] I see some read timeouts that eventually auto-resolved [14:11:39] ok ok [14:11:44] I think that in flight conns were dropped [14:11:45] sigh [14:27:42] o/ [14:31:51] klausman: this may be coincidence, but a job that was running on stat1006 (I think?) seems to have stopped right around when we were doing the restarts (September 24): https://dumps.wikimedia.org/other/pagecounts-ez/merged/2020/2020-09/ [14:32:16] it's a legacy job that we're in the final stages of porting over, but for now I think we need to help it along [14:32:16] Was this on a timer or a cronjob? [14:34:37] Also, is this using a venv? Luca mentioned that venvs need some TLC after the Buster update [14:35:17] milimetric: stat1007 right? [14:35:41] we don't really run anything on 1006 afaik [14:35:58] I can't remember, I thought it was 1006, it's zachte's stuff [14:35:59] the pagecounts-ez should be running from stat1007's ezachte's home no? [14:36:03] ahhh yes yes [14:36:06] okok so 1007 [14:36:06] oh ok, 1007 then [14:36:20] But that I only reinstalled today, so it's unrelated [14:36:28] yep yep [14:37:00] Maybe someone already decommissioned the job? [14:41:16] fdans: you'd be the next person to ask, did you turn off zacthe's pagecounts-ez job? [14:41:43] (also fdans: we should talk about those last two wikistats things, we gotta fix and deploy today before the conference starts) [14:41:51] 10Analytics-Clusters, 10Discovery, 10Discovery-Search (Current work): mjolnir-kafka-msearch-daemon dropping produced messages after move to search-loader[12]001 - https://phabricator.wikimedia.org/T260305 (10Gehel) 05Open→03Resolved [14:41:55] 10Analytics-Clusters, 10Discovery, 10Discovery-Search (Current work), 10Patch-For-Review: Move mjolnir kafka daemon from ES to search-loader VMs - https://phabricator.wikimedia.org/T258245 (10Gehel) [14:42:01] brb in 10 minutes, making some polenta [14:42:34] milimetric: what conference? [14:47:57] (03CR) 10Nuria: "As long as we have tested this works i see no problem but I do not think i understand the implications fully to CR." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/629659 (https://phabricator.wikimedia.org/T263736) (owner: 10Joal) [14:48:40] elukey: fyi non-new hue is down https://hue.wikimedia.org/ [14:48:43] mforns: somethign weird happens to hue [14:48:55] hmmm [14:48:59] yes it seems too many mysql conns [14:49:15] I think that hue next might leak some connections [14:49:17] lemme check [14:50:51] now it works [14:51:11] this is really weird, and I didn't see anything horrible on mysql [14:54:00] mforns: I am re-running the banner druid job [14:54:08] I think I found something strange [14:54:15] elukey: aha [14:54:28] do you have a min for early bc? [14:54:32] yep [14:54:36] fdans: the conference being ApacheCon [14:54:41] (I suspect) [14:54:51] oh [14:54:54] but it starts tomorrow no? [14:55:17] As far as I know, yes [15:00:06] ping razzi fdans [15:00:13] well, they claimed in an email yesterday that it started today, but maybe they're just tricking us to get us to look at the programme [15:08:20] milimetric: I think that today there are some workshops and weird things from corporate people :D [15:09:24] ::shudder:: [15:09:54] !log execute set global max_connections=200 on an-coord1001's mariadb (hue reporting too many conns, but in reality the fault is from superset) [15:09:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:16:09] 10Analytics-Radar, 10Product-Analytics, 10Product-Infrastructure-Data: prefUpdate schema contains multiple identical events for the same preference update - https://phabricator.wikimedia.org/T218835 (10sdkim) p:05Triage→03Medium [15:19:21] 10Analytics-Radar, 10Growth-Team, 10Product-Analytics, 10Product-Infrastructure-Data, 10Patch-For-Review: PrefUpdate captures user preference modifications at registration - https://phabricator.wikimedia.org/T260867 (10sdkim) [15:23:12] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10Cmjohnson) [15:24:02] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10Cmjohnson) a:05Cmjohnson→03RobH @RobH the new ssds have been installed to these servers, I appreciate you fixing the raid and... [15:24:55] 10Analytics, 10Event-Platform, 10Privacy Engineering, 10Product-Infrastructure-Data, and 3 others: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10sdkim) a:05mpopov→03jlinehan [15:25:01] 10Analytics, 10Event-Platform, 10Privacy Engineering, 10Product-Infrastructure-Data, and 3 others: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10mpopov) Patch is out there but requires further discussion. [15:28:20] 10Analytics-Radar, 10Product-Analytics, 10Product-Infrastructure-Data: prefUpdate schema contains multiple identical events for the same preference update - https://phabricator.wikimedia.org/T218835 (10sdkim) Product Infrastructure data as the new crowned owners of this schema will be reviewing and hoping to... [15:33:01] 10Analytics-Radar, 10Product-Analytics, 10Product-Infrastructure-Data: prefUpdate schema contains multiple identical events for the same preference update - https://phabricator.wikimedia.org/T218835 (10mpopov) `lang=SQL SELECT year, month, CONCAT_WS(', ', COLLECT_SET(event.property)) AS properties_affect... [15:38:11] mforns: so I found other kerberos errors in druid_loader.py [15:38:13] that is weird [15:40:51] ah no sorry [15:41:41] this is the job that failed [15:41:42] https://yarn.wikimedia.org/jobhistory/job/job_1600953045299_23399 [15:41:49] that is in the root queue [15:42:26] the reduce was killed, maybe for the resource constraits that Joseph talked about? [15:43:56] going off for ~2h, back after [15:53:53] some sort of problem with analytics sql? getting qlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError) (1040, 'Too many connections') from airflow talking to it [15:54:16] ebernhardson: there were some errors before, but I bumped the limit, should be good now [15:54:20] are you seeing more issues? [15:54:33] elukey: just got an error a few seconds ago [15:54:35] 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10Patch-For-Review, and 2 others: Develop a new schema for MediaSearch analytics or adapt an existing one - https://phabricator.wikimedia.org/T263875 (10egardner) I just published a patch that adds a [[ https://gerrit.wikimedia.org/r/c/schema... [15:54:46] elukey: err, actually this log is older...lemme regen [15:55:39] ebernhardson: so there are 154 conns now and 200 is the limit so it should be fine in theory [15:55:44] superset is eating a lot of conns [15:55:55] elukey: restarted it, and yea it seems fine now. thanks ! [15:56:11] ebernhardson: thanks for reporting! I'll follow up on this :( [16:49:58] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1102.eqiad.wmnet... [16:56:34] 10Analytics-Radar, 10Operations, 10ops-eqiad: an-presto1004 down - https://phabricator.wikimedia.org/T253438 (10Cmjohnson) a:05Cmjohnson→03RobH @RobH I did not see any signs of burning inside the chassis [16:58:53] 10Analytics, 10Operations, 10Traffic: ~1 request/minute to intake-logging.wikimedia.org times out at the traffic/service interface - https://phabricator.wikimedia.org/T264021 (10CDanis) [16:59:19] 10Analytics, 10Operations, 10Traffic: ~1 request/minute to intake-logging.wikimedia.org times out at the traffic/service interface - https://phabricator.wikimedia.org/T264021 (10CDanis) p:05Triage→03Low Clients will retry automatically so this isn't a huge deal, but it does merit investigation at some po... [17:08:19] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1102.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1... [17:13:33] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH) Ok, I setup an-worker1102 with raid1 on the two SSDS, and each HDD as its own raid0. Now it gets an LVM label in use error... [17:18:14] 10Analytics-Clusters, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10Cmjohnson) @elukey Can you do this Monday 5 October 1400UTC? [17:19:36] 10Analytics-Clusters, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10elukey) Definitely yes! [17:21:56] 10Analytics-Clusters, 10Operations, 10ops-eqiad: (Need By: TBD) upgrade ram in an-master100[12] - https://phabricator.wikimedia.org/T259162 (10Cmjohnson) @elukey Same thing with these...can we do them all Monday or will you need multiple days? [17:23:20] 10Analytics-Clusters, 10Operations, 10ops-eqiad: (Need By: TBD) upgrade ram in an-master100[12] - https://phabricator.wikimedia.org/T259162 (10elukey) All on Monday is fine! [17:25:45] 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Add more popular articles per country data to AQS - https://phabricator.wikimedia.org/T263697 (10Meghajain171192) Hi @nuria, This is Megha Jain and I am a newbie to Wikimedia , but would love to contribute to the open sou... [17:25:54] 10Analytics-Clusters, 10Operations, 10ops-eqiad: (Need By: TBD) upgrade ram in an-master100[12] - https://phabricator.wikimedia.org/T259162 (10Cmjohnson) Okay, great! [17:31:33] 10Analytics, 10Operations, 10Traffic: ~1 request/minute to intake-logging.wikimedia.org times out at the traffic/service interface - https://phabricator.wikimedia.org/T264021 (10JAllemandou) Idea: Could missing-revisions (T215001) be related to this? [17:36:50] 10Analytics, 10Analytics-Wikistats: Wikistats Bug - easy to understand language for pageviews - https://phabricator.wikimedia.org/T263973 (10Nuria) [17:38:43] 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Add more popular articles per country data to AQS - https://phabricator.wikimedia.org/T263697 (10Nuria) @Meghajain171192 thanks for your interest, this is a ticket that requires access to private data and our computation e... [17:38:50] 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Add more popular articles per country data to AQS - https://phabricator.wikimedia.org/T263697 (10Nuria) Also , you need access to gerrit: https://www.mediawiki.org/wiki/Gerrit [17:40:39] 10Analytics, 10Analytics-Data-Quality: page_id is null where it shouldn't be in mediawiki history - https://phabricator.wikimedia.org/T259823 (10Nuria) @Meghajain171192 Please see my comments on {T263697} that also apply here, thanks for your interest [17:40:57] 10Analytics, 10Analytics-Wikistats, 10good first task: Wikistats Bug - easy to understand language for pageviews - https://phabricator.wikimedia.org/T263973 (10Nuria) [17:41:24] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Create anaconda .deb package with stacked conda user envs - https://phabricator.wikimedia.org/T251006 (10Nuria) 05Open→03Resolved [17:41:27] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Newpyter - SWAP Juypter Rewrite - https://phabricator.wikimedia.org/T224658 (10Nuria) [17:41:41] mforns: I replied on alerts@, let me know if it makes sense [17:42:17] 10Analytics, 10Analytics-Kanban, 10EventStreams, 10Patch-For-Review: KafkaSSE: Cannot write SSE event, the response is already finished - https://phabricator.wikimedia.org/T261556 (10Nuria) 05Open→03Resolved [17:42:30] 10Analytics: Add Authentication/Encryption to Kafka Jumbo's clients - https://phabricator.wikimedia.org/T250146 (10Nuria) [17:43:13] razzi: o/ [17:43:30] if you want we can follow up on the oozie thing [17:43:39] I added some steps in the task [17:45:48] fdans: shall I try and fix those annotation/navigation bugs I mentioned Friday? [17:49:22] (03CR) 10Nuria: [C: 04-1] Improve path discovery in drop-older-than (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628933 (https://phabricator.wikimedia.org/T263495) (owner: 10Mforns) [18:03:25] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1102.eqiad.wmnet... [18:08:41] (03CR) 10Mforns: Improve path discovery in drop-older-than (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628933 (https://phabricator.wikimedia.org/T263495) (owner: 10Mforns) [18:18:39] https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=12&orgId=1&refresh=5m&var-server=an-coord1001&var-datasource=thanos&var-cluster=analytics&from=now-14d&to=now [18:18:58] from the 21st we have been growing the /var/lib/mysql dir on an-coord1001 a lot [18:19:10] and now I see an alarm for partition almost filled sigh [18:20:06] elukey: is an-coord1001 the one that hosts mysql for oozie ? [18:20:22] nuria: for hive/oozie/hue/etc.. [18:21:21] it may be the binlog, we are keeping too much of it [18:24:00] even if in theory we only keep the last 14 days [18:26:00] (03CR) 10Nuria: Improve path discovery in drop-older-than (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628933 (https://phabricator.wikimedia.org/T263495) (owner: 10Mforns) [18:27:27] elukey: are you looking in to the disk filling up? I'm interested in looking in to that with you [18:28:47] I am yes but I am wondering what it is best to do, not an easy one [18:29:49] elukey: is it teh new hue? [18:30:15] elukey: the new one we can possibly move to an entirely different host [18:31:06] nuria: for the moment the only problem is that the mariadb on an-coord1001 keeps 14d of binlog, and the files are consuming space now [18:31:19] not sure if hue, but its db seems tiny [18:31:26] elukey: mmm [18:32:43] so what I'd like to do now is purge some binary logs, say a couple of days, and get things in a stable situazion [18:35:26] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1102.eqiad.wmnet'] ` and were **ALL** successful. [18:37:00] !log execute "PURGE BINARY LOGS BEFORE '2020-09-15 00:00:00';" on an-coord1001's mariadb as attempt to recover space [18:37:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:37:37] it worked, will do more [18:37:58] !log execute "PURGE BINARY LOGS BEFORE '2020-09-20 00:00:00';" on an-coord1001's mariadb as attempt to recover space [18:38:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:38:20] ok we have 10G in the partition now [18:38:33] razzi: ok let me explain [18:38:56] so on an-coord1001 we have an old set up, that we are trying to migrate to something more modern/flexible [18:39:02] an-coord1001 replicates to db1108 [18:39:17] What exactly is replicated? [18:39:25] mariadb databases [18:39:37] on an-coord1001, we have these partitions: [18:39:40] Filesystem Size Used Avail Use% Mounted on [18:39:44] /dev/mapper/an--coord1001--vg-mysql 59G 50G 9.8G 84% /var/lib/mysql [18:39:47] /dev/mapper/an--coord1001--vg-srv 102G 19G 83G 19% /srv [18:40:05] it was a historical choice, but /srv is not used a lot, meanwhile /var/lib/mysql is [18:40:24] I see [18:40:42] 10Analytics, 10Event-Platform, 10Performance-Team, 10Product-Infrastructure-Data: Research and consider network connections made due to Event Platform - https://phabricator.wikimedia.org/T263049 (10Krinkle) To instrument this, and gauge any background/side impact during page load, I'd recommend creating tw... [18:40:44] now from https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=12&orgId=1&refresh=5m&var-server=an-coord1001&var-datasource=thanos&var-cluster=analytics&from=now-30d&to=now it seems that from the 21st the usage of /var/lib/mysql increaesed a lot [18:40:55] 10Analytics, 10Event-Platform, 10Product-Infrastructure-Data, 10Performance-Team (Radar): Research and consider network connections made due to Event Platform - https://phabricator.wikimedia.org/T263049 (10Krinkle) [18:41:22] and if you do 'ls -lht /var/lib/mysql' you'll see that it is mostly the binlog, basically where mariadb registers all its transactions etc.. [18:41:34] we keep 14d of binlog [18:41:56] I manually purged (via mariadb cli) some binlogs [18:42:31] and we have space now, but that growth is weird [18:43:14] there are also some useful commands to check [18:43:27] "sudo pvs" and "sudo lvs" [18:43:34] on an-coord1001 (you can execute them) [18:44:00] it gives you the LVM status of the partitions, if those can be expanded etc.. [18:44:41] now expanding a LVM volume is easy, but the partition on top of it might not like it very much [18:44:55] (expanding usually it is fine, shrinking not a lot0 [18:45:29] an easy one could be to shrink /srv and expand /var/lib/mysql, but we should in theory stop mariadb, unmount the partitions, etc.. [18:45:34] so it needs maintenance time scheduled [18:46:07] and the major question is: why does it grow that much ? [18:46:35] elukey: mmm, ya [18:47:08] we are talking about 50/60G so it is always compared to our db sizes :D [18:47:29] the answer might be that we need to keep only a week of binlog [18:47:43] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH) [18:47:46] So you basically did that manually for now [18:48:18] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH) Ok, updates: * an-worker1102 is now staged and ready for service owners to take it over. * I am working through the other hosts, rebuilding all... [18:49:00] elukey: the ibdata1 file is huge too [18:49:26] elukey: relatively speaking [18:49:36] mmm it is 332M no? [18:49:53] elukey: ya, but that seems real big [18:50:21] nuria: I am more worried about all the analytics-meta-bin, the last ones are all ~250/300MB in size [18:50:34] 272M analytics-meta-bin.017288 [18:50:34] 274M analytics-meta-bin.017256 [18:50:34] 275M analytics-meta-bin.017301 [18:50:34] 281M analytics-meta-bin.017287 [18:50:34] 288M analytics-meta-bin.017182 [18:50:37] 289M analytics-meta-bin.017260 [18:50:39] 292M hue_next [18:50:42] 306M superset_staging [18:50:44] 314M analytics-meta-bin.017261 [18:50:47] 319M hue [18:50:49] 319M superset_production [18:50:52] 320M analytics-meta-bin.017299 [18:50:54] 333M ibdata1 [18:50:57] 443M analytics-meta-bin.017300 [18:50:59] 3.6G druid [18:51:02] 5.4G oozie [18:51:04] 6.2G hive_metastore [18:52:07] elukey: and analytics-meta is ? [18:52:46] nuria: the binlog, basically where mariadb registers all the data changes / transactions /etc.. of the dbs [18:53:07] it is what gets replicated to db1108 [18:53:16] and we keep 14d from config [18:54:24] elukey: right , but from grafana seems that something is a amiss from the 22nd [18:54:37] elukey: the only new thing is the new hue , right? [18:55:09] nuria: in theory yes, it could be that something is causing more log entries in the binlog, that grows those files [18:55:40] elukey: we can turn of the new hue [18:55:44] elukey: and see? [18:56:23] elukey: maybe not at all related [18:56:45] nuria: we are stable now so tomorrow I'll try to check what's inside the binlog, there is a tool to do that (I'll also ask some info to Manuel).. It could be Hue or another daemon, but timings match with Hue next [18:57:19] it is weird that the hue-next db is very tiny [18:57:55] elukey: but transactions in flight [18:58:05] elukey: will affect the size of the binlog right? [18:59:05] nuria: I am a little ignorant about what's inside the binlog, so not sure [18:59:24] elukey: i am like in kindergarden when it comes to that so yeah [19:00:12] nuria: interesting fact - before standup there were some errors reported for Hue and Airflow (Discovery) of python getting max conns reached from mysql.. and the biggest consumer was superset_production db [19:00:24] elukey: mmmm [19:00:38] so I had to raise the limit a bit (from 150 to 200) [19:00:59] elukey: i just looked and transactions are written to libdata but not binlog (15 sec google so maybe totally off) [19:01:15] razzi: does what I wrote make sense? Doubts? [19:02:21] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1103.eqiad.wmnet', 'an-worker1104.eqi... [19:04:46] Mostly makes sense. But the database will fill up the disk eventually; will we do the partition resize you mentioned? [19:05:22] razzi: if we understand what is the root cause of the growth there shouldn't be the need, but probably yes [19:05:44] Also, was there an alert? Or did you fix it in time? I don't see one [19:06:20] I keep an eye on icinga.wikimedia.org and #wikimedia-operations (where the partition filling up alert fired) [19:06:23] And by "fix" I mean "buy us more time" :) [19:07:07] sure, but it is better to buy time and doing a proper root cause rather than applying early fixes in my opinion [19:07:18] Definitely [19:09:38] I am leaning towards nuria's suspicion about hue-next being spammy on the binlog [19:10:28] I'll try to open a task and check tomorrow, hopefully it is a quick one (but in case I'll have to follow up with upstream etc.. sigh) [19:10:49] i am going to have dinner otherwise I'll get probably killed, ttl :) [19:11:57] byebye [19:21:35] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1105.eqiad.wmnet ` The log can be found... [19:21:42] :] razzi, I'll pause now for 20 mins, ping me if you wanna pair later [19:35:44] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1103.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1104.eqiad.wmnet'] ` [19:36:33] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1105.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1105.eqiad.wmnet'] ` [19:38:21] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH) [20:34:14] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1105.eqiad.wmnet', 'an-worker1106.eqi... [20:36:19] (03PS11) 10Milimetric: Add filter/split component to Wikistats [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/613114 (https://phabricator.wikimedia.org/T249758) (owner: 10Fdans) [20:40:07] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1105.eqiad.wmnet ` The log can be found... [20:46:40] (03PS5) 10Razzi: Improve path discovery in drop-older-than [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628933 (https://phabricator.wikimedia.org/T263495) (owner: 10Mforns) [20:46:42] (03PS1) 10Razzi: Test using mocked file tree in refinery-drop-older-than [analytics/refinery] - 10https://gerrit.wikimedia.org/r/630680 (https://phabricator.wikimedia.org/T263495) [20:57:34] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1106.eqiad.wmnet', 'an-worker1107.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-... [21:00:46] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH) [21:05:53] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1105.eqiad.wmnet'] ` and were **ALL** successful. [21:09:45] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1108.eqiad.wmnet', 'an-worker1109.eqi... [21:13:41] (03PS12) 10Milimetric: Add filter/split component to Wikistats [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/613114 (https://phabricator.wikimedia.org/T249758) (owner: 10Fdans) [21:14:19] (03PS1) 10Joal: Fix banner_activity_daily job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/630682 [21:35:54] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1109.eqiad.wmnet', 'an-worker1108.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-... [21:36:18] (03CR) 10Fdans: [C: 03+2] Add filter/split component to Wikistats [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/613114 (https://phabricator.wikimedia.org/T249758) (owner: 10Fdans) [21:36:30] \o/ [21:43:47] fdans: ta-ta-channn [21:43:57] fdans: please ping razzi when we deploy [21:44:04] nuria: already did :) [21:44:35] fdans: alredy deployed? [21:45:03] nuria: nope, will do either later today or tomorrow [21:45:11] fdans: ok! [21:45:21] fdans: sounds great [21:47:03] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1111.eqiad.wmnet', 'an-worker1112.eqi... [21:59:54] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH) [22:03:13] a-team I just set up the meetings for the dev pod, lmk if you see anything missing. still need to remove the grooming meeting on thursdays since those will be separated from next week onward [22:34:50] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1112.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1111.eqiad.wmnet', 'an-... [22:35:19] fdans: grosking monday is split, it’s joint on Thursday [22:37:07] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH) [22:40:01] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH) [22:56:57] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1111.eqiad.wmnet ` The log can be found... [23:04:18] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1113.eqiad.wmnet ` The log can be found... [23:18:00] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1111.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1111.eqiad.wmnet'] ` [23:18:11] 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1113.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1113.eqiad.wmnet'] `