[00:36:09] PROBLEM - Hadoop DataNode on an-worker1112 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [00:37:38] 10Analytics-Radar, 10Growth-Scaling, 10Product-Analytics, 10Growth-Team (Current Sprint): Growth: shorten welcome survey retention to 90 days - https://phabricator.wikimedia.org/T275171 (10MMiller_WMF) [00:46:29] PROBLEM - Hadoop NodeManager on an-worker1112 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:53:09] RECOVERY - Hadoop DataNode on an-worker1112 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [00:53:57] RECOVERY - Hadoop NodeManager on an-worker1112 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:26:09] (03PS1) 10GoranSMilovanovic: T239205 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/667419 [02:26:25] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T239205 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/667419 (owner: 10GoranSMilovanovic) [07:37:39] good morning! [07:37:44] again 1112, weird [07:47:14] elukey: happy monday! qq: is it ok to sudo apt-get install npm on one of the aqs-test machines? [07:48:37] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "I don't understand every detail. But what I can see does make a lot of sense. Unfortunately I have no idea what the CI failure means." (032 comments) [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667192 (https://phabricator.wikimedia.org/T193169) (owner: 10Awight) [07:48:55] lexnasser: hello! In theory yes, but do you have a specific use case in mind? [07:49:35] elukey: yeah, it’s needed to run the aqs tests [07:52:01] lexnasser: sure sure, please go ahead and install it then [07:59:28] 10Analytics: Check home/HDFS leftovers of agaduran - https://phabricator.wikimedia.org/T276026 (10MoritzMuehlenhoff) [08:31:00] (03CR) 10Thiemo Kreuz (WMDE): "I tried to understand what the code does both before as well as after the change. I find both hard to understand. Maybe we should just bel" (031 comment) [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667159 (https://phabricator.wikimedia.org/T193174) (owner: 10Awight) [09:06:01] (03CR) 10Awight: Support explicit "hive" script type (032 comments) [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667192 (https://phabricator.wikimedia.org/T193169) (owner: 10Awight) [09:09:52] (03CR) 10Awight: Support explicit "hive" script type (032 comments) [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667192 (https://phabricator.wikimedia.org/T193169) (owner: 10Awight) [09:11:02] 10Analytics, 10SRE, 10ops-eqiad: an-worker1112 reports I/O errors for a disk - https://phabricator.wikimedia.org/T274981 (10elukey) 05Open→03Resolved Recreated partition for /dev/sdl and re-mounted. let's see if any error will trigger. Closing for the moment, will reopen if I make the disk to fail. [09:11:38] !log remount /dev/sdl on an-worker1112 (wasn't able to make it fail) [09:11:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:11:58] !log restart hadoop daemons on an-worker1112 to pick up the new disk [09:12:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:25:04] Good morning elukey [09:28:28] bonjour :) [09:49:26] elukey: Do you wish we brainstorm an-worker1112 problems? [09:50:48] joal: I don't have a clear idea yet, it seems that the oom killer is triggered and hits daemons (including the datanode) [09:51:23] it seems happening when the cluster sees some load, but other nodes see the same more or less [09:51:31] and I am not running any test etc.. [09:51:59] this morning I have re-added the disk that I was trying to force fail (it showed some medium errors a while ago) otherwise we can't get the replacement [09:52:07] and the I rebooted [09:55:34] ack [09:56:04] Mar 1 00:29:41 an-worker1112 systemd[1]: hadoop-hdfs-datanode.service: Main process exited, code=killed, status=9/KILL [09:56:08] this is the oom killer [09:56:28] now, why only on this host, is still a mistery to me [09:56:39] that's my wonder as well elukey [09:58:42] I rebooted to clear some weird temporary settings/state that it could have been carrying [09:59:10] the kernel that it runs is the same of the majority of the other ndoes [09:59:13] *nodes [10:03:53] 10Analytics: Review ROCm deployment procedures and current packages - https://phabricator.wikimedia.org/T275896 (10MoritzMuehlenhoff) Two things: The pyall component isn't enabled on an-worker1096, so that's expected? And this seems to be a bug in the package, which should be reported? We're importing these unmo... [10:12:53] moritzm: o/ - one clarification about --^ - is libpython38 supposed to be installable? IIRC from friday I was getting errors from apt due to the fact that it was a virtual package (so rocm-gdb failed as well to install) [10:13:37] you are right about the pyall component, there was a refactor in profile::python37 to include only 37 on stretch that I forgot [10:13:41] (03PS5) 10Awight: Support explicit "hive" script type [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667192 (https://phabricator.wikimedia.org/T193169) [10:13:48] but IIRC I added the component manually on the host [10:14:20] (03CR) 10Awight: "The CI error turned out to be a rebase thing, the internal results format changed in I102faea5309f4538" [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667192 (https://phabricator.wikimedia.org/T193169) (owner: 10Awight) [10:19:01] (03CR) 10Awight: Input table date column should be optional (031 comment) [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667159 (https://phabricator.wikimedia.org/T193174) (owner: 10Awight) [10:24:18] moritzm: nevermind pebcak [10:24:24] :) [10:29:15] 10Analytics: Review ROCm deployment procedures and current packages - https://phabricator.wikimedia.org/T275896 (10elukey) @MoritzMuehlenhoff you are completely right, we indeed don't include pyall on buster nodes (since we have profile::python37). On Friday I tried to add the pyall component but probably used t... [10:34:53] 10Analytics: Review ROCm deployment procedures and current packages - https://phabricator.wikimedia.org/T275896 (10elukey) https://github.com/RadeonOpenCompute/ROCm/tree/roc-3.7.x https://github.com/RadeonOpenCompute/ROCm/tree/roc-3.8.x They support, even in 4.x, 18.x and 20.x, so in theory Python 3.6+ afaics.... [10:37:41] 10Analytics: Review ROCm deployment procedures and current packages - https://phabricator.wikimedia.org/T275896 (10elukey) Found https://github.com/RadeonOpenCompute/ROCm/issues/1236, that it is exactly our issue. It seems that they are not really going to do anything about it.. [10:40:01] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Input table date column should be optional (031 comment) [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667159 (https://phabricator.wikimedia.org/T193174) (owner: 10Awight) [10:42:39] (03PS4) 10Awight: Input table date column should be optional [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667159 (https://phabricator.wikimedia.org/T193174) [10:42:55] (03CR) 10Awight: "PS 4: manual rebase" [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667159 (https://phabricator.wikimedia.org/T193174) (owner: 10Awight) [10:45:28] joal: if you have a moment https://gerrit.wikimedia.org/r/c/operations/puppet/+/667556/ [10:46:19] after this I'd like to disable segment query caching for broker + enable cache on historicals on the analytics cluster [10:46:25] to see if anything weird comes up [10:46:30] (before public) [10:50:10] elukey: I thought we had already do that on druid? was in on druid-analytics only [10:50:13] ? [10:50:41] joal: it was on public only, never applied it to analytics [10:50:50] and also https://gerrit.wikimedia.org/r/c/operations/puppet/+/667558 [10:50:56] Ah [10:50:59] that is for analytics (the same that I prepped for public) [10:53:23] elukey: all good for me [10:54:07] joal: super :) [10:55:18] 10Analytics: Review ROCm deployment procedures and current packages - https://phabricator.wikimedia.org/T275896 (10MoritzMuehlenhoff) >>! In T275896#6869618, @elukey wrote: > I'll follow up with upstream, but in the meantime what should we do? ROCm released 4.0 and I think this problem is there (will need to che... [10:56:57] 10Analytics, 10SRE: Review ROCm deployment procedures and current packages - https://phabricator.wikimedia.org/T275896 (10MoritzMuehlenhoff) [10:58:39] (03CR) 10Awight: "V+1 smoke-tested on stat1004" [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667192 (https://phabricator.wikimedia.org/T193169) (owner: 10Awight) [11:02:19] !log roll restart druid-broker and druid-historical daemons on druid-analytics to pick up new cache settings (disable segment caching on broker and enable it on historicals) [11:02:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:02:36] roll restart of brokers done, historicals are in progress [11:10:12] (03PS1) 10Awight: Fix typo: no "performer" field [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/667565 (https://phabricator.wikimedia.org/T272569) [11:12:36] (03CR) 10Awight: "V+1 smoke-tested on stat1004." [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667159 (https://phabricator.wikimedia.org/T193174) (owner: 10Awight) [11:13:08] elukey: quick question: from metrics, it seems that we only use an-druid1001 broker - could it be true? [11:14:16] joal: I think that we set an-druid1001 in turnilo's config, and possibly another one for superset [11:14:24] right [11:14:26] ok [11:14:51] I think that turnilo periodically pings an-druid1001 to see new datasources etc.. [11:15:04] makes sense [11:15:06] we could tune it, it may be unnecessary [11:16:00] elukey: cache size for brokers is down to 0 - seems correct :) [11:16:48] elukey: however it seems there still is cache-requests [11:19:30] joal: query cache is enabled in theory [11:19:35] segment cache is disabled [11:19:53] so historicals should be able to do some merging before returning results to the broker (if needed) [11:20:19] elukey: yup, got that - maybe the cached-object metric is not configured correctly (says 0) [11:21:11] joal: the caffeine cache metrics are only a few, no idea why, meanwhile I suspect the others are for segment caching only [11:21:16] it is not clear from druid's docs [11:21:45] cache misses is populated though, and not cached objects [11:21:56] \o [11:22:16] elukey: should we maybe poke rocm upstream to release their deb-src's as Moritz mentioned? [11:22:27] also cache metrics for historicals are empty, weird [11:22:49] klausman_: o/ yep we could! Do you want to open a gh issue? [11:22:50] yup elukey - I was about to mention [11:24:07] elukey: something else - HDFS 2 new hosts gentl fill-in - seems all good :) [11:24:10] Will do [11:24:34] joal: perfect! Is it ok if later on I reimage another couple? [11:24:45] I'll pause when mw history will run [11:26:02] elukey: no need to pause IMO - balancer is gentle :) [11:26:17] elukey: if we could do a single batch as I mentionned, it probably would be better :) [11:26:35] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Input table date column should be optional [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667159 (https://phabricator.wikimedia.org/T193174) (owner: 10Awight) [11:26:48] joal: nono I mean reimage the existing workers [11:26:57] otherwise I'll keep working on the expansion [11:27:16] Ahhhh - excuse me elukey - wrong track :) [11:27:21] elukey: yes all good [11:27:26] please continue the reimage [11:27:40] nono too many tracks opened :D [11:29:13] elukey: https://github.com/RadeonOpenCompute/ROCm/issues/1396 [11:29:40] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Fix typo: no "performer" field [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/667565 (https://phabricator.wikimedia.org/T272569) (owner: 10Awight) [11:30:06] joal: let me know if you see something weird in the following config that I made [11:30:09] druid.historical.cache.useCache: true [11:30:10] druid.historical.cache.populateCache: true [11:30:10] druid.cache.sizeInBytes: 2147483648 [11:30:10] # For small clusters it is reccomended to only enable caching on brokers [11:30:11] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Support explicit "hive" script type [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/667192 (https://phabricator.wikimedia.org/T193169) (owner: 10Awight) [11:30:13] # See: http://druid.io/docs/latest/querying/caching.html [11:30:15] druid.historical.cache.useCache: false [11:30:18] druid.historical.cache.populateCache: false [11:30:37] * elukey cries in a corner [11:30:58] elukey: I'm sorry - I should have been more careful when reading the code-review :S [11:31:16] joal: I am the one to blame, monday is tough :D [11:31:19] klausman: <3 [11:32:49] klausman: ok if as interim solution we create profile::python38 and we deploy it on gpu nodes? [11:34:41] !log roll restart historical daemons (again) on druid-analytics to remove stale config and enable (finally) segment caching. [11:34:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:36:41] elukey: I think that's currently our best bet. [11:55:53] !log roll restart druid broker on druid-analytics (again) to enable query cache settings (missing config due to typo) [11:55:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:56:19] 10Analytics, 10SRE, 10Patch-For-Review: Review ROCm deployment procedures and current packages - https://phabricator.wikimedia.org/T275896 (10elukey) p:05Triage→03Medium [12:00:03] joal: ok druid analytics should be finally set! [12:13:37] going afk for lunch! [12:29:26] Hey WMF Analytics-Engineering: I just wanted to say that the new Anaconda based Jupyter access to the Analytics Cluster is AWESOME <3 [12:32:38] (03CR) 10Kosta Harlan: "This wasn't submitted -- should we (Growth) do that? Is there a manual deploy that needs to happen after the code is merged?" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/666619 (https://phabricator.wikimedia.org/T270294) (owner: 10Kosta Harlan) [12:35:47] PROBLEM - Check the last execution of performance-asoranking on stat1007 is CRITICAL: CRITICAL: Status of the systemd unit performance-asoranking https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:41:58] gilles: Hi - I assume this alert is of interest to you --^ [13:13:59] (03PS2) 10WMDE-Fisch: Filter out oversampled events [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/666933 (https://phabricator.wikimedia.org/T273454) (owner: 10Awight) [13:14:59] (03CR) 10WMDE-Fisch: [C: 03+1] "PS2: Manual rebase." [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/666933 (https://phabricator.wikimedia.org/T273454) (owner: 10Awight) [13:26:41] 10Analytics, 10SRE: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10Ottomata) [13:44:23] GoranSM: Thanks a lot for the feedback! Explicitly mentioning ottomata since he did all the work :) [13:46:14] (03CR) 10Joal: [C: 03+1] "LGTM - Letting mforns confirm as he knows the codebase better than I do" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/667274 (owner: 10Ebernhardson) [14:07:10] 10Analytics-EventLogging, 10Analytics-Radar, 10Front-end-Standards-Group, 10MediaWiki-extensions-WikimediaEvents, and 2 others: Provide a reusable getEditCountBucket function for analytics purposes - https://phabricator.wikimedia.org/T210106 (10phuedx) I'm actually going to move this one back into **Needs... [14:08:18] joal: just added the cache hit rate for historicals [14:08:27] that seems increasing and good [14:08:38] we'll see how it goes, for the moment metrics look promising [14:08:40] elukey: in meeting, will check in a bit [14:08:43] lemme know if you spot anything weird [14:08:45] ahh sorry [14:34:17] 10Analytics, 10SRE, 10Patch-For-Review: Review ROCm deployment procedures and current packages - https://phabricator.wikimedia.org/T275896 (10elukey) The patch seems working, I was able to install `rocm-gdb` v3.8 on an-worker1096 :) Tobias opened https://github.com/RadeonOpenCompute/ROCm/issues/1396 [14:43:50] elukey: around for a manual sync of someone to hue? :) [14:44:35] shell name "tonina" https://ldap.toolforge.org/user/tonina [14:48:42] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1097.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20... [14:48:43] scrap that..... the ldap group is missing anyway [14:48:46] need to file a ticket [14:48:55] !log reimage an-worker1097 (gpu node) to debian buster [14:49:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:49:21] addshore: sure! Also note that hue-next.w.org uses CAS, and it doesn't require a sync [14:49:28] but it still suffers from some bugs [14:49:35] oh right [14:49:43] and only nda is needed for access? [14:50:08] yes correct [14:50:12] cool! [14:50:27] if it doesn't work lemme know and we'll sync for hue.w.o! [14:50:37] addshore: what is the use case btw? Data exploration? [14:52:33] yes, exploration, particularly around an error https://phabricator.wikimedia.org/T274149 [14:54:00] addshore: then if possible I'd suggest to use superset's sqllab for small datasets or hive/beeline/spark on stat100x for exploration of bigger ones [14:54:31] we don't really support hue that much, only for few use case, and in the bright future I'd really love to deprecate it [14:54:40] (when we'll deprecate oozie) [14:55:00] cool! I'll pass on those points [14:57:41] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Newpyter - SWAP Juypter Rewrite - https://phabricator.wikimedia.org/T224658 (10nshahquinn-wmf) Because of the bug above, I [updated the Jupyter page on Wikitech](https://wikitech.wikimedia.org/w/index.php?title=Analytics/Systems/Jupyter&diff=1900898&oldid=... [15:09:06] elukey: checking historical-cache metrics - Those are really great :) [15:09:42] elukey: hit-rate of ~0.75 with a cache of 10Mb per machine is incredible :) [15:11:20] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1097.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1097.eqiad.wmnet'] ` [15:13:01] hi! what's the process for patches to schemas/event/secondary? does it involve oversight from the Analytics team or can the maintainers of the code using the given schema just +2 and submit it? [15:13:21] I skimmed https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas#Modifying_schemas and https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines but neither seems to talk about the process [15:14:47] tgr_: this was a recent question in the better use of data working group. I'm not sure what the official answer was (they are still figuring some things out around the process for things like PII fields), but overall it does not require oversight from analytics [15:14:52] if no PII for sure. [15:15:19] it would be good to product data eng folks to the patches, since they are trying to create a common data dictionary [15:15:29] and i wouldnt' mind being CCed, but don't block on me [15:15:48] mforns: mholloway ^ [15:16:20] the specific change is fairly boring [15:17:45] yeah you can proceed wth that [15:19:15] ottomata: in my quest to standardize (and fix) our Java CI: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/667114 [15:20:24] thanks! happy to CC you to patches (I think you were in this case); https://www.mediawiki.org/wiki/Git/Reviewers could also add you automatically though (as a reviewer, not CC, but I don't think there's much practical difference) [15:21:08] is it by design that +2-ing a schema patch does not automatically submit it, or is it just something that's not set up yet? [15:22:01] also, submitting the patch automatically deploys it, there's no manual process, right? [15:24:42] tgr_: +2 without submit: not by design, i haven't done much gerrit config of that repo at all, so whatever is there is the default i guess? [15:24:46] submitting the patch automatically deploys it: [15:24:47] correct. [15:25:20] except [15:25:20] https://phabricator.wikimedia.org/T274901 [15:25:24] it is a little async :/ [15:27:21] This is one of the downsides of Puppet vs e.g. Ansible. [15:27:49] (of course the pull-nature of Puppet sometimes is great, you can't win every time with either approaches) [15:27:59] (03CR) 10Gergő Tisza: "> This wasn't submitted -- should we (Growth) do that? Is there a manual deploy that needs to happen after the code is merged?" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/666619 (https://phabricator.wikimedia.org/T270294) (owner: 10Kosta Harlan) [15:28:39] ottomata: do you mind if I add this to the wikitech docs? [15:31:23] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1097.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20... [15:34:17] PROBLEM - Check the last execution of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:35:06] tgr_: please do [15:35:27] man that bug i linked is annoying ^^ [15:35:31] it'll fix itself shorrtly [15:37:25] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10elukey) This happens for an-worker1097's reimage: ` ┌───────────┤ [!!] Partition disks ├───────────┐ │... [15:39:55] it worked elukey :) [15:52:26] I may be a bit late to standup today, apologies [15:54:42] elukey: did 1097 drop one disk? [15:54:57] elukey: lsscsi |wc -l says "23" on that host [15:55:20] yes definitely, I just realized it, it is due to https://phabricator.wikimedia.org/T274819 [15:55:23] my bad [15:55:33] (the task is closed but we haven't re-added partitions [15:56:11] Roger. [15:56:33] (and not really "your bad", I would've missed it as well) [15:59:13] I'll add this use case to the "list-of-things-to-remember-before-reimage" [16:00:13] It's a bit annoying that the dell controllers don't default to JBOD [16:00:37] klausman: don't you love single disks raid zeros ? [16:01:08] I'm more of a RAID45 type of guy ;) [16:02:59] http://marc.merlins.org/linux/raid45.html [16:08:17] RECOVERY - Check the last execution of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:20:07] ACHHHH [16:20:09] standup!??!?! [16:20:59] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1097.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20... [16:29:35] 10Analytics, 10Analytics-Kanban: Move the puppet codebase from cdh to bigtop - https://phabricator.wikimedia.org/T274345 (10fdans) 05Open→03Resolved [16:29:37] 10Analytics, 10Patch-For-Review: Upgrade the Analytics Hadoop cluster to Apache Bigtop - https://phabricator.wikimedia.org/T273711 (10fdans) [16:35:11] 10Analytics-Radar, 10Event-Platform, 10Product-Data-Infrastructure, 10Product-Analytics (Kanban): Draft of full process for instrumentation using new client libraries - https://phabricator.wikimedia.org/T275694 (10fdans) [16:37:21] 10Analytics: Check home/HDFS leftovers of agaduran - https://phabricator.wikimedia.org/T276026 (10JAllemandou) I ran the usual checking script: ` ====== stat1004 ====== total 0 ====== stat1005 ====== total 563444 -rw-r--r-- 1 agaduran wikidev 576830621 Aug 18 2020 Anaconda3-2020.07-Linux-x86_64.sh drwxrwxr-x... [16:38:21] 10Analytics: Check home/HDFS leftovers of agaduran - https://phabricator.wikimedia.org/T276026 (10JAllemandou) @MGerlach : Can you please check if any file from the previous comment is needed for you? We'll drop them after your review, thanks :) [16:42:31] 10Analytics-Radar, 10Machine-Learning-Team, 10SRE: Review ROCm deployment procedures and current packages - https://phabricator.wikimedia.org/T275896 (10fdans) [16:43:33] 10Analytics, 10SRE, 10Traffic: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10hashar) p:05Low→03Medium We had Nginx buffering disabled but there are still unreasonable delay to start a transfer. There are Java CI builds failing random... [16:44:06] 10Analytics, 10Analytics-Kanban: HDFS Namenode: use a separate port for Block Reports and Zookeeper failover - https://phabricator.wikimedia.org/T273629 (10fdans) 05Open→03Resolved [16:44:33] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Repackage spark without hadoop, use provided hadoop jars - https://phabricator.wikimedia.org/T274384 (10fdans) 05Open→03Resolved [16:44:35] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10fdans) [16:45:37] 10Analytics-Clusters: Decommisison the Hadoop backup cluster and add the worker nodes to the main Hadoop cluster - https://phabricator.wikimedia.org/T274795 (10fdans) [16:48:13] 10Analytics-Clusters, 10observability, 10User-fgiunchedi: Setup Analytics team in VO/splunk oncall - https://phabricator.wikimedia.org/T273064 (10fdans) [16:53:21] 10Analytics, 10Event-Platform: Sanitize and ingest event tables defined in the event_sanitized database - https://phabricator.wikimedia.org/T273789 (10fdans) a:03Ottomata [16:53:37] 10Analytics, 10Analytics-Kanban, 10Event-Platform: Sanitize and ingest event tables defined in the event_sanitized database - https://phabricator.wikimedia.org/T273789 (10fdans) [16:55:32] 10Analytics-Clusters: Add analytics-presto.eqiad.wmnet CNAME for Presto coordinator failover - https://phabricator.wikimedia.org/T273642 (10fdans) [16:55:47] 10Analytics-Clusters: Add analytics-presto.eqiad.wmnet CNAME for Presto coordinator failover - https://phabricator.wikimedia.org/T273642 (10fdans) p:05Triage→03Medium [16:56:26] 10Analytics-Clusters, 10Analytics-Kanban, 10observability: Modify Kafka max replica lag alert to only alert if increasing - https://phabricator.wikimedia.org/T273702 (10fdans) [16:57:08] 10Analytics-Clusters, 10User-Elukey: Update to CDH 6 or other up-to-date Hadoop distribution - https://phabricator.wikimedia.org/T203693 (10fdans) [16:57:11] 10Analytics, 10Patch-For-Review: Upgrade the Analytics Hadoop cluster to Apache Bigtop - https://phabricator.wikimedia.org/T273711 (10fdans) 05Open→03Resolved [16:58:36] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1097.eqiad.wmnet'] ` and were **ALL** successful. [17:03:25] 10Analytics, 10Wikidata, 10Wikidata-Query-Service: Automate regular WDQS query parsing and data-extraction - https://phabricator.wikimedia.org/T273854 (10fdans) p:05Triage→03High [17:05:16] 10Analytics, 10Event-Platform, 10Wikidata, 10Wikidata-Query-Service: Automate event stream ingestion into HDFS for streams that don't use EventGate - https://phabricator.wikimedia.org/T273901 (10fdans) p:05Triage→03Medium [17:07:00] 10Analytics, 10Cassandra, 10ContentTranslation, 10Event-Platform, and 9 others: Rebuild all blubber build docker images running on kubernetes - https://phabricator.wikimedia.org/T274262 (10JMeybohm) [17:07:11] 10Analytics, 10Cassandra, 10ContentTranslation, 10Event-Platform, and 9 others: Rebuild all blubber build docker images running on kubernetes - https://phabricator.wikimedia.org/T274262 (10JMeybohm) [17:11:41] 10Analytics: Automate the deployment procedure of Wikistats 2 to Production - https://phabricator.wikimedia.org/T274126 (10fdans) p:05Triage→03Medium a:03fdans [17:13:32] razzi: https://phabricator.wikimedia.org/T273850#6871392 [17:13:42] Please let me know razzi if it's detailed enough :) [17:13:47] ty! Will do [17:15:27] 10Analytics: Automate the deployment procedure of Wikistats 2 to Production - https://phabricator.wikimedia.org/T274126 (10fdans) let's move this to use the deployment pipeline as described here: https://wikitech.wikimedia.org/wiki/Deployment_pipeline [17:15:31] 10Analytics: Check home/HDFS leftovers of agaduran - https://phabricator.wikimedia.org/T276026 (10MGerlach) >>! In T276026#6871220, @JAllemandou wrote: > @MGerlach : Can you please check if any file from the previous comment is needed for you? > We'll drop them after your review, thanks :) I checked: None of th... [17:16:58] joal: not sure how I can see the logs for the timer, I don't have sudo on stat1007 [17:17:08] Arf gilles [17:18:17] gilles: /var/log/performance-asoranking/asoranking.log :) [17:18:19] 10Analytics-Radar, 10Cassandra, 10ContentTranslation, 10Event-Platform, and 9 others: Rebuild all blubber build docker images running on kubernetes - https://phabricator.wikimedia.org/T274262 (10fdans) [17:18:31] that works, thanks [17:20:06] 10Analytics: Make camus (or gobblin) jobs run in `essential` or `production` queue - https://phabricator.wikimedia.org/T274298 (10fdans) p:05Triage→03Medium [17:21:54] elukey: I thought we had to journalctl to see those logs? [17:22:41] anyway, problem solved - thanks elukey :) [17:25:24] 10Analytics: Make RefineFailuresChecker checker jobs use the same parameters as Refine jobs - https://phabricator.wikimedia.org/T274376 (10fdans) p:05Triage→03Medium [17:26:39] 10Analytics, 10PM: Fix Analytics workflow for #Analytics-EventLogging tasks - https://phabricator.wikimedia.org/T274490 (10fdans) Hi @Aklapper deleting the tag is ok to us! [17:26:48] joal: we do, but then for some timers we also redirect to syslog/logfiles [17:29:57] 10Analytics, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1097 - https://phabricator.wikimedia.org/T274819 (10elukey) To keep archives happy - disk formatted and re-added back in service. [17:33:53] 10Analytics, 10Product-Infrastructure-Team-Backlog, 10Wikimedia Taiwan, 10Chinese-Sites, 10Pageviews-Anomaly: Top read is showing one page that had fake traffic in zhwiki - https://phabricator.wikimedia.org/T274605 (10fdans) p:05Triage→03High @JAllemandou to contact them as a first step [17:34:09] elukey: in order to get that info (loged to both journal and file), I need to check the service execution, right? [17:36:26] joal: it is in puppet in the timer definition, not sure about other quick(er) ways [17:36:42] ack elukey - thanks :) [17:40:12] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1098.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20... [17:40:25] !log reimage an-worker1098 (GPU worker node) to Buster [17:40:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:44:43] 10Analytics: Devise a production way for pyspark jobs - https://phabricator.wikimedia.org/T274775 (10fdans) p:05Triage→03Medium Even though pyspark is supported, we'll continue to only have spark-scala jobs working in the repos we own. This will be documented. [17:45:09] 10Analytics: Document our level of support of pyspark-based jobs - https://phabricator.wikimedia.org/T274775 (10fdans) [17:45:54] 10Analytics-Clusters, 10Data-Persistence-Backup: Evaluate the need to generate and maintain zookeeper backups - https://phabricator.wikimedia.org/T274808 (10fdans) [17:51:50] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Newpyter - SWAP Juypter Rewrite - https://phabricator.wikimedia.org/T224658 (10fkaelin) Another observation: I attempted to use `wmfdata` to avoid replicating spark session code. The wmf base conda env contains an older version, and upgrading it fails wit... [17:54:42] (03PS1) 10Gerrit maintenance bot: Add mnw.wiktionary to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/667678 (https://phabricator.wikimedia.org/T276125) [17:55:36] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1098.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20... [18:09:34] 10Analytics, 10Product-Infrastructure-Team-Backlog, 10Wikimedia Taiwan, 10Chinese-Sites, 10Pageviews-Anomaly: Top read is showing one page that had fake traffic in zhwiki - https://phabricator.wikimedia.org/T274605 (10JAllemandou) > @JAllemandou to contact them as a first step Actually the link provided... [18:14:32] !log restart timer that wasn't running on an-worker1101: sudo systemctl restart prometheus-debian-version-textfile.timer [18:14:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:31:30] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1098.eqiad.wmnet'] ` and were **ALL** successful. [18:34:26] elukey: I'm curious if you have any tips on running superset with non-minified javascript, so I can debug the error I found and see if it has already been reported [18:41:50] razzi: no idea, never done it :( [18:42:08] razzi: it is ok to file a gh to upstream in the meantime, reporting the problem etc.. [18:48:21] elukey: cool, will do [18:54:13] ottomata: qq - can I re-enable puppet on an-test-client1001? If it stays too much without any puppet run it will fall out puppetdb etc.. (so no cumin, etc..) [18:54:39] it is not a problem, as soon as puppet runs it gets re-added, but then SREs wonder why the host is not there :D [19:01:29] going afk folks! [19:03:29] yes sorry! [19:03:37] lemme just run puppet [19:03:39] i'm still wrking on it [19:03:53] will give it a run and disable again [19:04:00] wiull try to remember to re-enable after i'm done tooday [19:04:05] sorry its very easy to forget that! [19:14:09] 10Analytics-Radar, 10Platform Engineering Roadmap Decision Making, 10Epic, 10MW-1.35-notes (1.35.0-wmf.32; 2020-05-12), and 2 others: Remove revision_comment_temp and revision_actor_temp - https://phabricator.wikimedia.org/T215466 (10WDoranWMF) @Marostegui is it correct that step 1.2 of this task is alread... [19:17:49] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for next deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/667678 (https://phabricator.wikimedia.org/T276125) (owner: 10Gerrit maintenance bot) [19:35:51] 10Analytics: Check home/HDFS leftovers of agaduran - https://phabricator.wikimedia.org/T276026 (10JAllemandou) Hi @razzi - Can you please take action on the above? Many thanks :) [19:38:07] 10Analytics: Check home/HDFS leftovers of agaduran - https://phabricator.wikimedia.org/T276026 (10razzi) @JAllemandou Sure thing! [19:47:02] 10Analytics: Check home/HDFS leftovers of agaduran - https://phabricator.wikimedia.org/T276026 (10razzi) All removed! [19:47:27] 10Analytics: Check home/HDFS leftovers of agaduran - https://phabricator.wikimedia.org/T276026 (10razzi) 05Open→03Resolved a:03razzi [20:05:21] 10Analytics, 10FR-Tech-Analytics, 10Fundraising-Backlog: Whitelist Portal and WikipediaApp event data for (sanitized) long-term storage - https://phabricator.wikimedia.org/T273246 (10EYener) @mforns thanks for linking to T262433! I've added myself as a subscriber. Heads up for @mpopov, as I believe you autho... [20:26:43] 10Analytics, 10PM: Fix Analytics workflow for #Analytics-EventLogging tasks - https://phabricator.wikimedia.org/T274490 (10Aklapper) @fdans: Hi, thanks for the reply! Deletion is not possible; [archiving](https://www.mediawiki.org/wiki/Phabricator/Project_management#Archiving_a_project) would be. Could you ela... [20:38:12] 10Analytics, 10Analytics-Kanban: Upgrade UA Parser to 1.5.1+ - https://phabricator.wikimedia.org/T272926 (10Milimetric) {F34130594} is a quick investigation I did, mostly copied from the excellent guide on wikitech. Summary: * There are 18% changes in unique classifications, we should do this more often, onc... [20:48:56] 10Analytics, 10Analytics-Kanban: Upgrade UA Parser to 1.5.1+ - https://phabricator.wikimedia.org/T272926 (10JAllemandou) Thanks for the analysis @Milimetric - Let's make this happen :) [20:56:11] ottomata: another ping on https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/667114 (adding configuration for sonar, so that it can run on CI build) [20:57:43] (03CR) 10Ottomata: [C: 03+2] Minimal configuration of Sonar maven plugin. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/667114 (https://phabricator.wikimedia.org/T264873) (owner: 10Gehel) [20:57:48] thanks! [20:57:50] gehel: i missed that one, sorry! [20:58:01] no problem! that was the last one for now [20:58:26] I still need to sort out the jenkins side, but that should allow having the same analysis on all java projects [21:24:15] brew install reattach-to-user-namespace [21:24:27] Oops, wrong paste [21:24:44] !log rebalance kafka partitions for webrequest_upload partition 6 [21:24:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:25:47] razzi: should we remove the replica lag alert? [21:25:53] the old one? [21:26:48] ottomata: should be able to; do you know if the initial spike of partition lag will cause an alert? I guess we can wait and find out [21:31:12] it shouldn't and if it did, we'd be getting them already :) [21:31:28] the icinga alert is configured to retry 6 times every 5 minutes [21:31:43] so it has to lag actively increasing for more than 30 miinute before the alert will triggerr [21:49:58] 10Analytics, 10Documentation: Wikimedia history dump - undocumented "merge" event - https://phabricator.wikimedia.org/T276119 (10Aklapper) [21:50:27] 10Analytics, 10Documentation: Wikimedia history dump - undocumented "create-page" event - https://phabricator.wikimedia.org/T276120 (10Aklapper) [21:58:42] fkaelin: isaacj hello! [21:58:55] just created a PR for wmfdata that adapts your conda shipping logic [21:59:00] would love a review! [21:59:01] * ottomata https://github.com/wikimedia/wmfdata-python/pull/22 [21:59:03] https://github.com/wikimedia/wmfdata-python/pull/22 [22:10:42] (03Abandoned) 10Ebernhardson: refinery-drop-hive-partitions: Ensure verbose logging goes somewhere [analytics/refinery] - 10https://gerrit.wikimedia.org/r/661799 (owner: 10Ebernhardson) [22:11:22] (03PS1) 10Milimetric: Update UA-Parser to 1.5.2 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/667717 (https://phabricator.wikimedia.org/T272926) [22:12:06] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Upgrade UA Parser to 1.5.1+ - https://phabricator.wikimedia.org/T272926 (10Milimetric) A bunch more work to do to package and deploy, but at least the basic patch is tested and done. [22:45:24] 10Analytics: Upgrade to Superset 1.0 - https://phabricator.wikimedia.org/T272390 (10razzi) I installed superset from source to run the frontend in development mode, and reported the error I found upstream: https://github.com/apache/superset/issues/13396 [22:51:40] ottomata: re: replica lag alert: Cool! Let's enable it. Made a patch to do so: https://gerrit.wikimedia.org/r/c/operations/puppet/+/667724/