[06:49:41] good morning [07:03:28] 10Analytics: Sync urbanecm's LDAP account to Hue - https://phabricator.wikimedia.org/T274732 (10elukey) 05Open→03Resolved a:03elukey Done! [07:32:18] Good morning [07:35:40] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10JAllemandou) [07:41:48] bonjour [07:42:11] joal: I am doing some reboots for kernel upgrades, ok to reboot the druid clusters? [07:42:11] o/ [07:42:16] sure! [07:44:04] I am very curious to see if rebooting public leads to aqs timeouts [07:44:50] ack elukey [07:51:55] 10Analytics: Repackage spark without hadoop, use provided hadoop jars - https://phabricator.wikimedia.org/T274384 (10JAllemandou) Ok for me :) [08:49:41] joal: if you feel adventurous this morning https://gerrit.wikimedia.org/r/c/operations/puppet/+/664099 [08:49:59] Let's have fun and try that elukey :) [09:01:16] ok joal starting [09:01:24] ack elukey :) [09:04:11] restarting the namenode on an-master1002 [09:04:55] going to wait for full bootstrap, then I'll failover and restart the on on master1001, and failover again [09:04:59] ack [09:05:00] this is to update the configs [09:05:22] then stop of zkfc daemons on both, formatZK + start of zkfc [09:05:28] and finally roll restart of all datanodes [09:05:44] ok - I don't know why, the ZK part is one worrying me :) [09:09:49] joal: change merged, I also didn't see any exec running on an-coord1001 [09:10:15] elukey: I hope I didn't make mistake in that change :) [09:10:31] I don't understand what you mean with no exec running on an-coord1001 :S [09:11:24] joal: so those scripts do run automatically if $condition, and puppet evals the $condition only on-coord1001 [09:11:30] Ah! [09:11:37] I didn't get you were talking about my changes [09:11:48] indeed, the files already exist so it all should be ok [09:12:16] elukey: change is for next time we need to recreate them (I hope we won't find a bug at that moment :S) [09:12:39] yes I know but I just reported that nothing triggered execs :) [09:12:46] Ack :) Thanks for that [09:16:15] restarted the namenode on an-master1001 [09:22:38] going to wait some mins for things to stabilize (GC wise) [09:23:02] ack elukey - RPC chart looks great :) [09:29:58] ok 1001 is the master afin [09:30:00] *again [09:30:17] waiting a couple of mins and then proceeding with the stop zkfc + formatZK [09:30:29] ack elukey - charts confirm ;) [09:36:55] stopped the zkfc daemons, I can still see 1001 active and 1002 standby, proceeding with formatZK [09:37:27] ack [09:39:40] joal: all good :) [09:39:52] \o/ [09:39:56] I still see 1001 as active and 1002 as standby [09:40:02] and the zkfc logs confirms [09:40:09] awesome [09:40:20] now it is the turn of the datanodes [09:40:24] * joal can't wait to see RPC charts stairs for port 8040 :) [09:41:04] joal: I'd say batches of 2 DNs with 120s of sleep [09:41:16] we are not really in a rush [09:42:32] perfect [09:58:04] elukey: is that expected that I see no change on RPC charts despite the rolling restart? [09:59:08] Oh my bad elukey - There are changes, just very small compare to the scale of 8020 port, therefore almost not visible - my bad [09:59:14] yeah :D [10:00:19] there are a few corrupt blocks but not sure if the usual temporary weirdness or not [10:00:51] hm [10:02:17] (03CR) 10Awight: [C: 03+2] Update schema with core bucket labels (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/656901 (https://phabricator.wikimedia.org/T269986) (owner: 10WMDE-Fisch) [10:02:18] historically they popped up during roll restarts from time to time [10:02:41] PROBLEM - HDFS corrupt blocks on an-master1001 is CRITICAL: 7 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_corrupt_blocks https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen [10:02:49] exactly [10:04:14] (03PS10) 10Awight: Update schema with core bucket labels [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/656901 (https://phabricator.wikimedia.org/T269986) (owner: 10WMDE-Fisch) [10:05:11] (03CR) 10Awight: [C: 03+2] "I see we missed an `is_anonymous` flag, we can just use the absence of the `performer` data, I guess." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/656901 (https://phabricator.wikimedia.org/T269986) (owner: 10WMDE-Fisch) [10:05:57] (03CR) 10jerkins-bot: [V: 04-1] Update schema with core bucket labels [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/656901 (https://phabricator.wikimedia.org/T269986) (owner: 10WMDE-Fisch) [10:10:04] I'm seeing a CI failure for schemas/event/secondary fragment/analytics/common: https://integration.wikimedia.org/ci/job/generic-node10-docker/2533/console [10:10:35] (03CR) 10Awight: [C: 03+2] "Looks like an unrelated test failure from fragment/analytics/common" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/656901 (https://phabricator.wikimedia.org/T269986) (owner: 10WMDE-Fisch) [10:16:39] * elukey bbiab [10:18:53] PROBLEM - HDFS corrupt blocks on an-master1001 is CRITICAL: 15 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_corrupt_blocks https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen [10:35:51] so we are half way through with the restarts, 15 is still very low, lemme check fsck [10:36:30] elukey@an-master1001:~$ sudo -u hdfs kerberos-run-command hdfs hdfs fsck / -list-corruptfileblocks [10:36:33] Connecting to namenode via https://an-master1001.eqiad.wmnet:50470/fsck?ugi=hdfs&listcorruptfileblocks=1&path=%2F [10:36:36] The filesystem under path '/' has 0 CORRUPT files [10:36:53] so this is a weirdness of the jmx metric [10:37:04] I think that we can raise the threshold to 100 [10:37:13] just to avoid false positives [10:51:05] elukey: Can we wait for the roll-restart to finish, checking if the problem resolves by itself? [10:51:44] joal: nah it is fine, 15 blocks are really nothing and fsck reports zero [10:51:52] for sure [10:52:07] I have seen this issue before, I hoped that with recent versions it would have been fixed [10:53:16] 40 hosts done, the number of RPCs for DNs seems more stable [10:53:57] we also need to explicitly tear down the backup cluster and call the experiment done :) [10:58:14] elukey: and bring back those hosts in the main cluster! Yay! [10:59:40] elukey: I don't know if it's related to the change you just applied, but definitely there is something visisble: https://grafana.wikimedia.org/d/000000585/hadoop?viewPanel=109&orgId=1&from=now-3h&to=now [11:01:33] 10Analytics: Devise a production way for pyspark jobs - https://phabricator.wikimedia.org/T274775 (10JAllemandou) [11:01:58] I think it may be a side effect of the roll restart or low peak time, but let's see what changes! [11:02:19] ack [11:02:40] ok so opening a task to repurpose the backup cluster to the main one [11:02:43] (with buster!) [11:10:51] (03CR) 10Awight: "This is waiting on schema and producer deployment." [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/657634 (https://phabricator.wikimedia.org/T273475) (owner: 10Awight) [11:26:17] joal: DN roll restart completed! [11:26:20] \o/ [11:26:56] elukey: I assume that the corrupt-blocks isue will go away if you restart some JMX related stuff? [11:27:33] joal: usually it clears itself after some time [11:27:39] fsck reports all good [11:27:45] so it is definitely a weirdness of jmx [11:27:45] no problemo :) [11:28:30] now I am going to roll reboot druid public [11:28:35] let's see if aqs complains [11:28:41] indeed elukey [11:34:39] druid1004 rebooted and didn't trigger alerts [11:34:43] this is good for the moment [12:01:40] 10Analytics, 10Analytics-Kanban: HDFS Namenode: use a separate port for Block Reports and Zookeeper failover - https://phabricator.wikimedia.org/T273629 (10elukey) p:05Triage→03High a:03elukey [12:01:42] 10Analytics, 10Analytics-Kanban: HDFS Namenode: use a separate port for Block Reports and Zookeeper failover - https://phabricator.wikimedia.org/T273629 (10elukey) [12:04:28] 10Analytics, 10EventStreams: EventStreams socket stays connected without any traffic incoming - https://phabricator.wikimedia.org/T250912 (10SD0001) I too think I've been facing this issue – happened twice over the past 3 days, though IIRC those have been the only two occurrences this year. The bot's process r... [12:24:41] 10Analytics, 10SRE, 10observability, 10serviceops, 10cloud-services-team (Kanban): hosts failing puppet compile due to missing secrets - https://phabricator.wikimedia.org/T274392 (10MoritzMuehlenhoff) [12:38:02] joal: roll restart of druid public completed, no issues :) [12:46:46] * elukey lunch! [12:46:47] elukey: this is excellent news :) [12:47:00] enjoy your lunch elukey [12:47:02] joal: it makes me feel hopeful for the next datasource drop :) [12:47:22] * joal says shhhhhuuuuuut to elukey (for once :D) [12:47:44] joal: yes yes but it breaks badly all the time already :D [12:47:55] just jokin' :) [12:47:58] Later! [12:48:03] o/ [13:11:43] Guys, I have a migraine that just won't go away, taking today off. [13:14:42] klausman: My wishes, and pain-killers :( [13:15:07] Yeah, tried the usuals (Ibuprofen), but it's doing nothing [13:16:27] :( [14:14:31] 10Analytics, 10Datasets-Archiving, 10Datasets-General-or-Unknown, 10Internet-Archive: Mediacounts dumps missing since February 9, 2021 - https://phabricator.wikimedia.org/T274617 (10JAllemandou) Closing the task as this is fixed - Please reopen if needed. [14:14:46] 10Analytics, 10Datasets-Archiving, 10Datasets-General-or-Unknown, 10Internet-Archive: Mediacounts dumps missing since February 9, 2021 - https://phabricator.wikimedia.org/T274617 (10JAllemandou) 05Open→03Resolved [14:22:23] hi milimetric, I'm preparing the GII data. I think you mentioned a table to match countries names to ISO-3 ... is it? [14:54:27] hey team! [14:57:12] dsaez: Hi - US folks are off today :) [14:57:24] Hi mforns [14:58:52] hey joal, [14:59:00] true, I forgot! thx [14:59:30] dsaez: the table you're after is `canonical_data.countries` :) [14:59:55] great! thx! [15:28:17] 10Analytics: Decommisison the Hadoop backup cluster and add the worker nodes to the main Hadoop cluster - https://phabricator.wikimedia.org/T274795 (10elukey) p:05Triage→03High [15:35:39] RECOVERY - HDFS corrupt blocks on an-master1001 is OK: (C)5 ge (W)2 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_corrupt_blocks https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen [15:37:46] goood [15:53:01] razzi: o/ morninggg [15:53:21] I know that you will be so happy to reboot nodes for https://phabricator.wikimedia.org/T273278 :D [15:53:53] I did some today, but we are missing some [15:54:09] if you want we can check how to do matomo and archiva [15:55:07] you can also do an-launcher, kafka and possibly schedule the stat100[4,6,7] reboots? (we need to send the email to announce@ etc..) [15:55:45] ah snap but of course US folks don't work today, pebcak [15:55:47] tomorrow then :) [15:58:44] https://echarts.apache.org/en/index.html [15:58:45] wow [15:59:11] Yeah I've seen that elukey [15:59:31] if we show this to milimetric it's the revival of dashiki :) [17:28:37] !log restart hdfs namenodes on the main cluster to pick up new racking changes (worker nodes from the backup cluster) [17:28:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:54:41] 10Analytics, 10Data-Persistence-Backup: Evaluate the need to generate and maintain zookeeper backups - https://phabricator.wikimedia.org/T274808 (10jcrespo) [17:56:41] 10Analytics, 10Data-Persistence-Backup: Evaluate the need to generate and maintain zookeeper backups - https://phabricator.wikimedia.org/T274808 (10jcrespo) This is not something we ask analytics to take care of, but for the initial questions, I believe @elukey or @ottomata may be the person to know more about... [18:09:15] 10Analytics, 10Patch-For-Review: Decommisison the Hadoop backup cluster and add the worker nodes to the main Hadoop cluster - https://phabricator.wikimedia.org/T274795 (10elukey) [18:15:16] aaand done [18:15:19] going to log off! [18:15:25] have a good evening EU folks :) [18:29:14] heh, nah, the problem dashiki was solving never had to do with charts, it's more information architecture [18:29:52] charts are easy. This library's ok, nothing special, there are a ton of open source ones just like it, and we have d3 / vega skills anyway, it's a bit redundant [18:51:16] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 5 others: KaiOS / Inuka Event Platform client - https://phabricator.wikimedia.org/T273219 (10SBisson) @Ottomata from the app's perspective, changing the destination URL and the shape of the message for all 3 schema is almo... [19:02:25] ack milimetric - thanks for the explanation to me, js-charts newby :) [19:03:45] I think superset is the right tool, dashiki was supposed to be easier to use but clearly we weren't going to get there with just one or two devs juggling other projects [19:05:57] makes perfect sense milimetric :) [19:06:27] Now please stop working milimetric ;) [19:06:49] oh I'm working out and watching Stargate SG-1, typing to you is just what I do to take my mind off the pain :) [19:07:27] Ah! I wish I were a pain-killer more often :) [19:19:42] 10Quarry: Login broken - https://phabricator.wikimedia.org/T274815 (10Reedy) [19:25:28] 10Quarry: Login broken - https://phabricator.wikimedia.org/T274815 (10Reedy) `lang=diff diff --git a/quarry/web/login.py b/quarry/web/login.py index f1de2ee..afd019d 100644 --- a/quarry/web/login.py +++ b/quarry/web/login.py @@ -31,7 +31,6 @@ def login(): oauth_token, user_agent=user_agent... [19:35:39] 10Quarry: Login broken - https://phabricator.wikimedia.org/T274815 (10Framawiki) 05Open→03Resolved a:03Reedy Thanks for the revert reedy. Tried to fix recent Oauth errors with a loop to retry login, but some connections were still failing. Will commit that properly. [20:00:25] 10Analytics, 10Event-Platform: Sanitize and ingest event tables defined in the event_sanitized database - https://phabricator.wikimedia.org/T273789 (10mforns) Yes, or we can have another instance of a sanitization job, that reads from a separate include-list specific for non EventLogging data sets? [20:13:40] 10Analytics, 10EventStreams, 10Services: To provide performer array in RC stream - https://phabricator.wikimedia.org/T218063 (10stjn) If `recentchange` has exactly the same data for edits as `revision-create` apart from these changes, I would be OK with it not having `performer` array, otherwise I would also... [20:30:47] PROBLEM - HDFS corrupt blocks on an-master1001 is CRITICAL: 3907 ge 50 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_corrupt_blocks https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen [20:38:29] !log running hdfs fsck to troubleshoot corrupt blocks [20:38:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:39:31] "sudo -u hdfs kerberos-run-command hdfs hdfs fsck / -list-corruptfileblocks" returns 0 blocks corrupted [20:39:38] *corrupt [20:46:28] RECOVERY - HDFS corrupt blocks on an-master1001 is OK: (C)50 ge (W)30 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_corrupt_blocks https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen [20:47:01] 10Analytics, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1097 - https://phabricator.wikimedia.org/T274819 (10Peachey88) [21:21:23] (03PS1) 10Framawiki: Support docker [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/664353 [21:34:28] (03PS2) 10Framawiki: Support docker [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/664353 [21:40:23] 10Analytics: The most visited wiki in Uzbekistan on Feb 14th at 6am UTC is mediawiki.org - https://phabricator.wikimedia.org/T274823 (10mforns) [23:15:46] (03PS3) 10Framawiki: Support docker [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/664353