[00:22:45] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Consolidate labs / production sqoop lists to a single list - https://phabricator.wikimedia.org/T280549 (10razzi) [00:56:02] 10Analytics-Clusters, 10Analytics-Kanban, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Setup Analytics team in VO/splunk oncall - https://phabricator.wikimedia.org/T273064 (10razzi) In our ops sync we decided to add victorops alerting for critical alerts, and I've started adding them to puppet... [05:56:57] good morning [06:00:39] !log stop timers on an-launcher1002 as prep step for an-coord1001 reimage [06:00:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:27:14] I may need to purge some binary logs on an-coord1001 again [06:27:58] (new days are ok, old ones are still using some space) [06:28:11] it will ease the reshape of the lvm volumes before the reimage [06:28:19] then we'll have more buffer for future events like this one [06:28:23] (hopefully) [06:45:56] !log PURGE BINARY LOGS BEFORE '2021-04-14 00:00:00'; on an-coord1001 to free some space before the reimage [06:45:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:48:55] all right we should be ready for the maintenance, stopping hive [06:49:09] !log stop all services on an-coord1001 as prep step for reimage [06:49:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:51:50] !log stop airflow on an-airflow1001 [06:51:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:07:21] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade the Hadoop coordinators to Debian Buster - https://phabricator.wikimedia.org/T278424 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-coord1001.eqiad.wmnet'] ` The log can be... [07:08:02] partition reshape done, now reimaging [07:08:17] !log reimage an-coord1001 after partition reshape (/var/lib/mysql folded in /srv) [07:08:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:31:29] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade the Hadoop coordinators to Debian Buster - https://phabricator.wikimedia.org/T278424 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-coord1001.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-coord1001.eqiad.wmnet'] ` [07:36:29] wow early start elukey :) [07:36:33] o/ [07:36:36] all good? [07:37:19] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade the Hadoop coordinators to Debian Buster - https://phabricator.wikimedia.org/T278424 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-coord1001.eqiad.wmnet'] ` The log can be... [07:37:58] joal: bonjour! Sort of, my partman recipe to preserve the data needed some adjustment, the installer failed a couple of times, my bad for some errors [07:38:08] I am running another time d-i now [07:38:13] let's see if it is the right one [07:41:19] PROBLEM - Number of Netflow realtime events received by Druid over a 30 minutes period on alert1001 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=41&fullscreen&orgId=1 [07:41:40] ok this time it was good, it is reimaging [07:42:34] the realtime indexing might have needed the db, forgot to stop it [07:49:01] 10Analytics, 10Patch-For-Review: Fix the open bugs for Hue - https://phabricator.wikimedia.org/T264896 (10elukey) [07:49:24] running puppet on an-coord1001 now [07:54:13] verified that the /srv partition has been preserved, puppet it is still installing packages (it takes a bit) [07:54:31] joal: when everything is up would you mind to double check with me that we are good before re-enabling timers and airflow? [07:57:18] sure elukey [07:57:23] when you wish :) [07:58:21] ETA 10 mins hopefully :) [07:58:35] ack! will keep some coffee warm :) [08:03:38] (03CR) 10Joal: "One nit :)" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/681496 (https://phabricator.wikimedia.org/T280549) (owner: 10Razzi) [08:07:48] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade the Hadoop coordinators to Debian Buster - https://phabricator.wikimedia.org/T278424 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-coord1001.eqiad.wmnet'] ` and were **ALL** successful. [08:10:32] Hi dcausse - thanks for the answer on Flink :) [08:10:56] joal: yw, not sure if it's super helpful tho :/ [08:10:58] dcausse: one wonder I have is resiliency - Are you happy with failures/restart so far? [08:11:09] dcausse: everything helps :) [08:11:44] we're about to test HA but it has to be resilient, that's a strong requirement for us [08:12:26] of course - it's a strong requirement for us as well - the question ar more along the line: how much does it cost to have it resilient enough, I guess :) [08:13:02] "HA" with flink means: restart from wherever the last successful checkpoint is if anything in the chain fails (jobmanager/taskmanager/..) [08:13:42] with k8s we hope to get that almost for "free" [08:13:46] dcausse: ok - I assume Flink needs some kind of HA-master, and sync point for this [08:14:24] with yarn I think it needs zookeeper [08:14:25] (03CR) 10Joal: [C: 03+2] "LGTM - Merging" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681390 (owner: 10Awight) [08:14:31] right dcausse [08:15:12] dcausse: and, I also assume the if the yarn job fails, the flink cluster dies, and the HA mode is gone, correct? [08:16:13] if you have something restarting the flink session cluster whenever it dies then it should resume its jobs from the HA state [08:16:41] ack dcausse - makes sense - thank you a lot for sharing :) [08:17:42] we're going to test that soon :) [08:17:52] RECOVERY - Number of Netflow realtime events received by Druid over a 30 minutes period on alert1001 is OK: (C)0 le (W)10 le 4.799e+05 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=41&fullscreen&orgId=1 [08:18:42] nice [08:18:53] druid is more resilient than I thought :) [08:19:16] joal: we should be back in business [08:19:37] \o/ [08:19:43] elukey: will test have/spark [08:20:02] thanks :) [08:20:21] now we have [08:20:21] /dev/mapper/an--coord1001--vg-srv 173G 97G 77G 56% /srv [08:20:30] so plenty more space for mysql in case it grows [08:20:38] we'll just need to be careful when adding things to /srv [08:23:38] (03Merged) 10jenkins-bot: Remove some lines tagged as unused by the linter [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681390 (owner: 10Awight) [08:24:11] 10Analytics-Clusters, 10Analytics-Kanban, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Setup Analytics team in VO/splunk oncall - https://phabricator.wikimedia.org/T273064 (10fgiunchedi) For the specific problem I think you could also use a `case` switch (I think preferably using hiera variabl... [08:28:49] hive all good elukey [08:29:15] elukey: also - removing the `--verbose=true` from our beeline wrapper makes the operations logs disappear [08:29:58] spark works too [08:30:05] elukey: can you restart camus? [08:30:51] joal: I can yes, doing so now :) [08:31:28] !log re-enable timers on an-launcher1002 and airflow on an-airflow1001 after maintenance on an-coord1001 [08:31:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:31:34] elukey: basic tooling works (hive, beeline, spark, hue) - now let's prod tell us if anything else is broken [08:31:44] joal: only two nodes left to reimage, the hadoop masters :) [08:31:49] \o/ [08:32:01] elukey: my congrats for the persistence :) [08:32:28] \o/ [08:33:01] ah joal did you see https://issues.apache.org/jira/browse/HIVE-25020? [08:33:24] I wanted to inform you as well so you are aware [08:33:36] I have seen you have reverted the use of mariadsb driver, but hadn't followed up on why [08:34:19] great bug report elukey ) [08:35:08] it is a hack but for buster it should be ok [08:35:41] I'll also follow up with the Bigtop folks to see if it is the case to include the mysql jar in the hive deps [08:37:00] <3 [08:37:51] coffee time :) [08:38:51] cheers elukey :) [09:05:41] going to upgrade Hue as well! [09:08:56] !log upgrade hue on an-tool1009 to 4.9.0-2 [09:09:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:10:33] perfect, all good [09:22:58] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade the Hadoop coordinators to Debian Buster - https://phabricator.wikimedia.org/T278424 (10elukey) All done! I'll follow up on https://issues.apache.org/jira/browse/HIVE-25020 but we should be good :) @Ottomata @razzi there is some potential... [09:23:41] 10Analytics-Clusters: Re-create deployment-aqs cluster - https://phabricator.wikimedia.org/T272722 (10elukey) a:05elukey→03None [09:24:56] PROBLEM - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:25:20] wooops [09:25:45] 10Analytics-Clusters: Re-create deployment-aqs cluster - https://phabricator.wikimedia.org/T272722 (10elukey) [09:27:06] PROBLEM - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:27:22] PROBLEM - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:27:25] lovely [09:27:32] this is probably the overlord [09:27:43] hnowlan: Good morning - Cn you confirm the cassandra version we're using for the new cluster? [09:27:50] hnowlan: 3.11.10 is it? [09:27:57] yep [09:28:11] !log roll restart druid-overlord on druid* after an-coord1001 maintenance [09:28:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:28:35] elukey: Why was the druid-overlord unhappy after the restat? [09:29:13] joal: it complains in the logs about connection failures to the db, I think it doesn't retry by itself [09:30:29] restarting eventlogging_to_druid_editattemptstep_hourly.service [09:32:43] ack elukey [09:33:12] joal: 3.11.4, but anything 3.x compatible outside of the betas should work fine as a driver [09:33:24] ack hnowlan - Thanks [09:36:18] RECOVERY - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:38:16] ta daaan [09:38:24] RECOVERY - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:38:40] RECOVERY - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:40:01] 10Analytics-Clusters: Re-create deployment-aqs cluster - https://phabricator.wikimedia.org/T272722 (10hnowlan) Known issue related to this - gitdeploy paths are incorrect for the purposes of init scripts https://gerrit.wikimedia.org/r/c/operations/puppet/+/677620 I believe all grants might not be complete on th... [09:44:50] elukey: shall I restart the failed oozie jobs? [09:46:23] ok looks like ou already did it elukey [09:46:34] yep! [09:46:37] thanks mate [09:46:57] * joal is too slow [09:47:32] nono I am fixing my messes :) [09:51:35] hnowlan: o/ eventlog1002 should be rebooted, interesting in doing it? [09:52:09] (brb) [09:56:51] elaragon: sounds good - anything fancy required? :) [09:57:05] elaragon: oops apologies, mis-tab [09:57:36] joal: heyo, would you have a moment for a chat on SparkSQL and dataframes? [09:57:48] Hi klausman :) [09:57:57] I have time [09:58:12] Let me pastebin a few things so you know what I;ve been up to [09:58:16] sure [09:59:50] DM? Don't want to spam the channel and drown out Luca's missives ;) [10:00:02] sure [10:03:30] here I am [10:03:37] (Rebooted my vm) [10:04:46] hey elukey - I'm up for rebooting eventlog1002, anything fancy required or just a normal reboot? [10:05:49] hnowlan: there is a systemd unit called 'eventlogging.service', that propagates to all the other eventlogging daemon service units, so in theory just systemctl stop eventlogging should suffice [10:06:11] but I reccomend to check it so you can see how it works etc.. (useful for eventlog1003) [10:06:35] https://grafana.wikimedia.org/d/000000505/eventlogging?orgId=1 is a dashboard to check [10:06:55] the mysql insertion rate is zero since we removed that a while ago [10:06:58] so don't worry about it [10:07:06] EL just pulls and pushes to Kafka [10:09:26] ack, looking now [10:12:14] and while the service is down the events just build up in kafka right? [10:12:19] why does it need to be rebooted btw? [10:17:29] (03PS1) 10Joal: [WIP] Update refinery-cassandra to cassandra 3.11 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681605 (https://phabricator.wikimedia.org/T280649) [10:20:01] hnowlan: kernel upgrades! [10:20:12] yes yes if the service is down the volume of events stops [10:20:18] ahhh cool [10:21:25] hnowlan: asking for permission to test a loading job on the new AQS cluster [10:22:27] joal: go for it [10:25:55] job launched hnowlan [10:27:51] nice [10:31:21] eventlog1002 rebooted, service is back up and it looks it's picking back up on the backlog [10:33:52] hnowlan: <3 nice! [10:33:59] going to lunch now, ttl! [10:37:24] enoy! [10:37:29] er enjoy. [10:52:27] (03PS2) 10Joal: [WIP] Update refinery-cassandra to cassandra 3.11 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681605 (https://phabricator.wikimedia.org/T280649) [11:04:24] 10Analytics-EventLogging, 10Analytics-Radar, 10Front-end-Standards-Group, 10MediaWiki-extensions-WikimediaEvents, and 4 others: Provide a reusable getEditCountBucket function for analytics purposes - https://phabricator.wikimedia.org/T210106 (10awight) >>! In T210106#7019613, @phuedx wrote: > I propose dep... [11:06:07] 10Analytics-EventLogging, 10Analytics-Radar, 10Front-end-Standards-Group, 10MediaWiki-extensions-WikimediaEvents, and 6 others: Provide a reusable getEditCountBucket function for analytics purposes - https://phabricator.wikimedia.org/T210106 (10awight) Putting on my team's board to reflect recent work. [11:19:27] (03PS1) 10Awight: Base class checkArgsSize [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681624 [11:19:29] (03PS1) 10Awight: Base class checkArgPrimitive [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681625 [11:43:35] Progress oin the VRN/ATS front! [11:44:47] Using clever SQL and an even cleverererer Jonathan, I have now narrowed down the discrepancy of <1% of requests per hour (~15M total reqests, ~140k in VRN but not ATS) to mostly-the-response-size-is-0: https://phabricator.wikimedia.org/P15498 [11:50:27] And with that, I shall go do some grocery shopping before the electrician shows up. [11:51:51] RECOVERY - eventgate-analytics-external validation error rate too high on alert1001 is OK: (C)2 gt (W)1 gt 0.9202 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [11:58:00] 10Analytics-Clusters, 10Analytics-Kanban: Migrate eventlog1002 to buster - https://phabricator.wikimedia.org/T278137 (10hnowlan) [12:07:32] 10Analytics-EventLogging, 10Analytics-Radar, 10Front-end-Standards-Group, 10MediaWiki-extensions-WikimediaEvents, and 6 others: Provide a reusable getEditCountBucket function for analytics purposes - https://phabricator.wikimedia.org/T210106 (10phuedx) {F34409430} Nice! [12:31:57] (03PS3) 10Joal: [WIP] Update refinery-cassandra to cassandra 3.11 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681605 (https://phabricator.wikimedia.org/T280649) [12:39:10] 10Analytics-EventLogging, 10Analytics-Radar, 10Front-end-Standards-Group, 10MediaWiki-extensions-WikimediaEvents, and 6 others: Provide a reusable getEditCountBucket function for analytics purposes - https://phabricator.wikimedia.org/T210106 (10Jdrewniak) 05Open→03Resolved That looks good to me :) I t... [12:54:23] klausman: nice! [13:01:37] hnowlan: we have a successful running job :) [13:04:59] (03PS4) 10Joal: Update refinery-cassandra to cassandra 3.11 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681605 (https://phabricator.wikimedia.org/T280649) [13:08:23] joal: wow ! [13:21:22] joal: !!!!!! nice! [13:21:54] hnowlan: I'm prepping a patch for double loading [13:23:43] joal: fantastic [13:24:24] 10Analytics-Clusters: Re-create deployment-aqs cluster - https://phabricator.wikimedia.org/T272722 (10Ottomata) Would it be worth moving AQS to deployment pipeline? Even if you don't use it in prod k8s , having the docker image would allow you to use [[ https://github.com/wikimedia/puppet/blob/e1e13a59de3021afa... [13:30:33] (03PS1) 10Joal: Update cassandra jobs for double loading [analytics/refinery] - 10https://gerrit.wikimedia.org/r/681678 (https://phabricator.wikimedia.org/T280649) [13:31:57] hi team!! [13:32:02] Hi mforns :) [13:35:07] hola! [13:38:46] (03PS2) 10Awight: Use base class methods to check argument type and convert [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681389 [13:45:45] (03PS1) 10Joal: Cleanup cassandra double loading [analytics/refinery] - 10https://gerrit.wikimedia.org/r/681682 (https://phabricator.wikimedia.org/T280649) [13:47:56] (03PS1) 10Awight: Remove unused imports [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681683 [13:52:25] (03CR) 10Hnowlan: [C: 03+1] "lgtm! Does this work as expected against the old cluster also?" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681605 (https://phabricator.wikimedia.org/T280649) (owner: 10Joal) [13:56:23] (03CR) 10Gehel: "If not having unused imports is important for your project (and it should be), maybe it make sense to add a checkstyle rule that would pre" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681683 (owner: 10Awight) [13:56:56] (03CR) 10Joal: "> Patch Set 4: Code-Review+1" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681605 (https://phabricator.wikimedia.org/T280649) (owner: 10Joal) [13:57:07] awight: Hey! I see a bunch of cleanups to the analytics/refinery project (great!) [13:57:51] we did some effort to have rules to prevent that kind of code to be added to projects in the first place. If that makes sense to you, I can probably find some time to adapt them for refinery [13:58:25] I did a first try some time ago, but no one was available to review / merge, so I'd prefer making sure that work is actually useful before starting again. [14:02:16] gehel: I like the idea personally, but I'm just a volunteer in that repo so should leave the decision to ottomata + joal, to pick some names. FWIW I'd also like to make some improvements to our Sonar configuration, see T279841. Maybe Search Platform has already done some customization there as well, that I could follow? [14:02:27] T279841: Improve Sonar job for analytics-refinery-source - https://phabricator.wikimedia.org/T279841 [14:02:57] awight: Ok, I'll check with them again [14:04:34] gehel: awight please proceed! the wwork you are both doing is super super appreciated [14:04:53] i think itt just didn't happen in the past because it was not prioritized, if you have time to make it happen i'm sure we will approve and support it [14:06:21] ottomata: Ok, I'll give it a try [14:07:13] gehel: Thanks, and feel free to CC me on anything if I can lend a hand, follow patterns, etc. [14:07:28] awight: thanks! [14:11:09] +1 to what ottomata said - thank you for our work awight and gehel :) [14:11:34] * awight semifearlessly launches a Java IDE like it's 1999 [14:14:28] (03CR) 10Hnowlan: [C: 03+1] "> Patch Set 4:" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681605 (https://phabricator.wikimedia.org/T280649) (owner: 10Joal) [14:20:56] (03CR) 10Hnowlan: "lgtm but I don't feel fully qualified to give a +1 😄" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/681678 (https://phabricator.wikimedia.org/T280649) (owner: 10Joal) [14:31:42] 10Analytics: Improve Sonar job for analytics-refinery-source - https://phabricator.wikimedia.org/T279841 (10Gehel) @awight You should not actually need any access to SonarCloud. The approach we're taking is keeping most of the configuration in the project repository. I'll try to add some more details later on. [14:32:08] awight: ping me about ^ at some point, I have some knowledge that I can share [14:33:24] gehel: Good point, I was imagining that some config might live in Sonar but +1 that it's better to do this in commited dotfiles. [14:35:28] awight: I think the lack of branch coverage for unit test is because Sonar is expecting cobertura, but we're not configuring it in our maven build [14:35:35] that's probably not too difficult to add [14:36:33] joal: found the last problem for the yarn logs with the ACLs, ready to send the change to enable the capacity scheduler :) [14:38:30] actually, sonar expects jacoco and we might be using cobertura, or not have anything configured (I can't remember) [14:44:53] very cool elukey :) [14:45:03] gehel: I'm seeing something similar, Cobertura hasn't had a release since 2015 but JaCoCo looks marginally healthier, it's receiving minor updates. [15:01:09] gehel: Here's an example that I'll try to use, https://github.com/SonarSource/sonar-scanning-examples/tree/master/sonarqube-scanner-maven/maven-multimodule [15:10:42] a-team since tomorrow's a holiday, let's discuss hardware needs after standup today? [15:11:11] +1 [15:11:16] oh i have MEP sync today [15:11:28] what about before standup? [15:11:29] fdans: ? [15:11:39] i guess we kinda need everyone :/ [15:11:54] i could cancle mep sync today probably [15:12:13] * elukey bbiab [15:12:50] (03PS5) 10Awight: Validate the native "hive" report type [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/676299 (https://phabricator.wikimedia.org/T193169) [15:13:39] \o/ fridge works again [15:22:11] I'd like to create deployment-eventlog06 in deployment-prep on buster, any considerations I should make before doing so or is it okay if it starts consuming events after provisioning? [15:22:21] oops meant to ping you on ^ elukey :) [15:22:45] hnowlan: it should be fine! [15:22:58] also if you break things there for a little while no one will care or notice :) [15:23:09] nice, thanks! [15:37:36] awight: interesting... the Java projects from the Search team are analyzed (with coverage) without aggregating the reports. [15:37:50] our jacoco config is in a parent pom shared by all projects: https://github.com/wikimedia/wikimedia-discovery-discovery-parent-pom/blob/master/pom.xml#L574 [15:40:27] even more interesting, the pom related to the readme you linked does not configure aggregation either: https://github.com/SonarSource/sonar-scanning-examples/blob/master/sonarqube-scanner-maven/maven-multimodule/pom.xml [15:40:28] elukey: so an-coord1001 has binlogs on /srv now? :D [15:40:44] ottomata: o/ yes! [15:40:52] cooooooo [15:41:10] we need to reboot the node to get some kernel security fixes applied, I'll do it next week [15:41:30] ottomata: I left some notes in the task about stuff to drop undr /srv, lemme know your thoughts [15:43:43] (03PS1) 10Awight: Rewrite WMDE Tech Wishes reports as native HiveQL [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/681707 (https://phabricator.wikimedia.org/T193169) [15:43:47] .win 11 [15:43:49] err [15:43:52] oh elukey i looked at that but i don't think i know anything about them? [15:43:58] they look like all superset backupss? [15:44:46] yes yes I just wanted a double check if we can drop, those may be related to Razzi's upgrade [15:45:53] I think my backups were in a different directory elukey [15:46:19] Old backups are unnecessary in any case, as long as we have a working replica, in my opinion [15:46:30] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/680267 (https://phabricator.wikimedia.org/T279046) (owner: 10Awight) [15:48:36] (03PS2) 10Mforns: Update job start dates to only backfill existing data [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/680267 (https://phabricator.wikimedia.org/T279046) (owner: 10Awight) [15:49:02] razzi: hi! yes I agree, I wanted to get a sign off from you two before proceeding [15:49:05] it will free more space [15:49:16] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM (after rebase)" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/680267 (https://phabricator.wikimedia.org/T279046) (owner: 10Awight) [15:50:42] aha, I was looking for the backup I took, which was in my home directory and was removed with the reimage. I'll keep that in mind that if I want a persistent backup I should use /srv [15:52:43] razzi, ottomata ok if I drop those dirs then? [15:52:49] /srv/an-coord1002-backup [15:52:55] ok with me! [15:52:57] ok by me [15:52:58] /srv/backup_hivemeta [15:53:07] and the superset_prodetc.. [15:53:08] ack :) [15:53:47] perfect [15:53:47] /dev/mapper/an--coord1001--vg-srv 173G 82G 92G 47% /srv [15:53:52] ottomata: --^ :) [15:53:58] plenty of space now [15:54:01] nice [15:54:07] ok i'll plan the partition rename [15:54:16] we need to be very careful though that we'll not add garbage under /srv [15:54:24] in the future I mean [15:57:17] 10Analytics, 10Analytics-Kanban, 10Event-Platform: Rename event_sanitized partition directories to lowercase - https://phabricator.wikimedia.org/T280813 (10Ottomata) [16:07:42] (03CR) 10Razzi: Combine labs_grouped_wikis and prod_grouped_wikis to grouped_wikis (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/681496 (https://phabricator.wikimedia.org/T280549) (owner: 10Razzi) [16:08:20] 10Analytics, 10Analytics-Kanban, 10Event-Platform: Rename event_sanitized partition directories to lowercase - https://phabricator.wikimedia.org/T280813 (10Ottomata) Huh wow! Apparently Hive table and partition locations are also case insensitive! I just did a test where I moved lowercased the external tab... [16:13:36] 10Analytics, 10Analytics-Kanban: Move lexnasser's files before user deletion - https://phabricator.wikimedia.org/T280096 (10Ottomata) 05Open→03Resolved [16:15:07] 10Analytics, 10Data-release, 10Privacy Engineering, 10Research, 10Privacy: Apache Beam go prototype code for DP evaluation - https://phabricator.wikimedia.org/T280385 (10Isaac) > Also, our privacy policy prevent us from keeping data at the user level, so DP notions that are user centric will not really s... [16:20:49] (03PS2) 10Razzi: Combine labs_grouped_wikis and prod_grouped_wikis to grouped_wikis [analytics/refinery] - 10https://gerrit.wikimedia.org/r/681496 (https://phabricator.wikimedia.org/T280549) [16:34:23] elukey: can you please remind me: have we refreshed druid[1-3] hardware yet? [16:35:18] https://phabricator.wikimedia.org/T255148 [16:35:21] elukey: ^ [16:35:37] (03CR) 10Joal: [C: 03+1] "LGTM :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/681496 (https://phabricator.wikimedia.org/T280549) (owner: 10Razzi) [16:37:06] 10Analytics, 10Privacy Engineering, 10Research, 10Patch-For-Review: Release dataset on top search engine referrers by country, device, and language - https://phabricator.wikimedia.org/T270140 (10JFishback_WMF) Hello all, I've completed the privacy risk analysis and shared it with the original requester: Du... [16:57:21] joal, ottomata the nodes should be in the racks waiting to get druid on top, going to update the task [16:57:39] ack - thanks elukey :) [17:02:03] 10Analytics-Clusters: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10elukey) The nodes are already racked (T274163) as an-druid100[3-5], we should need somebody to migrate druid100[1-3] to them :) Some notes: * druid100[1-3] are running the zookeeper daemons, so extra care w... [17:02:07] ottomata, joal --^ [17:10:06] MOAR machines :) [17:12:33] 10Analytics: [reportupdater] add --no-graphite flag - https://phabricator.wikimedia.org/T280823 (10mforns) [17:13:16] (03PS1) 10Mforns: Add --no-graphite flag [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/681746 (https://phabricator.wikimedia.org/T280823) [17:16:20] (03CR) 10jerkins-bot: [V: 04-1] Add --no-graphite flag [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/681746 (https://phabricator.wikimedia.org/T280823) (owner: 10Mforns) [17:16:58] is the mariadb stuff required for the new eventlog instance in deployment-prep given that it's disabled in prod? seems like there's problems with the puppet bits for it [17:19:23] (03PS2) 10Mforns: Add --no-graphite flag [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/681746 (https://phabricator.wikimedia.org/T280823) [17:23:48] hnowlan: In theory some people in the past used the mysql stuff to test, but probably not recently.. for this test I think that you can just add the new vm without mysql support [17:24:46] ottomata: should we should deprecate mysql-support in deployment-prep? [17:27:41] 10Analytics, 10Data-release, 10Privacy Engineering, 10Research, 10Privacy: Apache Beam go prototype code for DP evaluation - https://phabricator.wikimedia.org/T280385 (10Htriedman) > - You mention processing 500,000 rows in the README. Am I correct in assuming this is the process: 1) gather top-50 viewed... [17:48:13] alright, that eventlog instance in deployment-prep will have to wait until Monday. getting errors about a self-signed cert in the puppetmaster chain during first run (!?!), not gonna solve that tonight [17:50:57] hnowlan: weird, did you manually sign the cert on the puppet master? [17:50:57] like puppet cert -s blabla [17:50:59] ah the CA chain, lovely :D [17:51:08] yes seems a problem for Monday :) [17:51:24] going afk as well, enjoy the time off folks! [17:51:40] this is on first run, haven't logged into the host yet (and can't cos the first run failed) [17:51:47] later! [18:07:51] hmmm elukey hnowlan i think we can do that. [18:08:10] streams that are migrated can be viewed in stream-beta.wmflabs.org [18:08:41] a bit of a carrot (or stick?) to get people excited to migrate if they haven't already [18:08:47] so yes hnowlan lets remove the myql bits [18:15:15] * gmodena waves [18:16:36] are elasticsearch indexes replicated somewhere in Hadoop land? I need to perform a batch of requests to MediaSearch, and join the result with data stored in Hadoop [18:16:47] (03CR) 10Awight: "It would be good to include a test for `--no-graphite`, but I guess this would have to be at the "run()" level? Or at least around `confi" (031 comment) [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/681746 (https://phabricator.wikimedia.org/T280823) (owner: 10Mforns) [18:18:36] ebernhardson: ^^ see gmodena's question above [18:18:43] and I'd like to avoid hitting the HTTP endpoint (i need to lookup around 12 million records) [18:18:50] ottomata thx :) [18:27:28] * razzi lunch [18:27:40] gmodena: not really, although i have some scripts that can import an elasticsearch dump into hadoop [18:28:11] gmodena: the dumps are automatic every week, and there are scripts to import them, but that portion is one-off and not automated [18:29:18] ebernhardson is it something I could run/reproduce myself, or would it need some form of coordination? [18:29:53] gmodena: you could run them, but i can't promise it will just work :) I think i'm the only person who's ever run the scripts. Sec lemme find them [18:31:34] gmodena: you should be able to copy these, stat1007:~ebernhardson/projects/cirrus2hive/import-to-hdfs.sh takes in the gzip'd dump off nfs, decompresses and drops the raw json into hdfs. Decompression is necessary because hadoop can't split a .gz file across executors [18:31:56] ebernhardson awesome! [18:31:59] 10Analytics, 10Better Use Of Data, 10Product-Analytics, 10Product-Data-Infrastructure: [Metrics Platform] Define stream configuration syntax relevant to v1 release - https://phabricator.wikimedia.org/T273235 (10kzimmerman) a:05jlinehan→03DAbad Assigning to @DAbad for sign off [18:32:14] gmodena: then upload.py in same directory is a spark script that will read the raw json in elasticserach bulk import format and write back out as a hive table [18:32:32] ebernhardson many thanks. I'll give it a go, and let you know how it went :) [18:32:42] gmodena: sounds good, have fun :) [18:36:08] gmodena: actually there is one more in the middle i forgot, split.py has to go before upload.py, split.py takes the dump which is 2 json lines per document and combines those paired lines into a single row [18:36:49] (this is a little confusing because at one time this worked by fetching dumps via proxy from public sites, then switched to nfs with a partial but incomplete reworking) [18:37:09] ebernhardson ack! [18:40:47] ebernhardson sorry, one thing. Are the dumps fullsnapshots or incremental? [18:41:25] gmodena: full snapshots. Also there are a couple dumps for each wiki, content is articles, general is everything else. And then commonswiki is special and has file separated out [18:41:41] ebernhardson awesome sauce :) [18:43:15] 10Analytics, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 3 others: VirtualPageView Event Platform Migration - https://phabricator.wikimedia.org/T238138 (10kzimmerman) [18:51:06] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Analytics: Augment Hive event data with normalized host info from meta.domain - https://phabricator.wikimedia.org/T251320 (10mforns) Should we, once this is done, remove code that does this at query time? I.e. the session length intermediate ta... [19:10:04] mforns: yt [19:10:05] ? [19:10:16] yepppp [19:10:17] 10Analytics, 10Privacy Engineering, 10Research, 10Patch-For-Review: Release dataset on top search engine referrers by country, device, and language - https://phabricator.wikimedia.org/T270140 (10Isaac) Huge huge thanks to @JFishback_WMF for the privacy review! Everything makes sense from my side. I'll add... [19:10:27] pair with me to do event_sanitized path rename? [19:10:36] sure! bc? [19:10:40] ya [19:17:42] ottomata homedirs on stats nodes are mounted on the fuse_dfs volume (e.g. /mnt/hdfs/user/gmodena/), right? Just wanted to validate before pulling 100 something GB of dumps to the wrong fs :) [19:26:47] (03CR) 10Isaac Johnson: [C: 03+1] "ready to go per privacy review once these seven countries are filtered out. Thanks all for helping carry this through!" (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/655804 (https://phabricator.wikimedia.org/T270140) (owner: 10Bmansurov) [19:37:48] (03PS1) 10Gehel: Fix some checkstyle violations. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681772 [19:37:51] (03PS1) 10Gehel: Adding checkstyle configuration. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681773 [19:39:30] (03PS2) 10Gehel: Fix some checkstyle violations. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681772 [19:40:29] (03PS2) 10Gehel: Adding checkstyle configuration. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681773 [19:40:35] (03CR) 10jerkins-bot: [V: 04-1] Adding checkstyle configuration. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681773 (owner: 10Gehel) [19:41:55] (03CR) 10Gehel: "Note that this checkstyle configuration is generating a ton of errors with the current state of the project. This isn't surprising. While " [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681773 (owner: 10Gehel) [19:47:09] ottomata err... nevermind. [19:47:33] gmodena: yes but those are readonly :) [19:47:46] if you want to write you need to use hdfs cli or api [19:48:05] ottomata ack. Thanks. [19:59:00] (03CR) 10Awight: [C: 03+1] "Taking a second look, I see there's already a mixed precedent, for example configure_logging accepts `params` rather than config. Somethi" [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/681746 (https://phabricator.wikimedia.org/T280823) (owner: 10Mforns) [20:04:05] !log renaming event_santized hive table directories to lower case and repairing table partition paths - T280813 [20:04:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:04:08] T280813: Rename event_sanitized partition directories to lowercase - https://phabricator.wikimedia.org/T280813 [20:14:08] mforns FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. GC overhead limit exceeded [20:14:12] i thiink i need to do them one by one [20:28:36] (03CR) 10Ottomata: [C: 03+1] Fix some checkstyle violations. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/681772 (owner: 10Gehel) [21:01:01] 10Analytics, 10Analytics-Kanban, 10Event-Platform: Rename event_sanitized partition directories to lowercase - https://phabricator.wikimedia.org/T280813 (10Ottomata) Huh, my test must not have been representative. Marcel and I just paired to do this and a MSCK REPAIR TABLE was needed. Here's our procedure:... [21:30:03] !log temporariliy disabling sanitize_eventlogging_analytics_delayed jobs until T280813 is completed (probably tomorrow) [21:30:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:30:06] T280813: Rename event_sanitized partition directories to lowercase - https://phabricator.wikimedia.org/T280813 [21:30:28] mforns: still there? [22:14:56] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Data-Infrastructure, 10Readers-Web-Backlog: VirtualPageView should use EventLogging api to send virtual page view events - https://phabricator.wikimedia.org/T279382 (10Jdlrobson) I honestly can't remember off the top of my head. @phuedx may k... [22:19:20] a-team FYI I have to leave for the day but am in the middle of the event_sanitized table rename, i will continue tomorrow. In the meantime i've disabled the refine sanitize jobs, so no new data will be sanitiized there for now. [22:20:06] ottomata: sorry I missed the other messages [22:20:12] s'ok1 [22:20:23] many of the repairs suceeded, but also many did not [22:20:30] my status is in an-launcher1002:/home/otto/event_sanitized_table_rename_T280813 [22:20:32] I see [22:20:45] ii need to run those repairs still, but first need to figure out why they failed due to memory issues [22:21:04] ok