[00:15:48] 10Analytics: Move lexnasser's files before user deletion - https://phabricator.wikimedia.org/T280096 (10lexnasser) [00:29:28] PROBLEM - Check unit status of monitor_refine_eventlogging_analytics on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:25:30] PROBLEM - Check unit status of monitor_refine_event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:40:40] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:52:06] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:27:42] PROBLEM - Check unit status of monitor_refine_mediawiki_job_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_mediawiki_job_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:51:46] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:03:16] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:25:50] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:37:16] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:58:05] good morning [05:59:32] 10Analytics, 10Event-Platform, 10EventStreams: Hits from private AbuseFilters aren't in the stream - https://phabricator.wikimedia.org/T175438 (10Nirmos) This was resolved in T266298. [06:44:49] it is interesting that https://yarn.wikimedia.org/cluster/app/application_1615988861843_166049 has been running for like 3h [06:48:55] ah failed [07:35:08] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:46:26] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:54:14] 10Analytics, 10Machine-Learning-Team, 10ORES, 10editquality-modeling: Generate dump of scored-revisions from 2018-2020 for Wikis except English Wikipedia - https://phabricator.wikimedia.org/T280107 (10LeijieWang) [07:55:53] 10Analytics, 10Machine-Learning-Team, 10ORES, 10editquality-modeling: Generate dump of scored-revisions from 2018-2020 for Wikis except English Wikipedia - https://phabricator.wikimedia.org/T280107 (10elukey) Tagging @JAllemandou to know how much work this implies and if it is feasible :) [08:22:44] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:25:04] there are multiple issues ongoing [08:25:45] the produce canary events seems failing intermittently, but the main concern is the fact that refine is not behaving as expected [08:26:19] refine mediawikijobs takes ages [08:26:21] see https://yarn.wikimedia.org/cluster/app/application_1615988861843_166906 [08:26:42] and the logs point to spark connection failures between workers [08:27:00] all shuffle related [08:30:22] and also in communicating with the driver [08:34:04] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:14:33] !log run "sudo kill `pgrep -f sqoop`" on an-launcher1002 to clean up old test processes still running [09:14:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:15:24] razzi,milimetric ---^ all processes under your usernames :) [09:15:52] likely not related to the problem but I found it while investigating [09:16:43] --- [09:17:14] I see one interesting metric common to a lot of workers and an-launcher1002, namely tcp socket attempts errors rising from midnight utc [09:17:28] https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=20&orgId=1&refresh=5m&from=now-12h&to=now&var-server=an-launcher1002&var-datasource=eqiad%20prometheus%2Fops&var-cluster=analytics [09:18:14] the cluster is relatively busy but not to justify timeouts (at least never seen something like that) [09:18:41] ephemeral ports seems to be available on the nodes [09:18:58] the metric matches with what I am seeing in the logs [09:21:05] nothing really outstanding in librenms for the network ports afaics [09:23:14] nothing in yarn started at midnight afaics [09:23:28] then it must be job-specific [09:29:19] there are some big jupyter notebooks though [09:31:49] (03CR) 10Awight: "@mforns JFYI this patch is a priority fix for us. Long story short, our baseline data starts to roll out of the event db ~ May 1st, so it" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/676297 (https://phabricator.wikimedia.org/T279046) (owner: 10Awight) [09:33:46] the jobs seem in progress, I wouldn't love to kill them [09:41:35] there are a lot of sockets, from 300 to 500, on basically every hadoop worker for christinedk [09:41:52] that is very suspicious [09:43:59] !log kill application_1615988861843_164387 to see if any improvement to socket consumption is made [09:44:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:44:09] (was consuming ~500GB of tam) [09:44:10] *ram [09:46:05] !log kill application_1615988861843_163186 for the same reason [09:46:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:52:43] nope [09:54:03] !log kill long running mediawiki-job refine erroring out application_1615988861843_166906 [09:54:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:02:16] !log roll restart yarn nodemanagers on hadoop prod (attempt to see if they entered in a weird state, graceful restart) [10:02:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:10:21] it really seems affecting all workers [10:10:22] https://thanos.wikimedia.org/graph?g0.range_input=12h&g0.max_source_resolution=0s&g0.expr=rate(node_netstat_Tcp_AttemptFails%7Binstance%3D~%27an-worker.*%27%7D%5B5m%5D)&g0.tab=0 [10:11:11] anything I can help with? [10:12:37] (03PS1) 10Hnowlan: Add docker-compose environment with cassandra [analytics/aqs] - 10https://gerrit.wikimedia.org/r/679295 [10:15:22] hnowlan: hi! I see a lot of tcp syns from all workers to kafka-logging1001.eqiad.wmnet.9093, that is a new host in theory [10:15:26] (03CR) 10jerkins-bot: [V: 04-1] Add docker-compose environment with cassandra [analytics/aqs] - 10https://gerrit.wikimedia.org/r/679295 (owner: 10Hnowlan) [10:17:00] ahhh now I get it! [10:17:04] it is in the rsyslog config! [10:17:54] and we don't have it whitelisted in our firewall [10:19:32] it is likely not related to our issues [10:19:36] but lemme fix it [10:23:16] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:32:01] thx for the sqoop kill, sorry had no idea I had zombies [10:33:46] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:36:21] milimetric: np! :) [10:36:51] I am going to lunch now, just sent an email about what I did this morning for alarms [10:37:01] Andrew will probably have more clue about those timeouts.. [10:37:04] ttl! [10:53:23] (03CR) 10Hnowlan: "Of course, this broke the tests, d'oh. Consider this a WIP for now." [analytics/aqs] - 10https://gerrit.wikimedia.org/r/679295 (owner: 10Hnowlan) [11:06:39] 10Analytics-Radar, 10Patch-For-Review, 10Unplanned-Sprint-Work, 10WMDE-TechWish-Sprint-2021-03-31, 10WMDE-TechWish-Sprint-2021-04-14: Broken reportupdater queries: edit count bucket label contains illegal characters - https://phabricator.wikimedia.org/T279046 (10Lena_WMDE) [11:06:56] 10Analytics-Radar, 10Patch-For-Review, 10WMDE-TechWish-Sprint-2021-03-31, 10WMDE-TechWish-Sprint-2021-04-14: Broken reportupdater queries: edit count bucket label contains illegal characters - https://phabricator.wikimedia.org/T279046 (10Lena_WMDE) [11:31:03] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/676297 (https://phabricator.wikimedia.org/T279046) (owner: 10Awight) [11:53:18] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] scap: add analytics WMCS hosts instead of old deploy-prep hosts [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/677517 (owner: 10Hnowlan) [12:17:38] hey team! :] [12:37:33] hola mforns [12:37:52] ciao elukey :] [12:38:46] mforns: Awesome, thank you! Let me know if/when I can do anything to check on job outputs or debug. [12:40:59] awight hi! the new code should be deployed now! I'm not aware if those reports were failing, or it's just an update... If report dates are pending, those should be running now-ish. Otherwise, they will run 00:00 UTC. [12:42:17] mforns: Great, I'll let you know. Yes they were all failing with this same issue, unfortunately. [12:43:21] awight: do you want me to have a quick look? [12:45:02] mforns: 24-hour turnaround would be more than adequate, don't drop anything else for our sake :-) [12:46:59] mforns: Of course, in the medium-term it would be fantastic to get access to the error logs, it might save a lot of me pestering your team. [13:00:11] !log upgrade Hue to 4.9 on an-tool1009 - hue-next.wikimedia.org [13:00:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:04:47] ah lovely, it worked in staging/test but not on an-tool1009 [13:07:55] I'd like to roll back the recent change to the restbase cassandra driver version in aqs if anyone has a sec - this will restore what we currently have running in prod https://gerrit.wikimedia.org/r/c/analytics/aqs/+/678859 [13:09:00] (03CR) 10Elukey: [C: 03+1] Revert "package: bump restbase-mod-table-cassandra" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/678859 (owner: 10Hnowlan) [13:10:06] !log rollback hue-next to 4.8 - issues not present in staging [13:10:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:13:21] thanks elukey! [13:22:46] (03CR) 10Hnowlan: [C: 03+2] Revert "package: bump restbase-mod-table-cassandra" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/678859 (owner: 10Hnowlan) [13:26:28] (03Merged) 10jenkins-bot: Revert "package: bump restbase-mod-table-cassandra" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/678859 (owner: 10Hnowlan) [13:28:47] (03PS1) 10Hnowlan: Update aqs to 60c2b70 [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/679333 [13:35:47] mforns: I think I see a glimmer of data at the end of the tunnel... Will keep you posted. [13:36:01] hehe awight, OK [13:39:51] ottomata: o/ [13:39:58] morninn lemme know if I can hel [13:40:00] *help [13:40:18] it is super weird, but before doing more restarts etc.. I wanted to get your opinion to avoid messing the state more :D [13:43:23] trying to rollforward again hue in hue-next to debug what's happening [13:45:56] elukey: i think i know what is going on [13:46:07] i think there is a bug in the logic for table_exclude_regex [13:46:14] aside from that test_event alert [13:46:23] i think these are all datasets that should not be refined [13:46:35] but, i [13:46:48] still surprised there was no refine fail alert when the job failed refining though [13:46:51] so that is worrysome [13:47:06] yeah I was very confused as well [13:47:09] elukey: i don't think you will hurt anything by doing restarts [13:47:14] this is all just in refine stuff [13:47:26] yes yes I just wanted to have your opinion first [13:47:29] awight: I see that reportupdater ran today at 5am UTC and updated all the affected reports successfully (the report CSVs look fine). I imagine by the code that the error heppened when pushing metrics to graphite. Then, I think we'll have to wait until tomorrow 5am UTC for RU to run again, and see if metrics make it to graphite no? [13:48:02] ottomata: the refine_mediawiki_job takes hours to complete, and it seems failing over and over when the spark workers try to send heartbeats to the driver [13:48:41] elukey: probably because all of the sudden it is considering way more data to refine than it had befofre [13:48:41] this is why I thought there was somehow a networking problem going on (or an increased pressure on the cluster) [13:48:54] yes yes now it makes more sense [13:48:56] aye [13:53:42] ok live patched hue-next, will file a pull request to upstream [13:53:50] of course it fails when using cas [13:53:51] ... [13:54:02] but a one line change fixes it, goood [13:57:08] ottomata: if you agree I'd fix the libmysql-java issue with two steps: 1) we forward port libmysql-java (with all the CVE fixed by debian etc..) to a special component in Buster (Moritz is ok with it), and we use it to enable the Buster upgrade. 2) I try to follow up via Jira with Apache Hive upstream, to understand why the mariadb jdbc driver doesn't support their sql for the metastore [13:58:59] mforns: Aaah exactly, the error is due to an illegal graphite path. Unfortunately, it also means that the output files have cached bad data, I hadn't thought about that. Will update in the task. [13:59:50] elukey: sounds good to me! [14:00:59] awight: by "bad data" do you mean buckets with graphite-incompatible characters? [14:04:21] 10Analytics-Radar, 10WMDE-TechWish-Sprint-2021-03-31, 10WMDE-TechWish-Sprint-2021-04-14: Broken reportupdater queries: edit count bucket label contains illegal characters - https://phabricator.wikimedia.org/T279046 (10awight) I just realized that the cached output files for all of these jobs have been genera... [14:05:09] !log run build/env/bin/hue migrate on an-tool1009 after the hue upgade [14:05:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:06:30] mforns: Exactly. Awkward place for me to break the pipeline... [14:06:58] awight: do you want me to re-run some dates? [14:07:23] mforns: Sure! Any one day for any of the jobs touched in the patch you merged... [14:07:25] awight: although... the graphite metrics are only going to be sent for new dates, those will be good [14:07:58] ah one thing: when you re-run we're in danger of duplicating the data going to graphite, so maybe choose the latest day to make it more clear what to purge later? [14:08:52] awight: but you said the graphite metrics are not being sent now no? [14:09:31] mforns: :-/ a few may have gone through, whenever the output contained no edit count buckets which include "+" or " ". Messy. [14:09:57] I could write a shell thing to select exactly which files are bad, if that's helpful. [14:10:07] The intermediate files are TSV, I think? [14:10:52] awight: is it possible to remove data from graphite? [14:11:15] yes, they are TSV [14:11:57] mforns: That might be better, but I hear it's very low-level to edit the Graphite db. [14:14:22] awight: I don't think we can leave the metrics clean, if we don't drop the current graphite data... for each day there's several buckets, some were sent and some others didn't IIUC. So, by rerunning a given date, we'll correct part of the metrics, but duplicate the ones that were sent already, no? [14:14:57] mforns: Oof, in my heart I know that you're right. There's no way the graphite update was transactional... [14:15:10] reportupdater doesn't look at the missign buckets, just the missing dates, when deciding which jobs to rerun [14:15:59] if we delete some rows in the TSVs but leave just 1 row with date D, then date D will not get re-run... [14:16:42] +1 we would have to work on a per-file level. But I think you're right about the other thing, it's likely that some files were already partially pushed. [14:18:20] awight: the only thing I can think of, is temporaroly changing all the queries, to only output the buckets that were failing before, so that only the missing metrics are sent for the re-run period, when that is fixed, then recede to the complete queries again. [14:18:54] awight: do you have a public link to the graphite metrics? [14:19:13] Interesting workaround--but then we would be left with incomplete output files, if that matters. [14:19:33] awight: I understand you don't look at the files right? [14:21:00] awight: orrrr.... instead of not outputting the non-missing metrics, we could output them, but insert graphite-unsupported characters in them (oof this is becoming a bit too hacky...) [14:22:00] mforns: True, I don't personally care about the intermediate files so this could work. [14:23:18] Here's one of the metrics, https://graphite.wikimedia.org/S/u [14:29:36] mforns: addshore is optimistic that we can purge the data from Graphite, so probably the easiest fix is to purge both the intermediate and downstream data. [14:29:50] 10Analytics, 10Patch-For-Review: Add time interval limits to pageview API - https://phabricator.wikimedia.org/T261681 (10Milimetric) This approach sounds good to me, I just wouldn't make it depend linearly on the number of years. How about... 500ms * sqrt( numberOfYears )? [14:30:17] awight: OK, yes, if deleting in graphite is possible, then that would be easiest on analytics side [14:35:46] 10Analytics-Radar, 10WMDE-TechWish-Sprint-2021-03-31, 10WMDE-TechWish-Sprint-2021-04-14: Broken reportupdater queries: edit count bucket label contains illegal characters - https://phabricator.wikimedia.org/T279046 (10awight) After discussion, we think that some of the report data may have been pushed to Gra... [14:37:57] mforns: I just upgraded hue-next to 4.9, that is a little better, hopefully less buggy (even if the problems that we found are still there, no upstream changes, if we want to fix them we'll need to send pull requests :( ) [14:38:22] elukey: lookin [14:40:25] elukey: it seems the same problems are there still no? but yea, let's send changes upstream :] [14:42:37] 10Analytics-Radar, 10Graphite, 10WMDE-TechWish-Sprint-2021-03-31, 10WMDE-TechWish-Sprint-2021-04-14: Broken reportupdater queries: edit count bucket label contains illegal characters - https://phabricator.wikimedia.org/T279046 (10awight) [14:42:48] mforns: yes as I said above some issues are still there, but the overall stability should be better [14:42:54] ok [14:43:03] can't do more than this sorry [14:43:07] I know that you hate hue next [14:43:50] tomorrow I'll move hue-next to hue so we'll complete the migration, and drop all the cloudera packages etc.. [14:46:10] 10Analytics-Radar, 10Graphite, 10WMDE-TechWish-Sprint-2021-03-31, 10WMDE-TechWish-Sprint-2021-04-14: Broken reportupdater queries: edit count bucket label contains illegal characters - https://phabricator.wikimedia.org/T279046 (10awight) I'm not sure how to manage the timing between the two purges, especia... [14:51:12] elukey: no no, no hate for hue-next :] thanks for the upgrade! [14:56:41] awight: one key thing is, deleting a metric is easy, deleting bits of data from a metric is not so easy (but still possible) [14:56:46] Which was your case? [14:57:26] addshore: +1 This is wholesale deleting of the metric and its history. [14:57:45] Awesome! [15:06:44] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade the Hadoop coordinators to Debian Buster - https://phabricator.wikimedia.org/T278424 (10elukey) [15:08:01] 10Analytics, 10Data-Services, 10cloud-services-team (Kanban): Mitigate breaking changes from the new Wiki Replicas architecture - https://phabricator.wikimedia.org/T280152 (10Jhernandez) [15:08:19] 10Analytics, 10Data-Services, 10cloud-services-team (Kanban): Mitigate breaking changes from the new Wiki Replicas architecture - https://phabricator.wikimedia.org/T280152 (10Jhernandez) [15:08:21] 10Analytics, 10Data-Services, 10cloud-services-team (Kanban): Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema - https://phabricator.wikimedia.org/T215858 (10Jhernandez) [15:08:26] 10Analytics, 10Analytics-Kanban, 10Packaging: Create a debian package for Apache Airflow - https://phabricator.wikimedia.org/T277012 (10elukey) a:03razzi [15:11:00] 10Analytics, 10Data-Services, 10cloud-services-team (Kanban): Mitigate breaking changes from the new Wiki Replicas architecture - https://phabricator.wikimedia.org/T280152 (10Jhernandez) [15:11:15] 10Analytics, 10Data-Services, 10cloud-services-team (Kanban): Mitigate breaking changes from the new Wiki Replicas architecture - https://phabricator.wikimedia.org/T280152 (10Jhernandez) [15:11:36] 10Analytics, 10Data-Services, 10cloud-services-team (Kanban): Mitigate breaking changes from the new Wiki Replicas architecture - https://phabricator.wikimedia.org/T280152 (10Jhernandez) [15:11:54] 10Analytics, 10Data-Services, 10cloud-services-team (Kanban): Mitigate breaking changes from the new Wiki Replicas architecture - https://phabricator.wikimedia.org/T280152 (10Jhernandez) [15:14:03] 10Analytics, 10Data-Services, 10Developer-Advocacy (Apr-Jun 2021), 10cloud-services-team (Kanban): Mitigate breaking changes from the new Wiki Replicas architecture - https://phabricator.wikimedia.org/T280152 (10Jhernandez) p:05Triage→03High a:03Jhernandez [15:14:08] 10Analytics, 10Data-Services, 10Developer-Advocacy (Apr-Jun 2021), 10cloud-services-team (Kanban): Mitigate breaking changes from the new Wiki Replicas architecture - https://phabricator.wikimedia.org/T280152 (10Jhernandez) [15:24:36] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:25:22] RECOVERY - AQS root url on aqs1010 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [15:25:32] RECOVERY - AQS root url on aqs1011 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [15:26:22] RECOVERY - AQS root url on aqs1012 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [15:29:09] ^ this is me [15:29:10] RECOVERY - AQS root url on aqs1014 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [15:30:02] aqs works on the new cluster new - logs indicated nothing of the sort but it turns out aqs needs to have `ALL` granted on `ALL KEYSPACES` [15:31:06] RECOVERY - AQS root url on aqs1015 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [15:34:43] https://i.kym-cdn.com/entries/icons/original/000/005/613/Page22mh.jpg how did this get here i am not good with javascript [15:35:09] anyway, there's a (fairly) stale dataset on aqs101[0-5] for functional testing (cc joal) [15:35:22] hnowlan: ahhahahaha wow what a rabbit hole [15:35:35] let's make sure to document all at the end [15:35:40] for sure [15:35:50] Joseph is out this week so testing might be delayed :( [15:35:54] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:36:05] one positive of the testing is that it turns out we need no library bumps and we can avoid any schema stuff or concerns about old restbase libs [15:36:38] also that docker compose thing is pretty handy for testing code changes even if I need to figure out a middle-ground as far as the tests we already have are [15:37:11] nice [15:41:54] 10Analytics, 10Cassandra: AQS Cassandra driver needs to be updated - https://phabricator.wikimedia.org/T278699 (10hnowlan) 05Open→03Declined [15:41:56] 10Analytics, 10Cassandra: Cassandra3 migration for Analytics AQS - https://phabricator.wikimedia.org/T249755 (10hnowlan) [15:42:36] 10Analytics, 10Cassandra: AQS Cassandra driver needs to be updated - https://phabricator.wikimedia.org/T278699 (10hnowlan) Solved in-schema for the new cluster. [15:43:03] (03PS1) 10Ottomata: Fix bug in Refine where table regexes were not matching properly [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/679372 [15:46:08] 10Analytics, 10Cassandra: Dual loading from Hive into old and new AQS clusters - https://phabricator.wikimedia.org/T280155 (10hnowlan) [15:46:50] 10Analytics, 10Analytics-Kanban, 10Cassandra: Dual loading from Hive into old and new AQS clusters - https://phabricator.wikimedia.org/T280155 (10hnowlan) p:05Triage→03Medium a:05elukey→03hnowlan [15:48:57] (03CR) 10Ottomata: "We'll have to change the table include and exclude regexes in the puppet refine job configs to be normalized i.e. lower cased." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/679372 (owner: 10Ottomata) [15:49:38] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:56:40] sorry for the late notice a-team: I got pulled into a working group, first meeting is now so I'll miss standup. I've been trying to debug the encoding problem and I enlisted Lex to help me. Also had a sync-up with cloud, they have better data on cross-wiki use cases, fewer users and smaller overall % of queries need that feature, so it should be easier to build something more custom to meet the use cases. [16:01:00] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:01:45] 10Analytics, 10Machine-Learning-Team, 10ORES, 10editquality-modeling, 10artificial-intelligence: Generate dump of scored-revisions from 2018-2020 for Wikis except English Wikipedia - https://phabricator.wikimedia.org/T280107 (10calbon) Would it be possible to get all ORES scores ever as a one-off-job and... [16:14:42] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:18:59] (03PS1) 10Elukey: Move sqoop-mediawiki-tables back to the com.mysql.jdbc.Driver [analytics/refinery] - 10https://gerrit.wikimedia.org/r/679387 (https://phabricator.wikimedia.org/T278424) [16:26:00] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:36:39] hey a-team ! I'm getting some errors in Hive, e.g. this job: task_1615988861843_168581_m_000000 Looks like there might be some underlying permissions issues? [16:37:14] Nettrom: Hi! What is the error that you get? [16:37:48] elukey: java.io.IOException: java.lang.reflect.InvocationTargetException caused by java.lang.NullPointerException [16:38:07] wow [16:39:58] Nettrom: I suggest to open a task with steps to repro if you have a min [16:40:21] elukey: no problem will do! [16:40:56] thanks! [16:41:20] elukey: fwiw i was able to get the application logs on an-master1001 [16:41:21] sudo -u hdfs kerberos-run-command hdfs yarn logs -applicationId application_1615988861843_168581 [16:41:29] took me a whiel to figure out how/where to run that [16:41:37] i used to just be able to sudo as the user... [16:41:49] ottomata: :) [16:42:02] :) [16:42:32] there should be a hdfs keytab on an-launcher too in theory (if you need it) [16:43:00] oh, huh thought i tried that must have done that wrong [16:43:07] lemme check [16:43:22] yep yep it is there [16:45:54] 10Analytics, 10Product-Analytics: Hive: create table statement failure - https://phabricator.wikimedia.org/T280168 (10nettrom_WMF) [17:03:09] (03PS1) 10Lex Nasser: Refactor pageviews per-article endpoint [analytics/aqs] - 10https://gerrit.wikimedia.org/r/679398 [17:05:49] fdans: Can I assign this task to you for the Wikistats addition of the pageviews-per-country data: https://phabricator.wikimedia.org/T207171 ? And should I create a child task specifically for Wikistats, like the analog of https://phabricator.wikimedia.org/T263697? [17:08:28] Nettrom: can you run [17:08:53] hdfs dfs -chmod -R 755 /user/hive/warehouse/growth_welcomesurvey.db/monthly_overview [17:08:57] i can't look at thte data right now [17:09:21] ottomata: done [17:12:51] interesting Nettrom i was able to run that query into my own db [17:13:26] oh [17:13:27] Nettrom: [17:13:31] the table arleaddy exists, right? [17:13:58] ahh [17:14:05] it does not but the underlying data does [17:14:05] ottomata: sorry, I test ran it and it worked, let me delete it again [17:14:12] ? [17:14:19] drwxrwxrwt - nettrom hdfs 0 2021-04-14 17:11 /user/hive/warehouse/growth_welcomesurvey.db/monthly_overview_backup_test [17:15:21] ottomata: I've deleted it now, let me test with a different table name to see what happens [17:16:25] hmm, the test query now works [17:16:35] sounds like an underlying permissions issue then, possibly [17:16:40] I do get this though: [17:16:49] chgrp: changing ownership of 'hdfs://analytics-hadoop/user/hive/warehouse/growth_welcomesurvey.db/monthly_overview_backup_test_foo': User nettrom does not belong to hadoop [17:21:17] 10Analytics-Radar, 10Better Use Of Data, 10Product-Analytics, 10Product-Data-Infrastructure, and 2 others: prefUpdate schema contains multiple identical events for the same preference update - https://phabricator.wikimedia.org/T218835 (10LGoto) a:05ovasileva→03nray [17:27:22] interesting nettrom i think the hive managed databases/tables need a permissions fix. the default group ownership doesn't look right [17:27:26] filing a task for that [17:27:35] but, i don't see why your stuff would fail with the existing stuff [17:29:05] 10Analytics: Fix default ownership and permissions for Hive managed databases in /user/hive/warehouse - https://phabricator.wikimedia.org/T280175 (10Ottomata) [17:29:50] Nettrom: at what point do you get that chgrp error? [17:38:02] ottomata: what should we do with refine source? Do you prefer to wait for a review and build tomorrow? [17:43:00] elukey: i'm still working on fixing some things [17:43:05] mostly the failure email right now [17:43:22] ah okok, then is it ok if we deploy tomorrow? (Just trying to figure out how much time to stay online) [17:43:23] fkaelin: yt? i got a scala Try / Failure q [17:43:27] elukey: yup [17:43:31] or, i may to it today [17:43:33] don't worry about it! [17:43:37] i can take care of it [17:44:19] ottomata yep [17:44:32] got a sec for a hangout? easier to share [17:44:46] https://meet.google.com/rxb-bjxn-nip [17:46:01] fkaelin: ^ [17:46:31] ack :) [17:50:05] 10Analytics-Radar, 10observability, 10Graphite, 10Patch-For-Review, and 2 others: Broken reportupdater queries: edit count bucket label contains illegal characters - https://phabricator.wikimedia.org/T279046 (10awight) I've taken screenshots of the historical CodeMirror grafana boards, to be sure we have a... [17:52:32] * elukey afk! [17:58:55] ottomata: the chgrp error shows up after the job ended, it's preceded by two lines with "Moving data to directory…" followed by the chgrp message [17:59:26] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:59:35] (03PS6) 10Jason Linehan: [WIP] Metrics Platform context attribute schema fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) [18:10:48] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:13:44] 10Analytics-Radar, 10observability, 10Graphite, 10Patch-For-Review, and 2 others: Broken reportupdater queries: edit count bucket label contains illegal characters - https://phabricator.wikimedia.org/T279046 (10awight) {F34393714} {F34393715} {F34393716} [18:20:05] ottomata: we're talking about the issue with ad-blocker in session length (changing the domain the events are sent to), not sure if you're in favor of it, or not, and how long would it take (guess) to do it? [18:31:14] (03CR) 10Milimetric: [C: 03+1] "Thx for the change, quick opinion on the naming. I'll merge after you decide." (032 comments) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/679398 (owner: 10Lex Nasser) [18:32:03] (03CR) 10Bearloga: [WIP] Metrics Platform context attribute schema fragment (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan) [19:20:02] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 15.27 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [19:23:13] razzi: hi! we upgraded to superset 1.0 in Fundraising, and I'm noticing some browser-specific issues. We have some error messages fire in Firefox (for all permission levels, even created permissions) while everything works well in Chrome. Are you all noticing anything like that? [19:37:42] (03CR) 10Lex Nasser: Refactor pageviews per-article endpoint (032 comments) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/679398 (owner: 10Lex Nasser) [19:47:02] (03PS2) 10Lex Nasser: Refactor pageviews per-article endpoint [analytics/aqs] - 10https://gerrit.wikimedia.org/r/679398 [19:54:46] 10Analytics-Radar, 10Better Use Of Data, 10Product-Analytics, 10Product-Data-Infrastructure, and 2 others: prefUpdate schema contains multiple identical events for the same preference update - https://phabricator.wikimedia.org/T218835 (10nray) 05Open→03Resolved >>! In T218835#6966613, @Mholloway wrote:... [19:54:54] Hi eyener, I tested our instance using firefox, and didn't find any firefox-specific issues. Care to link to any specific issues? [19:55:16] 10Analytics-Radar, 10Better Use Of Data, 10Product-Analytics, 10Product-Data-Infrastructure, and 2 others: prefUpdate schema contains multiple identical events for the same preference update - https://phabricator.wikimedia.org/T218835 (10nray) a:05nray→03Mholloway [20:02:39] Thanks razzi! Unfortunately I can't share out much since our instance is Fundraising specific and has additional security layers because of that. This helps, though, since it points to something with our internal setup specific to the FR instance [20:10:28] (03PS2) 10Ottomata: Fix bug in Refine where table regexes were not matching properly [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/679372 [20:12:30] (03CR) 10jerkins-bot: [V: 04-1] Fix bug in Refine where table regexes were not matching properly [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/679372 (owner: 10Ottomata) [20:18:55] (03PS3) 10Ottomata: Fix bug in Refine where table regexes were not matching properly [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/679372 [20:20:50] (03CR) 10jerkins-bot: [V: 04-1] Fix bug in Refine where table regexes were not matching properly [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/679372 (owner: 10Ottomata) [20:24:16] (03PS4) 10Ottomata: Fix bug in Refine where table regexes were not matching properly [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/679372 [20:25:44] (03PS5) 10Ottomata: Fix bug in Refine where table regexes were not matching properly [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/679372 [21:02:35] 10Analytics, 10Better Use Of Data, 10Gerrit-Privilege-Requests, 10Product-Analytics, 10Product-Data-Infrastructure: Create or identify an appropriate Gerrit group for +2 rights on schemas/event/secondary - https://phabricator.wikimedia.org/T279089 (10Mholloway) p:05Triage→03Medium [21:13:55] 10Analytics, 10Analytics-EventLogging, 10Better Use Of Data, 10Product-Analytics, and 2 others: Document how ad blockers / tracking blockers interact with EventLogging - https://phabricator.wikimedia.org/T263503 (10Mholloway) 05Open→03Resolved a:03Mholloway [21:14:41] 10Analytics-Radar, 10Event-Platform, 10Product-Data-Infrastructure, 10Product-Analytics (Kanban): Draft of full process for instrumentation using new client libraries - https://phabricator.wikimedia.org/T275694 (10Mholloway) [21:23:14] * razzi out for a walk [21:32:34] 10Analytics, 10Analytics-Data-Quality: Import of MediaWiki tables into the Data Lakes mangles usernames - https://phabricator.wikimedia.org/T230915 (10Milimetric) a:05lexnasser→03None [21:33:33] (03CR) 10Milimetric: [C: 03+2] Refactor pageviews per-article endpoint [analytics/aqs] - 10https://gerrit.wikimedia.org/r/679398 (owner: 10Lex Nasser) [22:35:16] 10Analytics-EventLogging, 10Analytics-Radar, 10Front-end-Standards-Group, 10MediaWiki-extensions-WikimediaEvents, and 4 others: Provide a reusable getEditCountBucket function for analytics purposes - https://phabricator.wikimedia.org/T210106 (10Jdlrobson) > The one remaining implementation is Popups/src/co... [22:36:59] 10Analytics-EventLogging, 10Analytics-Radar, 10Front-end-Standards-Group, 10MediaWiki-extensions-WikimediaEvents, and 4 others: Provide a reusable getEditCountBucket function for analytics purposes - https://phabricator.wikimedia.org/T210106 (10Jdlrobson)