[00:23:17] (03PS1) 10Milimetric: Fix typo on properties file [analytics/refinery] - 10https://gerrit.wikimedia.org/r/657449 [00:23:38] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Fix typo on properties file [analytics/refinery] - 10https://gerrit.wikimedia.org/r/657449 (owner: 10Milimetric) [00:32:33] 10Analytics, 10Data-release, 10Privacy Engineering, 10Research, 10Privacy: Evaluate a differentially private solution to release wikipedia's project-title-country data - https://phabricator.wikimedia.org/T267283 (10Nuria) >in a setting like the one you describe, what would the attacker know, and what wo... [06:49:06] good morning [07:15:10] interesting, from the middlemanager logs [07:15:11] Permission denied: user=druid, access=READ_EXECUTE, inode="/wmf/tmp/druid/0019179-210107075406929-oozie-oozi-W-daily-druid-banner-activity-2021-1-20":analytics:hdfs:drwxr-x [07:16:38] and indeed the oozie directories are owned by analytics:hdfs [07:18:56] so in this case, the oozie hive action creates the dir but for some reason it doesn't inherit the druid group [07:24:14] the other interesting bit is [07:24:15] org.apache.hadoop.security.AccessControlException: Permission denied: user=druid, access=READ_EXECUTE, inode="/tmp/search_satisfaction_daily_2021-01-20":analytics-search:hdfs:drwxr-x--- [07:24:19] 111807- at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission( [07:30:21] --- [07:34:48] I checked the other similar dirs under /tmp, and they have other bits set, as we discovered when we checked what was failing and what not [07:35:39] Good morning [07:36:08] Meh [07:36:37] bonjour :) [07:41:08] elukey: it seems that the assumption that hive was keeping parent user/group was wrong /*facepalm*/ [07:42:46] https://cwiki.apache.org/confluence/display/Hive/Permission+Inheritance+in+Hive [07:42:57] That's what I 'm reading as well [07:42:58] External table directory inherits from parent directory. [07:43:29] and the inherit.blabla setting was added in hive 0.9 [07:45:13] elukey: hive.warehouse.subdir.inherit.perms - Default Value: false [07:45:28] Could it be? [07:45:48] I was checking but I thought it was true for our version [07:46:12] elukey: I thought that behaviour is what we had experienced :( [07:48:29] joal: I think that we may have been tricked by the /tmp directory [07:51:20] for example [07:51:21] drwxrwxrwt 2 analytics hdfs 4096 Dec 20 01:05 /mnt/hdfs/tmp/0043755-201202074829419-oozie-oozi-W [07:51:26] joal: --^ [07:51:56] but /tmp is owned by hdfs:hdfs [07:52:04] so it kinda makes sense [07:53:25] but it may have covered up what hive really does [07:54:26] also the subdir.inherit.perms flag is, IIUC, only related to u-g-o perms, not ownership [07:54:33] since Hive follows the BSD rule [07:55:32] joal: I am not sure what Dan tested but afaics the directories are created by oozie no? [07:55:47] elukey: They are created by Hive I think [07:56:03] with the name of the oozie workflow? [07:56:09] yes [07:56:29] destination_directory=${temporary_directory}/${wf:id()}-hourly-druid-webrequests-${year}-${month}-${day}-${hour} [07:56:46] and this into an .hql file [07:57:06] nope, in oozie launcher, config parameter for a HQL file [07:57:56] okok but it goes into an .hql file, so no interference from oozie [07:58:04] shouldn't be [07:58:25] elukey: I'm running a manual test now [07:59:33] elukey: I created an external table in hive the same way oozie would have - look at /wmf/tmp/druid [07:59:41] * joal cries in a corner [08:00:53] ok, doing some more tests [08:02:02] ah snap [08:02:22] we can check in hadoop test with/without the subdir setting true/false [08:03:10] need to step away for 10/20 mins, bbl [08:03:10] elukey: from what I am experiencing now, it seems that oozie-hive (or oozie-beeline to be precise) doesn't behave the same way as user-beeline [08:03:21] * elukey cries [08:03:30] it is always oozie's fault :D [08:03:31] later elukey [08:28:15] back [08:45:48] joal: any news? [08:45:58] or anything that I can do to help [08:46:21] elukey: I testing to try to understand what differs between hive-on-oozie and what is expected from us [08:46:32] elukey: I can show you what I have so far [08:46:44] cave? [08:46:49] if you want yes but I don't want to slow you down, it is fine if you want to debug alone :) [08:47:01] brainstorming will help [08:47:39] ack! [08:47:42] bc? [08:47:49] yessir [09:45:16] elukey: one thing I don't get [09:46:18] elukey: how come hive manages to create the folder with correct ownership at table creation, but not to change permissions later on - This I don't get [09:51:32] joal: will check [09:54:09] joal: isn't it because the analytics user has write perms on the dir? [09:54:18] the tmp druid dir I mean [09:54:52] then the new dir uses the bsd rule for ownership, and it works [09:55:20] elukey: hive started by analytics-user manages to create the folder with correct ownership (including druid group), but doesn't manage to change the perms of overwritten folder (because of druid group) - WIRD! [09:56:04] joal: changing the ownership is more tricky since in the general use case, if anybody were able to do it some attacks would be possible [09:56:45] so I think that hdfs lets you create something new, but not to change ownership to something existing [09:59:30] ok [10:03:36] elukey: I'm making another test, I had another idea [10:04:15] (03PS4) 10Awight: Update schema with core bucket labels [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/656901 (https://phabricator.wikimedia.org/T269986) (owner: 10WMDE-Fisch) [10:12:45] email sent elukey [10:14:20] (03CR) 10Awight: [C: 03+1] Update schema with core bucket labels (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/656901 (https://phabricator.wikimedia.org/T269986) (owner: 10WMDE-Fisch) [10:17:12] joal: yep read it, +1 [11:15:20] joal: https://gerrit.wikimedia.org/r/c/operations/puppet/+/657547/1/modules/profile/manifests/analytics/cluster/users.pp [11:15:35] it should be enough, never really tried anything similar [11:15:40] but it is not horrible [11:19:35] !log block UA with 'python-requests.*' hitting AQS via Varnish [11:19:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:51:19] heya teamm, joining early today [11:51:52] do you guys need help with ops stuff? [11:54:45] elukey, joal ^ [11:58:24] hey mforns, I think that we are pending a review of what we discovered + the puppet fix [11:58:39] if it makes sense etc.. [11:58:47] ok, reading [12:05:53] elukey: read the scrollback and the email, and it makes sense to me. [12:06:30] mforns: super thanks :) [12:06:43] I think we can wait for Dan/Andrew to judge if this is ok or not [12:07:31] k [12:18:01] 10Analytics, 10Event-Platform: MobileWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T267347 (10mforns) Starting the migration of the schema now! [12:18:12] 10Analytics, 10Event-Platform: DesktopWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T271164 (10mforns) Starting the migration of the schema now! [12:19:01] going afk for lunch! bbl [12:19:30] elukey: ! [12:19:35] If ou're not yet gone [12:21:17] gone already :) [12:28:20] (03CR) 10WMDE-Fisch: [C: 03+1] "Thanks for the fixes." (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/656901 (https://phabricator.wikimedia.org/T269986) (owner: 10WMDE-Fisch) [12:32:36] (03PS1) 10Mforns: Add MobileWebUIActionsTracking to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/657561 (https://phabricator.wikimedia.org/T267347) [12:33:16] (03CR) 10jerkins-bot: [V: 04-1] Add MobileWebUIActionsTracking to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/657561 (https://phabricator.wikimedia.org/T267347) (owner: 10Mforns) [12:37:52] joal: can I help in sth, given that Luca is afk? [12:38:18] Hi mforns - I tried to reach to an ongoing Luca but failed :) [12:38:55] mforns: I wanted to ask if we could try the group hnage before he leaves, allowing me to test that queries work as expected with hive [12:39:01] mforns: no big deal, it'll wait :) [12:39:06] ok ok [12:39:25] Thanks for asking :) [12:39:33] mforns: how went your meeting yesterday? [12:41:53] joal: good, we discussed a lot of stuff, didn't get deep into the sampling rate study, but I will discuss with bearloga further about the error and sampling. In any case, as suggested by him, as we can now change sampling rates just with config changes, it will be easy to start at a safe sampling rate like 1/100 and then move on to 1/10. And in the meantime, we can improve the error rate study to give an [12:41:53] accuracy number together with the data. [12:42:21] ack mforns [12:44:06] joal: BTW, bearloga liked the idea of interpolating session length measurements between minute marks with a uniform distribution, he actually pasted a link to wikipedia with the explanation to the actual technique and paper: https://en.wikipedia.org/wiki/Kaplan%E2%80%93Meier_estimator [12:44:48] \o/ [13:03:18] 10Analytics, 10Event-Platform, 10Patch-For-Review: MobileWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T267347 (10mforns) [13:17:32] (03PS1) 10Mforns: Add DesktopWebUIActionsTracking to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/657570 (https://phabricator.wikimedia.org/T271164) [13:18:13] (03CR) 10jerkins-bot: [V: 04-1] Add DesktopWebUIActionsTracking to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/657570 (https://phabricator.wikimedia.org/T271164) (owner: 10Mforns) [13:23:53] good morning team :) [13:24:05] Hi fdans [13:42:58] 10Analytics, 10Event-Platform, 10Patch-For-Review: DesktopWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T271164 (10mforns) [13:43:11] heya fdans :] [14:00:29] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Structured-Data-Backlog, 10Patch-For-Review: SuggestedTagsAction Event Platform Migration - https://phabricator.wikimedia.org/T267351 (10mforns) [14:07:02] thanks ottomata! BTW, the schemas-event-secondary repo is failing tests in jenkins, because the analytics/legacy/test schema [14:07:19] It's weird, is that expected? [14:07:30] that schema didn't change since last year [14:07:38] not expected at all, very strange [14:07:42] will look in just a few mins [14:07:46] k [14:15:33] weird mforns no idea wy that woul dhappen [14:15:57] ottomata: there's an incompat. between 1.0.0 and 1.1.0/1.2.0 [14:16:05] the event field is not required in 1.0.0 [14:16:21] it makes sense to me that it's failing now, but why wasn't it failing before? [14:16:29] and how can we fix it? [14:16:42] maybe we can just re add it as required....lookiung [14:17:35] weird... [14:17:40] it added event as reqruied after 1.0.0 [14:17:56] mforns: it looks like it is sorting them incorrectly [14:18:03] the other tests say e.g. [14:18:03] 1.1.0 must be compatible with 1.0.0 [14:18:10] this says [14:18:10] 1.0.0 must be compatible with 1.2.0 [14:18:15] aha [14:18:29] mforns: dunno what's up [14:18:33] don't have time to figrue it out now [14:18:37] let's skip jenkins ci [14:18:40] ok [14:18:51] (03CR) 10Ottomata: [C: 03+1] Add MobileWebUIActionsTracking to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/657561 (https://phabricator.wikimedia.org/T267347) (owner: 10Mforns) [14:19:03] (03CR) 10Ottomata: [C: 03+1] Add DesktopWebUIActionsTracking to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/657570 (https://phabricator.wikimedia.org/T271164) (owner: 10Mforns) [14:19:04] mforns: [14:19:09] +1 but haven't checked schema [14:19:14] we should probably add a step there [14:19:23] (03CR) 10jerkins-bot: [V: 04-1] Add MobileWebUIActionsTracking to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/657561 (https://phabricator.wikimedia.org/T267347) (owner: 10Mforns) [14:19:24] to temproarily disable the test skipping thing for these schemas [14:19:26] locallyl [14:19:28] and run tests, rigih? [14:19:38] to make sure most things pass, like examples validatingg with schema [14:19:39] no problem, have tested for errors other than camelcase [14:19:43] nice [14:19:44] yes yes [14:19:45] perfect [14:19:50] yeah we should make that an explilcit step in our migration [14:19:50] (03CR) 10jerkins-bot: [V: 04-1] Add DesktopWebUIActionsTracking to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/657570 (https://phabricator.wikimedia.org/T271164) (owner: 10Mforns) [14:19:50] ok, will move foward [14:19:52] will try to add that [14:19:54] great, thank you [14:19:58] i gotta prep for book club! :) [14:19:58] thank you! [14:20:07] k, good meeting! [14:31:13] mforns: yeah the Kaplan-Meier estimator is used to estimate the survival function, survival being time-to-event where event can be anything (death, relapse) -- in our case "end of session". the K-M curve is nice because it provides a mapping to/from time T and % of sessions that survive up to time T, including a confidence interval! [14:31:18] mforns: check out https://lifelines.readthedocs.io/en/latest/Survival%20analysis%20with%20lifelines.html [14:31:44] thanks bearloga :] [14:42:46] ottomata: o/ is it ok if I merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/657547 ? [14:50:48] ok I take it as yes and in case I'll revert :) [14:53:28] Notice: /Stage[main]/Profile::Analytics::Cluster::Users/User[analytics]/groups: groups changed analytics-privatedata-users to ['analytics-privatedata-users', 'druid'] [14:53:37] elukey@an-master1001:~$ id analytics [14:53:37] uid=497(analytics) gid=497(analytics) groups=497(analytics),731(analytics-privatedata-users),499(druid) [14:53:50] so it seems looking good [14:54:07] joal: --^ [15:00:06] ack elukey ! [15:00:13] testing queries now [15:00:23] joal: gimme 5 mins to roll it out [15:00:27] (to all nodes etc..) [15:00:28] sure elukey [15:09:01] ok joal ready to test [15:09:17] ack - I'll be slow, I'm in book-club meeting [15:09:24] ah snap sure! [15:09:33] sorry for the ping [15:17:07] !log Kill mediawiki-wikitext-history-wf-2020-12 as it was stuck and failed [15:17:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:19:50] elukey: I confirm it's all good! both manual and oozie test succeeded [15:20:43] Thanks a lot elukey for making the group stuff happen :) [15:21:13] yessssss \o/ [15:21:26] all right I am going to update the email thread so we can unblock miriam_ [15:21:30] err milimetric :D [15:22:02] Ack elukey - he's in them eeting with me, so no rush :) [15:22:30] I have meetings after this, so if something's on fire I can step out for a minute. What's on fire? [15:22:52] I can also restart all the druid jobs, the commands are all ready [15:23:07] elukey: shall I do that? ^ [15:24:04] milimetric: nono let's do it later, there is no rush :) [15:24:06] nothing on fire [15:30:25] 10Analytics: Remove support for the (deprecated) Druid datasources (in favor of Druid Tables) on Superset - https://phabricator.wikimedia.org/T263972 (10elukey) Today I made a little test in our staging instance, namely commenting `DRUID_IS_ACTIVE = True` in the `superset_config.py` file. This caused the followi... [15:49:57] (03CR) 10Mforns: [V: 03+2 C: 03+2] Add MobileWebUIActionsTracking to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/657561 (https://phabricator.wikimedia.org/T267347) (owner: 10Mforns) [15:50:16] (03CR) 10Mforns: [V: 03+2 C: 03+2] Add DesktopWebUIActionsTracking to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/657570 (https://phabricator.wikimedia.org/T271164) (owner: 10Mforns) [15:57:34] hi, I'm about to reboot the Kerberos servers, just a headsup, don't expect any issue, since those are redundant [15:58:11] * elukey braces for impact [15:59:03] ottomata: do we consume kafka events in MediaWiki anywhere currently? [16:01:58] 10Analytics, 10Event-Platform, 10Patch-For-Review: QuickSurveysResponses Event Platform Migration - https://phabricator.wikimedia.org/T271166 (10Gilles) @Ottomata the migration appears to have broken our daemon subscribed to Kafka updates for these schemas: https://github.com/wikimedia/performance-navtiming... [16:03:13] addshore: no, and there are difficulties doing so because the PHP kafka clients are not great. [16:03:52] addshore: for MW events as source of truth  purposes, some kind of Change data capture is probably the way to go [16:03:52] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 4 others: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10mforns) [16:03:54] https://phabricator.wikimedia.org/T120242 [16:06:39] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 4 others: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10mforns) [16:07:24] 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production - https://phabricator.wikimedia.org/T120242 (10Ottomata) [16:07:36] getting the events back into mediawiki somehow is one of the big parts of the wikidata / wikibase changes system i guess. But there would be other ways of doing it, such as having another system listen to the stream, triggering apis of the mediawiki that should be acting upon the events, but that all just starts to get messy again [16:08:22] addshore: i suspect that in the near future (years?) most of the use cases for event archs will be in read domains [16:08:29] for mw [16:08:38] e.g. materalized views of data generated by mw [16:08:58] what would be making changes that you'd like to get back into wikidata/wikibase? [16:09:31] well, its exactly that, a materialized view in mediawiki (parsed page) using data generated by wikidata [16:09:47] so perhaps it is the same problem, that will just be fixed by such a solution [16:10:13] yaaa exactly! that i think is more tractable, esp if we solve https://phabricator.wikimedia.org/T120242 [16:10:42] the 2 parts are, wikipedia pages contain data from wikidata, that data could come from a stream rather than from the db, and also the page needs to be purged when the data changes, again, stream [16:11:15] 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) [16:12:20] ya addshore if we get https://phabricator.wikimedia.org/T120242, we should be able to rely on a stream of data from MW to be consistent, and be able to create materialized views that we can rely on being up to date and (eventually) consistent with what is in MW MySQL [16:12:47] <3 [16:14:45] (03PS2) 10Awight: [WIP] Segment CodeMirror metrics by user edit count (sql) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/656210 (https://phabricator.wikimedia.org/T269986) [16:15:34] krb reboots completed [16:17:39] 10Analytics, 10Event-Platform, 10Patch-For-Review: MobileWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T267347 (10mforns) [16:17:49] 10Analytics, 10Event-Platform, 10Patch-For-Review: DesktopWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T271164 (10mforns) [16:18:04] (03PS3) 10Awight: [WIP] Segment CodeMirror metrics by user edit count [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/656210 (https://phabricator.wikimedia.org/T269986) [16:19:51] moritzm: \o/ [16:21:42] very good war test, restarting the krb nodes doesn't cause any issue [16:24:37] hmm addshore, 'parsed pages' are not stored in mw mysql right now anyway, right? those currently are even 'materialized views' in RESTBase IIUC? [16:25:22] RESTBase + change-prop + jobqueue are event driven, just not consistently, and are more like a a cache that can be stale and that's ok? [16:25:59] (03PS1) 10Awight: [WIP] Use edit count bucket sent by TemplateWizard [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/657634 (https://phabricator.wikimedia.org/T269986) [16:27:26] joal: want to do the hand off of the ops week before standup? I have a meeting right after grooming, and this way we don't have to wait. [16:27:44] sure - I have 3 minutes! [16:27:49] cave! [16:28:32] mforns: --^& [16:29:10] joal: omw! [16:35:53] 10Analytics, 10Performance-Team: Sharp drop of navtiming daemon metrics report rate on 2021-01-21 - https://phabricator.wikimedia.org/T272613 (10Gilles) p:05Triage→03High [16:36:19] (03PS1) 10Awight: [WIP] Update event bucketing for visualeditor events [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/657635 (https://phabricator.wikimedia.org/T269986) [16:41:14] 10Analytics, 10Event-Platform, 10Patch-For-Review: QuickSurveysResponses Event Platform Migration - https://phabricator.wikimedia.org/T271166 (10Ottomata) ? Weird. no your client should be working just fine with the change. The events are flowing into Kafka just fine: e.g. https://grafana-rw.wikimedia.or... [16:45:37] a-team: I have to go to get my car after the current meeting (that ends in ~15 mins), so I'll join late standup (or get only to the other meeting). Will send e-scrum, sorry, but if I don't go now I'll not get my car :( [16:46:14] don't worry! gogo [16:47:17] 10Analytics, 10Performance-Team: Sharp drop of navtiming daemon metrics report rate on 2021-01-21 - https://phabricator.wikimedia.org/T272613 (10Ottomata) From: https://phabricator.wikimedia.org/T271166#6765916 ? Weird. no your client should be working just fine with the change. The events are flowing into K... [16:47:37] gilles: o/ [16:48:11] hi, I'm on a call and my yubikey won't prompt me for PIN so can't SSH into the host and check the service error logs yet [16:48:18] https://phabricator.wikimedia.org/T272613#6765938 [16:48:21] I'll try rebooting my laptop [16:48:29] after the call... [16:49:03] ah, hadn't seen the end of your comment [16:49:12] !log installed libsnappy-dev and python3-snappy on webperf1001 [16:49:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:49:33] worth putting on webperf2001 as well [16:49:52] not surer if that is the fix yet, will comment if i figure it out in the next 10 mins before my meetings :) [16:51:19] i think that did it [16:51:22] wwill puppetize [16:51:32] awesome, thanks for looking into it so quickly! [17:01:26] 10Analytics, 10Performance-Team, 10Patch-For-Review: Sharp drop of navtiming daemon metrics report rate on 2021-01-21 - https://phabricator.wikimedia.org/T272613 (10Ottomata) I think ^ did it! [17:01:46] gilles: in case you didn't know, SRE changed bastions this week [17:01:55] if you haven't updated your ssh config that could be why ou can't llog in [17:02:07] https://wikitech.wikimedia.org/wiki/Bastion [17:02:10] ah, that might be what was messing with my SSH, thanks [17:02:25] lexnasser: I found it !!! https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format [17:02:30] it was trying to connect to bast3004 [17:02:57] lexnasser: drawing in data model section [17:06:04] 10Analytics, 10Performance-Team, 10Patch-For-Review: Sharp drop of navtiming daemon metrics report rate on 2021-01-21 - https://phabricator.wikimedia.org/T272613 (10Gilles) 05Open→03Resolved a:03Gilles It does look fixed, thanks! [17:06:51] 10Analytics, 10Event-Platform, 10Patch-For-Review: QuickSurveysResponses Event Platform Migration - https://phabricator.wikimedia.org/T271166 (10Ottomata) Fixed in {T272613}, python3-snappy needed to be installed for navtiming.py. [17:12:54] 10Analytics: Newpytyer python kernels - https://phabricator.wikimedia.org/T272313 (10Ottomata) > the intention to have people create spark sessions manually in a notebook Quick response, this ^ is the intention, using [[ https://github.com/wikimedia/wmfdata-python/pull/15 | wmfdata-python ]] to aid in SparkSes... [17:23:24] Hi all! [17:23:41] I’m running into a ”Connection to pypi.org timed out” error on stat1005 when trying to install https://github.com/wikimedia/wmfdata-python, anyway someone could help me out with that? [17:27:23] Andrew-WMDE: Hi - we're in meeting now - we';; be slow to answer :) [17:28:00] Andrew-WMDE: It seems related to proxy not being set (https://wikitech.wikimedia.org/wiki/HTTP_proxy) [17:31:33] That did the trick, thank you! [17:31:41] \o/ Andrew-WMDE :) [17:35:17] :D [17:37:21] 10Analytics, 10Analytics-Kanban: Check home/HDFS leftovers of kaldari - https://phabricator.wikimedia.org/T271089 (10JAllemandou) a:05JAllemandou→03razzi Reassigning, files to be checked with Luca. [17:37:50] 10Analytics, 10Analytics-Kanban: Check home/HDFS leftovers of dcipoletti - https://phabricator.wikimedia.org/T271092 (10JAllemandou) a:05JAllemandou→03razzi Reassigning for folders deletion (no file) [17:42:08] 10Analytics, 10Analytics-Kanban, 10Event-Platform: Some refined events folders contain no data while they should - https://phabricator.wikimedia.org/T272177 (10fdans) p:05Triage→03High a:03mforns [17:49:32] 10Analytics: Upgrade to Superset 1.0 - https://phabricator.wikimedia.org/T272390 (10fdans) p:05Triage→03Medium [17:52:50] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Patch-For-Review: MobileWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T267347 (10mforns) [17:53:00] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Patch-For-Review: DesktopWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T271164 (10mforns) [17:54:37] 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics, 10Product-Infrastructure-Data: Roll-up raw sessionTick data into distribution - https://phabricator.wikimedia.org/T271455 (10mforns) [17:57:55] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add client TCP source port to webrequest - https://phabricator.wikimedia.org/T271953 (10fdans) a:05JAllemandou→03elukey [18:00:52] 10Analytics-Clusters, 10Patch-For-Review: Kerberos credential cache location - https://phabricator.wikimedia.org/T255262 (10fdans) [18:09:47] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) a:05Ottomata→03elukey [18:20:39] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: AQS is not OpenAPI 3 compliant - https://phabricator.wikimedia.org/T240995 (10Milimetric) ping @Pchelolo: what's the latest plan on this? [18:21:53] 10Analytics, 10Analytics-Kanban, 10Growth-Team, 10Product-Analytics: Revisions missing from mediawiki_revision_create - https://phabricator.wikimedia.org/T215001 (10Milimetric) [18:21:57] 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Milimetric) [18:24:12] 10Analytics, 10Anti-Harassment, 10CheckUser, 10Privacy Engineering, and 2 others: SPIKE: consider all problems that might happen when we handle Google's privacy changes - https://phabricator.wikimedia.org/T265057 (10fdans) [18:26:13] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10Ottomata) Oo we'll also want eventstreams-internal.svc.* LVS set up too. [18:27:42] mforns: could you help me with the deployment window today? [18:27:53] i'll be in inteview for hte next hour, and the window starts in 30 mins [18:28:06] ottomata: yes, I'm subscribed to it [18:28:22] great, they'll ask to make sure i'm avail for the deploy, just tell them that you are [18:28:29] ottomata: however, I'll be in a meeting at the same time, I will tell the people in the meeting that I might have to leave [18:28:33] if they ok [18:28:41] they usually want the requestor present for the deploy [18:28:42] ottomata: ack for es-internal! If you have a moment today can you check if staging looks ok? If so I'll proceed with adding TLS + deploy to prod asap [18:28:50] elukey: wil do [18:29:23] razzi: bc in 10 mins? [18:29:39] ottomata: I'll be in the ops channel, and say I'm representing you as well, do I need to know anything about your patches? [18:29:39] in the meantime if you want we can kick off the druid public reboots [18:29:41] mforns: if https://grafana-rw.wikimedia.org/d/000000018/eventlogging-schema?orgId=1&var-schema=PrefUpdate&from=1611167374651&to=1611253774651 stops [18:29:42] things are broken [18:29:52] it modifies the PHP server side for EL [18:29:57] but should just work the same for old events [18:29:58] ottomata: will you be pingable? [18:30:08] i will but in interview so might lag in response [18:30:12] ok [18:30:21] no problemo [18:43:21] elukey: I'm available for bc now [18:45:42] razzi: joining [18:54:20] !log rebooting nodes for druid public cluster via cookbook [18:54:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:59:07] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [19:00:24] ok this is due to druid reboot I assume -^ [19:00:35] joal: yep, we are aware :) [19:00:44] ack razzi [19:00:57] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [19:01:45] ok [19:01:47] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [19:02:13] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [19:16:19] elukey: I pushed a last version of the hdfs-cleaner patch - it should be ready to go once DataframeToDruid timers have been updated [19:20:45] joal: ack will review in a sec [19:21:08] joal: I am wondering - should we keep less snapshots on druid cluster and replicate segments more? Like 3 times? [19:21:38] elukey: a very worthy idea! [19:21:44] We easily couldf [19:21:51] 10Analytics, 10Analytics-Kanban: Check home/HDFS leftovers of dcipoletti - https://phabricator.wikimedia.org/T271092 (10razzi) 05Open→03Resolved Dropped /srv/home/dcipoletti. [19:22:17] joal: I'll open a task :) [19:24:29] milimetric: I think that the restart of all cassandra jobs at once is brutal for cassandra, I did the same the other day [19:24:42] oof, sorry [19:25:01] cassandra punished me too [19:27:51] 10Analytics, 10Analytics-Kanban: Check home/HDFS leftovers of kaldari - https://phabricator.wikimedia.org/T271089 (10razzi) 05Open→03Resolved Removed /srv/home/kaldari. [19:31:27] ebernhardson: quick question - you have oozie workflow in prep mode for query_clicks_hourly old dates (2021-01-05 and 07) - May I kill them? [19:33:57] joal: hmm, checking [19:35:15] joal: should be fine, not sure i understand how those came about. [19:36:11] ebernhardson: I think jobs got suspended when the cluster has been roll-restart, and resume somtimes generate problems [19:36:23] ahh, ok [19:36:26] I had it backwards, tried to be gentle with druid, and was not so gentle with cassandra [19:36:31] ebernhardson: If data is there, let's kill them, otherwise, let rerun [19:36:50] milimetric: those error really are from new jobs failing? [19:37:06] milimetric: I thought they were from old jobs [19:37:39] joal: yea some other version of the job must have run, our downstream stuff all ran and i see data in hdfs [19:37:45] ottomata: I see events coming in from prod for SuggestedTagActions [19:38:01] ok perfect ebernhardson - killing the leftovers thank :) [19:38:07] nice! [19:38:25] joal: will merge tomorrow morning with you ok? [19:38:27] mforns: do you need help testing anything? [19:38:30] * elukey afk! [19:38:32] ottomata: curious about how you test server side changes from mwdebug1002? [19:38:35] i show you! [19:38:36] sure elukey - bye [19:38:36] bc? [19:38:43] joal: I think I only saw SLAs, which were indeed from the restarts. But the aqs outages I guess are related to me hammering Cassandra [19:38:45] ok, omw! [19:39:11] milimetric: really? I thought it was because of druid restarts [19:39:16] * joal is confused :) [19:40:32] I mean... Luca seemed to imply above that cassandra got hit, so maybe it's what caused him to restart AQS? But I'm looking at the jobs and all the hourly ones that ran seemed to have succeeded, and all the daily ones that are scheduled have not run yet (I scheduled them for the 21st as the 20th was done) [19:40:58] actually, so it's really just one: https://hue.wikimedia.org/oozie/list_oozie_coordinator/0020689-210107075406929-oozie-oozi-C/ [19:41:13] we'll clear it up later with Luca :) [19:43:41] ok mforns with that backport deployed, we sholud be able to move foward with EL PHP schemas [19:43:51] right! [19:43:56] I'm going to wait until Monday I think to do any, just to let this bake for a bit [19:44:04] but at least i can start monday mornign! :) [19:50:10] ottomata: would you have time now? [19:50:29] OHhh joal i was about to go ouside for 10 mins before the sun goes down! [19:50:39] in 15 mins ok? [19:50:42] i knkow its late for you [19:50:46] ottomata: all good [19:50:57] i'll ping you when i'm back in case you are still on [19:57:43] makes sense ottomata [20:17:29] joal: back [20:17:53] ottomata: I started something else - 5mins? [20:19:46] yup! [20:25:37] ottomata: ready! [20:25:40] bc? [20:25:43] ottomata: yessir [20:29:45] * razzi afk to eat. Call me an-luncher [20:31:56] * joal like that joke :) --^ [20:32:25] 10Analytics-Radar, 10DBA: mariadb on dbstore hosts, and specifically dbstore1004, possible memory leaking - https://phabricator.wikimedia.org/T270112 (10jcrespo) dbstore1004 again at 90% memory usage. [20:46:06] hahaha [20:47:36] Gone for tonight [20:51:00] 10Analytics, 10Event-Platform, 10MW-1.36-notes (1.36.0-wmf.28; 2021-01-26), 10Patch-For-Review: QuickSurveyInitiation Event Platform Migration - https://phabricator.wikimedia.org/T271165 (10Ottomata) [20:51:06] 10Analytics, 10Event-Platform, 10Patch-For-Review: QuickSurveysResponses Event Platform Migration - https://phabricator.wikimedia.org/T271166 (10Ottomata) [21:15:31] 10Analytics-Clusters: Improve logging for HDFS Namenodes - https://phabricator.wikimedia.org/T265126 (10Ottomata) [21:48:10] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10Ottomata) @elukey it works! I realized that since this service is not pr... [22:18:15] 10Analytics-Clusters: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers - https://phabricator.wikimedia.org/T255973 (10razzi)