[06:17:19] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie - https://phabricator.wikimedia.org/T184794 (10Som_Marcin) Thanks @Ottomata for your help figure out the issue IPv6 was not supported made use of IPv4. [06:18:06] 10Analytics, 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Cookie “WMF-Last-Access-Global” has been rejected for invalid domain. - https://phabricator.wikimedia.org/T261803 (10Aklapper) [06:19:01] 10Analytics, 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Cookie “WMF-Last-Access-Global” has been rejected for invalid domain. - https://phabricator.wikimedia.org/T261803 (10Aklapper) Introduced in T138027; T262882 might be a dup? [06:46:23] good morning! [06:46:35] so nothing seems to have exploded after the kafka ferm rules [06:48:23] !log run systemctl reset-failed monitor_refine_eventlogging_legacy_failure_flags.service on an-launcher1002 [06:48:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:48:38] RECOVERY - Check the last execution of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:01:55] * elukey bbiab [07:04:46] 10Analytics-Radar, 10Datasets-General-or-Unknown, 10Product-Analytics, 10Structured-Data-Backlog: Set up generation of JSON dumps for Wikimedia Commons - https://phabricator.wikimedia.org/T259067 (10ArielGlenn) I'll put it in my queue, without cookie-licking it however. It would be best if someone working... [07:12:14] Hi team - Naé's teacher being absent today I'll be mostly off - possibly here during siesta but not more - I'll also miss tonight's meeting as melissa has an appointment - Will send an e-scrum with my last news [07:14:26] ack joal [07:20:08] I am doing the roll restart of both druid cluster for openjdk upgrades [08:06:31] druid analytics done, proceeding with druid public [08:09:55] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:10:12] 10Analytics-Radar, 10Operations, 10ops-eqiad: an-presto1004 down - https://phabricator.wikimedia.org/T253438 (10elukey) @wiki_willy do we have a high level timeline about when we could have the host back in service? We are not in a hurry but it has been down from the end of March :( [08:10:24] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:11:02] this is weird [08:11:40] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:11:42] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:11:44] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:12:12] it seems the same problem as when we drop data [08:12:20] so probably some further tuning is needed [08:13:34] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:16:10] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:27:04] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:29:36] I think that the historicals need to have more connections opened for the brokers [08:29:44] like we did the last time [09:15:03] elukey: So upgrading to Buster, what is the process? More specifically, how are the following done: a) draining the machine b) upgrading (in place? full wipe (user data backups?)? is it actuated through puppet?) c) verification that it worked correctly. [09:16:15] klausman: good morning! [09:16:27] Ah yes, morning :) [09:16:45] so we do upgrades in place, and for the stat100x boxes it is sufficient to alert users before/after [09:16:56] With how much lead time? [09:17:09] most of our users are already aware that this is happening, lead time can be a couple of days max in my opinion [09:17:27] so we could do announce@ today and do the first on tue [09:17:31] err thur [09:17:48] in this case, there are some caveats [09:18:06] namely, we have /home -> /srv/home [09:18:18] and the /srv partition is a big raid 10 that we want to preserve [09:19:02] our dear kormat worked on a partman script and recipe for debian install to preserve the content of parittions [09:19:05] *partitions [09:19:28] A-ha. Good to know we have a scapegoat in the case of dataloss [09:19:29] but in our case, it may or may not work, so we'll need to be careful [09:19:37] (I'll explain in a bit why) [09:19:57] the generic workflow to reimage a bare metal host is (usually) [09:20:01] 1) ssh to cumin1001 [09:20:19] Hrm. 2.8T of data on /srv. Sounds like something that could ba backup'd [09:20:22] 2) run sudo -i wmf-auto-reimage $hostname -p TASKID [09:20:42] in puppet we have netboot.cfg, that lists the partman recipes for hosts [09:20:52] those will be picked up by the debian install process [09:21:14] and the stats to reimage have [09:21:14] stat100[4567]) echo reuse-parts-test.cfg partman/custom/reuse-analytics-stat-4dev.cfg ;; \ [09:22:14] the reuse-parts-tests.cfg is handy since it doesn't allow d-i to proceed until a user hits the confirm button [09:22:28] so we can check via mgmt serial console before pulling the trigger [09:22:43] Neat [09:22:57] the main caveat of reuse-analytics-stat-4dev.cfg is that we assume that the kernel assigns md0 to our raid10 [09:23:20] At a previous job, I automated d-i to completely autonomically rimage hosts. Well, one day a coworker plugged the wrong USB stick into their workstation and... [09:23:22] something that should happen in theory, but kormat raised some warnings about it [09:23:48] there is also a reuse-parts.cfg that does not require any confirm of course [09:24:03] that is also great, I used both for the last round of Kafka reimages [09:24:23] so in theory, the work should be as simple as running wmf-reimage [09:24:37] (we'll need to get you to pwstore though, to be able to use mgmt) [09:24:52] in practice, if we wipe /srv it is a problem :D [09:25:11] so what we could do is to rsync/backup the home dirs to another stat100x host [09:25:26] just to be sure [09:25:34] we have plenty of TBs free atm [09:27:23] we have an rsync server on every stat100x host, but it runs as nobody so doing rsyncs is surely going to raise permission issues [09:27:32] the Data Persistence team created https://wikitech.wikimedia.org/wiki/Transfer.py [09:27:41] that I have never used, but it should be neat and easy [09:27:44] pwstore access I have (and confirmed working) [09:28:11] ahhh nice! [09:28:15] then we are all set [09:28:16] can stat hosts ssh to each other? [09:28:32] If so, tar'ing up user dirs might be the better option, permission-wise [09:29:09] Like tar c /home|ssh statX00Z 'cat > backup.tar' [09:29:18] I don't recall if group ids are consistent between stat boxes [09:29:44] tar usually backups user names and groups, not uid/gid, unless told otherwise [09:30:11] all right then I remembered incorrectly, then it seems fine! [09:30:27] we cannot ssh though, netcat is an option [09:30:43] that is IIRC what transfer.py does (open ports temporarily etc..) [09:30:56] Ok. maybe put in zstd to speed things up a little, at basically zero CPU cost [09:31:09] Oh, there's a script already, perfect [09:31:38] yeah check https://wikitech.wikimedia.org/wiki/Transfer.py, should be handy [09:31:50] it runs from cumin1001 [09:32:29] so, overall the plan should be to [09:32:41] 1) announce the maintenance for stat1004 to our users [09:33:25] 2) backup the home dirs (maybe we could test transfer.py for a subset of the data, then do a pass before the maintenance to have more up to date files) [09:33:30] 3) reimage [09:34:16] the known side effect of upgrading is that python is upgraded to 3.7, so some venvs of users will need to be recreated probably [09:34:25] (like the ones for jupyter notebooks etc..) [09:34:32] 1.5) ask users to clean out old junk from their homes which will cut backups in half... [09:34:32] but it is not a big deal [09:34:50] moritzm: please don't add false hopes [09:34:52] :D [09:35:19] Ok, I'll start drafting a mail. We start with 1004, right? [09:35:24] exactly [09:35:43] Is there a way to lock out non-SRE users for the time of the transfer? [09:36:18] the only way that I know is to remove non-sre groups from the admin list in puppet for the host [09:36:27] Hrm. [09:36:29] that automatically removes ssh keys [09:36:32] but it is brutal [09:36:48] On thing I'd also do is disable cron, so it doesn't run stuff in the middle of a transfer [09:37:02] yep seems fine [09:37:10] Not sure if something similar can be done with systemd timers [09:37:14] I think it is fine to say that we'll backup data up to a certain dat [09:37:18] *date [09:37:27] klausman: systemctl stop *.timer [09:37:35] Yeah, and any change after X hour may get lost [09:37:47] exactly, it is perfectly acceptable [09:39:30] Ok, doing some writing :) [09:40:33] perfect :) [09:40:49] I am going to take my lunch break earlier today (need to do some errands) [09:43:29] Ack. Buon appetito! [09:43:47] grazie :) [10:08:01] Hi Hi analytics! Do you have any links to any policies around search terms being private data? :) [12:00:18] addshore: hello! No idea [12:00:31] I'll bring it up to my team later on [12:06:03] in the meantime, I am also restarting cassandra on aqs [12:06:12] (for openjdk) [12:30:20] !log stop timers on an-launcher1002 to drain the cluster and restart an-coord1001's daemons (hive/oozie/presto) [12:30:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:59:52] 10Analytics, 10observability: Indexing errors / malformed logs for aqs on cassandra timeout - https://phabricator.wikimedia.org/T262920 (10fgiunchedi) [13:24:03] hi teammmmm [13:24:12] holaaa [13:24:30] heyoo [13:28:01] :] [13:43:27] !log restart of hive/oozie/presto daemons on an-coord1001 [13:43:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:43:43] !log re-enable timers on an-launcher1002 after maintenance to an-coord1001 [13:43:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:46:18] 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10Goal, and 3 others: Modern Event Platform - https://phabricator.wikimedia.org/T185233 (10Ottomata) [14:11:32] so today the roll restart of cassandra caused some client latencies to go up [14:11:54] and this caused restbase alerts for wikifeeds, that uses AQS behind the scenes (TIL) [14:26:05] 10Analytics, 10Analytics-Kanban: analytics.wikimedia.org TLC - https://phabricator.wikimedia.org/T253393 (10mforns) We're keeping track of the design requirements of this task in the doc: https://docs.google.com/document/d/1u79ttV4_RMCJ9LLWqzGlQudlQ4w29i2xtcltQY-hjvM Please, request access if interested. [15:00:07] elukey: wow, so aqs is kind of tier 1.5 now [15:00:21] side note: how on earth do you go from https://gerrit.wikimedia.org/r/admin/repos/mediawiki/services/wikifeeds to the code of that repository? [15:00:32] like... I know the code is there somewhere gerrit... WHERE IS DA BUTTON [15:02:33] milimetric: not as easy as a button but changing "r/admin/repos" with "g" takes you there https://gerrit.wikimedia.org/g/mediawiki/services/wikifeeds [15:03:00] milimetric: so Giuseppe and Alex told me, I made the correlation only with graphs.. I think that wikifeeds calls restbase that calls aqs, not directly us [15:03:02] witch! BURN HIM [15:03:23] (Mikhail, not you Luca, you're fine) [15:03:45] (also, thx Mikhail :)) [15:04:13] milimetric: don't burn me! the chilly fall weather JUST started, let me have one day with a hoodie [15:04:44] hm, let's see what other options are there. There's drowning? [15:05:13] here at the witch hunting branch of the government we like to be sensitive to our clients [15:05:25] I appreciate that! [15:05:44] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 3 others: Automate ingestion and refinement into Hive of event data from Kafka using stream configs and canary/heartbeat events - https://phabricator.wikimedia.org/T251609 (10Ottomata) Canary events are now being produced for... [15:06:57] milimetric: i usually start at [15:06:58] https://gerrit.wikimedia.org/r/plugins/gitiles/?format=HTML [15:07:00] and find the repo from there [15:08:54] javascript:document.location.href=document.location.href.replace(/r\/admin\/repos/,'g') [15:08:57] there, now there's a button [15:09:27] ottomata: interesting, I usually use the BROWSE > Repositories and search for the repo in the filter input [15:09:45] milimetric: 👏 [15:11:45] yeah, that makes more sense, 'cause the repo is linked from everywhere in changes and such [15:13:26] 10Analytics, 10Wikidata, 10Wikidata-Query-Service: PoC on anomaly detection with Flink - https://phabricator.wikimedia.org/T262942 (10Gehel) [15:14:41] 10Analytics, 10Wikidata, 10Wikidata-Query-Service: PoC on anomaly detection with Flink - https://phabricator.wikimedia.org/T262942 (10Gehel) [15:16:38] stepping afk for a bit! [15:31:55] mforns: was the PR for restbase deployed? this request is still failing https://wikimedia.org/api/rest_v1/metrics/editors/by-country/ro.wikipedia/5..99-edits/2020/07 [15:32:53] nuria: checking [15:33:14] addshore: https://meta.wikimedia.org/wiki/Data_retention_guidelines there some info here as to what constitutes private data [15:34:04] mforns: the request [15:34:07] https://wikimedia.org/api/rest_v1/metrics/editors/by-country/ro.wikipedia/5..99-edits/2020/07 [15:34:12] is still failing [15:34:51] nuria: the same error, plus the route is still not fixed, I imagine it didn't get deployed [15:35:06] I can check with Pchelolo [15:35:23] mforns: please do [15:35:26] nuria: interesting, I was thinking around https://diff.wikimedia.org/2012/09/19/what-are-readers-looking-for-wikipedia-search-data-now-available/ as that identified personal info in searches, so I was wondering to what degree search terms are counted as persoanl info [15:35:27] I didn't deploy yesterday, apologies, didn't get to it [15:35:31] will do right now [15:35:45] no problem Pchelolo, we were just checking [15:35:48] thanks a lot! [15:35:50] addshore: a lot of search data is Pii by mistake [15:36:06] addshore: cause people do write their credit card in teh serach box by mistake [15:36:17] :D classic humans [15:38:26] Pchelolo: super thanks [15:38:54] 10Analytics, 10Event-Platform, 10Product-Analytics (Kanban): Product Analytics to review & provide feedback for Event Platform Instrumentation How-To - https://phabricator.wikimedia.org/T253269 (10jlinehan) [15:44:07] 10Analytics, 10Event-Platform, 10Platform Engineering: Need for new event-type - `user_create` and `user_rename` - https://phabricator.wikimedia.org/T262205 (10Milimetric) I think it would be mutually beneficial to work on producing the events together, so we can help iterate on the schemas and implement the... [15:48:58] May be late for the standup. Running an errand atm, and the queues are longer than expected [15:54:01] klausman: ack [15:54:27] klausman: send e-scrum if you don't attend :) [15:56:31] 10Analytics-Radar, 10Operations, 10ops-eqiad: an-presto1004 down - https://phabricator.wikimedia.org/T253438 (10wiki_willy) Let me check with @Cmjohnson . He's tied up with PDU upgrades this week, and he's out on vacation half of next week. But let's see if we can at least get a timeframe for you. Thanks,... [16:07:58] 10Analytics, 10Event-Platform, 10Platform Engineering: Need for new event-type - `user_create` and `user_rename` - https://phabricator.wikimedia.org/T262205 (10WDoranWMF) @Nuria From Petr: "Someone just needs to implement these, it shouldn’t be not too hard - 2 schemas already outlined, 2 hook handlers in E... [16:09:09] 10Analytics, 10Event-Platform, 10Platform Engineering: Need for new event-type - `user_create` and `user_rename` - https://phabricator.wikimedia.org/T262205 (10WDoranWMF) >>! In T262205#6463145, @Milimetric wrote: > I think it would be mutually beneficial to work on producing the events together, so we can h... [16:16:20] 10Analytics, 10Event-Platform, 10Privacy Engineering, 10observability: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10Ottomata) We should coordinate with @jlinehan and @Mholloway on the client error logging change, but yea I think that would work.... [16:35:59] elukey: how direct is the control over sudoers? arbitraty policy entries> [16:36:01] ? [16:36:41] If so, this is what we'd want: %whatevergroup sudoedit /this/file, /that/file [16:36:55] (I think) [16:37:11] klausman: so the rules are all in puppet, in the data.yaml file [16:37:22] (under modules/admin/etc..) [16:37:35] every group has a set of sudo policy that can be deployed etc.. [16:37:41] Ah, right. Then sudoedit might not work to narrow down. But if you can sudo to another user, editing is implied [16:37:59] i.e. we would still have shell access, which is likely wanted anyway [16:38:11] please check what we have in there and propose improvements if you think the current setting is not great, I'd love to know your opinion :) [16:38:24] Roger :) [16:40:29] nuria: as FYI https://phabricator.wikimedia.org/T262893 [16:40:39] (I asked for some extra budget to increase ram on an-coord1001) [16:41:03] ping razzi [16:41:20] ping milimetric [16:42:01] 10Analytics, 10Event-Platform, 10Privacy Engineering, 10Product-Infrastructure-Data, 10observability: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10jlinehan) [17:15:26] milimetric: on https://docs.plus/p/wmf-analytics [17:15:33] milimetric: loading [17:33:21] (03CR) 10Nuria: [C: 03+2] Add a link in the footer to translate Wikistats [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/626711 (https://phabricator.wikimedia.org/T261502) (owner: 10Paul Kernfeld) [17:34:46] (03Merged) 10jenkins-bot: Add a link in the footer to translate Wikistats [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/626711 (https://phabricator.wikimedia.org/T261502) (owner: 10Paul Kernfeld) [17:42:19] I'm looking into the geoeditors thing [17:45:16] klausman: sudoedit is so confusing, and seems to have confused everyone on the internets: https://medium.com/@madushan1000/the-dangers-of-sudoedit-c433cbdade83 [17:45:35] I haven't found your explanation there of how it copies to file and just writes it back to the target [17:45:44] *there (I mean the internet) [17:46:16] Huh, interesting [17:46:29] The :e problem never occurred to me [17:46:45] I gues "Editors are just too damn powerful" is the conclusion [17:47:33] I think how you set up rights when guarding against people sudoedit-ing is just different from how you set them up in general, and I can see that it can give you more granularity, but it seems tricky either way [17:48:11] Wait... hang on. [17:48:41] That post is wrong [17:49:04] The vim in question runs not as root, but as the calling user, therefore, :e gives you zip extra access [17:49:42] running sudo sudoedit is something that should be filtered [17:50:33] mforns: apoligies, I didn't notice this in reviews, but this is what will fix analytics endpoints: https://github.com/wikimedia/restbase/pull/1278 [17:50:47] you've used swagger 2 spec, I didn't notice that [17:51:26] ok, that makes sense (re: sudoedit) :) [17:56:58] oh Pchelolo, thanks a lot, I was trying to deduce [17:58:59] Pchelolo: does this need any change from me? [17:59:27] I think no [18:00:06] at some point we should all come together and look into AQS major update [18:02:33] aha [18:03:13] you mean for other endpoints to use swagger3? [18:03:38] sorry openapi3 [18:22:41] 10Analytics, 10Event-Platform, 10Privacy Engineering, 10Product-Infrastructure-Data, 10observability: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10jlinehan) p:05Triage→03High [18:23:40] 10Analytics, 10Event-Platform, 10Privacy Engineering, 10Product-Infrastructure-Data, 10observability: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10jlinehan) a:03mpopov [18:25:54] 10Analytics, 10Event-Platform, 10Privacy Engineering, 10Product-Infrastructure-Data, 10observability: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10jlinehan) @mpopov If you prepare a schema patch I'll CR and make a patch to bump the version on... [18:43:11] 10Analytics, 10Event-Platform, 10Privacy Engineering, 10Product-Infrastructure-Data, 10observability: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10Ottomata) In this case...I think this will be a backwards incompatible change, so we should prob... [18:45:48] 10Analytics, 10Event-Platform, 10Privacy Engineering, 10Product-Infrastructure-Data, 10observability: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10Ottomata) FYI @cdanis is working on {T257527} which has expects to have `http.client_ip` in logs... [18:46:02] 10Analytics, 10Event-Platform, 10Privacy Engineering, 10Product-Infrastructure-Data, 10observability: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10jlinehan) >>! In T262626#6464002, @Ottomata wrote: > In this case...I think this will be a backw... [18:48:31] 10Analytics, 10Event-Platform, 10Privacy Engineering, 10Product-Infrastructure-Data, 10observability: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10mpopov) >>! In T262626#6463258, @Ottomata wrote: > We should coordinate with @mpopov and Product... [18:53:32] 10Analytics, 10Event-Platform, 10Privacy Engineering, 10Product-Infrastructure-Data, 10observability: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10jlinehan) >>! In T262626#6464015, @mpopov wrote: > secondary:/fragment/analytics/common referenc... [19:02:12] nuria: what do we do then with the alarms? I can push a patch that reverts threshold? [19:08:11] 10Analytics, 10Analytics-Kanban: Check that mediawiki-events match mediawiki-history changes over a month - https://phabricator.wikimedia.org/T262261 (10Milimetric) >>! In T262261#6450148, @JAllemandou wrote: > Checks for discrenpencies (using previous comment setup): > > * page-delete > ... I noticed here th... [19:08:53] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Use MaxMind DB in piwik geo-location - https://phabricator.wikimedia.org/T213741 (10elukey) After merging the second change this was the output: ` Notice: /Stage[main]/Geoip::Data::Puppet/File[/usr/share/matomo/misc/GeoIP.dat]/ensure: defined content as '... [19:12:54] mforns: sorry, so many things. [19:13:22] mforns: i think having a threshold that also does not alarm when we need to is probably not the best either [19:13:38] mforns: now teh larm is in the combined metric if i remember right [19:13:53] we have alarms on all 4 useragent metrics right now [19:14:46] 10Analytics, 10Analytics-Kanban: Use MaxMind DB in piwik geo-location - https://phabricator.wikimedia.org/T213741 (10elukey) The change was reverted, we have a couple of ideas about how to proceed, new patches will follow. [19:14:46] let me look at dashboards again [19:14:50] * elukey afk! [19:15:00] nuria, I think a good approach could be to have both daily and hourly alarms at different thresholds, like we do with traffic metrics [19:15:30] I believe one bad thing about these hourly useragent metrics is that they have both strong daily and strong weekly seasonality [19:16:06] and IIUC the RSVD algorithm only accounts for one of those, the one that is expressed by the matrix size [19:16:38] in the case of the hourly metrics, that means the weekly seasonality is making it harder to extract noise from signal [19:17:36] so, we will only detect very high deviations with the hourly metric, still good because it has shorter alarm time, [19:18:11] but if we used daily metrics additionaly, we could get rid of the double-seasonality (we'd only have weekly seasonality) [19:18:25] mforns: it is not the length of the ts though right? it is the periodicity with which the metric is computed [19:18:36] and the RSVD would be able to remove noise from signal fine (if my hypothesis is correct) [19:18:48] yes, I believe so [19:19:36] mforns: i think is not realted to the matrix size though rather the periodicity [19:19:58] mforns: with which the metric is computed [19:20:02] yes, for RSVD periodicity == matrix size [19:20:17] hello A-team, [19:20:17] I'm having a file issue on stat6 [19:20:17] And am just realizing that work began on stat6, is that right? the stat6 updates are beginning now? [19:20:24] well, one of the size, don't remember if width or height [19:22:10] In short: I was working on a file on stat6 and saved two output dfs to csv and suddenly I don't see the notebook I was working on and am having a hard time getting the csv files to my local via scp [19:23:09] iflorez: there is no work happening on stats6 now [19:23:14] iflorez: it will start on thu [19:23:51] thank you for that update. I'm not sure then why I may be having file trouble in that case. [19:27:19] mforns: mmmm [19:28:06] 10Analytics, 10Event-Platform, 10Privacy Engineering, 10Product-Infrastructure-Data, 10observability: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10CDanis) >>! In T262626#6464008, @Ottomata wrote: > FYI @cdanis is working on {T257527} which has... [19:32:14] 10Analytics-Radar, 10Event-Platform, 10Platform Engineering: Duplicated revision_create events - https://phabricator.wikimedia.org/T262203 (10Milimetric) >>! In T262203#6459797, @JAllemandou wrote: > |database | rev_id | users | rev_id_count_gt_1 > |enwiki | 977034865 | [104.235... [19:37:50] mforns: I am not sure that reverting the threshold to 10 is of use [19:38:30] mforns_brb: let me think about this a bit more [19:38:44] iflorez: are you scp-ing files to stat1006 or from 1006? [19:39:03] from stat6 to my local [19:40:21] 10Analytics, 10Event-Platform, 10Privacy Engineering, 10Product-Infrastructure-Data, 10observability: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10Ottomata) > AFAIK Logstash/Kibana is the only such system we have that fits? Superset + Presto/... [19:49:56] nuria: I never thought of asking what was the problem that generated that anomaly back on 2020-04-20? [19:51:03] I think that a threshold=10 for hourly metrics is good, it will still catch big jumps, like the one we created these alarms for. [19:52:05] and then we can have a daily metric, that will take some more time to alarm, but won't have the double-seasonality problem, and I think will be more sensitive to small changes like the one on 2020-04-20 [19:54:59] mforns: taht makes a lot of sense yes [19:55:16] mforns: i have a meeting in a bit but I will send some patches [19:55:21] mforns: after that [19:55:41] ok [19:55:52] i'll be logging off in a bit [19:58:58] ottomata: Heya - do ou need help from me (ops-week) on camus? [20:02:46] joal: i think it is ok, a run is about to finish up here in 8 minutes [20:02:55] and i'll be able to see if it is actually progressing on those 3 partitions [20:02:59] but something is very weird eh! [20:03:06] ottomata: yes indeed! [20:03:09] why would the offsets be reset on these partitions from this one broker! [20:03:23] same thing that happened in may i guess....except for some reason last time they got stuck [20:03:33] ottomata: I need to write some scorecard, I'll be here in 8 mins - pleae let me know :) [20:03:38] ok ty [20:04:03] also, I'd like as well to understand how come camus restarted reading those partitions from start :( [20:05:11] indeed [20:05:27] i had not ever noticed that they got totally reset like that before [20:05:31] i only noticed that they got stuck [20:05:46] and could not understand what was causing them to get stuck [20:05:54] camus would just fail reading from kafka for some reason [20:07:37] hm :( [20:10:14] mforns: the aqs patch is still not working right? [20:11:18] nuria: they patched it, we used swagger2 syntax instead of openapi3, it should be fixed whenever they deploy it [20:11:53] mforns: can we create a ticket to migrate from swagger to openapi3? [20:12:06] they already did a pull request [20:12:12] pe-tr did [20:16:43] joal: it looks ok; those partitioins are progressing [20:17:31] ok ottomata - I assume there might be some more alerts. I'll triple check tomorrow morning in between kids-games :) [20:17:43] Thanks a lot for looking into that [20:24:16] 10Analytics, 10Event-Platform, 10Platform Engineering, 10Platform Engineering Roadmap Decision Making: Need for new event-type - `user_create` and `user_rename` - https://phabricator.wikimedia.org/T262205 (10eprodromou) OK, we're going to move this into our roadmap discussion. I think it's a small project,... [20:24:31] 10Analytics-Radar, 10Event-Platform, 10Platform Team Workboards (Clinic Duty Team): Duplicated revision_create events - https://phabricator.wikimedia.org/T262203 (10eprodromou) p:05Triage→03Medium [20:33:18] 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10Goal, and 3 others: Modern Event Platform - https://phabricator.wikimedia.org/T185233 (10Ottomata) [20:33:42] 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10Goal, and 3 others: Modern Event Platform - https://phabricator.wikimedia.org/T185233 (10Ottomata) [20:36:36] 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10Goal, and 3 others: Modern Event Platform - https://phabricator.wikimedia.org/T185233 (10Ottomata) [20:36:53] 10Analytics, 10Event-Platform, 10Privacy Engineering, 10Product-Infrastructure-Data, and 2 others: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10mpopov) [20:55:10] 10Analytics-Radar, 10MW-1.35-notes, 10MW-1.36-notes (1.36.0-wmf.2; 2020-07-28), 10Multi-Content-Revisions (New Features), and 4 others: MCR: Import all slots from XML dumps - https://phabricator.wikimedia.org/T220525 (10eprodromou) 05Open→03Resolved Congratulations and good job! [22:46:16] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: [Entropy alarms] Restrict the RSVD analysis to the last N data-points - https://phabricator.wikimedia.org/T257691 (10Nuria) 05Open→03Resolved [22:54:05] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: AQS is not OpenAPI 3 compliant - https://phabricator.wikimedia.org/T240995 (10Nuria) ping @Pchelolo and @mforns that were mentioning this today, we should probably schedule this work, correct? [22:54:09] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: AQS is not OpenAPI 3 compliant - https://phabricator.wikimedia.org/T240995 (10Nuria) @colewhite I see the patch has no reviewers, should we pick it up? [22:54:11] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: AQS is not OpenAPI 3 compliant - https://phabricator.wikimedia.org/T240995 (10Nuria)