[06:17:19] <wikibugs>	 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie - https://phabricator.wikimedia.org/T184794 (10Som_Marcin) Thanks @Ottomata  for your help figure out the issue  IPv6 was not supported made use of  IPv4.
[06:18:06] <wikibugs>	 10Analytics, 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Cookie “WMF-Last-Access-Global” has been rejected for invalid domain. - https://phabricator.wikimedia.org/T261803 (10Aklapper)
[06:19:01] <wikibugs>	 10Analytics, 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Cookie “WMF-Last-Access-Global” has been rejected for invalid domain. - https://phabricator.wikimedia.org/T261803 (10Aklapper) Introduced in T138027; T262882 might be a dup?
[06:46:23] <elukey>	 good morning!
[06:46:35] <elukey>	 so nothing seems to have exploded after the kafka ferm rules
[06:48:23] <elukey>	 !log run systemctl reset-failed monitor_refine_eventlogging_legacy_failure_flags.service on an-launcher1002
[06:48:25] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[06:48:38] <icinga-wm>	 RECOVERY - Check the last execution of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[07:01:55] * elukey bbiab
[07:04:46] <wikibugs>	 10Analytics-Radar, 10Datasets-General-or-Unknown, 10Product-Analytics, 10Structured-Data-Backlog: Set up generation of JSON dumps for Wikimedia Commons - https://phabricator.wikimedia.org/T259067 (10ArielGlenn) I'll put it in my queue, without cookie-licking it however. It would be best if someone working...
[07:12:14] <joal>	 Hi team - Naé's teacher being absent today I'll be mostly off - possibly here during siesta but not more - I'll also miss tonight's meeting as melissa has an appointment - Will send an e-scrum with my last news
[07:14:26] <elukey>	 ack joal 
[07:20:08] <elukey>	 I am doing the roll restart of both druid cluster for openjdk upgrades
[08:06:31] <elukey>	 druid analytics done, proceeding with druid public
[08:09:55] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:10:12] <wikibugs>	 10Analytics-Radar, 10Operations, 10ops-eqiad: an-presto1004 down - https://phabricator.wikimedia.org/T253438 (10elukey) @wiki_willy do we have a high level timeline about when we could have the host back in service? We are not in a hurry but it has been down from the end of March :(
[08:10:24] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:11:02] <elukey>	 this is weird
[08:11:40] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:11:42] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:11:44] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:12:12] <elukey>	 it seems the same problem as when we drop data
[08:12:20] <elukey>	 so probably some further tuning is needed
[08:13:34] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:16:10] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:27:04] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:29:36] <elukey>	 I think that the historicals need to have more connections opened for the brokers
[08:29:44] <elukey>	 like we did the last time
[09:15:03] <klausman>	 elukey: So upgrading to Buster, what is the process? More specifically, how are the following done: a) draining the machine b) upgrading (in place? full wipe (user data backups?)? is it actuated through puppet?) c) verification that it worked correctly.
[09:16:15] <elukey>	 klausman: good morning!
[09:16:27] <klausman>	 Ah yes, morning :)
[09:16:45] <elukey>	 so we do upgrades in place, and for the stat100x boxes it is sufficient to alert users before/after
[09:16:56] <klausman>	 With how much lead time?
[09:17:09] <elukey>	 most of our users are already aware that this is happening, lead time can be a couple of days max in my opinion
[09:17:27] <elukey>	 so we could do announce@ today and do the first on tue
[09:17:31] <elukey>	 err thur
[09:17:48] <elukey>	 in this case, there are some caveats
[09:18:06] <elukey>	 namely, we have /home -> /srv/home
[09:18:18] <elukey>	 and the /srv partition is a big raid 10 that we want to preserve
[09:19:02] <elukey>	 our dear kormat worked on a partman script and recipe for debian install to preserve the content of parittions
[09:19:05] <elukey>	 *partitions
[09:19:28] <klausman>	 A-ha. Good to know we have a scapegoat in the case of dataloss
[09:19:29] <elukey>	 but in our case, it may or may not work, so we'll need to be careful
[09:19:37] <elukey>	 (I'll explain in a bit why)
[09:19:57] <elukey>	 the generic workflow to reimage a bare metal host is (usually)
[09:20:01] <elukey>	 1) ssh to cumin1001
[09:20:19] <klausman>	 Hrm. 2.8T of data on /srv. Sounds like something that could ba backup'd
[09:20:22] <elukey>	 2) run sudo -i wmf-auto-reimage $hostname -p TASKID
[09:20:42] <elukey>	 in puppet we have netboot.cfg, that lists the partman recipes for hosts
[09:20:52] <elukey>	 those will be picked up by the debian install process
[09:21:14] <elukey>	 and the stats to reimage have
[09:21:14] <elukey>	 stat100[4567]) echo reuse-parts-test.cfg partman/custom/reuse-analytics-stat-4dev.cfg ;; \
[09:22:14] <elukey>	 the reuse-parts-tests.cfg is handy since it doesn't allow d-i to proceed until a user hits the confirm button
[09:22:28] <elukey>	 so we can check via mgmt serial console before pulling the trigger
[09:22:43] <klausman>	 Neat
[09:22:57] <elukey>	 the main caveat of reuse-analytics-stat-4dev.cfg is that we assume that the kernel assigns md0 to our raid10
[09:23:20] <klausman>	 At a previous job, I automated d-i to completely autonomically rimage hosts. Well, one day a coworker plugged the wrong USB stick into their workstation and...
[09:23:22] <elukey>	 something that should happen in theory, but kormat raised some warnings about it
[09:23:48] <elukey>	 there is also a reuse-parts.cfg that does not require any confirm of course
[09:24:03] <elukey>	 that is also great, I used both for the last round of Kafka reimages
[09:24:23] <elukey>	 so in theory, the work should be as simple as running wmf-reimage
[09:24:37] <elukey>	 (we'll need to get you to pwstore though, to be able to use mgmt)
[09:24:52] <elukey>	 in practice, if we wipe /srv it is a problem :D
[09:25:11] <elukey>	 so what we could do is to rsync/backup the home dirs to another stat100x host
[09:25:26] <elukey>	 just to be sure
[09:25:34] <elukey>	 we have plenty of TBs free atm
[09:27:23] <elukey>	 we have an rsync server on every stat100x host, but it runs as nobody so doing rsyncs is surely going to raise permission issues
[09:27:32] <elukey>	 the Data Persistence team created https://wikitech.wikimedia.org/wiki/Transfer.py
[09:27:41] <elukey>	 that I have never used, but it should be neat and easy 
[09:27:44] <klausman>	 pwstore access I have (and confirmed working)
[09:28:11] <elukey>	 ahhh nice!
[09:28:15] <elukey>	 then we are all set
[09:28:16] <klausman>	 can stat hosts ssh to each other?
[09:28:32] <klausman>	 If so, tar'ing up user dirs might be the better option, permission-wise
[09:29:09] <klausman>	 Like tar c /home|ssh statX00Z 'cat > backup.tar'
[09:29:18] <elukey>	 I don't recall if group ids are consistent between stat boxes
[09:29:44] <klausman>	 tar usually backups user names and groups, not uid/gid, unless told otherwise
[09:30:11] <elukey>	 all right then I remembered incorrectly, then it seems fine!
[09:30:27] <elukey>	 we cannot ssh though, netcat is an option
[09:30:43] <elukey>	 that is IIRC what transfer.py does (open ports temporarily etc..)
[09:30:56] <klausman>	 Ok. maybe put in zstd to speed things up a little, at basically zero CPU cost
[09:31:09] <klausman>	 Oh, there's a script already, perfect
[09:31:38] <elukey>	 yeah check https://wikitech.wikimedia.org/wiki/Transfer.py, should be handy
[09:31:50] <elukey>	 it runs from cumin1001
[09:32:29] <elukey>	 so, overall the plan should be to
[09:32:41] <elukey>	 1) announce the maintenance for stat1004 to our users
[09:33:25] <elukey>	 2) backup the home dirs (maybe we could test transfer.py for a subset of the data, then do a pass before the maintenance to have more up to date files)
[09:33:30] <elukey>	 3) reimage
[09:34:16] <elukey>	 the known side effect of upgrading is that python is upgraded to 3.7, so some venvs of users will need to be recreated probably
[09:34:25] <elukey>	 (like the ones for jupyter notebooks etc..)
[09:34:32] <moritzm>	 1.5) ask users to clean out old junk from their homes which will cut backups in half...
[09:34:32] <elukey>	 but it is not a big deal
[09:34:50] <elukey>	 moritzm: please don't add false hopes
[09:34:52] <elukey>	 :D
[09:35:19] <klausman>	 Ok, I'll start drafting a mail. We start with 1004, right?
[09:35:24] <elukey>	 exactly
[09:35:43] <klausman>	 Is there a way to lock out non-SRE users for the time of the transfer?
[09:36:18] <elukey>	 the only way that I know is to remove non-sre groups from the admin list in puppet for the host
[09:36:27] <klausman>	 Hrm.
[09:36:29] <elukey>	 that automatically removes ssh keys 
[09:36:32] <elukey>	 but it is brutal
[09:36:48] <klausman>	 On thing I'd also do is disable cron, so it doesn't run stuff in the middle of a transfer
[09:37:02] <elukey>	 yep seems fine
[09:37:10] <klausman>	 Not sure if something similar can be done with systemd timers
[09:37:14] <elukey>	 I think it is fine to say that we'll backup data up to a certain dat
[09:37:18] <elukey>	 *date
[09:37:27] <elukey>	 klausman: systemctl stop *.timer
[09:37:35] <klausman>	 Yeah, and any change after X hour may get lost
[09:37:47] <elukey>	 exactly, it is perfectly acceptable
[09:39:30] <klausman>	 Ok, doing some writing :)
[09:40:33] <elukey>	 perfect :)
[09:40:49] <elukey>	 I am going to take my lunch break earlier today (need to do some errands)
[09:43:29] <klausman>	 Ack. Buon appetito!
[09:43:47] <elukey>	 grazie :)
[10:08:01] <addshore>	 Hi Hi analytics! Do you have any links to any policies around search terms being private data? :)
[12:00:18] <elukey>	 addshore: hello! No idea
[12:00:31] <elukey>	 I'll bring it up to my team later on
[12:06:03] <elukey>	 in the meantime, I am also restarting cassandra on aqs
[12:06:12] <elukey>	 (for openjdk)
[12:30:20] <elukey>	 !log stop timers on an-launcher1002 to drain the cluster and restart an-coord1001's daemons (hive/oozie/presto)
[12:30:21] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:59:52] <wikibugs>	 10Analytics, 10observability: Indexing errors / malformed logs for aqs on cassandra timeout - https://phabricator.wikimedia.org/T262920 (10fgiunchedi)
[13:24:03] <mforns>	 hi teammmmm
[13:24:12] <elukey>	 holaaa
[13:24:30] <klausman>	 heyoo
[13:28:01] <mforns>	 :]
[13:43:27] <elukey>	 !log restart of hive/oozie/presto daemons on an-coord1001
[13:43:28] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:43:43] <elukey>	 !log re-enable timers on an-launcher1002 after maintenance to an-coord1001
[13:43:45] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:46:18] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10Goal, and 3 others: Modern Event Platform - https://phabricator.wikimedia.org/T185233 (10Ottomata)
[14:11:32] <elukey>	 so today the roll restart of cassandra caused some client latencies to go up
[14:11:54] <elukey>	 and this caused restbase alerts for wikifeeds, that uses AQS behind the scenes (TIL)
[14:26:05] <wikibugs>	 10Analytics, 10Analytics-Kanban: analytics.wikimedia.org TLC - https://phabricator.wikimedia.org/T253393 (10mforns) We're keeping track of the design requirements of this task in the doc: https://docs.google.com/document/d/1u79ttV4_RMCJ9LLWqzGlQudlQ4w29i2xtcltQY-hjvM Please, request access if interested.
[15:00:07] <milimetric>	 elukey: wow, so aqs is kind of tier 1.5 now
[15:00:21] <milimetric>	 side note: how on earth do you go from https://gerrit.wikimedia.org/r/admin/repos/mediawiki/services/wikifeeds to the code of that repository?
[15:00:32] <milimetric>	 like... I know the code is there somewhere gerrit... WHERE IS DA BUTTON
[15:02:33] <bearloga>	 milimetric: not as easy as a button but changing "r/admin/repos" with "g" takes you there https://gerrit.wikimedia.org/g/mediawiki/services/wikifeeds
[15:03:00] <elukey>	 milimetric: so Giuseppe and Alex told me, I made the correlation only with graphs.. I think that wikifeeds calls restbase that calls aqs, not directly us
[15:03:02] <milimetric>	 witch!  BURN HIM
[15:03:23] <milimetric>	 (Mikhail, not you Luca, you're fine)
[15:03:45] <milimetric>	 (also, thx Mikhail :))
[15:04:13] <bearloga>	 milimetric: don't burn me! the chilly fall weather JUST started, let me have one day with a hoodie
[15:04:44] <milimetric>	 hm, let's see what other options are there.  There's drowning?
[15:05:13] <milimetric>	 here at the witch hunting branch of the government we like to be sensitive to our clients
[15:05:25] <bearloga>	 I appreciate that!
[15:05:44] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 3 others: Automate ingestion and refinement into Hive of event data from Kafka using stream configs and canary/heartbeat events - https://phabricator.wikimedia.org/T251609 (10Ottomata) Canary events are now being produced for...
[15:06:57] <ottomata>	 milimetric:  i usually start at
[15:06:58] <ottomata>	 https://gerrit.wikimedia.org/r/plugins/gitiles/?format=HTML
[15:07:00] <ottomata>	 and find the repo from there
[15:08:54] <milimetric>	 javascript:document.location.href=document.location.href.replace(/r\/admin\/repos/,'g')
[15:08:57] <milimetric>	 there, now there's a button
[15:09:27] <bearloga>	 ottomata: interesting, I usually use the BROWSE > Repositories and search for the repo in the filter input
[15:09:45] <bearloga>	 milimetric: 👏
[15:11:45] <milimetric>	 yeah, that makes more sense, 'cause the repo is linked from everywhere in changes and such
[15:13:26] <wikibugs>	 10Analytics, 10Wikidata, 10Wikidata-Query-Service: PoC on anomaly detection with Flink - https://phabricator.wikimedia.org/T262942 (10Gehel)
[15:14:41] <wikibugs>	 10Analytics, 10Wikidata, 10Wikidata-Query-Service: PoC on anomaly detection with Flink - https://phabricator.wikimedia.org/T262942 (10Gehel)
[15:16:38] <elukey>	 stepping afk for a bit!
[15:31:55] <nuria>	 mforns: was the PR for restbase deployed? this request is still failing https://wikimedia.org/api/rest_v1/metrics/editors/by-country/ro.wikipedia/5..99-edits/2020/07
[15:32:53] <mforns>	 nuria: checking
[15:33:14] <nuria>	 addshore: https://meta.wikimedia.org/wiki/Data_retention_guidelines there some info here as to what constitutes private data
[15:34:04] <nuria>	 mforns: the request 
[15:34:07] <nuria>	 https://wikimedia.org/api/rest_v1/metrics/editors/by-country/ro.wikipedia/5..99-edits/2020/07
[15:34:12] <nuria>	 is still failing
[15:34:51] <mforns>	 nuria: the same error, plus the route is still not fixed, I imagine it didn't get deployed
[15:35:06] <mforns>	 I can check with Pchelolo
[15:35:23] <nuria>	 mforns: please do
[15:35:26] <addshore>	 nuria: interesting, I was thinking around https://diff.wikimedia.org/2012/09/19/what-are-readers-looking-for-wikipedia-search-data-now-available/ as that identified personal info in searches, so I was wondering to what degree search terms are counted as persoanl info
[15:35:27] <Pchelolo>	 I didn't deploy yesterday, apologies, didn't get to it
[15:35:31] <Pchelolo>	 will do right now
[15:35:45] <mforns>	 no problem Pchelolo, we were just checking
[15:35:48] <mforns>	 thanks a lot!
[15:35:50] <nuria>	 addshore: a lot of search data is Pii by mistake
[15:36:06] <nuria>	 addshore: cause people do write their credit card in teh serach box by mistake
[15:36:17] <addshore>	 :D classic humans
[15:38:26] <nuria>	 Pchelolo: super thanks
[15:38:54] <wikibugs>	 10Analytics, 10Event-Platform, 10Product-Analytics (Kanban): Product Analytics to review & provide feedback for Event Platform Instrumentation How-To - https://phabricator.wikimedia.org/T253269 (10jlinehan)
[15:44:07] <wikibugs>	 10Analytics, 10Event-Platform, 10Platform Engineering: Need for new event-type - `user_create` and `user_rename` - https://phabricator.wikimedia.org/T262205 (10Milimetric) I think it would be mutually beneficial to work on producing the events together, so we can help iterate on the schemas and implement the...
[15:48:58] <klausman>	 May be late for the standup. Running an errand atm, and the queues are longer than expected
[15:54:01] <elukey>	 klausman: ack
[15:54:27] <elukey>	 klausman: send e-scrum if you don't attend :)
[15:56:31] <wikibugs>	 10Analytics-Radar, 10Operations, 10ops-eqiad: an-presto1004 down - https://phabricator.wikimedia.org/T253438 (10wiki_willy) Let me check with @Cmjohnson .  He's tied up with PDU upgrades this week, and he's out on vacation half of next week.  But let's see if we can at least get a timeframe for you.  Thanks,...
[16:07:58] <wikibugs>	 10Analytics, 10Event-Platform, 10Platform Engineering: Need for new event-type - `user_create` and `user_rename` - https://phabricator.wikimedia.org/T262205 (10WDoranWMF) @Nuria From Petr:  "Someone just needs to implement these, it shouldn’t be not too hard - 2 schemas already outlined, 2 hook handlers in E...
[16:09:09] <wikibugs>	 10Analytics, 10Event-Platform, 10Platform Engineering: Need for new event-type - `user_create` and `user_rename` - https://phabricator.wikimedia.org/T262205 (10WDoranWMF) >>! In T262205#6463145, @Milimetric wrote: > I think it would be mutually beneficial to work on producing the events together, so we can h...
[16:16:20] <wikibugs>	 10Analytics, 10Event-Platform, 10Privacy Engineering, 10observability: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10Ottomata) We should coordinate with @jlinehan and @Mholloway on the client error logging change, but yea I think that would work....
[16:35:59] <klausman>	 elukey: how direct is the control over sudoers? arbitraty policy entries>
[16:36:01] <klausman>	 ?
[16:36:41] <klausman>	 If so, this is what we'd want: %whatevergroup sudoedit /this/file, /that/file
[16:36:55] <klausman>	 (I think)
[16:37:11] <elukey>	 klausman: so the rules are all in puppet, in the data.yaml file
[16:37:22] <elukey>	 (under modules/admin/etc..)
[16:37:35] <elukey>	 every group has a set of sudo policy that can be deployed etc..
[16:37:41] <klausman>	 Ah, right. Then sudoedit might not work to narrow down. But if you can sudo to another user, editing is implied
[16:37:59] <klausman>	 i.e. we would still have shell access, which is likely wanted anyway
[16:38:11] <elukey>	 please check what we have in there and propose improvements if you think the current setting is not great, I'd love to know your opinion :)
[16:38:24] <klausman>	 Roger :)
[16:40:29] <elukey>	 nuria: as FYI https://phabricator.wikimedia.org/T262893
[16:40:39] <elukey>	 (I asked for some extra budget to increase ram on an-coord1001)
[16:41:03] <nuria>	 ping razzi 
[16:41:20] <nuria>	 ping milimetric 
[16:42:01] <wikibugs>	 10Analytics, 10Event-Platform, 10Privacy Engineering, 10Product-Infrastructure-Data, 10observability: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10jlinehan)
[17:15:26] <nuria>	 milimetric: on https://docs.plus/p/wmf-analytics
[17:15:33] <nuria>	 milimetric: loading
[17:33:21] <wikibugs>	 (03CR) 10Nuria: [C: 03+2] Add a link in the footer to translate Wikistats [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/626711 (https://phabricator.wikimedia.org/T261502) (owner: 10Paul Kernfeld)
[17:34:46] <wikibugs>	 (03Merged) 10jenkins-bot: Add a link in the footer to translate Wikistats [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/626711 (https://phabricator.wikimedia.org/T261502) (owner: 10Paul Kernfeld)
[17:42:19] <mforns>	 I'm looking into the geoeditors thing
[17:45:16] <milimetric>	 klausman: sudoedit is so confusing, and seems to have confused everyone on the internets: https://medium.com/@madushan1000/the-dangers-of-sudoedit-c433cbdade83
[17:45:35] <milimetric>	 I haven't found your explanation there of how it copies to file and just writes it back to the target
[17:45:44] <milimetric>	 *there (I mean the internet)
[17:46:16] <klausman>	 Huh, interesting
[17:46:29] <klausman>	 The :e problem never occurred to me
[17:46:45] <klausman>	 I gues "Editors are just too damn powerful" is the conclusion
[17:47:33] <milimetric>	 I think how you set up rights when guarding against people sudoedit-ing is just different from how you set them up in general, and I can see that it can give you more granularity, but it seems tricky either way
[17:48:11] <klausman>	 Wait... hang on.
[17:48:41] <klausman>	 That post is wrong
[17:49:04] <klausman>	 The vim in question runs not as root, but as the calling user, therefore, :e gives you zip extra access
[17:49:42] <klausman>	 running sudo sudoedit is something that should be filtered
[17:50:33] <Pchelolo>	 mforns: apoligies, I didn't notice this in reviews, but this is what will fix analytics endpoints: https://github.com/wikimedia/restbase/pull/1278
[17:50:47] <Pchelolo>	 you've used swagger 2 spec, I didn't notice that
[17:51:26] <milimetric>	 ok, that makes sense (re: sudoedit) :)
[17:56:58] <mforns>	 oh Pchelolo, thanks a lot, I was trying to deduce
[17:58:59] <mforns>	 Pchelolo: does this need any change from me?
[17:59:27] <Pchelolo>	 I think no
[18:00:06] <Pchelolo>	 at some point we should all come together and look into AQS major update
[18:02:33] <mforns>	 aha
[18:03:13] <mforns>	 you mean for other endpoints to use swagger3?
[18:03:38] <mforns>	 sorry openapi3
[18:22:41] <wikibugs>	 10Analytics, 10Event-Platform, 10Privacy Engineering, 10Product-Infrastructure-Data, 10observability: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10jlinehan) p:05Triage→03High
[18:23:40] <wikibugs>	 10Analytics, 10Event-Platform, 10Privacy Engineering, 10Product-Infrastructure-Data, 10observability: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10jlinehan) a:03mpopov
[18:25:54] <wikibugs>	 10Analytics, 10Event-Platform, 10Privacy Engineering, 10Product-Infrastructure-Data, 10observability: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10jlinehan) @mpopov If you prepare a schema patch I'll CR and make a patch to bump the version on...
[18:43:11] <wikibugs>	 10Analytics, 10Event-Platform, 10Privacy Engineering, 10Product-Infrastructure-Data, 10observability: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10Ottomata) In this case...I think this will be a backwards incompatible change, so we should prob...
[18:45:48] <wikibugs>	 10Analytics, 10Event-Platform, 10Privacy Engineering, 10Product-Infrastructure-Data, 10observability: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10Ottomata) FYI @cdanis is working on {T257527} which has expects to have `http.client_ip` in logs...
[18:46:02] <wikibugs>	 10Analytics, 10Event-Platform, 10Privacy Engineering, 10Product-Infrastructure-Data, 10observability: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10jlinehan) >>! In T262626#6464002, @Ottomata wrote: > In this case...I think this will be a backw...
[18:48:31] <wikibugs>	 10Analytics, 10Event-Platform, 10Privacy Engineering, 10Product-Infrastructure-Data, 10observability: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10mpopov) >>! In T262626#6463258, @Ottomata wrote: > We should coordinate with @mpopov and Product...
[18:53:32] <wikibugs>	 10Analytics, 10Event-Platform, 10Privacy Engineering, 10Product-Infrastructure-Data, 10observability: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10jlinehan) >>! In T262626#6464015, @mpopov wrote: > secondary:/fragment/analytics/common referenc...
[19:02:12] <mforns>	 nuria: what do we do then with the alarms? I can push a patch that reverts threshold?
[19:08:11] <wikibugs>	 10Analytics, 10Analytics-Kanban: Check that mediawiki-events match mediawiki-history changes over a month - https://phabricator.wikimedia.org/T262261 (10Milimetric) >>! In T262261#6450148, @JAllemandou wrote: > Checks for discrenpencies (using previous comment setup): >  > * page-delete > ... I noticed here th...
[19:08:53] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Use MaxMind DB in piwik geo-location - https://phabricator.wikimedia.org/T213741 (10elukey) After merging the second change this was the output:  ` Notice: /Stage[main]/Geoip::Data::Puppet/File[/usr/share/matomo/misc/GeoIP.dat]/ensure: defined content as '...
[19:12:54] <nuria>	 mforns: sorry, so many things.
[19:13:22] <nuria>	 mforns: i think having  a threshold that also does not alarm when we need to is probably not the best  either
[19:13:38] <nuria>	 mforns: now teh larm is in the combined metric if i remember right
[19:13:53] <mforns>	 we have alarms on all 4 useragent metrics right now
[19:14:46] <wikibugs>	 10Analytics, 10Analytics-Kanban: Use MaxMind DB in piwik geo-location - https://phabricator.wikimedia.org/T213741 (10elukey) The change was reverted, we have a couple of ideas about how to proceed, new patches will follow.
[19:14:46] <nuria>	 let me look at dashboards again
[19:14:50] * elukey afk!
[19:15:00] <mforns>	 nuria, I think a good approach could be to have both daily and hourly alarms at different thresholds, like we do with traffic metrics
[19:15:30] <mforns>	 I believe one bad thing about these hourly useragent metrics is that they have both strong daily and strong weekly seasonality
[19:16:06] <mforns>	 and IIUC the RSVD algorithm only accounts for one of those, the one that is expressed by the matrix size
[19:16:38] <mforns>	 in the case of the hourly metrics, that means the weekly seasonality is making it harder to extract noise from signal
[19:17:36] <mforns>	 so, we will only detect very high deviations with the hourly metric, still good because it has shorter alarm time,
[19:18:11] <mforns>	 but if we used daily metrics additionaly, we could get rid of the double-seasonality (we'd only have weekly seasonality)
[19:18:25] <nuria>	 mforns: it is not the length of the ts though right? it is the periodicity with which the metric is computed
[19:18:36] <mforns>	 and the RSVD would be able to remove noise from signal fine (if my hypothesis is correct)
[19:18:48] <mforns>	 yes, I believe so
[19:19:36] <nuria>	 mforns: i think is not realted to the matrix size though  rather the periodicity
[19:19:58] <nuria>	 mforns:  with which the metric is computed
[19:20:02] <mforns>	 yes, for RSVD periodicity == matrix size
[19:20:17] <iflorez>	 hello A-team,
[19:20:17] <iflorez>	 I'm having a file issue on stat6
[19:20:17] <iflorez>	 And am just realizing that work began on stat6, is that right? the stat6 updates are beginning now?
[19:20:24] <mforns>	 well, one of the size, don't remember if width or height
[19:22:10] <iflorez>	 In short: I was working on a file on stat6 and saved two output dfs to csv and suddenly I don't see the notebook I was working on and am having a hard time getting the csv files to my local via scp
[19:23:09] <nuria>	 iflorez: there is no work happening on stats6 now
[19:23:14] <nuria>	 iflorez: it will start on thu
[19:23:51] <iflorez>	 thank you for that update. I'm not sure then why I may be having file trouble in that case. 
[19:27:19] <nuria>	 mforns: mmmm
[19:28:06] <wikibugs>	 10Analytics, 10Event-Platform, 10Privacy Engineering, 10Product-Infrastructure-Data, 10observability: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10CDanis) >>! In T262626#6464008, @Ottomata wrote: > FYI @cdanis is working on {T257527} which has...
[19:32:14] <wikibugs>	 10Analytics-Radar, 10Event-Platform, 10Platform Engineering: Duplicated revision_create events - https://phabricator.wikimedia.org/T262203 (10Milimetric) >>! In T262203#6459797, @JAllemandou wrote: > |database |  rev_id   |             users             | rev_id_count_gt_1  > |enwiki   | 977034865 | [104.235...
[19:37:50] <nuria>	 mforns: I am not sure that reverting the threshold to 10 is of use 
[19:38:30] <nuria>	 mforns_brb: let me think about this a bit more 
[19:38:44] <nuria>	 iflorez: are you scp-ing files to stat1006 or from 1006?
[19:39:03] <iflorez>	 from stat6 to my local
[19:40:21] <wikibugs>	 10Analytics, 10Event-Platform, 10Privacy Engineering, 10Product-Infrastructure-Data, 10observability: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10Ottomata) > AFAIK Logstash/Kibana is the only such system we have that fits?  Superset + Presto/...
[19:49:56] <mforns>	 nuria: I never thought of asking what was the problem that generated that anomaly back on 2020-04-20?
[19:51:03] <mforns>	 I think that a threshold=10 for hourly metrics is good, it will still catch big jumps, like the one we created these alarms for.
[19:52:05] <mforns>	 and then we can have a daily metric, that will take some more time to alarm, but won't have the double-seasonality problem, and I think will be more sensitive to small changes like the one on 2020-04-20
[19:54:59] <nuria>	 mforns: taht makes a lot of sense yes
[19:55:16] <nuria>	 mforns: i have a meeting in a bit but I will send some patches
[19:55:21] <nuria>	 mforns: after that
[19:55:41] <mforns>	 ok
[19:55:52] <mforns>	 i'll be logging off in a bit
[19:58:58] <joal>	 ottomata: Heya - do ou need help from me (ops-week) on camus?
[20:02:46] <ottomata>	 joal:  i think it is ok, a run is about to finish up here in 8 minutes
[20:02:55] <ottomata>	 and i'll be able to see if it is actually progressing on those 3 partitions
[20:02:59] <ottomata>	 but something is very weird eh!
[20:03:06] <joal>	 ottomata: yes indeed!
[20:03:09] <ottomata>	 why would the offsets be reset on these partitions from this one broker!
[20:03:23] <ottomata>	 same thing that happened in may i guess....except for some reason last time they got stuck
[20:03:33] <joal>	 ottomata: I need to write some scorecard, I'll be here in 8 mins - pleae let me know :)
[20:03:38] <ottomata>	 ok ty
[20:04:03] <joal>	 also, I'd like as well to understand how come camus restarted reading those partitions from start :(
[20:05:11] <ottomata>	 indeed
[20:05:27] <ottomata>	 i had not ever noticed that they got totally reset like that before
[20:05:31] <ottomata>	 i only noticed that they got stuck
[20:05:46] <ottomata>	 and could not understand what was causing them to get stuck
[20:05:54] <ottomata>	 camus would just fail reading from kafka for some reason
[20:07:37] <joal>	 hm :(
[20:10:14] <nuria>	 mforns: the aqs patch is still not working right?
[20:11:18] <mforns>	 nuria: they patched it, we used swagger2 syntax instead of openapi3, it should be fixed whenever they deploy it
[20:11:53] <nuria>	 mforns: can we create  a ticket to migrate from swagger to openapi3?
[20:12:06] <mforns>	 they already did a pull request
[20:12:12] <mforns>	 pe-tr did
[20:16:43] <ottomata>	 joal:  it looks ok; those partitioins are progressing
[20:17:31] <joal>	 ok ottomata - I assume there might be some more alerts. I'll triple check tomorrow morning in between kids-games :)
[20:17:43] <joal>	 Thanks a lot for looking into that
[20:24:16] <wikibugs>	 10Analytics, 10Event-Platform, 10Platform Engineering, 10Platform Engineering Roadmap Decision Making: Need for new event-type - `user_create` and `user_rename` - https://phabricator.wikimedia.org/T262205 (10eprodromou) OK, we're going to move this into our roadmap discussion. I think it's a small project,...
[20:24:31] <wikibugs>	 10Analytics-Radar, 10Event-Platform, 10Platform Team Workboards (Clinic Duty Team): Duplicated revision_create events - https://phabricator.wikimedia.org/T262203 (10eprodromou) p:05Triage→03Medium
[20:33:18] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10Goal, and 3 others: Modern Event Platform - https://phabricator.wikimedia.org/T185233 (10Ottomata)
[20:33:42] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10Goal, and 3 others: Modern Event Platform - https://phabricator.wikimedia.org/T185233 (10Ottomata)
[20:36:36] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10Goal, and 3 others: Modern Event Platform - https://phabricator.wikimedia.org/T185233 (10Ottomata)
[20:36:53] <wikibugs>	 10Analytics, 10Event-Platform, 10Privacy Engineering, 10Product-Infrastructure-Data, and 2 others: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10mpopov)
[20:55:10] <wikibugs>	 10Analytics-Radar, 10MW-1.35-notes, 10MW-1.36-notes (1.36.0-wmf.2; 2020-07-28), 10Multi-Content-Revisions (New Features), and 4 others: MCR: Import all slots from XML dumps - https://phabricator.wikimedia.org/T220525 (10eprodromou) 05Open→03Resolved Congratulations and good job!
[22:46:16] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: [Entropy alarms] Restrict the RSVD analysis to the last N data-points - https://phabricator.wikimedia.org/T257691 (10Nuria) 05Open→03Resolved
[22:54:05] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: AQS is not OpenAPI 3 compliant - https://phabricator.wikimedia.org/T240995 (10Nuria) ping @Pchelolo and @mforns that were mentioning this today, we should probably schedule this work, correct?
[22:54:09] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: AQS is not OpenAPI 3 compliant - https://phabricator.wikimedia.org/T240995 (10Nuria) @colewhite I see the patch has no reviewers, should we pick it up?
[22:54:11] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: AQS is not OpenAPI 3 compliant - https://phabricator.wikimedia.org/T240995 (10Nuria)