[00:05:35] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Remove postal code and longitude / latitude from geocoded data object  on webrequest data - https://phabricator.wikimedia.org/T236740 (10kzimmerman) @Nuria Product Analytics hasn't used this data (except for once, maybe, a few years...
[04:11:53] <wikibugs>	 10Analytics, 10MediaWiki-General, 10Platform Engineering: Proposal: drop kafka-php dependency from MediaWiki - https://phabricator.wikimedia.org/T265966 (10Pchelolo)
[04:14:23] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Remove postal code and longitude / latitude from geocoded data object  on webrequest data - https://phabricator.wikimedia.org/T236740 (10Nuria)
[04:16:31] <wikibugs>	 10Analytics, 10Discovery-Search, 10MediaWiki-General: Proposal: drop avro dependency from mediawiki - https://phabricator.wikimedia.org/T265967 (10Pchelolo)
[04:17:16] <wikibugs>	 10Analytics, 10Discovery-Search, 10MediaWiki-General: Proposal: drop avro dependency from mediawiki - https://phabricator.wikimedia.org/T265967 (10Pchelolo)
[04:17:51] <wikibugs>	 10Analytics, 10MediaWiki-General, 10Platform Engineering: Proposal: drop kafka-php dependency from MediaWiki - https://phabricator.wikimedia.org/T265966 (10Pchelolo)
[06:06:16] <elukey>	 goood morning
[06:20:33] <wikibugs>	 10Analytics, 10Patch-For-Review, 10User-Elukey: Move https termination from nginx to envoy (if possible) - https://phabricator.wikimedia.org/T240439 (10elukey) @razzi as note for the future, another useful test to do is via openssl s_client, like the following:  ` echo y | openssl s_client -CApath /etc/ssl/c...
[06:26:17] <wikibugs>	 10Analytics, 10Patch-For-Review, 10User-Elukey: Move https termination from nginx to envoy (if possible) - https://phabricator.wikimedia.org/T240439 (10elukey) @razzi something useful to do in the task is also to make a list of domain -> backend that you will work on, so others can double check. Something li...
[06:28:45] <wikibugs>	 10Analytics: Request a Kerberos identity for sbisson - https://phabricator.wikimedia.org/T265167 (10elukey) @SBisson Hi! As far as I can see your username is not in `analytics-privatedata-users`, but only in `researchers`, that it is a old group not really meant to explore Hadoop data. As far as I can see from y...
[06:30:49] <wikibugs>	 10Analytics, 10Operations, 10SRE-Access-Requests: Add sbisson to analytics-privatedata-users and create a kerberos identity - https://phabricator.wikimedia.org/T265969 (10elukey)
[06:31:06] <wikibugs>	 10Analytics, 10Operations, 10SRE-Access-Requests: Add sbisson to analytics-privatedata-users and create a kerberos identity - https://phabricator.wikimedia.org/T265969 (10elukey)
[06:31:08] <wikibugs>	 10Analytics: Request a Kerberos identity for sbisson - https://phabricator.wikimedia.org/T265167 (10elukey)
[06:33:44] <wikibugs>	 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Add sbisson to analytics-privatedata-users and create a kerberos identity - https://phabricator.wikimedia.org/T265969 (10elukey) @Nuria can you review/approve? I'll then merge and create the kerberos identity :)
[06:41:32] <elukey>	 !log decom analytics1056 from the hadoop cluster
[06:41:34] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:00:23] <wikibugs>	 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Add sbisson to analytics-privatedata-users and create a kerberos identity - https://phabricator.wikimedia.org/T265969 (10Marostegui) p:05Triage→03Medium a:03elukey
[07:31:40] <wikibugs>	 10Analytics, 10Analytics-Wikistats: Wikistats active editors metric reporting unrealistic numbers - https://phabricator.wikimedia.org/T265322 (10Lydia_Pintscher) I'm seeing similar things for Wikidata. Active editor and editor stats seem to be showing the exact same numbers: * https://stats.wikimedia.org/#/wik...
[07:53:03] * elukey coffee
[08:42:12] <elukey>	 so I am checking presto versions
[08:42:20] <elukey>	 and something is odd
[08:42:29] <elukey>	 we have ii  presto-server  0.266-2      all          Presto Server
[08:42:53] <elukey>	 https://github.com/prestosql/presto/tags (the non-fb version) is 344 now
[08:43:17] <elukey>	 but prestodb, the one that I thought we had, is at 242 https://github.com/prestodb/presto/tags
[08:45:49] <elukey>	 so we are running prestosql 
[08:54:24] <klausman>	 *Or,* we're running software from the future™
[08:54:35] <klausman>	 Also, morning!
[08:54:57] <elukey>	 ahhh no wait https://gerrit.wikimedia.org/r/c/operations/debs/presto/+/543161
[08:55:02] <elukey>	 morning!
[08:55:50] <elukey>	 so we have 226, and latest upstream is 242
[08:58:00] <klausman>	 A-ha! That makes more sense. Pesky versions numbers :)
[09:12:31] <klausman>	 elukey: I suppose netbox does not know whether a given machine contains a GPU?
[09:13:00] <elukey>	 klausman: yep exactly
[09:13:16] <klausman>	 So which machines are candidates for the rocm38 test?
[09:13:48] <elukey>	 you can find them in regex.yaml in puppet, an-worker1096-1101
[09:15:46] <klausman>	 All five have GPUs?
[09:16:59] <klausman>	 Ah, right, they do
[09:18:13] <elukey>	 6 have the gpu
[09:18:21] <klausman>	 So how about we drain 1101, I split it out of that group to get rocm38, we do a puppet run, watch the fireworks and decide what to do with it (move back to 33 or add it to the cluster with 38)?
[09:18:35] <elukey>	 yep makes sense
[09:18:45] <klausman>	 Yeah, I forgot to count 1100, somehow :)
[09:18:58] <elukey>	 I am currently decommissioning a node, so let's try to keep the hdfs datanode downtime minimal if possible
[09:19:04] <klausman>	 ack.
[09:19:07] <elukey>	 thanks :)
[09:19:31] <klausman>	 I mean, we can wait until the current decom is done, if the timing fits?
[09:19:45] <klausman>	 I'll just prep the Puppet change
[09:23:53] <elukey>	 nono it will take hours, let's do it
[09:24:46] <klausman>	 Can we drain the machine (of jobs, not data) without doing a puppet run?
[09:28:13] <elukey>	 so what I usually do is downtime, puppet disable and then systemctl stop hadoop-yarn-nodemanager
[09:28:25] <elukey>	 when the jvm containers are done, I proceed 
[09:28:47] <elukey>	 you can also stop the hdfs datanode before installing as well
[09:29:02] <klausman>	 Ok, sounds good. Will send the Puppet change for rocm38 for your review and then drain the Yarn jobs
[09:29:11] <elukey>	 ack
[09:32:19] <wikibugs>	 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Sqoop problem on stat1004 - https://phabricator.wikimedia.org/T265851 (10GoranSMilovanovic) @elukey All is fine, thank you very much!  @Milimetric @JAllemandou Given the comment in T265851#6560938, do you want me to close this task as resol...
[09:33:29] <elukey>	 I am doing an experiment with Presto TLS config, it may not work for a brief moment
[09:37:48] <klausman>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/635260 is ready for you. I did a simple c&p of the base role and tweaked the REs
[09:40:48] <elukey>	 klausman: that is fine, but you can also add a host-level hiera override (that takes more priority than regex.yaml)
[09:41:07] <elukey>	 for example, check under hieradata/hosts
[09:41:22] <elukey>	 you could create an-worker1101.yaml with profile::amd_gpu::rocm_version: '38' in it
[09:42:13] <klausman>	 Ah, that sounds cleaner
[09:46:28] <klausman>	 Ok, updated the change to use a host override
[09:47:17] <elukey>	 +1
[09:48:53] <klausman>	 Thanks! Will now drain the machine as discussed
[09:59:51] <klausman>	 Hmm. apt1001 does not have rock-dkms-firmware. Investigating
[10:01:50] <klausman>	 q
[10:05:59] <klausman>	 elukey: I have a sneaking suspicion that the fact thet rock-dkms-firmware is in the same subdir as rock-dkms, the apt/reprepro setup breaks.
[10:08:34] <elukey>	 klausman: mmm so rock-dkms-firmware is not included when doing checkupdate/update?
[10:09:45] <klausman>	 Correct
[10:10:01] <klausman>	 http://repo.radeon.com/rocm/apt/3.8/pool/main/r/rock-dkms/ has it, and the grep-line in the reprepro config mentions it
[10:10:15] <klausman>	 But it does not show up in the reprepro tree
[10:11:19] <elukey>	 is it in the package list configured for the remote repo? I am wondering if it is something like mvisionx
[10:12:49] <elukey>	 it is there, I see Package: rock-dkms-firmware
[10:12:55] <elukey>	 weird
[10:13:11] <klausman>	 http://repo.radeon.com/rocm/apt/3.8/dists/xenial/main/binary-amd64/Packages has it, yeah
[10:13:41] <klausman>	 (also checked the .gz, just in case)
[10:14:13] <elukey>	 root@apt1001:/srv/wikimedia# reprepro lsbycomponent rock-dkms-firmware
[10:14:16] <elukey>	 rock-dkms-firmware | 1:3.8-30 | buster-wikimedia | thirdparty/amd-rocm38 | amd64
[10:14:19] <elukey>	 klausman: --^
[10:14:43] <elukey>	 so we have it in our repo
[10:15:02] <elukey>	 what error do you get? maybe it needs a apt-get update?
[10:16:40] <klausman>	 E: Unable to locate package rock-dkms-firmware
[10:17:08] <klausman>	 did an apt update, 
[10:18:02] <klausman>	 Hmmm. Would there need to be an entry in apt/sources.list.d/wikimedia.list mentioning rocm?
[10:18:12] <klausman>	 ah nvm
[10:18:25] <klausman>	 /etc/apt/sources.list.d/repository_amd-rocm38.list exists and looks right
[10:19:33] <elukey>	 yep
[10:19:44] <klausman>	 https://apt.wikimedia.org/wikimedia/pool/thirdparty/amd-rocm38/r/rock-dkms/ has the package as well
[10:19:51] <klausman>	 and yet:
[10:20:07] <klausman>	 E: Unable to locate package rock-dkms-firmware
[10:20:13] <elukey>	 very weird
[10:20:33] <klausman>	 It also lists only the 3.3 rock-dkms package, no 3.8
[10:20:35] <elukey>	 ahhhhhhhhh
[10:20:56] <elukey>	 rock-dkms-firmware | 1:3.8-30 | buster-wikimedia | thirdparty/amd-rocm38 | amd64
[10:21:07] <elukey>	 elukey@an-worker1101:~$ cat /etc/debian_version
[10:21:07] <elukey>	 9.13
[10:21:10] <elukey>	 :)
[10:21:24] <elukey>	 we need to checkupdate/update also stretch-wikimedia
[10:21:26] <klausman>	 Oh
[10:21:37] <elukey>	 the workers are not on buster
[10:23:14] <klausman>	 Right. did an update on apt, now doing a puppet run (which should install 3.8)
[10:24:27] <elukey>	 super
[10:24:42] <milimetric>	 mforns: have we looked at dagster? https://dagster.io/
[10:25:31] <milimetric>	 it seems to me a little more polished than airflow and apache 2 all around (I think)
[10:26:19] <elukey>	 today we have a meeting with Jarek (a PMC of airflow) about our use cases and airflow 2.0 (in alpha but close to prime time)
[10:26:44] <elukey>	 if you want to jump in the meeting I can add you, it is at 17 CEST
[10:26:44] <milimetric>	 Executors are still dask/cellery just like airflow, but maybe dask is worth spinning up
[10:28:14] <milimetric>	 um, I think Marcel has all my use cases in mind and more, so I’m not needed
[10:28:57] <klausman>	 !r rocm38 install on an-worker1101 successful, rebooting to make sure everything is in place
[10:29:13] <klausman>	 erm
[10:29:16] <klausman>	 !log rocm38 install on an-worker1101 successful, rebooting to make sure everything is in place
[10:29:18] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:29:22] <milimetric>	 just saw dagster and it looks like a solid popular alternative, and maybe some crossover with an ML pipeline (maybe klausman is intrested in it vs kubeflow)
[10:29:40] <klausman>	 Will have a look
[10:29:51] <elukey>	 let's also check license
[10:30:13] <milimetric>	 It seems to me like one of these workflow orchestrator things should be able to handle both types of pipelines
[10:30:17] <klausman>	 They have something on GH which is APL-2
[10:30:35] <milimetric>	 yep, all apl 2 as far as I could see
[10:31:18] <milimetric>	 it may be ambitious to try and do both, but maybe the right decison would be to wait.  Flyte is getting better as well
[10:32:35] <elukey>	 we have also been waiting for a long time to replace oozie, we'd need to pull the trigger at some point
[10:35:11] <klausman>	 an-worker1101 is back up, with rocm38
[10:47:06] <elukey>	 wow
[10:47:07] <elukey>	 !!!
[10:47:40] <elukey>	 all good? 
[10:48:04] <klausman>	 No :)
[10:48:20] <klausman>	 There are still some deps missing from the reprepro config. Working on those right now
[10:49:10] <elukey>	 sure sure
[10:54:14] <klausman>	 https://gerrit.wikimedia.org/r/635279 if you're not lunching already
[10:54:58] <elukey>	 +1!
[10:55:15] <elukey>	 I assume those are all deps that don't need to be pulled via puppet explicitly
[10:56:23] <klausman>	 Correct
[10:58:03] <elukey>	 molto bene
[10:58:06] <elukey>	 :D
[10:58:20] <wikibugs>	 10Analytics-Clusters, 10Operations: Rename an-scheduler1001 to an-coord1002 - https://phabricator.wikimedia.org/T265620 (10Marostegui) p:05Triage→03Medium
[10:59:59] * elukey lunch!
[11:03:38] <klausman>	 I think we're a bit screwed re: rocm38 on stretch
[11:04:25] <klausman>	 one of the dependency chains ends with (libstdc++-5-dev||libstdc++-7-dev) && (libgcc-5-dev||libgcc-7-dev)
[11:04:34] <klausman>	 None of these are available on stretch
[11:07:15] <klausman>	 I'll revert the puppet change (38 override) for now to get the machine back into a sane state
[11:34:53] * klausman lunch
[12:04:39] <elukey>	 ouch :(
[12:11:02] <wikibugs>	 10Quarry: Quarry down - https://phabricator.wikimedia.org/T265997 (10Count_Count)
[12:11:54] <wikibugs>	 10Quarry: Quarry down for logged in user?? - https://phabricator.wikimedia.org/T265997 (10Count_Count)
[12:13:13] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Fix TLS certificate location and expire for Hadoop/Presto/etc.. and add alarms on TLS cert expiry - https://phabricator.wikimedia.org/T253957 (10elukey) The main problem with Presto seems to be that the puppet CA is not picked up as trusted source, so any...
[12:15:33] <wikibugs>	 10Quarry: Quarry down for logged in user - https://phabricator.wikimedia.org/T265997 (10Count_Count)
[12:23:29] <wikibugs>	 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Sqoop problem on stat1004 - https://phabricator.wikimedia.org/T265851 (10JAllemandou) @GoranSMilovanovic the needed docs are updated (no sqoop page per say, but related sqoop usages in other pages). Feel free to close the task. Thanks!
[12:25:15] <wikibugs>	 10Quarry: Quarry down for logged in users - https://phabricator.wikimedia.org/T265997 (10Count_Count)
[12:29:50] <wikibugs>	 10Quarry: Quarry down for logged in users - https://phabricator.wikimedia.org/T265997 (10JeanFred) I can confirm this behaviour − noticed it a few hours ago.
[12:33:39] <wikibugs>	 (03CR) 10Joal: [C: 04-1] "The fields should be removed from the refinery-core related files. RefineryGeocodeDatabaseResponse is the data-handler and GeocodeDatabase" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/635085 (https://phabricator.wikimedia.org/T236740) (owner: 10Razzi)
[12:40:50] <wikibugs>	 10Quarry: Quarry down for logged in users - https://phabricator.wikimedia.org/T265997 (10JeanFred) Even logged-out, only the front-page seems accessible, eg https://quarry.wmflabs.org/query/runs/all returns 500 Internal Server Error
[12:44:09] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Fix TLS certificate location and expire for Hadoop/Presto/etc.. and add alarms on TLS cert expiry - https://phabricator.wikimedia.org/T253957 (10elukey) If I add:  ` -Djavax.net.ssl.trustStore=/etc/ssl/certs/java/cacerts -Djavax.net.ssl.trustStorePassword=...
[12:52:03] <wikibugs>	 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Sqoop problem on stat1004 - https://phabricator.wikimedia.org/T265851 (10GoranSMilovanovic) 05Open→03Resolved @JAllemandou Fine.  @elukey Thank you for your help and advice.   Ticket closed as resolved.
[12:54:19] <elukey>	 ok so I think I got how to make presto to use puppet TLS certs, trying again
[12:54:24] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Design-Research: Setup and integrate analytics (Matomo) for Design Strategy Website - https://phabricator.wikimedia.org/T259322 (10Volker_E)
[12:58:28] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade AMD ROCm drivers/tools to latest upstream - https://phabricator.wikimedia.org/T264408 (10klausman) As the experiment today has shown, rocm38 ultimately depends on a few packages that are not available on Stretch. I think we have the follow...
[12:59:01] <klausman>	 elukey: what do you think the timeframe for everything-on-Buster looks like?
[12:59:24] <elukey>	 klausman: a couple of quarters at least
[13:05:59] <elukey>	 so the idea for hadoop (that is the big problem when migrating) is the following
[13:06:19] <elukey>	 - we migrate from cdh to bigtop 1.4 on stretch, upgrading hdfs etc..
[13:06:38] <elukey>	 - when upstream releases 1.5, we upgrade to it on stretch (upgrading hdfs again)
[13:07:10] <elukey>	 - at this point, we'll have bigtop 1.5 with packages for stretch and buster, so we'll be able to reimage a couple of workers at the time
[13:07:14] <elukey>	 (to buster)
[13:07:40] <elukey>	 so what I have in mind is to do two hdfs upgrades and then a os upgrade
[13:07:48] <elukey>	 (hdfs 2.6 -> 2.8.5 -> 2.10)
[13:08:21] <elukey>	 does it make sense?
[13:09:55] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade AMD ROCm drivers/tools to latest upstream - https://phabricator.wikimedia.org/T264408 (10elukey) I would upgrade the two stats to 3.8 leaving the hadoop workers to 3.3, so we could keep testing the drivers keeping the stretch stack stable....
[13:15:31] <klausman>	 Yeah, that sounds reasonable.
[13:16:07] <klausman>	 In the meantime, I think unless we find major breakage with rocm33, we should just stay on it. The parallel-workload and memory issues have been fixed by DKMS, so the pressure is not too high.
[13:16:42] <klausman>	 If we *do* find major breakage, we'll have to take a look at the intermediate versions, maybe there is one that fixes stuff but doesn't have dependency problems on Stretch.
[13:17:21] <elukey>	 klausman: we could move only what's on buster to 3.8, so people on stat100x could get the latest drivers
[13:19:05] <elukey>	 or we can stay on 33 and wait, maybe we can check with people using gps on the stats
[13:19:25] <elukey>	 with newer drivers it is also possible to use a more up to date tensorflow-rocm version
[13:20:08] <klausman>	 The Buster machines I think we should slowly upgrade
[13:20:35] <klausman>	 Are there any in that category besides 5 and 8?
[13:20:47] <klausman>	 I mean Buster+GPU
[13:24:43] <elukey>	 nope only those two
[13:25:05] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade AMD ROCm drivers/tools to latest upstream - https://phabricator.wikimedia.org/T264408 (10klausman) Sounds good to me. I will make an announcement that we'll have some disruption on stat1005 and 1008 soonish and then do the upgrade via pupp...
[13:27:25] <klausman>	 I'd propose Thu for the upgrade of one machine, but that's DC switchback day, and I'd rather not combine the two.
[13:27:37] <klausman>	 So Fri morning?
[13:28:37] <elukey>	 sounds good
[13:33:01] <elukey>	 !log move presto to pupet host TLS certificates
[13:33:03] <elukey>	 it works!
[13:33:03] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:33:15] <elukey>	 finally
[13:34:57] <elukey>	 !log upgrade superset's presto TLS config after the above changes
[13:35:00] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:35:00] <elukey>	 https://superset.wikimedia.org/superset/dashboard/73/ works fine
[13:36:58] <elukey>	 and the CLI tool works as well
[13:37:18] <mforns>	 hello teammm
[13:39:05] <elukey>	 o/
[13:44:49] <elukey>	 taking a break!
[13:44:53] <milimetric>	 :)
[13:44:54] <milimetric>	 hi
[13:45:06] <milimetric>	 luca and I are the same person
[13:45:28] <milimetric>	 mforns: do you remember anything about the mediawiki_api_request events?
[13:45:48] <mforns>	 milimetric: ?
[13:45:48] <milimetric>	 some of the okapi folks are asking about it, they want to use it
[13:46:12] <mforns>	 milimetric: is that an eventlogging schema?
[13:46:13] <milimetric>	 but I see that collection is turned off and it's not on the whitelist, the last data left in August would have been deleted
[13:46:19] <milimetric>	 it might be event bus
[13:46:24] <mforns>	 oh
[13:46:25] <milimetric>	 is that sanitization different?
[13:46:31] <mforns>	 wait looking
[13:49:16] <mforns>	 milimetric: I can see data in /wmf/data/event/mediawiki_api_request/datacenter=codfw/year=2020/month=10/day=20/hour=10 no?
[13:49:40] <milimetric>	 ahhhH!!! I forgot about the datacenter switchover
[13:50:20] <milimetric>	 hm... but why is this not showing up in the partitions of the table?
[13:50:37] <mforns>	 hm...
[13:51:37] <milimetric>	 oh... they do... :)
[13:52:02] <milimetric>	 they sort above eqiad, so it looks like they're not there if you're superficial like me :)
[13:53:15] <mforns>	 oh
[13:55:00] <milimetric>	 mforns: ok, so regarding the whitelist, do we use the same list for the EventBus stuff or did those plans move along
[13:57:34] <mforns>	 milimetric: yes, you're right data seems to be fine
[13:57:48] <mforns>	 there's a overlap period where data is in both eqiad and codfw
[13:57:54] <mforns>	 and now only in codfw
[13:58:19] <milimetric>	 oh cool, thx for checking
[13:58:37] <mforns>	 milimetric: the whitelist could work, yes, but right now EventLoggingSanitization filters out all data set names that have underscores in them
[13:58:40] <milimetric>	 I know there's that open discussion on sanitization in the schema/etc
[13:58:46] <mforns>	 to prevent deleting data that is not EventLogging
[13:58:52] <mforns>	 yes
[13:58:59] <wikibugs>	 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Add sbisson to analytics-privatedata-users and create a kerberos identity - https://phabricator.wikimedia.org/T265969 (10nshahquinn-wmf) If approval from @SBisson's manager is needed, that would be @Arrbee.
[13:59:08] <milimetric>	 hm, ok, but this data doesn't end up in event_sanitized
[13:59:11] <milimetric>	 so it's deleted anyway
[13:59:30] <mforns>	 yes
[14:01:32] <elukey>	 klausman: one thing about the upgrade - we should check what is the version of tensorflow that works with 3.8 and ping miriam_ about it
[14:01:54] <mforns>	 milimetric: do you need that data to be sanitized and kept for longer?
[14:02:14] <mforns>	 I believe if we add it to the whitelist, it will be sanitized together with EL
[14:02:46] <milimetric>	 you were saying it ignores _
[14:03:11] <mforns>	 milimetric: I thinkl I was confused...
[14:03:27] <mforns>	 it's the deletion scripts that do that!
[14:03:29] <milimetric>	 there's some new interest in it.  I was mostly trying to remember what we decided with it, there's an old task to ooziefy some transformations of it that nobody picked up
[14:03:39] <milimetric>	 oh ok, so we can sanitize it, good
[14:03:45] <mforns>	 I think so, yes
[14:04:25] <mforns>	 the sanitization is persisting stuff indefinitely, thus an include-list is good, because it excludes new stuff by default
[14:04:49] <mforns>	 the deletion is dropping data, so exclude list is better, so that new stuff is kept by default
[14:04:50] <milimetric>	 yep, it makes sense
[14:08:02] <elukey>	 klausman: like https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/commit/26764990921fdf3a98c0da7c7d81635496f116fa, not sure if it is included in the last tf version on pypi
[14:08:06] <elukey>	 we should make sure
[14:08:09] <elukey>	 otherwise it will not work
[14:15:28] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Privacy Engineering, and 4 others: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10Ottomata) Ok so in https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/635304 I've removed `client_ip` from...
[14:19:16] <klausman>	 Good point
[14:29:36] <mforns>	 elukey, joal: do you guys want to enter the meeting with Jarek like 15 minutes earlier, to align our questions and brief?
[14:29:51] <joal>	 Yes !
[14:29:57] <joal>	 mforns: --^
[14:30:07] <mforns>	 ok!
[14:30:25] <mforns>	 see you in 15, elukey please join if you can :]
[14:30:30] <joal>	 ack
[14:37:42] <elukey>	 yep!
[14:38:47] <elukey>	 mforns, joal let's also send e-scrum
[14:38:53] <joal>	 yep
[14:39:02] <mforns>	 oh yes
[14:40:42] <ottomata>	 elukey: https://phabricator.wikimedia.org/T246004#6564520 when you have a moment :)
[14:42:51] <elukey>	 ah yes sure, will do it after meetings :)
[14:42:57] <elukey>	 or is it super urgent?
[14:47:17] <fdans>	 a-team today's standup should be at 6 european? it's at 5 right now
[14:47:21] <mforns>	 joal, elukey pingggg
[14:47:33] <milimetric>	 all those meetings are a mess :)
[14:47:34] <joal>	 joining mforns 
[14:47:42] <ottomata>	 i see meeting starting in 13 mins
[14:47:43] <ottomata>	 no?
[14:47:53] <milimetric>	 nah, that conflicts with managers'
[14:48:21] <wikibugs>	 10Analytics: Retain nonsensitive mediawiki_api_request logging data - https://phabricator.wikimedia.org/T265952 (10Nuria) We can keep data for longer than 90 days that has no identifying fields. Just need to submit a changeset that lists those fields. Please take a look at docs: https://wikitech.wikimedia.org/wi...
[14:48:26] <ottomata>	 ok i guess you tell me when to join bc and i will be there! ;)
[14:48:38] <fdans>	 changing it
[15:00:50] <ottomata>	 !log disabling sending EventLogging events to eventlogging-valid-mixed topic - T265651
[15:00:53] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:00:53] <stashbot>	 T265651: Disable eventlogging-valid-mixed topic - https://phabricator.wikimedia.org/T265651
[15:20:17] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Privacy Engineering, and 4 others: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10Nuria) >If the intent is to decide whether all errors are from same user you can send the number of errors for t...
[15:20:41] <wikibugs>	 10Analytics, 10Patch-For-Review: Add urlshortener button to Turnilo - https://phabricator.wikimedia.org/T233336 (10Milimetric) Ok, so I modified the config to just `console.log(context.clientIp)` and ran it with the turnilo deployed on an-tool1007, on port 9092 (and tunneled).  With that setup, it looked like...
[15:20:55] <wikibugs>	 (03PS2) 10Razzi: Remove postal code, latitude, and longitude from geodata [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/635085 (https://phabricator.wikimedia.org/T236740)
[15:21:18] <wikibugs>	 (03CR) 10Razzi: "Thanks @Joal for pointing me in the right direction. Let me know how the new changes look." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/635085 (https://phabricator.wikimedia.org/T236740) (owner: 10Razzi)
[15:21:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Remove postal code, latitude, and longitude from geodata [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/635085 (https://phabricator.wikimedia.org/T236740) (owner: 10Razzi)
[15:23:14] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10lexnasser) @Isaac thanks for sharing these!  I think the following points are most useful: > a blocklist of countries we'...
[15:24:09] <wikibugs>	 (03PS3) 10Razzi: Remove postal code, latitude, and longitude from geodata [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/635085 (https://phabricator.wikimedia.org/T236740)
[15:28:54] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10Amire80) I'll have to repeat that San Marino is a very extreme case :)  However, monthly is an OK default, at least as a...
[15:56:35] <milimetric>	 you guys, I don't understand how anything works
[15:56:51] <milimetric>	 oh, confirmation bias, nvm
[15:56:52] <milimetric>	 :)
[16:18:44] <elukey>	 ottomata: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/635319 
[16:24:23] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10jcrespo) I can confirm backups have been flowing weekly as expected:  `lines=10 +------+------------------------------------------+---...
[16:35:53] <ottomata>	 elukey:  ty!
[16:38:02] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "LGTM! Thanks for the quick turnaround @Razzi. Let's see if anyone else wants to double check, otherwise let's merge." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/635085 (https://phabricator.wikimedia.org/T236740) (owner: 10Razzi)
[16:42:21] <nuria>	 ping urandom 
[16:42:27] <nuria>	 SORRY
[16:42:30] <nuria>	 ping fdans 
[16:42:39] <nuria>	 ping razzi
[16:43:16] <nuria>	 fdans:, razzi : RETROOOO
[16:43:22] <razzi>	 yup yup
[17:20:53] <milimetric>	 fdans: wanna brainbounce some JS with me for a sec?  I just need a sanity check
[17:20:59] <milimetric>	 (oh, it's lunch, anytime is fine)
[17:21:13] <fdans>	 milimetric: it's cool, I can do now for a lil bit
[17:21:18] <wikibugs>	 10Analytics: Request a Kerberos identity for sbisson - https://phabricator.wikimedia.org/T265167 (10Nuria) For faster resolution of permits issues add #sre-access-requests to ticket, that way the persosn on clinic duty will get to work on it soon after ticket is filed. I understand that process is a bit confusin...
[17:21:22] <milimetric>	 'cavin
[17:21:29] <milimetric>	 https://meet.google.com/kti-iybt-ekv?pli=1&authuser=1
[17:21:40] <milimetric>	 (cave's used up)
[17:22:12] <mforns>	 fdans: is the standup on thursday at the right time on the calendar? it collides with grosking...
[17:25:31] <wikibugs>	 (03CR) 10Nuria: [C: 03+1] "Looks good, let's make sure to test UDF in cluster before merging." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/635085 (https://phabricator.wikimedia.org/T236740) (owner: 10Razzi)
[17:26:53] <joal>	 Heya razzi - wish to do some pair testing on the cluster for --^?
[17:27:35] <razzi>	 joal: Yeah, give me a few minutes
[17:27:41] <joal>	 sure
[17:29:21] * elukey afk!
[17:30:10] <elukey>	 nuria: if you have time later on https://gerrit.wikimedia.org/r/c/operations/puppet/+/635227/ (I'll merge it tomorrow in case)
[17:34:04] <razzi>	 joal: batcave?
[17:34:12] <joal>	 yes razzi - joinig
[17:49:20] <mforns>	 milimetric: is the move of event data from eqiad to codfw final?
[17:49:40] <milimetric>	 mforns: mmm, what do you mean?
[17:49:47] <mforns>	 it is affecting at least one oozie job, and I was not fully aware of it
[17:50:26] <milimetric>	 is this about the wikidata link job?
[17:50:37] <mforns>	 I see that another data set (mediawiki_page_move) has shifted over receiving events from eqiad to receiving them from codfw in the last weeks
[17:50:39] <milimetric>	 the one that depends on every hour for like a month
[17:51:04] <mforns>	 I believe it depends of data for one week
[17:51:16] <milimetric>	 ok, link me
[17:51:21] <mforns>	 k
[17:51:58] <mforns>	 milimetric: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/events/datasets.xml#L30
[17:52:07] <mforns>	 see the hadrcoded datacenter=eqiad?
[17:52:30] <mforns>	 that is preventing the oozie job from running, even if data exists under datacenter=codfw
[17:52:53] <milimetric>	 mforns: yeah, I know all about this, it's a hack
[17:52:54] <milimetric>	 https://github.com/wikimedia/analytics-refinery/blob/7faf75747dfae8bb0a3363b2401249c9f165a9e5/oozie/wikidata/item_page_link/weekly/coordinator.xml#L111
[17:53:08] <milimetric>	 that's how it's used, that job is depending on 1 month + 7 days of page move data
[17:53:53] <milimetric>	 basically, there's no way to tell it to depend on both eqiad and codfw, so we just picked one
[17:53:55] <mforns>	 but that is fine, the issue I see is that the datacenter is hardcoded in the config, while the data seems to be shifting from one datacenter to the other in the last weeks
[17:54:01] <mforns>	 oh
[17:54:11] <milimetric>	 and the strategy has been, when the job should run, we fake the _SUCCESS flags in whichever one is hardcoded
[17:54:29] <milimetric>	 Joseph did it a few weeks ago and I did it two weeks ago
[17:54:30] <mforns>	 milimetric: but do you know why the data is shifting to codfw recently?
[17:54:53] <milimetric>	 yeah, that's the data center that's operational now, they're running all the mediawiki app servers from there
[17:55:01] <milimetric>	 until October... I wanna say 27th?
[17:55:17] <mforns>	 oh
[17:55:21] <milimetric>	 (Luca sent an email about the exact date)
[17:56:01] <milimetric>	 so the nuance here is that the page move table has partitions for both codfw and eqiad
[17:56:20] <milimetric>	 and the select statement doesn't specify the data center, so it's like "where datacenter like '%'"
[17:56:30] <mforns>	 ok
[17:56:51] <mforns>	 so the action item here is basically to copy the success files over to reqiad, then?
[17:57:00] <mforns>	 *eqiad
[17:57:02] <milimetric>	 so it doesn't matter for the actual query, but to trigger the job, when data is available in codfw/year=2020/month=10/day=20/hour=5, you have to put a success flag in the corresponding eqiad folder
[17:57:12] <milimetric>	 yes, one sec
[17:57:13] <mforns>	 ok, got it
[17:58:27] <joal>	 heya mforns - has the train started?
[17:58:40] <mforns>	 joal: I'm gathering info now to start
[17:58:50] <mforns>	 do you want me to add sth?
[17:58:55] <joal>	 great mforns - we have a patch for you with razzi 
[17:59:04] <mforns>	 cool
[18:00:31] <wikibugs>	 (03CR) 10Razzi: [C: 03+2] Remove postal code, latitude, and longitude from geodata [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/635085 (https://phabricator.wikimedia.org/T236740) (owner: 10Razzi)
[18:00:36] <joal>	 mforns: --^
[18:01:13] <mforns>	 ok joal and razzi, will deploy in short
[18:01:30] <joal>	 ack mforns - we're gently moving, please let us know ;)
[18:01:52] <milimetric>	 mforns: sorry I can't find the commands, but the only thing to look out for really is that the success flags are in codfw, so a copy from there to eqiad is probably the best option
[18:02:05] <mforns>	 joal: what do you mean with gently moving? :O
[18:02:19] <joal>	 mforns: refinery-source is ready, we're pushing a refinery patch soon
[18:02:35] <mforns>	 milimetric: thanks for looking :] no problemo
[18:02:37] <joal>	 mforns: + updating etherpad train
[18:02:40] <mforns>	 joal: ah! ok ok
[18:02:48] <mforns>	 thanks!
[18:11:42] <wikibugs>	 (03PS1) 10Razzi: oozie: update webrequest/load hive jar version [analytics/refinery] - 10https://gerrit.wikimedia.org/r/635352 (https://phabricator.wikimedia.org/T236740)
[18:21:37] <razzi>	 mforns: how does https://gerrit.wikimedia.org/r/c/analytics/refinery/+/635352/ look?
[18:22:05] <razzi>	 Also see the message on the train etherpad
[18:22:38] <mforns>	 razzi: LGTM!
[18:23:34] <mforns>	 razzi: I will merge that one after deploying source, thanks!
[18:25:11] <ottomata>	 razzi:  probably good to edit https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest#Changes_and_known_problems_since_2015-03-04 and add the change there
[18:25:11] <ottomata>	 :)
[18:28:06] <wikibugs>	 10Analytics, 10Patch-For-Review: Add urlshortener button to Turnilo - https://phabricator.wikimedia.org/T233336 (10Milimetric) Got the little %^&*(er :)  https://github.com/allegro/turnilo/pull/668
[18:30:05] <razzi>	 ottomata: thanks, updated that
[18:30:09] <milimetric>	 super helpful brainbounce, fdans, I love how much I get from just watching your reaction to stupid things I say :)
[18:30:27] <milimetric>	 (PR sent to Turnilo, problem was very simple)
[18:31:03] <razzi>	 ottomata: tomorrow we'll deploy that change to events refining
[18:37:43] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform: Make node-rdkafka an optional dependency of EventGate - https://phabricator.wikimedia.org/T266058 (10Ottomata)
[18:54:24] <fdans>	 milimetric: wat you didn't say anything stupid!
[19:25:36] <mforns>	 ottomata: I see this change is to be deployed today; Use camus + EventStreamConfig integration in CamusPartitionChecker
[19:25:58] <mforns>	 ottomata: do I need to bump up anything in refinery? Or restart anything?
[19:26:00] <ottomata>	 yes, once that is deployed i'll make it happen
[19:26:07] <ottomata>	 via puppet
[19:26:12] <mforns>	 ok ok, I 'll leave that to you then
[19:26:23] <mforns>	 :]
[19:38:44] <wikibugs>	 (03PS1) 10Mforns: Update changelog.md for v0.0.137 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/635373
[19:39:35] <wikibugs>	 (03CR) 10Mforns: [V: 03+2 C: 03+2] "Merging for deployment train" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/635373 (owner: 10Mforns)
[19:40:29] <wmf-insecte>	 Starting build #60 for job analytics-refinery-maven-release-docker
[19:50:49] <wmf-insecte>	 Project analytics-refinery-maven-release-docker build #60: 09SUCCESS in 10 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/60/
[19:56:45] <wmf-insecte>	 Starting build #27 for job analytics-refinery-update-jars-docker
[19:57:03] <wikibugs>	 (03PS1) 10Maven-release-user: Add refinery-source jars for v0.0.137 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/635378
[19:57:03] <wmf-insecte>	 Project analytics-refinery-update-jars-docker build #27: 09SUCCESS in 18 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/27/
[19:59:52] <wikibugs>	 (03CR) 10Mforns: [V: 03+2 C: 03+2] "Merging for deployment train!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/635378 (owner: 10Maven-release-user)
[20:00:31] <mforns>	 !log Deployed refinery-source v0.0.137
[20:00:33] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[20:06:43] <wikibugs>	 10Analytics, 10MediaWiki-General, 10Platform Team Workboards (Clinic Duty Team): Proposal: drop kafka-php dependency from MediaWiki - https://phabricator.wikimedia.org/T265966 (10Clarakosi)
[20:07:13] <wikibugs>	 10Analytics, 10MediaWiki-General, 10Platform Team Workboards (Clinic Duty Team): Proposal: drop kafka-php dependency from MediaWiki - https://phabricator.wikimedia.org/T265966 (10Clarakosi) p:05Triage→03Low
[20:22:36] <wikibugs>	 (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/635352 (https://phabricator.wikimedia.org/T236740) (owner: 10Razzi)
[20:24:49] <mforns>	 !log Deploying refinery with scap for v0.0.137
[20:24:52] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[20:53:06] <wikibugs>	 10Analytics, 10Analytics-Kanban: actor_signature_per_project_family does not work for apps - https://phabricator.wikimedia.org/T258101 (10razzi) a:05Nuria→03razzi
[20:59:07] <mforns>	 !log Deploying refinery with refinery-deploy-to-hdfs (for 0.0.137)
[20:59:09] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[21:41:16] <wikibugs>	 10Analytics: analytics.wikimedia.org incompatible on iOS and Android - https://phabricator.wikimedia.org/T266071 (10Peachey88)
[22:47:07] <wikibugs>	 10Analytics, 10Product-Analytics, 10Structured-Data-Backlog: Add image table to monthly sqoop list - https://phabricator.wikimedia.org/T266077 (10nettrom_WMF)