[00:05:35] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Remove postal code and longitude / latitude from geocoded data object on webrequest data - https://phabricator.wikimedia.org/T236740 (10kzimmerman) @Nuria Product Analytics hasn't used this data (except for once, maybe, a few years... [04:11:53] 10Analytics, 10MediaWiki-General, 10Platform Engineering: Proposal: drop kafka-php dependency from MediaWiki - https://phabricator.wikimedia.org/T265966 (10Pchelolo) [04:14:23] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Remove postal code and longitude / latitude from geocoded data object on webrequest data - https://phabricator.wikimedia.org/T236740 (10Nuria) [04:16:31] 10Analytics, 10Discovery-Search, 10MediaWiki-General: Proposal: drop avro dependency from mediawiki - https://phabricator.wikimedia.org/T265967 (10Pchelolo) [04:17:16] 10Analytics, 10Discovery-Search, 10MediaWiki-General: Proposal: drop avro dependency from mediawiki - https://phabricator.wikimedia.org/T265967 (10Pchelolo) [04:17:51] 10Analytics, 10MediaWiki-General, 10Platform Engineering: Proposal: drop kafka-php dependency from MediaWiki - https://phabricator.wikimedia.org/T265966 (10Pchelolo) [06:06:16] goood morning [06:20:33] 10Analytics, 10Patch-For-Review, 10User-Elukey: Move https termination from nginx to envoy (if possible) - https://phabricator.wikimedia.org/T240439 (10elukey) @razzi as note for the future, another useful test to do is via openssl s_client, like the following: ` echo y | openssl s_client -CApath /etc/ssl/c... [06:26:17] 10Analytics, 10Patch-For-Review, 10User-Elukey: Move https termination from nginx to envoy (if possible) - https://phabricator.wikimedia.org/T240439 (10elukey) @razzi something useful to do in the task is also to make a list of domain -> backend that you will work on, so others can double check. Something li... [06:28:45] 10Analytics: Request a Kerberos identity for sbisson - https://phabricator.wikimedia.org/T265167 (10elukey) @SBisson Hi! As far as I can see your username is not in `analytics-privatedata-users`, but only in `researchers`, that it is a old group not really meant to explore Hadoop data. As far as I can see from y... [06:30:49] 10Analytics, 10Operations, 10SRE-Access-Requests: Add sbisson to analytics-privatedata-users and create a kerberos identity - https://phabricator.wikimedia.org/T265969 (10elukey) [06:31:06] 10Analytics, 10Operations, 10SRE-Access-Requests: Add sbisson to analytics-privatedata-users and create a kerberos identity - https://phabricator.wikimedia.org/T265969 (10elukey) [06:31:08] 10Analytics: Request a Kerberos identity for sbisson - https://phabricator.wikimedia.org/T265167 (10elukey) [06:33:44] 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Add sbisson to analytics-privatedata-users and create a kerberos identity - https://phabricator.wikimedia.org/T265969 (10elukey) @Nuria can you review/approve? I'll then merge and create the kerberos identity :) [06:41:32] !log decom analytics1056 from the hadoop cluster [06:41:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:00:23] 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Add sbisson to analytics-privatedata-users and create a kerberos identity - https://phabricator.wikimedia.org/T265969 (10Marostegui) p:05Triage→03Medium a:03elukey [07:31:40] 10Analytics, 10Analytics-Wikistats: Wikistats active editors metric reporting unrealistic numbers - https://phabricator.wikimedia.org/T265322 (10Lydia_Pintscher) I'm seeing similar things for Wikidata. Active editor and editor stats seem to be showing the exact same numbers: * https://stats.wikimedia.org/#/wik... [07:53:03] * elukey coffee [08:42:12] so I am checking presto versions [08:42:20] and something is odd [08:42:29] we have ii presto-server 0.266-2 all Presto Server [08:42:53] https://github.com/prestosql/presto/tags (the non-fb version) is 344 now [08:43:17] but prestodb, the one that I thought we had, is at 242 https://github.com/prestodb/presto/tags [08:45:49] so we are running prestosql [08:54:24] *Or,* we're running software from the future™ [08:54:35] Also, morning! [08:54:57] ahhh no wait https://gerrit.wikimedia.org/r/c/operations/debs/presto/+/543161 [08:55:02] morning! [08:55:50] so we have 226, and latest upstream is 242 [08:58:00] A-ha! That makes more sense. Pesky versions numbers :) [09:12:31] elukey: I suppose netbox does not know whether a given machine contains a GPU? [09:13:00] klausman: yep exactly [09:13:16] So which machines are candidates for the rocm38 test? [09:13:48] you can find them in regex.yaml in puppet, an-worker1096-1101 [09:15:46] All five have GPUs? [09:16:59] Ah, right, they do [09:18:13] 6 have the gpu [09:18:21] So how about we drain 1101, I split it out of that group to get rocm38, we do a puppet run, watch the fireworks and decide what to do with it (move back to 33 or add it to the cluster with 38)? [09:18:35] yep makes sense [09:18:45] Yeah, I forgot to count 1100, somehow :) [09:18:58] I am currently decommissioning a node, so let's try to keep the hdfs datanode downtime minimal if possible [09:19:04] ack. [09:19:07] thanks :) [09:19:31] I mean, we can wait until the current decom is done, if the timing fits? [09:19:45] I'll just prep the Puppet change [09:23:53] nono it will take hours, let's do it [09:24:46] Can we drain the machine (of jobs, not data) without doing a puppet run? [09:28:13] so what I usually do is downtime, puppet disable and then systemctl stop hadoop-yarn-nodemanager [09:28:25] when the jvm containers are done, I proceed [09:28:47] you can also stop the hdfs datanode before installing as well [09:29:02] Ok, sounds good. Will send the Puppet change for rocm38 for your review and then drain the Yarn jobs [09:29:11] ack [09:32:19] 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Sqoop problem on stat1004 - https://phabricator.wikimedia.org/T265851 (10GoranSMilovanovic) @elukey All is fine, thank you very much! @Milimetric @JAllemandou Given the comment in T265851#6560938, do you want me to close this task as resol... [09:33:29] I am doing an experiment with Presto TLS config, it may not work for a brief moment [09:37:48] https://gerrit.wikimedia.org/r/c/operations/puppet/+/635260 is ready for you. I did a simple c&p of the base role and tweaked the REs [09:40:48] klausman: that is fine, but you can also add a host-level hiera override (that takes more priority than regex.yaml) [09:41:07] for example, check under hieradata/hosts [09:41:22] you could create an-worker1101.yaml with profile::amd_gpu::rocm_version: '38' in it [09:42:13] Ah, that sounds cleaner [09:46:28] Ok, updated the change to use a host override [09:47:17] +1 [09:48:53] Thanks! Will now drain the machine as discussed [09:59:51] Hmm. apt1001 does not have rock-dkms-firmware. Investigating [10:01:50] q [10:05:59] elukey: I have a sneaking suspicion that the fact thet rock-dkms-firmware is in the same subdir as rock-dkms, the apt/reprepro setup breaks. [10:08:34] klausman: mmm so rock-dkms-firmware is not included when doing checkupdate/update? [10:09:45] Correct [10:10:01] http://repo.radeon.com/rocm/apt/3.8/pool/main/r/rock-dkms/ has it, and the grep-line in the reprepro config mentions it [10:10:15] But it does not show up in the reprepro tree [10:11:19] is it in the package list configured for the remote repo? I am wondering if it is something like mvisionx [10:12:49] it is there, I see Package: rock-dkms-firmware [10:12:55] weird [10:13:11] http://repo.radeon.com/rocm/apt/3.8/dists/xenial/main/binary-amd64/Packages has it, yeah [10:13:41] (also checked the .gz, just in case) [10:14:13] root@apt1001:/srv/wikimedia# reprepro lsbycomponent rock-dkms-firmware [10:14:16] rock-dkms-firmware | 1:3.8-30 | buster-wikimedia | thirdparty/amd-rocm38 | amd64 [10:14:19] klausman: --^ [10:14:43] so we have it in our repo [10:15:02] what error do you get? maybe it needs a apt-get update? [10:16:40] E: Unable to locate package rock-dkms-firmware [10:17:08] did an apt update, [10:18:02] Hmmm. Would there need to be an entry in apt/sources.list.d/wikimedia.list mentioning rocm? [10:18:12] ah nvm [10:18:25] /etc/apt/sources.list.d/repository_amd-rocm38.list exists and looks right [10:19:33] yep [10:19:44] https://apt.wikimedia.org/wikimedia/pool/thirdparty/amd-rocm38/r/rock-dkms/ has the package as well [10:19:51] and yet: [10:20:07] E: Unable to locate package rock-dkms-firmware [10:20:13] very weird [10:20:33] It also lists only the 3.3 rock-dkms package, no 3.8 [10:20:35] ahhhhhhhhh [10:20:56] rock-dkms-firmware | 1:3.8-30 | buster-wikimedia | thirdparty/amd-rocm38 | amd64 [10:21:07] elukey@an-worker1101:~$ cat /etc/debian_version [10:21:07] 9.13 [10:21:10] :) [10:21:24] we need to checkupdate/update also stretch-wikimedia [10:21:26] Oh [10:21:37] the workers are not on buster [10:23:14] Right. did an update on apt, now doing a puppet run (which should install 3.8) [10:24:27] super [10:24:42] mforns: have we looked at dagster? https://dagster.io/ [10:25:31] it seems to me a little more polished than airflow and apache 2 all around (I think) [10:26:19] today we have a meeting with Jarek (a PMC of airflow) about our use cases and airflow 2.0 (in alpha but close to prime time) [10:26:44] if you want to jump in the meeting I can add you, it is at 17 CEST [10:26:44] Executors are still dask/cellery just like airflow, but maybe dask is worth spinning up [10:28:14] um, I think Marcel has all my use cases in mind and more, so I’m not needed [10:28:57] !r rocm38 install on an-worker1101 successful, rebooting to make sure everything is in place [10:29:13] erm [10:29:16] !log rocm38 install on an-worker1101 successful, rebooting to make sure everything is in place [10:29:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:29:22] just saw dagster and it looks like a solid popular alternative, and maybe some crossover with an ML pipeline (maybe klausman is intrested in it vs kubeflow) [10:29:40] Will have a look [10:29:51] let's also check license [10:30:13] It seems to me like one of these workflow orchestrator things should be able to handle both types of pipelines [10:30:17] They have something on GH which is APL-2 [10:30:35] yep, all apl 2 as far as I could see [10:31:18] it may be ambitious to try and do both, but maybe the right decison would be to wait. Flyte is getting better as well [10:32:35] we have also been waiting for a long time to replace oozie, we'd need to pull the trigger at some point [10:35:11] an-worker1101 is back up, with rocm38 [10:47:06] wow [10:47:07] !!! [10:47:40] all good? [10:48:04] No :) [10:48:20] There are still some deps missing from the reprepro config. Working on those right now [10:49:10] sure sure [10:54:14] https://gerrit.wikimedia.org/r/635279 if you're not lunching already [10:54:58] +1! [10:55:15] I assume those are all deps that don't need to be pulled via puppet explicitly [10:56:23] Correct [10:58:03] molto bene [10:58:06] :D [10:58:20] 10Analytics-Clusters, 10Operations: Rename an-scheduler1001 to an-coord1002 - https://phabricator.wikimedia.org/T265620 (10Marostegui) p:05Triage→03Medium [10:59:59] * elukey lunch! [11:03:38] I think we're a bit screwed re: rocm38 on stretch [11:04:25] one of the dependency chains ends with (libstdc++-5-dev||libstdc++-7-dev) && (libgcc-5-dev||libgcc-7-dev) [11:04:34] None of these are available on stretch [11:07:15] I'll revert the puppet change (38 override) for now to get the machine back into a sane state [11:34:53] * klausman lunch [12:04:39] ouch :( [12:11:02] 10Quarry: Quarry down - https://phabricator.wikimedia.org/T265997 (10Count_Count) [12:11:54] 10Quarry: Quarry down for logged in user?? - https://phabricator.wikimedia.org/T265997 (10Count_Count) [12:13:13] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Fix TLS certificate location and expire for Hadoop/Presto/etc.. and add alarms on TLS cert expiry - https://phabricator.wikimedia.org/T253957 (10elukey) The main problem with Presto seems to be that the puppet CA is not picked up as trusted source, so any... [12:15:33] 10Quarry: Quarry down for logged in user - https://phabricator.wikimedia.org/T265997 (10Count_Count) [12:23:29] 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Sqoop problem on stat1004 - https://phabricator.wikimedia.org/T265851 (10JAllemandou) @GoranSMilovanovic the needed docs are updated (no sqoop page per say, but related sqoop usages in other pages). Feel free to close the task. Thanks! [12:25:15] 10Quarry: Quarry down for logged in users - https://phabricator.wikimedia.org/T265997 (10Count_Count) [12:29:50] 10Quarry: Quarry down for logged in users - https://phabricator.wikimedia.org/T265997 (10JeanFred) I can confirm this behaviour − noticed it a few hours ago. [12:33:39] (03CR) 10Joal: [C: 04-1] "The fields should be removed from the refinery-core related files. RefineryGeocodeDatabaseResponse is the data-handler and GeocodeDatabase" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/635085 (https://phabricator.wikimedia.org/T236740) (owner: 10Razzi) [12:40:50] 10Quarry: Quarry down for logged in users - https://phabricator.wikimedia.org/T265997 (10JeanFred) Even logged-out, only the front-page seems accessible, eg https://quarry.wmflabs.org/query/runs/all returns 500 Internal Server Error [12:44:09] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Fix TLS certificate location and expire for Hadoop/Presto/etc.. and add alarms on TLS cert expiry - https://phabricator.wikimedia.org/T253957 (10elukey) If I add: ` -Djavax.net.ssl.trustStore=/etc/ssl/certs/java/cacerts -Djavax.net.ssl.trustStorePassword=... [12:52:03] 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Sqoop problem on stat1004 - https://phabricator.wikimedia.org/T265851 (10GoranSMilovanovic) 05Open→03Resolved @JAllemandou Fine. @elukey Thank you for your help and advice. Ticket closed as resolved. [12:54:19] ok so I think I got how to make presto to use puppet TLS certs, trying again [12:54:24] 10Analytics, 10Analytics-Kanban, 10Design-Research: Setup and integrate analytics (Matomo) for Design Strategy Website - https://phabricator.wikimedia.org/T259322 (10Volker_E) [12:58:28] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade AMD ROCm drivers/tools to latest upstream - https://phabricator.wikimedia.org/T264408 (10klausman) As the experiment today has shown, rocm38 ultimately depends on a few packages that are not available on Stretch. I think we have the follow... [12:59:01] elukey: what do you think the timeframe for everything-on-Buster looks like? [12:59:24] klausman: a couple of quarters at least [13:05:59] so the idea for hadoop (that is the big problem when migrating) is the following [13:06:19] - we migrate from cdh to bigtop 1.4 on stretch, upgrading hdfs etc.. [13:06:38] - when upstream releases 1.5, we upgrade to it on stretch (upgrading hdfs again) [13:07:10] - at this point, we'll have bigtop 1.5 with packages for stretch and buster, so we'll be able to reimage a couple of workers at the time [13:07:14] (to buster) [13:07:40] so what I have in mind is to do two hdfs upgrades and then a os upgrade [13:07:48] (hdfs 2.6 -> 2.8.5 -> 2.10) [13:08:21] does it make sense? [13:09:55] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade AMD ROCm drivers/tools to latest upstream - https://phabricator.wikimedia.org/T264408 (10elukey) I would upgrade the two stats to 3.8 leaving the hadoop workers to 3.3, so we could keep testing the drivers keeping the stretch stack stable.... [13:15:31] Yeah, that sounds reasonable. [13:16:07] In the meantime, I think unless we find major breakage with rocm33, we should just stay on it. The parallel-workload and memory issues have been fixed by DKMS, so the pressure is not too high. [13:16:42] If we *do* find major breakage, we'll have to take a look at the intermediate versions, maybe there is one that fixes stuff but doesn't have dependency problems on Stretch. [13:17:21] klausman: we could move only what's on buster to 3.8, so people on stat100x could get the latest drivers [13:19:05] or we can stay on 33 and wait, maybe we can check with people using gps on the stats [13:19:25] with newer drivers it is also possible to use a more up to date tensorflow-rocm version [13:20:08] The Buster machines I think we should slowly upgrade [13:20:35] Are there any in that category besides 5 and 8? [13:20:47] I mean Buster+GPU [13:24:43] nope only those two [13:25:05] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade AMD ROCm drivers/tools to latest upstream - https://phabricator.wikimedia.org/T264408 (10klausman) Sounds good to me. I will make an announcement that we'll have some disruption on stat1005 and 1008 soonish and then do the upgrade via pupp... [13:27:25] I'd propose Thu for the upgrade of one machine, but that's DC switchback day, and I'd rather not combine the two. [13:27:37] So Fri morning? [13:28:37] sounds good [13:33:01] !log move presto to pupet host TLS certificates [13:33:03] it works! [13:33:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:33:15] finally [13:34:57] !log upgrade superset's presto TLS config after the above changes [13:35:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:35:00] https://superset.wikimedia.org/superset/dashboard/73/ works fine [13:36:58] and the CLI tool works as well [13:37:18] hello teammm [13:39:05] o/ [13:44:49] taking a break! [13:44:53] :) [13:44:54] hi [13:45:06] luca and I are the same person [13:45:28] mforns: do you remember anything about the mediawiki_api_request events? [13:45:48] milimetric: ? [13:45:48] some of the okapi folks are asking about it, they want to use it [13:46:12] milimetric: is that an eventlogging schema? [13:46:13] but I see that collection is turned off and it's not on the whitelist, the last data left in August would have been deleted [13:46:19] it might be event bus [13:46:24] oh [13:46:25] is that sanitization different? [13:46:31] wait looking [13:49:16] milimetric: I can see data in /wmf/data/event/mediawiki_api_request/datacenter=codfw/year=2020/month=10/day=20/hour=10 no? [13:49:40] ahhhH!!! I forgot about the datacenter switchover [13:50:20] hm... but why is this not showing up in the partitions of the table? [13:50:37] hm... [13:51:37] oh... they do... :) [13:52:02] they sort above eqiad, so it looks like they're not there if you're superficial like me :) [13:53:15] oh [13:55:00] mforns: ok, so regarding the whitelist, do we use the same list for the EventBus stuff or did those plans move along [13:57:34] milimetric: yes, you're right data seems to be fine [13:57:48] there's a overlap period where data is in both eqiad and codfw [13:57:54] and now only in codfw [13:58:19] oh cool, thx for checking [13:58:37] milimetric: the whitelist could work, yes, but right now EventLoggingSanitization filters out all data set names that have underscores in them [13:58:40] I know there's that open discussion on sanitization in the schema/etc [13:58:46] to prevent deleting data that is not EventLogging [13:58:52] yes [13:58:59] 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Add sbisson to analytics-privatedata-users and create a kerberos identity - https://phabricator.wikimedia.org/T265969 (10nshahquinn-wmf) If approval from @SBisson's manager is needed, that would be @Arrbee. [13:59:08] hm, ok, but this data doesn't end up in event_sanitized [13:59:11] so it's deleted anyway [13:59:30] yes [14:01:32] klausman: one thing about the upgrade - we should check what is the version of tensorflow that works with 3.8 and ping miriam_ about it [14:01:54] milimetric: do you need that data to be sanitized and kept for longer? [14:02:14] I believe if we add it to the whitelist, it will be sanitized together with EL [14:02:46] you were saying it ignores _ [14:03:11] milimetric: I thinkl I was confused... [14:03:27] it's the deletion scripts that do that! [14:03:29] there's some new interest in it. I was mostly trying to remember what we decided with it, there's an old task to ooziefy some transformations of it that nobody picked up [14:03:39] oh ok, so we can sanitize it, good [14:03:45] I think so, yes [14:04:25] the sanitization is persisting stuff indefinitely, thus an include-list is good, because it excludes new stuff by default [14:04:49] the deletion is dropping data, so exclude list is better, so that new stuff is kept by default [14:04:50] yep, it makes sense [14:08:02] klausman: like https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/commit/26764990921fdf3a98c0da7c7d81635496f116fa, not sure if it is included in the last tf version on pypi [14:08:06] we should make sure [14:08:09] otherwise it will not work [14:15:28] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Privacy Engineering, and 4 others: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10Ottomata) Ok so in https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/635304 I've removed `client_ip` from... [14:19:16] Good point [14:29:36] elukey, joal: do you guys want to enter the meeting with Jarek like 15 minutes earlier, to align our questions and brief? [14:29:51] Yes ! [14:29:57] mforns: --^ [14:30:07] ok! [14:30:25] see you in 15, elukey please join if you can :] [14:30:30] ack [14:37:42] yep! [14:38:47] mforns, joal let's also send e-scrum [14:38:53] yep [14:39:02] oh yes [14:40:42] elukey: https://phabricator.wikimedia.org/T246004#6564520 when you have a moment :) [14:42:51] ah yes sure, will do it after meetings :) [14:42:57] or is it super urgent? [14:47:17] a-team today's standup should be at 6 european? it's at 5 right now [14:47:21] joal, elukey pingggg [14:47:33] all those meetings are a mess :) [14:47:34] joining mforns [14:47:42] i see meeting starting in 13 mins [14:47:43] no? [14:47:53] nah, that conflicts with managers' [14:48:21] 10Analytics: Retain nonsensitive mediawiki_api_request logging data - https://phabricator.wikimedia.org/T265952 (10Nuria) We can keep data for longer than 90 days that has no identifying fields. Just need to submit a changeset that lists those fields. Please take a look at docs: https://wikitech.wikimedia.org/wi... [14:48:26] ok i guess you tell me when to join bc and i will be there! ;) [14:48:38] changing it [15:00:50] !log disabling sending EventLogging events to eventlogging-valid-mixed topic - T265651 [15:00:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:00:53] T265651: Disable eventlogging-valid-mixed topic - https://phabricator.wikimedia.org/T265651 [15:20:17] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Privacy Engineering, and 4 others: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10Nuria) >If the intent is to decide whether all errors are from same user you can send the number of errors for t... [15:20:41] 10Analytics, 10Patch-For-Review: Add urlshortener button to Turnilo - https://phabricator.wikimedia.org/T233336 (10Milimetric) Ok, so I modified the config to just `console.log(context.clientIp)` and ran it with the turnilo deployed on an-tool1007, on port 9092 (and tunneled). With that setup, it looked like... [15:20:55] (03PS2) 10Razzi: Remove postal code, latitude, and longitude from geodata [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/635085 (https://phabricator.wikimedia.org/T236740) [15:21:18] (03CR) 10Razzi: "Thanks @Joal for pointing me in the right direction. Let me know how the new changes look." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/635085 (https://phabricator.wikimedia.org/T236740) (owner: 10Razzi) [15:21:37] (03CR) 10jerkins-bot: [V: 04-1] Remove postal code, latitude, and longitude from geodata [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/635085 (https://phabricator.wikimedia.org/T236740) (owner: 10Razzi) [15:23:14] 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10lexnasser) @Isaac thanks for sharing these! I think the following points are most useful: > a blocklist of countries we'... [15:24:09] (03PS3) 10Razzi: Remove postal code, latitude, and longitude from geodata [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/635085 (https://phabricator.wikimedia.org/T236740) [15:28:54] 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10Amire80) I'll have to repeat that San Marino is a very extreme case :) However, monthly is an OK default, at least as a... [15:56:35] you guys, I don't understand how anything works [15:56:51] oh, confirmation bias, nvm [15:56:52] :) [16:18:44] ottomata: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/635319 [16:24:23] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10jcrespo) I can confirm backups have been flowing weekly as expected: `lines=10 +------+------------------------------------------+---... [16:35:53] elukey: ty! [16:38:02] (03CR) 10Joal: [C: 03+1] "LGTM! Thanks for the quick turnaround @Razzi. Let's see if anyone else wants to double check, otherwise let's merge." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/635085 (https://phabricator.wikimedia.org/T236740) (owner: 10Razzi) [16:42:21] ping urandom [16:42:27] SORRY [16:42:30] ping fdans [16:42:39] ping razzi [16:43:16] fdans:, razzi : RETROOOO [16:43:22] yup yup [17:20:53] fdans: wanna brainbounce some JS with me for a sec? I just need a sanity check [17:20:59] (oh, it's lunch, anytime is fine) [17:21:13] milimetric: it's cool, I can do now for a lil bit [17:21:18] 10Analytics: Request a Kerberos identity for sbisson - https://phabricator.wikimedia.org/T265167 (10Nuria) For faster resolution of permits issues add #sre-access-requests to ticket, that way the persosn on clinic duty will get to work on it soon after ticket is filed. I understand that process is a bit confusin... [17:21:22] 'cavin [17:21:29] https://meet.google.com/kti-iybt-ekv?pli=1&authuser=1 [17:21:40] (cave's used up) [17:22:12] fdans: is the standup on thursday at the right time on the calendar? it collides with grosking... [17:25:31] (03CR) 10Nuria: [C: 03+1] "Looks good, let's make sure to test UDF in cluster before merging." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/635085 (https://phabricator.wikimedia.org/T236740) (owner: 10Razzi) [17:26:53] Heya razzi - wish to do some pair testing on the cluster for --^? [17:27:35] joal: Yeah, give me a few minutes [17:27:41] sure [17:29:21] * elukey afk! [17:30:10] nuria: if you have time later on https://gerrit.wikimedia.org/r/c/operations/puppet/+/635227/ (I'll merge it tomorrow in case) [17:34:04] joal: batcave? [17:34:12] yes razzi - joinig [17:49:20] milimetric: is the move of event data from eqiad to codfw final? [17:49:40] mforns: mmm, what do you mean? [17:49:47] it is affecting at least one oozie job, and I was not fully aware of it [17:50:26] is this about the wikidata link job? [17:50:37] I see that another data set (mediawiki_page_move) has shifted over receiving events from eqiad to receiving them from codfw in the last weeks [17:50:39] the one that depends on every hour for like a month [17:51:04] I believe it depends of data for one week [17:51:16] ok, link me [17:51:21] k [17:51:58] milimetric: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/events/datasets.xml#L30 [17:52:07] see the hadrcoded datacenter=eqiad? [17:52:30] that is preventing the oozie job from running, even if data exists under datacenter=codfw [17:52:53] mforns: yeah, I know all about this, it's a hack [17:52:54] https://github.com/wikimedia/analytics-refinery/blob/7faf75747dfae8bb0a3363b2401249c9f165a9e5/oozie/wikidata/item_page_link/weekly/coordinator.xml#L111 [17:53:08] that's how it's used, that job is depending on 1 month + 7 days of page move data [17:53:53] basically, there's no way to tell it to depend on both eqiad and codfw, so we just picked one [17:53:55] but that is fine, the issue I see is that the datacenter is hardcoded in the config, while the data seems to be shifting from one datacenter to the other in the last weeks [17:54:01] oh [17:54:11] and the strategy has been, when the job should run, we fake the _SUCCESS flags in whichever one is hardcoded [17:54:29] Joseph did it a few weeks ago and I did it two weeks ago [17:54:30] milimetric: but do you know why the data is shifting to codfw recently? [17:54:53] yeah, that's the data center that's operational now, they're running all the mediawiki app servers from there [17:55:01] until October... I wanna say 27th? [17:55:17] oh [17:55:21] (Luca sent an email about the exact date) [17:56:01] so the nuance here is that the page move table has partitions for both codfw and eqiad [17:56:20] and the select statement doesn't specify the data center, so it's like "where datacenter like '%'" [17:56:30] ok [17:56:51] so the action item here is basically to copy the success files over to reqiad, then? [17:57:00] *eqiad [17:57:02] so it doesn't matter for the actual query, but to trigger the job, when data is available in codfw/year=2020/month=10/day=20/hour=5, you have to put a success flag in the corresponding eqiad folder [17:57:12] yes, one sec [17:57:13] ok, got it [17:58:27] heya mforns - has the train started? [17:58:40] joal: I'm gathering info now to start [17:58:50] do you want me to add sth? [17:58:55] great mforns - we have a patch for you with razzi [17:59:04] cool [18:00:31] (03CR) 10Razzi: [C: 03+2] Remove postal code, latitude, and longitude from geodata [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/635085 (https://phabricator.wikimedia.org/T236740) (owner: 10Razzi) [18:00:36] mforns: --^ [18:01:13] ok joal and razzi, will deploy in short [18:01:30] ack mforns - we're gently moving, please let us know ;) [18:01:52] mforns: sorry I can't find the commands, but the only thing to look out for really is that the success flags are in codfw, so a copy from there to eqiad is probably the best option [18:02:05] joal: what do you mean with gently moving? :O [18:02:19] mforns: refinery-source is ready, we're pushing a refinery patch soon [18:02:35] milimetric: thanks for looking :] no problemo [18:02:37] mforns: + updating etherpad train [18:02:40] joal: ah! ok ok [18:02:48] thanks! [18:11:42] (03PS1) 10Razzi: oozie: update webrequest/load hive jar version [analytics/refinery] - 10https://gerrit.wikimedia.org/r/635352 (https://phabricator.wikimedia.org/T236740) [18:21:37] mforns: how does https://gerrit.wikimedia.org/r/c/analytics/refinery/+/635352/ look? [18:22:05] Also see the message on the train etherpad [18:22:38] razzi: LGTM! [18:23:34] razzi: I will merge that one after deploying source, thanks! [18:25:11] razzi: probably good to edit https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest#Changes_and_known_problems_since_2015-03-04 and add the change there [18:25:11] :) [18:28:06] 10Analytics, 10Patch-For-Review: Add urlshortener button to Turnilo - https://phabricator.wikimedia.org/T233336 (10Milimetric) Got the little %^&*(er :) https://github.com/allegro/turnilo/pull/668 [18:30:05] ottomata: thanks, updated that [18:30:09] super helpful brainbounce, fdans, I love how much I get from just watching your reaction to stupid things I say :) [18:30:27] (PR sent to Turnilo, problem was very simple) [18:31:03] ottomata: tomorrow we'll deploy that change to events refining [18:37:43] 10Analytics, 10Analytics-Kanban, 10Event-Platform: Make node-rdkafka an optional dependency of EventGate - https://phabricator.wikimedia.org/T266058 (10Ottomata) [18:54:24] milimetric: wat you didn't say anything stupid! [19:25:36] ottomata: I see this change is to be deployed today; Use camus + EventStreamConfig integration in CamusPartitionChecker [19:25:58] ottomata: do I need to bump up anything in refinery? Or restart anything? [19:26:00] yes, once that is deployed i'll make it happen [19:26:07] via puppet [19:26:12] ok ok, I 'll leave that to you then [19:26:23] :] [19:38:44] (03PS1) 10Mforns: Update changelog.md for v0.0.137 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/635373 [19:39:35] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Merging for deployment train" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/635373 (owner: 10Mforns) [19:40:29] Starting build #60 for job analytics-refinery-maven-release-docker [19:50:49] Project analytics-refinery-maven-release-docker build #60: 09SUCCESS in 10 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/60/ [19:56:45] Starting build #27 for job analytics-refinery-update-jars-docker [19:57:03] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.0.137 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/635378 [19:57:03] Project analytics-refinery-update-jars-docker build #27: 09SUCCESS in 18 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/27/ [19:59:52] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Merging for deployment train!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/635378 (owner: 10Maven-release-user) [20:00:31] !log Deployed refinery-source v0.0.137 [20:00:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:06:43] 10Analytics, 10MediaWiki-General, 10Platform Team Workboards (Clinic Duty Team): Proposal: drop kafka-php dependency from MediaWiki - https://phabricator.wikimedia.org/T265966 (10Clarakosi) [20:07:13] 10Analytics, 10MediaWiki-General, 10Platform Team Workboards (Clinic Duty Team): Proposal: drop kafka-php dependency from MediaWiki - https://phabricator.wikimedia.org/T265966 (10Clarakosi) p:05Triage→03Low [20:22:36] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/635352 (https://phabricator.wikimedia.org/T236740) (owner: 10Razzi) [20:24:49] !log Deploying refinery with scap for v0.0.137 [20:24:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:53:06] 10Analytics, 10Analytics-Kanban: actor_signature_per_project_family does not work for apps - https://phabricator.wikimedia.org/T258101 (10razzi) a:05Nuria→03razzi [20:59:07] !log Deploying refinery with refinery-deploy-to-hdfs (for 0.0.137) [20:59:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:41:16] 10Analytics: analytics.wikimedia.org incompatible on iOS and Android - https://phabricator.wikimedia.org/T266071 (10Peachey88) [22:47:07] 10Analytics, 10Product-Analytics, 10Structured-Data-Backlog: Add image table to monthly sqoop list - https://phabricator.wikimedia.org/T266077 (10nettrom_WMF)