[00:04:31] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Technical contributors emerging communities metric definition, thick data - https://phabricator.wikimedia.org/T250284 (10jwang) @Jhernandez I made some formatting changes on the page. Let me know if it looks better on smaller screen. [01:10:25] 10Analytics-Radar, 10Performance-Team, 10Research, 10Epic, 10Patch-For-Review: Citation Usage: run third round of data collection - https://phabricator.wikimedia.org/T213969 (10Krinkle) 05Resolved→03Open a:05bmansurov→03Krinkle [01:13:21] 10Analytics-Radar, 10Performance-Team, 10Product-Analytics, 10Reading Depth: Reading_depth: deactivate eventlogging instrumentation - https://phabricator.wikimedia.org/T229042 (10Krinkle) 05Resolved→03Open The production payload for readingDepth.js is still being transferred and parsed on all page vie... [01:13:37] 10Analytics-Radar, 10Performance-Team, 10Product-Analytics, 10Readers-Web-Backlog, 10Reading Depth: Reading_depth: deactivate eventlogging instrumentation - https://phabricator.wikimedia.org/T229042 (10Krinkle) [01:13:47] 10Analytics-Radar, 10Product-Analytics, 10Readers-Web-Backlog, 10Reading Depth, 10Performance-Team (Radar): Reading_depth: deactivate eventlogging instrumentation - https://phabricator.wikimedia.org/T229042 (10Krinkle) [01:14:31] 10Analytics-Radar, 10Product-Analytics, 10Readers-Web-Backlog, 10Reading Depth, 10Performance-Team (Radar): Reading_depth: deactivate eventlogging instrumentation - https://phabricator.wikimedia.org/T229042 (10Krinkle) [01:16:41] 10Analytics-Radar, 10Research: Citation Usage: Can instrumentation code be removed? - https://phabricator.wikimedia.org/T262349 (10Krinkle) >>! In T213969#6445202, @gerritbot wrote: > Change 626016 **merged** by jenkins-bot: > [mediawiki/extensions/WikimediaEvents@master] citationUsage: Remove unused campaign... [01:17:04] 10Analytics-Radar, 10Performance-Team, 10Research: Citation Usage: Can instrumentation code be removed? - https://phabricator.wikimedia.org/T262349 (10Krinkle) a:03Krinkle [01:17:07] 10Analytics-Radar, 10Performance-Team, 10Research, 10Epic, 10Patch-For-Review: Citation Usage: run third round of data collection - https://phabricator.wikimedia.org/T213969 (10Krinkle) 05Open→03Resolved See T262349. [01:31:24] 10Analytics-Radar, 10Performance-Team, 10Research: Citation Usage: Can instrumentation code be removed? - https://phabricator.wikimedia.org/T262349 (10leila) @bmansurov I want to make sure you're aware of this. [06:54:37] 10Analytics-Clusters, 10Operations: Segfault for systemd-sysusers.service on stat1007 - https://phabricator.wikimedia.org/T256098 (10MoritzMuehlenhoff) This was now fixed in glibc: https://sourceware.org/bugzilla/show_bug.cgi?id=20338#c5 And there's now also a bug in Debian to backport it to Buster: https://b... [07:25:47] !log restart varnishkafka-webrequest on cp5010 and cp5012, delivery reports errors happening since yesterday's network outage [07:25:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:25:50] this is bad --^ [07:25:55] https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-cp_cluster=cache_text&var-datasource=eqsin%20prometheus%2Fops&var-instance=All&var-source=webrequest&viewPanel=20 [07:26:05] we dropped a lot of data from the two nodes :( [07:30:50] RECOVERY - cache_text: Varnishkafka webrequest Delivery Errors per second -eqsin- on icinga1001 is OK: (C)5 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=eqsin+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_text&var-instance=All [07:31:29] * elukey cries in a corner [07:32:36] :( [07:32:42] * joal pat elukey on the back [07:32:54] kids days today, will be mostly off until standup [07:33:31] ack [07:34:09] so the problem is that vk-webrequest on cp50[12,10] kept using (I suppose) an old socket to kafka-jumbo1006, that was affected by the network outage [07:34:31] the icinga alert is an aggregated one, so it is listed under "icinga1001", and before leaving I didn't notice it [07:34:45] in fact there was no recovery for eqsin to alerts@ [07:34:58] and for hours those nodes have dropped data [07:36:41] going to send an email to alerts@ explaining what happened.. [07:43:05] elukey: o/ when using stat1005, I realized that a lot of memory is used by processes not doing anything (S and D state). I checked with some users, but for example agaduran mentioned he was not able to kill the corresponding pids. how can we free (some of) the memory again? [07:45:26] mgerlach: o/ [07:45:54] mgerlach: well there are also a lot of running processes consuming memory, if you check with top [07:47:26] elukey: I know but the top consumers top -o %MEM are all in S or D [07:47:58] mgerlach: so they are in S state for a bit, then R, I think those are doing I/O [07:48:07] and these processes are like that at least for a day [07:48:31] yeah there are python processes pulling from /mnt/data/xmldatadumps [07:48:56] we are soon going to bump ram on 1005 and 1008 to 1.5TB [07:49:10] hopefully these problems will be less and less [07:49:42] mgerlach: you can follow up with the owner of those processes, or move to a different stat temporarily [07:50:31] ok will do, I follwoed up with agaduran - he wanted to kill his D processes but said he wasnt able to [07:51:07] lemme check [07:55:29] mgerlach: mmm I think those processes are using the GPU? [07:55:40] could be [07:56:00] I mean, should we kill them? [07:56:07] or is it a new thing running?? [07:56:19] yesterday agaduran was talking about zombie processes [07:57:18] mgerlach: --^ [07:57:45] ahhh no wait [07:57:46] https://grafana.wikimedia.org/d/ZAX3zaIWz/amd-rocm-gpu?orgId=1&from=now-2d&to=now [07:57:53] the gpu on 1008 is being used [07:58:28] on 1005, if I run radeontop, I see that ~50% of the GPU's ram is used [07:58:49] but the gpu seems not doing anything [07:58:51] he moved to stat1008 since he said he couldnt run things on stat1005 anymore [07:58:57] so I think this is a weird state [07:59:07] so I checked and saw all those processes [07:59:34] it may be due to the ROCm drivers of the GPU, sometimes we get into this state [07:59:41] the GPU is unusable until we reboot [07:59:49] zombie was probably not the technically correct term [08:01:53] ok lemme try something [08:05:04] mgerlach: so the processes seem to be using /dev/kfd, that is the kernel driver for the gpu [08:05:17] I see a lot of keras etc.. [08:05:33] so I am pretty sure that the processes are waiting for the GPU [08:05:42] I attempted a GPU reset but didn't work [08:07:29] klausman: o/ [08:12:05] mgerlach: Tobias is working on using more up-to-date drivers for the GPU for another task, I hope that this work will also make these issues disappear [08:13:05] elukey: thanks for taking a look and working on them (also thanks klausman: ) [08:15:37] so for the moment my only "solution" is to schedule a reboot [08:16:28] removing the kernel drivers (amdgpu and amdkfd) may work but I am pretty sure that there is chance that the kernel will not like it :D [08:33:41] * elukey biab [08:54:50] Morning! [09:03:28] klausman@stat1005:~$ apt-cache search rock-dk [09:03:30] rock-dkms - rock-dkms driver in DKMS format. [09:03:36] Looks like the package is visible. [09:04:24] klausman: o/ - how was it imported at the end? [09:07:53] also, I am wondering what it is best in our case [09:08:21] is it possible to force dkms to compile the kernel modules for the kernel we use on buster on say deneb? [09:08:34] or are we going to just install dkms on 1005? [09:08:39] (curious about what's best) [09:09:41] I had run the cmdline I sent before you replied :-S [09:09:56] ahhh okok np [09:10:04] I created https://gerrit.wikimedia.org/r/c/operations/puppet/+/626112 for consistency [09:10:10] I think for testing of the driver, just using the DKMS is easier/less overhead/less likely to go wrong [09:10:37] but probably not needed, if we want to just test [09:10:43] I'll keep it there, we'll see [09:10:47] We can a fully static (as in: no compilation) driver afterwards, I think [09:10:53] super [09:10:54] make* [09:11:08] Thing is: that will mean having to make a new package for every kernel upgrade [09:11:33] Adn we'd have to come up with a clever versioning scheme for that aspect [09:11:49] one note - on 1005 the situation is a bit weird, we have again the GPU "stalled".. there are some processes in D state with references of /dev/kfd on lsof, and the gpu reset commands that upstream suggests don't work :( [09:11:50] (I'm no Debian dev, I dunno if there already is something like that) [09:13:09] yep we can ask to Moritz what's best at that point [09:13:22] So when is a good time to actually install the DKMS? And how do we inform people that there may/will be disruption? [09:13:36] good question :) [09:13:55] I think that (if we don't find another way) 1005 needs to be rebooted to make the GPU available again [09:14:10] that would be a good occasion to install dkms [09:14:40] Agreed [09:15:10] I do wonder if the rock-dkms conflicts with the amdkfd module [09:17:00] Does this machine have a remote access console, like an iLo or Drac? [09:18:34] it does yes, DRAC [09:18:55] usually what I do is jump on cumin1001.eqiad.wmnet, and do [09:19:13] ssh root@stat1005.mgmt.eqiad.wmnet [09:19:20] the password is contained in pwstore [09:19:39] so we need to have you added in there [09:19:57] I also told Razzi yesterday to create a gpg key, so we'll add new people and re-encrypt one time [09:20:20] moritzm: hello :) can we ask to you for pwstore new members? [09:22:35] And my signed key is still visible nowhere :-/ [09:24:03] sigh :( [09:24:20] anyway, it is not a hard requirement to get you added to pwstore [09:26:49] for sure! [09:27:09] klausman: where did you send it? tklausmann@w.o I suppose? [09:27:54] I sent it to a bunch of SKS servers and keys.openpgp.org (and did the privacy dance). ID is 9B91773F42CB71E3 [09:28:06] (or EC3D2B2DAC6964AFC7134FB69B91773F42CB71E3 if you prefer) [09:28:47] Both SKS and k.o.o have my _unsigned_ key [09:29:09] 10Analytics-Radar, 10Product-Analytics, 10Readers-Web-Backlog, 10Reading Depth, and 2 others: Reading_depth: deactivate eventlogging instrumentation - https://phabricator.wikimedia.org/T229042 (10phuedx) >>! In T229042#6445204, @Krinkle wrote: > The production payload for readingDepth.js is still being tr... [09:32:41] (I have posted the new analytics-announce@ mailing list on some channels on slack) [09:51:44] :thumbsup: [09:54:52] klausman: for 1005, if you want we can send an email to some mailing lists (including announce@, but we can't really rely on it) saying that we need to reboot the host [09:55:30] as far as I know in theory we could even do it soon, alerting the current users [09:55:34] Amir1: o/ around by any chance? [09:56:09] Maybe a reboot with a lead time like shutdown -r 10m [09:56:22] sure [09:56:38] Then again, I suspect most jobs are headless in the sense that nobody is actually logged in and looking at console messages [09:56:49] IIRC what it is currently running now (the python processes) are crons, that can be re-ran [09:56:53] yes yes [09:57:34] Ok, how much time would you think is sufficient between the announcement on the ML and actually doing the reboot? [09:57:57] we can do it right afterwards, explaining why we had to do it etc.. [09:58:06] Alrighty. [09:58:24] I am going to forward to you the last email that I sent so you can see the mailing lists [09:58:29] Currently waiting on the pws bits, will let you know once I have it [09:59:44] we can do it even without it, issuing the shutdown from a regular ssh session no? [10:00:39] Sure, I just don't want to end up with a freeze-on-reboot and then have to wait for the access bits to unfsck the machine. Then again, you already have access, so we can use that [10:01:12] yes yes right [10:01:28] but I suspect that it will take a bit to have pwstore updated [10:01:41] as you prefer, I am available anytime to unblock you [10:01:56] Ok, then let's do it now. If the host hangs, we can figure it out then [10:02:22] Can I do the honors? :) [10:05:30] 10Analytics, 10Release-Engineering-Team, 10observability, 10serviceops, 10User-jijiki: Should we create a separate 'mwdebug' cluster? - https://phabricator.wikimedia.org/T262202 (10jijiki) [10:06:17] klausman: of course! [10:06:33] Ok, I'll send out a mail, with a 5m lead time and then reboot stat1005 [10:06:44] +1 [10:06:58] please also use analytics-announce@ so we test it :) [10:07:02] 10Analytics, 10Release-Engineering-Team, 10observability, 10serviceops, 10User-jijiki: Should we create a separate 'mwdebug' cluster? - https://phabricator.wikimedia.org/T262202 (10jijiki) @lmata I can start the work and ask for help from #observability for reviews and questions, thank you! [10:10:37] Should I put this in the SAL as well? [10:10:53] yes please, here and in #operations [10:11:28] !log Rebooting stat1005 for clearing GPU status and testing new DKMS driver (T260442) [10:11:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:11:32] T260442: AMD ROCm kernel drivers on stat1005/stat1008 don't support some features - https://phabricator.wikimedia.org/T260442 [10:12:19] Reboot should happen any second now. [10:12:24] super [10:13:51] klausman: I forgot to mention something useful to do [10:13:57] `Please type in hostname of the machine to shutdown: Good thing I asked; I won't shutdown stat1005 ...` [10:14:00] Gah. [10:14:23] Ok, rebooting in 1m [10:14:27] elukey:yes? [10:14:39] on icinga1001/icinga2001 there is a script to add downtime to avoid alerts [10:14:59] like "sudo icinga-downtime -h kafka-jumbo1006 -d 1800 -r "maintenance"" [10:15:11] the -d are seconds [10:15:51] Hope I was fast enough ::) [10:15:59] `sudo icinga-downtime -h stat1005 -d 1800 -r "maintenance"` [10:16:07] icinga1001 seems the active one (I thought 2001 but anyway...) [10:16:16] perfect [10:21:05] How does one tell which is the active instance? [10:21:21] there is a big message in the motd if you ssh to 2001 [10:22:09] I am attached to the serial console of stat1005 and the host started booting now [10:22:17] Roger [10:22:29] I am pinging from cumin, to see when the net i/f is back [10:23:47] up and running [10:25:16] root@stat1005:~# rmmod amdgpu [10:25:18] Segmentation fault [10:25:20] Um. [10:25:51] Well, at least I could still remove amdkfd [10:25:59] now installing the kms [10:26:04] dkms* [10:28:28] Ah, missing kernel headers [10:31:20] ahahah interesting start [10:32:03] It's great that rock-dkms then just semi-quietly just doesn't build the .ko [10:32:31] All you get is: [10:32:33] Module build for kernel 4.19.0-10-amd64 was skipped since the [10:32:35] kernel headers for this kernel does not seem to be installed. [10:33:16] elukey: I'm around now [10:33:18] hey [10:33:27] We will likely need another reboot, since I fear that segfaulted rmmod may have wedged something [10:33:33] yeah [10:33:53] Amir1: hello hello, we are rebooting stat1005, I wanted to alert you since there are jobs from you (but probably crons) [10:34:30] yeah, no worries, I started it last night, I can restart it [10:37:52] The depmod phase of the DKMS install is taking forever :-/ [10:38:07] I fear some job already got started and is hosing the GPU interaction [10:39:42] Yup, hanging on finit_module() [10:39:52] Trying another reboot. [10:39:59] ack [10:49:30] elukey: I think the machine may be stuck, can you peek at the console? [10:50:47] sure [10:52:11] klausman: it seems so [10:52:36] forced a reboot from the console [10:52:44] thanks [10:52:50] (there is documentation in https://wikitech.wikimedia.org/wiki/Server_Lifecycle) [10:54:42] klausman: should be ready now [10:55:08] Havin a look-see [10:56:53] Looking better this time around [10:58:17] Ok, install complete, rebooting again for good measure. [11:01:09] Ok, machine is back. Now to see if radeontop shows better data [11:03:24] And maintenance-over mail sent [11:04:07] sudo /opt/rocm/bin/rocm-smi --showmeminfo vram looks a lot better :) [11:04:33] Yup, updated the bug accordingly [11:04:33] 10Analytics-Clusters, 10Patch-For-Review: AMD ROCm kernel drivers on stat1005/stat1008 don't support some features - https://phabricator.wikimedia.org/T260442 (10klausman) Notes from the install: - `rmmod amdgpu` segfaulted. Not very encouranging. rock-dkms comes with a module blacklist, so another reboot wil... [11:04:49] super [11:05:21] radeontop also seems to still work correctly [11:05:36] I am trying tensorflow [11:05:43] all good [11:06:23] Now the question is if/how we turn the DKMS deb into a no-compile-needed one. [11:06:33] yep [11:07:08] I mean, dpkg -L rock-dkms|xargs tar && tar2deb is an option, albeit a horrible one [11:07:49] Also, we should give stat1005 some soak time, maybe a week+, and the update stat1008 [11:07:58] I agree [11:08:08] Even if we don't make a "static" deb, this driver clearly works better. [11:08:28] I am also curious to know if it solves the hanging problem that we have seen earlier on [11:09:06] As in the hang I had earlier? or older issues before my time? [11:09:45] the latter [11:10:49] klausman: I think we can go to lunch now :) [11:11:28] Ack. [11:11:44] I shall have some spaghetti carbonara (the egg kidn, not cream) [11:12:01] +1 :D [12:04:30] elukey: I think /usr/local/prometheus-amd-rocm-stats.py could use an update to export temps and stuff. Want me to take care of it? [12:23:42] klausman: yep if you want! [12:23:52] Where does the code live, repo-wise? [12:24:04] in operations puppet [12:24:12] righto. [12:24:49] One interesting question is: we used to have "Temp 1" with the old setup, which was just The Temperature metrics-wise (no labels) [12:24:52] the only caveat is that it runs also on 1008, so we'll need to write the code in a way that supports both [12:25:08] The new setup has three sensors (mem, junction, edge) [12:25:16] ah interesting [12:25:39] So I'd keep the old code for Temp, so 1008 still works, and add a location label for the new measurements [12:25:53] +1 I like it [12:25:58] The metric name, I'd keep the same [12:26:11] makes sense yes [12:26:11] Roger dodger, will do some hacking. Just need to open a task for it [12:36:15] 10Analytics-Clusters: Update prometheus-amd-rocm-stats Python script to work with new JSON output - https://phabricator.wikimedia.org/T262404 (10klausman) [12:36:50] 10Analytics-Clusters: Update prometheus-amd-rocm-stats Python script to work with new JSON output - https://phabricator.wikimedia.org/T262404 (10klausman) [12:53:02] hey teammm [12:59:49] hola hola [13:11:23] 'lo [13:25:58] elukey: https://gerrit.wikimedia.org/r/c/operations/puppet/+/626150 pretty please :) [13:27:06] Code has been tested manually on both 1005 and 1008 [13:27:41] +1ed, really nice [13:27:49] have you ever ran puppet merge [13:27:50] ? [13:28:23] Nope [13:28:39] ah then it is a good moment to start :) [13:28:49] so you should be able to +2 and submit in gerrit [13:29:06] The cr+2 button at the top right? [13:29:30] yep [13:29:37] this will merge it in the operations/puppet repo [13:29:52] then we'll need to merge it in the repos on the various puppet masters (that will pull from gerrit [13:29:56] It's still saying "ready to submit, so I guess another click"? [13:30:00] yeah [13:30:19] submitting... [13:31:02] Now running sudo puppet-merge [13:31:19] Only my changes visible, proceeding with merge [13:31:24] super [13:31:29] I don't have to explain much then :) [13:32:24] Stevie was helping :) [13:33:58] ahhhh that explains [13:34:03] I know kormat is kormat [13:34:18] I cannot reach that level [13:34:23] :D [13:35:31] script now live on both machines. So the cron spam should stop [13:36:03] 10Analytics-Clusters, 10Patch-For-Review: Update prometheus-amd-rocm-stats Python script to work with new JSON output - https://phabricator.wikimedia.org/T262404 (10klausman) Turns out, the Prometheus lcient libs do not allow for not specifying a label. Thus, I will use the "sensor1" location as describe at th... [13:37:26] 10Analytics-Clusters, 10Patch-For-Review: Update prometheus-amd-rocm-stats Python script to work with new JSON output - https://phabricator.wikimedia.org/T262404 (10klausman) 05Open→03Resolved Submitted and live on stat1005 and stat1008, confirmed working as intended. [13:37:48] And all done :) [13:38:02] joal: really nice analysis on the data loss [13:38:06] klausman: \o/ [13:38:43] klausman: if you want to update https://grafana.wikimedia.org/d/ZAX3zaIWz/amd-rocm-gpu?orgId=1 too :) [13:39:17] Good point [13:40:33] Done. [13:42:27] at this point we can add a selector for the gpu to check metrics from, having both seems a bit confusing [13:42:30] what do you think? [13:42:39] hopefully soon we'll have also the hadoop worker ones etc.. [13:49:42] So you mean a host selector? [13:50:03] Because the CPU will likely always be card1 :) [13:51:03] yep host selector, just added [13:51:41] perfect [13:52:05] What do you think would be a good soak time for the new driver? ten days? So we cover weekly jobs? [13:57:02] the gpu usage is very experimental for the moment, few people are using those, we could ping them to do some tests [13:57:11] for example, do we have miriam_ around? :) [13:57:39] hellooo elukey [13:57:43] ciaoooo [13:57:45] ciaoo [13:58:08] did I break something :D [13:58:22] miriam_: not sure if you have met (virtually) klausman, he is a new SRE that will work on ML infra in the future :) [13:58:37] and he is working on GPU drivers right now [13:58:58] more precisely, he updated the ones on stat1005, so I am wondering if you have time during these days to test if everything is ok [13:59:01] oh nice to meet you klausman, welcome! [13:59:23] elukey yes, I can do tomorrow morning UK time? [13:59:29] super, anytime [13:59:35] anything specific you want me to test? [14:00:04] e.g. parallel processes, heavy duty tasks? [14:00:06] nope, just regular things.. with more up to date drivers MAYBE we'll also get rid of the weird GPU hanging problem [14:00:21] (nope was for "anything specific" :P) [14:00:44] I ran a tf test and it worked, but I am a n00b [14:00:55] if your tests are ok we'll update also stat1008 [14:01:03] oki, will do tomorrow, should I then comment on the task for archive happiness? [14:01:06] (and we'll finally get the stats about memory usage) [14:01:16] \o/ yayy [14:01:19] https://phabricator.wikimedia.org/T260442 [14:01:23] thanks a lot :) [14:01:58] oh no prob! thanks to you and klausman for the amazing work! [14:02:41] hullo \o [14:02:48] and you're welcome :) [14:03:40] I'm on CEST, so your morning is my morning. I shall be around if anything pops up [14:04:12] klausman: Miriam works on image recognition / computer vision (among the million other things) for the Research team [14:04:20] Fancy [14:04:57] there is the non written rule that she is the GPU queen (and hence rules their usage and upgrades) [14:05:10] wfm :) [14:05:12] :D [14:05:31] Alright, I gotta run a quick errand. Be back in 20m or so. [14:06:02] jokes aside, we have been working together on this since she is a heavy user of the GPUs (I think also mgerlach), so before upgrading etc.. I usually sync with her [14:07:16] btw, how is the memory usage exported to Prom? The script I edited obviously doesn't have that info? [14:07:24] :D [14:07:28] 10Analytics-Radar, 10Product-Analytics, 10Readers-Web-Backlog, 10Reading Depth, and 3 others: Reading_depth: deactivate eventlogging instrumentation - https://phabricator.wikimedia.org/T229042 (10phuedx) I've [[ https://meta.wikimedia.org/w/index.php?title=Schema_talk%3AReadingDepth&type=revision&diff=204... [14:08:21] 10Analytics-Radar, 10Product-Analytics, 10Readers-Web-Backlog, 10Reading Depth, and 3 others: Reading_depth: deactivate eventlogging instrumentation - https://phabricator.wikimedia.org/T229042 (10phuedx) 05Open→03Resolved a:03phuedx [14:11:03] klausman: correct we currently don't export it [14:11:36] so the python script needs to be modified, maybe when 1008 is upgraded? [14:15:54] ...all the sudden i can't log into a bastion [14:16:03] elukey: can you? is it just me? [14:16:33] ottomata: goood morning! Which one? I usually ssh to esams [14:16:38] any of them [14:17:08] nono for me it works [14:17:11] hm [14:17:16] i seem to authenicate fine [14:17:16] does it timeout or something different? [14:17:27] debug1: Entering interactive session. [14:17:27] debug1: pledge: network [14:17:27] debug1: client_input_global_request: rtype hostkeys-00@openssh.com want_reply 0 [14:17:27] client_loop: send disconnect: Broken pipe [14:17:33] it hangs for a while before disconnecting [14:18:49] I am wondering if it is your provider (maybe) [14:18:57] what do you get with traceroute? [14:21:42] still tracing [14:25:28] elukey: what is strange is i seem to authenticate fine, which would indicate to me that my provider isn't blocking anything [14:25:38] elukey: can you tial auth.log on bast1002 and see if anything looks weird? [14:25:40] as I log in? [14:26:52] sure [14:27:04] I was thinking maybe return traffic mangled for some reason, this is why I asked [14:27:08] hm [14:27:21] k well still tracing, mostly * * * at this point, will wait til it finishes [14:27:52] also strange, my terminal seems to hang while logging in too! I can't ctrl-c it, i have to wait til it disconnects [14:28:10] elukey: lemme know when you are tailing [14:28:19] I am [14:28:31] k just attempted [14:28:35] currently hanging at [14:28:40] debug1: client_input_global_request: rtype hostkeys-00@openssh.com want_reply 0 [14:45:10] elukey: will make a phab ticket regarding more metrics [14:46:59] ack! [14:48:42] (03CR) 10Milimetric: [V: 03+2 C: 03+2] "Love the secondary sorting and how it works, thanks for making me read those docs. We don't actually need it here since we could probably" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/625586 (https://phabricator.wikimedia.org/T262184) (owner: 10Joal) [14:53:35] Joal: yt? [14:54:24] mforns: yt? [14:54:33] nuria: yes [14:54:39] mforns: for the thresholds [14:54:45] aha [14:54:56] mforns: did you looked at the UA data? [14:55:01] yes [14:55:25] mforns: and does the threshold still seem to small? [14:55:28] I still think that 1.5 deviation units is small [14:55:47] mforns: i tested it [14:55:48] yes, remember it's not 1.5 entropy units, but normalized deviation units, right? [14:56:04] mforns: and less than that would not alarm on the significant last jump [14:56:13] oh, wow [14:56:23] which date was the jump you tested? [14:57:28] https://usercontent.irccloud-cdn.com/file/CZXBPhhF/Screen%20Shot%202020-09-09%20at%207.57.14%20AM.png [14:58:28] mforns: let me verify what happens if i run it with end date today [14:59:58] around 21st of april? [15:01:12] elukey: I won't make it to the standup (I have to attend the cultural orientation thingy), I trust you will relay my greatness appropriately [15:01:35] klausman: jajaja [15:01:56] As a German, spanish-speakers laughing is very funny [15:01:58] klausman: (spanish laugh) cause the "ja" gets mistaken [15:02:20] klausman: i know i know birgit always tells me that the "jajaja" [15:02:24] and the "ayayayaya" [15:02:24] klausman: ahahha okok [15:02:27] I've worked with assorted Spaniards and Mexicans before. [15:02:46] klausman: are just "confusing" [15:02:53] I can deal with it :) [15:03:09] I'll just start using ¡Ay, caramba! all the time [15:03:45] mforns: tested with 1.5 for the last 3 months and no larm is raised [15:03:53] mforns: see jump [15:04:18] mforns: the threshold depends on the nature of the phenomenon , and for UA is small [15:04:20] Possibly escalating to caramba carajo [15:04:21] * elukey afk for ~20 mins [15:04:42] klausman: that sounds like you did spend a lot of time with mexicans yay [15:05:10] klausman: there is the carajo guey which comes up every hour i'd say [15:06:00] mforns: but that's ok cause it is specified per "measure" (ua entrophy, pageview enthropy...) [15:06:16] nuria: so the jump you tested is the lower-boundary ramp-up that happened around mid June? [15:07:40] (03CR) 10Nuria: ">We don't actually need it here since we could probably bubble-sort with an Atari and still finish in a millisecond" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/625586 (https://phabricator.wikimedia.org/T262184) (owner: 10Joal) [15:07:52] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: [Wikistats v2] Default selection for (active) editors is confusing for inexperienced users - https://phabricator.wikimedia.org/T213800 (10Milimetric) This has been deployed for a while, we just forgot it in the wrong column. [15:08:04] mforns: i tested catching the jump on early march you see on graph above [15:09:12] nuria: I see, that one should be caught with a higher threshold [15:09:22] but you said you executed for last 3 months? [15:09:37] and that jump happened on 21 of April no? [15:09:55] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: All EventGate instances should use EventStreamConfig - https://phabricator.wikimedia.org/T251935 (10Nuria) 05Open→03Resolved [15:09:57] 10Analytics-EventLogging, 10Analytics-Kanban, 10Analytics-Radar, 10Event-Platform, and 5 others: Refactor EventBus mediawiki configuration - https://phabricator.wikimedia.org/T229863 (10Nuria) [15:09:59] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: Automate ingestion and refinement into Hive of event data from Kafka using stream configs and canary/heartbeat events - https://phabricator.wikimedia.org/T251609 (10Nuria) [15:10:17] 10Analytics, 10Analytics-Kanban, 10Two-Column-Edit-Conflict-Merge, 10User-awight: Sanitize and store historical conflict events - https://phabricator.wikimedia.org/T260965 (10Nuria) 05Open→03Resolved [15:10:26] mforns: i did two things [15:10:48] 1) test what was the threshold that would catch the late april/early mat jump [15:11:02] 2) that was ~1.5 [15:11:17] 10Analytics-Clusters: Add more metrics to prometheus-amd-rocm-stats Python script - https://phabricator.wikimedia.org/T262427 (10klausman) [15:11:17] 3) make sure that with 1.5 we do not get any "faux" alarms on teh alst 3 months [15:11:25] mforns: * the last 3 months [15:12:00] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Create new mailing list for analytics systems users - https://phabricator.wikimedia.org/T260849 (10Nuria) 05Open→03Resolved [15:12:21] 10Analytics, 10Analytics-Kanban: Fix cassandra/hyperswitch geoeditors field miscmatch - https://phabricator.wikimedia.org/T262017 (10Nuria) 05Open→03Resolved [15:12:48] 10Analytics, 10Analytics-Kanban: Move the Analytics infrastructure to Debian Buster - https://phabricator.wikimedia.org/T234629 (10Nuria) [15:12:50] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade Kafka Brokers to Debian Buster - https://phabricator.wikimedia.org/T255123 (10Nuria) 05Open→03Resolved [15:12:55] a-team: I don't see anything in the train etherpad or ready-to-deploy (after I cleaned out stuff that doesn't belong there). Is there anything for me to deploy this week, should I wait until tomorrow to deploy refinery? [15:13:16] nuria: I see... [15:13:24] that's crazy [15:14:08] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Add editors per country data to AQS API (geoeditors) - https://phabricator.wikimedia.org/T238365 (10Nuria) Leaving open until we finish docs and backfill. https://wikitech.wikimedia.org/wiki/Analytics/AQS/Geoeditors [15:14:31] ebernhardson: in the back of my mind, there's a table on HDFS that contains the cirrussearch indices (map of article to tokenized text) -- does this exist and, if so, what's the name of the table in Hive? [15:15:02] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Update data-purge for processed mediawiki_wikitext_history (6 snapshot kept, 3 would be sufficient) - https://phabricator.wikimedia.org/T237047 (10Nuria) 05Open→03Resolved [15:16:24] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: [Wikistats v2] Default selection for (active) editors is confusing for inexperienced users - https://phabricator.wikimedia.org/T213800 (10Nuria) 05Open→03Resolved [15:16:39] nuria: looking here on my side [15:17:15] nuria: by backfill did you mean you wanted to go back before 2020 (right now data starts on January 1st, but we could go back to 2018 I think). Or did you just mean the re-ordering that comes with Joseph's patch [15:17:37] isaacj: * i think* you are thinking of elastic rather than hive but maybe i am totally off [15:18:11] isaacj: you can look in "discovery" database on hive [15:18:33] isaacj: but i do not think tokenized text is there [15:19:17] nuria: yeah, i assume it was a table that was (regularly?) copied over into Hive from Elastic. i just have a vague memory of looking at it once and being excited to see all the tokenized text but now can't remember where that was. i just assumed Hive because I don't know how to access elastic :) [15:19:33] milimetric: I thought joseph 's patch will reorder differently all subsequent data that uses it , is that correct? [15:20:17] nuria: right, I was just going to delete the data and reload everything, it only takes a few minutes and the endpoint's not public yet [15:20:28] milimetric: ya, +1 [15:20:45] nuria: but you said "backfill" and I wasn't sure if you wanted to fill data before 2020 [15:21:15] milimetric: we should load data as far back as we have it [15:21:22] milimetric: which is sometime 2019 ? [15:21:36] milimetric: or maybe even earlier, i would need to look [15:21:37] I think 2018-01, I wasn't sure why we hadn't done that yet, thought it was on purpose [15:22:02] ok, I'll ask in standup just in case someone else remembers something and deploy afterwards [15:22:11] but so looks like nothing else to deploy, ok [15:22:20] milimetric: ya, 2018-01 [15:22:22] https://dumps.wikimedia.org/other/geoeditors/ [15:22:48] milimetric: so we need to load that far back, i think the 2020 is there cause lex probably wrote it without knowing how far does data go [15:23:36] milimetric: also, joseph's patch doe snot change the order though, right? [15:23:37] https://gerrit.wikimedia.org/r/c/analytics/refinery/+/625586/1/oozie/cassandra/monthly/editors_bycountry.hql [15:23:43] milimetric: just the "way" we sort [15:24:28] well, that's the script that loads cassandra, and when we query cassandra we don't sort, we just output what's in there [15:24:46] so it changes how we sort as we load cassandra, and therefore the order that the data comes out in the API [15:24:51] isaacj: i have not seen such data but ebernhardson might know best, seems unlikely cause the format consumed by elastic might not be consumable by other tools/humans [15:25:28] milimetric: ah wait, before there was just a group by, not a sort by, right [15:25:38] yeah, just a shot in the dark. i'll email him in a bit to check if he doesn't see this [15:25:43] thanks! [15:26:42] isaacj: he will see , do not worry, it but he is on PST and 8 is a bit early [15:27:13] ahh -- good point :) i'm still getting use to my eastern time zone [15:29:42] 10Analytics, 10Release-Engineering-Team, 10observability, 10serviceops, 10User-jijiki: Should we create a separate 'mwdebug' cluster? - https://phabricator.wikimedia.org/T262202 (10lmata) sounds good, will move this to Radar and let me know when/if we can be of assistance :-) [15:31:13] 10Analytics, 10Operations, 10Patch-For-Review: Deploy an updated eventgate-logging-external with NEL patches - https://phabricator.wikimedia.org/T262087 (10Ottomata) Ok! I think we are good to go! We'll need to add a wgEventStreams stream config entry and then redeploy (or just restart) eventgate-logging-ex... [15:56:35] 10Analytics, 10Release-Engineering-Team, 10observability, 10serviceops, 10User-jijiki: Should we create a separate 'mwdebug' cluster? - https://phabricator.wikimedia.org/T262202 (10thcipriani) Is this meant for folks deploying? Are we going to use these like we use the current mwdebug hosts? Or is this s... [16:01:29] ping mforns [16:02:00] pin razzi : stand up or do you also have cultural orientation? [16:02:15] nuria: cultural orientation [16:02:29] klausman, razzi : please be so kind to send e-scrums to analytics-internal@ if you cannot attend standup [16:02:47] Right right. [16:08:40] 10Analytics, 10Analytics-Kanban, 10EventStreams: KafkaSSE: Cannot write SSE event, the response is already finished - https://phabricator.wikimedia.org/T261556 (10Milimetric) https://github.com/wikimedia/KafkaSSE/pull/6 [16:21:32] milimetric: FYI added comment to your KafkaSSE review [16:21:37] lets discuss :) [16:51:38] 10Analytics, 10Release-Engineering-Team, 10observability, 10serviceops, 10User-jijiki: Should we create a separate 'mwdebug' cluster? - https://phabricator.wikimedia.org/T262202 (10jijiki) @thcipriani We will continue to use mwdebug* as we do (both for developers and SREs); the existing hosts will join t... [16:55:18] 10Analytics, 10Release-Engineering-Team, 10observability, 10serviceops, 10User-jijiki: Should we create a separate 'mwdebug' cluster? - https://phabricator.wikimedia.org/T262202 (10jijiki) [17:11:03] * elukey afk! [17:19:17] yep saw ottomata, I figured you wanted to just warn on all those errors, since the consume ones were caught separately [17:19:38] but I can do what you say, I think your second suggestion sounds better, I just have to figure out what exactly that error type is [17:19:39] hm [17:21:40] like, my logic was "this is not the kind of problem that should alert anyone or trip anyone up as they're reading logs, only the kind of thing you'd want to look for if users report a problem, hence warn" [17:21:41] milimetric: probably have to make it up [17:22:38] milimetric: example, if the sse.start() call fails [17:22:42] we should probably error, right? [17:23:10] hm not sure [17:23:40] hmmmmm [17:23:51] i see we actually do call this._error in other places, e.g. _consume [17:23:53] so [17:24:08] milimetric maybe instead of making a custom Error [17:24:14] I'll double check, but I didn't think that tripped this handler, it gets caught outisde [17:24:16] you could just put a catch after the this.sse.send() call [17:24:19] in _loop [17:24:28] and _error( , warn) there [17:26:09] yeah milimetric maybe that is the right thing: add a .catch after this.sse.send in _loop [17:26:23] that way you catch errors that happen due to sse send failing, and can warn those ones [17:26:25] specifically? [17:27:49] ok, seems more precise, but does add yet another place where error handling is done. It's definitely confusing trying to track the flow [17:29:48] agree [17:29:56] i guess that all could be refactored in some way [17:30:00] but maybe too much for this patch [17:33:19] ok, I can send that... though really weird now any kind of test I try to run locally fails, and npm run test fails too, something's weird... [17:45:45] (updated) [17:47:19] 10Analytics, 10Event-Platform, 10Technical-blog-posts: Story idea for Blog: Wikimedia's Event Platform - https://phabricator.wikimedia.org/T253649 (10Ottomata) @srodlund we are good to go! Let's post! [17:51:07] milimetric: i still only see one commit [17:51:12] 10Analytics, 10Event-Platform, 10Technical-blog-posts: Story idea for Blog: Wikimedia's Event Platform - https://phabricator.wikimedia.org/T253649 (10srodlund) Awesome. As per our convo on IRC, I will post the first one tomorrow (9/10) and the next two over the next two weeks. [17:51:42] Oh its a force push squash? [17:51:44] ottomata: oh I just amended as is our weird way, github doesn’t care either way [17:51:51] ah [17:52:49] hm milimetric that is the same as before, just at the end of _loop instead of at the end of _start(), right? i was imaginging it right after the sse.send call [17:52:56] that way we only do it for sse.send errors [17:53:05] inside of the _loop function [17:53:47] I’m confused, doesn’t the catch apply to that whole chain of promises started after sse.send? [17:53:58] yes but as it is [17:54:03] _loop does a bunch of things [17:54:08] if theres an error in e.g. [17:54:08] this._updateLatestOffsetsMap(kafkaMessage); [17:54:19] your current path would catch it and warn [17:54:22] patch* [17:54:52] as is, it applies to the whole chain of proises returned by _loop [17:55:07] ottomata: I don't think so, that call is before the send [17:55:23] its wrapped in a larger promises [17:56:08] return this._consume().then((kafkaMessage) => { this._updateLatestOffsetsMap , ... this.sse.send, ... } [17:56:48] right, that's the status quo, and my patch does this I thought:. [17:57:01] return this._consume().then((kafkaMessage) => { this._updateLatestOffsetsMap , ... this.sse.send.then(...).then(...).catch( *here* ), ... } [17:57:18] milimetric bc? [17:57:21] mow [17:57:24] omw [18:14:40] isaacj: kinda/sorta, re cirrussearch indices in hadoop. There is ebernhardson.cirrus2hive which is the cirrussearch dumps imported to hadoop. This has a script to pull the data, but it's not a fully automated/updated thing. I just run it when i need to do some analysis [18:15:12] isaacj: this isn't tokenized and such though, its just the raw json docs we index (example: https://en.wikipedia.org/wiki/Analytics?action=cirrusdump ) [18:16:51] the latest dump is there is probably quite old, but if you need something i could probably run a new import [18:19:18] ebernhardson: this is great, thanks! no need to do an import right now -- i'll go through the JSON and see if it'll actually help us with what we were trying to do (quick way to compute how many articles a given word appears in) [18:19:44] isaacj: you could query cloudelastic.wikimedia.org for that probably [18:20:03] isaacj: cloudelastic.wikimedia.org is a replica of the production search indices available in wmcloud [18:20:10] you can use the full elasticsearch api [19:05:22] 10Analytics, 10Event-Platform, 10MediaWiki-libs-HTTP, 10Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: PHP Notice: Array to string conversion - https://phabricator.wikimedia.org/T262462 (10Reedy) Line 359 `lang=php curl_setopt( $ch, CURLOPT_POSTFIELDS, $req['body'] ); ` T... [19:11:57] 10Analytics, 10Event-Platform, 10MediaWiki-libs-HTTP, 10Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: PHP Notice: Array to string conversion - https://phabricator.wikimedia.org/T262462 (10Pchelolo) I bet the root cause is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/... [19:13:53] 10Analytics, 10Event-Platform, 10MediaWiki-libs-HTTP, 10Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: PHP Notice: Array to string conversion - https://phabricator.wikimedia.org/T262462 (10Reedy) >>! In T262462#6448331, @Pchelolo wrote: > I bet the root cause is https://gerrit... [19:14:27] 10Analytics, 10Event-Platform, 10MediaWiki-libs-HTTP, 10Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: PHP Notice: Array to string conversion - https://phabricator.wikimedia.org/T262462 (10Pchelolo) Ok, I know how to fix it. gimme 5 mins. [19:15:04] 10Analytics, 10Event-Platform, 10Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: PHP Notice: Array to string conversion - https://phabricator.wikimedia.org/T262462 (10Reedy) [20:09:10] milimetric: did not rejecting stop the loop? [20:09:28] ottomata: yes, just doing nothing stops the loop [20:09:34] if you reject, the loop keeps going [20:09:58] (I updated the pull request and submitted a change in gerrit for eventstreams) [20:10:21] 10Analytics-Radar, 10Anti-Harassment, 10Product-Analytics: Capture special mute events in Prefupdate table - https://phabricator.wikimedia.org/T261461 (10Niharika) [20:11:16] great! merging [20:12:16] milimetric: i don't know how github does merges, but it seems there is a merge commit [20:12:23] you should probably reference that one from eventstreams [20:12:33] ah, ok, updating [20:17:29] k [20:18:42] milimetric: somehow you pushed the change directly? [20:18:46] in gerrit [20:18:46] ? [20:18:54] oh no! [20:19:07] oh looks right tho [20:19:08] is ok? [20:19:16] hmmm [20:19:25] i think that might not trigger the jenkins build pipeline though [20:19:33] OH [20:19:34] it did! [20:19:35] coo [20:19:37] ottomata: I had git push in my history from github, which I'm not used to, I usually have git push refs... [20:19:58] that repo should be set up to not do that [20:20:01] well at least the first step did [20:20:08] hopfeully it will push a new docker image up to our repo [20:20:10] in a bit... [20:20:30] I can revert, push, and then send the change? [20:22:45] Gone for tonight - see you tomorrow team [20:31:00] milimetric: its ok [20:31:03] it worked! [20:31:12] :) yuay [20:31:16] see the PipelineBot [20:31:18] comment? [20:31:25] IMAGE: [20:31:25] docker-registry.wikimedia.org/wikimedia/mediawiki-services-eventstreams [20:31:25] TAGS: [20:31:26] 2020-09-09-201733-production, a3c26393ea9e6d20e9e49e4e129bf676316ebcf1 [20:33:00] milimetric: [20:33:01] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/626223 [20:33:07] will make eventstreams in k8s use that version [20:33:14] let's let that be for now, deploy next week? [20:36:39] (in meeting for a bit, but will answer after) [20:36:58] np[! i gotta run actually, ttyt! [21:20:31] 10Analytics, 10Release-Engineering-Team, 10observability, 10serviceops, 10User-jijiki: Should we create a separate 'mwdebug' cluster? - https://phabricator.wikimedia.org/T262202 (10Krinkle) If I understand, from a non-SRE perspective, this proposal basically just means: * The `$_SERVER['SERVERGROUP']` e... [21:28:27] 10Analytics, 10Analytics-Kanban, 10EventStreams, 10Patch-For-Review: KafkaSSE: Cannot write SSE event, the response is already finished - https://phabricator.wikimedia.org/T261556 (10Milimetric) (we decided to deploy this next week, and we can do so by simply merging the change above (https://gerrit.wikime... [21:32:21] 10Analytics-Radar, 10Dumps-Generation, 10Okapi, 10Platform Engineering: HTML Dumps - June/2020 - https://phabricator.wikimedia.org/T254275 (10RBrounley_WMF) Hey all - I'm starting to post our sprint overviews here to improve Okapi's dialogue on phabricator. I will add tickets in the Okapi board, feel free... [21:43:05] 10Analytics-Radar, 10Dumps-Generation: Sample HTML Dumps - Request for feedback - https://phabricator.wikimedia.org/T257480 (10RBrounley_WMF) Split this oversighted revision conversation into T262479 to continue the conversation. [23:33:22] (03CR) 10Nuria: [C: 03+1] Removing seasonality cycle as it is fixed once granularity is set (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/623456 (https://phabricator.wikimedia.org/T257691) (owner: 10Nuria)