[00:27:53] 10Analytics, 10Discovery-Analysis, 10Product-Analytics, 10Wikidata, and 2 others: Query stats dashboard not updating - https://phabricator.wikimedia.org/T204415 (10mpopov) 05Open>03Resolved All good now :) [02:42:42] 10Analytics, 10Analytics-Kanban: heirloom-mailx fails trying to send out email from SWAP notebook - https://phabricator.wikimedia.org/T168103 (10Tbayer) PPS: I added [[https://wikitech.wikimedia.org/wiki/SWAP#Sending_emails_from_within_a_notebook | a section]] to the documentation. [05:54:51] 10Analytics, 10Operations, 10Traffic, 10Services (blocked): Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10Pchelolo) > Ok, @Pchelolo gets the persistence award! Yay! I've got the award! > Let me understand: are there other headers we would need besides the accept one... [07:15:07] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` an-coord1001.eqiad.wmnet ``` The log can... [07:15:10] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['an-coord1001.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['an-coord1001.eqiad.wmnet... [07:15:24] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` an-coord1001.eqiad.wmnet ``` The log can... [07:31:15] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['an-coord1001.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['an-coord1001.eqiad.wmnet... [07:32:36] goood morning! [07:32:37] /dev/mapper/stat1005--vg-data 7.2T 6.4T 443G 94% /srv [07:32:39] lol :D [07:33:04] let's see who's the winner [07:58:50] there is a combination of things [07:58:56] some people with huge homes [07:59:08] and also 1.9T of logs rsynced (eventlogging, mw-api, etc..) [08:11:18] (03PS1) 10Joal: Correct geoeditors-load oozie SLA [analytics/refinery] - 10https://gerrit.wikimedia.org/r/464105 [08:14:54] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10elukey) After hitting by mistake F10 the host got stuck several times in: `Unified Server Configurator does not support console redirection` After... [08:32:13] 10Analytics, 10Operations, 10Traffic, 10Services (blocked): Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10phuedx) ☝️ Best I could do at short notice… [08:36:36] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10elukey) From System Setup I can see that `D0:94:66:5F:75:BC` (set in puppet) shows Link Status Connected, while the other NICs are disconnected, so... [08:41:33] I am checking saltrotate in puppet [08:41:46] mforns added the -b option but it seems not available [08:43:49] ahhh https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/454631/2/bin/saltrotate [08:43:55] so this needs to be deployed first [08:50:05] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10elukey) If I leave the PXE boot running (even if it seems stuck in a blank screen) I end up in: ``` Loading Linux 4.9.0-8-amd64 ... Loading initial... [08:50:13] no joy with an-coord1001 --^ [08:50:21] pxe boot stuck while loading the kernel [08:50:25] * elukey sigh [10:23:17] 10Analytics, 10EventBus, 10MediaWiki-General-or-Unknown, 10WMF-JobQueue, and 2 others: Some requests fail with UIDGenerator error "Process clock is outdated or drifted" - https://phabricator.wikimedia.org/T94522 (10mobrovac) Time drifting is a known occurrence. From what I see, we have two options: - allow... [10:24:04] Followed up with some stat1005 users, now /dev/mapper/stat1005--vg-data 7.2T 5.7T 1.2T 84% /srv [10:24:08] better :) [10:24:24] I think though that we'd need a task to review the space used under /srv [10:24:33] 10Analytics, 10ChangeProp, 10EventBus, 10WMF-JobQueue, and 2 others: ChangeProp logging KafkaConsumer is not connected - https://phabricator.wikimedia.org/T199444 (10mobrovac) [10:24:33] we'll likely end up in the same mess again [10:48:52] hey all I'm going to SWAT A/B tests on page issues feature for Farsi/Japanese and Russian [10:49:09] you can expect higher events rate on our analytics servers ;) [10:49:25] we're bumping from 20% to 100% [10:49:29] https://phabricator.wikimedia.org/T200792 [10:54:53] hello raynor [10:56:04] I am reading the task but I don't get what events will increase [10:56:16] and more or less the amount of the increase [10:56:35] if these are events that go to the mysql datastore it could be a problem [11:00:23] elukey, it will be the ReadingDepthSchema [11:04:13] ah ok only one schema, good [11:04:18] and that one doesn't go into mysql [11:04:42] do you know more or less the amount of the increase? We have already been reached a high level of events/second for eventlogging [11:05:32] elukey: no, but next time I can ask for that estimate before swatting [11:06:04] but I think it's easy to calculate, the A/B test is enabled for 20% of users [11:06:49] on ru/jp/fa [11:07:10] raynor: thanks! I'd ask you to not deploy though until you have those numbers even for this use case [11:07:32] we might need to scale up eventlogging before [11:08:48] (03PS2) 10Fdans: Allow breakdown filtering in top metrics [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/463964 (https://phabricator.wikimedia.org/T205725) [11:10:28] elukey https://phabricator.wikimedia.org/T200792#4631954 [11:12:57] and https://phabricator.wikimedia.org/T200792#4632005 [11:13:50] looks like jdlrobson and HaeB did some math [11:14:29] (03CR) 10Fdans: Allow breakdown filtering in top metrics (037 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/463964 (https://phabricator.wikimedia.org/T205725) (owner: 10Fdans) [11:14:34] currently it's ~60 events per second (at 20% rate), so it will jump to ~300events per second [11:14:36] (03PS3) 10Fdans: Allow breakdown filtering in top metrics [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/463964 (https://phabricator.wikimedia.org/T205725) [11:15:24] raynor: from https://grafana.wikimedia.org/dashboard/db/eventlogging?orgId=1&panelId=5&fullscreen ReadingDepth is ~668 events/second [11:15:58] so it should jump to something like 900/s ? [11:16:29] yup, but that is most probably overall reading depth [11:16:53] oh yes this is the number that I care about basically [11:16:57] elukey - I won't lie, I'm not sure, I was asked to swat that change and people who did the math are still sleeping [11:17:39] I think we can wait till SF morning SWAT [11:17:57] if it is not super urgent it would be great [11:18:03] because I don't I'll get the answer for you a this moment [11:18:15] I don't know if I can get the answer* [11:19:54] you basically gave me the answer, we'd just need to verify it :) [11:20:29] I don't mean to be pedantic and stop your work, really sorry, but we are already processing a lot of events and I want to make sure that we don't cause hard to eventlogging in the process [11:20:39] *harm [11:23:04] sure, it's good that someone is guarding that :) [11:23:20] I added to my notes that everytime before I'm asked to deploy something I need hard numbers ;) [11:27:23] thanks! [12:14:51] elukey: good call on asking to wait [12:15:17] * joal is here for the few minutes his children leave him during siesta [12:15:30] o/ [12:16:19] elukey: ReadingDepth schema is already sharing 1st place of top-rate-schemas with VirtualPageviews [12:17:04] Getting 5 times this rate is probably ok if we monitor closely and have some spares-workers around in case [12:17:47] joal: it should go from 600 to 900 e/s [12:17:56] so not 5 times iiuc [12:18:54] elukey: I have not read the task, but if we currently are at 20% and have ~800 events/s now, I'm assuming going to 100% would mean 800*5? [12:21:02] joal: not all events, there was some calculations in the task [12:21:33] we need to verify the calculations but it should be 60/s now -> ~300/s later [12:21:50] elukey: I also think we are making a mistake on schema - IIRC the one deployed earlier this week to 20% was not ReadingDepth [12:22:04] it was PageIssues [12:22:27] yes [12:22:39] ?? [12:23:08] 13:00:23 < raynor> elukey, it will be the ReadingDepthSchema [12:23:33] This has been misleading me in number assumptions [12:23:46] I think it will actually be the PageIssues schema [12:23:52] ah you are saying that raynor told me ReadingDepth schema instead of PageIssues [12:24:04] it could be yes https://phabricator.wikimedia.org/T200792#4631205 [12:24:10] "We're seeing around 60 events per second (and no errors[1])" [12:24:59] ReadingDepth per sec rate is [500-1k] while PageIssues is [40-80] [12:26:01] yep, I thought that the increase was a small part of ReadingDepth (namely only some smallish wikis) [12:26:20] well let's see what they say later on, but it is probably what you just said [12:26:52] elukey: just checked the patches - The bump to 20% included enwiki while the bump to 100% doesn't - I feel safe now :) [12:27:33] but they didn't merge anything today no? [12:27:39] we were safe :) [12:28:11] yup [12:28:56] in any case, if more people need to do these kind of tests, we'll likely need eventlog1003 [12:29:05] There are chances yes [12:29:34] elukey: do we wait next swat, or do we tell raynor to move on? [12:30:13] I'd prefer them to confirm what schema is impacted and the numbers for the jump before giving the green light [12:30:28] and since it is not urgent it can wait for the SF swat in my opinion [12:30:36] elukey: Sounds good :) [12:31:13] 10Analytics: Many client side errors on citation data, significant percentages of data lost - https://phabricator.wikimedia.org/T206083 (10bmansurov) @Nuria that makes sense. Rather than limiting URL length (so that we don't get incomplete data), would it be a good idea to not report these errors? So I'd detect... [12:43:09] joal, elukey -> I already postponed that one [12:43:27] ack raynor - Thanks :) [12:43:34] it may be the PageIssues schema, I was going through the code and first I saw the ReadingDepth [12:43:46] so I wrote that, anyway, next time I'll ask for more information before swatting [12:54:06] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10elukey) From the install1002 point of view: ``` Oct 3 12:38:12 install1002 dhcpd: DHCPOFFER on 10.64.21.104 to d0:94:66:5f:75:bc via 10.64.21.2 Oc... [13:09:40] 10Analytics, 10Wikimedia-Stream, 10Patch-For-Review: Create /v2/schema/:schema_uri endpoint for eventstreams that proxies schemas from eventbus - https://phabricator.wikimedia.org/T160748 (10Ottomata) Hm, looks like I wrote the code for it but never configured it in prod [13:13:10] 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, 10Operations, 10Patch-For-Review, and 2 others: Prometheus resources in deployment-prep to create grafana graphs of EventLogging - https://phabricator.wikimedia.org/T204088 (10Ottomata) There are two changes needed, including https://gerrit.wikimedia.org/r... [13:14:46] joal: https://phabricator.wikimedia.org/T199121 [13:14:51] in case you haven't seen it [13:17:05] 10Analytics, 10Analytics-Kanban: heirloom-mailx fails trying to send out email from SWAP notebook - https://phabricator.wikimedia.org/T168103 (10Ottomata) Great! Yeah let's leave this open. The real problem is not solved. [13:22:55] ottomata: Thanks ! I subscribed to it yes :) [13:23:59] ok coo [13:24:05] just came my way again and wanted to make sure you knew [13:25:18] ;) [13:32:08] joal, elukey - so just to confirm, do you still need the numbers for PageIssues instrumentation? [13:32:26] yes please [13:35:55] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10elukey) Got some help from Faidon, one setting in the BIOS for the serial console wasn't correct (I've set `Serial Port Address set to Serial Device... [13:35:58] joal: --^ [13:36:10] the other issue was console redirection [13:36:26] basically all the output of debian install wasn't going to the serial port and I didn't see it [13:46:27] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10elukey) >>! In T204970#4637700, @elukey wrote: > After hitting by mistake F10 the host got stuck several times in: > > `Unified Server Configurator... [13:47:08] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` an-coord1001.eqiad.wmnet ``` The log can... [13:47:30] restarting the os install for an-coord1001, this time it should work [13:47:34] hopefully :D [14:03:46] milimetric: any ideas on how I can do this action item from the RFC meeting? 1. investigate if we need to support GET event intake anymore [14:03:56] i need to find out if we need to support non JS clients. [14:04:11] I was thinking on that ottomata [14:04:11] if we don't, then we can go POST with no more silly URI encoding of events [14:04:20] shodul I send an email to wmf engineering? [14:04:34] maybe also make a sub task for that very thing? [14:04:57] I mean, there's no way to find out for sure until it stops working I guess, an email should reach some of the people... hmm [14:05:18] well, opinions might be enough...oh i could check eventlogging extension to see if event.gif stuff is even built in anymore [14:05:58] let's see... how did it work, looking at the code [14:06:37] https://www.mediawiki.org/wiki/Analytics/Archive/Pixel_Service [14:07:16] https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/master/examples/varnish.vcl [14:07:38] ya [14:07:53] where's the varnish code again? [14:07:59] I guess that would tell us for sure [14:08:33] it is still listening for event.gif [14:09:02] actually it just listens to anything /becaon [14:09:30] but varnihskafka is filtering for "^/(beacon/)?event(\.gif)?\?" [14:09:34] let [14:09:38] let's just webrequest for event.gif!~ [14:10:44] yea [14:10:50] (you doing it?) [14:10:57] ya [14:11:01] gonna check the last couple of days [14:11:28] if it's less than one per day, I'd say it's safe to turn off :) [14:16:42] joal, elukey https://phabricator.wikimedia.org/T200792#4638420 (regarding the earlier patch) [14:16:49] milimetric: [14:16:49] +----+-----+---+--------+ [14:16:50] |year|month|day|count(1)| [14:16:50] +----+-----+---+--------+ [14:16:50] |2018| 10| 3| 3| [14:16:50] |2018| 10| 1| 6| [14:16:50] |2018| 10| 2| 5| [14:16:51] +----+-----+---+--------+ [14:17:00] gonna examine them now since it is so few [14:17:01] seems too low to matter [14:20:25] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['an-coord1001.eqiad.wmnet'] ``` and were **ALL** successful. [14:23:40] raynor: thanks answered [14:26:48] heyaaa team [14:27:43] ottomata: o/ [14:27:47] qq [14:27:51] this is an1003 now [14:27:51] /dev/mapper/analytics1003--vg-mysql 197G 18G 180G 9% /var/lib/mysql [14:27:54] /dev/mapper/analytics1003--vg-data 4.0T 68G 3.8T 2% /srv [14:28:06] I have this on an-coord1001 [14:28:07] /dev/mapper/an--coord1001--vg-data 173G 61M 164G 1% /srv [14:28:21] we could split it into two but I am wondering about the sizes [14:28:45] 70G for mysql, 100 for /srv ? [14:28:52] or maybe 60/110 [14:32:39] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10elukey) [14:33:01] elukey: 50 / 120 ! :D [14:33:18] or maybe [14:33:20] 50 100 [14:33:23] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10elukey) Assigning to Rob to see if anything needs to be done from the DC ops side before closing. [14:33:28] keep 20 around for whichever needs it first? [14:34:13] nah let's assign all [14:34:34] k [14:35:46] elukey@analytics1003:/srv/deployment/analytics$ du -hs * [14:35:46] 4.0K refinery [14:35:47] 68G refinery-cache [14:35:49] ahhahha [14:36:41] so we keep 5 revs for refinery [14:36:48] that sees a bit a waste [14:37:31] keeping only three would free ~20G [14:37:39] more sorry [14:40:49] sure! [14:40:53] 3 is plenty [14:43:07] 10Analytics, 10Analytics-Kanban, 10Operations, 10Traffic, 10Services (blocked): Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10Nuria) a:03Ottomata [14:43:34] oh milimetric NONE of the event.gif uris i found are for beacon [14:43:45] i searched for just event.gif since /beacon is optional in the varnishkafka regex [14:43:51] they are all just /wiki/File: uris [14:43:57] e.g. /wiki/File:Soccerball_current_event.gif [14:44:00] so 0 events found. [14:44:13] (03PS1) 10Elukey: Reduce cached revs to save some space on hosts [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/464157 [14:44:18] ottomata: --^ [14:44:26] default is 5 [14:44:38] (03CR) 10Ottomata: [V: 032 C: 032] Reduce cached revs to save some space on hosts [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/464157 (owner: 10Elukey) [14:44:41] merged [14:44:48] aahhaah [14:44:53] speed of light [14:45:29] (03CR) 10Nuria: Correct geoeditors-load oozie SLA (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/464105 (owner: 10Joal) [14:46:34] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 2 others: RFC: Modern Event Platform: Stream Intake Service - https://phabricator.wikimedia.org/T201963 (10Ottomata) One of the action items from the RFC meeting was: - investigate if we need to support GET event intake anymore I... [14:49:27] 10Analytics: Many client side errors on citation data, significant percentages of data lost - https://phabricator.wikimedia.org/T206083 (10Nuria) @bmansurov: errors are reported as way for us to monitor client, so turning errors off is like "turning monitoring off" so that is probably not an acceptable option [14:50:04] sorry ottomata: baby 'mergency, that's great, no event.gif requests anymore, can remove the rule from varnish and declare it dead [14:51:32] yeah coo [14:52:41] ottomata: hola! [14:53:21] yoohoo [14:53:57] ottomata: for the event.gif ... I would love for that to be true but does not seem possible [14:54:16] nuria: it looks like everyone is use /beacon/event not /beacon/event.gif or just /event.gif [14:54:16] ottomata: we have a significant percentage of opera requests and opera doe snot support sendBeacon cc milimetric [14:54:23] its not send beacon that matters [14:54:40] its embedded in the html [14:54:44] if that doesn't happen anymore [14:54:49] then we don't need it [14:54:58] you can still manually send the event using JS [14:55:16] ottomata: but the code if sendbeacon is not available falls back to image [14:55:19] ottomata: right? [14:55:21] show me code! [14:55:24] we are looking for that [14:55:37] so people would embed event.gif in the html that's output by instrumentation php? [14:55:56] milimetric: html? no [14:56:04] ah wait [14:56:06] i think i found it [14:56:10] function ( url ) { document.createElement( 'img' ).src = url; }, [14:56:13] milimetric: it is a js code alone [14:56:20] milimetric: exactly [14:56:32] but then, can we change that code? [14:56:37] what? [14:56:37] we don't need to use img [14:56:42] ottomata: it creates an image element for browsers that do not suppoort sendbeaconb [14:56:43] we can just make a manual http post [14:56:50] where's that code? [14:56:59] milimetric: ext.eventLogging.core.js line 263 [14:57:04] ottomata: and how would we get the data in varnish? [14:57:13] nuria: no varnish [14:57:16] public POST endpoint [14:57:18] ottomata: i thought we could not extract post parameters [14:58:00] ok also interesting, that the image src that this code will place is not event.gif [14:58:01] ottomata: ah ok , for the new intake, i do not think that would work in older browsers w/o creating an iframe [14:58:02] nuria / ottomata: but that code is not requesting event.gif [14:58:03] it'll just be /beacon/event [14:58:07] yeah [14:58:12] right, just /beacon/event [14:58:12] but its still doing the same thing [14:58:16] doesn't matter if .gif or not [14:58:19] the backend will treat it the same [14:58:26] no, it matters for that varnish rule we want to remvoe [14:58:30] i was hoping that if anyone used the img thing it would at least be nice and make the src uri look like a .gif [14:58:31] we can remove it [14:58:32] but nope. [14:58:34] nonono [14:58:38] we don't care about that [14:58:46] what we want: is to ahve a SINGLE endpoint if we can [14:58:54] hopefully that only accepts POST [14:58:59] so we don't have to do the URI encoding crap [14:59:00] ottomata: milimetric and i have a meeting in 1 min, we can talk maybe at standup? [14:59:08] i have a meeting too! [14:59:13] it's in 15 min nuria [14:59:24] milimetric: you ARE RIGHT [14:59:36] ottomata: in older browsers a post like taht is not seamlessly [14:59:40] ottomata: thus the image [14:59:45] nuria: but if we have JS [14:59:50] why not just [14:59:52] hm... we can submit a form [14:59:58] create a form element and submit it [15:00:07] oh... no [15:00:10] that would render the result [15:00:14] milimetric: right [15:00:21] milimetric: thus the image technique [15:00:26] milimetric: that is older than the times [15:00:31] cc ottomata [15:00:36] yeah... there's no work-around for that still?' [15:00:42] jeez... stuck in the bronze age [15:00:48] milimetric: noy for older browsers, the workarround in sendbeacon [15:01:19] ottomata: other option is to say, we will only support sendbeacon going forward and quantify how much of our userbase we will be missing [15:01:28] if not navigator.sendBeacon { fetch(url, { method: POST, body: JSON.stringify(data) } ); } [15:01:29] ? [15:01:39] fetch doesn't work in old browsers [15:01:41] find [15:01:43] xmlhttprequest [15:01:43] but ajax would [15:01:47] yea [15:02:10] but ori knew that, wonder why he didn't use it [15:02:13] ottomata: ajax is async , not sync like an image [15:02:19] why do we want sync here? [15:02:26] milimetric: to not drop events [15:02:27] this if fire and forget anyway [15:02:39] varnish is always returning 204 [15:02:50] no matter what for this URI [15:02:51] that's all it does [15:03:03] varnishkafka then reads the varnishlogs and produces to kafka [15:03:16] nuria: we can force the request to go sync, but why does that matter, yeah, that's just about whether the client gets notified when the request succeeded [15:03:31] in the img case, if it doesn't work, it gets a 404 response, but the client doesn't handle that in any way [15:03:32] ya, new stream intake service will have both options [15:03:34] but be the same service [15:03:51] in the ajax request, if it returns a failure, nothing handles the failure and the promise is garbage collected later [15:03:53] POST with wait for event validation, etc. for appropriate HTTP response [15:03:58] POST with i don't care just give me 204 right now [15:04:29] milimetric: there are two reasons 1) image is suynchronous and simple vent in fire and forget and 2) images are not subjected to cross domain restrictions [15:04:50] *image is synchronous and simple rather, sorry [15:05:03] ottomata: the event intake is on teh same domain than the page? [15:05:05] I don't think older browsers that don't have sendBeacon have cross-domain protections [15:05:18] milimetric: they do, you can try [15:05:47] milimetric: they do not have a way to circumvent those orotections with CORS, that is correct [15:05:54] makes sense? [15:05:58] haha, IE 11 doesn't support sendBeacon, so yeah, jeez [15:06:13] yea [15:06:15] makes sense [15:06:18] milimetric: but there is the option of not supporting browsers that do not support sendbeacon [15:06:20] really [15:06:25] but ... jsonp [15:06:39] nah, I don't think so, we need to support IE 11 [15:06:41] milimetric: ya, that woudl work always [15:06:54] milimetric: wait, let me see the grade a support for mediawiki [15:07:24] yeah, IE 11 is 5.1% of requests, that's a lot [15:08:04] milimetric: ya, looks like it is on grade A support [15:08:07] milimetric: https://www.mediawiki.org/wiki/Compatibility#Modern_(Grade_A) [15:08:19] nuria: sure? [15:09:11] we can make the domain whatever [15:09:11] milimetric: now, ie11 might be able to go around cross domain with the right set of headers, let me see, that might also be another option [15:09:33] don't see why it shouldn't be the same uri it is now even [15:09:34] just POST [15:10:09] ottomata: if we do not have cross domain restrictions then yes, that opens other possibilities [15:12:25] nuria: did you want to chat a bit before the meeting? [15:13:07] ottomata: but doing it via ajax still has the drawback of events being drop likely at a higher rate than with an image beacon [15:13:25] ottomata: that might be fine (cc milimetric ) [15:13:52] hm... I don't see why it would drop more... the request should be the same [15:14:09] you're saying createElement is more reliable than XHR mechanisms? [15:14:22] like, internal browser errors? [15:14:27] milimetric: I think in reallity the way the browser does rscheduling on its single thread is prioritized [15:14:56] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 2 others: RFC: Modern Event Platform: Stream Intake Service - https://phabricator.wikimedia.org/T201963 (10Ottomata) I spoke a little too soon ^. It looks like EventLogging extension code does not use the event.gif endpoint, it use... [15:15:03] milimetric: and that is why sendbeacon works in a way "on-its-own-thread" it comes to solve that problem, [15:15:11] ok, if you're talking about the browser crapping out more often on XHR than on createElement, yeah, I could believe that... but both of those should be super super rare, no? [15:15:38] milimetric: no, it is not errors, it is prioritization of sync operations over async ones on teh single thread js has [15:15:52] milimetric: makes sense? [15:15:59] meeting now [15:16:06] not to me :p [15:16:07] ! [15:16:35] ottomata: we can talk more about this , if you read https://www.w3.org/TR/beacon/ some of the problems it solves are described there [15:19:18] aye let's talk more [15:19:26] is XHR always blocking? is there anything else we could use? [15:20:48] XHR has async [15:22:02] yeah nuria sendbeacon is better for sure, but in the rare cases where it is not availbale, i don't think XHR will be that much worse than img tag, no? [15:36:02] ottomata: ops sync? [15:36:07] OH k [15:36:09] sorry [15:40:04] (03PS1) 10Mforns: Rearrange bucket ranges in EventLoggingToDruid time measure bucketing [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/464171 (https://phabricator.wikimedia.org/T205641) [15:59:33] 10Analytics: Research: add participant list to some of AQS edit api operations - https://phabricator.wikimedia.org/T206137 (10Nuria) [16:02:04] 10Analytics-EventLogging, 10Analytics-Kanban, 10MW-1.32-notes (WMF-deploy-2018-10-02 (1.32.0-wmf.24)), 10Patch-For-Review: Make sampling by session more obvious in eventlogging module - https://phabricator.wikimedia.org/T203612 (10Nuria) 05Open>03Resolved [16:02:14] 10Analytics-EventLogging, 10Analytics-Kanban, 10MW-1.32-notes (WMF-deploy-2018-10-02 (1.32.0-wmf.24)), 10Patch-For-Review: Make sampling by session more obvious in eventlogging module - https://phabricator.wikimedia.org/T203612 (10Nuria) 05Resolved>03Open [16:03:30] [ing milimetric stranddup? [16:03:53] sorry - distracted [16:06:09] (03CR) 10Milimetric: "Thanks so much @sbassett" [analytics/wikimetrics] - 10https://gerrit.wikimedia.org/r/464059 (https://phabricator.wikimedia.org/T193780) (owner: 10SBassett) [16:07:40] wikimedia/analytics-wikimetrics#1 (master - ee6c5a0 : sbassett): The build has errored. [16:07:40] Change view : https://github.com/wikimedia/analytics-wikimetrics/compare/2651b1bed785...ee6c5a08fab3 [16:07:40] Build details : https://travis-ci.com/wikimedia/analytics-wikimetrics/builds/86762789 [16:43:02] nuria: will Analytics reserve a space in Bentley for the Tuesday before all-hands? [16:43:23] nuria: there is interest on Research end to be in a space that is closer to you all, in case we want to do collaborations, re-alignment, etc. [16:43:43] nuria: (I'm asking as we discussed whether we should be in a different venue than Bentley or not) [17:17:38] what's the easiest way to convert wiki DB names to the project names used in the analytics tables? [17:27:14] tgr, hi! if you're in Hive, you can join with the mediawiki_project_namespace_map table, which has the equivalence between hostname and dbname [17:27:28] in wmf_raw database [17:28:54] awesome, thanks! [17:32:18] it seems the request tables cut off the TLD and mediawiki_project_namespace_map uses full hostname, fortunately everything is on .org [17:34:33] * elukey off! [17:38:53] 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, 10Operations, 10Patch-For-Review, and 2 others: Prometheus resources in deployment-prep to create grafana graphs of EventLogging - https://phabricator.wikimedia.org/T204088 (10Ottomata) FINALLY GOT IT! https://beta-prometheus.wmflabs.org/beta/graph?g0.ran... [17:39:07] 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, 10Operations, 10Patch-For-Review, and 2 others: Prometheus resources in deployment-prep to create grafana graphs of EventLogging - https://phabricator.wikimedia.org/T204088 (10Ottomata) a:03Ottomata [17:39:32] 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, 10Operations, 10Patch-For-Review, and 2 others: Prometheus resources in deployment-prep to create grafana graphs of EventLogging - https://phabricator.wikimedia.org/T204088 (10Ottomata) [17:41:14] mforns: yt?? for the event _ReadingDeph schema... [17:43:53] mforns: do you maybe have the dblists in hive? [17:44:15] the contents of the wiki lists from the operations/mediawiki-config repo, I mean [17:46:33] tgr: as in the phisical sharding hosts? [17:47:02] leila: I think we might go hiking! will need to ask team but I doubt we would be in the financial district [17:47:18] nuria: okay. good to know. [17:48:12] mforns: yt? [17:48:19] nuria: they are used for lots of things. I'm mostly after physical sharding, yeah (there's a small.dblist containing the smallish wikis and I want to look up information on those; I think it matches with DB sharding relatively well) [17:48:50] 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, 10Operations, 10Patch-For-Review, and 2 others: Prometheus resources in deployment-prep to create grafana graphs of EventLogging - https://phabricator.wikimedia.org/T204088 (10Jdlrobson) Thats awesome!!!!πŸŽ‰ πŸŽ‰ πŸŽ‰ πŸŽ‰ πŸŽ‰ πŸŽ‰ @Thank you @Ottomata @Krenair and @fgiu... [17:48:53] tgr: we do not have that data in hive , no [17:49:20] so the lists at the bottom of https://noc.wikimedia.org/conf/ [17:54:49] ack, thanks. It's not too hard to make a big IN query by hand, it would be a nice convenience though. [17:57:47] tgr: wait , those are project names which are available on sitematrix? [17:58:26] tgr: cause we pull the sitematrix to the stats machine, that is meta data though, not actually physical hosts [17:59:33] no, I think sitematrix does not make that available except for a few special groups like "closed" [18:05:02] 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, 10Operations, 10Patch-For-Review, and 2 others: Prometheus resources in deployment-prep to create grafana graphs of EventLogging - https://phabricator.wikimedia.org/T204088 (10Niedzielski) {icon thumbs-up}{icon thumbs-up}{icon thumbs-up}{icon thumbs-up}{ic... [18:05:21] (03PS1) 10Fdans: Allow several attempts to get latest data in top metrics [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/464375 (https://phabricator.wikimedia.org/T205915) [18:05:50] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 2 others: RFC: Modern Event Platform: Stream Intake Service - https://phabricator.wikimedia.org/T201963 (10Ottomata) [18:05:55] 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, 10Operations, 10Patch-For-Review, and 2 others: Prometheus resources in deployment-prep to create grafana graphs of EventLogging - https://phabricator.wikimedia.org/T204088 (10Krenair) so this is resolved? [18:06:24] tgr: ok, then I think we do not have that info [18:18:12] fdans: "This change also removes the short-lived but very honorable [18:18:12] availabilityBuffer property." ayayayaya [18:25:01] hey nuria tgr was afk, reading [18:26:15] mforns: k, for the redingDepth schema I think we need to consume also some capsule events right like browser? [18:26:30] nuria, could be interesting [18:27:06] hmmmm [18:27:55] I think we'll need to blacklist explicitly all userAgent subfields that we do not want to ingest, but that would work [18:29:15] nuria, I will try to ingest again adding browser_family and os_family [18:29:22] mforns: just ingest from sanitized db [18:29:27] mforns: right? [18:29:38] nuria, then it would ingest nulls no? [18:30:00] we'd have to blacklist it anyway [18:30:34] mforns: k, we can ingest from events as long as we are dropping > 90 days old data in druid, right? [18:30:51] nuria, that's totally an option [18:31:14] and dropping after 90 days is also what we would do anyway right? [18:31:39] mforns: we already drop data from druid after 90 days, cc joal [18:31:54] yea, depends on the data set config [18:33:42] nuria, but wouldn't it be too much "noise" having the whole userAgent fields? [18:36:01] Maybe ingest browser_family, os_family and is_bot [18:38:28] mforns: ya [18:38:43] mforns: as any perf splits w/o browser would not make much sense [18:38:59] mforns: those three should be sufficient [18:39:02] ok [18:39:27] will reload september with new time splits and those 3 fields [18:45:18] (03PS2) 10Joal: Correct geoeditors-load oozie SLA [analytics/refinery] - 10https://gerrit.wikimedia.org/r/464105 [18:48:04] (03CR) 10Joal: "Thanks for comment Nuria, did both :)" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/464105 (owner: 10Joal) [18:48:36] 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to to stats, analytics-search-users, statistics-privatedata-users for Chelsy Xie - https://phabricator.wikimedia.org/T205736 (10herron) [18:48:39] (03CR) 10Nuria: [V: 032 C: 032] Correct geoeditors-load oozie SLA [analytics/refinery] - 10https://gerrit.wikimedia.org/r/464105 (owner: 10Joal) [18:49:04] 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to to analytics-search-users, statistics-privatedata-users for Chelsy Xie - https://phabricator.wikimedia.org/T205736 (10herron) [18:51:29] (03CR) 10Joal: [V: 031] "See comment inline @ottomata" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/463548 (https://phabricator.wikimedia.org/T202490) (owner: 10Joal) [18:58:00] Heya ottomata - Do you have a minute for names comments in PR ? [19:03:43] 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to to analytics-search-users, statistics-privatedata-users for Chelsy Xie - https://phabricator.wikimedia.org/T205736 (10herron) 05Open>03Resolved a:03herron >>! In T205736#4635602, @chelsyx wrote: > Thanks everyo... [19:04:44] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 2 others: RFC: Modern Event Platform: Stream Intake Service - https://phabricator.wikimedia.org/T201963 (10Ottomata) Chatted with @Joe today. He followed up on a point of misunderstanding from the IRC meeting. He wanted to know sp... [19:06:35] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 2 others: RFC: Modern Event Platform: Stream Intake Service - https://phabricator.wikimedia.org/T201963 (10Ottomata) BTW, I am totally open to other suggestions than service-template-node for Option 2. It's just that when balancing... [19:06:44] 10Analytics, 10ChangeProp, 10EventBus, 10WMF-JobQueue, and 2 others: ChangeProp logging KafkaConsumer is not connected - https://phabricator.wikimedia.org/T199444 (10Pchelolo) I found the reason for this to be happening. At least, one possible reason. And to be honest, I'm embarrassed. To support regex to... [19:09:52] 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to to analytics-search-users, statistics-privatedata-users for Chelsy Xie - https://phabricator.wikimedia.org/T205736 (10chelsyx) Thank you @herron ! [19:12:44] 10Analytics, 10ChangeProp, 10EventBus, 10WMF-JobQueue, and 2 others: ChangeProp logging KafkaConsumer is not connected - https://phabricator.wikimedia.org/T199444 (10Pchelolo) [19:17:53] 10Analytics, 10ChangeProp, 10EventBus, 10WMF-JobQueue, and 2 others: ChangeProp logging KafkaConsumer is not connected - https://phabricator.wikimedia.org/T199444 (10Pchelolo) The previous comment also explains why we started seeing the errors after DC switchover. Topics are created on demand and while cod... [19:26:31] 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, 10Operations, 10Patch-For-Review, and 2 others: Prometheus resources in deployment-prep to create grafana graphs of EventLogging - https://phabricator.wikimedia.org/T204088 (10Jdlrobson) 05Open>03Resolved Yup! I can see events here > https://grafana-la... [19:35:36] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 3 others: Modern Event Platform (TEC2) - https://phabricator.wikimedia.org/T185233 (10Ottomata) [19:43:03] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 2 others: RFC: Modern Event Platform: Schema Registry - https://phabricator.wikimedia.org/T201643 (10Ottomata) Last week Dan and I met with the Product Analysts and discussed the question about using a git repository. My impression... [19:43:47] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 2 others: RFC: Modern Event Platform: Schema Registry - https://phabricator.wikimedia.org/T201643 (10Krinkle) I'm documenting here a question from myself that came up in the IRC discussion hour and in TechCom meetings. What does co... [19:50:26] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 2 others: RFC: Modern Event Platform: Schema Registry - https://phabricator.wikimedia.org/T201643 (10Ottomata) > whether and how one would, for example, draft a commit in one repository in a way that MediaWiki-EventLogging can disco... [19:54:44] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 2 others: RFC: Modern Event Platform: Schema Registry - https://phabricator.wikimedia.org/T201643 (10Ottomata) In the EventBus system, all events have a `meta.schema_uri` field, which we populate with the relative URI path of a sche... [19:55:18] (03PS4) 10Joal: Add MediawikiXMLDumpsConverter spark job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/463370 (https://phabricator.wikimedia.org/T202490) [19:55:31] (03CR) 10Joal: "Thanks for review - Comments inline :)" (038 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/463370 (https://phabricator.wikimedia.org/T202490) (owner: 10Joal) [19:55:57] (03CR) 10Ottomata: [C: 031] "Ohhh right!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/463548 (https://phabricator.wikimedia.org/T202490) (owner: 10Joal) [20:01:51] (03CR) 10Ottomata: Add MediawikiXMLDumpsConverter spark job (033 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/463370 (https://phabricator.wikimedia.org/T202490) (owner: 10Joal) [20:08:08] (03CR) 10Ottomata: [C: 031] "Hm, def good enough for now, but one day it'll be nice to be able to configure the bucket sizes. Can't think of a good way to do that atm" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/464171 (https://phabricator.wikimedia.org/T205641) (owner: 10Mforns) [20:08:23] thx [20:14:53] mforns: k let me know [20:15:14] nuria, loading, already 10 days there [20:15:38] now all dimensions are mixed up, but when I finish the month they will be ok [20:27:46] mforns: sounds good, will take a look then, i told gilles this would be useful for nav timing [20:27:55] I saw :] [20:28:09] I mean, I saw you asked him to review [20:47:06] PROBLEM - Number of segments reported as unavailable by the Druid Coordinators of the Analytics cluster on einsteinium is CRITICAL: 2740 gt 200 https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&panelId=46&fullscreen&orgId=1&var-cluster=druid_analytics&var-druid_datasource=All [20:53:28] nuria, mmmeh old dimensions still are there after complete reload (overwrite), will drop data and reload again [20:53:55] mforns: k [20:54:45] joal or ottomata : do you understand icinga alarm? [20:55:57] hm, could it be due to beginning of month druid loading? [20:56:11] it looks like there are just a lot of segments scheduled to load (from hdfs?) [20:56:27] but it is flapping, so i betcah it will recover [21:08:52] 10Analytics, 10Analytics-Kanban: Annotations need to use adjustedGraphData - https://phabricator.wikimedia.org/T206171 (10Milimetric) [21:08:56] 10Analytics, 10Analytics-Kanban: Annotations need to use adjustedGraphData - https://phabricator.wikimedia.org/T206171 (10Milimetric) p:05Triage>03High [21:09:48] nuria, ottomata, this happened as well at other times I was loading [21:09:49] (03PS1) 10Milimetric: Filter out annotations based on adjustedGraphData [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/464428 (https://phabricator.wikimedia.org/T206171) [21:10:11] I remember joal and elukey discussing that they were false alarms [21:10:23] caused by ad-hoc loading in big chunks [21:11:37] mforns: ya, https://github.com/apache/incubator-druid/issues/3173 [21:12:05] mforns: but i just do not even understand the alarm itself [21:12:24] yea [21:12:36] mforns: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid#Handling_alarms_for_unavailable_segments [21:18:56] RECOVERY - Number of segments reported as unavailable by the Druid Coordinators of the Analytics cluster on einsteinium is OK: (C)200 gt (W)180 gt 0 https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&panelId=46&fullscreen&orgId=1&var-cluster=druid_analytics&var-druid_datasource=All [21:27:39] gilles: mforns is the one that implemented the buckets, i think for perf measures tehy shoudl be pretty useful [21:28:32] mforns is loading some data now to test out the buckets, we can show you in a bit [21:29:54] yes, the values look ok. maybe we'll need more fine-grained between 1s and 5s, but it's hard to say without trying it out first [21:30:19] it's probably fine [21:30:41] ok [21:36:17] gilles: we will load navigation timing with a similar scheme , does that sound fine? [21:36:25] sure [21:37:47] anyone seen this? https://hortonworks.com/press-releases/cloudera-hortonworks-announce-merger-create-worlds-leading-next-generation-data-platform-deliver-industrys-first-enterprise-data-cloud/ [21:39:24] bearloga: wow [21:49:18] gilles: Please remember to add to whitelist the data taht will be retained after 90 days: https://github.com/wikimedia/analytics-refinery/blob/master/static_data/eventlogging/whitelist.yaml [21:50:53] nuria, I can not manage to remove the dataset... tried the instructions in the docs, but the dataset is still there, empty, but keeps the old dimension names... [21:51:24] mforns: i have had this problem before, did you disabled it before removing it? [21:51:29] nuria, yes [21:53:57] nuria, I have tried to use the coordinator DELETE request to the intervals, will try to use the POST request to the overlord... although I thought that would be to kill indexing jobs only [21:54:29] mforns: you tried these right? https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid#Delete_segments_from_deep_storage [21:54:43] nuria, yea [21:55:36] mforns: i have done it sending the manually the delete tasks after disabling [21:55:53] nuria, but there are 2 options to do that in the docs [21:56:07] 1) DELETE request to coordinator asking to delete intervals [21:56:15] 2) POST request to the overlord [21:56:23] 1) does not work for me [21:56:35] removes the data, but not the data set [21:56:52] PROBLEM - Check the last execution of eventlogging_db_sanitization on db1107 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:57:06] wow [21:57:30] mforns: i have done it with 1) define rule to remove segments from deep storage 2) sending kill task [21:57:42] mforns: ya, wow [21:59:02] mforns: what handshake? [21:59:08] mforns: me no compredou [21:59:18] don't know.. [22:01:09] don't see anything weird for db1107 in grafana... https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1107&var-network=eth0 [22:03:34] nuria, option 2) does not work for me either... [22:04:31] I remember having asked Luca to completely remove a data set from druid at some point, as said, removing data works, but not the data set, which still keeps the dimension names... [22:06:11] mforns: right, data is removed: hdfs dfs -ls /user/druid/deep-storage/event_ReadingDepth/ [22:06:35] mforns: so taht worked [22:06:47] well, data was removed before also [22:07:21] I was trying to remove the datasource [23:20:26] bye teaam