[04:48:33] (03CR) 10Zhuyifei1999: "I'm wondering, considering that there is no way to navigate through the history of a query, is this really useful for quarry users?" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/509638 (https://phabricator.wikimedia.org/T223009) (owner: 10Framawiki) [05:49:05] morning! [05:49:11] it seems that EL is finally stable [07:01:44] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Operations, and 2 others: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10elukey) Update: we were running 1.4.1.-1~stretch1, I have rolled back eventlogging to it and all instabilities went away. 1.4.3 seems a broken version f... [07:03:01] pf - Thanks a lot elukey for the restarts of the weekend :( [07:03:18] elukey: Don't hesitate to ping if I can be of any help :S [07:04:10] joal: bonjour! [07:04:16] o/ [07:05:09] yeah also my bad since when we rolled back I didn't check carefully the apt history log [07:05:23] that was a mistake [07:05:35] but we tested 1.4.3 and it is horrible as well :) [07:13:26] :S [07:23:01] (03CR) 10Joal: [V: 03+1] "Thanks for the review fdans - Let's continue to look for agreement and have you merge it :)" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/502858 (https://phabricator.wikimedia.org/T220111) (owner: 10Joal) [09:19:11] (03PS1) 10Joal: Refactor mediawiki-page-history computation [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/509772 (https://phabricator.wikimedia.org/T190434) [10:32:28] * elukey lunch! [12:21:41] elukey: not sure if we have something for this, and/or if it could be useful: https://www.lightbend.com/blog/monitor-kafka-consumer-group-latency-with-kafka-lag-exporter [12:56:19] joal: nice! [12:56:39] I didn't read it fully, but it should be a replacement of Burrow + prometheus exporter? [13:07:17] (03PS22) 10Elukey: Add artifacts for Debian Buster and upgrade to 0.32rc2 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/495182 (https://phabricator.wikimedia.org/T212243) [13:36:18] elukey: I think that's what the thing does :) [13:40:01] joal: thanks for the link! will read more carefully if it is better/worse than burrow :) [13:40:18] joal: do you want to update AQS with the new druid datasource? [13:40:29] elukey: no prob, I try to keep an eye for stuff in our stack ;) [13:41:33] elukey: yes! [13:41:55] IIRC marcel posted a code review [13:41:58] going to merge it [13:42:14] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/509150/ [13:42:40] mforns: commit msg perfect, thanks! --^ [13:49:45] :] [13:52:33] mforns,joal - aqs1004 is depooled and ready to be tested [13:52:55] mforns: if you are interested, what I usually do is [13:53:06] 1) merge the update in puppet (in this case, your change) [13:53:23] 2) run puppet on all the aqs nodes via cumin (it only updates the config.yaml file, it doesn't restart aqs) [13:53:50] 3) depool aqs1004 from the load balancer service, restart aqs (so changes are picked up) and let somebody verify [13:54:06] 4) repool, verify that metrics are good [13:54:09] 5) apply to all [13:56:14] 10Analytics, 10Analytics-Kanban, 10EventBus, 10serviceops, 10Services (watching): Change LVS port for eventlogging-analytics from 31192 to 33192 - https://phabricator.wikimedia.org/T222962 (10akosiaris) >>! In T222962#5172806, @Ottomata wrote: > Hm, question. Currently mediawiki-config ProductionService... [13:59:03] elukey: looks ggod to me - answers requests with values thqat seems reasonable [13:59:57] joal: all right, repooling then applying to all [13:59:59] ack? [14:00:03] yes ! [14:02:40] morning! [14:02:51] office hours are starting! [14:03:01] And probably nobody's here so we really have to advertise at some point :) [14:06:34] moooorning! [14:09:36] I'm here :] [14:09:47] you don't count Marcel [14:10:02] aahh [14:10:02] we want [^a]-team members [14:11:56] joal,mforns aqs updated! [14:12:04] :D [14:12:05] \o/ :) [14:13:06] milimetric: would you mind sending the email about AQS data change? [14:13:51] joal: sure, so wait, are you sending the internal one about the history schema? [14:14:10] milimetric: can do, as you wish :) [14:14:41] joal: it's totally up to you, you're the one who did most of the changes and I feel like sending the email would be like claiming credit in some way, so I don't wanna do that [14:15:28] milimetric: You wrote those emails, reviewed the code, and help me a lot - No wrong claim here :) [14:15:42] milimetric: I can nonetheless do it if you prefer :) [14:19:18] joal: yeah, you send it, that's better :) [14:19:33] ack milimetric :) [14:21:41] milimetric: will actually do it after catching the kids - I'll triple check my ideas with you :) [14:21:49] joal: no rush I think [14:22:08] snapshot's been out for a bit, and docs on wikitech mention the change, should be fine [14:22:29] 10Analytics, 10Analytics-Kanban, 10EventBus, 10serviceops, 10Services (watching): Change LVS port for eventlogging-analytics from 31192 to 33192 - https://phabricator.wikimedia.org/T222962 (10Ottomata) Ok, in that case, it will be much less annoying to temporarily disable the events in mediawiki-config,... [14:23:10] 10Analytics, 10Analytics-Kanban, 10EventBus, 10serviceops, 10Services (watching): Use new eventgate chart release analytics for eventgate-analytics service. - https://phabricator.wikimedia.org/T222962 (10Ottomata) [14:28:57] elukey: wait so is it still unstable with 1.4.1? [14:31:46] milimetric: nope it is not, it was unstable with 1.4.3 [14:32:03] now it works fine (it has been since yesterday) [14:32:25] elukey: thanks for that [14:32:43] i'm inclined to get rid of newer versions in our apt and just leave 1.4.1 there [14:32:54] mayyybe we can backport that change gilles needs to 1.4.1 v [14:34:31] ottomata: sorry for the 1.4.3 confusion, I didn't think to check the apt history log :( [14:34:48] me neither! [14:34:58] it was strange that /var/cache/apt/archives didn't have 1.4.3 [14:35:02] i should have thought more about that [14:35:25] ottomata: I have enabled debug logging on deployment-eventlog05 [14:35:31] sometimes the error happens in there [14:35:37] with 1.4.1? [14:35:42] nope 1.4.6 [14:35:44] oh ok [14:35:45] aye [14:35:52] maybe we can report something upstream [14:35:54] so even lowish colume [14:35:55] volume [14:36:01] elukey: it is that bug we've been commenting, no? [14:36:16] it should be yes [14:36:24] for example, I see this weird thing [14:36:25] May 13 14:20:09 deployment-eventlog05 eventlogging-processor@client-side-01[26126]: 2019-05-13 14:20:09,864 [26126] (MainThread) kafka.coordinator.consumer [DEBUG] OffsetCommit for group eventlogging_processor_client_side_00 failed: [Error 25] UnknownMemberIdError: eventlogging_processor_client_side_00 [14:37:24] (03CR) 10Fdans: [V: 03+2 C: 03+2] "We're good to merge, thanks joal!" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/502858 (https://phabricator.wikimedia.org/T220111) (owner: 10Joal) [14:37:37] but that might be due to the cgroup already rebalanced [14:38:08] seems tricky to identify [14:41:35] milimetric: not sure if my email makes sense, I was about to leave and probably didn't write a good summary :) TL;DR is that we deployed 1.4.6 ~10d ago, then rolled back to 1.4.3 mistakenly instead of 1.4.1 [14:41:41] (03PS1) 10Fdans: Fix two issues with new time selector: [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/509867 (https://phabricator.wikimedia.org/T219112) [14:41:47] (1.4.3 leads to horrible deadlocks) [14:42:12] 1.4.1 has been running fine since yesterday [14:43:38] elukey: no that made sense. I think I read the emails/irc in a weird order so I thought there was another restart this morning after the 1.4.1 switch. But I doubt others will be confused [14:44:12] makes sense that the version we were using so far still works [14:46:27] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 4 others: Modern Event Platform: Deploy instance of EventGate service that produces events to kafka main - https://phabricator.wikimedia.org/T218346 (10Ottomata) [14:47:30] elukey: [14:47:31] hm [14:47:31] UnknownMemberIdError: eventlogging_processor_client_side_00 [14:47:36] that is very weird, that's the group id [14:47:41] dunno why that would ever be unknown [14:47:55] unless the group id has been expired [14:48:00] due to lack of offset commits for a long time. [14:48:29] ....... [14:48:38] elukey: i doubt this is whhat is happening: but. [14:49:14] could it be that the lag we see isn't due to actual lag from rebalances, but because the consumer group expires in kafka, offsets are lost, and a rebalance causes the consumer group to re-subscribe and start from the beginning? [14:49:17] seems unlikely....... [14:54:40] Thanks for the merge fdans :) [14:55:57] ottomata: my suspicion is that $something causes one processor to fail, and the other ones log some errors due to the group being rebalanced. If this happens regularly on a processor, they in turn cause an excessive amount of rebalances that creates constant lag [14:56:08] your theory could be what's happening [14:56:38] elukey: i think yours is m ore likely [14:56:43] mine wouldn't be too hard to check [14:56:53] plus we'd probably see logs of some sort about that [14:57:02] but UnknownMemberIdError: eventlogging_processor_client_side_00 sounds very strange. [14:57:19] hm oh maybe eventlogging_processor_client_side_00 is not the group id? [14:57:21] checking... [14:57:38] a-team I’ll be 5 min late to standup, switching rooms [14:58:06] ok i tis [14:58:07] it is the group id [14:58:11] and [14:58:14] auto_offset_reset=earliest is set [14:58:15] . [14:58:16] hm [14:59:31] hm yeah reading more, that does just seem like the consumer iis rebalancing [14:59:42] not that the group itself has been expired [15:00:43] ping ottomata milimetric fdans standuppp [15:09:46] 10Analytics, 10Core Platform Team Backlog, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), 10Performance: > 2% of API wall time spent generating UUIDs - https://phabricator.wikimedia.org/T222966 (10ori) SGTM. `UIDGenerator::newUUIDv4()` uses a good source of randomness, and the rate of A... [15:22:18] 10Analytics, 10Core Platform Team Backlog, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), 10Performance: > 2% of API wall time spent generating UUIDs - https://phabricator.wikimedia.org/T222966 (10fdans) p:05Triage→03Normal [15:25:44] 10Analytics, 10Analytics-SWAP: Upgrade R in SWAP notebooks to 3.4+ - https://phabricator.wikimedia.org/T222933 (10fdans) p:05Triage→03Normal [15:29:06] 10Analytics, 10Analytics-SWAP: Upgrade R in SWAP notebooks to 3.4+ - https://phabricator.wikimedia.org/T222933 (10Ottomata) This isn't easy :( We use Debian's upstream R packages, and Debian Stretch. If/when Buster is available to install in prod from the SRE team, it looks like we could upgrade to 3.5.2: h... [16:17:08] sorry [16:17:41] ^ would it be easier to install an upgraded R on the stat machines? [16:19:54] groceryheist: o/ I think that we have the same problem as pandas/etc.. [16:20:43] but in theory if we plan to upgrade the os on notebooks soon(ish) we could get a more up to date version without a lot of problems [16:22:02] how's buster going? it's frozen no? [16:23:26] groceryheist: it should be but we have already been testing it on our infra (the awesome benefits of having a lot of Debian Devels in the SRE team :) [16:24:03] \m/ [16:41:29] if any of us wants to play with clickstream at "wide" scale (multiple wikis/dates) -> https://wikitech.wikimedia.org/wiki/User:Joal/Clickstream_historical [17:12:02] * elukey off! o/ [17:59:56] 10Quarry, 10Patch-For-Review: Create custom 502 Bad Gateway error page - https://phabricator.wikimedia.org/T223018 (10Framawiki) 05Open→03Resolved a:03Framawiki Works as excepted, yeah! Now if the standard nginx bad gateway error page is shown, that originate from WMCS shared web proxy hosts and that say... [18:23:23] 10Analytics, 10Analytics-Kanban, 10EventBus, 10serviceops, and 2 others: Use new eventgate chart release analytics for eventgate-analytics service. - https://phabricator.wikimedia.org/T222962 (10Ottomata) [18:35:03] 10Analytics, 10Analytics-Kanban, 10EventBus, 10serviceops, and 3 others: Set up LVS for eventgate-main on port 32192 - https://phabricator.wikimedia.org/T222899 (10Ottomata) @akosiaris when you have time tomorrow: https://gerrit.wikimedia.org/r/c/operations/dns/+/509912 Thank you! [18:35:29] 10Analytics, 10Product-Analytics: Identify imported revisions in mediawiki_history - https://phabricator.wikimedia.org/T221482 (10Neil_P._Quinn_WMF) [18:46:28] 10Analytics, 10Operations, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10dr0ptp4kt) @elukey thanks for the follow up here. No need to block on... [18:59:57] 10Analytics, 10Core Platform Team Backlog, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), 10Performance: > 2% of API wall time spent generating UUIDs - https://phabricator.wikimedia.org/T222966 (10Ottomata) Am fine with any uuid that matches the expected format. IIRC, we chose v4 becau... [19:24:56] 10Analytics, 10Analytics-Kanban, 10EventBus, 10serviceops, and 3 others: Set up LVS for eventgate-main on port 32192 - https://phabricator.wikimedia.org/T222899 (10Ottomata) a:03akosiaris [19:25:08] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 4 others: Modern Event Platform: Deploy instance of EventGate service that produces events to kafka main - https://phabricator.wikimedia.org/T218346 (10Ottomata) [19:26:50] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 4 others: Modern Event Platform: Deploy instance of EventGate service that produces events to kafka main - https://phabricator.wikimedia.org/T218346 (10Ottomata) [19:26:57] 10Analytics, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), 10Core Platform Team Kanban (Doing), and 2 others: Make EventBus extension support configurable per-event/stream EventServiceName - https://phabricator.wikimedia.org/T222822 (10Ottomata) [19:26:59] 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Core Platform Team Backlog (Watching / External), and 2 others: Modern Event Platform: Stream Intake Service: Migrate Mediawiki Eventbus events to EventGate - https://phabricator.wikimedia.org/T211248 (10Ottomata) [19:27:10] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 4 others: Modern Event Platform: Deploy instance of EventGate service that produces events to kafka main - https://phabricator.wikimedia.org/T218346 (10Ottomata) [19:27:13] 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Core Platform Team Backlog (Watching / External), and 2 others: Modern Event Platform: Stream Intake Service: Migrate Mediawiki Eventbus events to EventGate - https://phabricator.wikimedia.org/T211248 (10Ottomata) [19:27:19] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 3 others: Modern Event Platform: Stream Intake Service: Implementation - https://phabricator.wikimedia.org/T206785 (10Ottomata) [19:27:21] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 4 others: Modern Event Platform: Deploy instance of EventGate service that produces events to kafka main - https://phabricator.wikimedia.org/T218346 (10Ottomata) [19:30:03] 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Core Platform Team Backlog (Watching / External), and 2 others: Modern Event Platform: Stream Intake Service: Migrate Mediawiki Eventbus events to EventGate - https://phabricator.wikimedia.org/T211248 (10Ottomata) @Pchelolo {T218346} is pretty muc... [19:30:48] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 4 others: Modern Event Platform: Deploy instance of EventGate service that produces events to kafka main - https://phabricator.wikimedia.org/T218346 (10Ottomata) [20:40:56] (03PS1) 10Fdans: Add 122 wikis to prod sqoop list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/509938 (https://phabricator.wikimedia.org/T220456) [20:41:46] (03PS2) 10Fdans: Add 122 wikis to prod sqoop list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/509938 (https://phabricator.wikimedia.org/T220456) [20:42:16] (second PS is just removing extra trailing space) [20:46:17] (03CR) 10Nuria: [C: 03+2] Add 122 wikis to prod sqoop list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/509938 (https://phabricator.wikimedia.org/T220456) (owner: 10Fdans) [20:50:45] (03CR) 10Fdans: [V: 03+2] Add 122 wikis to prod sqoop list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/509938 (https://phabricator.wikimedia.org/T220456) (owner: 10Fdans) [21:22:38] 10Analytics, 10Core Platform Team Backlog, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), 10Performance: > 2% of API wall time spent generating UUIDs - https://phabricator.wikimedia.org/T222966 (10Krinkle) What is this `mediawiki.api-request` data ultimately for and why does it require...