[00:09:04] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on einsteinium is OK: (C)1e+05 gt (W)1e+04 gt 10 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [00:25:17] 10Analytics, 10Analytics-Wikistats: Beta Release: Wikistats: support annotations in graphs - https://phabricator.wikimedia.org/T178015#4206204 (10Milimetric) As discussed in tasking, I'm breaking down this task into: * model an annotations schema that makes it easy for the UI to render * find or implement a wa... [00:30:03] 10Analytics, 10Analytics-Wikistats: Read Dashiki annotations into Wikistats - https://phabricator.wikimedia.org/T194702#4206208 (10Milimetric) p:05Triage>03High [00:32:00] 10Analytics, 10Analytics-Wikistats: Render annotations on all Wikistats charts - https://phabricator.wikimedia.org/T194705#4206242 (10Milimetric) p:05Triage>03High [00:34:52] 10Analytics, 10Analytics-Wikistats: Organize annotations pages on meta by convention - https://phabricator.wikimedia.org/T194706#4206257 (10Milimetric) p:05Triage>03High [00:35:53] 10Analytics, 10Analytics-Wikistats: Make Dashiki Extension render annotations pages better - https://phabricator.wikimedia.org/T194708#4206281 (10Milimetric) p:05Triage>03Low [00:36:25] 10Analytics, 10Analytics-Wikistats: Interactively add annotations from Wikistats UI - https://phabricator.wikimedia.org/T194710#4206302 (10Milimetric) p:05Triage>03Lowest [00:37:53] 10Analytics, 10Analytics-Wikistats: Wiki popup form to add annotations on meta - https://phabricator.wikimedia.org/T194711#4206320 (10Milimetric) p:05Triage>03Lowest [05:35:00] 10Analytics-Legal, 10WMF-Legal, 10Wikidata: Solve legal uncertainty of Wikidata - https://phabricator.wikimedia.org/T193728#4206570 (10Rspeer) > the fact that 'London' is called 'Londres' in Frech is rather un-creative @Denny: Where is this reductionism getting you? You can pick one simple example at a tim... [06:12:43] 10Analytics, 10User-Elukey: Tests clone of pivot - https://phabricator.wikimedia.org/T194054#4206622 (10elukey) [06:12:48] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Deploy Turnilo (possible pivot replacement) - https://phabricator.wikimedia.org/T194427#4206625 (10elukey) [07:10:41] o/ [07:13:46] o/ [07:18:37] elukey: do you want us to discuss druid moves, or should I go and prepare the deploy I wish to do? [07:21:04] joal: as you prefer! [07:21:11] same for me :) [07:21:32] let's chat in here about the next steps for druid then, so I can work on it while you deploy :) [07:21:48] ok [07:22:15] So, about druid - Both KIS and parquet extension have worked, except for the version mismatch thing [07:22:26] as far as I understood, two things are outstanding: [07:22:36] I'm assuming you'll need to repackage wih those extension before actually deploying [07:22:47] 1) druid-parquet 0.10 is not in our debian packages [07:23:12] 2) KIS seems to work and in theory it doesn't need any special puppet config except loading the extension [07:25:37] so from the packaging/puppet point of view, I think we'd need to: [07:26:03] 1) create a new druid debian package with druid-parquet, that will become 0.11.0-2 [07:26:30] 2) set use_cdh=false in puppet and add druid-avro/parquet/kis to the common extensions list [07:26:45] the last point of course just before deploying [07:27:06] joal: anything missing? Probably a bit more time to properly test KIS I suppose [07:27:09] ? [07:28:02] elukey: +1 for the todo - About testing, I assume we'll need real data in order to do so (with real volume) [07:28:29] elukey: If you have an idea of others things to test, I'm super interested Q! [07:28:35] elukey: Maybe alarms? [07:28:56] I'd say no as long as the metrics are ok, didn't see any chance in those [07:29:24] for the realtime testing, this will mean stopping realtime banner impressions for a bit right? [07:29:37] probably also removing the cron on analytics1003 [07:29:47] that respawns tranquillity [07:34:04] elukey: yes - it also involves changing the streaming job so that it outputs to kafka [07:35:24] joal: should we do it before upgrading to 0.11 or afterwards? I am fine either way, but if before we'd need to announce the downtime for banner impression just so people will know what it is happening [07:35:46] elukey: +1 for anouncing to the FR team [07:38:42] joal: also this change for 0.12 is interesting https://github.com/druid-io/druid/pull/4815#issuecomment-346155552 [07:39:59] indeed elukey - That change seems good [07:54:46] elukey: No more talking on druid - Shall I go and deploy? [07:57:14] ack! [07:59:30] (03PS4) 10Joal: Add optional datasource to druid loading workflow [analytics/refinery] - 10https://gerrit.wikimedia.org/r/405053 [07:59:32] (03PS2) 10Joal: Add snapshot to datasource-name (mw hist reduced) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/429770 (https://phabricator.wikimedia.org/T193388) [08:02:14] (03PS4) 10Joal: Make mediawiki-history-reduced data permanent [analytics/refinery] - 10https://gerrit.wikimedia.org/r/427948 (https://phabricator.wikimedia.org/T192482) [08:02:56] (03CR) 10Joal: [V: 032 C: 032] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/427948 (https://phabricator.wikimedia.org/T192482) (owner: 10Joal) [08:03:36] (03CR) 10Joal: [V: 032 C: 032] "Merging for deploy (correct patch)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/427948 (https://phabricator.wikimedia.org/T192482) (owner: 10Joal) [08:05:32] (03PS5) 10Joal: Add optional datasource to druid loading workflow [analytics/refinery] - 10https://gerrit.wikimedia.org/r/405053 [08:05:34] (03PS3) 10Joal: Add snapshot to datasource-name (mw hist reduced) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/429770 (https://phabricator.wikimedia.org/T193388) [08:11:25] (03PS4) 10Joal: Add snapshot to datasource-name (mw hist reduced) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/429770 (https://phabricator.wikimedia.org/T193388) [08:13:54] (03CR) 10Joal: [V: 032 C: 032] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/405053 (owner: 10Joal) [08:14:47] (03CR) 10Joal: [V: 032 C: 032] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/429770 (https://phabricator.wikimedia.org/T193388) (owner: 10Joal) [08:23:47] (03PS6) 10Joal: Update sqoop script to allow for parquet import [analytics/refinery] - 10https://gerrit.wikimedia.org/r/423235 [09:07:28] joal: new deb package ready and sent https://gerrit.wikimedia.org/r/#/c/433131/ [09:07:34] YAY :) [09:07:44] Now I am going to upload/deploy it in labs [09:07:55] elukey: I'm gonna need 2 patches, one for AQS conf (may I let you do that?), and one for sqoop (I'll do it) [09:08:15] (03CR) 10Joal: [V: 032 C: 032] "Merging before deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/423235 (owner: 10Joal) [09:08:32] sure thing [09:13:01] can you point me to the config that needs to be added? IIRC there is a code review with the proof of concept [09:13:56] elukey: https://gerrit.wikimedia.org/r/#/c/429765/2/config.example.wikimedia.yaml [09:14:42] elukey: 2 ways of doing it: adding a dict in puppet/hiera to fill in the datasources, or having single properties per datasources [09:14:49] elukey: I don't mind one or the other [09:15:17] elukey: It might become handier to have a dict if we plan to add some more sources, but I'm pretty sure we won't have tons [09:18:20] ack [09:19:02] elukey: do you want me to test your last deploy in druid-labs? [09:19:18] haven't done it yet, I am working on the aqs config first [09:19:23] ack! [09:19:27] Thanks elukey [09:35:53] joal: so the crontab entry seems to long :( [09:36:18] yes elukey :( [09:36:30] elukey: should we use conf files? [09:36:52] !log Deployed refinery using scap [09:36:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:37:11] !log Deploy refinery onto HDFS [09:37:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:39:19] never seen this error before, weird [09:45:20] it says "-":48: command too long [09:45:25] that doesn't make any sense [09:47:58] WAT? [09:50:20] it seems like a "-" is interpreted badly [09:50:36] you just added --output-format avrodata right? [09:51:09] ah wait I may have found sometimg [09:51:11] *something [09:51:22] correct elukey [09:51:37] look at this [09:51:38] --partition-value $(/bin/date --date="$(/bin/date +\%Y-\%m-15) -1 month" +'\%Y-\%m') --mappers 4 --output-format avrodata [09:51:46] anything missing? :D [09:52:02] Ahhhhh - Crap [09:53:03] hm [09:53:24] I think that --date="$(/bin/date +\%Y-\%m-15) -1 month" +'\%Y-\%m') is a bit weird, or seems to be [09:53:26] actually, can't see it :( [09:54:06] elukey: I didn't touch it :( [09:54:16] * joal doesn't nderstand [09:54:44] I think that for some reason adding the --output-format brought up the issue that was already there [09:57:04] joal: what do you mean that you can't see it? IRC doesn't show? [09:57:28] elukey: you said something was missing, and I can't see it [09:57:36] ah sure ok [09:58:08] so I am trying to understand --date="$(/bin/date +\%Y-\%m-15) -1 month" +'\%Y-\%m') [09:58:31] err sorry [09:58:32] --partition-value $(/bin/date --date="$(/bin/date +\%Y-\%m-15) -1 month" +'\%Y-\%m') [09:58:59] Give you a month-format value (2018-04) for previous month [09:59:25] elukey@analytics1003:~$ a=$(/bin/date --date="$(/bin/date +\%Y-\%m-15) -1 month" +'\%Y-\%m') [09:59:28] elukey@analytics1003:~$ echo $a [09:59:30] \2018-\04 [09:59:42] elukey: the \ is for cron [09:59:48] to escape the % [10:00:55] yeah right that one is only interpreted by cron, shouldn't end up in the final thing that gets rendered [10:06:26] ok the explanation of "-":48 should be that the command on line 48 is too long [10:09:06] so I managed to make it work with [10:09:31] # Puppet Name: refinery-sqoop-mediawiki-private [10:09:32] PYTHONPATH=${PYTHONPATH}:/srv/deployment/analytics/refinery/python [10:09:32] 0 0 2 * * /usr/bin/python3 /srv/deployment/analytics/refinery/bin/sqoop-mediawiki-tables [10:09:47] basically doing like the mailto [10:11:00] but not sure if the absence of export causes some trobles [10:13:50] so joal it seems an issue with a command that is too long [10:14:11] elukey: I would have had guessed that :) [10:14:28] elukey: too long for cron ? [10:15:57] apparently, I checked and it is 1008 chars logn [10:15:59] *long [10:19:07] but I cannot find a documentation that states the limit [10:23:56] joal: in the meantime, https://turnilo.wikimedia.org/ :) [10:24:58] awesome elukey :) [10:29:14] ok so I am going to finish the work for AQS, then I'll try to fix that cron [10:29:31] elukey: will try to understand as well [10:34:36] elukey: https://github.com/systemd-cron/crontab/commit/536daf9d826278514a919aa98380e15b01eacd94 [10:34:47] looks like we're gonna need to reduce the command size :( [10:38:19] !log Kill-Restart mediawiki-history-reduced ooie coordinator to pick up deployed changes [10:38:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:41:26] (03PS1) 10Joal: Correct bug introduced at rebase of mwh-reduced [analytics/refinery] - 10https://gerrit.wikimedia.org/r/433137 [10:41:49] (03CR) 10Joal: [V: 032 C: 032] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/433137 (owner: 10Joal) [10:44:56] * elukey cries [10:44:59] thanks joal [11:06:38] almost there but the new settings are rendered weirdly https://puppet-compiler.wmflabs.org/compiler02/11204/aqs1004.eqiad.wmnet/ [11:50:35] joal: i see we now have virtualpageviews available in https://superset.wikimedia.org/druiddatasourcemodelview/list/ \o/ ... [11:50:53] ...but how can one select SUM(view_count) as a metric from it? [11:51:19] (the dropdown only appears to offer COUNT(*), which doesn't quite make sense for this table) [11:51:43] HaeB: we also have https://turnilo.wikimedia.org/ now (still need some testing but it looks good) [11:51:58] (pivot's open source replacement) [11:52:41] HaeB: I have updated the conf - You should have a sum__view_count available :) [11:52:58] elukey: oh nice. was talking about that with faidon recently, didn't know it was already being worked on! [11:53:02] joal: I think I finally have the patch ready for aqs [11:53:22] \o/ elukey :) [11:53:25] I have erb and yaml [11:53:39] joal: https://gerrit.wikimedia.org/r/433140 [11:54:19] ah no wait a nit is wrong, resending [11:54:54] joal: let me know when you want it deployed [11:55:28] elukey: looks great :) [11:56:18] <%- is very different that <% (spaces removed with '-', that otherwise are added before and after) [11:56:34] result in https://puppet-compiler.wmflabs.org/compiler02/11212/aqs1004.eqiad.wmnet/ [11:56:50] elukey: If you don't mind we'll deploy afer standup, so that my tests of other refinery patches are done (and a new datasource is available to test as well) [11:57:30] ack, whenever you want [11:58:32] actualy elukey, you can deploy that patch now if you want - The new conf will be ignored [12:00:23] joal: ack proceeding then [12:15:42] joal: deployed to aqs1004, is there a way to check that everything is fine? [12:16:14] elukey: reading the conf :) [12:16:18] elukey: I'll do that [12:17:04] elukey: conf looks good :) [12:17:17] the druid spaces were reduced from 4 to 2 to match the rest of the config [12:17:36] elukey: if we want to be sure there is no impact, we can restart AQS on aqs1004 [12:17:53] one! [12:19:00] *done [12:20:41] ok all done, config deployed [12:20:49] Awesome :) [12:20:52] Thanks a lot elukey :) [12:23:39] elukey: do we take a minute now to review my prez, or shall we do that post-standup? [12:26:41] oh yes let's do it! [12:26:44] if you have time now [12:27:54] yes ! [12:29:52] bc? [12:30:46] bc? [12:31:19] yes, 1 min [12:33:36] elukey: I'm in ! [13:40:01] o/ [13:43:57] o/ [13:46:44] 10Analytics: Generate pagecounts-ez data back to 2008 - https://phabricator.wikimedia.org/T188041#4207619 (10Milimetric) @CristianCantoro, sorry for the delay. I think the goals for us are: 1. serve per-article stats from an API so they can be incorporated into @MusikAnimal's tool. 2. publish more complete dum... [13:48:50] joal / elukey: what do you think about loading a very abridged version of historical pageviews per article into the API? [13:49:36] if you look at that comment I just made above (https://phabricator.wikimedia.org/T188041#4207619), it looks like we can reduce the number of articles we need to load by a factor of 10, and I think that brings it into feasible range with the storage we have on Cassandra, no? [14:01:08] (03PS2) 10AndyRussG: [PLS. DON'T MERGE] Make banner activity Druid ingress from EventLogging [analytics/refinery] - 10https://gerrit.wikimedia.org/r/432405 (https://phabricator.wikimedia.org/T186048) [14:02:20] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Migrate eventbus camus to Kafka jumbo - https://phabricator.wikimedia.org/T189713#4207661 (10Ottomata) [14:05:24] hey milimetric, we can chat during standup with Jo about this, I am less into sizing in cassandra than he is :D [14:07:43] cool, just a thought I had a I was breaking hearts on phab :) [14:16:29] (03CR) 10AndyRussG: [C: 04-2] "A few notes..." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/432405 (https://phabricator.wikimedia.org/T186048) (owner: 10AndyRussG) [14:18:04] 10Analytics, 10EventBus: TLS encryption for cross DC Kafka main MirrorMaker instances - https://phabricator.wikimedia.org/T194764#4207740 (10Ottomata) p:05Triage>03Normal [14:22:23] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#4207782 (10Ottomata) @bblack would you mind if I assigned this to someone on your team? [14:22:58] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#4207783 (10BBlack) a:05Ottomata>03Vgutierrez Done :) [14:24:26] 10Analytics: Decomission old analytics kafka cluster - https://phabricator.wikimedia.org/T183303#4207790 (10Ottomata) [14:24:30] 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: Port Kafka clients to new jumbo cluster - https://phabricator.wikimedia.org/T175461#4207789 (10Ottomata) [14:25:53] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Discovery, 10Patch-For-Review: Migrate Mediawiki Monolog Kafka producer to Kafka Jumbo - https://phabricator.wikimedia.org/T188136#4207796 (10Ottomata) @EBernhardson thanks for looking into this! I'd really like to defer to your best intuition here... [14:26:15] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Discovery, 10Patch-For-Review: Migrate Mediawiki Monolog Kafka producer to Kafka Jumbo - https://phabricator.wikimedia.org/T188136#4207798 (10Ottomata) @EBernhardson would you mind if I assigned this to you? [14:28:05] 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: Move EventStreams to new jumbo cluster. - https://phabricator.wikimedia.org/T185225#4207801 (10Ottomata) Now that main Kafka clusters have been upgraded to 1.x, we use a 1.x MirrorMaker, which so far is way more stable. I think we are ready to move forwa... [14:38:13] (03PS1) 10Milimetric: Fix name of wikimedia-ui-base package [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/433155 [14:39:56] (03PS2) 10Milimetric: Stop the bar chart from incrementing its height [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/433155 (https://phabricator.wikimedia.org/T194431) [14:41:07] 10Analytics-Kanban, 10Patch-For-Review: Checklist for geowiki pipeline - https://phabricator.wikimedia.org/T190409#4207829 (10Milimetric) [14:42:31] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on einsteinium is CRITICAL: 2.447e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [14:42:48] ottomata: --^ [14:43:11] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Put view settings in URL so it can be shared and bookmarked - https://phabricator.wikimedia.org/T179444#4207841 (10Milimetric) a:03Milimetric [14:44:11] AH! [14:44:13] looking [14:44:15] again! sheesh [14:44:39] ottomata: could it be spurious data due to burrow's restart? [14:45:06] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Put view settings in URL so it can be shared and bookmarked - https://phabricator.wikimedia.org/T179444#4207851 (10Milimetric) [14:45:59] oh did you restart burrow? [14:45:59] it coudl be [14:46:03] things look pretty normal... [14:46:07] i do see the alg though [14:46:14] OH [14:46:14] yes [14:46:20] elukey: same thign that happened the last time [14:46:28] the topics it is reporting are no longer replicated [14:46:30] so it got auto-restarted [14:46:31] mirrored [14:46:39] all change-prop tpics [15:00:59] ping ottomata elukey joal fdans [15:01:01] coming [15:01:42] 10Analytics, 10Analytics-Kanban, 10Wikimedia-Stream, 10Patch-For-Review, 10Wikimedia-Incident: Alerts for common/important EventStreams topic volume - https://phabricator.wikimedia.org/T174493#4207931 (10Ottomata) [15:02:09] ping elukey [15:02:21] coming sorry, zookeeper is keeping me busy :) [15:12:36] 10Analytics, 10Analytics-Kanban, 10Wikimedia-Stream, 10Patch-For-Review, 10Wikimedia-Incident: Alerts for common/important EventStreams topic volume - https://phabricator.wikimedia.org/T174493#4207959 (10Ottomata) a:03Ottomata [15:15:16] 10Analytics: Generate pagecounts-ez data back to 2008 - https://phabricator.wikimedia.org/T188041#4207967 (10CristianCantoro) @Milimetric, no problem. About 2.: >>! In T188041#4207619, @Milimetric wrote: > For 2, the scripts are available, yes: https://github.com/wikimedia/analytics-wikistats/tree/master/pagev... [15:15:39] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Put view settings in URL so it can be shared and bookmarked - https://phabricator.wikimedia.org/T179444#4207970 (10Milimetric) @DarTar, are you envisioning a dynamic DOI registration through something like https://mds.datacite.org/ ? Or statically regi... [15:50:25] 10Analytics: Generate pagecounts-ez data back to 2008 - https://phabricator.wikimedia.org/T188041#4208144 (10MusikAnimal) > just call it "abridged" or something, making sure people know some pages are missing. What do you both think, would people/you be happy with something like this? Better than nothing, for s... [15:52:52] milimetric: I leave the AQS patch as-is without try/cath - ok? [15:53:52] joal: of course, I was just showing you a trick, same thing either way [15:54:02] ok :) [15:54:21] (03CR) 10Joal: [V: 032] Add a config param for druid datasources [analytics/aqs] - 10https://gerrit.wikimedia.org/r/429765 (https://phabricator.wikimedia.org/T193387) (owner: 10Joal) [15:55:31] !log bouncing main -> analytics MirrorMaker [15:55:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:57:18] (03PS1) 10Milimetric: Remove metrics table from additional partition groups [analytics/refinery] - 10https://gerrit.wikimedia.org/r/433176 [15:57:42] mforns: check the above and if you're good I'll deploy it [15:57:55] anyone else want a deploy of refinery, joal were you gonna do one with Luca? [15:58:35] (03PS1) 10Joal: Update aqs to db369e6 [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/433177 [15:59:06] milimetric: did one this morning [15:59:08] thanks [15:59:15] k, will do this to fix that cron then [15:59:24] milimetric: many thanks !! [15:59:43] I pathed the table but forgot about that bit :( [16:00:54] (03CR) 10Joal: [V: 032 C: 032] "Self merging for deploy" [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/433177 (owner: 10Joal) [16:01:17] people I am failing over an1001 to an1002 [16:01:53] !log Deploy AQS using scap [16:01:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:03:42] (03CR) 10Mforns: "I think the code as you modified will work, but if we're removing the only table that has the metric partition, we can as well remove all " (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/433176 (owner: 10Milimetric) [16:03:59] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#4208208 (10Vgutierrez) Right now the TLS server allows the client to pick up the curve to use, since j8u121 (8u171-b11-1~deb9u1 is deployed on k... [16:04:01] milimetric, ^ [16:05:26] k, will do [16:06:31] (03PS2) 10Milimetric: Remove metrics table from additional partition groups [analytics/refinery] - 10https://gerrit.wikimedia.org/r/433176 [16:06:58] milimetric: I want to make clear that i think the idea of reducing dataset by only loading pages with more than 10 pageviews is s super good one eh? [16:07:06] milimetric: even if we don't do it now [16:11:01] (03CR) 10Mforns: [C: 032] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/433176 (owner: 10Milimetric) [16:11:15] elukey: deploy successfull :) [16:11:15] joal: deploy all good? [16:11:18] nice :) [16:11:21] :) [16:11:25] milimetric, LGTM, you think we should test it? [16:11:34] elukey: do we test changing the conf param [16:11:35] ? [16:16:22] mforns: oh yeah, testing with --dry-run and if that's fine I'll deploy [16:16:38] milimetric, OK cool! [16:17:40] joal: sure we can [16:17:52] joal: will you send a puppet code change? [16:18:24] Will do ! [16:19:02] 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Access to usergroups for Marshall Miller - https://phabricator.wikimedia.org/T194550#4208251 (10herron) 05Open>03Resolved Ok @MMiller_WMF, you should be good to go! In case you haven't seen them already, there are instructions at ht... [16:19:40] mforns: it works fine, but there are no partitions to drop. Eh... good enough :) [16:19:44] (03CR) 10Milimetric: [V: 032] Remove metrics table from additional partition groups [analytics/refinery] - 10https://gerrit.wikimedia.org/r/433176 (owner: 10Milimetric) [16:20:56] !log deploying refinery to fix that partition drop cron [16:20:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:21:25] 10Analytics, 10EventBus, 10JobRunner-Service, 10MediaWiki-Database, and 5 others: Wikimedia\Rdbms\LoadBalancer::{closure}: found writes pending - https://phabricator.wikimedia.org/T191282#4208257 (10mmodell) @jcrespo: should we close this then? Have you been able to verify that all are gone? [16:29:48] 10Analytics, 10EventBus, 10JobRunner-Service, 10MediaWiki-Database, and 5 others: Wikimedia\Rdbms\LoadBalancer::{closure}: found writes pending - https://phabricator.wikimedia.org/T191282#4208307 (10jcrespo) p:05High>03Normal @mmodell Most of the errors are gone, but some are still happening. I think t... [16:35:13] !log finished deploying refinery, cron for dropping old mediawiki snapshots should now be good [16:35:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:00:53] elukey: I confirm this thing is a success :) [17:00:55] great [17:04:05] joal: nice! [17:08:45] going afk team! Zookeeper seems fine, all restarted except 3 kafka jumbo brokers that will be taken care by andrew :) [17:08:51] will be back online later on to check! [17:14:04] (03CR) 10Mforns: Hide "Load more rows..." once max data is visible in Table Chart (031 comment) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/427774 (https://phabricator.wikimedia.org/T192407) (owner: 10Sahil505) [17:20:26] PROBLEM - Check if active EventStreams endpoint is delivering messages. on scb1002 is CRITICAL: CRITICAL: No EventStreams message was consumed from http://localhost:8092/v2/recentchange within 10 seconds. [17:20:27] PROBLEM - Check if active EventStreams endpoint is delivering messages. on scb2004 is CRITICAL: CRITICAL: No EventStreams message was consumed from http://localhost:8092/v2/recentchange within 10 seconds. [17:25:36] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Index and store page preview agreggates on Druid so they are visible in pivot/superset - https://phabricator.wikimedia.org/T192305#4208454 (10mforns) Virtualpageviews_hourly is in druid! https://pivot.wikimedia.org/#virtualpageviews-hourly/line-chart/2/EQU... [17:32:12] bye teammm [17:34:08] PROBLEM - Check if active EventStreams endpoint is delivering messages. on scb1001 is CRITICAL: CRITICAL: No EventStreams message was consumed from http://localhost:8092/v2/recentchange within 10 seconds. [17:34:48] PROBLEM - Check if active EventStreams endpoint is delivering messages. on scb2005 is CRITICAL: CRITICAL: No EventStreams message was consumed from http://localhost:8092/v2/recentchange within 10 seconds. [17:35:53] this is a false alarm ^ [17:35:56] not sure why yet [17:35:57] new check [17:36:19] PROBLEM - Check if active EventStreams endpoint is delivering messages. on scb1003 is CRITICAL: CRITICAL: No EventStreams message was consumed from http://localhost:8092/v2/recentchange within 10 seconds. [17:46:58] RECOVERY - Check if active EventStreams endpoint is delivering messages. on scb1002 is OK: OK: An EventStreams message was consumed from http://scb1002.eqiad.wmnet:8092/v2/stream/recentchange within 10 seconds. [17:49:49] there they come ^ [17:50:38] RECOVERY - Check if active EventStreams endpoint is delivering messages. on scb2004 is OK: OK: An EventStreams message was consumed from http://scb2004.codfw.wmnet:8092/v2/stream/recentchange within 10 seconds. [18:04:19] RECOVERY - Check if active EventStreams endpoint is delivering messages. on scb1001 is OK: OK: An EventStreams message was consumed from http://scb1001.eqiad.wmnet:8092/v2/stream/recentchange within 10 seconds. [18:04:59] RECOVERY - Check if active EventStreams endpoint is delivering messages. on scb2005 is OK: OK: An EventStreams message was consumed from http://scb2005.codfw.wmnet:8092/v2/stream/recentchange within 10 seconds. [18:06:38] RECOVERY - Check if active EventStreams endpoint is delivering messages. on scb1003 is OK: OK: An EventStreams message was consumed from http://scb1003.eqiad.wmnet:8092/v2/stream/recentchange within 10 seconds. [18:10:57] Gone for tonight a-team - TOmorrow is kid's day, will be there in the evning [18:37:33] laters! [20:11:35] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#4208839 (10Ottomata) So, something like ^? [23:36:30] 10Analytics, 10EventBus, 10Services (doing), 10User-Elukey: Kafka sometimes misses to rebalance topics properly - https://phabricator.wikimedia.org/T179684#4209195 (10Pchelolo) [23:36:35] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade Kafka on main cluster with security features - https://phabricator.wikimedia.org/T167039#4209196 (10Pchelolo) [23:36:38] 10Analytics, 10EventBus, 10Services (done), 10User-Elukey: Investigate group.initial.rebalance.delay.ms Kafka setting - https://phabricator.wikimedia.org/T189618#4209192 (10Pchelolo) 05Open>03Resolved This was deployed to production, the number of rebalance log messages during the consumer startups dec...