[00:17:27] (03CR) 10Milimetric: [C: 03+2] "I still think something's really weird with those tests but that's no reason to hold up the release. Let's look at that soon." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/616629 (https://phabricator.wikimedia.org/T257860) (owner: 10Nuria) [00:21:32] (03Merged) 10jenkins-bot: For Android and iOS we only count pageviews with x-Analytics marker [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/616629 (https://phabricator.wikimedia.org/T257860) (owner: 10Nuria) [00:53:05] PROBLEM - Check the last execution of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:36:49] 10Analytics: Can't connect to s8 from stat1006 - https://phabricator.wikimedia.org/T259185 (10Ladsgroup) 05Open→03Invalid a:03elukey aah thanks! [06:41:05] good morning! [06:45:09] Morning [06:49:29] RECOVERY - Check the last execution of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:43:32] 10Analytics-Clusters, 10Patch-For-Review, 10User-Elukey: Upgrade Druid to its latest upstream version (currently 0.18.1) - https://phabricator.wikimedia.org/T244482 (10elukey) While reading https://druid.apache.org/docs/latest/operations/rolling-updates.html I noticed that the coordinator and the overlord da... [07:54:26] fdans: o/ [07:54:42] HELLO [07:55:01] if you are around at ~14 CEST we'll upgrade the druid public cluster (the one serving to AQS) [07:55:23] (when Marcel joins) [07:55:36] I created an upgrade procedure in https://etherpad.wikimedia.org/p/analytics-druid-migration [07:56:24] the thing that I'd like to check while upgrading (and after) are [07:56:32] 1) if all the queries are working, and wikistats is appy, etc.. [07:56:55] 2) indexation works, so loading a mw-reduced snapshot doesn't lead to errors [07:57:37] the last point is a little bit tricky since we'd want to load not the last snapshot (since it may impact traffic) but another old one [07:57:57] we could re-run the entire mediawiki-reduced coordinator but it is huge [07:58:25] so I am checking load_mediawiki_history_reduced.json.template in refinery to see if we can launch only one druid indexation manually [08:01:27] say mediawiki_history_reduced_2020_03 [08:02:31] elukey: sounds good to me [08:07:34] very nice /wmf/data/wmf/mediawiki/history_reduced/snapshot=2020-03 is on hdfs [08:07:47] I recovered from a middlemanager a druid indexation spec, we should be good [08:18:04] ok I think the prep steps are done [08:19:05] git-fat does not support Python-3 yet. Please use python2. [08:19:09] /o\ [08:20:52] (03PS1) 10Elukey: Update Druid Parquet ingestion format class after cluster upgrade [analytics/refinery] - 10https://gerrit.wikimedia.org/r/617394 (https://phabricator.wikimedia.org/T244482) [09:45:34] * elukey going afk for a bit for an errand [10:10:30] (quick lunch before druid maintenance) [11:46:57] mforns, fdans - when you are ready we can start the upgrade [11:47:06] in the meantime, I'll do the prep steps [12:06:08] aaand we are ready :) [12:06:45] as starter, I'll update one historical (druid1004's) and check how it goes [12:09:09] it is reloading segments [12:10:01] ok looks fine, nothing horrible in the logs [12:10:27] elukey: helloooo [12:10:31] sorry 10 mins late [12:11:01] elukey: are you in da cave? [12:13:00] mforns: nono I am in here [12:13:03] (only) [12:13:06] k [12:13:10] looking for logs [12:13:14] I spotted one pupept weirdness, checking [12:14:13] elukey: we're seeing more of those errors from nav timing and resource timing [12:15:06] milimetric: morning, is it related to druid upgrade or alerts@ ? [12:15:30] no the large number problems [12:16:06] ah okok (we are upgrading druid) [12:16:41] mforns: I am fixing a puppet inconsistency for logging, then I should be able to proceed [12:16:45] ok [12:16:50] there was a theory that it was just one version of Firefox that wasn't out yet, so that would kind of make sense as we get closer to release more people are downloading that version [12:17:08] I think it was Firefox 78 and there was some correlation between that and the errors [12:17:17] mforns: we can sync quickly on bc to decide what to check etc..? [12:17:25] yes, omw [12:28:47] milimetric: I thought it was ok to re-run with DROP MALFORMED for that reason :( [12:35:59] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:36:11] uop [12:36:15] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:36:45] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:37:03] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:37:29] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:37:45] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:37:47] can I help elukey ? [12:37:57] oh, recovering :] [12:38:01] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:38:09] mmmm weird [12:38:29] historical logs seem to have stopped [12:38:31] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:38:41] for 7 minutes now [12:38:49] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:39:10] mforns: they restarted, I think that my second restart of the historical on 1004 caused some conn pile up [12:39:15] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:39:20] aha [12:39:32] I see now logs [12:39:48] in druid1004 not yet [12:40:28] mforns: you need to check overlord-requests.log [12:40:37] oh ok [12:40:52] yes this was the problem before, log4j not set correctly :( [12:40:58] so it was logging on the main historical.log [12:41:05] but having both in one log file is a nightmare [12:41:11] ok ok [12:42:09] from the metrics I see a rapid broker lock up, like when we drop snapshots [12:42:17] I think it is the same issues with conn piling up [12:42:46] you mean at the time of the issue, or now? [12:43:14] at the time of the issue yes, the metrics are still looking weird but there is a bit of lag from icinga to prometheus on this side [12:43:30] going to upgrade on 1005 ok? [12:43:44] mforns: --^ [12:43:59] ok [12:45:40] elukey: I con't find the log that you mentioned overlord-requests.log [12:45:59] on 1004? [12:46:06] I find historical-requests and overlord-metrics [12:46:14] both 1004 and 1005 [12:46:37] so [12:46:38]  [12:47:06] sorry mforns too many things at once, not overlord but historical-requests.lgo [12:47:09] *log [12:47:11] I mixed up names [12:47:16] ok ok [12:48:05] this time it looks better, I'll wait a bit before proceeding [12:48:33] the historicals are delicate since the brokers use all of them, and even if we have relatively tight timeouts it is super easy to pile up conns [12:48:38] (from brokers to historicals) [12:48:53] but I see logs in /var/log/druid/historical-requests.log on 1005 [12:48:54] so better [12:50:17] yes, cool [12:51:38] proceeding with 1006 [12:55:42] 10Analytics, 10Event-Platform: MEP development environment - https://phabricator.wikimedia.org/T259202 (10Ottomata) I wonder if all we really need for frontend devs is to be able to submit and validate schemas. I think maybe the simplest thing to do would be to make a new 'eventgate-devserver' (or something)... [12:56:25] elukey: yeah, it's totally ok, didn't mean to imply it wasn't, but I think it makes the fix more urgent otherwise we'll have to rerun those jobs all day long [12:56:28] o/ druid upgrade in place eh? [12:56:30] elukey: I reviewed the ingestion spec, and I think it's cool. there's a couple fields that we do not usually specify, like metadataUpdateSpec, dimensionExclusions, etc. But they are nullified, so I think we're fine. [12:56:31] lemme know i fyou need anything! [12:56:51] also, elukey: you should stop rerunning those, you have enough on your plate, that's ops week stuff [12:56:51] :] [12:57:00] agree with milimetric [12:57:53] ottomata: yep for the moment all good! a little bit of connection piling up for a double historical restart, so I am taking it more gently [12:58:20] milimetric: I need to do something with spark to avoid being a total n00b :D [12:58:32] mforns: 1006 looks good, proceeding [12:58:37] k [13:01:41] 1007 done, 1008 is left to do [13:01:57] then I'll proceed with overlords and middle managers [13:02:03] (no indexations are running) [13:04:23] k [13:08:52] all good, historicals are upgraded [13:08:58] doing overlords and middle managers [13:13:20] :] [13:13:27] done [13:13:36] now it is the time of the brokers [13:13:38] on all nodes? [13:13:41] ok [13:15:44] looks good on 1004, proceeding with the rest [13:15:56] (lovely to have 5 nodes..) [13:16:24] should I have done a cookbook? Yes [13:16:26] bad luca [13:18:52] hehe [13:21:08] brokers done [13:21:12] final step - coordiantors [13:21:16] *coordinators [13:24:25] mforns: everything upgraded [13:24:33] :D [13:24:54] launch ingestion of snapshot? [13:25:08] did you see my comment on that? [13:25:13] the ingestion spec? [13:25:21] yes yes sorry I didn't answer, feel free to remove the extra bits [13:25:31] I am checking with ssh -L 8081:druid1004.eqiad.wmnet:8081 druid1004.eqiad.wmnet that everything looks good [13:25:39] let's also check that wikistats works properly [13:27:53] from a quick check nothing is on fire [13:28:04] mforns: feel free to start the indexation when you are ok [13:28:36] looking at wikistats [13:30:10] elukey: wikistats seems good, I even think it's snappier. Possible? [13:30:32] mforns: could be yes! [13:31:40] I think that they improved a lot caching [13:31:50] let's try indexation! [13:32:09] ok, launching now [13:33:26] I got a syntax error in the spec [13:33:27] one sec [13:35:27] elukey: ok launched now [13:35:46] I see it in the console [13:36:57] the mapreduce seems in progress [13:37:17] lokin [13:37:25] https://yarn.wikimedia.org/proxy/application_1592377297555_199162/mapreduce/job/job_1592377297555_199162 [13:37:38] elukey: do you know how to switch from overlord mode to coordinator mode? [13:37:52] mforns: what do you mean? [13:37:56] every now and then, the UI switches to overlord mode and I can't do much [13:38:17] at the top right, there's the mode [13:38:32] right now I get restricted mode, and I can't do anything [13:38:46] now overlord mode again [13:38:59] I keep refreshing until I get the coordinator mode... [13:39:01] mforns: what port are you using? [13:39:08] it doesn't happen to me [13:39:09] 8090 [13:39:16] ah yes that is the overlord port [13:39:19] try 8081 [13:39:21] oh ok [13:39:23] ok [13:39:43] tourist [13:39:52] (me) [13:40:11] ok I see ingestion [13:40:40] ahahahha nono druid has so many ports, I always forget them [13:41:24] the errors that I got from mismatching hadoop jars/classes where right at the beginning, the mapreduce job running is promising [13:42:06] mforns: there is a new daemon called "router", that for the moment we are not using, that offers a better console. You can also run SQL queries from it, together with checking all the other things [13:42:16] maybe it could be handy to have [13:42:42] aha, yea [13:43:14] if turnilo is not usable for some reason, then it would be nice to have the sql endpoint [13:43:27] also, the coordinator can run as overlord if we wish [13:43:34] (we can get rid of the overlord daemon) [13:43:39] aha [13:43:59] but i tried to keep the moving parts as little as possible [13:44:20] another advantage of the sql endpoint maybe is to check for actual shape of the data, without the turnilo introspection layer [13:44:42] for us yes it would be very handy [13:44:46] but still, don't think it's urgent [14:06:46] mforns: the cr is https://gerrit.wikimedia.org/r/c/analytics/refinery/+/617394 [14:07:02] (when you have time) [14:07:24] elukey: sure, but do we want to wait until ingestion finished? [14:07:39] ah yes I'll not merge it [14:07:50] ok [14:07:53] but it looks good so far, all the problems that I found were right at the beginning [14:08:15] (03CR) 10Mforns: [C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/617394 (https://phabricator.wikimedia.org/T244482) (owner: 10Elukey) [14:30:47] hey mforns / fdans: why isn't edited pages available for all wikis? It's additive across wikis, no? https://stats.wikimedia.org/#/en.wikipedia.org/content/edited-pages/normal|line|2-year|editor_type~anonymous*group-bot*name-bot*user|monthly [14:32:08] milimetric: not to aqs https://wikimedia.org/api/rest_v1/metrics/edited-pages/aggregate/all-projects/all-editor-types/all-page-types/all-activity-levels/monthly/2018060100/2020073000 [14:32:43] right, but why [14:32:56] I mean, pages are specific to the wiki they're on [14:33:18] so why wouldn't we be able to count 5 pages from enwiki + 3 pages from dewiki as 8 pages? [14:35:39] fdans: as a side note, this metric is not additive on the user dimension, I didn't think we had examples like that [14:37:02] oh wait I was about to say this breaks your idea that we talked about, rendering breakdowns as 30 separate rows instead of 10 rows with 3 values each, but it's not related, that would still be fine [14:56:36] 10Analytics, 10Event-Platform: MEP development environment - https://phabricator.wikimedia.org/T259202 (10Nuria) >. It would be configurable with paths to local schema repo(s) and would only validate events (and log them to a file?) I think this would be fine, given that there is no official development enviro... [15:51:41] (03CR) 10Elukey: [V: 03+2] Update Druid Parquet ingestion format class after cluster upgrade [analytics/refinery] - 10https://gerrit.wikimedia.org/r/617394 (https://phabricator.wikimedia.org/T244482) (owner: 10Elukey) [15:52:00] 10Analytics, 10Event-Platform: Instrumentation development environment on MEP platform. - https://phabricator.wikimedia.org/T259202 (10Nuria) [15:52:34] 10Analytics, 10Event-Platform: Instrumentation development environment on EventGate server - https://phabricator.wikimedia.org/T259202 (10Nuria) [16:46:51] a-team, someone needs something merged before deployment train??? [16:48:58] mforns: not me, I am happy to stand by while we do deploy [16:49:16] nuria: didn't you have that pageview definition change? [16:49:28] mforns: it is merged [16:49:33] mforns: let me triple check [16:49:44] oh ok ok [16:50:14] mforns: ya, https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/616629 [16:50:20] mforns: i added it to deployment train [16:50:37] ottomata: the puppet jar update for refine, do you want me to do it? or you want to do it? [16:50:44] thx nuria [16:50:46] i'd like to do it [16:50:48] i'll do it next week [16:50:48] thanks [16:51:16] ok :] [16:54:00] nuria: the doc says to restart refine after pageview definition, does it mean restart webrequest bundle? [16:54:34] * after pageview definition changes [17:04:40] * elukey afk! [17:31:53] mforns: yes, but once we have a new jar version [17:32:03] mforns: so a second chnage for teh jar is needed [17:32:09] mforns: which i can do [17:52:40] nuria: ah, ok no prob I will do [18:01:41] (03PS1) 10Mforns: Update changelog.md for 0.0.132 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/617510 [18:02:32] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Merging for deployment train" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/617510 (owner: 10Mforns) [18:03:30] Starting build #55 for job analytics-refinery-maven-release-docker [18:10:39] (03PS1) 10Mforns: Bump up jar version for webrequest load to 0.0.132 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/617514 [18:15:28] Project analytics-refinery-maven-release-docker build #55: 09SUCCESS in 11 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/55/ [18:25:28] Starting build #22 for job analytics-refinery-update-jars-docker [18:25:45] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.0.132 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/617519 [18:25:46] Project analytics-refinery-update-jars-docker build #22: 09SUCCESS in 17 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/22/ [18:26:57] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM! Merging for deployment train" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/617519 (owner: 10Maven-release-user) [18:27:51] !log deployed refinery-source v0.0.132 [18:27:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:28:59] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Self-merging for deployment train" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/617514 (owner: 10Mforns) [18:40:37] elukey: question about the the LDAP permissions process. What verification is actually needed once the request gets to you or another SRE person? Meaning: if I create a Wikitech account for someone, have them reset the password, and then file a request for them on Phab, do they or their boss need to comment in the task? Or can we connect them with you or the SRE person over email? [18:48:12] !log starting refinery deploy (for v0.0.132) [18:48:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:51:48] nshahquinn: for an employee only approval from their manager is needed [19:04:51] nshahquinn: which, if they do not have a phab account can be sent to SRE vi ae-mail [19:14:00] !log finished refinery deploy (for v0.0.132) [19:14:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:17:46] mforns: did we restarted teh bundle? [20:54:46] nuria: got a sec for a brain bounce on event ingestion stuff? [20:56:16] or mforns ? [20:58:06] 10Analytics, 10VPS-Projects, 10Puppet: Puppet failing on wikistats.analytics.eqiad.wmflabs due to statistics::user - https://phabricator.wikimedia.org/T259307 (10bd808) [21:09:17] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10Jclark-ctr) [21:12:34] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson name rack position asset_tag switchport an-worker1102 A4 39 WMF5406 38 an-worker1103 A7 39 WMF5407 25 an-worker110... [21:29:54] ottomata: here now with time [21:30:22] ooo i gotta go! but i can tell you what I don't like and you can tell me what you think here [21:30:39] ottomata: let's do tomorrow if you have more time [21:30:43] ottomata: no rush [21:30:54] https://github.com/ottomata/wikimedia-eventutilities/blob/master/eventutilities-core/src/main/java/org/wikimedia/eventutilities/core/event/EventStreamConfig.java#L359-L370 [21:31:10] i don't like it but i don't know how to make it better [21:31:18] so the getEventServiceUri functions [21:31:23] take the event service name (from strream config) [21:31:31] and look up a URI in the provided eventServiceToUriMap [21:31:59] but because we need to produced to datacenter specific URIs for canary events (so events go to both datacenter prefixed topic names) [21:32:10] i also need a way to get a datacenter specific URI [21:32:44] https://github.com/ottomata/wikimedia-eventutilities/blob/master/eventutilities-core/src/main/java/org/wikimedia/eventutilities/core/event/EventStreamConfigFactory.java#L13 [21:32:50] these are the ones that we will eventually look up from config [21:32:54] but either way [21:33:11] what i've got now is an artificvial event service URI name suffixed with the datacenter name [21:33:37] it feels bad to make a method like that that assumes certain datacenter suffixed keys in the eventServiceToUriMap exist [21:33:54] but that logic has to live somewhere...i could move it out into EventStream, but it isn't much better there [21:33:58] i could put it in canary event only [21:34:05] CanaryEventProducer or whatever [21:34:23] make some special case lookup that assumes the dc specici URI is configured [21:34:32] maybe that would be ok, dunno. [21:34:40] ok thats my problem, lemme know if you have ideas. [21:34:55] you can let me know here in IRC or in a comment on github, whatever you like [21:34:57] l8rs! [23:30:01] 10Analytics, 10Event-Platform, 10Technical-blog-posts: Story idea for Blog: Wikimedia's Event Platform - https://phabricator.wikimedia.org/T253649 (10srodlund) @Ottomata I'll be out on vacation until Aug 17. I can look at these posts when I get back or @apaskulin has agreed to do some initial copyediting wit...