[02:01:17] 10Analytics, 10EventBus, 10MediaWiki-extensions-CentralAuth, 10MediaWiki-extensions-CentralNotice, and 4 others: EventBus jobs failing heavily because of CentralNotice and WikibaseRepo - https://phabricator.wikimedia.org/T225195 (10Jdforrester-WMF) Can we just declare this Resolved or is further work under... [08:18:56] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Allow all Analytics tools to work with Kerberos auth - https://phabricator.wikimedia.org/T226698 (10elukey) @Ottomata I think I have fixed the spark2shell --master yarn issue with the last patch. Added the current issues in https://wikitec... [09:08:42] 10Analytics, 10Analytics-Kanban, 10Operations, 10Wikimedia-Incident: Move icinga alarm for the EventStreams external endpoint to SRE - https://phabricator.wikimedia.org/T227065 (10elukey) 05Resolved→03Open p:05High→03Normal [09:08:47] 10Analytics, 10Operations, 10Security, 10Services (watching), 10Wikimedia-Incident: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10elukey) [09:09:04] 10Analytics, 10Analytics-Kanban, 10Operations, 10Wikimedia-Incident: Move icinga alarm for the EventStreams external endpoint to SRE - https://phabricator.wikimedia.org/T227065 (10elukey) We didn't discuss if SERVICE UNKNOWN needs to alarm or not for some services :) [10:36:48] * elukey lunch! [11:41:12] team, restarting the cassandra bundle [12:51:36] mforns: o/ [12:52:57] mforns: so after a few new settings I discovered that hive works with kerberos (the CLI tool) [12:53:08] but in theory it shouldn't have before [12:53:21] what did you test when you told me that it was working? [13:02:47] Hey folks. Anyone coming to the analytics systems hangout meeting? [13:05:13] halfak: o/ I usually don't attend to this meeting, I think that nobody else is around now from my team. I can join if you need some info but my knowledge about data processing is limited :) [13:05:54] Oh! Gotcha. We usually see ottomata in this meeting. [13:05:57] Is he OOO? [13:07:16] halfak: he should be working today, usually he joins around this time (or a bit later) [13:08:49] there you go --^ [13:08:50] :) [13:09:38] thanks elukey [13:09:43] ottomata, we're waiting on you :P [13:10:53] AH! [13:10:53] a 9am meeting?! [13:10:53] ok! [13:10:58] be there in a minute [13:12:13] 10Analytics, 10Developer-Advocacy, 10MediaWiki-API, 10Reading-Admin, and 3 others: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079 (10WDoranWMF) [13:23:01] elukey, I just tested queries I think [13:23:07] read-only [13:25:11] I'm about to test oozie jobs with hive [13:25:25] mforns: nono wait a sec please [13:25:31] oh [13:25:33] I am deploying again refinery in testing cluster [13:25:36] ok ok [13:25:37] I need to test some things [13:25:49] ok, no prob [13:26:45] mforns: what things do you want to tesT? [13:27:27] because I am now wondering if hive2 actions are needed [13:27:35] I thought so, but I made some discoveries this morning [13:27:50] so what I am trying to do it to restart the webrequest bundle with hive actions [13:27:53] and see how it goes [13:28:07] if it works I'll cry for 2 hours probably [13:28:30] elukey, I was planning to execute the mediacounts job [13:28:33] load [13:28:46] read from webrequest and write to mediacounts table [13:30:46] mforns: gimme 10 mins and I should be ready [13:31:28] elukey, no rush man [13:37:20] 10Analytics, 10EventBus, 10MassMessage, 10Operations, and 2 others: Write incident report for jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Pchelolo) 05Open→03Resolved Report written. Please reopen if it's not sufficient. [13:45:17] mforns: if you have time I can explain in bc what I am doing [13:46:38] elukey, ok [13:46:42] omw [15:10:43] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Refine should accept principal name for hive2 jdbc connection for DDL - https://phabricator.wikimedia.org/T228291 (10Ottomata) [15:17:04] 10Analytics, 10Reading Depth, 10Readers-Web-Backlog (Tracking): [Bug] Many ReadingDepth validation errors logged - https://phabricator.wikimedia.org/T216063 (10phuedx) >>! In T216063#5328423, @Milimetric wrote: > These failures are expected, though, the spec does not account for such prefixes, it just expect... [15:21:12] ottomata: o/ [15:21:28] I fixed the spark2 shell --master yarn thing [15:21:40] and also discovered that the hive CLI works now with the new settings :) [15:21:45] (but not oozie's actions) [15:23:00] really! [15:23:10] tell me more! [15:23:15] how does hive cli work? [15:24:46] so there are two options that need to be in sync with coordinator and client [15:25:07] [15:25:07] hive.metastore.kerberos.principal [15:25:07] hive/_HOST@WIKIMEDIA [15:25:07] [15:25:12] [15:25:12] hive.metastore.sasl.enabled [15:25:12] true [15:25:13] [15:25:28] if one of the two is missing, hive (cli) returns weird errors [15:25:51] but basically the latter instructs hive cli to use sasl to authenticate via Thrift to the metastore [15:26:01] and it seems working fine [15:26:41] the former is kinda similar to what we need to jbdc and hive2-server [15:26:53] ah nice [15:26:55] ok cool [15:27:07] nuria: I'm in bc if you want to talk pre standup about transcoding data [15:27:11] removing the sasl spark2 options also made everything work with master yarn [15:27:23] oh cool [15:27:23] fdans: on meeting sorry [15:27:28] sure! [15:27:30] nice [15:28:06] ottomata: we can test Refine when your change is ready and see how it goes [15:28:15] (if you haven't already done it) [15:29:05] elukey: i have already done it! [15:29:09] oh i haven't tested in yarn [15:29:34] 10Analytics, 10Reading Depth, 10Readers-Web-Backlog (Tracking): [Bug] Many ReadingDepth validation errors logged - https://phabricator.wikimedia.org/T216063 (10Milimetric) I meant the `operator_user=` and `iorg_domain_internal=` prefixes, which are not added by us in any way, so these failures are expected,... [15:44:58] ottomata: bit of a random thought, but i wonder if the upload event should report when the items will expire. I realized while testing locally with 1 hour delete-after i realized i can't tell the difference between "swift isn't returning files i expected, try again later" and "swift has deleted the files i expected, give up". Although i guess could just retry a few times than declare [15:45:04] failure [15:46:08] perhaps not a particularly important edge case with longer delete-after, but perhaps worth thinking about [15:47:12] 10Analytics, 10Editing-team: Deletion of limn-edit-data repository - https://phabricator.wikimedia.org/T228982 (10Milimetric) @Neil_P._Quinn_WMF they weren't migrated, they only exist in the old repo: https://github.com/wikimedia/analytics-limn-edit-data [15:49:00] (03CR) 10EBernhardson: swift-upload.py to handle upload and event emitting (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/525435 (https://phabricator.wikimedia.org/T227896) (owner: 10Ottomata) [15:52:08] 10Analytics, 10Discovery, 10Operations, 10Research-Backlog: Run swift-object-expirer as part of the swift cluster - https://phabricator.wikimedia.org/T229584 (10EBernhardson) [16:03:41] (03CR) 10EBernhardson: swift-upload.py to handle upload and event emitting (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/525435 (https://phabricator.wikimedia.org/T227896) (owner: 10Ottomata) [16:23:53] 10Analytics, 10Reading Depth, 10Readers-Web-Backlog (Tracking): [Bug] Many ReadingDepth validation errors logged - https://phabricator.wikimedia.org/T216063 (10phuedx) Correct. Those prefixes must be added by whatever application is proxying our content. [17:00:13] a-team: do we do any grooming today or do we skip? [17:06:12] elukey, team, didn't we want to talk about transcodings etc? [17:06:58] I think so but nobody is in the bc :) [17:07:23] brb [17:11:43] mforns: ready to merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/519685/ ? [17:25:20] elukey, yes :] [17:25:31] (03CR) 10Ottomata: swift-upload.py to handle upload and event emitting (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/525435 (https://phabricator.wikimedia.org/T227896) (owner: 10Ottomata) [17:25:33] elukey, and the webrequest one as well I think [17:28:46] 10Analytics, 10Operations, 10SRE-Access-Requests: Access to HUE for Mayakpwiki - https://phabricator.wikimedia.org/T229143 (10Mayakp.wiki) Thanks @Nuria for the query and suggestion. I will use Jupyter and Beeline in the meantime. Please let me know whenever my HUE access is granted. https://wikitech.wikimed... [17:28:53] mforns: link of the second one? [17:29:16] mforns: ah btw just finished the work in the testing cluster, the webrequest test bundle works again now [17:29:21] you are free to test :) [17:29:25] elukey, https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/519683/ [17:29:35] cool! [17:34:02] mforns: merged and deployed both [17:34:11] :D [17:34:20] will check [17:34:24] thanks! [17:34:29] all right going to dinner! [17:34:30] o/ [17:34:44] * elukey off! [18:59:57] 10Analytics, 10Analytics-Kanban: Add agent type split to wikistats pageviews - https://phabricator.wikimedia.org/T228937 (10Nuria) 05Open→03Resolved [19:00:16] 10Analytics, 10Analytics-Kanban: Map doesn't redraw when returning from table view - https://phabricator.wikimedia.org/T226514 (10Nuria) 05Open→03Resolved [19:15:58] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add new mediatypes to media classification refinery code - https://phabricator.wikimedia.org/T225911 (10Nuria) In order for the code to take effect the job needs to be re-started [19:21:30] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: API Request for unique devices for all wikipedia families is only showing data up to November 2018 - https://phabricator.wikimedia.org/T229254 (10Nuria) Root issue was duplicated records on the dataset that we were loading in cassandra. The SQL has been c... [19:30:35] PROBLEM - Throughput of EventLogging EventError events on icinga1001 is CRITICAL: 44.01 ge 30 https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Administration https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=13&fullscreen&orgId=1 [19:37:40] mforns: i think this is TestSearchSatisfaction2 cc ebernhar|lunch dcausse [19:41:57] 10Analytics, 10Discovery: tons of errors on eventlogging events - https://phabricator.wikimedia.org/T229614 (10Nuria) [19:42:11] 10Analytics, 10Discovery: tons of errors on eventlogging events - https://phabricator.wikimedia.org/T229614 (10Nuria) a:03EBernhardson [19:48:13] RECOVERY - Throughput of EventLogging EventError events on icinga1001 is OK: (C)30 ge (W)20 ge 16.15 https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Administration https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=13&fullscreen&orgId=1 [19:50:53] 10Analytics, 10Discovery: tons of errors on eventlogging events - https://phabricator.wikimedia.org/T229614 (10Nuria) {F29932706} [20:14:37] 10Analytics, 10Discovery, 10Discovery-Search (Current work), 10Patch-For-Review: tons of errors on eventlogging events - https://phabricator.wikimedia.org/T229614 (10EBernhardson) p:05Triage→03High [20:18:51] 10Analytics, 10Multimedia, 10Tool-Pageviews: Statistics for views of individual Wikimedia images - https://phabricator.wikimedia.org/T210313 (10Milimetric) @Ramsey-WMF: we shouldn't make granular referer information public but you can always access the raw data. Talk to Product Analytics or jump on the Hado... [20:34:45] nuria, thanks for looking into the alarm [20:35:04] mforns: np, i think ebernhardson 's got it , it seem to subside soon after [20:56:11] nuria: it subsided because the train was rolled back [20:56:26] (not for this bug though) [21:08:45] (03CR) 10EBernhardson: swift-upload.py to handle upload and event emitting (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/525435 (https://phabricator.wikimedia.org/T227896) (owner: 10Ottomata) [21:24:20] 10Analytics, 10Analytics-Kanban: Page creation data stream died June 6 - https://phabricator.wikimedia.org/T228188 (10Milimetric) FYI: this is already possible in the API, it's just not implemented in the UI. For example all content pages created by anonymous users last month on Estonian Wikipedia: https:... [21:40:21] PROBLEM - Throughput of EventLogging EventError events on icinga1001 is CRITICAL: 37.47 ge 30 https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Administration https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=13&fullscreen&orgId=1 [21:50:36] 10Analytics, 10Analytics-Kanban: Page creation data stream died June 6 - https://phabricator.wikimedia.org/T228188 (10Nuria) Also, one side question for @kaldari: should we decommission these dashboards and make them redirect to wikistats? [21:52:24] 10Analytics, 10Analytics-Kanban: Page creation data stream died June 6 - https://phabricator.wikimedia.org/T228188 (10kaldari) Probably at some point, yes. We should take screenshots of all the views, though, and upload them to Commons so the info that isn't available through wikistats isn't lost entirely. [21:54:12] 10Analytics, 10Discovery, 10Discovery-Search (Current work), 10Patch-For-Review: tons of errors on eventlogging events - https://phabricator.wikimedia.org/T229614 (10Nuria) Errors are big again: https://grafana.wikimedia.org/d/000000018/eventlogging-schema?orgId=1&from=1564678406935&to=1564696406935&var-sc... [21:57:37] ebernhardson: errors are high again on the schema for EL, did we deploy? [22:03:23] nuria: possibly, i wasn't watching [22:03:45] ebernhardson: if so i think events are not validating: https://grafana.wikimedia.org/d/000000018/eventlogging-schema?orgId=1&from=now-5h&to=now&var-schema=TestSearchSatisfaction2 [22:03:47] looks like yes, 40 min ago ... i'll just ship the fix [22:03:53] i have a fix but it needs reviews and swat, etc. [22:04:44] ebernhardson: ok, let me know how i can help [22:13:03] elukey: https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A+Replace+ZooKeeper+with+a+Self-Managed+Metadata+Quorum [22:13:06] :o [22:13:07] :) [22:26:48] while i generally support the idea, roll your own cluster-metadata-updates is a bit error prone ;) [22:26:57] also the eventlogging errors should be tapering off now [22:36:20] RECOVERY - Throughput of EventLogging EventError events on icinga1001 is OK: (C)30 ge (W)20 ge 19.8 https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Administration https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=13&fullscreen&orgId=1 [22:44:49] ebernhardson: would the above deployment and fix have impacted data in Superset? I received an email about errors in superset (500 errors on graphs that include filters on derived user-agent info) when using the pageviews_daily data source, and was about to hop in here when I reloaded and it was suddenly fixed [22:47:45] (03CR) 10EBernhardson: swift-upload.py to handle upload and event emitting (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/525435 (https://phabricator.wikimedia.org/T227896) (owner: 10Ottomata) [22:48:33] kzeta: no, or at least the only table it can impact in superset is test_search_satisfaction_hourly [22:49:47] oddly though ... the error rate didn't drop back as far as expected :S i still have to look into it more. [22:50:05] hmm, ok, back to digging into what's going on - thanks! [22:58:13] ebernhardson: here to help if help is needed [22:58:56] ebernhardson: ya, still half of events are not valid [22:59:51] nuria: i know why and have a fix up, although i'm not sure why half ... [22:59:59] well, actually i probably do know why [23:00:06] ebernhardson: let me look at error stream pone sec [23:01:05] its complaining that '10' isn't a number (because it's a string) [23:01:16] because we put a number in localStorage, but localStorage only holds strings so it gave back a string [23:01:22] ebernhardson: k [23:01:32] it's just poorly tested frontend stuff :( [23:02:00] ebernhardson: it can be changed in the schema and bump up version if needed be, or is it "truly" a number? [23:02:23] i wrote browsertests for WikimediaEvents at one point ... butit has to use some custom hacks to access the eventlogging data so it doesn't run in CI [23:02:25] ebernhardson: cause the value i see repeated is '8' [23:02:43] nuria: right, currently all events count as 8 events [23:03:07] basically its reporting the sampling rate, so things can be sample-rate corrected downstream [23:04:55] ebernhardson: i see, then up to you if you want to correct schema or code [23:10:29] shipped new js to users, should hopefully see this all subside in the next 10 min then i have to figure out how to correct and reprocess the events.. [23:11:04] well, correcting them is quite easy with a few lines of python or whatever decode/fix/reencode. but i have to figure out where to send them :) [23:18:13] ebernhardson: i can help you with reprocess events [23:18:24] ebernhardson: we have magics [23:20:31] ebernhardson: * i think* you might have permits to do this but i will feel better if we pair up: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Backfilling#Backfilling_a_kafka_eventlogging_%3CSchema%3E_topic [23:20:37] ebernhardson: just did same thing recently [23:21:23] reading [23:22:19] looks straight forward. I wasn't sure camus would put the events in the right hour if they came late [23:22:51] error rates are back to normal [23:23:04] ebernhardson: it is should work that way cause we always need to be able to reprocess [23:25:00] ebernhardson: ok, thanks for fix and do let me know if you decide to reprocess events , if done soon after the refine process will catch it [23:25:48] if done some days after we need to run an special refine and that happens on different machines that you probably do not have access to but it is real easy to do, any of us can do it [23:26:08] nuria: as for access, i don't think i can login to the eventlogging machines to run eventlogging-processor [23:26:19] although that looks like a script that simply pipes data around [23:26:30] oh, this wants me to clone it. ok [23:27:40] ebernhardson: ah ya, i think we were able to do this from stats machines but maybe now it needs you to be logged in to el machine to do it, if so just correct events, add it to ticket and we can do it [23:59:38] ebernhardson: if you look at volume of events i think it has not quite recovered yet: https://grafana.wikimedia.org/d/000000018/eventlogging-schema?orgId=1&from=now-5h&to=now&var-schema=TestSearchSatisfaction2