[02:18:30] PROBLEM - Check the last execution of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:39:54] RECOVERY - Check the last execution of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:03:19] 10Analytics-Clusters, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10elukey) As FYI I'd need to advertise the host maintenance a couple of days in advance to users, so happy to work with Chris/John in any time... [07:18:39] gooood morning [07:27:24] Good morning! [08:01:13] wow I thought we had less 1g hadoop workers, but analytics[1058-1075]. [08:01:24] so 18 nodes [08:01:49] that is a lot, I think it makes sense to use labels after we move to bigtop [08:01:56] Indeed elukey - I thought most of our newer nodes were 10g [08:02:06] yes those are the older ones [08:02:30] we'll add 24 more with 10g, but the above are not yet OOW [08:02:41] Labels can't hurt - I'm not sure of how we'll use them yet though :) [08:03:16] elukey: is the plan about rack-space the one we dicussed last week (move the 1g nodes out of q0g racks? [08:03:25] my idea is to have the possibility to run our jobs only on say 10g nodes, especially the ones requiring shuffling etc.. [08:04:08] joal: yes we have some hosts to move around, I am working with dcops in https://phabricator.wikimedia.org/T267065 [08:04:43] elukey: super thanks [08:05:35] there are some 1g hadoop worker nodes in 10g-capable racks to move around, to free space [08:05:43] up [08:08:25] 10Analytics: Check home/HDFS leftovers of niedzielski - https://phabricator.wikimedia.org/T267515 (10MoritzMuehlenhoff) [08:11:27] 10Analytics: Check home/HDFS leftovers of niedzielski - https://phabricator.wikimedia.org/T267515 (10elukey) 05Open→03Resolved a:03elukey ` ====== stat1004 ====== total 0 ls: cannot access '/var/userarchive/niedzielski.tar.bz2': No such file or directory ====== stat1005 ====== total 0 ls: cannot access '/... [08:13:02] joal: also we have https://phabricator.wikimedia.org/T267392 for the two nodes that are stuck in booting :( [08:13:12] Right :( [08:13:43] elukey: I'd the good ol'way of kicking them, but I think that's not modern in any way [08:18:19] joal: already tried all kicks that I could think of, they don't move [08:18:46] I suspect that they are dead, the raid controller seems to have issues [08:19:07] the thing that makes me smile is that a reboot triggered it, they were running fine before [08:19:59] Meh - https://media.giphy.com/media/xjqNH3Bml1gTC/source.gif [08:20:29] old nodes are all like the shroedinger cat, a reboot can make them dead o not :D [08:20:49] :D [09:03:16] 10Analytics, 10Data-release, 10Privacy Engineering, 10Research, 10Privacy: Evaluate a differentially private solution to release wikipedia's project-title-country data - https://phabricator.wikimedia.org/T267283 (10JAllemandou) I have issues picturing how ε/δ changes will affect us :) If feasible, I'd l... [09:11:54] 10Analytics, 10Analytics-Kanban: Set up automatic deletion/snitization for netflow data set in Hive - https://phabricator.wikimedia.org/T231339 (10ayounsi) >>! In T231339#6609214, @mforns wrote: > I think that the following fields are privacy-sensitive, and that they should not be kept. > ` > ip_dst... [09:50:46] 10Analytics: Check home/HDFS leftovers of rodolfovalentim - https://phabricator.wikimedia.org/T266467 (10elukey) 05Open→03Resolved a:03elukey All dropped! [11:36:40] * elukey lunch! [12:00:29] Heloo everyone! [12:00:37] What a week that was! [12:12:11] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Data, and 2 others: Set up an instance of EventStreams in beta that will allow for consuming any stream - https://phabricator.wikimedia.org/T253069 (10MSantos) [13:11:33] helloooo team [13:16:46] (03PS1) 10Fdans: Add historical_raw job to load data from pagecounts_raw [analytics/refinery] - 10https://gerrit.wikimedia.org/r/640146 (https://phabricator.wikimedia.org/T251777) [13:23:23] fdans! how is the everything? [13:24:23] klausman: had a lot of champagne with my neighbors and avoided the subject with my in-laws! pretty good weekend :) [13:26:58] Must be bizarre to have the move and the elections timed the way it worked out for you [13:27:08] Plus, on top, COVID [13:30:45] 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10JAllemandou) Using unique-actors as main threshold metrics seems a nice idea. As @Milimetric was pointing the other day,... [13:51:53] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Operations, 10Traffic: Don't cache schema.wikimedia.org - https://phabricator.wikimedia.org/T267557 (10Ottomata) [13:57:14] Hi folks - ottomata good morning - I have a question for you about alarms from this weekend please :) [13:57:26] sure! i'm still checking email [13:57:38] ottomata: I can wait for your emails to be checked :) [13:57:54] let me look at 'notices' label first :p [13:58:09] ok joal ask me [13:58:13] nothing new for me there :) [13:58:28] yourr question will not be a surprise to me at least [13:58:35] I wondered if you expected me to take a look at any of the failures, given it all seems related to the caches issue [13:58:47] ottomata: --^ [13:59:08] I've looked at and resolved all of the Refine failure reports [13:59:17] esp. related to NewcomerTask [13:59:46] The produce_canary_events is CRITICAL is still a problem but we don't know what to do about it yet [13:59:51] but it isn't causing any functional issue [13:59:58] since we are now producing canary events 4 times an hour [14:00:01] instead okf 1 [14:00:31] joal: those are the only alerts I've looked at, I don't think there are others related to refine events etc., right? [14:00:53] Ok I had followed that one - About the caching of schemas - the problem comes from the internal cache of event-gate, right? [14:01:04] I don't think so ottomata - thanks for letting me know [14:01:18] joal: from frontend varnish caches yes [14:01:29] since the canary job was configured to go trhough schema.wikimedia.org [14:01:40] ottomata: I didn't take any action this weekend as you were already on the thing, but I still prefer to confirm :) [14:01:48] ok, cool, ya all is well [14:01:51] that stuff was my fault anyway [14:02:05] no blame - just checks :) [14:03:04] thanks ottomata! [14:26:19] (03CR) 10Neil P. Quinn-WMF: "I think it's ready to go as well! I suggested a few small documentation tweaks, but otherwise, it looks very good." (033 comments) [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/635578 (https://phabricator.wikimedia.org/T261953) (owner: 10Sbisson) [14:29:34] mforns: yt? [14:31:25] ottomata: good morning! One thing that seems interesting from bigtop 1.5 (the upcoming release) [14:31:28] kafka 0.10.2.2 => 2.4.0 [14:31:41] oh very COOOOL [14:31:41] kafka too? [14:31:48] we don't need confluent's .deb anymore? [14:32:20] if the deb is ok etc.. (didn't test it) could be doable yes :) [14:32:32] same thing for flink, etc.. [14:32:59] the nice thing about bigtop is that they ship various docker images for any os supported, equipped with all the tools to build the packages [14:33:06] and there is one for "trunk" [14:33:28] so even if we have new things, we can self-build our new packages and contribute to upstream [14:33:29] so great [14:33:32] yep! [14:33:34] I like it a lot [14:34:01] there is less space for very custom things, this is the downside [14:34:46] less space? [14:35:00] oh you mean like maybe the weird stuff we do with the spark .deb? [14:35:08] to support multi OS / python versions? [14:35:56] (03PS15) 10Sbisson: Oozie job for Wikipedia Preview stats [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/635578 (https://phabricator.wikimedia.org/T261953) [14:36:40] (03PS16) 10Sbisson: Oozie job for Wikipedia Preview stats [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/635578 (https://phabricator.wikimedia.org/T261953) [14:36:50] (03CR) 10Sbisson: Oozie job for Wikipedia Preview stats (033 comments) [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/635578 (https://phabricator.wikimedia.org/T261953) (owner: 10Sbisson) [14:38:55] ottomata: yes exactly, you parsed my italo-english words correctly :D [14:39:14] :) [14:51:33] (03CR) 10Joal: "One important performance request in the query, other requests are more about parameter naming and are less important. Thanks!" (036 comments) [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/635578 (https://phabricator.wikimedia.org/T261953) (owner: 10Sbisson) [14:53:27] joal: for your immense joy, this week we'll need to decide next steps for https://phabricator.wikimedia.org/T265971 [14:54:40] * joal hides and pretends not to have read [14:54:55] ahahahha [14:55:12] IN THEORY we'd just need to push the backup to hdfs [14:55:36] and ask to dcops a misc node to refresh [14:56:08] to add more joy, there is no hdfs client/keytab on thorium (rightfully) [14:57:16] who would have guessed elukey [15:03:41] hey teamm [15:04:13] ottomata: just joined, 'sup? [15:04:44] wanted to brain bounce real quick on the eventstreams stream config ideas we had, and also ask if you minded if i started to work on that part. [15:04:53] if that's ok, then you could focus on the GUI part using the openapi spec? [15:05:06] dunno if you already started that though [15:06:28] no, didn't start yet, and no, I don't mind! I can do something else mep [15:06:56] wanna batcave? [15:06:59] ok cool, ya [15:07:42] ok, omw [15:14:23] * klausman runnin an errand, be back for the standup [16:02:14] fdans standup?' [16:02:47] razzi standup? [16:03:05] mforns: follow Andrew talking! :D [16:03:19] what? [16:03:40] I am joking, I was saying "pay attention to Andrew's turn!" :D [16:03:53] (you were writing in here) [16:04:00] (nevermind bad joke) [16:04:11] (I send some wikilove anyway mforns ) [16:04:18] (<3) [16:04:22] xDDD [16:08:34] 10Analytics-Radar, 10Better Use Of Data, 10Product-Analytics, 10Product-Infrastructure-Data, and 4 others: Session Length Metric. Web implementation - https://phabricator.wikimedia.org/T248987 (10sdkim) [16:08:39] 10Analytics-Radar, 10Better Use Of Data, 10Product-Analytics, 10Product-Infrastructure-Data: session_tick stream configs - https://phabricator.wikimedia.org/T256311 (10sdkim) 05Open→03Resolved a:03sdkim [16:08:52] 10Analytics: pageviews complete have irregular lines - https://phabricator.wikimedia.org/T267575 (10Milimetric) [16:12:56] 10Analytics-Radar, 10Better Use Of Data, 10Product-Analytics, 10Product-Infrastructure-Data, and 4 others: Session Length Metric. Web implementation - https://phabricator.wikimedia.org/T248987 (10jlinehan) [16:22:49] elukey: FYI this is in a the 2.7.0 Kafka version that is about to be released [16:22:50] https://cwiki.apache.org/confluence/display/KAFKA/KIP-651+-+Support+PEM+format+for+SSL+certificates+and+private+key [16:25:32] PROBLEM - Check the last execution of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:30:20] 10Analytics: pageviews complete have irregular lines - https://phabricator.wikimedia.org/T267575 (10Milimetric) [16:31:26] 10Analytics-Radar, 10Product-Analytics: Consider recalculating revert rate - https://phabricator.wikimedia.org/T267053 (10fdans) [16:31:33] 10Analytics: pageviews complete have irregular lines - https://phabricator.wikimedia.org/T267575 (10Milimetric) [16:31:57] 10Analytics-Radar, 10Product-Analytics: Content for analytics.wikimedia.org - https://phabricator.wikimedia.org/T267254 (10fdans) cc @mforns [16:37:00] 10Analytics: [Data quality stats] Add dsaez to receive traffic anomaly alarms - https://phabricator.wikimedia.org/T267356 (10fdans) p:05Triage→03High [16:37:30] 10Analytics-Radar, 10Operations, 10ops-eqiad: analytics1046/analytics1057 stuck in booting - https://phabricator.wikimedia.org/T267392 (10fdans) [16:39:01] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Static vertical footer position ignores list length, overlaps with list, makes lists on Wikistats unreadable - https://phabricator.wikimedia.org/T267467 (10fdans) a:03fdans [16:42:50] 10Analytics, 10Analytics-Kanban: pageviews complete have irregular lines - https://phabricator.wikimedia.org/T267575 (10fdans) p:05Triage→03High a:03fdans [16:45:34] ottomata: I reply with another interesting thing https://issues.apache.org/jira/browse/BIGTOP-3225 :) [16:46:54] RECOVERY - Check the last execution of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:48:12] !log reboot an-coord1002 to see if it updates kernel cpu instructions [17:48:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:49:32] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Operations, 10Traffic: Reduce cache TTL of schema.wikimedia.org - https://phabricator.wikimedia.org/T267557 (10Ottomata) p:05High→03Triage [17:49:59] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Operations, 10Traffic: Reduce cache TTL of schema.wikimedia.org - https://phabricator.wikimedia.org/T267557 (10Ottomata) p:05Triage→03High [17:50:27] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Operations: Reduce cache TTL of schema.wikimedia.org - https://phabricator.wikimedia.org/T267557 (10Ottomata) [17:51:30] elukey: cool! [17:51:33] what a future! [17:52:41] :D [18:33:01] !log manually start logrotate.timer apt.timer etc.. on an-launcher1002 - stopped since the last time that I have disabled timers [18:33:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:33:21] after some digging with Razzi, i learned a big lesson [18:33:56] when I do "systemctl stop *.timer" to disable timers on launcher1002, I effectively disable also the system ones, that are not re-enabled by puppet [18:34:03] so logrotate didn't work because of this reason [18:34:07] * elukey cries in a corner [18:34:49] !log drop hdfs-balancer multi-gb log file from launcher1002 [18:34:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:34:50] it's ok elukey - you figured it out! And no services crashed [18:35:00] razzi: I feel so stupid [18:35:02] In the nick of time :) [18:35:54] anyway, nice to debug in two a problem :) [18:45:35] logging off folks, ttl! o/ [19:53:09] ottomata: after some thinking, I believe having EventStreams UI on a separate repo will be better... you OK? [19:53:55] there will be quite a lot of files, that otherwise will pollute EventStreams repo, and make it harder to understand [19:54:16] probably, independent deployments will also be desirable, no? [19:54:37] Also, makes it easier to test [20:03:22] mforns: +1 [20:03:23] sounds good to me1 [20:03:25] ! [20:03:26] k [20:03:30] hmmm although [20:03:42] that does mean that we will have to explicitly deploy it separately [20:03:54] will need a webserver, etc. [20:04:42] ottomata: couldn't we have it be a git submodule of eventstreams, and let eventstreams point to the build main.js? [20:05:58] not sure if that's good though.. [20:06:42] ok, let me dig deeper [20:22:45] 10Analytics, 10Product-Infrastructure-Data, 10Wikimedia-Logstash, 10observability: Create a separate logstash ElasticSearch index for schemaed events - https://phabricator.wikimedia.org/T265938 (10CDanis) Not sure I can capture the whole discussion but I'll try: 1. For Mediawiki client error logging, it w... [21:00:13] 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10Isaac) Thanks for making the table @lexnasser! A few thoughts below: === Total Pageviews vs. Unique Pageviews I checked... [21:13:47] 10Analytics, 10Event-Platform, 10Product-Infrastructure-Data: Client-side error logging should use ECS fields when possible - https://phabricator.wikimedia.org/T267602 (10Ottomata) [21:15:44] 10Analytics, 10Event-Platform, 10Product-Infrastructure-Data: Client-side error logging should use ECS fields when possible - https://phabricator.wikimedia.org/T267602 (10Ottomata) I think the `http` field might also be affected, and that one will be a bit trickier to reconcile. Our Event Schema: https://sc... [21:16:17] mforns: sorry was in meetings n stuff! [21:16:23] np! [21:16:45] uhhh hm yeah i guess a submodule could work.... [21:16:46] hmmm [21:16:53] hm [21:16:59] i guess, hm [21:17:40] ottomata: I found an example of app that uses node with express and vue, and I think we can put them together [21:17:53] in eventstreams? [21:17:55] I'd put all ui code within a ui folder, and that should work [21:17:56] yes [21:18:03] ah ok cool [21:18:07] i think that will make maintenance simpler [21:18:59] if there are problems at some point, extracting that to a separate repo will be easy [21:19:04] ok [21:19:05] cool [21:19:24] ok, I'll let you know [21:22:01] (03PS17) 10Sbisson: Oozie job for Wikipedia Preview stats [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/635578 (https://phabricator.wikimedia.org/T261953) [21:23:32] (03CR) 10Sbisson: "Hi Joal, thanks for your review. I've made some changes but I also have some questions. And I want Neil to chime in at least in one place." (035 comments) [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/635578 (https://phabricator.wikimedia.org/T261953) (owner: 10Sbisson) [21:33:27] (03PS1) 10Fdans: Consider locked dimensions when exploding keys for AQS url [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/640252 (https://phabricator.wikimedia.org/T265322) [21:34:14] (03PS2) 10Fdans: Consider locked dimensions when exploding keys for AQS url [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/640252 (https://phabricator.wikimedia.org/T265322) [21:43:17] (03CR) 10Milimetric: [C: 03+2] Consider locked dimensions when exploding keys for AQS url [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/640252 (https://phabricator.wikimedia.org/T265322) (owner: 10Fdans) [21:45:11] (03Merged) 10jenkins-bot: Consider locked dimensions when exploding keys for AQS url [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/640252 (https://phabricator.wikimedia.org/T265322) (owner: 10Fdans) [22:12:45] 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10Isaac) And regarding unique pageview threshold, I threw together this table of # of pages (and unique languages/projects)... [22:30:09] byeeeee [23:39:51] 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10lexnasser) Thanks for everyone's input! Copying Isaac's style: # Total Pageviews vs. Unique Pageviews @mforns @JAllema...