[02:55:26] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Chinese-Sites, 10Pageviews-Anomaly: Unusual high page view on Chinese Wikipedia - https://phabricator.wikimedia.org/T269065 (10Shizhao)
[02:56:41] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Chinese-Sites, 10Pageviews-Anomaly: Unusual high page view on Chinese Wikipedia - https://phabricator.wikimedia.org/T269065 (10Shizhao) 05duplicate→03Open
[02:57:29] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Chinese-Sites, 10Pageviews-Anomaly: Unusual high page view on Chinese Wikipedia - https://phabricator.wikimedia.org/T269065 (10Shizhao)
[06:09:54] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1061 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[06:25:40] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1061 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[06:41:22] <elukey>	 good morning
[06:41:37] <elukey>	 this nodemanager down is a little weird, can't find much from the logs
[06:56:00] <wikibugs>	 10Analytics, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1097 - https://phabricator.wikimedia.org/T274819 (10elukey) @razzi very interesting use case, I am going to add in here what I usually do and we can translate this into a procedure on wikitech if you want. In this case, if you execute `dmesg -T` on t...
[06:57:56] <wikibugs>	 10Analytics, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1097 - https://phabricator.wikimedia.org/T274819 (10elukey) Correction - in this case the umount command failed, telling me that the target was busy (so either yarn or hdfs daemons were reading from it). I had to stop both to umount :)
[06:58:15] <elukey>	 !log restart hdfs/yarn daemons on an-worker1097 to exclude a failed disk
[06:58:16] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:32:43] <elukey>	 !log restart hadoop daemons on an-worker1099 after reconfiguring a new disk
[07:32:44] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:35:10] <wikibugs>	 10Analytics-Radar, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1099 - https://phabricator.wikimedia.org/T273034 (10elukey) @razzi today I remembered this task by chance, I had to follow https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Swapping_broken_disk to add the new d...
[07:46:36] <joal>	 Good morning :)
[07:48:21] <joal>	 Nice summary of the procedure to keep the host up and running elukey  ---^
[07:48:56] <elukey>	 :)
[07:53:39] <joal>	 elukey: dump question - There is an email alert for analytics1061 NodeManager - I assume this is unrelated to an-worker1097?
[07:54:04] <elukey>	 joal: yes it is unrelated, not sure what happened in there, didn't find much in the logs, but I'll re-check
[07:54:22] <joal>	 elukey: is it you having made it back, or did fix itself?
[07:54:55] <elukey>	 the latter
[07:55:00] <joal>	 elukey: I'm curiously investigating hadoop alerts, as the system is brand new
[07:55:03] <joal>	 ack
[07:59:39] <elukey>	 it shutdown for some reason, there may be something buried in the logs that I don't see
[07:59:54] <joal>	 Weird
[08:00:25] <elukey>	 elukey@analytics1061:~$ grep SHUTDOWN /var/log/hadoop-yarn/yarn-yarn-nodemanager-analytics1061.log
[08:00:28] <elukey>	 2021-02-16 06:05:07,873 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: SHUTDOWN_MSG: 
[08:00:31] <elukey>	 SHUTDOWN_MSG: Shutting down NodeManager at analytics1061/10.64.21.113
[08:00:45] <joal>	 well, ok :)
[08:00:55] <joal>	 it indeed shut down :)
[08:02:20] <elukey>	 the only relevant log seems to be
[08:02:21] <elukey>	 org.apache.hadoop.yarn.ser
[08:02:23] <elukey>	 ver.nodemanager.NodeResourceMonitorImpl is interrupted. Exiting.
[08:02:31] <elukey>	 but I have never seen it
[08:03:13] <joal>	 hm
[08:05:42] <elukey>	 Feb 16 06:05:09 analytics1061 systemd[1]: hadoop-yarn-nodemanager.service: Main process exited, code=exited, status=255/n/
[08:05:45] <elukey>	 a
[08:06:26] <elukey>	 Feb 16 00:18:22 analytics1061 kernel: [32804012.594207] cgroup: fork rejected by pids controller in /system.slice/hadoop-yarn-nodemanager.service
[08:06:53] <elukey>	 but this was hours before
[08:07:04] <joal>	 :(
[08:11:56] <joal>	 elukey: I was thinking a
[08:12:34] <joal>	 about the corrupt block reports alerts - It seems to show up more often since we sperated ports 8020 and 8040
[08:12:49] <joal>	 I wonder if it could be related
[08:13:34] <elukey>	 joal: not sure but I think we have few datapoints to judge, yesterday I had to restart the namenodes and it seemed matching 
[08:13:49] <elukey>	 moreover it is only the jmx metrics, fsck doesn't report anything weird
[08:14:02] <joal>	 right - true - ok I'm gonna calm down :)
[08:16:50] <elukey>	 it is something to keep in mind since block reports are flowing through the service port
[08:25:57] <wikibugs>	 10Analytics, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1097 - https://phabricator.wikimedia.org/T274819 (10MoritzMuehlenhoff) p:05Triage→03Medium
[08:41:39] <elukey>	 2021-02-16 06:05:01,675 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
[08:41:42] <elukey>	 java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717)
[08:41:45] <elukey>	 joal: --^
[08:41:48] <elukey>	 this makes more sense
[08:41:51] <joal>	 Ah!
[08:41:53] <joal>	 OOM
[08:43:00] <elukey>	 I am going to create a little guide for the /Alerts page, grepping for OOM is a good way to avoid reading, I should've thought it :D
[08:43:57] <wikibugs>	 10Analytics-Radar, 10WMDE-Templates-FocusArea, 10MW-1.36-notes (1.36.0-wmf.30; 2021-02-09), 10WMDE-TechWish (Sprint-2021-02-03): Compensate for sampling - https://phabricator.wikimedia.org/T273454 (10awight) Now that the aggregation is deployed, we need to backfill by purging the following data since Jan 1...
[08:44:03] * elukey bbiab
[08:52:24] <wikibugs>	 10Analytics: The most visited wiki in Uzbekistan on Feb 14th at 6am UTC is mediawiki.org - https://phabricator.wikimedia.org/T274823 (10JAllemandou) I have done some checking:  - MaxMind database update was on Feb 9th and archived files got deleted on Feb 11th - This seems unrelated.  - There clearly seem to hav...
[09:36:39] <elukey>	 so I entered a big rabbit hole for the nodemanager :D
[09:36:48] <elukey>	 going to get a coffee and open a task
[09:36:52] <joal>	  /o\ :)
[09:36:54] <elukey>	 (and also update the docs)
[09:37:30] <elukey>	 joal: basically I think that we have different cgroup limits for number of tasks/processes that a nodemanager can have under its umbrella
[09:37:37] <elukey>	 ranging from 4k to 40k
[09:37:59] <elukey>	 analytics1061 has ~4k, and we may have hit the ceiling with the failure
[09:38:09] <joal>	 MEH?
[09:38:41] <elukey>	 4k tasks are not that much if we count that we can have subprocesses for a single jvm yarn "container"
[09:38:49] <elukey>	 bash -> jvm -> whatever -> etc..
[09:38:55] <elukey>	 and finally the real jvm that runs
[09:39:13] <elukey>	 if the host is busy with a lot of small tasks it might hit a 4k threshold easily
[09:39:24] <elukey>	 (at least this is my impression)
[09:39:54] <elukey>	 one thing that we need to do is to roll reboot the hadoop workers for kernel upgrades
[09:40:14] <elukey>	 that hopefully should bring a little bit more consistency
[09:40:26] <joal>	 ok
[09:40:42] <joal>	 it feels bizarre that the number is incoherent between workers
[09:41:15] <elukey>	 the systemd's default varies a lot, and it doesn't seem to be configured in the systemd config files, so I bet it comes from the kernels
[09:42:10] <joal>	 ack
[09:44:26] <joal>	 elukey: as for me I fell in the '中華電信MOD' rabbithole
[09:44:59] <joal>	 elukey: https://phabricator.wikimedia.org/T274605
[09:46:23] <elukey>	 iiiinteresting!
[09:46:35] <joal>	 I'm gonna comment
[09:51:53] <wikibugs>	 10Analytics, 10Product-Infrastructure-Team-Backlog, 10Chinese-Sites, 10Pageviews-Anomaly: Top read is showing one page that had fake traffic in zhwiki - https://phabricator.wikimedia.org/T274605 (10JAllemandou) Hi @Shizhao  and @cooltey, thanks for reporting. I have done some deeper analysis and my finding...
[09:52:02] <joal>	 elukey: if you're interested --^
[09:52:10] <wikibugs>	 10Analytics, 10Product-Infrastructure-Team-Backlog, 10Chinese-Sites, 10Pageviews-Anomaly: Top read is showing one page that had fake traffic in zhwiki - https://phabricator.wikimedia.org/T274605 (10JAllemandou)
[09:52:46] <joal>	 elukey: https://en.wikipedia.org/wiki/CHT_MOD
[09:53:36] <joal>	 and while I'm here talking to you elukey - I have looked at ranger/sentry a bit yesterday 
[09:54:14] <joal>	 elukey: and look what I found: https://issues.apache.org/jira/browse/BIGTOP-3471?focusedCommentId=17273216&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17273216
[09:57:06] <elukey>	 joal: It would be good to get Ranger packaged, but I fear that configuring/testing/etc.. it is a looong process
[09:58:07] <joal>	 elukey: looking at config, it seems that we could get it setup without too much complication (also, latest version 2.1 has a plugin for presto :)
[09:59:43] <elukey>	 joal: sure, it is something that we could look in the future
[10:00:06] <joal>	 elukey: I'm afraid we'll have to :)
[10:00:34] <elukey>	 after security druid/kafka/etc..
[10:00:56] <elukey>	 there is some backlog of things to do, but they got deprioritized
[10:01:14] <joal>	 yeah - ranger can care kafka
[10:01:19] <joal>	 but not druid
[10:02:49] <elukey>	 joal: in my view we'd need to find a solution applicable to all kafka clusters in prod handling PII, I don't see Ranger as a solution for Kafka
[10:03:03] <joal>	 hm
[10:03:08] <elukey>	 maybe I am wrong, it could be simple
[10:03:49] <elukey>	 but there is some groundwork about adding fences to kafka first (like avoid that bypassing Ranger and hitting kafka directly is unauthenticated/not-encrypted/etc..)
[10:04:53] <joal>	 For kafka it seems that the RBAC is on enterprise-grade only
[10:05:51] <joal>	 of course elukey, encryption first, and even possibly kerberos
[10:13:45] <wikibugs>	 10Analytics: Inconsistent systemd default task max on hadoop workers - https://phabricator.wikimedia.org/T274860 (10elukey)
[10:21:30] <elukey>	 this was the rabbit hole --^
[10:34:38] <wikibugs>	 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: WDCM_Sqoop_Clients.R fails from stat1004 - https://phabricator.wikimedia.org/T274866 (10GoranSMilovanovic)
[10:34:41] <elukey>	 joal: I am rolling out the same 4.19 kernel on all hadoop workers/masters
[10:34:55] <elukey>	 I am planning to roll reboot the test cluster, then the main cluster if you are ok
[10:34:59] <elukey>	 to bring a little consistency
[10:34:59] <wikibugs>	 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: WDCM_Sqoop_Clients.R fails from stat1004 - https://phabricator.wikimedia.org/T274866 (10GoranSMilovanovic) p:05Triage→03High
[10:35:06] <joal>	 +1 elukey 
[10:35:13] <elukey>	 bueno
[10:35:30] <elukey>	 it will be the same kernel as buster (already running on some workers), so a good test anyway
[10:53:03] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10JAllemandou)
[10:56:44] <wikibugs>	 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: WDCM_Sqoop_Clients.R fails from stat1004 - https://phabricator.wikimedia.org/T274866 (10elukey) Hi Goran! We have recently introduced a stricter umask policy for HDFS, namely new files/dirs don't have the other permission bits set by defaul...
[11:13:51] <wikibugs>	 10Analytics, 10Product-Infrastructure-Team-Backlog, 10Chinese-Sites, 10Pageviews-Anomaly: Top read is showing one page that had fake traffic in zhwiki - https://phabricator.wikimedia.org/T274605 (10Antigng) >>! In T274605#6832642, @JAllemandou wrote: > Hi @Shizhao  and @cooltey, thanks for reporting. > I h...
[11:29:21] <elukey>	 reboot started, it will take a few hours probably, I am doing it veeery slowly
[11:29:31] <elukey>	 (via cookbook)
[11:30:10] <joal>	 ack
[11:56:44] <awight>	 @a-team, would it make sense for one of us (a dev from WMDE Technical Wishes) to apply for CR+2 and deployment access in some of the analytics repos, to better share the burden?  Or would that be a coordination nightmare, etc.?
[11:58:55] <joal>	 Hi awight - I assume it depends on repos and frequence of deployments
[11:59:48] <awight>	 joal: For sure.  The ones that have been blocky for us are schemas-event-secondary and reportupdater-queries
[12:00:23] <awight>	 My uneducated guess is that we could muck around in those without causing too much damage to other teams?
[12:00:24] <joal>	 About schemas, ottomata should be the one deciding
[12:01:18] <joal>	 about reportupdater queries, I can't recall how reportupdater get updated in term of deploy
[12:01:38] <awight>	 All I know is that it runs on a private analytics-runner machine.
[12:01:49] <joal>	 this is for sure awight :)
[12:01:52] <awight>	 We would need machine access in order to read the error logs, for example.
[12:02:03] <awight>	 hehe yeah I have the outsider knowledge.
[12:02:11] <joal>	 :)
[12:02:25] <joal>	 elukey: any perspective on awight request --^
[12:02:26] <joal>	 ?
[12:02:48] <awight>	 If it's a "maybe", I can file this as a Phab task for further discussion...
[12:03:11] <joal>	 a Phab is always a good idea - even if the answer is no at the end :)
[12:03:24] <awight>	 Excellent, will do!
[12:07:43] <wikibugs>	 10Analytics-Radar, 10Add-Link, 10Growth-Structured-Tasks, 10Growth-Team (Current Sprint), 10Patch-For-Review: Add Link engineering: Pipeline for moving MySQL database(s) from stats1008 to production MySQL server - https://phabricator.wikimedia.org/T266826 (10kostajh) 05Open→03Resolved Pipeline is set...
[12:11:50] <wikibugs>	 10Analytics, 10WMDE-TechWish: Deployment access request for some analytics repos - https://phabricator.wikimedia.org/T274880 (10awight)
[12:19:11] <wikibugs>	 10Analytics, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10MW-1.36-notes (1.36.0-wmf.30; 2021-02-09): eventgate_validation_error for NewcomerTask, HomepageTask, and HomepageVisit schemas - https://phabricator.wikimedia.org/T273700 (10kostajh) I think all of the items here have been fixed, but wmf....
[12:49:42] <wikibugs>	 10Analytics, 10WMDE-TechWish: Deployment access request for some analytics repos - https://phabricator.wikimedia.org/T274880 (10mforns) I think in general that's the way we should go! Give the teams the capability to test, deploy and manage their jobs independently. We are accumulating more and more data sets,...
[13:19:44] <wikibugs>	 10Analytics-Radar, 10WMDE-Templates-FocusArea, 10MW-1.36-notes (1.36.0-wmf.30; 2021-02-09), 10WMDE-TechWish (Sprint-2021-02-03): Compensate for sampling - https://phabricator.wikimedia.org/T273454 (10mforns) @awight Re. graphite: I haven't ever dealt with back-filling graphite metrics. I'm not sure they ca...
[13:22:14] <wikibugs>	 10Analytics: The most visited wiki in Uzbekistan on Feb 14th at 6am UTC is mediawiki.org - https://phabricator.wikimedia.org/T274823 (10mforns) > There clearly seem to have a small number of IPs making most requests for projects having seen a change (en.wikipedia, commons.wikipedia` for instance). Thanks for loo...
[13:22:23] <elukey>	 joal: sorry I was on the phone!
[13:22:32] <joal>	 no prob elukey :)
[13:23:01] <elukey>	 awight: so reportupdater runs on an-launcher1002, that it is analyitcs-only.. what I'd try to do is to publish logs somewhere (logstash?) so people can check
[13:23:40] <elukey>	 or we could have a separate vm that people can use to run their jobs
[13:23:49] <elukey>	 but it seems more an overhead
[13:23:56] <elukey>	 anyway, I am open to discuss use cases :)
[13:24:21] <elukey>	 we moved jobs in one place to consolidate, sta100x hosts were a mixture of clients/schedulers
[13:24:31] <elukey>	 so we needed to clear out things and start from scratch :D
[13:26:26] <elukey>	 (going to have a quick lunch and I'll be back)
[13:49:58] <elukey>	 back
[14:05:40] <mforns>	 awight: hi! I saw there's some of your work that needs coordination with us? Do you want to have a short meeting? It could be faster than async
[14:25:16] <wikibugs>	 10Analytics, 10Patch-For-Review: Decide to move or not to PrestoSQL/Trino - https://phabricator.wikimedia.org/T266640 (10Ottomata) > If you are ok I'd package 0.246-1~wmf1 with the "custom" jar built on deneb, Sure of course!
[14:26:16] <wikibugs>	 10Analytics, 10WMDE-TechWish: Deployment access request for some analytics repos - https://phabricator.wikimedia.org/T274880 (10Milimetric) Big +2 from me for access to reportupdater-queries and any access needed to rerun jobs if needed.  Deployment there is a matter of puppet-sync, and queries are all complet...
[14:26:43] <wikibugs>	 10Analytics-Clusters, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Ottomata) 👏
[14:29:07] <wikibugs>	 10Analytics, 10WMDE-TechWish: Deployment access request for some analytics repos - https://phabricator.wikimedia.org/T274880 (10elukey) I am not happy about the idea of granting access to an-launcher1002, we have important credentials in there (and timers) that only our team should manage. If we want to have a...
[14:37:29] <wikibugs>	 10Analytics-Radar, 10Better Use Of Data, 10Product-Analytics, 10Product-Data-Infrastructure, and 4 others: Develop a new schema for MediaSearch analytics or adapt an existing one - https://phabricator.wikimedia.org/T263875 (10CBogen) >>! In T263875#6819431, @kzimmerman wrote: > @CBogen can you verify with...
[14:43:13] <wikibugs>	 10Analytics, 10WMDE-TechWish: Deployment access request for some analytics repos - https://phabricator.wikimedia.org/T274880 (10Ottomata) > schemas-event-secondary   Indeed! our team should not be a blocker for schemas/event/secondary.  The only reason we have 2 different schema repos is so that more people ca...
[14:46:28] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 5 others: KaiOS / Inuka Event Platform client - https://phabricator.wikimedia.org/T273219 (10Ottomata) Yes, we need to migrate the schemas and declare the streams.  Basically Steps 1 - 6 in the Migration Plan in {T259163}....
[14:50:22] <wikibugs>	 10Analytics, 10WMDE-TechWish: Deployment access request for some analytics repos - https://phabricator.wikimedia.org/T274880 (10awight) >>! In T274880#6833587, @elukey wrote: > [...] an-launcher1002 [...] we have important credentials  To be clear, I would also be happier using logstash and automatic deploymen...
[14:55:06] <wikibugs>	 (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/657362 (https://phabricator.wikimedia.org/T271902) (owner: 10Svantje Lilienthal)
[14:55:45] <wikibugs>	 (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/659291 (owner: 10Awight)
[14:57:13] <wikibugs>	 (03PS11) 10Awight: Update schema with core bucket labels [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/656901 (https://phabricator.wikimedia.org/T269986) (owner: 10WMDE-Fisch)
[14:58:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Update schema with core bucket labels [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/656901 (https://phabricator.wikimedia.org/T269986) (owner: 10WMDE-Fisch)
[15:02:58] <wikibugs>	 10Analytics, 10Event-Platform: Sanitize and ingest event tables defined in the event_sanitized database - https://phabricator.wikimedia.org/T273789 (10Ottomata) p:05Triage→03High Moving this back to incoming so we can groom as a team (I think I missed the grooming session where it was moved ).  I think thi...
[15:07:10] <wikibugs>	 (03PS12) 10Awight: Update schema with core bucket labels [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/656901 (https://phabricator.wikimedia.org/T269986) (owner: 10WMDE-Fisch)
[15:08:43] <wikibugs>	 (03CR) 10Awight: [C: 03+2] Update schema with core bucket labels [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/656901 (https://phabricator.wikimedia.org/T269986) (owner: 10WMDE-Fisch)
[15:10:41] <wikibugs>	 (03PS2) 10Mforns: Fix case of metric path [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/659291 (owner: 10Awight)
[15:11:11] <wikibugs>	 (03CR) 10Awight: "PS 10: Fixed a typo in one of the edit count buckets." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/656901 (https://phabricator.wikimedia.org/T269986) (owner: 10WMDE-Fisch)
[15:11:46] <wikibugs>	 10Analytics, 10WMDE-TechWish: Deployment access request for some analytics repos - https://phabricator.wikimedia.org/T274880 (10Ottomata) Just added this section to hopefully make the difference more clear: https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines#WMF_Schema_Repositories
[15:11:48] <wikibugs>	 (03CR) 10Mforns: [V: 03+2 C: 03+2] Fix case of metric path [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/659291 (owner: 10Awight)
[15:18:02] <awight>	 mforns: Thanks for the chat!  I was able to merge the schema.  I'm curious what is responsible for responding to the schema change and deploying to production (also causing the Hadoop schema migration).
[15:18:50] <ottomata>	 awight:  merging it makes it all happen 
[15:19:06] <ottomata>	 puppet will eventually cause the git repo to pull on the schema.wikimeida.org hosts
[15:19:13] <awight>	 ah ty
[15:19:26] <ottomata>	 refine uses the 'latest' schema
[15:19:29] <awight>	 so ~15 minutes on a normal day, good to know!
[15:19:36] <ottomata>	 so it will see the changes when it runs, and evolve the hive table
[15:19:50] <awight>	 neat
[15:20:05] <ottomata>	 this happens whenever there is new data in a new hour thhat refine hasn't done yet
[15:20:36] <ottomata>	 if your stream has canary_events_enabled (which all analyticsy streams should), there should always be data every hour
[15:20:43] <awight>	 ottomata: I'd be happy to champion migrating all my team's schemas, now that I'm getting experience.
[15:20:50] <ottomata>	 :D
[15:20:59] <awight>	 Yes m-forns was mentioning canary events, I like the idea very much.
[15:22:12] <ottomata>	 hmm awight  if you are inclined, you might be able to do most of the steps of the migration process
[15:22:12] <ottomata>	 https://phabricator.wikimedia.org/T259163
[15:22:29] <ottomata>	 at the very least you could probably c reate patchces to  migrate the schemas over
[15:22:43] <ottomata>	 we have to pre-evolve the hive table, and then finalize the upgrade with some puppet changes
[15:22:53] <awight>	 ottomata: :+1: hopefully I can help with the scaling by cleaning my crumbs and bring back to our anthill.
[15:23:01] <ottomata>	 but the other parts are all  schema repo and in mw-config
[15:23:12] <awight>	 TIL "pre-evolve" ;-)
[15:23:42] <wikibugs>	 (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/656210 (https://phabricator.wikimedia.org/T273471) (owner: 10Awight)
[15:23:43] <ottomata>	 yeah, its how we handle events from both systems during the middle of the migration
[15:24:21] <awight>	 That sounds like a nasty problem, thanks for making it mostly transparent to me.
[15:27:29] <elukey>	 Added more infos in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts, including a brief summary of what all hadoop daemon do (and attached to the related alert for 0 process running in puppet)
[15:27:53] <elukey>	 there are a few alerts to cover in puppet to complete the job but the important ones are there
[15:27:57] <elukey>	 please check the docs and let me know :)
[15:39:22] <icinga-wm>	 PROBLEM - Check the last execution of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[15:50:02] <icinga-wm>	 RECOVERY - Check the last execution of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[15:53:16] <ottomata>	 looking ^
[15:56:36] <ottomata>	 huh!  that was caused by the update to the template wizard schema, it looks like produce canary events saw it before eventgate did, probably due to differening state across the different schema.wm.org hosts.  
[15:56:47] <ottomata>	 i think we need to stop relying on puppet for schema deploys
[15:57:04] <ottomata>	 probably to use scap?  
[15:57:06] <ottomata>	 will make a ticket
[16:03:52] <wikibugs>	 10Analytics, 10Event-Platform, 10Release-Engineering-Team: Stop using puppet + git pull for auto deployment of schema repos - https://phabricator.wikimedia.org/T274901 (10Ottomata)
[16:05:20] <wikibugs>	 10Analytics, 10Event-Platform, 10Release-Engineering-Team: Stop using puppet + git pull for auto deployment of schema repos - https://phabricator.wikimedia.org/T274901 (10Ottomata) I tagged RelEng here for advice.    I want a merge in gerrit to trigger a deployment of repository, basically just a git pull on...
[16:06:25] <awight>	 Is it possible that a new hive makes "select 'foo' as date" fail, because `date` is a now a reserved word?
[16:06:44] <ottomata>	 awight: it is possible, what if you do as `date` ?
[16:06:48] <ottomata>	 wrapped in ` `
[16:06:50] <ottomata>	 `
[16:06:52] <ottomata>	 1
[16:06:53] <ottomata>	 1
[16:06:56] <ottomata>	 aghhhh formatting!
[16:07:04] <ottomata>	 wrapped in backticks
[16:07:44] <awight>	 +1 that works.  But if this changed recently, it looks like many hive scripts in reportupdate-queries must suddenly be failing?
[16:09:05] <ottomata>	 awight: yeah it is possible, we are slowing dealing with fallout from upgrade last week
[16:09:16] <ottomata>	 If you've found something
[16:09:20] <ottomata>	 could you comment on https://phabricator.wikimedia.org/T274322 and add it?
[16:09:22] <awight>	 hehe btw don't try `date` in those scripts, you will get the shell substitution.
[16:09:36] <ottomata>	 oh!  haha
[16:09:38] <ottomata>	 yeah maybe escaping
[16:09:44] <awight>	 Sure, I'll push a patch
[16:09:44] <ottomata>	 or what if in single quotes?  sheesh
[16:09:45] <ottomata>	 i dunno
[16:09:50] <ottomata>	 thank you
[16:10:35] <awight>	 Does reportupdater ignore the first column name and use regardless, or is it sensitive?
[16:12:08] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Patch-For-Review, 10Performance-Team (Radar): Convert WikimediaEvents to use ResourceLoader packageFiles - https://phabricator.wikimedia.org/T253634 (10Krinkle) @Jdlrobson @phuedx FYI this is riding the next train branch, might be worth some testing...
[16:14:04] <awight>	 Insensitive.  https://github.com/wikimedia/analytics-reportupdater/blob/master/reportupdater/executor.py#L150
[16:29:39] <wikibugs>	 (03PS1) 10Ebernhardson: HivePartition.list must return valid HQL [analytics/refinery] - 10https://gerrit.wikimedia.org/r/664597
[16:29:58] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Patch-For-Review, 10Performance-Team (Radar): Convert WikimediaEvents to use ResourceLoader packageFiles - https://phabricator.wikimedia.org/T253634 (10phuedx) Excellent! I'd asked @Mholloway if he was looking for a merger just the other day. Great w...
[16:32:55] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 5 others: KaiOS / Inuka Event Platform client - https://phabricator.wikimedia.org/T273219 (10SBisson) There's one last change @nshahquinn-wmf would like to make to the InukaPageView schema before it gets protected but you...
[16:34:31] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 5 others: KaiOS / Inuka Event Platform client - https://phabricator.wikimedia.org/T273219 (10Ottomata) Ok, awesome.  If alright with you we'll just wait for all the schemas to be settled before proceeding (so we can try do...
[16:39:42] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 5 others: KaiOS / Inuka Event Platform client - https://phabricator.wikimedia.org/T273219 (10SBisson) @Ottomata Sure, will let you know when all 3 are settled.
[17:10:35] <awight>	 Was the jump from Hive 1.1.0 to 2.3.6?
[17:10:53] <joal>	 Indeed awight :)
[17:11:09] <joal>	 awight: we were 'just a bit late' :)
[17:16:03] <awight>	 joal: :-D no that's great news, congratulations!
[17:17:36] <sukhe>	 hello! I have a (possibly dumb) question while running a Hive query. when I run a query and move on to something else and in the meantime the query finishes, I see the backlog filled with messages like: "ExecutorAllocationManager: Removing executor 26 because it has been idle for 60 seconds (new desired total will be 1)"
[17:17:41] <sukhe>	 is there a way to filter these out so that I can see the actual query result? thanks!
[17:27:20] <sukhe>	 "21/02/16 17:25:19 INFO ContextCleaner: Cleaned accumulator 67" -> additional messages 
[17:28:03] <elukey>	 sukhe: hi! We have a brand new version of hive since few days ago, we still need to tune the logging levels :(
[17:28:34] <wikibugs>	 10Analytics-Radar, 10Better Use Of Data, 10Event-Platform, 10Product-Data-Infrastructure: mw.user.generateRandomSessionId should return a UUID - https://phabricator.wikimedia.org/T266813 (10Mholloway)
[17:31:03] <sukhe>	 elukey: oh that's fine! I just thought I was doing something wrong in the setup :P
[17:31:24] <elukey>	 sukhe: nono it is on our side, but please open a task so it is on our radar
[17:31:32] <elukey>	 other folks might have better suggestions :)
[17:31:50] <sukhe>	 thanks, will do!
[17:31:50] <elukey>	 I hope that eventually we'll get a good log4j settings that avoids spam :(
[17:31:53] <elukey>	 <3
[17:34:34] <wikibugs>	 (03PS1) 10Awight: Avoid reserved keyword `date` [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/664605
[17:34:36] <wikibugs>	 (03PS1) 10Awight: Drop redundant operations on literal date [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/664606
[17:34:38] <wikibugs>	 (03PS1) 10Awight: Escape reserved "date" column [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/664607
[17:38:20] <elukey>	 the hadoop workers are still rebooting, it might take a couple of hours more
[17:39:28] <elukey>	 razzi: when you have a moment can you review https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts and let me know if there are bits not clear etc.. ?
[17:39:56] <elukey>	 those are the descriptions attached to the hadoop icinga alerts
[17:40:36] * elukey bbiab
[17:42:32] <razzi>	 !log rebalance kafka partitions for atskafka_test_webrequest_text
[17:42:35] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:43:40] <wikibugs>	 10Analytics: Reducing logging levels when running a Hive query - https://phabricator.wikimedia.org/T274914 (10ssingh)
[17:44:10] <razzi>	 !log rebalance kafka partitions for netflow
[17:44:14] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:17:18] <wikibugs>	 10Analytics-Radar, 10Product-Analytics: Provide a list of 100 most popular articles of Russian and English Wikipedias in terms of page views from Ukraine - https://phabricator.wikimedia.org/T273924 (10LGoto) a:03kzimmerman
[18:41:34] * razzi afk for a walk
[18:47:10] <wikibugs>	 (03CR) 10Joal: "Comments on comments - code is great - Thanks Erik :)" (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/664597 (owner: 10Ebernhardson)
[18:50:57] <elukey>	 sukhe: one question about the task - the hive query that you mentioned is done via spark?
[18:51:20] <sukhe>	 elukey: yes!
[18:51:33] <sukhe>	 important info that should have been added there? :P
[18:51:34] <elukey>	 ahhh okok, sorry I thought it was via hive/beeline
[18:51:39] <sukhe>	 I can update it! 
[18:51:51] <elukey>	 yes please, with what you use etc..
[18:52:05] <sukhe>	 doing
[18:52:07] <elukey>	 it is clear from the logs that it is spark but I had some brain fault while parsing :D
[18:52:11] <elukey>	 thanks :)
[18:52:44] <elukey>	 in the spark case I am not sure how much we can do, but we'll try
[18:53:21] <joal>	 elukey: sparksql logging is a known issue (to us I mean)- I wish we can find a log4j spec that would make it better!n
[18:53:34] <sukhe>	 that's fine. is there another way better way of accessing Hive that I should know about? please link me to the manual -- happy to do the reading!
[18:53:45] <wikibugs>	 10Analytics: Reducing logging levels when running a Hive query - https://phabricator.wikimedia.org/T274914 (10ssingh)
[18:55:38] <sukhe>	 also sorry if the task was not clear: I am not even sure if this is a bug or a feature, so please feel free to close it/not work on if it's a special corner case
[18:57:48] <wikibugs>	 10Analytics: The most visited wiki in Uzbekistan on Feb 14th at 6am UTC is mediawiki.org - https://phabricator.wikimedia.org/T274823 (10JAllemandou) > It's curious how the automated traffic detection didn't catch those, if they share IPs. Maybe we can improve the heuristics for this particular case.  The reason...
[18:58:28] <elukey>	 sukhe: we'll try to see what we can do!
[18:58:38] <elukey>	 happy to help :)
[19:00:09] <elukey>	 going to have dinner folks, I think that there are still 15 hosts left to reboot, the cookbook is gently proceeding (I should have fixed all the weird corner cases of the interface renaming)
[19:00:13] <elukey>	 will check later
[19:00:30] <elukey>	 the reboots are not impactful afaics to jobs, but it takes ages to reboot 60 hosts :D
[19:00:40] <elukey>	 and soon we'll have 84 /o\
[19:01:02] <joal>	 enjoy diner elukey - gone for diner as well :)
[19:01:24] <mforns>	 byeee guys
[19:02:04] <sukhe>	 elukey: thanks _/\_
[19:02:08] <sukhe>	 enjoy the dinner!
[19:18:04] <wikibugs>	 10Analytics: The most visited wiki in Uzbekistan on Feb 14th at 6am UTC is mediawiki.org - https://phabricator.wikimedia.org/T274823 (10ssingh) Thanks for opening this task, Marcel.  Joal, thanks for investigating this: it is helpful context for some past and possibly future alerts as well that we may have (had)...
[19:18:46] <wikibugs>	 10Analytics: The most visited wiki in Uzbekistan on Feb 14th at 6am UTC is mediawiki.org - https://phabricator.wikimedia.org/T274823 (10mforns) We could add a tag to pageviews generated by actors with high-trafic IPs. It would not change the way we process, count or classify traffic today, but we could use it to...
[19:20:20] <wikibugs>	 10Analytics: Reducing logging levels when running a Hive query - https://phabricator.wikimedia.org/T274914 (10ssingh) I edited the task but I wanted to add here as well that this happens when I run the Hive query from Spark.
[19:21:38] <wikibugs>	 (03CR) 10Mforns: [C: 03+1] "+1 from me, after joal's comments are addressed :]" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/664597 (owner: 10Ebernhardson)
[19:25:28] <wikibugs>	 (03PS2) 10Ebernhardson: HivePartition.list must return valid HQL [analytics/refinery] - 10https://gerrit.wikimedia.org/r/664597
[19:25:30] <wikibugs>	 (03CR) 10Ebernhardson: HivePartition.list must return valid HQL (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/664597 (owner: 10Ebernhardson)
[19:42:34] <wikibugs>	 10Analytics-Radar, 10Better Use Of Data, 10Product-Analytics, 10Product-Data-Infrastructure, and 4 others: Develop a new schema for MediaSearch analytics or adapt an existing one - https://phabricator.wikimedia.org/T263875 (10kzimmerman) Thanks @CBogen !
[20:09:40] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for next deploy - Thanks Erik :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/664597 (owner: 10Ebernhardson)
[20:29:29] <wikibugs>	 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics, 10Product-Data-Infrastructure: Roll-up raw sessionTick data into distribution - https://phabricator.wikimedia.org/T271455 (10Mayakp.wiki)
[21:08:02] <wikibugs>	 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: WDCM_Sqoop_Clients.R fails from stat1004 - https://phabricator.wikimedia.org/T274866 (10GoranSMilovanovic) @elukey First of all: thanks for reaching out!  ` sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chown -R analytics-privatedata:ana...
[21:14:28] <wikibugs>	 (03PS1) 10Milimetric: Fix use of reserved keywords [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/664674 (https://phabricator.wikimedia.org/T274322)
[21:15:19] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10Milimetric)
[21:18:26] <wikibugs>	 (03Abandoned) 10Milimetric: Fix use of reserved keywords [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/664674 (https://phabricator.wikimedia.org/T274322) (owner: 10Milimetric)
[21:20:02] <wikibugs>	 (03PS3) 10Milimetric: Update commons_file_usage_in_wikimedia_projects logic per Isaac [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/656166 (https://phabricator.wikimedia.org/T271571)
[21:20:07] <wikibugs>	 (03PS4) 10Milimetric: Update commons_file_usage_in_wikimedia_projects logic per Isaac [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/656166 (https://phabricator.wikimedia.org/T271571)
[21:20:17] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] Update commons_file_usage_in_wikimedia_projects logic per Isaac [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/656166 (https://phabricator.wikimedia.org/T271571) (owner: 10Milimetric)
[21:20:35] <wikibugs>	 (03CR) 10Milimetric: "my apologies, never saw the comment :(" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/656166 (https://phabricator.wikimedia.org/T271571) (owner: 10Milimetric)
[21:22:52] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] "I'm just assuming yall wanted this merged, but I may be wrong.  Easier to apologize than wait for permission, trying to merge Adam's other" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/659230 (https://phabricator.wikimedia.org/T271902) (owner: 10Svantje Lilienthal)
[21:25:09] <wikibugs>	 (03PS3) 10Milimetric: Use the edit count bucket sent by TemplateData [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/659227 (https://phabricator.wikimedia.org/T272569) (owner: 10Andrew-WMDE)
[21:30:28] <milimetric>	 (don't mind all that, I'm just cleaning up my mess and merging my sadly redundant patch with Adam's)
[21:42:09] <wikibugs>	 (03PS2) 10Milimetric: Avoid reserved keyword `date` [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/664605 (owner: 10Awight)
[21:42:39] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] "Thanks Adam!" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/664605 (owner: 10Awight)
[21:48:09] <wikibugs>	 (03PS2) 10Milimetric: Drop redundant operations on literal date [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/664606 (owner: 10Awight)
[21:48:26] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] "Thanks again :)" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/664606 (owner: 10Awight)
[21:48:53] <wikibugs>	 (03Abandoned) 10Milimetric: Escape reserved "date" column [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/664607 (owner: 10Awight)
[21:54:27] <icinga-wm>	 PROBLEM - Check the last execution of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:07:00] <wikibugs>	 10Analytics, 10Event-Platform: WikimediaEventUtilities and produce_canary_events job should use api-ro.discovery.wmnet instead of meta.wikimedia.,org to get stream config - https://phabricator.wikimedia.org/T274951 (10Ottomata)
[22:16:49] <wikibugs>	 10Analytics, 10Event-Platform: WikimediaEventUtilities and produce_canary_events job should use api-ro.discovery.wmnet instead of meta.wikimedia.,org to get stream config - https://phabricator.wikimedia.org/T274951 (10Ottomata) Oh, this is a bit more of a problem than just canary events.  Camus is using webpro...
[22:27:26] <wikibugs>	 10Analytics, 10Event-Platform: WikimediaEventUtilities and produce_canary_events job should use api-ro.discovery.wmnet instead of meta.wikimedia.,org to get stream config - https://phabricator.wikimedia.org/T274951 (10Ottomata) Ok, @akosiaris has webproxy turned back on for now.  We need to do 2 things:  - Mak...
[22:31:48] <razzi>	 !log rebalance kafka partitions for codfw.mediawiki.api-request
[22:31:50] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[22:37:11] <icinga-wm>	 RECOVERY - Check the last execution of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:38:47] <musikanimal>	 hello analytics! FYI it seems every or most projects *except* Wikipedias experienced a dramatic increase in desktop traffic the past two days
[22:38:51] <musikanimal>	 e.g. https://pageviews.toolforge.org/siteviews/?platform=desktop&source=pageviews&agent=user&range=latest-20&sites=en.wikibooks.org|en.wikinews.org|en.wikiquote.org|en.wikisource.org|en.wikiversity.org|en.wikivoyage.org
[22:39:19] <musikanimal>	 the same seems to be true when I spot check other languages
[22:40:21] <musikanimal>	 not sure if it's the usual undeclared bots throwing these numbers, but it seems odd that nearly every project but Wikipedias are effected
[22:51:25] <wikibugs>	 10Analytics, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10MW-1.36-notes (1.36.0-wmf.30; 2021-02-09): eventgate_validation_error for NewcomerTask, HomepageTask, and HomepageVisit schemas - https://phabricator.wikimedia.org/T273700 (10Etonkovidova) 05Open→03Resolved Thanks everybody for the com...
[23:30:49] <wikibugs>	 (03PS5) 10Eric Gardner: Update schema to handle quickview copy events [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/661273 (https://phabricator.wikimedia.org/T263663)
[23:31:06] <wikibugs>	 (03PS2) 10Eric Gardner: Update schema to handle quickview playback events [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/663703 (https://phabricator.wikimedia.org/T263154)
[23:33:24] <wikibugs>	 (03CR) 10Eric Gardner: "This patch was +2ed but never got merged. Presumably it needed a rebase? I've just done so, but don't have submit authority here. If anyon" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/661273 (https://phabricator.wikimedia.org/T263663) (owner: 10Eric Gardner)
[23:55:14] <wikibugs>	 (03CR) 10Milimetric: Fix unit tests that ensure certain requests fail and clean up all unit tests (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/662821 (https://phabricator.wikimedia.org/T273404) (owner: 10Lex Nasser)