[00:03:47] <icinga-wm>	 RECOVERY - Check the last execution of drop-el-unsanitized-events on an-launcher1002 is OK: OK: Status of the systemd unit drop-el-unsanitized-events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:48:03] <icinga-wm>	 PROBLEM - Check the last execution of drop-el-unsanitized-events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit drop-el-unsanitized-events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:48:07] <icinga-wm>	 PROBLEM - Check the last execution of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[06:31:01] <elukey>	 goood morning
[06:33:25] <elukey>	 !log reboot stat1005 to resolve weird GPU state (scheduled last week)
[06:33:30] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[06:36:38] <elukey>	 I quickly checked the applogs for yarn (failed refinement) and I see again shuffle errors
[06:36:59] <elukey>	 I am wondering if we could have spark metrics somehow, to monitor what's happening in these cases
[06:38:14] <elukey>	 https://issues.apache.org/jira/browse/SPARK-25642
[06:38:18] <elukey>	 fixed in 3.0
[06:38:20] <elukey>	 OF COURSE
[06:42:19] <elukey>	 mmm but the pull request seems belonging to before the jira
[06:42:46] <elukey>	 the jira seems to be related to more metrics
[06:44:54] <elukey>	 elukey@an-worker1080:~$ curl localhost:8141/metrics -s | grep -v '#' | grep -i shuffle
[06:44:57] <elukey>	 Hadoop_NodeManager_ShuffleConnections{name="ShuffleMetrics",} 2.8402165E7
[06:45:00] <elukey>	 Hadoop_NodeManager_ShuffleOutputsFailed{name="ShuffleMetrics",} 0.0
[06:45:02] <elukey>	 Hadoop_NodeManager_ShuffleOutputBytes{name="ShuffleMetrics",} 7.147219471227E12
[06:45:05] <elukey>	 Hadoop_NodeManager_ShuffleOutputsOK{name="ShuffleMetrics",} 6937516.0
[06:45:24] <elukey>	 that is not all outlined in the pull request but something
[07:00:18] <elukey>	 ah no my bad the above are probably not the external shuffle service metrics (the one that uses spark)
[07:02:53] <HeartGlow30797>	 a-team
[07:04:36] <joal>	 good morning team
[07:06:26] <joal>	 elukey: I think the problem of mediawiki-history was related as well :(
[07:07:27] <joal>	 elukey: I assume we have the ratio computation-and-IOs vs spark-shufflers is too high
[07:08:05] <elukey>	 joal: yeah it would be great to have metrics to confirm though, but I fear that only 3.0 exposes those
[07:08:20] <joal>	 I assume ou're right :(
[07:09:49] <elukey>	 I need to double check the prometheus config for the yarn nodemanagers
[07:09:53] <elukey>	 we set this
[07:09:54] <elukey>	 whitelistObjectNames: - 'Hadoop:service=NodeManager,name=NodeManagerMetrics' - 'Hadoop:service=NodeManager,name=ShuffleMetrics'
[07:10:20] <elukey>	 that are JMX objs, but maybeee there are also others
[07:10:56] <joal>	 ack
[07:11:38] <elukey>	 ah ok I see in https://issues.apache.org/jira/browse/SPARK-18364
[07:11:45] <elukey>	 "ExternalShuffleService exposes metrics as of SPARK-16405. However, YarnShuffleService does not."
[07:11:53] <elukey>	 fixed in 3.0
[07:12:55] <joal>	 elukey: in https://spark.apache.org/docs/2.2.2/monitoring.html , Metrics section, they advertise metrics being available through various sinks
[07:15:00] <elukey>	 joal: yes but what are those sinks? The history server, etc...? 
[07:15:51] <joal>	 from the page: Console, CSV, Jmx, Graphite, Slf4j
[07:16:40] <elukey>	 mmm but is it the driver exposing those, if configurdd?
[07:17:17] <elukey>	 because in the above pull request it seems that the spark yarn shuffle service itself exposes those (so our dedicated prometheus exporter can pull them)
[07:18:02] <joal>	 elukey: I think the metrics I advertised are handled by job - Not aligned with our metric system
[07:18:27] <joal>	 And indeed having the metrics available straight out of the node-manager would br a lot easier
[07:18:41] <joal>	 Fast move to spark 3.0 am I hearing?
[07:19:50] <elukey>	 this is a very good reason to do so, but probably in Q3, we have too many things :(
[07:20:02] <joal>	 yup - too many indeed
[07:21:27] <elukey>	 we use the binary distribution of spark atm IIRC, not sure how easy it would be to get 2.4.x source and backport that metrics patch and rebuild
[07:21:53] <elukey>	 if it was say 2/3 hours of work it would be worth it
[07:21:56] <joal>	 elukey: I assume it would be easier to move to spark 3.0 :)
[07:22:42] <joal>	 elukey: I might be wrong though :)
[07:22:44] <elukey>	 not sure, it depends how many things are changing, other users than us relying on 2.x vs 3.x, etc.. but I am ignorant on this :)
[07:51:54] <elukey>	 joal: checked with jconsole, I can confirm that we have only the shuffle metrics that yarn currently exposes (so not the spark ones )
[07:52:06] <joal>	 ack elukey - makes sense
[07:56:46] <joal>	 elukey: I just posted a patch for AQS druid datasource update if you have minute
[07:59:22] <elukey>	 ah yes
[08:03:19] <elukey>	 joal: do you have patience half an hour if I create a cookbook ?
[08:03:30] <joal>	 for sure :)
[09:02:10] <klausman>	 Morning
[09:03:44] <elukey>	 morning :)
[09:04:18] <klausman>	 So as far as I can tell the main change between rocm33 and 38 is that HCC goes away and is replaced by an llvm-based compiler
[09:06:09] <klausman>	 https://github.com/RadeonOpenCompute/ROCm/tree/8c835d14fc4597bcdd3048c850e77669dfaa49fd#Upgrading-to-This-Release That happened in 3.5.1
[09:15:07] <elukey>	 no idea where a fix for the weird GPU hang would be advertised :(
[09:15:26] <elukey>	 usually there are a lot of commits and fixes that come from the community, not sure how much they are advertised
[09:16:17] <klausman>	 There is one KI listed (and then unlisted) that talks about hangs, but that was with multi-VPU systems
[09:16:28] <klausman>	 GPU*
[09:22:19] <elukey>	 joal: ready for depooling
[09:25:05] <elukey>	 lemme know when you are ready
[09:34:33] <elukey>	 (afk, be back in a few)
[10:00:12] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Import page_props table to Hive - https://phabricator.wikimedia.org/T258047 (10Miriam) Hi @MMiller_WMF all data is there, thanks! Merci @JAllemandou
[10:04:58] <wikibugs>	 10Analytics-Radar, 10Discovery, 10Operations, 10Recommendation-API, 10Patch-For-Review: Run swift-object-expirer as part of the swift cluster - https://phabricator.wikimedia.org/T229584 (10fgiunchedi) Merged the patch above, apologies for the late action on this.  These are the steps I think are left to...
[10:31:16] <elukey>	 !log bootstrap an-worker111[0,2
[10:31:18] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:31:19] <elukey>	 uff
[10:31:24] <elukey>	 !log bootstrap an-worker111[0,2] as hadoop workers
[10:31:25] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:33:20] <joal>	 elukey: sorry I missed your ping - testing now
[10:34:10] <elukey>	 joal: ah no joal one sec, I have to depool
[10:34:20] <joal>	 Opps ok :)
[10:34:42] <elukey>	 joal: done, please go :)
[10:34:52] <joal>	 ok
[10:35:00] <joal>	 \o/
[10:35:09] <joal>	 good on my side elukey 
[10:37:57] <elukey>	 completing the roll restart
[10:39:18] <elukey>	 done!
[10:40:18] <elukey>	 joal: very good, all done via cookbook now
[10:40:25] <joal>	 Awesome :)
[10:43:38] <elukey>	 all right going to get lunch, ttl!
[10:43:51] <joal>	 later elukey 
[11:41:45] <fdans>	 morning team!
[11:43:48] <klausman>	 Heyooo
[11:44:12] <joal>	 Hi fdans, hi klausman
[12:35:33] <elukey>	 !log execute "PURGE BINARY LOGS BEFORE '2020-09-28 00:00:00';" on an-coord1001's mysql to free space - T264081
[12:35:35] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:35:35] <stashbot>	 T264081: Increase in usage of /var/lib/mysql on an-coord1001 after Sept 21st - https://phabricator.wikimedia.org/T264081
[13:31:55] <elukey>	 !log shutdown an-master1002 for ram expansion (64 -> 128G)
[13:31:57] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:35:18] <mforns>	 hey teamm
[13:36:57] <joal>	 Hi mforns 
[13:37:55] <mforns>	 :]
[13:39:39] <elukey>	 elukey@an-master1002:~$ free -m total        used        free      shared  buff/cache   available
[13:39:42] <elukey>	 Mem:         128600        8941      114959           9        4699      118808
[13:40:00] <joal>	 \o/ !! Moar RAM :)
[13:42:13] <elukey>	 I am also shutting down stat1005 and 1008
[13:42:34] <joal>	 ack elukey - even moar RAM :)
[13:42:46] <elukey>	 yes in there a lot, 1.5TB :D
[13:42:52] <joal>	 :D
[13:43:08] * joal will try a local spark job, just for fun :D
[13:54:56] <elukey>	 !log shutdown stat1005 for ram upgrade
[13:55:00] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:57:22] <joal>	 gone for kids, back at standup
[14:01:32] <milimetric>	 morning yall
[14:01:48] <milimetric>	 it's my ops week, I have a lot of backlog
[14:10:53] <elukey>	 ok I misremembered, it was 512G of ram per node (we thought about 1.5TB originally but the cost was too high)
[14:12:27] <klausman>	 Aw, now my torrents won't fit
[14:14:21] <elukey>	 :D
[14:24:36] <wikibugs>	 10Analytics-Clusters, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10Cmjohnson) @robh @wiki_willy  I am looking at the packing slip and what I have in the data center and it appears we're 4 DIMM short.  The pac...
[14:25:21] <elukey>	 !log shutdown an-master1001 for ram expansion
[14:25:23] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:25:58] <klausman>	 It still fascinates me how little is needed OS-side when upgrading memory. I mean, it's a fundamental aspect of the machine, and yet, you plop in hardware, boot, done.
[14:26:38] <elukey>	 yep!
[14:26:45] <klausman>	 And it's not just modern machines. An 8086 upgraded from 256K to 512K would work the same.
[14:30:46] <wikibugs>	 10Analytics-Clusters, 10Operations, 10ops-eqiad: (Need By: TBD) upgrade ram in an-master100[12] - https://phabricator.wikimedia.org/T259162 (10Cmjohnson)
[14:31:11] <wikibugs>	 10Analytics-Clusters, 10Operations, 10ops-eqiad: (Need By: TBD) upgrade ram in an-master100[12] - https://phabricator.wikimedia.org/T259162 (10Cmjohnson) 05Open→03Resolved Task complete
[14:31:29] <wikibugs>	 10Analytics-Clusters, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10Cmjohnson)
[14:39:32] <wikibugs>	 10Analytics-Clusters, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10RobH)
[14:41:02] <elukey>	 !log shutdown stat1005 and stat1008 for ram expansion (1005 again)
[14:41:06] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:41:49] <elukey>	 quick coffee
[14:42:35] <fdans>	 joal elukey just updated the puppet patch to make the job daily instead of hourly https://gerrit.wikimedia.org/r/c/operations/puppet/+/629409
[14:44:48] <wikibugs>	 10Analytics-Clusters, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10RobH) >>! In T260448#6517598, @Cmjohnson wrote: > @robh @wiki_willy  I am looking at the packing slip and what I have in the data center and...
[14:46:45] <milimetric>	 I'm confused about the drop-el-unsanitized-events timer.  It's toggling between CRITICAL and OK every time it runs it looks like, and there seems to be a bug in the script, here's the stack trace:
[14:46:56] <milimetric>	 https://www.irccloud.com/pastebin/8tjQX3Pa/
[14:47:03] <milimetric>	 two weird things
[14:47:14] <milimetric>	 1. why is it running unit tests... does it always do that?  We should get that to stop
[14:47:41] <milimetric>	 2. why is the timer status going back to OK automatically, I thought someone had to reset it once there was an error like this
[14:49:56] <wikibugs>	 (03CR) 10Nuria: [C: 03+2] "+2 to deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/631804 (https://phabricator.wikimedia.org/T263495) (owner: 10Mforns)
[14:50:26] <milimetric>	 I'm joining batcave early if anyone wants to be social
[14:50:36] <nuria>	 milimetric: running unit tests is on purpose , we designed the script that way
[14:50:54] <nuria>	 milimetric: see upcoming changes that will fix timer https://gerrit.wikimedia.org/r/c/analytics/refinery/+/631804/
[14:51:08] <nuria>	 milimetric: the "prioblem" is undeterministic
[14:51:13] <wikibugs>	 (03CR) 10Mforns: [V: 03+2] Fix directory expansion bug in refinery-drop-older-than [analytics/refinery] - 10https://gerrit.wikimedia.org/r/631804 (https://phabricator.wikimedia.org/T263495) (owner: 10Mforns)
[14:51:35] <milimetric>	 nuria: why not disable the timer until that fix is deployed?
[14:51:42] <nuria>	 milimetric: it does not happen always (depends of the directory path structure) and that makes the timer oscillate (i think) between broken and ok
[14:51:56] <milimetric>	 I see... weird
[14:52:09] <nuria>	 milimetric: ya, it can be disabled 
[14:52:16] <milimetric>	 k, disabling
[14:52:34] <milimetric>	 !log disabling drop-el-unsanitized-events timer until https://gerrit.wikimedia.org/r/c/analytics/refinery/+/631804/ is deployed
[14:52:36] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:53:06] <elukey>	 milimetric: puppet needs to be disabled as well, otherwise the firt puppet run restores the timer
[14:53:53] <milimetric>	 oh... makes sense
[14:54:11] <milimetric>	 never ... mind then, I'll maybe deploy a couple times if that fix is ready to go
[14:55:43] <elukey>	 also, the drop-el-unsanitized-events.service is in failed state and alerting in icinga
[14:55:43] <elukey>	 so we could do 'systemctl reset-failed drop-el-unsanitized-events.service'
[14:55:43] <elukey>	 to clean it up
[15:01:43] <nuria>	 klausman: ping standup
[15:01:50] <klausman>	 omw
[15:22:03] <wikibugs>	 10Analytics-Clusters, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10Cmjohnson)
[15:46:32] <wikibugs>	 10Analytics-Radar, 10Operations, 10ops-eqiad: an-presto1004 down - https://phabricator.wikimedia.org/T253438 (10Cmjohnson) The dell tech is back today with new power supplies,  he took the system down to the bare minimum and slowly started adding things back, and once he connected the backplane there was smo...
[16:18:40] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Privacy Engineering, and 4 others: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10fdans)
[16:22:44] <wikibugs>	 10Analytics, 10Growth-Team, 10Product-Analytics: Revisions missing from mediawiki_revision_create - https://phabricator.wikimedia.org/T215001 (10fdans) a:03Milimetric
[16:23:03] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Growth-Team, 10Product-Analytics: Revisions missing from mediawiki_revision_create - https://phabricator.wikimedia.org/T215001 (10fdans)
[16:23:32] <wikibugs>	 10Analytics, 10Operations, 10Traffic: ~1 request/minute to intake-logging.wikimedia.org times out at the traffic/service interface - https://phabricator.wikimedia.org/T264021 (10fdans) Just pinging @Ottomata for when he's back from vacation.
[16:26:31] <wikibugs>	 10Analytics-Radar, 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Cookie “WMF-Last-Access-Global” has been rejected for invalid domain. - https://phabricator.wikimedia.org/T261803 (10fdans)
[16:28:08] <wikibugs>	 10Analytics, 10Operations, 10Traffic, 10netops: Turnilo: per-second rates for wmf_netflow bytes + packets - https://phabricator.wikimedia.org/T263290 (10fdans)
[16:28:12] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Operations, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10fdans)
[16:32:15] <wikibugs>	 10Analytics-Clusters, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10Cmjohnson) stat1008, I added all the DIMM and the server would not boot, I received the following error  UEFI0060: Power required by the syst...
[16:34:23] <wikibugs>	 10Analytics-Radar, 10Data-Services, 10cloud-services-team (Kanban): labstore1006 persistent high iowait - https://phabricator.wikimedia.org/T263329 (10fdans)
[16:34:55] <wikibugs>	 10Analytics-Radar, 10Data-Services, 10cloud-services-team (Kanban): labstore1006 persistent high iowait - https://phabricator.wikimedia.org/T263329 (10Milimetric) This looks like legitimate use, as people sometimes pull huge files over NFS at /mnt/public on the stat boxes.  Assuming this kind of usage will h...
[16:35:18] <wikibugs>	 10Analytics-Radar, 10Patch-For-Review, 10Product-Analytics (Kanban): Add DesktopWebUIActionsTracking fields to the allowlist - https://phabricator.wikimedia.org/T263143 (10fdans)
[16:36:05] <wikibugs>	 10Analytics-Clusters, 10Operations: Switch Zookeeper to profile::java - https://phabricator.wikimedia.org/T264176 (10fdans)
[16:41:57] <wikibugs>	 10Analytics-Clusters, 10Operations, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10fdans) a:03klausman
[16:43:02] <wikibugs>	 10Analytics: Make sure pageview API limits are well documented - https://phabricator.wikimedia.org/T261681 (10fdans) p:05Triage→03Medium
[16:43:25] <wikibugs>	 10Analytics: Make sure pageview API limits are well documented - https://phabricator.wikimedia.org/T261681 (10fdans) a:03Milimetric
[16:44:10] <wikibugs>	 10Analytics-Clusters: Ensure Puppet checks types as part of the build - https://phabricator.wikimedia.org/T261693 (10fdans)
[16:44:37] <wikibugs>	 10Analytics-Clusters: Ensure Puppet checks types as part of the build - https://phabricator.wikimedia.org/T261693 (10fdans) p:05Triage→03High
[16:46:13] <joal>	 elukey: yarn master is an-master1002 - is that expected?
[16:47:20] <wikibugs>	 10Analytics, 10Analytics-Kanban: pagecounts-ez of month 2020-08 is incomplete - https://phabricator.wikimedia.org/T262141 (10fdans) p:05Triage→03High a:03fdans
[16:48:19] <elukey>	 joal: ah yes my bad, I need to do the failover
[16:48:54] <joal>	 elukey: UI is broken when we have an-master1002 active, so I notice ;)
[16:49:20] <wikibugs>	 10Analytics, 10Product-Analytics, 10Platform Team Initiatives (Modern Event Platform (TEC2)): Retrofit event pipeline with bot detection code - https://phabricator.wikimedia.org/T263286 (10fdans) p:05Triage→03High
[16:49:22] <elukey>	 joal: should be ok now
[16:58:29] <joal>	 Thanks elukey :)
[16:59:55] <wikibugs>	 10Analytics-Clusters, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10RobH)
[17:06:57] <wikibugs>	 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10Cmjohnson) physically moved an-worker1111 from C8 to C2, updated network switch and netbox.  vlan and IP stay the same physically moved an-worker1113/1...
[17:09:14] <nuria>	 mforns: my meeting was cancelled want to talk entropy?
[17:09:45] <nuria>	 cc razzi
[17:09:46] <mforns>	 nuria: sure!
[17:09:57] <mforns>	 batcave?
[17:10:25] <mforns>	 o the batcave is busy
[17:10:26] <nuria>	 mforns: k
[17:10:35] <nuria>	 mforns: the tardis?
[17:11:02] <nuria>	 mforns: no, wait
[17:11:11] <mforns>	 yes! link for you razzi, if you want to come: https://meet.google.com/kti-iybt-ekv
[17:11:29] <elukey>	 !log bootstrap an-worker[1115-1117] as hadoop workers
[17:11:31] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:25:23] <elukey>	 we just crossed the 3PB mark on hdfs :D
[17:25:24] <elukey>	 https://grafana.wikimedia.org/d/000000585/hadoop?viewPanel=25&orgId=1
[17:25:44] <elukey>	 it is temporary since I'll need to remove nodes, but then we'll probably get back to it once the other hadoop workers will come
[17:36:29] <elukey>	 razzi: o/ still working with marcel?
[17:36:42] <razzi>	 elukey: just wrapped up
[17:37:11] <elukey>	 ah ok, I wanted to ask if you had questions about what I am doing with hadoop worker nodes etc.. but it might be too much in one morning then :D
[17:39:08] <razzi>	 elukey: I'm interested, wanna chat for a quick explanation?
[17:39:30] <elukey>	 sure, we can do in here or over meet
[17:40:21] <razzi>	 elukey: tardis?
[17:41:10] <elukey>	 razzi: can you give me the link?
[17:41:14] <elukey>	 i don't have it saved
[17:41:27] <razzi>	 dm'd
[17:49:34] <milimetric>	 elukey: \o/
[17:56:02] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Move the stat1004-6-7 hosts to Debian Buster - https://phabricator.wikimedia.org/T255028 (10nshahquinn-wmf) >>! In T255028#6510728, @elukey wrote: > /me cries in a corner  Mistakes happen! You're still taking //excellent// care of our beloved anal...
[18:20:15] <elukey>	 !log manual creation of /opt/rocm -> /opt/rocm-3.3.0 on stat1008 to avoid failures in finding the lib dir
[18:20:17] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:25:30] * elukey afk!
[18:28:34] <mforns>	 heya a-team, I'm going to do a quick refinery deployment, to be able to move forward with the netflow and mediawiki_job deletion jobs. Anyone against? It's just the changes to the deletion script.
[18:30:25] <mforns>	 or any other change you guys want me to include??
[18:30:47] <mforns>	 unique devices changes maybe, joal?
[18:31:13] <joal>	 mforns: I can't think of anything now - unique-deviceS?
[18:31:33] <mforns>	 joal "Correct unique-devices per domain monthly dependency"
[18:31:38] <mforns>	 it's in CR
[18:32:42] <joal>	 wow - I thought that had been merged!!
[18:32:53] <joal>	 let me check mforns - thanks for looking!
[18:33:17] <mforns>	 joal I can merge
[18:33:23] <joal>	 mforns: please!
[18:33:45] <wikibugs>	 (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM! Merging for deployment." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/621515 (owner: 10Joal)
[18:34:10] <joal>	 mforns: indeed it should have been merged, and only the per-domain-monthly job needs to be restarted
[18:34:18] <mforns>	 ok, will do
[18:34:23] <joal>	 <3 mforns :)
[18:34:30] <mforns>	 :]
[18:34:33] <joal>	 Going for diner team, back after
[18:45:01] <mforns>	 !log deploying refinery to unblock deletion of raw mediawiki_job and raw netflow data
[18:45:02] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:00:54] <nuria>	 mforns: +1
[19:01:27] <mforns>	 ok, already deploying
[19:05:48] <mforns>	 !log finished deploying refinery to unblock deletion of raw mediawiki_job and raw netflow data
[19:05:50] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:13:09] <mforns>	 milimetric: I just deployed the fixes to the deletion script, I think we can re-enable drop-el-unsanitized-events timer
[19:13:20] <mforns>	 milimetric: may I?
[19:14:01] <mforns>	 !log restarted oozie coord unique_devices-per_domain-monthly after deployment
[19:14:02] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:15:21] <mforns>	 oh, milimetric it's active?
[19:15:36] <milimetric>	 mforns: sorry yes
[19:15:50] <mforns>	 ok ok, better!
[19:15:54] <milimetric>	 I never disabled it because of the puppet issue - I didn't want to disable puppet and go through all that
[19:16:01] <mforns>	 oh ok!
[19:16:07] <milimetric>	 thanks for the deploy, I could've done that but I missed it while I was eating lunch
[19:16:28] <mforns>	 milimetric: not a weekly deploy, I did it as part of netflow work
[19:16:52] <mforns>	 no problemo
[19:17:16] <milimetric>	 ok, cool (I still can do non-train deploys, but you're welcome to as well)
[19:17:42] <mforns>	 ok :]
[19:36:37] <mforns>	 nuria, razzi: I modified the deletion timers to work properly (hopefully :]), can you please review? I left a comment explaining: https://gerrit.wikimedia.org/r/c/operations/puppet/+/628895/  Thanks!
[19:38:35] <mforns>	 if you guys +1, I'll let Lu-ca merge tomorrow :]
[20:01:04] <wikibugs>	 10Analytics, 10Analytics-Wikistats: Wikistats Bug - https://phabricator.wikimedia.org/T264660 (10Nevatovol)
[20:03:05] <joal>	 before I leave - milimetric - Have you investigated the SLA alert for virtual-pageviews? 
[20:03:45] <joal>	 The one to consider is action 4440 of job https://hue-next.wikimedia.org/hue/jobbrowser/#!id=0057983-200303081945184-oozie-oozi-C
[20:04:00] <joal>	 The other one is the druid coordinator, dependent on that one
[20:04:10] <joal>	 It seems an hour of events has not been refined
[20:05:26] <milimetric>	 joal: thanks, not yet
[20:05:32] <milimetric>	 but looking now
[20:08:41] <joal>	 Thanks milimetric
[20:08:44] <joal>	 Gone for tonight :)
[20:52:23] <wikibugs>	 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10Patch-For-Review, and 2 others: Develop a new schema for MediaSearch analytics or adapt an existing one - https://phabricator.wikimedia.org/T263875 (10egardner) Regarding filter-change events, here's how I'm approaching this for now:  1. Wh...
[21:25:35] <wikibugs>	 10Analytics, 10Event-Platform, 10EventStreams, 10Instrument-ClientError: Bot throwing large amount of errors - https://phabricator.wikimedia.org/T264453 (10Jdlrobson)
[21:30:45] <icinga-wm>	 RECOVERY - Check the last execution of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[21:54:52] <razzi>	 Out for a walk, back in a bit
[23:12:57] <wikibugs>	 10Analytics, 10observability: Indexing errors / malformed logs for aqs on cassandra timeout - https://phabricator.wikimedia.org/T262920 (10colewhite) @JAllemandou this is a known issue with the current Logstash configuration and one of the primary drivers behind adopting a Common Logging Schema (T234565).  In...