[05:56:47] <elukey>	 good morning
[05:59:36] <elukey>	 The two disk partitions full are related to yarn logs, will follow up
[06:05:19] <elukey>	 !log truncate logs for application_1615988861843_158592 on analytics1061 - one partition full
[06:05:20] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[06:08:25] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on analytics1061 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[06:10:50] <elukey>	 !log kill application_1615988861843_158592 on analytics1061 to allow space to recover (truncate of course in D state)
[06:10:52] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[06:17:12] <elukey>	 !log kill application application_1615988861843_158645 to free space on analytics1070
[06:17:13] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[06:17:36] <elukey>	 I am working with Aiko on this, tensorflow on Yarn for some reason spams the yarn logs a lot 
[06:19:55] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on analytics1070 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[08:06:22] <wikibugs>	 10Analytics-Radar, 10WMDE-TechWish-Sprint-2021-03-31: Reportupdater output can be corrupted by hive logging - https://phabricator.wikimedia.org/T275757 (10awight) 05Open→03Resolved a:03awight
[08:50:21] * elukey hates hue
[08:50:35] <elukey>	 the new 4.9 version doesn't work in test sigh
[08:50:41] <elukey>	 the hive panel is broken
[09:02:47] <klausman>	 anything obvious?
[09:09:13] <elukey>	 good morning :)
[09:10:02] <elukey>	 I am trying to revert some commits and rebuild, I suspect it is due to some changes to thrift, but upstream is not super collaborative
[09:10:24] <elukey>	 https://github.com/cloudera/hue/issues/1997
[09:10:56] <elukey>	 their release model is break first, patch later
[09:11:38] <elukey>	 anyway, I'd like to upgrade hue behind hue-next.wikimedia.org since we are going to switch hue.wikimedia.org to it soon
[09:11:55] <elukey>	 hue next uses the Hue github repo
[09:12:02] <elukey>	 (self packaged by us)
[09:12:13] <elukey>	 meanwhile hue.wikimedia.org uses the CDH 5 Hue
[09:12:24] <klausman>	 is it always HEAD or does it go by tags?
[09:12:39] <elukey>	 there are tags yes
[09:12:41] <elukey>	 https://www.cloudera.com/downloads/paywall-expansion.html
[09:12:50] <elukey>	 this is the main reason to drop CDH 5 Hue
[09:13:01] <elukey>	 I'd like to clean up our repositories just in case
[09:13:09] <klausman>	 Ah.
[09:13:20] <elukey>	 yeah lovely I know :(
[09:13:54] <elukey>	 (on the bright side, we switched to Bigtop just in time)
[09:14:36] <klausman>	 So if Hue goes subscription-only, what will it be replaced with?
[09:15:25] <elukey>	 in theory it is only their packages etc.., the github repo's code is licensed with apache 2.0
[09:15:49] <elukey>	 and I really hope to drop Hue when airflow comes in, but not sure if we'll make it
[09:16:37] <elukey>	 There is also https://phabricator.wikimedia.org/T264896 with the upstream bugs to fix
[09:16:55] <elukey>	 (upstream is receptive to pull requests only basically)
[09:17:36] <elukey>	 so the best thing to do is probably to assign somebody in the team to fix these, but it may take a while and not sure what the priority will be (low to lowest probably)
[09:50:23] <elukey>	 I tried to rollback to 4.8, the prev version (in test), and same error, now I am confused
[10:10:51] <klausman>	 fgDid you try the page before starting the update? Maybe it always has been broken that way?
[10:13:32] <elukey>	 It used to work, I didn't test it recently, and we have the same version running on an-tool1009 (the hue-next.wikimedia.org backend) that works fine
[10:14:23] <elukey>	 the test cluster might have something special, but the errors are really cryptic
[10:18:13] <klausman>	 `Exception in thread "IPC Client (1969564802) connection to an-master1001.eqiad.wmnet/10.64.5.26:8032 from klausman@WIKIMEDIA" java.lang.OutOfMemoryError: GC overhead limit exceeded` <-- oops.
[10:20:27] <elukey>	 :)
[10:22:41] <elukey>	 ok I think that I found the issue
[10:23:22] <elukey>	 I have upgraded an-test-coord1001 to buster recently, as prep-step/test for an-coord1001
[10:23:31] <hnowlan>	 !log deploying aqs with updated cassandra libraries to aqs1004 while depooled
[10:23:33] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:23:40] <elukey>	 part of the changes involve moving hive to org.mariadb.jdbc.Driver
[10:23:50] <elukey>	 since the mysql one is not available anymore
[10:24:34] <elukey>	 I suspect that Hue doesn't like it
[10:24:57] <elukey>	 I pointed Hue test to an-coord1001 and all worked as expected 
[10:25:20] <elukey>	 this is a big problem :(
[10:25:41] <klausman>	 isn't MariaDB the same as Mysql on the wire?
[10:26:01] <elukey>	 hnowlan: little nit for the future - if possible add a deploy msg to scap like "testing only on aqs1004 blabla" 
[10:26:41] <hnowlan>	 elukey: good call, always realise that too late with scap :/
[10:27:21] <elukey>	 hnowlan: yes yes no big deal, it happens the same to mee :)
[10:27:59] <elukey>	 klausman: hic sunt leones, I think yes but they diverged for some things, maybe there is a specific query that mariadb doesn't digest
[10:28:27] <klausman>	 :-S
[10:28:39] <klausman>	 Does Hue at least log the failed query somewhere?
[10:30:33] <elukey>	 yes yes but I am 100% sure that upstream will not try to solve the problem since they support mysql
[10:31:33] <klausman>	 Could you try running the query manually in a mariadb console?
[10:31:54] <klausman>	 Could you try running the query manually in a mariadb console?
[10:32:03] <klausman>	 oops, wrong window for up-arrow and return
[10:32:22] <hnowlan>	 rolled back aqs on aqs1004, turns out there's some kind of problem with schemas the library has 
[10:33:09] <hnowlan>	 https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-2021.04.13?id=P5LEyngBA6MeBtBqgyBE
[10:37:11] <elukey>	 hnowlan: good outcome of the test then!
[10:37:34] <elukey>	 klausman: this is a good brainbounce, I thought that the query was for the hive metastore, but it may be for mariadb, will try!
[10:39:07] <elukey>	 but then, it wouldn't be hue making it
[10:39:23] <elukey>	 because hue talks only to the metastore
[10:39:27] <elukey>	 mmmmm
[10:40:36] <klausman>	 fg
[10:40:39] <klausman>	 gah :)
[10:41:33] <elukey>	 I am wondering how many other things will break if we upgrade an-coord1001
[10:41:50] <elukey>	 I have an-coord1002 already upgraded, at least we could failover to it and test
[10:42:04] <elukey>	 Hue seems the only thing complaining in hadoop test
[10:43:01] <klausman>	 Yeah, just pointing things at 1002 and seeing where the smoke comes out might be a good test balloon. With advance warning etc.
[10:59:24] <klausman>	 bbiab, lunch and groceries.
[11:08:33] <elukey>	 lunch for me too
[11:43:22] <wikibugs>	 10Analytics: Top read repeats - https://phabricator.wikimedia.org/T280011 (10dr0ptp4kt)
[11:52:07] <fdans>	 good morning dear team
[12:25:12] <elukey>	 hola!
[12:44:41] <wikibugs>	 10Analytics, 10Event-Platform: Deploy schema repos to analytics cluster and use local uris for analytics jobs - https://phabricator.wikimedia.org/T280017 (10Ottomata)
[13:04:08] <wikibugs>	 10Analytics, 10SRE, 10Traffic: Add Traffic's notion of "from public cloud" to Analytics webrequest data - https://phabricator.wikimedia.org/T279380 (10fgiunchedi) p:05Triage→03Medium
[13:05:53] <ottomata>	 a-team FYI I will be merging some patches that change around Refine and related monitoring jobs today.  Hopefully it won't trigger any false alerts but I can't promise! :)
[13:09:55] <klausman>	 ottomata: would you have a few minutes to talk about ATS/Varnish-Kafka?
[13:11:56] <milimetric>	 mornin everyone
[13:14:27] <elukey>	 ack!
[13:14:30] <elukey>	 morning milimetric 
[13:15:20] <milimetric>	 hiya elukey.  I am having so much empathetic pain over your pain with Hue
[13:16:14] <ottomata>	 klausman: ya gimme a few just merging something then for sure
[13:17:31] <klausman>	 sure, I'll be in the BC
[13:18:39] <ottomata>	 !log Refine now uses refinery-job 0.1.4; RefineFailuresChecker has been removed and its function rolled into RefineMonitor  - 
[13:18:39] <ottomata>	  T273789
[13:18:41] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:18:41] <stashbot>	 T273789: Sanitize and ingest all event tables into the event_sanitized database - https://phabricator.wikimedia.org/T273789
[13:19:03] <elukey>	 milimetric: in this case it is a bigger mess, it seems the JDBC mariadb driver between hive server and metastore :(
[13:19:06] <elukey>	 but only Hue complains
[13:19:42] <milimetric>	 yeah, maybe Hue's the only sane one in this case :P  But nevertheless, I just wish it didn't exist
[13:20:12] <milimetric>	 as soon as we're done with Gobblin I'm going to push the AirFlow migration as fast as I can
[13:24:02] <icinga-wm>	 PROBLEM - Check unit status of refine_eventlogging_analytics on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[13:24:36] <ottomata>	 me ^ fixing
[13:25:34] <elukey>	 milimetric: I think Airflow is part of the problem, a lot of people use the hive editor via Hue so we'll have to empower Superset too :(
[13:25:59] <ottomata>	 i think once we get rid of oozie we can encourage people to stop using hue
[13:29:23] <elukey>	 yes I agree but we need also to get Presto in a better shape too
[13:29:45] <elukey>	 otherwise we'll have to drop some use cases (like people using the hive editor in hue)
[13:30:00] <elukey>	 I am not defending hue of course, just bringing up some points :)
[13:30:21] <ottomata>	 yeah, i think that would be good, but this might be a prime example of 'dropping systems' that we don't strictly need
[13:30:32] <ottomata>	 some people will lose out on some functionality, but we have too much stuff do support ya know?
[13:30:40] <icinga-wm>	 PROBLEM - Check unit status of refine_event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[13:30:48] <ottomata>	 checking^  i
[13:31:00] <ottomata>	 anyway hopefully we'd have a manager to make that decision :p
[13:31:39] <elukey>	 the main problem right now is that if I upgrade an-coord1001 to buster I break the Hue hive editor
[13:32:03] <ottomata>	 oof
[13:32:29] <ottomata>	 elukey: ... wait why?  sorry maybe i missed  thread...
[13:32:46] <ottomata>	 klausman: almost with ya, just making sure refine still works!
[13:33:07] <klausman>	 np, I'm not in a particular rush :)
[13:34:40] <icinga-wm>	 RECOVERY - Check unit status of refine_eventlogging_analytics on an-launcher1002 is OK: OK: Status of the systemd unit refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[13:35:29] <elukey>	 ottomata: on buster we don't have anymore libmysql-java, that brings com.mysql.jdbc.Driver. It is replaced by libmariadb-java, that provides org.mariadb.jdbc.Driver (that needs to be set also in the hive-site.xml)
[13:35:59] <ottomata>	 elukey:  if we need a jar, we can probably get and deploy it via maven
[13:36:54] <elukey>	 ottomata: sure but it was dropped by debian and it would be another special use case
[13:37:05] <elukey>	 I am still not sure why this fails
[13:37:22] <elukey>	 I suspect that libmariadb-dev's jar may not be fully compatible
[13:37:42] <elukey>	 I am wondering if debian 11/sid versions might be better
[13:38:36] <icinga-wm>	 PROBLEM - Check unit status of refine_mediawiki_job_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_mediawiki_job_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[13:41:34] <icinga-wm>	 RECOVERY - Check unit status of refine_event on an-launcher1002 is OK: OK: Status of the systemd unit refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[14:19:08] <wikibugs>	 10Analytics: Easy dimensional data visualization - https://phabricator.wikimedia.org/T280029 (10Milimetric)
[14:19:42] <icinga-wm>	 RECOVERY - Check unit status of refine_mediawiki_job_events on an-launcher1002 is OK: OK: Status of the systemd unit refine_mediawiki_job_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[14:24:33] <ottomata>	 elukey: i wonder if relying on debian for language dependencies for things like java and python is just not the right thing to do anyway
[14:26:09] <elukey>	 ottomata: for base libs it makes sense in my opinion
[14:26:37] <elukey>	 the more the customize the more we diverge from the debian security policies
[14:27:18] <ottomata>	 right, but, if we were going the k8s, which we probably will services like hue and hive-server, the deps would be in the docker image anyway
[14:27:39] <ottomata>	 its a bit easier to manage the deps with the language package manager than with debian
[14:27:51] <ottomata>	 esp when debian upstream is not the one creating the service debian packages
[14:28:16] <elukey>	 yes but it is also a bigger problem when we need to track dependencies
[14:28:32] <elukey>	 for example, a CVE is out and we need to figure out what to patch
[14:28:44] <hnowlan>	 for stuff like python deps I would highly recommend using the pipeline blubber stuff if possible, it handles installing deps and containerising them pretty consistently 
[14:29:18] <elukey>	 yes as long as we use something approved by SRE (and controlled) I am +1
[14:29:32] <ottomata>	 hnowlan: aye would be good to get java in there nicely too
[14:29:47] <ottomata>	 although, hnowlan  python application deps are not really managed by SRE, right?  
[14:29:56] <ottomata>	 they are just part of the image build process 
[14:30:10] <hnowlan>	 no, they're generally managed by dependent teams and/or whoever is building hte service 
[14:30:17] <hnowlan>	 yeah I dunno if there's any java support yet 
[14:30:17] <ottomata>	 right, i think that's going to be the general trend
[14:30:25] <hnowlan>	 makes sense imo
[14:30:26] <ottomata>	 for java as well
[14:30:42] <ottomata>	 but elukey  is right, it means that handling CVEs will be the job of the service owning teams, not SRE
[14:31:42] <elukey>	 and for extra libs etc.. it makes complete sense, but it is surely easier for interpreters/std-libs to be managed via deb upstream
[14:31:43] <hnowlan>	 yeah and assumes that teams have the capability to do that... and that they have an awareness. it's a lot of variables, but it's also the path we started to go down when the pipeline was made a policy for services. that's not a useful answer but the ship's kinda sailing 
[14:32:01] <hnowlan>	 coming at this with little context but - if we were to get away from using system packages for java stuff at all, what would be the tool used for installing the dependencies? 
[14:32:14] <ottomata>	 hnowlan: probably maven
[14:32:24] <ottomata>	 if in an image, that would just be part of the build
[14:32:37] <ottomata>	 if like we do now...i guess scap + archive + git-fat :/
[14:32:42] <ottomata>	 archiva*
[14:32:56] <hnowlan>	 cool - in theory the maven step could just be part of the build step within the pipeline, doesn't necessarily *need* native support
[14:33:01] <ottomata>	 right
[14:33:11] <ottomata>	 elukey:  but what is a 'std lib'?  mysql client?
[14:33:12] <ottomata>	 not sure
[14:33:57] <elukey>	 no well my thought was generic, not related to mysql
[14:34:00] <ottomata>	 services like hue/ hive are built against specific versions of deps, if debian changes out the version underneath them, then breakages like this are probably likely
[14:34:25] <ottomata>	 yeah, i think for build pipeline SRE/RelEng maintains certain base images
[14:34:32] <ottomata>	 like for specific versions of nodejs
[14:34:36] <elukey>	 this is more deep, namely that Debian chose to keep only mariadb-related libraries
[14:34:58] <ottomata>	 right, right, which hive/hue may not know about?  since they dependon mysql libs?
[14:36:33] <elukey>	 they offer both mysql and mariadb connector configs, but in this case there may be a bug 
[14:36:40] <ottomata>	 aye
[14:39:00] <ottomata>	 elukey: gguessing you are working on this on an-test-coord1001?
[14:39:08] <ottomata>	 i just ran puppet there and saw Notice: /Stage[main]/Bigtop::Hive::Server/Service[hive-server2]/ensure: ensure changed 'stopped' to 'running' (corrective)
[14:39:21] <elukey>	 yes yes I can stop now, please test anything
[14:39:56] <elukey>	 I hoped to test the bulleye libmariadb-mysql package but hive doesn't pick it up in the classpath
[14:40:40] <elukey>	 anyway, I think that for the moment we can move hue-next -> hue anyway
[14:41:01] <elukey>	 the cloudera paywall is a little scary since we have their packages in our public apt repo
[14:45:14] <wikibugs>	 (03PS1) 10Hnowlan: Revert "package: bump restbase-mod-table-cassandra" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/678859
[14:48:12] <wikibugs>	 10Analytics, 10Patch-For-Review: Fix the remaining bugs open on for Hue next - https://phabricator.wikimedia.org/T264896 (10elukey) https://github.com/cloudera/hue/issues/1997
[14:49:44] <elukey>	 fdans: hola! Was https://gerrit.wikimedia.org/r/c/analytics/refinery/+/658348/ deployed?
[14:49:56] <elukey>	 it is the only thing that I can see in the kanban
[14:50:01] <elukey>	 the train is empty
[14:50:22] <fdans>	 elukey: yes, it is deployed, thank you!
[14:50:31] <fdans>	 not yet in done because backfilling
[14:51:46] <elukey>	 perfect
[14:52:04] <elukey>	 I am going to send an email to the team for the empty train, in case something comes up I can do it tomorrow
[15:15:56] <ottomata>	 elukey: i just noticed in hadoop::directory define that if ensure is absent
[15:15:57] <ottomata>	 https://github.com/wikimedia/puppet/blob/production/modules/bigtop/manifests/hadoop/directory.pp#L53
[15:16:08] <ottomata>	 hadoop-hdfs-namenode is required
[15:16:19] <ottomata>	 but, that define can be used anywhere, including nodes that don't run a namenode
[15:16:24] <ottomata>	 i should remove the require, right?
[15:17:06] <ottomata>	 OPHHHH but it can ohly be used in places that have the hdfs userr
[15:17:06] <ottomata>	 hm
[15:18:00] <ottomata>	 oh no that is everywhere
[15:18:07] <ottomata>	 ah we just can't by default sudo to hdfs uh huh
[15:18:10] <ottomata>	 ok yeah that should be fine then
[15:19:24] <ottomata>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/678870
[15:30:59] <milimetric>	 ottomata: the camus timestamp extractor thing was able to look at multiple fields to extract the timestamp.  That's still needed right, to handle schema migration?
[15:31:28] <elukey>	 ottomata: in theory yes, I think that the original reasoning was to avoid issuing directory creation requests if the namenode was down
[15:34:56] <ottomata>	 milimetric:  hm.  yes i think that's right
[15:35:12] <milimetric>	 k, cool, will do
[15:35:14] <milimetric>	 thx
[15:35:23] <ottomata>	 i think its kind of a useful feature too, and probably not hard to implement....if you are extracting the time from the payload anyway
[15:35:44] <ottomata>	 i don't know if i would support fallback for multiple timestamp sources, e.g. kafka and then event data...or...hmm i guess that would be cool
[15:35:47] <ottomata>	 but maybe that's difficult?
[15:35:57] <ottomata>	 elukey:  aye ok
[15:48:25] <icinga-wm>	 PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:01:16] <ottomata>	 a-team standup!
[16:04:28] <wikibugs>	 10Analytics-Radar, 10Growth-Scaling, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Product-Analytics (Kanban): Growth: update welcome survey aggregation schedule - https://phabricator.wikimedia.org/T275172 (10nettrom_WMF) This work is 90% done. The notebook is updated and ready for automatic mont...
[16:16:03] <elukey>	 bearloga: o/ ready for https://gerrit.wikimedia.org/r/c/operations/puppet/+/678864?
[16:16:19] <bearloga>	 elukey: yup!
[16:17:22] <razzi>	 !log rebalance kafka partitions for webrequest_text partitions 19, 20
[16:17:25] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:18:09] <bearloga>	 elukey: thank you! I've got stat1007:/srv/discovery/venv all ready and all of reportupdater's dependencies installed there, so theoretically next time it runs it should backfill all the missing reports since feb 8
[16:18:50] <bearloga>	 elukey: please let me know if you continue to see errors from systemd timer
[16:19:35] <icinga-wm>	 RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:21:27] <elukey>	 bearloga: yeah I think we need to set /srv/discovery/venv/lib/python3.7/site-packages as PYTHONPATH
[16:21:37] <elukey>	 without the " " quoting
[16:21:46] <ottomata>	 oh good just commented ^
[16:21:59] <elukey>	 silly me I should have checked
[16:22:15] <elukey>	 yes yes my bad
[16:22:16] <bearloga>	 ottomata elukey: ah! thank you, my bad
[16:22:57] <ottomata>	 perhaps golden main.sh should just use your venv's python though?
[16:23:17] <ottomata>	 bearloga:  elukey  ^?
[16:23:52] <elukey>	 yes it could be an option as well
[16:24:15] <elukey>	 having the PYTHONPATH set seemed good but we can do anything
[16:24:16] <bearloga>	 ottomata: I don't want it dependent on my env
[16:24:47] <ottomata>	 ...? isn't it aleady?
[16:25:24] <elukey>	 so /srv/discovery/venv/lib/python3.7/site-packages works fine, the timer is running now
[16:25:53] <bearloga>	 ottomata: it was, yes, but I want that job to be self-contained going forward
[16:26:05] <lexnasser>	 milimetric: no rush, but two things:
[16:26:06] <lexnasser>	 1) could you take a look at my latest comment on https://phabricator.wikimedia.org/T261681 and leave your thoughts?
[16:26:06] <lexnasser>	 2) In the AQS code here https://github.com/wikimedia/analytics-aqs/blob/master/sys/pageviews.js#L198-L207 , every time the per-article endpoint serves a request, it has to redefine the highlighted functions. from my understanding this is not the best practice - is there something I'm missing or is it just that the performance difference doesn't matter too much?
[16:27:09] <bearloga>	 elukey: wait, how? doesn't it need a separate patch to set pythonpath correctly?
[16:27:35] <bearloga>	 elukey: or did you just test it manually first?
[16:28:49] <elukey>	 checked manually yes, I filed the patch to fix :)
[16:28:55] <bearloga>	 thanks so much!
[16:29:52] <elukey>	 should have reviewed the other one more carefully :)
[16:33:03] <ottomata>	 bearloga: but...what's the difference between using the venv's python and adding its site-packages to PYTHONPATH in this case?
[16:33:19] <ottomata>	 oh you just don't want that link in the script?
[16:33:46] <icinga-wm>	 RECOVERY - Check unit status of wikimedia-discovery-golden on stat1007 is OK: OK: Status of the systemd unit wikimedia-discovery-golden https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:33:51] <elukey>	 goood
[16:36:29] <bearloga>	 ottomata: I suppose I could have updated main.sh to call that venv's python, sure. the puppet pythonpath solution just... felt better
[16:37:13] <ottomata>	 aye
[17:03:28] <wikibugs>	 (03PS1) 10GoranSMilovanovic: minor20210413 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/678902
[17:03:44] <wikibugs>	 (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] minor20210413 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/678902 (owner: 10GoranSMilovanovic)
[17:38:03] * elukey afk! ttl
[18:57:39] <wikibugs>	 (03PS1) 10Ottomata: ProduceCanaryEvents - include httpRequest body in failure message [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/678919 (https://phabricator.wikimedia.org/T274951)
[19:35:14] * razzi afk for lunch
[19:50:12] <ottomata>	 razzi: lemme know if you wanna sync, happy to join :)
[19:50:41] <ottomata>	 (i know there haven't been a lot of work days since our last sync :) )
[20:01:47] <razzi>	 ottomata: yeah we can skip for today
[20:01:59] <ottomata>	 k!
[21:09:28] <icinga-wm>	 PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[21:19:30] <icinga-wm>	 RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[21:42:44] <icinga-wm>	 PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[21:47:57] <razzi>	 ^-- Looks like canary events fixed themselves again, but breaking twice in an hour is concerning
[21:52:44] <icinga-wm>	 RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:17:23] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add better monitoring for Analytics UIs - https://phabricator.wikimedia.org/T277729 (10razzi) I implemented a check that works on both staging and production, using the appropriate header for production (x-cas-uid rather than x-remote-user). I re-enabled a...